hi! i feel like im falling into an antipattern tha...
# hamilton-help
j
hi! i feel like im falling into an antipattern that someone will have dug themselves out of before 🤣 I have a very simple ingestion pipeline that takes a source excel file (don't ask), sucking out one particular worksheet and serializing out to parquet.... i'm doing something like:
Copy code
@load_from.excel(path="<gs://bkt/source/asset1.xlsx>)
@save_to.parquet(path="<gs://bkt/raw/asset1.parquet>")
def asset1(df: pd.DataFrame) -> pd.DataFrame:
   return df
(later i write a
@load_from.parquet
wrapper in a downstream pipeline...) now i actually have a few dozen asset files... so i was tinkering around with some sorta parameterization and resolve magic but it got me to thinking if im doing something really dumb here 😞 my actual working code is just a copy paste of a function for each asset.... any ideas?
s
ah you want to parameterize that 🤔 ?
will have to think about it. Since I think you’re asleep by now, I’ll put this on my queue for later today 🙂
j
i've been playing around and built out the code a bit... but i ended up dropping the
@load_from
and building my own load_from-like node that im parameterizing out .... something i've fallen into before so i now have
Copy code
@parameter(**NODE_CONFIG)
def asset(path:str, otherstuff:str) -> pd.DataFrame:
    df = pd.read_excel(path)
    <do stuff>
    return df
but really this is just working around things, i still do have the load/save use case and a desire to make it a bit DRY
s
I assume you want to parameterize this at runtime? Is it a bunch of small DAGs effectively then? or is this used downstream then?
j
parameterizing at runtime i.e. through config isn't really needed here. yeah its essentially a bunch of repeated small DAGs... one thing i'm struggling with is that to load a single asset takes a bit too long a time to work with it interactively.... so half of the time i'm just doing a load operation, a quick massage, then storing back to a faster and more compact format (i.e. excel sources then over to parquet)
e
Chiming in — some Qs: 1. How will you plan to be changing these? Is the reasoning that you’d rather have it in config than code, or do you plan to be cahnging it regularly. 2. Does the output path always corresponded to the input path? IMO (and curious what you think) — storing the configurations in code isn’t crazy if they’re relatively static — its nice and self-docuemnting/forms a single source of truth. Furthermore, if they’re parameterized, you can use a custom loader/saver to parameterize it. Then joining it into a single one
@move_asset
that just calls both load/save is nice. Otherwise if you need runtime-parameterization, you can do more like what you did (wire in a config).
Copy code
def move_asset(fn):
    @functools.wraps(fn)
    def wrapper(from_: str, to_: str):
        return load_from(...)(save_to(...))(fn)
    return wrapper
👍 1