hi i feel like im falling into an antipattern that someone w Hamilton Open Source #hamilton-help

hi! i feel like im falling into an antipattern tha...

Jan Hurst

05/20/2024, 8:00 AM

hi! i feel like im falling into an antipattern that someone will have dug themselves out of before 🤣 I have a very simple ingestion pipeline that takes a source excel file (don't ask), sucking out one particular worksheet and serializing out to parquet.... i'm doing something like:

Copy code

@load_from.excel(path="<gs://bkt/source/asset1.xlsx>)
@save_to.parquet(path="<gs://bkt/raw/asset1.parquet>")
def asset1(df: pd.DataFrame) -> pd.DataFrame:
   return df

(later i write a

@load_from.parquet

wrapper in a downstream pipeline...) now i actually have a few dozen asset files... so i was tinkering around with some sorta parameterization and resolve magic but it got me to thinking if im doing something really dumb here 😞 my actual working code is just a copy paste of a function for each asset.... any ideas?

Stefan Krawczyk

05/20/2024, 3:42 PM

ah you want to parameterize that 🤔 ?

Stefan Krawczyk

05/20/2024, 3:42 PM

will have to think about it. Since I think you’re asleep by now, I’ll put this on my queue for later today 🙂

Jan Hurst

05/20/2024, 3:58 PM

i've been playing around and built out the code a bit... but i ended up dropping the

@load_from

and building my own load_from-like node that im parameterizing out .... something i've fallen into before so i now have

Copy code

@parameter(**NODE_CONFIG)
def asset(path:str, otherstuff:str) -> pd.DataFrame:
    df = pd.read_excel(path)
    <do stuff>
    return df

but really this is just working around things, i still do have the load/save use case and a desire to make it a bit DRY

Stefan Krawczyk

05/20/2024, 4:56 PM

I assume you want to parameterize this at runtime? Is it a bunch of small DAGs effectively then? or is this used downstream then?

Jan Hurst

05/20/2024, 4:58 PM

parameterizing at runtime i.e. through config isn't really needed here. yeah its essentially a bunch of repeated small DAGs... one thing i'm struggling with is that to load a single asset takes a bit too long a time to work with it interactively.... so half of the time i'm just doing a load operation, a quick massage, then storing back to a faster and more compact format (i.e. excel sources then over to parquet)

Elijah Ben Izzy

05/20/2024, 5:54 PM

Chiming in — some Qs: 1. How will you plan to be changing these? Is the reasoning that you’d rather have it in config than code, or do you plan to be cahnging it regularly. 2. Does the output path always corresponded to the input path? IMO (and curious what you think) — storing the configurations in code isn’t crazy if they’re relatively static — its nice and self-docuemnting/forms a single source of truth. Furthermore, if they’re parameterized, you can use a custom loader/saver to parameterize it. Then joining it into a single one

@move_asset

that just calls both load/save is nice. Otherwise if you need runtime-parameterization, you can do more like what you did (wire in a config).

Copy code

def move_asset(fn):
    @functools.wraps(fn)
    def wrapper(from_: str, to_: str):
        return load_from(...)(save_to(...))(fn)
    return wrapper

👍 1

Open in Slack

Previous Next