This message was deleted.
# hamilton-help
s
This message was deleted.
👀 1
s
@Jan Hurst it might be easier to jump on a call to talk through the code (here’s a link to schedule time). One question, is it difficult because you’re trying to manage processing differently named columns from each of the sources and wanting that to be configuration driven? That’s what it sounds like to me on first read. Otherwise is @pipe useful here perhaps?
j
i think the desire/need to do a row level sample in the middle of a dag is messing me up a bit 😕
s
oh yep I could see that. Yeah what I’ve seen work well is breaking up the components into blocks around transform types and what they need to operate over. E.g. operate on datasets (e.g. for filters, sorts, joins) or columns. So you end up having fan-in, fan-out type steps. Without seeing more, and hearing @Elijah Ben Izzy’s thoughts, there could be a general extension over what we did with pyspark (e.g. read this query) to general pandas/dask dataframes, i.e. linearizing things in a way to make it easy to go between operating over a dataframe to columns and back, and then doing filters, etc.
e
So yeah, adding to what @Stefan Krawczyk said — I think there is room for more abstraction around this. E.G. you go from row-level filters to column-level transforms on occasion. The standard approach (as stefan said) is to group them mid-way, although you can always allow the indices to be the source of truth (if you trust pandas indexes for series), or do a grouping operation to join then a full filter. You can also track with
NaN
or a sentinel value then do filtering at the end — depends on your use-case (and there are trade-offs). I think the
with_columns
approach to auto-group would be really clean for this, mirroring exactly how the spark one works. https://hamilton.dagworks.io/en/latest/reference/decorators/with_columns/?highlight=with_columns#with-columns.
j
unfortunately i have to do some manipulation before i can do my filter, and im trying to do my filter early because a step after the filter is moderately expensive.... i think im just going to have to pay up and do the workloads in pieces and do all the prefilter work up front and serialize out to disk
e
Makes sense — one approach that would help with that (and you look like you’re already doing it) is breaking it up using materializers — with `to`/`from_` you can model it all as one DAG, run it as individual pieces with materializers injected in between, and use that to break it into tasks. I go over a similar workflow in this — its talking about ML training/whatnot, but should be fairly relevant: https://blog.dagworks.io/p/separate-data-io-from-transformation.