here I have to port a bunch of ugly pandas data cleanup fea Hamilton Open Source #hamilton-help

@here I have to port a bunch of ugly pandas data c...

Sundeep Amrute

05/29/2024, 12:00 PM

@here I have to port a bunch of ugly pandas data cleanup/feature extraction code to use pyspark/databricks and was hoping to use Hamilton to organize it. The code relies on dynamic colnames quite a bit (pivot df to create n new cols, then apply same transforms to each of those new cols based on some colname prefix) Is the dataframe->dataframe UDF method discussed in the pyspark readme basically my only option for this? I’d love it if the UDF name somehow would match a pattern of colnames instead of a single colname, so that the DAG would execute it all for me. _def startswith_foo(startswith_bar: pd.Series) -> htypes.column[pd.Series, int]:_ return startswith_bar + 1.0 If the df had cols like bar_col1 and bar_col2 then the transform would automatically run on both of these Is this something that already exists and supported for the pyspark usecase?

Sundeep Amrute

05/29/2024, 2:42 PM

Possibly this is what @extract_columns is does but I’m not sure

Elijah Ben Izzy

05/29/2024, 4:03 PM

Hey! This is a good question. So, it is tricky as Hamilton does not allow (much) dynamism in the way of column names. If you know them beforehand it’s one thing (you can use

@parameterize

) but if you don’t it’s a bit tricky. So, do you know them all beforehand? If not Hamilton (currently) doesn’t have the easy ability to have dynamic functions (except using parallelism, which is a bit overkill for this). One-hot encoding/pivot operations are where this generally falls apart, but in most other cases people have an idea of the data space. That said, I’d think about levels of what you want/difficulty to achieve: 1. Fns of pyspark dataframe -> pyspark dataframe (not using

with_columns

). Basically have a few different ones, then join them together. This has some potential inefficiencies (joining can be a complex operation), but generally works. This way you can have some functions for “groups” of columns 2. Move to using

with_columns

— doing the colum-level operations you can and then the df -> df UDF otherwise 3. Look at that pattern (3) and think through cleaner ways to do things/ways to make it parameterized. Does this make sense? Happy to draft out some code.

Sundeep Amrute

05/29/2024, 5:03 PM

Thanks for the detailed reply! As written now, the transform part of the code doesn’t know all the potential columns. However what can probably be done is do most of the transforms on this piece before the pivot (in row-space) then have a pivot step which is dataframe -> dataframe That would let me have a nice @with_columns block and then one transform without it at the end. I am liking the idea of Hamilton as alternatives like pyspark.ml.pipeline just seem clunky and overkill for what I’m doing

Elijah Ben Izzy

05/29/2024, 5:20 PM

Yep! I think that makes sense as a first step. One idea that I was thinking about is a decorator that would apply an operation to each column…

Copy code

@for_each_column(
    apply_to=source("raw_df"),
    select_columns=regex("prefix_"),
)
def with_applied(series_in: pd.Series) -> pd.Series:
    return do_operation...

Then you could run the operation on each column that matches a prefix. This could be done with/without Hamilton (without hamilton it would just mess with the signature a bit).

Sundeep Amrute

06/01/2024, 12:40 PM

That is a cool idea…ofc when you start doing these wildcards you can quickly get to a point where you can no longer understand the dataflow

Elijah Ben Izzy

06/01/2024, 2:45 PM

Yep its a bit of a trade-off — how much do you represent, statically, in code? And how much do you use more fancy decorators that produce a ton of columns?

Open in Slack

Previous Next