Sundeep Amrute
05/29/2024, 12:00 PMSundeep Amrute
05/29/2024, 2:42 PMElijah Ben Izzy
05/29/2024, 4:03 PM@parameterize
) but if you don’t it’s a bit tricky.
So, do you know them all beforehand? If not Hamilton (currently) doesn’t have the easy ability to have dynamic functions (except using parallelism, which is a bit overkill for this). One-hot encoding/pivot operations are where this generally falls apart, but in most other cases people have an idea of the data space.
That said, I’d think about levels of what you want/difficulty to achieve:
1. Fns of pyspark dataframe -> pyspark dataframe (not using with_columns
). Basically have a few different ones, then join them together. This has some potential inefficiencies (joining can be a complex operation), but generally works. This way you can have some functions for “groups” of columns
2. Move to using with_columns
— doing the colum-level operations you can and then the df -> df UDF otherwise
3. Look at that pattern (3) and think through cleaner ways to do things/ways to make it parameterized.
Does this make sense? Happy to draft out some code.Sundeep Amrute
05/29/2024, 5:03 PMElijah Ben Izzy
05/29/2024, 5:20 PM@for_each_column(
apply_to=source("raw_df"),
select_columns=regex("prefix_"),
)
def with_applied(series_in: pd.Series) -> pd.Series:
return do_operation...
Then you could run the operation on each column that matches a prefix. This could be done with/without Hamilton (without hamilton it would just mess with the signature a bit).Sundeep Amrute
06/01/2024, 12:40 PMElijah Ben Izzy
06/01/2024, 2:45 PM