This message was deleted.
# hamilton-help
s
This message was deleted.
🤔 1
s
What do you have in mind to do? Short answer is yes you can do it, but spark doesn’t speak polars, so you can’t use it to define UDFs for example (like you can with pandas).
the main caveat is that you can’t naively transform a pyspark dataframe into polars because polars only operates on what fits in memory, while spark doesn’t have that limitation.
They both speak pyarrow so there’s a way to bridge the two
g
Makes sense. Our codebase currently does the
{prep, train, predict, post-processing}
all in
pyspark
for “historical reasons”. The code is in need of a major refactor though so I wanted to piecewise move things to
polars
as possible. I have a PoC pipeline that uses
spark -> arrow -> polars
to extract the prep and do the rest in
polars
so I wanted to use that as a testing ground for hamilton.
👍 1
s
Yep so that sounds pretty reasonable. Are you wanting to run it as a single Hamilton DAG? or at least logically be able to define one? or?
To bridge the two, which it sounds like you already have, is to have a function that does:
Copy code
def data_set_foo_pyspark(...) -> ps.DataFrame:
   return df # pyspark dataframe object

def data_set_foo_polars(data_set_foo_pyspark: ps.DataFrame) -> pl.DataFrame:
   # this will be a blocking call and force spark to compute things
   # then it'll bring it into memory and you can do arrow and then to polars ...
   return df
đź‘€ 1
🙏 1
g
Not sure whether to have it as a single dag yet, but having the prep + (blocking) extraction as a single DAG is compelling as a lot of folks are wanting to try different preprocessing experiments (the prep in our current codebase is too tightly coupled with expected strings/regexes for easy preprocessing experiments). Also wasn’t sure if hamilton would do the
spark -> arrow -> polars
for me in the above functions or if i needed to define that conversion myself
s
Cool. Yes, by default, nothing in the framework for it right now to do the conversion. So if you start with the explicit conversion via functions, we can then figure out how to “hoist” it up into the framework — there’s a few ways we can achieve that.
🙏 1
g
Very cool, thanks for walking me through this!
s
np. as you develop you’ll probably get a feel for a few things, e.g. module structure, code reuse, etc. which would help feed the requirements for bringing it into the framework.
🙏 1