Im converting a mostly pandas based ML app to use pyspark Do Hamilton Open Source #hamilton-help

Im converting a mostly pandas based ML app to use ...

Sundeep Amrute

06/08/2024, 2:16 PM

Im converting a mostly pandas based ML app to use pyspark. Does Hamilton do any serialization magic to help with “ModuleNotFound” when using pandas_udfs that contain references to other Python modules in the project? I know there are ways to inject Python files into spark context but they are nasty (create a zip file with all dependencies etc etc), and so I feel it discourages you from writing nice modularized code.

Elijah Ben Izzy

06/08/2024, 7:48 PM

Hey! So just to be clear — how are you using Hamilton/pyspark together? We have a few integrations (as you might have seen).

Sundeep Amrute

06/10/2024, 2:58 AM

I’m just starting to play with it but was planning to use the pyspark integration as described here: https://hamilton.dagworks.io/reference/decorators/with_columns/

Sundeep Amrute

06/10/2024, 3:09 AM

In this project there are many pre-existing transforms (in pandas) that make heavy use of a utility module. So I was going to turn the transforms into Hamilton @with_columns pandas_udf functions but in playing with pandas_udfs on their own, I ran into ModuleNotFound errors bc the utility module called is not in the spark server (I’m using remote connections.

Elijah Ben Izzy

06/10/2024, 4:12 PM

Ok, got it. Yep, certainly tricky — we don’t have specific tooling in Hamilton (yet), but happy o help you think through it. The first approach is to get it to work manually (E.G. figure out which packages to send out), then we can pretty easily add something to crawl the module list and add them. Specifically, I’m thinking we can create an adapter that: 1. On graph initialization, scrapes the graph for all the used modules 2. Zips them up + puts them in the python file using this: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile 3. Maybe messes with pythonpath (worth playing with) 4. Then its available when yours is called So, yeah, first step is to manually call

addPyFile

to ensure it can work, then I can help you get set up with a hook (and we can contribute it back!)

Sundeep Amrute

06/11/2024, 2:24 PM

Thanks I will play around and then come back to you! Really this something you’d hope spark-connect or databricks-connect tooling would be handling by 2024…it should just be synching your venv to the cluster when developing

Elijah Ben Izzy

06/11/2024, 2:25 PM

Ha! Yeah we’ve solved this problem in many ways in the past and its always deceptively tough 😢

Elijah Ben Izzy

06/11/2024, 2:25 PM

It’s quite possible there’s tooling we don’t know about though, so worth searching around more

Sundeep Amrute

06/11/2024, 2:30 PM

Yeah it’s hard bc i will restructure code to make then udfs “pure pandas” and before long you’re on the road to rewriting the entire shared utility package you wanted to re-use!

Elijah Ben Izzy

06/11/2024, 6:12 PM

Hehm yeah, hopefully there’s a nice middle ground

Open in Slack

Previous Next