Im converting a mostly pandas based ML app to use ...
# hamilton-help
s
Im converting a mostly pandas based ML app to use pyspark. Does Hamilton do any serialization magic to help with “ModuleNotFound” when using pandas_udfs that contain references to other Python modules in the project? I know there are ways to inject Python files into spark context but they are nasty (create a zip file with all dependencies etc etc), and so I feel it discourages you from writing nice modularized code.
e
Hey! So just to be clear — how are you using Hamilton/pyspark together? We have a few integrations (as you might have seen).
s
I’m just starting to play with it but was planning to use the pyspark integration as described here: https://hamilton.dagworks.io/reference/decorators/with_columns/
In this project there are many pre-existing transforms (in pandas) that make heavy use of a utility module. So I was going to turn the transforms into Hamilton @with_columns pandas_udf functions but in playing with pandas_udfs on their own, I ran into ModuleNotFound errors bc the utility module called is not in the spark server (I’m using remote connections.
e
Ok, got it. Yep, certainly tricky — we don’t have specific tooling in Hamilton (yet), but happy o help you think through it. The first approach is to get it to work manually (E.G. figure out which packages to send out), then we can pretty easily add something to crawl the module list and add them. Specifically, I’m thinking we can create an adapter that: 1. On graph initialization, scrapes the graph for all the used modules 2. Zips them up + puts them in the python file using this: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile 3. Maybe messes with pythonpath (worth playing with) 4. Then its available when yours is called So, yeah, first step is to manually call
addPyFile
to ensure it can work, then I can help you get set up with a hook (and we can contribute it back!)
s
Thanks I will play around and then come back to you! Really this something you’d hope spark-connect or databricks-connect tooling would be handling by 2024…it should just be synching your venv to the cluster when developing
e
Ha! Yeah we’ve solved this problem in many ways in the past and its always deceptively tough 😢
It’s quite possible there’s tooling we don’t know about though, so worth searching around more
s
Yeah it’s hard bc i will restructure code to make then udfs “pure pandas” and before long you’re on the road to rewriting the entire shared utility package you wanted to re-use!
e
Hehm yeah, hopefully there’s a nice middle ground