This message was deleted.
# hamilton-help
s
This message was deleted.
c
some other stuff: • we'd like to avoid just combining all the SQLs into 1 big scary SQL since our data scientists may not always need all of those columns, and we're also switching to Hamilton to move away from our old featurestore that is just 1 giant scary SQL • I saw there was a
async
decorator but the README said
async
doesnt play well with other decorators, and we're using
extract_columns
to break the dataframes from queries into individual columns
e
Good morning! So yes, this is a pretty common use-case. Just to be clear — you run a bunch of SQL operations that each load a dataframe of sorts, then join/manipulate them in some way, correct?
async
could work (although its still a little undeveloped). Our approach for parallelization has generally been to delegate to other frameworks. So, the
ray
and
dask
graph adapters both naturally do horizontal parallelism. The idea is its a quick swap for the driver, and you get the power of distributed systems. Some resources: • Quick post about scaling with ray • More information about horizontal scaling with ray/dask • dask hello_world • ray hello_world
I think this should happily cover your case — both can be set up pretty easily to run on whatever cores/compute you have. That said, we’re also thinking of having anotehr simple multiprocessing adapter.
s
@Culver McWhirter another idea (as a stop gap) would be to split things into two drivers: 1. One that uses Ray/Dask to parallelize and load the data. 2. Then one that does the downstream computation — passing in the output of the first driver; it is not always ideal to use Ray/Dask because the serialization cost between processes can outweigh any parallelization benefits. We currently don’t support arbitrary parallelization of a DAG, but your use case is definitely a motivating one to provide functionality for.
Would you mind creating a github issue with your use case and what you’d like to see happen please? That would help.
c
sorry for the very late response, i wasnt sure exactly what i wanted, so i wanted to put in more effort and see if i could actually get something working this is what i ended up doing (im sure its not ideal/perfect, but it does work):
^ this is pretty specific to our code, but i wonder if it could be made more general by throwing each Hamilton func into a new thread with this?
loop = asyncio.get_running_loop()
loop.run_in_executor(THREAD_POOL, some_func)
would love to hear your thoughts
e
Nice! Glad it works. OK, a bit confused — are you using the async decorator? If so, then why are you getting the event loop and adding it into another thread? Shouldn’t you get the benefits of async on a single thread?
Also, if you want, we’d be happy to get on a call and talk through your use-case!
s
yeah @Culver McWhirter we also have an AsyncDriver for Hamilton - which we can jump on a call to explain too. But a little more context on where you want this to run would help 🙂
c
this is my first dive into async with Python, so i might be doing stuff a little wrong the reason i ended up having to put
run_query()
and
get_results()
in new threads is because theyre not async functions themselves, so i couldnt
await
them
sorry i left out that important part. I am using
AsyncDriver
, the code snippet I posted was the funcs i pass to to the driver
so this would be the other script that actually imports those functions and runs the driver
s
@Culver McWhirter do you have time now to jump on a quick call?
👍 1
Thanks for your time @Culver McWhirter here’s the gist of code we walked through https://gist.github.com/skrawcz/677daa5e72cba8b9c26d91728468f9e0