This message was deleted.
# hamilton-help
s
This message was deleted.
e
Hey, welcome! We’ve been working a decent amount with National labs in the US for just this problem, and I think it’s a pretty clean solution (it is, in part, a large reason we designed hamilton). Re: beam, I’m going to dig more later (haven’t used it in a bit), but there are a few modes of operating that I can think of: 1. Hamilton instead of beam — call it in a loop, and manage any state you need. We’re actually digging into this streaming notion to make this more natural… read about it here: https://github.com/DAGWorks-Inc/hamilton/issues/19 2. Hamilton as udfs inside beam — you could utilize groups of hamilton functions to represent discrete transforms — it’s an interesting notion, allowing you to represent pipelines in beam and call out to hamilton for large map-type operations. 3. Hamilton with beam dataframes — just have your functions return/take these in! https://beam.apache.org/documentation/dsls/dataframes/overview/ So, options here, but I think (1) is pretty reasonable if it’s just a simple pipeline and you don’t need all the fancy streaming beam features. Lots of customizabikity/ease of debugging and readability by anyone who knows python…
s
@Josh Buckland any particular reason to use beam?
j
@Stefan Krawczyk Right now it’s purely because I was looking at options for us building out a data pipeline and (at first glance) dataflow on GCP looked like a nice option that would give us an “off the shelf” infra option. I’m basically the only SWE working on this in our group so I’ll be wearing all the hats and anything that minimises my dev ops load is a big help. At the same time, Hamilton appealed for the same reason with regards to it being more readable and closer to Python outright
s
Makes sense. As @Elijah Ben Izzy mentioned, there are a few options. There are two mental models possible with Hamilton and beam: 1. Use Hamilton within beam. This could take a couple of forms, depending on how you model it and what “objects” you’re using. 2. Use Hamilton to help generate beam code — we haven’t built anything here, but theoretically we could build something to help here.
j
With that in mind, I’d be interested to know what people are using for Hamilton from an infrastructure side for simple workflows. I’m still trying to nail down what kind of data the biomedical engineers are expecting to produce but currently it looks like there’ll be a wearable device that will probably dump a CSV into a bucket through some sort of REST API and that CSV will contain a time series of raw data. There might be some kind of data processing and unsupervised learning on these raw signals. I’ll then need to convert that to physiological signals through a specific algorithm. Finally the physiological time series will be passed to another step for feature extraction and a machine learning step. It feels like it could be good to handle this with something like DataFlow but to be honest, it just needs an explainable set of steps that can be defined somewhere. I can’t see the amount of data ever venturing into the big data territory
Hamilton within beam feels like it could be a nice approach. It means that more complex transformations can be much more understandable which is really important when publishing the research. A lot of this work is still at the whiteboard stage so if people have any recommendations for simple infrastructure solutions with Hamilton, I’m all ears. I can definitely see it being really useful for the feature engineering step
s
Yeah so two things: 1. Hamilton is a great way to help you structure your python code for this. 2. Hamilton isn’t infrastructure. I’ve seen people use circleci, or, anything that can run python code, to run Hamilton. So you could use dataflow/beam to be that place for running python. Or something simpler.
The nice thing about using Hamilton, is that you could reuse Hamilton code in multiple contexts — making it easy to use the same code in a notebook, as well as something like beam.
The only wrinkle, would be whether you need aggregations over data that is too big to fit in memory, but that doesn’t sound like it would be an issue for you?
j
I don’t think we’ll ever need anything too big for memory. At least not until we hit the end stage machine learning step but that’s a separate question anyway.
That sounds really good. I think I’ll focus on using Hamilton for a lot of the data transformation stuff and then when it comes to scaling it out to a full pipeline it sounds like there’s a bunch of options here ranging form lambdas and queues through to Dataflow. Do you have a citation on your website that I can use when we come to publish our work?
👍 1
s
Do you have a citation on your website that I can use when we come to publish our work?
Yep - https://github.com/dagworks-inc/hamilton#citing-hamilton
j
Brilliant. I’ll add that to my reference manager now
🙌 1
s
Otherwise there’s a few Hamilton users in the London area (IIRC), e.g. @James Marvin, @Mathew Savage; I’ll try to organize a meetup at some point this year!
e
I’ll be there in early June as well if we all want to grab a 🍺