This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

04/18/2023, 2:21 PM

This message was deleted.

Elijah Ben Izzy

04/18/2023, 3:43 PM

Hey, welcome! We’ve been working a decent amount with National labs in the US for just this problem, and I think it’s a pretty clean solution (it is, in part, a large reason we designed hamilton). Re: beam, I’m going to dig more later (haven’t used it in a bit), but there are a few modes of operating that I can think of: 1. Hamilton instead of beam — call it in a loop, and manage any state you need. We’re actually digging into this streaming notion to make this more natural… read about it here: https://github.com/DAGWorks-Inc/hamilton/issues/19 2. Hamilton as udfs inside beam — you could utilize groups of hamilton functions to represent discrete transforms — it’s an interesting notion, allowing you to represent pipelines in beam and call out to hamilton for large map-type operations. 3. Hamilton with beam dataframes — just have your functions return/take these in! https://beam.apache.org/documentation/dsls/dataframes/overview/ So, options here, but I think (1) is pretty reasonable if it’s just a simple pipeline and you don’t need all the fancy streaming beam features. Lots of customizabikity/ease of debugging and readability by anyone who knows python…

Stefan Krawczyk

04/18/2023, 4:23 PM

@Josh Buckland any particular reason to use beam?

Josh Buckland

04/18/2023, 4:32 PM

@Stefan Krawczyk Right now it’s purely because I was looking at options for us building out a data pipeline and (at first glance) dataflow on GCP looked like a nice option that would give us an “off the shelf” infra option. I’m basically the only SWE working on this in our group so I’ll be wearing all the hats and anything that minimises my dev ops load is a big help. At the same time, Hamilton appealed for the same reason with regards to it being more readable and closer to Python outright

Stefan Krawczyk

04/18/2023, 4:34 PM

Makes sense. As @Elijah Ben Izzy mentioned, there are a few options. There are two mental models possible with Hamilton and beam: 1. Use Hamilton within beam. This could take a couple of forms, depending on how you model it and what “objects” you’re using. 2. Use Hamilton to help generate beam code — we haven’t built anything here, but theoretically we could build something to help here.

Josh Buckland

04/18/2023, 4:37 PM

With that in mind, I’d be interested to know what people are using for Hamilton from an infrastructure side for simple workflows. I’m still trying to nail down what kind of data the biomedical engineers are expecting to produce but currently it looks like there’ll be a wearable device that will probably dump a CSV into a bucket through some sort of REST API and that CSV will contain a time series of raw data. There might be some kind of data processing and unsupervised learning on these raw signals. I’ll then need to convert that to physiological signals through a specific algorithm. Finally the physiological time series will be passed to another step for feature extraction and a machine learning step. It feels like it could be good to handle this with something like DataFlow but to be honest, it just needs an explainable set of steps that can be defined somewhere. I can’t see the amount of data ever venturing into the big data territory

Josh Buckland

04/18/2023, 4:40 PM

Hamilton within beam feels like it could be a nice approach. It means that more complex transformations can be much more understandable which is really important when publishing the research. A lot of this work is still at the whiteboard stage so if people have any recommendations for simple infrastructure solutions with Hamilton, I’m all ears. I can definitely see it being really useful for the feature engineering step

Stefan Krawczyk

04/18/2023, 4:41 PM

Yeah so two things: 1. Hamilton is a great way to help you structure your python code for this. 2. Hamilton isn’t infrastructure. I’ve seen people use circleci, or, anything that can run python code, to run Hamilton. So you could use dataflow/beam to be that place for running python. Or something simpler.

Stefan Krawczyk

04/18/2023, 4:43 PM

The nice thing about using Hamilton, is that you could reuse Hamilton code in multiple contexts — making it easy to use the same code in a notebook, as well as something like beam.

Stefan Krawczyk

04/18/2023, 4:43 PM

The only wrinkle, would be whether you need aggregations over data that is too big to fit in memory, but that doesn’t sound like it would be an issue for you?

Josh Buckland

04/18/2023, 4:44 PM

I don’t think we’ll ever need anything too big for memory. At least not until we hit the end stage machine learning step but that’s a separate question anyway.

Josh Buckland

04/18/2023, 4:46 PM

That sounds really good. I think I’ll focus on using Hamilton for a lot of the data transformation stuff and then when it comes to scaling it out to a full pipeline it sounds like there’s a bunch of options here ranging form lambdas and queues through to Dataflow. Do you have a citation on your website that I can use when we come to publish our work?

👍 1

Stefan Krawczyk

04/18/2023, 4:47 PM

Do you have a citation on your website that I can use when we come to publish our work?

Yep - https://github.com/dagworks-inc/hamilton#citing-hamilton

Josh Buckland

04/18/2023, 4:47 PM

Brilliant. I’ll add that to my reference manager now

🙌 1

Stefan Krawczyk

04/18/2023, 4:53 PM

Otherwise there’s a few Hamilton users in the London area (IIRC), e.g. @James Marvin, @Mathew Savage; I’ll try to organize a meetup at some point this year!

Elijah Ben Izzy

04/18/2023, 4:55 PM

I’ll be there in early June as well if we all want to grab a 🍺

Open in Slack

Previous Next