Hi guys what are good and easy ways to execute a hamilton pi Hamilton Open Source #hamilton-help

Hi guys, what are good and easy ways to execute a...

Volker Lorrmann

05/08/2024, 8:52 AM

Hi guys, what are good and easy ways to execute a hamilton pipelin on a schedule or triggered on an event (for example new file is written in a s3 bucket or a new message is published in mqtt ), without using orchestrators like dagster, prefect or airflow? I am thinking about using simple webserver (flask or fastapi) , so that I am able to execute the pipeline via simple http requests. And finally, I do have a simple python script or node red flow, that listens for the specific events (or waits for a defined time period) to fire a request against the webserver endpoint.

Thierry Jean

05/08/2024, 11:44 AM

@Elijah Ben Izzy is on vacations for a few days, but he's currently writing a blog on this topic! For scheduled execution, one potential approach is to use GitHub actions Chron scheduler to launch Hamilton execution on AWS Lambda For event-triggered execution, a webserver makes sense! We have a write up on Hamilton + FastAPI for online feature engineering (ETL) and a briefer guide on Hamilton + FastAPI for RAG

Volker Lorrmann

05/08/2024, 11:51 AM

Thanks for the response. The hamilton pipelines (simple etl pipelines) will run on a edge devices on our shopfloor. Therefore github is not an option. But I could probably use crontab on the edge device that runs either the pipeline or fires a request to the webserver, considering that I have several pipelines, which runs event triggered or scheduled.

💡 1

Volker Lorrmann

05/08/2024, 11:56 AM

In combination with hamilton ui, this seems to be a very good setup for our edge devices. The following would be a very common pattern: Messages/data is published on the databus (mqtt). first etl is triggered that processing the message and saves it into minio (also running on the edge). Once a hour, the second etl loads the data from minio, does some aggregation and saves into our data lake.

Thierry Jean

05/08/2024, 12:01 PM

That approach makes a lot of sense to me! @Stefan Krawczyk should be up in a few hours to share his insights

Volker Lorrmann

05/08/2024, 12:05 PM

Cool. Thanks a lot. Lets see, if I get hamilton ui running on our edge devices. 😅

Volker Lorrmann

05/08/2024, 1:12 PM

Some more consideration: I think we even do not need a webserver. We can simply use the internal database (mqtt) instead. The mqtt client listens to a specific topic or topics. When a new message is published on one of the topics a hamilton pipeline is executed. Which pipeline is executed depends on the topic name or the message payload

💡 1

Stefan Krawczyk

05/08/2024, 2:43 PM

@Volker Lorrmann yep Hamilton just needs a python process to run in. So if you can run a python process and trigger that somehow, you should be able to make it run Hamilton. What did you have in mind regarding the Hamilton UI? Would you want to run the whole system on the edge? or do you perhaps mean you’d like the telemetry to be emitted from the edge to some central place?

Volker Lorrmann

05/08/2024, 2:50 PM

I want to run all of this on the edge. The idea is to deploy jupyter lab and hamilton ui on the edge. This allows us to develop our pipelines in the "production" environment. and hamilton ui is used to monitor the pipelines.

Elijah Ben Izzy

05/08/2024, 3:16 PM

So just to chime in — not super familiar with edge deployments (but I think I get the basic concepts) Hamilton UI is a server + UI, meaning you can have an instance on a central server that stores data for tracking. Is that phoning home (edge to central server) allowed?

Volker Lorrmann

05/08/2024, 3:43 PM

I think this could be a problem. Not all edge devices are able to connect to a central server. At least I have to discuss this with our IT guys. Thats why I currently prefer running hamilton ui on each edge device. However, having on central server that tracks all hamilton pipeline from several edge devices would indeed be awesome.

Elijah Ben Izzy

05/08/2024, 3:45 PM

Got it, yep, makes sense. We can discuss possibilities internally! Tracking centrally would be an easier experience (tag each run with the edge device it came from), as opposed to looking at each individual instance of the UI but makes sense that there might not be the network connection…

Volker Lorrmann

05/08/2024, 3:47 PM

Thanks. Looking forward to implement all this. Hope to have a first prototype running next week.

🙌 1

Volker Lorrmann

05/08/2024, 3:51 PM

One thing I am not yet sure about is, if we need something like a job queue (possible candidates are celery, apscheduler or huey) for pipelines that runs longer than the data ingestion rate.

Elijah Ben Izzy

05/08/2024, 4:07 PM

So the UI is currently built around bi-directional communication, but that’s more just the sdk, and really only important at the beginning of a run. Also it’s open-source! Otherwise yes, some interesting orchestration questions, would be curious about your requirements (run a strict cadence versus just run/clean up with a high watermark…). Think through your needs and feel free to reach out with more details, we can help you think it through!

🙏 1

❤️ 1

2 Views

Open in Slack

Previous Next