This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

02/02/2023, 8:08 AM

This message was deleted.

Elijah Ben Izzy

02/02/2023, 4:08 PM

Ok, so you’ve actually stumbled on an inherently difficult/interesting question 🙂 That said, I think you have the right approach for this. A lot, IMO, depends on the size of the data/production-criticality of the pipeline. As you pointed out, the map operations are very easy — you shouldn’t have to make any changes, save for making pd.Series of length 1. For the joins — I actually think its not all that different — you just have to know a few things. Assuming a pretty standard left join (basically augmenting a dataset with extra information), you can effectively do what you did above. Let’s break it into two types of data: (1) streaming index —

scene

above (2) reference data —

title

+ other metadata above (1) you want to be able to consume in chunks, whereas (2) you want to have loaded/be able to query. If (2) is small enough, you can likely use a join and load it into memory. All you have to do is keep the upstream function cached, and you should be good to go. I think something like duckdb will be your friend here — allowing you to do lightning fast/low-memory joins. If (2) is not small (likely), you’ll have to do what you did above. Which is a join — it’s just dynamically loading rather than statically loading. I think you might be able to, with duckdb, ensure you’re using the same code. But even if not, you should be good to go — I’d just call it a join so that you can pass a series in. Streaming pipelines should (IMO) work for 1 -> n data points, not just 1. Up to you whether its a join with an in-memory table or a query using IN syntax + a list of values in SQL — worth messing around to see if you can use duckdb to get them to be similar. For all map operations, you should be in a good position to use the same code. So, the TL;DR of how I would approach this problem (knowing that I’m not as close to the data as you are) is: (1) Treat all incoming data as

pd.Series/dataframes

— you;’ll take basically no performance hit (even with series of size 1 and you’ll be able to easily call the same code (2) if the “reference” data (E.G. the data you’re joining against) is small, load it all up — you can do the join in memory (against a series/df of size 1) using the exact same code (3) if its too big, do the join against an SQL table. Explore using duckdb or just use the IN syntax in SQL. Hope this helps! I’d be curious what others say about this — it would be awesome if we could gather best practices for reference here!

🔥 1

Игорь Хохолко

02/02/2023, 6:16 PM

Thank you @Elijah Ben Izzy for response! DuckDB can work. One difficulty here can be how to get this DB up-to-data all the time. It would be possible to store this DB locally with feature pipeline because these tables which I need to store won't become too large in nearest future.

Elijah Ben Izzy

02/02/2023, 6:21 PM

yeah, so duckdb can load from other sql engines but its a q of performance: https://duckdb.org/2022/09/30/postgres-scanner.html. I would time it/look at db loads. That said, I’m just conjecturing — you should be able to prototype this out pretty easily to see how it works. Really excited to see what you come up with! If you can get it working with minimal code changes that’s definitely worth sharing out — I think others could benefit from what you’re learned!

👍 1

3 Views

Open in Slack

Previous Next