Slackbot
02/02/2023, 8:08 AMElijah Ben Izzy
02/02/2023, 4:08 PMscene
above
(2) reference data — title
+ other metadata above
(1) you want to be able to consume in chunks, whereas (2) you want to have loaded/be able to query.
If (2) is small enough, you can likely use a join and load it into memory. All you have to do is keep the upstream function cached, and you should be good to go. I think something like duckdb will be your friend here — allowing you to do lightning fast/low-memory joins.
If (2) is not small (likely), you’ll have to do what you did above. Which is a join — it’s just dynamically loading rather than statically loading. I think you might be able to, with duckdb, ensure you’re using the same code. But even if not, you should be good to go — I’d just call it a join so that you can pass a series in. Streaming pipelines should (IMO) work for 1 -> n data points, not just 1. Up to you whether its a join with an in-memory table or a query using IN syntax + a list of values in SQL — worth messing around to see if you can use duckdb to get them to be similar.
For all map operations, you should be in a good position to use the same code.
So, the TL;DR of how I would approach this problem (knowing that I’m not as close to the data as you are) is:
(1) Treat all incoming data as pd.Series/dataframes
— you;’ll take basically no performance hit (even with series of size 1 and you’ll be able to easily call the same code
(2) if the “reference” data (E.G. the data you’re joining against) is small, load it all up — you can do the join in memory (against a series/df of size 1) using the exact same code
(3) if its too big, do the join against an SQL table. Explore using duckdb or just use the IN syntax in SQL.
Hope this helps! I’d be curious what others say about this — it would be awesome if we could gather best practices for reference here!Игорь Хохолко
02/02/2023, 6:16 PMElijah Ben Izzy
02/02/2023, 6:21 PM