Hamilton Open Source

<@U03LJT9C7F1> took a stab at integrating pandera -- would love some thoughts! <https://github.com/stitchfix/hamilton/pull/147>. What's your github name? Happy to add as a reviewer.

Hi <@U03AV0GG20H>! This is really cool. I’ll take a deeper look later but at first glance it looks good!

Overall it looks good! Would love more examples of how this applies in practice and how results are persisted over time (but I guess that’s a Hamilton overall workflow rather than specific to Pandera)

that’s what a plugin in with whylabs would be for :wink:

right now persistence would be up to the user to do  — not a concern of Hamilton

Otherwise an example using what’s proposed — <https://github.com/stitchfix/hamilton/blob/0f82dc862412023eaf3a152a3444cd1c98f12115/examples/data_quality/quality_flow.py>

Ah gotcha! Just want to make sure I didn’t miss anything!

I guess you can technically capture the pandera report and store it somewhere as well

Super cool. I noticed you’re looking at Flyte integration as well and we are working on this: <https://github.com/flyteorg/flytekit/pull/1100>

<!channel> released! Would love some feedback: <https://twitter.com/hamilton_os/status/1547720103667441665>

Curious — what’s the size of the data? And why are you worried about performance particularly? In my opinion if performance isn’t an issue its probably not worth optimizing for yet… Materialized views generally seem like a win here — my guess is that postgres will be able to handle it well unless things are crazy (massive data…). But, given that it can fit in pandas, I think you’ll be good (unless its append-only). I think (and this is where I might get into trouble from a bit of guesswork), that the penalty you’ll take is the latency due to the time it takes to materialize the views.

Re: denormalization, it completely depends on how you’re querying it. I generally dislike making denormalized tables, but if it makes querying easier then it can be worth it. Often done in analytics which seems like your use case.

<@U03AV0GG20H> is stronger in this space — he’ll be able to dig more into it.

Also curious — hamilton is the first part of your transform — then youre saving it into postgres, right?

And I wouldn’t be surprised if a purpose-built dashboarding tool could help with the aggregation problem so you have to deal with it less, but that’s more outside of my area of expertise.

Largest table I saw so far was 433mil records. The thing about mvs is there's a need to refresh them just like a table everytime the pipeline runs, and mv are part of rapid development cycle so they constantly get updates. I'm just not sure how denormalization helped with downstream views. Some folks have suggested building truncated tables each time for the pandas loading, that copy vs inserts can improve performance. To your point we power a front end web app on this postgres that's used internally

You are correct, there's time outs on refresh of the materialized views in our CI often, it's becoming a bottleneck

Hamilton might be able to help improve the loading if we aren't designing the base tables as efficiently as we could, keen to hear you thoughts on how that helps over just using pandas

:thinking_face:  <@U04Q3ELBKJ5> would you mind drawing out the flow you describe happening please, along with where/what would be updated from run to run? I want to make sure I’m understanding it correctly, since that’s not clear to me.

Regarding materialized views or not, I’m not a postgres expert, but to my knowledge, you commonly use them when you have a query across some tables that is run often for reporting (e.g. like <https://www.postgresql.org/docs/current/rules-materializedviews.html|this example> in the postgres docs) and you want it to be cached effectively.  So whether they’re a good idea or not depends on the query and as you mention the need to propagate updates?

Regarding denormalized tables, you do them because you don’t want to join at query time - the trade-off is space and potentially speed of changes to existing rows. It’s a common practice if you’re using it to power a dashboard as it likely makes that query simpler for aggregations along some dimensions.

hey <@U03AV0GG20H>!

let's just take the following example,
using sqlalchemy+pandas we do the following steps

1. pull from some postgres DB into data frame
2. join this table to others, this should have the widest number of cols, and most number of rows
3. save this elsewhere on a unified DB of postgres as the data warehouse
4. tack on materialized views which aggregate/apply metrics logic onto the base table
Occasional we get change in shape of the base table, but we expect it to become less and less likely as the data matures. We expect the MVs to constantly change, but what we don't want to is to have excessive joins on top of eachother hence step #2

Swap out #1 with internal apis or other source systems, but the rest of the flow remains the same

<@U04Q3ELBKJ5> great that helps. Follow up, does this run incrementally (i.e. only on new data?) or does it recompute the world when it is run?

Also up for discussion, current incrementally loaded

Okay thanks. Yeah I am trying to think of alternative ways to do this. I optimize for the least amount of moving parts, and I think CI is a good means to run and trigger this until it doesn’t scale. Modeling the extraction of data sources and transformations required seems like a good fit for Hamilton. Then using postgres seems fine, though postgres isn’t necessarily optimized for analytics workloads (as opposed to a columnar oriented database). Whether postgres is a good long term choice depends on how much growth and headroom you have in the database. If it’s 12 months+ then probably fine not to worry about it now?

Yeah we are going to move to a proper db this year, in the mean time for our volume of data this"just works" enough

On that note I'm having a bit of trouble getting the example load datas in the GitHub, what would it look like at a high level, 3 files? Loader, transforms and the run file?

Hey! Curious — what’s not working with the github examples? But yeah, that’s one way to organize it:
• `data_loaders.py` — functions to load your data. These functions are usually called nouns (E.G. `my_data`, or `dataset_foo`.). If you’re using pandas they’ll be dataframes, often with `@extract_columns`
• `transforms.py`  — downstream from `data_loaders` — whatever you want
• `run.py` — runs your transforms, using a driver

Note that if you have a lot of functions you’ll want to start utilizing subpackages for each, and this shape might change.

<@U04Q3ELBKJ5> happy to jump on a call once you’re up and working given the timezone difference. Just let us know.