A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

2022-11-08_9-54-40.jpg

Hello everyone,
How does DataHub create the lineage for Redshift objects? In particular, I want to know where DataHub retrieved the information for the lineage attached between the s3 files and the Redshift table. Is there a particular view that is being ingested? Or is DataHub parsing the queries on the table?

We use STL_LOAD_COMMITS to get that information:  <https://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_COMMITS.html>

Thank you, <@UV14447EU>! Follow question, what determines which files get used? Because it appears that it is the same file in different folders (one for each day/time that it is uploaded).

we join this table with the query history

Is there anything that can be done to filter/hide the duplicates?