I m not familiar with data lineage wanted to explore and tes DataHub #ingestion

I'm not familiar with data lineage, wanted to expl...

better-orange-49102

05/12/2021, 1:34 AM

I'm not familiar with data lineage, wanted to explore and test out how it works for PostgreSQL. How is a table derived from another one? is it as simple as

Copy code

CREATE TABLE new_table
  AS (SELECT * FROM old_table);

I tried that but there was no lineage shown.

green-football-43791

05/12/2021, 1:36 AM

Hey @better-orange-49102 - you need to emit data lineage yourself separately, we do not attempt to sql parse at the moment

green-football-43791

05/12/2021, 1:37 AM

I would take a look at our docs for lineage in airflow: https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow

better-orange-49102

05/12/2021, 1:40 AM

ah, so only airflow dags can emit data lineage? the rest of the data sources do not, yes?

green-football-43791

05/12/2021, 1:40 AM

You can also emit lineage from other sources like Dbt and Superset

green-football-43791

05/12/2021, 1:40 AM

from where are you executing that sql?

green-football-43791

05/12/2021, 1:41 AM

are you just running it on your postgres command line?

green-football-43791

05/12/2021, 1:41 AM

as long as we can derive the lineage information from the source, we can emit it.

better-orange-49102

05/12/2021, 1:41 AM

it was a test, so i just created a table, then derived another table from it

green-football-43791

05/12/2021, 1:41 AM

for example, bigquery has query logs with source and destination tables

better-orange-49102

05/12/2021, 1:41 AM

then had Datahub recipe extract both tablees

green-football-43791

05/12/2021, 1:42 AM

if Postgres has an API that will tell us what tables another table was derived from, we can use that to derive lineage

green-football-43791

05/12/2021, 1:42 AM

or, if you use an orchestration tool like airflow or dbt to schedule your table creation, you can get lineage for free

green-football-43791

05/12/2021, 1:43 AM

you can also always emit it yourself manually as a last resort 🙂

better-orange-49102

05/12/2021, 1:43 AM

my team doesnt use airflow, so probably we would have to emit our own

better-orange-49102

05/12/2021, 1:43 AM

alright, thanks!

green-football-43791

05/12/2021, 1:43 AM

https://datahubproject.io/docs/metadata-ingestion/#transformations

green-football-43791

05/12/2021, 1:43 AM

Or, apply lineage in a transformation

green-football-43791

05/12/2021, 1:44 AM

how does your team currently create new tables? all via the postgres command line?

green-football-43791

05/12/2021, 1:44 AM

if you store your table schemas under version control, that could be another good place to emit lineage

better-orange-49102

05/12/2021, 1:45 AM

i believe we used some psql trigger, based on the date of the data

acceptable-architect-70237

05/13/2021, 7:48 PM

@better-orange-49102, this one is actually exactly what you are looking for. https://github.com/linkedin/datahub/tree/master/contrib/metadata-ingestion/haskell

acceptable-architect-70237

05/13/2021, 7:50 PM

a little bit difficult to understand though. This works contributor has used

nix

to configure the local

haskell

environment. Once you have

haskell

, you can run the script which parses a

sql

ddl, and generate

mce

with

lineage

information

acceptable-architect-70237

05/13/2021, 7:50 PM

in the end, publish the

mce

to Kafka.

square-greece-86505

05/14/2021, 4:52 AM

just to share what we've been doing in our company. we used a lot of SQL Server procedure in our legacy database. we put several lines of YAML like tagging using specific start and stop keywords. Inside the tags we define few metadata such as TABLEIN and TABLEOUT and then we run SQL to get all those procedure definition to be fed into a python script that parse the tags and generate mce objects. perhaps you can do the similar approach using Postgres information schema or show trigger result and pars e it to MCE

acceptable-architect-70237

05/14/2021, 4:01 PM

It's a neat solution. Will you guys also take care of column-level lineage by this approach? I know the Datahub doesn't have the column-level lineage feature yet.

Open in Slack

Previous Next