I'm not familiar with data lineage, wanted to expl...
# ingestion
b
I'm not familiar with data lineage, wanted to explore and test out how it works for PostgreSQL. How is a table derived from another one? is it as simple as
Copy code
CREATE TABLE new_table
  AS (SELECT * FROM old_table);
I tried that but there was no lineage shown.
g
Hey @better-orange-49102 - you need to emit data lineage yourself separately, we do not attempt to sql parse at the moment
I would take a look at our docs for lineage in airflow: https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow
b
ah, so only airflow dags can emit data lineage? the rest of the data sources do not, yes?
g
You can also emit lineage from other sources like Dbt and Superset
from where are you executing that sql?
are you just running it on your postgres command line?
as long as we can derive the lineage information from the source, we can emit it.
b
it was a test, so i just created a table, then derived another table from it
g
for example, bigquery has query logs with source and destination tables
b
then had Datahub recipe extract both tablees
g
if Postgres has an API that will tell us what tables another table was derived from, we can use that to derive lineage
or, if you use an orchestration tool like airflow or dbt to schedule your table creation, you can get lineage for free
you can also always emit it yourself manually as a last resort 🙂
b
my team doesnt use airflow, so probably we would have to emit our own
alright, thanks!
Or, apply lineage in a transformation
how does your team currently create new tables? all via the postgres command line?
if you store your table schemas under version control, that could be another good place to emit lineage
b
i believe we used some psql trigger, based on the date of the data
a
@better-orange-49102, this one is actually exactly what you are looking for. https://github.com/linkedin/datahub/tree/master/contrib/metadata-ingestion/haskell
a little bit difficult to understand though. This works contributor has used
nix
to configure the local
haskell
environment. Once you have
haskell
, you can run the script which parses a
sql
ddl, and generate
mce
with
lineage
information
in the end, publish the
mce
to Kafka.
s
just to share what we've been doing in our company. we used a lot of SQL Server procedure in our legacy database. we put several lines of YAML like tagging using specific start and stop keywords. Inside the tags we define few metadata such as TABLEIN and TABLEOUT and then we run SQL to get all those procedure definition to be fed into a python script that parse the tags and generate mce objects. perhaps you can do the similar approach using Postgres information schema or show trigger result and pars e it to MCE
a
It's a neat solution. Will you guys also take care of column-level lineage by this approach? I know the Datahub doesn't have the column-level lineage feature yet.