Hi folks, is field level lineage supported at the ...
# getting-started
i
Hi folks, is field level lineage supported at the moment? I see there was an RFC for it, and there is code in the repo corresponding to the RFCs, though the docs indicate it's still coming soon. https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage
Also: do you have an opinion on where the MCEs would come in that describe the field set lineage? The reason I ask is that I've been experimenting with Airflow integration to Datahub and using the lineage backend to provide Dataset lineage data. This has worked out nicely so far, the DAG declares the lineage of data processed by each tag. Do you expect that field level lineage could be defined in a DAG as well? Or would it come from somewhere else?
l
The models exist for Dataset fields. We need to do a bit more work for retrieving them. You can theoretically add emitters for column level lineage similar to dataset lineage but we want to go a bit further than that and support SQL parsing as well. From a timeline standpoint, we will make progress on the backend side in May & June. You can expect support to land in early-mid July.
Would love to get more specific requirements and use-cases from you as well.
i
Good to know thanks. At the moment we've got an in-house orchestration system which we're considering replacing with Airflow though its not been confirmed yet. It hooks in with another in-house system we have for managing data pipelines within our data lake, that system has runs Juypter notebooks to do data transformations. I'm thinking of the following: • Airflow -> Use Airflow backend to declare dataset level lineage
• In house system - Send over detailed MCEs to Datahub showing the details of each dataset: users/columns etc
I wasn't sure where the the column level lineage would be sent in from, Airflow or our in-house system using Juypter notebooks. Still very early in our design process, so I reckon we're pretty flexible. My thought was that by declaring the column level lineage (like you do in DAGs for dataset lineage) would potentially be brittle, reasoning is: • we have a python task that takes some data, transforms the columns, and puts them somewhere else • we declare what those input and output columns are • we change the python task to tweak the columns • we forget to change the declaration of the input/output columns, so the lineage is no longer accurate Perhaps there's a smarter way of doing this, if so I'm all ears 🙂
l
If the transformation is being done through custom Python code and not standardized SQL (which is where parsing comes in), I agree that it would be brittle to decouple declaration of input/output columns from transformation logic.
You could potentially make the python task interface stronger so that you capture the input and output columns more explicitly. @gray-shoe-75895 if he has any better ideas based on Airflow lineage backend impl
g
Agree that it certainly makes sense to keep the lineage declarations coupled with the actual transformations to whatever extent possible, and then getting that information into the airflow lineage backend or emitting directly to datahub is a matter building out the plumbing. SQL parsing is the nice when possible, but unfortunately sql dialects are incredibly complex and data transformations are often done with a variety of tools.
A couple ideas here - one is to colocate the lineage information with the transforms themselves, and then try to expose that information to the Airflow operator so it can construct the inlets/outlets appropriately. With some db systems e.g. BigQuery, you can also consume the “audit log” information, which has dataset-level lineage included
i
I'll get back to you once I get a chance to look at our use case a bit more closely. Appreciate the input though
Our Juypter tasks don't look at SQL tables so the SQL parsing wouldn't be applicable here. We do file -> many file transformation. Those files could be CSV/Parquet/ANother I'm pretty sure we can't do any sort of automatic inference from those reliably, declaring the lineage close to the transformation should suffice. When you were referring to Airflow inlet/outlet, was that for the dataset level lineage? Or were you thinking that could expand to accommodate field level lineage as well.
l
It is only dataset level lineage and not field-level
i
That's fine. Just to clarify, do you plan to allow for declaration of field level lineage when it can't be inferred automatically? I realise it will be a bit brittle potentially, though we could live with it.
l
Yes we will support emission of field level lineage
i
Hi folks, bringing this lineage thread back to life. Is column level lineage still in the plan for June/July release? Regardless of whether the answer is yes or no, do you have a mock-up/screenshot of what you would expect that lineage to look like on the user interface? Reason I ask is because we're evaluating various catalogs that would cover column level lineage; it would help to have an indication of what the functionality would look like on Datahub. We're leaning away from defining the lineage within Airflow, and going back to @gray-shoe-75895 suggestion of having it closer to our code. We'd likely build a small library to send those lineage definitions over to Datahub Another use case we have is allowing users to define their lineage on the UI, as opposed to through code. @loud-island-88694, this ties back to our discussion on Zoom. We have cases where we'd like to define datasets manually, we'd like to be able to define our lineage manually as well. Is this something you anticipate doing in the UI? Failing that, we could possibly build our own thin UI, and pass in the lineage from that to GMS as MCEs. @future-airplane-75730 @miniature-ram-76637
👍 2
l
We are still planning to add the models and emitters in July timeframe
@green-football-43791 can you please take on producing a mock for how column level lineage is likely to look on the UI and update this thread?
1
As for defining datasets and lineage on UI, it needs a bit more thought. Without good governance workflows, it can lead to incorrect information getting in. Going the MCE route is definitely recommended first
i
thanks very much for the update
g
Will do @loud-island-88694 👍