DataHub #design-column-lineage

millions-pencil-75565

03/24/2023, 5:42 AM

Hi, not sure where to post this but I am interested in how DataHub lineage (table or column level) aligns to or compares with lineage as provided by various source tools? For example dbt provides lineage graphs - these are generated from the dbt model so I assume DataHub does exactly the same - or will it be different?

cuddly-garden-9148

06/19/2023, 8:20 AM

Hello everyone , i am new to Datahub, i want to know if it is possible to add a column level lineage for Oracle via graphql ?

better-agent-91402

06/26/2023, 11:01 AM

Hi, is column level lineage only available for tables on Snowflake? It doesnt seem to work for any Views for me, is that correct and if yes is the support for snowflake views planned?

plus1 2

➕ 1

rich-crowd-33361

07/05/2023, 7:30 PM

Can someone help me with view to view column level lineage for Snowflake. This is one of the painpoints we want our data catalog to solve

powerful-monitor-13002

07/10/2023, 1:18 AM

Hello everyone, was there any progress made on Spark Column level lineage or any plans to release such feature in the near future? I have recently used Spline project to generate said lineage from Spark Job and repurposed the output to fit into the

FineGrainedLineage

construct. Will there be any interest into integrating spline with the datahub Spark listener jar? Some cool features of spline: It supports a lot more low level Spark commands along with support for multiple data providers out of the box such as : Kafka, Mongo, ES, Hive, JDBC, Cassandra, etc

gifted-diamond-19544

07/14/2023, 2:15 PM

Hello all! I would be interested in extracting the column level lineage from a particular Athena table via Graphql. Basically, what I need is the following: Given a certain Athena table, I want to know which tableau charts (downstream) are using each of the fields of the table. I do not seem to be able to find the graphql query to search column level lineage, only regular lineage. Any help on this? Thank you 🙂

handsome-park-80602

08/01/2023, 3:28 PM

Hi @dazzling-judge-80093 and team, I was told that column level lineage is coming to BigQuery in v.0.10.5.2 during last week's townhall. I however don't see the v.0.10.5.2 release yet (https://datahubproject.io/docs/releases/) I was wondering if there is any estimate to when 0.10.5.2 would be released?

most-monkey-10812

08/21/2023, 10:41 AM

I am observing some strange behaviour on Dataset Lineage Tab in latest version of Datahub. I want to display the lineage (impact analisys) for the following use-case: srcTable(col1) -> dataJob -> destTable(col1). Visual lineage works fine, but I cannot see any column lineage on Dataset Lineage Tab. Is it a bug or do I provide invalid input for the dataJobInputOutput aspect:

Copy code

{
  "inputDatasetFields": [
    "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,catalog.src.srcTable,PROD),col1)"
  ],
  "outputDatasetFields": [
    "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,catalog.dest.destTable,PROD),col1)"
  ],
  "inputDatasetEdges": [
    {
      "destinationUrn": "urn:li:dataset:(urn:li:dataPlatform:athena,catalog.src.srcTable,PROD)",
      "lastModified": {
        "actor": "urn:li:corpuser:UNKNOWN",
        "time": 1692613604982
      },
      "created": {
        "actor": "urn:li:corpuser:UNKNOWN",
        "time": 1692613604982
      }
    }
  ],
  "outputDatasetEdges": [
    {
      "destinationUrn": "urn:li:dataset:(urn:li:dataPlatform:athena,catalog.dest.destTable,PROD)",
      "lastModified": {
        "actor": "urn:li:corpuser:UNKNOWN",
        "time": 1692613604984
      },
      "created": {
        "actor": "urn:li:corpuser:UNKNOWN",
        "time": 1692613604984
      }
    }
  ],
  "inputDatasets": [
    "urn:li:dataset:(urn:li:dataPlatform:athena,catalog.src.srcTable,PROD)"
  ],
  "outputDatasets": [
    "urn:li:dataset:(urn:li:dataPlatform:athena,catalog.dest.destTable,PROD)"
  ],
  "fineGrainedLineages": [
    {
      "downstreamType": "FIELD",
      "downstreams": [
        "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,catalog.dest.destTable,PROD),col1)"
      ],
      "upstreamType": "FIELD_SET",
      "upstreams": [
        "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:athena,catalog.src.srcTable,PROD),col1)"
      ]
    }
  ]
}

proud-mouse-39373

08/22/2023, 8:11 AM

Hi, I want a bilateral connection between 2 column but it appears as below: Any idea? it is a bilateral connection between 2 columns: legacymono_id <-> gym_id Please check the below scenario. I believe there is a UI bug as well sfdc_account <-> sap_business_partner These 2 tables have bilateral connection using 2 different columns sfdc_account (canonical_id) -> sap_business_partner (sfdccanonicalId) sap_business_partner (id) -> sfdc_account (sap_id) This image clearly shows the bilateral connection: sfdc_account (canonical_id) -> sap_business_partner (sfdccanonicalId) sap_business_partner (id) -> sfdc_account (sap_id)

proud-mouse-39373

08/22/2023, 8:12 AM

------------------------------ Hi, Is there any example to delete column lineage using script either graphql or python?

purple-refrigerator-27989

08/24/2023, 3:08 AM

Hi, everyone. 😄Is the only way to manually add column-level data lineage is through API programming? Can I add it through the UI interface? (As far as I know, UI interfaces can only add table-level data lineage)

clever-match-44392

08/29/2023, 12:39 PM

Hi, I want to create column level lineages from a single table column to multiple table's columns. I'm creating through Python code. In UI, it is showing all the column lineages. But in the visualization page, lineages for only one downstream table is visible. I'm attaching the screenshots, and the lineageMcp variable used here. _lineageMcp = MetadataChangeProposalWrapper(entityType='dataset', changeType='UPSERT', entityUrn='urnlidataset:(urnlidataPlatform:mongodb,adv8.target,PROD)', entityKeyAspect=None, auditHeader=None, aspectName='upstreamLineage', aspect=UpstreamLineageClass({'upstreams': [UpstreamClass({'auditStamp': AuditStampClass({'time': 0, 'actor': 'urnlicorpuser:unknown', 'impersonator': None, 'message': None}), 'created': None, 'dataset': 'urnlidataset:(urnlidataPlatform:mongodb,adv8.random_owner_assignments,PROD)', 'type': 'TRANSFORMED', 'properties': None})], 'fineGrainedLineages': [FineGrainedLineageClass({'upstreamType': 'FIELD_SET', 'upstreams': ['urnlischemaField:(urnlidataset:(urnlidataPlatform:mongodb,adv8.random_owner_assignments,PROD),_id)'], 'downstreamType': 'FIELD_SET', 'downstreams': ['urnlischemaField:(urnlidataset:(urnlidataPlatform:mongodb,adv8.test,PROD),_id)', 'urnlischemaField:(urnlidataset:(urnlidataPlatform:mongodb,adv8.target,PROD),_id)', 'urnlischemaField:(urnlidataset:(urnlidataPlatform:mongodb,adv8.target,PROD),client_id)'], 'transformOperation': None, 'confidenceScore': 1.0})]}), systemMetadata=None)_ Any help would be highly appreciated, and I'll share any other details if required here.

icy-yacht-31703

09/18/2023, 12:10 PM

Is there any plan for automatic extraction of spark column-level-lineage?

plus1 3

red-florist-94889

09/22/2023, 2:09 PM

Hi when can we expect column level lineage for Athena and s3 files ?

tall-flag-84207

09/28/2023, 9:20 AM

I am getting started with Column Level Lineage in Datahub.....and would love to Contibute to it. Can somebody guide me to some nice feature requests or bugs that I can work on?

bitter-baker-81702

10/11/2023, 12:14 PM

Hi everyone, We are using DataHub v0.10.4 in order to expose column level lineage from our Snowflake data warehouse. We extract this information through DataHub’s Snowflake connector as explained here. However, it seems that DataHub’s Snowflake connector is not always able to extract column level lineage, especially for flattened variant columns (I am attaching a relevant example). No errors are indicated in the ingestion logs, however it seems that the column lineage is missing some parts. I understand that extracting column level lineage is a complex process. Has anybody else experienced such issues? Are there any know limitations to the Snowflake column level lineage feature? If yes, could you point me to any relevant documentation depicting those? Thank you!

Snowflake_cll_missing_mapping.txt

bulky-shoe-65107

10/16/2023, 12:34 AM

has renamed the channel from "column-level-lineage" to "design-column-lineage"

icy-yacht-31703

10/30/2023, 9:48 AM

Hello, when I use Python Emitter to add lineage to datahub, datasets of different platforms are involved. Therefore, is there any method of Python Emitter to obtain the entity information of datahub so that I can judge which platform they belong to? Thank you

stale-ram-69119

11/07/2023, 11:02 AM

Hi, everyone! I'm thinking of to extend tableau.py ingestor to build up a column-level lineage. Wdyt? Does anyone already doing some stuff in this direction? The idea would be to grab columns lineage from LineageRunner, detect tables, find those tables and fields in Tableau and emit relations.

brief-eye-25921

11/08/2023, 12:35 PM

Hi everyone, I couldn't find any other relevant channels, so I'm asking here. I'm currently trying to build up a column-level lineage by using [SQL-QUERIES], but I'm not very clear about the specific format and requirements for the JSON file. Do you have any sample files available for reference? Thank you so much. my demo is like:

Copy code

{"query": "SELECT * FROM test.test_table", "downstram_tables": [], "upstram_tables": ["test.test_table"]}
{"query": "INSERT INTO test.test_son (runoob_id) SELECT runoob_id FROM test.test_table", "downstram_tables": ["test.test_son"], "upstram_tables": []}

late-lizard-17365

11/10/2023, 10:24 AM

Hello! I’m currently trying to implement the column level lineage feature but I’ve run into this issue where if the schema fields in a big query table come from two different pubsub subscriptions and the schema field names are the same, the arrows only show for one of the pubsub subscription. The urns for both the pubsub subscriptions are exactly the same apart from the name. I’m wondering if this feature hasn’t been implemented to be used in this way. Does anyone have any details on this?

thank you 1

big-table-62755

11/24/2023, 11:22 AM

Hi Datahub team, I am trying to use

SqlQueriesSource

to create column lineage

Copy code

from datahub.ingestion.source.sql_queries import SqlQueriesSource, SqlQueriesSourceConfig
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.ingestion.api.common import PipelineContext

datahub_graph_client = DataHubGraph(config=DatahubClientConfig())

conf = SqlQueriesSourceConfig(platform='clickhouse', query_file='./queries/subscription_events.sql')
cxt = PipelineContext(run_id='test_column_lineage', graph=datahub_graph_client)
src = SqlQueriesSource(config=conf, ctx=cxt)

I am missing the piece how to emit this to my datahub cluster, should I use

DatahubRestEmitter

somehow?

astonishing-byte-5433

11/29/2023, 9:06 AM

Hello first of all thanks for your amazing work here it is a really cool feature! After investing some time with ingesting a different sqlalchemy source I encountered that the sqlglot dialect wasn't supported which is set by platform. I think it would be very helpful to set a custom dialect via config. Most dialects share their syntax and after changing the platform to a supported one some basic views could be parsed which is better than nothing. This would also help with some sources like athena which is based on trino.

early-hydrogen-27542

11/29/2023, 11:27 PM

Is dbt CLL actually live in 0.12.0 today? I don't see it applied when we upgraded, and I see this PR was merged after 0.12.0 went live.

bland-orange-13353

11/30/2023, 8:29 AM

This message was deleted.

thank you 1

glamorous-spoon-27211

12/26/2023, 2:58 AM

Hi team. i have deployed datahub locally. And there is 2 tables in the hive data source, account_balance and account_balance_delta. I wanted to test and define the column datalineage, but an error occurred indicating that they do not exist upstream , and the lineage not generated. I don't understand the reason. This is my code, Is the defined format incorrect? Can you give me some advice at your convenience please?

Copy code

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
    DatasetLineageType,
    FineGrainedLineage,
    FineGrainedLineageDownstreamType,
    FineGrainedLineageUpstreamType,
    Upstream,
    UpstreamLineage,
)


def datasetUrn(tbl):
    return builder.make_dataset_urn("hive", tbl)


def fldUrn(tbl, fld):
    return builder.make_schema_field_urn(datasetUrn(tbl), fld)


fineGrainedLineages = [
    FineGrainedLineage(
        upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
        upstreams=[
            fldUrn("account_balance", "account_key"),
        ],
        downstreamType=FineGrainedLineageDownstreamType.FIELD,
        downstreams=[fldUrn("account_balance_delta", "account_balance_key")],
    ),
]


# this is just to check if any conflicts with existing Upstream, particularly the DownstreamOf relationship
upstream = Upstream(
    dataset=datasetUrn("account_balance"), type=DatasetLineageType.TRANSFORMED
)

fieldLineages = UpstreamLineage(
    upstreams=[upstream], fineGrainedLineages=fineGrainedLineages
)

lineageMcp = MetadataChangeProposalWrapper(
    entityUrn=datasetUrn("account_balance_delta"),
    aspect=fieldLineages,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(lineageMcp)

fresh-book-19245

02/06/2024, 5:13 PM

Hey everyone! 😁 Thank you for all your efforts on providing such a great tool! I am a newbie and I have been exploring it with more detail, and a question came out: by using File Based Lineage, is it only possible to define the fineGrainedLineages through URNs? Thank you!

gray-sundown-82407

02/08/2024, 3:32 PM

Hi Team, My DataHub deployment has varying levels of success when presenting lineage. I am new to posting in the DataHub slack channel, so I hope this makes sense, but does the DataHub BQ plugin have any limitations regarding how lineage is presented for BigQuery Tables/Views? I think the problem I am experiencing is because some tables/views that are created in BigQuery are particularly complex e.g. multiple SQL scripts are required to generate/transform/join the data in tables/views to create the final view. The simpler tables/views have full working lineage as expected, but the more complex ones typically have no lineage or the lineage ends at "random" points though the expected elements are present in DataHub. I am still working to troubleshoot this, but would appreciate any guidance that could help me pinpoint the problem or suggest a solution!

some-car-9623

03/13/2024, 2:47 PM

Hello Team, I am trying to create the column level lineage from charts to dataset. I am using the Edges from the ChartInfo to have the lineage from chart to dataset as dataset lineage, like that is there any aspect can be used to generate the column level lineage from chart to the dataset while ingestion the chart? Thanks Geetha

some-car-9623

03/14/2024, 8:45 PM

Hello Everyone, is there any possible to have the column level lineage from charts to dataset? if yes , any example?. Thanks in Advance Geetha