DataHub #ingestion

boundless-bear-68728

03/08/2024, 10:13 PM

Hi Team/@gray-shoe-75895 I having an issue with the Looker Ingestion. I could see discrepancies between the number of datasets that DataHub is showing vs the actual count of Datasets that exist in our Looker. The DataHub is showing only 97 Explores vs 300+ explores that we have. The Looker Ingestion logs show success but I still could not see all the records in DataHub. Can you please help me how to resolve this issue

ripe-machine-72145

03/09/2024, 2:03 PM

Hi Team, Is there any better way to ingest csv file metadata. UI based 0.13 Csv

worried-agent-2446

03/10/2024, 2:42 PM

Hello! I’m using DataHub. And I’m considering ingesting mysql with SQL Queries(https://datahubproject.io/docs/generated/ingestion/sources/sql-queries/) to view column level lineage. I’d like to know whether I can use ddl sql (like CREATE TABLE or DROP TABLE …) in SQL Queries for this purpose.🙇‍♂️

clean-magazine-98135

03/11/2024, 2:42 AM

Hi all! I'm using DataHub on version 0.13.0. I wanna connect to a hive database using UI ingestion feature. Could you please offer me a recipe demo of hive database connection? Thanks a lot.

rich-barista-93413

03/11/2024, 9:24 AM

Hey there! 👋 Make sure your message includes the following information if relevant, so we can help more effectively! 1. Are you using UI or CLI for ingestion? 2. Which DataHub version are you using? (e.g. 0.12.0) 3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

breezy-honey-91751

03/11/2024, 9:29 AM

Hi Everyone, I am looking to add custom source for which connectors are not available. Can we add some placeholder data assets. This will help me to document some data assets and lineage as well. #ingestion

tall-answer-76571

03/11/2024, 9:53 AM

Hello all! Could someone please explain which PowerBI license is required for integrating dashboards with a data catalog?

fancy-barista-51991

03/11/2024, 4:34 PM

Hi! Could somebody explain to me more about ingestion of CSV files? I am a student, and I'm just starting with DataHub. I need to create a Data Catalog and I have some CSV files. I checked the documentation about it but still I have a lot of questions. I couldn't ingest any of my files. I am aware that the CSV format should follow this format, and I am confused. Should I add here the columns of my CSV? like id, name, last_name and so on? Thank you!

Copy code

resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
"urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD)",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe|urn:li:corpuser:jdoe],CUSTOM,"description for users table",urn:li:domain:Engineering,urn:li:ownershipType:a0e9176c-d8cf-4b11-963b-f7a1bc2333c9
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",first_name,[urn:li:glossaryTerm:FirstName],,,,"first_name description",
"urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",last_name,[urn:li:glossaryTerm:LastName],,,,"last_name description",

nice-dog-12741

03/11/2024, 8:23 PM

In my DataHub's instance, I can see that when a failure occurs in an ingestion operation, such as being unable to access a file for a table, the operation is redone from scratch. What could I do to ensure that the result of the first ingestion is final, even if some error occurs?

salmon-nail-53998

03/12/2024, 1:37 AM

Hi all! I’m trying to ingest glossary via datahub CLI(.13.0) and facing validation issues. I see below error, when I’m using a super simple glossary config highlighted as below, where can I find the schema and allowed fields for business glossary? Thanks in advance!

version: 1

source: DataHub

owners:

users:

- <mailto:useremail@email.com|useremail@email.com>

url: "<https://github.com/datahub-project/datahub/|https://github.com/datahub-project/datahub/>"

nodes:

- name: Classification_demo

description: A set of terms related to Data Classification

terms:

- name: Sensitive

description: Sensitive Data

- name: Confidential

description: Confidential Data

- name: HighlyConfidential

description: Highly Confidential Data

Copy code

ERROR    {datahub.ingestion.run.pipeline:69} -  failed to write record with workunit urn:li:glossaryTerm:Classification_demo.Confidential/mce with ('Unable to emit metadata to DataHub GMS: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryTerm:Classification_demo.Confidential'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryTerm:Classification_demo.Confidential'

millions-byte-1976

03/12/2024, 2:35 AM

Hey There, I am trying to ingest my metadata through glue job but the job is throwing an error , Max retries exceeded with url (cause by ssl error ssl: certificate_verify_failed) Can you please help me how to resolve this issue

early-oil-14918

03/12/2024, 4:55 AM

Hello, I installed DataHub on GKE using Helm, then uninstalled it and reinstalled it again. The installation process was the same, and I also deleted and reinstalled all persistent volumes (pv). Afterwards, I attempted ingestion targeting an S3 bucket. Although the ingestion process succeeded and indicated access to the target, no assets were being ingested. Interestingly, DataHub running on an EC2 instance with Docker works fine. Any idea what might be causing this? There don't seem to be any relevant logs.

hallowed-helicopter-80392

03/12/2024, 6:15 AM

Has someone connected Confluence to Datahub? Do I need to build a custom connector or what would be the recommended route?

blue-cartoon-10359

03/12/2024, 9:31 AM

If I have a dataset in datahub with a Stats section containing summary statistics for each column. How can these be accessed in python using the graph? I.e., does there exist a class to do the following?

Copy code

from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import ???

graph = DataHubGraph(DatahubClientConfig(server=endpoint))
res = graph.get_aspect(dataset_urn, aspect=???)

# get summary stats for each column
col_stats = res["some key"]

red-scientist-36390

03/12/2024, 9:45 AM

Hello! 👋 Has anyone used the dagster-datahub integration? We’re evalutating different tools and were wondering if Lineage is also part of ingestion with this integration

damp-computer-24317

03/13/2024, 4:22 AM

Hi All, I am trying out the new feature for ingestion from Redshift Serverless. I am getting this error when I am trying to use the is_serverless config. Am I missing anything here: https://github.com/datahub-project/datahub/pull/9998

Copy code

This version of datahub supports report-to functionality
+ exec datahub ingest run -c /tmp/datahub/ingest/96342606-9c11-45af-be9b-a7fdbcc6f2e6/recipe.yml --report-to /tmp/datahub/ingest/96342606-9c11-45af-be9b-a7fdbcc6f2e6/ingestion_report.json
[2024-03-13 04:17:40,407] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.0
[2024-03-13 04:17:40,507] INFO     {datahub.ingestion.run.pipeline:238} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
Failed to configure the source (redshift): 1 validation error for RedshiftConfig
is_serverless
  extra fields not permitted (type=value_error.extra)

rich-barista-93413

03/13/2024, 8:34 AM

bland-orange-13353

03/13/2024, 8:35 AM

This message was deleted.

high-area-68604

03/13/2024, 9:03 AM

Hi everyone,

high-area-68604

03/13/2024, 9:05 AM

I want some help understanding Data Terms and Tags ,Data Dictionary .I am confused ,because they seems similar to me .How can I differentiate them?

high-area-68604

03/13/2024, 9:06 AM

Thank you

purple-addition-48342

03/13/2024, 9:34 AM

Hello everyone ... I am looking into ingesting dataset lineage via MCP using the

UpstreamClass

. This

UpstreamClass

type support setting the "`type`" (VIEW, TRANSFORM, COPY) and "`properties`", which are not shown in the UI. Is the type somehow reflected in the UI? I saw there is

properties = {"source": "UI"}

, which is resulting in "Added manually" in the UI. I am thinking if this properties can be used to store custom information, like "source file", or any other information Is there is any way to display them or is there are plan for future implementation ? Or if that field is only used internally and should not be used Thx in advance

incalculable-sundown-8765

03/13/2024, 10:50 AM

Hi it seems like there is an issue to ingest Glossary. This is my recipe:

Copy code

pipeline_name: my_glossary

source:
  type: datahub-business-glossary
  config:
    file: datahub/resources/glossary/my_glossary.yaml
    enable_auto_id: False

This is my

my_glossary.yaml

Copy code

version: 1
source: DataHub
owners:
  users:
    - my.name
nodes:
  - id: "urn:li:glossaryNode:customer"
    name: Customer
    description: "Customer Glossary"
    terms:
      - id: "urn:li:glossaryTerm:created_at"
        name: Created At
        description: "Timestamp when customer first being created."

I get this error:

Copy code

failed to write record with workunit urn:li:glossaryNode:customer/mce with ('Unable to emit metadata to DataHub GMS: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryNode:customer'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryNode:customer'}

I believe this is coming from the

owners

. I'm seeing this issue for csv enricher as well whenever I added ownership. Datahub version: v0.12.1

cuddly-dinner-641

03/13/2024, 1:10 PM

In the databricks ingestion source, it looks like the

include_metastore

flag is deprecated and will always be "false" in the future. Isn't metastore necessary to guarantee dataset URNs are unique?

quiet-computer-34771

03/13/2024, 4:29 PM

UI/CLI: UI Version: 0.13.0 Source: MSSQL inside Amazon AWS (yes, I know the system account constraint) MSSQL tables and views come in just fine. View lineage, however, is not shown. I found a question about this that was last updated 10 months ago but the suggested link for API/SDK reference is dead. What is the solution for supporting View lineage?

little-painter-30105

03/13/2024, 6:45 PM

Hi Team, We have Datahub integrated with Airflow + Snowflake + dbt + tableau. I am trying to do some custom metadata updates using graphql API. Currently airflow DAG owner name is flowing from DAG to Datahub. For all DAGs/Tasks, we want to keep a different owner name in Airflow UI, but it should be updated to

ldap

user name (datahub signed in user) in Datahub . How can I just update Airflow owner name in Datahub (keeping different Airflow owner in Airflow UI) ? Is there a way we can update using API calls or ingestion in Datahub?

sparse-arm-36740

03/13/2024, 8:19 PM

Hi Team, Does DataHub collect the table constraint information from ingested tables? For example, TABLE1 from a Postgres Database with a primary key of 'my_id' and a foreign key of 'some_other_id' from TABLE2. If so I am unable to find it. Can anyone tell me if it is displayed in the UI? Can I get it from one of the APIs? Any help is appreciated!

plus1 1

microscopic-twilight-7661

03/14/2024, 10:37 AM

Hi Everyone, Is there a way to ingest LookML using UI without providing GitHub Deploy key in plaintext? I've tried to add the key as a secret but that results in an error (debug logs in thread). I am able to ingest the metadata if I provide the key in plaintext. We are using Datahub v0.12.1 and ingesting the metadata through UI.

victorious-lizard-36455

03/14/2024, 11:10 AM

Hi Everyone Ingesting Snowflake metadata into DataHub does not currently offer visibility into the fields within columns of the Variant type. This limitation affects the ability to catalog and search nested fields stored in semi-structured data formats like JSON within Variant columns. Is there any way to access nested fields in the DataHub catalog?

glamorous-area-45109

03/14/2024, 4:09 PM

Hi all, I need to tag BigQuery datasets with the layer they belong to. For this I am employing pattern_add_dataset_tags.

transformers:

- type: "pattern_add_dataset_tags"

config:

replace_existing: true

tag_pattern:

rules:

".*common.*": ["urn:li:tag:tag:layer:common"]

".*core.*": ["urn:li:tag:layer:core"]

".*consumption.*": ["urn:li:tag:layer:consumption"]

However, this is only tagging me the tables and views inside the dataset. Is there any way to only tag the datasets and not the tables and views? Thanks!