https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    boundless-bear-68728

    03/08/2024, 10:13 PM
    Hi Team/@gray-shoe-75895 I having an issue with the Looker Ingestion. I could see discrepancies between the number of datasets that DataHub is showing vs the actual count of Datasets that exist in our Looker. The DataHub is showing only 97 Explores vs 300+ explores that we have. The Looker Ingestion logs show success but I still could not see all the records in DataHub. Can you please help me how to resolve this issue
    r
    g
    • 3
    • 10
  • r

    ripe-machine-72145

    03/09/2024, 2:03 PM
    Hi Team, Is there any better way to ingest csv file metadata. UI based 0.13 Csv
    r
    f
    • 3
    • 3
  • w

    worried-agent-2446

    03/10/2024, 2:42 PM
    Hello! I’m using DataHub. And I’m considering ingesting mysql with SQL Queries(https://datahubproject.io/docs/generated/ingestion/sources/sql-queries/) to view column level lineage. I’d like to know whether I can use ddl sql (like CREATE TABLE or DROP TABLE …) in SQL Queries for this purpose.πŸ™‡β€β™‚οΈ
    r
    • 2
    • 3
  • c

    clean-magazine-98135

    03/11/2024, 2:42 AM
    Hi all! I'm using DataHub on version 0.13.0. I wanna connect to a hive database using UI ingestion feature. Could you please offer me a recipe demo of hive database connection? Thanks a lot.
    r
    • 2
    • 1
  • r

    rich-barista-93413

    03/11/2024, 9:24 AM
    Hey there! πŸ‘‹ Make sure your message includes the following information if relevant, so we can help more effectively! 1. Are you using UI or CLI for ingestion? 2. Which DataHub version are you using? (e.g. 0.12.0) 3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
  • b

    breezy-honey-91751

    03/11/2024, 9:29 AM
    Hi Everyone, I am looking to add custom source for which connectors are not available. Can we add some placeholder data assets. This will help me to document some data assets and lineage as well. #ingestion
    r
    • 2
    • 1
  • t

    tall-answer-76571

    03/11/2024, 9:53 AM
    Hello all! Could someone please explain which PowerBI license is required for integrating dashboards with a data catalog?
    r
    • 2
    • 1
  • f

    fancy-barista-51991

    03/11/2024, 4:34 PM
    Hi! Could somebody explain to me more about ingestion of CSV files? I am a student, and I'm just starting with DataHub. I need to create a Data Catalog and I have some CSV files. I checked the documentation about it but still I have a lot of questions. I couldn't ingest any of my files. I am aware that the CSV format should follow this format, and I am confused. Should I add here the columns of my CSV? like id, name, last_name and so on? Thank you!
    Copy code
    resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,datahub.growth.users,PROD)",,[urn:li:glossaryTerm:Users],[urn:li:tag:HighQuality],[urn:li:corpuser:lfoe|urn:li:corpuser:jdoe],CUSTOM,"description for users table",urn:li:domain:Engineering,urn:li:ownershipType:a0e9176c-d8cf-4b11-963b-f7a1bc2333c9
    "urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",first_name,[urn:li:glossaryTerm:FirstName],,,,"first_name description",
    "urn:li:dataset:(urn:li:dataPlatform:hive,datahub.growth.users,PROD)",last_name,[urn:li:glossaryTerm:LastName],,,,"last_name description",
    r
    • 2
    • 3
  • n

    nice-dog-12741

    03/11/2024, 8:23 PM
    In my DataHub's instance, I can see that when a failure occurs in an ingestion operation, such as being unable to access a file for a table, the operation is redone from scratch. What could I do to ensure that the result of the first ingestion is final, even if some error occurs?
    r
    • 2
    • 3
  • s

    salmon-nail-53998

    03/12/2024, 1:37 AM
    Hi all! I’m trying to ingest glossary via datahub CLI(.13.0) and facing validation issues. I see below error, when I’m using a super simple glossary config highlighted as below, where can I find the schema and allowed fields for business glossary? Thanks in advance!
    version: 1
    source: DataHub
    owners:
    users:
    - <mailto:useremail@email.com|useremail@email.com>
    url: "<https://github.com/datahub-project/datahub/|https://github.com/datahub-project/datahub/>"
    nodes:
    - name: Classification_demo
    description: A set of terms related to Data Classification
    terms:
    - name: Sensitive
    description: Sensitive Data
    - name: Confidential
    description: Confidential Data
    - name: HighlyConfidential
    description: Highly Confidential Data
    Copy code
    ERROR    {datahub.ingestion.run.pipeline:69} -  failed to write record with workunit urn:li:glossaryTerm:Classification_demo.Confidential/mce with ('Unable to emit metadata to DataHub GMS: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryTerm:Classification_demo.Confidential'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryTerm:Classification_demo.Confidential'
    r
    • 2
    • 3
  • m

    millions-byte-1976

    03/12/2024, 2:35 AM
    Hey There, I am trying to ingest my metadata through glue job but the job is throwing an error , Max retries exceeded with url (cause by ssl error ssl: certificate_verify_failed) Can you please help me how to resolve this issue
    r
    • 2
    • 1
  • e

    early-oil-14918

    03/12/2024, 4:55 AM
    Hello, I installed DataHub on GKE using Helm, then uninstalled it and reinstalled it again. The installation process was the same, and I also deleted and reinstalled all persistent volumes (pv). Afterwards, I attempted ingestion targeting an S3 bucket. Although the ingestion process succeeded and indicated access to the target, no assets were being ingested. Interestingly, DataHub running on an EC2 instance with Docker works fine. Any idea what might be causing this? There don't seem to be any relevant logs.
    r
    • 2
    • 3
  • h

    hallowed-helicopter-80392

    03/12/2024, 6:15 AM
    Has someone connected Confluence to Datahub? Do I need to build a custom connector or what would be the recommended route?
    r
    • 2
    • 3
  • b

    blue-cartoon-10359

    03/12/2024, 9:31 AM
    If I have a dataset in datahub with a Stats section containing summary statistics for each column. How can these be accessed in python using the graph? I.e., does there exist a class to do the following?
    Copy code
    from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
    from datahub.metadata.schema_classes import ???
    
    graph = DataHubGraph(DatahubClientConfig(server=endpoint))
    res = graph.get_aspect(dataset_urn, aspect=???)
    
    # get summary stats for each column
    col_stats = res["some key"]
    r
    • 2
    • 2
  • r

    red-scientist-36390

    03/12/2024, 9:45 AM
    Hello! πŸ‘‹ Has anyone used the dagster-datahub integration? We’re evalutating different tools and were wondering if Lineage is also part of ingestion with this integration
    r
    • 2
    • 2
  • d

    damp-computer-24317

    03/13/2024, 4:22 AM
    Hi All, I am trying out the new feature for ingestion from Redshift Serverless. I am getting this error when I am trying to use the is_serverless config. Am I missing anything here: https://github.com/datahub-project/datahub/pull/9998
    Copy code
    This version of datahub supports report-to functionality
    + exec datahub ingest run -c /tmp/datahub/ingest/96342606-9c11-45af-be9b-a7fdbcc6f2e6/recipe.yml --report-to /tmp/datahub/ingest/96342606-9c11-45af-be9b-a7fdbcc6f2e6/ingestion_report.json
    [2024-03-13 04:17:40,407] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.0
    [2024-03-13 04:17:40,507] INFO     {datahub.ingestion.run.pipeline:238} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
    Failed to configure the source (redshift): 1 validation error for RedshiftConfig
    is_serverless
      extra fields not permitted (type=value_error.extra)
    r
    w
    +3
    • 6
    • 33
  • r

    rich-barista-93413

    03/13/2024, 8:34 AM
    Hey there! πŸ‘‹ Make sure your message includes the following information if relevant, so we can help more effectively! 1. Are you using UI or CLI for ingestion? 2. Which DataHub version are you using? (e.g. 0.12.0) 3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
  • b

    bland-orange-13353

    03/13/2024, 8:35 AM
    This message was deleted.
    r
    • 2
    • 1
  • h

    high-area-68604

    03/13/2024, 9:03 AM
    Hi everyone,
    r
    • 2
    • 1
  • h

    high-area-68604

    03/13/2024, 9:05 AM
    I want some help understanding Data Terms and Tags ,Data Dictionary .I am confused ,because they seems similar to me .How can I differentiate them?
    r
    • 2
    • 2
  • h

    high-area-68604

    03/13/2024, 9:06 AM
    Thank you
    r
    • 2
    • 3
  • p

    purple-addition-48342

    03/13/2024, 9:34 AM
    Hello everyone ... I am looking into ingesting dataset lineage via MCP using the
    UpstreamClass
    . This
    UpstreamClass
    type support setting the "`type`" (VIEW, TRANSFORM, COPY) and "`properties`", which are not shown in the UI. Is the type somehow reflected in the UI? I saw there is
    properties = {"source": "UI"}
    , which is resulting in "Added manually" in the UI. I am thinking if this properties can be used to store custom information, like "source file", or any other information Is there is any way to display them or is there are plan for future implementation ? Or if that field is only used internally and should not be used Thx in advance
    r
    c
    • 3
    • 2
  • i

    incalculable-sundown-8765

    03/13/2024, 10:50 AM
    Hi it seems like there is an issue to ingest Glossary. This is my recipe:
    Copy code
    pipeline_name: my_glossary
    
    source:
      type: datahub-business-glossary
      config:
        file: datahub/resources/glossary/my_glossary.yaml
        enable_auto_id: False
    This is my
    my_glossary.yaml
    Copy code
    version: 1
    source: DataHub
    owners:
      users:
        - my.name
    nodes:
      - id: "urn:li:glossaryNode:customer"
        name: Customer
        description: "Customer Glossary"
        terms:
          - id: "urn:li:glossaryTerm:created_at"
            name: Created At
            description: "Timestamp when customer first being created."
    I get this error:
    Copy code
    failed to write record with workunit urn:li:glossaryNode:customer/mce with ('Unable to emit metadata to DataHub GMS: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryNode:customer'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/0/com.linkedin.glossary.GlossaryNodeInfo/customProperties :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryNodeSnapshot/aspects/1/com.linkedin.common.Ownership/ownerTypes :: unrecognized field found but not allowed\n', 'status': 422, 'urn': 'urn:li:glossaryNode:customer'}
    I believe this is coming from the
    owners
    . I'm seeing this issue for csv enricher as well whenever I added ownership. Datahub version: v0.12.1
    r
    • 2
    • 1
  • c

    cuddly-dinner-641

    03/13/2024, 1:10 PM
    In the databricks ingestion source, it looks like the
    include_metastore
    flag is deprecated and will always be "false" in the future. Isn't metastore necessary to guarantee dataset URNs are unique?
    r
    • 2
    • 3
  • q

    quiet-computer-34771

    03/13/2024, 4:29 PM
    UI/CLI: UI Version: 0.13.0 Source: MSSQL inside Amazon AWS (yes, I know the system account constraint) MSSQL tables and views come in just fine. View lineage, however, is not shown. I found a question about this that was last updated 10 months ago but the suggested link for API/SDK reference is dead. What is the solution for supporting View lineage?
    r
    • 2
    • 3
  • l

    little-painter-30105

    03/13/2024, 6:45 PM
    Hi Team, We have Datahub integrated with Airflow + Snowflake + dbt + tableau. I am trying to do some custom metadata updates using graphql API. Currently airflow DAG owner name is flowing from DAG to Datahub. For all DAGs/Tasks, we want to keep a different owner name in Airflow UI, but it should be updated to
    ldap
    user name (datahub signed in user) in Datahub . How can I just update Airflow owner name in Datahub (keeping different Airflow owner in Airflow UI) ? Is there a way we can update using API calls or ingestion in Datahub?
    r
    g
    • 3
    • 4
  • s

    sparse-arm-36740

    03/13/2024, 8:19 PM
    Hi Team, Does DataHub collect the table constraint information from ingested tables? For example, TABLE1 from a Postgres Database with a primary key of 'my_id' and a foreign key of 'some_other_id' from TABLE2. If so I am unable to find it. Can anyone tell me if it is displayed in the UI? Can I get it from one of the APIs? Any help is appreciated!
    plus1 1
    r
    • 2
    • 2
  • m

    microscopic-twilight-7661

    03/14/2024, 10:37 AM
    Hi Everyone, Is there a way to ingest LookML using UI without providing GitHub Deploy key in plaintext? I've tried to add the key as a secret but that results in an error (debug logs in thread). I am able to ingest the metadata if I provide the key in plaintext. We are using Datahub v0.12.1 and ingesting the metadata through UI.
    r
    • 2
    • 2
  • v

    victorious-lizard-36455

    03/14/2024, 11:10 AM
    Hi Everyone Ingesting Snowflake metadata into DataHub does not currently offer visibility into the fields within columns of the Variant type. This limitation affects the ability to catalog and search nested fields stored in semi-structured data formats like JSON within Variant columns. Is there any way to access nested fields in the DataHub catalog?
    r
    • 2
    • 3
  • g

    glamorous-area-45109

    03/14/2024, 4:09 PM
    Hi all, I need to tag BigQuery datasets with the layer they belong to. For this I am employing pattern_add_dataset_tags.
    transformers:
    - type: "pattern_add_dataset_tags"
    config:
    replace_existing: true
    tag_pattern:
    rules:
    ".*common.*": ["urn:li:tag:tag:layer:common"]
    ".*core.*": ["urn:li:tag:layer:core"]
    ".*consumption.*": ["urn:li:tag:layer:consumption"]
    However, this is only tagging me the tables and views inside the dataset. Is there any way to only tag the datasets and not the tables and views? Thanks!
    r
    • 2
    • 2