https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • h

    hallowed-kilobyte-916

    05/22/2023, 10:08 AM
    I am trying to add a descriptions to columns in datahub source from s3 using the code below:
    Copy code
    import logging
    import time
    
    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
    
    # read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
    from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
    
    # Imports for metadata model classes
    from datahub.metadata.schema_classes import (
        AuditStampClass,
        EditableSchemaFieldInfoClass,
        EditableSchemaMetadataClass,
        InstitutionalMemoryClass,
    )
    
    log = logging.getLogger(__name__)
    logging.basicConfig(level=<http://logging.INFO|logging.INFO>)
    
    
    def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
        """A helper function to extract simple . path notation from the v2 field path"""
        if not field_path.startswith("[version=2.0]"):
            # not a v2, we assume this is a simple path
            return field_path
            # this is a v2 field path
        tokens = [t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))]
    
        return ".".join(tokens)
    
    
    dictionary = {
        "a": "b",
        "c": "d",
    }
    
    
    # Inputs -> owner, ownership_type, dataset
    documentation_to_add = "The unique application (service) correlation id on service now"
    dataset_name = "a/b/20230511.csv"
    
    dataset_urn = make_dataset_urn(platform="s3", name=dataset_name, env="PROD")
    
    
    def add_dict(graph, dataset_urn, column, documentation_to_add):
        need_write = False
    
        field_info_to_set = EditableSchemaFieldInfoClass(fieldPath=column, description=documentation_to_add)
    
        # Some helpful variables to fill out objects later
        now = int(time.time() * 1000)  # milliseconds since epoch
        current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")
    
        current_editable_schema_metadata = graph.get_aspect(
            entity_urn=dataset_urn, aspect_type=EditableSchemaMetadataClass,
        )
    
        # need_write = False
    
        if current_editable_schema_metadata:
            for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo:
                if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column:
                    # we have some editable schema metadata for this field
                    field_match = True
                    if documentation_to_add != fieldInfo.description:
                        fieldInfo.description = documentation_to_add
                        need_write = True
        else:
            # create a brand new editable dataset properties aspect
            current_editable_schema_metadata = EditableSchemaMetadataClass(
                editableSchemaFieldInfo=[field_info_to_set], created=current_timestamp,
            )
            need_write = True
    
        if need_write:
            event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
                entityUrn=dataset_urn, aspect=current_editable_schema_metadata,
            )
            graph.emit(event)
            <http://log.info|log.info>(f"Documentation added to dataset {dataset_urn}")
    
        else:
            <http://log.info|log.info>("Documentation already exists and is identical, omitting write")
    
    
    # First we get the current owners
    gms_endpoint = "<http://localhost:8080>"
    graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))
    
    
    for column, documentation_to_add in dictionary.items():
        print(f"column: {column} and documentation_to_add: {documentation_to_add}")
        add_dict(graph, dataset_urn, column, documentation_to_add)
    However, the code just tells me tells me the dictionary already exists and meanwhile, it doesn't exist in datahub.
    Copy code
    column: a and documentation_to_add: b
    INFO:__main__:Documentation already exists and is identical, omitting write
    column: c and documentation_to_add: d.
    INFO:__main__:Documentation already exists and is identical, omitting write
    Am I missing anything? I am following instructions from here: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_column_documentation.py
    g
    • 2
    • 5
  • l

    lemon-scooter-69730

    05/22/2023, 3:27 PM
    Can you specify CLI version in a yaml formatted recipe?
    ✅ 1
    g
    • 2
    • 3
  • l

    lemon-scooter-69730

    05/22/2023, 5:05 PM
    Also can you name a CLI ingest job so that when it shows in the UI it's not just the platorm type under the name. So that instead of
    dbt
    for example it's
    a_specific_dbt_instance
    ✅ 1
    g
    • 2
    • 6
  • h

    hundreds-airline-29192

    05/23/2023, 4:43 AM
    i am facing this err when using spark lineage datahub . can anybody help me please ???
    m
    • 2
    • 3
  • e

    enough-elephant-56787

    05/23/2023, 6:44 AM
    Hello All, Issue : not able to ingest a single table from a athena Database I was trying to ingest Athena tables with "AwsDataCatalog" as Data Source I have a database named "employees_db" and under that I have 2 tabeles "male_emp" and "female_emp". When I try to ingest the entire "employees_db" , the db and the 2 tables are fuly getting ingested, but the issue arises when I try to ingest a single table from the db.Please help me out in ingesting a specific table in a database with athena source Reciepe file I used: source: config: aws_region: us-east-2 database: employees_db query_result_location: 'S3 location' work_group: primary table_pattern: allow: - male_emp type: athena
    ✅ 1
    g
    • 2
    • 3
  • h

    hundreds-airline-29192

    05/23/2023, 8:12 AM
    How to run datahub docker quickstart with specify gms's port ? ????
    g
    g
    • 3
    • 19
  • h

    hundreds-airline-29192

    05/23/2023, 9:21 AM
    i am facing this error , please help me !!!
    s
    • 2
    • 13
  • a

    adamant-sunset-13770

    05/23/2023, 2:48 PM
    Hello! We have encountered an issue where imported data from BigQuery to Datahub does not retain BigQuery labels as tags for BigQuery Views. We have already enabled
    capture_table_label_as_tag: true
    and observed that it works for BigQuery Tables. • Could you please confirm if this is the intended functionality? Additionally, we would greatly appreciate any available workarounds. Thank you and best regards, Stian
    d
    h
    r
    • 4
    • 7
  • h

    hundreds-airline-29192

    05/24/2023, 8:07 AM

    https://files.slack.com/files-pri/TUMKD5EGJ-F059XGR8W4Q/image.png▾

    g
    g
    • 3
    • 8
  • h

    hundreds-airline-29192

    05/24/2023, 8:08 AM
    Hey!!!!!!!!!!!!!!!!!!!!!! I cannot see anything when i go to the description of table/dataset of big query
  • h

    hundreds-airline-29192

    05/24/2023, 8:08 AM
    Can't see tags, glossary terms ..... of domain or dataset
  • a

    agreeable-table-54007

    05/24/2023, 8:47 AM
    Hello y'all ! I'm wondering if the ingestion with the databricks unity catalog in the UI is working for any of you ? It's working with the CLI but not the UI. I got this error :
    Copy code
    datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (unity-catalog): type object 'Retry' has no attribute 'DEFAULT_METHOD_WHITELIST'
    I tried a lot of stuff, to install some libraries versions etc..but not working. If you have a solution to use only the UI connector and not the CLI i'll gladly take it. Thanks. Hope you have a great day guys.
    g
    • 2
    • 2
  • e

    echoing-evening-57052

    05/24/2023, 12:01 PM
    Hello All, After ingesting metadata from redshift, I am unable to see information for constraints like primary key and foreign key in schema part of dataset. Is there anything specific while ingesting data from redshift? Can anyone suggest anything in this?
    d
    • 2
    • 12
  • b

    boundless-nail-65912

    05/24/2023, 1:24 PM
    HI Team, During ingestion I am getting the below error. May i know why i am gestting this error. Can anyone help me this error.
    s
    g
    • 3
    • 3
  • a

    adamant-sunset-13770

    05/24/2023, 1:55 PM
    Hello! We are currently having an issue with ingesting descriptions from BigQuery Datasets. The problem seems to be that the query used to obtain the Dataset metadata uses `select x from
    {project_id}
    .INFORMATION_SCHEMA.SCHEMATA` which automatically sets
    region-us
    for the query (documentation). We have all our data in
    region-eu
    , but there does not seem to be any way of specifying region. 1. Can you confirm that
    region-us
    is being set ? 2. Do you know of any workarounds? All the best, Stian
    ✅ 1
    g
    d
    • 3
    • 4
  • b

    bored-truck-17085

    05/25/2023, 12:14 AM
    Hi everyone, I'm having an issue with ingesting the dbt freshness validation. I would like to understand the reason of my freshness validation doesn’t sync correctly. Today I have this CLI recipe config, I'm syncing all files (manifest, catalog, sources and run_results). The last update from a specific table is "2023-05-24", but the
    stats
    tab shows "2023-05-20". Has anyone faced this issue before?
    Copy code
    source:
        type: "dbt"
        config:
            # Coordinates
            manifest_path: "/home/datahub/datahub/manifest.json"
            catalog_path: "/home/datahub/datahub/catalog.json"
            sources_path: "/home/datahub/datahub/sources.json"
            test_results_path: "/home/datahub/datahub/run_results.json"
    
            # Options
            stateful_ingestion:
                enabled: false
            entities_enabled:
                models: 'YES'
                sources: 'YES'
                seeds: 'NO'
                test_definitions: 'YES'
                test_results: 'YES'
            target_platform: bigquery
    
    pipeline_name: "source_tests"
    
    sink:
        type: "datahub-rest"
        config:
            server: <http://localhost:8080>
            token: ${DATAHUB_CLI_ACCESS_TOKEN}
    plus1 1
    g
    • 2
    • 1
  • s

    silly-ambulance-51171

    05/25/2023, 7:33 AM
    Hi there! Could anyone suggest if datahub has some notification options for cases when the ingestion process is not successful ? I was thinking about using alerts based on prometheus metrics (but did not find any useful metric for that). Mb smth like slack | email callback are possible ?
    g
    • 2
    • 1
  • h

    hundreds-airline-29192

    05/25/2023, 8:03 AM
    Copy code
    Can I display my spark insert into big query flow on datahub ?
    g
    • 2
    • 1
  • e

    echoing-branch-87829

    05/25/2023, 8:59 AM
    Hi, I have DataHub on a Linux VM and managed to get the frontend, when I have tried to run a dbt ingestion I am recieving the below error
    Copy code
    [2023-05-25 07:55:30,344] ERROR    {datahub.entrypoints:195} - Command failed: [Errno 13] Permission denied: '/root/xxxx/target/manifest.json'
    Have tried giving permission to the folder but an struggling to get past this permission issue if anyone could help please
    ✅ 1
    g
    • 2
    • 1
  • e

    echoing-branch-87829

    05/25/2023, 8:59 AM
    recipie:
  • e

    echoing-branch-87829

    05/25/2023, 9:00 AM
    source:
    type: dbt config: manifest_path: /root/virtualone/virtualone/target/manifest.json catalog_path: /root/virtualone/virtualone/target/catalog.json sources_path: /root/virtualone/virtualone/target/sources.json test_results_path: /root/virtualone/virtualone/target/run_results.json target_platform: snowflake
  • e

    echoing-branch-87829

    05/25/2023, 9:00 AM
    the paths are a git fetch on the VM
  • r

    ripe-stone-30144

    05/25/2023, 9:56 AM
    Hi guys! Can someone tell me exactly what roles/rights the user needs for ingestion metadata from MS SQL Server? https://datahubproject.io/docs/generated/ingestion/sources/mssql/#config-details
    ✅ 1
    g
    • 2
    • 2
  • q

    quiet-exabyte-77821

    05/25/2023, 10:31 AM
    Hello everyone, is there any way to ingest data for dbt source other than CLI or UI ?
    ✅ 1
    a
    • 2
    • 1
  • a

    ancient-queen-15575

    05/25/2023, 10:50 AM
    I’m having an issue connecting to datahub from my terminal and wondering if anyone can help 🙏 . I successfully run a curl request like below:
    Copy code
    curl --location --request POST '<https://dev.mydomain.org/api/graphql>' \
    --header 'Authorization: Bearer <datahub token>' \
    --header 'Content-Type: application/json' \
    --data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:s3,land-dev.insurance/calculations,DEV)\") { domain { associatedUrn domain { urn properties { name } } } } }", "variables":{}}'
    But I can connect when trying to run a recipe. My enviorment variable for the token is called
    DATAHUB_GMS_TOKEN
    and has the same value as is used in the curl request. For
    DATAHUB_GMS_URL
    the value is
    <https://dev.mydomain.org:8080>
    . I’m not understanding why a query to graphql would work but connecting to the GMS port wouldn’t
    g
    • 2
    • 1
  • h

    helpful-guitar-93961

    05/25/2023, 11:51 AM
    I am config gcs source in cli and this error happen.How to solve this???
    d
    r
    a
    • 4
    • 12
  • f

    flat-engineer-75197

    05/25/2023, 12:22 PM
    👋 I’d like to update the json files used in the integration tests for dbt. Do I have to manually update these or is there command to run (similar to updating golden files)?
    ✅ 1
    g
    • 2
    • 4
  • h

    helpful-guitar-93961

    05/25/2023, 1:11 PM
    Iam ingest gcs metadata and facing this error.How to solve this?
    g
    • 2
    • 3
  • a

    adventurous-pillow-74569

    05/25/2023, 1:47 PM
    Hello I am trying to ingest data using ingestion from UI from Bigquery, and getting below error What could be the resolution for the same?
    g
    • 2
    • 1
  • a

    adamant-sunset-13770

    05/25/2023, 3:37 PM
    Hello, We’re facing an issue with our ingestion pipelines where manually added tags from the UI are getting overwritten. • Is this expected? • Is it possible to disable this behaviour and keep both the manually added tags and the ingested tags? Thanks, Stian
    g
    • 2
    • 4
1...122123124...144Latest