https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • g

    gentle-plastic-92802

    02/01/2023, 8:25 PM
    Hi all. Is there any other way to create a business glossary other than through yml or UI?
    👀 1
    l
    p
    +2
    • 5
    • 8
  • p

    plain-france-42647

    02/01/2023, 9:13 PM
    I want to write Python code to parse the result of ingestion that was done to a file. Is there any good way to parse the objects? what i do right now is:
    Copy code
    f = open('/bla.json', 'rt')
    s = f.read()
    f.close()
    
    d = json.loads(s)
    The problem is that
    d
    is now a list of dictionaries, that belong to different types (e.g. ChartSnapshot, etc.) - and i need for each item in d to check its type (how to do that?) and then use the specific extractor. Is there any generic code that can easily do this? (FTR - in my specific case, i have the result of running ingestion for Tableau)
    h
    • 2
    • 3
  • m

    magnificent-lock-58916

    02/02/2023, 4:33 AM
    I have a question about file based Lineage https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/ I need to connect clickhouse table to tableau sql query. The issue is…our Tableau have tons of different queries named the same (e.g. most of datasources have a query named just “Custom SQL Query”). It also applies to clickhouse tables. They can share the name but have different schema. How can I specify a particular entity in the yaml file, if it shares its name with many other different entities of the same type and same platform?
    ✅ 1
    h
    • 2
    • 3
  • p

    plain-cricket-83456

    02/02/2023, 6:47 AM
    Hi,team.When I change the parameter of the data ingestion, the cron parameter that was set to execute on time is reset to none. Why? I use datahub v0.8.41
    ✅ 1
    s
    • 2
    • 2
  • f

    fresh-zoo-34934

    02/02/2023, 9:15 AM
    Hi team, I have this setup on datahub • Looker + LookML • Databricks Unity Catalog (UC) LookML is pointing to Trino tables, but we didn’t ingest the Trino tables because all of these tables are also available in Databricks UC, so is it possible to do these steps using python? 1. get all the trino tables from datahub 2. Change all of these Trino tables to Databricks UC for no
    2
    I know that there is a grapql
    UpdateLineageInput
    to update the lineage and we probably can set the Trino tables to archive using
    BatchDatasetUpdateInput
    , but I don’t know whether this is the best solution because there is no batch
    UpdateLineageInput
    h
    • 2
    • 13
  • b

    best-dawn-94548

    02/02/2023, 9:42 AM
    Hi all, I have installed Datahub and now I am looking into ingesting metadata from a microsoft sql database, is there a way to use Windows authentication? or would I always need to specify a username and the password? (I currently do not have access to a service account) Thank you for your help in advance! 🙏
    ✅ 1
    h
    • 2
    • 4
  • a

    aloof-egg-97140

    02/02/2023, 10:04 AM
    Hi all, is it possible to include field level lineage for BigQuery Sources? Haven’t found anything in the documentation
    ✅ 1
    f
    • 2
    • 2
  • b

    best-umbrella-88325

    02/02/2023, 12:43 PM
    Hello community! I'm trying to make a change in the existing S3 ingestion source. I've made the changes on my local system, and have then run the command
    Copy code
    python -m build
    followed by
    Copy code
    python -m pip install .
    to install the datahub cli locally, in the metadata-ingestion directory. I see the 0.0.0.dev0 version of datahub getting installed as well. However when I run the command datahub ingest -c file.yaml or datahub version, it fails with the following error.
    Copy code
    Traceback (most recent call last):
      File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
        exec(code, run_globals)
      File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\Scripts\datahub.exe\__main__.py", line 4, in <module>
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\entrypoints.py", line 12, in <module>
        from datahub.cli.check_cli import check
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\cli\check_cli.py", line 7, in <module>
        from datahub.cli.json_file import check_mce_file
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\cli\json_file.py", line 3, in <module>
        from datahub.ingestion.source.file import GenericFileSource
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\ingestion\source\file.py", line 17, in <module>
        from datahub.emitter.mcp import MetadataChangeProposalWrapper
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\emitter\mcp.py", line 5, in <module>
        from datahub.emitter.aspect import ASPECT_MAP, TIMESERIES_ASPECT_MAP
      File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\emitter\aspect.py", line 1, in <module>
        from datahub.metadata.schema_classes import ASPECT_CLASSES
    ModuleNotFoundError: No module named 'datahub.metadata'
    Can someone help me here as to what is going wrong? Thanks in advance!
    ✅ 1
    d
    h
    • 3
    • 10
  • d

    delightful-orange-22738

    02/02/2023, 1:49 PM
    Hello, i don’t see in airflow properties how to set airflow host. https://datahubproject.io/docs/lineage/airflow/#how-to-validate-installation
    ✔️ 1
    ✅ 1
    d
    • 2
    • 3
  • e

    elegant-salesmen-99143

    02/02/2023, 6:43 PM
    Hey huys. I’m trying adding Pattern Add Dataset glossaryTerms Transformers , and I’ve managed to add term to the table field, but I don’t understand how to write the rule that adds Term to a whole table based on the field it contains. Here’s a piece of my recipe with transformer:
    Copy code
    type: pattern_add_dataset_schema_terms
            config:
                semantics: OVERWRITE
                term_pattern:
                    rules:
                        first_name: ['urn:li:glossaryTerm:XXX']
    It adds term XXX to the field
    first_name
    within a table. And lets say I want to add this term to the whole table, in a
    type: "pattern_add_dataset_terms"
    transformer type, how do I do that? So far the ways I've tried to wtite it didn't work...
    ✅ 1
    h
    • 2
    • 4
  • r

    rich-state-73859

    02/02/2023, 10:16 PM
    Hi all, when I ingested glue locally, it ran successfully but no metadata updated. I could directly visit the table using url but I couldn’t search it. Got
    document_missing_exception
    from Elasticsearch when checking gms logs.
    h
    • 2
    • 1
  • e

    elegant-salesmen-99143

    02/03/2023, 8:57 AM
    I'm still a bit confused about which sources do allowe Query History functionality and which don't, so I'll just ask - is it possible for Presto on Hive? I think Presto does have logs stored. And second part of the question: Apart from system presto logs, if we have our own separate table where we store queries for Presto - is it possible to import them and show in Datahub in Queries tab for dataset?
    ✅ 1
    h
    d
    • 3
    • 4
  • b

    best-wire-59738

    02/03/2023, 11:42 AM
    Hello Team, we were having a doubt if we could restrict access to all the API of datahub to users? we found that we could enable metadata service authentication but it also restricts ingestion, so we need to provide token while using Ingestion. But we need to set ingestion free and restrict all kinds of API datahub currently has. Could you just let me know if that’s possible.
    a
    • 2
    • 2
  • m

    microscopic-twilight-7661

    02/03/2023, 2:18 PM
    Hi everyone, We have a
    *.proto
    source file that contains multiple non-nested messages. Is there a way to specify which message to emit or even multiple messages?
    h
    • 2
    • 2
  • b

    brainy-intern-50400

    02/03/2023, 4:44 PM
    Hi, i don't know if i found an error, but if i update lineage informations with the python emitter, the entity informations are updated, but the graph data remains the same. I could also make a graphql query, but it seems to me, the grahph should be updated automatically. Actually Neo4J responds with error:
    Copy code
    Remove edge not supported by Neo4JGraphService at this time.
    e
    o
    • 3
    • 4
  • g

    great-kangaroo-88413

    02/03/2023, 10:21 PM
    I have installed datahub on kubernetes. I created a topic on kafka called data.now. I populated the topic with test data
    Copy code
    {
      "topic": "data.now",
      "partition": 0,
      "offset": 2000,
      "tstype": "create",
      "ts": 1675455920363,
      "broker": 1,
      "key": null,
      "payload": "{\"id\":1,\"first_name\":\"Zachariah\",\"last_name\":\"Wiffield\",\"email\":\"<mailto:zwiffield0@amazonaws.com|zwiffield0@amazonaws.com>\",\"gender\":\"Male\",\"ip_address\":\"176.189.152.5\"}"
    }
    {
      "topic": "data.now",
      "partition": 0,
      "offset": 2001,
      "tstype": "create",
      "ts": 1675455920363,
      "broker": 1,
      "key": null,
      "payload": "{\"id\":2,\"first_name\":\"Hilton\",\"last_name\":\"Siverns\",\"email\":\"<mailto:hsiverns1@csmonitor.com|hsiverns1@csmonitor.com>\",\"gender\":\"Male\",\"ip_address\":\"216.205.159.252\"}"
    }
    This is my source
    Copy code
    source:
        type: kafka
        config:
            connection:
                consumer_config:
                    security.protocol: PLAINTEXT
                bootstrap: '<http://kafka0.com:9094,kafka1.com:9094,kafka2.com:9094|kafka0.com:9094,kafka1.com:9094,kafka2.com:9094>'
                schema_registry_url: '<http://dh-cp-schema-registry:8081>'
            stateful_ingestion:
                enabled: false
            topic_patterns:
                allow:
                    - data.now
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-datahub-gms.mynamespace.svc.cluster.local:8080>'
    I get this message
    The schema registry subject for the value schema is not found. The topic is either schema-less, or no messages have been written to the topic yet.
    This is what I end up with. What am I missing?
    s
    r
    +2
    • 5
    • 8
  • r

    rhythmic-glass-37647

    02/03/2023, 11:04 PM
    I'm trying to add tableau as an ingestion source but not getting any assets. I'm running tableau server and everything seems happy but not getting any metadata ingested. I've made many attempts to modify the recipe to something that works. this is where im currently, ive stripped it down mostly to the basics. any ideas for why my ingestion is coming up empty?
    Copy code
    source:
        type: tableau
        config:
            ingest_owner: true
            connect_uri: '<https://mytableau.mycompany.com>'
            ssl_verify: true
            token_name: datahub
            token_value: 'mytoken'
            ingest_tags: true
    pipeline_name: 'urn:li:dataHubIngestionSource:ba12380e-7fc1-425e-9783-88ada4ab8b61'
    ✅ 2
    • 1
    • 1
  • b

    bitter-evening-61050

    02/06/2023, 6:43 AM
    Hi , I have created an ingestion from databricks hive to datahub .But i am not able to see the lineage . source: type: hive config: username: token stateful_ingestion: enabled: false remove_stale_metadata: true host_port: 'https://xxxxxxxx' profiling: profile_table_level_only: true enabled: true password: 'xxxx' scheme: databricks+pyhive options: connect_args: http_path: xxxxx database: gmfdb sink: type: datahub-rest config: server: http://xx.xx.xxx.xxx:8080 token: xxxx
    ✅ 1
    h
    • 2
    • 4
  • p

    plain-cricket-83456

    02/06/2023, 7:37 AM
    @hundreds-photographer-13496 Does data extraction affect the cpu consumption of the server where the database resides? During hive database data ingest last week, I found that the cpu consumption of the hive server increased
    h
    • 2
    • 4
  • f

    fresh-balloon-59613

    02/06/2023, 8:03 AM
    Hi Everyone, I am trying to integrate Datahub with airflow. I am getting this message - Successfully added
    conn_id
    ='datahub_rest' : generic//*'<http//2|http>//*******' . But from datahub platform I am not able to see the DAGS lineage in pipelines
    a
    • 2
    • 1
  • b

    better-state-74960

    02/06/2023, 8:23 AM
    Is there documentation for the DataTemplate(com.linkedin.data.template.DataTemplate) implementation class?
    a
    • 2
    • 1
  • b

    better-state-74960

    02/06/2023, 8:25 AM
    Can i delete UpstreamLineage metadata use Emitter java sdk?
    a
    • 2
    • 2
  • s

    steep-fountain-54482

    02/06/2023, 10:31 AM
    hello i´m getting this message on the ui when looking at a dataset i just created using the api
    ✅ 1
  • s

    steep-fountain-54482

    02/06/2023, 10:31 AM
    Copy code
    This entity is not discoverable via search or lineage graph. Contact your DataHub admin for more information.
    a
    b
    • 3
    • 2
  • s

    steep-fountain-54482

    02/06/2023, 10:32 AM
    this dataset seems to be the output of a job and they appear connected
  • s

    steep-fountain-54482

    02/06/2023, 10:33 AM
    however i can only reach it by setting the urn on the browser url
  • s

    square-yak-42039

    02/06/2023, 11:33 AM
    How can we protect manual lineage set in UI from being overriden in next ingestion (we use dbt platform)?
    plus1 1
  • e

    elegant-salesmen-99143

    02/06/2023, 1:17 PM
    Hi. We recently tried ingesting Presto on Hive and ran into two problems: 1. in Platform view in doesn't show tables within schema (screenshot 1) 2. In Dataset view it shows tables, but doesn't show fields in them (screenshot 2) Any idea what we're doing wrong? We're on 0.9.6.1 and the recipe is
    Copy code
    source:
        type: presto
        config:
            host_port: 'XXX'
            database: hive
            username: hive
            include_views: false
            include_tables: false
            profiling:
                enabled: true
                profile_table_level_only: true
                include_field_sample_values: true
            schema_pattern:
                allow:
                    - sandbox_data
            stateful_ingestion:
                enabled: true
    transformers:
        -
            type: set_dataset_browse_path
            config:
                replace_existing: true
                path_templates:
                    - /ENV/PLATFORM/DATASET_PARTS
    (the transformer in recipe helped to start display tables in Dataset view, without it they weren't shown in it as well, just like in a Platform view)
    a
    • 2
    • 4
  • c

    crooked-carpet-28986

    02/06/2023, 2:18 PM
    Hi Everyone, Is there anyway to disable SSL verification for Trino connector? If no, what is the best/soft way to install a certificate? I am deploying on K8s cluster and I am connecting to actions container and installing it, but it seems to be tricky once we do not have full access to k8s node in order to login into container as root user.
    b
    • 2
    • 2
  • g

    gentle-plastic-92802

    02/06/2023, 7:00 PM
    Hi all, does someone have an example of programmatically sending a request to Rest.Li? I would like to create the business glossary
    ✅ 1
    b
    • 2
    • 1
1...100101102...144Latest