DataHub #ingestion

gentle-plastic-92802

02/01/2023, 8:25 PM

Hi all. Is there any other way to create a business glossary other than through yml or UI?

👀 1

plain-france-42647

02/01/2023, 9:13 PM

I want to write Python code to parse the result of ingestion that was done to a file. Is there any good way to parse the objects? what i do right now is:

Copy code

f = open('/bla.json', 'rt')
s = f.read()
f.close()

d = json.loads(s)

The problem is that

is now a list of dictionaries, that belong to different types (e.g. ChartSnapshot, etc.) - and i need for each item in d to check its type (how to do that?) and then use the specific extractor. Is there any generic code that can easily do this? (FTR - in my specific case, i have the result of running ingestion for Tableau)

magnificent-lock-58916

02/02/2023, 4:33 AM

I have a question about file based Lineage https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/ I need to connect clickhouse table to tableau sql query. The issue is…our Tableau have tons of different queries named the same (e.g. most of datasources have a query named just “Custom SQL Query”). It also applies to clickhouse tables. They can share the name but have different schema. How can I specify a particular entity in the yaml file, if it shares its name with many other different entities of the same type and same platform?

✅ 1

plain-cricket-83456

02/02/2023, 6:47 AM

Hi,team.When I change the parameter of the data ingestion, the cron parameter that was set to execute on time is reset to none. Why? I use datahub v0.8.41

✅ 1

fresh-zoo-34934

02/02/2023, 9:15 AM

Hi team, I have this setup on datahub • Looker + LookML • Databricks Unity Catalog (UC) LookML is pointing to Trino tables, but we didn’t ingest the Trino tables because all of these tables are also available in Databricks UC, so is it possible to do these steps using python? 1. get all the trino tables from datahub 2. Change all of these Trino tables to Databricks UC for no

I know that there is a grapql

UpdateLineageInput

to update the lineage and we probably can set the Trino tables to archive using

BatchDatasetUpdateInput

, but I don’t know whether this is the best solution because there is no batch

UpdateLineageInput

best-dawn-94548

02/02/2023, 9:42 AM

Hi all, I have installed Datahub and now I am looking into ingesting metadata from a microsoft sql database, is there a way to use Windows authentication? or would I always need to specify a username and the password? (I currently do not have access to a service account) Thank you for your help in advance! 🙏

✅ 1

aloof-egg-97140

02/02/2023, 10:04 AM

Hi all, is it possible to include field level lineage for BigQuery Sources? Haven’t found anything in the documentation

✅ 1

best-umbrella-88325

02/02/2023, 12:43 PM

Hello community! I'm trying to make a change in the existing S3 ingestion source. I've made the changes on my local system, and have then run the command

Copy code

python -m build

followed by

Copy code

python -m pip install .

to install the datahub cli locally, in the metadata-ingestion directory. I see the 0.0.0.dev0 version of datahub getting installed as well. However when I run the command datahub ingest -c file.yaml or datahub version, it fails with the following error.

Copy code

Traceback (most recent call last):
  File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\XXX\XXX\AppData\Local\Programs\Python\Python310\Scripts\datahub.exe\__main__.py", line 4, in <module>
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\entrypoints.py", line 12, in <module>
    from datahub.cli.check_cli import check
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\cli\check_cli.py", line 7, in <module>
    from datahub.cli.json_file import check_mce_file
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\cli\json_file.py", line 3, in <module>
    from datahub.ingestion.source.file import GenericFileSource
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\ingestion\source\file.py", line 17, in <module>
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\emitter\mcp.py", line 5, in <module>
    from datahub.emitter.aspect import ASPECT_MAP, TIMESERIES_ASPECT_MAP
  File "C:\XXX\XXX\datahub\datahub\metadata-ingestion\src\datahub\emitter\aspect.py", line 1, in <module>
    from datahub.metadata.schema_classes import ASPECT_CLASSES
ModuleNotFoundError: No module named 'datahub.metadata'

Can someone help me here as to what is going wrong? Thanks in advance!

✅ 1

delightful-orange-22738

02/02/2023, 1:49 PM

Hello, i don’t see in airflow properties how to set airflow host. https://datahubproject.io/docs/lineage/airflow/#how-to-validate-installation

✔️ 1

✅ 1

elegant-salesmen-99143

02/02/2023, 6:43 PM

Hey huys. I’m trying adding Pattern Add Dataset glossaryTerms Transformers , and I’ve managed to add term to the table field, but I don’t understand how to write the rule that adds Term to a whole table based on the field it contains. Here’s a piece of my recipe with transformer:

Copy code

type: pattern_add_dataset_schema_terms
        config:
            semantics: OVERWRITE
            term_pattern:
                rules:
                    first_name: ['urn:li:glossaryTerm:XXX']

It adds term XXX to the field

first_name

within a table. And lets say I want to add this term to the whole table, in a

type: "pattern_add_dataset_terms"

transformer type, how do I do that? So far the ways I've tried to wtite it didn't work...

✅ 1

rich-state-73859

02/02/2023, 10:16 PM

Hi all, when I ingested glue locally, it ran successfully but no metadata updated. I could directly visit the table using url but I couldn’t search it. Got

document_missing_exception

from Elasticsearch when checking gms logs.

elegant-salesmen-99143

02/03/2023, 8:57 AM

I'm still a bit confused about which sources do allowe Query History functionality and which don't, so I'll just ask - is it possible for Presto on Hive? I think Presto does have logs stored. And second part of the question: Apart from system presto logs, if we have our own separate table where we store queries for Presto - is it possible to import them and show in Datahub in Queries tab for dataset?

✅ 1

best-wire-59738

02/03/2023, 11:42 AM

Hello Team, we were having a doubt if we could restrict access to all the API of datahub to users? we found that we could enable metadata service authentication but it also restricts ingestion, so we need to provide token while using Ingestion. But we need to set ingestion free and restrict all kinds of API datahub currently has. Could you just let me know if that’s possible.

microscopic-twilight-7661

02/03/2023, 2:18 PM

Hi everyone, We have a

*.proto

source file that contains multiple non-nested messages. Is there a way to specify which message to emit or even multiple messages?

brainy-intern-50400

02/03/2023, 4:44 PM

Hi, i don't know if i found an error, but if i update lineage informations with the python emitter, the entity informations are updated, but the graph data remains the same. I could also make a graphql query, but it seems to me, the grahph should be updated automatically. Actually Neo4J responds with error:

Copy code

Remove edge not supported by Neo4JGraphService at this time.

great-kangaroo-88413

02/03/2023, 10:21 PM

I have installed datahub on kubernetes. I created a topic on kafka called data.now. I populated the topic with test data

Copy code

{
  "topic": "data.now",
  "partition": 0,
  "offset": 2000,
  "tstype": "create",
  "ts": 1675455920363,
  "broker": 1,
  "key": null,
  "payload": "{\"id\":1,\"first_name\":\"Zachariah\",\"last_name\":\"Wiffield\",\"email\":\"<mailto:zwiffield0@amazonaws.com|zwiffield0@amazonaws.com>\",\"gender\":\"Male\",\"ip_address\":\"176.189.152.5\"}"
}
{
  "topic": "data.now",
  "partition": 0,
  "offset": 2001,
  "tstype": "create",
  "ts": 1675455920363,
  "broker": 1,
  "key": null,
  "payload": "{\"id\":2,\"first_name\":\"Hilton\",\"last_name\":\"Siverns\",\"email\":\"<mailto:hsiverns1@csmonitor.com|hsiverns1@csmonitor.com>\",\"gender\":\"Male\",\"ip_address\":\"216.205.159.252\"}"
}

This is my source

Copy code

source:
    type: kafka
    config:
        connection:
            consumer_config:
                security.protocol: PLAINTEXT
            bootstrap: '<http://kafka0.com:9094,kafka1.com:9094,kafka2.com:9094|kafka0.com:9094,kafka1.com:9094,kafka2.com:9094>'
            schema_registry_url: '<http://dh-cp-schema-registry:8081>'
        stateful_ingestion:
            enabled: false
        topic_patterns:
            allow:
                - data.now
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-datahub-gms.mynamespace.svc.cluster.local:8080>'

I get this message

The schema registry subject for the value schema is not found. The topic is either schema-less, or no messages have been written to the topic yet.

This is what I end up with. What am I missing?

rhythmic-glass-37647

02/03/2023, 11:04 PM

I'm trying to add tableau as an ingestion source but not getting any assets. I'm running tableau server and everything seems happy but not getting any metadata ingested. I've made many attempts to modify the recipe to something that works. this is where im currently, ive stripped it down mostly to the basics. any ideas for why my ingestion is coming up empty?

Copy code

source:
    type: tableau
    config:
        ingest_owner: true
        connect_uri: '<https://mytableau.mycompany.com>'
        ssl_verify: true
        token_name: datahub
        token_value: 'mytoken'
        ingest_tags: true
pipeline_name: 'urn:li:dataHubIngestionSource:ba12380e-7fc1-425e-9783-88ada4ab8b61'

✅ 2

bitter-evening-61050

02/06/2023, 6:43 AM

Hi , I have created an ingestion from databricks hive to datahub .But i am not able to see the lineage . source: type: hive config: username: token stateful_ingestion: enabled: false remove_stale_metadata: true host_port: 'https://xxxxxxxx' profiling: profile_table_level_only: true enabled: true password: 'xxxx' scheme: databricks+pyhive options: connect_args: http_path: xxxxx database: gmfdb sink: type: datahub-rest config: server: http://xx.xx.xxx.xxx:8080 token: xxxx

✅ 1

plain-cricket-83456

02/06/2023, 7:37 AM

@hundreds-photographer-13496 Does data extraction affect the cpu consumption of the server where the database resides? During hive database data ingest last week, I found that the cpu consumption of the hive server increased

fresh-balloon-59613

02/06/2023, 8:03 AM

Hi Everyone, I am trying to integrate Datahub with airflow. I am getting this message - Successfully added

conn_id

='datahub_rest' : generic//*'<http//2|http>//*******' . But from datahub platform I am not able to see the DAGS lineage in pipelines

better-state-74960

02/06/2023, 8:23 AM

Is there documentation for the DataTemplate(com.linkedin.data.template.DataTemplate) implementation class?

better-state-74960

02/06/2023, 8:25 AM

Can i delete UpstreamLineage metadata use Emitter java sdk?

steep-fountain-54482

02/06/2023, 10:31 AM

hello i´m getting this message on the ui when looking at a dataset i just created using the api

✅ 1

steep-fountain-54482

02/06/2023, 10:31 AM

Copy code

This entity is not discoverable via search or lineage graph. Contact your DataHub admin for more information.

steep-fountain-54482

02/06/2023, 10:32 AM

this dataset seems to be the output of a job and they appear connected

steep-fountain-54482

02/06/2023, 10:33 AM

however i can only reach it by setting the urn on the browser url

square-yak-42039

02/06/2023, 11:33 AM

How can we protect manual lineage set in UI from being overriden in next ingestion (we use dbt platform)?

plus1 1

elegant-salesmen-99143

02/06/2023, 1:17 PM

Hi. We recently tried ingesting Presto on Hive and ran into two problems: 1. in Platform view in doesn't show tables within schema (screenshot 1) 2. In Dataset view it shows tables, but doesn't show fields in them (screenshot 2) Any idea what we're doing wrong? We're on 0.9.6.1 and the recipe is

Copy code

source:
    type: presto
    config:
        host_port: 'XXX'
        database: hive
        username: hive
        include_views: false
        include_tables: false
        profiling:
            enabled: true
            profile_table_level_only: true
            include_field_sample_values: true
        schema_pattern:
            allow:
                - sandbox_data
        stateful_ingestion:
            enabled: true
transformers:
    -
        type: set_dataset_browse_path
        config:
            replace_existing: true
            path_templates:
                - /ENV/PLATFORM/DATASET_PARTS

(the transformer in recipe helped to start display tables in Dataset view, without it they weren't shown in it as well, just like in a Platform view)

crooked-carpet-28986

02/06/2023, 2:18 PM

Hi Everyone, Is there anyway to disable SSL verification for Trino connector? If no, what is the best/soft way to install a certificate? I am deploying on K8s cluster and I am connecting to actions container and installing it, but it seems to be tricky once we do not have full access to k8s node in order to login into container as root user.

gentle-plastic-92802

02/06/2023, 7:00 PM

Hi all, does someone have an example of programmatically sending a request to Rest.Li? I would like to create the business glossary

✅ 1