DataHub #ingestion

wooden-football-7175

02/08/2022, 3:54 PM

Hello all. I posted an issue on troubleshoot channel : https://datahubspace.slack.com/archives/C029A3M079U/p1644330772754709

mysterious-portugal-30527

02/09/2022, 12:43 AM

I don’t get it.

version 0.8.25

Running

docker QuickStart

on Linux and connecting thru Chrome on a MBP, adding an ingestion thru the web application. Choosing

Execute

fails. Why is this failing:

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://localhost:8080>'

Log shows:

Copy code

"ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by "
           "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fae83a81a30>: Failed to establish a new connection: [Errno 111] "
           "Connection refused'))\n",
           "2022-02-09 00:27:27.263935 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Failed to execute 'datahub ingest'",
           '2022-02-09 00:27:27.269727 [exec_id=e989b898-fb4d-4eec-9d9c-965a78650cb9] INFO: Caught exception EXECUTING '
           'task_id=e989b898-fb4d-4eec-9d9c-965a78650cb9, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}

Curl shows:

Copy code

curl <http://localhost:8080/config>
{
  "models" : { },
  "versions" : {
    "linkedin/datahub" : {
      "version" : "v0.8.25",
      "commit" : "306fe0b5ffe3e59857ca5643136c8b29d80d4d60"
    }
  },
  "statefulIngestionCapable" : true,
  "retention" : "true",
  "noCode" : "true"
}

What am I missing??

shy-island-99768

02/09/2022, 7:35 AM

Hello all, I have a question about the ingestion of documentation that we have written in yaml for (bigquery) tables. What would be the best way to enrich the out of the box bigquery ingestion meta data with documentation that we have in version controlled yml files? Example below:

Copy code

full_name: project-p-p:stats.active_stats
name: active_stats
owners:
  - email: <mailto:abel@vanmoof.com|abel@vanmoof.com>
notes:
description: Collect stats...
usage:
  - department_name:
    example_usage:
      - hello
bigquery_link: <https://bigquery.googleapis.com/bigquery/v2/projects/blabla/datasets/bla/tables/active_stats>
columns:
  - name: frame_number
    description:
    is_primary_key:
    aliases: []
    unit:
    relations: []
  - name: created_at
    description:
    is_primary_key:
    aliases: []
    unit:
    relations: []
  - name: product_id
    description:
    is_primary_key:
    aliases: []
    unit:
    relations: []

plain-farmer-27314

02/09/2022, 2:24 PM

Also:

Copy code

We now support the ability to ignore specific users when calculating Top Users of a Dataset/Column — this is useful when you want to exclude users designated for maintenance/automated execution.

So we can yeet our airflow user out of datahub 🙂

lively-fall-12210

02/09/2022, 4:03 PM

Hello! In the Kafka Metadata Source, I am not sure how the config values

domain.domain_key.allow

and

domain.domain_key.deny

are used. Are they intended to extract domain names from the topic name by a capturing group in the regex? Or are they used to only keep topics that belong to a certain domain? Does somebody have an example? The documentation is a bit short here. Thanks a lot!

wooden-football-7175

02/09/2022, 6:25 PM

Hello channel. I have a silly question. I want to make lineage with

glue pipelines

that I imported from aws source. I could manage to use

Airflow backend for lineage

but I do not find documentation how to configure

glue

as a job to connect two differents `datasets`(also glue). Anyone has any reference? Thanks in advance!!

handsome-football-66174

02/09/2022, 7:50 PM

Hi Everyone, Are we able to add descriptions to Datasets during ingestion ? Any transformations or configurations present ?

rich-policeman-92383

02/09/2022, 8:13 PM

Hello How do we configure datahub to pull metadata from a SSL enabled ES cluster. While trying to configure host: "https://ip:9200" results in assertion error: host contains bad character. If we omit the scheme then we get connection error: caused by: ProtocolError datahub version v0.8.26, ES version: 7.5.x

glamorous-house-64036

02/09/2022, 10:18 PM

Good day, I'm trying to do a simple ingestion from a PostgreSQL but facing some error messages that I straggle to understand. ( DataHub is running locally via "datahub docker quickstart" ) My yaml file:

Copy code

source:
  type: postgres
  config:
    # Coordinates
    host_port: URL:5432
    database: DATABASENAME
    # Credentials
    username: user
    password: password
    #Options
    include_tables: True
    include_views: True
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:9002/api/gms>"   #this path is what UI ingestion tool sugests, I also tried default <http://localhost:8080>" with same result

Both postgres and datahub-rest plugins looks enabled. Upd: Error log moved into thread.

plus1 1

rich-winter-40155

02/10/2022, 4:22 AM

Hi All, we are setting up hive metadata ingestion on the airflow cluster. We are trying to follow the architecture proposed here https://github.com/linkedin/datahub/blob/master/docs/architecture/metadata-ingestion.md . How do we publish hive events to kafka and to datahub. If there is any example config, can you please point me to it. Thanks.

broad-tomato-45373

02/10/2022, 6:31 AM

Hi I am trying to add users for login via user.props file. Steps i followed : 1. Created a configmap user-props (with the added users) . 2. created a file (named as overide_chart_values.yml) to override the chart values .

Copy code

extraVolumes:
  - name: user-props
    configMap:
      name: user-props
extraVolumeMounts:
  - name: user-props
    mountPath: /datahub-frontend/conf/user.props

3. upgraded the helm chart with the new values

Copy code

helm upgrade --install -f values.yaml -f overide_chart_values.yml datahub datahub/datahub

But, i didn't succeed in getting new users for login. I am very much new to K8s and helm charts. Any help would be much appreciated.

gray-spoon-5206

02/10/2022, 6:36 AM

Good day everyone, I’m trying to ingest data from snowflake, however, I got an error like this(in the thread), can someone help me with it? thanks very much.

adorable-flower-19656

02/10/2022, 6:57 AM

Hi guys, I'm using BigQuery ingestion, is there a difference in performance when using 'use_exported_bigquery_audit_metadata' option? what is pros and cons? https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/bigquery.py#L309

square-machine-96318

02/10/2022, 6:58 AM

Hi datahub! I have some trouble in datahub system. Can I get help? I want to ingest dataset of postgreSQL. and I created ingestion source with web ui like below picture. and I decide the schedule as ’00 03 * * *’. But I initially did run manually to see if the data ingestion was good, but it doesn’t sink normally. What is the reason and how can I solve it? It was possible to manually sink through CLI on my EKS server.

few-air-56117

02/10/2022, 7:33 AM

Hi guys, i have a question, Its possible to se ingetion ui logs when the ingestion is running?

great-dusk-47152

02/10/2022, 8:10 AM

Follow this: https://datahubproject.io/docs/docker/airflow/local_airflow, can't run examples , only airflow,no datahub

rhythmic-kitchen-64860

02/10/2022, 8:12 AM

hi all, I want to ask how to add a scheduler if I'm trying to ingest data using package from

datahub.ingestion.run.pipeline

the whole code is

from datahub.ingestion.run.pipeline import Pipeline

# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.

pipeline = Pipeline.create(

'source':{

"type":"postgres",

"config":{

"username":"postgres",

"password":"strongpass",

"database":"northwind",

"host_port":"localhost:5432",

"database_alias":"test",

"schema_pattern":{

"allow":{

"public"

},

"table_pattern":{

"allow":[

"test.public.region",

"test.public.suppliers"

},

"sink":{

"type":"datahub-rest",

"config":{

"server":"<http://localhost:8080>"

# Run the pipeline and report the results.

pipeline.run()

pipeline.pretty_print_summary()

and I want to try to make a scheduler from running that config, is it possible to do that? thank you.

few-air-56117

02/10/2022, 8:54 AM

Hi guys, i try to ingest bigquery-usage via ui. This is the recepie

Copy code

source:
    type: bigquery-usage
    config:
        projects:
            - p1
            - p2
        credential:
            project_id: 
            private_key_id: 
            private_key: '${PRIVATE_KEY}'
            client_email: 
            client_id: 
sink:
    type: datahub-rest
    config:
        server:

I got this error

Copy code

'1 validation error for BigQueryUsageConfig\n'
           'credential\n'
           '  extra fields not permitted (type=value_error.extra)\n',

so its look like i can add credential on biguqery-usage ( on bigquery its work)

silly-beach-19296

02/10/2022, 12:19 PM

Hello, I made deployment of datahub + EKS and I want to modify the authentication connector to OKTA, how could I do it??? in the documentation they only mention that you should do it by modifying the .env file that is inside reacts front

crooked-van-51704

02/10/2022, 2:15 PM

does anyone here have some experience with

dbt

? I have seeing an issue when I try to ingest a dbt project, it causes a

DuplicateKeyException

When I disable the dbt node creation using

disable_dbt_node_creation: True

it works fine. So it must be related to the dbt specific metadata. Oddly, I can disable the dbt nodes, do the ingestion successfully, re-enable the dbt nodes, and now ingestion works without any errors The specific error I see in the stack trace is this

Copy code

'Caused by: java.sql.BatchUpdateException: Duplicate entry '
"'urn:li:dataset:(urn:li:dataPlatform:snowflake,citibike_tripdata.' for key 'metadata_aspect_v2.PRIMARY'\n"

limited-cricket-18852

02/10/2022, 4:29 PM

Hello! I can see in the roadmap of Q3 2021 that Datahub could show a preview of the data, has it been already released? Is there any entity in the demo instance that showcases this feature? Thanksss!

ambitious-guitar-89068

02/11/2022, 5:01 AM

Faced an issue with Tableau Ingestion: https://github.com/linkedin/datahub/issues/4119

curved-truck-53235

02/11/2022, 1:47 PM

Hi, everyone! I'm trying to ingest schema from Kafka but if fails

narrow-bird-99605

02/11/2022, 2:51 PM

Hello, is there a way to set domain during ingestion?

handsome-football-66174

02/11/2022, 4:21 PM

Hi Everyone, Quick question regarding Lineages - How we add lineage between existing Data Task ( in a Data pipeline) and a dataset. I see this example - https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py but it seems to be adding lineage to a data_job entity -

Copy code

entityUrn=builder.make_data_job_urn(
        orchestrator="airflow", flow_id="flow1", job_id="job1", cluster="PROD"
    ),

modern-monitor-81461

02/11/2022, 6:09 PM

MySQL Databases and schemas VS containers and platform instances I have installed 0.8.26 to play with domains and containers and I have ingested metadata from a MySQL database. I can see that every schema has now its own container and that's all good. When I look at the parent of those containers (schemas), I see a container representing the database with a name of

none

. In Mysql, a database is pretty much the same as a schema, so it is unclear to me what the database should be... Seeing

none

is not what I was expecting. Is it to represent the fact that it is absent from Mysql? If so, would it be better to simply not create that container? Now if I want to document the hostname of the Mysql server in DataHub, is it when I need to use platform instances? I thought platform instances were used to differentiate different instances of Mysql servers, am I right? Looking for guidance here on how to use those concepts since I want to apply them to Datalakes in an Iceberg source I am currently working on.

cool-painting-92220

02/12/2022, 12:13 AM

Hi there! I've taken a look around the community Slack and the DataHub documentation and didn't seem to find anything on it, so I wanted to check here - is there a good way of connecting to AzureML for metadata ingestion, or is that something projected to be added to the selection of compatible sources in the future?

❤️ 1

mysterious-nail-70388

02/14/2022, 8:25 AM

Hello Team , I used PIP to install DataHub, but failed to run DataHub after installing 0.8.14 or 0.8.12, while running DataHub properly after installing 0.8.24 on the same server. why? It is normal since 0.8.16...

proud-accountant-49377

02/14/2022, 9:58 AM

Hi team! I´ve been trying to remove terms associated with a field from my dataset via api for several days and an unknown error appears that does not allow me to do so. Is it a bug, does anyone know something? Thank you!

red-napkin-59945

02/14/2022, 6:20 PM

Hey team, I want to check if there is some existing python lib to parse Urns? I did not find any in the datahub repo. I wrote some on my owner and wondering if I should contribute it back?