DataHub #ingestion

brief-insurance-68141

09/23/2021, 9:55 PM

thrift.transport.TTransport.TTransportException: Could not start SASL: b’Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found’

adamant-pharmacist-61996

09/24/2021, 2:38 AM

hey everyone! 👋 In our airflow instance we run subdags within one main dag in order to manage dependencies between workflows. We’re noticing that this results in a broken lineage between parent dag, and the sub-dag. Has anyone noticed this before and found a work around?

brave-market-65632

09/24/2021, 5:03 AM

Business glossary question: Thanks for the demo and PR. It was great! In the demo video, there was a subtle point about keeping the glossary configuration in a single yaml file vs splitting them into multiple files as long as the tree structure is preserved. This means if the following is the structure in one file

Copy code

node 1
		> term 1
		> term 2
		node 2
			> term a
			> term b

and if one were to introduce a new node and terms collection at an arbitrary location on the tree the file should be defined like this. Is this a fair assumption?

Copy code

node 1
		> term 1
		> term 2
		node 2
			node 3
				> term c
				> term d

This did work for me. Wondering if I'm missing something here. This meant that I had to repeat the name and description configs for the nodes. One could write an abstraction to generate the yaml file. Would it make sense to simply have a parent_urn or something like that in the nodes config to make it canonical to attach a node and term collection at any point in the tree? Something like

Copy code

node 3
				parent_node: node 1.node 2 or simply node 2
				> term c
				> term d

Thanks!

mysterious-monkey-71931

09/24/2021, 5:41 AM

Hello. We already have debezium to CDC from different datasources (mysql, postgres, mssql,...) Can we reuse kafka schema-registry or

dbhistory

topics for datahub?

witty-keyboard-20400

09/24/2021, 8:37 AM

I'm new to DataHub, it looks very promising for metadata management. I want to take the path "file to datahub (REST)" yml config. However, I'm clueless what different fields mean. Is there any sort of doc with example to get me up to speed here? I'd really appreciate this help.

adventurous-scooter-52064

09/25/2021, 6:53 AM

Hi, I’m using AWS Glue Schema Registry, and I’m wondering what should I put here for my datahub-kafka’s

connection.schema_registry_url

? 😢 https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub#config-details-1

nice-planet-17111

09/27/2021, 1:31 AM

Hello, i'm a noob to datahub. I've deployed datahub on GKE, and i'm trying to ingest bigquery metadata via

datahub-rest

. The app (datahub) and bigquery are on same privatea project. When i try sink through console or sink through file, it succeeds without error . However, sink through datahub-rest fails with

ConnectionError

☹️ Is there something i'm missing? Here's my recipe...

Copy code

source:
  type: bigquery
  config:
    project_id: <my_project_id>

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

error message :

Copy code

ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10ba85d00>: Failed to establish a new connection: [Errno 61] Connection refused'

nice-planet-17111

09/27/2021, 8:32 AM

Hi, another newbie question here 😂 Is there a way to automatically upsert metadata, detecting only the changed part? I'm trying to ingest bigquery metadata via datahub-rest. Since several people are using the same project, it is hard to know exactly which part of the dataset is modified and when. What I want is to only update the changed part eventhough i don't define anything (like, specific table... ) in the recipe. (using airflow or etc.) Optimally, whenever change occurs in the data source, i want datahub to automatically upsert the change. Is there a way i can do this ? 🙂

bumpy-activity-74405

09/27/2021, 11:11 AM

Hey, how do you people deal with datasets deleted in source after they’ve already been ingested with a previous run? I am trying to figure out how to automate the process - I was thinking of maybe running some job that would compare what is already ingested to what I would be ingesting and sending mce’s for the diff items with a status aspect where

removed=true

. Curious to know what if anyone had success with this or any other approach.

stocky-noon-61140

09/27/2021, 12:56 PM

Hi everyone - I'm looking for a description of the business glossary file format. In particular, I would like to know which relationships types I can specify among business terms. The example yml file provided only contains the relationship elements "contains" and "inherits" (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml). My goal is to specify e.g. that "Glossary Term A" RELATES TO "Glossary Term B".

bland-orange-13353

09/27/2021, 7:46 PM

This message was deleted.

astonishing-lunch-91223

09/27/2021, 8:44 PM

OK, let me try this one more time… I have the following metadata ingestion config that I’m trying to run via the

linkedin/datahub-ingestion

container (

ingest -c /workspace/data_recipe.yml

Copy code

source:
  type: "file"
  config:
    filename: "/workspace/bootstrap_mce.json"
sink:
  type: "datahub-rest"
  config:
    server: '<http://localhost:8080>'

and I’m using this `bootstrap_mce.json`: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json with DataHub version v0.8.14. Any ideas why I’m getting the errors from the attached log? Basically I’m hitting that

No root resource defined for path '/corpUsers'

issue again.

err.log

adventurous-scooter-52064

09/28/2021, 6:24 AM

Anyone here is using AWS Athena with SQL Profile? How are you guys using it? We just can’t find a way to go around SQL Profiles on big tables in AWS Athena 😞

numerous-cricket-19689

09/28/2021, 6:44 PM

I am newbie in this space and one question i had is how can i implement RDBMS schema ingestion using push model. Ex the https://datahubproject.io/docs/metadata-ingestion document talks about how it can scan mysql database and publish all the databases, tables, schema... Can it generate a change log (ex. column x is added to table A) or i will have implement something myself. Ex. listen to schema changes in mysql using tools like debezium and when i receive a new event for say schema change use it for publishing to datahub. I really like the datahub project so thank you for creating this wonderful technology thanks

rough-eye-60206

09/28/2021, 8:52 PM

Hello, i am new to datahub and i was trying to ingest data(a local file) to http://localhost:9002/ but i am getting the following error. Can someone please help me.

Copy code

File "/Users/vn0d5ac/Library/Python/3.7/lib/python/site-packages/datahub/emitter/rest_emitter.py", line 94, in test_connection
    f"This version of {__package_name__} requires GMS v0.8.0 or higher"

ValueError: This version of acryl-datahub requires GMS v0.8.0 or higher

brief-insurance-68141

09/28/2021, 11:04 PM

Looks like cronjob in datahub does not removed tables that were dropped in source database

sparse-energy-27188

09/29/2021, 2:02 AM

Hey, been trying to use the datahub-ingestion docker image to ingest and kept getting a lot of weird errors about the recipe being invalid in ways that contradicted the documentation. I just figured out that the "latest" tag on the image is 4 months old. It would be good if someone updated the latest tag to v0.8.14 so others don't experience the same problems.

breezy-guitar-97226

09/30/2021, 10:48 AM

Hi here, we are currently using the add_dataset_browse_path transformation to add custom browse paths to our ingested datasets. At the same time though we would like to prevent the canonical ingestion path from being used, by removing it from the ingested object. We are going to achieve this with our own transformer, but I was wondering if such a feature could be also a useful contribution, by making it an option to the current ingestor by setting a flag (ie.

remove_existing_browse_paths: true

) Thanks!

red-smartphone-15526

09/30/2021, 11:34 AM

Hey! working on a dbt -> datahub ingestion. Is there anyway to exclude ephemeral models to show up in the dataset list? (but still include them in lineage)?

adorable-portugal-3397

09/30/2021, 1:48 PM

Hi, running a simple datahub ingest -c <path to yml file>, but get the following error. Source is a json file. Has anyone faced such issues? Datahub is running on k8s, locally works just fine. {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': "All action methods (specified via 'action' in URI) must be submitted as a POST (was GET)", 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status400] All action methods (specified via ' "'action' in URI)

chilly-nail-87894

09/30/2021, 6:19 PM

This polly is closed. @little-megabyte-1074 has a polly for you!

polite-flower-25924

10/02/2021, 8:51 AM

Hey team, I’m very pleased that

redshift-usage

statistics is added with this PR at v0.8.15. This connector requires Redshift Super User privileges to run this query with

svv_table_info

svl_user_info

tables and also explore other user queries. What’s the approach you follow to pass a Redshift super user to this connector? I’m not sure that data platform team allows us to use super user credentials in a connector. I guess, @witty-state-99511 can give better suggestions here 🙂

tall-controller-60779

10/04/2021, 11:29 AM

Hi. We've setup datahub with LDAP authentication. Then I launched recipe for data ingestion. According to logs it was completed successfully. I can even query these new objects using graphiql. But in UI I don't see any objects at all. Do you have any ideas why?

witty-keyboard-20400

10/04/2021, 1:51 PM

Is there a way to just clean the sample data (bootstrap_mce.json) so that I could modify and ingest it cleanly? I've been using

datahub docker nuke

. But this removes all the containers and subsequent

datahub docker quickstart

results into pulling all the containers again over the network.

witty-keyboard-20400

10/04/2021, 2:12 PM

When I directly execute

datahub ingest -c  sample.yml

which is pointing to the latest checked out bootstrap_mce.json, I see only 4 records are ingested.

Copy code

[user@localhost datahub]$ datahub ingest -c ./metadata-ingestion/examples/mce_files/sample.yml 
[2021-10-04 19:35:09,834] INFO     {datahub.cli.ingest_cli:57} - Starting metadata ingestion
[2021-10-04 19:35:09,858] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:0
[2021-10-04 19:35:09,886] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:1
[2021-10-04 19:35:09,907] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:2
[2021-10-04 19:35:09,964] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:3
[2021-10-04 19:35:09,964] INFO     {datahub.cli.ingest_cli:59} - Finished metadata ingestion

Source (file) report:
{'failures': {},
 'warnings': {},
 'workunit_ids': ['file://./sample.json:0', 'file://./sample.json:1', 'file://./sample.json:2', 'file://./sample.json:3'],
 'workunits_produced': 4}
Sink (datahub-rest) report:
{'failures': [], 'records_written': 4, 'warnings': []}

Pipeline finished successfully

My sample.yml is :

Copy code

source:
  type: "file"
  config:
    filename: "./sample.json"

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

However, when I execute the datahub docker ingest-sample-data I see there are 82 records ingested:

Copy code

...
                  'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:79',
                  'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:80',
                  'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:81'],
 'workunits_produced': 82}
Sink (datahub-rest) report:
{'failures': [], 'records_written': 82, 'warnings': []}

What is the difference between manually ingesting against the bootstrap_mce.json vs the

ingest-sample-data

command? @mammoth-bear-12532 @big-carpet-38439

witty-keyboard-20400

10/04/2021, 3:58 PM

In the bootstrap_mce.json, SampleKafkaDataset --> SchemaMetadata --> fields, the 1st field definition is

Copy code

"fields": [
                {
                  "fieldPath": "[version=2.0].[type=boolean].field_foo_2",
                  "jsonPath": null,
                  "nullable": false,
                  "description": {
                    "string": "Foo field description"
                  },
                  "type": {
                    "type": {
                      "com.linkedin.pegasus2avro.schema.BooleanType": {}
                    }
                  },
                  "nativeDataType": "varchar(100)",
                  "globalTags": {
                    "tags": [{ "tag": "urn:li:tag:NeedsDocumentation" }]
                  },
                  "recursive": false
                },
....
]

I checked the type for fieldPath, it's declared just as:

Copy code

fieldPath: SchemaFieldPath

..and SchemaFieldPath is defined as:

Copy code

typeref SchemaFieldPath = string

Question: Is there any significance of mentioning

version

and

type: boolean

in the SchemaFieldPath:

"[version=2.0].[type=boolean].field_foo_2"

nice-planet-17111

10/05/2021, 6:01 AM

Hi, does anyone know how to define credentials in recipe file or handle permission error when ingesting from

bigquery-usage

? •

options.credentials_path

, or

extra_client_options.credentials_path

does not work ( fails to run the file ->

got an unexpected keyword arguement

extra fields not permitted

) • I tried

export GOOGLE_APPLICATION_CREDENTIALS

-> file runs but it stops with the error :

the caller does not have permission

• bigquery ingestion under same environment & configs works without errors.

witty-keyboard-20400

10/05/2021, 8:03 AM

Question on nativeDataType. In the file test_serde_large.json, I see

Copy code

"nativeDataType": "INTEGER(unsigned=True)"

while in the glue_mces_golden.json, I see

Copy code

"nativeDataType": "int",

Does nativeDataType attribute refer to the data type natively supported by the source systems? OR, are both the formats supported by DataHub ?

witty-keyboard-20400

10/05/2021, 3:32 PM

Question on Upstream lineage (UpstreamLineage): In the bootstrap_mce.json, I see that UpstreamLineage is defined at DatasetSnapshot level.

Copy code

{
  "com.linkedin.pegasus2avro.dataset.UpstreamLineage": {
    "upstreams": [
      {
        "auditStamp": {
          "time": 1581407189000,
          "actor": "urn:li:corpuser:jdoe",
          "impersonator": null
        },
        "dataset": "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)",
        "type": "TRANSFORMED"
      }
    ]
  }
}

Shouldn't there be a feature to track lineage at field level? @green-football-43791 @big-carpet-38439

rough-eye-60206

10/05/2021, 10:50 PM

Hello, I am new to datahub and currently able to ingest metadata from hive. Can someone guide me or provide me an example/documentation on how to ingest the metadata description for the tables/columns.