DataHub #ingestion

Join Slack

purple-terabyte-64712

03/06/2023, 11:48 AM

Hi, the ingestion file follow some metadata standard?

alert-football-80212

03/06/2023, 3:33 PM

Hi all 👋 Does someone have and example for dbt ingstion recipe, and also available for a few questions?

red-waitress-53338

03/06/2023, 5:53 PM

Hi Team, I am sourcing from BigQuery and have enabled column level profiling, but I cannot see the column level profiling stats on the UI, I can only see table level stats.

lively-dusk-19162

03/06/2023, 8:18 PM

Hello everyone, Can anyone clarify me whther writing resolver test code in graphql module? And what that resolver test code is for?

✅ 1

red-waitress-53338

03/06/2023, 10:10 PM

Hi Team, Is there a way to extract BigQuery tags as part of the ingestion? If yes, I am wondering where those tags will be stored in DataHub, because in BigQuery there are Tag Templates and they are key-value pairs - those key-value pairs cannot be mapped directly to tags in Datahub (they are like hashtags). I think the right place would be the properties section in DataHub where those Tag Templated can be mapped.

clever-author-65853

03/07/2023, 7:25 AM

Hey community, is there a way to tables sync from snowflake could inherited the dbt documentation that ingest them?

✅ 1

elegant-salesmen-99143

03/07/2023, 7:44 AM

is there a way to add pattern_add_dataset_terms Transformer not within one ingest recipe, but across all ingests, all datasources, all at once?

✅ 1

brave-judge-32701

03/07/2023, 7:45 AM

I’m under spark3.2.3, use spark-shell to run a sql:

create table test.testtable4 as select * from test.testtable3

, but the table testtable4's upstream is

sql at <console>:23

not

testtable3

, does it is a compatibility issue? And spark run on hive, spark create hive table can not be show in datahub immediately , I need to run batch Ingestion task to Ingest hive metastore data. Slack Conversation

✅ 1

fresh-balloon-59613

03/07/2023, 9:44 AM

Hi, I am trying to visualize the lineage of tables and stored procedures using inlets and outlets in Airflow. After running the DAG and if we check the lineage to get the dataset details it is creating the new dummy table/stored procedure. But I want that to point to original dataset which we have ingested in datahub.

great-flag-53653

03/07/2023, 3:48 PM

Hi. I'm on the latest version 0.10.0 and I've successfully ingested into my local environment from BigQuery, Looker and DBT - but now that I'm trying to ingest "LookML" I'm getting error messages like this. After searching here in slack the only suggested answers previously have been to upgrade GMS to newer versions, but I'm on the latest version

Copy code

Pipeline running successfully so far; produced 1 events in 2 minutes and 13.81 seconds.
|[2023-03-07 16:45:46,896] ERROR    {datahub.ingestion.run.pipeline:62} -  failed to write record with workunit lookml-view-LookerViewId(project_name='lookml-bq-prd', model_name='ip-model', view_name='account') with ('Unable to emit metadata to DataHub GMS', {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}) and info {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}

numerous-scientist-83156

03/07/2023, 4:00 PM

Hello, I am trying to build a ADLS gen 2 ingestor in Python, the ingestor should, based on the storage account and the container within that storage account, pull every folder, which in our case will be used as the dataset name, and get the schema based on the files within. So the current config is:

Copy code

source:
  type: adlsg2.source.ADLSg2Source
  config:
    platform_instance: ${ADLS_ACCOUNT}
    env: ${ENVIRONMENT}
    adls:
      account_name: ${ADLS_ACCOUNT}
      container_name: ${ADLS_CONTAINER}
      base_path: /
      tenant_id: ${TENANT_ID}
      client_id: ${CLIENT_ID}
      client_secret: ${CLIENT_SECRET}

Currently they way I am making this, I am using the storage accounts name as the platform instance, since the platform is

adlsGen2

it makes sense for the instance to be

ourInstanceTest

I want to make datahub container, based on the name of the adls container name, that the datasets can live in. The idea being that it gives us a nice structured urn and search flow in Datahub _`urnlidataset:(urnlidataPlatform:<platform>,<platform_instance>.<adls_container>.<dataset_name>,<env>)`_ I use the

gen_containers()

function from the

datahub.emitter.mcp_builder

module, to make the

MetaDataWorkUnits

, and then use the

add_dataset_to_container

function to add my dataset to the container. But this is where I run into trouble.. If i just do as described above my dataset urn looks like this:

urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<dataset_name>,TEST)

There is no container present in the urn? There is also no container present in the breadcrumbs (See first image) If i then add the adls containers name as part of the dataset name separeted by a forward slash

<container_name>/<dataset_name>

I get the following URN:

urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<container_name>/<dataset_name>,TEST)

This time the container is pressent in the dataset urn, and the breadcrumbs also works.. sort of On of the breadcrumbs does not separate the

platform_instance

and the

container_name

from each other, so the breadcrumb ends up looking like

<platform_intstance>.<container_name>

which is not what I want. (See second image) The reason for this test is that, from what i understand this is how they do it in the existing [s3 ingestor](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]ta-ingestion/src/datahub/ingestion/source/s3/data_lake_utils.py) Though I do not think the s3 ingestor uses the

platform_instance

and just makes a container to represent that? It also seems a off since it generates its datasets with

between containers and dataset name, where mssql uses

? But what I do not understand is that the [mssql](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]etadata-ingestion/src/datahub/ingestion/source/sql/sql_utils.py) ingestor also "just" calles the

gen_containers

function, and uses the

platform_instance

, but their implementation works without wierd breadcrumbs? Am I miss understanding how the platform instance is suppose to work? Or how containers are suppose to be configured?

magnificent-sunset-43871

03/07/2023, 4:58 PM

Hi folks ! I'm trying to ingest a json schema via meltano. The schema looks like this

schema.json

magnificent-sunset-43871

03/07/2023, 4:59 PM

And the meltano config is like this

meltano.yml

magnificent-sunset-43871

03/07/2023, 5:00 PM

this is the recipe

recipe.yml

magnificent-sunset-43871

03/07/2023, 5:01 PM

with this configs i can start the ingestion like this : meltano --cwd ./meltano invoke datahub ingest -c ingest-recipe.yml

magnificent-sunset-43871

03/07/2023, 5:01 PM

But i get an Error:

magnificent-sunset-43871

03/07/2023, 5:01 PM

Failed to process file output/schemas/Administrative-Unit.json Traceback (most recent call last): File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/source/schema/json_schema.py", line 374, in get_workunits_internal yield from self._load_one_file( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/source/schema/json_schema.py", line 279, in _load_one_file meta: models.SchemaMetadataClass = get_schema_metadata( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 546, in get_schema_metadata schema_fields = list(JsonSchemaTranslator.get_fields_from_schema(json_schema)) File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 510, in get_fields_from_schema yield from JsonSchemaTranslator.get_fields( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 475, in get_fields yield from generator.__get__(cls)( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 359, in _field_from_complex_type yield from JsonSchemaTranslator.get_fields( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 467, in get_fields datahub_type = JsonSchemaTranslator.get_type_mapping(json_type) File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 440, in get_type_mapping return JsonSchemaTranslator.field_type_mapping.get(json_type, NullTypeClass)

magnificent-sunset-43871

03/07/2023, 5:02 PM

it's because the jsonschemaTranslator cannot translate the "type" when it's an array:

magnificent-sunset-43871

03/07/2023, 5:02 PM

when i remove the array as the type and only write a string, then it works

magnificent-sunset-43871

03/07/2023, 5:03 PM

do you maybe have some suggestions or allready know this bug ? maybe i also just configured it wrong

magnificent-sunset-43871

03/07/2023, 5:03 PM

thx in advance for your help

magnificent-sunset-43871

03/07/2023, 5:05 PM

i don't think, that this error is related to meltano because meltano under the hood just starts the python command for the ingestion so big question is : is it possible to ingest json-schemas, which have an array as a type ? (like type: ["string", "null"])

green-lock-62163

03/07/2023, 5:48 PM

Hello, I try to launch

lineage_emitter_dataset_finegrained.py

from the metadata-ingestion/library. Where is the test data corresponding to the column lineage produced in that code section ? I might be missing an obvious setup phase ...

few-branch-52297

03/07/2023, 10:00 PM

Hi, I've a use case where there is a producer java application writing to kafka and a consumer java application reading from Kafka. I need to capture the lineage of these custom producer/consumer applications. • Should I use the Java Emitter (Rest/Kafka) for this? • Is there any sample code in Java to emit lineage?

nice-farmer-35903

03/07/2023, 11:45 PM

@astonishing-answer-96712 added a workflow to this channel: *Community Support Bot *.

important-afternoon-19755

03/08/2023, 1:39 AM

Hi, Team. I want to use Python Emitter for glue metadata ingestion. After install

'acryl-datahub[datahub-rest]'

in python, I try to import

from datahub.emitter.rest_emitter import DatahubRestEmitter

but I got error. Does anybody know the solution? Below is the traceback of error.

Copy code

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_6987/4112535850.py in <module>
      5 
      6 from datahub.emitter.mcp import MetadataChangeProposalWrapper
----> 7 from datahub.emitter.rest_emitter import DatahubRestEmitter
      8 from datahub.emitter.mce_builder import (
      9     get_sys_time,

~/.local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py in <module>
     11 from requests.exceptions import HTTPError, RequestException
     12 
---> 13 from datahub.cli.cli_utils import get_system_auth
     14 from datahub.configuration.common import ConfigurationError, OperationalError
     15 from datahub.emitter.mcp import MetadataChangeProposalWrapper

~/.local/lib/python3.7/site-packages/datahub/cli/cli_utils.py in <module>
     11 import requests
     12 import yaml
---> 13 from pydantic import BaseModel, ValidationError
     14 from requests.models import Response
     15 from requests.sessions import Session

~/.local/lib/python3.7/site-packages/pydantic/__init__.cpython-37m-x86_64-linux-gnu.so in init pydantic.__init__()

~/.local/lib/python3.7/site-packages/pydantic/dataclasses.cpython-37m-x86_64-linux-gnu.so in init pydantic.dataclasses()

~/.local/lib/python3.7/site-packages/pydantic/main.cpython-37m-x86_64-linux-gnu.so in init pydantic.main()

TypeError: dataclass_transform() got an unexpected keyword argument 'field_specifiers'

adorable-computer-92026

03/08/2023, 10:59 AM

Hello, i want to know when i run this command :'datahub docker ingest-sample-data' from where the data was ingested and where it is stored (in which database), is it in MySQL ? thank you!

✅ 1

fresh-zoo-34934

03/08/2023, 12:40 PM

[Bulk Emit MetadataChangeProposal] Hi Team, I tried to add assertion data to Datahub using python, is there a way to send this data in bulk? Or perhaps using yaml file ingestion?

✅ 1

plus1 1

white-hydrogen-24531

03/08/2023, 6:28 PM

Hi Team, we created set of glossary terms through

business_glossary.yml

file and they do show up in the UI after successfully executing the yml file. But when we attach the terms to hive datasets , dathub randomly throws exceptions about duplicate terms Dathub Version: 0.8.45

Copy code

Caused by: java.lang.IllegalStateException: Duplicate key EntityAspectIdentifier(urn=urn:li:glossaryTerm:ABC>TEST, aspect=glossaryTermKey, version=0) (attempted merging values com.linkedin.metadata.entity.EntityAspect@4c802884 and com.linkedin.metadata.entity.EntityAspect@4c802884)

✅ 1

bitter-school-38051

03/08/2023, 9:22 PM

Hi everybody! I have added new custom ingesting (according to https://datahubproject.io/docs/metadata-ingestion/adding-source/), built both frontend docker container and gms docker container (using ./gradlew datahub frontendbuild ./gradlew datahub frontenddist ./gradlew datahub frontenddocker and ./gradlew metadata servicewar:docker with build) and deploy both containers, I can see new source in UI but it fails and in logs there is error Failed to find a registered source for type my_source_name: ‘Did not find a registered class for my_source_name’. Maybe I did forgot something else to do? I am running all datahub containers datahub-actions, datahub-gms, datahub-frontened containers

✅ 1