purple-terabyte-64712
03/06/2023, 11:48 AMalert-football-80212
03/06/2023, 3:33 PMred-waitress-53338
03/06/2023, 5:53 PMlively-dusk-19162
03/06/2023, 8:18 PMred-waitress-53338
03/06/2023, 10:10 PMclever-author-65853
03/07/2023, 7:25 AMelegant-salesmen-99143
03/07/2023, 7:44 AMbrave-judge-32701
03/07/2023, 7:45 AMcreate table test.testtable4 as select * from test.testtable3
, but the table testtable4's upstream is sql at <console>:23
not testtable3
, does it is a compatibility issue?
And spark run on hive, spark create hive table can not be show in datahub immediately , I need to run batch Ingestion task to Ingest hive metastore data.
Slack Conversationfresh-balloon-59613
03/07/2023, 9:44 AMgreat-flag-53653
03/07/2023, 3:48 PMPipeline running successfully so far; produced 1 events in 2 minutes and 13.81 seconds.
|[2023-03-07 16:45:46,896] ERROR {datahub.ingestion.run.pipeline:62} - failed to write record with workunit lookml-view-LookerViewId(project_name='lookml-bq-prd', model_name='ip-model', view_name='account') with ('Unable to emit metadata to DataHub GMS', {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}) and info {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}
numerous-scientist-83156
03/07/2023, 4:00 PMsource:
type: adlsg2.source.ADLSg2Source
config:
platform_instance: ${ADLS_ACCOUNT}
env: ${ENVIRONMENT}
adls:
account_name: ${ADLS_ACCOUNT}
container_name: ${ADLS_CONTAINER}
base_path: /
tenant_id: ${TENANT_ID}
client_id: ${CLIENT_ID}
client_secret: ${CLIENT_SECRET}
Currently they way I am making this, I am using the storage accounts name as the platform instance, since the platform is adlsGen2
it makes sense for the instance to be ourInstanceTest
I want to make datahub container, based on the name of the adls container name, that the datasets can live in.
The idea being that it gives us a nice structured urn and search flow in Datahub _`urnlidataset:(urnlidataPlatform:<platform>,<platform_instance>.<adls_container>.<dataset_name>,<env>)`_
I use the gen_containers()
function from the datahub.emitter.mcp_builder
module, to make the MetaDataWorkUnits
, and then use the add_dataset_to_container
function to add my dataset to the container.
But this is where I run into trouble..
If i just do as described above my dataset urn looks like this: urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<dataset_name>,TEST)
There is no container present in the urn?
There is also no container present in the breadcrumbs (See first image)
If i then add the adls containers name as part of the dataset name separeted by a forward slash <container_name>/<dataset_name>
I get the following URN:
urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<container_name>/<dataset_name>,TEST)
This time the container is pressent in the dataset urn, and the breadcrumbs also works.. sort of
On of the breadcrumbs does not separate the platform_instance
and the container_name
from each other, so the breadcrumb ends up looking like <platform_intstance>.<container_name>
which is not what I want. (See second image)
The reason for this test is that, from what i understand this is how they do it in the existing [s3 ingestor](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]ta-ingestion/src/datahub/ingestion/source/s3/data_lake_utils.py)
Though I do not think the s3 ingestor uses the platform_instance
and just makes a container to represent that?
It also seems a off since it generates its datasets with /
between containers and dataset name, where mssql uses .
?
But what I do not understand is that the [mssql](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]etadata-ingestion/src/datahub/ingestion/source/sql/sql_utils.py) ingestor also "just" calles the gen_containers
function, and uses the platform_instance
, but their implementation works without wierd breadcrumbs?
Am I miss understanding how the platform instance is suppose to work?
Or how containers are suppose to be configured?magnificent-sunset-43871
03/07/2023, 4:58 PMmagnificent-sunset-43871
03/07/2023, 4:59 PMmagnificent-sunset-43871
03/07/2023, 5:00 PMmagnificent-sunset-43871
03/07/2023, 5:01 PMmagnificent-sunset-43871
03/07/2023, 5:01 PMmagnificent-sunset-43871
03/07/2023, 5:01 PMmagnificent-sunset-43871
03/07/2023, 5:02 PMmagnificent-sunset-43871
03/07/2023, 5:02 PMmagnificent-sunset-43871
03/07/2023, 5:03 PMmagnificent-sunset-43871
03/07/2023, 5:03 PMmagnificent-sunset-43871
03/07/2023, 5:05 PMgreen-lock-62163
03/07/2023, 5:48 PMlineage_emitter_dataset_finegrained.py
from the metadata-ingestion/library. Where is the test data corresponding to the column lineage produced in that code section ? I might be missing an obvious setup phase ...few-branch-52297
03/07/2023, 10:00 PMnice-farmer-35903
03/07/2023, 11:45 PMimportant-afternoon-19755
03/08/2023, 1:39 AM'acryl-datahub[datahub-rest]'
in python, I try to import from datahub.emitter.rest_emitter import DatahubRestEmitter
but I got error. Does anybody know the solution?
Below is the traceback of error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_6987/4112535850.py in <module>
5
6 from datahub.emitter.mcp import MetadataChangeProposalWrapper
----> 7 from datahub.emitter.rest_emitter import DatahubRestEmitter
8 from datahub.emitter.mce_builder import (
9 get_sys_time,
~/.local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py in <module>
11 from requests.exceptions import HTTPError, RequestException
12
---> 13 from datahub.cli.cli_utils import get_system_auth
14 from datahub.configuration.common import ConfigurationError, OperationalError
15 from datahub.emitter.mcp import MetadataChangeProposalWrapper
~/.local/lib/python3.7/site-packages/datahub/cli/cli_utils.py in <module>
11 import requests
12 import yaml
---> 13 from pydantic import BaseModel, ValidationError
14 from requests.models import Response
15 from requests.sessions import Session
~/.local/lib/python3.7/site-packages/pydantic/__init__.cpython-37m-x86_64-linux-gnu.so in init pydantic.__init__()
~/.local/lib/python3.7/site-packages/pydantic/dataclasses.cpython-37m-x86_64-linux-gnu.so in init pydantic.dataclasses()
~/.local/lib/python3.7/site-packages/pydantic/main.cpython-37m-x86_64-linux-gnu.so in init pydantic.main()
TypeError: dataclass_transform() got an unexpected keyword argument 'field_specifiers'
adorable-computer-92026
03/08/2023, 10:59 AMfresh-zoo-34934
03/08/2023, 12:40 PMwhite-hydrogen-24531
03/08/2023, 6:28 PMbusiness_glossary.yml
file and they do show up in the UI after successfully executing the yml file. But when we attach the terms to hive datasets , dathub randomly throws exceptions about duplicate terms
Dathub Version: 0.8.45
Caused by: java.lang.IllegalStateException: Duplicate key EntityAspectIdentifier(urn=urn:li:glossaryTerm:ABC>TEST, aspect=glossaryTermKey, version=0) (attempted merging values com.linkedin.metadata.entity.EntityAspect@4c802884 and com.linkedin.metadata.entity.EntityAspect@4c802884)
bitter-school-38051
03/08/2023, 9:22 PM