DataHub #ingestion

nutritious-train-7865

04/23/2022, 3:51 AM

Hi team, I am getting the following errors in docker which failed to start the

hive-metastore-postgresql

container and get timeout exception for waiting on service when running the integration tests for

presto-on-hive

and

trino

source on my M1 mac, could anyone help me with it? Thanks!

most-waiter-95820

04/23/2022, 3:27 PM

Heya. Does anyone use a combination of BigQuery and DataHub but with BQ tables and views being controlled by Terraform? It seems like such tables/views are not exporting any log data towards GCP cloud logging and exported audit logs (table name: cloudaudit_googleapis_com_data_access) regarding table/view creation or updates, hence DataHub cannot extract any lineage from them (so far I've tried to ingest via acryl-datahub[bigquery], I don't think that [bigquery-usage] will help). My theory is that Terraform creates objects via BigQuery API rather than SQL jobs so they naturally don't appear under "BigQuery job" part of logging/exported audit logs. I've managed to find some traces of Terraform'ed tables in table cloudaudit_googleapis_com_activity though. Just want to get your opinions if that sounds plausible as we would need to make an update to metadata ingestion.

cuddly-arm-8412

04/24/2022, 6:21 AM

hi Team... i want to debug ingestion local and i am run ../gradlew metadata ingestioninstallDev source venv/bin/activate datahub version # should print "version: unavailable (installed via git)" and then how to do

icy-ram-1893

04/24/2022, 7:47 AM

Hi! I've been able to successfully deployed datahub in my local network. However, as it could be seen in the photo, when I am trying to ingest data from UI Ingestion tab, I am facing some problems. Here is one of them : As I choose a source (an Oracle DB for instance) , at the "Configure Oracle Recipe" section , the below box would stick "Loading" and I can progress. Any idea how can I fix it? Where should I begin the troubleshooting ?

salmon-rose-54694

04/25/2022, 2:28 AM

Hi, Is there a fast way to enable Stats for hive and mysql dataset?

mysterious-nail-70388

04/25/2022, 6:25 AM

Hi，I'm having problems with the metadata ingestion process. Does anyone know what is causing this😅？But when I run it again it will work

many-guitar-67205

04/25/2022, 7:44 AM

Hello, I am experimenting with ingesting data from Apache Atlas, using the python libraries. I ran into an issue with the ingestion of a

upstreamLineage

aspect. This particular dataset (a kafka topic) is reported by Atlas to have a lineage of over 6000 hdfs files. (background: these are json files that are generated every 15 minutes by some external process) The ingestion reports that all workunits have succeeded, but the gms logs show the following:

Copy code

07:23:56.574 [qtp544724190-15] INFO  c.l.m.r.entity.AspectResource:125 - INGEST PROPOSAL proposal: {aspectName=upstreamLineage, systemMetadata={lastObserved=1650871434684, runId=file-2022_04_25-09_23_54}, entityUrn=urn:li:dataset:(urn:li:dataPlatform:kafka,udexprd
.RESOURCEPERFORMANCE.PROD.STREAM.FAST.15MIN.RAW.FAMILIES,PROD), entityType=dataset, aspect={contentType=application/json, value=ByteString(length=1912263,bytes=7b227570...227d5d7d)}, changeType=UPSERT}
07:23:57.343 [qtp544724190-15] ERROR c.l.m.d.producer.KafkaEventProducer:146 - Failed to emit MCL for entity with urn urn:li:dataset:(urn:li:dataPlatform:kafka,udexprd.RESOURCEPERFORMANCE.PROD.STREAM.FAST.15MIN.RAW.FAMILIES,PROD)
org.apache.kafka.common.errors.RecordTooLargeException: The message is 1856310 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.

it's clear that the message is too large for Kafka. I could play around with the kafka configuration, and increase

max.request.size

but that's not a good longterm solution. Several questions: 1. the ingest should report a failure. Why doesn't it? (probably because the kafka publish is async?) 2. Is there any other way to add lineage than as an update to the single aspect? 3. I could try to put some hard limits on the lineage ingestion, but then you loose information. Are there any other ways this could be modeled/ingested?

cuddly-arm-8412

04/25/2022, 12:15 PM

hello,to develop metadata-ingestion local,i am try to run pip install -e '.[mysql]' python3 -m datahub ingest -c /datahub/docker/ingestion/mysql_recipe.yml and ingest mysql data success but when i direct debug main.py,an error was prompted Traceback (most recent call last): File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 671, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 783, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/github/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module> from datahub.cli.check_cli import check File "/github/datahub/metadata-ingestion/src/datahub/cli/check_cli.py", line 5, in <module> from datahub.ingestion.sink.sink_registry import sink_registry File "/github/datahub/metadata-ingestion/src/datahub/ingestion/sink/sink_registry.py", line 1, in <module> from datahub.ingestion.api.registry import PluginRegistry File "/github/datahub/metadata-ingestion/src/datahub/ingestion/api/registry.py", line 5, in <module> import entrypoints File "/github/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module> from datahub.cli.check_cli import check ImportError: cannot import name 'check' from partially initialized module 'datahub.cli.check_cli' (most likely due to a circular import) (/github/datahub/metadata-ingestion/src/datahub/cli/check_cli.py) How can I debug the code locally

teamwork 1

plus1 1

bright-beard-86474

04/25/2022, 4:17 PM

Hello! Could someone please clarify or\and share a doc link about all those keys & values in Source report? What is _workunits_produced_? I ran the ingestion process for Glue, it listed all the tables, but the _workunits_produced_ value is 0. I played with parameters, removed the _table_pattern_, now I see numbers for _workunits_produced_ and _records_written_ keys (is it a bug?), but I don’t see my tables on UI (another bug?). Could someone please help to understand all these details and why I don’t see my tables on UI? thanks! Deployed with Docker (quick start). Docker containers working fine: python3 -m datahub docker check -> ✔️ No issues detected.

lemon-terabyte-66903

04/25/2022, 4:21 PM

Hello, I am trying to ingest using new s3 connector with profiling on. It ends up with failure:

lemon-terabyte-66903

04/26/2022, 12:36 AM

I am unable to resolve the pydeequ version problem while running profiling. Can somebody mention the correct pydeequ and deequ jar versions to use?

bland-orange-13353

04/26/2022, 1:32 AM

This message was deleted.

bland-orange-13353

04/26/2022, 8:41 AM

This message was deleted.

brash-photographer-9183

04/26/2022, 11:06 AM

Is there an example similar to the ones under library in the repo for the rest api for creating containers? I have a custom ingestion I want to run and I can't seem to find docs around how to create nested Containers?

brash-photographer-9183

04/26/2022, 12:05 PM

I'm looking at the source ingestion for tableau in the repo and trying to understand how I would do something similar with the DatahubRestEmitter and I can't seem to understand it. I want to do something similar to what's in the tableau ingestion source where I have some containers that then contain more concrete types. But is the DatahubRestEmitter equivalent just to emit the individual containerProperties, subTypes, dataPlatformInstance, etc.. MCPW items through the emitter, or is there a way to box it all up like gen_containers seems to do but for the rest API?

lemon-terabyte-66903

04/26/2022, 4:24 PM

Hi, I tried to ingest a dataset using new s3 data lake connector. With mergeSchema option, I tried ingesting as the dataset had multiple schemas. But it failed with

Failed to merge incompatible data types double and bigint

. Is it possible to handle this error and record this schema change in

schemaMetadata

aspect? cc @hundreds-photographer-13496

mysterious-nail-70388

04/27/2022, 2:54 AM

Hi, what are the Kerberos authentication parameters of Hive

icy-ram-1893

04/27/2022, 6:26 AM

Hi! I am trying to ingest data from oracle and SQL Server sources. I have tried both UI and CLI ingestion methods but both failed. it should be mentioned that all plug-ins have been installed successfully. Besides, I check the telnet of the sources servers, and it was ok. I am sharing screenshot of config and error logs. Any Idea where should I begin to solve the problem?

mysterious-lamp-91034

04/27/2022, 6:54 AM

Hello I created an entity and run

Copy code

./gradlew :metadata-ingestion:codegen

Then I don't see the snapshotclass generated in

metadata-ingestion/src/datahub/metadata/com/linkedin/pegasus2avro/metadata/snapshot/__init__.py

. But I see lots of other snapshots generated. Is it expected? I want snapshot because I want to ingest MCE, MCE takes snapshot as the first parameter. I know snapshot is legacy, is there a way to ingest without MCE? Thanks

orange-coat-2879

04/27/2022, 7:44 AM

Hello everyone, I ingested data from snowflake and the stats usage can show the number of monthly queries. I am wondering whether there is a chart or table can show top queried tables ( the tables have top 10 number of monthly queries NOT the search queries in Datahub) in datahub? Thanks!

👍 1

brash-photographer-9183

04/27/2022, 9:59 AM

Is there a way to delete all the metadata for a specific platform? As in to clear out entities in order to recreate them?

fresh-coat-71059

04/27/2022, 12:39 PM

Hi every， I have a problem when I try to show a lineage between dataset and dashboard. But the lineage does not work as expected. I ingested a table named test from Mysql and then ingested SuperSet dashboard using data from the table. It seems Datahub does not align dataset correctly. becuase MySQL datasets are named

<db>.<table>

and dataset of Superset`<conection_name>.<db>.<table>` # mysql dataset urnlidataset:(urnlidataPlatform:mysql,MySQL.test.test1,PROD) # dataset used in a superset dashboard urnlidataset:(urnlidataPlatform:mysql,test.test1,PROD)

millions-sundown-65420

04/27/2022, 12:41 PM

Hi. I have a MongoDB as the source database from where I would like to 'push' metadata to the datahub. I have setup the datahub completely and was looking at the Metadata Change Event emitters. I can emit metadata events to Kafka using python emitter example but I am not sure how to connect this to my source database. Essentially, whenever a new record is inserted into my Mongo collection, I would like metadata to be ingested to datahub automatically. May I know how I can set this up?

prehistoric-salesclerk-23462

04/27/2022, 1:31 PM

Hi guys, can anyone else help here? the user and password is correct👆🏻

colossal-easter-99672

04/27/2022, 4:06 PM

Hello team. Is there any way to partial update UpstreamLineage via python?

purple-student-30113

04/27/2022, 4:14 PM

Hello guys, i'm testing datahub using 'quick start' with docker in my computer, i can use UI interface, the docker installed are datahub-frontend, datahub-gms, datahub-actions1, schema-registry, broker, elastic-search, mysql and zookeeper. I want to add kafka streaming from mysql to mysql or mysql to other one using debezium too. I dont know where i add the connect properties in datahub and debezium plugin, Should i add a docker with connect or other way? Sorry i'm noob

curved-football-28924

04/27/2022, 7:02 PM

Hello team, I am building my own validation plugin in DataHub. Right now for validation ingestion, there is only a GE (Great Expectation) for support. Similar to GE I am trying to build for AWS Databrew. Kindly guide me on how to achieve this? In the below code, I tried to pass the GE assertion variable directly to the emitter class and got this error. What is the value that should be sent in "assertionResults" ? Code : https://gist.github.com/12345k/a8747ebe889fa03f2d33ac72adaca674 Error:

File "/home/karthickaravindan/.local/lib/python3.8/site-packages/datahub/emitter/mcp.py", line 17, in _make_generic_aspect

serialized = json.dumps(pre_json_transform(<http://codegen_obj.to|codegen_obj.to>_obj()))

AttributeError: 'str' object has no attribute 'to_obj'

rhythmic-stone-77840

04/27/2022, 7:47 PM

Hello 😄 I'm trying to remove the glossaryTerms&Node that have been created, but nothing is being removed from the UI. I run

Copy code

datahub delete --entity_type glossaryTerm --query "*" -f --hard
datahub delete --entity_type glossaryNode --query "*" -f --hard

And the run says that it did hard delete rows for the entries found, but I'm still seeing the nodes and terms show up on the DataHub UI and I can still click through them. Anyone have an idea on whats going on?

plus1 1

cuddly-arm-8412

04/28/2022, 9:55 AM

hi,i try to ingest mysql data locally and send it to kafka,But when I debugged, I found two methods One is 【emit_mce_async】 the other is 【emit_mcp_async】 i want to know what's the difference between mce and mcp?

microscopic-mechanic-13766

04/28/2022, 10:07 AM

Hi, I have connected a few sources to Datahub and have ingested their respective dataset. The problem is that the lineage tab isn't available. I have read the page https://datahubproject.io/docs/lineage/sample_code/, so I wanted to check if I understood it correctly. I will need to create a script similar to the ones that appear there for, every source I have in order to see its lineage. Is that right??