DataHub #ingestion

few-grass-66826

08/27/2022, 12:42 PM

Ho guys I am ingesting metadata from kafka and the only thing it gets is topic names, what else it can i gest and how?

lemon-engine-23512

08/27/2022, 4:12 PM

Hi team, is it necessary to build python package of our project for adding custom sources? Also where i do install this package if i am not using cli but airflow to schedule custom source

jolly-yacht-10587

08/28/2022, 9:51 AM

Hi, I got questions about ingesting metadata via OpenAPI: 1. Can I specify a semantic version for an aspect when do a Post request? I tried using this as a request body but the version that I specified didnt seem to appear on UI.

Copy code

{
  "aspect": {
    "__type": "SchemaMetadata",
    "schemaName": "mongodb",
    "platform": "urn:li:dataPlatform:mongodb",
    "platformSchema": {
      "__type": "MySqlDDL",
      "tableSchema": "schema"
    },
    "version": "3",
    "hash": "",
    "fields": [
      {
        "fieldPath": "hello",
        "jsonPath": "null",
        "nullable": true,
        "description": "test hello 18",
        "type": {
          "type": {
            "__type": "RecordType"
          }
        },
        "nativeDataType": "Record()",
        "recursive": false
      }
    ]
  },
  "entityType": "dataset",
  "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mongodb,hello.hello3,PROD)"
}

2. Is it possible to delete an aspect using the post request instead of delete request by passing the body similar as above pic? 3. If I want to delete an aspect but still want it to be shown on UI but just mark it as “deleted” or sth so users can view version history of this dataset, is this possible?

few-grass-66826

08/28/2022, 11:47 AM

Another issue now with airflow, I changed airflow.cfg in my docker image added datahub linage but it is unable to find datahub and fail whole airflow. Rebuilt every datahub module but no result

better-actor-97450

08/29/2022, 2:56 AM

i'm ingestion with oracle but job status not update to finish bit log of job record is finish. how can i fix it ?

straight-agent-79732

08/29/2022, 5:31 AM

Does datahub support tablename.propertyName in the rules regex?

aloof-oil-31167

08/29/2022, 7:57 AM

Hey, how can i get a datahub base img with version based tags instead of sha -

Copy code

FROM linkedin/datahub-ingestion:85a55ff

the following one is not pulling anything -

Copy code

FROM linkedin/datahub-ingestion:0.8.43.3

few-grass-66826

08/29/2022, 8:48 AM

Hi guys always the same issue tried literally everything. Same with airflow pip3 install acryl-datahub[kafka-connect] no matches found: acryl-datahub[kafka-connect]

flat-painter-78331

08/29/2022, 9:52 AM

Hi guys. I've been working on integrating airflow. I've installed the plugin and created the REST hook connection. After these two steps in the documentation it says to define the inlets and outlets in the DAG. Can i please know how to define the inlets and outlets when doing an ingestion from MySQL to BigQuery?

square-hair-99480

08/29/2022, 10:15 AM

Hi friends, so my doubts/problem is: I created a Snowflake ingestion. Initially I did not specify its

platform_instance

so it appeared to me with the name

datahub

in the UI. After a few days ingesting data I had to change it and add

platform_instance

to this ingestion since I would be ingesting data from two distinct Snowflake accounts. Later I was asked to change

platform_instance

value another time. So now when I go in the UI

Datasets -> Prod -> Snwoflake

I see 3 names (

datahub

name_01

name_02

) for the same ingestion Job. How can I delete the older data so I only see and access the data related to the last value for the ingestion

platform_instance

? I have tried things like

Copy code

datahub delete --urn "urn:li:dataPlatformInstance:(urn:li:dataPlatform:snowflake,DATAHUB.datahub,PROD)" --soft

but it did not work.

alert-fall-82501

08/29/2022, 11:44 AM

Hi team - I am scheduling hive jobs metadata transfer in apache airflow DAG . In docker container of airflow it is showing that Hive is disabled .No module found "pyhive" ...How to enable this . Please help . Thanks in advance

aloof-oil-31167

08/29/2022, 1:22 PM

Hey, i’m trying to use Spark-Lineage feature and getting the following error from the driver -

Copy code

Caused by: java.lang.ClassNotFoundException: datahub.spark.DatahubSparkListener

i added the following configs to the spark session -

Copy code

"spark.jars.packages" = "io.acryl:datahub-spark-lineage:0.8.23",
"spark.extraListeners" = "datahub.spark.DatahubSparkListener",
"spark.datahub.rest.server" = ${?DATAHUB_URL},
"spark.datahub.rest.token" = ${?DATAHUB_TOKEN}
"spark.datahub.metadata.dataset.env" = "STG"

does anyone have an idea?

stocky-minister-77341

08/29/2022, 1:42 PM

Hi! I’m trying to set up a new mysql ingestion. I’m getting an error that mysql is disabled although I see that it is enabled when running datahub check plugins. I’ll add in the comments the stack trace and the recepie file. Any Ideas?

brave-businessperson-3969

08/29/2022, 2:09 PM

Hi, I have some significant problems with the Trino ingestion when profiling is enabled: (Plattform DataHub 0.8.43 with acryldata/command line tool 0.8.43.5, the source is Starburst, a commercial variant of Trino) • Various errors of type "trino/sqlalchemy/datatype.py209 SAWarging: Did nt recoginize type 'str' This error shows up for various data types (text, bool, float, float64, int32, etc.) • config.table_pattern.allow changes the table names in DataHub when used • TrinoUseError TABLE_NOT_FOUND errors. For some reasons the ingestion source replaces _ in table names with $ and then does not find the table of course • Schema not found exception: for some reason the schema name gets additional double-quotes (e.g. trino.exceptions.TrinoUserError: [...] Schema '"dwh"' does not exists) All these errors only show up if profiling is enabled and table_pattern.allow is used. I'm willing and able to debug python code but currently I lack the understanding how the Trino connector works overall (e.g. where the SQL code is generated or where the check for pattern_allow is performed). Has anybody managed to ingest table statistics from Trino and has some idea how to debug these issues?

alert-coat-46957

08/29/2022, 3:11 PM

Hi Team, Does anyone know if we can integrate Databricks🧱 Data source with Datahub datahub? Do we have any document?

steep-finland-24780

08/29/2022, 6:35 PM

Hi, Our team has been using metabase as our primary BI tool I was wondering if anyone was also ingesting on datahub? I was wondering if you guys do any transformations or anything beyond the basic recipe to ingest the collections on metabase and use it as a container on datahub? so that user created collections were easily searchable on datahub Anyone has any tips on this matter?

miniature-plastic-43224

08/29/2022, 8:40 PM

Team, I have a question about LDAP ingestion. I can see that all ingested users (CorpUserInfoClass) will always be setup as "active=True" (it is hardcoded at ldap.py). It means that if I need to filter out all "not active" users (which mostly means those who are not with the enterprise anymore) I need to apply ldap filter on my own. This is fine. However, model project has a note on "CorpUserInfo" object: "Deprecated! Use CorpUserStatus instead. Whether the corpUser is active, ...". However, I don't see CorpUserStatus during ldap ingestion, MCE doesn't have it. So, where should I get CorpUserStatus from?

careful-insurance-60247

08/29/2022, 9:49 PM

I have noticed a few character case miss matches in urns with mssql and tableau when ingesting linage. Will other database source ingestion processes support the convert_urns_to_lowercase function that snowflake has?

cool-translator-98249

08/29/2022, 10:57 PM

Hi, I just got the install done and am trying a first few ingestions. When I do a dry run of our first CLI ingestion, I'm getting an error on the sink of:

Copy code

[2022-08-29 22:53:33,805] ERROR    {datahub.entrypoints:195} - Command failed: 
	Tree is empty.

alert-fall-82501

08/30/2022, 5:33 AM

Copy code

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> sasl

alert-fall-82501

08/30/2022, 5:35 AM

can anybody suggest on this issue ? Actually I am trying to ingest metadata from hive to datahub . facing this issue while installing pip install 'acryl-datahub[hive]'

few-carpenter-93837

08/30/2022, 6:18 AM

Hey guys, just to confirm, in the current state, does ingestion through CLI overwrite all elements of a dataset (for example tag's), unless we use a custom transformer logic to first request the current state from server?

microscopic-mechanic-13766

08/30/2022, 7:34 AM

Good morning team, I am trying to connect Spark on Jupyter notebooks to Datahub. I have created a notebook which spark session is the following

val spark = SparkSession.builder().appName("test-application").config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.43").config("spark.extraListeners","datahub.spark.DatahubSparkListener").config("spark.datahub.rest.server", "<http://datahub-gms:8080>").enableHiveSupport().getOrCreate()

After that, the initial datasets (which are not ingested in Datahub as they are .csv files) are modified. My "problem" is that after executing all of the notebook, nothing appears on Datahub. Is it needed to install anything in Jupyter itself, or does it look for the jars in some repository like Maven?? I would really appreciate some guidance on how this connection works! Thanks in advance 🙂

brave-tomato-16287

08/30/2022, 7:36 AM

Hello all. After increasing the server limit from 20000 to 100000 we are still facing with the tableau ingestion error:

Copy code

{\'message\': \'Showing partial results. The '
'request exceeded the "\n'
'                                   "100000 node limit. Use pagination, additional filtering, or both in the query to adjust results.\', '
'\'extensions\': "\n'

Can anybody suggest something?

alert-fall-82501

08/30/2022, 7:45 AM

Copy code

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:databricks.pyhive

alert-fall-82501

08/30/2022, 7:46 AM

can anyone suggest on this ,Actually I am trying to ingest metadata from hive to datahub

bumpy-journalist-41369

08/30/2022, 9:11 AM

Hello. I have problem ingesting data from S3 buckets. I have setup Datahub in Kubernetes cluster and using the UI and not the cli. The ingestion source looks like this: sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: s3 config: path_spec: include: ‘s3://<bucket_name>/<table_name>/{partition_key[0]}={partition[0]}/*.parquet’ platform: s3 aws_config: aws_access_key_id: ***** aws_region: us-east-1 aws_session_token: ******* aws_secret_access_key: ****** pipeline_name: ‘urnlidataHubIngestionSource:7ba22ca7-6c50-4b71-a766-8e89fa8fac52’ The S3 bucket structure is the following: Bucket_name: -Table_name - Sh_date=2022-08-30 -part-00000-7a70bb8c-48b0-4c9b-bea0-585c9146c8cf.c000.snappy.parquet -part-00001-7a70bb8c-48b0-4c9b-bea0-585c9146c8cf.c000.snappy.parquet ……. The ingestion fails and the output is the following:

exec-urn_li_dataHubExecutionRequest_28e3ac39-d148-4306-b0f7-08dd063c52b9.log

bumpy-journalist-41369

08/30/2022, 9:11 AM

Does anyone have any idea how to fix the issue?

colossal-hairdresser-6799

08/30/2022, 9:27 AM

UPSERT

Python Emitter

Add or update aspect (tags, terms, owners)

Hi, When looking at the documentation for adding tags, terms and owners to dataset all the examples includes 1. Get the current owners

Copy code

current_tags: Optional[GlobalTagsClass] = graph.get_aspect_v2(
    entity_urn=dataset_urn,
    aspect="globalTags",
    aspect_type=GlobalTagsClass,
)

2. Check if tag not already exist

Copy code

if current_tags:
    if tag_to_add not in [x.tag for x in current_tags.tags]:

3. If it doesn’t add to list of tags

Copy code

# tags exist, but this tag is not present in the current tags
        current_tags.tags.append(TagAssociationClass(tag_to_add)) <- new tag

4. Then add the current_tags with an UPSERT.

Copy code

event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
        entityType="dataset",
        changeType=ChangeTypeClass.UPSERT,
        entityUrn=dataset_urn,
        aspectName="globalTags",
        aspect=current_tags,
)

My understanding of an UPSERT is “if the aspect exist update that aspect and if not add it”. So what I don’t understand is why we would need to go through 1-3 if we’re using UPSERT in the end anyways?

colossal-hairdresser-6799

08/30/2022, 9:54 AM

graph.emit

Information regarding successful update or skipped write due to aspect already exists

Hi, When using graph.emit to update an aspect is there any way to see if it was updated or just skipped since it already existed? Right now I can only see a log saying

INFOmetadata ingestionOwner urnlicorpGroup:test already exists, omitting write