DataHub #ingestion

lively-carpenter-82828

09/01/2023, 11:27 AM

Hi, during yesterday's DataHub Town Hall - Aug 2023 meeting, I asked a question about lineage for Kafka producers and consumers. Does anyone know how to achieve such lineage? Currently, I have around 40 Kafka Streams applications and approximately 120 KSQL streams. From what I've gathered, it's relatively easy to extract information about the association between consumers and topics. However, there isn't such information available for producers. I found a similar issue here: https://www.datasciencecentral.com/kapxy-a-kafka-utility-for-topic-lineage/. In Kafka Streams applications, it's possible to extract information from the configuration. However, in ksqlDB streams, I can't do that.

dry-raincoat-85182

09/01/2023, 3:06 PM

Hi Team, I am trying to ingest IBM Db2 metadata by using source type as sqlalchemy as mentions in link(https://datahubproject.io/docs/generated/ingestion/sources/sqlalchemy/) and pip install the required dialect which in this case is ibm_db_sa, but schemaMetadata is not getting ingested. Getting below error in logs:

Traceback (most recent call last):

File "/data/vdc/conda/condapub/svc_am_cicd/envs/dh-actions/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 373, in run

for record_envelope in self.transform(record_envelopes):

File "/data/vdc/conda/condapub/svc_am_cicd/envs/dh-actions/lib/python3.10/site-packages/datahub/ingestion/extractor/mce_extractor.py", line 77, in get_records

raise ValueError(

ValueError: source produced an invalid metadata work unit: MetadataChangeEventClass(

square-painter-33350

09/01/2023, 5:00 PM

This is a nubie question about DataHub ingestion: I want to ingest data and extract the metadata in DataHub. During this process, I'd like to know if DataHub can, not only extract the metadata but also push the actual data into some persistence datastore - OR should I load the data into a datastore first then ingest and extract metadata from that datastore? I'm just trying to understand the basic data flow through DataHub for starters.

important-autumn-58748

09/01/2023, 6:23 PM

How do I add a DataSet to a Container? I'm struggling with the Python kafka emitter/sdk

plus1 1

🫠 1

best-monitor-90704

09/02/2023, 1:06 PM

Hi , we are getting below error while connecting datahub with mongo db Failed to configure the source (mongodb): Could not reach any servers in [('ip-x-x-x-x', 27017)]. Replica set is configured with internal hostnames or IPs?, Timeout: 30s, Topology Description: <TopologyDescription id:6xxxxxxx91, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ip-x-x-x-x, 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('ip-x-x-x-x27017 timed out')>]

✅ 1

refined-gold-30439

09/04/2023, 8:09 AM

Hi team, The following error message is occurring. Why is this error occurring?

Copy code

[2023-09-04 06:03:02,906] ERROR    {datahub.utilities.sqlalchemy_query_combiner:403} - Failed to execute queue using combiner: (pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'jgtcpmgyvhhzjeoo AS \n(SELECT count(*) AS count_1 \nFROM a.email)\n SELECT ' at line 1")

And it seems like the tagging syntax isn't working properly... What part is incorrect?

Copy code

source:
    type: mysql
    config:
        host_port: 'hostname:3306'
        database: null
        username: user
        include_tables: true
        include_views: true
        profiling:
            enabled: true
            profile_table_level_only: false
            profile_table_row_count_estimate_only: true
            field_sample_values_limit: 0
            include_field_sample_values: false
        stateful_ingestion:
            enabled: true
        password: '${mysql_pw}'
        schema_pattern:
            allow:
                - public
transformers:
    -
        type: simple_add_dataset_tags
        config:
            tag_urns:
                - 'urn:li:tag:us'

fierce-monkey-46092

09/04/2023, 8:21 AM

Hi, Folks. I need to know where the logs of glossary terms/glossary term group are located in OS. I suppose it's stored in ES indexes. My main goal is to find out who and how many users are created/added these glossary terms in Datahub generally.

✅ 1

dazzling-stone-78871

09/04/2023, 12:47 PM

Hi team 🙂 I have firstly run ingestion of Redshift and Tableau Datasources with default parameters with datahub CLI, everything run ok. I wanted to change the default env from prod to dev so I deleted the platforms and since my data won't change. I do not see error when doing:

datahub --debug  ingest -c my_file.yaml

(got:

Finished metadata ingestion [...] Pipeline finished successfully;

) but my metadata are not updated, and I do not see any new runs on the UI or when doing

datahub ingest list-runs

. Does anyone have encountered those kind of issues ? I have deployed Datahub on AWS with Kubernetes and used AWS MSK (Kafka v3.4.0), AWS Opensearch (Elasticsearch v7.10) and RDS (Postgresql v13.8). I am on version

0.10.4

of Datahub. Thank you in advance 🙂

✅ 1

straight-eve-29501

09/04/2023, 12:53 PM

Wondering if anyone had the idea of logging to the ingestion console/UI for custom ingestion scripts. I find it a good idea to bring together whatever ingestion are happening, regardless if they are the standard ones or just some scripts running somewhere. I found DatahubIngestionRunSummaryProvider and was able to initialize it, but even if the events get sent ( I also see them in the database) they don't show up in the UI. Any ideas would be appreciated Pasting some code bit in case something dumb is visible out of an airplane.

some-alligator-9844

09/04/2023, 2:18 PM

Is there a way to provide keytab details in recipe instead of picking from the default ticket cache? I am trying to ingest data from multiple hive sources but each source requires separate keytabs for authentication. There is no username and password authentication available. I have 2 recipes which I want to run simultaneously using different keytabs for different hive source.

alert-analyst-73197

09/04/2023, 4:53 PM

Hi, I have a big ammount of datasets that I uploaded through python emitter. Each one has at the moment only custom properties and nothing else. However, some of them share a dependency relation one with another. It is not a proper lineage because it has nothing to do with content transformation; it is only a "dataset is dependency of dataset" relation. I searched for all possible relationships and found nothing. I could use lineage for that but I think lineage is not the proper way to describe this relation. Do you have some suggestion?

✅ 1

acceptable-stone-72571

09/05/2023, 8:06 AM

Hi, I have created the lineages between the datasets and between the columns, I wanted to delete both dataset and column lineages now, so used the graphql query provided https://datahubproject.io/docs/api/tutorials/lineage#add-lineage, But I can see it has deleted only dataset level lineage not the column level lineage. Does the graphql query works the same way ? Or any other way to delete the column level lineage ? Please help me out.

nutritious-lighter-88459

09/05/2023, 9:16 AM

Hi, I have integrated datahub with great expectations for running validations/assertions. Validation result is exported to datahub via

datahub_action

and is displayed in datahub under

Validation

tab as expected. However, I was wondering if there is a way to update the description of the assertion which gets auto generated (PFA) ? We would like to provide our own custom description may be in the form of some attribute in

meta

tag in expectation suite. TIA

best-kite-4934

09/05/2023, 11:40 AM

Is there someone that was able to integrate great expectations validation results with Datahub? I follow https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/ but it does not work by using a token

bitter-florist-92385

09/05/2023, 12:50 PM

Hey, im trying to put some yaml files into my Datahub. They are on the Host Server of Datahub, but not in the venv. Still when im in the CLI it wont find them. Any Idea why that might be ? ( raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f22895fc5d0>: Failed to establish a new connection: [Errno 111] Connection refused)

shy-diamond-99510

09/05/2023, 12:58 PM

Hey guys, I’m new to this slack and to datahub. I have to install a custom ingestion source in a docker setup. I have followed all the tutorials related to what I am trying to achieve but it still doesn’t work. Has anybody experience with installing a custom ingestion source? I really need help.

great-florist-68068

09/05/2023, 10:44 PM

Hello! I’m trying to run an example ingestion for Hive: https://datahubproject.io/docs/generated/ingestion/sources/hive My Hive instance is using Kerberos for auth. When running

datahub ingest -c ./hive-datahub.yml

, i’m getting

Server not found in Kerberos database

. I tried to run

kinit

before but still get the same error. It is not clear to me how to provide krb5.conf in the hive-datahub.yml

enough-pizza-64105

09/06/2023, 3:08 AM

Hi guys, Im quite new to Datahub, I would like to ask how would you configure the ingestion source for delta-table source that is housed within HDFS (in other tools the location would be something like hdfs://localhost:9000/my_env/my_subfolder_location)

best-umbrella-88325

09/06/2023, 2:08 PM

Hello community!! We're trying to ingest metadata using Airflow DAGs by using the Python SDKs. Although the pipeline is completing successfully, we are not able to see it as an 'ingestion' in the Ingestion tab. Ideally, it should have shown the source as 'CLI' ingestion under that tab. We use the PythonVirtualEnvOperator to facilitate the ingestion. Anyone having any idea how we can achieve this? I tried with 0.10.3 and 0.10.1 as acryl-datahub package version. We use MWAA v2.5.1. Thanks in advance!!

great-florist-68068

09/06/2023, 9:15 PM

Hello! I’m ingesting data from Hive metastore and table has a partition key;

Copy code

create table hive_example(a string, b int) partitioned by(c int);

but the datahub ingestion doesn’t indicate the

isPartitioningKey

in the payload

Copy code

{
  "fieldPath": "c",
  "nullable": true,
  "type": {
    "type": {
      "com.linkedin.schema.NumberType": {}
    }
  },
  "nativeDataType": "int",
  "recursive": false,
  "isPartOfKey": false
}

lively-energy-75016

09/07/2023, 3:47 AM

Hi all I have a question for you, when datahub syncs the hive metadata, only the database is synced, but the tables are not synced over, the whole sync is shown to be successful, there is an error reported as:

better-orange-49102

09/07/2023, 8:36 AM

noticed that mongodb doesnt have the ability to specify platform instance. just wondering, for teams that already ingested metadata from mongodb, should the ingestion add the ability to specify instances, how you would migrate your documentation from the existing entity to the new entity? (cos once platform instance is added, the urn of the entity changes)

mammoth-musician-30735

09/07/2023, 2:37 PM

Hello, I want to delete the ingested meta data from DataHub(Vertica platform metadata only). I see this is possible through command line. Is there any option to delete it through datahub UI.

able-library-93578

09/07/2023, 9:12 PM

Hi All, I am seeing a weird ingestion behavior with using the UI (to be honest I have not tested if the CLI has the same issue or not). I am ingesting PowerBI metadata (UI YAML below), We pull in endorsements as tags. Works great no issues on the initial ingestion. I see the tags, example

Certified

. I create another tag (programmatically - python -DataHubGraph )

SourcesSDP

, so now there are 2 tags - Great. When I run the UI ingestion again, it deletes all tags and re-writes the

Certified

tag. I verify that GMS log shows an UPSERT for that entity and tag. What is the behavior supposed to be for the UPSERT? GMS log:

Copy code

2023-09-07 18:08:32,038 [qtp1577592551-60548] INFO c.l.m.r.entity.AspectResource:180 - INGEST PROPOSAL proposal: {aspectName=globalTags, systemMetadata={lastObserved=1694110111876, runId=2359db8a-8c78-471f-b38a-0e8167a4f431}, entityUrn=urn:li:dataset:(urn:li:dataPlatform:powerbi,SMG_Channel_Reporting_Dataset.DATE_DIM,PROD), entityType=dataset, aspect={contentType=application/json, value=ByteString(length=43,bytes=7b227461...227d5d7d)}, changeType=UPSERT}

eager-monitor-4683

09/08/2023, 1:52 AM

Hey team, just want to check if event-based ingestion supported in Datahub? Thanks

mammoth-musician-30735

09/08/2023, 4:51 AM

Hello, I am doing S3 ingestion through DataHub UI. I want to exclude files on S3, can we have only folders or not ? If yes, could you please provide example config to only have folders.

mammoth-musician-30735

09/08/2023, 7:35 AM

Hello Team, I am doing Vertica ingestion through DataHub UI. I have a Vertica Database which has 552 tables and 541 views. I started ingestion yesterday afternoon IST time, still ingestion is in progress. Don't know what's wrong with it. I am adding the config here, could you please help us.

Copy code

source:
    type: vertica
    config:
        host_port: 'host:5433'
        database: databse
        schema_pattern:
            allow:
                - '^specific_schema_name*'
        username: '${VERTICA_USERNAME}'
        password: '${VERTICA_PASSWORD}'
        include_tables: true
        include_views: true
        include_projections: false
        include_models: false
        include_view_lineage: false
        include_projection_lineage: false
        profiling:
            enabled: false
            field_sample_values_limit: 10
            max_workers: 1

As per document https://datahubproject.io/docs/generated/ingestion/sources/vertica/ profiling is disabled. Tried without adding profiling section in config and the result is same(Taking more than a day for small ingestion).

bland-barista-59197

09/08/2023, 4:47 PM

Hi Team Any help is appreciated is it possible to run profiling as standalone job? In my scenario, the profiling taking longer time >18hr and sometime get gms error.

melodic-dusk-2080

09/10/2023, 4:32 PM

Hi, Is it possible to ingest REST API's, and if so how?

refined-gold-30439

09/11/2023, 1:48 AM

Hi team! I have two MySQL databases with the same schema but different descriptions (in different languages). When I run a script for ingestion from the command line, only one schema is created instead of two (the database information from the last executed script overwrites the previous one). How can I make them distinguishable? 🫠