DataHub #ingestion

late-arm-1146

12/13/2022, 11:27 PM

Hi All, A question around data volume growth? With profiling I can see the volume of data but is there any support in DataHub to report the volume growth rate?

refined-energy-76018

12/13/2022, 11:31 PM

for the Datahub Airflow plugin, I was curious about it there are any limitations with emitting DAG runs as DataProcessInstance. I know the documentation of DataProcessInstance suggests that it can be used to represent an instance of jobflow(dataflow?) run but the Datahub Airflow plugin only emits DataProcessInstances for datajobs. From what I can understand, the implementation of the plugin makes use of the Airflow task callbacks to emit DataProcessInstances and other metadata. I'm not aware of a similar callback for DAG runs which leads me to wonder if the Datahub team has previously considered adding DataProcessInstance for DAG runs before to the plugin but found some reason for it being infeasible due to not having a DAG-level callback.

better-orange-49102

12/14/2022, 7:01 AM

For existing datasets without dataplatform instance, can I add data platform instance aspect to the dataset without having to rename the dataset URN? I don't want to affect some bookmarks pointing to the entity.

plain-cricket-83456

12/14/2022, 7:11 AM

For now, I see that metadata ingestion supports three types of database test connections. I wonder if you have any plans to add general databases such as hive,mysql,postgres to test connections, and look forward to your reply.

late-book-30206

12/14/2022, 8:50 AM

Hello, I have a problem when I try to ingest dataset from mongoDB on one of my collections. The ingest finish in succeeded but no assets ingested and when I look the logs, it said DataHub produced my 3 workunits. But nothing appeared in my Datasets. The logs :

Copy code

'[2022-12-14 08:46:01,730] INFO     {datahub.ingestion.run.pipeline:103} - sink wrote workunit coffreoDirectory.coffreoDirectory\n'
           '[2022-12-14 08:46:02,131] INFO     {datahub.ingestion.run.pipeline:103} - sink wrote workunit coffreoDirectory.coffreoDirectoryDump\n'
           '[2022-12-14 08:46:02,422] INFO     {datahub.ingestion.run.pipeline:103} - sink wrote workunit coffreoDirectory.coffreoDirectory_logs\n'
           '[2022-12-14 08:46:02,423] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion\n'
           '\n'
           'Source (mongodb) report:\n'
           "{'workunits_produced': '3',\n"
           " 'workunit_ids': ['coffreoDirectory.coffreoDirectory', 'coffreoDirectory.coffreoDirectoryDump', "
           "'coffreoDirectory.coffreoDirectory_logs'],\n"
           " 'warnings': {},\n"
           " 'failures': {},\n"
           " 'cli_version': '0.8.43',\n"
           " 'cli_entry_location': '/tmp/datahub/ingest/venv-3c4dd982-80a2-42a3-b7fc-9e0bbd0f5c06/lib/python3.9/site-packages/datahub/__init__.py',\n"
           " 'py_version': '3.9.9 (main, Dec 21 2021, 10:03:34) \\n[GCC 10.2.1 20210110]',\n"
           " 'py_exec_path': '/tmp/datahub/ingest/venv-3c4dd982-80a2-42a3-b7fc-9e0bbd0f5c06/bin/python3',\n"
           " 'os_details': 'Linux-4.15.0-194-generic-x86_64-with-glibc2.31',\n"
           " 'filtered': ['CoffreoPdfClient',\n"
           "              'app',\n"
           "              'auth',\n"
           "              'booking',\n"
           "              'cmde',\n"
           "              'coffreo-entities',\n"
           "              'coffreoEvents',\n"
           "              'deliveries',\n"
           "              'delivery',\n"
           "              'documentsapi',\n"
           "              'documentsapi-vms',\n"
           "              'entitySettings',\n"
           "              'eventStore',\n"
           "              'events',\n"
           "              'hosting',\n"
           "              'integrationCME',\n"
           "              'log',\n"
           "              'messagers',\n"
           "              'metadata',\n"
           "              'mission',\n"
           "              'notifications',\n"
           "              'oauth2_server',\n"
           "              'randstadB2B',\n"
           "              'randstadBPE',\n"
           "              'standard',\n"
           "              'startpeople',\n"
           "              'sunnyCerts',\n"
           "              'timesheet',\n"
           "              'versioning']}\n"
           'Sink (datahub-rest) report:\n'
           "{'records_written': '3',\n"
           " 'warnings': [],\n"
           " 'failures': [],\n"
           " 'downstream_start_time': '2022-12-14 08:46:00.594476',\n"
           " 'downstream_end_time': '2022-12-14 08:46:02.422375',\n"
           " 'downstream_total_latency_in_seconds': '1.827899',\n"
           " 'gms_version': 'v0.8.43'}\n"
           '\n'
           'Pipeline finished successfully producing 3 workunits\n',
           "2022-12-14 08:46:02.955894 [exec_id=3c4dd982-80a2-42a3-b7fc-9e0bbd0f5c06] INFO: Successfully executed 'datahub ingest'"]}
Execution finished successfully!
false

wonderful-vegetable-45135

12/14/2022, 1:23 PM

Hi guys, I'm wondering if some of you can help me / guide me with contributing to Datahub with adding lineage capability for MySql and Mariadb. I feel like I'm very close to figuring out what to do, but due to my lack of experience with OOP I feel a bit lost in the code. I know that with the [lineage_emitter_mcpw_rest.py] file we can make a lineage object. However, you have to manually gather the metadata of all the datasets. I want to incorporate it in a similar way as how lineage was incorported with redshift in this commit https://github.com/datahub-project/datahub/commit/0fdd3352bdd22830c5efacaec9f50a50389fe06f?diff=split#diff-53a1aa12c6137e954cc2ffa690902e677a00e530ab13d6b6222fdb478963bdb9. I can also see that within the [sql_common.py] file, the method [_process_table] uses the MetadataChangeProposalWrapper class just like [lineage_emitter_mcpw_rest.py] file. So, it should be possible to create the lineage with the metadata already scrapped by the [mysql.py] file. In any case, any tips or explanation on how the code works, or what the next steps should be are very appreciated!

swift-dream-78272

12/14/2022, 2:38 PM

Hey, has anyone ever tried concatenating secrets in recipe ymal file. For example something like this, when I have saved host and port separately:

Copy code

host_port: $HOST:$PORT 
....

Is that even possible? So far all my experiments get me error

invalid literal int base 10

kind-vr-18537

12/14/2022, 4:31 PM

Hi all, Is there a way to use the s3 source module to read data from json files? I am trying to ingest some data in s3 that has its metadata stored in a json files - so I want to have the json values as well as the keys displayed in the ui. I am new to datahub. I have ingest working for our s3 bucket, but not sure how to go about pulling values as well as keys out of the json files. Could I achieve this with a transform? Or will I need to fork the source of the module and add in some code?

plain-controller-95961

12/14/2022, 5:50 PM

Hi all, What is the criteria for enabling column level lineage for other sql sources. Is it something that we would have to handle in the connector codebase or would DH be responsible for enabling that feature for specific sources?

lively-dusk-19162

12/14/2022, 7:04 PM

Hello all, Can anyone help me how to use airflow as datajob for column level lineage?

refined-energy-76018

12/14/2022, 10:30 PM

Does the Datahub Airflow plugin support creating the lineage via Airflow concept of Datasets? If not, are there any plans to do so? https://airflow.apache.org/docs/apache-airflow/stable/concepts/datasets.html

incalculable-queen-1487

12/15/2022, 2:46 AM

Hi all, Does DataHub support Table and Column level lineage for Apache Hive？Or Plan it?

full-chef-85630

12/15/2022, 5:49 AM

Hi team, I’m using bigquery-usage, There are two fields that don’t understand. “count” and “totalSqlQueries”

Copy code

"usageStats": {
              "aggregations": {
                "uniqueUserCount": 1,
                "users": [
                  {
                    "user": {
                      "username": "content-backend-prod"
                    },
                    "userEmail": "<http://xxx.gserviceaccount.com|xxx.gserviceaccount.com>",
                    "count": 165
                  }
                ],

                "totalSqlQueries": 0
              }
            },

modern-answer-65441

12/15/2022, 7:08 AM

Hello Team, After creating new Groups or new secrets, the page is refreshed without data display. The background reports an error Elasticsearch document missing, and the data is not written to the es.

Copy code

01:02:16.024 [I/O dispatcher 1] ERROR c.l.m.s.e.update.BulkListener:25 - Failed to feed bulk request. Number of events: 1 Took time ms: -1 Message: failure in bulk execution:
[0]: index [datahubsecretindex_v2], type [_doc], id [urn%3Ali%3AdataHubSecret%3ASECRET_CREATED_BY_SSEITLER_SSO_2], message [[datahubsecretindex_v2/-OUJuv_aSvWNBB2-NiVKfg][[datahubsecretindex_v2][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn%3Ali%3AdataHubSecret%3ASECRET_CREATED_BY_SSEITLER_SSO_2]: document missing]]

Copy code

7:06:44.882 [I/O dispatcher 1] ERROR c.l.m.s.e.update.BulkListener:25 - Failed to feed bulk request. Number of events: 2 Took time ms: -1 Message: failure in bulk execution:
[0]: index [corpgroupindex_v2], type [_doc], id [urn%3Ali%3AcorpGroup%3Ac377308d-1b2a-42e9-a407-6595a4520962], message [[corpgroupindex_v2/E8CJttR6RwKYFEFXdblmTg][[corpgroupindex_v2][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn%3Ali%3AcorpGroup%3Ac377308d-1b2a-42e9-a407-6595a4520962]: document missing]]]
[1]: index [corpgroupindex_v2], type [_doc], id [urn%3Ali%3AcorpGroup%3Ac377308d-1b2a-42e9-a407-6595a4520962], message [[corpgroupindex_v2/E8CJttR6RwKYFEFXdblmTg][[corpgroupindex_v2][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn%3Ali%3AcorpGroup%3Ac377308d-1b2a-42e9-a407-6595a4520962]: document missing]]]

This happens when we do any ingestion from any datasource. can someone help debug this ? I used this link: https://github.com/datahub-project/datahub/issues/6689 - which says I need to set MAE_CONSUMER_ENABLED to true However it didn't help me Any suggestions please ?

magnificent-lock-58916

12/15/2022, 9:45 AM

Hello! Can you please tell me if the Airflow DataHub ingestion stateful? If yes, how can we enable this ingestion to be stateful?

tall-father-13753

12/15/2022, 10:10 AM

Hi, I’ve produced some unwanted entities and now I would like to remove them. Their urn-s follow pattern:

urn:li:dataFlow:(drone,<SOME-ID>,gold)

. I’ve found this doc: https://datahubproject.io/docs/how/delete-metadata/ and according to it, used command:

Copy code

$ datahub delete --env gold --entity_type dataFlow --platform drone --hard -f

But it didn’t worked. I got:

Copy code

[2022-12-15 10:06:21,679] INFO     {datahub.cli.delete_cli:287} - datahub configured with <http://localhost:8080>
[2022-12-15 10:06:22,075] INFO     {datahub.cli.delete_cli:323} - Filter matched  dataFlow entities of drone. Sample: []
No urns to delete. Maybe you want to change entity_type=dataFlow or platform=drone to be something different?
Took 1.007 seconds to hard delete 0 versioned rows and 0 timeseries aspect rows for 0 entities.

Am I doing something wrong?

salmon-angle-92685

12/15/2022, 10:53 AM

Hello everyone, I've created a Pipeline in Dagster to launch my data ingestion. However, when the ingestion fails with an Error, the datahub CLI doesn't exit with code 1. So my pipelines seems to work when actually it doesnt. What should I do to force exiting 1 when it fails ? Thanks !

future-iron-16086

12/15/2022, 12:45 PM

Hi, all. I'm facing some issues with Oracle (12.1.0.2.0 version) to ingest some metadata. In a 19.11.0.0 version its Ok to connect and collect the metadata. Oracle (12.1.0.2.0 version): • User Permissions: ◦ SELECT_CATALOG_ROLE (dba_role_privs) ◦ CREATE TABLE / CREATE VIEW / CREATE SESSION / SELECT ANY DICTIONARY / CREATE SEQUENCE (dba_sys_privs) With this permissions works fine in 19.11.0.0, but in 12.1.0.2.0 I'm getting the error.

Copy code

'sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) ORA-00942: table or view does not exist\n'
           '[SQL: SELECT username FROM dba_users ORDER BY username]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/4xp6>)\n'
           '[2022-12-15 12:25:46,603] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\t(cx_Oracle.DatabaseError) ORA-00942: table or view does not exist\n'
           '[SQL: SELECT username FROM dba_users ORDER BY username]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/4xp6>) due to \n'
           "\t\t'ORA-00942: table or view does not exist'.\n"

Any help to solve this?

colossal-sandwich-50049

12/15/2022, 3:18 PM

Hello, I am seeing different behavior w.r.t. how dataset metadata is displayed in the Datahub UI depending on what platform I specify in the dataset URN (using the java REST emitter). As can be seen by the images, when

platform=delta

a folder structure is created (i.e. the dataset is seen under the folder mlprod in the UI); this is the desired behavior. When

platform=s3

this behavior does not happen; the dataset is not created under a folder. Can someone advise on why the above happens? Ideally I would like the same behavior on platform=s3 as on platform=delta Note: I have isolated the cause of this behavior to the platform (i.e. FabricType, the aspect type, etc do not factor in).

Copy code

String platform = "s3" // different behavior depending on whether this value is s3 or delta
DataPlatformUrn defaultDataPlatformUrn = new DataPlatformUrn(platform);
DatasetUrn defaultDatasetUrn = new DatasetUrn(
          defaultDataPlatformUrn,
          "mlprod." + name,
          FabricType.DEV
);

cc: @great-toddler-2251

✅ 1

curved-apple-55756

12/15/2022, 4:27 PM

Hello, I'm trying to ingest DB2 AS400 metadata in Datahub. I've created a yalm file with sqlalchemy connection parameters but the

datahub ingest -c file.yaml

failed. Here is the error :

sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'IBM DB2 ODBC DRIVER' : file not found

My datahub is installed on Debian. Have you ever meet this kind of issue ? Do you know how to solve it ? Any advice would be appreciated !

breezy-rainbow-26977

12/15/2022, 7:12 PM

hi all im getting this basic error while trying to setup a connection to a trino instance with no password (its just being proof of concept -- im leaving the password field blank for the source setup)

Copy code

Failed to create ingestion source!: Failed to fetch

unsure if its a networking error or an authentication error. Are there any logs I can check or something similar?

acceptable-account-83031

12/15/2022, 7:41 PM

Hello everyone, how can I preserve original upstream lineage of a certain dataset when overwriting lineage using python file and MetadataChangeProposalWrapper object?

✅ 1

👀 1

ambitious-notebook-45027

12/16/2022, 3:24 AM

hi，everyone,how can i make the pod(actions、gms、frontend）highly available？

bland-room-3136

12/16/2022, 4:30 AM

Hi, I have just started exploring Datahub. I tried to ingest data from Superset, and I have a few doubts: 1. By default, only charts/Dashboards metadata is ingested in Datahub. Can I include virtual datasets(similar to views in DB) or all datasets in Superset with ingestion? 2. Some dashboards have missing charts when seen in Datahub.(Those charts are present in Datahub chart list, but not linked to the correct dashboards) Any way to debug this?

few-tent-75240

12/16/2022, 12:52 PM

@dazzling-judge-80093 Hey Tamas, silly question, how do I edit a schedule on an ingest job that's already configured? I don't see an option to edit/remove the schedule and just run ingest manually...Thanks for your help!

wonderful-hair-89448

12/16/2022, 1:16 PM

Hi, I am very new to group, joined a few minutes back. I am learning to get to know about datahub process, and I am trying to deploy data hub instance in my MAC device. I am facing an issue mentioned below... Detected M1 machine requests.exceptions.SSLError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443) : Max retries exceeded with url: /datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j-m1-quickstart.yml (Caused by SSLError(SSLCerVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))) When i enter "datahub docker quickstart", the above error message does render. I need to resolve this as i am not aware on this fix. Any resolving fix we have here for this error message here?

✅ 1

gifted-bird-57147

12/16/2022, 2:36 PM

Hi Team, I'm trying to ingest data from MSSQL, including profiling. For one of my tables this results in:

Copy code

'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
               'info': {'message': '413 Client Error: Payload Too Large for url: '
                                   '<https://noord-holland.acryl.io/api/gms/aspects?action=ingestProposal>',
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:mssql,GPBUAPP.areaaldata.BEHEERGRENZEN_V,PROD)'}}]

I suspect it has to do with the profile of a column of type varbinary(MAX)..... It would be really useful to have this feature request implemented: https://feature-requests.datahubproject.io/p/sql-profile-allow-deny-by-data-type so I can exclude those columns from profiling...

acceptable-account-83031

12/16/2022, 3:38 PM

Hello everyone, I’m facing an issue where after custom lineage is established using a python file and whenever there is new metadata coming with an execution schedule at set intervals, the original lineage is shown instead of custom lineage. How to avoid this situation without having to run a python file every single time recipe file is run on a schedule to get new metadata?

wonderful-hair-89448

12/16/2022, 4:12 PM

Hi, I am trying to deploy datahub instance for first time. I do get following error when i enter following command: datahub docker quickstart --quickstart-compose-file ./docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml unable to run quickstart - the following issues were detected: • kafka-setup is still running • datahub-gms is running but not healthy please let me know if you have any suggestions or quick fix on this. THANK YOU.

ambitious-shoe-92590

12/16/2022, 9:32 PM

Question. Does anyone familiar with S3 parquet ingestion know if its possible to NOT separate out each parquet file into its own "dataset"? This is my current ingest conf:

Copy code

...
path_specs:
  - include: '<s3://bucket-name/path/table/{partition_key[0]}={partition[0]}/*.parquet>'
...

and this is producing

number of "datasets" under

table

based on how many `partition_key`s there are