DataHub #ingestion

bland-orange-13353

05/02/2023, 6:02 PM

This message was deleted.

✅ 1

quiet-television-68466

05/03/2023, 9:17 AM

Not a deal breaker for us at all, but was curious if there’s any way for us to modify the urns of containers as they are ingested? Our Datasets have a very readable urn:

urn:li:dataset:(urn:li:dataPlatform:snowflake,source.github.pull_requests,PROD)

, but its corresponding schema and database have the following urns:

urn:li:container:0080ebfa374633b2294b7ff38c82923b, urn:li:container:0a6efd87d585a012e259f1457f68ce0d

The main use case for us is having them be parsable in the same way the dataset urns are, but if its not possible that’s fine!

📖 1

✅ 1

🔍 1

acoustic-quill-54426

05/03/2023, 3:29 PM

Hi! We are eager to use a ingestion fix before it makes it into the next release. We’ve done this before using the commit SHA. But it seems that the datahub-ingestion docker step has been manually disabled 🤔 Is this intended?

✅ 1

🔍 1

📖 1

colossal-hairdresser-6799

05/03/2023, 3:36 PM

Good day! How can I add a glossaryTerm to a column?

✅ 1

ancient-queen-15575

05/03/2023, 4:21 PM

Could anyone help me understand stateful ingestion and the

bucket_duration

variable? An initial run of a snowflake ingestion I’m trying takes about 3 minutes. If I use stateful ingestion and remove the

ignore_start_time_lineage: true

line, then a rerun takes about 30s. That seems great but what I understood from the docs is that only lineage changes from the past day will be picked up like this. It would be nice if the past few days were checked incase Datahub went down for a few days. Is there a way to configure checking, for example, the past 3 days? I see there’s a

bucket_duration

variable that’s an enum, but what are the accepted values for it? I can’t see any documentation for that.

📖 1

🔍 1

brainy-oxygen-20792

05/03/2023, 4:50 PM

How do y'all handle multiple runs of the same DBT project? We have one DBT project, and multiple departments own seeds and models within it. Rather than running everything in one run once per day, we have each department running their pipelines (via

--select

) on their own schedule, which may be daily or hourly. So scheduling DataHub to pull on a schedule means our assertions (run_results.json) may not be complete. Ideas we're considering in the thread.

✅ 1

🔍 1

📖 1

purple-salesmen-12745

05/03/2023, 7:07 PM

Hi, I would like to know if it’s possible to do a connection Machine to Machine between Tableau cloud and Datahub with Tolken? The reason it’s because the use of personnel Tolken is blocking Tableau due to 2FA. If the anwser is positif, do you have a script to propose for the ingestion ?

bland-lighter-26751

05/03/2023, 7:26 PM

Hello! Does anyone here have BigQuery + Metabase lineage working? It would be awesome to hear from even just one person that it works for them. It half worked up until 3 months ago before completely breaking with the latest /ingestion/source/metabase.py change.

🔍 1

📖 1

bland-orange-13353

05/03/2023, 7:57 PM

This message was deleted.

✅ 1

lively-dusk-19162

05/03/2023, 8:50 PM

Hi team, Can anyone help me how to stop datahub-upgrade from running in development phase?

able-evening-90828

05/04/2023, 1:57 AM

For both Snowflake and MySQL ingestions, we have noticed that some of our datasets are missing the platform instance. The other datasets from the same UI ingestion run have the platform instance. We are running

0.10.2

server and

0.10.2.2

CLI for the UI ingestion. We looked at the

metadata_aspect_v2

table and noticed the

dataPlatformInstance

is missing the

instance

field in the

metadata

column. We saw the following:

{"platform":"urn:li:dataPlatform:mysql"}

as opposed to

{"platform":"urn:li:dataPlatform:mysql","instance":"<OUR_PLATFORM_INSTANCE_URN"}

We have never had such problem before. Has anyone else seen this?

✅ 1

🔍 1

📖 1

steep-midnight-37232

05/04/2023, 10:49 AM

Hi there, I have a problem with the ingestion of tags defined in looker. In my case I have an hidden column with a tag "hidden_dimension". I have ingested looker and lookml metadata to datahub but i can only see the tag "Dimension" and not the one i have defined as shown in the picture. Could you help me with this? thanks

🔍 1

📖 1

adamant-honey-44884

05/04/2023, 1:24 PM

Hello, I ran the Athena ingestion and it was successful but there is an extra container defined? Has anyone seen this and know how I can get rid of it?

loud-hospital-37195

05/04/2023, 5:38 PM

Hi we have set up datahub on a kubernetes cluster Azure and we want to do a massive ingest of business terms. We have tried to do a test with the recipe and .yaml downloaded locally, however we get an error. Attached is the error:

📖 1

🔍 1

loud-hospital-37195

05/04/2023, 5:39 PM

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '7741f040-bc32-4b90-9182-bd273621ab7e',
 'infos': ['2023-05-04 16:58:30.727353 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-05-04 16:58:34.832761 INFO: Failed to execute 'datahub ingest'",
           '2023-05-04 16:58:34.832922 INFO: Caught exception EXECUTING task_id=7741f040-bc32-4b90-9182-bd273621ab7e, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/7741f040-bc32-4b90-9182-bd273621ab7e/recipe.yml --report-to /tmp/datahub/ingest/7741f040-bc32-4b90-9182-bd273621ab7e/ingestion_report.json
[2023-05-04 16:58:33,369] INFO     {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.10.0
[2023-05-04 16:58:33,395] INFO     {datahub.ingestion.run.pipeline:179} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-datahub-gms:8080>
Failed to configure the source (datahub-business-glossary): 1 validation error for BusinessGlossarySourceConfig
file
  file or directory at path "cristina.narros/testterms.yml" does not exist (type=value_error.path.not_exists; path=cristina.narros/testterms.yml)

lemon-scooter-69730

05/05/2023, 9:52 AM

Is there a graphql enpoint or sdk method to get all ingestion sources instead of one by one?

great-notebook-53658

05/07/2023, 7:32 AM

Hi everyone. I have previously ingested into business glossary but when I run the recipe again (after making some updates to glossary term definition to see if the plugin can successfully update previously ingested term), I encountered the following error : ‘failures’: [{‘error’: ‘Unable to emit metadata to DataHub GMS’, ‘info’: {‘exceptionClass’: ‘com.linkedin.restli.server.RestLiServiceException’, ‘stackTrace’: ‘com.linkedin.restli.server.RestLiServiceException [HTTP Status500] java.lang.RuntimeException: Unknown ’ ‘aspect institutionalMemory for entity glossaryNode\n’ ‘\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n’ ‘\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)’, ‘message’: ‘java.lang.RuntimeException: Unknown aspect institutionalMemory for entity glossaryNode’, ‘status’: 500, ‘id’: ‘urnliglossaryNode:510f2c45a4622cb5ae7d4616c2aeafa2’}}],

great-notebook-53658

05/08/2023, 2:00 AM

I did not see in the documentation that PowerBI support DAX code. Is this something on the roadmap? Thanks!

🩺 1

✅ 1

best-wire-59738

05/08/2023, 3:44 AM

Hi Team, We are getting

SSLV3_ALERT_HANDSHAKE_FAILURE

error while connecting to MariaDB using ssl account. can you please help me to overcome this issue.

✅ 1

loud-hospital-37195

05/08/2023, 8:53 AM

Hi, we are trying to bring the lineage from Snowflake, however, we have done the metadata ingestion and all the entities appear correctly but the trace of any of them does not appear. The entities are in different schemas in Snowflake, could it be because of this? We have followed all the steps in the guide, thank you very much in advance!

🔍 1

📖 1

✅ 1

delightful-painter-8227

05/08/2023, 10:21 AM

Hello. 👋 Is anyone facing similar issues as described here? https://github.com/acryldata/datahub-helm/issues/266. Thanks.

📖 1

acceptable-morning-73148

05/08/2023, 11:41 AM

Hello. In the Python SDK I see that making a container URN expects a guid in the

make_container_urn

function. Is there a particular reason for that? Can we use another string instead of a UUID given the fact that our custom containers always have a unique name?

📖 1

🔍 1

fierce-finland-15121

05/08/2023, 10:58 PM

Hello. I am trying to integrate Confluent Kafka with Datahub using the documentation here (helm deploy) https://datahubproject.io/docs/deploy/confluent-cloud/ I've mounted a custom executor.yml file with sasl.username and sasl.password and I've confirmed that the file is mounted and the environment variables are there. (will post relevant config in replies) I've also confirmed that the credentials are correct and work using a basic local consumer. Unfortunately, when trying to run this action I get the following error. Could I get some help figuring out what exactly could be wrong here? I believe everything is mounted correctly and the credentials are correct, so I am really confused why I am getting a message that just says authentication failed

Copy code

datahub-actions actions -c /etc/datahub/actions/system/conf/executor.yaml
%3|1683586489.986|FAIL|rdkafka#consumer-1| [thrd:sasl_ssl://<my broker url>/bootstr]: sasl_ssl://<my broker url>/bootstrap: SASL authentication error: Authentication failed (after 5055ms in state AUTH_REQ, 5 identical error(s) suppressed)

🔍 1

📖 1

cool-flag-71835

05/08/2023, 11:07 PM

Hello everybody! How you doing? Please, I need some help with a Glue Ingestion Recipe... I need that my first recipe extracts the Properties and Documentation, after that, if i do any alteration of this metadata(properties and documentation) using the Datahub API, the ingestion will only alter the tables columns, but will keep the alterations made via API. This is my recipe now:

flaky-refrigerator-97518

05/09/2023, 2:48 AM

Hi Everyone, I have added new custom entity. When I start Datahub (docker-compose) I get following error (org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=index_not_found_exception, reason=no such index) when visiting UI: at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) datahub-gms | at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) datahub-gms | Caused by: java.lang.RuntimeException: Failed to get entity counts datahub-gms | at com.linkedin.datahub.graphql.resolvers.group.EntityCountsResolver.lambda$get$2(EntityCountsResolver.java:53) datahub-gms | at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) datahub-gms | ... 16 common frames omitted datahub-gms | Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=index_not_found_exception, reason=no such index

🔍 1

📖 1

important-afternoon-19755

05/09/2023, 6:20 AM

Hi team, is there any way to ingest queries info into Athena or Glue’s Queries tab using python emitter? Or is this impossible?

colossal-hairdresser-6799

05/09/2023, 9:10 AM

Hello! I’m trying to modify the https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json to add some additional records for a bigquery dataset. Is there any simple way of exporting the mce records from a datahub instance so I easily can add some realistic bigquery entries?

📖 1

❌ 1

🔍 1

rapid-crowd-46218

05/09/2023, 3:08 PM

Hello. I would like to ask for your opinions. In a scenario where I have ingested table from Glue for the first time, I want to ingest new Glue tables into Datahub in real-time after the initial ingestion. I was thinking about using an airflow DAG to re-ingest all glue tables once a day, but I would prefer a way to ingest only the new table immediately after they are created in Glue. Currently, I am considering writing a recipe that uses a aws lambda function to retrieve the new tables as a variable and then call "datahub ingest -c". What do you think is the best approach? Or I would like to know the thoughts of someone who has had this experience. The main point is how to add new tables after the initial ingestion in the best way possible. thanks in advanced!

✅ 1

📖 1

🔍 1

prehistoric-wall-71780

05/09/2023, 6:47 PM

Hello people. I'm testing datahub. I deployed to gke. I have a question about the Ingestion Framework. What kubernetes component does it work on?

gorgeous-psychiatrist-31553

05/10/2023, 6:09 AM

Copy code

Good afternoon. I have a problem with missing INGESTIONS
When creating a new one INGESTION in the DATAHUB
After saving, it has the status Pending (but it's not scary, I think it's normal)

After a while, he missing. Even if I immediately launched it disappears.
This happened after one of the INGESTION had a connection error. The server where the account was not available and was fixed later. I Made a ROLLBACK of one of the ingestion and now every new my ingestions dont work. They are missing in few seconds. 
Can you help me?
I made restart dicker containers, but it did not help.