DataHub #ingestion

blue-beach-27940

06/28/2022, 8:05 AM

then, refer to the example dag: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py

blue-beach-27940

06/28/2022, 8:09 AM

datahub docker quickstart

blue-beach-27940

06/28/2022, 8:13 AM

Hello, I am new to datahub 0.8.38, I luanched another python virual env, and installed datahub=0.8.38, and execute command:

datahub docker quickstart

, but got an error blow, anyone has this problem?

brainy-crayon-53549

06/28/2022, 1:13 PM

Hi, i was getting below error when trying to connect to Postgres. Anyone has solution for this?

microscopic-helicopter-87069

06/28/2022, 1:31 PM

Hi, I would like to know how the recipe for hive should be when postgres is used as the metastore

green-lion-58215

06/28/2022, 10:21 PM

Hi Team, I am trying to set up a pipeline to ingest DBT metadata into datahub through a lambda function. However, I am getting the below error. Does anyone know how to resolve it? I am using a layer which has 'acryl-datahub[dbt]' package installed. below is the pipeline step:

Copy code

from datahub.ingestion.run.pipeline import Pipeline
 pipeline = Pipeline.create(
        {
            "source": {
                "type": "dbt",
                "config": {
                    "manifest_path": "/tmp/manifest.json",
                    "catalog_path": "/tmp/catalog.json",
                    "sources_path": "/tmp/sources.json",
                    "target_platform": "databricks",
                    "load_schemas": True
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "http://<masked>:8080"},
            },
        }
    )

Thread in Slack Conversation

many-glass-61317

06/29/2022, 1:22 AM

Hi team, just looking at this doc. is there way to change airflow environment name to DEV or UAT etc. instead of PROD https://datahubproject.io/docs/docker/airflow/local_airflow

brainy-crayon-53549

06/29/2022, 11:20 AM

I'm using Datahub and airflow in docker... But both will be using port8080.. because of this can't run both at a time. Anyway to change port number

tall-fall-45442

06/29/2022, 5:13 PM

Is it possible to schedule an ingestion to happen on a schedule by sending requests to GMS? I know that this is possible through the UI but we would like to use the UI for viewing data and not any management tasks.

mysterious-eye-58423

06/29/2022, 9:13 PM

Hi team, I was looking at Datahub documentation on timeseries data and it mentions

This makes restoring timeseries aspects in a disaster scenario a bit more challenge.

Have we discussed/implemented a solution to recover search index when the attributes of metadata are not persisted in the relational store?

lemon-zoo-63387

06/30/2022, 4:11 AM

hello everyone,The datahub uses docker to automatically download MySQL and save all metadata， Can I change it to MSSQL DB of the company,Thanks in advance for your help!

bitter-oxygen-31974

06/30/2022, 4:33 AM

Hi team, I am unable to Hi team, I am exploring integration of metabase with datahub, having some issues while fetching metadata. It would be helpful, if someone can share config specs.

bitter-toddler-42943

06/30/2022, 5:54 AM

Hello, all! How do I import data excluding specific columns when ingesting data? I am using MSSQL, but some columns are personal information. I don't want to bring sample data from the personal information column, and I want to bring sample data from the rest of the columns. Anyone know?

brainy-crayon-53549

06/30/2022, 11:27 AM

Anyone have similar steps to follow in windows cmd for airflow integration datahubproject.io/docs/docker/airflow/local_airflow/

mammoth-honey-57770

06/30/2022, 12:03 PM

Hi, I am trying to ingest data from bigquery. I deployed datahub with Helm. My ingestion always fail when it's almost done. So I've tried in local on my computer and it worked. The error seems to indicate that was due to a property debug, which is only call by logger in the code. Any tips ?

billions-twilight-48559

06/30/2022, 12:06 PM

Hi there We are crawling tables from Hive, but the column descriptions are not ingested. It’s not supported? or we are missing any parameter? We are on v0.8.36

quick-megabyte-61846

06/30/2022, 12:26 PM

Hello there

case

Periodic tasks run on airflow with dbt I was thinking it’s possible to ingest only specific artefacts in my example I would like only to ingest

run_results.json

hence test/assertion data to datahub. With this case, I don’t ingest data that is already in datahub only the assertions which are needed

elegant-salesmen-99143

06/30/2022, 1:23 PM

Is it possible to manually turn off some of the columns stats (i.e. min, max, median, null count etc) from Profiling tab to lessen the stress load for ingestion runs?

colossal-easter-99672

06/30/2022, 1:29 PM

Hello, team. Is there any way to solve this case for custom ETL source. For example, i have DataFlow with DataJob, which insert data from 10 datasets to another 10 datasets

Copy code

insert into a select * from k;
insert into b select * from l;
insert into c select * from m;
insert into d select * from n;
insert into e select * from o;
insert into f select * from p;
insert into g select * from q;
insert into h select * from r;
insert into e select * from s;
insert into j select * from t;

if i set to this DataJob outputs = a,b,c,d,e,f,g,h,e,j and inputs = k,l,m,n,o,p,q,r,s,t i get mixed (and wrong) lineage for datasets. Now i generate fake 10 DataJobs (outputs = a/inputs = k; outputs = b/inputs = l and etc.) to solve this. Is there any other better solution?

delightful-barista-90363

06/30/2022, 8:46 PM

Hello, Wondering if there exists any ingestion integrations for h5 files (or if a feature request for it already exists?)

blue-beach-27940

07/01/2022, 2:28 AM

Is there any examples? I can't find it in the docs.

blue-beach-27940

07/01/2022, 3:45 AM

example https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/recipes/bigquery_to_datahub.dhub.yaml

blue-beach-27940

07/01/2022, 3:45 AM

👍

gray-architect-29447

07/01/2022, 5:38 AM

hi, my ingestion works fine without any error. However I cannot see it in datahub web interface. Have you ever seen this? I just run "datahub ingest -c ingestion.yaml" and it shows pipeline finished successfully

late-bear-87552

07/01/2022, 7:54 AM

Hi team, tried running bigquery-usage recipe through ui, getting below error. can anyone help me with this??

Copy code

'typing-extensions-4.2.0 typing-inspect-0.7.1 tzdata-2022.1 tzlocal-4.2 urllib3-1.26.9 websocket-client-1.3.3 wrapt-1.14.1\n'
           '[2022-07-01 07:34:29,599] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38.4\n'
           '[2022-07-01 07:34:30,173] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion\n'
           '[2022-07-01 07:44:41,592] INFO     {datahub.ingestion.source.usage.bigquery_usage:975} - Starting log load from GCP Logging\n'
           '/usr/local/bin/run_ingest.sh: line 26: 20382 Killed                  ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
           "2022-07-01 07:49:20.890195 [exec_id=cceb1680-1556-4712-bc57-0b1d631836a3] INFO: Failed to execute 'datahub ingest'",
           '2022-07-01 07:49:20.890775 [exec_id=cceb1680-1556-4712-bc57-0b1d631836a3] INFO: Caught exception EXECUTING '
           'task_id=cceb1680-1556-4712-bc57-0b1d631836a3, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}

astonishing-dusk-99990

07/01/2022, 7:56 AM

Hi Team, Recently I was installed and running datahub using

datahub docker quickstart --quickstart-compose-file=docker-compose.quickstart.yml

but when I did the ingestion on postgres always N/A and after I checked on the container, it looks like on the container datahub-actions when I checked got an error. Anyone know?

hallowed-dog-79615

07/01/2022, 8:41 AM

Greetings Team! Today I come with questions about Stats. Not Usage stats, but Data profiling. We are trying to implement DataHub in our company, and we find that the current profiling is very slow for our purposes (this is properly warned in the documentation 🙂). And we are wondering a couple of things: • What metrics are included in the turn_off_expensive_profiling_metrics parameter? • Can we find an extensive list of parameters for the profiling recipe? I think its absent from the docu. We found some Youtube guides, but we are not sure if the parameters covered there are all the available ones. • Can columns included in the profiling process be selected by name? I know there is an option to limit the number of columns included, but I suppose this works by taking the first n columns found in the table. • Does DataHub support external profiling? We have implemented different quality assertions through Great Expectations or dbt test, but I couldn't find references about ingesting other tool's profiling info to be shown in Stats, we assume this means that right now this is not an option. In our setup, profiling the full tables each time we ingest is quite expensive. We thought that a plausible solution would be to profile only the increments of the tables in each iteration of our pipelines, but at the same time keep this partial profiles under the full table's entity in DataHub. In this way, the dataset list in DataHub would only list the tables once, but within each one, we would have the list of incremental profilings (in a similar way to that of Validations, in which you have a timelined list). We think this is not achievable right now, is it? Is there any option that we achieve a reduced or preliminar version of this in the current DataHub? Thanks for your time and for your wonderful tool!! Dani

proud-baker-56489

07/01/2022, 11:56 AM

proud-baker-56489

07/01/2022, 11:58 AM

When I want to see the origin airflow address from the datahub, I press the button “View in Airflow” as shown above. But the url turns to the http://localhost:8080/taskinstance/list/?flt1_dag_id_equals=auto_airflow_test3&_flt_3_task_id=task_b. How could i change the default localhost to certain ip like 192.168.1.2?

proud-baker-56489

07/04/2022, 3:48 AM

hello ,Does anyone know the datahub support the Hudi data format for ingestion? I find the roadmap here,but don’t know the situation now.https://datahubproject.io/docs/roadmap/#data-lake-ecosystem-integration