DataHub #ingestion

clean-tomato-22549

09/26/2022, 9:46 AM

Besides, according to doc, Account Usage system tables are need to ingest table linkage. Why? The database INFORMATION_SCHEMA should be enough, right? https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#prerequisites-1

Copy code

If you plan to enable extraction of table lineage, via the include_table_lineage config flag or extraction of usage statistics, via the include_usage_stats config, you'll also need to grant access to the Account Usage system tables, using which the DataHub source extracts information. This can be done by granting access to the snowflake database.

fancy-alligator-33404

09/26/2022, 10:54 AM

hello! I have a problem with hive ingestion. There is tables in the hive DB, but when I bring them as ingestion, I cannot see the tables to the datahub. Instead, if I put the 'include_views: true' option, the table information is brought to the dat hub, but as the view tables... I am attaching the captured picture that I did ingestion I would be very grateful if you could tell me the solution...!!!

few-carpenter-93837

09/26/2022, 7:02 AM

Hey, does anyone have any information regarding what Tableau datahub ingestion includes under "Tags". Is it only attribute level tags that are brought over or should it also include the workbook level tags?

careful-action-61962

09/26/2022, 11:14 AM

Hey Folks, I'm trying to ingest metadata of tableau into datahub, I'm gettting following error repeatedly and not able to get the same asset in tableau when i search it there. Would really appreciate your help. Thanks

exec-urn_li_dataHubExecutionRequest_04e93dc4-f2b5-4014-ba02-47bcd06e06f7.log

fierce-baker-1392

09/26/2022, 12:21 PM

Hey Folks, I download the datahub sourcecode, but I have lots of python lib do not install automatically when compile module ‘_*metadata-ingestion*_’, I am not very familiar with python, how can I solve this problem? thanks~

lemon-engine-23512

09/26/2022, 1:11 PM

Hello, am unable to establish Connection between mwaa and datahub rest, for dag schedule. I get connection time out from dag. i even tested by opening all ports and ip address. Still same error. Am i missing anything? Thank you

proud-table-38689

09/26/2022, 3:58 PM

we have a database that’s exposed via HTTPS, I believe that DataHub uses SQLAlchemy under the hood, is there a way we can add these parameters to our ingestion script?

Copy code

ssl_args = {'ssl_ca': ca_path}
engine = create_engine("mysql+mysqlconnector://<user>:<pass>@<addr>/<schema>",
                        connect_args=ssl_args)

proud-table-38689

09/26/2022, 3:58 PM

see the

connect_args=ssl_args

one

creamy-tent-10151

09/26/2022, 5:18 PM

Hi all, on the documentation for Athena ingestion, it mentions that you can enable table-level lineage through configuration. How do enable lineage? Thank you.

bland-balloon-48379

09/26/2022, 7:14 PM

Hey team, I wanted to start a discussion about how datasets get represented over time. In particular I'm thinking of a situation where a dataset is created, ingested, edited in Datahub, and deleted at some point. Then some time in the future, a new dataset with the same name and in the same database/schema is created. The same urn would be generated in both these cases, so internally they would be treated as the same dataset. And if the original dataset was soft-deleted, then the subsequent one would inherit that custom documentation even if it doesn't apply to this new case. Should these be considered two separate datasets? Would the second one be considered a new version of the original? If the original dataset had it's status set to removed when it was deleted from the database, would there be some indication that it had been removed and then reingested? I think there is a lot to consider here. I'm interested to hear what Datahub and the community's perspective is on this scenario, both from a stateful and stateless ingestion standpoint, and if there is a particular direction Datahub has in mind. Thanks!

clean-tomato-22549

09/27/2022, 3:06 AM

I am trying to ingest presto-on-hive, but got following error, could anyone help to check it. Thx

Copy code

datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (presto-on-hive)
[2022-09-26 05:58:34,312] ERROR    {datahub.entrypoints:195} - Command failed: 
	Failed to configure source (presto-on-hive) due to 
		'TSocket read 0 bytes'.
	Run with --debug to get full stacktrace.

melodic-beach-18239

09/27/2022, 8:00 AM

Hello! I had ingested metadata of mysql and postgres. Could I create lineage of these two datasets?

microscopic-mechanic-13766

09/27/2022, 10:29 AM

Hi, I am planning on making a custom connection to enable HDFS ingestion. Could someone walk me through on what would be needed to be created? Just a file like the ones that can be found under the path

/datahub/metadata-ingestion/src/datahub/ingestion/source

inside the project?

bumpy-journalist-41369

09/27/2022, 11:15 AM

Hello. I have a problem when ingesting data from Glue. Even though I can see that I have ingested data, the run fails and these are the exception that I can see in the log:

Copy code

" 'warnings': {'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py>': ['Error parsing DAG for Glue job. The script '\n"
           '                                                                                         '
           "'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py> '\n"
           "                                                                                         'cannot be processed by Glue (this usually "
           "occurs when it '\n"
           "                                                                                         'has been user-modified): An error occurred '\n"
           "                                                                                         '(InvalidInputException) when calling the "
           "GetDataflowGraph '\n"
           "                                                                                         'operation: line 19:4 no viable alternative at "
           "input '\n"
           '                                                                                         "\'e3g:))\'e:)o.)) #\'"],\n'
           "              '<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py>': ['Error parsing DAG for Glue job. The "
           "script '\n"
           '                                                                                                    '
           "'<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py> '\n"
           "                                                                                                    'cannot be processed by Glue (this "
           "usually '\n"
           "                                                                                                    'occurs when it has been "
           "user-modified): An '\n"
           "                                                                                                    'error occurred "
           "(InvalidInputException) when '\n"
           "                                                                                                    'calling the GetDataflowGraph "
           "operation: line '\n"
           "                                                                                                    '337:12 no viable alternative at "
           "input '\n"
           '

as well as this :

Copy code

exception=NoSuchKey('An error occurred (NoSuchKey) wh\n"
           "               en calling the GetObject operation: The specified key does not exist.')>\n"

the recipe I am using is : sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: glue config: aws_region: us-east-1 env: DEV database_pattern: allow: - cdca Does anyone know how to fix the problem?

future-hair-23690

09/27/2022, 2:40 PM

Hi, This might be a pretty basic question, but I'm trying to ingest MSSQL from the UI, but It require pyodbc. I have a deployment via the official helm chart. Where/how can I actually install it? I mean which subchart actually takes care of ingestion?

kind-dawn-17532

09/27/2022, 11:25 AM

Hi Team, What am I missing here - I see that platform is needed to make a Feature table urn, but only a table name is needed to make a Feature? Are we thinking the feature table name will be unique enough?

fast-potato-13714

09/27/2022, 3:20 PM

Hi, I'm loading objects from trino and hive, is there a way to unify datasets, since some of them appear as trino and others as Hive, thanks in advance!

green-lion-58215

09/27/2022, 3:43 PM

Hello all, is there a way to remove a specific tag from all resources in an environment fabric in datahub?

wonderful-notebook-20086

09/27/2022, 7:18 PM

I'm running datahub locally from the

getting-started

docker container images based on the Quickstart guide I tried setting up a connection to our RS cluster and ran into this error:

Copy code

'2022-09-27 18:30:45.362431 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-27 18:47:03.670827 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Caught exception EXECUTING '
           'task_id=fec3ab48-c33b-4403-abfc-f61720c609ae, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 182, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 126, in '
           '_read_output_lines\n'
           '    full_log_file.write(line)\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 188, in execute\n'
           '    full_log_file.close()\n'
           'OSError: [Errno 28] No space left on device\n'

Ingestion recipe yml looks something like this:

Copy code

source:
    type: redshift
    config:
        start_time: '2022-09-26 00:00:00Z'
        end_time: '2022-09-26 12:00:00Z'
        table_lineage_mode: mixed
        include_table_lineage: true
        database: insightsetl
        password: '${etl2_test_datahub_creds}'
        profiling:
            enabled: true
        host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
        stateful_ingestion:
            enabled: true
        username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:93b5640d-8ed3-456e-89f9-0ec3def38733'

I'm not sure if it's a container issue or config or something else.

aloof-leather-92383

09/27/2022, 9:14 PM

Hi, I'm doing a kafka ingestion and is there anyway to display the topic name without the cluster id? all the ingested topics are written <cluster_id>.<topic-name>. thank you!

proud-table-38689

09/28/2022, 1:45 AM

how would I add a package to the ingestion virtual environment? trying to add

teradatasqlalchemy

. I added it to the docker image but I don’t see it used in the

venv

that’s used per ingestion

few-sugar-84064

09/28/2022, 2:36 AM

Hi, I found all ingested glue jobs have dataFlow urn only, doesn't have job urn. So I can't see lineage even for the job with auto generated script. Is there anyone knows how to update glue jobs as Datajob, not DataFlow? Below is my yaml ingested the jobs, tks.

Copy code

source:
  type: glue
  config:
    aws_region: "ap-northeast-2"
    extract_transforms: True
    catalog_id: "catalog_id"

sink:
  type: "datahub-rest"
  config:
    server: "gms sever address"

chilly-ability-77706

09/28/2022, 5:33 AM

Hi, I am looking for the Hive recipe which supports KNOX authentication with required SSL certificate, below is the one I tried but I received the error message "CannotSendHeader = <class 'http.client.CannotSendHeader'>"

chilly-ability-77706

09/28/2022, 5:33 AM

Copy code

source:
    type: hive
    config:
        database: <>
        password: <>
        host_port: <>
        stateful_ingestion:
            enabled: true
        username: <>
        options:
            connect_args:
                auth: NOSASL
                http_path: "/gateway/default/hive"
                ssl_cert: "required"
        scheme: hive+https

gifted-diamond-19544

09/28/2022, 8:40 AM

Hello! What should my IAM permissions be to use the ATHENA ingestion? Which tables should I be able to run queries on? I don’t think this is specified in the documentation (At least I don’t seem to find it). Thank you!

plus1 2

thankful-ghost-61888

09/28/2022, 9:02 AM

Hi all, a question about the lookml ingestion, we have seen recurring errors with the the SQL lineage analyzer, it looks like it doesn’t recognise some lookML syntax like the

${EXTENDED}

keyword (docs here). From the ingestion logs:

Copy code

b"2022-09-25 10:05:57,890 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: EXTENDED] instead.' for query: 'SELECT\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b"      date_trunc('week',purchase_date) as purchase_date,\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      user_id,\n'
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      buyer_country,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      count(id) as items_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      sum(GMV) as gmv_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      min(user_order_sequence_number) as user_order_sequence_number_minweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      max(user_order_sequence_number) as user_order_sequence_number_maxweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    FROM (EXTENDED)\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    GROUP BY 1,2,3\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get tables\n'
[2022-09-25 10:05:57,894] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get columns\n'

The original lookML sql looks like this:

Copy code

include: "dt_engine_purchasetransaction.view"

view: dt_buyer_transactions_weekly {
  extends: [dt_engine_purchasetransaction]
  derived_table: {
  sql:
    SELECT
      date_trunc('week',purchase_date) as purchase_date,
      user_id,
      buyer_country,
      count(id) as items_bought_week,
      sum(GMV) as gmv_bought_week,
      min(user_order_sequence_number) as user_order_sequence_number_minweek,
      max(user_order_sequence_number) as user_order_sequence_number_maxweek,
      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,
      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country
    FROM (${EXTENDED})
    GROUP BY 1,2,3
    ;;
  }

We’re also seeing an issue with parsing a new line after the

FROM

statement:

Copy code

b"2022-09-25 10:06:10,069 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: \n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b"] instead.' for query: 'SELECT\n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        __d_a_t_e\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , userid\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , query\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , sum(total_searches) AS total_searches\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        FROM\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (SELECT\n'...

The original syntax looks like this:

Copy code

view: product_interaction_searches {
    derived_table: {
      sql_trigger_value: SELECT max(event_date::date) FROM datalake_processed.etl_tracking_searches;;
      distribution: "date"
      sortkeys: ["date"]
      sql:
       SELECT
        date
        , userid
        , query
        , sum(total_searches) AS total_searches
        FROM
            (
            (SELECT
              cast(date as date) AS date
              , userid AS userid
              , query
              , count(*) AS total_searches
            FROM datalake_compacted.mixpanel_tracking_search_results_query_action...

Any thoughts about the above?

careful-action-61962

09/28/2022, 10:12 AM

Hey Folks, has anyone used service principals in databricks instead of their own user to ingest metadata into datahub?

plus1 1

flaky-soccer-57765

09/28/2022, 11:50 AM

Hello All, I am trying to build an emitter that will update the predefined glossary terms to a column. However, I have noticed that when creating "terms" in UI the URN has a GUID format ex: urnliglossaryTerm:bfe84277-037a-4c4c-b650-5cf0a4c2002e. Ideally, I would like to have the URN setup with the name of the node. name of the term to build logic inside the emitter. Can you suggest how to do that please?

early-airplane-84388

09/28/2022, 12:19 PM

Hi all, After I enabled the metadata authentication in DataHub, my DataHub Ingestions couldn't access the Secrets created for Ingestions. This isn't the an issue in my localhost deployment of Datahub using the quickstart. Can you please suggestion how to fix this?

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': 'efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc',
 'infos': ['2022-09-28 11:44:07.154056 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-28 11:44:07.182384 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Caught exception EXECUTING '
           'task_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 67, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_task_common.py", line 84, in _resolve_recipe\n'
           '    raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
           'acryl.executor.execution.task.TaskError: Failed to resolve secret with name Dev_DataHub_PRIVATE_KEY_ID. Aborting recipe execution.\n']}
Execution finished with errors.

ancient-policeman-73437

09/28/2022, 7:09 PM

Dear all, I am ingesting the metadata from Glue and it starts the load of data, but at some moment shows the next error message:

Copy code

'/usr/local/bin/run_ingest.sh: line 26:  4036 Killed                  ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
           "2022-09-28 14:04:23.217735 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Failed to execute 'datahub ingest'",
           '2022-09-28 14:04:23.218122 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Caught exception EXECUTING '

what could be a reason ? Thank you in advance!