DataHub #ingestion

shy-lion-56425

09/21/2022, 8:50 PM

Any recommendations on setting a include and exclude path_specs for s3?

Copy code

source:
    type: s3
    config:
        path_specs:
        - include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
        - exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
        aws_config:
            aws_access_key_id: "{aws_key}"
            aws_secret_access_key: "{aws_secret}"
            aws_region: us-east-1
        profiling:
            enabled: false

Error:

Copy code

[2022-09-21 15:47:53,596] ERROR    {datahub.ingestion.run.pipeline:127} - 'include'
Traceback (most recent call last):
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
    self.source: Source = source_class.create(
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
    config = DataLakeSourceConfig.parse_obj(config_dict)
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1056, in pydantic.main.validate_model
  File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
  File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
  File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
  File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1082, in pydantic.main.validate_model
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
    if "**" in values["include"]:
KeyError: 'include'
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion

few-sugar-84064

09/22/2022, 3:17 AM

Hi, Is there anyone who ingest lineages of glue job (redshift table to redshift table ETL job) manually? If yes, a sample code would be really helpful for me. Tks I've tried below: • Glue Annotation - Got parsing error

Copy code

Error parsing DAG for Glue job. The script <s3://steadio-glue-info/scripts/test-datahub-lineage.py> cannot be processed by Glue (this usually occurs when it has been user-modified): An error occurred (InvalidInputException) when calling the GetDataflowGraph operation: line 11:87 no viable alternative at input \'## @type: DataSource\\n## @args: [catalog_connection = "redshiftconnection", connection_options = {"database" =\'']}

• Dataset job code - have no idea what I need to put for job id and flow id

cool-vr-73109

09/22/2022, 8:06 AM

Hi Team, for below scenario can you help me to ingest data lineage in to datahub Oracle->s3->aws glue->redshift I will be ingesting data from aws glue. So my question is whether Datahub will detect this whole lineage from source to target or we should manually add via yaml file for data lineage. Please help?

kind-scientist-44426

09/22/2022, 9:33 AM

Hi all, I’m trying to ingest the ldap users from ldap server for which I’m trying below recipe:

Copy code

source:
  type: "ldap"
  config:
    ldap_server: <server>
    ldap_user: "cn=<user_name>,dc=example,dc=org"
    ldap_password: "<password>"
    base_dn: "dc=example,dc=org"

but on running this recipe on ui i’m getting below errors

Copy code

ERROR    {datahub.ingestion.run.pipeline:127} - LDAP connection failed\n'

"AttributeError: 'Pipeline' object has no attribute 'source'\n"
           "[2022-09-22 07:19:56,128] ERROR    {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'

if someone can suggest the reason?

mammoth-air-95743

09/22/2022, 10:37 AM

Hi everyone! I am using ingestion from S3 bucket, and json files within, and in logger of ingestion task I get the message that it’s extracting table schema but there’s nothing actually there, it doesn’t infer the schema. Here’s logger output:

Copy code

'[2022-09-20 09:41:44,078] INFO     {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: '
           '<s3://path/to/file.json>\n'
'[2022-09-20 09:41:44,078] INFO     {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: '
           'path/to/file.json\n'

mammoth-air-95743

09/22/2022, 10:40 AM

Second question is related to ingestion breaking for some mongo collections and json files. Mongo collections causes I found were related to values of columns having some sort of encoded html or json inside them, and also one case of really big schema. Out of few JSON’s that failed, I imagine it has a similar case but I didn’t debug it properly yet. My main issue is that there’s no useful output anywhere to see why it failed. Is there anywhere I can look, in some pod’s logs or something. Alternatively, can I add a blacklist to ingestion recipe so it skips some collections/files?

careful-action-61962

09/22/2022, 11:32 AM

Hey Folks, I'm new to datahub and wanted to ingest tableau metadata.. Is there a way to ingest all the projects at once instead of adding there name manually?

some-printer-33912

09/22/2022, 3:03 PM

Hi! I am trying to get metadata of mssql and i have this error (local machine, windows, sql server locally, docker)

Copy code

"[2022-09-22 14:54:17,208] ERROR    {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
           "retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
           "to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '2547645d-cd36-4752-b64d-d5bf7552b7c6',
 'infos': ['2022-09-22 14:54:03.509045 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-22 14:54:17.675643 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: stdout=venv setup time = 0\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/recipe.yml --report-to '
           '/tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/ingestion_report.json\n'
           '[2022-09-22 14:54:04,862] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.42\n'
           "[2022-09-22 14:54:17,208] ERROR    {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
           "retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
           "to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"
           '[2022-09-22 14:54:17,208] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.42 at '
           '/tmp/datahub/ingest/venv-mssql-0.8.42/lib/python3.10/site-packages/datahub/__init__.py\n',
           "2022-09-22 14:54:17.677042 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Failed to execute 'datahub ingest'",
           '2022-09-22 14:54:17.677216 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Caught exception EXECUTING '
           'task_id=2547645d-cd36-4752-b64d-d5bf7552b7c6, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

limited-forest-73733

09/22/2022, 3:16 PM

Hey team! Whats plan for new datahub release i.e. 0.845

plus1 2

limited-forest-73733

09/22/2022, 3:16 PM

Any tentative date

thankful-morning-85093

09/22/2022, 4:33 PM

Hi Team, Can we use Trino/Presto to push data Hive Platform in Datahub? Profiling in Hive is super slow for us.

bland-balloon-48379

09/22/2022, 7:12 PM

Hey everyone, I got a quick question. Are there a processes or timeframes for which soft-deleted entities get auto hard-deleted? I'm interested in hiding some datasets from the UI, but want to make sure that historical data will always be retained. Thanks!

plus1 1

careful-engine-38533

09/23/2022, 4:19 AM

Hi, my mongodb ingestion fails with following error message - any help?

Copy code

'/usr/local/bin/run_ingest.sh: line 40:    79 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
           "2022-09-22 06:29:49.739560 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Failed to execute 'datahub ingest'",
           '2022-09-22 06:29:49.739831 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Caught exception EXECUTING '
           'task_id=29430983-bfd2-4551-b153-c869537f5fe5, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in

cuddly-arm-8412

09/23/2022, 2:18 AM

hi,team. i try the lineage_job_dataflow_new_api.py.and the data is successfully warehoused.

Copy code

jobFlow = DataFlow(cluster="prod", orchestrator="airflow", id="flow_new_api")
jobFlow.emit(emitter)

dataJob = DataJob(flow_urn=jobFlow.urn, id="flow_new_api_job1")
dataJob.emit(emitter)

But in the UI, I found that they were not related,How to troubleshoot here?

narrow-toothbrush-13209

09/23/2022, 7:16 AM

Hi, The Dathub Provider For Airflow can handle the connection error . Since the task are getting failed if connection is not established. datahub_provider.lineage.datahub.DatahubLineageBackend

boundless-student-48844

09/23/2022, 8:14 AM

Hi team, can I check if there’s code repo for the

acryl-executor

pip package?

brave-tomato-16287

09/23/2022, 11:13 AM

Hello All. We faced a Tableau ingestion error when trying to get a google sheet.

Copy code

'urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)\\n Cause: ERROR :: '
                      '/upstreams/0/dataset :: \\"Provided urn '
                      "urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12vw5vo.'am, bad debt users, to make "
                      'bal$\',PROD)\\" is invalid: Failed to convert urn to entity key: urns parts and key fields do not have same length\\n", '
                      '"message": "Invalid urn format for aspect: {upstreams=[{type=TRANSFORMED, auditStamp={actor=urn:li:corpuser:unknown, time=0}, '
                      'dataset=urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12", "status": 400, "id": '
                      '"urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)"}}], "failures": [{"error": "Unable '
                      'to emit metadata to DataHub GMS", "info": {"exceptionClass": "com.linkedin.restli.server.RestLiServiceException",

glamorous-wire-83850

09/23/2022, 11:47 AM

Hello, I am currently trying to ingest multi-project GCP but its only ingest “second_project” which is written as storage_project_id. I want ingest all projects in GCP. What should I?

Copy code

source:
    type: bigquery
    config:
        project_id: service_acc_project
        storage_project_id: second_project
        credential:
            project_id: service_acc_project
            private_key_id: '${BQ_PRIVATE_KEY_ID2}'
            client_email: <mailto:abc-abc@service.iam.gserviceaccount.com|abc-abc@service.iam.gserviceaccount.com>
            private_key: '${BQ_PRIVATE_KEY2}'
            client_id: '11111111'
        include_tables: true
        include_views: true
        include_table_lineage: true

lemon-engine-23512

09/23/2022, 11:48 AM

Hi team, am trying to schedule ingestion airflow(managed apache airflow on aws, mwaa) . Here we upload dag and yaml files to s3 location. But when i rub schedule in airflow i get error Datahub.configuration.common.ConfigurationError:cannot open config file --s3path to yaml Anyway to resolve this? Thank you

adamant-rain-51672

09/23/2022, 12:32 PM

Hey, upgraded to 0.8.44, however, seeing this ingestion error for all ingestions:

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': 'e7e241c2-dcbf-43c9-9363-0eb77c8a1fad',
 'infos': ['2022-09-23 12:30:07.735004 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-23 12:30:07.735648 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Caught exception EXECUTING '
           'task_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
           '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
           '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
           '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
           'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
           'debug_mode\n'
           '  extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.

Do you maybe know why it's caused?

future-smartphone-53257

09/23/2022, 12:57 PM

hi, I'm trying to ingest glossary terms from a file, but I can't figure out how to indicate the Contains and Inherits relationships that I can set in the UI. Is there some way I can export the metadata from DataHub to MXE/MCE format?

bumpy-whale-50799

09/23/2022, 1:12 PM

Does Metadata lineage work if a Stored Procedure creates a temp table before populating the output table?

gray-cpu-75769

09/22/2022, 12:40 PM

Hi all, I’m trying to enable data profiling for the Google Big Query Table but getting the following error, below mentioned is the recipe. does anyone has any idea about it ?

Copy code

source:
    type: bigquery
    config:
        credential:
            private_key_id: '${private_key_id}'
            project_id: '${project_id}'
            client_email: '${Client_email}'
            private_key: '${private_key}'
            client_id: '${client_id}'
        profiling:
            enabled: true
        project_id: '${project_id}'
        table_pattern:
            allow:
                - daas-prod-251711.cdo.online_merchant
        profile_pattern:
            allow:
                - daas-prod-251711.cdo.online_merchant
pipeline_name: 'urn:li:dataHubIngestionSource:60e4b0c9-dc16-4138-8fa7-d0c881af095a'

chilly-potato-57465

09/23/2022, 1:33 PM

Hello! In our case we have huge datasets stored in regular file systems (images) and HDFS? As far as I could see from previous questions and sources documentation, there are no plugins to ingest metadata from regular file systems (attributes such as created/modified/ownership/size/assess rights/etc and folders structure) and HDFS. Is this still so? Additionally, I wonder how to ingest metadata (column names) from csv files. I see that is possible from S3 source, is it also possible from regular file systems? Thank you!!

fresh-nest-42426

09/23/2022, 9:16 PM

Hi all and happy Friday! We are doing some POC with DataHub specifically ingesting table lineage from Redshift and realized that it's also automatically ingests upstream s3 COPY lineage, which is very interesting & useful. However, we have many event level table that gets too many small batches of s3 files loaded using Kinesis firehose so the S3 lineage looks extremely verbose and even seem to completely break ingestion sometimes

Copy code

'  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is found, but chunk is longer than limit\n']}

I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239 Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using

v0.8.44

and datahub actions

0.0.7

Thanks!

green-lion-58215

09/23/2022, 10:47 PM

Quick qn on DBt ingestion. I am ingesting dbt test results with the run_results.json. I can also see the test cases in the datahub UI as well. But it does not show any pass/failures for the test cases. It is all coming up as “no evaluations found”. I am using datahub 0.8.41 for context. dbt version 0.21.1 any help is appreciated.

little-spring-72943

09/24/2022, 9:55 PM

We are trying to build Assertion from our DQ toolset Great Expectations equivalent: expect_column_sum_to_be_between set following values:

Copy code

scope = DatasetAssertionScope.DATASET_COLUMN
operator = AssertionStdOperator.BETWEEN
aggregation = AssertionStdAggregation.SUM

UI shows: Column Amount values are between 0 and 1 Sum (aggregation) is missing. How can we fix this or have custom text here?

acceptable-judge-21659

09/26/2022, 6:52 AM

Hello, I have to ingest a database which is on a VPS I'm not sure how to do it... Any advice ?

bumpy-journalist-41369

09/26/2022, 7:33 AM

I have a problem when ingesting data from Glue. I get the following exception:

Copy code

'2022-09-21 12:33:03.932429 [exec_id=14acb269-e6af-4ca0-871b-684c02a11814] INFO: Caught exception EXECUTING '
           'task_id=14acb269-e6af-4ca0-871b-684c02a11814, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
           '    line = await self.readuntil(sep)\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 620, in readuntil\n'
           '    raise exceptions.LimitOverrunError(\n'
           'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is found, but chunk is longer than limit\n']}
Execution finished with errors.

And eventually the ingestion fails, even though it managed to ingest some of the data. My recipe looks like that : sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: glue config: aws_region: us-east-1 database_pattern: allow: - product_metrics I don’t see any other exceptions in the log. Does anyone know how to fix it?

plus1 1

clean-tomato-22549

09/26/2022, 9:42 AM

Hello, how can I lighten the "view definition", "linkage", "queries" tabs. I use

type: snowflake

, and has enabled following parameters while ingesting. What else setting do I need to lighten the tabs?

Copy code

type: snowflake


ignore_start_time_lineage: true
include_table_lineage: true
include_view_lineage: true