https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • s

    shy-lion-56425

    09/21/2022, 8:50 PM
    Any recommendations on setting a include and exclude path_specs for s3?
    Copy code
    source:
        type: s3
        config:
            path_specs:
            - include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
            - exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
            aws_config:
                aws_access_key_id: "{aws_key}"
                aws_secret_access_key: "{aws_secret}"
                aws_region: us-east-1
            profiling:
                enabled: false
    Error:
    Copy code
    [2022-09-21 15:47:53,596] ERROR    {datahub.ingestion.run.pipeline:127} - 'include'
    Traceback (most recent call last):
      File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
        self.source: Source = source_class.create(
      File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
        config = DataLakeSourceConfig.parse_obj(config_dict)
      File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
      File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
      File "pydantic/main.py", line 1056, in pydantic.main.validate_model
      File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
      File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
      File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
      File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
      File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
      File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
      File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
      File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
      File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
      File "pydantic/main.py", line 1082, in pydantic.main.validate_model
      File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
        if "**" in values["include"]:
    KeyError: 'include'
    [2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
    [2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion
    h
    g
    • 3
    • 12
  • f

    few-sugar-84064

    09/22/2022, 3:17 AM
    Hi, Is there anyone who ingest lineages of glue job (redshift table to redshift table ETL job) manually? If yes, a sample code would be really helpful for me. Tks I've tried below: • Glue Annotation - Got parsing error
    Copy code
    Error parsing DAG for Glue job. The script <s3://steadio-glue-info/scripts/test-datahub-lineage.py> cannot be processed by Glue (this usually occurs when it has been user-modified): An error occurred (InvalidInputException) when calling the GetDataflowGraph operation: line 11:87 no viable alternative at input \'## @type: DataSource\\n## @args: [catalog_connection = "redshiftconnection", connection_options = {"database" =\'']}
    • Dataset job code - have no idea what I need to put for job id and flow id
    h
    • 2
    • 4
  • c

    cool-vr-73109

    09/22/2022, 8:06 AM
    Hi Team, for below scenario can you help me to ingest data lineage in to datahub Oracle->s3->aws glue->redshift I will be ingesting data from aws glue. So my question is whether Datahub will detect this whole lineage from source to target or we should manually add via yaml file for data lineage. Please help?
    h
    • 2
    • 1
  • k

    kind-scientist-44426

    09/22/2022, 9:33 AM
    Hi all, I’m trying to ingest the ldap users from ldap server for which I’m trying below recipe:
    Copy code
    source:
      type: "ldap"
      config:
        ldap_server: <server>
        ldap_user: "cn=<user_name>,dc=example,dc=org"
        ldap_password: "<password>"
        base_dn: "dc=example,dc=org"
    but on running this recipe on ui i’m getting below errors
    Copy code
    ERROR    {datahub.ingestion.run.pipeline:127} - LDAP connection failed\n'
    
    "AttributeError: 'Pipeline' object has no attribute 'source'\n"
               "[2022-09-22 07:19:56,128] ERROR    {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'
    if someone can suggest the reason?
    h
    m
    • 3
    • 11
  • m

    mammoth-air-95743

    09/22/2022, 10:37 AM
    Hi everyone! I am using ingestion from S3 bucket, and json files within, and in logger of ingestion task I get the message that it’s extracting table schema but there’s nothing actually there, it doesn’t infer the schema. Here’s logger output:
    Copy code
    '[2022-09-20 09:41:44,078] INFO     {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: '
               '<s3://path/to/file.json>\n'
    '[2022-09-20 09:41:44,078] INFO     {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: '
               'path/to/file.json\n'
  • m

    mammoth-air-95743

    09/22/2022, 10:40 AM
    Second question is related to ingestion breaking for some mongo collections and json files. Mongo collections causes I found were related to values of columns having some sort of encoded html or json inside them, and also one case of really big schema. Out of few JSON’s that failed, I imagine it has a similar case but I didn’t debug it properly yet. My main issue is that there’s no useful output anywhere to see why it failed. Is there anywhere I can look, in some pod’s logs or something. Alternatively, can I add a blacklist to ingestion recipe so it skips some collections/files?
  • c

    careful-action-61962

    09/22/2022, 11:32 AM
    Hey Folks, I'm new to datahub and wanted to ingest tableau metadata.. Is there a way to ingest all the projects at once instead of adding there name manually?
    • 1
    • 2
  • s

    some-printer-33912

    09/22/2022, 3:03 PM
    Hi! I am trying to get metadata of mssql and i have this error (local machine, windows, sql server locally, docker)
    Copy code
    "[2022-09-22 14:54:17,208] ERROR    {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
               "retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
               "to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '2547645d-cd36-4752-b64d-d5bf7552b7c6',
     'infos': ['2022-09-22 14:54:03.509045 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-22 14:54:17.675643 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: stdout=venv setup time = 0\n'
               'This version of datahub supports report-to functionality\n'
               'datahub  ingest run -c /tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/recipe.yml --report-to '
               '/tmp/datahub/ingest/2547645d-cd36-4752-b64d-d5bf7552b7c6/ingestion_report.json\n'
               '[2022-09-22 14:54:04,862] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.42\n'
               "[2022-09-22 14:54:17,208] ERROR    {datahub.entrypoints:188} - Command failed with HTTPConnectionPool(host='localhost', port=8080): Max "
               "retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1859651c90>: Failed "
               "to establish a new connection: [Errno 111] Connection refused')). Run with --debug to get full trace\n"
               '[2022-09-22 14:54:17,208] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.42 at '
               '/tmp/datahub/ingest/venv-mssql-0.8.42/lib/python3.10/site-packages/datahub/__init__.py\n',
               "2022-09-22 14:54:17.677042 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Failed to execute 'datahub ingest'",
               '2022-09-22 14:54:17.677216 [exec_id=2547645d-cd36-4752-b64d-d5bf7552b7c6] INFO: Caught exception EXECUTING '
               'task_id=2547645d-cd36-4752-b64d-d5bf7552b7c6, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    d
    • 2
    • 4
  • l

    limited-forest-73733

    09/22/2022, 3:16 PM
    Hey team! Whats plan for new datahub release i.e. 0.845
    plus1 2
    b
    • 2
    • 2
  • l

    limited-forest-73733

    09/22/2022, 3:16 PM
    Any tentative date
  • t

    thankful-morning-85093

    09/22/2022, 4:33 PM
    Hi Team, Can we use Trino/Presto to push data Hive Platform in Datahub? Profiling in Hive is super slow for us.
    h
    • 2
    • 1
  • b

    bland-balloon-48379

    09/22/2022, 7:12 PM
    Hey everyone, I got a quick question. Are there a processes or timeframes for which soft-deleted entities get auto hard-deleted? I'm interested in hiding some datasets from the UI, but want to make sure that historical data will always be retained. Thanks!
    plus1 1
  • c

    careful-engine-38533

    09/23/2022, 4:19 AM
    Hi, my mongodb ingestion fails with following error message - any help?
    Copy code
    '/usr/local/bin/run_ingest.sh: line 40:    79 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
               "2022-09-22 06:29:49.739560 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Failed to execute 'datahub ingest'",
               '2022-09-22 06:29:49.739831 [exec_id=29430983-bfd2-4551-b153-c869537f5fe5] INFO: Caught exception EXECUTING '
               'task_id=29430983-bfd2-4551-b153-c869537f5fe5, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in
    h
    • 2
    • 1
  • c

    cuddly-arm-8412

    09/23/2022, 2:18 AM
    hi,team. i try the lineage_job_dataflow_new_api.py.and the data is successfully warehoused.
    Copy code
    jobFlow = DataFlow(cluster="prod", orchestrator="airflow", id="flow_new_api")
    jobFlow.emit(emitter)
    
    dataJob = DataJob(flow_urn=jobFlow.urn, id="flow_new_api_job1")
    dataJob.emit(emitter)
    But in the UI, I found that they were not related,How to troubleshoot here?
    h
    • 2
    • 1
  • n

    narrow-toothbrush-13209

    09/23/2022, 7:16 AM
    Hi, The Dathub Provider For Airflow can handle the connection error . Since the task are getting failed if connection is not established. datahub_provider.lineage.datahub.DatahubLineageBackend
    h
    • 2
    • 2
  • b

    boundless-student-48844

    09/23/2022, 8:14 AM
    Hi team, can I check if there’s code repo for the
    acryl-executor
    pip package?
    h
    • 2
    • 4
  • b

    brave-tomato-16287

    09/23/2022, 11:13 AM
    Hello All. We faced a Tableau ingestion error when trying to get a google sheet.
    Copy code
    'urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)\\n Cause: ERROR :: '
                          '/upstreams/0/dataset :: \\"Provided urn '
                          "urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12vw5vo.'am, bad debt users, to make "
                          'bal$\',PROD)\\" is invalid: Failed to convert urn to entity key: urns parts and key fields do not have same length\\n", '
                          '"message": "Invalid urn format for aspect: {upstreams=[{type=TRANSFORMED, auditStamp={actor=urn:li:corpuser:unknown, time=0}, '
                          'dataset=urn:li:dataset:(urn:li:dataPlatform:google-sheets,temp_0ufiu670cqle3e165n9eh12", "status": 400, "id": '
                          '"urn:li:dataset:(urn:li:dataPlatform:tableau,d1ad8766-18c8-6938-770e-42929141371c,PROD)"}}], "failures": [{"error": "Unable '
                          'to emit metadata to DataHub GMS", "info": {"exceptionClass": "com.linkedin.restli.server.RestLiServiceException",
    h
    • 2
    • 8
  • g

    glamorous-wire-83850

    09/23/2022, 11:47 AM
    Hello, I am currently trying to ingest multi-project GCP but its only ingest “second_project” which is written as storage_project_id. I want ingest all projects in GCP. What should I?
    Copy code
    source:
        type: bigquery
        config:
            project_id: service_acc_project
            storage_project_id: second_project
            credential:
                project_id: service_acc_project
                private_key_id: '${BQ_PRIVATE_KEY_ID2}'
                client_email: <mailto:abc-abc@service.iam.gserviceaccount.com|abc-abc@service.iam.gserviceaccount.com>
                private_key: '${BQ_PRIVATE_KEY2}'
                client_id: '11111111'
            include_tables: true
            include_views: true
            include_table_lineage: true
    d
    • 2
    • 4
  • l

    lemon-engine-23512

    09/23/2022, 11:48 AM
    Hi team, am trying to schedule ingestion airflow(managed apache airflow on aws, mwaa) . Here we upload dag and yaml files to s3 location. But when i rub schedule in airflow i get error Datahub.configuration.common.ConfigurationError:cannot open config file --s3path to yaml Anyway to resolve this? Thank you
    d
    • 2
    • 16
  • a

    adamant-rain-51672

    09/23/2022, 12:32 PM
    Hey, upgraded to 0.8.44, however, seeing this ingestion error for all ingestions:
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': 'e7e241c2-dcbf-43c9-9363-0eb77c8a1fad',
     'infos': ['2022-09-23 12:30:07.735004 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-23 12:30:07.735648 [exec_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad] INFO: Caught exception EXECUTING '
               'task_id=e7e241c2-dcbf-43c9-9363-0eb77c8a1fad, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
               '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
               '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
               '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
               'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
               'debug_mode\n'
               '  extra fields not permitted (type=value_error.extra)\n']}
    Execution finished with errors.
    Do you maybe know why it's caused?
    h
    f
    • 3
    • 5
  • f

    future-smartphone-53257

    09/23/2022, 12:57 PM
    hi, I'm trying to ingest glossary terms from a file, but I can't figure out how to indicate the Contains and Inherits relationships that I can set in the UI. Is there some way I can export the metadata from DataHub to MXE/MCE format?
    g
    • 2
    • 5
  • b

    bumpy-whale-50799

    09/23/2022, 1:12 PM
    Does Metadata lineage work if a Stored Procedure creates a temp table before populating the output table?
    g
    • 2
    • 2
  • g

    gray-cpu-75769

    09/22/2022, 12:40 PM
    Hi all, I’m trying to enable data profiling for the Google Big Query Table but getting the following error, below mentioned is the recipe. does anyone has any idea about it ?
    Copy code
    source:
        type: bigquery
        config:
            credential:
                private_key_id: '${private_key_id}'
                project_id: '${project_id}'
                client_email: '${Client_email}'
                private_key: '${private_key}'
                client_id: '${client_id}'
            profiling:
                enabled: true
            project_id: '${project_id}'
            table_pattern:
                allow:
                    - daas-prod-251711.cdo.online_merchant
            profile_pattern:
                allow:
                    - daas-prod-251711.cdo.online_merchant
    pipeline_name: 'urn:li:dataHubIngestionSource:60e4b0c9-dc16-4138-8fa7-d0c881af095a'
    h
    • 2
    • 6
  • c

    chilly-potato-57465

    09/23/2022, 1:33 PM
    Hello! In our case we have huge datasets stored in regular file systems (images) and HDFS? As far as I could see from previous questions and sources documentation, there are no plugins to ingest metadata from regular file systems (attributes such as created/modified/ownership/size/assess rights/etc and folders structure) and HDFS. Is this still so? Additionally, I wonder how to ingest metadata (column names) from csv files. I see that is possible from S3 source, is it also possible from regular file systems? Thank you!!
    g
    c
    d
    • 4
    • 9
  • f

    fresh-nest-42426

    09/23/2022, 9:16 PM
    Hi all and happy Friday! We are doing some POC with DataHub specifically ingesting table lineage from Redshift and realized that it's also automatically ingests upstream s3 COPY lineage, which is very interesting & useful. However, we have many event level table that gets too many small batches of s3 files loaded using Kinesis firehose so the S3 lineage looks extremely verbose and even seem to completely break ingestion sometimes
    Copy code
    '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
               '    line_bytes = await ingest_process.stdout.readline()\n'
               '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
               '    raise ValueError(e.args[0])\n'
               'ValueError: Separator is found, but chunk is longer than limit\n']}
    I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239 Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using
    v0.8.44
    and datahub actions
    0.0.7
    Thanks!
    d
    g
    • 3
    • 4
  • g

    green-lion-58215

    09/23/2022, 10:47 PM
    Quick qn on DBt ingestion. I am ingesting dbt test results with the run_results.json. I can also see the test cases in the datahub UI as well. But it does not show any pass/failures for the test cases. It is all coming up as “no evaluations found”. I am using datahub 0.8.41 for context. dbt version 0.21.1 any help is appreciated.
    m
    a
    • 3
    • 32
  • l

    little-spring-72943

    09/24/2022, 9:55 PM
    We are trying to build Assertion from our DQ toolset Great Expectations equivalent: expect_column_sum_to_be_between set following values:
    Copy code
    scope = DatasetAssertionScope.DATASET_COLUMN
    operator = AssertionStdOperator.BETWEEN
    aggregation = AssertionStdAggregation.SUM
    UI shows: Column Amount values are between 0 and 1 Sum (aggregation) is missing. How can we fix this or have custom text here?
    l
    h
    • 3
    • 3
  • a

    acceptable-judge-21659

    09/26/2022, 6:52 AM
    Hello, I have to ingest a database which is on a VPS I'm not sure how to do it... Any advice ?
    g
    • 2
    • 7
  • b

    bumpy-journalist-41369

    09/26/2022, 7:33 AM
    I have a problem when ingesting data from Glue. I get the following exception:
    Copy code
    '2022-09-21 12:33:03.932429 [exec_id=14acb269-e6af-4ca0-871b-684c02a11814] INFO: Caught exception EXECUTING '
               'task_id=14acb269-e6af-4ca0-871b-684c02a11814, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
               '    line = await self.readuntil(sep)\n'
               '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 620, in readuntil\n'
               '    raise exceptions.LimitOverrunError(\n'
               'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n'
               '\n'
               'During handling of the above exception, another exception occurred:\n'
               '\n'
               'Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
               '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
               '    line_bytes = await ingest_process.stdout.readline()\n'
               '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
               '    raise ValueError(e.args[0])\n'
               'ValueError: Separator is found, but chunk is longer than limit\n']}
    Execution finished with errors.
    And eventually the ingestion fails, even though it managed to ingest some of the data. My recipe looks like that : sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: glue config: aws_region: us-east-1 database_pattern: allow: - product_metrics I don’t see any other exceptions in the log. Does anyone know how to fix it?
    plus1 1
    c
    r
    g
    • 4
    • 4
  • c

    clean-tomato-22549

    09/26/2022, 9:42 AM
    Hello, how can I lighten the "view definition", "linkage", "queries" tabs. I use
    type: snowflake
    , and has enabled following parameters while ingesting. What else setting do I need to lighten the tabs?
    Copy code
    type: snowflake
    
    
    ignore_start_time_lineage: true
    include_table_lineage: true
    include_view_lineage: true
    h
    • 2
    • 7
1...727374...144Latest