https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • n

    numerous-account-62719

    09/02/2022, 7:41 AM
    Hi Team I am trying to trigger the ingestion pipeline from the UI Getting the following error Please refer the stack trace below
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '918dc5d5-c95c-4051-ad58-0867a0bc89f8',
     'infos': ['2022-09-01 14:06:39.052519 [exec_id=918dc5d5-c95c-4051-ad58-0867a0bc89f8] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-01 14:07:05.133419 [exec_id=918dc5d5-c95c-4051-ad58-0867a0bc89f8] INFO: stdout=Requirement already satisfied: pip in '
               '/tmp/datahub/ingest/venv-918dc5d5-c95c-4051-ad58-0867a0bc89f8/lib/python3.9/site-packages (21.2.4)\n'
               'WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cfaf70>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/pip/\n"
               'WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cfaeb0>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/pip/\n"
               'WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cfadf0>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/pip/\n"
               'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cfad00>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/pip/\n"
               'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28d00a60>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/pip/\n"
               'WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cc2e20>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cc6070>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cc6220>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cc63d0>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fce28cc6580>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/wheel/\n"
               'ERROR: Could not find a version that satisfies the requirement wheel (from versions: none)\n'
               'ERROR: No matching distribution found for wheel\n'
               'WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd10d26f40>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/acryl-datahub/\n"
               'WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd10d42190>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/acryl-datahub/\n"
               'WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd10d42340>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/acryl-datahub/\n"
               'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd10d424f0>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/acryl-datahub/\n"
               'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd10d261c0>: Failed to establish a new connection: "
               "[Errno -3] Temporary failure in name resolution')': /simple/acryl-datahub/\n"
               'ERROR: Could not find a version that satisfies the requirement acryl-datahub[datahub-rest,oracle]==0.8.41 (from versions: none)\n'
               'ERROR: No matching distribution found for acryl-datahub[datahub-rest,oracle]==0.8.41\n'
               '/tmp/datahub/ingest/venv-918dc5d5-c95c-4051-ad58-0867a0bc89f8/bin/python3: No module named datahub\n',
               "2022-09-01 14:07:05.133542 [exec_id=918dc5d5-c95c-4051-ad58-0867a0bc89f8] INFO: Failed to execute 'datahub ingest'",
               '2022-09-01 14:07:05.140360 [exec_id=918dc5d5-c95c-4051-ad58-0867a0bc89f8] INFO: Caught exception EXECUTING '
               'task_id=918dc5d5-c95c-4051-ad58-0867a0bc89f8, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    d
    • 2
    • 10
  • s

    some-hairdresser-53679

    09/02/2022, 11:09 AM
    Hello. How can I send validations for a dataset without using great expectations?
    h
    g
    • 3
    • 5
  • e

    enough-monitor-24292

    09/02/2022, 11:34 AM
    I'm not able to see stats option on datahub 0.8.38 version, can anyone help, how we can push stats on datahub from presto or deltalake?
    g
    • 2
    • 3
  • r

    ripe-tiger-90198

    09/02/2022, 12:11 PM
    Hello, team! I'm trying to run ingestion from UI for BigQuery source. Getting the below error. Please help!
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '5d21d1f2-21ba-48d5-abcb-1099d069f959',
     'infos': ['2022-09-02 12:02:55.086878 [exec_id=5d21d1f2-21ba-48d5-abcb-1099d069f959] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-02 12:03:01.329263 [exec_id=5d21d1f2-21ba-48d5-abcb-1099d069f959] INFO: stdout=Elapsed seconds = 0\n'
               '  --report-to TEXT                Provide an output file to produce a\n'
               'This version of datahub supports report-to functionality\n'
               'datahub  ingest run -c /tmp/datahub/ingest/5d21d1f2-21ba-48d5-abcb-1099d069f959/recipe.yml --report-to '
               '/tmp/datahub/ingest/5d21d1f2-21ba-48d5-abcb-1099d069f959/ingestion_report.json\n'
               '[2022-09-02 12:02:57,138] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.42\n'
               '[2022-09-02 12:02:57,197] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
               'to talk to <http://datahub-gms:8080>\n'
               "[2022-09-02 12:02:59,501] INFO     {datahub.ingestion.source.sql.sql_common:284} - Applying table_pattern {'allow': ['.*\\\\.tracks']} "
               'to view_pattern.\n'
               '[2022-09-02 12:02:59,501] ERROR    {datahub.ingestion.run.pipeline:127} - 1 validation error for BigQueryConfig\n'
               'include_view_lineage\n'
               '  extra fields not permitted (type=value_error.extra)\n'
               '[2022-09-02 12:02:59,502] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion\n'
               '[2022-09-02 12:02:59,502] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion\n'
               "[2022-09-02 12:03:00,041] ERROR    {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'. Run with "
               '--debug to get full trace\n'
               '[2022-09-02 12:03:00,041] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.42 at '
               '/tmp/datahub/ingest/venv-bigquery-0.8.42/lib/python3.9/site-packages/datahub/__init__.py\n',
               "2022-09-02 12:03:01.331049 [exec_id=5d21d1f2-21ba-48d5-abcb-1099d069f959] INFO: Failed to execute 'datahub ingest'",
               '2022-09-02 12:03:01.332045 [exec_id=5d21d1f2-21ba-48d5-abcb-1099d069f959] INFO: Caught exception EXECUTING '
               'task_id=5d21d1f2-21ba-48d5-abcb-1099d069f959, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    d
    g
    • 3
    • 8
  • l

    limited-forest-73733

    09/02/2022, 1:29 PM
    @dazzling-judge-80093 can you please help me out
  • l

    limited-forest-73733

    09/02/2022, 1:27 PM
    Hey i am facing issue while enabling profiling for different schemas
    s
    b
    • 3
    • 7
  • l

    limited-forest-73733

    09/02/2022, 1:28 PM
    And if i am providing profile_if_modified_since_days: 360
    d
    • 2
    • 2
  • a

    adamant-rain-51672

    09/02/2022, 4:08 PM
    Anyone experiencing problems with login failed when ingesting data from tableau? (deployed on ECK cluster)
    g
    • 2
    • 7
  • c

    cool-gpu-73611

    09/03/2022, 6:01 AM
    Hi! I've deleted containers using cli. It not visible now in web but in fact it somewhere exists... how to restore it in web?
  • e

    enough-monitor-24292

    09/03/2022, 6:30 AM
    Hi Does Datahub provide data quality layer i.e. Can we add some test cases or rules for data quality over datasets? Thanks
  • m

    millions-sundown-65420

    09/03/2022, 7:14 AM
    Hi. Does Datahub support code with Spark in Kubernetes operator: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator I get this exception:
    Copy code
    22/09/02 10:16:00 ERROR DatahubSparkListener: java.lang.NullPointerException
            at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:258)
            at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:254)
            at scala.Option.foreach(Option.scala:407)
            at datahub.spark.DatahubSparkListener.processExecutionEnd(DatahubSparkListener.java:254)
            at datahub.spark.DatahubSparkListener.onOtherEvent(DatahubSparkListener.java:241)
            at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
            at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
            at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
            at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
            at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
            at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
            at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
            at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
            at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
            at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
            at <http://org.apache.spark.scheduler.AsyncEventQueue.org|org.apache.spark.scheduler.AsyncEventQueue.org>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
            at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
            at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1381)
            at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
    • 1
    • 2
  • a

    aloof-oil-31167

    09/04/2022, 1:55 PM
    Hey, an Airflow Integration question… we realized that if setting non-existing datasets in the
    inlets
    or
    outlets
    inside a DAG, the datahub-airflow plugin will create these datasets and lineage these non-existing datasets is there an option to make it skip on datasets who aren’t exist?
    d
    • 2
    • 1
  • m

    melodic-beach-18239

    09/05/2022, 3:40 AM
    i config it as doc say
  • b

    better-orange-49102

    09/05/2022, 5:41 AM
    noticed that whenever i have a policy with an entity filter (for instance, a new metadata policy applicable to notebook entities only) and i try to query the new policy using something like
    Copy code
    curr_policy = graph.get_aspect_v2(
            entity_urn=policy_urn,
            aspect="dataHubPolicyInfo",
            aspect_type=DataHubPolicyInfoClass,
    )
    I always get the error message:
    Copy code
    File "/home/*redacted*/datahub/*redacted*/policy.py", line 54, in <module>
        curr_policy = graph.get_aspect_v2(
      File "/home/*redacted*/datahub/metadata-ingestion/src/datahub/ingestion/graph/client.py", line 171, in get_aspect_v2
        return aspect_type.from_obj(post_json_obj)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/dict_wrapper.py", line 41, in from_obj
        return conv.from_json_object(obj, cls.RECORD_SCHEMA)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/avrojson.py", line 104, in from_json_object
        return self._generic_from_json(json_obj, writers_schema, readers_schema)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/avrojson.py", line 257, in _generic_from_json
        result = self._record_from_json(json_obj, writers_schema, readers_schema)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/avrojson.py", line 345, in _record_from_json
        field_value = self._generic_from_json(json_obj[field.name], writers_field.type, field.type)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/avrojson.py", line 255, in _generic_from_json
        result = self._union_from_json(json_obj, writers_schema, readers_schema)
      File "/home/*redacted*/miniconda3/envs/*redacted*/lib/python3.9/site-packages/avrogen/avrojson.py", line 314, in _union_from_json
        raise schema.AvroException('Datum union type not in schema: %s', value_type)
    avro.schema.AvroException: ('Datum union type not in schema: %s', 'filter')
    any idea what causes this? What is weird is that it can be overcame by going to the policy, adding another entity, saving it, and then undoing it again from UI. Then querying the policy again will not have the same error. Almost like there was something missing the first time round when it was created... I was trying to query all my policies and store them as a json file (backup)
    g
    • 2
    • 4
  • f

    flat-painter-78331

    09/05/2022, 7:01 AM
    Hi guys, when integrating with airflow, after defining inlets and outlets how should check it from the datahub side? I tried checking the logs after running the DAG and it does not show any datahub related messages. This is the code i did for defining inlets and outlets;
    datahub_lineage_task_1 = BashOperator(
    task_id="extract_data",
    dag= dag,
    inlets=[Dataset("mysql", "extract_sql.dag")],
    outlets=[Dataset("s3", "test-project-100/project_100/table_01")],
    bash_command="echo Dummy Task 1",
    )
    datahub_lineage_task_2 = BashOperator(
    task_id="load_data",
    dag= dag,
    inlets=[Dataset("s3", "test-project-100/project_100/table_01")],
    outlets=[Dataset("bigquery", "project-test.tb_bq_datahub")],
    bash_command="echo Dummy Task 2",
    )
    d
    • 2
    • 21
  • e

    enough-monitor-24292

    09/05/2022, 8:22 AM
    Hi, I have installed great-expectations plugin on datahub, Test are working fine, but validations tab are not visible. Do we need restart datahub after installing plugin? If yes, can anyone ping me the steps. Thanks
    h
    • 2
    • 2
  • b

    bumpy-journalist-41369

    09/05/2022, 8:49 AM
    I have problem running ingestion from S3 buckets. I have followed the documentation to create IAM role and policy to gate request to access metatdata from S3(https://datahubproject.io/docs/deploy/aws -> IAM policies for UI-based ingestion), however the ingestion request fails with the following message:
    Copy code
    '[2022-09-05 08:32:57,596] ERROR    {datahub.entrypoints:188} - Command failed with An error occurred (AccessDenied) when calling the '
               'ListObjects operation: Access Denied. Run with --debug to get full trace\n'
    I have created an iamserviceaccount associated with the kubernetes cluster called acryl-datahub-actions, having an the following policy: { “Version”: “2012-10-17", “Statement”: [ { “Effect”: “Allow”, “Action”: [ “s3:*” ], “Resource”: [ “arnawss3:::cdca-dev-us-east-1-product-metrics”, “arnawss3:::cdca-dev-us-east-1-product-metrics/*” ] } ] } The receipe that I am trying is the following: sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: s3 config: profiling: enabled: false path_spec: include: ‘s3://my-bucket/table/sh_date=2021-06-23/test.parquet’ env: DEV aws_config: aws_region: us-east-1 P.S in the policy I have given all the permission for S3, which will eventually I will narrow down.
    d
    • 2
    • 5
  • a

    adamant-rain-51672

    09/05/2022, 9:48 AM
    [postgres ingestion] Do you know if there's a way to ingest column descriptions?
    b
    • 2
    • 4
  • m

    mysterious-dress-35051

    09/05/2022, 10:51 AM
    Hi! I am new here, testing possibilities with datahub. With ingestion and profiling, I have a problem. My recipe looks like this source: type: mssql The ingestion works, but I dont see any stats. But there is a bunch of error messages right before:
    Copy code
    "AttributeError: 'Insert' object has no attribute 'columns'\n"
    '[2022-08-31 09:46:52,291] ERROR    {datahub.utilities.sqlalchemy_query_combiner:249} - Failed to execute query normally, using '
               'fallback: \n'
               'CREATE TABLE "#ge_temp_dbf5dfdd" (\n'
               '\tcondition INTEGER NOT NULL\n'
               ')\n'
               '\n'
    I found this question in the history of Slack, but there is no answer there. Could you help me with this problem?🙏
    h
    h
    • 3
    • 4
  • b

    bumpy-journalist-41369

    09/05/2022, 11:15 AM
    Is it possible to ingest data from the latest partition of a S3 bucket instead of from all the partitions?
    d
    • 2
    • 6
  • m

    melodic-beach-18239

    09/05/2022, 3:40 AM
    image.png
    d
    • 2
    • 4
  • f

    full-chef-85630

    09/05/2022, 1:35 PM
    Hi,the ingestion metadata(bigquery),multiple tables have the same structure but different table names, such as user_ 1,user_ 2,Why is there only table user in the final dataset。Is there any logic for merging? If so, can it be canceled @dazzling-judge-80093
    d
    • 2
    • 3
  • a

    adamant-rain-51672

    09/05/2022, 7:32 PM
    [okta ingestion] Anyone experienced the following problem when running okta ingestion?
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '894f9189-bbb0-4d44-8dfb-2a7056fd6e65',
     'infos': ['2022-09-05 19:28:16.998654 [exec_id=894f9189-bbb0-4d44-8dfb-2a7056fd6e65] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-09-05 19:28:40.653581 [exec_id=894f9189-bbb0-4d44-8dfb-2a7056fd6e65] INFO: stdout=Requirement already satisfied: pip in '
               '/tmp/datahub/ingest/venv-894f9189-bbb0-4d44-8dfb-2a7056fd6e65/lib/python3.9/site-packages (21.2.4)\n'
    
    [...PACKAGE INSTALLATION...]
    
               '[2022-09-05 19:28:39,965] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
               'to talk to <http://datahub-datahub-gms:8080>\n'
               '[2022-09-05 19:28:40,128] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion\n'
               '[2022-09-05 19:28:40,129] INFO     {datahub.cli.ingest_cli:123} - Source (okta) report:\n'
               "{'workunits_produced': '0',\n"
               " 'workunit_ids': [],\n"
               " 'warnings': {},\n"
               " 'failures': {},\n"
               " 'cli_version': '0.8.43',\n"
               " 'cli_entry_location': '/tmp/datahub/ingest/venv-894f9189-bbb0-4d44-8dfb-2a7056fd6e65/lib/python3.9/site-packages/datahub/__init__.py',\n"
               " 'py_version': '3.9.9 (main, Dec 21 2021, 10:03:34) \\n[GCC 10.2.1 20210110]',\n"
               " 'py_exec_path': '/tmp/datahub/ingest/venv-894f9189-bbb0-4d44-8dfb-2a7056fd6e65/bin/python3',\n"
               " 'os_details': 'Linux-5.4.209-116.363.amzn2.x86_64-x86_64-with-glibc2.31',\n"
               " 'filtered': []}\n"
               '[2022-09-05 19:28:40,130] INFO     {datahub.cli.ingest_cli:126} - Sink (datahub-rest) report:\n'
               "{'records_written': '0', 'warnings': [], 'failures': [], 'gms_version': 'v0.8.43'}\n"
               '[2022-09-05 19:28:40,418] ERROR    {datahub.entrypoints:188} - Command failed with There is no current event loop in thread '
               "'asyncio_0'.. Run with --debug to get full trace\n"
               '[2022-09-05 19:28:40,418] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43 at '
               '/tmp/datahub/ingest/venv-894f9189-bbb0-4d44-8dfb-2a7056fd6e65/lib/python3.9/site-packages/datahub/__init__.py\n',
               "2022-09-05 19:28:40.654203 [exec_id=894f9189-bbb0-4d44-8dfb-2a7056fd6e65] INFO: Failed to execute 'datahub ingest'",
               '2022-09-05 19:28:40.654552 [exec_id=894f9189-bbb0-4d44-8dfb-2a7056fd6e65] INFO: Caught exception EXECUTING '
               'task_id=894f9189-bbb0-4d44-8dfb-2a7056fd6e65, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    false
    h
    b
    • 3
    • 4
  • b

    better-actor-97450

    09/06/2022, 2:50 AM
    Hi! have anybody ingestion Iceberg using Hive metadata for store metadata of iceberg ? If i use hive metadata for iceberg then right way to ingestion metadata is Hive right ?
    h
    d
    • 3
    • 3
  • b

    brainy-intern-50400

    09/06/2022, 7:17 AM
    Maybe somebody can help me 🙂 I am trying to ingest a notebook entity. When i ingest a notebook entity over the python emitter, i get a key error. Propably a miss to ingest somekind of key..?:
    Copy code
    react-dom.production.min.js:216 Error: Unrecognized key NOTEBOOK provided in map {}
        at Gt (EntityRegistry.tsx:11:11)
        at e.value (EntityRegistry.tsx:153:24)
        at renderItem (EntityNameList.tsx:118:53)
        at index.js:143:12
        at index.js:243:14
        at Array.map (<anonymous>)
        at A (index.js:242:33)
        at ai (react-dom.production.min.js:157:137)
        at Xc (react-dom.production.min.js:267:460)
        at _s (react-dom.production.min.js:250:347)
    here is the data i ingest: #Datahub Emitter
    Copy code
    emitter: DatahubRestEmitter = DatahubRestEmitter(gms_server=DATAHUB_SERVER, extra_headers={},)# token=DATAHUB_API_KEY)
        emitter.test_connection()
        
        #milliseconds since epoch
        now: int = int(time.time() * 1000)
        current_timestamp: AuditStampClass = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")
        
        last_modified = ChangeAuditStampsClass(current_timestamp)    
        
        inputs_notebook: List[NotebookCellClass] = [
            NotebookCellClass(
                type=NotebookCellTypeClass().CHART_CELL,
                chartCell=ChartCellClass(
                    cellId="2",
                    changeAuditStamps=last_modified,
                    cellTitle="second",
                ),
                
            )
        ]
        
        properties: dict[str,str] = {}
        properties = {
            '..': '..'
        }
        
        notebook_info: NotebookInfoClass = NotebookInfoClass(
            title="Janatka Notebook",
            changeAuditStamps=last_modified,
            customProperties=properties,
            externalUrl="",
        )
        
        browse_path: BrowsePathsClass = BrowsePathsClass(
            ["/test/notebook/test/querybook"]
        )
        
        #notebook_key: NotebookKeyClass = NotebookKeyClass(
        #    notebookTool="Zeppelin",
        #    notebookId="Janatka_Test"
        #)
        
        notebook_urn = "urn:li:notebook:(querybook,1234)"
    
        #Construct a MetadataChangeProposalWrapper object with the Notebook aspects.
            
        notebook_info_mce = MetadataChangeProposalWrapper(
            entityType="notebook",
            changeType=ChangeTypeClass.UPSERT,
            entityUrn=notebook_urn,
            aspectName="notebookInfo",
            aspect=notebook_info,
        )
        
        notebook_content_mce = MetadataChangeProposalWrapper(
            entityType="notebook",
            changeType=ChangeTypeClass.UPSERT,
            entityUrn=notebook_urn,
            aspectName="notebookContent",
            aspect=NotebookContentClass(inputs_notebook),
        )
        
        notebook_path_mce = MetadataChangeProposalWrapper(
            entityType="notebook",
            changeType=ChangeTypeClass.UPSERT,
            entityUrn=notebook_urn,
            aspectName="browsePaths",
            aspect=browse_path,
        )
        
        #Emit metadata!
        emitter.emit(notebook_info_mce)
        emitter.emit(notebook_content_mce)
        emitter.emit(notebook_path_mce)
    h
    b
    • 3
    • 3
  • a

    ancient-apartment-23316

    09/06/2022, 7:47 AM
    Hello, I have an error while ingesting from the Snowflake
    Copy code
    'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                                          'com.linkedin.dataset.DatasetUsageStatistics: ERROR :: /userCounts/0/user :: "Provided urn urn:li:corpuser:" '
                                          'is invalid\n'
                                          '\n'
                                          '\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:142)',
                            'message': 'Failed to validate record with class com.linkedin.dataset.DatasetUsageStatistics: ERROR :: /userCounts/0/user :: '
                                       '"Provided urn urn:li:corpuser:" is invalid\n',
                            'status': '422'}}],
    I’m used the search here and found that I must use the transformers block What should I add?
    Copy code
    transformers:
      - type: "simple_add_dataset_ownership"
        config:
          owner_urns:
            - "urn:li:corpuser" #like this?
    b
    h
    • 3
    • 26
  • a

    adamant-rain-51672

    09/06/2022, 7:48 AM
    Have anyone experienced problems with ingesting data to datahub instance deployed on AWS/EKS (following datahub tutorial). I'm having problem with ingesting data from both Tableau and Okta. These flows work perfectly fine locally for the same recipes. The only thing that is different is where dh is deployed. Anyone had a similar problem?
    h
    • 2
    • 4
  • s

    salmon-angle-92685

    09/06/2022, 8:00 AM
    Hello everyone, The ingestion via YAML file will add all the metadata from our Datawarehouse into Datahub. However, it will not delete the tables that have been dropped from the Datawarehouse. Is there a way of ingesting the new data as well as deleting the deprecated ones ? Thank you so much !
    h
    h
    b
    • 4
    • 7
  • j

    jolly-traffic-67085

    09/06/2022, 8:18 AM
    Hi everyone I have a question, I connect datahub using glue, but I want to ingestion specific table in glue catalog. can be config in ingestion source? thanks.
    h
    • 2
    • 2
  • m

    microscopic-mechanic-13766

    09/06/2022, 8:30 AM
    Hello everyone, one quick question: Is it possible to erase a dataset or all the datasets from a source?? I want to know it because I initially did an erroneous ingestion from Hive and the datasets are empty, so I want to erase them so I can ingest them back with the full info Thanks in advance!
    b
    • 2
    • 6
1...676869...144Latest