https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • g

    green-lion-58215

    11/09/2022, 11:26 PM
    Hello all, I have been working on ingesting airflow metadta to datahub. But I am hitting this error. anyone know what causes this?
    Copy code
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 256, in process_file
        m = imp.load_source(mod_name, filepath)
      File "/usr/local/lib/python3.7/imp.py", line 171, in load_source
        module = _load(spec)
      File "<frozen importlib._bootstrap>", line 696, in _load
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/usr/local/airflow/dags/marketing/bing_ads/bing_ads_dag.py", line 66, in <module>
        outlets=Dataset("delta-lake", "l1_dev.bing_ads_ads"),
      File "/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py", line 98, in wrapper
        result = func(*args, **kwargs)
      File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/databricks_operator.py", line 448, in __init__
        super(DatabricksRunNowOperator, self).__init__(**kwargs)
      File "/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py", line 98, in wrapper
        result = func(*args, **kwargs)
      File "/usr/local/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 447, in __init__
        self._inlets.update(inlets)
    TypeError: cannot convert dictionary update sequence element #0 to a sequence
    d
    • 2
    • 9
  • l

    lively-dusk-19162

    11/10/2022, 1:24 AM
    Hello team, I have written a code for column level lineage using python emitter. But it is showing me any columns in datahub. Here is the attachment of the code. Could anyone please help me?
  • l

    lively-dusk-19162

    11/10/2022, 2:17 AM
    This is screenshot from the output of python sdk for column lineage, the code from above screenshot. I am unable to view columns in UI. Could anyone please help me on that?
  • r

    rough-fish-51544

    11/10/2022, 7:37 AM
    Hi everyone!) I have some questions about PowerBi. when we used PowerBi Cloud to upload sources to Datahub, following the configuration example https://datahubproject.io/docs/generated/ingestion/sources/powerbi/#starter-recipe. Here is an example of what we got. The directories where the dashboards themselves are stored are named as specified in the workspace_id parameter. The question is, can directories (workspaces where dashboards are stored) be named not by their identifier (workspace_id), but by the name of the workspace itself, which was given to it and which exists in the PowerBi Cloud itself?
    g
    • 2
    • 1
  • f

    fierce-baker-1392

    11/10/2022, 7:44 AM
    Hello Team, I ingested avro schema into datahub, and found some nest fields cannot be expanded. Is this a bug ?
    g
    • 2
    • 2
  • s

    stocky-helicopter-7122

    11/10/2022, 12:58 PM
    Hello! I want to ingest metadata from Hive Metastore with postgres backend. I have an url in format jdbcpostgresql// and user/password. However, I don’t understand how to use the creds and which scheme to use for connection to hive metastore. Could you kindly provide a snippet? Thanks in advance! P.S. My current config is in the thread
    plus1 1
    a
    • 2
    • 2
  • b

    brief-ability-41819

    11/10/2022, 1:31 PM
    Hello, I’d like to ask if AWS MSK metadata ingestion is officially supported. We’re using very simple recipe:
    Copy code
    source:
        type: kafka
        config:
            schema_registry_class: datahub.ingestion.source.confluent_schema_registry.ConfluentSchemaRegistry
            platform_instance: Kafka
            connection:
                bootstrap: 'bootstrap-server:9092'
    but only 34 out of ~1200 topics are ingested. Those 34 are also “empty” inside. No errors being throwned 🤔 thanks in advance!
    plus1 1
    d
    r
    • 3
    • 6
  • p

    proud-accountant-49377

    11/10/2022, 2:14 PM
    Hi! I am having some problems ingesting MLModels via openApi. Apparently the structure of my schema is fine, but I always get a weird error. As I am typing MLModelProperties with ML uppercase, and in the error it shows the m as if it was lowercase, so It doesn´t recognise it as an aspect of my entity. Thanks!!!!😊
    • 1
    • 1
  • b

    bland-orange-13353

    11/10/2022, 3:41 PM
    This message was deleted.
    l
    h
    • 3
    • 4
  • g

    green-lion-58215

    11/10/2022, 3:45 PM
    does anybody know what causes this error? I am trying to setup lineage for airflow. the error happens for all dags
    Copy code
    [2022-11-10 06:01:41,020] {{taskinstance.py:1150}} ERROR - 1 validation error for DatahubLineageConfig
    enabled
      extra fields not permitted (type=value_error.extra)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 990, in _run_raw_task
        task_copy.post_execute(context=context, result=result)
      File "/usr/local/lib/python3.7/site-packages/airflow/lineage/__init__.py", line 82, in wrapper
        outlets=self.outlets, context=context)
      File "/usr/local/lib/python3.7/site-packages/datahub_provider/lineage/datahub.py", line 75, in send_lineage
        config = get_lineage_config()
      File "/usr/local/lib/python3.7/site-packages/datahub_provider/lineage/datahub.py", line 35, in get_lineage_config
        return DatahubLineageConfig.parse_obj(kwargs)
      File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj
      File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
    pydantic.error_wrappers.ValidationError: 1 validation error for DatahubLineageConfig
    enabled
      extra fields not permitted (type=value_error.extra)
    d
    • 2
    • 20
  • e

    elegant-salesmen-99143

    11/10/2022, 4:15 PM
    Hi. Am a able to cancel Scheduling of an exhiting Ingestion that's already on Schedule from the UI? I don't see such option, I'm only seeing changing the schedule, not removing it
    a
    • 2
    • 1
  • p

    purple-monitor-41675

    11/10/2022, 6:34 PM
    Hi guys I am trying to test the UI ingestion but I am not getting all the UI features like the status, and the logs For the moment I can’t know the status of the ingestion job, nor debugging. I am using datahub helm chart version
    0.2.109
    Any help about what is missing
    a
    g
    • 3
    • 6
  • l

    lively-dusk-19162

    11/10/2022, 10:21 PM
    Hello all, Can anyone help me on how to declare schema for insert into select queries? Like 1. I have 3 tables and define schema for that and ingest to datahub 2. Next i have written some sql queries ,parse them and ingest column level lineage using finegrained lineages to datahub. 3. I have written one insert into select query then columns will change in target table. If that is the case, how schema will be defined.
    a
    g
    • 3
    • 4
  • a

    alert-fall-82501

    11/11/2022, 5:34 AM
    Copy code
    can anybody suggest on below error messages .I am ingesting bigquery beta source .
  • a

    alert-fall-82501

    11/11/2022, 5:34 AM
    Copy code
    /tmp
    [2022-11-10, 06:17:38 UTC] {{subprocess.py:74}} INFO - Running command: ['bash', '-c', 'python3 -m datahub ingest -c /usr/local/airflow/dags/dt_datahub/recipes/prod/bigquery/bmp5.yaml']
    [2022-11-10, 06:17:38 UTC] {{subprocess.py:85}} INFO - Output:
    [2022-11-10, 06:17:40 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:17:40 UTC] INFO     {datahub.cli.ingest_cli:179} - DataHub CLI version: 0.8.44
    [2022-11-10, 06:17:40 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:17:40 UTC] INFO     {datahub.ingestion.run.pipeline:165} - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://datahub-gms.digitalturbine.com:8080>
    [2022-11-10, 06:17:44 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:17:44 UTC] INFO     {datahub.ingestion.run.pipeline:190} - Source configured successfully.
    [2022-11-10, 06:17:44 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:17:44 UTC] INFO     {datahub.cli.ingest_cli:126} - Starting metadata ingestion
    [2022-11-10, 06:18:08 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:18:08 UTC] INFO     {datahub.ingestion.source.bigquery_v2.lineage:145} - Populating lineage info via GCP audit logs
    [2022-11-10, 06:18:08 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:18:08 UTC] INFO     {datahub.ingestion.source.bigquery_v2.lineage:208} - Start loading log entries from BigQuery start_time=2022-11-08T23:45:00Z and end_time=2022-11-10T06:32:44Z
    [2022-11-10, 06:18:11 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:18:11 UTC] INFO     {datahub.ingestion.source.bigquery_v2.lineage:227} - Finished loading 0 log entries from BigQuery so far
    [2022-11-10, 06:18:11 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:18:11 UTC] INFO     {datahub.ingestion.source.bigquery_v2.lineage:319} - Parsing BigQuery log entries: number of log entries successfully parsed=0
    [2022-11-10, 06:18:11 UTC] {{subprocess.py:89}} INFO - [2022-11-10, 06:18:11 UTC] INFO     {datahub.ingestion.source.bigquery_v2.lineage:433} - Built lineage map containing 0 entries.
    [2022-11-10, 06:21:00 UTC] {{subprocess.py:93}} INFO - Command exited with return code -9
    [2022-11-10, 06:21:00 UTC] {{taskinstance.py:1703}} ERROR - Task failed with exception
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
        self._execute_task_with_callbacks(context)
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
        result = self._execute_task(context, self.task)
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1509, in _execute_task
        result = execute_callable(context=context)
      File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash.py", line 188, in execute
        f'Bash command failed. The command returned a non-zero exit code {result.exit_code}.'
    airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code -9.
    d
    • 2
    • 1
  • f

    full-chef-85630

    11/11/2022, 6:09 AM
    airflow run job error info: @dazzling-judge-80093 datahub: v0.9.0 airflow: 2.4.1
    Copy code
    [2022-11-11T11:10:48.012+0800] {logging_mixin.py:117} INFO - Exception: Traceback (most recent call last):
      File "/opt/miniconda3/envs/xavier/lib/python3.10/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 339, in custom_on_success_callback
        datahub_on_success_callback(context)
      File "/opt/miniconda3/envs/xavier/lib/python3.10/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 192, in datahub_on_success_callback
        inlets = get_inlets_from_task(task, context)
      File "/opt/miniconda3/envs/xavier/lib/python3.10/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 46, in get_inlets_from_task
        and isinstance(task._inlets, list)
    AttributeError: '_PythonDecoratedOperator' object has no attribute '_inlets'
    d
    • 2
    • 6
  • g

    glamorous-library-1322

    11/11/2022, 9:51 AM
    Hey all, I have s3 ingestion setup and my dataset is ingested from a nested folder in a bucket. The URN is:
    urn:li:dataset:(urn:li:dataPlatform:s3,path/to/my/data/my_file_1.csv,PROD)
    , and
    datahub get --urn 'urn:li:dataset:(urn:li:dataPlatform:s3,path/to/my/data/my_file_1.csv,PROD)'
    works well. But when i try to see the timeline with
    datahub timeline -c TECHNICAL_SCHEMA --urn 'urn:li:dataset:(urn:li:dataPlatform:s3,path/to/my/data/my_file_1.csv,PROD)'
    i get an error from Jetty ("status": 404). I'm guessing my URN is wrong, so what is it? How do I find out? No problem with other data sets (with only special chars '.'). Help appreciated. datahub version 0.9.1
    a
    g
    o
    • 4
    • 6
  • s

    sparse-australia-70466

    11/11/2022, 7:39 PM
    Hey all! I've had some ingestion sources working for a few months now but I noticed they've started erroring with
    Copy code
    "[2022-11-11 19:33:18,895] ERROR    {datahub.ingestion.run.pipeline:127} - s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
    This is using the vanilla ingestion functionality offered through the UI so I'm unsure of where to interject with adding a
    pip install
    command or some other form of providing a customer container image... Any ideas? Thanks!
    • 1
    • 1
  • t

    thousands-branch-81757

    11/14/2022, 3:45 AM
    I have the same problem using datahub 0.9.0.4 and dbt 1.0.8. Do anyone know how to fix this? Thanks in advance,
    a
    m
    g
    • 4
    • 11
  • s

    stocky-helicopter-7122

    11/14/2022, 7:27 AM
    Hello! I still have an issue with ingesting metadata from Hive Metastore with postgres backend. The metastore is standalone, it doesn’t have Hive Server, so datahub Hive connector doesn’t work. The metastore is connected to spark via jdbc url but I don’t understand completely how to connect it to datahub. Thanks!
    plus1 1
    d
    • 2
    • 4
  • a

    alert-fall-82501

    11/14/2022, 8:44 AM
    Copy code
    Can anybody suggest on this ? Not able to start the datahub using command python3 -m  datahub docker quickstart
  • a

    alert-fall-82501

    11/14/2022, 8:44 AM
    Copy code
    requests.exceptions.SSLError: HTTPSConnectionPool(host='<http://raw.githubusercontent.com|raw.githubusercontent.com>', port=443): Max retries exceeded with url: /datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j.quickstart.yml (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
    [2022-11-14 14:11:44,353] ERROR    {datahub.entrypoints:195} - Command failed: 
    	HTTPSConnectionPool(host='<http://raw.githubusercontent.com|raw.githubusercontent.com>', port=443): Max retries exceeded with url: /datahub-project/datahub/master/docker/quickstart/docker-compose-without-neo4j.quickstart.yml (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)'))).
    	Run with --debug to get full stacktrace.
    	e.g. 'datahub --debug docker quickstart'
    d
    • 2
    • 3
  • m

    microscopic-mechanic-13766

    11/14/2022, 9:00 AM
    Hello everyone, so I am currently working on improving the Spark lineage plugin (to solve the issue mentioned here) and I want to test a few modifications I have made. The thing is, how is this plugin build? If I am not mistaken it is build in Java 8.
    d
    • 2
    • 1
  • m

    mammoth-gigabyte-6392

    11/14/2022, 9:18 AM
    Can someone please explain how to use s3 path specs while ingestion in the pipeline API https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs
    h
    • 2
    • 64
  • a

    alert-petabyte-2924

    11/14/2022, 12:27 PM
    Hi all - I am trying to ingest dbt data to datahub,
    dbt version - 1.2.2
    I am getting the below error when I am running my recipe:
    Copy code
    RUN_INGEST - {'errors': [],
     'exec_id': '70be018f-168c-41e6-b7a3-eb01a9468ff8',
     'infos': ['2022-11-14 12:20:28.468413 [exec_id=70be018f-168c-41e6-b7a3-eb01a9468ff8] INFO: Starting execution for task with name=RUN_INGEST',
               '2022-11-14 12:20:44.744602 [exec_id=70be018f-168c-41e6-b7a3-eb01a9468ff8] INFO: stdout=venv setup time = 0\n'
               'This version of datahub supports report-to functionality\n'
               'datahub  ingest run -c /tmp/datahub/ingest/70be018f-168c-41e6-b7a3-eb01a9468ff8/recipe.yml --report-to '
               '/tmp/datahub/ingest/70be018f-168c-41e6-b7a3-eb01a9468ff8/ingestion_report.json\n'
               '[2022-11-14 12:20:31,186] INFO     {datahub.cli.ingest_cli:167} - DataHub CLI version: 0.9.2\n'
               '[2022-11-14 12:20:43,924] ERROR    {datahub.entrypoints:206} - Command failed: Failed to set up framework context: Failed to connect to '
               'DataHub\n'
               'Traceback (most recent call last):\n'
               '  File "/tmp/datahub/ingest/venv-dbt-0.9.2/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn\n'
               '    conn = connection.create_connection(\n'
               '  File "/tmp/datahub/ingest/venv-dbt-0.9.2/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection\n'
               '    raise err\n'
               '  File "/tmp/datahub/ingest/venv-dbt-0.9.2/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection\n'
               '    sock.connect(sa)\n'
               'ConnectionRefusedError: [Errno 111] Connection refused\n'
               '\n'
               'During handling of the above exception, another exception occurred:\n'
               '\n'
               'Traceback (most recent call last):\n'
    My recipe looks like below :
    sink:
    type: datahub-rest
    config:
    server: '<http://localhost:8080>'
    source:
    type: dbt
    config:
    manifest_path: /Users/annujoshi/Downloads/artifacts/manifest_file.json
    test_results_path: /Users/annujoshi/Downloads/artifacts/run_results.json
    sources_path: /Users/annujoshi/Downloads/artifacts/sources_file.json
    target_platform: snowflake
    catalog_path: /Users/annujoshi/Downloads/artifacts/catalog_file.json
    d
    • 2
    • 15
  • b

    bulky-salesclerk-62223

    11/14/2022, 12:35 PM
    Hi All. I am using the latest version of the Snowflake (not legacy) DataHub Ingestion CLI. This is my config:
    Copy code
    pipeline_name: "snowflake_metadata"
    
    source:
      type: snowflake
      config:
        ignore_start_time_lineage: true
        account_id: XXX
        warehouse: XXXX
        role: XXX
        include_table_lineage: true
        include_view_lineage: true
        profiling:
          enabled: true
        stateful_ingestion:
          fail_safe_threshold: 100.0
          enabled: true
        username: XXX
        password: '${PASSWORD}'
    I have given my role
    grant imported privileges on database snowflake to role XXX;
    but the "Queries" tab is still greyed out. The CLI Logs also have "include_usage_stats" set to True. Any ideas why it's not working?
    a
    g
    h
    • 4
    • 6
  • a

    alert-fall-82501

    11/14/2022, 1:43 PM
    Hi Team - Just quick question , I am working on datahub action framework to send notification for changes in metadata over email and slack channel . I want clear about how long we can keep datahub action pipeline running ?
    d
    • 2
    • 1
  • s

    square-ocean-28447

    11/14/2022, 1:51 PM
    Hi Everyone, I would like to ask, is there a way to programmatically sync the metadata of BigQuery Tables to Datahub? Currently, our flow is, whenever I emit some column level-lineage to the rest using FineGrainedLineage via java emitter from our beam streaming job, we had to run a separate airflow job to sync the schema/metadata using
    datahub ingest -c recipe.yaml
    -> from there, after the airflow job completes, the rendered lineage will change from dataset to table with corresponding schema and relationship. I was wondering if there's already a way to bundle those 2 separate steps programmatically?
    a
    • 2
    • 4
  • p

    purple-monitor-41675

    11/14/2022, 2:16 PM
    Hi guys, could someone help me with this please https://datahubspace.slack.com/archives/CUMUWQU66/p1668105252326929
  • g

    green-lion-58215

    11/14/2022, 9:15 PM
    Hello all, I am ingesting airflow metadata to datahub version 0.8.44. However I am seeing this weird behaviour. is this expected? 1. I have a DAG with 5 tasks 2. I ingested the metadata through backend lineage model 3. I updated the DAG and now it has 3 tasks 4. retrigered the dag and ingested metadata again 5. datahub still shows the old taks even though I removed them. I am confused. why is it still showing tasks that are no longer present?
    m
    • 2
    • 3
1...838485...144Latest