https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • r

    refined-energy-76018

    12/17/2022, 1:55 AM
    For the Datahub Airflow plugin, what does the future look like for
    customProperties
    of
    dataProcessInstanceProperties
    ? Are there plans to add new attributes to
    customProperties
    or should new attributes be added by extending the entity? I noticed there are some missing properties that the airflow API has when comparing to what
    airflow_generator.py
    emits. More specifically, comparing this with the screenshot attached.
    • 1
    • 1
  • g

    great-fall-93268

    12/17/2022, 4:37 PM
    Hello, I'm testing sql profiling function. I ingested a test table from mysql successfully. I can get result of Null,Count,Null %,Distinct Count,Distinct % but the min,max,mean,medium of the column is unknown on datahub. Please let me know if you have any suggestions on this, Thank you.
    g
    h
    • 3
    • 3
  • b

    brave-lunch-64773

    12/19/2022, 4:25 AM
    How we can extract only required schemas and table from oracle in Datahub ? Please help to share the exact syntax to use schema_pattern ,schema_pattern.allow . In help docs its not clear and if we follow same we getting below error
    Copy code
    (oracle): 1 validation error for OracleConfig\n'
               'schema_pattern.allow\n'
               '  extra fields not permitted (type=value_error.extra)\n',
    ✅ 1
    h
    • 2
    • 4
  • l

    limited-forest-73733

    12/19/2022, 7:49 AM
    Hey team any plan of new release i.e. 0.9.4?
    e
    • 2
    • 2
  • d

    damp-ambulance-34232

    12/19/2022, 8:04 AM
    Hello guys Is there any way to add Foreign Key manual from a field of a dataset to another field in another dataset. Thank you
    h
    • 2
    • 1
  • a

    aloof-energy-17918

    12/19/2022, 9:05 AM
    Hello all, Could someone point me in the right direction? I'm looking for way to programatically ingest metadata information. (python emitter maybe?) Say for example, I have a bunch of email reporting going on. and let assume that each email can be classified as a dashboard or chart. How would I ingest that into Datahub?
    e
    • 2
    • 2
  • f

    faint-actor-78390

    12/19/2022, 9:37 AM
    Hi Team, trying to connect my docker deployed datahub to an external postgresql DB , not clear how to allow a docker image to access to exterior world port mapping ? trying that in docker-compose file. always stuck . I access perfectly to the DB with DBeawer. 'datahub ingest run -c /tmp/datahub/ingest/e957040d-e032-477e-9d4c-8735f67726eb/recipe.yml --report-to ' '/tmp/datahub/ingest/e957040d-e032-477e-9d4c-8735f67726eb/ingestion_report.json\n' '[2022-12-19 084531,524] INFO {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.9.1\n' 'sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5435 failed: ' 'Connection refused\n' '\tIs the server running on that host and accepting TCP/IP connections?\n' 'connection to server at "localhost" (::1), port 5435 failed: Cannot assign requested address\n' '\tIs the server running on that host and accepting TCP/IP connections?\n' '\n'
    h
    b
    • 3
    • 5
  • b

    bitter-park-52601

    12/19/2022, 10:15 AM
    Datahub API Authorisation Hi everyone, I am struggling to get the OpenAPI to work. I can test it on Swagger and it works fine but outside the documentation I keep getting a 401 Unauthorised error. What I have done so far: 1. set the
    METADATA_SERVICE_AUTH_ENABLED
    environment variable to “true” for the
    datahub-gms
    AND
    datahub-frontend
    containers / pods. 2. granted the privileges
    Generate Personal Access Tokens
    or `Manage All Access Tokens`to my user 3. Generated an Access Token without expiration date 4. Tried the following request: curl -X ‘GET’ \ ‘<myserverurl>/openapi/entities/v1/latest?urns=<myurn>’ \ -H ‘Authorization: Bearer <my token>’ -H ‘accept: application/json’ Any ideas? 🙂
    o
    • 2
    • 6
  • s

    stocky-truck-96371

    12/19/2022, 2:02 PM
    Hi Team, Does anyone have any idea on creating metadata policies with entities matching certain pattern instead of specifying the entity names. Like Dataset names starts with 'Stage*'.
    e
    • 2
    • 9
  • r

    rhythmic-church-10210

    12/19/2022, 3:00 PM
    Hey guys. two questions: 1. Does datahub support manual lineage (like a drag and drop model) 2. Is it possible to link same datasets, we have a lot of duplicate datasets..
    d
    • 2
    • 5
  • a

    aloof-lamp-5537

    12/19/2022, 3:27 PM
    Hi, I am trying to implement a custom Kafka schema registry (as outlined here registry) for Azure EventHubs, but am a bit confused as to how exactly: 1. As far as I can tell,
    get_schema_metadata()
    is given a kafka topic to then fetch a single schema. How would I handle topics that hold messages with more than one schema? 2. Do I need to fork the source code or can I just copy
    src/datahub/metadata/schema_classes.py
    to my project in order to get the
    SchemaMetatadata
    class? Or is there a better way?
    h
    s
    e
    • 4
    • 12
  • f

    future-florist-65080

    12/19/2022, 9:00 PM
    Hi, I am trying to ingest glossary terms from dbt meta with the automated mappings (https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings) Is it possible to associate these terms with an existing Glossary Term that lives in a hierarchical structure? I have tried using the same string, but this creates a new term, rather than associating with the existing one
    h
    b
    • 3
    • 4
  • l

    lively-dusk-19162

    12/19/2022, 11:25 PM
    Hello all, I am trying a data quality rule on a dataset. Can anyone help me how to implement data quality using python SDK?
    h
    • 2
    • 1
  • s

    swift-evening-68463

    12/20/2022, 6:59 AM
    Hi everyone, might be a dumb question (and also a dumb functionality), but one of our customers is looking for a data catalog in which he is also able to add a dataset completely manually. So far I couldn’t find a way.. I wonder if that‘s possible?
    h
    • 2
    • 1
  • a

    alert-fall-82501

    12/20/2022, 11:29 AM
    Hi Team - I am ingesting metadata from hive to datahub but jobs are failing .with following exception ..can anybody please help me with this ?
    d
    h
    • 3
    • 27
  • a

    alert-fall-82501

    12/20/2022, 11:30 AM
    Copy code
    [2022-12-20, 06:00:12 UTC] {{subprocess.py:74}} INFO - Running command: ['bash', '-c', 'python3 -m datahub ingest -c /usr/local/airflow/dags/dt_datahub/recipes/prod/Hive/hive.yaml']
    [2022-12-20, 06:00:12 UTC] {{subprocess.py:85}} INFO - Output:
    [2022-12-20, 06:00:16 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:16 UTC] INFO     {datahub.cli.ingest_cli:179} - DataHub CLI version: 0.8.44
    [2022-12-20, 06:00:16 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:16 UTC] INFO     {datahub.ingestion.run.pipeline:165} - Sink configured successfully. DataHubRestEmitter: configured to talk to <https://datahub-gms.digitalturbine.com:8080>
    [2022-12-20, 06:00:21 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:21 UTC] INFO     {datahub.ingestion.run.pipeline:190} - Source configured successfully.
    [2022-12-20, 06:00:21 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:21 UTC] INFO     {datahub.cli.ingest_cli:126} - Starting metadata ingestion
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:22 UTC] INFO     {datahub.cli.ingest_cli:134} - Source (hive) report:
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - {'entities_profiled': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'event_ids': [],
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'events_produced': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'events_produced_per_sec': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'failures': {},
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'filtered': [],
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'read_rate': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'running_time_in_seconds': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'soft_deleted_stale_entities': [],
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'start_time': '2022-12-20 06:00:21.431859',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'tables_scanned': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'views_scanned': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'warnings': {}}
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:22 UTC] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - {'current_time': '2022-12-20 06:00:22.094502',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'failures': [],
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'gms_version': 'v0.8.45',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'pending_requests': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'records_written_per_second': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'start_time': '2022-12-20 06:00:14.207811',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'total_duration_in_seconds': '7.89',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'total_records_written': '0',
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -  'warnings': []}
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - [2022-12-20, 06:00:22 UTC] ERROR    {datahub.entrypoints:192} -
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO - Traceback (most recent call last):
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub/entrypoints.py", line 149, in main
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     sys.exit(datahub(standalone_mode=False, **kwargs))
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return self.main(*args, **kwargs)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1053, in main
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     rv = self.invoke(ctx)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return ctx.invoke(self.callback, **ctx.params)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return __callback(*args, **kwargs)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return f(get_current_context(), *args, **kwargs)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub/telemetry/telemetry.py", line 347, in wrapper
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     raise e
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub/telemetry/telemetry.py", line 299, in wrapper
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     res = func(*args, **kwargs)
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -   File "/usr/local/airflow/.local/lib/python3.7/site-packages/datahub/utilities/memory_leak_detector.py", line 102, in wrapper
    [2022-12-20, 06:00:22 UTC] {{subprocess.py:89}} INFO -     return func(*args, **kwargs)
    packages/airflow/operators/bash.py", line 188, in execute
        f'Bash command failed. The command returned a non-zero exit code {result.exit_code}.'
    airflow.exceptions.AirflowException: Bash command failed. The command returned a non-zero exit code 1.
    [2022-12-20, 06:00:23 UTC] {{taskinstance.py:1280}} INFO - Marking task as UP_FOR_RETRY. dag_id=datahub_hive_ingest, task_id=hive_ingest, execution_date=20221219T060000, start_date=20221220T060011, end_date=20221220T060023
    [2022-12-20, 06:00:23 UTC] {{standard_task_runner.py:91}} ERROR - Failed to execute job 85663 for task hive_ingest
  • c

    chilly-spring-43918

    12/20/2022, 12:37 PM
    Hi, i am facing error when ingesting bigquery via ui, here is my configuration
    Copy code
    source:
        type: bigquery
        config:
            credential:
                private_key_id: #####key_id#####
                project_id: #####project_id#####
                client_email: #####client_email#####
                private_key: '${stg_pvt_key}'
                client_id: '#####client_d#####
            project_id_pattern:
                allow:
                    - #####bigquery_project#####
    and here is the error
    Copy code
    ⏳ Pipeline running successfully so far; produced 19 events in 7.76 seconds.
    /usr/local/bin/run_ingest.sh: line 40:   376 Killed                  ( datahub ${debug_option} ingest run -c "${recipe_file}" ${report_option} )
    
    2022-12-20 12:18:14.874076 [exec_id=1d5677d6-5b68-4652-87df-9842306804aa] INFO: Failed to execute 'datahub ingest'
    2022-12-20 12:18:14.874434 [exec_id=1d5677d6-5b68-4652-87df-9842306804aa] INFO: Caught exception EXECUTING task_id=1d5677d6-5b68-4652-87df-9842306804aa, name=RUN_INGEST, stacktrace=Traceback (most recent call last):
      File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task
        task_event_loop.run_until_complete(task_future)
      File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
        return future.result()
      File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute
        raise TaskError("Failed to execute 'datahub ingest'")
    acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'
    
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '1d5677d6-5b68-4652-87df-9842306804aa',
    i am using datahub version v0.9.3 using helm version 0.2.120
    m
    g
    +2
    • 5
    • 17
  • b

    best-wire-59738

    12/20/2022, 12:49 PM
    Hi guys, is it possible to add owners to a dashboard while ingestion using transformers?
    ✅ 1
    h
    • 2
    • 3
  • b

    best-wire-59738

    12/20/2022, 1:41 PM
    Hello Team, I am facing the error “‘snowflake.connector.errors.ForbiddenError: 000403: HTTP 403: Forbidden\n’ ” while ingestion from UI. I am currently using v0.9.3. The same recipe worked fine in v0.9.2. could you please help me overcome the issue.
    h
    • 2
    • 4
  • p

    purple-terabyte-64712

    12/20/2022, 1:45 PM
    Hi, is it possible to write a custom file based Ingestion Sink? I would like to read out the metadata, but save it in a different file format.
    d
    h
    • 3
    • 3
  • l

    limited-forest-73733

    12/20/2022, 2:10 PM
    Hey team i updated all components to 0.9.3 and ingestion image is pointing to 0.9.3.2 and cli version is 0.9.3.2 . I am unable to see column level lineage as well as dbt snapshots.
    h
    • 2
    • 60
  • m

    microscopic-machine-90437

    12/20/2022, 2:49 PM
    Hello Everyone, I want to delete metadata of Tableau which I ingested few days ago. Can someone help me with the deletion using CLI. I have gone through the documentation but couldn't understand what exactly URN is.
    m
    h
    • 3
    • 2
  • m

    microscopic-mechanic-13766

    12/20/2022, 3:55 PM
    Hello everyone, I am developing the HDFS ingestion source and I want to compile it in order to add it to my local datahub deployment and try out. Could some guide me a bit on what command should I use to compile it and what should I do to add it to my deployment?? Thanks in advance!
    h
    p
    • 3
    • 13
  • f

    faint-tiger-13525

    12/21/2022, 11:02 AM
    Hello Team! Could you please advise if I can change the dashboard's default link? E.g. for Looker, the dashboard URL contains "Charts", and I need to change it and start from the Documentation. Other words, I need open the dashboard from link https://demo.datahubproject.io/dashboard/urn:li:dashboard:(looker,baz)/Documentation?is_lineage_mode=false instead of https://demo.datahubproject.io/dashboard/urn:li:dashboard:(looker,baz)/Charts?is_lineage_mode=false . Can I do this without changing the core app?
    b
    • 2
    • 1
  • l

    late-ability-59580

    12/21/2022, 12:07 PM
    DBT Ingestion results in lowercase db, schema, table . Hi everyone! In my dbt project, all resources are named with uppercase (DB.SCHEMA.TABLE) The target platform is snowflake. When ingesting the dbt, I get entities urn in uppercase indeed, but they seem to be comprised of a dbt model in uppercase, and a snowflake table in lowercase. It's a weird situation where the urn is all upper, but the snowflake part of the ingested entity appears in lowercase. This is a problem because later, when ingesting the snowflake, I end up with two separate entities. Any ideas why this is happening? Is there some flag in dbt ingestion to enforce the target platform to be uppercase (like the urn)?
    h
    • 2
    • 5
  • l

    late-ability-59580

    12/21/2022, 1:16 PM
    SnowFlake Shares . Hi all, I know that
    platform_instance
    can be used to differentiate between 2 Snowflake accounts, and allows for identical resource (<db.schema.table>) names in different accounts. My question is about shared databases and tables: Is there a way to automatically identify shared entities and provide lineage between them?
    h
    • 2
    • 1
  • b

    bumpy-egg-8563

    12/21/2022, 3:20 PM
    Hello everyone! I'm hoping someone could help me as I'm a bit confused. With v0.8.39 coming,
    dbt
    and
    bigquery
    entities have been wrapped into one, user-friendly looking dataset. So, I could expect to see the results of BQ SQL profiling (
    Stats
    ) present next to
    dbt tests
    results (
    Validation
    ), assuming ingestion was performed using two separate recipies, am I right? If not, please, could you give me a hint about what kind of action should I take to make these tabs both available? P.S. I'm using
    v0.8.44
    atm.
    👀 1
    a
    m
    f
    • 4
    • 6
  • a

    abundant-airport-72599

    12/21/2022, 6:54 PM
    Soft-delete vs. Deprecation Hey all, we’ve been working on adding lineage information to DataHub and I’m trying to figure out what the right thing to do about e.g. a DataJob that no longer exists. I’ve played around with the deprecation feature and a few things that seemed to be lacking for this use-case were: • If you’re browsing the catalog, there’s no visual indicator that something is deprecated unless you click into it and there’s no way to filter deprecated entities out. • Searching for not deprecated is awkward,
    !(deprecated:true)
    is the only way I can figure out how to do it, I guess because the deprecated property doesn’t exist at all until it’s first set as true? • Lineage graph visuals give no indication that something downstream is deprecated, unless you click on the deprecated thing. • The language around deprecation makes it sound like it’s meant only to be a first step in removal. For example, if you put a deprecation date that’s in the past the UI only states that it is planned to be decommissioned on that past date, not that it is. E.g. “Scheduled to be decommissioned on 16/Nov/2022”. Ideally I’d want deprecated items to be A) excluded by default but toggle-able and/or B) visually indicated as deprecated in all contexts. Should I be soft-deleting instead? Is there a way to explicitly ask to see soft-deleted items in the UI?
    h
    • 2
    • 2
  • h

    helpful-greece-26038

    12/21/2022, 7:04 PM
    Ingestion is removing descriptions from data sets. Currently when I ingest data from Microsoft SQL Server databases, the column-level descriptions are removed and the timeline events show that the existing columns are now treated as if they were new. For example if column A exists in a data set definition, when the ingestion is run again, the timeline event shows that column A is added again. Is there any way to avoid this? I have been experimenting with stateful ingestion settings but that doesn't seem to be the root cause of the issue.
    a
    • 2
    • 2
  • l

    lively-dusk-19162

    12/21/2022, 9:56 PM
    Hello all, Can anyone help me out Is there any python SDK to ingest data profiling like dataset uasage and query history data to datahub?
    ✅ 1
    o
    • 2
    • 3
1...919293...144Latest