https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • b

    better-orange-49102

    03/10/2022, 10:18 AM
    for ElasticSearch ingestion, I currently have a bunch of indices (each index represents a single day) with a common alias. but I think the current implementation is to create datasets for each and every index? is there currently an option to ingest only the alias and have a single common dataset?
    💯 1
    s
    k
    • 3
    • 3
  • b

    brief-toothbrush-55766

    03/10/2022, 10:43 AM
    Hi Good people of Datahub community. We are looking to extend Postgress Ingestion support so that DH can extract spatial metadata from Postgres (Postgis enabled) tables. Our current workflow involves first ingesting metadata for PG tables in Datahub, and then using then either programatically using Java Client API updating the extracted dataset metadata properties with spatial info that we extract from the same PG table using a python script. Alternatively we are using the REST API to update the dataset metadata. So far so good. But we would like to wrap all this up in the ingestion, so that on ingesting a PG source, both normal metadata and spatial metadata(via python extension) would be generated. Is this possible, does it make sense?
    h
    l
    • 3
    • 3
  • n

    nutritious-bird-77396

    03/10/2022, 4:15 PM
    With MWAA Supporting Airflow version 2.2.2 has any one in the community tried to push the lineage from MWAA to Datahub? https://docs.aws.amazon.com/mwaa/latest/userguide/airflow-versions.html#airflow-versions-v222
    g
    d
    +2
    • 5
    • 18
  • b

    billowy-rocket-47022

    03/10/2022, 5:36 PM
    df.write.mode(‘overwrite’).saveAsTable(‘sparktable’)
    Copy code
    [9:06 AM] 22/03/10 09:05:18 ERROR Schema: Failed initialising database.
    Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
    java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@5c09afbc, see the next exception for details.
    	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    	at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
    g
    c
    • 3
    • 3
  • s

    shy-parrot-64120

    03/10/2022, 6:47 PM
    Hi @lemon-hydrogen-83671 very impressed with your file-based lineage - very handy stuff for initial data bootstrap one question from my side regarding this source: does data file supports
    yaml-anchors
    ? like this:
    Copy code
    version: 1
    lineage:
      - entity: &dataset
          name: report.payment_reconciliation
          type: dataset
          platform: postgres
          platform_instance: mvp
        upstream:
          - entity: &datajob
              name: report.load_payment_reconciliation
              type: datajob
              platform: postgres
              platform_instance: mvp
      - entity:
          <<: *datajob
          name: report.load_payment_reconciliation
        upstream:
          - entity:
              <<: *dataset
              name: core.payment
          - entity:
              <<: *dataset
              name: core.ph2_transaction
          - entity:
              <<: *dataset
              name: core.ph2_order
    afaiks answer is
    no
    have you any plans to do like this?
    l
    • 2
    • 17
  • b

    better-orange-49102

    03/11/2022, 2:54 AM
    i see a deprecation aspect in dataset now. is there any difference in their purpose - ie now status (removed=true) removes the dataset from UI and search, does deprecation do anything?
    b
    • 2
    • 3
  • f

    fierce-waiter-13795

    03/11/2022, 7:01 AM
    Hi, Datahub's ingestion cli seems to be importing table/column descriptions if the data platform has the metadata. Is there any way to turn this feature off?
    b
    m
    • 3
    • 6
  • m

    mysterious-australia-30101

    03/11/2022, 9:48 AM
    @here if multiple database needs to be ingested and followed running (datahub ingest -c postgreys.yml) , how to defile multiple db in yaml file ?
    d
    • 2
    • 1
  • m

    mysterious-nail-70388

    03/11/2022, 9:51 AM
    Hi, If I want to use my locally built dataHub-ingestion client on another server, how can I migrate and use it without building again?
    l
    h
    • 3
    • 5
  • b

    brief-toothbrush-55766

    03/11/2022, 11:26 AM
    Hi everyone. I know that currently MinIO ingestion is not supported. However, if we wanted to what would be the suggestion, use S3-like connector? File ingestion? What would you recommend?
    l
    • 2
    • 2
  • c

    careful-insurance-60247

    03/13/2022, 2:55 PM
    We use Cloudflare to protect some of our applications because of this, we need to set a header for the datahub recipe to be able to pull data from the source system. Is this currently possible? https://developers.cloudflare.com/cloudflare-one/identity/service-auth/service-tokens/#renew-service-tokens
    b
    • 2
    • 2
  • s

    salmon-rose-54694

    03/14/2022, 1:40 AM
    Can I know when the dataset ingested into datahub in UI?
    g
    • 2
    • 1
  • g

    green-pencil-45127

    03/14/2022, 1:49 PM
    We want to bring all of the tags from either the
    tag
    or
    meta
    property inside our dbt documentation into DataHub. After reviewing the example recipe, it looks more like the command is to
    Do X if Y is detected
    . While this makes sense for known tags (like PII), ideally we would send all tags from dbt to DataHub without any predefined knowledge or recipe. Any ideas on the syntax to do that?
    l
    s
    +2
    • 5
    • 15
  • p

    plain-farmer-27314

    03/15/2022, 3:18 PM
    Posting again: Bigquery ingest doesn't seem to pickup
    Materialized view
    tables. I have include_views set to
    true
    and the dataset/table pattern in my allow config.
    Views
    are successfully picked up fwiw
    f
    m
    • 3
    • 5
  • h

    handsome-football-66174

    03/15/2022, 4:53 PM
    Hi everyone, Trying to Ingest with Kafka as the Sink, Getting the following -
    Copy code
    [2022-03-15 15:59:16,848] {logging_mixin.py:104} INFO -  Pipeline config is {'source': {'type': 'glue', 'config': {'env': 'PROD', 'aws_region': 'us-east-1', 'extract_transforms': 'false', 'table_pattern': {'allow': ['testdb.*'], 'ignoreCase': 'false'}}}, 'transformers': [{'type': 'simple_remove_dataset_ownership', 'config': {}}, {'type': 'simple_add_dataset_ownership', 'config': {'owner_urns': ['urn:li:corpuser:user1']}}, {'type': 'set_dataset_browse_path', 'config': {'path_templates': ['/Platform/PLATFORM/DATASET_PARTS']}}], 'sink': {'type': 'datahub-kafka', 'config': {'connection': {'bootstrap': 'bootstrapserver:9092', 'schema_registry_url': '<https://schemaregistryurl>'}}}}
    [2022-03-15 16:05:46,022] {pipeline.py:85} ERROR - failed to write record with workunit testdb.person_era with KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"} and info {'error': KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}, 'msg': <cimpl.Message object at 0x7f0863603560>}
    [2022-03-15 16:05:46,078] {taskinstance.py:1482} ERROR - Task failed with exception
    Traceback (most recent call last):
    e
    • 2
    • 4
  • g

    gifted-queen-80042

    03/15/2022, 5:52 PM
    Hi team! I would like some more context on the
    profiling.limit
    configuration for SQL profiling. • Scenario 1: Without this config parameter, the profiling runs successfully. • Scenario 2: However, upon introducing this to say 20 rows, I run into Operational Error:
    Copy code
    sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1044, "Access denied for user '<username>'@'%' to database '<database_name>'")
    [SQL: CREATE TEMPORARY TABLE ge_temp_<temp_table> AS SELECT * 
    FROM <table_name> 
     LIMIT 20]
    (Background on this error at: <http://sqlalche.me/e/13/e3q8>)
    My question is more in terms of how this parameter is implemented. Given that in both the scenarios above it runs a
    SELECT
    query, why does
    LIMIT
    result in access denied error but without
    LIMIT
    , there's no error?
    d
    s
    m
    • 4
    • 16
  • l

    lemon-terabyte-66903

    03/15/2022, 7:28 PM
    Hi, When using delta-lake source to ingest s3 parquet files, each part file is separately shown on UI. How to avoid that and display only main file name instead of individual chunks?
    h
    c
    • 3
    • 15
  • p

    plain-farmer-27314

    03/15/2022, 8:47 PM
    Hi all wondering what difference is between use_v2_audit_metadata = true and false is (for bigquery usage) looking at the source it just seems like it parses different log version. Are there any tangible differences between the two log types?
    h
    d
    • 3
    • 5
  • p

    prehistoric-optician-40107

    03/16/2022, 11:36 AM
    Hi all. I'm trying to learn datahub and I'm having trouble ingestion metadata via UI. I was able to get my metadata via yml file but not via UI This is my execution details. How can I fix this ?
    Copy code
    "ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by "
               "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb58e5d14f0>: Failed to establish a new connection: [Errno 111] "
               "Connection refused'))\n",
               "2022-03-16 11:30:30.325290 [exec_id=d287226a-592b-4029-879a-583a3cfa64eb] INFO: Failed to execute 'datahub ingest'",
               '2022-03-16 11:30:30.325765 [exec_id=d287226a-592b-4029-879a-583a3cfa64eb] INFO: Caught exception EXECUTING '
               'task_id=d287226a-592b-4029-879a-583a3cfa64eb, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    Execution finished with errors.
    d
    h
    • 3
    • 2
  • b

    brave-secretary-27487

    03/16/2022, 12:59 PM
    Hey all, Is there a way to propagate documentation based on lineage? For example we have bigquery views that are well documented. We have just intergrated looker and there is a lot of resamblance between the bigquery view and looker view. Is there a way to propogate the documentation of the BQ to the Looker view based on lineage? Or are there other solutions I could use to achieve the same effect?
    h
    • 2
    • 2
  • d

    damp-queen-61493

    03/16/2022, 1:00 PM
    Hi everyone! Trying to configure airflow lineage backend to use Datahub Kafka Sink connection. If configured with extra parameters to point to schema_registry_url, I receive this error:
    Copy code
    [2022-03-16, 12:39:38 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {'schema_registry_url': '<http://prerequisites-cp-schema-registry.datahub-prereqs-prod.svc.cluster.local:8081>'}
    [2022-03-16, 12:39:38 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {'schema_registry_url': '<http://prerequisites-cp-schema-registry.datahub-prereqs-prod.svc.cluster.local:8081>'}
    [2022-03-16, 12:39:38 UTC] {datahub.py:122} ERROR - 1 validation error for KafkaSinkConfig
    schema_registry_url
      extra fields not permitted (type=value_error.extra)
    And without extra this error:
    Copy code
    [2022-03-16, 12:58:42 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {}
    [2022-03-16, 12:58:42 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {}
    [2022-03-16, 12:58:42 UTC] {datahub.py:122} ERROR - KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeEvent_v4-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff97e544a50>: Failed to establish a new connection: [Errno 111] Connection refused'))"}
    [2022-03-16, 12:58:42 UTC] {datahub.py:123} INFO - Supressing error because graceful_exceptions is set
    So, how is the proper way to configure it?
    s
    • 2
    • 10
  • e

    eager-florist-67924

    03/16/2022, 11:27 PM
    Hi I am trying to use Java Emitter to create entities of dataflow and datajobs linked to it. Those jobs will have relations to datasets as input and output. However when i tried to write dataflow:
    Copy code
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
            .entityType("dataflow")
            .entityUrn("urn:li:dataflow:(urn:li:dataPlatform:kafka,trace-pipeline,PROD)")
            .upsert()
            .aspect(new DataFlowInfo()
                    .setName("Trace pipeline")
                    .setDescription("Pipeline for trace service")
            )
            .build();
    i am able to successfully emit it
    Copy code
    emitter.emit(mcpw, new Callback()
    but then when executing graphql query:
    Copy code
    graphql query
    {
      search(input: { type: DATA_FLOW, query: "*", start: 0, count: 10 }) {
        start
        count
        total
        searchResults {
          entity{
            urn
            type
            ...on DataFlow {
                cluster
             }
          }
        }
      }
    }
    i get following error:
    Copy code
    response
    {
      "errors": [
        {
          "message": "The field at path '/search/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'",
          "path": [
            "search",
            "searchResults",
            0,
            "entity"
          ],
          "extensions": {
            "classification": "NullValueInNonNullableField"
          }
        }
      ],
      "data": {
        "search": null
      }
    }
    so basically how such dataflow entity should look like? Did i miss some required fields? And how from entities documentation i can know which fields are optional and which are mandatory? thx
    l
    b
    • 3
    • 6
  • b

    billowy-book-26360

    03/17/2022, 1:20 AM
    Hey all, anyone encountered DataBricks Hive ingestion error
    ValueError: ('# Detailed Table Information', None, None) is not in list
    ? I encounter this for all tables, but all database names are ingested fine.
    l
    h
    • 3
    • 7
  • s

    stale-jewelry-2440

    03/17/2022, 1:08 PM
    hello! I’m trying to ingest validation with Great Expectation within an Airflow pipeline, but I get a strange error:
    Copy code
    [2022-03-17, 13:35:39 CET] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL
    Note that the part GE - Airflow works good, i.e. if I deactivate the action of sending stuff to DataHub everything works fine. I also set the logging level to debug, but nothing interesting is printed out. Any hint?
    s
    l
    +4
    • 7
    • 29
  • m

    miniature-hair-20451

    03/17/2022, 2:47 PM
    Hi, im really new in datahub. Can you help me please? I'm trying to console ingest and doesn't understand how to use it with kerberos. kerberos working fine, i just don't understand the options in yml config
    Copy code
    datahub ingest -c hive_2_datahub.yml
    cat hive_2_datahub.yml 
    source:
      type: hive
      config:
        host_port: <http://rnd-dwh-nn-002.msk.mts.ru:10010|rnd-dwh-nn-002.msk.mts.ru:10010>
        database: digital_dm
        username: aaplato9
        options.connect_args: 'KERBEROS' 
    
    sink:
        type: "console"
    Error
    Copy code
    Error:
    1 validation error for HiveConfig
    options.auth
      extra fields not permitted (type=value_error.extra)
    d
    • 2
    • 2
  • h

    high-family-71209

    03/18/2022, 12:16 PM
    Hi all, I found the slack/roadmap/docs a bit inconclusive. Can or can I not ingest kafka metadata from AWS MSK?
    l
    • 2
    • 1
  • s

    swift-breakfast-25077

    03/18/2022, 1:03 PM
    hi all, i am trying to ingest metadata from prostgres, when i excute pip install 'acryl-datahub[postgres]' i got this error :
    d
    • 2
    • 4
  • g

    green-pencil-45127

    03/18/2022, 1:37 PM
    Hello, me again! We’re trying to get DataHub configured correctly with our environment. I noticed today that our ingestion of dbt is not encoding sources as nodes. In fact, sources aren’t being integrated at all. We use dbt core (not cloud), and looking more into how the ‘sources’ function works - reliant on
    sources.json
    , it seems like it might be a cloud-only feature. Can anyone confirm that this is the case?
    h
    l
    • 3
    • 6
  • t

    thankful-glass-88027

    03/18/2022, 3:41 PM
    For those who are looking for ingestion from Vertica - objects like table: install alchemy plugin:
    Copy code
    # python3 -m pip install 'acryl-datahub[sqlalchemy]'
    # python3 -m pip install sqlalchemy-vertica-python
    Build and ingest Yaml • vertica_ingest.yaml
    Copy code
    Source:
        type: sqlalchemy
        config:
            platform: vertica
            connect_uri: 'vertica+vertica_<python://datahub_user:password@1.1.1.1:5433/verticadb>'
    sink:
        type: datahub-rest
        config:
            server: 'http:// 1.1.1.1:8080'
    To ingest via CLI:
    Copy code
    datahub ingest -c vertica_ingest.yaml
    Could the Vertica Dialect for SQL Alchemy could be added to the offical image :)?
    thank you 1
    h
    b
    b
    • 4
    • 3
  • a

    adamant-laptop-28839

    03/18/2022, 6:01 PM
    HI everyone, i'm trying to ingest my mssql to datahub with this config
    source:
    type: mssql
    config:
    uname,pas,port
    database: db_name
    database_alias: db_alias
    but doesn't use the database_alias like db_alias.dbo.table its still using the database name db_name.dbo.table can anyone help me how to fix this? thank you!!
    h
    • 2
    • 2
1...333435...144Latest