DataHub #ingestion

better-orange-49102

03/10/2022, 10:18 AM

for ElasticSearch ingestion, I currently have a bunch of indices (each index represents a single day) with a common alias. but I think the current implementation is to create datasets for each and every index? is there currently an option to ingest only the alias and have a single common dataset?

💯 1

brief-toothbrush-55766

03/10/2022, 10:43 AM

Hi Good people of Datahub community. We are looking to extend Postgress Ingestion support so that DH can extract spatial metadata from Postgres (Postgis enabled) tables. Our current workflow involves first ingesting metadata for PG tables in Datahub, and then using then either programatically using Java Client API updating the extracted dataset metadata properties with spatial info that we extract from the same PG table using a python script. Alternatively we are using the REST API to update the dataset metadata. So far so good. But we would like to wrap all this up in the ingestion, so that on ingesting a PG source, both normal metadata and spatial metadata(via python extension) would be generated. Is this possible, does it make sense?

nutritious-bird-77396

03/10/2022, 4:15 PM

With MWAA Supporting Airflow version 2.2.2 has any one in the community tried to push the lineage from MWAA to Datahub? https://docs.aws.amazon.com/mwaa/latest/userguide/airflow-versions.html#airflow-versions-v222

billowy-rocket-47022

03/10/2022, 5:36 PM

df.write.mode(‘overwrite’).saveAsTable(‘sparktable’)

Copy code

[9:06 AM] 22/03/10 09:05:18 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@5c09afbc, see the next exception for details.
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
	at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)

shy-parrot-64120

03/10/2022, 6:47 PM

Hi @lemon-hydrogen-83671 very impressed with your file-based lineage - very handy stuff for initial data bootstrap one question from my side regarding this source: does data file supports

yaml-anchors

? like this:

Copy code

version: 1
lineage:
  - entity: &dataset
      name: report.payment_reconciliation
      type: dataset
      platform: postgres
      platform_instance: mvp
    upstream:
      - entity: &datajob
          name: report.load_payment_reconciliation
          type: datajob
          platform: postgres
          platform_instance: mvp
  - entity:
      <<: *datajob
      name: report.load_payment_reconciliation
    upstream:
      - entity:
          <<: *dataset
          name: core.payment
      - entity:
          <<: *dataset
          name: core.ph2_transaction
      - entity:
          <<: *dataset
          name: core.ph2_order

afaiks answer is

no

have you any plans to do like this?

better-orange-49102

03/11/2022, 2:54 AM

i see a deprecation aspect in dataset now. is there any difference in their purpose - ie now status (removed=true) removes the dataset from UI and search, does deprecation do anything?

fierce-waiter-13795

03/11/2022, 7:01 AM

Hi, Datahub's ingestion cli seems to be importing table/column descriptions if the data platform has the metadata. Is there any way to turn this feature off?

mysterious-australia-30101

03/11/2022, 9:48 AM

@here if multiple database needs to be ingested and followed running (datahub ingest -c postgreys.yml) , how to defile multiple db in yaml file ?

mysterious-nail-70388

03/11/2022, 9:51 AM

Hi, If I want to use my locally built dataHub-ingestion client on another server, how can I migrate and use it without building again？

brief-toothbrush-55766

03/11/2022, 11:26 AM

Hi everyone. I know that currently MinIO ingestion is not supported. However, if we wanted to what would be the suggestion, use S3-like connector? File ingestion? What would you recommend?

careful-insurance-60247

03/13/2022, 2:55 PM

We use Cloudflare to protect some of our applications because of this, we need to set a header for the datahub recipe to be able to pull data from the source system. Is this currently possible? https://developers.cloudflare.com/cloudflare-one/identity/service-auth/service-tokens/#renew-service-tokens

salmon-rose-54694

03/14/2022, 1:40 AM

Can I know when the dataset ingested into datahub in UI?

green-pencil-45127

03/14/2022, 1:49 PM

We want to bring all of the tags from either the

tag

meta

property inside our dbt documentation into DataHub. After reviewing the example recipe, it looks more like the command is to

Do X if Y is detected

. While this makes sense for known tags (like PII), ideally we would send all tags from dbt to DataHub without any predefined knowledge or recipe. Any ideas on the syntax to do that?

plain-farmer-27314

03/15/2022, 3:18 PM

Posting again: Bigquery ingest doesn't seem to pickup

Materialized view

tables. I have include_views set to

true

and the dataset/table pattern in my allow config.

Views

are successfully picked up fwiw

handsome-football-66174

03/15/2022, 4:53 PM

Hi everyone, Trying to Ingest with Kafka as the Sink, Getting the following -

Copy code

[2022-03-15 15:59:16,848] {logging_mixin.py:104} INFO -  Pipeline config is {'source': {'type': 'glue', 'config': {'env': 'PROD', 'aws_region': 'us-east-1', 'extract_transforms': 'false', 'table_pattern': {'allow': ['testdb.*'], 'ignoreCase': 'false'}}}, 'transformers': [{'type': 'simple_remove_dataset_ownership', 'config': {}}, {'type': 'simple_add_dataset_ownership', 'config': {'owner_urns': ['urn:li:corpuser:user1']}}, {'type': 'set_dataset_browse_path', 'config': {'path_templates': ['/Platform/PLATFORM/DATASET_PARTS']}}], 'sink': {'type': 'datahub-kafka', 'config': {'connection': {'bootstrap': 'bootstrapserver:9092', 'schema_registry_url': '<https://schemaregistryurl>'}}}}
[2022-03-15 16:05:46,022] {pipeline.py:85} ERROR - failed to write record with workunit testdb.person_era with KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"} and info {'error': KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}, 'msg': <cimpl.Message object at 0x7f0863603560>}
[2022-03-15 16:05:46,078] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):

gifted-queen-80042

03/15/2022, 5:52 PM

Hi team! I would like some more context on the

profiling.limit

configuration for SQL profiling. • Scenario 1: Without this config parameter, the profiling runs successfully. • Scenario 2: However, upon introducing this to say 20 rows, I run into Operational Error:

Copy code

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1044, "Access denied for user '<username>'@'%' to database '<database_name>'")
[SQL: CREATE TEMPORARY TABLE ge_temp_<temp_table> AS SELECT * 
FROM <table_name> 
 LIMIT 20]
(Background on this error at: <http://sqlalche.me/e/13/e3q8>)

My question is more in terms of how this parameter is implemented. Given that in both the scenarios above it runs a

SELECT

query, why does

LIMIT

result in access denied error but without

LIMIT

, there's no error?

lemon-terabyte-66903

03/15/2022, 7:28 PM

Hi, When using delta-lake source to ingest s3 parquet files, each part file is separately shown on UI. How to avoid that and display only main file name instead of individual chunks?

plain-farmer-27314

03/15/2022, 8:47 PM

Hi all wondering what difference is between use_v2_audit_metadata = true and false is (for bigquery usage) looking at the source it just seems like it parses different log version. Are there any tangible differences between the two log types?

prehistoric-optician-40107

03/16/2022, 11:36 AM

Hi all. I'm trying to learn datahub and I'm having trouble ingestion metadata via UI. I was able to get my metadata via yml file but not via UI This is my execution details. How can I fix this ?

Copy code

"ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by "
           "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb58e5d14f0>: Failed to establish a new connection: [Errno 111] "
           "Connection refused'))\n",
           "2022-03-16 11:30:30.325290 [exec_id=d287226a-592b-4029-879a-583a3cfa64eb] INFO: Failed to execute 'datahub ingest'",
           '2022-03-16 11:30:30.325765 [exec_id=d287226a-592b-4029-879a-583a3cfa64eb] INFO: Caught exception EXECUTING '
           'task_id=d287226a-592b-4029-879a-583a3cfa64eb, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

brave-secretary-27487

03/16/2022, 12:59 PM

Hey all, Is there a way to propagate documentation based on lineage? For example we have bigquery views that are well documented. We have just intergrated looker and there is a lot of resamblance between the bigquery view and looker view. Is there a way to propogate the documentation of the BQ to the Looker view based on lineage? Or are there other solutions I could use to achieve the same effect?

damp-queen-61493

03/16/2022, 1:00 PM

Hi everyone! Trying to configure airflow lineage backend to use Datahub Kafka Sink connection. If configured with extra parameters to point to schema_registry_url, I receive this error:

Copy code

[2022-03-16, 12:39:38 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {'schema_registry_url': '<http://prerequisites-cp-schema-registry.datahub-prereqs-prod.svc.cluster.local:8081>'}
[2022-03-16, 12:39:38 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {'schema_registry_url': '<http://prerequisites-cp-schema-registry.datahub-prereqs-prod.svc.cluster.local:8081>'}
[2022-03-16, 12:39:38 UTC] {datahub.py:122} ERROR - 1 validation error for KafkaSinkConfig
schema_registry_url
  extra fields not permitted (type=value_error.extra)

And without extra this error:

Copy code

[2022-03-16, 12:58:42 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {}
[2022-03-16, 12:58:42 UTC] {base.py:79} INFO - Using connection to: id: datahub_kafka_default. Host: prerequisites-kafka.datahub-prereqs-prod.svc.cluster.local:9092, Port: None, Schema: , Login: ***, Password: ***, extra: {}
[2022-03-16, 12:58:42 UTC] {datahub.py:122} ERROR - KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeEvent_v4-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff97e544a50>: Failed to establish a new connection: [Errno 111] Connection refused'))"}
[2022-03-16, 12:58:42 UTC] {datahub.py:123} INFO - Supressing error because graceful_exceptions is set

So, how is the proper way to configure it?

eager-florist-67924

03/16/2022, 11:27 PM

Hi I am trying to use Java Emitter to create entities of dataflow and datajobs linked to it. Those jobs will have relations to datasets as input and output. However when i tried to write dataflow:

Copy code

MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
        .entityType("dataflow")
        .entityUrn("urn:li:dataflow:(urn:li:dataPlatform:kafka,trace-pipeline,PROD)")
        .upsert()
        .aspect(new DataFlowInfo()
                .setName("Trace pipeline")
                .setDescription("Pipeline for trace service")
        )
        .build();

i am able to successfully emit it

Copy code

emitter.emit(mcpw, new Callback()

but then when executing graphql query:

Copy code

graphql query
{
  search(input: { type: DATA_FLOW, query: "*", start: 0, count: 10 }) {
    start
    count
    total
    searchResults {
      entity{
        urn
        type
        ...on DataFlow {
            cluster
         }
      }
    }
  }
}

i get following error:

Copy code

response
{
  "errors": [
    {
      "message": "The field at path '/search/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value.  The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult'",
      "path": [
        "search",
        "searchResults",
        0,
        "entity"
      ],
      "extensions": {
        "classification": "NullValueInNonNullableField"
      }
    }
  ],
  "data": {
    "search": null
  }
}

so basically how such dataflow entity should look like? Did i miss some required fields? And how from entities documentation i can know which fields are optional and which are mandatory? thx

billowy-book-26360

03/17/2022, 1:20 AM

Hey all, anyone encountered DataBricks Hive ingestion error

ValueError: ('# Detailed Table Information', None, None) is not in list

? I encounter this for all tables, but all database names are ingested fine.

stale-jewelry-2440

03/17/2022, 1:08 PM

hello! I’m trying to ingest validation with Great Expectation within an Airflow pipeline, but I get a strange error:

Copy code

[2022-03-17, 13:35:39 CET] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

Note that the part GE - Airflow works good, i.e. if I deactivate the action of sending stuff to DataHub everything works fine. I also set the logging level to debug, but nothing interesting is printed out. Any hint?

miniature-hair-20451

03/17/2022, 2:47 PM

Hi, im really new in datahub. Can you help me please? I'm trying to console ingest and doesn't understand how to use it with kerberos. kerberos working fine, i just don't understand the options in yml config

Copy code

datahub ingest -c hive_2_datahub.yml
cat hive_2_datahub.yml 
source:
  type: hive
  config:
    host_port: <http://rnd-dwh-nn-002.msk.mts.ru:10010|rnd-dwh-nn-002.msk.mts.ru:10010>
    database: digital_dm
    username: aaplato9
    options.connect_args: 'KERBEROS' 

sink:
    type: "console"

Error

Copy code

Error:
1 validation error for HiveConfig
options.auth
  extra fields not permitted (type=value_error.extra)

high-family-71209

03/18/2022, 12:16 PM

Hi all, I found the slack/roadmap/docs a bit inconclusive. Can or can I not ingest kafka metadata from AWS MSK?

swift-breakfast-25077

03/18/2022, 1:03 PM

hi all, i am trying to ingest metadata from prostgres, when i excute pip install 'acryl-datahub[postgres]' i got this error :

green-pencil-45127

03/18/2022, 1:37 PM

Hello, me again! We’re trying to get DataHub configured correctly with our environment. I noticed today that our ingestion of dbt is not encoding sources as nodes. In fact, sources aren’t being integrated at all. We use dbt core (not cloud), and looking more into how the ‘sources’ function works - reliant on

sources.json

, it seems like it might be a cloud-only feature. Can anyone confirm that this is the case?

thankful-glass-88027

03/18/2022, 3:41 PM

For those who are looking for ingestion from Vertica - objects like table: install alchemy plugin:

Copy code

# python3 -m pip install 'acryl-datahub[sqlalchemy]'
# python3 -m pip install sqlalchemy-vertica-python

Build and ingest Yaml • vertica_ingest.yaml

Copy code

Source:
    type: sqlalchemy
    config:
        platform: vertica
        connect_uri: 'vertica+vertica_<python://datahub_user:password@1.1.1.1:5433/verticadb>'
sink:
    type: datahub-rest
    config:
        server: 'http:// 1.1.1.1:8080'

To ingest via CLI:

Copy code

datahub ingest -c vertica_ingest.yaml

Could the Vertica Dialect for SQL Alchemy could be added to the offical image :)?

thank you 1

adamant-laptop-28839

03/18/2022, 6:01 PM

HI everyone, i'm trying to ingest my mssql to datahub with this config

source:

type: mssql

config:

uname,pas,port

database: db_name

database_alias: db_alias

but doesn't use the database_alias like db_alias.dbo.table its still using the database name db_name.dbo.table can anyone help me how to fix this? thank you!!