DataHub #ingestion

Join Slack

bland-orange-13353

05/31/2023, 6:53 AM

This message was deleted.

fierce-agent-11572

05/31/2023, 8:46 AM

hello can i use the datahub cli cmd to ingest data from a remote server ?

✅ 1

proud-school-44110

05/31/2023, 9:19 AM

Hello Team, I have opened below Github Issue to address the Oracle Ingestion bug when using Service Name as discussed in Office Hours on 30th May https://github.com/datahub-project/datahub/issues/8148

✅ 1

billions-rose-75566

05/31/2023, 12:44 PM

Hi all! I have a question regarding Kafka ingestion. We have a secure schema registry proxy, which needs certs to create a successful connection (like this curl -s -k -X GET $registry/schemas/types --cacert ca.crt --cert client.pem --key client.key). How can we configure this in our recipe?

✅ 1

powerful-tent-14193

05/31/2023, 1:32 PM

Hi I have ingested metadata from our kafka cluster to datahub, unfortunatelly I can not see any lineage for the topics, I want to know how many up & downstream I have for my topics, is there any way to fix that?

powerful-tent-14193

05/31/2023, 1:53 PM

Hi Team Sorry I still have one more question. When I tried to ingest data from Druid to datahub with the following Recipe: source: type: druid config: host_port: 'imply-query.bi-druid:8888' Unfortunatelly the ingestion fails all the time, can someone help me with this? I uploaded the logs as well.

exec-urn_li_dataHubExecutionRequest_4517e2ae-30a5-4537-92a5-47783c41f829 (1).log

✅ 1

late-arm-1146

06/01/2023, 9:41 AM

Hi everyone, I am using

csv-enricher

with v0.8.45. I am unable to get an existing 'domain' added to a dataset. I have checked the URN for the domain created through UI and the CSV ingestion does not throw an error either. Is there anything I might be missing

✅ 1

strong-wall-16201

06/01/2023, 11:37 AM

Hello everyone 😄 I am trying to ingest datasets from Google Earth Engine. I currently have all metadata from the platform in a JSON file. Does anyone have advice? I want to know if there is something I can try before writing my own connector plugin. Thanks!

✅ 1

silly-intern-25190

06/01/2023, 2:30 PM

Hi everyone, I am writing integration tests for the Vertica plugin and am kind of stuck at a weird place, once we start the database container we run a few queries to create default tables and views. And in the datahub table/view properties, we show create time of table/view. The problem is the mce_golden file will have created time when it was last updated, but whenever the test runs it gets a different time since the table gets created whenever the test runs, thus resulting in a test fail as data of temp JSON file and mce_golden file do not match. I tried using pytest freeze_time decorator but since tables are created inside the pytest docker container, vertica database by default will apply the create time from the host system, it did not work. Please let me know if anyone else faced this issue and what could be done to overcome the issue.

✅ 1

nutritious-lifeguard-19727

06/01/2023, 6:16 PM

Hello, I just wanted to check if I was missing any context that may not be found in the docs pages. I was interested in setting up ingestion of an AWS Opensearch source. Guidance here from Acryl was that due to the similarities between Opensearch and Elasticsearch APIs this might be possible. However, from the docs it seems the only supported authentication method is using a username/password. I was wondering if perhaps I had missed something and there was some level of IAM support provided. The AWS docs pages show an example of this, but using the opensearch-py SDK (instead of elasticsearch which I believe is what datahub uses). And a follow up question if this isn't supported, if adding AWS Opensearch support is on the radar at all?

✅ 1

limited-train-99757

06/02/2023, 8:03 AM

Hi everyone, Is there any way to find out why three instances are simultaneously scheduled here for data ingestion?

plus1 1

heads down 1

limited-forest-73733

06/02/2023, 1:16 PM

Hey team, i am using DataHub 0.10.3 and trying to integrate airflow with datahub using datahub-kafka but the datahub airflow plugin is not compatible with python packages that are coming with airflow:2.6.1. Can anyone please help me here. Thanks

little-spring-72943

06/03/2023, 2:15 AM

Hello, I am trying out the Databricks Unity profiling feature in version v0.10.3 but we keep getting following error on SQL warehouse: ERROR {datahub.entrypoints:199} - Command failed: failed to reach RUNNING, got State.STOPPED: current status: State.STOPPED
Do we know how can we increase wait time for warehouse cluster to be warmed up?

shy-dog-84302

06/05/2023, 7:58 AM

Hi! Kafka metadata ingestion is failing on DataHub CLI 0.10.3+ it was working fine until v0.10.2.3 with error “`AttributeError: 'str' object has no attribute 'get'`” Did anyone else experience the same? Logs in 🧵

creamy-ram-28134

06/05/2023, 1:56 PM

Hi Team! I was trying to update datahub and am running into this issue in the update job - does anyone know how to fix this ? ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.7.2ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.7.2ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.7.2ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.7.2

✅ 1

acoustic-quill-54426

06/05/2023, 2:24 PM

waveboi We have issues with redshift profiling so we were eager to test the new unified source that came with v0.10.2. Our gms is still at v0.9.6.1 but we tried it anyway (I know we should upgrade everything at the same time but this is usually not a problem) and somehow the stateful ingestion soft deleted all the datasets sad panda

✅ 1

happy-branch-61686

06/05/2023, 2:59 PM

Hi all! I have a question related to data ingestion, specifically, I have a recipe that is reading parquet files from AWS S3. The recipe runs on schedule, but every time it runs it always overwrites some of the metadata I have added through the UI for those files. For example, customProperties and description are being overwritten but tags remain intact. Is there a way for the ingestion script to keep my added metadata?

numerous-address-22061

06/05/2023, 5:47 PM

I had a questions about Snowflake ingestions, if anyone can help... I am trying to split out a snowflake ingestion into 3 different pieces 1. Table Catalog/Lineage/Usage 2. Table Stats Profiling (row counts over times, we want frequent data points) 3. Column Stats Profiling (less frequent, as it takes a long time, but just some of the column stats) My questions is if this is somewhat of an anti-pattern for snowflake ingestions. Is it more optimized to do all of this at one time? Is there a piece of work that is repeated in all 3 ingestions?

broad-yak-43537

06/06/2023, 10:05 AM

hello,all.I have created a new user by the rest api, which is also visible on the ui, but I don't know this new user's password, and the reset password button is gray this is request json { "entity": { "value": { "com.linkedin.metadata.snapshot.CorpUserSnapshot": { "urn": "urnlicorpuser:__datahub_system", "aspects": [ { "com.linkedin.identity.CorpUserInfo": { "active": true, "displayName": "jack", "email": "jack@acryl.io", "title": "Software Engineer", "fullName": "jack full" } } ] } } } } this is ui

nutritious-megabyte-12020

06/06/2023, 1:37 PM

did anyone ingest unstructured data like png's, img's or other files? for me it fails with

ERROR    {datahub.entrypoints:199} - Command failed: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

is it supported at all? recipe: source: type: "file" config: path: ./exampleimgs/ file_extension: ".jpg" sink: type: "datahub-rest" config: server: scret token: secret

elegant-salesmen-99143

06/06/2023, 3:15 PM

Hi. I'm not sure I understand the difference between table level profiling and column level profiling. I have an ingest recipe that has

profile_table_level_only

set to

true

. But in Stats tab for tables I see Stats for each column. Is that correct behaviour? I have the same thing on Presto and on Postgres. The Datahub version is 10.1.

Copy code

profiling:
            enabled: true
            profile_table_level_only: true
            include_field_sample_values: false

👀 1

wide-florist-83539

06/06/2023, 5:27 PM

Hi I am ingesting RDS into datahub and doing so via the username password and endpoint configuration. If I want to ingest the database AWS user tags as tags or even properties how could I go about this. It currently does not show up at all in DataHub?

✅ 1

few-sugar-84064

06/07/2023, 7:48 AM

Hi, I wrote recipe for S3 as below but the metadata is not ingested as I expected. Could you help me to find why? • [actual s3 path]

Copy code

- <s3://test-datalake/user-event/etl_year={year}/etl_month={month}/etl_day={day}/etl_hour={hour}/{i}.parquet>
- <s3://test-datalake/platform-event/pn/aos/etl_year={year}/etl_month={month}/etl_day={day}/etl_hour={hour}/{i}.parquet>
- <s3://test-datalake/platform-event/pn/ios/etl_year={year}/etl_month={month}/etl_day={day}/etl_hour={hour}/{i}.parquet>

• [recipe]

Copy code

...
path_specs:
  - include: "<s3://test-datalake/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.parquet>"
  - include: "<s3://test-datalake/platform-event/pn/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.parquet>"

• [expectation] ◦ the first path: a dataset named "user-event" is created under "test-datake" folder ◦ the second path: two datasets named "aos", "ios" are created under "test-datalake/platform-event/pn" • [result on datahub] ◦ the first path: a dataset named "1.parquet" is created under "test-datalake/user-event/etl_year=2023/etl_month=1/etl_date=1/etl_hour=3" ◦ the second path: didn't create any dataset

cold-father-66356

06/07/2023, 8:44 AM

Hello 🙂. I have one question, Is it possible to use Athena ingestion and make datahub to connect upstream and downstream automatically with spark jobs that are just saving the data on s3? the idea would be to use the s3 path to connect the output of the spark job with the Athena table. Is that possible?

proud-dusk-671

06/07/2023, 9:24 AM

Hi team, Is it possible to pass environment variable in the private key field in a Snowflake ingestion recipe. Essentially we want to mask our private key somehow. We are deploying through helm

✅ 1

prehistoric-kangaroo-75605

06/07/2023, 4:15 PM

When ingesting, we are seeing deadlock errors. General ingestion works fine, but the deadlocks seem to occur during the profiling phase. Our configuration is focused on a single table. I suspect the issue is with our connection setup or permissions on the user account.

[ODBC Driver 18 for SQL Server][SQL Server]Transaction (Process ID 207) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. (1205)

We're seeing failures prior to that like:

Copy code

2023-06-07 10:49:31,337 INFO sqlalchemy.engine.Engine BEGIN (implicit)
[2023-06-07 10:49:31,337] INFO     {sqlalchemy.engine.Engine:1032} - BEGIN (implicit)
2023-06-07 10:49:31,338 INFO sqlalchemy.engine.Engine SELECT object_id(?, 'U')
[2023-06-07 10:49:31,338] INFO     {sqlalchemy.engine.Engine:1858} - SELECT object_id(?, 'U')
2023-06-07 10:49:31,338 INFO sqlalchemy.engine.Engine [generated in 0.00016s] ('tempdb.dbo.[#ge_temp_133ecb0f]',)
[2023-06-07 10:49:31,338] INFO     {sqlalchemy.engine.Engine:1863} - [generated in 0.00016s] ('tempdb.dbo.[#ge_temp_133ecb0f]',)
|[2023-06-07 10:49:31,427] ERROR    {datahub.utilities.sqlalchemy_query_combiner:257} - Failed to execute query normally, using fallback: 
CREATE TABLE "#ge_temp_133ecb0f" (
	condition INTEGER NOT NULL
)


Traceback (most recent call last):
  File "/Users/jerrythome/Library/Python/3.9/lib/python/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", line 253, in _sa_execute_fake
    handled, result = self._handle_execute(conn, query, args, kwargs)
  File "/Users/jerrythome/Library/Python/3.9/lib/python/site-packages/datahub/utilities/sqlalchemy_query_combiner.py", line 218, in _handle_execute
    if not self.is_single_row_query_method(query):
  File "/Users/jerrythome/Library/Python/3.9/lib/python/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 228, in _is_single_row_query_method
    column_names = [column.name for column in query_columns]
  File "/Users/jerrythome/Library/Python/3.9/lib/python/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 228, in <listcomp>
    column_names = [column.name for column in query_columns]
AttributeError: 'CreateColumn' object has no attribute 'name'
2023-06-07 10:49:31,428 INFO sqlalchemy.engine.Engine 
CREATE TABLE [#ge_temp_133ecb0f] (
	condition INTEGER NOT NULL
)

=========== SETUP =========== Datahub: Quickstart: v.0.10.3 Datasource: Azure SQL database General setup:

Copy code

source:
  type: mssql
  
  config:
    # Coordinates
    host_port: <mysever>.<http://database.windows.net:1433|database.windows.net:1433>
    database: <mydb>
  
    # Credentials
    username: <username>
    password: <password>

    # Options
    use_odbc: True
    uri_args:
      driver: "ODBC Driver 18 for SQL Server"
      Encrypt: "yes"
      TrustServerCertificate: "Yes"
      ssl: "True"

Has anyone experienced this? Is there an alternate connection for Azure SQL that's different than local SQL Server? I tried this with an admin user (vs. read only) too with the same results. Thanks for any thoughts.

early-hydrogen-27542

06/07/2023, 8:00 PM

👋 folks - how should we update our Kafka ingest to bring in the schemas themselves, and not just the folders? Our recipe looks like this:

Copy code

source:
  type: kafka
  config:
    connection:
      bootstrap: ${KAFKA_BOOTSTRAP}
      consumer_config:
        security.protocol: "SSL"
      schema_registry_url: ${KAFKA_SCHEMAURL}
    env: ${DATAHUB_ENV}
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true

pipeline_name: kafka_ingest

sink:
  type: "datahub-rest"
  config:
    server:  ${DATAHUB_REST}
    retry_max_times: 10

We have a set of Avro schemas (.avsc) that are grouped in folders. For example,

test_folder

holds

test_one.avsc

and

test_two.avsc

. The above recipe only ingests

test_folder

as a topic. How would we also tell it to ingest

test_one.avsc

and

test_two.avsc

✅ 1

freezing-sunset-28534

06/08/2023, 3:09 AM

Hi folks, I try to use "Transformers - Pattern Add Dataset Schema Field glossaryTerms" to automatically associate column with Business Terms based on regex patterns. But I'm not sure this method could only identify the column's name or the column's data value. Could anyone let me know if this method extactly works if I want to identify the data value in the column. Or if it could not work, is there any way could do that? Thanks a lot

✅ 1

orange-river-19475

06/08/2023, 3:25 AM

Hi, Is there a way to ingest bigquery usage only? I want to seperate ingesting bigquery metadata and ingest bigquery usage with different recipe.yaml.

✅ 1

microscopic-room-90690

06/08/2023, 6:10 AM

Hi team, I ingest metadata from Hive into Datahub daily and it worked well and "Properties" could be shown in UI. But "Properties" don't be shown any more recently. I run "DESCRIBE FORMATTED <table_name>" in Hive and get the metadata. So I'm wondering if something in Datahub changed? The version is v0.8.43. Can anyone help?