DataHub #ingestion

numerous-ram-92457

01/19/2023, 9:58 PM

Hey all 👋🏽, trying to ingest LookML via the UI and am running into errors. Is there a specific part of the log file that we can share to help troubleshoot? Thanks!

✅ 1

cool-tiger-42613

01/20/2023, 9:31 AM

Hi all, I am having issues with the display of dataset name. From this example here typically

sales

is the dataset belonging to the

realestate_db

hence the display name should only be

sales

. The output is not as expected. the code is from the example here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py Can I get some help with this please?

✅ 1

kind-dusk-91074

01/20/2023, 11:35 AM

Hi team, I got this 'schema-registry is not running' error when running the quickstart. Please any idea as to why this is happening and how can it can be resolved?

👀 1

✅ 1

calm-dinner-63735

01/20/2023, 12:48 PM

can anyone share existing code repo to pull topic information from MSK and schema information from Glue schema rrgistry

witty-butcher-82399

01/20/2023, 2:01 PM

Hi team! Checking about monitoring https://datahubproject.io/docs/advanced/monitoring it only mentions java backend components. Are connectors producing metrics too? I would like to have metrics about number of URNs issued by pipeline and stage (source, transform, sink), soft-deleted ones, errors and warnings, etc. That would enable to easily set up alarms too.

alert-fall-82501

01/20/2023, 5:24 PM

Hi Team - I am working importing airflow dag jobs in to the datahub . I am trying to test it on local with docker container , I have put all required config in docker file and running same but I am having some bugs ...Can anybody help me with ? Bugs are in thread and docker config are in the threads . If anyone has done it before , I will appreciate if they can help .

lemon-lock-92370

01/21/2023, 3:38 AM

Hi team! 🙇 hello dog We are trying to deploy datahub v0.9.3 with aws eks using helm chart. As far as I understand, datahub-ingestion image is for CLI and datahub-actions image for UI. I made some changes in

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

file and built an image for datahub-ingestion (using

docker/datahub-ingestion/Dockerfile

). But I’d like to apply this code to UI ingestion. It seems there’s no such aws glue file in datahub-actions code..? How can I apply code in metadata-ingestion directory to UI ingestion..? I coudn’t find any right place in

values.yaml

either.. 😢 Please help 🙏 Thank you in advance 🙇

✅ 1

👀 1

careful-ability-12984

01/23/2023, 5:31 AM

hello everyone!! I'm trying ingestion from Oracle Cloud Autonomous Database. However, I am having trouble connecting, could you give me some advice? I have already installed and configured the Oracle client required for the container, and have verified that a simple Python program using cx_oracle can retrieve data from the Autonomous Database. In this case we are connecting using tns service entry(sample_high↓) from tnsnames.ora. Apart from this, the DataHub setting requires a Host Port, but I don't know what value to set, so it would be helpful if you could tell me. The settings in tnsnames.ora are as follows: sample_high = (description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1522)(host=adb.ap-tokyo-1.oraclecloud.com))(connect_data=(service_name= sample_high.adb.oraclecloud.com))(security=(ssl_server_dn_match=yes))) above host_port->adb.ap-tokyo-1.oraclecloud.com:1522 setting will result in an error. [recipe-oracle.yaml] source: type: oracle config: host_port: adb.ap-tokyo-1.oraclecloud.com:1522 database: null username: user password: password service_name: gmaxoac_high Error Log 2023-01-19 011837,321] ERROR {datahub.entrypoints:213} - Command failed: (cx_Oracle.DatabaseError) ORA-12537: TNS:connection closed (Background on this error at: http://sqlalche.me/e/13/4xp6) Thank you.

elegant-salesmen-99143

01/23/2023, 10:26 AM

Is there any way to make Stateful Ingestion mark the tables that are no longer there as deleted, but without actually deleting them? kind of like make a deprecation. Right now I can guess it by looking at "last synchronized" marking, with our daily ingestions seeing that a table was last synchronized 4 weeks ago would mean that 4 weeks ago it was removed. But it's not a very obvious way. We don't currently have stateful ingestion on, cause we would like to keep info about removed tables. So is there any way that Datahub would mark/deprecate stale metadata without deleting it?

steep-family-13549

01/23/2023, 10:37 AM

Hi Team and @hundreds-photographer-13496 N! We have created column level lineage using Java emitter and are exploring the feature of editing the lineage on the UI. When we update the lineage from the UI, the changes are visible only till the time we are present on the page. If we refresh it, or move to any other page and come back, the changes are reverted. Is this the desired behavior? Attached are a few images of the same. PS: We are on version 0.9.6

lively-spring-5482

01/23/2023, 10:42 AM

Hi, hello, good morning. I’m Jarek Filonik, working for Westwing (www.westwing.com) and involved in the implementation of Data Hub as a data catalog tool in our company. We have noticed a particular behaviour of the lineage ingestion process (on Snowflake). To better understand the issue, let’s consider the script below.

Copy code

CREATE TEMPORARY TABLE tmp1 AS (
  SELECT
    t1.id
    , t1.attr_column_1
    , t1.attr_column_2
    , t2.attr_column_3
    , t3.not_used
  FROM src_table_1 AS t1
  INNER JOIN src_table_2 AS t2 USING (id_1)
  INNER JOIN src_table_3 AS t3 USING (id_2)
);

INSERT INTO target_table
SELECT
  t.id
  , t.attr_column_1
  , t.attr_column_2
  , t.attr_column_3
  , s.attr_column_4
FROM tmp1 AS t
INNER JOIN src_table_4 AS s USING (id);

What was observed when ingesting the lineage for

target_table

likes is that the (somewhat unfortunate) use of a temporary table in the script results in getting partial sourcing information as the outcome. Specifically:

target_table

shows as having

src_table_4

as its only downstream source, while technically speaking this is not the case -> it is sourced from

src_table_1

src_table_2

src_table_4

(whether or not

src_table_3

should be included is a separate discussion). I wonder if this behaviour can be modified by configuration in release 0.9.6? If not, then is it a limitation that you plan to remove? Is there a workaround you could suggest other than, of course, refactor to use CTEs? Thanks in advance for looking into it. Have an excellent day :)

✅ 1

wooden-jackal-88380

01/23/2023, 10:55 AM

Hi there, is there a way to use both the Airflow Datasets (data-aware scheduling since Airflow 2.4) and DataHub Datasets at the same time? They both look at the same kwargs argument “outlets” in an Airflow operator

✅ 1

straight-camera-35934

01/23/2023, 10:59 AM

Hi, Is there any update on https://feature-requests.datahubproject.io/p/need-excel-support-for-s3-profiling-while-metadata-ingestion?

✅ 1

lively-dusk-19162

01/23/2023, 5:44 PM

Hello all, Could anyone help me on how to emit glue job metadata into datahub? Is there any python sdk for that?

nutritious-yacht-6205

01/23/2023, 7:40 PM

Hello all, I have tags in description field in postgres. I want to parse the description and add it as tags with transformer. How can I get the description field?

lively-dusk-19162

01/23/2023, 9:53 PM

Hi everyone, I am trying to ingest glue metadata into datahub. I have sql queries with me. Previously i have found lineage between queries using sqllineage and emit that using python sdk. Now how can I use those queries to ingest glue metadata?

lively-dusk-19162

01/23/2023, 9:53 PM

Could anyone please help me on that?

rich-state-73859

01/24/2023, 12:10 AM

Hi all, I’m using

datahub-protobuf

lib to ingest protobuf schema but it could not parse the message comment correctly after I updated this lib to the latest version (v.0.9.6). Here is the detailed issue info. Could someone help me with that?

✅ 1

microscopic-machine-90437

01/24/2023, 9:02 AM

Hello everyone, I have a question on metadata deletion. I have ingested DBT dev data into datahub prod server, and deleted it later on. But our dbt is linked to snowflake, hence snowflake dev metadata is still intact in the datahub environment. Now I've both dev and prod metadata available for snowflake and I have to delete the dev metadata. I tried all the combinations(like delete --env dev --platform snowflake, delete --entity_type dataset --env dev --platform snowflake) to delete the dev metadata, but couldn't succeed. Can someone help me in deleting the snowflake dev data.

👀 1

blue-rainbow-97669

01/24/2023, 9:55 AM

Hi Team, While I insert Table Level expectation from Great expectation to DataHub, then it creates new entry in Validation UI, I want all the Table level validation to be inserted on single entry. For eg: It should create single entry, can anyone please let me know how to fix this Here I am using expectation expect_table_row_count_to_equal

best-umbrella-88325

01/24/2023, 3:22 PM

Hello community! A couple of questions around setting up Datahub locally and making changes. 1. We've been trying to create a custom ingestion source on Datahub which would be visible on the datahub UI as well. We've made the corresponding changes in the datahub frontend web react and the GMS project as per the documentation https://datahubproject.io/docs/metadata-ingestion/adding-source. We are wondering how will we let GMS/Frontend know that it has to pick the newer metadata-ingestion docker image that we created, since this is not present in the values.yaml. 2. Is there any way we can get Frontend and GMS up and running on local ports like 3000 and 8080 before creating the docker images? Creating the docker images / building GMS and Frontend takes a relatively larger amount of time. For dev purposes, we would love to know if we can start and stop the processes on local ports before creating the images. Any help appreciated. Please let me know if any further clarification is needed.

elegant-salesmen-99143

01/24/2023, 4:23 PM

I have a questions about Ingestions that got stuck at Pending status and the ones that show Running status for hours and hours while the ingestion job is in fact also stuck. (I’m not talking about the reasons for it at the moment, it might be due to not anough memory or whatever, but my question is different) 1) We have daily ingests for our datasourses. And if previous ingest got stuck, it doesn not get cancelled by Datahub when the new ingestion job starts. So they both try to run, and the old ingest job gets in the way of a new ingest job, and everything just collapces. The only way we found to cancel it all is to reboot Datahub completely. Are we missing something? Does not Datahub have any way of knowing when the ingest got stuck and it should cancel it before starting a new one? 2) Is there a way to cancel an ingest job with status Pending from UI? I don’t see cancel button for such ingests

helpful-tent-87247

01/24/2023, 5:56 PM

anyone seen this error with ingesting data:

Copy code

'2023-01-24 17:53:10.177088 [exec_id=22af9dd4-b420-4021-880f-4eec9c4e0677] INFO: Caught exception EXECUTING '
'task_id=22af9dd4-b420-4021-880f-4eec9c4e0677, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
'  File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
'    line = await self.readuntil(sep)\n'
'  File "/usr/local/lib/python3.10/asyncio/streams.py", line 620, in readuntil\n'
'    raise exceptions.LimitOverrunError(\n'
'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n'

thank you 1

rhythmic-glass-37647

01/24/2023, 8:53 PM

Hi, I'm pretty new to datahub and trying to plan out what ingestion i will look like for our environment. I saw https://feature-requests.datahubproject.io/p/support-ingestion-from-aws-dynamodb and it looks like its not being actively worked on, has anybody found a solution for dynamodb tables? On first look, im guessing the easiest path forward is to use a glue crawler and ingest the glue catalog?

elegant-salesmen-99143

01/25/2023, 11:38 AM

Hi, can anyone please help me

stateful_ingestion.ignore_old_state

and

stateful_ingestion.ignore_new_state parametres

, the description is not clear to me. It says "If set to True, ignores the previous/current checkpoint state". But what is a checkpoint state? how does it ignore it?

limited-forest-73733

01/25/2023, 12:22 PM

Hey team Can i give domain in dbt ingestion recipe as well? I am not able to see any documets for this. Can anyone please help me out. Thanks!

✅ 1

blue-rainbow-97669

01/25/2023, 3:12 PM

Hi Team, We are emitting great Expectations validations result to Datahub using AssertionRunEventClass and while searching those results values in the table "metadata_aspect_v2", we are not able to get the information passed using AssertionRunEventClass under the aspect column, Can you help me in understanding where does the AssertionRunEvent aspect information is stored?

✅ 1

magnificent-lawyer-97772

01/25/2023, 3:19 PM

Hi folks, could someone please update documentation on adding statefulness to sources. It seems to be a bit out of date since the introduction of

GenericCheckpointState

. @gray-shoe-75895 I noticed that you did a lot of work in that area.

👀 2

stocky-energy-24880

01/25/2023, 3:20 PM

Hi Team, I have found an issue with TimeSeries aspects. Datahub Version (0.9.6) Issue: If the

timestampMillis

value is same for multiple datasets then while fetching the TimeSeries aspect for one dataset urn returning the aspect value for other datasets as well. Please find below details. I have created a TimeSeries aspect with below mentioned .pdl files:

DatasetTimeSeriesTest.pdl

Copy code

namespace com.mine.tests

import com.linkedin.timeseries.TimeseriesAspectBase

@Aspect = {
  "name": "datasetTimeSeriesTest",
  "type": "timeseries"
}
record DatasetTimeSeriesTest includes TimeseriesAspectBase {
  @TimeseriesFieldCollection = {"key":"urn"}
  testItems: optional array[TestItem]
}

TestItem.pdl

Copy code

namespace com.mine.tests

import com.linkedin.common.Urn

record TestItem {

  urn: Urn

  @TimeseriesField = {}
  name: string

  @TimeseriesField = {}
  count: long
}

With the TimeSeries aspect (

datasetTimeSeriesTest

) I am able to ingest values correctly for 2 different dataset.

urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)

and

urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.department,PROD)

with same `timestampMillis`(1674658250386)

Copy code

curl --location --request POST '<http://localhost:8080/aspects?action=ingestProposal>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
  "proposal" : {
    "entityType": "dataset",
    "entityUrn" : "urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)",
    "changeType" : "UPSERT",
    "aspectName" : "datasetTimeSeriesTest",
    "aspect" : {
      "value" : "{ \"timestampMillis\":1674658250386, \"testItems\": [ {\"urn\": \"urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)\",\"name\": \"company1\", \"count\": 101}]}",
      "contentType": "application/json"
    }
  }
}'

Copy code

curl --location --request POST '<http://localhost:8080/aspects?action=ingestProposal>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
  "proposal" : {
    "entityType": "dataset",
    "entityUrn" : "urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.department,PROD)",
    "changeType" : "UPSERT",
    "aspectName" : "datasetTimeSeriesTest",
    "aspect" : {
      "value" : "{ \"timestampMillis\":1674658250386, \"testItems\": [ {\"urn\": \"urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.department,PROD)\",\"name\": \"department1\", \"count\": 102}]}",
      "contentType": "application/json"
    }
  }
}'

But then when I queried the aspect for one dataset urn(urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)
) I got the response for the other dataset urn as well (urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.department,PROD)
) Query:

Copy code

curl -X POST '<http://localhost:8080/aspects?action=getTimeseriesAspectValues>' \
--data '{
    "urn": "urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)",
    "entity": "dataset",
    "aspect": "datasetTimeSeriesTest",
    "latest": true
}'

Response:

Copy code

{
  "value": {
    "aspectName": "datasetTimeSeriesTest",
    "entityName": "dataset",
    "values": [
      {
        "aspect": {
          "value": "{\"timestampMillis\":1674658250386,\"testItems\":[{\"urn\":\"urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.company,PROD)\",\"name\":\"company1\",\"count\":101}]}",
          "contentType": "application/json"
        }
      },
      {
        "aspect": {
          "value": "{\"timestampMillis\":1674658250386,\"testItems\":[{\"urn\":\"urn:li:dataset:(urn:li:dataPlatform:postgres,lusiadas.public.department,PROD)\",\"name\":\"department1\",\"count\":102}]}",
          "contentType": "application/json"
        }
      }
    ],
    "limit": 10000
  }
}

Is this a known issue? Or I am doing something wrong? Can you please suggest. Also is

"autoRender:true"

does not work with TimeSeries Aspects? I mean when I tried below mentioned code with Versioned Aspects then I am able to view the Aspect on L-DH UI for a dataset but not able to view it for TimeSeries Aspects.

Copy code

"autoRender": true,
  "renderSpec": {
    "displayType": "tabular", // or properties
    "key": "tests",
    "displayName": "My Tests"
  }

Can we fetch custom TimeSeries Aspects with graphql?

lively-dusk-19162

01/25/2023, 4:13 PM

Hello team, Is there any way i can ingest lineage information from aws glue?

✅ 1