DataHub #ingestion

mammoth-bear-12532

06/08/2022, 4:52 PM

It is raining good news today! If you've ever found yourself stumped with how to fill out the ingestion configuration correctly, we finally have a solution! yaml validators for VSCode and Intellij, so you can write those recipe files correctly the first time. Check out the docs here. Huge kudos to @dazzling-judge-80093 for building this for all ingestion enthusiasts! 🧵

Screen Recording 2022-06-06 at 18.52.04.mov

🤩 2

👍 7

handsome-football-66174

06/08/2022, 5:34 PM

Hi Everyone, trying to use https://datahubproject.io/docs/generated/metamodel/entities/dataset/#fine-grained-lineage Trying to understand the usage of upstream in the below code ( we are already adding the necessary column lineages via fieldLineages ) Also are we able to view the fine grained lineages ?

Copy code

upstream = Upstream(dataset=datasetUrn("bar2"), type=DatasetLineageType.TRANSFORMED)

fieldLineages = UpstreamLineage(
    upstreams=[upstream], fineGrainedLineages=fineGrainedLineages
)

lineageMcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn=datasetUrn("bar"),
    aspectName="upstreamLineage",
    aspect=fieldLineages,
)

bright-cpu-56427

06/09/2022, 7:32 AM

Hi Guys, I am creating a custom dataplatform (looking at this)

Copy code

pydantic.error_wrappers.ValidationError: 1 validation error for PipelineConfig
source -> filename:
  extra fields not permitted (type=value_error.extra)

This error is coming out. I don’t know which one is wrong. my python code is

Copy code

def add_quicksight_platform():
    pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            "source": {
                "type": "file",
                "filename:":"/opt/airflow/dags/datahub/quicksight.json"
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "{datahub-gms-ip}:8080"},
            }
        }
    )
    pipeline.run()
    pipeline.raise_from_status()

quicksight.json

Copy code

{
  "auditHeader": null,
  "proposedSnapshot": {
    "com.linkedin.pegasus2avro.metadata.snapshot.DataPlatformSnapshot": {
      "urn": "urn:li:dataPlatform:quicksight",
      "aspects": [
        {
          "com.linkedin.pegasus2avro.dataplatform.DataPlatformInfo": {
            "datasetNameDelimiter": "/",
            "name": "quicksight",
            "type": "OTHERS",
            "logoUrl": "<https://play-lh.googleusercontent.com/dbiOAXowepd9qC69dUnCJWEk8gg8dsQburLUyC1sux9ovnyoyH5MsoLf0OQcBqRZILB0=w240-h480-rw>"
          }
        }
      ]
    }
  },
  "proposedDelta": null
}

bumpy-activity-74405

06/09/2022, 8:13 AM

Hey, I just started using bigquery/dbt/airflow so forgive me if the questions sound stupid 🙂 I’ve somewhat successfully ingested metadata/usage/stats of bigquery tables but it also had lineage (from bigquery logs, I assume). I initially thought that lineage would have to be ingested from dbt/airflow sources. Is there any reason I would look into those sources? Any pros/cons of getting lineage from bq source vs dbt/airflow?

brave-pencil-21289

06/09/2022, 12:11 PM

Did anyone succeed with azure synapse recipe. If yes can you please share the template of recipe.

crooked-lunch-27985

06/09/2022, 5:48 PM

Hey there! Got a noob Q.. Is there an example code (Py) that someone can point me too .. that does ingestion - both metadata and then the actual data as well using datahub apis. e.g. Say from a snowflake datasource i want to identify all tables that have say "member_id" column in it and then pull the actual raw data from those tables and dump it in a local/serverside db/file etc. Is this a typical usecase or the usecase of actually pulling raw data is not supported? thx !

rich-policeman-92383

06/09/2022, 6:11 PM

Hello Do we support query usage for hive and oracle data sources. datahub_version: v0.8.35

stocky-midnight-78204

06/10/2022, 2:49 AM

How Do I import hive table statistics ? I have executed analyze table xxx compute statistics and then import this table to datahub. But still can't see statistics from datahub

rich-policeman-92383

06/10/2022, 9:25 AM

In stateful ingestion how can i get list of datasets that were marked as "soft-deleted". Ingestion Source: Hive Datahub_Version: v0.8.35 Reason: On datahub UI the dataset count for hive is 13K while at the source it is 10K

brave-pencil-21289

06/10/2022, 11:41 AM

I am getting "logintime out expired" Error while working on azure synase ingestion. Can someone help on this.

fresh-garage-83780

06/10/2022, 12:59 PM

I'm trying to bring in my Confluent Cloud instance to Datahub. Schemas are protobuf in Confluent Schema Registry. Most of our message keys are uuids, the schema for topic key in the registry is:

Copy code

syntax = "proto3";
package atech.proto.shared;

message Uuid {
  string value = 1;
}

What I'm seeing is its pulling in one topic, then erroring on the second. I think its telling me because of duped shared proto objects:

Copy code

TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "shared/uuid.proto":
  atech.proto.shared.Uuid.value: "atech.proto.shared.Uuid.value" is already defined in file "atech_app_journey_ape_completed-key.proto".
  atech.proto.shared.Uuid: "atech.proto.shared.Uuid" is already defined in file "atech_app_journey_ape_completed-key.proto".

Seems that I might have hit this this limitation but can't think of a workaround, so not quite sure where to go from here? Any clues?

some-lighter-85578

06/10/2022, 1:36 PM

Hi, I would like to ask for an advice. I'm trying to ingest data from Snowflake (that works fine), but for a reason beyond my control, we don't have descriptions of tables/columns directly in snowflake, but in another app. I can easily export those into json or csv, but how can I merge those to the ingestion process? Is it possible to use transformers for that? Any kind of help/example would be really great.

adorable-guitar-54244

06/10/2022, 3:22 PM

While trying to inject schema registry getting this error

bland-orange-13353

06/10/2022, 3:31 PM

This message was deleted.

brainy-shampoo-83265

06/10/2022, 5:25 PM

If I want to update the Validation tab on a regular schedule, should I make a new type of Source? If so, is there a better way of doing that than using the DatahubRestEmitter?

stocky-midnight-78204

06/13/2022, 1:52 AM

What is the difference of Presto on Hive and trino? How do I choose ?

stocky-midnight-78204

06/13/2022, 3:42 AM

who has the sample of ingesting

presto-on-hive

lemon-zoo-63387

06/13/2022, 6:39 AM

Could you please help me to answer questions? The Oracle client has been installed, but it does run like this，it's too hard

lemon-zoo-63387

06/13/2022, 6:43 AM

MySQL I have the same problem，I really need your help

bright-cpu-56427

06/13/2022, 9:48 AM

Hi guys When I make a request from airflow server to datahub with below code, The airflow server is set to localhost, can I not change this value?? operator code

Copy code

emit_lineage_task = DatahubEmitterOperator(
    task_id=f"emit_lineage_{table}",
    datahub_conn_id="datahub_rest_default",
    mces=[
        builder.make_lineage_mce(
            upstream_urns=upstreams,
            downstream_urn=builder.make_dataset_urn(downstream[0], downstream[1]),
        )
    ],
)
emit_lineage_task.execute(kwargs)

log file

Copy code

[2022-06-13 09:27:43,912] {_lineage_core.py:67} INFO - Emitted from Lineage: 
    DataFlow(urn=<datahub.utilities.urns.data_flow_urn.DataFlowUrn object at 0x7f9bcf23eee0>, ... url='<http://localhost:8080/tree?dag_id=dag>', ...)

rich-rocket-77152

06/13/2022, 10:25 AM

@millions-waiter-49836 hello. Currently, i am using glue profiling (Thankyou for your contribution) Today, i add statistics to my glue table through update-column-statistics-for-table ( https://docs.aws.amazon.com/cli/latest/reference/glue/update-column-statistics-for-table.html) I checked statistics through get-column-statistics-for-table (https://docs.aws.amazon.com/cli/latest/reference/glue/get-column-statistics-for-table.html), and i ingest glue table but nothing showing up in Stats tab I saw your code In your code, you read statistics through get_table(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_table) and get_partitions(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partition) When I called the get_table api via boto3, there were no Parameters field in the Table field, and get_partitions too Why didn't you use get_column_statistics_for_partition(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_column_statistics_for_partition) and get_column_statistics_for_table(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_column_statistics_for_table)? Is what I checked wrong? let me know if there is a right way I'll attach a screenshot of the Stats tab from my datahub and a screenshot of the get-column-statistics-for-table result.

fresh-napkin-5247

06/13/2022, 10:56 AM

Hello, I came up with another question. I would like to change part of the urn of a given entity via a transformer. For example, I would like to replace the word “prod” with “dev” on the urn string (The reasons for this have to do with out internal architecture). Is this a possibility? There does not seem to exist a guide for this on the docs. Thank you! Version: acryl-datahub, version 0.8.38

bumpy-daybreak-85714

06/13/2022, 12:26 PM

Hello! I would like to ingest data from dbt into DataHub. I want to ingest manifest and catalog that are put on s3 bucket. Is there a way to use the instance level metadata for the role? (IAM roles) https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#id1 Right now I am getting the following error when I do not specify a role:

Copy code

INFO - ERROR:root:Error: Unable to locate credentials

From the code perspective, I guess that for this to work this line needs to be modified: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/aws/aws_common.py#L114 Am I right? Using version

acryl-datahub[dbt] == 0.8.36

salmon-angle-92685

06/13/2022, 1:18 PM

Hello, Have you ever had this problem when trying to set _top_n_queries_ field in Snowflake ? Would you know why this happens and how to solve it ?

1 validation error for SnowflakeConfig

top_n_queries

extra fields not permitted (type=value_error.extra)

Thank you in advance !

stocky-energy-24880

06/13/2022, 1:39 PM

Hello, We are doing redshift ingestion with below source: ---------------- source: type: "redshift" config: platform_instance: "dataland" env: "DEV" username: "${REDSHIFT_USER}" password: "${REDSHIFT_PASSWORD}" host_port: "dataland-internal.big.dev.scmspain.io:5439" database: "dwh_sch_sp_db" schema_pattern: deny: - .*_mgmt$ table_pattern: deny: - .*dim_ad_normalization$ - .*dim_api_client$ include_table_lineage: False include_views: False stateful_ingestion: enabled: True remove_stale_metadata: True profiling: enabled: true limit: 1000 turn_off_expensive_profiling_metrics: True profile_pattern: allow: - ^dwh_sch_sp_db.motos deny: - .tmp. - .temp. options: connect_args: sslmode: prefer ------------- after our ingestion ran successfully we can see in logs that tables defined with source.deny are soft deleted --------------LOGS----------------------- 'soft_deleted_stale_entities': ['urnlidataset:(urnlidataPlatform:redshift,dataland.dwh_sch_sp_db.pro_infojobs_es.dim_api_client,DEV)', 'urnlidataset:(urnlidataPlatform:redshift,dataland.dwh_sch_sp_db.infojobs_es.dim_api_client,DEV)'], 'query_combiner': {'total_queries': 676, 'uncombined_queries_issued': 379, 'combined_queries_issued': 82, 'queries_combined': 297, 'query_exceptions': 0}, 'saas_version': 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.38698', 'upstream_lineage': {}} Sink (datahub-kafka) report: {'records_written': 8252, 'warnings': [], 'failures': [], 'downstream_start_time': None, 'downstream_end_time': None, 'downstream_total_latency_in_seconds': None} ---------------------------------------- but the tables are still visible from UI and the mysql table has 2 entry first with {"removed":true} and again with {"removed":false}, can you please explain what is happening wrong here.

busy-airport-23391

06/13/2022, 9:31 PM

Good afternoon, I'm trying to set up a custom java emitter to update a glue dataset's

stats

. At the moment, all I'm trying to upsert is the

timestampMillis

. When I try to set the timestampMillis, I get the following error despite passing a long value:

Copy code

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
	at com.linkedin.metadata.timeseries.transformer.TimeseriesAspectTransformer.getCommonDocument(TimeseriesAspectTransformer.java:80)
	at com.linkedin.metadata.timeseries.transformer.TimeseriesAspectTransformer.transform(TimeseriesAspectTransformer.java:47)
	at com.linkedin.metadata.kafka.hook.UpdateIndicesHook.updateTimeseriesFields(UpdateIndicesHook.java:217)
	at com.linkedin.metadata.kafka.hook.UpdateIndicesHook.invoke(UpdateIndicesHook.java:115)
	at com.linkedin.metadata.kafka.MetadataChangeLogProcessor.consume(MetadataChangeLogProcessor.java:77)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:169)
	at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:119)
	at org.springframework.kafka.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:56)
	at org.springframework.kafka.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:347)
	at org.springframework.kafka.listener.adapter.RecordMessagingMessageListenerAdapter.onMessage(RecordMessagingMessageListenerAdapter.java:92)
	at org.springframework.kafka.listener.adapter.RecordMessagingMessageListenerAdapter.onMessage(RecordMessagingMessageListenerAdapter.java:53)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeOnMessage(KafkaMessageListenerContainer.java:2334)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeOnMessage(KafkaMessageListenerContainer.java:2315)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeRecordListener(KafkaMessageListenerContainer.java:2237)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeWithRecords(KafkaMessageListenerContainer.java:2150)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListener(KafkaMessageListenerContainer.java:2032)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeListener(KafkaMessageListenerContainer.java:1705)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeIfHaveRecords(KafkaMessageListenerContainer.java:1276)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1268)
	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1163)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

So I had two questions: 1. Is there some more documentation on how to update a dataset's profile aspect? 2. Is there a reason the long value is getting converted to an int? Or am I trying to update this aspect incorrectly? Here's my Java function:

Copy code

MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
                .entityType("dataset")
                .entityUrn("urn:li:dataset:(<urn>,TEST)")
                .upsert()
                .aspect(new DatasetProfile()
                        //.setColumnCount(11)
                        .setTimestampMillis(0L)
                )
                .build();

Thanks in advance for your help!

tall-fall-45442

06/13/2022, 9:59 PM

I'm having some issues using the UI to ingest new data. I have created a secret called

my-postgres-password

through the UI. Then when I create an ingestion recipe for Postgres and set the password value to

'${my-postgres-password}'

it says that it can't authenticate with the database. However, when I put the password in plain text the ingestion succeeds. Is there something I am doing wrong or must secrets follow a certain naming convention?

best-umbrella-24804

06/14/2022, 12:25 AM

Hello, I have a recipe for snowflake usage that looks like this

Copy code

source:
    type: snowflake-usage
    config:
        env: DEV
        host_port: <http://xxxx.snowflakecomputing.com|xxxx.snowflakecomputing.com>
        warehouse: DEVELOPER_X_SMALL
        username: DATAHUB_DEV_USER
        password: '${SNOWFLAKE_DEV_PASSWORD}'
        role: DATAHUB_DEV_ACCESS
        top_n_queries: 10
sink:
    type: datahub-rest
    config:
        server: '<http://xxxxxx>'

We are finding that there are 1600 instances of the following error in our logs

Copy code

'2022-06-13 23:52:41.644712 [exec_id=66448a5e-1f15-406f-8ee7-700ed32f427b] INFO: stdout=[2022-06-13 23:52:37,635] WARNING  '
"{datahub.ingestion.source.usage.snowflake_usage:93} - usage => Failed to parse usage line {'query_start_time': datetime.datetime(2022, "
'6, 12, 4, 33, 7, 869000, tzinfo=datetime.timezone.utc), \'query_text\': "INSERT INTO xxxxx( ..... )            '
'validation error for SnowflakeJoinedAccessEvent\n'
'email\n'
'  none is not an allowed value (type=type_error.none.not_allowed)\n'

Any idea what is going on here?

lemon-zoo-63387

06/14/2022, 4:05 AM

Could you please help me to answer questions? I ran successfully with YML, but the UI interface operation did not alarm，how to configure

many-morning-40345

06/14/2022, 5:44 AM

Hi, Is it possible to add dataset/glossary Term properties via UI?