DataHub #ingestion

cool-painting-92220

02/02/2022, 1:27 AM

Hey everyone! I had a question about Snowflake query usage stat ingestions: The user account I created in Snowflake for DataHub ingestions is not an `accountadmin`; instead, I've applied a lower level role (has restricted access to tables and is masked from seeing particular sensitive rows/columns) and have granted the user access to the account_usage Snowflake schema. Will the queries pulled in this user's ingestion job also consist of queries that have been made by users with higher level access (ex: an account admin)? As an example of this scenario:

Copy code

Tables:
Table A
Table B

Users:
User 1: can only access Table A
User 2: can access Table A and Table B

User 2 has made a query before of the following: 
SELECT uid FROM Table A JOIN SELECT uid FROM Table B

Let's say I used User 1's credentials for my ingestion job of Table A: would the query usage stats pull User 2's query above?

rich-winter-40155

02/02/2022, 1:31 AM

Hi, I am new to datahub. We are looking to setup a google based SSO for our datahub instance and how does the metadata rest connector will work if we enabled security via google sign in? tried to look into the docs I see there is token option but I am not sure how will that work. Appreciate any help here. Thanks

rhythmic-kitchen-64860

02/02/2022, 2:31 AM

Hi, i want to know can we ingest only 1 table from the database? thanks before!!

curved-truck-53235

02/02/2022, 6:56 AM

Hi everyone! Can we use environment variables in yaml? I know about datahub.ingestion.run.pipeline but yaml is more preferrable for us

modern-monitor-81461

02/02/2022, 12:22 PM

How to disable Airflow lineage for some DAGs Hi all, I am using Airflow as a job scheduler and I have been enjoying the lineage backend with DataHub. I have looked at the code and did not see any hint of this, so I'll ask here. Is there a way to configure a DAG or an Operator to prevent Airflow from emitting task and pipeline lineage to DataHub? By default when you install and configure the backend, any task and DAG that will run in Airflow will emit to DataHub. That's all cool, but we have jobs running in Airflow that are unrelated to data (could be infrastructure maintenance jobs, housekeeping, etc...) and it makes no sense to see those in DataHub. It would be nice if there would be a flag that I can set on a DAG and/or Operator and that flag would indicate if Airflow should or not emit to DataHub. And there should be a default value for this that can be set in the lineage backend config so that you can overwrite the current default behavior (emit by default or not). Does this make sense?

dazzling-cat-48477

02/02/2022, 5:36 PM

Hello everyone. I have a pipeline from which I would like to extract the lineage, the pipeline consists of the following components: AWS s3 buckets AWS Glue jobs (pyspark) AWS Redshift All this orchestrated by AWS MWAA (Airflow). So far I have managed to visualize the lineage of the s3, Redshift and Glue jobs (although the latter was a bit difficult) and I wanted to try to get the lineage of Airflow, taking into account that the Airflow tasks are all of type AwsGlueJobOperator. Since we have Airflow operated by AWS we are not allowed to do the backend lineage due to version incompatibility so I plan to try to get it with the help of the DatahubEmitterOperator. My questions are: 1. Is it possible to tell the lineage emitter that both its upstream and downstream tasks are AwsGlueJobOperator type tasks? 2. If this is not possible, could it be done with the Spline-spark-agent to extract the data from the Glue jobs?

handsome-football-66174

02/02/2022, 6:02 PM

Hi Everyone, I see that there is Ingestion via WebUI. Wanted to understand if there is governance on who can execute the Ingestions etc.

late-bear-87552

02/02/2022, 6:04 PM

Copy code

source:
  type: "bigquery"
  config:
    ## Coordinates
    project_id: adf-adfa-240416
    credential:
      project_id: adf-adfa-240416
      private_key_id: ""
      private_key: "-----BEGIN PRIVATE KEY"
      client_email: ""
      client_id: ""
    table_pattern:
      deny:
      - 
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

ancient-apartment-23316

02/02/2022, 8:07 PM

Hi, I trying to ingest from snowflake to DataHub using my local machine, and I get http 500 error

Copy code

[2022-02-02 19:57:29,766] ERROR    {datahub.ingestion.run.pipeline:87} - failed to write record with workunit corp_data_forge_ods_dev.ada.curr_ada_permissions with ('Unable to emit metadata to DataHub GMS'
'status': 500

errors in GMS pod:

Copy code

16:31:33.065 [qtp544724190-11] INFO  c.l.m.filter.RestliLoggingFilter:56 - POST /entities?action=ingest - ingest - 500 - 0ms
16:31:33.066 [qtp544724190-11] ERROR c.l.m.filter.RestliLoggingFilter:38 - java.lang.RuntimeException: java.lang.reflect.InvocationTargetException

this is my recipes

Copy code

source:
  type: snowflake
  config:
    env: POC
    host_port: "myacc"
    warehouse: "wh-name"
    database_pattern:
      allow:
        - "db-name"
    username: "username"
    password: "pass"
    role: "myrole"
sink:
  type: "datahub-rest"
  config:
    server: "<http://123123-123123.us-east-1.elb.amazonaws.com:8080>"

GSM is able, I can send API request and receive respond

Copy code

curl --location --request POST '<http://123123-12312123.us-east-1.elb.amazonaws.com:8080/entities?action=search>' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
    "input": "*",
    "entity": "dataset",
    "start": 0,
    "count": 1000
}'

but I can set sink to json. It’s work. Then I able to set source = json, sink=datahub and it’s work! Don’t know how it’s happens

glamorous-microphone-33484

02/03/2022, 1:12 AM

In our org, we will use spark to read from kafka and write to kafka/hive/files. Can datahub extract these lineage info out from spark streaming jobs using DatahubSparkListener?

late-bear-87552

02/03/2022, 5:42 AM

facing the issue while trying to ingest via UI. its working through datahub ingest command. any idea what is missing?

Copy code

source:
    type: bigquery
    config:
        project_id: re-240416
        credential:
            private_key_id: 134143qefqafa12341
            private_key: "-----BEGIN PRIVATE KEY-----\n\n-----END PRIVATE KEY-----\n"
            client_email: <mailto:test-query@re.gserviceaccount.com|test-query@re.gserviceaccount.com>
            client_id: '4512451451341341'
sink:
    type: datahub-rest
    config:
        server: '<http://localhost:8080>'

few-air-56117

02/03/2022, 7:13 AM

Hi guys, i tried to ingest biguqery-usage for 2 project, its started , but after 2-3 minutes i get this error

Copy code

Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service '<http://logging.googleapis.com|logging.googleapis.com>' for consumer 'project_number:491986273194'. [{'@type': '<http://type.googleapis.com/google.rpc.ErrorInfo|type.googleapis.com/google.rpc.ErrorInfo>', 'reason': 'RATE_LIMIT_EXCEEDED', 'domain': '<http://googleapis.com|googleapis.com>', 'metadata': {'consumer': 'projects/491986273194', 'quota_metric': '<http://logging.googleapis.com/read_requests|logging.googleapis.com/read_requests>', 'quota_limit': 'ReadRequestsPerMinutePerProject', 'service': '<http://logging.googleapis.com|logging.googleapis.com>'}}]

This is the recepi

Copy code

source:
  type: bigquery-usage
  config:
    # Coordinates
    projects:
      - <project1>
      - <project2>
    max_query_duration: 5

sink:
  type: "datahub-rest"
  config:
    server: <ip>

I use a k8s cronjob and this image

Copy code

linkedin/datahub-ingestion:v0.8.24

with this command

Copy code

args: ["ingest", "-c", "file"]

Thx 😄.

✅ 1

sparse-planet-56664

02/03/2022, 12:27 PM

Hi, just testing out the meta_mapping with the DBT Ingestion. What if we have a meta key that contains different values that should map to different terms. Lets say we can have this in different models:

Copy code

meta:
  some_key: S1
meta:
  some_key: S2

Is this possible? Currently we are doing the mapping ourselves, but wanted to test this out if we didn’t have to add our own logic/complexity. I can’t see in any documentation that we can reuse the actual value from the meta key. Or is it possible to use regexp match in the “match” field?

bland-orange-13353

02/03/2022, 1:59 PM

This message was deleted.

high-family-71209

02/03/2022, 2:08 PM

Hi, what is the status of the Kafka Metadata connector? I would like to ingest some avro that is propagated via Kafka.

millions-waiter-49836

02/03/2022, 10:27 PM

Hi everyone, about the data profiling feature, I noticed we use Great Expectations for SQL data stores and Deequ for data lake. Can I ask if the considerations behind this (despite Deequ lacks SQLAlchemy support)? If possible, I would also like to learn your comparison between those two tools, such as which tool is better for which scenarios, etc.

👍 1

glamorous-microphone-33484

02/04/2022, 9:17 AM

Hi all, I have a few questions regarding kafka connector. 1. For kafka connector, does it support cloudera dist kafka or just confluence kafka? 2. Regarding security options, will the connector works on kafka cluster that uses kerberos (ie. sasl.mechanism=GSSAPI)? I tried to connect to my cluster by defining the mandatory parameters for kerberos such as sasl.kerberos.service.name, sasl.kerberos.principal and sasl.kerberos.keytab etc. However it failed with the following exception : ""No provider for SASL mechanism GSSAPI: recompile librdkafka with lbsasl2 or openssl support. Current Build options: Plain SASL_SCRAM OAUTHBEARER Can I assume GSSAPI is not supported at the moment?

plus1 1

rich-policeman-92383

02/04/2022, 11:26 AM

In v0.8.20 setting env: "QA" in hive_source.yaml results in an exception of unknown Fabric Type. Can we use all FabricTypes defined here for all datasources.

gray-table-56299

02/04/2022, 1:52 PM

👋 im running into

ValueError: source produced an invalid metadata work unit:

when i am trying to a write a custom ingestion script using the python library. is it possible to get a more specific exception msg that provides info on which part of the mcp is invalid?

bulky-arm-32887

02/04/2022, 3:44 PM

Hi everyone, I have a question about BigQuery connector. Seems like external tables are ignored during the ingestion process. Is there this limitation?

broad-battery-31188

02/04/2022, 5:13 PM

I am experiencing error

duplicate key value violates unique constraint "pk_metadata_aspect_v2"

for DBT ingestion. Recipe:

Copy code

source:
  type: "dbt"
  config:
    manifest_path: "home/user/manifest.json"
    catalog_path: "/home/user/catalog.json"
    target_platform: "snowflake" 
    load_schemas: False

dazzling-cat-48477

02/04/2022, 10:10 PM

Hi again everyone. Has anyone been able to visualize the lineage between AWS Glue and AWS Redshift? I think my annotation for the DataSink in the job is the problem as it shows me the Redshift dataset as a Glue dataset as seen in the red circle in the first image and it should look like in the second image. I have generated the Glue annotation manually because when I try to generate it through Glue Studio I get an error in the DataSink:

[<http://gluestudio-service.us[MASK].amazonaws.com|gluestudio-service.us[MASK].amazonaws.com>] createScript: InvalidInputException: Invalid DataSink: DataSink(name=Amazon Redshift, classification=DataSink, type=Redshift, inputs=[node-2], isSinkInStreamingDAG=false)

Am I missing something? I attach the Glue annotation below. Thank you!

Copy code

## @type: DataSink
## @args: [database = "redshift_test", table_name = "dev_stg_stg_version_detail", transformation_ctx = "df3"]
## @return: df3
## @inputs: []

nutritious-egg-28432

02/06/2022, 9:06 PM

Hello all, is it possible to integrate dataiku to datahub ?

plus1 2

glamorous-microphone-33484

02/07/2022, 5:12 AM

Hi all, Do you have any connector ready to ingest from MinIO?

high-hospital-85984

02/07/2022, 1:11 PM

I'm implementing a custom SQL parser (Snowflake dialect) for use with for example the LookML integration. I'm looking at the get_columns function, and can't really figure out what the output should be. Is it supposed to return the "schema" of the lookml view, or the source columns from a lineage point of view? Based on the tests it's only the column names, without any possible source table prefix?

cool-gpu-73611

02/07/2022, 2:48 PM

Hi! Where can I see examples of creating metadata directly using API? Better documentation.. I see ability to add data using plugns, but it is not enough for us. And it is too hard for me to create new plugins. Marquez for example support user friendly api, it is ease to manipulate data using this api. But I don’t see any user friendly way with datahub

some-crayon-90964

02/07/2022, 5:08 PM

Hey Acryl team, we are wondering if it is possible to let GMS accepts metadata but does not actually store in database. We would like a pipeline that we developed to test through GMS when deployed to make sure it works with our other systems. Thanks in advanced!

busy-sandwich-94034

02/08/2022, 4:19 AM

Hi everyone, we are looking for ingesting kafka schema registry and we customize schema-registry authentication mechanism to only allow JWT. But I only find the basic auth way in kafka metadata recipe.yaml*,* do you have any idea how to use JWT as authentication, thank you!

gray-table-56299

02/08/2022, 12:25 PM

👋 since only

UPSERT

is supported for MCPs, whats the recommended way to delete an aspect…?

bland-salesmen-77140

02/08/2022, 1:15 PM

Hi, some time ago we did PoC at our company and we spotted that metadata ingestion from Snowflake have a constraint that all dbs, schemas, table names should be in upper case- is it still a case? Is there some kind of workaround for that? Unfortunately we use case sensitive naming convention and wa can not change that at this point.