DataHub #ingestion

glamorous-library-1322

08/05/2022, 2:38 PM

Hey all, I'm trying to do profiling from druid datasets, works ok for the table stats (with

profile_table_level_only: true

but when it gets to columns it gets stuck on null count (throws an error for

datahub/ingestion/source/ge_data_profiler.py

get_column_nonnull_count

). Side note: unfortunately the query that great expectations is trying to run against druid to count all the nulls is not allowed 😞 There is an option to disable null count in druid data source

include_field_null_count: false

but this does not stop the error (or make any difference). Anybody has an experience with profiling on druid data sources? I'm currently running 0.8.36 and run the ingestion via the client and my ingestion yaml is very simple (below).

brave-tomato-16287

08/05/2022, 3:14 PM

Hey all! How to ingest items from subfolder projects in Tableau? I have the structure:

Copy code

root / Operations / [Operations] Common reports / workbooks*

Operations

is included in the section projects in yaml and it is ingested. But items in the subfolder, for example

[Operations] Common reports

do not ingest.

plus1 1

bulky-keyboard-25193

08/05/2022, 3:40 PM

Hi all, brand new here, just tried ingesting from

postgres

and saw that

composite types

do not seem to be supported. Anything I’m missing before I look at the code?

bulky-keyboard-25193

08/05/2022, 4:13 PM

ok, I looked at the Datahub code and I see that it delegates to

sqlalchemy

. Looking there I see that it views composite types as a collection of columns, like (c1,c2,c3…) and it expects you to access via

orm

https://docs.sqlalchemy.org/en/14/orm/composites.html . So I guess I need to write my own ingest code to get my composite types into Datahub?

gifted-knife-16120

08/06/2022, 9:46 AM

hi all, right now we are having DataHub on our dev environment and we need to deploy on production as well. I have setup all the owner, description, validation and all. Hence, need advice on how to replicate the all those info to production? Is it possible

cold-autumn-7250

08/07/2022, 9:05 AM

Hey all, how do you connect Airflow jobs with DBT models? We trigger our DBT Dags with Airflow and I would like to connect them. Both Airflow DAGs and DBT models are in Datahub. For connecting them, I have the following ideas: 1. Connect Airflow trigger task with all DBT nodes 2. Connect Airflow trigger task with only DBT leaf nodes / sources (only connect first nodes instead of all) The first idea seems the easiest to implement but might mess up lineage. The second one seems to be hard to implement as DBT does not support you by giving you only the source/lead-nodes. Therefore my question to you: how do you solve the connection between Airflow and DBT when you trigger DBT jobs with Airflow? Additional question in my mind: how do you e.g. also connect external sources like S3 with DBT (in case of external tables)? Thanks a lot for your ideas and insights 🙂

victorious-tomato-25942

08/07/2022, 11:59 AM

Hey, ran the below ingestion for one of our read-only aurora instances, which resulted in cpu going from 2% to ~100% for a long while - are there any guidelines on using profiling with safe manners in production environments?

Copy code

source:
  type: postgres
  config:
    host_port: ddddd
    database: ddddd
    username: dddd
    password:xxxx
    include_tables: True
    include_views: True
    table_pattern:
      deny: '*.gateway_raw_*'
    profiling:
      enabled: True
      turn_off_expensive_profiling_metrics: True

sink:
  type: "datahub-rest"
  config:
    server: xxxx
    token: xxxx

plus1 2

aloof-oil-31167

08/07/2022, 2:25 PM

Hey everyone, i’m trying to ingest s3 with delta-lake ingestion type does anyone familiar with the following error -

Copy code

TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

??🙏

lemon-answer-80661

08/07/2022, 3:15 PM

Hey I installed Athena and Trino Plugins but they don't appear in the ingestion options. Anyone has faced such issue?

crooked-rose-22807

08/08/2022, 8:16 AM

Hi everyone, I’m currently trying to understand the

ignore_old_state

and

ignore_new_state

for dbt

stateful_ingestion

. I don’t quite catch how I can check or monitor the checkpoint to see these flags working on my data. Can someone help to clarify where I can check or any useful articles to read? TQVM

mysterious-nail-70388

08/08/2022, 8:20 AM

Hello, is the Schema-Registry container always started

aloof-oil-31167

08/08/2022, 12:22 PM

Hey, i’m using delta-lake ingestion and added a transformer in order to add an owner to the recipe but i’m getting the following error -

Copy code

'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
           'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                         'com.linkedin.common.Ownership: ERROR :: /owners/0/owner :: "Provided urn Allegro" is invalid\n'
                         '\n'
                           '\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:142)\n'
                         '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\n'
                         '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n'

this is my recipe -

Copy code

source:
  type: delta-lake
  config:
    env: $ENV
    platform_instance: "riskified-delta-lake"
    base_path: $DELTA_TABLE_PATH # test one table, and then make this recipe work for entire bucket
    s3:
      aws_config:
        aws_role: $AWS_ROLE_NAME
        aws_region: "us-east-1"
        env: $ENV
        aws_access_key_id: "" 
        aws_secret_access_key: ""
transformers:
  - type: "simple_add_dataset_ownership"
    config:
      owner_urns:
        - $OWNER
sink:
  type: "datahub-rest"
  config:
    server: "<https://riskified.acryl.io/gms>"
    token: $DATAHUB_TOKEN

any ideas?

alert-football-80212

08/08/2022, 12:41 PM

Hi all, how can I create featureTable features and their linage cause i cant find it in datahub ui

little-twilight-71687

08/08/2022, 3:30 PM

Hi there. According to docs:

Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the
max_rows
recipe parameter (see below) JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.

I have many JSON files which are cannot be ingested because of:

could not infer schema for file s3://path/to/file.json: ' 'Trailing data']

It looks datahub uses ujson for ingesting. How to workaround this problem and/or when this will be fixed ?

victorious-pager-14424

08/08/2022, 3:43 PM

Hi everyone! We’re using the Trino ingestion recipe and we wanted all data ingested from it to have a different platform name. Is that possible? From what i’ve read in this article we are able to create new data platforms, and the recipe also has a

platform

string parameter. If I pass the new data platform name or URN to this parameter will it assign all ingested data to the new platform?

bright-receptionist-94235

08/08/2022, 8:07 PM

Hi All, Any plan to add Vertica ingestion from UI?

cuddly-apple-7818

08/08/2022, 9:42 PM

For BigQuery, is there a way to get lineage computed incrementally? Currently, if table1 updates table2 on 01/01, and table3 updates table2 on 01/02, and we trigger two runs with date range 01/01 and 01/02 respectively, the second run will overwrite the table1 to table2 lineage. We’d like to get the full lineage from the very start but would hate to have to parse through all historical logs every single time.

lemon-zoo-63387

08/09/2022, 2:36 AM

Hey everyone, I don't know if I understand it correctly. Action subscribes to metadatachangelog_ v1, PlatformEvent_ V1, but my startup file is docker-compose-without-neo4j.quickstart YML, there is no Kafka container. If I install Kafka with docker, how does the datahub write data to these two topics. How to create an issue using JIRA https://datahubproject.io/docs/actions/sources/kafka-event-source https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/cli/docker.py

famous-florist-7218

08/09/2022, 6:42 AM

Hi guys, DataHub Ingestion seems lacking environment variables of Kafka-connect and S3 Data lake. Any recommendation? Thanks in advance!

Copy code

'[2022-08-09 06:34:00,334] ERROR    {datahub.ingestion.run.pipeline:126} - No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.\n'

'[2022-08-09 06:36:45,327] ERROR    {logger:26} - Please set env variable SPARK_VERSION

few-grass-66826

08/09/2022, 11:19 AM

Hi guys, I am using profiling: enable: True But datahub doesn't ingest stats for all tables, is there something wrong or it has limitations?

alert-football-80212

08/09/2022, 2:52 PM

Hi all, I have a kafka ingestion recipe with one topic and his schema. all the recipe parameters look perfectly fine, but still after i execute the ingestion command i have a schema less topic in data hub ui. For the love of datahub whats wrong with my kafka recipe

Copy code

source:
  type: "kafka"
  config:
    # Coordinates
    env: PROD
    connection:
      bootstrap: some_url
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: user_name
        sasl.password: some_password
      schema_registry_url: some_scheme_url
    topic_patterns:
      allow:
        - some_topic_name
    topic_subject_map:
      some_topic_name-value: some_schema_name
transformers:
  - type: "simple_add_dataset_ownership"
    config:
      owner_urns:
        - some_owner_name

shy-parrot-64120

08/09/2022, 6:33 PM

Hi all does anyone tried to ingest meta from AWS Athena Views? Views ingested, however no upstream lineage and SQL definitions there filled a bug here: https://github.com/datahub-project/datahub/issues/5599

curved-magazine-23582

08/10/2022, 1:52 AM

hello, I am looking at PowerBI ingestion, and have some questions. Does it work with admin level user or common user credentials?

steep-soccer-91284

08/10/2022, 6:33 AM

Can I ingest Airflow lineage from another EKS? I’m wondering it would work.

kind-whale-32412

08/10/2022, 7:15 AM

Can I add tags with

MetadataChangeProposalWrapper

if I am building a custom ingestion. I couldn't find a way to do that with the Java library. I also couldn't see any reference (ie it exists for GraphQL API https://datahubproject.io/docs/graphql/mutations/ but couldn't find anything for MCPW) An example GraphQL API query that I'm trying to do with MCPW is like this:

"operationName": "addTags",

"variables": {

"input": {

"tagUrns": [

"urn:li:tag:someTag"

],

"resourceUrn": "urn:li:dataset:(urn:li:dataPlatform:plato,something.here,PROD)",

"subResource": "_file_name",

"subResourceType": "DATASET_FIELD"

},

"query": "mutation addTags($input: AddTagsInput\u0021) {\\n addTags(input: $input)\\n}\\n"

alert-football-80212

08/10/2022, 9:30 AM

Hello, I want to create three entities Model, featureTable, mlFeature and their connections. I didnt find a way to do it form the ui. I look for an API for that and cant find one. Does anyone know what can I do? Thank you!

busy-umbrella-4099

08/10/2022, 9:35 AM

I have set up a docker based instance of Datahub. 1. Using the recipe.yml: source: type:“postgres” config: username:“postgres” password:“postgres” host_port“postgreshost5432" sink: type:“datahub-rest” config: server: ‘localhost:8080’ And ran: datahub ingest -c recipe.yml the error I got was: ERROR {datahub.entrypoints:188} - Command failed with mapping values are not allowed here in “<file>“, line 3, column 9. Run with --debug to get full trace. 2. I also tried to add the same data source using the Ingestion UI. It took me through the process, and showed messages that the ingestion was initiated. But I don’t see any data source added. I had scheduled it per minute . Any guidance on how I can make the ingestion work?

limited-forest-73733

08/10/2022, 10:29 AM

Hey i am working on enabling profiling for the tables in schemas in databases. I want to ask how profiling is happening for tables. I enabled the profiling and i added database pattern, schema pattern and profiling pattern but it does not enable the profiling for the tables. I just want to confirm what is the base that it is considering to enable the profiling for tables? This is the recipe we are using to enable the profiling for ESG.T_ESG_MSCI.* tables.

microscopic-mechanic-13766

08/10/2022, 11:07 AM

Good morning everyone, I am trying to ingest metadata from a Kerberized Hive but I am getting this error:

Copy code

TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found

I am currently using datahub-gms version v0.8.42 (the release 4f35a6c where the

file:///etc/datahub/plugins/auth/resources

is fixed), 0.8.42 for CLI and

acryldata/datahub-actions:v0.0.4

. My recipe is the following:

Copy code

source:
    type: hive
    config:
        database: null
        host_port: 'hive-server:10000'
        options:
            connect_args:
                auth: KERBEROS
                kerberos_service_name: hive-server
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'

I have seen this same error in messages from almost a year ago that the problem was that some library is missing. Although I think it might be solved, I added said libraries but I still get the same error or a very similar one. I have also seen that it might be a problem that the authentication protocol might not be the same, but in my case, Hive uses Kerberos:

Copy code

<property>
    <name>hive.server2.authentication</name>
    <value>kerberos</value>
  </property>

elegant-salesmen-99143

08/10/2022, 1:40 PM

Hello community. Does anyone know, is there an integration between Datahub and Fine BI?