https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • g

    glamorous-library-1322

    08/05/2022, 2:38 PM
    Hey all, I'm trying to do profiling from druid datasets, works ok for the table stats (with
    profile_table_level_only: true
    but when it gets to columns it gets stuck on null count (throws an error for
    datahub/ingestion/source/ge_data_profiler.py
    on
    get_column_nonnull_count
    ). Side note: unfortunately the query that great expectations is trying to run against druid to count all the nulls is not allowed 😞 There is an option to disable null count in druid data source
    include_field_null_count: false
    but this does not stop the error (or make any difference). Anybody has an experience with profiling on druid data sources? I'm currently running 0.8.36 and run the ingestion via the client and my ingestion yaml is very simple (below).
    • 1
    • 2
  • b

    brave-tomato-16287

    08/05/2022, 3:14 PM
    Hey all! How to ingest items from subfolder projects in Tableau? I have the structure:
    Copy code
    root / Operations / [Operations] Common reports / workbooks*
    Operations
    is included in the section projects in yaml and it is ingested. But items in the subfolder, for example
    [Operations] Common reports
    do not ingest.
    plus1 1
    h
    h
    • 3
    • 5
  • b

    bulky-keyboard-25193

    08/05/2022, 3:40 PM
    Hi all, brand new here, just tried ingesting from
    postgres
    and saw that
    composite types
    do not seem to be supported. Anything I’m missing before I look at the code?
  • b

    bulky-keyboard-25193

    08/05/2022, 4:13 PM
    ok, I looked at the Datahub code and I see that it delegates to
    sqlalchemy
    . Looking there I see that it views composite types as a collection of columns, like (c1,c2,c3…) and it expects you to access via
    orm
    https://docs.sqlalchemy.org/en/14/orm/composites.html . So I guess I need to write my own ingest code to get my composite types into Datahub?
    g
    • 2
    • 2
  • g

    gifted-knife-16120

    08/06/2022, 9:46 AM
    hi all, right now we are having DataHub on our dev environment and we need to deploy on production as well. I have setup all the owner, description, validation and all. Hence, need advice on how to replicate the all those info to production? Is it possible
    b
    • 2
    • 7
  • c

    cold-autumn-7250

    08/07/2022, 9:05 AM
    Hey all, how do you connect Airflow jobs with DBT models? We trigger our DBT Dags with Airflow and I would like to connect them. Both Airflow DAGs and DBT models are in Datahub. For connecting them, I have the following ideas: 1. Connect Airflow trigger task with all DBT nodes 2. Connect Airflow trigger task with only DBT leaf nodes / sources (only connect first nodes instead of all) The first idea seems the easiest to implement but might mess up lineage. The second one seems to be hard to implement as DBT does not support you by giving you only the source/lead-nodes. Therefore my question to you: how do you solve the connection between Airflow and DBT when you trigger DBT jobs with Airflow? Additional question in my mind: how do you e.g. also connect external sources like S3 with DBT (in case of external tables)? Thanks a lot for your ideas and insights 🙂
    g
    p
    • 3
    • 10
  • v

    victorious-tomato-25942

    08/07/2022, 11:59 AM
    Hey, ran the below ingestion for one of our read-only aurora instances, which resulted in cpu going from 2% to ~100% for a long while - are there any guidelines on using profiling with safe manners in production environments?
    Copy code
    source:
      type: postgres
      config:
        host_port: ddddd
        database: ddddd
        username: dddd
        password:xxxx
        include_tables: True
        include_views: True
        table_pattern:
          deny: '*.gateway_raw_*'
        profiling:
          enabled: True
          turn_off_expensive_profiling_metrics: True
    
    sink:
      type: "datahub-rest"
      config:
        server: xxxx
        token: xxxx
    plus1 2
  • a

    aloof-oil-31167

    08/07/2022, 2:25 PM
    Hey everyone, i’m trying to ingest s3 with delta-lake ingestion type does anyone familiar with the following error -
    Copy code
    TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
    ??🙏
    • 1
    • 1
  • l

    lemon-answer-80661

    08/07/2022, 3:15 PM
    Hey I installed Athena and Trino Plugins but they don't appear in the ingestion options. Anyone has faced such issue?
    g
    g
    • 3
    • 16
  • c

    crooked-rose-22807

    08/08/2022, 8:16 AM
    Hi everyone, I’m currently trying to understand the
    ignore_old_state
    and
    ignore_new_state
    for dbt
    stateful_ingestion
    . I don’t quite catch how I can check or monitor the checkpoint to see these flags working on my data. Can someone help to clarify where I can check or any useful articles to read? TQVM
    m
    h
    • 3
    • 2
  • m

    mysterious-nail-70388

    08/08/2022, 8:20 AM
    Hello, is the Schema-Registry container always started
    m
    • 2
    • 2
  • a

    aloof-oil-31167

    08/08/2022, 12:22 PM
    Hey, i’m using delta-lake ingestion and added a transformer in order to add an owner to the recipe but i’m getting the following error -
    Copy code
    'failures': [{'error': 'Unable to emit metadata to DataHub GMS',
    'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
               'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                             'com.linkedin.common.Ownership: ERROR :: /owners/0/owner :: "Provided urn Allegro" is invalid\n'
                             '\n'
                               '\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:142)\n'
                             '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\n'
                             '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n'
    this is my recipe -
    Copy code
    source:
      type: delta-lake
      config:
        env: $ENV
        platform_instance: "riskified-delta-lake"
        base_path: $DELTA_TABLE_PATH # test one table, and then make this recipe work for entire bucket
        s3:
          aws_config:
            aws_role: $AWS_ROLE_NAME
            aws_region: "us-east-1"
            env: $ENV
            aws_access_key_id: "" 
            aws_secret_access_key: ""
    transformers:
      - type: "simple_add_dataset_ownership"
        config:
          owner_urns:
            - $OWNER
    sink:
      type: "datahub-rest"
      config:
        server: "<https://riskified.acryl.io/gms>"
        token: $DATAHUB_TOKEN
    any ideas?
    b
    • 2
    • 4
  • a

    alert-football-80212

    08/08/2022, 12:41 PM
    Hi all, how can I create featureTable features and their linage cause i cant find it in datahub ui
    g
    • 2
    • 1
  • l

    little-twilight-71687

    08/08/2022, 3:30 PM
    Hi there. According to docs:
    Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the
    max_rows
    recipe parameter (see below) JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
    I have many JSON files which are cannot be ingested because of:
    could not infer schema for file s3://path/to/file.json: ' 'Trailing data']
    It looks datahub uses ujson for ingesting. How to workaround this problem and/or when this will be fixed ?
    b
    h
    • 3
    • 14
  • v

    victorious-pager-14424

    08/08/2022, 3:43 PM
    Hi everyone! We’re using the Trino ingestion recipe and we wanted all data ingested from it to have a different platform name. Is that possible? From what i’ve read in this article we are able to create new data platforms, and the recipe also has a
    platform
    string parameter. If I pass the new data platform name or URN to this parameter will it assign all ingested data to the new platform?
    m
    • 2
    • 2
  • b

    bright-receptionist-94235

    08/08/2022, 8:07 PM
    Hi All, Any plan to add Vertica ingestion from UI?
    g
    • 2
    • 2
  • c

    cuddly-apple-7818

    08/08/2022, 9:42 PM
    For BigQuery, is there a way to get lineage computed incrementally? Currently, if table1 updates table2 on 01/01, and table3 updates table2 on 01/02, and we trigger two runs with date range 01/01 and 01/02 respectively, the second run will overwrite the table1 to table2 lineage. We’d like to get the full lineage from the very start but would hate to have to parse through all historical logs every single time.
    p
    b
    • 3
    • 2
  • l

    lemon-zoo-63387

    08/09/2022, 2:36 AM
    Hey everyone, I don't know if I understand it correctly. Action subscribes to metadatachangelog_ v1, PlatformEvent_ V1, but my startup file is docker-compose-without-neo4j.quickstart YML, there is no Kafka container. If I install Kafka with docker, how does the datahub write data to these two topics. How to create an issue using JIRA https://datahubproject.io/docs/actions/sources/kafka-event-source https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/cli/docker.py
    b
    • 2
    • 1
  • f

    famous-florist-7218

    08/09/2022, 6:42 AM
    Hi guys, DataHub Ingestion seems lacking environment variables of Kafka-connect and S3 Data lake. Any recommendation? Thanks in advance!
    Copy code
    '[2022-08-09 06:34:00,334] ERROR    {datahub.ingestion.run.pipeline:126} - No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.\n'
    
    '[2022-08-09 06:36:45,327] ERROR    {logger:26} - Please set env variable SPARK_VERSION
    m
    c
    • 3
    • 5
  • f

    few-grass-66826

    08/09/2022, 11:19 AM
    Hi guys, I am using profiling: enable: True But datahub doesn't ingest stats for all tables, is there something wrong or it has limitations?
    d
    m
    • 3
    • 18
  • a

    alert-football-80212

    08/09/2022, 2:52 PM
    Hi all, I have a kafka ingestion recipe with one topic and his schema. all the recipe parameters look perfectly fine, but still after i execute the ingestion command i have a schema less topic in data hub ui. For the love of datahub whats wrong with my kafka recipe
    Copy code
    source:
      type: "kafka"
      config:
        # Coordinates
        env: PROD
        connection:
          bootstrap: some_url
          consumer_config:
            security.protocol: "SASL_SSL"
            sasl.mechanism: "PLAIN"
            sasl.username: user_name
            sasl.password: some_password
          schema_registry_url: some_scheme_url
        topic_patterns:
          allow:
            - some_topic_name
        topic_subject_map:
          some_topic_name-value: some_schema_name
    transformers:
      - type: "simple_add_dataset_ownership"
        config:
          owner_urns:
            - some_owner_name
    h
    e
    • 3
    • 5
  • s

    shy-parrot-64120

    08/09/2022, 6:33 PM
    Hi all does anyone tried to ingest meta from AWS Athena Views? Views ingested, however no upstream lineage and SQL definitions there filled a bug here: https://github.com/datahub-project/datahub/issues/5599
  • c

    curved-magazine-23582

    08/10/2022, 1:52 AM
    hello, I am looking at PowerBI ingestion, and have some questions. Does it work with admin level user or common user credentials?
    m
    g
    • 3
    • 16
  • s

    steep-soccer-91284

    08/10/2022, 6:33 AM
    Can I ingest Airflow lineage from another EKS? I’m wondering it would work.
    m
    g
    • 3
    • 3
  • k

    kind-whale-32412

    08/10/2022, 7:15 AM
    Can I add tags with
    MetadataChangeProposalWrapper
    if I am building a custom ingestion. I couldn't find a way to do that with the Java library. I also couldn't see any reference (ie it exists for GraphQL API https://datahubproject.io/docs/graphql/mutations/ but couldn't find anything for MCPW) An example GraphQL API query that I'm trying to do with MCPW is like this:
    {
    "operationName": "addTags",
    "variables": {
    "input": {
    "tagUrns": [
    "urn:li:tag:someTag"
    ],
    "resourceUrn": "urn:li:dataset:(urn:li:dataPlatform:plato,something.here,PROD)",
    "subResource": "_file_name",
    "subResourceType": "DATASET_FIELD"
    }
    },
    "query": "mutation addTags($input: AddTagsInput\u0021) {\\n addTags(input: $input)\\n}\\n"
    }
    b
    • 2
    • 9
  • a

    alert-football-80212

    08/10/2022, 9:30 AM
    Hello, I want to create three entities Model, featureTable, mlFeature and their connections. I didnt find a way to do it form the ui. I look for an API for that and cant find one. Does anyone know what can I do? Thank you!
    c
    • 2
    • 1
  • b

    busy-umbrella-4099

    08/10/2022, 9:35 AM
    I have set up a docker based instance of Datahub. 1. Using the recipe.yml: source: type:“postgres” config: username:“postgres” password:“postgres” host_port“postgreshost5432" sink: type:“datahub-rest” config: server: ‘localhost:8080’ And ran: datahub ingest -c recipe.yml the error I got was: ERROR {datahub.entrypoints:188} - Command failed with mapping values are not allowed here in “<file>“, line 3, column 9. Run with --debug to get full trace. 2. I also tried to add the same data source using the Ingestion UI. It took me through the process, and showed messages that the ingestion was initiated. But I don’t see any data source added. I had scheduled it per minute . Any guidance on how I can make the ingestion work?
    b
    c
    s
    • 4
    • 44
  • l

    limited-forest-73733

    08/10/2022, 10:29 AM
    Hey i am working on enabling profiling for the tables in schemas in databases. I want to ask how profiling is happening for tables. I enabled the profiling and i added database pattern, schema pattern and profiling pattern but it does not enable the profiling for the tables. I just want to confirm what is the base that it is considering to enable the profiling for tables? This is the recipe we are using to enable the profiling for ESG.T_ESG_MSCI.* tables.
    d
    h
    • 3
    • 6
  • m

    microscopic-mechanic-13766

    08/10/2022, 11:07 AM
    Good morning everyone, I am trying to ingest metadata from a Kerberized Hive but I am getting this error:
    Copy code
    TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found
    I am currently using datahub-gms version v0.8.42 (the release 4f35a6c where the
    file:///etc/datahub/plugins/auth/resources
    is fixed), 0.8.42 for CLI and
    acryldata/datahub-actions:v0.0.4
    . My recipe is the following:
    Copy code
    source:
        type: hive
        config:
            database: null
            host_port: 'hive-server:10000'
            options:
                connect_args:
                    auth: KERBEROS
                    kerberos_service_name: hive-server
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-gms:8080>'
    I have seen this same error in messages from almost a year ago that the problem was that some library is missing. Although I think it might be solved, I added said libraries but I still get the same error or a very similar one. I have also seen that it might be a problem that the authentication protocol might not be the same, but in my case, Hive uses Kerberos:
    Copy code
    <property>
        <name>hive.server2.authentication</name>
        <value>kerberos</value>
      </property>
    h
    l
    • 3
    • 10
  • e

    elegant-salesmen-99143

    08/10/2022, 1:40 PM
    Hello community. Does anyone know, is there an integration between Datahub and Fine BI?
    g
    • 2
    • 2
1...596061...144Latest