https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • l

    lively-jackal-83760

    03/02/2023, 8:44 AM
    Hi guys. Is it possible to make some sort of sync between runs by run_id? For instance, I run hive ingestion, get 10 tables, and set them run_id=1. A week later I run the same receipt, and set the same run_id=1 but get 9 tables (1 was removed). Will be great to remove this table from datahub or set it to deprecated status. Otherwise, it's hard to manage all deletions and renamings on several instances.
    g
    • 2
    • 6
  • a

    alert-football-80212

    03/02/2023, 9:45 AM
    Hi all, I am doing dbt ingestion to datahub, anyone know if its possible to ingest lineage between dbt and their tables in s3?
  • b

    best-notebook-58252

    03/02/2023, 1:34 PM
    any thoughts https://datahubspace.slack.com/archives/CUMUWQU66/p1677433300922219 ?
    ✅ 1
  • a

    aloof-holiday-45827

    03/02/2023, 1:51 PM
    Hi, I have a question on lineage. In terms of custom lineage, does datahub support open lineage format? Can lineage information in open lineage format be pushed into datahub?
    ✅ 1
    a
    • 2
    • 2
  • c

    chilly-potato-57465

    03/02/2023, 2:50 PM
    Hello Everyone! I am wondering about the following. Has anyone worked/ingested hdf5 file format? It is a scientific data format and I am wondering if anyone has come across it and ingest any data from such files. Thank you!
    ✅ 1
    a
    • 2
    • 2
  • b

    blue-engineer-74605

    03/02/2023, 5:20 PM
    Hey Folks! I’m working with Superset Metadata extraction - updating it to Superset 2.0, but I’m lost and with some questions, anyone available for a chat? I’m willing to open a PR too.
    plus1 2
    a
    • 2
    • 3
  • b

    brave-animal-98220

    03/02/2023, 5:34 PM
    Copy code
    Hey guys. I'm ingesting from superset (datahub v0.10.0) but I get this error below, can anyone help.
    plus1 1
    ✅ 1
    b
    • 2
    • 6
  • s

    stocky-portugal-689

    03/02/2023, 5:38 PM
    Hi all , I have a scenario that requires to associate several schemas to the same topic , as we are using distinct schema in the same topic. How can this be anchiev in DataHub ? Alternatively, are there any way to ingest metadata directly (only) from the Schema registry?
  • l

    late-dawn-4912

    03/02/2023, 5:47 PM
    Hi All! We are trying to get the LookML recipe working with a transformer using the UI-Ingestion. This is the recipe we are using, but it keeps saying "extra fields not permitted" and failing (without the transformer the recipe works). Any pointers will be greatly appreciated! 😄 This is the recipe (removed deploy_key, repo and base_url values just to make it simpler): source: type: lookml config: stateful_ingestion: enabled: false github_info: repo: repo_url deploy_key: "-----BEGIN OPENSSH PRIVATE KEY----------END OPENSSH PRIVATE KEY-----\n" branch: develop api: base_url: 'URL" client_id: '${looker_client_id}' client_secret: '${looker_client_secret}' parse_table_names_from_sql: true transformers: - type: "simple_add_dataset_domain" config: replace_existing: true domains: - "urnlidomain:OSS"
    f
    • 2
    • 6
  • p

    prehistoric-furniture-42991

    03/02/2023, 6:15 PM
    Hi All, I'm new to dataHub and trying
    put
    command in CLI. I'm able to update the ownership with example json in documentation
    datahub put --urn "urn:li:dataset:(urn:li:dataPlatform:s3,test.parquet,PROD)" --aspect ownership -d editable_schema.json
    Also I'm trying to update the description using the --aspect editableSchemaMetadata. Now I need example json file for this. I tried with different json files, which is not supporting.
    ✅ 1
    f
    • 2
    • 2
  • s

    salmon-vr-6357

    03/02/2023, 6:47 PM
    Hi all, anyone had experience ingesting metadata from PubSub or Bigtable? would love to hear how it goes thankyou
    ✅ 1
    a
    • 2
    • 1
  • c

    cuddly-dinner-641

    03/02/2023, 7:31 PM
    is it on the roadmap to enhance the Unity Catalog emitter to ingest from multiple/all workspaces? seems like it can only connect to a single workspace a time
    d
    c
    • 3
    • 3
  • g

    gifted-bear-4760

    03/03/2023, 11:04 AM
    Hi Everyone! I have a BW/4HANA source. I just want to ingest the Schema of the data table, nothing else. Can this be done with any of the source connectors that are currently available (Eg. MySQL) ?
    m
    • 2
    • 2
  • r

    rich-policeman-92383

    03/03/2023, 11:07 AM
    Is there a way to specify platform_instance in https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py#L38 If not please share some workaround. ## Datahub version: v0.9.5
    a
    a
    • 3
    • 2
  • e

    elegant-salesmen-99143

    03/03/2023, 12:38 PM
    Hi, can somebody please show me an example\template of how to add Schema Registry configs (for Protobuf schemas) into Kafka ingestion recipe? It is not clear for me from Kafka ingestion documentation:(
    ✅ 1
    s
    • 2
    • 1
  • r

    rich-policeman-92383

    03/03/2023, 1:43 PM
    Hello Is there a way to populate stats tab from INFORMATION_SCHEMA of a sql source ? This will be much quicker and less resource intensive compared to the profiling of individual tables. @rough-journalist-49506 @magnificent-notebook-88304
    g
    • 2
    • 2
  • m

    most-animal-32096

    03/03/2023, 2:59 PM
    Hello, Anyone having an idea why, once imported in
    datahub:metadata-integration:java:datahub-client
    the very basic example using
    RestEmitter
    and described in documentation, fails with such
    ClassNotFoundException: org.apache.http.ssl.TrustStrategy
    error?
    RestEmitterSample.java
    a
    • 2
    • 2
  • l

    lively-jackal-83760

    03/03/2023, 3:26 PM
    Hi guys question - is it possible to create stateful ingestion via Java Emitter? For instance, if I build entities in java using some external configs and push it to gms server and want to remove the old one on the next run. Couldn't find any related classes in your Java lib, something similar to Python stateful ingestion
    a
    • 2
    • 1
  • w

    witty-monitor-636

    03/03/2023, 6:02 PM
    hello, can someone point me in the right direction for ingesting a dbt catalog.json from a private github repo? is that possible?
    a
    a
    • 3
    • 7
  • b

    bitter-midnight-96257

    03/03/2023, 8:12 PM
    Hi everyone, I'm currently trying to ingest metadata from a DocumentDB (MongoDB on AWS). They have the recommendation to use a certificate file. How can I pass this file to my pipeline? I tried to put in yaml.
    Copy code
    options:
        tls: true
        tlsCRLFile: /home/ec2-user/crl.pem
    But it didn't recognize the path. Thank you in advance and a great weekend to all 🙂
    a
    • 2
    • 4
  • a

    able-evening-90828

    03/03/2023, 8:50 PM
    I have a question about
    bigquery
    ingestion. I want to ingest data from Google's public
    bigquery-public-data
    project. I was only to get it working using the
    project_id: bigquery-public-data
    setting in the recipe, although the doc says it is deprecated. I tried to use
    project_id_pattern
    as follows, but it wasn't able to pick up any datasets.
    Copy code
    project_id_pattern:
                allow:
                    - '.*bigquery-public-data.*'
    How do you read data from another project that is different from the service account's project?
    a
    • 2
    • 1
  • f

    few-branch-52297

    03/05/2023, 4:19 AM
    I'm using AWS Glue Schema Registry and kafka-setup is failing due to this last line in kafka-setup.sh https://github.com/datahub-project/datahub/blob/01ee351c4c925a2c784b72a3065b506b72e2b7ba/docker/kafka-setup/kafka-setup.sh#L144 With AWS GSR, the topic _schemas are not required and not created.
    s
    m
    • 3
    • 3
  • f

    few-branch-52297

    03/05/2023, 4:40 AM
    I've deployed datahub v0.9.6.1 and I'm getting the following error & warnings -
    datahub-datahub-system-update-job-nqrrl   0/1     CreateContainerConfigError
    Copy code
    W0305 04:38:31.609377   15424 warnings.go:70] spec.template.spec.containers[0].env[31].name: duplicate name "DATAHUB_UPGRADE_HISTORY_TOPIC_NAME"
    W0305 04:38:31.609414   15424 warnings.go:70] spec.template.spec.containers[0].env[33].name: duplicate name "ENTITY_REGISTRY_CONFIG_PATH"
    W0305 04:38:31.609423   15424 warnings.go:70] spec.template.spec.containers[0].env[34].name: duplicate name "KAFKA_BOOTSTRAP_SERVER"
    W0305 04:38:31.609430   15424 warnings.go:70] spec.template.spec.containers[0].env[35].name: duplicate name "KAFKA_SCHEMAREGISTRY_URL"
    W0305 04:38:31.609438   15424 warnings.go:70] spec.template.spec.containers[0].env[39].name: duplicate name "ELASTICSEARCH_HOST"
    W0305 04:38:31.609445   15424 warnings.go:70] spec.template.spec.containers[0].env[40].name: duplicate name "ELASTICSEARCH_PORT"
    W0305 04:38:31.609453   15424 warnings.go:70] spec.template.spec.containers[0].env[41].name: duplicate name "SKIP_ELASTICSEARCH_CHECK"
    W0305 04:38:31.609460   15424 warnings.go:70] spec.template.spec.containers[0].env[42].name: duplicate name "ELASTICSEARCH_USE_SSL"
    W0305 04:38:31.609467   15424 warnings.go:70] spec.template.spec.containers[0].env[43].name: duplicate name "ELASTICSEARCH_USERNAME"
    W0305 04:38:31.609474   15424 warnings.go:70] spec.template.spec.containers[0].env[44].name: duplicate name "ELASTICSEARCH_PASSWORD"
    W0305 04:38:31.609496   15424 warnings.go:70] spec.template.spec.containers[0].env[48].name: duplicate name "GRAPH_SERVICE_IMPL"
    s
    a
    c
    • 4
    • 17
  • a

    alert-football-80212

    03/05/2023, 9:59 AM
    Hi all does anyone manage to do linage between dbt and s3?
    a
    • 2
    • 1
  • b

    bitter-evening-61050

    03/06/2023, 7:51 AM
    Hi , I am to integrate with teams action with kafka as source ,but i am getting below error . Error:L|rdkafka#consumer-1| [thrdxx.xx.xx.xx9092/bootstrap]: xx.xx.xx.xx9092/bootstrap Connect to ipv4#xx.xx.xx.xx:9092 failed: Unknown error (after 21031ms in state CONNECT) Kafka is running in kubernates .We have created an external api to kafka ,but that too is not working . Can anyone help me resolve this issue
    a
    • 2
    • 1
  • b

    best-wire-59738

    03/06/2023, 8:21 AM
    Hi Team, We are facing Kakfa clients re-balancing issue whenever we use kafka sink for our ingestions. we have increased
    KAFKA_LISTENER_CONCURRENCY
    to 10 to support parallel processing of the kafka offsets. At this point of time our UI is also freezed as the consumer is going in a re-balancing loop and its not consuming offsets and offset lag keeps on Increasing as Ingestion is pulling more info. Upon debugging we found that the datahub is using MetadataChangeLog_Versioned_v1 topic for all the changes made to Metadata Graph using UI and also while using kafka sink for Ingestions. So for this reason our UI is in freezed state till the consumer (
    generic-mae-consumer-job-client
    ) reads all the partitions from the topic as the change made to UI is also some where in the queue in the kafka topic. 1. Can we use seperate topic for all the changes we made using UI so that our UI be free from freeezing issue? 2. Also how can we let ourselves come out from the re-balancing of groups issue and speed up our ingestion, As kafka is Asynchronous . MCE consumers are slow in reading the offsets. we are yet to create standalone MCE and MAE Consumers. Hope it increase the speed of the Ingestion but yet to find solution for re-balancing issue. Pulling up this thread which has the logs for the kafka client re-balancing issue. https://datahubspace.slack.com/archives/CV2UVAPPG/p1677676301370689
    f
    • 2
    • 2
  • c

    calm-dinner-63735

    03/06/2023, 10:25 AM
    i have a topic in MSK , and schema in Glue schema registry , can some share the recipe how to integrate that with datahub
    a
    a
    • 3
    • 18
  • m

    microscopic-machine-90437

    03/06/2023, 10:52 AM
    Hi Team, My ingestions are disappearing from the UI. I tried ingesting Snowflake data, it is getting saved but when I refresh the page it is missing. Please let me know if anyone else is also facing the same issue.
    a
    • 2
    • 3
  • b

    best-notebook-58252

    03/06/2023, 11:27 AM
    Hi all, I think I found a bug (or something that is still not supported) in LookML ingestion I have an explore file that uses a view declared inline:
    Copy code
    …
    view_name: payment_events {
        fields: [
          payment_events.payment_id,
          …
        ]
      }
    …
    this seems not supported because it’s causing an error and the explore is skipped:
    Copy code
    Traceback (most recent call last):
      File "…/datahub/ingestion/source/looker/lookml_source.py", line 1634, in get_internal_workunits
        explore: LookerExplore = LookerExplore.from_dict(
      File "…/datahub/ingestion/source/looker/looker_common.py", line 550, in from_dict
        view_names.add(dict.get("view_name") or dict.get("from") or dict["name"])
    TypeError: unhashable type: 'dict'
    Should I open a bug/feature request?
    a
    • 2
    • 2
  • e

    elegant-salesmen-99143

    03/06/2023, 11:45 AM
    Hi. Is there a way to enable something like stateful ingestion for Airflow integration? to clean up stale metadata left from dags that are no longer up to date
    a
    • 2
    • 2
1...107108109...144Latest