https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • l

    lively-carpenter-82828

    09/01/2023, 11:27 AM
    Hi, during yesterday's DataHub Town Hall - Aug 2023 meeting, I asked a question about lineage for Kafka producers and consumers. Does anyone know how to achieve such lineage? Currently, I have around 40 Kafka Streams applications and approximately 120 KSQL streams. From what I've gathered, it's relatively easy to extract information about the association between consumers and topics. However, there isn't such information available for producers. I found a similar issue here: https://www.datasciencecentral.com/kapxy-a-kafka-utility-for-topic-lineage/. In Kafka Streams applications, it's possible to extract information from the configuration. However, in ksqlDB streams, I can't do that.
    g
    f
    r
    • 4
    • 5
  • d

    dry-raincoat-85182

    09/01/2023, 3:06 PM
    Hi Team, I am trying to ingest IBM Db2 metadata by using source type as sqlalchemy as mentions in link(https://datahubproject.io/docs/generated/ingestion/sources/sqlalchemy/) and pip install the required dialect which in this case is ibm_db_sa, but schemaMetadata is not getting ingested. Getting below error in logs:
    Traceback (most recent call last):
    File "/data/vdc/conda/condapub/svc_am_cicd/envs/dh-actions/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 373, in run
    for record_envelope in self.transform(record_envelopes):
    File "/data/vdc/conda/condapub/svc_am_cicd/envs/dh-actions/lib/python3.10/site-packages/datahub/ingestion/extractor/mce_extractor.py", line 77, in get_records
    raise ValueError(
    ValueError: source produced an invalid metadata work unit: MetadataChangeEventClass(
    g
    a
    • 3
    • 24
  • s

    square-painter-33350

    09/01/2023, 5:00 PM
    This is a nubie question about DataHub ingestion: I want to ingest data and extract the metadata in DataHub. During this process, I'd like to know if DataHub can, not only extract the metadata but also push the actual data into some persistence datastore - OR should I load the data into a datastore first then ingest and extract metadata from that datastore? I'm just trying to understand the basic data flow through DataHub for starters.
    g
    a
    • 3
    • 4
  • i

    important-autumn-58748

    09/01/2023, 6:23 PM
    How do I add a DataSet to a Container? I'm struggling with the Python kafka emitter/sdk
    plus1 1
    🫠 1
    g
    • 2
    • 19
  • b

    best-monitor-90704

    09/02/2023, 1:06 PM
    Hi , we are getting below error while connecting datahub with mongo db Failed to configure the source (mongodb): Could not reach any servers in [('ip-x-x-x-x', 27017)]. Replica set is configured with internal hostnames or IPs?, Timeout: 30s, Topology Description: <TopologyDescription id:6xxxxxxx91, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ip-x-x-x-x, 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('ip-x-x-x-x27017 timed out')>]
    ✅ 1
    g
    a
    b
    • 4
    • 27
  • r

    refined-gold-30439

    09/04/2023, 8:09 AM
    Hi team, The following error message is occurring. Why is this error occurring?
    Copy code
    [2023-09-04 06:03:02,906] ERROR    {datahub.utilities.sqlalchemy_query_combiner:403} - Failed to execute queue using combiner: (pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'jgtcpmgyvhhzjeoo AS \n(SELECT count(*) AS count_1 \nFROM a.email)\n SELECT ' at line 1")
    And it seems like the tagging syntax isn't working properly... What part is incorrect?
    Copy code
    source:
        type: mysql
        config:
            host_port: 'hostname:3306'
            database: null
            username: user
            include_tables: true
            include_views: true
            profiling:
                enabled: true
                profile_table_level_only: false
                profile_table_row_count_estimate_only: true
                field_sample_values_limit: 0
                include_field_sample_values: false
            stateful_ingestion:
                enabled: true
            password: '${mysql_pw}'
            schema_pattern:
                allow:
                    - public
    transformers:
        -
            type: simple_add_dataset_tags
            config:
                tag_urns:
                    - 'urn:li:tag:us'
    f
    h
    • 3
    • 4
  • f

    fierce-monkey-46092

    09/04/2023, 8:21 AM
    Hi, Folks. I need to know where the logs of glossary terms/glossary term group are located in OS. I suppose it's stored in ES indexes. My main goal is to find out who and how many users are created/added these glossary terms in Datahub generally.
    ✅ 1
    m
    h
    • 3
    • 5
  • d

    dazzling-stone-78871

    09/04/2023, 12:47 PM
    Hi team 🙂 I have firstly run ingestion of Redshift and Tableau Datasources with default parameters with datahub CLI, everything run ok. I wanted to change the default env from prod to dev so I deleted the platforms and since my data won't change. I do not see error when doing:
    datahub --debug  ingest -c my_file.yaml
    (got:
    Finished metadata ingestion [...] Pipeline finished successfully;
    ) but my metadata are not updated, and I do not see any new runs on the UI or when doing
    datahub ingest list-runs
    . Does anyone have encountered those kind of issues ? I have deployed Datahub on AWS with Kubernetes and used AWS MSK (Kafka v3.4.0), AWS Opensearch (Elasticsearch v7.10) and RDS (Postgresql v13.8). I am on version
    0.10.4
    of Datahub. Thank you in advance 🙂
    ✅ 1
    g
    g
    +2
    • 5
    • 15
  • s

    straight-eve-29501

    09/04/2023, 12:53 PM
    Wondering if anyone had the idea of logging to the ingestion console/UI for custom ingestion scripts. I find it a good idea to bring together whatever ingestion are happening, regardless if they are the standard ones or just some scripts running somewhere. I found DatahubIngestionRunSummaryProvider and was able to initialize it, but even if the events get sent ( I also see them in the database) they don't show up in the UI. Any ideas would be appreciated Pasting some code bit in case something dumb is visible out of an airplane.
    g
    f
    • 3
    • 2
  • s

    some-alligator-9844

    09/04/2023, 2:18 PM
    Is there a way to provide keytab details in recipe instead of picking from the default ticket cache? I am trying to ingest data from multiple hive sources but each source requires separate keytabs for authentication. There is no username and password authentication available. I have 2 recipes which I want to run simultaneously using different keytabs for different hive source.
    g
    f
    • 3
    • 2
  • a

    alert-analyst-73197

    09/04/2023, 4:53 PM
    Hi, I have a big ammount of datasets that I uploaded through python emitter. Each one has at the moment only custom properties and nothing else. However, some of them share a dependency relation one with another. It is not a proper lineage because it has nothing to do with content transformation; it is only a "dataset is dependency of dataset" relation. I searched for all possible relationships and found nothing. I could use lineage for that but I think lineage is not the proper way to describe this relation. Do you have some suggestion?
    ✅ 1
    h
    • 2
    • 2
  • a

    acceptable-stone-72571

    09/05/2023, 8:06 AM
    Hi, I have created the lineages between the datasets and between the columns, I wanted to delete both dataset and column lineages now, so used the graphql query provided https://datahubproject.io/docs/api/tutorials/lineage#add-lineage, But I can see it has deleted only dataset level lineage not the column level lineage. Does the graphql query works the same way ? Or any other way to delete the column level lineage ? Please help me out.
    h
    g
    • 3
    • 4
  • n

    nutritious-lighter-88459

    09/05/2023, 9:16 AM
    Hi, I have integrated datahub with great expectations for running validations/assertions. Validation result is exported to datahub via
    datahub_action
    and is displayed in datahub under
    Validation
    tab as expected. However, I was wondering if there is a way to update the description of the assertion which gets auto generated (PFA) ? We would like to provide our own custom description may be in the form of some attribute in
    meta
    tag in expectation suite. TIA
    g
    a
    +3
    • 6
    • 16
  • b

    best-kite-4934

    09/05/2023, 11:40 AM
    Is there someone that was able to integrate great expectations validation results with Datahub? I follow https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/ but it does not work by using a token
    g
    a
    +2
    • 5
    • 35
  • b

    bitter-florist-92385

    09/05/2023, 12:50 PM
    Hey, im trying to put some yaml files into my Datahub. They are on the Host Server of Datahub, but not in the venv. Still when im in the CLI it wont find them. Any Idea why that might be ? ( raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f22895fc5d0>: Failed to establish a new connection: [Errno 111] Connection refused)
    g
    • 2
    • 5
  • s

    shy-diamond-99510

    09/05/2023, 12:58 PM
    Hey guys, I’m new to this slack and to datahub. I have to install a custom ingestion source in a docker setup. I have followed all the tutorials related to what I am trying to achieve but it still doesn’t work. Has anybody experience with installing a custom ingestion source? I really need help.
    g
    g
    b
    • 4
    • 4
  • g

    great-florist-68068

    09/05/2023, 10:44 PM
    Hello! I’m trying to run an example ingestion for Hive: https://datahubproject.io/docs/generated/ingestion/sources/hive My Hive instance is using Kerberos for auth. When running
    datahub ingest -c ./hive-datahub.yml
    , i’m getting
    Server not found in Kerberos database
    . I tried to run
    kinit
    before but still get the same error. It is not clear to me how to provide krb5.conf in the hive-datahub.yml
    g
    • 2
    • 2
  • e

    enough-pizza-64105

    09/06/2023, 3:08 AM
    Hi guys, Im quite new to Datahub, I would like to ask how would you configure the ingestion source for delta-table source that is housed within HDFS (in other tools the location would be something like hdfs://localhost:9000/my_env/my_subfolder_location)
    g
    • 2
    • 3
  • b

    best-umbrella-88325

    09/06/2023, 2:08 PM
    Hello community!! We're trying to ingest metadata using Airflow DAGs by using the Python SDKs. Although the pipeline is completing successfully, we are not able to see it as an 'ingestion' in the Ingestion tab. Ideally, it should have shown the source as 'CLI' ingestion under that tab. We use the PythonVirtualEnvOperator to facilitate the ingestion. Anyone having any idea how we can achieve this? I tried with 0.10.3 and 0.10.1 as acryl-datahub package version. We use MWAA v2.5.1. Thanks in advance!!
    g
    • 2
    • 5
  • g

    great-florist-68068

    09/06/2023, 9:15 PM
    Hello! I’m ingesting data from Hive metastore and table has a partition key;
    Copy code
    create table hive_example(a string, b int) partitioned by(c int);
    but the datahub ingestion doesn’t indicate the
    isPartitioningKey
    in the payload
    Copy code
    {
      "fieldPath": "c",
      "nullable": true,
      "type": {
        "type": {
          "com.linkedin.schema.NumberType": {}
        }
      },
      "nativeDataType": "int",
      "recursive": false,
      "isPartOfKey": false
    }
    g
    • 2
    • 4
  • l

    lively-energy-75016

    09/07/2023, 3:47 AM
    Hi all I have a question for you, when datahub syncs the hive metadata, only the database is synced, but the tables are not synced over, the whole sync is shown to be successful, there is an error reported as:
    g
    a
    • 3
    • 5
  • b

    better-orange-49102

    09/07/2023, 8:36 AM
    noticed that mongodb doesnt have the ability to specify platform instance. just wondering, for teams that already ingested metadata from mongodb, should the ingestion add the ability to specify instances, how you would migrate your documentation from the existing entity to the new entity? (cos once platform instance is added, the urn of the entity changes)
    g
    a
    • 3
    • 3
  • m

    mammoth-musician-30735

    09/07/2023, 2:37 PM
    Hello, I want to delete the ingested meta data from DataHub(Vertica platform metadata only). I see this is possible through command line. Is there any option to delete it through datahub UI.
    g
    a
    • 3
    • 3
  • a

    able-library-93578

    09/07/2023, 9:12 PM
    Hi All, I am seeing a weird ingestion behavior with using the UI (to be honest I have not tested if the CLI has the same issue or not). I am ingesting PowerBI metadata (UI YAML below), We pull in endorsements as tags. Works great no issues on the initial ingestion. I see the tags, example
    Certified
    . I create another tag (programmatically - python -DataHubGraph )
    SourcesSDP
    , so now there are 2 tags - Great. When I run the UI ingestion again, it deletes all tags and re-writes the
    Certified
    tag. I verify that GMS log shows an UPSERT for that entity and tag. What is the behavior supposed to be for the UPSERT? GMS log:
    Copy code
    2023-09-07 18:08:32,038 [qtp1577592551-60548] INFO c.l.m.r.entity.AspectResource:180 - INGEST PROPOSAL proposal: {aspectName=globalTags, systemMetadata={lastObserved=1694110111876, runId=2359db8a-8c78-471f-b38a-0e8167a4f431}, entityUrn=urn:li:dataset:(urn:li:dataPlatform:powerbi,SMG_Channel_Reporting_Dataset.DATE_DIM,PROD), entityType=dataset, aspect={contentType=application/json, value=ByteString(length=43,bytes=7b227461...227d5d7d)}, changeType=UPSERT}
    g
    • 2
    • 2
  • e

    eager-monitor-4683

    09/08/2023, 1:52 AM
    Hey team, just want to check if event-based ingestion supported in Datahub? Thanks
    g
    a
    d
    • 4
    • 8
  • m

    mammoth-musician-30735

    09/08/2023, 4:51 AM
    Hello, I am doing S3 ingestion through DataHub UI. I want to exclude files on S3, can we have only folders or not ? If yes, could you please provide example config to only have folders.
    g
    d
    • 3
    • 10
  • m

    mammoth-musician-30735

    09/08/2023, 7:35 AM
    Hello Team, I am doing Vertica ingestion through DataHub UI. I have a Vertica Database which has 552 tables and 541 views. I started ingestion yesterday afternoon IST time, still ingestion is in progress. Don't know what's wrong with it. I am adding the config here, could you please help us.
    Copy code
    source:
        type: vertica
        config:
            host_port: 'host:5433'
            database: databse
            schema_pattern:
                allow:
                    - '^specific_schema_name*'
            username: '${VERTICA_USERNAME}'
            password: '${VERTICA_PASSWORD}'
            include_tables: true
            include_views: true
            include_projections: false
            include_models: false
            include_view_lineage: false
            include_projection_lineage: false
            profiling:
                enabled: false
                field_sample_values_limit: 10
                max_workers: 1
    As per document https://datahubproject.io/docs/generated/ingestion/sources/vertica/ profiling is disabled. Tried without adding profiling section in config and the result is same(Taking more than a day for small ingestion).
    g
    g
    b
    • 4
    • 11
  • b

    bland-barista-59197

    09/08/2023, 4:47 PM
    Hi Team Any help is appreciated is it possible to run profiling as standalone job? In my scenario, the profiling taking longer time >18hr and sometime get gms error.
    m
    h
    a
    • 4
    • 5
  • m

    melodic-dusk-2080

    09/10/2023, 4:32 PM
    Hi, Is it possible to ingest REST API's, and if so how?
    m
    r
    d
    • 4
    • 9
  • r

    refined-gold-30439

    09/11/2023, 1:48 AM
    Hi team! I have two MySQL databases with the same schema but different descriptions (in different languages). When I run a script for ingestion from the command line, only one schema is created instead of two (the database information from the last executed script overwrites the previous one). How can I make them distinguishable? 🫠
    m
    • 2
    • 2
1...129130131...144Latest