https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • r

    rhythmic-flag-69887

    02/19/2022, 12:29 PM
    Hello, Im trying to connect the quickstart guide to my postgres server. I know the postgres connection works as its the same credentials use for dbt. I inputted it like so
    Copy code
    source:
        type: postgres
        config:
            host_port: '***.com:5432'
            database: **
            username: **
            password: **
    However Im getting an error. Is this error because I wrote something wrong in the recipe?
    Copy code
    ConnectionRefusedError: [Errno 111] Connection refused
    m
    • 2
    • 1
  • m

    mysterious-nail-70388

    02/21/2022, 3:30 AM
    Hi Team, when will we support Clickhouse data sources
    l
    • 2
    • 2
  • m

    mysterious-nail-70388

    02/21/2022, 5:59 AM
    I locally built 0.8.26 DataHub reported an error in obtaining the ES data source metadata. I reinstalled the ES plug-in, but there was still a problem. I don't know why
    b
    • 2
    • 2
  • b

    breezy-noon-83306

    02/21/2022, 8:29 AM
    Goog morning Datahub Community, just starting with Datahub and I have some questions about ingestion: 1- Once the data is ingested, if they are data/metadata changes the source updates itself automatically or you have to ingest it again?
    c
    • 2
    • 1
  • b

    breezy-noon-83306

    02/21/2022, 8:30 AM
    2- How to update a new release if you have installed datahub thru Kubernetes not Docker? THank you very much community !
    s
    • 2
    • 1
  • f

    fierce-waiter-13795

    02/21/2022, 9:22 AM
    Hi Team, I'm having some issues ingesting redshift-lineage into datahub. I've followed the documentation here, yet I'm unable to see any lineage on datahub. Posting the recipe file and other details in the thread.
    b
    s
    • 3
    • 11
  • g

    gifted-piano-21322

    02/21/2022, 10:04 AM
    I know that DynamoDB is a document database, but is there a way to store it's 'schema' in DataHub? Or maybe not schema but at least datasrouce description, sample field values, etc
    b
    m
    • 3
    • 5
  • s

    silly-beach-19296

    02/21/2022, 7:25 PM
    Hello, how do I add my glossary of terms to an eks? Should I connect to the node or directly to a POD?
    s
    • 2
    • 13
  • m

    mysterious-nail-70388

    02/22/2022, 8:12 AM
    Hello, when in metadata ingestion, we will specify ingesting metadata information to rest services in yml file. When we delete datahub data how to specify the address of the GMS service?
    s
    • 2
    • 1
  • w

    witty-painting-90923

    02/22/2022, 10:28 AM
    Hello! i am trying to ingest Elasticsearch metadata. I am using also some transformers to put it in the right browse path, and add a tag. This worked really great with mongodb and postgres, but this doesnt work at all for Elasticsearch, as if it is ignoring the transformers at all. This is driving me crazy. I tried both programmatic pipeline and yaml recipe… The code is literally copy paste from mongo and postgres is there a chance that for ES the transformers need to be written differently? or that i am missing something? thank you!
    Copy code
    pipeline = Pipeline.create(
            # This configuration is analogous to a recipe configuration.
            {
                "source": {
                    "type": "elasticsearch",
                    "config": {
                        "env": ENV,
                        "host": es_connection_host_port,
                        "username": es_connection_login,
                        "password": es_connection_password,
                        "index_pattern": {
                            "deny": [es_deny_index_pattern]
                        }
                    },
                },
                "sink": {
                    "type": "datahub-rest",
                    "config": {"server": datahub_server},
                },
                "transformers": [
                    {
                        "type": "set_dataset_browse_path",
                        "config": {
                            "path_templates": [f"/ENV/PLATFORM/EsComments/DATASET_PARTS"]
                        }
                    },
                    {
                        "type": "simple_add_dataset_tags",
                        "config": {
                            "tag_urns": [f"urn:li:tag:EsComments"]
                        }
                    }
                ]
    
            })
    
        pipeline.run()
    l
    e
    +3
    • 6
    • 14
  • b

    breezy-guitar-97226

    02/22/2022, 11:44 AM
    Hi here, I have a question about Platform Instances 🙂 A bit of context: in my company we run multiple Kafka Clusters, and in our plans each Cluster would be modelled as a Platform Instance in Datahub, to replace the custom Catalogue api we currently offer to our internal users to list them. However, differently from Platform Instances our Cluster entities have the possibility to add custom properties, to represent cluster configurations, and ownership, a little bit like Containers in Datahub. The issue here is that we fall a bit in a middle ground: on one side Clusters are our own internal abstraction and would be hard to abstract them as a generic concept for the Kafka connector, on the other side the available Platform Instances do not offer enough customisation to fulfil all our needs. In an ideal world (for us), Platform Instances would just allow as much customisation as Containers currently do (and in a sense Clusters represent also physical data containers). Does this make any sense 🙂 ? Is there a suggested/possible way we can fully model our Kafka Cluster concept using the current Datahub data model? Thanks!
    plus1 2
    h
    b
    • 3
    • 3
  • l

    lively-fall-12210

    02/22/2022, 2:24 PM
    Hello! I am trying to use the "domain" feature of the Kafka Metadata ingestion. My recipe looks like this:
    Copy code
    source:
        type: kafka
        config:
            connection:
                bootstrap: 'my-broker:9092'
                schema_registry_url: '<http://my-schema-registry:8081>'
            topic_patterns:
                deny:
                    - ^_.+
            domain:
                'urn:li:domain:3215d470-9bb9-4cdf-be43-e971047b4b72':
                    allow:
                        - '^foo\.bar*'
                'urn:li:domain:a518ea17-b705-4e59-94be-75cd1c600ca7':
                    allow:
                        - '^foo\.bazz*'
                'urn:li:domain:ea46bbbe-33c2-4a7e-bedd-665037df50fc':
                    allow:
                        - '^foo\.blub*'
    sink:
        type: datahub-rest
        config:
            server: '<http://my-datahub:8080>'
    However, when executing the recipy, I get the validation error:
    Copy code
    '1 validation error for KafkaSourceConfig\n'
               'domain\n'
               '  extra fields not permitted (type=value_error.extra)\n',
    According to the source, my deployment of DataHub should support the domain field already. Am I doing something subtly wrong here? Thank you very much for your support!
    d
    • 2
    • 3
  • p

    plain-farmer-27314

    02/22/2022, 2:39 PM
    Hey all, after updating our ingestion plugins to 0.8.26.3, I'm seeing the following error:
    Copy code
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return self.main(*args, **kwargs)
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     rv = self.invoke(ctx)
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return ctx.invoke(self.callback, **ctx.params)
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return callback(*args, **kwargs)
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "discord_data/python/bin/datahub/datahub_looker_ingest", line 33, in datahub_looker_ingest
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     'sink': {'type': 'datahub-rest', 'config': {'server': f'{server_url}'}},
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return cls(config, dry_run=dry_run, preview_mode=preview_mode)
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 116, in __init__
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     preview_mode=preview_mode,
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/common.py", line 41, in __init__
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/graph/client.py", line 47, in __init__
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     ca_certificate_path=self.config.ca_certificate_path,
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py", line 121, in __init__
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     allowed_methods=self._retry_methods,
    [2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - TypeError: __init__() got an unexpected keyword argument 'allowed_methods'
    Any thoughts on what could be causing this? Could also be that another dependency needs to be updated
    s
    • 2
    • 4
  • f

    fierce-alligator-27212

    02/22/2022, 4:18 PM
    Hi, we are trying to enable lineage info for BigQuery. We can see all the datasets/tables in the UI but not the lineage info. Based on the logs, it seems that it wasn’t able to get any entries back. Wonder if anyone ran into this issue and potential causes. Thanks.
    Copy code
    [2022-02-22 10:38:24,592] INFO     {datahub.cli.ingest_cli:86} - Starting metadata ingestion
    [2022-02-22 10:38:24,592] INFO     {datahub.ingestion.source.sql.bigquery:320} - Populating lineage info via GCP audit logs
    [2022-02-22 10:38:25,997] INFO     {datahub.ingestion.source.sql.bigquery:381} - Start loading log entries from BigQuery
    [2022-02-22 11:00:36,725] INFO     {datahub.ingestion.source.sql.bigquery:520} - Creating lineage map: total number of entries=0, number skipped=0.
    [2022-02-22 11:00:36,726] INFO     {datahub.ingestion.source.sql.bigquery:316} - Built lineage map containing 0 entries.
    config:
    Copy code
    source:
      type: bigquery
      config:
        project_id: <GCP Project ID>
        env: prod
        include_table_lineage: True
        start_time: 2022-02-20 00:00:00Z
        end_time: 2022-02-21 00:00:00Z
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    p
    f
    +2
    • 5
    • 11
  • p

    plain-farmer-27314

    02/22/2022, 4:40 PM
    Latest version 0.8.26.7 seems to not set the project_id correctly in great expectations when running profiling for BQ tables. Logs in thread
    d
    • 2
    • 13
  • s

    silly-beach-19296

    02/22/2022, 4:55 PM
    Is there a swagger for the API documentation?? I am looking for the structure of the body that it receives when ingesting the glossary of terms
    o
    • 2
    • 1
  • h

    handsome-football-66174

    02/22/2022, 9:05 PM
    Hi Everyone - We have Datahub deployed on EKS cluster. We are able to use Airflow to do pull based ingestions. We would like to do push based ingestions via Kafka. How do we achieve this and what configurations need to be used ?
    Copy code
    sink:
      type: "datahub-kafka"
      config:
        connection:
          bootstrap: localhost:9092
          schema_registry_url: <http://localhost:8081>
    I believe we need to point schema registry to something else than above ?
    Copy code
    kafka:
      bootstrap:
        server: "<bootstrap server>"
      zookeeper:
        server: "<zookeeper server>"
      schemaregistry:
        url: "<http://prerequisites-cp-schema-registry:8081>"
    e
    • 2
    • 39
  • h

    handsome-football-66174

    02/22/2022, 10:07 PM
    Hi Everyone, trying to Add lineage, between Data job and dataset (specifically S3 location ). Is there a convention to follow for the S3 path ( what is usually present in AWS ) . I see that a dot convention has been used in S3 samples ingested in the demo datahub.
    e
    o
    • 3
    • 42
  • s

    silly-beach-19296

    02/23/2022, 12:26 PM
    hello again, I am trying to ingest the glossary of terms through the API and it is giving me this error "message": "com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/glossaryTermInfo :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/urn :: field is required but not found and has no default value\nERROR :: /value/com.linkedin.metadata.snapshot. GlossaryTermSnapshot/aspects :: field is required but not found and has no default value\n",     "status": 422
    c
    r
    • 3
    • 19
  • r

    rhythmic-bear-20384

    02/23/2022, 2:08 PM
    Hello I am getting this error when I try to ingest from a mysql datasource Connection Refused to /api/gms/config Is there a config or setting I need to do to make the endpoints available?
    s
    e
    • 3
    • 12
  • m

    modern-monitor-81461

    02/23/2022, 8:12 PM
    I am using the Azure AD source to ingest users and groups from Azure AD, but I'm using the
    groups_pattern
    and
    users_pattern
    since I only want to ingest specific users and groups. My AD contains thousands of entries and it creates a huge log of filtered items, which is just polluting the logs and not having any real value. I still want the logs since when things go sideways, I need to know what is going on, so redirecting the logs to
    /dev/null
    is not an option. I could hack it with grep, but I'd like to know if there is way to disable some reporting? From me reading the code, I don't think there is, but I might have missed something. I think the reporting is done via introspection of a
    dataclass
    , so the
    filtered
    list is being printed if defined. Would there be a way (by modifying the existing code) to disable that list using a param passed to the
    AzureADSourceReport
    constructor? And instead of recording all the filtered names, I could simply keep a count...
    Copy code
    @dataclass
    class AzureADSourceReport(SourceReport):
        filtered: List[str] = field(default_factory=list)
    
        def report_filtered(self, name: str) -> None:
            self.filtered.append(name)
    e
    • 2
    • 4
  • f

    fierce-airplane-70308

    02/23/2022, 10:29 PM
    I'm trying to create lineage between a (Custom) Qlik dashboard and 2 datasets but i just get an internal error. Are there any examples using python emitter to establish lineage between datasets and dashboard?
    Copy code
    from typing import List
    
    import datahub.emitter.mce_builder as builder
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
        DatasetLineageTypeClass,
        UpstreamClass,
        UpstreamLineage,
    )
    from datahub.metadata.schema_classes import ChangeTypeClass
    
    # Construct upstream tables.
    upstream_tables: List[UpstreamClass] = []
    upstream_table_1 = UpstreamClass(
        dataset=builder.make_dataset_urn("mssql", "Analytics.PDDBI_DL.USERS","PROD"),
        type=DatasetLineageTypeClass.TRANSFORMED,
    )
    upstream_tables.append(upstream_table_1)
    upstream_table_2 = UpstreamClass(
        dataset=builder.make_dataset_urn("mssql", "<http://Analytics.PDDBI_DL.JOBS|Analytics.PDDBI_DL.JOBS>","PROD"),
        type=DatasetLineageTypeClass.TRANSFORMED,
    )
    upstream_tables.append(upstream_table_2)
    
    # Construct a lineage object.
    upstream_lineage = UpstreamLineage(upstreams=upstream_tables)
    
    # Construct a MetadataChangeProposalWrapper object.
    lineage_mcp = MetadataChangeProposalWrapper(
        entityType="dataset",
        changeType=ChangeTypeClass.UPSERT,
        entityUrn=builder.make_dashboard_urn(platform="QlikSense", name="14542bf2-65a8-46ee-b140-953a2f67ebee"),
        aspectName="upstreamLineage",
        aspect=upstream_lineage,
    )
    
    # Create an emitter to the GMS REST API.
    emitter = DatahubRestEmitter("<http://localhost:8080>")
    
    # Emit metadata!
    emitter.emit_mcp(lineage_mcp)
    e
    • 2
    • 5
  • a

    adorable-flower-19656

    02/24/2022, 1:17 AM
    Hi, is there a way to specify the number of history of UI ingestion?
    e
    b
    • 3
    • 4
  • s

    square-machine-96318

    02/24/2022, 2:31 AM
    When I proceed with ingestion on Datahub Web UI, the new meta data seems to be uploaded well. But does it not support the function for deletion? For example, there are (a1, a2, a3) in the dataset ‘A’. If ‘a2’ is deleted and ‘a4’ is newly created, the expected result of ‘A’ after ingestion is (a1, a3, a4). However, the results (a1, a2, a3, a4) are derived. How can I perform the function for deletion?
    e
    • 2
    • 1
  • b

    better-orange-49102

    02/24/2022, 6:18 AM
    whats the purpose of the "url" in the business glossary? ie the sample glossary looks like this:
    Copy code
    version: 1
    source: DataHub
    owners:
      users:
        - mjames
    url: "<https://github.com/linkedin/datahub/>"
    nodes:
      - name: Classification
        description: A set of terms related to Data Classification
        terms:
          - name: Sensitive
            description: Sensitive Data
            custom_properties:
              is_confidential: false
    That particular field doesn't show up in MySQL and seems to be causing a display bug if you omit, as discussed here: https://datahubspace.slack.com/archives/C029A3M079U/p1644386207180329
    e
    g
    • 3
    • 2
  • b

    breezy-controller-54597

    02/24/2022, 8:34 AM
    When ingesting from S3 with data-lake type, the getFileStatus for s3a:// is executed for the object of s3:// and an error occurs.
    n
    l
    c
    • 4
    • 18
  • l

    late-animal-78943

    02/24/2022, 11:19 AM
    is Datahub capable of getting the data lineage from a Managed Airflow solution e.g https://aws.amazon.com/managed-workflows-for-apache-airflow/ ?
    l
    • 2
    • 1
  • h

    hundreds-memory-3344

    02/24/2022, 5:31 PM
    Hello 😃 I am trying to insert a tag using a Python emitter. However, even if I modify the tags of
    DatasetPropertiesClass
    , the tag is not entered in the Datahub. 1. If I simply append string in tags , doesn’t it get input? 2. Do I need to put urn in tags? I attach the code I used as a sample
    Copy code
    dataset_properties = DatasetPropertiesClass(description="This is Google Sample",
    externalUrl="<https://www.google.com>", 
    customProperties={},
    tags = ['Active']
    )
    
    metadata_event = MetadataChangeProposalWrapper(
        entityType="dataset",
        changeType=ChangeTypeClass.UPSERT,
        entityUrn=builder.make_dataset_urn("google_sheet", "sample1"),
        aspectName="datasetProperties",
        aspect=dataset_properties,
    )
    
    emitter.emit(metadata_event)
    e
    m
    • 3
    • 8
  • g

    gentle-father-80172

    02/24/2022, 6:58 PM
    Hey Team! 👋 Any reason Glue ingestion is formatting my schema incorrectly? Looks like the ingestion isn't parsing Glue properly....
    l
    h
    • 3
    • 7
  • m

    mysterious-portugal-30527

    02/24/2022, 9:56 PM
    How can I load query info for MySQL and Postgres data sources?
    l
    • 2
    • 2
1...303132...144Latest