DataHub #ingestion

rhythmic-flag-69887

02/19/2022, 12:29 PM

Hello, Im trying to connect the quickstart guide to my postgres server. I know the postgres connection works as its the same credentials use for dbt. I inputted it like so

Copy code

source:
    type: postgres
    config:
        host_port: '***.com:5432'
        database: **
        username: **
        password: **

However Im getting an error. Is this error because I wrote something wrong in the recipe?

Copy code

ConnectionRefusedError: [Errno 111] Connection refused

mysterious-nail-70388

02/21/2022, 3:30 AM

Hi Team, when will we support Clickhouse data sources

mysterious-nail-70388

02/21/2022, 5:59 AM

I locally built 0.8.26 DataHub reported an error in obtaining the ES data source metadata. I reinstalled the ES plug-in, but there was still a problem. I don't know why

breezy-noon-83306

02/21/2022, 8:29 AM

Goog morning Datahub Community, just starting with Datahub and I have some questions about ingestion: 1- Once the data is ingested, if they are data/metadata changes the source updates itself automatically or you have to ingest it again?

breezy-noon-83306

02/21/2022, 8:30 AM

2- How to update a new release if you have installed datahub thru Kubernetes not Docker? THank you very much community !

fierce-waiter-13795

02/21/2022, 9:22 AM

Hi Team, I'm having some issues ingesting redshift-lineage into datahub. I've followed the documentation here, yet I'm unable to see any lineage on datahub. Posting the recipe file and other details in the thread.

gifted-piano-21322

02/21/2022, 10:04 AM

I know that DynamoDB is a document database, but is there a way to store it's 'schema' in DataHub? Or maybe not schema but at least datasrouce description, sample field values, etc

silly-beach-19296

02/21/2022, 7:25 PM

Hello, how do I add my glossary of terms to an eks? Should I connect to the node or directly to a POD?

mysterious-nail-70388

02/22/2022, 8:12 AM

Hello, when in metadata ingestion, we will specify ingesting metadata information to rest services in yml file. When we delete datahub data how to specify the address of the GMS service?

witty-painting-90923

02/22/2022, 10:28 AM

Hello! i am trying to ingest Elasticsearch metadata. I am using also some transformers to put it in the right browse path, and add a tag. This worked really great with mongodb and postgres, but this doesnt work at all for Elasticsearch, as if it is ignoring the transformers at all. This is driving me crazy. I tried both programmatic pipeline and yaml recipe… The code is literally copy paste from mongo and postgres is there a chance that for ES the transformers need to be written differently? or that i am missing something? thank you!

Copy code

pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            "source": {
                "type": "elasticsearch",
                "config": {
                    "env": ENV,
                    "host": es_connection_host_port,
                    "username": es_connection_login,
                    "password": es_connection_password,
                    "index_pattern": {
                        "deny": [es_deny_index_pattern]
                    }
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": datahub_server},
            },
            "transformers": [
                {
                    "type": "set_dataset_browse_path",
                    "config": {
                        "path_templates": [f"/ENV/PLATFORM/EsComments/DATASET_PARTS"]
                    }
                },
                {
                    "type": "simple_add_dataset_tags",
                    "config": {
                        "tag_urns": [f"urn:li:tag:EsComments"]
                    }
                }
            ]

        })

    pipeline.run()

breezy-guitar-97226

02/22/2022, 11:44 AM

Hi here, I have a question about Platform Instances 🙂 A bit of context: in my company we run multiple Kafka Clusters, and in our plans each Cluster would be modelled as a Platform Instance in Datahub, to replace the custom Catalogue api we currently offer to our internal users to list them. However, differently from Platform Instances our Cluster entities have the possibility to add custom properties, to represent cluster configurations, and ownership, a little bit like Containers in Datahub. The issue here is that we fall a bit in a middle ground: on one side Clusters are our own internal abstraction and would be hard to abstract them as a generic concept for the Kafka connector, on the other side the available Platform Instances do not offer enough customisation to fulfil all our needs. In an ideal world (for us), Platform Instances would just allow as much customisation as Containers currently do (and in a sense Clusters represent also physical data containers). Does this make any sense 🙂 ? Is there a suggested/possible way we can fully model our Kafka Cluster concept using the current Datahub data model? Thanks!

plus1 2

lively-fall-12210

02/22/2022, 2:24 PM

Hello! I am trying to use the "domain" feature of the Kafka Metadata ingestion. My recipe looks like this:

Copy code

source:
    type: kafka
    config:
        connection:
            bootstrap: 'my-broker:9092'
            schema_registry_url: '<http://my-schema-registry:8081>'
        topic_patterns:
            deny:
                - ^_.+
        domain:
            'urn:li:domain:3215d470-9bb9-4cdf-be43-e971047b4b72':
                allow:
                    - '^foo\.bar*'
            'urn:li:domain:a518ea17-b705-4e59-94be-75cd1c600ca7':
                allow:
                    - '^foo\.bazz*'
            'urn:li:domain:ea46bbbe-33c2-4a7e-bedd-665037df50fc':
                allow:
                    - '^foo\.blub*'
sink:
    type: datahub-rest
    config:
        server: '<http://my-datahub:8080>'

However, when executing the recipy, I get the validation error:

Copy code

'1 validation error for KafkaSourceConfig\n'
           'domain\n'
           '  extra fields not permitted (type=value_error.extra)\n',

According to the source, my deployment of DataHub should support the domain field already. Am I doing something subtly wrong here? Thank you very much for your support!

plain-farmer-27314

02/22/2022, 2:39 PM

Hey all, after updating our ingestion plugins to 0.8.26.3, I'm seeing the following error:

Copy code

[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return self.main(*args, **kwargs)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     rv = self.invoke(ctx)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return callback(*args, **kwargs)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "discord_data/python/bin/datahub/datahub_looker_ingest", line 33, in datahub_looker_ingest
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     'sink': {'type': 'datahub-rest', 'config': {'server': f'{server_url}'}},
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     return cls(config, dry_run=dry_run, preview_mode=preview_mode)
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 116, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     preview_mode=preview_mode,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/api/common.py", line 41, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/ingestion/graph/client.py", line 47, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     ca_certificate_path=self.config.ca_certificate_path,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -   File "/usr/local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py", line 121, in __init__
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO -     allowed_methods=self._retry_methods,
[2022-02-18, 14:00:15 UTC] {pod_launcher.py:149} INFO - TypeError: __init__() got an unexpected keyword argument 'allowed_methods'

Any thoughts on what could be causing this? Could also be that another dependency needs to be updated

fierce-alligator-27212

02/22/2022, 4:18 PM

Hi, we are trying to enable lineage info for BigQuery. We can see all the datasets/tables in the UI but not the lineage info. Based on the logs, it seems that it wasn’t able to get any entries back. Wonder if anyone ran into this issue and potential causes. Thanks.

Copy code

[2022-02-22 10:38:24,592] INFO     {datahub.cli.ingest_cli:86} - Starting metadata ingestion
[2022-02-22 10:38:24,592] INFO     {datahub.ingestion.source.sql.bigquery:320} - Populating lineage info via GCP audit logs
[2022-02-22 10:38:25,997] INFO     {datahub.ingestion.source.sql.bigquery:381} - Start loading log entries from BigQuery
[2022-02-22 11:00:36,725] INFO     {datahub.ingestion.source.sql.bigquery:520} - Creating lineage map: total number of entries=0, number skipped=0.
[2022-02-22 11:00:36,726] INFO     {datahub.ingestion.source.sql.bigquery:316} - Built lineage map containing 0 entries.

config:

Copy code

source:
  type: bigquery
  config:
    project_id: <GCP Project ID>
    env: prod
    include_table_lineage: True
    start_time: 2022-02-20 00:00:00Z
    end_time: 2022-02-21 00:00:00Z

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

plain-farmer-27314

02/22/2022, 4:40 PM

Latest version 0.8.26.7 seems to not set the project_id correctly in great expectations when running profiling for BQ tables. Logs in thread

silly-beach-19296

02/22/2022, 4:55 PM

Is there a swagger for the API documentation?? I am looking for the structure of the body that it receives when ingesting the glossary of terms

handsome-football-66174

02/22/2022, 9:05 PM

Hi Everyone - We have Datahub deployed on EKS cluster. We are able to use Airflow to do pull based ingestions. We would like to do push based ingestions via Kafka. How do we achieve this and what configurations need to be used ?

Copy code

sink:
  type: "datahub-kafka"
  config:
    connection:
      bootstrap: localhost:9092
      schema_registry_url: <http://localhost:8081>

I believe we need to point schema registry to something else than above ?

Copy code

kafka:
  bootstrap:
    server: "<bootstrap server>"
  zookeeper:
    server: "<zookeeper server>"
  schemaregistry:
    url: "<http://prerequisites-cp-schema-registry:8081>"

handsome-football-66174

02/22/2022, 10:07 PM

Hi Everyone, trying to Add lineage, between Data job and dataset (specifically S3 location ). Is there a convention to follow for the S3 path ( what is usually present in AWS ) . I see that a dot convention has been used in S3 samples ingested in the demo datahub.

silly-beach-19296

02/23/2022, 12:26 PM

hello again, I am trying to ingest the glossary of terms through the API and it is giving me this error "message": "com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/glossaryTermInfo :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.GlossaryTermSnapshot/urn :: field is required but not found and has no default value\nERROR :: /value/com.linkedin.metadata.snapshot. GlossaryTermSnapshot/aspects :: field is required but not found and has no default value\n", "status": 422

rhythmic-bear-20384

02/23/2022, 2:08 PM

Hello I am getting this error when I try to ingest from a mysql datasource Connection Refused to /api/gms/config Is there a config or setting I need to do to make the endpoints available?

modern-monitor-81461

02/23/2022, 8:12 PM

I am using the Azure AD source to ingest users and groups from Azure AD, but I'm using the

groups_pattern

and

users_pattern

since I only want to ingest specific users and groups. My AD contains thousands of entries and it creates a huge log of filtered items, which is just polluting the logs and not having any real value. I still want the logs since when things go sideways, I need to know what is going on, so redirecting the logs to

/dev/null

is not an option. I could hack it with grep, but I'd like to know if there is way to disable some reporting? From me reading the code, I don't think there is, but I might have missed something. I think the reporting is done via introspection of a

dataclass

, so the

filtered

list is being printed if defined. Would there be a way (by modifying the existing code) to disable that list using a param passed to the

AzureADSourceReport

constructor? And instead of recording all the filtered names, I could simply keep a count...

Copy code

@dataclass
class AzureADSourceReport(SourceReport):
    filtered: List[str] = field(default_factory=list)

    def report_filtered(self, name: str) -> None:
        self.filtered.append(name)

fierce-airplane-70308

02/23/2022, 10:29 PM

I'm trying to create lineage between a (Custom) Qlik dashboard and 2 datasets but i just get an internal error. Are there any examples using python emitter to establish lineage between datasets and dashboard?

Copy code

from typing import List

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
    DatasetLineageTypeClass,
    UpstreamClass,
    UpstreamLineage,
)
from datahub.metadata.schema_classes import ChangeTypeClass

# Construct upstream tables.
upstream_tables: List[UpstreamClass] = []
upstream_table_1 = UpstreamClass(
    dataset=builder.make_dataset_urn("mssql", "Analytics.PDDBI_DL.USERS","PROD"),
    type=DatasetLineageTypeClass.TRANSFORMED,
)
upstream_tables.append(upstream_table_1)
upstream_table_2 = UpstreamClass(
    dataset=builder.make_dataset_urn("mssql", "<http://Analytics.PDDBI_DL.JOBS|Analytics.PDDBI_DL.JOBS>","PROD"),
    type=DatasetLineageTypeClass.TRANSFORMED,
)
upstream_tables.append(upstream_table_2)

# Construct a lineage object.
upstream_lineage = UpstreamLineage(upstreams=upstream_tables)

# Construct a MetadataChangeProposalWrapper object.
lineage_mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn=builder.make_dashboard_urn(platform="QlikSense", name="14542bf2-65a8-46ee-b140-953a2f67ebee"),
    aspectName="upstreamLineage",
    aspect=upstream_lineage,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(lineage_mcp)

adorable-flower-19656

02/24/2022, 1:17 AM

Hi, is there a way to specify the number of history of UI ingestion?

square-machine-96318

02/24/2022, 2:31 AM

When I proceed with ingestion on Datahub Web UI, the new meta data seems to be uploaded well. But does it not support the function for deletion? For example, there are (a1, a2, a3) in the dataset ‘A’. If ‘a2’ is deleted and ‘a4’ is newly created, the expected result of ‘A’ after ingestion is (a1, a3, a4). However, the results (a1, a2, a3, a4) are derived. How can I perform the function for deletion?

better-orange-49102

02/24/2022, 6:18 AM

whats the purpose of the "url" in the business glossary? ie the sample glossary looks like this:

Copy code

version: 1
source: DataHub
owners:
  users:
    - mjames
url: "<https://github.com/linkedin/datahub/>"
nodes:
  - name: Classification
    description: A set of terms related to Data Classification
    terms:
      - name: Sensitive
        description: Sensitive Data
        custom_properties:
          is_confidential: false

That particular field doesn't show up in MySQL and seems to be causing a display bug if you omit, as discussed here: https://datahubspace.slack.com/archives/C029A3M079U/p1644386207180329

breezy-controller-54597

02/24/2022, 8:34 AM

When ingesting from S3 with data-lake type, the getFileStatus for s3a:// is executed for the object of s3:// and an error occurs.

late-animal-78943

02/24/2022, 11:19 AM

is Datahub capable of getting the data lineage from a Managed Airflow solution e.g https://aws.amazon.com/managed-workflows-for-apache-airflow/ ?

hundreds-memory-3344

02/24/2022, 5:31 PM

Hello 😃 I am trying to insert a tag using a Python emitter. However, even if I modify the tags of

DatasetPropertiesClass

, the tag is not entered in the Datahub. 1. If I simply append string in tags , doesn’t it get input? 2. Do I need to put urn in tags? I attach the code I used as a sample

Copy code

dataset_properties = DatasetPropertiesClass(description="This is Google Sample",
externalUrl="<https://www.google.com>", 
customProperties={},
tags = ['Active']
)

metadata_event = MetadataChangeProposalWrapper(
    entityType="dataset",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn=builder.make_dataset_urn("google_sheet", "sample1"),
    aspectName="datasetProperties",
    aspect=dataset_properties,
)

emitter.emit(metadata_event)

gentle-father-80172

02/24/2022, 6:58 PM

Hey Team! 👋 Any reason Glue ingestion is formatting my schema incorrectly? Looks like the ingestion isn't parsing Glue properly....

mysterious-portugal-30527

02/24/2022, 9:56 PM

How can I load query info for MySQL and Postgres data sources?