DataHub #ingestion

hallowed-kilobyte-916

05/22/2023, 10:08 AM

I am trying to add a descriptions to columns in datahub source from s3 using the code below:

Copy code

import logging
import time

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
    AuditStampClass,
    EditableSchemaFieldInfoClass,
    EditableSchemaMetadataClass,
    InstitutionalMemoryClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=<http://logging.INFO|logging.INFO>)


def get_simple_field_path_from_v2_field_path(field_path: str) -> str:
    """A helper function to extract simple . path notation from the v2 field path"""
    if not field_path.startswith("[version=2.0]"):
        # not a v2, we assume this is a simple path
        return field_path
        # this is a v2 field path
    tokens = [t for t in field_path.split(".") if not (t.startswith("[") or t.endswith("]"))]

    return ".".join(tokens)


dictionary = {
    "a": "b",
    "c": "d",
}


# Inputs -> owner, ownership_type, dataset
documentation_to_add = "The unique application (service) correlation id on service now"
dataset_name = "a/b/20230511.csv"

dataset_urn = make_dataset_urn(platform="s3", name=dataset_name, env="PROD")


def add_dict(graph, dataset_urn, column, documentation_to_add):
    need_write = False

    field_info_to_set = EditableSchemaFieldInfoClass(fieldPath=column, description=documentation_to_add)

    # Some helpful variables to fill out objects later
    now = int(time.time() * 1000)  # milliseconds since epoch
    current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")

    current_editable_schema_metadata = graph.get_aspect(
        entity_urn=dataset_urn, aspect_type=EditableSchemaMetadataClass,
    )

    # need_write = False

    if current_editable_schema_metadata:
        for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo:
            if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column:
                # we have some editable schema metadata for this field
                field_match = True
                if documentation_to_add != fieldInfo.description:
                    fieldInfo.description = documentation_to_add
                    need_write = True
    else:
        # create a brand new editable dataset properties aspect
        current_editable_schema_metadata = EditableSchemaMetadataClass(
            editableSchemaFieldInfo=[field_info_to_set], created=current_timestamp,
        )
        need_write = True

    if need_write:
        event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
            entityUrn=dataset_urn, aspect=current_editable_schema_metadata,
        )
        graph.emit(event)
        <http://log.info|log.info>(f"Documentation added to dataset {dataset_urn}")

    else:
        <http://log.info|log.info>("Documentation already exists and is identical, omitting write")


# First we get the current owners
gms_endpoint = "<http://localhost:8080>"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))


for column, documentation_to_add in dictionary.items():
    print(f"column: {column} and documentation_to_add: {documentation_to_add}")
    add_dict(graph, dataset_urn, column, documentation_to_add)

However, the code just tells me tells me the dictionary already exists and meanwhile, it doesn't exist in datahub.

Copy code

column: a and documentation_to_add: b
INFO:__main__:Documentation already exists and is identical, omitting write
column: c and documentation_to_add: d.
INFO:__main__:Documentation already exists and is identical, omitting write

Am I missing anything? I am following instructions from here: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_column_documentation.py

lemon-scooter-69730

05/22/2023, 3:27 PM

Can you specify CLI version in a yaml formatted recipe?

✅ 1

lemon-scooter-69730

05/22/2023, 5:05 PM

Also can you name a CLI ingest job so that when it shows in the UI it's not just the platorm type under the name. So that instead of

dbt

for example it's

a_specific_dbt_instance

✅ 1

hundreds-airline-29192

05/23/2023, 4:43 AM

i am facing this err when using spark lineage datahub . can anybody help me please ???

enough-elephant-56787

05/23/2023, 6:44 AM

Hello All, Issue : not able to ingest a single table from a athena Database I was trying to ingest Athena tables with "AwsDataCatalog" as Data Source I have a database named "employees_db" and under that I have 2 tabeles "male_emp" and "female_emp". When I try to ingest the entire "employees_db" , the db and the 2 tables are fuly getting ingested, but the issue arises when I try to ingest a single table from the db.Please help me out in ingesting a specific table in a database with athena source Reciepe file I used: source: config: aws_region: us-east-2 database: employees_db query_result_location: 'S3 location' work_group: primary table_pattern: allow: - male_emp type: athena

✅ 1

hundreds-airline-29192

05/23/2023, 8:12 AM

How to run datahub docker quickstart with specify gms's port ? ????

hundreds-airline-29192

05/23/2023, 9:21 AM

i am facing this error , please help me !!!

adamant-sunset-13770

05/23/2023, 2:48 PM

Hello! We have encountered an issue where imported data from BigQuery to Datahub does not retain BigQuery labels as tags for BigQuery Views. We have already enabled

capture_table_label_as_tag: true

and observed that it works for BigQuery Tables. • Could you please confirm if this is the intended functionality? Additionally, we would greatly appreciate any available workarounds. Thank you and best regards, Stian

hundreds-airline-29192

05/24/2023, 8:07 AM

https://files.slack.com/files-pri/TUMKD5EGJ-F059XGR8W4Q/image.png▾

hundreds-airline-29192

05/24/2023, 8:08 AM

Hey!!!!!!!!!!!!!!!!!!!!!! I cannot see anything when i go to the description of table/dataset of big query

hundreds-airline-29192

05/24/2023, 8:08 AM

Can't see tags, glossary terms ..... of domain or dataset

agreeable-table-54007

05/24/2023, 8:47 AM

Hello y'all ! I'm wondering if the ingestion with the databricks unity catalog in the UI is working for any of you ? It's working with the CLI but not the UI. I got this error :

Copy code

datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (unity-catalog): type object 'Retry' has no attribute 'DEFAULT_METHOD_WHITELIST'

I tried a lot of stuff, to install some libraries versions etc..but not working. If you have a solution to use only the UI connector and not the CLI i'll gladly take it. Thanks. Hope you have a great day guys.

echoing-evening-57052

05/24/2023, 12:01 PM

Hello All, After ingesting metadata from redshift, I am unable to see information for constraints like primary key and foreign key in schema part of dataset. Is there anything specific while ingesting data from redshift? Can anyone suggest anything in this?

boundless-nail-65912

05/24/2023, 1:24 PM

HI Team, During ingestion I am getting the below error. May i know why i am gestting this error. Can anyone help me this error.

adamant-sunset-13770

05/24/2023, 1:55 PM

Hello! We are currently having an issue with ingesting descriptions from BigQuery Datasets. The problem seems to be that the query used to obtain the Dataset metadata uses `select x from

{project_id}

.INFORMATION_SCHEMA.SCHEMATA` which automatically sets

region-us

for the query (documentation). We have all our data in

region-eu

, but there does not seem to be any way of specifying region. 1. Can you confirm that

region-us

is being set ? 2. Do you know of any workarounds? All the best, Stian

✅ 1

bored-truck-17085

05/25/2023, 12:14 AM

Hi everyone, I'm having an issue with ingesting the dbt freshness validation. I would like to understand the reason of my freshness validation doesn’t sync correctly. Today I have this CLI recipe config, I'm syncing all files (manifest, catalog, sources and run_results). The last update from a specific table is "2023-05-24", but the

stats

tab shows "2023-05-20". Has anyone faced this issue before?

Copy code

source:
    type: "dbt"
    config:
        # Coordinates
        manifest_path: "/home/datahub/datahub/manifest.json"
        catalog_path: "/home/datahub/datahub/catalog.json"
        sources_path: "/home/datahub/datahub/sources.json"
        test_results_path: "/home/datahub/datahub/run_results.json"

        # Options
        stateful_ingestion:
            enabled: false
        entities_enabled:
            models: 'YES'
            sources: 'YES'
            seeds: 'NO'
            test_definitions: 'YES'
            test_results: 'YES'
        target_platform: bigquery

pipeline_name: "source_tests"

sink:
    type: "datahub-rest"
    config:
        server: <http://localhost:8080>
        token: ${DATAHUB_CLI_ACCESS_TOKEN}

plus1 1

silly-ambulance-51171

05/25/2023, 7:33 AM

Hi there! Could anyone suggest if datahub has some notification options for cases when the ingestion process is not successful ? I was thinking about using alerts based on prometheus metrics (but did not find any useful metric for that). Mb smth like slack | email callback are possible ?

hundreds-airline-29192

05/25/2023, 8:03 AM

Copy code

Can I display my spark insert into big query flow on datahub ?

echoing-branch-87829

05/25/2023, 8:59 AM

Hi, I have DataHub on a Linux VM and managed to get the frontend, when I have tried to run a dbt ingestion I am recieving the below error

Copy code

[2023-05-25 07:55:30,344] ERROR    {datahub.entrypoints:195} - Command failed: [Errno 13] Permission denied: '/root/xxxx/target/manifest.json'

Have tried giving permission to the folder but an struggling to get past this permission issue if anyone could help please

✅ 1

echoing-branch-87829

05/25/2023, 8:59 AM

recipie:

echoing-branch-87829

05/25/2023, 9:00 AM

source:

type: dbt config: manifest_path: /root/virtualone/virtualone/target/manifest.json catalog_path: /root/virtualone/virtualone/target/catalog.json sources_path: /root/virtualone/virtualone/target/sources.json test_results_path: /root/virtualone/virtualone/target/run_results.json target_platform: snowflake

echoing-branch-87829

05/25/2023, 9:00 AM

the paths are a git fetch on the VM

ripe-stone-30144

05/25/2023, 9:56 AM

Hi guys! Can someone tell me exactly what roles/rights the user needs for ingestion metadata from MS SQL Server? https://datahubproject.io/docs/generated/ingestion/sources/mssql/#config-details

✅ 1

quiet-exabyte-77821

05/25/2023, 10:31 AM

Hello everyone, is there any way to ingest data for dbt source other than CLI or UI ?

✅ 1

ancient-queen-15575

05/25/2023, 10:50 AM

I’m having an issue connecting to datahub from my terminal and wondering if anyone can help 🙏 . I successfully run a curl request like below:

Copy code

curl --location --request POST '<https://dev.mydomain.org/api/graphql>' \
--header 'Authorization: Bearer <datahub token>' \
--header 'Content-Type: application/json' \
--data-raw '{ "query": "{ dataset(urn: \"urn:li:dataset:(urn:li:dataPlatform:s3,land-dev.insurance/calculations,DEV)\") { domain { associatedUrn domain { urn properties { name } } } } }", "variables":{}}'

But I can connect when trying to run a recipe. My enviorment variable for the token is called

DATAHUB_GMS_TOKEN

and has the same value as is used in the curl request. For

DATAHUB_GMS_URL

the value is

<https://dev.mydomain.org:8080>

. I’m not understanding why a query to graphql would work but connecting to the GMS port wouldn’t

helpful-guitar-93961

05/25/2023, 11:51 AM

I am config gcs source in cli and this error happen.How to solve this???

flat-engineer-75197

05/25/2023, 12:22 PM

👋 I’d like to update the json files used in the integration tests for dbt. Do I have to manually update these or is there command to run (similar to updating golden files)?

✅ 1

helpful-guitar-93961

05/25/2023, 1:11 PM

Iam ingest gcs metadata and facing this error.How to solve this?

adventurous-pillow-74569

05/25/2023, 1:47 PM

Hello I am trying to ingest data using ingestion from UI from Bigquery, and getting below error What could be the resolution for the same?

adamant-sunset-13770

05/25/2023, 3:37 PM

Hello, We’re facing an issue with our ingestion pipelines where manually added tags from the UI are getting overwritten. • Is this expected? • Is it possible to disable this behaviour and keep both the manually added tags and the ingested tags? Thanks, Stian