DataHub #ingestion

few-grass-66826

08/10/2022, 2:43 PM

Hello everyone, another one: I have this dataflow S3 -> DB1.Table1 -> DB2.Table1 Linage for DB1.Table1 shows that it gets data from s3 to DB1.Table1 but linage for DB2.Table1 shows only that it gets data from DB1.Table1 Is there any solution that linage for DB2.Table1 will show also that root is S3 bucket?

jolly-balloon-85466

08/10/2022, 3:51 PM

Hello everyone. I'm getting following errors when trying to ingest data from bigquery

Copy code

3
           "2022-08-10 15:00:55.500831 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Failed to execute 'datahub ingest'",
1072
           '2022-08-10 15:00:55.506885 [exec_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47] INFO: Caught exception EXECUTING '
1071
           'task_id=c22f29ad-b1a9-4ad6-a1e8-ca6831fc6e47, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
1070
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
1069
           '    self.event_loop.run_until_complete(task_future)\n'
1068
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
1067
           '    return f.result()\n'
1066
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
1065
           '    raise self._exception\n'
1064
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
1063
           '    result = coro.send(None)\n'
1062
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
1061
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
1060
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}

jolly-balloon-85466

08/10/2022, 3:53 PM

Copy code

\'failures\': {\'lineage-gcp-logs\': ["Error was \'datasetId\'"]},\n'

jolly-balloon-85466

08/10/2022, 3:53 PM

This has marked as the failure

rapid-house-76230

08/10/2022, 10:22 PM

Hi team, I’m trying to ingest from Hive using a recipe that I’ve used before without a problem. Now I’m just getting a successful report with no schema ingested? 🧵

microscopic-mechanic-13766

08/11/2022, 7:49 AM

Hi everyone, since I updated Datahub to v0.8.42 my recipes change from what they were inicially. For example: Initially it would like this:

Copy code

source:
    type: postgres
    config:
        host_port: 'postgresql:5432'
        database: <db>
        username: <usr>
        password: <psswd>
        include_tables: true
        include_views: true
        profiling:
            enabled: True
            max_workers: 20
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'

But after the source is created, it looks like this:

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'
source:
    type: postgres
    config:
        include_tables: true
        database: <db>
        profiling:
            max_workers: 20
            enabled: true
        host_port: 'postgresql:5432'
        include_views: true
        username: <usr>
        password: <psswd>
pipeline_name: 'urn:li:dataHubIngestionSource:7c70f090-79cc-432a-a757-7c01b8c091b9'

Is this intended? If so, may I know why the change of structure? Thanks in advance!!

alert-fall-82501

08/11/2022, 8:22 AM

Hi Team I am working creating DAG with with apache airflow to run my task with datahub ingest ? but I am having issue with source

alert-fall-82501

08/11/2022, 8:22 AM

can anyone please suggest on this ?

alert-fall-82501

08/11/2022, 8:22 AM

Copy code

raise KeyError(f"Did not find a registered class for {key}")
KeyError: 'Did not find a registered class for s3'

famous-florist-7218

08/11/2022, 9:18 AM

Hi team! Does anyone know why the s3 ingestion job was run successfully but UI doesn’t load s3 dataset?

echoing-farmer-38304

08/11/2022, 1:10 PM

Hello everyone, have a question about delta lake ingestion, there is an opportunity to ingest with s3 aws. Example:

Copy code

source:
  type: "delta-lake"
  config:
    base_path:  "<s3://my-bucket/my-folder/sales-table>"
    s3:
      aws_config:

tried to use it with minio credentials but it doesn't see my data(run successfully but doesn’t load data). It happens, in my opinion, because of troubles with base_path. Is there any opportunity to use delta lake ingestion with minio credentials? And if there are any ready solutions and we want this feature should we implement it as an additional module or we can just add some changes to the current module so it could use both aws and minio?

gifted-knife-16120

08/11/2022, 1:29 PM

let say I have 3 databases (A, B, C) in

athena

platform. and I would like to delete 2 of them. how it can be done?

damp-queen-61493

08/11/2022, 8:26 PM

Hi team! I'm trying to ingest Kafka with Schema Registry. Unable to get schema registry updates for topic. The schema was ingested correctly the first time, but after that it looks like the datahub doesn't update the schema anymore. Datahub version

v0.8.43

(dev env)

Copy code

## Recipe
source:
    type: kafka
    config:
        platform_instance: poc_cluster_0
        connection:
            bootstrap: 'xxxxxx.gcp.confluent.cloud:9092'
            consumer_config:
                security.protocol: SASL_SSL
                sasl.mechanism: PLAIN
                sasl.username: '${CLUSTER_API_KEY_ID}'
                sasl.password: '${CLUSTER_API_KEY_SECRET}'
            schema_registry_url: '<https://xxxxx.gcp.confluent.cloud>'
            schema_registry_config:
                <http://basic.auth.user.info|basic.auth.user.info>: '${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}'

colossal-sandwich-50049

08/11/2022, 8:50 PM

Hello, I am running

datahub ingest -c my-delta-recipe.yml

locally and getting the error below; can someone assist? Note, I am running DataHub using datahub quickstart

Copy code

###### Recipe
source:
  type: "delta-lake"
  config:
    env: "PROD"
    platform_instance: "my-delta-lake"
    platform: "delta-lake"
    base_path: "<s3://my-bucket/data/v3/>"
    s3:
      aws_config:
        aws_region: "eu-west-1"
        aws_endpoint_url: "http://<local-ip>:4566"
sink:
  type: "datahub-rest"
  config:
    server: "http://<local-ip>:8080" # local IP

Copy code

###### Logs
[2022-08-11 16:46:47,941] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43
[2022-08-11 16:46:48,004] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://10.12.242.238:8080>
[2022-08-11 16:46:49,063] ERROR    {logger:26} - Please set env variable SPARK_VERSION
[2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:123} - Source (delta-lake) report:
{'workunits_produced': '0',
 'workunit_ids': [],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.43',
 'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
 'py_version': '3.9.9 (main, Nov 21 2021, 03:23:42) \n[Clang 13.0.0 (clang-1300.0.29.3)]',
 'py_exec_path': '/usr/local/opt/python@3.9/bin/python3.9',
 'os_details': 'macOS-12.4-x86_64-i386-64bit',
 'filtered': []}
[2022-08-11 16:46:49,565] INFO     {datahub.cli.ingest_cli:126} - Sink (datahub-rest) report:
{'records_written': '0', 'warnings': [], 'failures': [], 'gms_version': 'v0.8.43'}
[2022-08-11 16:46:50,061] ERROR    {datahub.entrypoints:188} - Command failed with argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'. Run with --debug to get full trace
[2022-08-11 16:46:50,061] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py

kind-whale-32412

08/11/2022, 8:54 PM

Hey there, I am using this example to tag columns in a table. One issue I noticed is

graph.get_aspect_v2

part where you always have to make a GET request to the server first to obtain all existing tags; then append if it's a new tag; and then emit it to DataHub. I find this design a little bit odd that client side has to know what all the tags are, and then server side is completely stateless. I attempted to bypass this getting the aspect and tried out to just construct

MetadataChangeProposalWrapper

with

GlobalTagsClass(tags=[tag_association_to_add])

no matter what the state is. I noticed that this removes all the other tags. I was expecting that this would append only the tag that I am attempting to add, not remove other tags. Is this intended by design? Is there a way to change this by having a flag or any other way to submit? One big issue here is the race condition, if I am submitting these changes through kafka events (or even synchronous parallel way) and there happens to be multiple MCPW of the same column, other tags could be lost.

plus1 2

👍 1

dazzling-insurance-83303

08/11/2022, 9:55 PM

Domain assignment using simple_add_dataset_domain Hello. I am trying to associate domains to datasets using the transfomers - *simple_add_dataset_domain* specification but I am getting the following error:

KeyError: 'Did not find a registered class for simple_add_dataset_domain'

I am able to associate a domain to database tables via the domains section under the source configuration section. I am trying to group all the ownerships, associations in the same section, i.e., transformers. Under the transformers section per the documentation the code should look like below:

Copy code

transformers:
  - type: "simple_add_dataset_domain"
    config:
      semantics: OVERWRITE
      domains:
        - urn:li:domain:engineering

My construct is as follows which is throwing the error:

Copy code

transformers:
  - type: "simple_add_dataset_ownership"
  ...
  - type: "simple_add_dataset_domain"
    config:
      domains:
        - "urn:li:domain:xxxxxxxx-xxxx-xxxx-xxxx-bffdfb8b977d" # Finance

Thoughts? Thanks!

alert-fall-82501

08/12/2022, 6:19 AM

Hi Team - I have partitioned data on s3 for various partner . I am able to get those data on appropriate server .Right Now I gave hard core base path in the config file . can anybody tell how I can take only table and schema without hardcore path ?

alert-fall-82501

08/12/2022, 6:19 AM

s3://my-bucket/foo/tests/bar.avro # single file table s3://my-bucket/foo/tests/*.* # mulitple file level tables s3://my-bucket/foo/tests/{table}/*.avro #table without partition s3://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified s3://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified s3://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name s3://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format s3://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions s3://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket

alert-fall-82501

08/12/2022, 6:23 AM

I am following this documents but need to get more clarification

full-chef-85630

08/12/2022, 6:49 AM

hi all,Using Datahub's Airflow lineage plugin ,Dataub enables token verification. How to set airflow

bright-cpu-56427

08/12/2022, 7:15 AM

Hi all, What value should I put in the database value in the yaml recipe when ingesting mysql? If I put a database name that does not exist in mysql in this value, an error occurs, If I put the name of an existing database, when it is displayed in datahub, dbname.dbname.table is displayed like this, and it is confusing. Should I configure it to ingest only one database with only one recipe?

average-rocket-98592

08/12/2022, 11:56 AM

Hi, I’m trying to ingest metadata from PowerBI report server on premise. Did someone already try to do it? Thanks in advance for your help!!

fancy-thailand-73281

08/12/2022, 2:00 PM

Hi All, We deployed the datahub

v0.8.36

in AWS EKS cluster with helm charts, we are using AWS MSK (KAFKA), with as a Bootstrap servers :*SASL/SCRAM* and ZooKeeper TLS. Everything works fine till here. But we are not able to Ingest data(SNOWFLAKE) from UI , I see 'N/A' when I try to run ingestion. We found in Datahub docs(https://datahubproject.io/docs/ui-ingestion/) saying that we need to enable

datahub-actions

, then we deployed the public.ecr.aws/datahub/acryl-datahub-actions pods. The pod is not running(CrashLoopBackOff) and we see error logs shows as below

Copy code

[2022-08-11 19:56:59,004] ERROR    {datahub.entrypoints:138} - File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 77, in run                                                                                                    │
│     67   def run(config: str, dry_run: bool, preview: bool, strict_warnings: bool) -> None:                                                             


                                                                                                                                                                                                                                                         │
│ KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Failed to create consumer: No provider for SASL mechanism GSSAPI: recompile librdkafka with libsasl2 or openssl support. Current build options: PLAIN SASL_SCRAM OAUTHBEARER"}                  │
│ 2022/08/11 19:56:59 Command exited with error: exit status 1

could someone please help us and thanks in Advance

rapid-fall-7147

08/12/2022, 4:29 PM

Hi All , We are ingesting redshift Metadata but even though we have enabled profiling but the field descriptions (which are part of the redshift DDL statement) are not getting populated in the datahub Any suggestion ?

delightful-zebra-4875

08/15/2022, 11:46 AM

hi，I want to display flinkcatalog information when extracting metadata using hive. The metadata information of flinkcatalog is displayed in Properties, but after I modified the columns logic in hive.py and sql_common.py, I replaced the corresponding files in the datahub package under python of acryldata/datahub-actions in docker, and re-run The back front end doesn't show the data I want

chilly-sundown-93656

08/15/2022, 1:18 PM

Hello, I'm trying to build some automation around datahub ingestion sources, in our company we have lots of different data stores and numerous clusters. for example, we have 40 different Kafka clusters. I'm not sure if my approach is correct one Initially, I've created a python script that creates sources with GraphQL APIs and triggers the executions.

Copy code

query = """
   mutation
    {
      createIngestionSource(input: {
        name: "$name",
        type: "kafka",
        description: "$name",
        config: {
          recipe: "$recipe",
          executorId: "default"
        }
      })

    }

"""

The rationale is to allow users from different teams to add their own Data stores and see what we already ingesting. Not sure if this is a legit method because the service experiences a lot of trouble pulling topics from several Clusters in parallel. There are a lot of MySql, Kafka, and Schema Registry exceptions. sometimes I need to rerun the execution to make it pass. appreciate any advice on that matter. thanks

bright-receptionist-94235

08/15/2022, 4:49 PM

Hi All, General question: Who is writing the data source plugin? Datahub or the data source team?

boundless-mechanic-19488

08/15/2022, 6:18 PM

hey guys, i am trying to ingest my glue catalog with the following recipe deployed on a docker-compose, but it always returns 0 assets:

Copy code

source:
    type: glue
    config:
        aws_region: eu-central-1
        aws_access_key_id: YYY
        aws_secret_access_key: XXX

i ve attached the following policy:

Copy code

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabases",
        "glue:GetTables"
      ],
      "Resource": [
        "arn:aws:glue:eu-central-1:XXX:catalog",
        "arn:aws:glue:eu-central-1:XXX:database/*",
        "arn:aws:glue:eu-central-1:XXX:table/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDataflowGraph",
        "glue:GetJobs"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::*"
      ]
    }
  ]
}

does anyone has a hint? the pipeline has been executed successfully in 2.5 seconds all the time with ingestion 0 values ando not reporting any issues but a Client-Server Incompatible Warning

Copy code

x1b[36mYour client version 0.8.42 is older than your server version 0.8.43. Upgrading the '
           'cli to 0.8.43 is recommended.\n'

quick-megabyte-61846

08/12/2022, 1:17 PM

Hello while trying to do some houeskeeping inside my demo datahub I found weird bug:

Copy code

❯ datahub get --urn "urn:li:dataPlatform:dbt"
{
  "dataPlatformInfo": {
    "datasetNameDelimiter": ".",
    "displayName": "dbt",
    "logoUrl": "/assets/platforms/dbtlogo.png",
    "name": "dbt",
    "type": "OTHERS"
  },
  "dataPlatformKey": {
    "platformName": "dbt"
  }
}

alert-fall-82501

08/16/2022, 6:21 AM

Hi Team - what if I need to take multiple schema and Table from s3 delta lake to datahub using single config file ? please suggest , I will be having data on different s3 paths . I want to use single config file .