DataHub #ingestion

wonderful-egg-79350

05/09/2022, 7:33 AM

Hello. Is it possible to access 'About', 'Tags', 'Owners', 'Glossary Terms' or 'Domain' value to a database and insert it as a bulk insert?

fresh-napkin-5247

05/09/2022, 12:08 PM

Hello. I am trying to test the profilling on Athena and Redshift. The connector was working well for both of these services without profiling, so what I did was just add the variable to enable profilling, as shown below:

Copy code

source:
  type: athena
  config:
    # Coordinates
    aws_region: "region"      
    s3_staging_dir: "s3_staging_dir"
    work_group: "work_group"
  
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

However, the stats tab on the athena tables is still not filled. Is this not the purpose of the profiling flag? What an I missing? Additionally, why is the Athena connector so much slower than the Glue connector? The Athena connector takes around 1h for less tables than the Glue connector. Finally, after enabling the profilling for redshift, I started getting this error:

Copy code

...    cursor.execute(statement, parameters)
psycopg2.errors.InsufficientPrivilege: permission denied for schema 'schema'

However, I have allowed all the permitions from the redshift documentation’s page. What other permission do I need to runt he profilling? Thank you 🙂

adamant-furniture-37835

05/09/2022, 2:04 PM

Hi, We are trying to ingest metadata from Oracle and want to understand the minimum privileges required for the user used for ingestion purpose. We started with read/write privileges for all schemas and Datahub cli could fetch all schemas and their tables but when we limited the privileges to dba schema, it could only fetch the metadata from sys schemas. DBA schema holds metadata of all schemas and tables; other tools we have are able to fetch metadata from it but Datahub CLI doesn't seem to use dba schema for metadata purpose. Could you please ellabarote the accesses required or point us to the specific documentation ? Thanks, Mahesh

plus1 1

agreeable-army-26750

05/09/2022, 2:40 PM

Hi! Can someone give me an example how to ingest glossary-terms via the rest client? Is there an example curl request in the repository? Thank you in advance!

billowy-flag-4217

05/09/2022, 3:17 PM

Hello, is it currently possible to expose which columns are indexed using the postgres ingestion library?

orange-coat-2879

05/10/2022, 12:32 AM

Hi team, just wonder whether Mysql support profiling feature. When I removed profiling, the ingestion works, otherwise it failed. Thanks

source:

type: mysql

config:

# Coordinates

host_port: localhost:port

database: database

# Credentials

username: username

password: password

profiling:

enabled: true

sink:

# sink configs

type: "datahub-rest"

config:

server: "<http://localhost:8080>"

orange-coat-2879

05/10/2022, 12:53 AM

Hi team, how can I solve this problem when I implemented

pip install acryl-datahub[airflow

]: nbconvert requires jinja2>=3.0 meanwhile flask requires jinja2<3.0? Thanks !

cuddly-arm-8412

05/10/2022, 1:04 AM

hi,team! i run ./gradlew metadata jobsmce-consumer-job:bootRun --debug-jvm to debug mce-job,but it's been prompting!I don't know if it will succeed

brave-insurance-80044

05/10/2022, 2:41 AM

Hello team, is there anyway to ingest and display a list of consumers/producers for a topic on DataHub? Or are there any upcoming plans to support this?

swift-breakfast-25077

05/10/2022, 10:02 AM

Hi team, i didn't understand how to use the pattern_add_dataset_tags transformer to add tags to specific datasets, i tried the syntax in the figure but it didn't work ! with what we should replace .example. ??

alert-football-80212

05/10/2022, 10:39 AM

Hi all, is it possible to ingest specifics kafka topic and his schema. How can i write it in the recipe? Thank you!

agreeable-army-26750

05/10/2022, 11:42 AM

Hi team! I am trying to ingest a postgre datasource to datahub via the UI (without scheduling), and it fails every time. If i try the ingestion yaml file with the datahub ingest -c command from a terminal it works. Can you help me what is causing the error? (I am using images from getting started locally) https://pastebin.com/99m6ysfu Thank you for helping! I am using an ingestion source file like this:

Copy code

source:
  type: postgres
  config:
    # Coordinates
    host_port: localhost:5432
    database: postgres

    # Credentials
    username: admin
    password: admin

    # Options
    database_alias: DatabaseNameToBeIngested

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

gifted-bird-57147

05/10/2022, 5:13 PM

I'm trying to figure out the custom transformer example from the documentation (https://datahubproject.io/docs/metadata-ingestion/transformers#writing-a-custom-transformer-from-scratch) The transform module is loaded (took me a few tries...) but now I get an error:

Copy code

File "/python3.9/site-packages/datahub/ingestion/transformer/base_transformer.py", line 252, in transform for urn, state in self.entity_map.items():

AttributeError: 'AddCustomOwnership' object has no attribute 'entity_map'

Is the example still current or has something changed recently?

orange-coat-2879

05/10/2022, 10:30 PM

Hi team, could we create secrets through CLI or we only can create secrets via UI ingestion? Thanks!

modern-artist-55754

05/11/2022, 3:52 AM

Is there anyway to create domain through Emitter?

astonishing-dusk-99990

05/11/2022, 8:13 AM

Hi All, Has anyone ever tried ingest data from Dbt to datahub? Recently I tried to ingest from dbt to datahub but it's error and can't decode the json. I'm using datahub version 0.8.32.1 cli.

square-solstice-69079

05/11/2022, 8:21 AM

Any idea why my Glue ingestion is not working? Could it be related to setting up AD auth since it worked last time? Im getting 110 tables scanned and found, so I dont think the permission error is from Glue / AWS. More info: https://datahubspace.slack.com/archives/C029A3M079U/p1651769949980219

straight-telephone-84434

05/11/2022, 10:21 AM

Hello I have one question: I would like to ingest some metadata with the action pod (so that the procedure can be automatized) and see the ingestion in the UI as well. Is that possible? How to do that?

fresh-napkin-5247

05/11/2022, 12:28 PM

Hello! For some reason the Tableau connector is not getting all the datasources/workbooks from tableau online. Any idea what this could be? The credentials have access to all the datasources (I have tested with the graphQL API). Here is the config that I am using:

Copy code

source:
  type: tableau
  config:
    # Coordinates
    connect_uri: <https://region.online.tableau.com>
    site: site
    workbooks_page_size: 1
      
    token_name: token
    token_value:  token
    projects: ['project1', 'project2', …]

    # Options
    ingest_tags: True
    ingest_owner: True
    
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

There are no errors or warning logs:

Copy code

Sink (datahub-rest) report:
{'records_written': 4778,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2022, 5, 11, 15, 0, 49, 396728),
 'downstream_end_time': datetime.datetime(2022, 5, 11, 15, 2, 0, 111509),
 'downstream_total_latency_in_seconds': 70.714781,
 'gms_version': 'v0.8.34'}

What could be the problem? Datahub version: acryl-datahub, version 0.8.34.1

elegant-article-21703

05/11/2022, 1:35 PM

Hello everyone! I'm trying to ingest through the API a couple of Dashboards to DataHub. I was trying to validate the mce file but I got a

failed to parse

error. The change I have introduced to my previous ingestion it's that I change one

customProperties

(

customProp2

) from a string value to a list. Here is a sample:

Copy code

[
  {
  "auditHeader": null,
  "proposedSnapshot": {
    "com.linkedin.pegasus2avro.metadata.snapshot.DashboardSnapshot": {
      "urn": "urn:li:dashboard:(powerbi,analytics_update)",
      "aspects": [
        {
          "com.linkedin.pegasus2avro.common.Ownership": {
            "owners": [
              {
                "owner": "urn:li:corpGroup:some_owner",
                "type": "DATAOWNER",
                "source": null
              }
            ],
            "lastModified": {
              "time": 0,
              "actor": "urn:li:corpuser:dev",
              "impersonator": null
            }
          }
        },
        {
          "com.linkedin.pegasus2avro.dashboard.DashboardInfo": {
            "title": "Analytics_Update",
            "description": "Explanatory text about what this power BI is and what information the user can get",
            "dashboardUrl": "<http://google.com|google.com>",
            "customProperties": {
              "customProp1": "MainDomain",
              "customProp2": ["Charizard", "Pikachu"],
            },
            "lastModified": {
              "created": {
                "time": 1650279002,
                "actor": "urn:li:corpuser:devn",
                "impersonator": null
              },
            "deleted": null
            },
            "access": null,
            "lastRefreshed": null
          }
        }
      ]
    }
  }
]

Isn't a value of a list supported in

customProperties

? Is there any other workaround? Thank you all in advance!

agreeable-army-26750

05/11/2022, 2:46 PM

Hi All! I am running datahub in docker conatiners, and I would like to change some objects in the metadata-ingestion folder. After I finished and the build is successful: ( ./gradlew metadata ingestionbuild), I would like to redeploy the docker image like this:

Copy code

(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate datahub-actions)

But when I try the datahub cli, my changes are not integrated. I am sure I am not building something, so the docker image wont change… Can you help me what I have to rebuild and run to test my changes? Thanks in advace!

rich-policeman-92383

05/11/2022, 5:08 PM

Hi Team We need some urgent help for a organisational demo. We are trying to ingest & profile few oracle datasets. We have tried many different yaml configuration but none of it is working. Profiling is not working at all. It would be great if someone can help us resolve this. Reference Config

Copy code

source:
  type: oracle
  config:
    host_port: mydb:1521
    env: "PROD"
    username: myuser
    password: mypass
    service_name: myservice # omit database if using this option
    schema_pattern:
      allow:
        - "schema.tablename"
    table_pattern:
      allow:
        - "schema.tablename"
    profiling:
      enabled: True
    profile_pattern:
      allow:
        - "schema.tablename"
sink:
  type: "datahub-rest"
  config:
    server: '<https://mydatahubinstance.com:8080>'

microscopic-controller-88617

05/11/2022, 8:31 PM

Hi guys, i was following the Business Glossary example from here https://datahubproject.io/docs/generated/ingestion/sources/business-glossary/. From the example, you can specify the

url, source_ref and source_url

for the Glossary, but i can't find those information on Datahub after ingesting it. Is it just not implemented yet, or am i missing a step? Thanks in advance teamwork

nice-country-99675

05/11/2022, 8:37 PM

👋 Hello team! Just a super weird scenario with

Superset

ingestion, using datahub 0.8.34. In method `emit_dashboard_mces`there is this piece of code

Copy code

dashboard_response = self.session.get(
    f"{self.config.connect_uri}/api/v1/dashboard",
    params=f"q=(page:{current_dashboard_page},page_size:{PAGE_SIZE})",
)
payload = dashboard_response.json()

The request is failing with

Missing Authorization Header

even when the

session

object already has a token

Copy code

{'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer .....', 'Content-Type': 'application/json'}

Anybody has faced a similar issue with Superset? By the way, all these requests worked fine with Postman...

orange-coat-2879

05/12/2022, 1:00 AM

Hi team, I tried to install

acryl-datahub-actions

but got error below. Appreciate any help.

Copy code

Building wheels for collected packages: confluent-kafka
  Building wheel for confluent-kafka (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [50 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka

      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv                       -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwra                      pv -O2 -fPIC -I/usr/include/python3.10 -c /tmp/pip-install-5ufi6nkd/confluent-ka                      fka_98e4581a7dd144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.c -o build/te                      mp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7dd                      144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.o
      In file included from /tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7d                      d144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.c:17:
      /tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7dd144dcaaf7c59075c25202                      /src/confluent_kafka/src/confluent_kafka.h:23:10: fatal error: librdkafka/rdkafk                      a.h: No such file or directory
         23 | #include <librdkafka/rdkafka.h>
            |          ^~~~~~~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem wit                      h pip.
  ERROR: Failed building wheel for confluent-kafka
  Running setup.py clean for confluent-kafka
Failed to build confluent-kafka
Installing collected packages: confluent-kafka, fastavro, acryl-datahub-actions
  Running setup.py install for confluent-kafka ... error
  error: subprocess-exited-with-error

  × Running setup.py install for confluent-kafka did not run successfully.
  │ exit code: 1
  ╰─> [52 lines of output]
      running install
      /home/ubuntu/.local/lib/python3.10/site-packages/setuptools/command/instal                      l.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build                       and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka
      copying src/confluent_kafka/error.py -> build/lib.linux-x86_64-cpython-310                      /confluent_kafka
      copying src/confluent_kafka/serializing_producer.py -> build/lib.linux-x86                      _64-cpython-310/confluent_kafka
      copying src/confluent_kafka/deserializing_consumer.py -> build/lib.linux-x                      86_64-cpython-310/confluent_kafka
      copying src/confluent_kafka/__init__.py -> build/lib.linux-x86_64-cpython-                      310/confluent_kafka
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/schema_registr                      y
      copying src/confluent_kafka/schema_registry/json_schema.py -> build/lib.li                      nux-x86_64-cpython-310/confluent_kafka/schema_registry
      copying src/confluent_kafka/schema_registry/error.py -> build/lib.linux-x8                      6_64-cpython-310/confluent_kafka/schema_registry
      copying src/confluent_kafka/schema_registry/avro.py -> build/lib.linux-x86                      _64-cpython-310/confluent_kafka/schema_registry
      copying src/confluent_kafka/schema_registry/__init__.py -> build/lib.linux                      -x86_64-cpython-310/confluent_kafka/schema_registry
      copying src/confluent_kafka/schema_registry/protobuf.py -> build/lib.linux                      -x86_64-cpython-310/confluent_kafka/schema_registry
      copying src/confluent_kafka/schema_registry/schema_registry_client.py -> b                      uild/lib.linux-x86_64-cpython-310/confluent_kafka/schema_registry
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/kafkatest
      copying src/confluent_kafka/kafkatest/verifiable_client.py -> build/lib.li                      nux-x86_64-cpython-310/confluent_kafka/kafkatest
      copying src/confluent_kafka/kafkatest/verifiable_consumer.py -> build/lib.                      linux-x86_64-cpython-310/confluent_kafka/kafkatest
      copying src/confluent_kafka/kafkatest/__init__.py -> build/lib.linux-x86_6                      4-cpython-310/confluent_kafka/kafkatest
      copying src/confluent_kafka/kafkatest/verifiable_producer.py -> build/lib.                      linux-x86_64-cpython-310/confluent_kafka/kafkatest
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/avro
      copying src/confluent_kafka/avro/cached_schema_registry_client.py -> build                      /lib.linux-x86_64-cpython-310/confluent_kafka/avro
      copying src/confluent_kafka/avro/error.py -> build/lib.linux-x86_64-cpytho                      n-310/confluent_kafka/avro
      copying src/confluent_kafka/avro/__init__.py -> build/lib.linux-x86_64-cpy                      thon-310/confluent_kafka/avro
      copying src/confluent_kafka/avro/load.py -> build/lib.linux-x86_64-cpython                      -310/confluent_kafka/avro
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/serialization
      copying src/confluent_kafka/serialization/__init__.py -> build/lib.linux-x                      86_64-cpython-310/confluent_kafka/serialization
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/admin
      copying src/confluent_kafka/admin/__init__.py -> build/lib.linux-x86_64-cp                      ython-310/confluent_kafka/admin
      creating build/lib.linux-x86_64-cpython-310/confluent_kafka/avro/serialize                      r
      copying src/confluent_kafka/avro/serializer/message_serializer.py -> build                      /lib.linux-x86_64-cpython-310/confluent_kafka/avro/serializer
      copying src/confluent_kafka/avro/serializer/__init__.py -> build/lib.linux                      -x86_64-cpython-310/confluent_kafka/avro/serializer
      running build_ext
      building 'confluent_kafka.cimpl' extension
      creating build/temp.linux-x86_64-cpython-310
      creating build/temp.linux-x86_64-cpython-310/tmp
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/conf                      luent-kafka_98e4581a7dd144dcaaf7c59075c25202
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/conf                      luent-kafka_98e4581a7dd144dcaaf7c59075c25202/src
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/conf                      luent-kafka_98e4581a7dd144dcaaf7c59075c25202/src/confluent_kafka
      creating build/temp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/conf                      luent-kafka_98e4581a7dd144dcaaf7c59075c25202/src/confluent_kafka/src
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv                       -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwra                      pv -O2 -fPIC -I/usr/include/python3.10 -c /tmp/pip-install-5ufi6nkd/confluent-ka                      fka_98e4581a7dd144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.c -o build/te                      mp.linux-x86_64-cpython-310/tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7dd                      144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.o
      In file included from /tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7d                      d144dcaaf7c59075c25202/src/confluent_kafka/src/Admin.c:17:
      /tmp/pip-install-5ufi6nkd/confluent-kafka_98e4581a7dd144dcaaf7c59075c25202                      /src/confluent_kafka/src/confluent_kafka.h:23:10: fatal error: librdkafka/rdkafk                      a.h: No such file or directory
         23 | #include <librdkafka/rdkafka.h>
            |          ^~~~~~~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem wit                      h pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> confluent-kafka

great-nest-9369

05/12/2022, 1:04 AM

Hello team. I got an error when executing SQL profiling in Hive ingestion with “schema: sparksql”. Has anyone encountered a similar situation? I also report an issue: https://github.com/datahub-project/datahub/issues/4897

most-plumber-32123

05/12/2022, 6:44 AM

Hi All am getting an error when tried to ingest a file for snowflake.

Copy code

[2022-05-12 12:11:31,452] INFO     {datahub.cli.ingest_cli:96} - DataHub CLI version: 0.8.34.1
[2022-05-12 12:11:31,738] ERROR    {datahub.entrypoints:165} - Unable to connect to <http://localhost:9002/api/gms/config> with status_code: 401. Maybe you need to set up authentication? Please check your configuration and make sure you are talking to the DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS API (usually <frontend>:9002/api/gms).
[2022-05-12 12:11:31,738] INFO     {datahub.entrypoints:176} - DataHub CLI version: 0.8.34.1 at C:\Users\*****\AppData\Local\Programs\Python\Python39\lib\site-packages\datahub\__init__.py
[2022-05-12 12:11:31,738] INFO     {datahub.entrypoints:179} - Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] at C:\Users\*****\AppData\Local\Programs\Python\Python39\python.exe on Windows-10-10.0.22000-SP0
[2022-05-12 12:11:31,738] INFO     {datahub.entrypoints:182} - GMS config {}

polite-orange-57255

05/12/2022, 7:09 AM

Hi team, we have a use-case where a lot of tables have common fields. Is there any way out we can add/change description to any field and it get propagated to all the common fields of all the tables.

many-morning-40345

05/12/2022, 8:34 AM

Hi,have a question, I have ingested metadata from BQ to datahub, Is there any way to use transformation to add a link to each dataset and the link URL should depend on dataset name ex. dataset : bq_projectbq datasettable > link : https://xxxx/bq_project/bq_dataset/table

alert-football-80212

05/12/2022, 10:59 AM

Hi, I am trying to ingest s3 datalake when i run the ingestion i get an error

Copy code

ERROR: Please set env variable SPARK_VERSION

does anyone know something about it?