DataHub #ingestion

lively-dusk-19162

02/23/2023, 5:50 PM

Hello team, I am grtting the following error when I was building datahub to reflect the changes after creating the new entity.I was running ./gradlew build Could anyone please help me on this?

white-horse-97256

02/23/2023, 7:27 PM

Hi Team, I am trying to create DataLineage from JAVA-SDK client using DataJobInputOutput() class. I see that the function

setInputDatajobs()

is deprecated and there is no other function for setting the job values in that class for this variable

_inputDatajobsField

Where and how can i create lineage of dataset->datajob->dataset using JAVA SDK? I don't find any reference documentation either?

handsome-flag-16272

02/23/2023, 7:33 PM

1. Is datahub-mce-consumer responsible for consuming the events? 2. To improve the throughput, we need to increasing the number of Kafka partitions and scaling out the datahub-mce-consumer instance. - Is that all messages send to one topic or we can config diffient Kafka topics per domain? - Does datahub support scaleing out consumer based on the number of to be processed messages in the Kafka topic? 3. If no standalone datahub-mce-consumer instances, which component will do the job, datahub-gms or other servers? 4. Where will datahub-mce-consumer send data to, datahub-gms or directly to ElasticSearch for index buildibng? It appears to the datahub-gms.

rich-daybreak-77194

02/24/2023, 1:53 AM

Why column stat doesn’t show it too much row? Another table columns stat is display but it’s has 1M rows

✅ 1

magnificent-lawyer-97772

02/24/2023, 10:00 AM

Hi folks, we are writing a custom transformer. We noticed that in the documentation, developing transformers depends on datahub_actions. There’s another Transformer in the datahub.ingestion package. In the past the documentation used to point towards datahub.ingestion package but nowadays, it points to datahub_actions. Are transformers based on datahub_actions the new way to go forward and will the datahub.ingestion package be deprecated or will they continue co-existing?

✅ 1

boundless-nail-65912

02/24/2023, 2:50 PM

HI Team, I am experiencing a weird issue. I took fresh linux machine and installed datahub in linux(Rhel8.x). I checked the datahub version and I can see it is 0.10.0.2 and I went inside datahub actions docker container there I checked the datahub version it says 0.9.6.2. And also for vertica .Inside container it has old dailect (sqlalchemy_vertica). and outside the docker container it has latest dailect (vertica_sqlalchemy_dailect). Can someone help me with this?

dazzling-microphone-98929

02/24/2023, 5:57 PM

Hello All, I am getting error with Power BI ingestion , attaching the logs here and looking for some guidance.Thanks in advance!!!

exec-urn_li_dataHubExecutionRequest_305aece4-b17c-46e9-bab2-80a368fe4322.log

gray-ghost-82678

02/24/2023, 7:58 PM

Hi all I have a question about creating tables and jobs. Is it possible to create jobs manually or is it only available from ingesting from Airflow? Also can tables with their column names be created manually either from the UI or the CLI? Thanks!

white-horse-97256

02/24/2023, 8:43 PM

Hi Team, I am trying to create a datajob with JAVA emitter and I get 422 error,

Copy code

MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
        .entityType("datajob")
        .entityUrn(Utils.createDataJobUrn("Connectors", "source_connector", "cdc_digital_account_master_account_dbz_source_connector_test", "STG"))
        .upsert()
        .aspect( new DataJobInfo()
                .setName("cdc_digital_account_master_account_dbz_source_connector_test")
                .setFlowUrn(new DataFlowUrn("Connectors", "source_connector", "STG")))
                .build();

breezy-controller-54597

02/25/2023, 6:42 AM

Hi. 👋 I propose a method to create projects like dbt, define some datasets and lineages in yaml files, and ingest them by project. This is useful for managing metadata as code when you cannot extract metadata directly from data sources for some reason. It is something like a combination of the current csv-enricher and File based Llineage, and it is easy to manage if it works even if it is divided into multiple yaml files and directories like dbt.

rich-daybreak-77194

02/25/2023, 1:28 PM

Can we update dataset properties without ingest data? I mean only update dataset properties not refresh new data in dataset Thank you

👍 1

best-notebook-58252

02/26/2023, 5:41 PM

Hi everybody, I’m running a LookML ingestion and noticed a weird behavior, but I don’t know if this is expected because I’m not a Looker expert. I have some models lkml files, where explores are not defined there but they are included:

Copy code

connection: "starburst"

include: "/partners_dm/explores/*.explore.lkml"

checking on source code, when scanning on reachable views only explores defined in model files are considered, ignoring the ones included from separate files: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py#L1632 So every view is marked as unreachable. Is this an expected behavior?

important-afternoon-19755

02/27/2023, 2:30 AM

Hi team. I want to provide a stats tab from a glue source. When I checked this link, it looks like I need to use a glue crawler. But I don’t have access to a glue crawler. Is there any way to put the profiling value in the data catalog without using a glue crawler when loading the table from Spark to S3?

colossal-easter-99672

02/27/2023, 8:52 AM

Hello, team. How to solve problem with overwriting upstream lineage from different sources? Now i query current lineage and add it incrementally, but in long term i can have problems with stale lineage.

salmon-vr-6357

02/27/2023, 4:23 PM

Hello team, I’m trying to ingest BQ via datahub ui with keys setup in secrets. But got this error while executing

Copy code

google.auth.exceptions.DefaultCredentialsError: ('Failed to load service account credentials from /tmp/tmplmjtp2eo', ValueError('Could not deserialize key data. The data may be in an incorrect format, it may be encrypted with an unsupported algorithm, or it may be an unsupported key type (e.g. EC curves with explicit parameters).', [_OpenSSLErrorWithText(code=503841036, lib=60, reason=524556, reason_text=b'error:1E08010C:DECODER routines::unsupported')]))

Anyone experienced this before and what could be wrong? btw I’m on

v0.9.6.1

✅ 2

ripe-tailor-61058

02/27/2023, 6:10 PM

Is it possble via recipe-based ingestion to change the folder structure/containers created in datahub? I am doing an s3 based ingestion and what I see is in datahub it gets placed in <env>/<s3/<bucket-name>/folder1/folder2/file.csv. What if I want it to be cataloged at <env>/s3/<bucket-name>/folder3/file.csv? Is there a transformer that does that? Thanks!

cold-dress-65039

02/28/2023, 7:31 AM

hi! i'm trying to ingest Mode, but i keep getting this

ParserError: Unknown string format: None

error. doesn't look like this has been flagged before. any idea on how to triage or fix this?

agreeable-cricket-61480

02/28/2023, 10:53 AM

Hi, I am calling the stored procedure in snowflake from airflow DAG. I am able to see the lineage in the task but I want to see the lineage with its upstream from snowflake. How can I do this?

crooked-rose-22807

02/28/2023, 4:05 PM

Hi, The Business Glossary plugin is not behaving as expected. 1. Deleting glossary via

entity_type

does nothing. Instead I need to do deletion using the urn one by one which is impractical but this method still have more issues:

Copy code

datahub delete --entity_type glossaryTerm # DOES NOT WORK
datahub delete --urn "urn:li:glossaryTerm:green_transport_revenue" # assigned a custom ID, WORKS
datahub delete --hard --urn "urn:li:glossaryTerm:<randomid>" # enable_auto_id=True WORKS
datahub delete --hard --urn "urn:li:glossaryTerm:Metrics%203.My%20Revenue" # enable_auto_id=False, no custom ID assigned, the urn automatically takes glossary term `name`, DOES NOT WORK

2. Using

contains

and/or

inherits

key in the business glossary yaml, works as expected. However, if the a term is removed from these keys, the term DOES NOT ACTUALLY BEING REMOVED from the UI. Interchangeably modify a term to

contains

inherits

key & vice versa, surprisingly works. All in all, this should be fixed ASAP catyay or if there is already available solution, kindly lmk!

busy-train-56443

02/28/2023, 7:18 PM

Hi everyone! My goal is to set Properties for a Dataset object. I didn't find a straightforward way to do it. Could someone please advise on how to add Properties to a Dataset entity through Python script or POST method?

refined-energy-76018

02/28/2023, 11:38 PM

has there been any success with implementing the airflow cluster policy for DAGs to emit dataProcessInstances at the DAG level? I'm playing around with the code for the Datahub Airflow plugin locally and can't seem to set the

on_success_callback

on_failure_callback

for

dag_policy

where it has any effect upon DAG completion. Would that explain why the

run_dataflow

and

complete_dataflow

are implemented in

airflow_generator.py

but not used?

✅ 1

acceptable-nest-20465

03/01/2023, 12:10 AM

has anyone tried ingesting metadata from neo4j to datahub ? I am able to get json with all relationships, labels,nodes etc by using call apoc.meta.schema but couldn't get figure out how to ingest it into datahub. To use either aspects or daatset snapshots seems it should be one of the existing data platform supported .Otherwise do we have to build whole logic of how to process metadata from neo4j ?

✅ 1

elegant-salesmen-99143

03/01/2023, 12:36 PM

Hi all. I just tried connecting Kafka to Datahub, I see the topics ingested alright, but I don't see the partitions in them. Is that the expected behavior? What info from Kafka Datahub is supposed to display?

plus1 2

quiet-jelly-11365

03/01/2023, 1:37 PM

Hi all, has anyone managed to sync AWS managed Kafka with AWS Glue schema registry to Datahub ?

lively-dusk-19162

03/01/2023, 3:31 PM

Hello everyone, does anyone has idea , when we create a new entity is it possible to write graphql code changes in python because the original code for other entities is available in java?

handsome-flag-16272

03/01/2023, 8:12 PM

Hi Team, Is there any one can help to answer the following questions about the the stateful ingestions (datahub v0.10.0)? 1. Between the ingestions on the same data, there’s only one new table column was added, how many events will be sent from action node to gms node? I assume that should only one, but actually I found that’s the full DB metadata were sent. Is there anything wrong in my config?

Copy code

source:
  type: snowflake
  config:
    platform_instance: DEV

    # Coordinates
    account_id: MY_ACCOUNT
    warehouse: MY_WH

    # Credentials
    username: MY_NAME
    password: MY_PASS

    # Options
    include_table_lineage: true
    include_view_lineage: true
    include_operational_stats: false
    include_usage_stats: false

    database_pattern:
      allow:
        - MY_DB
    schema_pattern:
      allow:
        - MY_SCHEMA
    stateful_ingestion:
      enabled: true

datahub_api:
  server: '<http://localhost:8080>'

sink:
  type: datahub-rest
  config:
    server: '<http://localhost:8080>'

pipeline_name: 'urn:li:dataHubIngestionSource:dev_snowflake_db'

2. Before the 3rd run of stateful ingestion, I dropped the my_test table. This time I can the summary in CLI as below:

Copy code

Pipeline finished with at least 2 warnings; produced 168 events in 30.94 seconds.

In the 1st and 2nd rans, the message is “… produced 169 events ….“. The issues I’ve found: • It also indicates this stateful ingestion is full ingestion rather than delta ingestion • When I login the UI, I can still see the my_table. However, it neither marked as soft deleted nor update the “Last synchronized” time correctly. The “Last synchronized” is the 2nd ingestion time.

ripe-tailor-61058

03/01/2023, 9:13 PM

Hello, I am trying to ingest via recipe and use a transformer to add dataset properties, per https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#simple-add-dataset-datasetproperties I am running into an error with the following transformer:

transformers:

- type: "simple_add_dataset_properties"

config:

semantics: PATCH

properties:

bucket: djla-dev-tenant-jna

dataset: dataset2

Copy code

[2023-03-01 15:59:22,624] DEBUG  {datahub.telemetry.telemetry:239} - Sending Telemetry
[2023-03-01 15:59:22,689] DEBUG  {datahub.ingestion.run.pipeline:181} - Source type:s3,<class 'datahub.ingestion.source.s3.source.S3Source'> configured
[2023-03-01 15:59:22,689] ERROR  {datahub.ingestion.run.pipeline:127} - 1 validation error for SimpleAddDatasetPropertiesConfig
semantics
 extra fields not permitted (type=value_error.extra)
Traceback (most recent call last):
 File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 197, in __init__
  self._configure_transforms()
 File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 212, in _configure_transforms
  transformer_class.create(transformer_config, self.ctx)
 File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/transformer/add_dataset_properties.py", line 97, in create
  config = SimpleAddDatasetPropertiesConfig.parse_obj(config_dict)
 File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
 File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for SimpleAddDatasetPropertiesConfig
semantics
 extra fields not permitted (type=value_error.extra)
[2023-03-01 15:59:22,691] INFO   {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2023-03-01 15:59:22,692] INFO   {datahub.cli.ingest_cli:137} - Finished metadata ingestion

Failed to configure transformers due to 1 validation error for SimpleAddDatasetPropertiesConfig
semantics
 extra fields not permitted (type=value_error.extra)
[2023-03-01 15:59:22,703] DEBUG  {datahub.telemetry.telemetry:239} - Sending Telemetry

It works fine without the

semantics: PATCH

line but can't get it to work when including it before or after the properties.

✅ 1

white-horse-97256

03/01/2023, 11:20 PM

Hi Team, is there a way to ingest datasets in bulk in python-sdk?

agreeable-cricket-61480

03/02/2023, 7:21 AM

Someone help me with these questions: 1. Is datahub providing encryption to the columns before sharing with downstream? 2. How can I find how many tables using the particular column? 3. How can I check recent users accessing this table? 4. How can I check if the user has proper permission to access the table?