DataHub #ingestion

boundless-student-48844

09/08/2022, 3:36 PM

Hi team, the

:metadata-ingestion:lint

task failed due to lint errors when running

mypy

command. There are 72 errors, listed in thread. A suggestion - do you think if lint check can be enforced when there are PRs to

metadata-ingestion

for better QA? 😅

mypy src/ tests/ examples/

clean-tomato-22549

09/09/2022, 5:33 AM

hi team, I have a question for lookml ingest. It seems it requests to specify base_folder which is the local lookml git repo position. Why we need this as required, since we can specify github_info.repo

jolly-library-86177

09/09/2022, 8:56 AM

Looking through the connections for DataHub, and wondering if anyone is using DataHub as part of a Logical Data Warehouse archtechture? I.E. not connecting directly to data sources but instead ingestion through an access layer such as Denodos, TIBCOs etc?

silly-finland-62382

09/09/2022, 9:14 AM

Hey team, As we are using Datahub Spark lineage via Databricks to populate spark lineage, lineage is created successfully but, the following error we are facing while running this command :

Copy code

df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/nishchay.agarwal@meesho.com/services_classification.csv")
df.write.mode("overwrite").saveAsTable("new_p")

While I am running this command via Databricks Cluster, pipeline is created successfully as per name given in cluster spark conf spark.datahub.databricks.cluster shell_dbx, but 
while I am running delta table command, I am getting error :
22/09/09 09:06:56 ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.Project is not supported yet. Please contact datahub team for further support. 

Also, I am not able to see schema of dataset that I build using spark-lineage, also both upstream & downstream table is showing same as per screenshot (that's not expected)
Also, can you help me, how to enable Delta catalog support from databricks, because its not working on Databricks

fresh-cricket-75926

09/09/2022, 10:26 AM

Hi All, is there any way that we can ingest Oracle schema and tables metadata without select table privilege

rich-battery-25772

09/09/2022, 11:05 AM

Hi all! I found that ingestion process from deltalake could use a lot of memory (in my case more then 8G) and it looks like memory reduction. And the reduction is critical as for me. Datahub ingestion library uses deltalake’s library (in python). And the deltalake’s library creates a vector with all parquet file-names for all delta-table’s states. The vector could be big. Very big! Huge! Dramatically huge! Datahub needs the vector to calculate number of files only. The deltalake’s python library uses a deltalake’s library on rust. And the rust-library has special flag (require_files) which can handle if the files-vector has to be created or not. And avoiding using the vector has to save memory.

Copy code

pub struct DeltaTableLoadOptions {
	..............
    /// Indicates whether DeltaTable should track files.
    /// This defaults to `true`
    ///
    /// Some append-only applications might have no need of tracking any files.
    /// Hence, DeltaTable will be loaded with significant memory reduction.
    pub require_files: bool,
}

The main problem is that the flag couldn’t be managed from the python deltalake’s library (it needs to be changed to manage the flag). And also a question is how we can calculate the number of files in alternative way. • Datahub’s code (using of DeltaTable class): https://github.com/datahub-project/datahub/blob/083ab9bc0e7b9d8ba293afcf9fae4ffb71c4f86c/metadata-ingestion/src/datahub/ingestion/source/delta_lake/delta_lake_utils.py#L24 • Deltalake’s python library: - DeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/deltalake/table.py#L72 - RawDeltaTable class: https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/python/src/lib.rs#L78 • Deltalake’s rust library: - DeltaTableBuilder class (require_files is in the options: DeltaTableLoadOptions field): https://github.com/delta-io/delta-rs/blob/45a0404287287ead94005740dad90b67922e0ec9/rust/src/builder.rs#L116

witty-butcher-82399

09/09/2022, 2:06 PM

Is there any mechanism preventing to send the ingestion events if there is any failure in the pipeline during the ingestion? I’m asking because I noted the

process_commit

function in

pipeline.py

. It checks if there are errors or not, and depending on that and the commit policy, it will commit or not the checkpoint. https://github.com/datahub-project/datahub/blob/23b929ea10daded7447f806f8860447626[…]e573a6/metadata-ingestion/src/datahub/ingestion/run/pipeline.py However, I don’t see such a behaviour with the ingestion events themselves. Which means that ingestion pipeline could be publishing some events via the Sink and not committing the checkpoint. In my opinion, publishing policy in the Sink should be aligned with committing policy. WDYT?

busy-glass-61431

09/12/2022, 5:11 AM

Hi I have setup datahub with AWS OpenSearch and Managed Postgres. There seem to be some issue with my opensearch domain and I need to recreate it, is there a way to restore data from postgres to ES?

creamy-controller-55842

09/12/2022, 8:22 AM

Hi, I was integrating hive with datahub and ingesting metadata from UI, but I can see the partition Column Info is not present, I checked the code , it's written in code that if the row contain partition information, the loop breaks in hive.Py . May I know the reason behind this ?

many-hairdresser-79517

09/12/2022, 10:03 AM

Hi, I'm ingesting Redash with Datahub, And enable the parse_table_names_from_sql: true as the following the doc https://datahubproject.io/docs/generated/ingestion/sources/redash It works fine to get the table name to the inputs, but the sources it still unknown (It support be databricks hive table, detail in the image) Do we have any options to enable us to get the datasources name as well? Thank you so much.

famous-florist-7218

09/12/2022, 10:59 AM

Hi guys, I got this ERROR when integrate Spark to DataHub. It seems start event didn’t work.

McpEmitter: REST Emitter Configuration

is missing. Any thoughts?

Copy code

22/09/12 17:54:35 ERROR DatahubSparkListener: Application end event received, but start event missing for appId local-1662980072825

Spark version: v3.1.1

important-answer-79732

09/12/2022, 11:04 AM

Hi team, I'm getting the below error while creating the BigQuery integration in the Kubernetes deployment while a similar integration is successful with same configurations in my localhost (with quickstart).

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '7f529d57-21f5-4d39-a8e8-2b92580692ab',
 'infos': ['2022-09-12 10:22:14.801662 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-12 10:22:14.855554 [exec_id=7f529d57-21f5-4d39-a8e8-2b92580692ab] INFO: Caught exception EXECUTING '
           'task_id=7f529d57-21f5-4d39-a8e8-2b92580692ab, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
           '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
           '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
           '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
           'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
           'debug_mode\n'
           '  extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.

chilly-scientist-91160

09/12/2022, 11:34 AM

We’re using mongoDB a lot, so I tried to hook up the mongodb ingestion plugin: https://datahubproject.io/docs/generated/ingestion/sources/mongodb/ Whilst it is able to see the data, it does not seem to extract the validation json schema as metadata - is this correct?

busy-glass-61431

09/12/2022, 11:40 AM

Hi I am running into an issue and not sure how to debug this. I've tried ingesting data using both kafka as well as rest emitter. I can see that the entries are getting created in postgres but not getting ingested in elastic. Any leads how I can debug this? I've checked logs for gms but dont see any errors

silly-finland-62382

09/12/2022, 5:18 PM

Hey, Can someone tell me why spark.sql("") in datahub treated as Input Dataset as HDFS dataset instead of Hive because in spark.sql() I passed select * from hive table name

bland-sundown-49496

09/12/2022, 10:49 PM

hello, I am new to Datahub. I am getting error to ingest metadata from source S3. Would you please help me on these questions. 1. Can I use sink as "file" type for s3 source? . I got error saying that I cant use file type sink 2. When I use gms as sink, its failing? Please help me. Thanks

stocky-truck-96371

09/13/2022, 7:45 AM

Hi Team, We are ingesting metadata from Hive platform using sql alchemy plugin. But it's not picking up the column descriptions of the tables. We are on the version v0.8.43. Can anyone help on this?

great-branch-515

09/13/2022, 9:15 AM

@here does cli support ingestion from SSL only mysql databases? I am getting error

Copy code

(pymysql.err.OperationalError) (3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')
(Background on this error at: <http://sqlalche.me/e/13/e3q8>) due to 
		'(3159, 'Connections using insecure transport are prohibited while --require_secure_transport=ON.')'.

Any idea?

better-orange-49102

09/13/2022, 2:21 PM

Is there a way to use the Python SDK to retrieved past versions of an aspect? I mean I could go to RDBMS to retrieve the stored string but thats not very ideal. iirc I can use curl commands

brave-pencil-21289

09/13/2022, 2:23 PM

Can we use tns details to ingest oracle source. Any sample recipe code on how to use the tns details in the recipe.

gentle-camera-33498

09/13/2022, 2:33 PM

Hello guys! I soft deleted all entities and forced new ingestion to make a full update. But, the Status aspect did not update after the ingestion. It's right? Is this expected?

cool-actor-73767

09/13/2022, 7:19 PM

Hi Everyone, I'm having issues with ingestion from Metabase. I'm receiving this errors printed bellow. Did ingestion process stop after this error and other chart/dashboard metadata weren't load? Can anyone know what I need to do to solve this?

rhythmic-sundown-12093

09/13/2022, 6:13 AM

Copy code

source:
  type: "dbt"
  config:
    # Coordinates
    # To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
    manifest_path: "${DBT_PROJECT_ROOT}/target/manifest.json"
    catalog_path: "${DBT_PROJECT_ROOT}/target/catalog.json"
    test_results_path: "${DBT_PROJECT_ROOT}/target/run_results.json" # optional for recording dbt test results after running dbt test

    # Options
    target_platform: "redshift" # e.g. bigquery/postgres/etc.
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

many-hairdresser-79517

09/13/2022, 4:15 AM

Hello, About Redash metadata of the dashboard that have the table chart type Is there any way to also ingest that list of columns in the table chart to datahub?

polite-art-12182

09/14/2022, 5:38 AM

Hi, Is there a way to use NiFi as a source with a self signed cert? I have a NiFi instance I want to pull from. Right now in dev, it's in just the default configuration with self signed cert and single user sign on. When I try to connect DataHub to it, the connection fails with:

Copy code

"retries exceeded with url: /nifi-api/access/token (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] "

Any help resolving this without having to re-configure NiFi certs would be appreciated.

blue-boots-43993

09/14/2022, 5:39 AM

hey everyone, could you please assist in understanding dashboards <> containers as I see here (and several other places as well), dashboard entity has a container aspect, however when looking here I don't see Container as a supported aspect. I am writing a custom ingestion source for Qlik Sense where I map so-called Streams and Apps to containers. Streams are basically collections of Apps and Apps contain so-called Sheets (mapped as Dashboards), datasets, charts and Load Scripts (mapped as DataFlows). I would like to be able to see all of the entities that are part of one App in the respective container's entity list, which I currently cannot. Thanks in advance for any help provided datahub

bumpy-journalist-41369

09/14/2022, 7:14 AM

How do I increase the log of the Ingestion Run Details that you can see in the UI to Debug when running ingestions from UI? I have deployed Datahub on a Kubernetes cluster using helm charts provided in this repository - https://github.com/acryldata/datahub-helm.

bland-orange-13353

09/14/2022, 7:30 AM

This message was deleted.

microscopic-mechanic-13766

09/14/2022, 8:23 AM

Good morning, so I was trying to ingest metadata from Kafka using the following recipe:

Copy code

source:
    type: kafka
    config:
        platform_instance: <platform_instance>
        connection:
            consumer_config:
                security.protocol: SASL_PLAINTEXT
                sasl.username: <user>
                sasl.mechanism: PLAIN
                sasl.password: <password>
            bootstrap: 'broker1:9092'
            schema_registry_url: '<http://schema-registry:8081>'

When I got the following error:

Copy code

File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is not found, and chunk exceed the limit\n']}

Mention that recipe worked in previous versions (the current version is v0.8.44) Thanks in advance!

thankful-vr-12699

09/14/2022, 8:48 AM

Hi everyone, Since the Browe Paths Upgrade of August, we have to remove the table name in our transformer to change the path from: platform/db/schema/table to platform/db/schema . In the documentation for browse paths transformer, the only option we have is to use DATASET_PARTS which include the table name. Is there an other variable we can use to remove the table name from DATASET_PARTS? Or a split of DATASET_PARTS to have only the db name and the schema? Thank you for your help!