DataHub #ingestion

full-cartoon-72793

03/04/2022, 9:48 PM

Hello. I am getting an error trying to ingest metadata via hive from Databricks. I am including my recipe and the captured output (after cleaning up some identifiers and keys). My DataHub deployment is in Azure Kubernetes Service and I am running the default deployment that is created via the documented helm commands in the getting started guide.

Copy code

helm repo add datahub <https://helm.datahubproject.io/>
helm install prerequisites datahub/datahub-prerequisites
helm install datahub datahub/datahub

I also created the metadata ingestion source using the DataHub UI. Any ideas on how to fix this?

captured_output.txt cleaned_recipe.yml

salmon-rose-54694

03/07/2022, 1:56 AM

hi team, Is there a way to import data from gms endpoint rather than datahub cli? Appreciate an example. Thank you.

rough-van-26693

03/07/2022, 2:27 AM

Hi Team, Any example of how to add tags for metadata-ingested tables on GCP?

brave-secretary-27487

03/07/2022, 9:43 AM

Hey, I'm looking for a way to change to description of a scheme with the python emitter. Is there a object that I can use for this like

DatasetPropertiesClass

Or can I use this object to also change the scheme of a dataset?

red-accountant-48681

03/07/2022, 11:46 AM

Hi all, hopefully this is the correct channel for my query. I am trying to find class definitions for the schema_classes, with no luck. I would like to know how these classes function, and where I can find the source code for them.

many-guitar-67205

03/07/2022, 12:59 PM

Hello, Is there any easy way of ingesting a protobuf schema? There is code in the datahub repo to read and ingest avro schemas (https://github.com/linkedin/datahub/blob/c8a3e6820204bf8c59ce4afae3d2b5d9dfdc0b71/[…]tadata-ingestion/src/datahub/ingestion/extractor/schema_util.py) so I'm wondering if there is something similar for protobuf out there ...

most-nightfall-36645

03/07/2022, 2:20 PM

Hi I am having some issues ingesting postgres data. When I try I receive a bunch of exceptions describing an unknown entity type of

container

Copy code

{'error': 'Unable to emit metadata to DataHub GMS',                                                                                                                                                                                                                  
        'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                  stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: java.lang.RuntimeException: Unknown '
                  aspect container for entity dataset\n'
                  <OMITTED>
        'message': 'java.lang.RuntimeException: Unknown aspect container for entity dataset','status': 500}}

This seems similar to an issue described in this thread, however no solution was found. Would anyone be able to help me understand this issue? I am using: Datahub: v0.8.24 CLI: 0.8.27

full-cartoon-72793

03/07/2022, 7:48 PM

I have deployed DataHub with Kubernetes into Azure Kubernetes Service, following the deployment guide in the docs (adapted a bit for azure). I need to install a plugin into my AKS cluster (specifically “sqlalchemy.dialects:databricks.pyhive”). How do I do this? I am getting the following error telling me I need this plugin:

Copy code

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:databricks.pyhive

damp-greece-27806

03/07/2022, 8:31 PM

Hi - for the metabase ingestion, the recipe calls for username/password, but we use OAuth and it can’t seem to connect to it. Is this a known limitation?

plain-farmer-27314

03/07/2022, 9:29 PM

Another Looker/LookML bug we've spotted: we are also finding examples of Charts that don't have lineage to their explores Interestingly our staging environment has the correct lineage (and is running a much older version of the looker/lookml ingestors) Can see them in the properties field, but no lineage is actually being established to these explores

silly-beach-19296

03/08/2022, 1:10 AM

Hi folks, I'm still learning about intake. Check I have an emitter made in python that reads a CSV with all my glossary terms, descriptions, and relationships. So far I have been able to ingest without relationships, whether inherited or contains. How could I add that to my code to read the field row["inherit"] or row["contains"]?

stocky-midnight-78204

03/08/2022, 3:33 AM

Ingestion via UI is not working, When I try to execute ingestion via UI, nothing happened.

melodic-helmet-78607

03/08/2022, 9:53 AM

Hello team, for some tables I am manually using MetadataChangeEventClass for ingestion. how do I start debugging MCE format mismatch from AvroTypeException error, I can't seem to find which record has the error. What keyword should I search to find in the error log?

Copy code

avro.io.AvroTypeException: The datum MetadataChangeEventClass({'auditHeader': None, 'proposedSnapshot': ...

broad-thailand-41358

03/08/2022, 5:29 PM

Hello, I'm getting the following error when trying to ingest elasticsearch:

host

host contains bad characters, found <http://master1.search.dfw.wordpress.com> (type=assertion_error)

Here is the portion of my config

Copy code

source:
  type: "elasticsearch"
  config:
    # Coordinates
    host: <http://master1.search.dfw.wordpress.com:9203>

plain-farmer-27314

03/08/2022, 8:01 PM

Hey yall - does the Bigquery ingestor pull in Materialized View Schemas? I'm not sure which section (table/view/schema) I need to add the allow regex too. Also fwiw we are picking up regular ol

Views

, just not

Materialized Views

high-toothbrush-90528

03/08/2022, 9:11 PM

Hi everybody. I want to create a container using the python emitter via the rest APIs and unfortunately when I run:

Copy code

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import ChangeTypeClass, ContainerPropertiesClass

from datahub.emitter.rest_emitter import DatahubRestEmitter


emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>", extra_headers={})
emitter.test_connection()


metadata_event = MetadataChangeProposalWrapper(
    aspect=ContainerPropertiesClass(
        name="DataProduct1",
        description="original description for dp1"
    ),
    entityType="container",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn="urn:li:container:DATAPROD",
    aspectName="containerProperties",
)

# Emit metadata! This is a blocking call
emitter.emit(metadata_event)

I receive:

The field at path '/search/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value. The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult' (code undefined)

To be mentioned is that I have created the container before ingesting any data.

melodic-helmet-78607

03/09/2022, 12:36 AM

Hi, does anyone know if it possible to upsert column terms/tags? I thought I could make two separate ingestion for two separate column level tag/terms by ingesting two MCE with their respective editableSchemaFieldInfoClass, but apparently they override each other. Is there any other method without using graphql? I want to be able to stage all MCE files before ingesting them in bulk

lemon-terabyte-66903

03/09/2022, 6:05 AM

Hi, it looks like GMS needs kafka, elasticsearch, neo4j and mysql whereas Amundsen requires only elasticsearch and neo4j. Is kafka really required for Datahub or can it function without kafka service on the backend?

handsome-cartoon-79613

03/09/2022, 9:34 AM

Does datahub support Vertica?

stocky-midnight-78204

03/09/2022, 12:30 PM

22/03/09 163903 ERROR McpEmitter: Failed to emit metadata to DataHub java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-3 [ACTIVE] at datahub.spark2.shaded.http.concurrent.BasicFuture.getResult(BasicFuture.java:71) at datahub.spark2.shaded.http.concurrent.BasicFuture.get(BasicFuture.java:84) at datahub.spark2.shaded.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70) at datahub.client.MetadataResponseFuture.get(MetadataResponseFuture.java:52) at datahub.client.MetadataResponseFuture.get(MetadataResponseFuture.java:13) at datahub.spark.consumer.impl.McpEmitter.lambda$emit$1(McpEmitter.java:39) at java.util.ArrayList.forEach(ArrayList.java:1257) at datahub.spark.consumer.impl.McpEmitter.emit(McpEmitter.java:37) at datahub.spark.consumer.impl.McpEmitter.accept(McpEmitter.java:71) at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:300) at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:285) at scala.Option.foreach(Option.scala:407) at datahub.spark.DatahubSparkListener.processExecutionEnd(DatahubSparkListener.java:285) at datahub.spark.DatahubSparkListener.onOtherEvent(DatahubSparkListener.java:272) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-3 [ACTIVE] at datahub.spark2.shaded.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at datahub.spark2.shaded.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98) at datahub.spark2.shaded.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40) at datahub.spark2.shaded.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at datahub.spark2.shaded.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at datahub.spark2.shaded.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReact

polite-application-51650

03/09/2022, 12:50 PM

Hi all, I'm getting started with Datahub, but during metadata ingestion for BigQuery I got the following error.

Copy code

'NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe02a7ef100>: Failed to establish a new connection: [Errno 111] '
           'Connection refused\n'
           '\n'   
"MaxRetryError: HTTPConnectionPool(host='localhost', port=9002): Max retries exceeded with url: /api/gms/config (Caused by "
           "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe02a7ef100>: Failed to establish a new connection: [Errno 111] "
           "Connection refused'))\n"
           '\n'

Can somebody please help me rectify this.

salmon-area-51650

03/09/2022, 3:54 PM

Hi team 👋 I want to get the lineage between Kafka and Snowflake. I have Kafka connect sink connector to connect both platforms. I know that Kafka Sink connector are not implemented in DataHub yet. But, is there any option to implement it using custom emitter or something like that? Any suggestion? Thanks!!

prehistoric-room-17640

03/09/2022, 5:04 PM

I'm trying to utilize the elasticsearch connector to ingest data from an open search cluster and am getting the following error when trying to use the elasticsearch recipe as a base: KeyError: 'Did not find a registered class for elasticsearch' Is this a version issue with the datahub cli tool I'm using?

green-pencil-45127

03/09/2022, 6:57 PM

We use holistics as our main BI tool. Any move toward including ingestion from there?

freezing-farmer-89710

03/09/2022, 7:22 PM

Hello, I hope you are all well, I have a question about generating runIds, I hope you can help me please, =) I currently ingest through a recipe with the glue plugin, I would like to persist the runId to a DB, to later use it in rollback if necessary, the response of the ingestion does not throw the associated runId, and when I get the runId through the command ( datahub ingest list-runs ) it only returns 10 last ingests, and it seems that the generation of runId is not immediate, so it is difficult to associate the execution pipeline of my job to a specific runId, is there any way to obtain the associated runId to the ingestion?, how can I get all the past runId, which are not listed in the first 10? thank you very much!

microscopic-elephant-47912

03/09/2022, 8:13 PM

Hi team, In looker we use labeling for the explores. Does looker ingestion get the label information ? I couldn't find a place that mentions about it ? It is important for us because the end users see the label information not the technical information when he/she opens the looker object. Thanks.

shy-parrot-64120

03/09/2022, 9:01 PM

Hi folks find a thing that

glue

ingestor not adding

Data source

(e,g.)

AwsDataCatalog

to dataset are there any ways to do so in “natural” way? Issue is - i cant map dataset ingested from `Superset`/`Tableau` to ones ingested from

Glue

kind-baker-52130

03/10/2022, 12:21 AM

ValueError: invalid literal for int() with base 10: ‘’\n

eager-church-8887

03/10/2022, 4:22 AM

Hi all, is there any central repository of openapi config files for common use cases like Salesforce, Fivetran, etc?

polite-application-51650

03/10/2022, 6:10 AM

Hi all, does datahub supports metadata ingestion through Java code?