https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • f

    full-cartoon-72793

    03/04/2022, 9:48 PM
    Hello. I am getting an error trying to ingest metadata via hive from Databricks. I am including my recipe and the captured output (after cleaning up some identifiers and keys). My DataHub deployment is in Azure Kubernetes Service and I am running the default deployment that is created via the documented helm commands in the getting started guide.
    Copy code
    helm repo add datahub <https://helm.datahubproject.io/>
    helm install prerequisites datahub/datahub-prerequisites
    helm install datahub datahub/datahub
    I also created the metadata ingestion source using the DataHub UI. Any ideas on how to fix this?
    captured_output.txtcleaned_recipe.yml
    l
    h
    s
    • 4
    • 7
  • s

    salmon-rose-54694

    03/07/2022, 1:56 AM
    hi team, Is there a way to import data from gms endpoint rather than datahub cli? Appreciate an example. Thank you.
    r
    b
    c
    • 4
    • 4
  • r

    rough-van-26693

    03/07/2022, 2:27 AM
    Hi Team, Any example of how to add tags for metadata-ingested tables on GCP?
    h
    • 2
    • 1
  • b

    brave-secretary-27487

    03/07/2022, 9:43 AM
    Hey, I'm looking for a way to change to description of a scheme with the python emitter. Is there a object that I can use for this like
    DatasetPropertiesClass
    Or can I use this object to also change the scheme of a dataset?
    b
    • 2
    • 7
  • r

    red-accountant-48681

    03/07/2022, 11:46 AM
    Hi all, hopefully this is the correct channel for my query. I am trying to find class definitions for the schema_classes, with no luck. I would like to know how these classes function, and where I can find the source code for them.
    s
    m
    • 3
    • 2
  • m

    many-guitar-67205

    03/07/2022, 12:59 PM
    Hello, Is there any easy way of ingesting a protobuf schema? There is code in the datahub repo to read and ingest avro schemas (https://github.com/linkedin/datahub/blob/c8a3e6820204bf8c59ce4afae3d2b5d9dfdc0b71/[…]tadata-ingestion/src/datahub/ingestion/extractor/schema_util.py) so I'm wondering if there is something similar for protobuf out there ...
    g
    g
    • 3
    • 5
  • m

    most-nightfall-36645

    03/07/2022, 2:20 PM
    Hi I am having some issues ingesting postgres data. When I try I receive a bunch of exceptions describing an unknown entity type of
    container
    Copy code
    {'error': 'Unable to emit metadata to DataHub GMS',                                                                                                                                                                                                                  
            'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                      stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: java.lang.RuntimeException: Unknown '
                      aspect container for entity dataset\n'
                      <OMITTED>
            'message': 'java.lang.RuntimeException: Unknown aspect container for entity dataset','status': 500}}
    This seems similar to an issue described in this thread, however no solution was found. Would anyone be able to help me understand this issue? I am using: Datahub: v0.8.24 CLI: 0.8.27
    s
    • 2
    • 4
  • f

    full-cartoon-72793

    03/07/2022, 7:48 PM
    I have deployed DataHub with Kubernetes into Azure Kubernetes Service, following the deployment guide in the docs (adapted a bit for azure). I need to install a plugin into my AKS cluster (specifically “sqlalchemy.dialects:databricks.pyhive”). How do I do this? I am getting the following error telling me I need this plugin:
    Copy code
    NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:databricks.pyhive
    s
    b
    • 3
    • 11
  • d

    damp-greece-27806

    03/07/2022, 8:31 PM
    Hi - for the metabase ingestion, the recipe calls for username/password, but we use OAuth and it can’t seem to connect to it. Is this a known limitation?
    g
    • 2
    • 2
  • p

    plain-farmer-27314

    03/07/2022, 9:29 PM
    Another Looker/LookML bug we've spotted: we are also finding examples of Charts that don't have lineage to their explores Interestingly our staging environment has the correct lineage (and is running a much older version of the looker/lookml ingestors) Can see them in the properties field, but no lineage is actually being established to these explores
    g
    d
    • 3
    • 13
  • s

    silly-beach-19296

    03/08/2022, 1:10 AM
    Hi folks, I'm still learning about intake. Check I have an emitter made in python that reads a CSV with all my glossary terms, descriptions, and relationships. So far I have been able to ingest without relationships, whether inherited or contains. How could I add that to my code to read the field row["inherit"] or row["contains"]?
    b
    • 2
    • 2
  • s

    stocky-midnight-78204

    03/08/2022, 3:33 AM
    Ingestion via UI is not working, When I try to execute ingestion via UI, nothing happened.
    e
    s
    m
    • 4
    • 65
  • m

    melodic-helmet-78607

    03/08/2022, 9:53 AM
    Hello team, for some tables I am manually using MetadataChangeEventClass for ingestion. how do I start debugging MCE format mismatch from AvroTypeException error, I can't seem to find which record has the error. What keyword should I search to find in the error log?
    Copy code
    avro.io.AvroTypeException: The datum MetadataChangeEventClass({'auditHeader': None, 'proposedSnapshot': ...
    h
    s
    • 3
    • 4
  • b

    broad-thailand-41358

    03/08/2022, 5:29 PM
    Hello, I'm getting the following error when trying to ingest elasticsearch:
    host
    host contains bad characters, found <http://master1.search.dfw.wordpress.com> (type=assertion_error)
    Here is the portion of my config
    Copy code
    source:
      type: "elasticsearch"
      config:
        # Coordinates
        host: <http://master1.search.dfw.wordpress.com:9203>
    s
    g
    • 3
    • 23
  • p

    plain-farmer-27314

    03/08/2022, 8:01 PM
    Hey yall - does the Bigquery ingestor pull in Materialized View Schemas? I'm not sure which section (table/view/schema) I need to add the allow regex too. Also fwiw we are picking up regular ol
    Views
    , just not
    Materialized Views
    h
    g
    • 3
    • 3
  • h

    high-toothbrush-90528

    03/08/2022, 9:11 PM
    Hi everybody. I want to create a container using the python emitter via the rest APIs and unfortunately when I run:
    Copy code
    import datahub.emitter.mce_builder as builder
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
    from datahub.metadata.schema_classes import ChangeTypeClass, ContainerPropertiesClass
    
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    
    
    emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>", extra_headers={})
    emitter.test_connection()
    
    
    metadata_event = MetadataChangeProposalWrapper(
        aspect=ContainerPropertiesClass(
            name="DataProduct1",
            description="original description for dp1"
        ),
        entityType="container",
        changeType=ChangeTypeClass.UPSERT,
        entityUrn="urn:li:container:DATAPROD",
        aspectName="containerProperties",
    )
    
    # Emit metadata! This is a blocking call
    emitter.emit(metadata_event)
    I receive:
    The field at path '/search/searchResults[0]/entity' was declared as a non null type, but the code involved in retrieving data has wrongly returned a null value. The graphql specification requires that the parent field be set to null, or if that is non nullable that it bubble up null to its parent and so on. The non-nullable type is 'Entity' within parent type 'SearchResult' (code undefined)
    To be mentioned is that I have created the container before ingesting any data.
    g
    • 2
    • 2
  • m

    melodic-helmet-78607

    03/09/2022, 12:36 AM
    Hi, does anyone know if it possible to upsert column terms/tags? I thought I could make two separate ingestion for two separate column level tag/terms by ingesting two MCE with their respective editableSchemaFieldInfoClass, but apparently they override each other. Is there any other method without using graphql? I want to be able to stage all MCE files before ingesting them in bulk
    b
    g
    • 3
    • 3
  • l

    lemon-terabyte-66903

    03/09/2022, 6:05 AM
    Hi, it looks like GMS needs kafka, elasticsearch, neo4j and mysql whereas Amundsen requires only elasticsearch and neo4j. Is kafka really required for Datahub or can it function without kafka service on the backend?
    b
    s
    • 3
    • 7
  • h

    handsome-cartoon-79613

    03/09/2022, 9:34 AM
    Does datahub support Vertica?
    b
    • 2
    • 1
  • s

    stocky-midnight-78204

    03/09/2022, 12:30 PM
    22/03/09 163903 ERROR McpEmitter: Failed to emit metadata to DataHub java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-3 [ACTIVE] at datahub.spark2.shaded.http.concurrent.BasicFuture.getResult(BasicFuture.java:71) at datahub.spark2.shaded.http.concurrent.BasicFuture.get(BasicFuture.java:84) at datahub.spark2.shaded.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70) at datahub.client.MetadataResponseFuture.get(MetadataResponseFuture.java:52) at datahub.client.MetadataResponseFuture.get(MetadataResponseFuture.java:13) at datahub.spark.consumer.impl.McpEmitter.lambda$emit$1(McpEmitter.java:39) at java.util.ArrayList.forEach(ArrayList.java:1257) at datahub.spark.consumer.impl.McpEmitter.emit(McpEmitter.java:37) at datahub.spark.consumer.impl.McpEmitter.accept(McpEmitter.java:71) at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:300) at datahub.spark.DatahubSparkListener$3.apply(DatahubSparkListener.java:285) at scala.Option.foreach(Option.scala:407) at datahub.spark.DatahubSparkListener.processExecutionEnd(DatahubSparkListener.java:285) at datahub.spark.DatahubSparkListener.onOtherEvent(DatahubSparkListener.java:272) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:82) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) Caused by: java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-3 [ACTIVE] at datahub.spark2.shaded.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at datahub.spark2.shaded.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98) at datahub.spark2.shaded.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40) at datahub.spark2.shaded.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at datahub.spark2.shaded.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at datahub.spark2.shaded.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReact
    g
    c
    • 3
    • 4
  • p

    polite-application-51650

    03/09/2022, 12:50 PM
    Hi all, I'm getting started with Datahub, but during metadata ingestion for BigQuery I got the following error.
    Copy code
    'NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe02a7ef100>: Failed to establish a new connection: [Errno 111] '
               'Connection refused\n'
               '\n'   
    "MaxRetryError: HTTPConnectionPool(host='localhost', port=9002): Max retries exceeded with url: /api/gms/config (Caused by "
               "NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe02a7ef100>: Failed to establish a new connection: [Errno 111] "
               "Connection refused'))\n"
               '\n'
    Can somebody please help me rectify this.
    d
    g
    m
    • 4
    • 13
  • s

    salmon-area-51650

    03/09/2022, 3:54 PM
    Hi team 👋 I want to get the lineage between Kafka and Snowflake. I have Kafka connect sink connector to connect both platforms. I know that Kafka Sink connector are not implemented in DataHub yet. But, is there any option to implement it using custom emitter or something like that? Any suggestion? Thanks!!
    g
    p
    • 3
    • 14
  • p

    prehistoric-room-17640

    03/09/2022, 5:04 PM
    I'm trying to utilize the elasticsearch connector to ingest data from an open search cluster and am getting the following error when trying to use the elasticsearch recipe as a base: KeyError: 'Did not find a registered class for elasticsearch' Is this a version issue with the datahub cli tool I'm using?
    • 1
    • 1
  • g

    green-pencil-45127

    03/09/2022, 6:57 PM
    We use holistics as our main BI tool. Any move toward including ingestion from there?
    g
    • 2
    • 3
  • f

    freezing-farmer-89710

    03/09/2022, 7:22 PM
    Hello, I hope you are all well, I have a question about generating runIds, I hope you can help me please, =) I currently ingest through a recipe with the glue plugin, I would like to persist the runId to a DB, to later use it in rollback if necessary, the response of the ingestion does not throw the associated runId, and when I get the runId through the command ( datahub ingest list-runs ) it only returns 10 last ingests, and it seems that the generation of runId is not immediate, so it is difficult to associate the execution pipeline of my job to a specific runId, is there any way to obtain the associated runId to the ingestion?, how can I get all the past runId, which are not listed in the first 10? thank you very much!
    g
    • 2
    • 10
  • m

    microscopic-elephant-47912

    03/09/2022, 8:13 PM
    Hi team, In looker we use labeling for the explores. Does looker ingestion get the label information ? I couldn't find a place that mentions about it ? It is important for us because the end users see the label information not the technical information when he/she opens the looker object. Thanks.
    g
    • 2
    • 6
  • s

    shy-parrot-64120

    03/09/2022, 9:01 PM
    Hi folks find a thing that
    glue
    ingestor not adding
    Data source
    (e,g.)
    AwsDataCatalog
    to dataset are there any ways to do so in “natural” way? Issue is - i cant map dataset ingested from `Superset`/`Tableau` to ones ingested from
    Glue
    g
    c
    • 3
    • 15
  • k

    kind-baker-52130

    03/10/2022, 12:21 AM
    ValueError: invalid literal for int() with base 10: ‘’\n
    b
    d
    s
    • 4
    • 5
  • e

    eager-church-8887

    03/10/2022, 4:22 AM
    Hi all, is there any central repository of openapi config files for common use cases like Salesforce, Fivetran, etc?
    g
    • 2
    • 2
  • p

    polite-application-51650

    03/10/2022, 6:10 AM
    Hi all, does datahub supports metadata ingestion through Java code?
    h
    • 2
    • 2
1...323334...144Latest