https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • b

    brave-tomato-16287

    04/14/2022, 8:19 AM
    Hello all! Is it possible to connect to superset via JWT token?
    h
    • 2
    • 1
  • b

    brave-forest-5974

    04/14/2022, 8:41 AM
    🤔 from what I can tell from the dbt source, the lineage is taken dbt's view of the lineage, which ignores "hard coded" table references. I wonder if there's any call (outside of from us 😄 ) to attempt to parse the sql files to extract those hard coded table references. The general complications of Jinja make it a bit interesting but to us it could be useful 🤔
    m
    • 2
    • 1
  • b

    bland-orange-13353

    04/14/2022, 11:33 AM
    This message was deleted.
    e
    • 2
    • 2
  • f

    famous-match-44342

    04/14/2022, 1:32 PM
    图片.png
    g
    • 2
    • 1
  • f

    famous-match-44342

    04/14/2022, 1:35 PM
    图片.png
    g
    e
    • 3
    • 5
  • c

    cool-architect-34612

    04/15/2022, 12:48 AM
    Hi, I want to exclude only ‘column’ containing sensitive information such as name or address by using deny, but it doesn’t work. What should I do? When getting tb, I want to exclude db.tb.name and db.tb.addr.
    e
    l
    • 3
    • 7
  • n

    nutritious-bird-77396

    04/15/2022, 4:16 PM
    I am seeing a 500 error in GMS and records are being dropped in a Batch update from Okta related to
    GroupMembership
    for users... Even though I insert the same ingestion record the records should be skipped instead of errors. Because of this 1 error all the other updates part of the Batch update are dropped as well causing the users and groups information to be mismatched across environments. Error details in 🧵
    e
    h
    b
    • 4
    • 34
  • i

    icy-ram-1893

    04/16/2022, 7:04 AM
    Hello! I've got a basic question. when we gather data from diverse sources, where they are physically stored ? in a database? in a file ?
    b
    • 2
    • 1
  • o

    orange-coat-2879

    04/18/2022, 2:14 AM
    Hello everyone, I used sa account to ingest dataset to datahub. But the metadata of tables are all empty. Also, when I use profiling function in the configuration, the ingestion failed. Can anyone give any hint? Thanks!
    Copy code
    source:
      type: mssql
      config:
        # Coordinates'
        host_port: localhost:1433
        database: TutorialDB
        schema_pattern:
          allow:
            - "QQ"
        table_pattern:
          allow:
            - "accessories"
            - "raw_account"
        # Credentials
        username: sa
        password: pwd
    
        profiling:
          enabled: true
    
    sink:
      # sink configs
        type: "datahub-rest"
        config:
          server: "<http://localhost:8080>"
    b
    • 2
    • 5
  • c

    cool-architect-34612

    04/18/2022, 5:11 AM
    Hi, I am performing table parsing using the ‘parse_table_names_from_sql’ option when redash ingestion, but there are too many tables duplicated with mysql’s table. Is there any way to connect directly to mysql table when parsing in redash? source: type: “redash” config: connect_uri: ### api_key: ### parse_table_names_from_sql: true sink: type: “datahub-rest” config: server: “http://localhost:8080”
    b
    • 2
    • 8
  • s

    silly-application-87541

    04/18/2022, 9:17 AM
    Hi, can't we ingest metadata without having a service account key for bigquery connection?
    b
    • 2
    • 1
  • b

    better-orange-49102

    04/18/2022, 9:36 AM
    for datasets with a containerAspect, is it possible to remove the dataset container without resorting to rollback? I am ingesting containerAspect and I realized ingesting empty values for container doesn't work for "undo" the relationship.
    b
    c
    r
    • 4
    • 22
  • f

    fresh-electrician-85277

    04/18/2022, 12:31 PM
    Hello everyone, if i want to use lineage of Spark Integration , but my spark version is 3.1.2 ,do i need to recompile the datahub with spark 3.1.2 ? thanks a lot
    b
    m
    +2
    • 5
    • 7
  • d

    delightful-barista-90363

    04/18/2022, 9:23 PM
    on top of these, i noticed that there is a datalake folder, and an s3 folder, both of which seem to get info from s3. Which is the one that is used by the following config?
    Copy code
    source:
      type: "s3"
    h
    d
    • 3
    • 3
  • r

    rich-policeman-92383

    04/19/2022, 9:24 AM
    What regex pattern should i use to ingest and profile only a single hive dataset.
    d
    • 2
    • 6
  • a

    acoustic-quill-54426

    04/19/2022, 10:58 AM
    👋 If someone else is scratching his head with the bigquery usage ingestion and the warnings about
    Failed to match table read event XXX with job; try increasing query_log_delay or max_query_duration
    but there is nothing wrong with those: please make sure your job/QueryEvent is not being filtered here
    s
    m
    d
    • 4
    • 14
  • q

    quaint-lighter-81058

    04/19/2022, 6:19 PM
    raise tds_base.Error('Client does not have encryption enabled but it is required by server, ' Azure Managed Instance Connection is failing from the recipe. DataHub is unable to injest from Azure Managed Instance .. Need your help please
    b
    • 2
    • 1
  • b

    best-umbrella-24804

    04/21/2022, 4:12 AM
    Hello, I'm trying to ingest metadata from Glue. Almost everything has been ingested except for 2 pipeline/glue jobs. They are both throwing the following error. Anyone know what these errors mean?
    • 1
    • 2
  • a

    alert-football-80212

    04/21/2022, 8:33 AM
    Hi all, I am trying to use the REST.li API for retrieving entity aspec. I followed the example of a simple curl get request and I inserted my url-encoded-entity-urn.
    curl  '<http://localhost:8080/entitiesV2/><url-encoded-entity-urn>'.
    While in the documentation the response is a json describing the entity I receive the page html (with response 200). someone maybe use the rest.li api and can help?
    s
    • 2
    • 3
  • b

    billions-twilight-48559

    04/21/2022, 2:54 PM
    Hi there, Is there any way to ignore ssl validation when gms are under a https url with a corporate certificate not trusted? I mean, in the recipes Thanks
    s
    • 2
    • 6
  • e

    early-librarian-13786

    04/21/2022, 3:54 PM
    Hello everyone, I have an problem with Columns Stats after profiling a postgres table: all stats(min, max, mean, null and distinct count) are "unknown", despite the ingestion pipeline finished successfully with
    'entities_profiled': 1
    I tried different sink types: datahub-kafka, datahub-rest, and postgres tables with different column types and rows number, but result was the same Has anyone else faced with this issue and is there any solution?
    d
    • 2
    • 6
  • r

    red-pizza-28006

    04/21/2022, 4:54 PM
    Hello, I need some guidance. We have some tables where we run some SQL data quality checks before our Analysts can use them. I am wondering how i can expose these checks in Datahub, and what would be the recommended way of marking a table as "certified" or "gold" ?
    l
    h
    • 3
    • 6
  • l

    lemon-terabyte-66903

    04/21/2022, 8:16 PM
    Hi, I have multiple datasets (parquet) ingested using
    data-lake
    source. How do I merge them all into one, so that it shows as one dataset on UI?
    l
    h
    • 3
    • 9
  • n

    nutritious-bird-77396

    04/21/2022, 8:49 PM
    Hi Team...I am looking at the Queries tab and wonder if there are any PII concerns here? Are the values in the query hashed? Can you help me point out to the code for that? I am having a hard time finding it yet...
    s
    m
    • 3
    • 4
  • c

    curved-football-28924

    04/22/2022, 5:27 AM
    url:https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/emitter/mcp.py#L16 Error message: emitter.emit_mcp(dataset_assertionResult_mcp) File "/home/karthickaravindan/.local/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 208, in emit_mcp mcp_obj = pre_json_transform(mcp.to_obj()) File "/home/karthickaravindan/.local/lib/python3.8/site-packages/datahub/emitter/mcp.py", line 69, in to_obj return self.make_mcp().to_obj(tuples=tuples) File "/home/karthickaravindan/.local/lib/python3.8/site-packages/datahub/emitter/mcp.py", line 42, in make_mcp serializedAspect = _make_generic_aspect(self.aspect) File "/home/karthickaravindan/.local/lib/python3.8/site-packages/datahub/emitter/mcp.py", line 17, in _make_generic_aspect serialized = json.dumps(pre_json_transform(codegen_obj.to_obj())) AttributeError: 'str' object has no attribute 'to_obj' when I am inserting my own custom validation for assertion Info I get the error? What should be the parameter for codegen_obj? Is there any documentation for writing own validation plugin? Similar to Great expectation I am trying to write a plugin for AWS DataBrew. Kindly guide me on How to achieve it?
    h
    • 2
    • 2
  • d

    dazzling-alarm-64985

    04/22/2022, 6:00 AM
    Hello, i am ingesting from kafka and schema-registry to datahub, im only getting empty datasets, this is in the logs which is not true, how can i troubleshoot this?
    Copy code
    'xxxxxx': ['The schema registry subject for the value schema is not found. The topic is "
               "either '\n" 'schema-less, or no messages have been written to the topic yet.']
    m
    • 2
    • 1
  • r

    rich-policeman-92383

    04/22/2022, 6:59 AM
    While profiling a hive dataset i am only getting rows and column count. All other stats(min, max, mean, median, null % , null count, distinct %, distinct count, sample values) are unknown. No errors. Yml
    Copy code
    ---
    source:
      type: hive
      config:
        host_port: hive:10000
        env: "PROD"
        table_pattern:
          allow:
            - "A.B\\$"
        options:
          connect_args: {'auth': 'KERBEROS','kerberos_service_name': 'hive'}
        profiling:
          enabled: true
        profile_pattern:
          allow:
            - "A.B\\$"
    sink:
      type: "datahub-rest"
      config:
        server: "<https://datahub:8080>"
    d
    p
    • 3
    • 3
  • m

    magnificent-hospital-52323

    04/22/2022, 7:53 AM
    Hi all! I'm trying to build assertions using the Python Emitter. I started out with the example: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/data_quality_mcpw_rest.py In that script, I only modified the data table and column to match my dataset. The generated urns seem okay to me (at least they match the ones seen in the DataHub front-end). However, when I ran the script I hit the following error:
    Copy code
    datahub-gms               | 07:49:26.623 [qtp544724190-14] ERROR c.l.m.filter.RestliLoggingFilter:38 - <http://Rest.li|Rest.li> error: 
    datahub-gms               | com.linkedin.restli.server.RestLiServiceException: Failed to validate record with class com.linkedin.assertion.AssertionInfo: ERROR :: /datasetAssertion/nativeParameters :: unrecognized field found but not allowed
    datahub-gms               | ERROR :: /datasetAssertion/nativeType :: unrecognized field found but not allowed
    datahub-gms               | ERROR :: /datasetAssertion/aggregation :: unrecognized field found but not allowed
    datahub-gms               | ERROR :: /datasetAssertion/parameters :: unrecognized field found but not allowed
    datahub-gms               | ERROR :: /datasetAssertion/dataset :: unrecognized field found but not allowed
    datahub-gms               | ERROR :: /datasetAssertion/operator :: unrecognized field found but not allowed
    datahub-gms               | 
    datahub-gms               | 	at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:140)
    datahub-gms               | 	at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)
    datahub-gms               | 	at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)
    datahub-gms               | 	at com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:133)
    datahub-gms               | 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    datahub-gms               | 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    datahub-gms               | 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    datahub-gms               | 	at java.lang.reflect.Method.invoke(Method.java:498)
    datahub-gms               | 	at com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:172)
    datahub-gms               | 	at com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:326)
    datahub-gms               | 	at com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)
    datahub-gms               | 	at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)
    datahub-gms               | 	at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)
    datahub-gms               | 	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
    datahub-gms               | 	at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)
    datahub-gms               | 	at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)
    datahub-gms               | 	at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)
    datahub-gms               | 	at com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)
    datahub-gms               | 	at com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:218)
    datahub-gms               | 	at com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:242)
    datahub-gms               | 	at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:211)
    datahub-gms               | 	at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:181)
    datahub-gms               | 	at com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:164)
    datahub-gms               | 	at com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:120)
    datahub-gms               | 	at com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)
    datahub-gms               | 	at com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)
    datahub-gms               | 	at com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)
    datahub-gms               | 	at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
    datahub-gms               | 	at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)
    datahub-gms               | 	at com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)
    datahub-gms               | 	at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
    datahub-gms               | 	at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)
    datahub-gms               | 	at com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)
    datahub-gms               | 	at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
    datahub-gms               | 	at com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:96)
    datahub-gms               | 	at com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)
    datahub-gms               | 	at com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)
    datahub-gms               | 	at com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)
    datahub-gms               | 	at com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)
    datahub-gms               | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    datahub-gms               | 	at com.linkedin.restli.server.spring.ParallelRestliHttpRequestHandler.handleRequest(ParallelRestliHttpRequestHandler.java:63)
    datahub-gms               | 	at org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)
    datahub-gms               | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    datahub-gms               | 	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:852)
    datahub-gms               | 	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)
    datahub-gms               | 	at com.datahub.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:77)
    datahub-gms               | 	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591)
    datahub-gms               | 	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:542)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    datahub-gms               | 	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:536)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
    datahub-gms               | 	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1581)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1307)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
    datahub-gms               | 	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:482)
    datahub-gms               | 	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1549)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1204)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
    datahub-gms               | 	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    datahub-gms               | 	at org.eclipse.jetty.server.Server.handle(Server.java:494)
    datahub-gms               | 	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:374)
    datahub-gms               | 	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:268)
    datahub-gms               | 	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    datahub-gms               | 	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    datahub-gms               | 	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
    datahub-gms               | 	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:918)
    datahub-gms               | 	at java.lang.Thread.run(Thread.java:748)
    What could the issue be? As I don't quite understand the error message. Thanks.
    h
    • 2
    • 4
  • m

    mammoth-fountain-32989

    04/22/2022, 12:23 PM
    Hi, Trying to ingest data from hive to datahub from UI which was setup with quickstart docker image. Provided the username and password in recipe yaml but getting the below error on execution: ValueError: Password should be set if and only if in LDAP or CUSTOM mode; Remove password or use one of those modes This seems to be with the pyhive package of the docker, it is trying to connect as PLAIN. Any workaround for this? Thanks
    e
    • 2
    • 2
  • b

    bright-beard-86474

    04/22/2022, 6:38 PM
    Hello! Happy Friday! I’m new to the DataHub. I followed all steps from the Quickstart Guide, all looks good except one thing - no metadata on UI. I tried to ingest a file using the following:
    python3 -m datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
    . The output log says Pipeline finished successfully, no warnings no errors. But I don’t see any records on DataHub UI. Could someone please help to figure out where the blocker is? Thanks!
    l
    • 2
    • 2
1...373839...144Latest