DataHub #ingestion

numerous-address-22061

05/25/2023, 5:23 PM

Hello, I am noticing buggy behavior with the

browse path

of my ingested

Kakfa Topics

. Some are getting a nice, fully qualified browse path, and some are just not. I am not explicitly defining the browse path in my ingestion, here is an example...

Ingestion

Copy code

pipeline_name: ${PIPELINE_NAME}
source:
  type: "kafka"
  config:
    platform_instance: ${CLUSTER_NAME}
    connection:
      bootstrap: ${BOOTSTRAP_BROKERS}
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "SCRAM-SHA-512"
        sasl.username: "${KAFKA_USERNAME}"
        sasl.password: "${KAFKA_PASSWORD}"
      schema_registry_url: ${SCHEMA_REGISTRY_URL}
sink:
  type: "datahub-rest"
  config:
    server: ${DATAHUB_GMS_ENDPOINT}

First topic

(queried using GraphQL)

Copy code

{
  "data": {
    "dataset": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:kafka,platform-instance.org.db.app.topic_name,PROD)",
      "platform": {
        "name": "kafka"
      },
      "browsePaths": [
        {
          "path": [
            "prod",
            "kafka",
            "platform-instance",
            "org",
            "db",
            "app"
          ]
        }
      ],
      "properties": {
        "name": "org.db.app.topic_name"
      }
    }
  }
}

Second Topic

(note this is

undesired

and I cant figure out why it is getting a different browse path than the topic above)

Copy code

{
  "data": {
    "dataset": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:kafka,platform-instance.org.db.app.topic_name_2,PROD)",
      "platform": {
        "name": "kafka"
      },
      "browsePaths": [
        {
          "path": [
            "prod",
            "kafka",
            "platform-instance"
          ]
        }
      ],
      "properties": {
        "name": "org.db.app.topic_name_2"
      }
    }
  }
}

Why is the second browse path so short? It is very unfortunate for discovery in the UI

creamy-ram-28134

05/25/2023, 7:56 PM

Hey all - I am having trouble executing ingestion - can someone share examples for csv and file ingestion

brainy-balloon-97302

05/25/2023, 9:38 PM

Hi all! I have a glue ingestion job that constantly fails. It's failing with this error and was wondering if anyone has came across it before and was able to fix it?

Copy code

'failures': {'<s3://aws-glue-assets-XXXXXX-us-west-2/scripts/Untitled> job.py': ['Unable to download DAG for Glue job from <s3://aws-glue-assets-XXXXXX-us-west-2/scripts/Untitled> job.py, so job subtasks and lineage will be missing: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.', 'Unable to download DAG for Glue job from <s3://aws-glue-assets-XXXXXX-us-west-2/scripts/Untitled> job.py, so job subtasks and lineage will be missing: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.']}

I don't have that file in s3 nor a glue job called

Untitled job.py

so I am trying to see what I can do to resolve. The rest of the metadata is being pulled over but it's annoying it's marking it as a failure.

✅ 1

hundreds-airline-29192

05/26/2023, 5:18 AM

Hey iam facing this err when ingest from gcs.Please help me!!

✅ 1

hundreds-airline-29192

05/26/2023, 7:55 AM

Copy code

botocore.exceptions.PaginationError: Error during pagination: The same next token was received twice: {'Marker': 'dwh/dev/fact/fact_gross_profit/order_date_key_07%3D20230109/part-00018-f1470254-2c8b-4a23-aaad-0260cdca7054.c000.snappy.parquet'}

hundreds-airline-29192

05/26/2023, 7:55 AM

Anyone know this error can help me ?

gifted-bird-57147

05/26/2023, 10:28 AM

Hi Team, I receive the following warning in my ingestion recipe for our Athena source: '''Global Warnings: ['env is deprecated and will be removed in a future release. Please use platform_instance instead.']''' However, I think this warning is misleading. Since env points to the 'general' environment the datasource is part of (so: urnlidataset:(urnlidataPlatform:athena,mytablename,PROD) whereas platform_instance is refering to a subset within the platform (so: urnlidataset:(urnlidataPlatform:athena,PROD.mytablename,PROD) Is this a bug or a misunderstanding on my side?

✅ 1

freezing-fall-69290

05/26/2023, 10:32 AM

hi guys, may I ingest like “dataset-notebook-dataset” lineage using python api?

✅ 1

brainy-needle-61527

05/26/2023, 12:46 PM

Has anyone been able to visualize the lineage between AWS Glue and AWS Redshift?

brainy-intern-50400

05/26/2023, 4:34 PM

Hi community, i am using a lot the python api. Now we encouter the probleme, that we want to emit a lot of events, but with the datahub emitter that needs time.. somebody has a solution to emit a list of mcp events to datahub or something similar? i thought about a mcp stack, which could be emitted parallel to creating events.

late-addition-48515

05/26/2023, 5:07 PM

Hi everyone, I have ingested parquet files into DH from GCS. I is there a cleaner way of ingesting lineage data for parquet files that have been partitioned than my example in the comments?

rapid-controller-60841

05/29/2023, 8:29 AM

Hi community ！ I would like to know whether the timeout period of connection can be set through configuration. If anyone knows, please tell me how to set it. Thank you very much

source:

type: hive config: env: PROD platform: databricks host_port: 'http://JD-in-us.cloud.databricks.com/published' username: token password: '${databricks_token}'

many-rocket-80549

05/29/2023, 10:36 AM

Hi, I am evaluating Datahub to be implemented in our company. I am trying to ingest a sample file that you provide in the docs . I am not sure where I should drop the file, I just dropped it in a folder like so: /home/miquelp/datahub/file_onboarding/test_containers.json However I am getting an error while executing the recipe, seems like it doesn't find the file (The error message could be improved)? I have looked for a similar error but couldn't find anything. What linux user is the one that executes the ingestion? Can you give us a hand? Thanks

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '06b7698c-048e-470e-bf2c-1ff4fca75bd0',
 'infos': ['2023-05-29 10:18:33.415813 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-05-29 10:18:37.476974 INFO: Failed to execute 'datahub ingest'",
           '2023-05-29 10:18:37.477118 INFO: Caught exception EXECUTING task_id=06b7698c-048e-470e-bf2c-1ff4fca75bd0, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.10.0.7",
    "cli_entry_location": "/usr/local/lib/python3.10/site-packages/datahub/__init__.py",
    "py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
    "py_exec_path": "/usr/local/bin/python",
    "os_details": "Linux-5.15.0-72-generic-x86_64-with-glibc2.31",
    "peak_memory_usage": "57.82 MB",
    "mem_info": "57.82 MB"
  },
  "source": {
    "type": "file",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {},
      "total_num_files": 0,
      "num_files_completed": 0,
      "files_completed": [],
      "percentage_completion": "0%",
      "estimated_time_to_completion_in_minutes": -1,
      "total_bytes_read_completed_files": 0,
      "total_parse_time_in_seconds": 0,
      "total_count_time_in_seconds": 0,
      "total_deserialize_time_in_seconds": 0,
      "aspect_counts": {},
      "entity_type_counts": {},
      "start_time": "2023-05-29 10:18:35.206188 (now)",
      "running_time": "0 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2023-05-29 10:18:35.161225 (now)",
      "current_time": "2023-05-29 10:18:35.208860 (now)",
      "total_duration_in_seconds": 0.05,
      "gms_version": "v0.10.3",
      "pending_requests": 0
    }
  }
}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/recipe.yml --report-to /tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/ingestion_report.json
[2023-05-29 10:18:35,113] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
No ~/.datahubenv file found, generating one for you...
[2023-05-29 10:18:35,164] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-05-29 10:18:35,206] INFO     {datahub.ingestion.run.pipeline:201} - Source configured successfully.
[2023-05-29 10:18:35,207] INFO     {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2023-05-29 10:18:35,209] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/06b7698c-048e-470e-bf2c-1ff4fca75bd0/ingestion_report.json' mode='w' encoding='UTF-8'>
[2023-05-29 10:18:35,209] INFO     {datahub.cli.ingest_cli:134} - Source (file) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {},
 'total_num_files': 0,
 'num_files_completed': 0,
 'files_completed': [],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 0,
 'total_parse_time_in_seconds': 0,
 'total_count_time_in_seconds': 0,
 'total_deserialize_time_in_seconds': 0,
 'aspect_counts': {},
 'entity_type_counts': {},
 'start_time': '2023-05-29 10:18:35.206188 (now)',
 'running_time': '0 seconds'}
[2023-05-29 10:18:35,210] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2023-05-29 10:18:35.161225 (now)',
 'current_time': '2023-05-29 10:18:35.210294 (now)',
 'total_duration_in_seconds': 0.05,
 'gms_version': 'v0.10.3',
 'pending_requests': 0}
[2023-05-29 10:18:35,809] ERROR    {datahub.entrypoints:188} - Command failed: Failed to process /home/miquelp/datahub/file_onboarding/test_containers.json
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 175, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
    ret = await the_one_future
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
    return await loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
    pipeline.run()
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
    for wu in itertools.islice(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/file.py", line 196, in get_workunits
    for f in self.get_filenames():
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/file.py", line 193, in get_filenames
    raise Exception(f"Failed to process {self.config.path}")
Exception: Failed to process /home/miquelp/datahub/file_onboarding/test_containers.json

acceptable-helmet-19082

05/29/2023, 10:53 AM

Hello, I am using Datahub to ingest the data table Metadata of Databricks, source selected Hive, but there is currently a 200 million data table that has been stuck in the analysis step during the ingestion process, and the log content prompts that there are no newly generated logs for many seconds (WARNING: These logs appear to be stale. No new logs have been received since 2023-05-26 102225.389811 (53443 seconds ago). However, the ingestion process still appears to be running and may complete normally.), I guess it may be caused by the execution time of the analysis SQL exceeding the time to connect to the databricks. The execution time of the analysis SQL takes about two minutes, so I want to know How to configure the timeout for the databricks connection. Can you help me?

✅ 1

astonishing-father-13229

05/29/2023, 5:15 PM

Hi Team, I'm facing build issue for datahub/metadata-ingestion Steps to reproduce: Clone datahub repository cd metadata-ingestion ../gradlew build Attached screenshos in the thread for the reference Could you please advise me ? Advance thanks 🙏

✅ 1

hundreds-airline-29192

05/30/2023, 2:21 AM

Why my datahub cannot load data from elasticsearch ????

✅ 1

hundreds-airline-29192

05/30/2023, 2:24 AM

Copy code

com.linkedin.restli.server.RestLiServiceException: com.datahub.util.exception.ESQueryException: Search query failed:
        at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)
        at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)
        at com.linkedin.metadata.resources.usage.UsageStats.query(UsageStats.java:320)
        at com.linkedin.metadata.resources.usage.UsageStats.queryRange(UsageStats.java:386)
        at jdk.internal.reflect.GeneratedMethodAccessor375.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)
        at com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)
        at com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)
        at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)
        at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)
        at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)
        at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)
        at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)
        at com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)
        at com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:262)
        at com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294)
        at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262)
        at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232)
        at com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215)
        at com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171)
        at com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:130)
        at com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)
        at com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)
        at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
        at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
        at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)
        at com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)
        at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
        at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
        at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)
        at com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)
        at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)
        at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)
        at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)
        at com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106)
        at com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)
        at com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)
        at com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)
        at com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at com.linkedin.restli.server.RestliHandlerServlet.service(RestliHandlerServlet.java:21)
        at com.linkedin.restli.server.RestliHandlerServlet.handleRequest(RestliHandlerServlet.java:26)
        at org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
        at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631)
        at com.datahub.auth.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:102)
        at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
        at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.datahub.util.exception.ESQueryException: Search query failed:
        at com.linkedin.metadata.timeseries.elastic.query.ESAggregatedStatsDAO.getAggregatedStats(ESAggregatedStatsDAO.java:375)
        at com.linkedin.metadata.timeseries.elastic.ElasticSearchTimeseriesAspectService.getAggregatedStats(ElasticSearchTimeseriesAspectService.java:216)
        at com.linkedin.metadata.resources.usage.UsageStats.getBuckets(UsageStats.java:182)
        at com.linkedin.metadata.resources.usage.UsageStats.lambda$query$1(UsageStats.java:348)
        at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)
        ... 89 common frames omitted
Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
        at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
        at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1911)
        at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1888)
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1645)
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1602)
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1572)
        at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1088)
        at com.linkedin.metadata.timeseries.elastic.query.ESAggregatedStatsDAO.getAggregatedStats(ESAggregatedStatsDAO.java:371)
        ... 93 common frames omitted
        Suppressed: org.elasticsearch.client.ResponseException: method [POST], host

bland-orange-13353

05/30/2023, 2:24 AM

This message was deleted.

hundreds-airline-29192

05/30/2023, 2:27 AM

I know this is open source , but why does it have so many bugs and no stable version . Besides, the support team does not provide timely support. ???

hundreds-airline-29192

05/30/2023, 2:53 AM

Copy code

think you are demo datahub with company and boom! Unable to load description of tables

✅ 1

bitter-evening-61050

05/30/2023, 5:50 AM

Hi Team, I have datahub running in kubernates . We have created one user from kubernates and are able to login to datahub with it . But using this user we are not able to add tokens , glossary terms , owners etc . The permissions and policy tab is missing . Error:Failed to add: Unauthorized to perform this action. Please contact your DataHub administrator." any one please help me in resolving this issue

✅ 1

microscopic-room-90690

05/30/2023, 6:06 AM

Hi team, I'm wondering how to exclude specific path using regex pattern for source S3.

"**/*test*/**"

works, while

**/(^|_)(tmp|temp|test)(_|$)/**"

do not work. Anyone can help?

✅ 1

lemon-scooter-69730

05/30/2023, 11:06 AM

Hello while trying to injest with the datahub kafka sink I keep getting this error

Copy code

datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (bigquery): Missing provider configuration.

This is what the recipe looks like

Copy code

pipeline_name: analytics
source:
    type: bigquery
    config:
        env: DEV
        include_table_lineage: true
        include_usage_statistics: true
        include_tables: true
        include_views: true
        profiling:
            enabled: true
            profile_table_level_only: false
        stateful_ingestion:
            enabled: true
        credential:
            project_id: <redacted>
            private_key: <redacted>
            private_key_id: <redacted>
            client_email: <redacted>
            client_id: <redacted>
sink:
    type: datahub-kafka
    config:
        connection:
            bootstrap: 'datahub-prerequisites-kafka:9092'
            schema_registry_url: '<http://datahub-prerequisites-cp-schema-registry:8081>'

✅ 1

microscopic-elephant-47912

05/30/2023, 12:08 PM

Untitled.txt

✅ 1

microscopic-elephant-47912

05/30/2023, 12:09 PM

Hi team, I’m using quickstart on docker and after ingesting looker metadata I can not see any of them in the UI. I checked the kafka and see the events in the topics and when I checked the gms docker logs I saw many consumer errors like above.

narrow-bear-42430

05/30/2023, 2:57 PM

Hi DataHub folks - this was asked a while ago, but I was wondering if anyone in the community has done any work to integrate Thoughtspot with DataHub? Any pointers/ info would be gratefully received! Thank you

✅ 1

astonishing-father-13229

05/30/2023, 8:37 PM

Hi Team, I'm facing build issue for datahub Steps to reproduce: Clone datahub repository cd datahub ../gradlew build Attached screenshos in the thread for the reference Could you please advise me ? Advance thanks 🙏

✅ 1

great-rainbow-70545

05/30/2023, 9:52 PM

I’ve been working through a bunch of permutations trying to get Hive ingestion working. There is no auth, access is controlled via security groups and have verified that I can connect from the container. The error is always:

Command failed: TSocket read 0 bytes

. Not finding much via google other than possible wrong thrift version but I figured that would be coming up for other people as well. Ring any bells?

few-air-34037

05/31/2023, 4:44 AM

New power bi -ingestion uses

platform_instance

-tags for lineages. We haven't used

platform_instance

yet but we have added a lot of metadata to objects... If we now add

platform_instance

then we get new hierarchy and new objects. What would be easiest way to migrate metadata from old objects without

platform_instance

to new ones with it?

✅ 1

cool-architect-34612

05/31/2023, 5:01 AM

Hi, I want to ingest Presto dataset but there is something slow. The platform was ingested first and datasets were ingested after. Why does it work like this?

✅ 1