DataHub #troubleshoot

gifted-queen-61023

09/08/2021, 11:20 AM

Hey guys waving from afar left When building the project (with

./gradlew build

) I keep getting stuck at 99% due to the installation of dependencies (with pip). It seems to have difficulties with

docutils

and

dill

from metadata ingestioninstallDev. Should the have more strict version intervals in

metadata-ingestion/setup.py

or something of that sort? Am I doing something wrong? Screenshot from my 4th attempt

adorable-judge-53430

09/08/2021, 11:46 AM

Hi all 👋 Happy to join the Datahub Slack 😊 we currently have some trouble with the Metadata Ingestion for the Business Glossary. We followed the steps in the official documentation here. I tried to use the example yaml from Github and saved it in my home directory. And here is my recipe:

Copy code

source:
  type: datahub-business-glossary
  config:
    # Coordinates
    file: ~/business_glossary.yml

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

If I then run the ingestion, I get the following error:

Copy code

KeyError: 'Did not find a registered class for datahub-business-glossary'

Ingesting data from Postgres and other sources worked great though. Any ideas what’s happening here?

curved-sandwich-81699

09/09/2021, 7:59 PM

Seems like the table_pattern is broken when ingesting Snowflake metadata. With a recipe like:

Copy code

source:
  type: "snowflake"
  config:
    username: ...
    password: ...
    host_port: ...
    database_pattern:
      ignoreCase: true
      allow:
        - "database"
    schema_pattern:
      ignoreCase: true
      allow:
        - "schema"
    table_pattern:
      ignoreCase: true
      deny:
        - ".*"

The tables from database.schema are still getting ingested. Same thing if using

database.*

database.schema.*

as table_pattern.deny... Or I am missing something?

handsome-belgium-11927

09/10/2021, 8:20 AM

Hello! Anybody ingested tags via python framework? I'm getting this error:

Copy code

Caused by: java.net.URISyntaxException: Urn entity type should be 'dataset'.: urn:li:dataset:(urn:li:dataPlatform:exasol,main.dds.h_car,PROD)

urn is correct for sure, used it for other examples, like profiling. Any help would be much appreciated

fresh-carpet-31048

09/10/2021, 10:03 PM

Hi! I wrote a test to test_e2e.py and I was wondering if I should install pytest to run the tests or if there's another recommended way? Thank you in advance!

square-activity-64562

09/13/2021, 8:24 AM

Is this file for policies going to be ingested every time on boot or just once? https://github.com/linkedin/datahub/blob/master/metadata-service/war/src/main/resources/boot/policies.json Asking because I would like to change the root user to a group by ingesting these MCEs after the first boot

square-activity-64562

09/13/2021, 9:18 AM

v0.8.12 Minor UI issue the less description doesn’t show

but the complete one does

square-activity-64562

09/13/2021, 9:33 AM

v0.8.12 Minor UI issue On dataset page when description is empty and you hover over it the height of the row changes when

+ Add Description

is shown. If we move cursor over the schema descriptions (all empty) it feels like the schema is jumping. Probably need to increase height of row

square-activity-64562

09/13/2021, 9:41 AM

v0.8.12 Bug It seems redirect to original URI is broken now. https://github.com/linkedin/datahub/pull/3026 is not working after the upgrade

microscopic-musician-99632

09/13/2021, 9:53 AM

Does datahub provide a list of APIs (similar to - https://atlas.apache.org/api/v2/index.html ).

cool-state-20157

09/13/2021, 10:10 PM

Hello, I'm trying to ingest with the snowflake-usage type and I'm getting this error: "ProgrammingError: (snowflake.connector.errors.ProgrammingError) Account must be specified". Please see the image for my host_port property data in the yaml file. Could it be I'm typing the property data incorrectly?

curved-jordan-15657

09/14/2021, 10:02 AM

Hello! After i started to use latest version of datahub, analytics page started to give error in UI called “An unknown error occurred.” on top of the page and all charts are empty now. I checked the logs of gms pod from k8s, i saw an error. The full error is like below:

Copy code

09:57:26.739 [Thread-6973] ERROR c.l.d.g.a.service.AnalyticsService - Search query failed: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
09:57:26.739 [Thread-6973] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler - Failed to execute DataFetcher
java.lang.RuntimeException: Search query failed:
	at com.linkedin.datahub.graphql.analytics.service.AnalyticsService.executeAndExtract(AnalyticsService.java:245)
	at com.linkedin.datahub.graphql.analytics.service.AnalyticsService.getHighlights(AnalyticsService.java:216)
	at com.linkedin.datahub.graphql.analytics.resolver.GetHighlightsResolver.getHighlights(GetHighlightsResolver.java:50)
	at com.linkedin.datahub.graphql.analytics.resolver.GetHighlightsResolver.get(GetHighlightsResolver.java:29)
	at com.linkedin.datahub.graphql.analytics.resolver.GetHighlightsResolver.get(GetHighlightsResolver.java:19)
	at graphql.execution.ExecutionStrategy.fetchField(ExecutionStrategy.java:270)
	at graphql.execution.ExecutionStrategy.resolveFieldWithInfo(ExecutionStrategy.java:203)
	at graphql.execution.AsyncExecutionStrategy.execute(AsyncExecutionStrategy.java:60)
	at graphql.execution.Execution.executeOperation(Execution.java:165)
	at graphql.execution.Execution.execute(Execution.java:104)
	at graphql.GraphQL.execute(GraphQL.java:557)
	at graphql.GraphQL.parseValidateAndExecute(GraphQL.java:482)
	at graphql.GraphQL.executeAsync(GraphQL.java:446)
	at graphql.GraphQL.execute(GraphQL.java:377)
	at com.linkedin.datahub.graphql.GraphQLEngine.execute(GraphQLEngine.java:88)
	at com.datahub.metadata.graphql.GraphQLController.lambda$postGraphQL$0(GraphQLController.java:82)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187)
	at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1892)
	at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1869)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1626)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1583)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1553)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1069)
	at com.linkedin.datahub.graphql.analytics.service.AnalyticsService.executeAndExtract(AnalyticsService.java:240)
	... 17 common frames omitted
	Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [<https://vpc-datahub-o67waaz2xr5zttbor35tgmlksa.us-east-1.es.amazonaws.com:443>], URI [/datahub_datahub_usage_event/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahub_datahub_usage_event","node":"M5OibEC5ThKefEm2b1wR4Q","reason":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}},"status":400}
		at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:302)
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:272)
		at org.elasticsearch.client.RestClient.performRequest(RestClient.java:246)
		at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1613)
		... 21 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
	at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
	at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:179)
	... 24 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [browserId] in order to load field data by uninverting the inverted index. Note that this can use significant memory.]
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
	at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
	... 28 common frames omitted
09:57:26.740 [Thread-6973] ERROR c.d.m.graphql.GraphQLController - Errors while executing graphQL query: "query getHighlights {\n  getHighlights {\n    value\n    title\n    body\n    __typename\n  }\n}\n", result: {errors=[{message=An unknown error occurred., locations=[{line=2, column=3}], path=[getHighlights], extensions={code=500, classification=DataFetchingException}}], data=null}, errors: [DataHubGraphQLError{path=[getHighlights], code=SERVER_ERROR, locations=[SourceLocation{line=2, column=3}]}]

how do i resolve the issue?

bland-orange-13353

09/14/2021, 3:48 PM

This message was deleted.

better-orange-49102

09/15/2021, 6:50 AM

i've datahub connected to oidc, currently trying to troubleshoot an issue where the users reported encountering error 502 when they attempted to reach datahub. (continued in thread)

millions-soccer-98440

09/15/2021, 8:56 AM

Hi all, Please help I have problem about “metadata ingestion - source - kafka connect” this my Python test code:

Copy code

from datahub.ingestion.run.pipeline import Pipeline

def ingest_metadata(**kwargs):

    """
    :param ingest_param: source & sink datahub param
    :type ingest_param: json/struct
    """

    ingest_param = kwargs.get('ingest_param')

    pipeline = Pipeline.create(ingest_param)
    pipeline.run()
    pipeline.raise_from_status()

kafka_connect = {
    "source": {
        "type": "kafka-connect",
        "config": {
            "connect_uri": "<http://127.0.0.1:8083>",
            "cluster_name": "ts-connect",
        },
    },
    "sink": {
        "type": "datahub-kafka",
        "config": {
            "connection": {
                "bootstrap": "127.0.0.1:19092",
                "schema_registry_url": "<http://127.0.0.1:17081>"
            }
        },
    },
}

ingest_metadata(ingest_param=kafka_connect)

this error after run code

Copy code

Skipping connector saleordering-postcodes. Sink Connector not yet implemented
Skipping connector thestreet-image-receipts. Sink Connector not yet implemented
Traceback (most recent call last):
  File "kafkaconnect.py", line 36, in <module>
    ingest_metadata(ingest_param=kafka_connect)
  File "kafkaconnect.py", line 14, in ingest_metadata
    pipeline.run()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 108, in run
    for wu in self.source.get_workunits():
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/datahub/ingestion/source/kafka_connect.py", line 468, in get_workunits
    connectors_manifest = self.get_connectors_manifest()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/datahub/ingestion/source/kafka_connect.py", line 308, in get_connectors_manifest
    connector_manifest.topic_names = topics[c]["topics"]
KeyError: 'sales-ordering-prod-v5'

cool-state-20157

09/15/2021, 5:22 PM

Hello! I wanted to test out adding custom users to the datahub frontend, so I was following the instructions for modifying the user.props file directly, but after running either ./dev.sh or ./datahub-frontend/run-local-frontend I ran into an error. Seems to be around a missing "datahub-frontend/bin/playBinary" path.

Copy code

ERROR: for datahub-frontend-react  Cannot start service datahub-frontend-react: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "datahub-frontend/bin/playBinary": stat datahub-frontend/bin/playBinary: no such file or directory: unknown

ERROR: for datahub-frontend-react  Cannot start service datahub-frontend-react: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "datahub-frontend/bin/playBinary": stat datahub-frontend/bin/playBinary: no such file or directory: unknown
ERROR: Encountered errors while bringing up the project.

hundreds-twilight-96303

09/15/2021, 7:18 PM

Hi all, hope you guys all well. I just encountered a trouble while ingesting by sql-profiles be enabled. I quit sure that I have installed the sql-profiles plugin successfully, but when I checked and see if the plugin be installed or not by command "Datahub check plugins", failed to found the plugin name in the plugins list.

Copy code

[root@QgY85nPtI2 ~]# python3 -m datahub check plugins
Sources:
athena
azure-ad
bigquery       (disabled)
bigquery-usage (disabled)
datahub-business-glossary
dbt
druid          (disabled)
feast
file
glue           (disabled)
hive           (disabled)
kafka          (disabled)
kafka-connect
ldap           (disabled)
looker         (disabled)
lookml         (disabled)
mongodb        (disabled)
mssql          (disabled)
mysql
okta           (disabled)
oracle         (disabled)
postgres       (disabled)
redash         (disabled)
redshift       (disabled)
sagemaker      (disabled)
snowflake      (disabled)
snowflake-usage(disabled)
sqlalchemy
superset

Sinks:
console
datahub-kafka  (disabled)
datahub-rest
file

Transformers:
add_dataset_ownership
add_dataset_tags
mark_dataset_status
pattern_add_dataset_ownership
set_dataset_browse_path
simple_add_dataset_ownership
simple_add_dataset_tags
simple_remove_dataset_ownership

And, when I try to ingest data while enable sql-profile option and error encountered and tell me, Table profiles requested but profiler plugin is not enabled. Try running: pip install 'acryl-datahub[sql-profiles]'

Copy code

File "/usr/local/python3/lib/python3.6/site-packages/datahub/entrypoints.py", line 91, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
File "/usr/local/python3/lib/python3.6/site-packages/datahub/cli/ingest_cli.py", line 52, in run
    pipeline = Pipeline.create(pipeline_config)
File "/usr/local/python3/lib/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 103, in create
    return cls(config)
File "/usr/local/python3/lib/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 72, in __init__
    self.config.source.dict().get("config", {}), self.ctx
File "/usr/local/python3/lib/python3.6/site-packages/datahub/ingestion/source/sql/mysql.py", line 23, in create
    return cls(config, ctx)
File "/usr/local/python3/lib/python3.6/site-packages/datahub/ingestion/source/sql/mysql.py", line 18, in __init__
    super().__init__(config, ctx, "mysql")
File "/usr/local/python3/lib/python3.6/site-packages/datahub/ingestion/source/sql/sql_common.py", line 278, in __init__ "Table profiles requested but profiler plugin is not enabled. "ConfigurationError: Table profiles requested but profiler plugin is not enabled. Try running: pip install 'acryl-datahub[sql-profiles]'

May someone give me a favor? Many thanks in advance..

square-activity-64562

09/16/2021, 10:29 AM

Has bintray gone down? Dependencies are failing https://datahubspace.slack.com/archives/CV3FCF9QE/p1631774540158900

adamant-pharmacist-61996

09/17/2021, 4:17 AM

hey.. i might be missing something, but it looks like the latest version of datahub hasn’t been pushed to pypi yet? https://pypi.org/project/acryl-datahub/

handsome-belgium-11927

09/20/2021, 4:01 PM

Hello, everyone! I'm getting a message "An unknown error occured. (code 500)" when I'm browsing through charts:

/browse/chart/tableau

. I've got 2 charts there and I can find them via search, but not through browsing. Where to look for this error description? I've tried searching docker logs but no luck yet

millions-soccer-98440

09/20/2021, 5:21 PM

Hi, How to create CDC task with completed lineage (include source, sink side) like this demo https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:kafka,cdc.UserAccount_ChangeEvent,PROD)?is_lineage_mode=true I try to create ingest type “kafka metadata, kafka connect” but lineage display only “source connector” side (first image) but “sink connector” side display only pipeline task (second image)

millions-soccer-98440

09/20/2021, 5:53 PM

one more question my “postgresql task with profiling” run more than 84 hrs and not yet complete how to tuning ? my database info: 17 table 50-100 column per table 1-9m rows per table ingest config

Copy code

{
    "source": {
        "type": "postgres",
        "config": {
            "username": login,
            "password": password,
            "database": "user_activity",
            "host_port": host,
            "schema_pattern": {
                "deny": ["information_schema"]
            }
        },
    },
    "sink": {
        "type": "datahub-kafka",
        "config": {
            "connection": {
                "bootstrap": "prerequisites-kafka.datahub:9092",
                "schema_registry_url": "<http://prerequisites-cp-schema-registry.datahub:8081>"
            }
        },
    },
}

straight-dentist-7439

09/21/2021, 7:50 AM

Hi there, I have an issue with the glossary ingestion mechanism. I tried to use the glossary recipe from the documentation (https://datahubproject.io/docs/metadata-ingestion/source_docs/business_glossary/) but it seems something is amiss as I get the following error:

KeyError: 'Did not find a registered class for datahub-business-glossary'

. Any ideas?

handsome-belgium-11927

09/22/2021, 3:26 PM

Hello! Is there a way to make a readonly version of DataHub, available without authorization? Like you've made it here : https://demo.datahubproject.io/ (Though in demo it is not readonly)

proud-jelly-46237

09/22/2021, 8:29 PM

probably this needs some 👀

Copy code

Failed to pull image "acryldata/datahub-mysql-setup:v0.8.14": rpc error: code = Unknown desc = Error response from daemon: manifest for acryldata/datahub-mysql-setup:v0.8.14 not found: manifest unknown: manifest unknown

careful-artist-3840

09/23/2021, 12:10 AM

Copy code

2021-09-22T20:07:53-04:00 00:07:53 [application-akka.actor.default-dispatcher-197] WARN  auth.sso.oidc.OidcCallbackLogic - Failed to extract groups: No OIDC claim with name groups found

What would cause this error?

adamant-van-40260

09/23/2021, 10:43 AM

Hi team, I got this exception in datahub when I visit the path /browse/dataset in browser I still dont know how to working around it

colossal-furniture-76714

09/23/2021, 3:31 PM

If i click the button 'view in superset' the resulting url maps to the port :808 instead of :8088 which is the standard port for superset. Is this on purpose or a bug or am I missing something?

colossal-furniture-76714

09/23/2021, 3:57 PM

is the schema history feature currently disabled in the ui?

rough-garage-43684

09/24/2021, 8:44 AM

I ran a simple graphql query

Copy code

mutation updateDataset($input: DatasetUpdateInput!) {

it run success in datahub-frontend-react

localhost:9002/api/graphiql

but failed in metadata-service

localhost:8080/api/graphiql

with this log in datahub gms backend. Am I missing something?

👀 1