salmon-area-51650
06/06/2023, 9:57 AMv0.9.6
and I’m getting the following error in `datahub-gms`:
Caught exception while executing bootstrap step IngestRolesStep. Continuing...
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms com.amazonaws.services.schemaregistry.exception.AWSSchemaRegistryException: Exception occurred while fetching or registering schema definition = {"type":"record","name":"MetadataChangeLog","namespace":"com.linkedin.pegasus2avro.mxe","doc":"Kafka event for capturing update made to an entity's metadata.","fields":[{"name":"auditHeader","type":["null",{"type":
...
...
:"An audit stamp detailing who and when the aspect was changed by. Required for all intents and purposes.","default":null}]}, schema name = MetadataChangeLog_Versioned_v1
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.SchemaByDefinitionFetcher.getORRegisterSchemaVersionId(SchemaByDefinitionFetcher.java:99)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.serializers.GlueSchemaRegistrySerializationFacade.getOrRegisterSchemaVersion(GlueSchemaRegistrySerializationFacade.java:86)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.serializers.GlueSchemaRegistryKafkaSerializer.serialize(GlueSchemaRegistryKafkaSerializer.java:113)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at org.apache.kafka.common.serialization.Serializer.serialize(Serializer.java:62)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:902)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:862)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.dao.producer.KafkaEventProducer.produceMetadataChangeLog(KafkaEventProducer.java:114)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.entity.EntityService.produceMetadataChangeLog(EntityService.java:1286)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.entity.EntityService.produceMetadataChangeLog(EntityService.java:1311)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.boot.steps.IngestRolesStep.ingestRole(IngestRolesStep.java:111)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.boot.steps.IngestRolesStep.execute(IngestRolesStep.java:79)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.linkedin.metadata.boot.BootstrapManager.lambda$start$0(BootstrapManager.java:44)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at java.base/java.lang.Thread.run(Thread.java:829)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms Caused by: com.amazonaws.services.schemaregistry.exception.AWSSchemaRegistryException: Failed to get schemaVersionId by schema definition for schema name = MetadataChangeLog_Versioned_v1
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.AWSSchemaRegistryClient.getSchemaVersionIdByDefinition(AWSSchemaRegistryClient.java:148)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.SchemaByDefinitionFetcher$SchemaDefinitionToVersionCache.load(SchemaByDefinitionFetcher.java:110)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.SchemaByDefinitionFetcher$SchemaDefinitionToVersionCache.load(SchemaByDefinitionFetcher.java:106)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4935)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.SchemaByDefinitionFetcher.getORRegisterSchemaVersionId(SchemaByDefinitionFetcher.java:74)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms ... 15 common frames omitted
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms Caused by: software.amazon.awssdk.services.glue.model.AccessDeniedException: User: arn:aws:sts::277977467804:assumed-role/eks-node-iam-role/XXXX is not authorized to perform: glue:GetSchemaByDefinition on resource: arn:aws:glue:us-east-1:XXXXXXX:registry/default-registry because no identity-based policy allows the glue:GetSchemaByDefinition action (Service: Glue, Status Code: 400, Request ID: 7df19e37-f1f8-40ab-9032-d24796a3972b)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
...
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms at com.amazonaws.services.schemaregistry.common.AWSSchemaRegistryClient.getSchemaVersionIdByDefinition(AWSSchemaRegistryClient.java:144)
datahub-datahub-gms-5ff98bf459-st6pn datahub-gms ... 25 common frames omitted
It’s like it’s trying to access to AWS Glue instead of Schema registry and I have set up schema registry in global variables. Any ideas 🙏?boundless-student-48844
06/06/2023, 10:12 AMbrash-plumber-28960
06/06/2023, 10:58 AMbrash-plumber-28960
06/06/2023, 11:12 AMimportant-bear-9390
06/06/2023, 12:20 PMrefined-continent-34966
06/06/2023, 1:28 PMdocument_missing
errors, but if I queried the postgres db directly I could find that domain. If I added a dataset to that domain I was also able to find it from that dataset profile under domains.
Running the reindexing cron job seemed to have no effect, but then after a few days, entities that were added would suddenly appear in the UI. Has anyone experienced anything similar? I'm tempted to try and remove all data from the neo4j and elastic search volumes and then try the reindexing cron job again.astonishing-answer-88639
06/06/2023, 4:40 PMdatahub docker quickstart
, it gives me the error below and then gets stuck. What do i need to do to get to the UI?able-evening-90828
06/06/2023, 7:23 PMdatasetindex_v2
index, we have discovered that all the slowness came from the simple_query_string
with the query_word_delimited
analyzer. If we removed this from the query, then the query returned in 0.5 second. Otherwise, it took more than 12 seconds. Is there anyway to disable this particular simple_query_string
via some settings?
We are on 10.2.2
.
{
"simple_query_string": {
"query": "parquet",
"fields": [
"id.delimited^4.0",
"editedFieldDescriptions.delimited^0.040000003",
"fieldDescriptions.delimited^0.040000003",
"name.delimited^4.0",
"description.delimited^0.4",
"fieldLabels.delimited^0.080000006",
"urn.delimited^5.0",
"fieldPaths.delimited^2.0",
"qualifiedName.delimited^4.0",
"editedDescription.delimited^0.4"
],
"analyzer": "query_word_delimited",
"flags": -1,
"default_operator": "and",
"analyze_wildcard": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"fuzzy_transpositions": true,
"boost": 1
}
}
microscopic-country-10588
06/06/2023, 7:26 PMearly-hydrogen-27542
06/06/2023, 8:32 PMsearchAcrossEntities
appears to slow heavily at a certain pagination threshold. It gets progressively slower as I change start
. For instance, this is fast...
{
searchAcrossEntities(
input: {types: DATASET, start: 0, count: 10, query: "*"}
) {
total
}
}
This is a bit slower...
{
searchAcrossEntities(
input: {types: DATASET, start: 100, count: 10, query: "*"}
) {
total
}
}
This is slower still...
{
searchAcrossEntities(
input: {types: DATASET, start: 250, count: 10, query: "*"}
) {
total
}
}
And this fails with a 503 Service Unavailable
...
{
searchAcrossEntities(
input: {types: DATASET, start: 500, count: 10, query: "*"}
) {
total
}
}
Do you all have any ideas on speeding this up?strong-potato-63475
06/07/2023, 1:15 AMripe-eye-60209
06/07/2023, 9:32 AMloud-painting-41553
06/07/2023, 1:12 PMswift-processor-45491
06/07/2023, 2:25 PMvictorious-monkey-86128
06/07/2023, 2:30 PM./gradlew quickstart
, it looks like it froze at a container instantiation. How could solve this problem?mammoth-breakfast-21990
06/07/2023, 5:50 PMwitty-journalist-16013
06/07/2023, 8:36 PMelegant-student-62491
06/07/2023, 10:35 PMbrief-advantage-89816
06/08/2023, 2:51 AMdev
, and my source.config.env = dev
But for some reason my urn is getting this PROD:
urnlidataset:(urnlidataPlatform:redshift,dev.<schema_name>.salesorder,PROD
I don’t have a clue where is this PROD coming from.
{process_utils.py:187} INFO - Source (redshift) report:
{process_utils.py:187} INFO - {'aspects': {'container': {'container': 3, 'containerProperties': 4, 'dataPlatformInstance': 4, 'status': 4, 'subTypes': 4},
{process_utils.py:187} INFO - 'dataset': {'container': 95, 'datasetProfile': 1, 'datasetProperties': 95, 'schemaMetadata': 95, 'subTypes': 95}},
{process_utils.py:187} INFO - 'entities': {'container': ['urn:li:container:e46efd1c881f4d6ee511bcb6024fdaf8',
{process_utils.py:187} INFO - 'urn:li:container:e99a7636015d37a29ddf5e05efeacf57',
{process_utils.py:187} INFO - 'urn:li:container:4ddad5b8ba6c86bf31cb5d757fe631e9',
{process_utils.py:187} INFO - 'urn:li:container:94782e3c226027a0cf9a9b12c5eddc1d'],
{process_utils.py:187} INFO - 'dataset': ['urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.salesorder,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.vicidial_users,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.dim_contact,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.dim_lead_crm,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.envision_ssot,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.fct_lead_activity,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.fct_lead_opps_sale,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.fct_salesorder_activity,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.fct_salesorder_payment,PROD)',
{process_utils.py:187} INFO - 'urn:li:dataset:(urn:li:dataPlatform:redshift,dev.<schema_name>.fct_user_login,PROD)',
{process_utils.py:187} INFO - '... sampled of 96 total elements']},
strong-potato-63475
06/08/2023, 3:44 AMstrong-potato-63475
06/08/2023, 3:46 AMgray-airplane-39227
06/08/2023, 4:18 PMGenerate Personal Access Token
, Edit Entity
, and I log in as this user and confirm that I’m not authorized to view dataset and its fields, and I also confirm GMS rest api is guarded by setting env variable REST_API_AUTHORIZATION_ENABLED
to true
. I’ll get 401 when I curl a search request to GMS rest api
However, if I open GraphiQL from UI and make a search request on dataset, I’m able to view all metadata of any dataset, and similarly I can get results by making graphql search queries to GMS graphql endpoint.
I checked the code and seems SearchResolver.java
doesn’t have any authentication on it, would like to confirm if this is a valid issue, thank you!bland-gigabyte-28270
06/09/2023, 9:26 AM0.10.3
, helm chart 0.2.165
), however our datahub:datahub
user seems like they don’t have any permission. Noted that this works before for previous PoC, and somehow it’s don’t work anymorebest-rose-86507
06/09/2023, 9:52 AMmysterious-advantage-78411
06/09/2023, 12:11 PMbest-market-29539
06/09/2023, 2:07 PMhandsome-park-80602
06/09/2023, 2:59 PMdatahub
I have no permission to view the policies
and my datahub user doesn't seem to have root role as I don't have visibility to ingestion UI tab either.
I tried restoring-incides as it was suggested here: https://datahubspace.slack.com/archives/C029A3M079U/p1675057819949539?thread_ts=1674544681.800709&cid=C029A3M079U and that also didn't work.
I was wondering if anyone else has seen this issue before. I am not sure where to look next as looking into gms log has no indication that it attempted to ingest policies.json file during boot process despite me mounting the policies.json file explicitly to datahub-gms:
datahub-gms:
enabled: true
image:
repository: linkedin/datahub-gms
# tag: "v0.10.0 # defaults to .global.datahub.version
resources:
limits:
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
extraVolumes:
- name: datahub-policies-volume
configMap:
name: "datahub-policies-cm"
extraVolumeMounts:
- name: datahub-policies-volume
mountPath: /datahub/datahub-gms/resources/policies.json
subPath: policies.json
extraEnvs:
- name: UI_INGESTION_ENABLED
value: "true"
any help would be appreciated.lemon-greece-73651
06/09/2023, 8:30 PMbland-orange-13353
06/09/2023, 9:16 PMeager-winter-63685
06/10/2023, 1:42 AMdatahub docker quickstart
it gives me an error:
[2023-06-10 09:34:07,069] ERROR {datahub.entrypoints:189} - Command failed with Unknown color 'bright_red'. Run with --debug to get full trace
I m sure that I have datahub installed successfully:
$ datahub version
/home/leo/.local/lib/python3.6/site-packages/datahub/__init__.py:23: FutureWarning: DataHub will require Python 3.7 or newer in a future release. Please upgrade your Python version to continue using DataHub.
FutureWarning,
DataHub CLI version: 0.8.43
Python version: 3.6.9 (default, Nov 25 2022, 14:10:45)
[GCC 8.4.0]
does anyone know why this happens?