Hello everyone, I'm having problemns with my 0.10....
# troubleshoot
g
Hello everyone, I'm having problemns with my 0.10.0 deployment. Context: Before updating the version, I decided to soft delete datasets, charts and dashboards. With this, I could delete all entities and force reingestion to ingest new ones. Problem: I'm receiving a log of exception messages of the BulkListener like the below:
Copy code
[I/O dispatcher 2] ERROR c.l.m.s.e.update.BulkListener:44 - Failed to feed bulk request. Number of events: 5 Took time ms: -1 Message: failure in bulk execution:
[1]: index [datasetindex_v2_1678278613797], type [_doc], id [urn...], message [[datasetindex_v2_1678278613797/YrBRraPeT6OLr7JvUNdy6A][[datasetindex_v2_1678278613797][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn...]: document missing]]]
I'm unsure if it is the cause, but I do not see any dataset in the UI. NOTE: Yes, I tried to run the restoreIndices job, but nothing changed.
2
I fully reset my deployment (uninstalled the Chart and delete storages), but the restoreIndices job takes so long. There have been more than 4 hours since the job was completed, and just 10k entities have loaded. I'm using two replicas of GMS, three of Elasticsearch, and two of kafka.
weird. Entities can be searched and found, but metadata count and entity listing is in error or says no metadata exists. For example:
I can find this glossary
But:
image.png
image.png
a
Hey @gentle-camera-33498, this is likely due to some search issues on head that might be messing stuff up- @brainy-tent-14503 may be able to help here
g
I tried a lot of things to resolve this, but without success. I'm having problems with my glossary and dataset indices. I can find them in search, but the count and list entities appear to only get data from elasticsearch. In my elasticsearch indexes, no documents were registered for these indices and GMS has a lot of messages of the error that I presented above.
b
I’ve noticed that restore indices doesn’t work if they are set to soft deleted in sql. First set the status back to undelete for the urns and then restore indices.
g
But I have datasets where remove=false
But in my homepage:
My indices
b
you have 144 datasets in elastic. the restore indices should work for the missing ones. the count shown is being divided by 1k?
the count in your sql output
g
I tried to force the ingestion but stills not working. I'll have a check if BigQuery ingestion foces Status aspect to removed=false
b
Also check if there is some lag on your kafka topic, are you running standalone consumers?
async ingestion?
g
No, i'm running sync ingestion without standalone consumers
The restore indices job do not restore my glossary metadata too
b
Yeah, I’ve only used it for datasets and platforms. I am not sure what kind of support is might have for terms.
g
But, if i do a hardcoded search by URL I can find them: URL: https//<host>/glossaryTerm/urnliglossaryTermc7c57cda-696f-42b2-af3b-73b924e05db6/Documentation?is_lineage_mode=false Returns:
Using the restore indices URL it's not working too:
There is any documentation about how can I check the Kafka consumers? My Prometheus instance is off, so now I cant extract metrics from my deployment.
b
kafka would be via the cli or if using confluent they have a ui, at least for their managed solution
are these terms in a glossary node?
i see both a glossary node index and a glossary term index. i am wondering if a restore on the node first, then the term would work
g
Ok, let me try
So, I think it didn't work
I turned on my ingestion pipeline. The GMS logs:
Copy code
21:48:15.965 [I/O dispatcher 2] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 6 Took time ms: -1
21:48:15.966 [pool-10-thread-2] INFO  c.l.m.filter.RestliLoggingFilter:55 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 1008ms
21:48:15.970 [I/O dispatcher 2] ERROR c.l.m.s.e.update.BulkListener:44 - Failed to feed bulk request. Number of events: 2 Took time ms: -1 Message: failure in bulk execution:
[0]: index [datasetindex_v2_1678308848188], type [_doc], id [urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cbi-data-science.bing_ads.ad_performance_report_daily_scd%2CPROD%29], message [[datasetindex_v2_1678308848188/WXDHhwssRXy3_bPyPQ14OQ][[datasetindex_v2_1678308848188][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Abigquery%2Cbi-data-science.bing_ads.ad_performance_report_daily_scd%2CPROD%29]: document missing]]]
My dataset index:
snippetjson.json
The index config:
The chart index has the same routing config but appears to not have erros when update the indices.
The chart and dataset indexes have the same settings (except uuid, provided_name, and creation date)
@brainy-tent-14503 Could you tell me what else should I check in my deployment?
hmm...
I'm running out of ideas of what to try... 😓
b
I will need to try to reproduce this locally and try dropping a breakpoint on upsert request that is returning the document not found error. Typically those are seen on non-critical operations, however this one seems different, like the restore indices is not generating the right sequence or upserts. Or alternatively the mae consumer is not generating the right upsert.
g
Sorry for taking your time @astonishing-answer-96712 and @brainy-tent-14503! Before going to bed I forced the restoreIndices job again and magically the platform resolve itself in the dawn. I'm not sure why, but I'll spend all day today investigating and seeing if I can replicate it.
What I'm guessing is that the first time I tried to run the restore job it was using v0.9.6.1 while the front and GMS are v0.10.0.
a
Can we try setting this and then restore
Copy code
- name: GRAPH_SERVICE_DIFF_MODE_ENABLED
              value: 'false'
g
Off course! I will try. I expect that with some of these changes, the restore indices job will take less time to conclude (Before upgrading to 0.10.0, restoring all indices took less than 30 minutes to conclude. Now it is very slow.)
For now, it's working. I'll come back here when I made the change. Ok? I have to take a look at the drift between my chart and the community to search for possible causes. Thak you @brainy-tent-14503!
a
Let me see if this exposed in the community chart actually…it is not exposed. Need to add this to the community chart. I’d throw it on GMS, MCE, MAE if they are split, otherwise just GMS
w
Hi everyone! I have a similar issue with
document missing
after restoring indices (migration from 0.9.6 to 0.10.1) didn't quite understand what needs to be done to solve this problem? I need to change this parameter to
enableGraphDiffMode: false
into the Values.global.datahub.enableGraphDiffMode step?
after changing with parameter and restarting restore indices job, I faced with similar issue
Also my GMS service is actively spamming with the next logs:
a
Check your settings to enable always emit is true and enable diff to false. Then run the restore index process. Then set to
alwaysEmitChangeLog: false
and
enableGraphDiffMode: true
to reduce processing everything.