Hello everyone Do you know what can cause GMS to process MAE DataHub #troubleshoot

Hello everyone, Do you know what can cause GMS to ...

gentle-camera-33498

05/16/2023, 4:55 PM

Hello everyone, Do you know what can cause GMS to process MAEs so slowly? I cleaned my Elasticsearch and ran the reindex job. There are 64994 MAE rows. More than 1 hour later I still can't see all metadata in the front end. Previus versions GMS could process this volume very fast. Deployment details: Environment: kubernetes Datahurb version: 0.10.2 GMS replicas: 1 Standalone consumers: False

astonishing-answer-96712

05/16/2023, 8:13 PM

Hi, @brainy-tent-14503 might be able to help you out with this!

gentle-camera-33498

05/17/2023, 2:25 PM

In addition to the GMS taking time to consume all the MAEs from the reindexing job, it is also taking a long time to consume events coming from the ingestion. What could be causing this?

brainy-tent-14503

05/17/2023, 5:45 PM

If it is not making any forward progress is might be due to some

assert

statements which are Errors and not Exceptions, the former is not caught, and under some circumstances can fail to acknowledge the kafka messages. I have a PR to fix this here. I do not know whether this is the cause, so I would also take a look at the bulk processing batch size, if its <1000, might need to increase the interval

ES_BULK_FLUSH_PERIOD

[doc] (this is slow progress vs the other condition which might be very slow to no progress).

brainy-tent-14503

05/17/2023, 5:45 PM

Additionally any seeming unrelated exceptions in the gms logs?

gentle-camera-33498

05/18/2023, 12:55 PM

I only found exceptions caused by DataFetcher. It seems to have a bug when I edit the view to filter entities of Term Group (using rule 'is ony of'). This causes 500 errors to appear in the UI but the data is returned

gentle-camera-33498

05/18/2023, 12:57 PM

I'm using the following configurations for the bulk processor:

aloof-gpu-11378

05/18/2023, 8:29 PM

The DataFetcher exceptions due to a view filter, shouldn’t be effecting performance for sure. Are you seeing messages like

c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 18

where the number is maybe in the 100-700 range? If so, then increasing

flushPeriod

to like 5 can make sure those become full batches of 1000. That is delaying the write to accumulate more documents. This is however only typical for large ingestions. I’d also check the cpu limit and how close you’re coming to that limit cap.

2 Views

Open in Slack

Previous Next