hey folks, are there any rough guidelines on scali...
# all-things-deployment
b
hey folks, are there any rough guidelines on scaling GMS instances based on ingest load? We started to dial up traffic from our ingest systems -> gms (currently all over http) and we're doing around 60 calls / min to 570 calls / min. When we see our spikes, we notice that our ingest latencies also tend to go up from 200ms to 3s. This is being handled by 3 GMS instances and a 12 data node + 3 master + 3 router ES cluster. Trying to get a better sense for where the bottlenecks might be to inform the right scaling. I looked at our DB metrics and they seem healthy (low latencies, low cpu, well below conn count limit). I suspect the bottlenecks might be on the search cluster or the GMS service. Trying to dig in to figure out which and as I'm hunting through the JMX metrics we push, would be great to get pointers from folks who might already know
I think our ES metrics also look good. Trying to figure out if there's settings we need to tune either on the GMS web-service side to allow more requests in flight or add more instances or possibly tune in terms of calls to ES / threadpools there
o
Hey Piyush! Have you checked the ingestion steps latencies? We track method latencies along the way and publish them using dropwizard: https://datahubproject.io/docs/advanced/monitoring/#metrics The DataHub dashboard we provide includes these metric queries:
metrics_com_linkedin_metadata_entity_ebean_EbeanEntityService_ingestAspectToLocalDB
metrics_com_linkedin_metadata_entity_ebean_EbeanEntityService_produceMCL
and then on the MCL side:
metrics_com_linkedin_metadata_kafka_MetadataChangeLogProcessor_UpdateIndicesHook_latency
These might give you a better idea of where the latency is coming from
s
@bitter-lizard-32293 Per our internal benchmarks done back in Jan 2022 with this configuration (in the screenshot) we found it to be better to have the consumer outside the GMS and scale out number of GMS. Default GMS is set to use 50 connections. But we tested with 200 connection per GMS and it was able to handle ~95 QPS with 1 GMS and 1 MAE consumer (at that time it was mainly MAE consumer). Note that this benchmark is a bit dated (Jan 2022), was purely for the ingestion endpoint and the load test was done through a client which was running inside the same kubernetes cluster in separate pods.
b
Thanks @orange-night-91387 and @square-activity-64562 Let me dig into those 3 metrics and see what I find. Assem, do you have a sense for what the ingest qps was prior to your bump to 200? We might explore bumping to 200 connections as a stopgap as carving out the consumer might be a little more work
It looks like the first metric that @orange-night-91387 pointed out seems quite low:
metrics_com_linkedin_metadata_entity_ebean_EbeanEntityService_ingestAspectToLocalDB
we seem to be < 20ms on the p95. Not able to find the others so I might need to dig into whether I missed emitting those in our JMX hook yaml
s
@bitter-lizard-32293 I can see one other internal benchmark doc with EBEAN_MAX_CONNECTIONS to 40. It is not an apples-to-apples comparison as the size of RDS was different. Good enough QPS but latency was higher. Don't have a benchmark with the consumers being inside GMS as realised soon that taking them out made the response times more stable. Do note the K8s resource mentioned in earlier picture. We were running GMS with lower memory but increased after seeing GMS memory allocated was our first bottleneck hit, then it was number of RDS connections.
b
ah ok, thanks for clarifying Aseem. I seem to see that on our end the RDS metrics look quite good. CPU is ~2%, write latencies / iops is fairly low. I dug a little more into our metrics and it looks like our traffic is kind of bursty. We have 500-700 calls made via HTTP to GMS in a short time period. I'm wondering if the first thing to try might be to just decouple the ingest and instead of using Http use the Kafka rest emitter.
o
Using the KafkaEmitter is definitely a good option here if you can handle a bit of delay. One thing you may be running into is max connections as well, you could try bumping up your Ebean max connections to be higher if your RDS is not currently throttling. Doing this will increase the memory overhead of GMS though so recommend giving that a bump if you're on the default. If you have Jaeger or some other tracing tool deployed you could also look through the span graph to see where requests are lagging with the distributed traces we expose. Not sure why not all of the JMX metrics are popping up for you