hey friends I wonder if anyone has issues like thi...
# troubleshooting
l
hey friends I wonder if anyone has issues like this, we have a table and we want to ingest data from the beginning of the topic, however in the cluster while we are ingesting this data the servers get really really busy and the p99 response metrics for other tables get impacted greatly, has anyone gotten across this, do you all know what the bottleneck is and why servers get so impacted in terms of response times? it’s really weird that a change adding one table will impact the cluster so negatively.
they are trying to catch up to speed but the response times get impacted greatly
our p99 are usually 15ms and with new table it spikes to 300ms
j
it’s likely CPU. Pinot tries to ingest as many events as possible per second
l
right but we are at 32 cores and usually they are chilling
unless we do something like this 😄
should we over provision do more replicas?
j
there is a
topic.consumption.rate.limit
from https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion you can use. we’ve been meaning to experiment with this, but I can’t say for certain how well it works yet
👍 1
l
omg
#pro
#PROTIP
j
the description for that config also mentions there’s tons of GC with high ingestion, so maybe check those metrics as well? I know we had to disallow “smallest” offset ingestion for our tables to avoid this same latency impact
l
we just did smallest cause we need the data from the beginning of the topic indeed and wow yea GC pauses I need to check that too
image.png
i do see it goes up but it’s still .3% time spent on GC
j
what is that metric? i thought GC metrics were emitted as milliseconds
l
it’s % of CPU time on GC based on what i see lol
like this:
Copy code
rate(jvm_gc_collection_seconds_sum{ kubernetes_namespace="pinot", component="server"}[4m])
image.png
u can see how it just spikes up and that’s when we started ingesting those many records
and we are going right now at 200k/s
j
jvm_gc_collection_seconds_sum
sounds like seconds spent doing GC no? ~.3 sounds like 300ms which correlates with the p99 you’re seeing
😮 1
l
thank you this is very insightful
it seems like it caught up to speed
r
as @Johan Adami mentioned limiting it on the table level using
topic.consumption.rate.limit
is the right way for this use case. it works quite well. • this is also a dynamic config - e.g. no need to reload or reset table for it to take effect. although it will not take effect until the next segment start consuming • it doesn't limit rate on a server level. e.g. it doesn't enforce limit across all consumers within the same server. • it only limits on a per partition level for lower-level consumer
l
do you all know how to establish the rate?
topic.consumption.rate.limit
like i know it’s a double but is it like message per second or something like that it should process? @Rong R @Johan Adami
j
Copy code
double topicRateLimit = streamConfig.getTopicConsumptionRateLimit().get();
    double partitionRateLimit = topicRateLimit / partitionCount;
    <http://LOGGER.info|LOGGER.info>("A consumption rate limiter is set up for topic {} in table {} with rate limit: {} "
            + "(topic rate limit: {}, partition count: {})", streamConfig.getTopicName(), tableName, partitionRateLimit,
        topicRateLimit, partitionCount);
    MetricEmitter metricEmitter = new MetricEmitter(serverMetrics, metricKeyName);
    return new RateLimiterImpl(partitionRateLimit, metricEmitter);
It seems it naively assumes equal traffic per partition. so if you set 1000 as the limit on a topic with 10 partitions, each consumer will get 100 as the rate limit
l
that rate limit is messages/s (?)
j
correct
i can’t tell if it’s recreated or not when partition count changes, though
l
thank you Johan you the MVP
thankyou 1
j
just some links so you or others can double check. here is where the rate limiter is created. here is where it’s created in the stream processing code. here is where the rate limiting is being done
🙏 1
l
it works like a charm
one question about rate limiting what happens when we hit the rate limit with this
topic.consumption.rate.limit
? are messages just queued up or dropped I think they get queued cause i think we just sleep right if we hit the rate limit?