Hi team, I have an issue. Could you help? Pinot st...
# troubleshooting
a
Hi team, I have an issue. Could you help? Pinot stopped consuming kafka data. While there’s new messages coming into kafka.
Copy code
2022/09/27 13:20:48.886 INFO [LLRealtimeSegmentDataManager_table__0__224__20220927T0712Z] [telemetry__0__224__20220927T0712Z] Consumed 0 events from (rate:0.0/s), currentOffset=261332557, numRowsConsumedSoFar=633915, numRowsIndexedSoFar=633915
h
Hi team, what could cause Pinot to stop pulling messages from Kafka?
a
When I call reset api to reset this segment, it consumes data again, but stopped at the same offset.
k
Check the log for any exception..
Message might be corrupted
a
no error
n
Are you able to consume messages directly from the Kafka topic ( from the last offset in the log - 261332557) for partition 0?
n
I’m facing the same issue too. Pinot stopped consuming data from all my Kinesis Streams. I though t it would be because of Memory issues and re created entire setup. However, I’m still stuck at the same step: further details here https://apache-pinot.slack.com/archives/C011C9JHN7R/p1664306337847879
h
Hi @Navina, yes, Alice and I could pull messages directly using a console consumer, confirmed there were new messages flowing into the topic and the broker/topic was good. After a few hours of looking around we ended up restarting the Pinot servers and it worked. I have encountered it several times that a Pinot table stopped consuming messages. In my experiences there were 2 scenarios. 1. I updated the table config, a bit often, in here and there. But the table config was still right because a new table with the same config worked. 2. It’s what we faced yesterday. I did not update the table config and did not touch anything in the Kafka/topic. I stopped/restarted many times of the upstream Flink process job which is the message producer. Pinot table just stopped pulling messages, without any error and did not resume consuming after resetting the consuming segments. We finally restarted the Pinot servers.
n
@Huaqiang He starting and stopping producer should not affect the consumer. that's the point of Kafka. 1. So, the only think we can look at right now is the server logs. Is that something you can share ? We can look at the logs around the time where it got stuck. 2. Also, do you remember checking the Ideal State (IS) and External View (EV) in ZK. In a healthy state, this should match each other and every partition should have 1 consuming segment.
a
1, Here’s the server log around the time point when it stopped consuming. 2. forgot to check IS and EV. But checked consuming status. All partition offset was stuck through lastConsumedTimestamp kept updating.
n
I think
lastConsumedTimestamp
is not coming from the record metadata. It is just coming from the pinot consumer. I will check again on its semantics.
though there are no errors, I see this log "Recreating stream consumer ... exceeded idle timeout..." . This shouldn't really cause the consumer to get stuck. it will likely close the current consumer and spin up a new kafka consumer. But I wonder if this is causing some issue in consumption itself. can you try increasing the idle timeout to a few hours? Set
idle.timeout.millis
in Stream configs to
86400000
? see if the problem reoccurs
h
Thanks, Navina, we will try it, but may not reproduce the stuck. Meanwhile, the behavior of stopping pulling messages is separate.
Hi @Navina, this time it’s Pinot stopped pulling messages from 5 of the 6 partitions. It’s being 29 hours. What information do you want us to collect now?
@Kishore G are you there? I think we’d collect some debugging info before I recover the stuck up.
a
adjusted idle.timeout.millis to 86400000. The issue occurred again. I checked idealstate and externalview, they’re consistent and status are only “online” or “consuming”
m
@Alice are you not seeing any errors/exceptions in the server log? Do you see any segment generation log that might have failed?
For example, pick the partition that stopped consuming and search for the latest segment name for that partition in the log
a
It’s weird there’s no error/exception. Only log about Consumed 0 events from (rate:0.0/s)…. and Recreating stream consumer for topic partition
s
If
idle.timeout.millis
is set to
86400000
, we should see the
Recreating stream consumer for topic partition
message only once in a day right? Was the consuming segment refreshed after the change?
h
@Mayank the issue looks like this one