Hi team I have an issue Could you help Pinot stopped consumi Apache Pinot #troubleshooting

Hi team, I have an issue. Could you help? Pinot st...

Alice

09/27/2022, 1:34 PM

Hi team, I have an issue. Could you help? Pinot stopped consuming kafka data. While there’s new messages coming into kafka.

Copy code

2022/09/27 13:20:48.886 INFO [LLRealtimeSegmentDataManager_table__0__224__20220927T0712Z] [telemetry__0__224__20220927T0712Z] Consumed 0 events from (rate:0.0/s), currentOffset=261332557, numRowsConsumedSoFar=633915, numRowsIndexedSoFar=633915

Huaqiang He

09/27/2022, 1:40 PM

Hi team, what could cause Pinot to stop pulling messages from Kafka?

Alice

09/27/2022, 1:42 PM

When I call reset api to reset this segment, it consumes data again, but stopped at the same offset.

Kishore G

09/27/2022, 1:42 PM

Check the log for any exception..

Kishore G

09/27/2022, 1:43 PM

Message might be corrupted

Alice

09/27/2022, 1:47 PM

no error

Navina

09/27/2022, 5:06 PM

Are you able to consume messages directly from the Kafka topic ( from the last offset in the log - 261332557) for partition 0?

Nagendra Gautham Gondi

09/27/2022, 10:27 PM

I’m facing the same issue too. Pinot stopped consuming data from all my Kinesis Streams. I though t it would be because of Memory issues and re created entire setup. However, I’m still stuck at the same step: further details here https://apache-pinot.slack.com/archives/C011C9JHN7R/p1664306337847879

Huaqiang He

09/27/2022, 11:56 PM

Hi @Navina, yes, Alice and I could pull messages directly using a console consumer, confirmed there were new messages flowing into the topic and the broker/topic was good. After a few hours of looking around we ended up restarting the Pinot servers and it worked. I have encountered it several times that a Pinot table stopped consuming messages. In my experiences there were 2 scenarios. 1. I updated the table config, a bit often, in here and there. But the table config was still right because a new table with the same config worked. 2. It’s what we faced yesterday. I did not update the table config and did not touch anything in the Kafka/topic. I stopped/restarted many times of the upstream Flink process job which is the message producer. Pinot table just stopped pulling messages, without any error and did not resume consuming after resetting the consuming segments. We finally restarted the Pinot servers.

Navina

09/28/2022, 5:35 AM

@Huaqiang He starting and stopping producer should not affect the consumer. that's the point of Kafka. 1. So, the only think we can look at right now is the server logs. Is that something you can share ? We can look at the logs around the time where it got stuck. 2. Also, do you remember checking the Ideal State (IS) and External View (EV) in ZK. In a healthy state, this should match each other and every partition should have 1 consuming segment.

Alice

09/28/2022, 5:45 AM

1, Here’s the server log around the time point when it stopped consuming. 2. forgot to check IS and EV. But checked consuming status. All partition offset was stuck through lastConsumedTimestamp kept updating.

Navina

09/28/2022, 6:25 AM

I think

lastConsumedTimestamp

is not coming from the record metadata. It is just coming from the pinot consumer. I will check again on its semantics.

Navina

09/28/2022, 6:36 AM

though there are no errors, I see this log "Recreating stream consumer ... exceeded idle timeout..." . This shouldn't really cause the consumer to get stuck. it will likely close the current consumer and spin up a new kafka consumer. But I wonder if this is causing some issue in consumption itself. can you try increasing the idle timeout to a few hours? Set

idle.timeout.millis

in Stream configs to

86400000

? see if the problem reoccurs

Huaqiang He

09/28/2022, 7:00 AM

Thanks, Navina, we will try it, but may not reproduce the stuck. Meanwhile, the behavior of stopping pulling messages is separate.

Huaqiang He

09/30/2022, 1:46 AM

Hi @Navina, this time it’s Pinot stopped pulling messages from 5 of the 6 partitions. It’s being 29 hours. What information do you want us to collect now?

Huaqiang He

09/30/2022, 1:49 AM

@Kishore G are you there? I think we’d collect some debugging info before I recover the stuck up.

Alice

09/30/2022, 3:57 AM

adjusted idle.timeout.millis to 86400000. The issue occurred again. I checked idealstate and externalview, they’re consistent and status are only “online” or “consuming”

Mayank

09/30/2022, 5:23 AM

@Alice are you not seeing any errors/exceptions in the server log? Do you see any segment generation log that might have failed?

Mayank

09/30/2022, 5:24 AM

For example, pick the partition that stopped consuming and search for the latest segment name for that partition in the log

Alice

09/30/2022, 5:27 AM

It’s weird there’s no error/exception. Only log about Consumed 0 events from (rate:0.0/s)…. and Recreating stream consumer for topic partition

saurabh dubey

09/30/2022, 7:10 AM

idle.timeout.millis

is set to

86400000

, we should see the

Recreating stream consumer for topic partition

message only once in a day right? Was the consuming segment refreshed after the change?

Hassan Ait Brik

04/26/2023, 12:18 PM

@Mayank the issue looks like this one

Open in Slack

Previous Next