Untitled Apache Pinot #troubleshooting

Join Slack

Untitled

# troubleshooting

Elon

11/19/2020, 10:49 PM

Untitled

Mayank

11/19/2020, 10:52 PM

Is it an expensive query?

Elon

11/19/2020, 10:54 PM

But it looks like the error came from a server though

Elon

11/19/2020, 10:55 PM

It wasn't expensive, we bounced the servers and the query latency went from 25s to < 1 second after that.

Elon

11/19/2020, 10:55 PM

And we saw no red flags, cpu, network, disk usage etc. were fine

Elon

11/19/2020, 10:55 PM

Also I wasn't aware those combiner services run on the servers, maybe the logs are messed up in google cloud?

Alex

11/19/2020, 11:01 PM

@Mayank yep, error is from the server

Mayank

11/19/2020, 11:03 PM

Sorry, my original message was incorrect. this is from server level combine.

👍 1

Mayank

11/19/2020, 11:03 PM

Was there GC?

Elon

11/19/2020, 11:04 PM

I can check

Elon

11/19/2020, 11:04 PM

1 sec, also thanks so much for the super quick response!

👍 1

Elon

11/19/2020, 11:10 PM

We need to edit the jmx config, looks like we didn't expose that metric

Elon

11/19/2020, 11:10 PM

Will do this

Mayank

11/20/2020, 1:06 AM

@Elon, just to add a bit more context. The part of the code throwing exception is waiting on worker threads that are executing the query on a bunch of segments.

numBlocksMerged: 0

seems to suggest none of those threads returned within timeout.

Alex

11/20/2020, 1:06 AM

@Mayank some context

Elon

11/20/2020, 1:07 AM

thanks! So what would likely cause that?

Mayank

11/20/2020, 1:07 AM

It is typically things like expensive query, GC, VM freeze

👍 1

Alex

11/20/2020, 1:08 AM

our pinot cluster stalled in prod, with 3 out of 5 nodes having those log messages. Their appearance correlate with the time observed query latencies went up

Alex

11/20/2020, 1:08 AM

rolling restart fixed the problem for now, but we don't have a root cause

Mayank

11/20/2020, 1:08 AM

This is the most common stack trace you will see in case of latency spikes. This is not root cause, but side effect of some other problem.

Alex

11/20/2020, 1:09 AM

hmmm

Alex

11/20/2020, 1:09 AM

what can cause an issue that is treated by restart?

Mayank

11/20/2020, 1:10 AM

GC, High read load

Elon

11/20/2020, 1:10 AM

We're using jdk8 - do you recommend us changing to g1gc? Right now it's using concurrent mark sweep.

Mayank

11/20/2020, 1:10 AM

Yes, please move to g1gc

❤️ 1

Elon

11/20/2020, 1:10 AM

Will do!

Mayank

11/20/2020, 1:10 AM

we have not used CMS in Pinot production at all at LinkedIn

Mayank

11/20/2020, 1:11 AM

Are these realtime nodes?

Elon

11/20/2020, 1:11 AM

Yep

Mayank

11/20/2020, 1:11 AM

Are you using all the realtime optimizations (like offheap etc)?

Alex

11/20/2020, 1:11 AM

they are hybrid

Alex

11/20/2020, 1:11 AM

realtime + offline

Elon

11/20/2020, 1:11 AM

I believe so, mmap for segments, is that how you do off heap?

Elon

11/20/2020, 1:12 AM

Any jvm options you recommend?

Mayank

11/20/2020, 1:12 AM

how much data per node?

Mayank

11/20/2020, 1:12 AM

most of the time we use xms=xmx=16G

Elon

11/20/2020, 1:14 AM

6 nodes x 100gb-280gb (for the older nodes w more realtime data)

Mayank

11/20/2020, 1:15 AM

ingestion rate?

Elon

11/20/2020, 1:15 AM

and ~27gb memory

Mayank

11/20/2020, 1:15 AM

ok, you likely need more heap than 16G for sure

Elon

11/20/2020, 1:16 AM

But if we' re using mmap for segments that would imply we need to leave space for offheap, right?

Mayank

11/20/2020, 1:16 AM

also something like:

<value>-XX:MaxGCPauseMillis=20</value>

👍 2

Elon

11/20/2020, 1:17 AM

and how do we take advantage of offheap, is that just the segment loadmode = mmap?

Mayank

11/20/2020, 1:17 AM

Copy code

<value>-XX:+UseG1GC</value>
          <value>-XX:+ParallelRefProcEnabled</value>
          <value>-XX:+DisableExplicitGC</value>

Mayank

11/20/2020, 1:17 AM

I think there's another setting for mmap. Let me find.

Elon

11/20/2020, 1:18 AM

thanks!

Elon

11/20/2020, 1:18 AM

we ingest ~6-8mb/s but that will be increasing

Elon

11/20/2020, 1:19 AM

@Mayank, you're really help us, thanks!

👍 1

Alex

11/20/2020, 1:19 AM

do we need to set MaxDirectMemory?

Mayank

11/20/2020, 1:19 AM

@Alex it is not needed, but in practice, we set it to size of main-memory - xmx

Alex

11/20/2020, 1:21 AM

got it, thank you! do you do it to allocate additional ram? or just historical?

Elon

11/20/2020, 1:27 AM

sorry, just want to make sure we know how to take advantage of offheap, I thought it was just setting the load mode to MMAP, lmk if there's anything else

Mayank

11/20/2020, 1:27 AM

https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime#controlling-memory-allocation

🙌 3

Mayank

11/20/2020, 1:27 AM

@Elon ^^

Open in Slack

Previous Next