We ran into an issue where a thread was blocked and it cause Apache Pinot #troubleshooting

We ran into an issue where a thread was blocked, a...

Elon

06/03/2021, 4:26 AM

We ran into an issue where a thread was blocked, and it caused the entire cluster to stop processing queries due to the worker (

pqw

threads in default scheduler) threadpool on one server being blocked. Would it help to make the worker thread pool use a cached threadpool while still keeping the query runner threadpool (

pqr

threads in default scheduler) fixed? Or do you recommend using one of the other query schedulers? Here is the thread dump:

Mayank

06/03/2021, 4:28 AM

Hmm, could you elaborate on how just one thread being blocked caused the entire cluster to stop processing?

Mayank

06/03/2021, 4:28 AM

Also, any insights on why it was blocked?

Elon

06/03/2021, 4:29 AM

Yep, was really strange but happened multiple times, coinciding with large # of segments scanned for a particular table. Only one

pqr

threadpool on one worker was blocked, but no queries in that tenant were completing.

Elon

06/03/2021, 4:29 AM

We would see messages in the logs like this:

Elon

06/03/2021, 4:30 AM

Untitled

Elon

06/03/2021, 4:30 AM

Thread was blocked (or deadlocked?) reading.

Elon

06/03/2021, 4:31 AM

maybe it was deadlocked with a thread ingesting realtime segments?

Mayank

06/03/2021, 4:31 AM

I think the root cause might be something else, I don't see how one thread can block the entire cluster.

Elon

06/03/2021, 4:31 AM

Either way each time it happened just bouncing that one server fixed the issue.

Elon

06/03/2021, 4:32 AM

I checked threads on all the servers in that tenant and only one particular server would have

pqw

threads in BLOCKED state with a similar thread dump each time

Elon

06/03/2021, 4:33 AM

I did see that direct memory dropped and mmapped memory suddenly would spike each time this happened.

Elon

06/03/2021, 4:33 AM

And no queries completed (for tables on that tenant) until the server was restarted.

Mayank

06/03/2021, 4:33 AM

That probably means segment completion

Elon

06/03/2021, 4:33 AM

Any idea what would cause that?

Mayank

06/03/2021, 4:34 AM

The direct mem vs mmap is segment completion.

Mayank

06/03/2021, 4:34 AM

Are you saying one thread in one server is causing entire cluster to stop processing?

Mayank

06/03/2021, 4:34 AM

Or are you saying all/several pqw in one server?

Elon

06/03/2021, 4:34 AM

Only for queries hitting tables on that tenant

Elon

06/03/2021, 4:35 AM

and only 1 server had blocked pqw threads, always with the same thread dump

Mayank

06/03/2021, 4:35 AM

Ok, multiple blocked pqw threads on 1 server, then?

Elon

06/03/2021, 4:35 AM

From the thread dump I see this code:

Elon

06/03/2021, 4:35 AM

Copy code

public void add(int dictId, int docId) {
    if (_bitmaps.size() == dictId) {
      // Bitmap for the dictionary id does not exist, add a new bitmap into the list
      ThreadSafeMutableRoaringBitmap bitmap = new ThreadSafeMutableRoaringBitmap(docId);
      try {
        _writeLock.lock();
        _bitmaps.add(bitmap);
      } finally {
        _writeLock.unlock();
      }
    } else {
      // Bitmap for the dictionary id already exists, check and add document id into the bitmap
      _bitmaps.get(dictId).add(docId);
    }
  }

Elon

06/03/2021, 4:36 AM

Maybe bcz a segment was added while the unsynchronized code was getting the size or calling get? (i.e. the first and last lines)

Elon

06/03/2021, 4:36 AM

And although it was multiple pqw threads blocked on 1 server, there were still runnable pqw threads. Each time this happened it was the same scenario: 1 server, 1-3 threads blocked with that same thread dump.

Mayank

06/03/2021, 4:37 AM

https://github.com/apache/incubator-pinot/pull/6990

🙌 1

Mayank

06/03/2021, 4:37 AM

Are you using text index?

Mayank

06/03/2021, 4:37 AM

if not then nm

Elon

06/03/2021, 4:37 AM

ok:)

Elon

06/03/2021, 4:37 AM

we're not using text indexes on this table at all

Mayank

06/03/2021, 4:38 AM

what release?

Elon

06/03/2021, 4:38 AM

0.6.0

Elon

06/03/2021, 4:38 AM

we're testing 0.7.1, do you recommend we upgrade?

Mayank

06/03/2021, 4:39 AM

I am curious to know what is happening here. I didn't catch any race conditions in my radar recently.

Elon

06/03/2021, 4:40 AM

yep, I just noticed that the size() and get() are called unsynchronized, but it doesn't seem like that would lead to any deadlocks...

Elon

06/03/2021, 4:40 AM

From the logs it looks like a thread was stuck reading - also not sure why it would block other queries from completing.

Mayank

06/03/2021, 4:41 AM

I can see how multiple pqw threads blocked can cause this. But single one, I am not so sure.

Elon

06/03/2021, 4:42 AM

there were multiple pqw threads blocked (1-3 out of 8 ) but there were also runnable pqw threads each time

Mayank

06/03/2021, 4:42 AM

There's a metric for wait time in scheduler queue for the query

Elon

06/03/2021, 4:42 AM

Nice, is it in the fcfs code?

Mayank

06/03/2021, 4:43 AM

Copy code

ServerQueryExecutorV1Impl

Elon

06/03/2021, 4:43 AM

thanks!

Mayank

06/03/2021, 4:43 AM

Check for SCHEDULER_WAIT

Mayank

06/03/2021, 4:43 AM

see if that spikes

👍 1

Mayank

06/03/2021, 4:43 AM

If so, then it is building a backlog

Mayank

06/03/2021, 4:44 AM

One way to mitigate that is to set table level timeout

Elon

06/03/2021, 4:44 AM

ok, and the other thing that happened each time was a huge (like 75%) drop in direct memory used, and a spike in mmapped memory.

Elon

06/03/2021, 4:44 AM

Oh nice, how do we set the timeout?

Mayank

06/03/2021, 4:46 AM

Copy code

QueryConfig

Mayank

06/03/2021, 4:46 AM

Inside of Table config

Mayank

06/03/2021, 4:46 AM

Although I think since in java you cannot force interrupt a thread, not sure if that will preempt this blocked thread.

Mayank

06/03/2021, 4:46 AM

Perhaps it does

Mayank

06/03/2021, 4:46 AM

But again, this is mitigation.

👍 1

Mayank

06/03/2021, 4:47 AM

I'd recommend filing an issue with as much details as possible. We should get to the root cause.

Elon

06/03/2021, 4:47 AM

Sure, I saved a bunch of logs and metrics...

Mayank

06/03/2021, 4:47 AM

Seems like some race condition might be causing some threads to block, building a backlog (please check the scheduler wait metric)

Elon

06/03/2021, 4:47 AM

Would changing pqw to a cached threadpool be dangerous?

Elon

06/03/2021, 4:47 AM

Yep, I'm looking for it now, thanks!

Mayank

06/03/2021, 4:48 AM

need to think about repercussions of that. I'd rather get to bottom of the issue instead of trying something.

👍 1

Elon

06/03/2021, 4:48 AM

yep, makes sense

Elon

06/03/2021, 4:49 AM

thanks, I'll look into the metrics

Mayank

06/03/2021, 4:49 AM

Please also include info on if segment commit was happening, and if that failed or something, any special characteristics of the table/kafka-topic on the table this happens, etc.

Mayank

06/03/2021, 4:49 AM

This will help get some clues.

👍 1

Elon

06/03/2021, 4:50 AM

Sure, I'll file an issue, thanks for the advice!

Mayank

06/03/2021, 4:50 AM

thanks for reporting.

👍 1

Elon

06/03/2021, 4:57 AM

So I do see spikes in the scheduler wait metric:

Elon

06/03/2021, 4:58 AM

Screen Shot 2021-06-02 at 9.57.33 PM.png

Mayank

06/03/2021, 5:02 AM

Do they correspond to the time when issue happened? I’d so try setting table level timeout as a fallback but we should still debug the issue

👍 1

Elon

06/03/2021, 5:02 AM

yep

Elon

06/03/2021, 5:03 AM

also another note: using the trino connector I could query the servers directly with no blocks or delays

Elon

06/03/2021, 5:03 AM

same table...

Elon

06/03/2021, 5:04 AM

it was only broker queries which were blocked - not sure why restarting 1 server would fix the issue each time.

Mayank

06/03/2021, 5:16 AM

Hmm, broker to server connection is async now (since a long time), so not sure why that would happen

Mayank

06/03/2021, 5:18 AM

Oh I think it could be because Trino to server connection uses different thread pool? (not shared with pqw)?

👍 1

Mayank

06/03/2021, 5:20 AM

Also tagging @Jackie

Jackie

06/03/2021, 5:44 AM

If it is always the same server that is blocking, maybe check hardware failure?

Jackie

06/03/2021, 5:45 AM

E.g. block on IO because one segment is unreadable

Jackie

06/03/2021, 5:45 AM

(Just random guess)

Ken Krugler

06/03/2021, 2:37 PM

Hmm, I had something similar happen when I did a

DISTINCT

query on a column with very high cardinality. A broker somehow got “wedged”, and would no longer process queries - we had to bounce it. IIRC the guess at that time was some issue with the connection pool getting hung, when the response from a server was too big.

Mayank

06/03/2021, 3:36 PM

Yeah so the end result is queries piling up and timing out. And may be mitigated by choosing a more appropriate table level query timeout. In your case it was a expensive query. but here I am more concerned about the possible deadlock

Elon

06/04/2021, 12:17 AM

Yeah, it happened today again - sorry for the delay. I saved thread dumps (all similar). Today it was multiple servers, all pqw pool.

Elon

06/04/2021, 12:18 AM

And the schedule wait metric showed pileups again. Seemed like it wasn't one large query, many small ones blocked...

Open in Slack

Previous Next