We ran into an issue where a thread was blocked, a...
# troubleshooting
e
We ran into an issue where a thread was blocked, and it caused the entire cluster to stop processing queries due to the worker (
pqw
threads in default scheduler) threadpool on one server being blocked. Would it help to make the worker thread pool use a cached threadpool while still keeping the query runner threadpool (
pqr
threads in default scheduler) fixed? Or do you recommend using one of the other query schedulers? Here is the thread dump:
m
Hmm, could you elaborate on how just one thread being blocked caused the entire cluster to stop processing?
Also, any insights on why it was blocked?
e
Yep, was really strange but happened multiple times, coinciding with large # of segments scanned for a particular table. Only one
pqr
threadpool on one worker was blocked, but no queries in that tenant were completing.
We would see messages in the logs like this:
Untitled
Thread was blocked (or deadlocked?) reading.
maybe it was deadlocked with a thread ingesting realtime segments?
m
I think the root cause might be something else, I don't see how one thread can block the entire cluster.
e
Either way each time it happened just bouncing that one server fixed the issue.
I checked threads on all the servers in that tenant and only one particular server would have
pqw
threads in BLOCKED state with a similar thread dump each time
I did see that direct memory dropped and mmapped memory suddenly would spike each time this happened.
And no queries completed (for tables on that tenant) until the server was restarted.
m
That probably means segment completion
e
Any idea what would cause that?
m
The direct mem vs mmap is segment completion.
Are you saying one thread in one server is causing entire cluster to stop processing?
Or are you saying all/several pqw in one server?
e
Only for queries hitting tables on that tenant
and only 1 server had blocked pqw threads, always with the same thread dump
m
Ok, multiple blocked pqw threads on 1 server, then?
e
From the thread dump I see this code:
Copy code
public void add(int dictId, int docId) {
    if (_bitmaps.size() == dictId) {
      // Bitmap for the dictionary id does not exist, add a new bitmap into the list
      ThreadSafeMutableRoaringBitmap bitmap = new ThreadSafeMutableRoaringBitmap(docId);
      try {
        _writeLock.lock();
        _bitmaps.add(bitmap);
      } finally {
        _writeLock.unlock();
      }
    } else {
      // Bitmap for the dictionary id already exists, check and add document id into the bitmap
      _bitmaps.get(dictId).add(docId);
    }
  }
Maybe bcz a segment was added while the unsynchronized code was getting the size or calling get? (i.e. the first and last lines)
And although it was multiple pqw threads blocked on 1 server, there were still runnable pqw threads. Each time this happened it was the same scenario: 1 server, 1-3 threads blocked with that same thread dump.
Are you using text index?
if not then nm
e
ok:)
we're not using text indexes on this table at all
m
what release?
e
0.6.0
we're testing 0.7.1, do you recommend we upgrade?
m
I am curious to know what is happening here. I didn't catch any race conditions in my radar recently.
e
yep, I just noticed that the size() and get() are called unsynchronized, but it doesn't seem like that would lead to any deadlocks...
From the logs it looks like a thread was stuck reading - also not sure why it would block other queries from completing.
m
I can see how multiple pqw threads blocked can cause this. But single one, I am not so sure.
e
there were multiple pqw threads blocked (1-3 out of 8 ) but there were also runnable pqw threads each time
m
There's a metric for wait time in scheduler queue for the query
e
Nice, is it in the fcfs code?
m
Copy code
ServerQueryExecutorV1Impl
e
thanks!
m
Check for SCHEDULER_WAIT
see if that spikes
👍 1
If so, then it is building a backlog
One way to mitigate that is to set table level timeout
e
ok, and the other thing that happened each time was a huge (like 75%) drop in direct memory used, and a spike in mmapped memory.
Oh nice, how do we set the timeout?
m
Copy code
QueryConfig
Inside of Table config
Although I think since in java you cannot force interrupt a thread, not sure if that will preempt this blocked thread.
Perhaps it does
But again, this is mitigation.
👍 1
I'd recommend filing an issue with as much details as possible. We should get to the root cause.
e
Sure, I saved a bunch of logs and metrics...
m
Seems like some race condition might be causing some threads to block, building a backlog (please check the scheduler wait metric)
e
Would changing pqw to a cached threadpool be dangerous?
Yep, I'm looking for it now, thanks!
m
need to think about repercussions of that. I'd rather get to bottom of the issue instead of trying something.
👍 1
e
yep, makes sense
thanks, I'll look into the metrics
m
Please also include info on if segment commit was happening, and if that failed or something, any special characteristics of the table/kafka-topic on the table this happens, etc.
This will help get some clues.
👍 1
e
Sure, I'll file an issue, thanks for the advice!
m
thanks for reporting.
👍 1
e
So I do see spikes in the scheduler wait metric:
Screen Shot 2021-06-02 at 9.57.33 PM.png
m
Do they correspond to the time when issue happened? I’d so try setting table level timeout as a fallback but we should still debug the issue
👍 1
e
yep
also another note: using the trino connector I could query the servers directly with no blocks or delays
same table...
it was only broker queries which were blocked - not sure why restarting 1 server would fix the issue each time.
m
Hmm, broker to server connection is async now (since a long time), so not sure why that would happen
Oh I think it could be because Trino to server connection uses different thread pool? (not shared with pqw)?
👍 1
Also tagging @Jackie
j
If it is always the same server that is blocking, maybe check hardware failure?
E.g. block on IO because one segment is unreadable
(Just random guess)
k
Hmm, I had something similar happen when I did a
DISTINCT
query on a column with very high cardinality. A broker somehow got “wedged”, and would no longer process queries - we had to bounce it. IIRC the guess at that time was some issue with the connection pool getting hung, when the response from a server was too big.
m
Yeah so the end result is queries piling up and timing out. And may be mitigated by choosing a more appropriate table level query timeout. In your case it was a expensive query. but here I am more concerned about the possible deadlock
e
Yeah, it happened today again - sorry for the delay. I saved thread dumps (all similar). Today it was multiple servers, all pqw pool.
And the schedule wait metric showed pileups again. Seemed like it wasn't one large query, many small ones blocked...