This message was deleted.
# dev
s
This message was deleted.
c
My current path of investigation is to track down where the segments unannounce. I am pretty sure this is happening in the indexers, when a task completes and the segment is sent to deep storage.
g
In situations like this is can be tough to identify the "real" bug. The idea here is that
ChangeRequestHttpSyncer
will keep trying forever, because something else tells it when the server went away. Peeling apart how it's supposed to work: The
ChangeRequestHttpSyncer
is held by a
DruidServerHolder
. This is created by
HttpServerInventoryView#serverAdded
in two cases: 1a)
DruidNodeDiscovery
fires
nodesAdded
to its listeners. One of these listeners is in
HttpServerInventoryView
and creates the
DruidServerHolder
. 2a)
HttpServerInventoryView#scheduleSyncMonitoring
notices a server holder is not OK and re-creates it. Then,
ChangeRequestHttpSyncer#stop
is called by
DruidServerHolder#stop
in two cases: 1b)
DruidNodeDiscovery
fires
nodesRemoved
to its listeners. One of these listeners is in
HttpServerInventoryView
and stops the
DruidServerHolder
. 2b)
HttpServerInventoryView#scheduleSyncMonitoring
notices a server holder is not OK and stops it prior to re-creating it. So, what's supposed to happen when the pod goes away is (1b) above. The
DruidNodeDiscovery
created by
K8sDruidNodeDiscoveryProvider
is supposed to call
nodesRemoved
for that pod on its listeners, which in turn causes the
ChangeRequestHttpSyncer
to stop trying. Two things that could be going wrong here: • Perhaps the k8s-based discovery provider doesn't properly call
nodesRemoved
on its listeners. • Perhaps it does properly call
nodesRemoved
, but the
ChangeRequestHttpSyncer#stop
method doesn't "work": it keeps trying even after
stop
is called. My guess is it's the first one: something wrong with the k8s discovery provider. That's because the
ChangeRequestHttpSyncer
is used anytime http segment announcing is in play, which is relatively common, and we haven't seen such problems with it there Looking at the code for
K8sDruidNodeDiscoveryProvider
I see the listener fires in one place:
Copy code
case WatchResult.DELETED:
                    baseNodeRoleWatcher.childRemoved(item.object.getNode());
                    break;
Is it possible something's wrong with the way we do these watches? This part seems suspicious:
Copy code
} else {
                // Try again by starting the watch from the beginning. This can happen if the
                // watch goes bad.
                LOGGER.debug("Received NULL item while watching node type [%s]. Restarting watch.", this.nodeRole);
                return;
              }
I'm not sure what "watch goes bad" means. But the logic does
return
from a method that is processing an iterator. Is it possible there is more stuff in the iterator that we're now ignoring, like perhaps another
WatchResult.DELETED
that we'll miss?
btw, I just wrote this comment on your other PR: https://github.com/apache/druid/pull/13175#pullrequestreview-1138158359. It should be re-branched onto
master
& something seems wrong with the build
c
That really clarifies the process for me, thank you! I am also suspecting the k8s deletion is misfunctioning. I can debug that directly now that I know where to look.
FYI: I found some hidden issues in the move to k8s api v16, so I am working with 11.0.3 (latest v11 with security patches). I will fix the bugs with deleted nodes and anything else I uncovered against 11.0.3, then do the upgrade separately afterwards to decouple the changees.
g
ah, makes sense. a comment on GitHub about that would be great, in case someone only follows that conversation thank you for digging into this!
c
No worries, glad to help, and I'm enjoying the deep dive.