Slackbot
10/10/2022, 11:45 PMCory Johannsen
10/11/2022, 7:12 PMGian Merlino
10/11/2022, 9:39 PMChangeRequestHttpSyncer
will keep trying forever, because something else tells it when the server went away. Peeling apart how it's supposed to work:
The ChangeRequestHttpSyncer
is held by a DruidServerHolder
. This is created by HttpServerInventoryView#serverAdded
in two cases:
1a) DruidNodeDiscovery
fires nodesAdded
to its listeners. One of these listeners is in HttpServerInventoryView
and creates the DruidServerHolder
.
2a) HttpServerInventoryView#scheduleSyncMonitoring
notices a server holder is not OK and re-creates it.
Then, ChangeRequestHttpSyncer#stop
is called by DruidServerHolder#stop
in two cases:
1b) DruidNodeDiscovery
fires nodesRemoved
to its listeners. One of these listeners is in HttpServerInventoryView
and stops the DruidServerHolder
.
2b) HttpServerInventoryView#scheduleSyncMonitoring
notices a server holder is not OK and stops it prior to re-creating it.
So, what's supposed to happen when the pod goes away is (1b) above. The DruidNodeDiscovery
created by K8sDruidNodeDiscoveryProvider
is supposed to call nodesRemoved
for that pod on its listeners, which in turn causes the ChangeRequestHttpSyncer
to stop trying.
Two things that could be going wrong here:
• Perhaps the k8s-based discovery provider doesn't properly call nodesRemoved
on its listeners.
• Perhaps it does properly call nodesRemoved
, but the ChangeRequestHttpSyncer#stop
method doesn't "work": it keeps trying even after stop
is called.
My guess is it's the first one: something wrong with the k8s discovery provider. That's because the ChangeRequestHttpSyncer
is used anytime http segment announcing is in play, which is relatively common, and we haven't seen such problems with it there
Looking at the code for K8sDruidNodeDiscoveryProvider
I see the listener fires in one place:
case WatchResult.DELETED:
baseNodeRoleWatcher.childRemoved(item.object.getNode());
break;
Is it possible something's wrong with the way we do these watches?
This part seems suspicious:
} else {
// Try again by starting the watch from the beginning. This can happen if the
// watch goes bad.
LOGGER.debug("Received NULL item while watching node type [%s]. Restarting watch.", this.nodeRole);
return;
}
I'm not sure what "watch goes bad" means. But the logic does return
from a method that is processing an iterator. Is it possible there is more stuff in the iterator that we're now ignoring, like perhaps another WatchResult.DELETED
that we'll miss?Gian Merlino
10/11/2022, 9:40 PMmaster
& something seems wrong with the buildCory Johannsen
10/11/2022, 9:42 PMCory Johannsen
10/11/2022, 9:44 PMGian Merlino
10/11/2022, 9:50 PMCory Johannsen
10/11/2022, 9:51 PM