This message was deleted Apache Druid #dev

Join Slack

This message was deleted.

# dev

Slackbot

10/10/2022, 11:45 PM

This message was deleted.

Cory Johannsen

10/11/2022, 7:12 PM

My current path of investigation is to track down where the segments unannounce. I am pretty sure this is happening in the indexers, when a task completes and the segment is sent to deep storage.

Gian Merlino

10/11/2022, 9:39 PM

In situations like this is can be tough to identify the "real" bug. The idea here is that

ChangeRequestHttpSyncer

will keep trying forever, because something else tells it when the server went away. Peeling apart how it's supposed to work: The

ChangeRequestHttpSyncer

is held by a

DruidServerHolder

. This is created by

HttpServerInventoryView#serverAdded

in two cases: 1a)

DruidNodeDiscovery

fires

nodesAdded

to its listeners. One of these listeners is in

HttpServerInventoryView

and creates the

DruidServerHolder

. 2a)

HttpServerInventoryView#scheduleSyncMonitoring

notices a server holder is not OK and re-creates it. Then,

ChangeRequestHttpSyncer#stop

is called by

DruidServerHolder#stop

in two cases: 1b)

DruidNodeDiscovery

fires

nodesRemoved

to its listeners. One of these listeners is in

HttpServerInventoryView

and stops the

DruidServerHolder

. 2b)

HttpServerInventoryView#scheduleSyncMonitoring

notices a server holder is not OK and stops it prior to re-creating it. So, what's supposed to happen when the pod goes away is (1b) above. The

DruidNodeDiscovery

created by

K8sDruidNodeDiscoveryProvider

is supposed to call

nodesRemoved

for that pod on its listeners, which in turn causes the

ChangeRequestHttpSyncer

to stop trying. Two things that could be going wrong here: • Perhaps the k8s-based discovery provider doesn't properly call

nodesRemoved

on its listeners. • Perhaps it does properly call

nodesRemoved

, but the

ChangeRequestHttpSyncer#stop

method doesn't "work": it keeps trying even after

stop

is called. My guess is it's the first one: something wrong with the k8s discovery provider. That's because the

ChangeRequestHttpSyncer

is used anytime http segment announcing is in play, which is relatively common, and we haven't seen such problems with it there Looking at the code for

K8sDruidNodeDiscoveryProvider

I see the listener fires in one place:

Copy code

case WatchResult.DELETED:
                    baseNodeRoleWatcher.childRemoved(item.object.getNode());
                    break;

Is it possible something's wrong with the way we do these watches? This part seems suspicious:

Copy code

} else {
                // Try again by starting the watch from the beginning. This can happen if the
                // watch goes bad.
                LOGGER.debug("Received NULL item while watching node type [%s]. Restarting watch.", this.nodeRole);
                return;
              }

I'm not sure what "watch goes bad" means. But the logic does

return

from a method that is processing an iterator. Is it possible there is more stuff in the iterator that we're now ignoring, like perhaps another

WatchResult.DELETED

that we'll miss?

Gian Merlino

10/11/2022, 9:40 PM

btw, I just wrote this comment on your other PR: https://github.com/apache/druid/pull/13175#pullrequestreview-1138158359. It should be re-branched onto

master

& something seems wrong with the build

Cory Johannsen

10/11/2022, 9:42 PM

That really clarifies the process for me, thank you! I am also suspecting the k8s deletion is misfunctioning. I can debug that directly now that I know where to look.

Cory Johannsen

10/11/2022, 9:44 PM

FYI: I found some hidden issues in the move to k8s api v16, so I am working with 11.0.3 (latest v11 with security patches). I will fix the bugs with deleted nodes and anything else I uncovered against 11.0.3, then do the upgrade separately afterwards to decouple the changees.

Gian Merlino

10/11/2022, 9:50 PM

ah, makes sense. a comment on GitHub about that would be great, in case someone only follows that conversation thank you for digging into this!

Cory Johannsen

10/11/2022, 9:51 PM

No worries, glad to help, and I'm enjoying the deep dive.

2 Views

Open in Slack

Previous Next