we have a list of the important ones created by ou...
# general
n
we have a list of the important ones created by our SREs: https://apache-pinot.readthedocs.io/en/latest/in_production.html#monitoring-pinot
👍 4
a
how should I interpret this?
Copy code
"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18
It's supposed to be percent Percentage of complete online replicas in external view as compared to replicas in ideal state. but I'm not sure what to make of this value?
n
that is strange. In our setup, we usually see values between 0 to 100 as expected.
Copy code
_controllerMetrics.setValueOfTableGauge(tableNameWithType, ControllerGauge.PERCENT_SEGMENTS_AVAILABLE,
        (nSegments > 0) ? (100 - (nOffline * 100 / nSegments)) : 100);
i can try to debug tomorrow
a
Okay that would be great maybe we can just do an update of Pinot
n
is it possible to share the ideal state and external view of table
region_behavior
?
and is it just this metric which looks off, or are they all giving weird values?
a
all the metrics for controllerpercentSegmentsAvailable looks like that
}, "region_behavior_2018_01_2018_01_59": { "Server_server-01_7000": "ONLINE", "Server_server-03_7000": "ONLINE", "Server_server-07_7000": "ONLINE" },
this is a snippet of ideal state
n
could you share the entire ideal state and external view?
a
like from the rest api?
n
thanks ! I'll take a look today
a
basically all the percent segments available have this issue as weel
Copy code
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerrealtimeTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerofflineTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerdataDir.exists\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.leader\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.connected\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
even though the segments are available and online
n
oh, then it's likely not the code
but i'll look anyway
a
Hm what could be causing this erroreous metrics ?
Would having the controllers not share the same storage be causing this issue ?
n
maybe SegmentStatusChecker is not running correctly.
do you see the logs/warn from the SegmentStatusChecker file:
<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);
"Caught exception while processing table: {} in task: {}", tableNamesWithType, _taskName, e);
LOGGER.error("Caught exception while updating segment status for table {}", tableNameWithType, e);
LOGGER.warn("Table {} is disabled. Skipping segment status checks", tableNameWithType);
or others from that file
a
Where do I find the SegmentStatusChecker file ?
a
Sorry I meant should I just ssh into the controller and look at the logs or is there some ui to do that
n
oh, yes ssh to the controller
a
No I’m not seeing such exceptions
n
not even this info line:
<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);
? task name will be
SegmentStatusChecker
.
a
I should do a keyword search on SegmentStatusChecker in the controller logs ?
So I’m seeing a lot of
Skipping status check not a leader
Maybe I should check other controllers ?
I see a lot of this
Copy code
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table footfall_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Segment status metrics completed in 2163ms
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:05.326 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Starting Segment Status check for metrics
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.337 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table region_behavior_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.339 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table demographics_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.346 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table dma_customer_journey_OFFLINE has 2 replicas, below replication threshold
I noticed a lot of my segments only has 2 replicas instead of 3 like we configured, not sure what the fix is here or if it's related to the incorrect metrics values
n
what version/commit of pinot are you on? i dont see this log at all in the code
Starting Segment Status check for metrics
. The last it existed was in 2018-10.
also, the log you read says
pinotTools.log
. There should be another controller log right?
a
I did a grep on all the logs
Hm I’ll check the Pinot Veri so
What’s a quick way to check Pinot Version
Pinto tools and broker is version 0.016
n
what does your setup look like? what artifacts do you use to start the services? If you take the latest Pinot release, that should work. Or if you're using QuickStart/Pinot admin, you'll have to pull latest code and build
a
We use ansible to deploy a pre downloaded Pinot jar
So you recommend using the latest release of Pinot ?
Is our version outdated
n
Pre-downloaded - where was it downloaded from?
a
That I actually don’t know
I think git repo
n
yes, recommend getting the latest release. was your cluster setup a long time ago? is an upgrade easily possible?
a
I’d like to think it’s possible
Yeah it was set up by a different eng a long time ago
Like more than a year
n
just curious, which organization is this? is the cluster production facing?
btw, can you check if other metrics work at least? (broker level, server level, or other controller metrics)
a
Yeah those work I think the only weird ones are segment level