we have a list of the important ones created by ou...
# general
we have a list of the important ones created by our SREs: https://apache-pinot.readthedocs.io/en/latest/in_production.html#monitoring-pinot
👍 4
how should I interpret this?
Copy code
"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18
It's supposed to be percent Percentage of complete online replicas in external view as compared to replicas in ideal state. but I'm not sure what to make of this value?
that is strange. In our setup, we usually see values between 0 to 100 as expected.
Copy code
_controllerMetrics.setValueOfTableGauge(tableNameWithType, ControllerGauge.PERCENT_SEGMENTS_AVAILABLE,
        (nSegments > 0) ? (100 - (nOffline * 100 / nSegments)) : 100);
i can try to debug tomorrow
Okay that would be great maybe we can just do an update of Pinot
is it possible to share the ideal state and external view of table
and is it just this metric which looks off, or are they all giving weird values?
all the metrics for controllerpercentSegmentsAvailable looks like that
}, "region_behavior_2018_01_2018_01_59": { "Server_server-01_7000": "ONLINE", "Server_server-03_7000": "ONLINE", "Server_server-07_7000": "ONLINE" },
this is a snippet of ideal state
could you share the entire ideal state and external view?
like from the rest api?
thanks ! I'll take a look today
basically all the percent segments available have this issue as weel
Copy code
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerrealtimeTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerofflineTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerdataDir.exists\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.leader\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.connected\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
even though the segments are available and online
oh, then it's likely not the code
but i'll look anyway
Hm what could be causing this erroreous metrics ?
Would having the controllers not share the same storage be causing this issue ?
maybe SegmentStatusChecker is not running correctly.
do you see the logs/warn from the SegmentStatusChecker file:
<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);
"Caught exception while processing table: {} in task: {}", tableNamesWithType, _taskName, e);
LOGGER.error("Caught exception while updating segment status for table {}", tableNameWithType, e);
LOGGER.warn("Table {} is disabled. Skipping segment status checks", tableNameWithType);
or others from that file
Where do I find the SegmentStatusChecker file ?
Sorry I meant should I just ssh into the controller and look at the logs or is there some ui to do that
oh, yes ssh to the controller
No I’m not seeing such exceptions
not even this info line:
<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);
? task name will be
I should do a keyword search on SegmentStatusChecker in the controller logs ?
So I’m seeing a lot of
Skipping status check not a leader
Maybe I should check other controllers ?
I see a lot of this
Copy code
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table footfall_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Segment status metrics completed in 2163ms
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:05.326 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Starting Segment Status check for metrics
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.337 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table region_behavior_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.339 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table demographics_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.346 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table dma_customer_journey_OFFLINE has 2 replicas, below replication threshold
I noticed a lot of my segments only has 2 replicas instead of 3 like we configured, not sure what the fix is here or if it's related to the incorrect metrics values
what version/commit of pinot are you on? i dont see this log at all in the code
Starting Segment Status check for metrics
. The last it existed was in 2018-10.
also, the log you read says
. There should be another controller log right?
I did a grep on all the logs
Hm I’ll check the Pinot Veri so
What’s a quick way to check Pinot Version
Pinto tools and broker is version 0.016
what does your setup look like? what artifacts do you use to start the services? If you take the latest Pinot release, that should work. Or if you're using QuickStart/Pinot admin, you'll have to pull latest code and build
We use ansible to deploy a pre downloaded Pinot jar
So you recommend using the latest release of Pinot ?
Is our version outdated
Pre-downloaded - where was it downloaded from?
That I actually don’t know
I think git repo
yes, recommend getting the latest release. was your cluster setup a long time ago? is an upgrade easily possible?
I’d like to think it’s possible
Yeah it was set up by a different eng a long time ago
Like more than a year
just curious, which organization is this? is the cluster production facing?
btw, can you check if other metrics work at least? (broker level, server level, or other controller metrics)
Yeah those work I think the only weird ones are segment level