we have a list of the important ones created by our SREs <ht Apache Pinot #general

we have a list of the important ones created by ou...

Neha Pawar

10/16/2019, 5:54 PM

we have a list of the important ones created by our SREs: https://apache-pinot.readthedocs.io/en/latest/in_production.html#monitoring-pinot

👍 4

Andre

10/18/2019, 2:22 AM

how should I interpret this?

Andre

10/18/2019, 2:22 AM

Copy code

"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18

Andre

10/18/2019, 2:23 AM

It's supposed to be percent Percentage of complete online replicas in external view as compared to replicas in ideal state. but I'm not sure what to make of this value?

Neha Pawar

10/18/2019, 4:58 AM

that is strange. In our setup, we usually see values between 0 to 100 as expected.

Neha Pawar

10/18/2019, 4:58 AM

likely some bug in this code: https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/helix/SegmentStatusChecker.java#L227

Neha Pawar

10/18/2019, 4:58 AM

Copy code

_controllerMetrics.setValueOfTableGauge(tableNameWithType, ControllerGauge.PERCENT_SEGMENTS_AVAILABLE,
        (nSegments > 0) ? (100 - (nOffline * 100 / nSegments)) : 100);

Neha Pawar

10/18/2019, 5:04 AM

i can try to debug tomorrow

Andre

10/18/2019, 8:14 AM

Okay that would be great maybe we can just do an update of Pinot

Neha Pawar

10/18/2019, 4:01 PM

is it possible to share the ideal state and external view of table

region_behavior

Neha Pawar

10/18/2019, 4:02 PM

and is it just this metric which looks off, or are they all giving weird values?

Andre

10/18/2019, 4:26 PM

all the metrics for controllerpercentSegmentsAvailable looks like that

Andre

10/18/2019, 4:40 PM

}, "region_behavior_2018_01_2018_01_59": { "Server_server-01_7000": "ONLINE", "Server_server-03_7000": "ONLINE", "Server_server-07_7000": "ONLINE" },

Andre

10/18/2019, 4:40 PM

this is a snippet of ideal state

Neha Pawar

10/18/2019, 4:43 PM

could you share the entire ideal state and external view?

Andre

10/18/2019, 4:53 PM

like from the rest api?

Andre

10/18/2019, 5:04 PM

idealstate

Andre

10/18/2019, 5:04 PM

externalview

Neha Pawar

10/18/2019, 5:48 PM

thanks ! I'll take a look today

Andre

10/18/2019, 6:58 PM

basically all the percent segments available have this issue as weel

Andre

10/18/2019, 6:58 PM

Copy code

_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerrealtimeTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerofflineTableCount\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllersegmentsInErrorState.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerdataDir.exists\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.leader\"",} 0.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.demographics_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.region_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerhelix.connected\"",} 1.0
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.dma_behavior_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllernumberOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentOfReplicas.footfall_OFFLINE\"",} -9.223372036854776E18
_com_linkedin_pinot_common_metrics_ControllerMetrics_Value{name="\"pinot.controllerpercentSegmentsAvailable.dma_customer_journey_OFFLINE\"",} -9.223372036854776E18

Andre

10/18/2019, 6:58 PM

even though the segments are available and online

Neha Pawar

10/18/2019, 8:13 PM

oh, then it's likely not the code

Neha Pawar

10/18/2019, 8:14 PM

but i'll look anyway

Andre

10/18/2019, 8:26 PM

Hm what could be causing this erroreous metrics ?

Andre

10/19/2019, 10:24 PM

Would having the controllers not share the same storage be causing this issue ?

Neha Pawar

10/21/2019, 5:01 PM

maybe SegmentStatusChecker is not running correctly.

Neha Pawar

10/21/2019, 5:03 PM

do you see the logs/warn from the SegmentStatusChecker file:

<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);

"Caught exception while processing table: {} in task: {}", tableNamesWithType, _taskName, e);

LOGGER.error("Caught exception while updating segment status for table {}", tableNameWithType, e);

LOGGER.warn("Table {} is disabled. Skipping segment status checks", tableNameWithType);

or others from that file

Andre

10/21/2019, 5:04 PM

Where do I find the SegmentStatusChecker file ?

Neha Pawar

10/21/2019, 5:05 PM

https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/helix/SegmentStatusChecker.java

Andre

10/21/2019, 5:51 PM

Sorry I meant should I just ssh into the controller and look at the logs or is there some ui to do that

Neha Pawar

10/21/2019, 8:12 PM

oh, yes ssh to the controller

Andre

10/21/2019, 8:23 PM

No I’m not seeing such exceptions

Neha Pawar

10/21/2019, 8:43 PM

not even this info line:

<http://LOGGER.info|LOGGER.info>("Processing {} tables in task: {}", numTables, _taskName);

? task name will be

SegmentStatusChecker

Andre

10/21/2019, 8:47 PM

I should do a keyword search on SegmentStatusChecker in the controller logs ?

Andre

10/21/2019, 8:47 PM

So I’m seeing a lot of

Andre

10/21/2019, 8:48 PM

Skipping status check not a leader

Andre

10/21/2019, 8:48 PM

Maybe I should check other controllers ?

Andre

10/21/2019, 8:59 PM

I see a lot of this

Andre

10/21/2019, 8:59 PM

Copy code

./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table footfall_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:11:05.325 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Segment status metrics completed in 2163ms
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:05.326 INFO [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Starting Segment Status check for metrics
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.337 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table region_behavior_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.339 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table demographics_OFFLINE has 2 replicas, below replication threshold :3
./pinot-tools-pkg/pinotTools.log.2:2019/10/16 02:16:06.346 WARN [com.linkedin.pinot.controller.helix.SegmentStatusChecker] [] Table dma_customer_journey_OFFLINE has 2 replicas, below replication threshold

Andre

10/21/2019, 8:59 PM

I noticed a lot of my segments only has 2 replicas instead of 3 like we configured, not sure what the fix is here or if it's related to the incorrect metrics values

Neha Pawar

10/21/2019, 9:29 PM

what version/commit of pinot are you on? i dont see this log at all in the code

Starting Segment Status check for metrics

. The last it existed was in 2018-10.

Neha Pawar

10/21/2019, 9:29 PM

also, the log you read says

pinotTools.log

. There should be another controller log right?

Andre

10/21/2019, 9:30 PM

I did a grep on all the logs

Andre

10/21/2019, 9:30 PM

Hm I’ll check the Pinot Veri so

Andre

10/21/2019, 10:11 PM

What’s a quick way to check Pinot Version

Andre

10/21/2019, 10:13 PM

Pinto tools and broker is version 0.016

Neha Pawar

10/21/2019, 10:18 PM

what does your setup look like? what artifacts do you use to start the services? If you take the latest Pinot release, that should work. Or if you're using QuickStart/Pinot admin, you'll have to pull latest code and build

Andre

10/21/2019, 10:18 PM

We use ansible to deploy a pre downloaded Pinot jar

Andre

10/21/2019, 10:19 PM

So you recommend using the latest release of Pinot ?

Andre

10/21/2019, 10:19 PM

Is our version outdated

Neha Pawar

10/21/2019, 10:22 PM

Pre-downloaded - where was it downloaded from?

Andre

10/21/2019, 10:24 PM

That I actually don’t know

Andre

10/21/2019, 10:24 PM

I think git repo

Neha Pawar

10/21/2019, 10:31 PM

yes, recommend getting the latest release. was your cluster setup a long time ago? is an upgrade easily possible?

Andre

10/21/2019, 10:32 PM

I’d like to think it’s possible

Andre

10/21/2019, 10:32 PM

Yeah it was set up by a different eng a long time ago

Andre

10/21/2019, 10:32 PM

Like more than a year

Neha Pawar

10/21/2019, 10:33 PM

just curious, which organization is this? is the cluster production facing?

Neha Pawar

10/21/2019, 10:34 PM

btw, can you check if other metrics work at least? (broker level, server level, or other controller metrics)

Andre

10/21/2019, 10:35 PM

Yeah those work I think the only weird ones are segment level

Open in Slack

Previous Next