# general

Adrian Cole

08/27/2020, 3:23 AM
hi. wondering about composite health endpoint for servicemanager
Copy code
  @ApiOperation(value = "Get Pinot Instances Status")
  @ApiResponses(value = {@ApiResponse(code = 200, message = "Instance Status"), @ApiResponse(code = 500, message = "Internal server error")})
  public Map<String, PinotInstanceStatus> getPinotAllInstancesStatus() {
    Map<String, PinotInstanceStatus> results = new HashMap<>();
    for (String instanceId : _pinotServiceManager.getRunningInstanceIds()) {
      results.put(instanceId, _pinotServiceManager.getInstanceStatus(instanceId));
    return results;
seems ^^ could be slightly changed or a similar copy for /health that returns HTTP 200 when all are healthy and something else when not

Xiang Fu

08/27/2020, 3:54 AM
agree, we can have a separated
for serviceManager
and another endpoint for all the roles

Adrian Cole

08/27/2020, 5:33 AM
personally I think SM health should return health of its subordinates in aggregate
as I don't know why it is important to return success if as a process it is down. Right now, we rely a lot on knowing things are all on different ports, but if you think of the unit of SM, it represents a composite service correct?
for example, in zipkin we have a composite health endpoint because if its connection to storage is busted it is not servicable as a process
we sometimes start a second listener depending on config, but because we use armeria, generally we don't need a new port for every feature running
and when you take network ports out of it, it simplifies health. is the unit servicable or not basically. if it is composite you can see what things actually work, if nice json is used. spring boot does this by default but we wrote our own for perf.
ack it is a little more complex in impl than we probably need for the pinot side. main thing is to represent the composite of its health so you can know if the process should be in service or not
for example, i noticed one part of process fail in docker due to zip extraction maybe take too long no idea. still passes health check! that's bad as it fails other things. incidentally that fail went away when I flatten classpath, but that's a separate topic

Daniel Lavoie

08/27/2020, 12:37 PM
I agree with Adrian, if anything within the subsystem is not healthy, a single endpoint should return the aggregate of all health checks for the process and 5xx when any of those is not healthy.