BentoML

Or did I misunderstood the `max_latency_ms` parameter and the 503 are only returned when the inference time of the model is higher than the max_latency_ms value? Without taking into account the number of requests and how much they have to wait?

~*Update*: After a few more tests, I see that receiving more requests does not affect to the number of 503 responses. So I guess that: *`max_latency_ms` should be set with (more or less) the time that the model needs to process a single batch of `max_batch_size`*. And with that I should not get any errors no matter the amount of requests. Am I right?~

`max_latency_ms` parameter sets an upper limit on the time (in milliseconds) on the latency of a request. Runners will cancel requests if the current time spent in the queue plus the estimated execution time of the model runner exceeds the specified latency.

`max_latency_ms` can be thought of as the service level objective. During a peak of requests, if the latency tolerance is high, `max_latency_ms` should be set higher so the optimizer does not reject the requests and keep them enqueued.

Thank you so much <@U01GMPKSG0H>. I thought `max_latency_ms` was also used as a parameter that could help batching. For example: if an item is enqueued for inference in the runner, I thought it would wait `max_latency_ms` to see if any other item is enqueued too. So that both can be batched and inferenced together. Was I wrong, then?

That is what I understood when reading the paragraph of “1. Batching Window:” and the “Max Latency” section in <https://docs.bentoml.org/en/latest/guides/batching.html>

Hi. Sorry for bothering <@U01GMPKSG0H>. Is my assumption wrong?

While `max_latency_ms` can influence the batching window size, it is not equivalent to the batching window size. `max_latency_ms` is used as a parameter for training the batching algorithm to determine the most optimal window size, thus the “adaptive” part in adaptive batching.