This message was deleted.
# ask-for-help
s
This message was deleted.
m
Or did I misunderstood the
max_latency_ms
parameter and the 503 are only returned when the inference time of the model is higher than the max_latency_ms value? Without taking into account the number of requests and how much they have to wait?
Update: After a few more tests, I see that receiving more requests does not affect to the number of 503 responses. So I guess that:
max_latency_ms
should be set with (more or less) the time that the model needs to process a single batch of
max_batch_size
. And with that I should not get any errors no matter the amount of requests. Am I right?
c
cc @Sean
s
max_latency_ms
parameter sets an upper limit on the time (in milliseconds) on the latency of a request. Runners will cancel requests if the current time spent in the queue plus the estimated execution time of the model runner exceeds the specified latency.
1
max_latency_ms
can be thought of as the service level objective. During a peak of requests, if the latency tolerance is high,
max_latency_ms
should be set higher so the optimizer does not reject the requests and keep them enqueued.
m
Thank you so much @Sean. I thought
max_latency_ms
was also used as a parameter that could help batching. For example: if an item is enqueued for inference in the runner, I thought it would wait
max_latency_ms
to see if any other item is enqueued too. So that both can be batched and inferenced together. Was I wrong, then?
That is what I understood when reading the paragraph of “1. Batching Window:” and the “Max Latency” section in https://docs.bentoml.org/en/latest/guides/batching.html
Hi. Sorry for bothering @Sean. Is my assumption wrong?
s
While
max_latency_ms
can influence the batching window size, it is not equivalent to the batching window size.
max_latency_ms
is used as a parameter for training the batching algorithm to determine the most optimal window size, thus the “adaptive” part in adaptive batching.
m
Understood thanks!