Slackbot
05/09/2023, 8:02 AMMikel Menta
05/09/2023, 8:06 AMmax_latency_ms
parameter and the 503 are only returned when the inference time of the model is higher than the max_latency_ms value? Without taking into account the number of requests and how much they have to wait?Mikel Menta
05/09/2023, 8:40 AMmax_latency_ms
should be set with (more or less) the time that the model needs to process a single batch of max_batch_size
. And with that I should not get any errors no matter the amount of requests. Am I right?Chaoyu
05/10/2023, 5:17 PMSean
05/10/2023, 6:20 PMmax_latency_ms
parameter sets an upper limit on the time (in milliseconds) on the latency of a request. Runners will cancel requests if the current time spent in the queue plus the estimated execution time of the model runner exceeds the specified latency.Sean
05/10/2023, 6:22 PMmax_latency_ms
can be thought of as the service level objective. During a peak of requests, if the latency tolerance is high, max_latency_ms
should be set higher so the optimizer does not reject the requests and keep them enqueued.Mikel Menta
05/12/2023, 7:22 AMmax_latency_ms
was also used as a parameter that could help batching. For example: if an item is enqueued for inference in the runner, I thought it would wait max_latency_ms
to see if any other item is enqueued too. So that both can be batched and inferenced together. Was I wrong, then?Mikel Menta
05/12/2023, 11:33 AMMikel Menta
05/17/2023, 7:15 AMSean
05/17/2023, 9:46 AMmax_latency_ms
can influence the batching window size, it is not equivalent to the batching window size. max_latency_ms
is used as a parameter for training the batching algorithm to determine the most optimal window size, thus the “adaptive” part in adaptive batching.Mikel Menta
05/17/2023, 9:49 AM