This message was deleted.
# ask-for-help
s
This message was deleted.
🍱 1
đź‘€ 1
s
Hi @Shihgian Lee, the input to your
predict_2()
is expected to be a batchable format as well. Instead of using
InputFeatures
, could you please try
list[InputFeatures]
? The runner will then batch the lists into a single list.
s
Hi @Sean Thanks for your reply. I tried
list[InputFeatures]
but my
input_data
still came through as a single item and not a list. I have a couple questions for `list[InputFeatures]`: 1. Will
input_data
be turned into a list automatically by bentoml? 2. If not, do I need to add
[]
to
domain_request
on the
run
method e.g.,
my_runner_2.predict.run([domain_request])
. This is the last hurdle I need to overcome before I deploy bentoml 1.0 to our cluster for load testing.
s
Hi @Shihgian Lee, thanks for sharing your experience. This part is in fact a confusing topic. 1. No,
input_data
will not be automatically turned into a list. The typing
list[InputFeatures]
is more of a type hint. 2. Yes, you are expected to add
[]
to the input.
s
Hi @Sean Please bear with me, I am still pretty confused about enabling micro-batching. 1. If we always receive a single item for
input_data
in the service, why do we need the
list[InputFeatures]
type hint? That doesn’t seem correct from programming perspective and causes more confusion to the users. 2. I passed
[domain_request]
to the run method and the custom runner is happy and returns a list of
PredictionResult
data classes to the service. Per our last conversation, I thought the runner will “unbatch” the list and return a single item to the service. But the service gets a list back instead. Please see JSON output below. Did I misunderstand you?
Copy code
[
  {
    "rate": 0.169609017,
    ...
  }
]
s
Hi @Shihgian Lee, no worries, I probably haven’t explained myself well. Type hint is irrelevant. For adaptive batching to function in the runner, the input in the API must be a batchable format, e.g.
list
,
ndarray
,
dataframe
. Using the following
numpy.ndarray
as an example, two requests were sent to the API server. Request 1:
Copy code
np.array([[1, 2, 3], [4, 5, 6]])
Request 2:
Copy code
np.array([[7, 8, 9]])
When these two requests are batched in the runner, what the runner sees is the following.
Copy code
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
Say the runner processes the batch and returns
np.array([0, 1, 2])
. The responses to request 1 & 2 looks like the following. Response 1:
Copy code
np.array([0,1])
Response 2:
Copy code
np.array([2])
To answer your 2nd question, the response is indeed unbatched, but still in the batchable format, which is a
list
.
s
aasd
Hi @Sean Thank you for the examples. I have modified the service code based on your explanations. Please let me know your thought. I also have a few follow-up questions and hope you can help to answer them: 1. On
line# 8
, PyCharm complains that
input_data
is a list but the
to_domain_2
accepts a single item because I know we will be dealing with single item in the service (
predict_2
). If the
List[InputFeatures]
type hinting is not relevant, can we remove it? But, you pointed out that this is how we tell the API server the batching is enabled. This is the contradiction I am trying to reconcile. Can you help? On
line# 9
, I pass in a list of
domain_request
manually. 2. Since the runner always return a list, do I always return
results[0]
(
line# 10
) since we know the service always deal with a single item at a time? This is because the upstream service doesn’t expect a list. My frustration is due to the lacked of request batching explanation on the boundary between API services and runner batching. The batching architecture can be improved by providing code example for different cases.
Hi @Sean Can I get a little bit time from you to help with the above 👆 ? I think we are getting close. Thanks!
c
Hey @Shihgian Lee, indeed we need a better explanation of this in our docs. The design behind this is to make it easier to use adaptive batching and provide more flexibility. For the API function, the input argument is mapping to exactly one HTTP request - this can be confusing for many 0.13 users. The idea is that users can fully control the lifecycle of how one client request is processed. And Runner is a building block representing a compute-intense unit that could benefit from Batching. So when multiple API server processes are calling
runner.predict.run
, the runner instance will receive those
run
calls and batch their execution. E.g. API server process #1 has:
runner_a.predict.run([[0,0,0,0]])
API server process #1 has:
runner_a.predict.run([[1,1,1,1]])
The runner_a instance will aggregate those calls and execute the predict function with input
[[0,0,0,0], [1,1,1,1]]
Let’s say the return from this function call is
[0, 1]
, process #1 will receive a return value of
0
and process #2 will receive a return value of
1
This is assuming we are using the default
batch_axies=0
, which tells BentoML how to batch multiple input requests and split the responses
s
Hi @Chaoyu Thanks for the explanation! It is consistent with Sean’s explanation. So far so good for me. I take the representation of API server is
predict_2
above? Yes, I am using the default
batch_axes=0
to keep thing simple. Currently, I only send 1 request from my local swagger to
predict_2
. However, the
results
on
line# 11
above returns me a list, e.g.
[0]
, and NOT
0
. What am I missing?
c
Ah my fault, it should be
[0]
not
0
This is to keep the interface consistent with most ML framework’s inference API, as well as enabling users to handle inputs that are already batched on the client side.
e.g.: API server process #1 has:
runner_a.predict.run([[0,0,0,0]])
API server process #1 has:
runner_a.predict.run([[1,1,1,1],[2,2,2,2]])
The runner_a instance will aggregate those calls and execute the predict function with input
[[0,0,0,0], [1,1,1,1], [2,2,2,2]]
, returning
[0,1,2]
So the return will be
[0]
in process #1, and
[1,2]
in process #2
s
@Chaoyu Cool. So far so good for me on the aggregation explanation. So, my final question is do I always
return results[0]
(line# 12 above) if the upstream service is expecting a single item returns to them?
c
yes, the returned value should always have the same length as the input data to the runner, when batching is enabled
s
yes, the returned value should always have the same length as the input data to the runner, when batching is enabled
@Chaoyu got it. so, it is up to the user (me) to extract the single element list to return to the client if we know the client always send one request at a time and expect one item returned to them at a time. am i correct?
One last question for this thread, I promise 🙂 Please consider the API service below:
Copy code
@svc.api(input=input_spec, output=JSON())
def predict(input_data: List[InputFeatures], ctx: bentoml.Context):
Sean pointed out that I need provide
List[InputFeatures]
type hint to enable batching. Do I need to do that if the client always send 1 request at a time?
c
got it. so, it is up to the user (me) to extract the single element list to return to the client if we know the client always send one request at a time and expect one item returned to them at a time. am i correct?
Yes that’s corrected, this is most natural for most ML framework’s inference API as well, e.g. in scikit learn, you can directly map the model inference call to runner.run:
Copy code
sklearn_model.predict([[1,1,1,1],[2,2,2,2]])  # => [1, 2]
👇
model_runner.predict.run([[1,1,1,1],[2,2,2,2]])  # => [1, 2]
I think the type hint here is not relevant to batching behavior
if the client is only sending one
InputFeatures
, you don’t need to make that a list
s
thank you so much! finally i understand the micro batching in the new architecture.
I think the type hint here is not relevant to batching behavior. if the client is only sending one
InputFeatures
, you don’t need to make that a list
@Chaoyu I assume you referred to the type hinting. If so, agreed! I will remove the
List[]
type hinting and only have
InputFeatures
as input to the service.
c
cc @sauyon