This message was deleted BentoML #ask-for-help

Join Slack

This message was deleted.

# ask-for-help

Slackbot

10/14/2022, 6:24 PM

This message was deleted.

🍱 1

👀 1

Sean

10/15/2022, 12:46 AM

Hi @Shihgian Lee, the input to your

predict_2()

is expected to be a batchable format as well. Instead of using

InputFeatures

, could you please try

list[InputFeatures]

? The runner will then batch the lists into a single list.

Shihgian Lee

10/15/2022, 3:24 PM

Hi @Sean Thanks for your reply. I tried

list[InputFeatures]

but my

input_data

still came through as a single item and not a list. I have a couple questions for `list[InputFeatures]`: 1. Will

input_data

be turned into a list automatically by bentoml? 2. If not, do I need to add

[]

domain_request

on the

run

method e.g.,

my_runner_2.predict.run([domain_request])

. This is the last hurdle I need to overcome before I deploy bentoml 1.0 to our cluster for load testing.

Sean

10/16/2022, 1:53 AM

Hi @Shihgian Lee, thanks for sharing your experience. This part is in fact a confusing topic. 1. No,

input_data

will not be automatically turned into a list. The typing

list[InputFeatures]

is more of a type hint. 2. Yes, you are expected to add

[]

to the input.

Shihgian Lee

10/16/2022, 3:38 AM

Hi @Sean Please bear with me, I am still pretty confused about enabling micro-batching. 1. If we always receive a single item for

input_data

in the service, why do we need the

list[InputFeatures]

type hint? That doesn’t seem correct from programming perspective and causes more confusion to the users. 2. I passed

[domain_request]

to the run method and the custom runner is happy and returns a list of

PredictionResult

data classes to the service. Per our last conversation, I thought the runner will “unbatch” the list and return a single item to the service. But the service gets a list back instead. Please see JSON output below. Did I misunderstand you?

Copy code

[
  {
    "rate": 0.169609017,
    ...
  }
]

Sean

10/16/2022, 11:08 PM

Hi @Shihgian Lee, no worries, I probably haven’t explained myself well. Type hint is irrelevant. For adaptive batching to function in the runner, the input in the API must be a batchable format, e.g.

list

ndarray

dataframe

. Using the following

numpy.ndarray

as an example, two requests were sent to the API server. Request 1:

Copy code

np.array([[1, 2, 3], [4, 5, 6]])

Request 2:

Copy code

np.array([[7, 8, 9]])

When these two requests are batched in the runner, what the runner sees is the following.

Copy code

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Say the runner processes the batch and returns

np.array([0, 1, 2])

. The responses to request 1 & 2 looks like the following. Response 1:

Copy code

np.array([0,1])

Response 2:

Copy code

np.array([2])

Sean

10/16/2022, 11:09 PM

To answer your 2nd question, the response is indeed unbatched, but still in the batchable format, which is a

list

Shihgian Lee

10/17/2022, 3:38 PM

aasd

Shihgian Lee

10/17/2022, 3:52 PM

Hi @Sean Thank you for the examples. I have modified the service code based on your explanations. Please let me know your thought. I also have a few follow-up questions and hope you can help to answer them: 1. On

line# 8

, PyCharm complains that

input_data

is a list but the

to_domain_2

accepts a single item because I know we will be dealing with single item in the service (

predict_2

). If the

List[InputFeatures]

type hinting is not relevant, can we remove it? But, you pointed out that this is how we tell the API server the batching is enabled. This is the contradiction I am trying to reconcile. Can you help? On

line# 9

, I pass in a list of

domain_request

manually. 2. Since the runner always return a list, do I always return

results[0]

(

line# 10

) since we know the service always deal with a single item at a time? This is because the upstream service doesn’t expect a list. My frustration is due to the lacked of request batching explanation on the boundary between API services and runner batching. The batching architecture can be improved by providing code example for different cases.

Untitled.py

Shihgian Lee

10/17/2022, 9:06 PM

Hi @Sean Can I get a little bit time from you to help with the above 👆 ? I think we are getting close. Thanks!

Chaoyu

10/17/2022, 9:51 PM

Hey @Shihgian Lee, indeed we need a better explanation of this in our docs. The design behind this is to make it easier to use adaptive batching and provide more flexibility. For the API function, the input argument is mapping to exactly one HTTP request - this can be confusing for many 0.13 users. The idea is that users can fully control the lifecycle of how one client request is processed. And Runner is a building block representing a compute-intense unit that could benefit from Batching. So when multiple API server processes are calling

runner.predict.run

, the runner instance will receive those

run

calls and batch their execution. E.g. API server process #1 has:

runner_a.predict.run([[0,0,0,0]])

API server process #1 has:

runner_a.predict.run([[1,1,1,1]])

The runner_a instance will aggregate those calls and execute the predict function with input

[[0,0,0,0], [1,1,1,1]]

Let’s say the return from this function call is

[0, 1]

, process #1 will receive a return value of

and process #2 will receive a return value of

Chaoyu

10/17/2022, 9:53 PM

This is assuming we are using the default

batch_axies=0

, which tells BentoML how to batch multiple input requests and split the responses

Shihgian Lee

10/17/2022, 10:01 PM

Hi @Chaoyu Thanks for the explanation! It is consistent with Sean’s explanation. So far so good for me. I take the representation of API server is

predict_2

above? Yes, I am using the default

batch_axes=0

to keep thing simple. Currently, I only send 1 request from my local swagger to

predict_2

. However, the

results

line# 11

above returns me a list, e.g.

[0]

, and NOT

. What am I missing?

Chaoyu

10/17/2022, 10:03 PM

Ah my fault, it should be

[0]

not

Chaoyu

10/17/2022, 10:04 PM

This is to keep the interface consistent with most ML framework’s inference API, as well as enabling users to handle inputs that are already batched on the client side.

Chaoyu

10/17/2022, 10:05 PM

e.g.: API server process #1 has:

runner_a.predict.run([[0,0,0,0]])

API server process #1 has:

runner_a.predict.run([[1,1,1,1],[2,2,2,2]])

The runner_a instance will aggregate those calls and execute the predict function with input

[[0,0,0,0], [1,1,1,1], [2,2,2,2]]

, returning

[0,1,2]

So the return will be

[0]

in process #1, and

[1,2]

in process #2

Shihgian Lee

10/17/2022, 10:06 PM

@Chaoyu Cool. So far so good for me on the aggregation explanation. So, my final question is do I always

return results[0]

(line# 12 above) if the upstream service is expecting a single item returns to them?

Chaoyu

10/17/2022, 10:07 PM

yes, the returned value should always have the same length as the input data to the runner, when batching is enabled

Shihgian Lee

10/17/2022, 10:10 PM

yes, the returned value should always have the same length as the input data to the runner, when batching is enabled

@Chaoyu got it. so, it is up to the user (me) to extract the single element list to return to the client if we know the client always send one request at a time and expect one item returned to them at a time. am i correct?

Shihgian Lee

10/17/2022, 10:17 PM

One last question for this thread, I promise 🙂 Please consider the API service below:

Copy code

@svc.api(input=input_spec, output=JSON())
def predict(input_data: List[InputFeatures], ctx: bentoml.Context):

Sean pointed out that I need provide

List[InputFeatures]

type hint to enable batching. Do I need to do that if the client always send 1 request at a time?

Chaoyu

10/17/2022, 11:40 PM

got it. so, it is up to the user (me) to extract the single element list to return to the client if we know the client always send one request at a time and expect one item returned to them at a time. am i correct?

Yes that’s corrected, this is most natural for most ML framework’s inference API as well, e.g. in scikit learn, you can directly map the model inference call to runner.run:

Copy code

sklearn_model.predict([[1,1,1,1],[2,2,2,2]])  # => [1, 2]
👇
model_runner.predict.run([[1,1,1,1],[2,2,2,2]])  # => [1, 2]

Chaoyu

10/17/2022, 11:41 PM

I think the type hint here is not relevant to batching behavior

Chaoyu

10/17/2022, 11:41 PM

if the client is only sending one

InputFeatures

, you don’t need to make that a list

Shihgian Lee

10/17/2022, 11:42 PM

thank you so much! finally i understand the micro batching in the new architecture.

Shihgian Lee

10/17/2022, 11:44 PM

I think the type hint here is not relevant to batching behavior. if the client is only sending one
InputFeatures
, you don’t need to make that a list

@Chaoyu I assume you referred to the type hinting. If so, agreed! I will remove the

List[]

type hinting and only have

InputFeatures

as input to the service.

Chaoyu

10/25/2022, 7:13 PM

cc @sauyon

74 Views

Open in Slack

Previous Next