BentoML

Hi <@UKB4CLKP1>, speaking to <@UKSB3UA3Z> on the Spark support, he tells me the you guys are not considering Dask instead of Spark and you are working on being able to serialize the BentoService object. Is that correct?

<@U01JV6VTVK3> yes that’s correct, we are planning to add APIs to make it easy to create Spark UDFs with BentoML packaged models

And we are considering using Dask for BentoML’s own batch inference implementation

Where can I find out what exactly is being done for the Spark UDF API? I looked at in over the weekend and the issue I ran into was the fact that BentoService cannot be pickled.

BentoML’s own batch inference API is relatively higher-level APIs, where a user will specify where’s the input, where’s the output directory, applying which model.

The BentoML to Spark UDF integration gives more flexibility for Spark users who needs to involve a ML model packaged with BentoML.

It is still being worked on, BentoService can not be pickled today

I’m going to create a github issue for this so you can follow the progress

<https://github.com/bentoml/BentoML/issues/890>

I prototyped a possible solution to creating a pandas udf to call the predict API. Obviously calling the load in every call is not the way to go so I am wondering if 1. make model pickle serializable or 2. Caching it so that the load() function call is essentially becomes a very cheap call to make the second time around.
```def load():
    return saved_bundle.load_from_dir(__module_path)


def get_pandas_udf():
    from pyspark.sql.functions import pandas_udf
    import pandas as pd
    import pyspark.sql.types as T

    def predict(*args):
        X1 = pd.concat(args, axis=1)
        model = load()
        y = pd.Series(model.predict(X1).tolist())
        return y

    return pandas_udf(predict, "integer")```

BTW i added this to INIT_PY_TEMPLATE in saved_bundle/templates.py

has renamed the channel from "spark-integration" to "batch-inference"

<@UQF2L5GQ0> <@UKSB3UA3Z> just renamed the channel to batch-inference, for you guys to post batch inference related plans and progress here.

<@UQF2L5GQ0>, you could post the latest notebook here, I will try to run it myself, and then sync with you about the implementation plan.

dask-k8s.ipynb

In this method, the dask controller is started on the node where the jupyter notebook is located (could be the Host rather than inside of the cluster). And it launches workers in k8s cluster.

Hello all. Can anyone help me for batch inference in fastai

Hi !!<@UQF2L5GQ0> can you give an example? It would be very helpful for me

<@U02LZ8P8KNH> Hi there. would u mind sharing the service code?

plus i guess you want to query multiple json in one HTTP request right?

`@api(input=JsonInput(),output=JsonOutput(),route="pos_flair0.9_pos-english_v1",batch=True)`
    def predictpos(self, parsed_json:List[JsonSerializable]):
        model = self.artifacts.posmodel.get("model")
        sentence = self.artifacts.posmodel.get("sentence")
        output = []
        for i in parsed_json[0]:
            inputsentence = sentence(i['koo_text'])
            model.predict(inputsentence)
            o = <http://inputsentence.to|inputsentence.to>_dict(tag_type='pos')
            dictionary = response_create_pos(o['entities'])
            dictionary['koo_id'] = i['koo_id']
            output.append(dictionary)
        return [output]

Sample input:
[
    {
        "koo_id": 123,
        "koo_text": "Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."
    },
    {
        "koo_id": 234,
        "koo_text": "I love Berlin."
    }
]

not sure if this is the right way ... Im trying to serve flair pos model using this

<@UQF2L5GQ0> any updates on this? actually when I'm giving around 20 inputs in json im getting a timeout from server , is there anyway to optimise this?

also for batch inference im doing this:
for i in parsed_json[0]:
            inputsentence = sentence(i['koo_text'])
            sent.append(inputsentence)
        model.predict(sent)

hi there. it looks hacky just because the bentoml's `batch=True` is not clear enough in naming.

basically, if you want your API server to handle many concurrent requests on the same time and each request container only one piece of data (a json here), you enable the `batch=True`

in short you may just let `batch=False` , then your code will be like
```    @api(input=JsonInput(),output=JsonOutput(),route="pos_flair0.9_pos-english_v1",batch=False)
    def predictpos(self, parsed_json:JsonSerializable):
        model = self.artifacts.posmodel.get("model")
        sentence = self.artifacts.posmodel.get("sentence")
        output = []
        for i in parsed_json:
            inputsentence = sentence(i['koo_text'])
            model.predict(inputsentence)
            o = inputsentence.to_dict(tag_type='pos')
            dictionary = response_create_pos(o['entities'])
            dictionary['koo_id'] = i['koo_id']
            output.append(dictionary)
        return output```

can you just help me understand the trade off here? Incase I don't do a bulk predict the number of http requests to the service increases highly and in case i do a batch predict the model server timesout giving a 408 code as the input text is long for my usecase

what would you recommend doing in this case? <@UQF2L5GQ0>

hi guys, I'd like to integrate bentoml model into pyspark. Do you have any examples?

you could use boto to download the images to local (of the docker container) and then process them directly if your model supports batch inference else store them in a dataframe and use apply for batch inference. Then delete the images to free up space

The S3 thing is not so important, I can easily read the files using `boto`. I don’t know how to deal with long running tasks. The batch inference can take an hour easily. I’m talking about batch inference in mass / whole dataset sense.

Where do you need the outputs to be ? Hour long inferences imho should be treated as batch jobs instead of having the model on api. You could run cron or on demand jobs on something like sagemaker or kubernetes. If you're still adamant on using bento api server for some reason you could increase the timeout to 2-3 hrs? How often do you need to process this batch job? if it's in very short intervals you could build on something like celery but if it's long intervals then probably go with batch jobs instead of api based solution

We are using Bento server to process images on the go as they are created and there is occasional need to process batch of images e.g. with an updated model. It feels natural to reuse the Bento server for this and not deal with another model instances, etc. The results could be stored anywhere: S3 or returned through API. This is not an important aspect. Increasing the timeout seems pretty fragile. 

It would be cool to have a long running task API like <https://softwareengineering.stackexchange.com/a/414356|https://softwareengineering.stackexchange.com/a/414356> in BentoML.