This message was deleted.
# ask-for-help
s
This message was deleted.
m
We do routinely use BentoML “for our online and offline inference needs”. However up to now we’ve been making use of network API calls to the bento services for both cases, not running the BentoML container locally as you’re suggesting.
For offline eval we use explicit batching, in which we break up the overall offline dataset into many chunks (typically something like 20k rows per chunk). We’ve had success with two different approaches: 1. Send the dataframe chunks as parquet binary over the network to individual bentos. 2. Send a GCS location to a “root bento” service, which loads the dataframe chunk from GCS and then makes several network calls to various other bentos in our local network. Then combines the results, writes them to GCS, and returns the location.
Both of these scenarios are orchestrated via airflow DAGs. And they both lend themselves very nicely to parallelization. For example, we have one airflow DAG that has 10 parallel tasks which each work through several hundred dataframe chunks as described above. With that approach we’re able to score ~150 million rows on 13 machine learning models (spread across 7 distinct bento services) in just under 3 hours (total run time could be shortened by scaling up the parallelization).
And the best part is that these bentos are the exact same services that we call in an online/real-time use case.
e
@Mike Kuhlen This is amazing! When you say "root bento", does this mean you're using Yatai as your bento hoster, and using each of the downstream "sub bento" services using the inference graph feature? Or are you hosting each bento individually, and using REST to send the dataframe from the root to the others? Do you think this would be both cost-effective and stable at low-volume? Like, one advantage I can see of using BentoML's directly in your DAG orchestrator is that you are only paying to run your Bento when it's actually needed. Once the data volume gets high (frequent) I imagine you'd always want a bento running.
I can see how this would be fabulous for parallelization because you could just put your bento's in an auto-scaling group, and if Airflow starts to hammer your bentos with millions of requests, you could autoscale your bento's on demand to handle that.
🎯 1
m
Or are you hosting each bento individually, and using REST to send the dataframe from the root to the others?
Yes, that’s what we do.
Do you think this would be both cost-effective and stable at low-volume? Like, one advantage I can see of using BentoML’s directly in your DAG orchestrator is that you are only paying to run your Bento when it’s actually needed. Once the data volume gets high (frequent) I imagine you’d always want a bento running.
Yes, that’s a good point. We’re in a place where just keeping them running is ok. And we do use the auto-scaling, so when we’re not making heavy use of the bentos, they’re down to 2 pods and pretty low cpu/mem requests (the limits are higher).
e
That makes sense! I like the idea of hosting long-running REST APIs because then the system is more decoupled. Your batch process could hit the API, but so could an end user.
s
Thought it might be worth mentioning we're working on a Python API for starting a server, which might make it easier to integrate bentos into DAG workflows:
Copy code
server = bentoml.HTTPServer("my_bento:version")

with server.start() as client:
    client.classify(...)
❤️ 2