Bo
11/02/2022, 9:38 PMSlackbot
11/04/2022, 6:15 PMSlackbot
11/09/2022, 2:39 AMBo
11/11/2022, 7:38 PMBo
11/28/2022, 7:48 PMSean
12/07/2022, 8:37 PMv1.0.11
is here featuring the introduction of an inference collection and model monitoring API that can be easily integrated with any model monitoring frameworks.
• Introduced the bentoml.monitor
API for monitoring any features, predictions, and target data in numerical, categorical, and numerical sequence types.
import bentoml
from <http://bentoml.io|bentoml.io> import Text
from <http://bentoml.io|bentoml.io> import NumpyNdarray
CLASS_NAMES = ["setosa", "versicolor", "virginica"]
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
@svc.api(
input=NumpyNdarray.from_sample(np.array([4.9, 3.0, 1.4, 0.2], dtype=np.double)),
output=Text(),
)
async def classify(features: np.ndarray) -> str:
with bentoml.monitor("iris_classifier_prediction") as mon:
mon.log(features[0], name="sepal length", role="feature", data_type="numerical")
mon.log(features[1], name="sepal width", role="feature", data_type="numerical")
mon.log(features[2], name="petal length", role="feature", data_type="numerical")
mon.log(features[3], name="petal width", role="feature", data_type="numerical")
results = await iris_clf_runner.predict.async_run([features])
result = results[0]
category = CLASS_NAMES[result]
mon.log(category, name="pred", role="prediction", data_type="categorical")
return category
• Enabled monitoring data collection through log file forwarding using any forwarders (fluentbit, filebeat, logstash) or OTLP exporter implementations.
◦ Configuration for monitoring data collection through log files.
monitoring:
enabled: true
type: default
options:
log_path: path/to/log/file
• Configuration for monitoring data collection through an OTLP exporter.
monitoring:
enable: true
type: otlp
options:
endpoint: <http://localhost:5000>
insecure: true
credentials: null
headers: null
timeout: 10
compression: null
meta_sample_rate: 1.0
• Supported third-party monitoring data collector integrations through BentoML Plugins. See bentoml/plugins repository for more details.
🐳 Improved containerization SDK and CLI options, read more in #3164.
• Added support for multiple backend builder options (Docker, nerdctl, Podman, Buildah, Buildx) in addition to buildctl (standalone buildkit builder).
• Improved Python SDK for containerization with different backend builder options.
import bentoml
bentoml.container.build("iris_classifier:latest", backend="podman", features=["grpc","grpc-reflection"], **kwargs)
• Improved CLI to include the newly added options.
import bentoml
bentoml.container.build("iris_classifier:latest", backend="podman", features=["grpc","grpc-reflection"], **kwargs)
• Standardized the generated Dockerfile in bentos to be compatible with all build tools for use cases that require building from a Dockerfile directly.
💡 We continue to update the documentation and examples on every release to help the community unlock the full power of BentoML.
• Learn more about inference data collection and model monitoring capabilities in BentoML.
• Learn more about the default metrics that comes out-of-the-box and how to add custom metrics in BentoML.Bo
12/12/2022, 9:12 PMBo
12/15/2022, 5:34 PMSlackbot
01/05/2023, 5:42 PMBo
01/10/2023, 7:46 PMSlackbot
01/20/2023, 3:58 AMSean
02/17/2023, 5:00 PMv1.0.15
release is here featuring the introduction of the bentoml.diffusers
framework.
• Learn more about the capabilities of the bentoml.diffusers
framework in the Creating Stable Diffusion 2.0 Services With BentoML And Diffusers blog and BentoML Diffusers example project.
• Import a diffusion model with the bentoml.diffusers.import_model
API.
bentoml.diffusers.import_model(
"sd2",
"stabilityai/stable-diffusion-2",
)
• Create a text2img
service using a Stable Diffusion 2.0 model runner with the familiar to_runner
API from the bentoml.diffuser
framework.
import torch
from diffusers import StableDiffusionPipeline
import bentoml
from <http://bentoml.io|bentoml.io> import Image, JSON, Multipart
bento_model = bentoml.diffusers.get("sd2:latest")
stable_diffusion_runner = bento_model.to_runner()
svc = bentoml.Service("stable_diffusion_v2", runners=[stable_diffusion_runner])
@svc.api(input=JSON(), output=Image())
def txt2img(input_data):
images, _ = stable_diffusion_runner.run(**input_data)
return images[0]
⭐ Fixed a incompatibility change introduced in starlette==0.25.0
result in the type MultiPartMessage
not being found in starlette.formparsers
.
ImportError: cannot import name 'MultiPartMessage' from 'starlette.formparsers' (/opt/miniconda3/envs/bentoml/lib/python3.10/site-packages/starlette/formparsers.py)
Sean
02/22/2023, 10:55 PMSlackbot
03/14/2023, 9:21 PMSlackbot
04/06/2023, 8:59 PMSlackbot
04/14/2023, 4:00 PMSean
05/10/2023, 1:28 AMv1.0.19
is released with enhanced GPU utilization and expanded ML framework support.
• Optimized GPU resource utilization: Enabled scheduling of multiple instances of the same runner using the workers_per_resource
scheduling strategy configuration. The following configuration allows scheduling 2 instances of the “iris” runner per GPU instance. workers_per_resource
is 1 by default.
runners:
iris:
resources:
<http://nvidia.com/gpu|nvidia.com/gpu>: 1
workers_per_resource: 2
• New ML framework support: We’ve added support for EasyOCR and Detectron2 to our growing list of supported ML frameworks.
• Enhanced runner communication: Implemented PEP 574 out-of-band pickling to improve runner communication by eliminating memory copying, resulting in better performance and efficiency.
• Backward compatibility for Hugging Face Transformers: Resolved compatibility issues with Hugging Face Transformers versions prior to v4.18
, ensuring a seamless experience for users with older versions.
⚙️ With the release of Kubeflow 1.7, BentoML now has native integration with Kubeflow, allowing developers to leverage BentoML’s cloud-native components. Prior, developers were limited to exporting and deploying Bento as a single container. With this integration, models trained in Kubeflow can easily be packaged, containerized, and deployed to a Kubernetes cluster as microservices. This architecture enables the individual models to run in their own pods, utilizing the most optimal hardware for their respective tasks and enabling independent scaling.
💡 With each release, we consistently update our blog, documentation and examples to empower the community in harnessing the full potential of BentoML.
• Learn more scheduling strategy to get better resource utilization.
• Learn more about model monitoring and drift detection in BentoML and integration with various monitoring framework.
• Learn more about using Nvidia Triton Inference Server as a runner to improve your application’s performance and throughput.Slackbot
05/10/2023, 1:29 AMTim Liu
05/31/2023, 8:01 PMSlackbot
06/12/2023, 8:49 PMSean
07/24/2023, 9:14 PMv1.1.0
, our first minor version update since the milestone v1.0.
• Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
• Official gRPC Support: We’ve transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
• Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
• Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
• Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.
🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.
• GPU and CPU Support: Running Llama is support on both GPU and CPU.
• Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face. Users can use any weights on HuggingFace (e.g. TheBloke/Llama-2-13B-chat-GPTQ
), custom weights from local path (e.g. /path/to/llama-1
), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM. Use openllm models --show-available
to learn more.
• Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground, python -m openllm.playground.llama2_qlora --help
.Sean
08/31/2023, 6:59 PMv1.1.4
and OpenLLM v0.2.27
. See an example service definition for SSE streaming with Llama2.
• Added response streaming through SSE to the bentoml.io.Text
IO Descriptor type.
• Added async generator support to both API Server and Runner to yield
incremental text responses.
• Added supported to ☁️ BentoCloud to natively support SSE streaming.
🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.
• Added /v1/generate_stream
endpoint for streaming responses from LLMs.
curl -N -X 'POST' '<http://0.0.0.0:3000/v1/generate_stream>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"prompt": "### Instruction:\n What is the definition of time (200 words essay)?\n\n### Response:",
"llm_config": {
"use_llama2_prompt": false,
"max_new_tokens": 4096,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"use_cache": true,
"temperature": 0.89,
"top_k": 50,
"top_p": 0.76,
"typical_p": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"diversity_penalty": 0,
"repetition_penalty": 1,
"encoder_repetition_penalty": 1,
"length_penalty": 1,
"no_repeat_ngram_size": 0,
"renormalize_logits": false,
"remove_invalid_values": false,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"encoder_no_repeat_ngram_size": 0,
"n": 1,
"best_of": 1,
"presence_penalty": 0.5,
"frequency_penalty": 0,
"use_beam_search": false,
"ignore_eos": false
},
"adapter_name": null
}'
Chaoyu
10/06/2023, 6:14 PMJian Shen Yap
01/22/2024, 4:53 AMSean
02/20/2024, 4:00 PMv1.2
, the biggest release since the launch of v1.0
. This release includes improvements from all the learning and feedback from our community over the past year. We invite you to read our release blog post for a comprehensive overview of the new features and the motivations behind their development.
Here are a few key points to note before we delve into the new features:
• v1.2
ensures complete backward compatibility, meaning that Bentos built with v1.1
will continue to function seamlessly with this release.
• We remain committed to supporting v1.1
. Critical bug fixes and security updates will be backported to the v1.1
branch.
• BentoML documentation has been updated with examples and guides for v1.2
. More guides are being added every week.
• BentoCloud is fully equipped to handle deployments from both v1.1
and v1.2
releases of BentoML.
⛏️ Introduced a simplified service SDK to empower developers with greater control and flexibility.
• Simplified the service and API interfaces as Python classes, allowing developers to add custom logic and use third party libraries flexibly with ease.
• Introduced @bentoml.service
and @bentoml.api
decorators to customize the behaviors of services and APIs.
• Moved configuration from YAML files to the service decorator @bentoml.service
next to the class definition.
• See the vLLM example demonstrating the flexibility of the service API by initializing a vLLM AsyncEngine in the service constructor and run inference with continuous batching in the service API.
🔭 Revamped IO descriptors with more familiar input and output types.
• Enable use of Pythonic types directly, without the need for additional IO descriptor definitions or decorations.
• Integrated with Pydantic to leverage its robust validation capabilities and wide array of supported types.
• Expanded support to ML and Generative AI specific IO types.
📦 Updated model saving and loading API to be more generic to enable integration with more ML frameworks.
• Allow flexible saving and loading models using the bentoml.models.create
API instead of framework specific APIs, e.g. bentoml.pytorch.save_model
, bentoml.tensorflow.save_model
.
🚚 Streamlined the deployment workflow to allow more rapid development iterations and a faster time to production.
• Enabled direct deployment to production through CLI and Python API from Git projects.
🎨 Improved API development experience with generated web UI and rich Python client.
• All bentos are now accompanied by a custom-generated UI in the BentoCloud Playground, tailored to their API definitions.
• BentoClient offers a Pythonic way to invoke the service endpoint, allowing parameters to be supplied in native Python format, letting the client efficiently handles the necessary serialization while ensuring compatibility and performance.
🎭 We’ve learned that the best way to showcase what BentoML can do is not through dry, conceptual documentation but through real-world examples. Check out our current list of examples, and we’ll continue to publish new ones to the gallery as exciting new models are released.
• BentoVLLM
• BentoControlNet
• BentoSDXLTurbo
• BentoWhisperX
• BentoXTTS
• BentoCLIP
🙏 Thank you for your continued support!Chaoyu
02/27/2024, 7:55 PMSherlock Xu
07/12/2024, 7:37 AMSherlock Xu
07/19/2024, 12:48 PM@bentoml.task
decorator to set a task endpoint for executing long-running workloads (such as batch processing or video generation).
◦ Added the .submit()
method to both the sync and async clients, which can submit task inputs via the task endpoint and dedicated worker processes constantly monitor task queues for new work to perform.
◦ Full compatibility with BentoCloud to run Bentos defined with task endpoints.
◦ See the Services and Clients doc with examples of a Service API by initializing a long running task in the Service constructor, creating clients to call the endpoint, and retrieving task status.
🚀 Optimized the build cache to accelerate the build process
◦ Enhanced build speed for bentoml build
& containerize
through pre-installed large packages like torch
◦ Switch to uv
as the installer and resolver, replacing pip
🔨 Supported concurrency-based autoscaling on BentoCloud
◦ Added the concurrency
configuration to the @bentoml.service
decorator to set the ideal number of simultaneous requests a Service is designed to handle.
◦ Added the external_queue
configuration to the @bentoml.service
decorator to queue excess requests until they can be processed within the defined concurrency
limits.
◦ See the documentation to configure concurrency and external queue.
🔒 Secure data handling with secrets in BentoCloud
◦ You can now create and manage credentials, such as HuggingFace tokens and AWS secrets, securely on BentoCloud and easily apply them across multiple Deployments.
◦ Added secret subcommands to the BentoML CLI for secret management. Run bentoml secret -h
to learn more.
🗒️ Added streamed logs for Bento image deployment
◦ Easier to troubleshoot build issues and enable faster development iterations
🙏 Thank you for your continued support! Feel free to try 1.3 now!Sherlock Xu
02/20/2025, 1:38 PMbentoml code
command for creating a Codespace
◦ Auto-sync of local changes to the cloud environment
◦ Access to a variety of powerful cloud GPUs
◦ Real-time logs and debugging through the cloud dashboard
◦ Eliminate dependency headaches and ensure consistency between dev and prod environments
🐍 New Python SDK for runtime configurations
◦ Added bentoml.images.PythonImage
for defining the Bento runtime environment in Python instead of using bentofile.yaml
or pyproject.toml
◦ Support customizing runtime configurations (e.g., Python version, system packages, and dependencies) directly in the service.py
file
◦ Introduced context-sensitive run()
method for running custom build commands
◦ Backward compatible with existing bentofile.yaml
and pyproject.toml
configurations
⚡ Accelerated model loading
◦ Implemented build-time model downloads and parallel loading of model weights using safetensors to reduce cold start time and improve scaling performance. See the documentation to learn more.
◦ Added bentoml.models.HuggingFaceModel
for loading models from HF. It supports private model repositories and custom endpoints
◦ Added bentoml.models.BentoModel
for loading models from BentoCloud and the Model Store
🌍 External deployment dependencies
◦ Extended bentoml.depends()
to support external deployments
◦ Added support for calling BentoCloud Deployments via name or URL
◦ Added support for calling self-hosted HTTP AI services outside BentoCloud
⚠️ Legacy Service API deprecation
◦ The legacy bentoml.Service
API (with runners) is now officially deprecated and is scheduled for removal in a future release. We recommend you use the @bentoml.service
decorator.
Note that:
• 1.4
remains fully compatible with Bentos created by 1.3
.
• The BentoML documentation has been updated with examples and guides for 1.4
.
🙏 As always, we appreciate your continued support!Sean
04/22/2025, 4:00 PMContent‑Type: application/vnd.bentoml+pickle
header.
CVE‑2025‑27520:
• Scope: Insecure pickle deserialization in the entry service
• Affected versions: BentoML ≥ 1.3.4 and < 1.4.3
• Action: Upgrade to v1.4.3 or later.
CVE‑2025‑32375:
• Scope: Insecure pickle deserialization in dependent (runner) services
• Affected versions: BentoML ≤ v1.4.8
• Exposure: Only when runners are launched explicitly with bentoml start-runner-server
.
◦ Deployments started with standard bentoml serve
and containerized via
◦ bentoml containerize
are not exposed, because runner ports are not published.
◦ As of v1.4.8, the start-runner-server
sub‑command has been removed, fully closing this attack vector.
• Action: Upgrade to v1.4.8 or later.
Recommended next steps:
1. Upgrade immediately to the minimum safe version listed above (or any newer release).
2. Audit ingress rules to ensure only intended content types are accepted if pickle support is truly required; otherwise, consider disabling pickle inputs altogether.
If you have questions or need assistance, please open an issue or reach out in our community Slack.
Stay safe,
The BentoML Team