Bo
09/28/2022, 6:00 PMSean
10/03/2022, 1:05 AMv1.0.7
as a patch to quickly fix a critical module import issue introduced in v1.0.6
. The import error manifests in the import of any modules under io.*
or models.*
. The following is an example of a typical error message and traceback. Please upgrade to v1.0.7
to address this import issue.
packages/anyio/_backends/_asyncio.py", line 21, in <module>
from io import IOBase
ImportError: cannot import name 'IOBase' from '<http://bentoml.io|bentoml.io>'
Sean
10/17/2022, 7:06 PMv1.0.0
release of Yatai is here! Yatai (屋台) is the Japanese word for a food stall where bentos 🍱 can be served (yes, pun intended 😛). If you are not already a user, Yatai is a production-first platform that brings collaborative BentoML workflows to Kubernetes, helps run model serving at scale, and simplifies model management and deployment across teams.
• Scale BentoML to its full potential on a distributed system, optimized for cost-saving and performance.
• Manage deployment lifecycle to deploy, update, or roll back via API or Web UI.
• Centralized registry providing the foundation for CI/CD via artifact management APIs, labeling, and WebHooks for custom integration.
☁️ Improved compatibility with major cloud providers (AWS, GCP, and Azure)
• Improved AWS EKS installation documentation for yatai and yatai-deployment.
• Enhanced Kaniko image builder support to address the permission issues seen with Google Kubernetes Engine (GKE).
👩💻 Enhanced DevOps experience with better Kubernetes-native CRD workflows and observability support.
• Kubernetes-native workflow via BentoDeployment CRD (Custom Resource Definition), which can easily fit into an existing GitOps workflow.
• Native integration with Grafana stack for observability.
◦ Follow the metrics collection guide for setting up Prometheus and Grafana dashboards for BentoDeployment metrics.
◦ Follow the log collection guide for setting up Loki for BentoDeployment log collection, storage, and querying.
• Support for traffic control with Istio.
⚠️ For v0.4.6
users of the Yatai version, version v1.0.0
introduced a few breaking changes.
• Split Yatai into two components, yatai
and yatai-deployment
, for better modularization and separation of concerns, see architecture.
• Updated container image building trigger from bento push event to bento deployment event. Image building status can be viewed in the console logs UI.
• Removed all previous component operators, e.g., metrics and logging, for more standard integration with the ecosystem.
• See the complete migration guide for upgrading Yatai from v0.4.6
to v1.0.0
.Bo
10/19/2022, 6:57 PMSean
11/01/2022, 12:51 AMv1.0.8
is released with a list of improvement we hope that you’ll find useful.
• Introduced Bento Client for easy access to the BentoML service over HTTP. Both sync and async calls are supported. See the Bento Client Guide for more details.
from bentoml.client import Client
client = Client.from_url("<http://localhost:3000>")
# Sync call
response = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
# Async call
response = await client.async_classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
• Introduced custom metrics support for easy instrumentation of custom metrics over Prometheus. See Metrics Guide for more details. Full Prometheus style syntax is supported for instrumenting custom metrics inside API and Runner definitions.
# Histogram metric
inference_duration = bentoml.metrics.Histogram(
name="inference_duration",
documentation="Duration of inference",
labelnames=["nltk_version", "sentiment_cls"],
)
# Counter metric
polarity_counter = bentoml.metrics.Counter(
name="polarity_total",
documentation="Count total number of analysis by polarity scores",
labelnames=["polarity"],
)
# Histogram
inference_duration.labels(
nltk_version=nltk.__version__, sentiment_cls=self.sia.__class__.__name__
).observe(time.perf_counter() - start)
# Counter
polarity_counter.labels(polarity=is_positive).inc()
• Improved health checking to also cover the status of runners to avoid returning a healthy status before runners are ready.
• Added SSL/TLS support to gRPC serving.
bentoml serve-grpc --ssl-certfile=credentials/cert.pem --ssl-keyfile=credentials/key.pem --production --enable-reflection
Added channelz support for easy debugging gRPC serving.
• Allowed nested requirements with the -r
syntax.
# requirements.txt
-r nested/requirements.txt
pydantic
Pillow
fastapi
• Improved the adaptive batching dispatcher auto-tuning ability to avoid sporadic request failures due to batching in the beginning of the runner lifecycle.
• Fixed a bug such that runners will raise a TypeError
when overloaded. Now an HTTP 503 Service Unavailable
will be returned when runner is overloaded.
File "python3.9/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 188, in async_run_method
return tuple(AutoContainer.from_payload(payload) for payload in payloads)
TypeError: 'Response' object is not iterable
💡 We continue to update the documentation and examples on every release to help the community unlock the full power of BentoML.
• Check out the updated PyTorch Framework Guide on how to use external_modules
to save classes or utility functions required by the model.
• See the Metrics Guide on how to add custom metrics to your API and custom Runners.
• Learn more about how to use the Bento Client to easily call your BentoML service with Python.
• Check out the latest blog post on why model serving over gRPC matters to data scientists.
🥂 We’d like to thank the community for your continued support and engagement.
• Shout out to @judahrand for multiple contributions to BentoML and bentoctl.
• Shout out to @phildamore-phdata, @quandollar, @2JooYeon, and @fortunto2 for their first contribution to BentoML.
🦄 After years of work, we’re proud to announce that next week, we’ll be launching Yatai 1.0. Sign up for the launch event at https://app.livestorm.co/bentoml/yatai-10-launch?type=detailed 🎉Bo
11/02/2022, 9:38 PMTim Liu
11/04/2022, 6:15 PMSean
11/09/2022, 2:39 AMv1.0.10
is released to address a recurring broken pipe
reported by the community. Also included in this release, is a list of improvements we’d like to share with the community.
• Fixed an aiohttp.client_exceptions.ClientOSError
caused by asymmetrical keep alive timeout settings between the API Server and Runner.
aiohttp.client_exceptions.ClientOSError: [Errno 32] Broken pipe
• Added multi-output support for ONNX and TensorFlow frameworks.
• Added from_sample
support to all IO Descriptors in addition to just bentoml.io.NumpyNdarray
and the sample is reflected in the Swagger UI.
# Pandas Example
@svc.api(
input=PandasDataFrame.from_sample(
pd.DataFrame([1,2,3,4])
),
output=PandasDataFrame(),
)
# JSON Example
@svc.api(
input=JSON.from_sample(
{"foo": 1, "bar": 2}
),
output=JSON(),
)
💡 We continue to update the documentation and examples on every release to help the community unlock the full power of BentoML.
• Check out the updated multi-model inference graph guide and example to learn how to compose multiple models in the same Bento service.
• Did you know BentoML support OpenTelemetry tracing out-of-the-box? Checkout the Tracing guide for tracing support for OTLP, Jaeger, and Zipkin.
🦄 After years of work, we’re proud to announce that next week, we’ll be launching Yatai 1.0. Sign up for the launch event here.Bo
11/11/2022, 7:38 PMBo
11/28/2022, 7:48 PMSean
12/07/2022, 8:37 PMv1.0.11
is here featuring the introduction of an inference collection and model monitoring API that can be easily integrated with any model monitoring frameworks.
• Introduced the bentoml.monitor
API for monitoring any features, predictions, and target data in numerical, categorical, and numerical sequence types.
import bentoml
from <http://bentoml.io|bentoml.io> import Text
from <http://bentoml.io|bentoml.io> import NumpyNdarray
CLASS_NAMES = ["setosa", "versicolor", "virginica"]
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
@svc.api(
input=NumpyNdarray.from_sample(np.array([4.9, 3.0, 1.4, 0.2], dtype=np.double)),
output=Text(),
)
async def classify(features: np.ndarray) -> str:
with bentoml.monitor("iris_classifier_prediction") as mon:
mon.log(features[0], name="sepal length", role="feature", data_type="numerical")
mon.log(features[1], name="sepal width", role="feature", data_type="numerical")
mon.log(features[2], name="petal length", role="feature", data_type="numerical")
mon.log(features[3], name="petal width", role="feature", data_type="numerical")
results = await iris_clf_runner.predict.async_run([features])
result = results[0]
category = CLASS_NAMES[result]
mon.log(category, name="pred", role="prediction", data_type="categorical")
return category
• Enabled monitoring data collection through log file forwarding using any forwarders (fluentbit, filebeat, logstash) or OTLP exporter implementations.
◦ Configuration for monitoring data collection through log files.
monitoring:
enabled: true
type: default
options:
log_path: path/to/log/file
• Configuration for monitoring data collection through an OTLP exporter.
monitoring:
enable: true
type: otlp
options:
endpoint: <http://localhost:5000>
insecure: true
credentials: null
headers: null
timeout: 10
compression: null
meta_sample_rate: 1.0
• Supported third-party monitoring data collector integrations through BentoML Plugins. See bentoml/plugins repository for more details.
🐳 Improved containerization SDK and CLI options, read more in #3164.
• Added support for multiple backend builder options (Docker, nerdctl, Podman, Buildah, Buildx) in addition to buildctl (standalone buildkit builder).
• Improved Python SDK for containerization with different backend builder options.
import bentoml
bentoml.container.build("iris_classifier:latest", backend="podman", features=["grpc","grpc-reflection"], **kwargs)
• Improved CLI to include the newly added options.
import bentoml
bentoml.container.build("iris_classifier:latest", backend="podman", features=["grpc","grpc-reflection"], **kwargs)
• Standardized the generated Dockerfile in bentos to be compatible with all build tools for use cases that require building from a Dockerfile directly.
💡 We continue to update the documentation and examples on every release to help the community unlock the full power of BentoML.
• Learn more about inference data collection and model monitoring capabilities in BentoML.
• Learn more about the default metrics that comes out-of-the-box and how to add custom metrics in BentoML.Bo
12/12/2022, 9:12 PMBo
12/15/2022, 5:34 PMABC
01/06/2023, 6:26 AMSean
01/20/2023, 3:58 AMv1.0.13
is released featuring a preview of batch inference with Spark.
• Run the batch inference job using the bentoml.batch.run_in_spark()
method. This method takes the API name, the Spark DataFrame containing the input data, and the Spark session itself as parameters, and it returns a DataFrame containing the results of the batch inference job.
import bentoml
# Import the bento from a repository or get the bento from the bento store
bento = bentoml.import_bento("<s3://bentoml/quickstart>")
# Run the run_in_spark function with the bento, API name, and Spark session
results_df = bentoml.batch.run_in_spark(bento, "classify", df, spark)
• Internally, what happens when you run run_in_spark
is as follows:
◦ First, the bento is distributed to the cluster. Note that if the bento has already been distributed, i.e. you have already run a computation with that bento, this step is skipped.
◦ Next, a process function is created, which starts a BentoML server on each of the Spark workers, then uses a client to process all the data. This is done so that the workers take advantage of the batch processing features of the BentoML server. PySpark pickles this process function and dispatches it, along with the relevant data, to the workers.
◦ Finally, the function is evaluated on the given dataframe. Once all methods that the user defined in the script have been executed, the data is returned to the master node.
⚠️ The bentoml.batch
API may undergo incompatible changes until general availability announced in a later minor version release.
🥂 Shout out to jeffthebear, KimSoungRyoul, Robert Fernandez, Marco Vela, Quan Nguyen, and y1450 from the community for their contributions in this release.Sean
02/17/2023, 5:00 PMv1.0.15
release is here featuring the introduction of the bentoml.diffusers
framework.
• Learn more about the capabilities of the bentoml.diffusers
framework in the Creating Stable Diffusion 2.0 Services With BentoML And Diffusers blog and BentoML Diffusers example project.
• Import a diffusion model with the bentoml.diffusers.import_model
API.
bentoml.diffusers.import_model(
"sd2",
"stabilityai/stable-diffusion-2",
)
• Create a text2img
service using a Stable Diffusion 2.0 model runner with the familiar to_runner
API from the bentoml.diffuser
framework.
import torch
from diffusers import StableDiffusionPipeline
import bentoml
from <http://bentoml.io|bentoml.io> import Image, JSON, Multipart
bento_model = bentoml.diffusers.get("sd2:latest")
stable_diffusion_runner = bento_model.to_runner()
svc = bentoml.Service("stable_diffusion_v2", runners=[stable_diffusion_runner])
@svc.api(input=JSON(), output=Image())
def txt2img(input_data):
images, _ = stable_diffusion_runner.run(**input_data)
return images[0]
⭐ Fixed a incompatibility change introduced in starlette==0.25.0
result in the type MultiPartMessage
not being found in starlette.formparsers
.
ImportError: cannot import name 'MultiPartMessage' from 'starlette.formparsers' (/opt/miniconda3/envs/bentoml/lib/python3.10/site-packages/starlette/formparsers.py)
Sean
02/22/2023, 10:55 PMSean
03/14/2023, 9:21 PMv1.0.16
release is here featuring the introduction of the bentoml.triton
framework. With this integration, BentoML now supports running NVIDIA Triton Inference Server as a Runner. See Triton Inference Server documentation to learn more!
• Triton Inference Server can be configured as a Runner in BentoML with its model repository and CLI arguments specified as parameters.
import bentoml
triton_runner = bentoml.triton.Runner(
"triton_runner",
model_repository="<s3://bucket/path/to/model_repository>",
cli_args=["--load-model=torchscrip_yolov5s", "--model-control-mode=explicit"],
)
• Models served by the Triton Inference Server Runner can be called as a method on the runner handle both synchronously and asynchronously.
@svc.api(
input=bentoml.io.Image.from_sample("./data/0.png"), output=bentoml.io.NumpyNdarray()
)
async def bentoml_torchscript_mnist_infer(im: Image) -> NDArray[t.Any]:
arr = np.array(im) / 255.0
arr = np.expand_dims(arr, (0, 1)).astype("float32")
InferResult = await triton_runner.torchscript_mnist.async_run(arr)
return InferResult.as_numpy("OUTPUT__0")
• Build bentos and containerize images with Triton Runners by specifying <http://nvcr.io/nvidia/tritonserver|nvcr.io/nvidia/tritonserver>
base image in bentofile.yaml
.
service: service:svc
include:
- /model_repository
- /data/*.png
- /*.py
exclude:
- /__pycache__
- /venv
- /train.py
- /build_bento.py
- /containerize_bento.py
python:
packages:
- bentoml[triton]
docker:
base_image: <http://nvcr.io/nvidia/tritonserver:22.12-py3|nvcr.io/nvidia/tritonserver:22.12-py3>
💡 If you are an existing Triton user, the integration provides simpler ways to add custom logics in Python, deploy distributed multi-model inference graph, unify model management across different ML frameworks and workflows, and standardize model packaging format with versioning and collaboration features. If you are an existing BentoML user, the integration improves the runner efficiency and throughput under high load thanks to Triton’s efficient C++ runtime.Sean
04/06/2023, 8:59 PMv1.0.17
, which includes support for 🤗 Hugging Face Transformers pre-trained instances. Prior to this release, only pipelines could be saved and loaded using the bentoml.transformers
APIs. However, based on the community’s demand to work with pre-trained models, tokenizers, preprocessors, etc., without pipelines, we have expanded our capabilities in bentoml.transformers
APIs. With this release, all pre-trained instances can be saved and loaded into either built-in Transformers framework runners or custom runners. This update opens up new possibilities for users to work with pre-trained models, and we are thrilled to see what the community will create using this feature. To learn more, visit BentoML Transformers framework documentation.
• Pre-trained models and instances, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model
API.
import bentoml
from transformers import AutoTokenizer
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
bentoml.transformers.save_model("speecht5_tts_processor", processor)
bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)
• Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the to_runner method.
import bentoml
import torch
from <http://bentoml.io|bentoml.io> import Text, NumpyNdarray
from datasets import load_dataset
proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])
@svc.api(input=Text(), output=NumpyNdarray())
def generate_speech(inp: str):
inputs = proccessor_runner.run(text=inp, return_tensors="pt")
speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
return speech.numpy()
• To use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.
import bentoml
import torch
from datasets import load_dataset
processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
model_ref = bentoml.models.get("speecht5_tts_model:latest")
vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")
class SpeechT5Runnable(bentoml.Runnable):
def __init__(self):
self.processor = bentoml.transformers.load_model(processor_ref)
self.model = bentoml.transformers.load_model(model_ref)
self.vocoder = bentoml.transformers.load_model(vocoder_ref)
self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)
@bentoml.Runnable.method(batchable=False)
def generate_speech(self, inp: str):
inputs = self.processor(text=inp, return_tensors="pt")
speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
return speech.numpy()
text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])
@svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
async def generate_speech(inp: str):
return await text2speech_runner.generate_speech.async_run(inp)
Sean
04/14/2023, 4:00 PMv1.0.18
brings a new way of creating the server and client natively from Python.
• Start an HTTP or gRPC server and client asynchronously with a context manager.
server = HTTPServer("iris_classifier:latest", production=True, port=3000)
# Start the server in a separate process and connect to it using a client
with server.start() as client:
res = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
• Start an HTTP or gRPC server synchronously.
server = HTTPServer("iris_classifier:latest", production=True, port=3000)
server.start(blocking=True)
• As always, a client can be created and connected to an running server.
client = Client.from_url("<http://localhost:3000>")
res = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
Sean
05/10/2023, 1:28 AMv1.0.19
is released with enhanced GPU utilization and expanded ML framework support.
• Optimized GPU resource utilization: Enabled scheduling of multiple instances of the same runner using the workers_per_resource
scheduling strategy configuration. The following configuration allows scheduling 2 instances of the “iris” runner per GPU instance. workers_per_resource
is 1 by default.
runners:
iris:
resources:
<http://nvidia.com/gpu|nvidia.com/gpu>: 1
workers_per_resource: 2
• New ML framework support: We’ve added support for EasyOCR and Detectron2 to our growing list of supported ML frameworks.
• Enhanced runner communication: Implemented PEP 574 out-of-band pickling to improve runner communication by eliminating memory copying, resulting in better performance and efficiency.
• Backward compatibility for Hugging Face Transformers: Resolved compatibility issues with Hugging Face Transformers versions prior to v4.18
, ensuring a seamless experience for users with older versions.
⚙️ With the release of Kubeflow 1.7, BentoML now has native integration with Kubeflow, allowing developers to leverage BentoML’s cloud-native components. Prior, developers were limited to exporting and deploying Bento as a single container. With this integration, models trained in Kubeflow can easily be packaged, containerized, and deployed to a Kubernetes cluster as microservices. This architecture enables the individual models to run in their own pods, utilizing the most optimal hardware for their respective tasks and enabling independent scaling.
💡 With each release, we consistently update our blog, documentation and examples to empower the community in harnessing the full potential of BentoML.
• Learn more scheduling strategy to get better resource utilization.
• Learn more about model monitoring and drift detection in BentoML and integration with various monitoring framework.
• Learn more about using Nvidia Triton Inference Server as a runner to improve your application’s performance and throughput.Sean
05/10/2023, 1:29 AMv1.0.20
is released with improved usability and compatibility features.
• Production Mode by Default: bentoml serve
command will now run with the --production
option by default. The change is made the simulate the production behavior during development. The --reload
option will continue to with as expected. To achieve the serving behavior previously, use --development
instead.
• Optional Dependency for OpenTelemetry Exporter: The opentelemetry-exporter-otlp-proto-http
dependency has been moved from a required dependency to an optional one to address a protobuf
dependency incompatibility issue. ⚠️ If you are currently using the Model Monitoring and Inference Data Collection feature, you must install the package with the monitor-otlp
option from this release onwards to include the necessary dependency.
pip install "bentoml[monitor-otlp]"
• OpenTelemetry Trace ID Configuration Option: A new configuration option has been added to return the OpenTelemetry Trace ID in the response. This feature is particularly helpful when tracing has not been initialized in the upstream caller, but the caller still wishes to log the Trace ID in case of an error.
api_server:
http:
response:
trace_id: True
• Start from a Service: Added the ability to start a server from a bentoml.Service
object. This is helpful for troubleshooting a project in a development environment where no Bentos has been built yet.
import bentoml
# import the Service defined in `/clip_api_service/service.py` file
from clip_api_service.service import svc
if __name__ == "__main__":
# start a server:
server = bentoml.HTTPServer(svc)
server.start(blocking=False)
client = server.get_client()
client.predict(..)
Tim Liu
05/31/2023, 8:01 PMSean
06/12/2023, 8:49 PMv1.0.22
release has brought a list of well-anticipated updates.
• Added support for Pydantic 2 for better validate performance.
• Added support for CUDA 12 versions in builds and containerization.
• Introduced service lifecycle events allowing adding custom logic on_deployment
, on_startup
, and on_shutdown
. States can be managed using the context ctx
variable during the on_startup
and on_shutdown
events and during request serving in the API.
@svc.on_deployment
def on_deployment():
pass
@svc.on_startup
def on_startup(ctx: bentoml.Context):
ctx.state["object_key"] = create_object()
@svc.on_shutdown
def on_shutdown(ctx: bentoml.Context):
cleanup_state(ctx.state["object_key"])
@svc.api
def predict(input_data, ctx):
object = ctx.state["object_key"]
pass
• Added support for traffic control for both API Server and Runners. Timeout and maximum concurrency can now be configured through configuration.
api_server:
traffic:
timeout: 10 # API Server request timeout in seconds
max_concurrency: 32 # Maximum concurrency requests in the API Server
runners:
iris:
traffic:
timeout: 10 # Runner request timeout in seconds
max_concurrency: 32 # Maximum concurrency requests in the Runner
• Improved performance of bentoml push
performance for large Bentos.
🚀 One more thing, the team is delighted to unveil our latest endeavor, OpenLLM. This innovative project allows you to effortless build with the state-of-the-art open source or fine-tuned Large Language Models.
• Supports all variants of Flan-T5, Dolly V2, StarCoder, Falcon, StableLM, and ChatGLM out-of-box. Fully customizable with model specific arguments.
openllm start [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
• Exposes the familiar BentoML APIs and transforms LLMs seamlessly into Runners.
llm_runner = openllm.Runner("dolly-v2")
• Builds LLM application into the Bento format that can be deployed to BentoCloud or containerized into OCI images.
openllm build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
Our dedicated team is working hard to pioneering more integrations of advanced models for our upcoming releases of OpenLLM. Stay tuned for the unfolding developments.Sean
07/24/2023, 9:14 PMv1.1.0
, our first minor version update since the milestone v1.0.
• Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
• Official gRPC Support: We’ve transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
• Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
• Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
• Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.
🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.
• GPU and CPU Support: Running Llama is support on both GPU and CPU.
• Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face. Users can use any weights on HuggingFace (e.g. TheBloke/Llama-2-13B-chat-GPTQ
), custom weights from local path (e.g. /path/to/llama-1
), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM. Use openllm models --show-available
to learn more.
• Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground, python -m openllm.playground.llama2_qlora --help
.Sean
08/31/2023, 6:59 PMv1.1.4
and OpenLLM v0.2.27
. See an example service definition for SSE streaming with Llama2.
• Added response streaming through SSE to the bentoml.io.Text
IO Descriptor type.
• Added async generator support to both API Server and Runner to yield
incremental text responses.
• Added supported to ☁️ BentoCloud to natively support SSE streaming.
🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.
• Added /v1/generate_stream
endpoint for streaming responses from LLMs.
curl -N -X 'POST' '<http://0.0.0.0:3000/v1/generate_stream>' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"prompt": "### Instruction:\n What is the definition of time (200 words essay)?\n\n### Response:",
"llm_config": {
"use_llama2_prompt": false,
"max_new_tokens": 4096,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"use_cache": true,
"temperature": 0.89,
"top_k": 50,
"top_p": 0.76,
"typical_p": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"diversity_penalty": 0,
"repetition_penalty": 1,
"encoder_repetition_penalty": 1,
"length_penalty": 1,
"no_repeat_ngram_size": 0,
"renormalize_logits": false,
"remove_invalid_values": false,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"encoder_no_repeat_ngram_size": 0,
"n": 1,
"best_of": 1,
"presence_penalty": 0.5,
"frequency_penalty": 0,
"use_beam_search": false,
"ignore_eos": false
},
"adapter_name": null
}'
Chaoyu
10/06/2023, 6:14 PMJian Shen Yap
01/22/2024, 4:53 AMSean
02/20/2024, 4:00 PMv1.2
, the biggest release since the launch of v1.0
. This release includes improvements from all the learning and feedback from our community over the past year. We invite you to read our release blog post for a comprehensive overview of the new features and the motivations behind their development.
Here are a few key points to note before we delve into the new features:
• v1.2
ensures complete backward compatibility, meaning that Bentos built with v1.1
will continue to function seamlessly with this release.
• We remain committed to supporting v1.1
. Critical bug fixes and security updates will be backported to the v1.1
branch.
• BentoML documentation has been updated with examples and guides for v1.2
. More guides are being added every week.
• BentoCloud is fully equipped to handle deployments from both v1.1
and v1.2
releases of BentoML.
⛏️ Introduced a simplified service SDK to empower developers with greater control and flexibility.
• Simplified the service and API interfaces as Python classes, allowing developers to add custom logic and use third party libraries flexibly with ease.
• Introduced @bentoml.service
and @bentoml.api
decorators to customize the behaviors of services and APIs.
• Moved configuration from YAML files to the service decorator @bentoml.service
next to the class definition.
• See the vLLM example demonstrating the flexibility of the service API by initializing a vLLM AsyncEngine in the service constructor and run inference with continuous batching in the service API.
🔭 Revamped IO descriptors with more familiar input and output types.
• Enable use of Pythonic types directly, without the need for additional IO descriptor definitions or decorations.
• Integrated with Pydantic to leverage its robust validation capabilities and wide array of supported types.
• Expanded support to ML and Generative AI specific IO types.
📦 Updated model saving and loading API to be more generic to enable integration with more ML frameworks.
• Allow flexible saving and loading models using the bentoml.models.create
API instead of framework specific APIs, e.g. bentoml.pytorch.save_model
, bentoml.tensorflow.save_model
.
🚚 Streamlined the deployment workflow to allow more rapid development iterations and a faster time to production.
• Enabled direct deployment to production through CLI and Python API from Git projects.
🎨 Improved API development experience with generated web UI and rich Python client.
• All bentos are now accompanied by a custom-generated UI in the BentoCloud Playground, tailored to their API definitions.
• BentoClient offers a Pythonic way to invoke the service endpoint, allowing parameters to be supplied in native Python format, letting the client efficiently handles the necessary serialization while ensuring compatibility and performance.
🎭 We’ve learned that the best way to showcase what BentoML can do is not through dry, conceptual documentation but through real-world examples. Check out our current list of examples, and we’ll continue to publish new ones to the gallery as exciting new models are released.
• BentoVLLM
• BentoControlNet
• BentoSDXLTurbo
• BentoWhisperX
• BentoXTTS
• BentoCLIP
🙏 Thank you for your continued support!Chaoyu
02/27/2024, 7:55 PM