Slackbot
03/06/2023, 9:44 AMlarme (shenyang)
03/06/2023, 9:49 AMshift + h
, does htop still output so many bentoml lines? By default htop will output all threads in a process, where api_server.workers=1
will run one api_server process but with some threads.Ariel Zadok
03/06/2023, 9:51 AMAriel Zadok
03/06/2023, 10:00 AMlarme (shenyang)
03/06/2023, 10:05 AMps -aux |grep http_api_server
?Ariel Zadok
03/06/2023, 10:06 AMlarme (shenyang)
03/06/2023, 10:47 AMlarme (shenyang)
03/06/2023, 10:48 AMapi_server
using multithreading should not affect memory usage because threads should share memory.larme (shenyang)
03/06/2023, 10:49 AMAriel Zadok
03/06/2023, 10:52 AMAriel Zadok
03/06/2023, 10:53 AMlarme (shenyang)
03/06/2023, 10:54 AMservice.py
? I think maybe because api_server also load the model?Ariel Zadok
03/06/2023, 10:54 AMimport os
import time
import bentoml
import mlflow
import pandas as pd
from fastapi import FastAPI
from utils import CORRELATION_ID_HEADER, generate_correlation_id, generate_input_and_output_descriptors
from custom_bento_service import CustomBentoService
BENTO_REGISTRY_MODEL_NAME = os.environ['BENTO_REGISTRY_MODEL_NAME']
BENTO_REGISTRY_MODEL_VERSION = os.environ['BENTO_REGISTRY_MODEL_VERSION']
OPEN_API_PREFIX = os.environ.get('OPEN_API_PREFIX', '')
bento_model_path = f"{BENTO_REGISTRY_MODEL_NAME}:{BENTO_REGISTRY_MODEL_VERSION}"
pyfunc_model: mlflow.pyfunc.PyFuncModel = bentoml.mlflow.load_model(bento_model_path)
artifact_name = pyfunc_model.metadata.artifact_path
bento_model = bentoml.mlflow.get(bento_model_path)
input_descriptor, output_descriptor = generate_input_and_output_descriptors(bento_model_path, pyfunc_model)
del pyfunc_model
model_runner = bento_model.to_runner()
svc = CustomBentoService(BENTO_REGISTRY_MODEL_NAME, runners=[model_runner])
@svc.api(
input=input_descriptor,
output=output_descriptor,
)
def predict(input_df: pd.DataFrame, ctx: bentoml.Context) -> pd.DataFrame:
start_time = time.time()
# get request headers
request_headers = ctx.request.headers
x_cr_id = request_headers.get(CORRELATION_ID_HEADER)
if x_cr_id is None:
x_cr_id = generate_correlation_id(artifact_name)
response = model_runner.run(input_df)
process_time = time.time() - start_time
ctx.response.headers.append(CORRELATION_ID_HEADER, x_cr_id)
ctx.response.headers.append("x-process-time", str(process_time))
return response
fastapi_app = FastAPI(
openapi_url=f"/docs.json",
root_path=f"{OPEN_API_PREFIX}",
)
svc.mount_asgi_app(fastapi_app)
@fastapi_app.get("/metadata")
def metadata():
return {"name": bento_model.tag.name, "version": bento_model.tag.version}
larme (shenyang)
03/06/2023, 11:03 AMdel pyfunc_model
won't free up the memory (immediately). Could you try adding gc codes:
import gc
del pyfunc_model
gc.collect()
and see what happens?Ariel Zadok
03/06/2023, 11:04 AMAriel Zadok
03/06/2023, 11:04 AMAriel Zadok
03/06/2023, 11:04 AMAriel Zadok
03/06/2023, 11:05 AMAriel Zadok
03/06/2023, 11:06 AMlarme (shenyang)
03/06/2023, 11:06 AMbentoml.mlflow.get(bento_model_path)
will only return a reference to a model inside bentoml modelstore. to_runner
will turn this reference to a runner that will be lazy loaded (including models used by runner) inside the runner process.Ariel Zadok
03/06/2023, 11:07 AMAriel Zadok
03/06/2023, 11:48 AMgc.collect()
didnt helpAriel Zadok
03/06/2023, 11:48 AMPatrick Alves
04/03/2023, 4:22 PM