This message was deleted.
# ask-for-help
s
This message was deleted.
👀 1
l
Hi Ariel, when you are in htop, if you press
shift + h
, does htop still output so many bentoml lines? By default htop will output all threads in a process, where
api_server.workers=1
will run one api_server process but with some threads.
a
So bentoml is opening multiple threads?
It still show by the way, even after shift + h
l
could you try the output of
ps -aux |grep http_api_server
?
a
@Asaf Horovitz
l
That means only one api_server process is running. I think maybe your htop has different key bindings with my htop.
api_server
using multithreading should not affect memory usage because threads should share memory.
From the htop screenshot I think runner is using around 37 GB memory and api_server is using around 19 GB memory. Is this not what you expect?
a
runner using around 37 GB is somewhat logical
api_server that is using 19GB is very odd
l
can you share your
service.py
? I think maybe because api_server also load the model?
a
Copy code
import os
import time
import bentoml
import mlflow
import pandas as pd
from fastapi import FastAPI
from utils import CORRELATION_ID_HEADER, generate_correlation_id, generate_input_and_output_descriptors
from custom_bento_service import CustomBentoService


BENTO_REGISTRY_MODEL_NAME = os.environ['BENTO_REGISTRY_MODEL_NAME']
BENTO_REGISTRY_MODEL_VERSION = os.environ['BENTO_REGISTRY_MODEL_VERSION']
OPEN_API_PREFIX = os.environ.get('OPEN_API_PREFIX', '')

bento_model_path = f"{BENTO_REGISTRY_MODEL_NAME}:{BENTO_REGISTRY_MODEL_VERSION}"

pyfunc_model: mlflow.pyfunc.PyFuncModel = bentoml.mlflow.load_model(bento_model_path)
artifact_name = pyfunc_model.metadata.artifact_path
bento_model = bentoml.mlflow.get(bento_model_path)

input_descriptor, output_descriptor = generate_input_and_output_descriptors(bento_model_path, pyfunc_model)

del pyfunc_model

model_runner = bento_model.to_runner()
svc = CustomBentoService(BENTO_REGISTRY_MODEL_NAME, runners=[model_runner])



@svc.api(
    input=input_descriptor,
    output=output_descriptor,
)
def predict(input_df: pd.DataFrame, ctx: bentoml.Context) -> pd.DataFrame:
    start_time = time.time()
    # get request headers
    request_headers = ctx.request.headers
    x_cr_id = request_headers.get(CORRELATION_ID_HEADER)
    if x_cr_id is None:
        x_cr_id = generate_correlation_id(artifact_name)
    response = model_runner.run(input_df)
    process_time = time.time() - start_time
    ctx.response.headers.append(CORRELATION_ID_HEADER, x_cr_id)
    ctx.response.headers.append("x-process-time", str(process_time))
    return response


fastapi_app = FastAPI(
    openapi_url=f"/docs.json",
    root_path=f"{OPEN_API_PREFIX}",
)
svc.mount_asgi_app(fastapi_app)


@fastapi_app.get("/metadata")
def metadata():
    return {"name": bento_model.tag.name, "version": bento_model.tag.version}
l
Maybe
del pyfunc_model
won't free up the memory (immediately). Could you try adding gc codes:
Copy code
import gc
del pyfunc_model
gc.collect()
and see what happens?
a
Will try
What about the 'bentoml.mlflow.get(bento_model_path)'?
Which then I transfer to the runner?
Is it copied in the background to a different process?
and if so? the reference holds a copy?
l
bentoml.mlflow.get(bento_model_path)
will only return a reference to a model inside bentoml modelstore.
to_runner
will turn this reference to a runner that will be lazy loaded (including models used by runner) inside the runner process.
a
ok
looks like adding the
gc.collect()
didnt help
any other suggestions?
p
I am having a similar issue