This message was deleted.
# ask-for-help
s
This message was deleted.
🤔 1
j
Hi Jack.
Because of this, each time they use it, it seems like the service has to restart and load all the runners and it takes a long time for them to be able to actually use the endpoint.
I don't think it is the default behavior of bentoml. This might be caused by a service.py not in best practice.
Would you mind sharing your service.py? You can blur sensitive information before posting it.
j
Hi @Jiang, sure thing, it’s possible I’ve made a mistake in the service.py. The service.py is essentially this:
Copy code
# define IO descriptors
# ...
success_probability_input_descriptor = (JSON(pydantic_model=SuccessProbabilityInputDescriptor))
success_probability_output_descriptor = (JSON(pydantic_model=SuccessProbabilityOutputDescriptor))
# ...

# Load Bento Models from Google Cloud Storage; also convert them to Runners
bento_models, runners = load_bentos_from_gcs()

svc = bentoml.Service("probability-models-bento", runners=list(runners.values()))


def success_probability_pipeline(input_data):
    # do stuff with input_data


@svc.api(input=success_probability_input_descriptor,
         output=success_probability_output_descriptor,
         route="success/predict_proba")
def success_predict_proba(
        input_data: SuccessProbabilityInputDescriptor) -> SuccessProbabilityOutputDescriptor:
    validated_results = success_probability_pipeline(input_data)
    return validated_results

# other pipelines ...
j
has to restart and load all the runners
Can you tell me about the relevant phenomenon behind this? And would you like provide the source of
load_bentos_from_gcs
?
j
@Jiang Sure, this is `load_bentos_from_gcs`:
Copy code
def load_bentos_from_gcs(bento_framework_model_names, load_from_gcs=True):
    bento_models = {}
    runners = {}
    for model_tmp in bento_framework_model_names:
        model_name_tmp = model_tmp["model_name"]
        framework = model_tmp["framework"]
        if load_from_gcs:
            <http://bentoml_logger.info|bentoml_logger.info>(f"Importing {BUCKET}/{model_name_tmp}.bentomodel from GCS")
            try:
                bentoml.models.import_model(f"{BUCKET}/{model_name_tmp}.bentomodel")
            except (BentoMLException, ImportServiceError) as e:
                bentoml_logger.warning(f"The Bento model is already in the store. No harm done, though! "
                                       f"The {e.__class__.__name__} exception caught from Bento was:"
                                       f"\n\n\t'{e}'\n\nProceeding with the rest of the build.")

        <http://bentoml_logger.info|bentoml_logger.info>(f"Loading in the {model_name_tmp}:latest model using the {framework} framework.")
        if framework == "sklearn":
            bento_model = bentoml.sklearn.get(f"{model_name_tmp}:latest")
        elif framework == "picklable_model":
            bento_model = bentoml.picklable_model.get(f"{model_name_tmp}:latest")

        <http://bentoml_logger.info|bentoml_logger.info>("Success.")
        <http://bentoml_logger.info|bentoml_logger.info>(f"Converting the model to a runner")
        runner = bento_model.to_runner()
        bento_models[model_name_tmp] = bento_model
        runners[model_name_tmp] = runner
    return bento_models, runners
With regards to the restarting and loading in of all the runners: When I make a request (after a fair amount of idle time), I notice the endpoint hangs for a long time waiting for a response. I look at my Cloud Run Service logs and see the loading from GCS occurring. Once all the models are loaded in (on each one of the 8 CPUs) (like 5 minutes later), then the request gets completed successfully. Thereafter, the requests are fast as usual.
j
Sure, I understand that: 1. You are using the Cloud Run Serverless service, which will automatically shut down the server after a certain period of inactivity (about 15 minutes) to save costs. 2. Your Bento model is being re-downloaded every time the server starts. The combination of these two factors is causing the current issue of unavailability.
There are two possible solutions to this issue. One option is to set Cloud Run to always be on: https://cloud.google.com/blog/products/serverless/cloud-run-gets-always-on-cpu-allocation The other option is to maintain the resource-saving feature and tolerate short request delays by modifying your Bento container. You can achieve this by importing the model during image build using "setup.sh": https://docs.bentoml.org/en/latest/guides/containerization.html This approach should reduce cold start waiting time to a few seconds (usually depending on how long it takes for your model to load from disk into memory).
You could also consider our hosted BentoCloud, which support both serverless and serverful deployment.
j
@Jiang, thank you very much for these options and taking the time to explain! Yes, I think our immediate solution will be setting Cloud Run to always be on. We chose not to import the model during image build because we liked the idea of keeping the models separate from the service. Our reasoning was that if we were to retrain a model, we could simply just upload the new model to GCS and restart the service. With a retrained model using the image build approach, we would have to rebuild the image and redeploy. Maybe that doesn’t actually save that much down time, though. I will look into BentoCloud again. Is it an easy transition from Cloud Run to BentoCloud?
👍 1
@Jiang, I was able to turn that option on using
gcloud beta run services update SERVICE-NAME --no-cpu-throttling
. This made a Revision deployment. That’s fine, but now when I try to deploy using bentoctl, it gives me the error
Copy code
Error 409: Revision named 'probability-models-bento-cloud-run-service-00023-pud' with different configuration already exists.
How can I either pull the configuration or update the bentoctl commands to include the --no-cpu-throttling flag?
j
It seem that we need a new revision name?
Is it an easy transition from Cloud Run to BentoCloud?
Yeah it is much easier since you will only need your bento to use BentoCloud, and you already have one.
I am not the maintainer of bentoctl. My suggestion is to first delete the revision, modify the terraform file to include
--no-throttle
, and then recreate it through terraform.