BentoML

Hi. afaik `torch.cuda.empty_cache()` will not release the memory occupied by tensors. And model weights are tensors, too.

I have been using torch.cuda.empty_cache to release the memory with my models before. Do you have another suggestion for releasing the memory?

I think it's a tradeoff, right? Is your application latency-critical?

<https://pytorch.org/docs/stable/notes/cuda.html#memory-management> I think this may help

Then I think we best not calling `empty_cache` manually. Just rely on pytorch itself to do that

If the memory consumption is more than we expected, then we just need to fix that bug rather than touching this

To figure out the real reason of unexpected memory usage.

In case anyone else is looking for this, <@U02S0AL5NMU> and I discussed it in DM. basically, you need to call empty_cache from the Runner code, not from the API server code. In this case, you will need to create a customer runner that wraps the pythhon runner. BentoML schedules runners &amp; models in their own processes, so calling empty_cache from the main api server wouldn’t apply to other python processes.

<@UKB4CLKP1> Not necessarily even a custom runner, the BentoML v1 signatures kwarg is a very useful option. I have added a function to my already initiated model class that will wrap the inference and will call empty_cache. This way it runs from the Runner process and not the API server process