This message was deleted.
# ask-for-help
s
This message was deleted.
j
Hi. afaik
torch.cuda.empty_cache()
will not release the memory occupied by tensors. And model weights are tensors, too.
y
I have been using torch.cuda.empty_cache to release the memory with my models before. Do you have another suggestion for releasing the memory?
j
I think it's a tradeoff, right? Is your application latency-critical?
y
It is latency-critical
j
Then I think we best not calling
empty_cache
manually. Just rely on pytorch itself to do that
If the memory consumption is more than we expected, then we just need to fix that bug rather than touching this
y
What are you suggesting then?
j
To figure out the real reason of unexpected memory usage.
y
How would you suggest to do that?
c
In case anyone else is looking for this, @Yakir Saadia and I discussed it in DM. basically, you need to call empty_cache from the Runner code, not from the API server code. In this case, you will need to create a customer runner that wraps the pythhon runner. BentoML schedules runners & models in their own processes, so calling empty_cache from the main api server wouldn’t apply to other python processes.
y
@Chaoyu Not necessarily even a custom runner, the BentoML v1 signatures kwarg is a very useful option. I have added a function to my already initiated model class that will wrap the inference and will call empty_cache. This way it runs from the Runner process and not the API server process
👍 1