Hi with the new flotilla runner should I expect OOM on the h Distributed Data Community #general

Hi, with the new flotilla runner, should I expect ...

Garrett Weaver

09/04/2025, 8:41 PM

Hi, with the new flotilla runner, should I expect OOM on the head node? I see

get_next_partition

is running there.

Garrett Weaver

09/04/2025, 8:49 PM

Copy code

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.216.137.197, ID: 4b0c3bed34f46dc36c802eeaae07829efd7d23477c65d805d0f818a7) where the task (actor ID: 901ddb5502f9a3d9b16edc4dd6000000, name=flotilla-plan-runner:RemoteFlotillaRunner.__init__, pid=1964799, memory used=2.71GB) was running was 12.40GB / 13.04GB (0.950941), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dc) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.216.137.197`. To see the logs of the worker, use `ray logs worker-4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dcout -ip 10.216.137.197. Top 10 memory users:
PID	MEM(GB)	COMMAND
43	3.70	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799	2.71	ray::RemoteFlotillaRunner
135	2.34	/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...
239	0.32	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name...
134	0.25	/home/ray/anaconda3/bin/python -m ray.util.client.server --address=10.216.137.197:6379 --host=0.0.0....
327	0.19	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/age...
244	0.14	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/log_...
1	0.12	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --no-monitor --memory=140000...
589951	0.12	ray::_StatsActor
329	0.08	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runt...
Refer to the documentation on how to address the out of memory issue: <https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html>. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Garrett Weaver

09/04/2025, 8:50 PM

a bit weird, I don't think those add up to >12GB

Kevin Wang

09/04/2025, 8:56 PM

@Colin Ho could you take a look?

Garrett Weaver

09/04/2025, 8:57 PM

I am going to restart the cluster and rerun the job to see what happens. cluster is dedicated to this single job.

Colin Ho

09/04/2025, 9:32 PM

Do you have the

.explain

? Also, head node OOM tends to happen for me when there's too many object refs / partitions. So perhaps reducing via

into_partitions

repartition

might help

Garrett Weaver

09/05/2025, 8:48 PM

so, what I am seeing on a long-running Ray cluster is that the head node memory is slowly increasing over time as I run more daft jobs on it and never seems to be deallocated

Garrett Weaver

09/05/2025, 8:50 PM

I don't know if this is a daft or ray issue. here is what the head node looks like with nothing running on it

Colin Ho

09/05/2025, 9:13 PM

are there any actors up?

Garrett Weaver

09/05/2025, 9:16 PM

nope, no actors up that I could see

Garrett Weaver

09/05/2025, 9:17 PM

I just kicked off a fresh job(s) and now up to 9.75GB

Colin Ho

09/05/2025, 10:00 PM

In the log you sent, i see

Copy code

Top 10 memory users:
PID	MEM(GB)	COMMAND
43	3.70	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799	2.71	ray::RemoteFlotillaRunner
135	2.34	/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...

looks like ray gcs server and dashboard are hogging the memory, in addition to the flotilla runner

Garrett Weaver

09/05/2025, 10:11 PM

oh maybe this is relevant https://github.com/ray-project/ray/issues/45338

Garrett Weaver

09/05/2025, 10:17 PM

I am going to try to use

RAY_task_events_max_num_task_in_gcs= 10000

, default is 100,000

👍 1

8 Views

Open in Slack

Previous Next