Garrett Weaver
09/04/2025, 8:41 PMget_next_partition is running there.Garrett Weaver
09/04/2025, 8:49 PMray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.216.137.197, ID: 4b0c3bed34f46dc36c802eeaae07829efd7d23477c65d805d0f818a7) where the task (actor ID: 901ddb5502f9a3d9b16edc4dd6000000, name=flotilla-plan-runner:RemoteFlotillaRunner.__init__, pid=1964799, memory used=2.71GB) was running was 12.40GB / 13.04GB (0.950941), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dc) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.216.137.197`. To see the logs of the worker, use `ray logs worker-4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dcout -ip 10.216.137.197. Top 10 memory users:
PID MEM(GB) COMMAND
43 3.70 /home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799 2.71 ray::RemoteFlotillaRunner
135 2.34 /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...
239 0.32 /home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name...
134 0.25 /home/ray/anaconda3/bin/python -m ray.util.client.server --address=10.216.137.197:6379 --host=0.0.0....
327 0.19 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/age...
244 0.14 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/log_...
1 0.12 /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --no-monitor --memory=140000...
589951 0.12 ray::_StatsActor
329 0.08 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runt...
Refer to the documentation on how to address the out of memory issue: <https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html>. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.Garrett Weaver
09/04/2025, 8:50 PMKevin Wang
09/04/2025, 8:56 PMGarrett Weaver
09/04/2025, 8:57 PMColin Ho
09/04/2025, 9:32 PM.explain ?
Also, head node OOM tends to happen for me when there's too many object refs / partitions. So perhaps reducing via into_partitions or repartition might helpGarrett Weaver
09/05/2025, 8:48 PMGarrett Weaver
09/05/2025, 8:50 PMColin Ho
09/05/2025, 9:13 PMGarrett Weaver
09/05/2025, 9:16 PMGarrett Weaver
09/05/2025, 9:17 PMColin Ho
09/05/2025, 10:00 PMTop 10 memory users:
PID MEM(GB) COMMAND
43 3.70 /home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799 2.71 ray::RemoteFlotillaRunner
135 2.34 /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...
looks like ray gcs server and dashboard are hogging the memory, in addition to the flotilla runnerGarrett Weaver
09/05/2025, 10:11 PMGarrett Weaver
09/05/2025, 10:17 PMRAY_task_events_max_num_task_in_gcs= 10000 , default is 100,000