Hi, with the new flotilla runner, should I expect ...
# general
g
Hi, with the new flotilla runner, should I expect OOM on the head node? I see
get_next_partition
is running there.
Copy code
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.216.137.197, ID: 4b0c3bed34f46dc36c802eeaae07829efd7d23477c65d805d0f818a7) where the task (actor ID: 901ddb5502f9a3d9b16edc4dd6000000, name=flotilla-plan-runner:RemoteFlotillaRunner.__init__, pid=1964799, memory used=2.71GB) was running was 12.40GB / 13.04GB (0.950941), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dc) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.216.137.197`. To see the logs of the worker, use `ray logs worker-4dea7b56a15827e931b4354fd7f20a407c1a561dfc13db8331e950dcout -ip 10.216.137.197. Top 10 memory users:
PID	MEM(GB)	COMMAND
43	3.70	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799	2.71	ray::RemoteFlotillaRunner
135	2.34	/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...
239	0.32	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name...
134	0.25	/home/ray/anaconda3/bin/python -m ray.util.client.server --address=10.216.137.197:6379 --host=0.0.0....
327	0.19	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/age...
244	0.14	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/log_...
1	0.12	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --no-monitor --memory=140000...
589951	0.12	ray::_StatsActor
329	0.08	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/runt...
Refer to the documentation on how to address the out of memory issue: <https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html>. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
a bit weird, I don't think those add up to >12GB
k
@Colin Ho could you take a look?
g
I am going to restart the cluster and rerun the job to see what happens. cluster is dedicated to this single job.
c
Do you have the
.explain
? Also, head node OOM tends to happen for me when there's too many object refs / partitions. So perhaps reducing via
into_partitions
or
repartition
might help
g
so, what I am seeing on a long-running Ray cluster is that the head node memory is slowly increasing over time as I run more daft jobs on it and never seems to be deallocated
I don't know if this is a daft or ray issue. here is what the head node looks like with nothing running on it
c
are there any actors up?
g
nope, no actors up that I could see
I just kicked off a fresh job(s) and now up to 9.75GB
c
In the log you sent, i see
Copy code
Top 10 memory users:
PID	MEM(GB)	COMMAND
43	3.70	/home/ray/anaconda3/lib/python3.12/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/...
1964799	2.71	ray::RemoteFlotillaRunner
135	2.34	/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.12/site-packages/ray/dashboard/dashbo...
looks like ray gcs server and dashboard are hogging the memory, in addition to the flotilla runner
g
I am going to try to use
RAY_task_events_max_num_task_in_gcs= 10000
, default is 100,000
👍 1