cool-lifeguard-49380
04/04/2023, 4:31 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
04/05/2023, 7:27 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
04/13/2023, 6:06 PMtorch.distributed.init_process_group()
, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed nvidia-cuda-toolkit
. To summarize, at the state of ~1.5 years ago I think it would already have been supported.cool-lifeguard-49380
04/17/2023, 8:00 AMnnodes=1
in a single pod, and with nnodes>1
with the pytorch operator.
I think we could try with alpaca now 🦙
The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
I have one question about the[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
execute
method I copied from PythonFunctionTask
: We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
04/20/2023, 9:17 AMbroad-monitor-993
04/20/2023, 12:54 PMfacebook/opt-125m
, currently trying to get to work on a pre-existing llama model on huggingfacebroad-monitor-993
04/20/2023, 12:55 PMfreezing-airport-6809
broad-monitor-993
04/20/2023, 2:02 PMcool-lifeguard-49380
04/20/2023, 4:17 PMtorchrun
allows the user to set --nnodes
which could e.g. be 2
but also be "1:2"
which means min 1 max 2. Currently this is what iour new task_config=Elastic()
exposes as well.
The kubeflow PytorchJob allows setting minReplicas
, maxReplicas
(which by default are both None), and replicas
(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes).
If a user specifies 2:3
we currently set min to 2 and max and replicas to 3.
To summarize: Should we expose nnodes
like torchrun or min_replicas
, max_replicas
, and replicas
like the pytorchjob to the user?cool-lifeguard-49380
04/23/2023, 12:11 PMcool-lifeguard-49380
04/23/2023, 12:13 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
cool-lifeguard-49380
05/03/2023, 6:45 AMcool-lifeguard-49380
06/19/2023, 7:25 AMshy-accountant-549
06/30/2023, 5:37 PMRendezvousTimeoutError
when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the RendezvousTimeoutError
shows up exactly 600s after the pod starts running.
not sure what is the best workaround is, but seems adding something like rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
to the LaunchConfig could probably solve it.
Please lmk if this is the right approach. would love to contribute