Flyte #torch-elastic

Join Slack

cool-lifeguard-49380

04/04/2023, 4:31 PM

👍

freezing-airport-6809

04/04/2023, 5:56 PM

Cc @boundless-pizza-95864

freezing-airport-6809

04/05/2023, 12:38 AM

@cool-lifeguard-49380 do you think you can update the PR?

freezing-airport-6809

04/05/2023, 12:39 AM

or should i create a new PR. What I want is an easy capability of training on a single node with multiple gpus

cool-lifeguard-49380

04/05/2023, 7:27 AM

Got it, first single node multi GPU 👍 I can implement this on Friday or Saturday and then ping you for review. Or do you need it before that?

freezing-airport-6809

04/10/2023, 4:00 AM

@cool-lifeguard-49380 also maybe we should try to integrate accelerate in the same way - Hopefully @abundant-hamburger-66584 can help 😄

freezing-airport-6809

04/12/2023, 4:07 PM

Also maybe - https://www.deepspeed.ai/getting-started/

freezing-airport-6809

04/12/2023, 4:07 PM

@cool-lifeguard-49380 have you used this ^

cool-lifeguard-49380

04/13/2023, 6:06 PM

I haven’t used it much but ~1.5 years ago “I got an example to train with it” on k8s (which they don’t explicitly mentioned as supported in the docs). Ultimately under the hood it also just uses

torch.distributed.init_process_group()

, see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed

nvidia-cuda-toolkit

. To summarize, at the state of ~1.5 years ago I think it would already have been supported.

cool-lifeguard-49380

04/17/2023, 8:00 AM

Still draft PRs because I will add more tests and docs: • https://github.com/flyteorg/flytekit/pull/1583 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flyteidl/pull/394 But torch elastic task now works for me when executing locally, with

nnodes=1

in a single pod, and with

nnodes>1

with the pytorch operator. I think we could try with alpaca now 🦙 The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).

[W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).

I have one question about the

execute

method I copied from

PythonFunctionTask

: We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?

freezing-airport-6809

04/20/2023, 4:11 AM

cc @broad-monitor-993 did you end up trying alpaca?

freezing-airport-6809

04/20/2023, 4:12 AM

@cool-lifeguard-49380 what do you think should we merge the idl PR?

freezing-airport-6809

04/20/2023, 4:12 AM

how can i help take them over the finish line?

cool-lifeguard-49380

04/20/2023, 9:17 AM

I will add tests and documentation on the weekend. Then I’ll request PR reviews You could help by testing it, so far I have only run minimal working examples (e.g. this one) that don’t do much more other than making sure that the process group can be initialized.

broad-monitor-993

04/20/2023, 12:54 PM

the code works, successfully ran the workflow on the

facebook/opt-125m

, currently trying to get to work on a pre-existing llama model on huggingface

broad-monitor-993

04/20/2023, 12:55 PM

also still need to test it on multiple cpus/gpus

freezing-airport-6809

04/20/2023, 1:58 PM

@broad-monitor-993 I can help with multi core example

broad-monitor-993

04/20/2023, 2:02 PM

cool, I just updated our fork/branch with my changes: https://github.com/unionai-oss/stanford_alpaca/tree/flytekit-alpaca

cool-lifeguard-49380

04/20/2023, 4:17 PM

One other thing about which I’m interested in your opinion:

torchrun

allows the user to set

--nnodes

which could e.g. be

but also be

"1:2"

which means min 1 max 2. Currently this is what iour new

task_config=Elastic()

exposes as well. The kubeflow PytorchJob allows setting

minReplicas

maxReplicas

(which by default are both None), and

replicas

(see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes). If a user specifies

2:3

we currently set min to 2 and max and replicas to 3. To summarize: Should we expose

nnodes

like torchrun or

min_replicas

max_replicas

, and

replicas

like the pytorchjob to the user?

cool-lifeguard-49380

04/23/2023, 12:11 PM

Ready for review from my side:

cool-lifeguard-49380

04/23/2023, 12:12 PM

• https://github.com/flyteorg/flyte/issues/3614 • https://github.com/flyteorg/flytekit/pull/1603 • https://github.com/flyteorg/flyteidl/pull/394 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flytesnacks/pull/987

cool-lifeguard-49380

04/23/2023, 12:13 PM

How does the merge process typically look like when idl is changed? Tests in flytekit and flyteplugins fail since idl changes are not there yet

freezing-airport-6809

04/23/2023, 6:13 PM

cc +@polite-ability-4005

freezing-airport-6809

04/23/2023, 6:13 PM

@polite-ability-4005 we are enabling torch-elastic in flytekit now

freezing-airport-6809

04/23/2023, 6:14 PM

@cool-lifeguard-49380 / @polite-ability-4005 seems these instructions are no longer valid - https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s - as we have one training operator now. cc @glamorous-carpet-83516 / @great-school-54368

freezing-airport-6809

04/24/2023, 4:36 AM

Also, @cool-lifeguard-49380 do you folks use - https://github.com/libffcv/ffcv?

freezing-airport-6809

04/24/2023, 4:36 AM

@broad-monitor-993 / @many-wire-75890 / @abundant-hamburger-66584

cool-lifeguard-49380

05/03/2023, 6:45 AM

Thanks for finishing the Pr and merging 🚀

cool-lifeguard-49380

06/19/2023, 7:25 AM

https://github.com/flyteorg/flytekit/pull/1677 Need feedback on this fix, thx 🙂 Maybe @glamorous-carpet-83516 @broad-monitor-993 @high-accountant-32689? Doesn’t have time pressure

shy-accountant-549

06/30/2023, 5:37 PM

we are getting

RendezvousTimeoutError

when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the

RendezvousTimeoutError

shows up exactly 600s after the pod starts running. not sure what is the best workaround is, but seems adding something like

rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},

to the LaunchConfig could probably solve it. Please lmk if this is the right approach. would love to contribute