https://flyte.org logo
Join Slack
Powered by
# torch-elastic
  • c

    cool-lifeguard-49380

    04/04/2023, 4:31 PM
    👍
  • f

    freezing-airport-6809

    04/04/2023, 5:56 PM
    Cc @boundless-pizza-95864
  • f

    freezing-airport-6809

    04/05/2023, 12:38 AM
    @cool-lifeguard-49380 do you think you can update the PR?
  • f

    freezing-airport-6809

    04/05/2023, 12:39 AM
    or should i create a new PR. What I want is an easy capability of training on a single node with multiple gpus
  • c

    cool-lifeguard-49380

    04/05/2023, 7:27 AM
    Got it, first single node multi GPU 👍 I can implement this on Friday or Saturday and then ping you for review. Or do you need it before that?
    f
    • 2
    • 61
  • f

    freezing-airport-6809

    04/10/2023, 4:00 AM
    @cool-lifeguard-49380 also maybe we should try to integrate accelerate in the same way - Hopefully @abundant-hamburger-66584 can help 😄
  • f

    freezing-airport-6809

    04/12/2023, 4:07 PM
    Also maybe - https://www.deepspeed.ai/getting-started/
  • f

    freezing-airport-6809

    04/12/2023, 4:07 PM
    @cool-lifeguard-49380 have you used this ^
  • c

    cool-lifeguard-49380

    04/13/2023, 6:06 PM
    I haven’t used it much but ~1.5 years ago “I got an example to train with it” on k8s (which they don’t explicitly mentioned as supported in the docs). Ultimately under the hood it also just uses
    torch.distributed.init_process_group()
    , see here. Back then I just created a kubeflow PytorchJob to run it which worked. Image needed
    nvidia-cuda-toolkit
    . To summarize, at the state of ~1.5 years ago I think it would already have been supported.
  • c

    cool-lifeguard-49380

    04/17/2023, 8:00 AM
    Still draft PRs because I will add more tests and docs: • https://github.com/flyteorg/flytekit/pull/1583 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flyteidl/pull/394 But torch elastic task now works for me when executing locally, with
    nnodes=1
    in a single pod, and with
    nnodes>1
    with the pytorch operator. I think we could try with alpaca now 🦙 The problems with rendezvous flakiness I mentioned in the call on Thursday were actually related to network config on my notebook (no ipv6 enabled).
    [W socket.cpp:601] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 49651) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).
    I have one question about the
    execute
    method I copied from
    PythonFunctionTask
    : We don’t need the else case here for dynamic even though the original docstring hints one should implement it as well, right?
  • f

    freezing-airport-6809

    04/20/2023, 4:11 AM
    cc @broad-monitor-993 did you end up trying alpaca?
  • f

    freezing-airport-6809

    04/20/2023, 4:12 AM
    @cool-lifeguard-49380 what do you think should we merge the idl PR?
  • f

    freezing-airport-6809

    04/20/2023, 4:12 AM
    how can i help take them over the finish line?
  • c

    cool-lifeguard-49380

    04/20/2023, 9:17 AM
    I will add tests and documentation on the weekend. Then I’ll request PR reviews You could help by testing it, so far I have only run minimal working examples (e.g. this one) that don’t do much more other than making sure that the process group can be initialized.
  • b

    broad-monitor-993

    04/20/2023, 12:54 PM
    the code works, successfully ran the workflow on the
    facebook/opt-125m
    , currently trying to get to work on a pre-existing llama model on huggingface
  • b

    broad-monitor-993

    04/20/2023, 12:55 PM
    also still need to test it on multiple cpus/gpus
  • f

    freezing-airport-6809

    04/20/2023, 1:58 PM
    @broad-monitor-993 I can help with multi core example
  • b

    broad-monitor-993

    04/20/2023, 2:02 PM
    cool, I just updated our fork/branch with my changes: https://github.com/unionai-oss/stanford_alpaca/tree/flytekit-alpaca
  • c

    cool-lifeguard-49380

    04/20/2023, 4:17 PM
    One other thing about which I’m interested in your opinion:
    torchrun
    allows the user to set
    --nnodes
    which could e.g. be
    2
    but also be
    "1:2"
    which means min 1 max 2. Currently this is what iour new
    task_config=Elastic()
    exposes as well. The kubeflow PytorchJob allows setting
    minReplicas
    ,
    maxReplicas
    (which by default are both None), and
    replicas
    (see here). In theory you could say min 2, max 4, replicas 3 (without going into how much sense this makes). If a user specifies
    2:3
    we currently set min to 2 and max and replicas to 3. To summarize: Should we expose
    nnodes
    like torchrun or
    min_replicas
    ,
    max_replicas
    , and
    replicas
    like the pytorchjob to the user?
    f
    • 2
    • 5
  • c

    cool-lifeguard-49380

    04/23/2023, 12:11 PM
    Ready for review from my side:
  • c

    cool-lifeguard-49380

    04/23/2023, 12:12 PM
    • https://github.com/flyteorg/flyte/issues/3614 • https://github.com/flyteorg/flytekit/pull/1603 • https://github.com/flyteorg/flyteidl/pull/394 • https://github.com/flyteorg/flyteplugins/pull/343 • https://github.com/flyteorg/flytesnacks/pull/987
    f
    • 2
    • 4
  • c

    cool-lifeguard-49380

    04/23/2023, 12:13 PM
    How does the merge process typically look like when idl is changed? Tests in flytekit and flyteplugins fail since idl changes are not there yet
  • f

    freezing-airport-6809

    04/23/2023, 6:13 PM
    cc +@polite-ability-4005
  • f

    freezing-airport-6809

    04/23/2023, 6:13 PM
    @polite-ability-4005 we are enabling torch-elastic in flytekit now
    p
    • 2
    • 1
  • f

    freezing-airport-6809

    04/23/2023, 6:14 PM
    @cool-lifeguard-49380 / @polite-ability-4005 seems these instructions are no longer valid - https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s - as we have one training operator now. cc @glamorous-carpet-83516 / @great-school-54368
  • f

    freezing-airport-6809

    04/24/2023, 4:36 AM
    Also, @cool-lifeguard-49380 do you folks use - https://github.com/libffcv/ffcv?
    c
    • 2
    • 4
  • f

    freezing-airport-6809

    04/24/2023, 4:36 AM
    @broad-monitor-993 / @many-wire-75890 / @abundant-hamburger-66584
  • c

    cool-lifeguard-49380

    05/03/2023, 6:45 AM
    Thanks for finishing the Pr and merging 🚀
    f
    b
    • 3
    • 8
  • c

    cool-lifeguard-49380

    06/19/2023, 7:25 AM
    https://github.com/flyteorg/flytekit/pull/1677 Need feedback on this fix, thx 🙂 Maybe @glamorous-carpet-83516 @broad-monitor-993 @high-accountant-32689? Doesn’t have time pressure
    b
    • 2
    • 2
  • s

    shy-accountant-549

    06/30/2023, 5:37 PM
    we are getting
    RendezvousTimeoutError
    when launching ddp on eks. It happens when some workers started running while others are waiting for resources to be available. After investigating the logs and pytorch code we believe it is due to join_timeout parameter which defaults to 600s, as the
    RendezvousTimeoutError
    shows up exactly 600s after the pod starts running. not sure what is the best workaround is, but seems adding something like
    rdzv_configs={'join_timeout': int(os.getenv("PET_RDZV_JOIN_TIMEOUT", "600"))},
    to the LaunchConfig could probably solve it. Please lmk if this is the right approach. would love to contribute
    c
    b
    f
    • 4
    • 25