https://flyte.org logo
Join Slack
Powered by
# slurm-flyte-wg
  • c

    creamy-shampoo-53278

    12/06/2024, 4:48 PM
    Hello everyone, I succeed to setup a Slurm cluster with this guide, https://github.com/SergioMEV/slurm-for-dummies?tab=readme-ov-file and run a naive
    sbatch
    example. Han-Ru and I think this can be a good start point !
    d
    • 2
    • 2
  • f

    freezing-airport-6809

    12/08/2024, 10:14 PM
    @creamy-shampoo-53278 / @damp-lion-88352 and team i would love as an eventual goal that - nanotron - this script can be fully converted to a flyte + slurm runner. This would be a fantastic achievment. we would be able to train a full LLM from scratch using one flyte workflow (small one, thats ok)
  • f

    freezing-airport-6809

    12/12/2024, 3:37 AM
    @damp-lion-88352 @creamy-shampoo-53278 should we use this instead https://intertwin-eu.github.io/interLink/docs/intro
    d
    c
    • 3
    • 4
  • f

    freezing-airport-6809

    12/12/2024, 3:40 PM
    Another one Read “Pretrain Your Own AI Models with Fast-LLM“ by Jesus Rodriguez on Medium: https://jrodthoughts.medium.com/pretrain-your-own-ai-models-with-fast-llm-df70846cc37b
  • c

    creamy-shampoo-53278

    12/16/2024, 3:43 PM
    Hello everyone, We're excited to share that a very naive slurm agent with synchronous
    create
    and
    get
    methods works fine locally. At this early stage, we setup a single-host (ubuntu 20.04) machine for our local development and testing. Some takeaways are: 1. Run controller daemons (
    slurmctld
    ,
    slurmdbd
    ), one compute node daemon (
    slurmd
    ), and also the REST API daemon (
    slurmrestd
    ) on the same machine. a. The slurm agent can interact with
    slurmrestd
    through the host base url
    <http://localhost:6820>
    . b. Authentication is done by JWT. 2. Test the slurm agent locally based on this guide to mimic FlytePropeller’s behavior. We'll keep pushing forward to make this feature a reality!
    f
    d
    • 3
    • 11
  • c

    creamy-shampoo-53278

    12/18/2024, 4:24 PM
    Hi all, Slurm agent's
    create
    and
    get
    methods (naive version) are implemented with
    asyncssh
    now, but we still use
    PythonTask
    so far. We'll support
    delete
    and switch to
    ShellTask
    soon. If there's any mistake, please let me know. Thanks a lot!
    d
    f
    • 3
    • 6
  • c

    creamy-shampoo-53278

    12/30/2024, 2:35 PM
    Hello everyone, We're happy to share that the Slurm agent v1 (with
    PythonFunctionTask
    ) has been implemented. It supports the following three core methods: 1. `create`: Use
    srun
    to run a Slurm job which executes Flyte entrypoints,
    pyflyte-fast-execute
    and
    pyflyte-execute
    2. `get`: Use
    scontrol show job <job_id>
    to monitor the Slurm job state 3. `delete`: Use
    scancel <job_id>
    to cancel the Slurm job (this method is still under test) We setup an environment to test it locally without running the agent gRPC server. The setup is divided into three components: a client (localhost), a remote tiny Slurm cluster, and an Amazon S3 bucket that facilitates communication between the two. The attached figure below illustrates the interaction between the client and the remote Slurm cluster.
    f
    d
    • 3
    • 19
  • c

    creamy-shampoo-53278

    01/04/2025, 2:41 PM
    Happy new year! We do some research on ssh connection through
    asyncssh
    and have a question to discuss with you! As far as we know, establishing connection with client keys is a preferred approach over password due to security, convenience, and scalability. Before using key pairs for connection, we must first complete the following steps: 1. Generate an SSH key pair (take RSA as an example)
    Copy code
    ssh-keygen -t rsa -b 4096
    2. Place the public key on the remote server (slurm cluster in this case)
    Copy code
    ssh-copy-id <username>@<hostname>
    g
    • 2
    • 5
  • c

    creamy-shampoo-53278

    01/07/2025, 4:18 PM
    Hi everyone, We successfully implement Slurm agent with
    ShellTask
    , but there's still room for improvement! Following describes how it works: • To mimic NFS, we put a simple python script on Slurm cluster in advance • To run the user-defined batch script on Slurm cluster, we write script content to a tmp file and transfer it to cluster through SFTP • Within
    create
    method, we construct a
    sbatch
    command based on user-defined
    sbatch
    options and run this cmd on cluster
    e
    f
    d
    • 4
    • 40
  • r

    red-farmer-96033

    01/14/2025, 3:35 PM
    @red-farmer-96033 has left the channel
  • d

    damp-lion-88352

    01/15/2025, 3:10 PM
    Hi Team, Please take a look at this comment. This video demo showcases two task types on Slurm, running on an AWS EC2 instance: 1. Slurm Python Task – Executes a shell script on the Slurm host. 2. Slurm Shell Task – Executes a shell script provided by my computer. Let me know how @creamy-shampoo-53278 and I can help! PR Link cc @eager-processor-63090
    c
    e
    • 3
    • 7
  • e

    eager-processor-63090

    01/15/2025, 9:05 PM
    Going to start opening threads for any bugs I encounter to keep the PR clean. I'll react with a ✅ when I'm able to resolve!
  • e

    eager-processor-63090

    01/15/2025, 9:18 PM
    Copy code
    RuntimeError: Error encountered while executing 'job2':
      Task <Task pending name='Task-57' coro=<AsyncAgentExecutorMixin._create() running at /Users/pryceturner/Desktop/slurm-flyte/flytekit/flytekit/extend/backend/base_agent.py:376> cb=[_run_until_complete_cb() at
    /Users/pryceturner/.pyenv/versions/3.12.8/lib/python3.12/asyncio/base_events.py:181]> got Future <Future pending> attached to a different loop
    c
    d
    • 3
    • 9
  • c

    creamy-shampoo-53278

    01/17/2025, 3:03 PM
    Hi Pryce, For what you've commented in PR, I have some questions to ask: 1. As you mentioned consistent persistence layer, do you mean that we should implement an independent object store to hold the inputs and outputs of heterogeneous tasks (i.e., Slurm tasks and other task types) composed in a single workflow? 2. As for GPU accelerated tasks offload, could I assume that some tasks are run on CPU (e.g., data preprocessing) and some tasks like LLM finetuning should be offloaded to another cloud service with GPU nodes? Hence, we again need to support a consistent persistence layer of inputs/outputs access. Sorry for the dumb questions. I just want to clarify if I totally get what you said and wonder what'll be the top priority for the next step. Thanks!
    e
    d
    • 3
    • 11
  • f

    freezing-airport-6809

    01/28/2025, 6:00 AM
    @creamy-shampoo-53278 how can we start testing this, is there a PR?
    d
    c
    • 3
    • 8
  • d

    damp-lion-88352

    01/29/2025, 5:00 PM
    this is some notes discussed with @faint-gpu-25916 and @freezing-airport-6809 yesterday. Let's chat in this thread or hop on a call. cc @creamy-shampoo-53278 https://hackmd.io/@Future-Outlier/r18R22w_Jx
    c
    • 2
    • 1
  • d

    damp-lion-88352

    02/05/2025, 3:55 PM
    Hi, @freezing-airport-6809 You mentioned that you want the slurm python function task to be like this.
    Copy code
    @task(
        task_config=SlurmFunction(
            slurm_host="aws",
            srun_conf={
                "partition": "debug",
                "job-name": "tiny-slurm",
            },
            script="""
    #!/bin/bash
    # Pre-execute
    echo "Hello, world!"
    export MY_ENV_VAR=123
    
    # Run the python function here
    {flyte.fn}
    
    # Post-execute
    exit -1
    """
        )
    )
    def plus_one(x: int) -> int: 
        print(os.getenv("MY_ENV_VAR"))
        return x + 1
    so the execution order will be (1) script (pre-execute) (2) function I am wondering about in agent's operation framework (create, get, and delete) the pre-execute part should be put in
    create
    operation, right? or every operation should be put in
    get
    operation? in agent's each operation we have a timeout mechanism, so we need to figure out a way to implement this. TLDR: the agent framework is for 1 execution, but in your propose, this will be more than 1 execution, which is still possible to implement it but ugly
    f
    c
    • 3
    • 7
  • d

    damp-lion-88352

    02/06/2025, 3:44 PM
    Here is a demo with slurm's python function task.
    Copy code
    @task(
        task_config=SlurmFunction(
            slurm_host="<http://ec2-18-207-193-50.compute-1.amazonaws.com|ec2-18-207-193-50.compute-1.amazonaws.com>",
            srun_conf={
                "partition": "debug",
                "job-name": "fn-task",
                "output": "/home/ubuntu/fn_task.log"
            },
            script="""
    #!/bin/bash
    
    # == Pre-Execution ==
    echo "Hello, world!"
    
    # Setup env vars
    export MY_ENV_VAR=456
    
    # Activate virtual env
    . /home/ubuntu/.cache/pypoetry/virtualenvs/demo-poetry-RLi6T71_-py3.12/bin/activate
    
    # == Execute Flyte Task Function ==
    {task.fn}
    
    # == Post-Execution ==
    echo "Success!!"
    """
        )
    )
    def plus_one(x: int) -> int:
        print(os.getenv("MY_ENV_VAR"))
        return int(os.getenv("MY_ENV_VAR")) + 1
    pr link: https://github.com/flyteorg/flytekit/pull/3005 cc @glamorous-carpet-83516 @freezing-airport-6809 @eager-processor-63090 can we tag some potential users to take a look at this?
    2025-02-06 23-20-21.mp4
    a
    r
    • 3
    • 2
  • f

    freezing-airport-6809

    02/06/2025, 5:04 PM
    Awesome stuff let’s go
  • c

    creamy-shampoo-53278

    02/08/2025, 3:15 AM
    Hi all, The setup documentation has been moved here. We will continue to complete the remaining sections, including setting up the environment for local testing and agent deployment. PR link: https://github.com/flyteorg/flyte/pull/6231 Thanks!
  • c

    creamy-shampoo-53278

    02/17/2025, 1:02 PM
    (Update) We now support establishing an SSH connection using the following configuration:
    Copy code
    ssh_config={
        "host": "<hostname-or-ip-address>",
        "username": "<username>",
        "client_keys": "<file-path-to-private-key>"  # Support both string and list of strings
    }
    , where
    host
    and
    username
    are required, and
    client_keys
    is optional. Here are the common use cases: 1. Use OpenSSH client config files – Load settings from
    ~/.ssh/config
    2. Specify private key paths explicitly – Define key file paths in the
    client_keys
    field 3. Use the
    FLYTE_SLURM_PRIVATE_KEY
    secret
    – Write the private key content to a local file, then add its path to
    client_keys
  • c

    creamy-shampoo-53278

    02/18/2025, 4:39 PM
    (Update) We now support input/output interpolation in
    SlurmShellTask
    ! Here’s an example task:
    Copy code
    write_file_task = SlurmShellTask(
        name="write-file",
        script="""#!/bin/bash
    
    echo "[SlurmShellTask] Write something into a file..." >> {inputs.x}
    if grep "file" {inputs.x}
    then
        echo "Found a string 'file'!" >> {inputs.x}
    else
        echo "'file' not found!"
    fi
        """,
        task_config=Slurm(
            ssh_config={
                "host": "aws2",
                "username": "ubuntu",
            },
            sbatch_conf={
                "partition": "debug",
                "job-name": "tiny-slurm",
            }
        ),
        inputs=kwtypes(x=str),
        output_locs=[OutputLocation(var="i", var_type=FlyteFile, location="{inputs.x}")],
    )
    A workflow that demonstrates passing files between tasks: https://github.com/JiangJiaWei1103/Flyte-Demos/blob/main/slurm_agent/script/shell_3.py Observations when passing files between tasks: 1. FlyteFile as an input type is problematic – The file will be created on the Slurm cluster and can't be found on the local machine 2. Output location type mismatch – Even though
    var_type
    in
    output_locs
    is set to
    FlyteFile
    , it’s actually a
    str
    I’d love to hear your thoughts on this. Thanks!
    a
    d
    e
    • 4
    • 9
  • d

    damp-lion-88352

    02/24/2025, 4:48 AM
    base on this article from slurm official I want to use the slurm agent to run gpu task and make it a blog. https://www.schedmd.com/what-can-a-high-performance-computer-do/ cc @creamy-shampoo-53278 @eager-processor-63090
    c
    f
    • 3
    • 4
  • e

    eager-processor-63090

    02/24/2025, 9:38 PM
    Current state (please correct me where I'm wrong) 🧵
    f
    d
    • 3
    • 27
  • d

    damp-lion-88352

    02/25/2025, 6:23 AM
    [need help] I'm trying to setup a gpu slurm cluster. this is the last 2 lines in my
    /etc/slurm/slurm.conf
    Copy code
    NodeName=localhost Gres=gpu:1 CPUs=4 RealMemory=15006 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
    PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
    this is the
    /etc/slurm/gres.conf
    Copy code
    AutoDetect=nvml
    NodeName=localhost Name=gpu Type=tesla  File=/dev/nvidia0 COREs=0
    after changed the config, I restarted my slurm cluster and type
    slurmd -C
    but it doesn't show that I have gpu. CC @rich-application-44533 @red-school-96573 @fierce-oil-47448
    r
    • 2
    • 9
  • d

    damp-lion-88352

    02/28/2025, 4:39 AM
    Here's a slurm training and inference example running on Union's cluster! Thank you @creamy-shampoo-53278 https://github.com/JiangJiaWei1103/Flyte-Demos/issues/1
  • d

    damp-lion-88352

    03/06/2025, 3:43 AM
    Slurm shell task and function task are done now! We will have some improvements about it and will add some examples in flytesnacks. Please let me know if you want to try it, I am willing to 1 on 1 with you to help you setup your agent service with Slurm and gain feedbacks from you to improve this plugin. Thank you ❤️
    a
    • 2
    • 4
  • d

    damp-lion-88352

    03/06/2025, 3:51 AM
    and also here are examples about 3 kinds of usecase, include SlurmTask, SlurmShellTask, and SlurmFuncionTask written by my friend @creamy-shampoo-53278 feel free to try it! https://github.com/JiangJiaWei1103/Flyte-Demos/tree/main/slurm_agent
    c
    • 2
    • 1
  • e

    eager-processor-63090

    03/17/2025, 9:05 PM
    Thread for usability fixes @creamy-shampoo-53278 @damp-lion-88352 🧵
    • 1
    • 2
  • a

    average-finland-92144

    04/02/2025, 4:00 PM
    Hey, so with the docs additions that went out with Flyte 1.15.1, is the Slurm connector (agent) fully released?
    d
    • 2
    • 1