Flyte #slurm-flyte-wg

creamy-shampoo-53278

12/06/2024, 4:48 PM

Hello everyone, I succeed to setup a Slurm cluster with this guide, https://github.com/SergioMEV/slurm-for-dummies?tab=readme-ov-file and run a naive

sbatch

example. Han-Ru and I think this can be a good start point !

freezing-airport-6809

12/08/2024, 10:14 PM

@creamy-shampoo-53278 / @damp-lion-88352 and team i would love as an eventual goal that - nanotron - this script can be fully converted to a flyte + slurm runner. This would be a fantastic achievment. we would be able to train a full LLM from scratch using one flyte workflow (small one, thats ok)

freezing-airport-6809

12/12/2024, 3:37 AM

@damp-lion-88352 @creamy-shampoo-53278 should we use this instead https://intertwin-eu.github.io/interLink/docs/intro

freezing-airport-6809

12/12/2024, 3:40 PM

Another one Read “Pretrain Your Own AI Models with Fast-LLM“ by Jesus Rodriguez on Medium: https://jrodthoughts.medium.com/pretrain-your-own-ai-models-with-fast-llm-df70846cc37b

creamy-shampoo-53278

12/16/2024, 3:43 PM

Hello everyone, We're excited to share that a very naive slurm agent with synchronous

create

and

get

methods works fine locally. At this early stage, we setup a single-host (ubuntu 20.04) machine for our local development and testing. Some takeaways are: 1. Run controller daemons (

slurmctld

slurmdbd

), one compute node daemon (

slurmd

), and also the REST API daemon (

slurmrestd

) on the same machine. a. The slurm agent can interact with

slurmrestd

through the host base url

<http://localhost:6820>

. b. Authentication is done by JWT. 2. Test the slurm agent locally based on this guide to mimic FlytePropeller’s behavior. We'll keep pushing forward to make this feature a reality!

creamy-shampoo-53278

12/18/2024, 4:24 PM

Hi all, Slurm agent's

create

and

get

methods (naive version) are implemented with

asyncssh

now, but we still use

PythonTask

so far. We'll support

delete

and switch to

ShellTask

soon. If there's any mistake, please let me know. Thanks a lot!

creamy-shampoo-53278

12/30/2024, 2:35 PM

Hello everyone, We're happy to share that the Slurm agent v1 (with

PythonFunctionTask

) has been implemented. It supports the following three core methods: 1. `create`: Use

srun

to run a Slurm job which executes Flyte entrypoints,

pyflyte-fast-execute

and

pyflyte-execute

2. `get`: Use

scontrol show job <job_id>

to monitor the Slurm job state 3. `delete`: Use

scancel <job_id>

to cancel the Slurm job (this method is still under test) We setup an environment to test it locally without running the agent gRPC server. The setup is divided into three components: a client (localhost), a remote tiny Slurm cluster, and an Amazon S3 bucket that facilitates communication between the two. The attached figure below illustrates the interaction between the client and the remote Slurm cluster.

creamy-shampoo-53278

01/04/2025, 2:41 PM

Happy new year! We do some research on ssh connection through

asyncssh

and have a question to discuss with you! As far as we know, establishing connection with client keys is a preferred approach over password due to security, convenience, and scalability. Before using key pairs for connection, we must first complete the following steps: 1. Generate an SSH key pair (take RSA as an example)

Copy code

ssh-keygen -t rsa -b 4096

2. Place the public key on the remote server (slurm cluster in this case)

Copy code

ssh-copy-id <username>@<hostname>

creamy-shampoo-53278

01/07/2025, 4:18 PM

Hi everyone, We successfully implement Slurm agent with

ShellTask

, but there's still room for improvement! Following describes how it works: • To mimic NFS, we put a simple python script on Slurm cluster in advance • To run the user-defined batch script on Slurm cluster, we write script content to a tmp file and transfer it to cluster through SFTP • Within

create

method, we construct a

sbatch

command based on user-defined

sbatch

options and run this cmd on cluster

red-farmer-96033

01/14/2025, 3:35 PM

@red-farmer-96033 has left the channel

damp-lion-88352

01/15/2025, 3:10 PM

Hi Team, Please take a look at this comment. This video demo showcases two task types on Slurm, running on an AWS EC2 instance: 1. Slurm Python Task – Executes a shell script on the Slurm host. 2. Slurm Shell Task – Executes a shell script provided by my computer. Let me know how @creamy-shampoo-53278 and I can help! PR Link cc @eager-processor-63090

eager-processor-63090

01/15/2025, 9:05 PM

Going to start opening threads for any bugs I encounter to keep the PR clean. I'll react with a ✅ when I'm able to resolve!

eager-processor-63090

01/15/2025, 9:18 PM

Copy code

RuntimeError: Error encountered while executing 'job2':
  Task <Task pending name='Task-57' coro=<AsyncAgentExecutorMixin._create() running at /Users/pryceturner/Desktop/slurm-flyte/flytekit/flytekit/extend/backend/base_agent.py:376> cb=[_run_until_complete_cb() at
/Users/pryceturner/.pyenv/versions/3.12.8/lib/python3.12/asyncio/base_events.py:181]> got Future <Future pending> attached to a different loop

creamy-shampoo-53278

01/17/2025, 3:03 PM

Hi Pryce, For what you've commented in PR, I have some questions to ask: 1. As you mentioned consistent persistence layer, do you mean that we should implement an independent object store to hold the inputs and outputs of heterogeneous tasks (i.e., Slurm tasks and other task types) composed in a single workflow? 2. As for GPU accelerated tasks offload, could I assume that some tasks are run on CPU (e.g., data preprocessing) and some tasks like LLM finetuning should be offloaded to another cloud service with GPU nodes? Hence, we again need to support a consistent persistence layer of inputs/outputs access. Sorry for the dumb questions. I just want to clarify if I totally get what you said and wonder what'll be the top priority for the next step. Thanks!

freezing-airport-6809

01/28/2025, 6:00 AM

@creamy-shampoo-53278 how can we start testing this, is there a PR?

damp-lion-88352

01/29/2025, 5:00 PM

this is some notes discussed with @faint-gpu-25916 and @freezing-airport-6809 yesterday. Let's chat in this thread or hop on a call. cc @creamy-shampoo-53278 https://hackmd.io/@Future-Outlier/r18R22w_Jx

damp-lion-88352

02/05/2025, 3:55 PM

Hi, @freezing-airport-6809 You mentioned that you want the slurm python function task to be like this.

Copy code

@task(
    task_config=SlurmFunction(
        slurm_host="aws",
        srun_conf={
            "partition": "debug",
            "job-name": "tiny-slurm",
        },
        script="""
#!/bin/bash
# Pre-execute
echo "Hello, world!"
export MY_ENV_VAR=123

# Run the python function here
{flyte.fn}

# Post-execute
exit -1
"""
    )
)
def plus_one(x: int) -> int: 
    print(os.getenv("MY_ENV_VAR"))
    return x + 1

so the execution order will be (1) script (pre-execute) (2) function I am wondering about in agent's operation framework (create, get, and delete) the pre-execute part should be put in

create

operation, right? or every operation should be put in

get

operation? in agent's each operation we have a timeout mechanism, so we need to figure out a way to implement this. TLDR: the agent framework is for 1 execution, but in your propose, this will be more than 1 execution, which is still possible to implement it but ugly

damp-lion-88352

02/06/2025, 3:44 PM

Here is a demo with slurm's python function task.

Copy code

@task(
    task_config=SlurmFunction(
        slurm_host="<http://ec2-18-207-193-50.compute-1.amazonaws.com|ec2-18-207-193-50.compute-1.amazonaws.com>",
        srun_conf={
            "partition": "debug",
            "job-name": "fn-task",
            "output": "/home/ubuntu/fn_task.log"
        },
        script="""
#!/bin/bash

# == Pre-Execution ==
echo "Hello, world!"

# Setup env vars
export MY_ENV_VAR=456

# Activate virtual env
. /home/ubuntu/.cache/pypoetry/virtualenvs/demo-poetry-RLi6T71_-py3.12/bin/activate

# == Execute Flyte Task Function ==
{task.fn}

# == Post-Execution ==
echo "Success!!"
"""
    )
)
def plus_one(x: int) -> int:
    print(os.getenv("MY_ENV_VAR"))
    return int(os.getenv("MY_ENV_VAR")) + 1

pr link: https://github.com/flyteorg/flytekit/pull/3005 cc @glamorous-carpet-83516 @freezing-airport-6809 @eager-processor-63090 can we tag some potential users to take a look at this?

2025-02-06 23-20-21.mp4

freezing-airport-6809

02/06/2025, 5:04 PM

Awesome stuff let’s go

creamy-shampoo-53278

02/08/2025, 3:15 AM

Hi all, The setup documentation has been moved here. We will continue to complete the remaining sections, including setting up the environment for local testing and agent deployment. PR link: https://github.com/flyteorg/flyte/pull/6231 Thanks!

creamy-shampoo-53278

02/17/2025, 1:02 PM

(Update) We now support establishing an SSH connection using the following configuration:

Copy code

ssh_config={
    "host": "<hostname-or-ip-address>",
    "username": "<username>",
    "client_keys": "<file-path-to-private-key>"  # Support both string and list of strings
}

, where

host

and

username

are required, and

client_keys

is optional. Here are the common use cases: 1. Use OpenSSH client config files – Load settings from

~/.ssh/config

2. Specify private key paths explicitly – Define key file paths in the

client_keys

field 3. Use the
FLYTE_SLURM_PRIVATE_KEY
secret – Write the private key content to a local file, then add its path to

client_keys

creamy-shampoo-53278

02/18/2025, 4:39 PM

(Update) We now support input/output interpolation in

SlurmShellTask

! Here’s an example task:

Copy code

write_file_task = SlurmShellTask(
    name="write-file",
    script="""#!/bin/bash

echo "[SlurmShellTask] Write something into a file..." >> {inputs.x}
if grep "file" {inputs.x}
then
    echo "Found a string 'file'!" >> {inputs.x}
else
    echo "'file' not found!"
fi
    """,
    task_config=Slurm(
        ssh_config={
            "host": "aws2",
            "username": "ubuntu",
        },
        sbatch_conf={
            "partition": "debug",
            "job-name": "tiny-slurm",
        }
    ),
    inputs=kwtypes(x=str),
    output_locs=[OutputLocation(var="i", var_type=FlyteFile, location="{inputs.x}")],
)

A workflow that demonstrates passing files between tasks: https://github.com/JiangJiaWei1103/Flyte-Demos/blob/main/slurm_agent/script/shell_3.py Observations when passing files between tasks: 1. FlyteFile as an input type is problematic – The file will be created on the Slurm cluster and can't be found on the local machine 2. Output location type mismatch – Even though

var_type

output_locs

is set to

FlyteFile

, it’s actually a

str

I’d love to hear your thoughts on this. Thanks!

damp-lion-88352

02/24/2025, 4:48 AM

base on this article from slurm official I want to use the slurm agent to run gpu task and make it a blog. https://www.schedmd.com/what-can-a-high-performance-computer-do/ cc @creamy-shampoo-53278 @eager-processor-63090

eager-processor-63090

02/24/2025, 9:38 PM

Current state (please correct me where I'm wrong) 🧵

damp-lion-88352

02/25/2025, 6:23 AM

[need help] I'm trying to setup a gpu slurm cluster. this is the last 2 lines in my

/etc/slurm/slurm.conf

Copy code

NodeName=localhost Gres=gpu:1 CPUs=4 RealMemory=15006 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

this is the

/etc/slurm/gres.conf

Copy code

AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla  File=/dev/nvidia0 COREs=0

after changed the config, I restarted my slurm cluster and type

slurmd -C

but it doesn't show that I have gpu. CC @rich-application-44533 @red-school-96573 @fierce-oil-47448

damp-lion-88352

02/28/2025, 4:39 AM

Here's a slurm training and inference example running on Union's cluster! Thank you @creamy-shampoo-53278 https://github.com/JiangJiaWei1103/Flyte-Demos/issues/1

damp-lion-88352

03/06/2025, 3:43 AM

Slurm shell task and function task are done now! We will have some improvements about it and will add some examples in flytesnacks. Please let me know if you want to try it, I am willing to 1 on 1 with you to help you setup your agent service with Slurm and gain feedbacks from you to improve this plugin. Thank you ❤️

damp-lion-88352

03/06/2025, 3:51 AM

and also here are examples about 3 kinds of usecase, include SlurmTask, SlurmShellTask, and SlurmFuncionTask written by my friend @creamy-shampoo-53278 feel free to try it! https://github.com/JiangJiaWei1103/Flyte-Demos/tree/main/slurm_agent

eager-processor-63090

03/17/2025, 9:05 PM

Thread for usability fixes @creamy-shampoo-53278 @damp-lion-88352 🧵

average-finland-92144

04/02/2025, 4:00 PM

Hey, so with the docs additions that went out with Flyte 1.15.1, is the Slurm connector (agent) fully released?