creamy-shampoo-53278
12/06/2024, 4:48 PMsbatch
example. Han-Ru and I think this can be a good start point !freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
creamy-shampoo-53278
12/16/2024, 3:43 PMcreate
and get
methods works fine locally. At this early stage, we setup a single-host (ubuntu 20.04) machine for our local development and testing. Some takeaways are:
1. Run controller daemons (slurmctld
, slurmdbd
), one compute node daemon (slurmd
), and also the REST API daemon (slurmrestd
) on the same machine.
a. The slurm agent can interact with slurmrestd
through the host base url <http://localhost:6820>
.
b. Authentication is done by JWT.
2. Test the slurm agent locally based on this guide to mimic FlytePropeller’s behavior.
We'll keep pushing forward to make this feature a reality!creamy-shampoo-53278
12/18/2024, 4:24 PMcreate
and get
methods (naive version) are implemented with asyncssh
now, but we still use PythonTask
so far. We'll support delete
and switch to ShellTask
soon. If there's any mistake, please let me know. Thanks a lot!creamy-shampoo-53278
12/30/2024, 2:35 PMPythonFunctionTask
) has been implemented. It supports the following three core methods:
1. `create`: Use srun
to run a Slurm job which executes Flyte entrypoints, pyflyte-fast-execute
and pyflyte-execute
2. `get`: Use scontrol show job <job_id>
to monitor the Slurm job state
3. `delete`: Use scancel <job_id>
to cancel the Slurm job (this method is still under test)
We setup an environment to test it locally without running the agent gRPC server. The setup is divided into three components: a client (localhost), a remote tiny Slurm cluster, and an Amazon S3 bucket that facilitates communication between the two. The attached figure below illustrates the interaction between the client and the remote Slurm cluster.creamy-shampoo-53278
01/04/2025, 2:41 PMasyncssh
and have a question to discuss with you! As far as we know, establishing connection with client keys is a preferred approach over password due to security, convenience, and scalability.
Before using key pairs for connection, we must first complete the following steps:
1. Generate an SSH key pair (take RSA as an example)
ssh-keygen -t rsa -b 4096
2. Place the public key on the remote server (slurm cluster in this case)
ssh-copy-id <username>@<hostname>
creamy-shampoo-53278
01/07/2025, 4:18 PMShellTask
, but there's still room for improvement!
Following describes how it works:
• To mimic NFS, we put a simple python script on Slurm cluster in advance
• To run the user-defined batch script on Slurm cluster, we write script content to a tmp file and transfer it to cluster through SFTP
• Within create
method, we construct a sbatch
command based on user-defined sbatch
options and run this cmd on clusterred-farmer-96033
01/14/2025, 3:35 PMdamp-lion-88352
01/15/2025, 3:10 PMeager-processor-63090
01/15/2025, 9:05 PMeager-processor-63090
01/15/2025, 9:18 PMRuntimeError: Error encountered while executing 'job2':
Task <Task pending name='Task-57' coro=<AsyncAgentExecutorMixin._create() running at /Users/pryceturner/Desktop/slurm-flyte/flytekit/flytekit/extend/backend/base_agent.py:376> cb=[_run_until_complete_cb() at
/Users/pryceturner/.pyenv/versions/3.12.8/lib/python3.12/asyncio/base_events.py:181]> got Future <Future pending> attached to a different loop
creamy-shampoo-53278
01/17/2025, 3:03 PMfreezing-airport-6809
damp-lion-88352
01/29/2025, 5:00 PMdamp-lion-88352
02/05/2025, 3:55 PM@task(
task_config=SlurmFunction(
slurm_host="aws",
srun_conf={
"partition": "debug",
"job-name": "tiny-slurm",
},
script="""
#!/bin/bash
# Pre-execute
echo "Hello, world!"
export MY_ENV_VAR=123
# Run the python function here
{flyte.fn}
# Post-execute
exit -1
"""
)
)
def plus_one(x: int) -> int:
print(os.getenv("MY_ENV_VAR"))
return x + 1
so the execution order will be
(1) script (pre-execute)
(2) function
I am wondering about in agent's operation framework (create, get, and delete)
the pre-execute part should be put in create
operation, right?
or every operation should be put in get
operation?
in agent's each operation we have a timeout mechanism, so we need to figure out a way to implement this.
TLDR: the agent framework is for 1 execution, but in your propose, this will be more than 1 execution, which is still possible to implement it but uglydamp-lion-88352
02/06/2025, 3:44 PM@task(
task_config=SlurmFunction(
slurm_host="<http://ec2-18-207-193-50.compute-1.amazonaws.com|ec2-18-207-193-50.compute-1.amazonaws.com>",
srun_conf={
"partition": "debug",
"job-name": "fn-task",
"output": "/home/ubuntu/fn_task.log"
},
script="""
#!/bin/bash
# == Pre-Execution ==
echo "Hello, world!"
# Setup env vars
export MY_ENV_VAR=456
# Activate virtual env
. /home/ubuntu/.cache/pypoetry/virtualenvs/demo-poetry-RLi6T71_-py3.12/bin/activate
# == Execute Flyte Task Function ==
{task.fn}
# == Post-Execution ==
echo "Success!!"
"""
)
)
def plus_one(x: int) -> int:
print(os.getenv("MY_ENV_VAR"))
return int(os.getenv("MY_ENV_VAR")) + 1
pr link: https://github.com/flyteorg/flytekit/pull/3005
cc @glamorous-carpet-83516 @freezing-airport-6809 @eager-processor-63090
can we tag some potential users to take a look at this?freezing-airport-6809
creamy-shampoo-53278
02/08/2025, 3:15 AMcreamy-shampoo-53278
02/17/2025, 1:02 PMssh_config={
"host": "<hostname-or-ip-address>",
"username": "<username>",
"client_keys": "<file-path-to-private-key>" # Support both string and list of strings
}
, where host
and username
are required, and client_keys
is optional.
Here are the common use cases:
1. Use OpenSSH client config files – Load settings from ~/.ssh/config
2. Specify private key paths explicitly – Define key file paths in the client_keys
field
3. Use the FLYTE_SLURM_PRIVATE_KEY
secret – Write the private key content to a local file, then add its path to client_keys
creamy-shampoo-53278
02/18/2025, 4:39 PMSlurmShellTask
! Here’s an example task:
write_file_task = SlurmShellTask(
name="write-file",
script="""#!/bin/bash
echo "[SlurmShellTask] Write something into a file..." >> {inputs.x}
if grep "file" {inputs.x}
then
echo "Found a string 'file'!" >> {inputs.x}
else
echo "'file' not found!"
fi
""",
task_config=Slurm(
ssh_config={
"host": "aws2",
"username": "ubuntu",
},
sbatch_conf={
"partition": "debug",
"job-name": "tiny-slurm",
}
),
inputs=kwtypes(x=str),
output_locs=[OutputLocation(var="i", var_type=FlyteFile, location="{inputs.x}")],
)
A workflow that demonstrates passing files between tasks: https://github.com/JiangJiaWei1103/Flyte-Demos/blob/main/slurm_agent/script/shell_3.py
Observations when passing files between tasks:
1. FlyteFile as an input type is problematic – The file will be created on the Slurm cluster and can't be found on the local machine
2. Output location type mismatch – Even though var_type
in output_locs
is set to FlyteFile
, it’s actually a str
I’d love to hear your thoughts on this. Thanks!damp-lion-88352
02/24/2025, 4:48 AMeager-processor-63090
02/24/2025, 9:38 PMdamp-lion-88352
02/25/2025, 6:23 AM/etc/slurm/slurm.conf
NodeName=localhost Gres=gpu:1 CPUs=4 RealMemory=15006 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
this is the /etc/slurm/gres.conf
AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla File=/dev/nvidia0 COREs=0
after changed the config, I restarted my slurm cluster and type slurmd -C
but it doesn't show that I have gpu.
CC @rich-application-44533 @red-school-96573 @fierce-oil-47448damp-lion-88352
02/28/2025, 4:39 AMdamp-lion-88352
03/06/2025, 3:43 AMdamp-lion-88352
03/06/2025, 3:51 AMeager-processor-63090
03/17/2025, 9:05 PMaverage-finland-92144
04/02/2025, 4:00 PM