https://flyte.org logo
Join Slack
Powered by
# ray-integration
  • f

    future-notebook-79388

    01/13/2023, 4:33 AM
    Hi, in KubeRay version 0.3.0 while trying to perform ray training in remote using
    pyflyte --config ~/.flyte/config-remote.yaml run --remote --image <image_name> ray_demo.py wf
    , I am getting this issue in logs and the task is getting queued in the console. When the same is executed in local using
    pyflyte --config ~/.flyte/config-remote.yaml run --image <image_name> ray_demo.py wf
    , it works fine.
    t
    g
    • 3
    • 2
  • f

    future-notebook-79388

    01/16/2023, 5:48 AM
    Hi, I have a doubt regarding scaling of nodes. Do we have options to make each worker pod run in different node so that the node will spawn 'n' number of nodes with a less memory instance? for eg, Now if I request for 8G memory and 4 CPU and request for 4 replicas, the node is spawning an instance with higher GB instance and trying to accommodate all worker nodes in single node. Instead I need an approach where each worker pod should schedule in 4 different node with less GB instance. Do we have any way to achieve this scaling?
    t
    • 2
    • 2
  • a

    ambitious-australia-27749

    01/30/2023, 8:42 PM
    Hello! I am running Ray on Flyte. I am getting a warning about Ray using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. To fix this, I would just specify --shm-size=3.55gb when running the container. But Flyte is running the containers for us, so I cannot figure out how to specify any run options. Is there a way for specifying run options for the containers that Flyte runs? Full text of warning attached.
    ray_on_flyte_error.txt
    f
    h
    • 3
    • 7
  • l

    late-alligator-89730

    02/23/2023, 8:55 PM
    Hi, I'm trying to get some of my data processing jobs working on Ray with Flyte. I have KubeRay 0.4.0 installed. When my flyte task is started, ray cluster gets created. It is accessible, I can access the dashboard. I can also send jobs to it from my local machine with port forwarding. Unfortunately the original job I have is not run, so the cluster is waiting with no tasks. I looked at the logs of kuberay-operator and all I see is the lines being repeated forever:
    Copy code
    2023-02-23T18:08:48.386Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:48.387Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:51.387Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    2023-02-23T18:08:51.388Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:51.388Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:54.388Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    2023-02-23T18:08:54.388Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:54.389Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:57.389Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    These logs seem to be generated by the this piece of code: https://github.com/ray-project/kuberay/blob/89f5fba8d6f868f9fedde1fbe22a6eccad88ecc1/ray-operator/controllers/ray/rayjob_controller.go#L174 and are unexpected as the cluster is healthy and I can use it on the side. I would appreciate any help and advice. Do you think the operator version? My flyte deployment is in version: 1.2.1 Ray in cluster is 2.2.0 flytekitplugins-ray: 1.2.7
  • g

    glamorous-carpet-83516

    02/23/2023, 9:46 PM
    kuberay 0.4.0 has some problems. install 0.3.0 or from master branch. kuberay community is going to release 0.4.1 soon.
    👍 1
  • d

    dry-egg-91175

    03/17/2023, 8:04 PM
    Hi, we recently opened a pull request to address the following issue (inter-cluster communication between Flyte and custom Ray cluster). Can someone please review it? It adds to a product Spotify is building that is integral to our machine learning platform.
    👍 1
    👀 1
    r
    g
    +3
    • 6
    • 14
  • f

    future-notebook-79388

    04/12/2023, 4:25 AM
    hi, I installed master version of kuberay operator and deployed the ray cluster. When I submit the workflow, I could see only head pod getting created and worker pods are not getting created. In ray operator logs I could find some error in creating worker pods. It says failed quota but we have enough project quota and also the requested resources are very less. do anyone have idea on how to solve this issue
    @task(task_config=ray_config, requests=Resources(mem="2000Mi", cpu="1"), limits=Resources(mem="3000Mi", cpu="2"))
    Copy code
    - development:
                - projectQuotaCpu:
                    value: "64"
                - projectQuotaMemory:
                    value: "150Gi"
    
    value: |
            apiVersion: v1
            kind: ResourceQuota
            metadata:
              name: project-quota
              namespace: {{ namespace }}
            spec:
              hard:
                limits.cpu: {{ projectQuotaCpu }}
                limits.memory: {{ projectQuotaMemory }}
    f
    • 2
    • 3
  • t

    tall-exabyte-99685

    04/12/2023, 9:00 PM
    Hi all, I created an issue here before realizing there was a Slack. Any ideas as to why the Python Ray example (from the docs) registers its workflow just fine, but the Jupyter Notebook example doesn't find any entities? I'm probably missing something obvious so apologies if that's the case. I noticed that VS Code thinks there is a
    \n
    after `@workflow`(unsurprising as Jupyter Notebooks are typically run in the browser obviously), not sure if that could be causing the problem.
    t
    • 2
    • 10
  • g

    gorgeous-beach-23305

    05/29/2023, 5:52 PM
    Hi All, I am working on integrating Ray with Flyte. I have been able to register and run the ray task and it completes successfully. But I am not able to find any logs anywhere saying that the task was run through ray. Also, I can't see any pods being created / destroyed. There is a ray cluster created, but it is also not destroyed after the task run. I have installed Ray operator, ray cluster and ray api-server using their helm charts. And I have added the configmap in the
    inline
    section of the
    configuration
    in values.yaml.
    Copy code
    configuration:
      inline:
        configmap:
          enabled_plugins:
            # -- Task specific configuration [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#GetConfig>)
            tasks:
              # -- Plugins configuration, [structure](<https://pkg.go.dev/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/config#TaskPluginConfig>)
              task-plugins:
                # -- [Enabled Plugins](<https://pkg.go.dev/github.com/flyteorg/flyteplugins/go/tasks/config#Config>). Enable SageMaker*, Athena if you install the backend
                # plugins
                enabled-plugins:
                  - container
                  - sidecar
                  - k8s-array
                  - ray
                default-for-task-types:
                  container: container
                  sidecar: sidecar
                  container_array: k8s-array
                  ray: ray
    I have all the ray pods running -
    Copy code
    NAME                                                 READY   STATUS    RESTARTS   AGE
    flyte-flyte-binary-6cfdcfc575-9l42x                  1/1     Running   0          3d2h
    flyte-ray-cluster-kuberay-head-9q6jq                 1/1     Running   0          147m
    flyte-ray-cluster-kuberay-worker-workergroup-bts8b   1/1     Running   0          147m
    kuberay-apiserver-d7bbb9864-htsw4                    1/1     Running   0          97m
    kuberay-operator-55c84695b8-vftmn                    1/1     Running   0          11h
    And also all the services -
    Copy code
    NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
    flyte-flyte-binary-grpc              ClusterIP   x.x.x.x.   <none>        8089/TCP                                        3d3h
    flyte-flyte-binary-http              ClusterIP   x.x.x.x.   <none>        8088/TCP                                        3d3h
    flyte-flyte-binary-webhook           ClusterIP   x.x.x.x.    <none>        443/TCP                                         3d3h
    flyte-ray-cluster-kuberay-head-svc   ClusterIP   x.x.x.x.    <none>        10001/TCP,6379/TCP,8265/TCP,8080/TCP,8000/TCP   166m
    kuberay-apiserver-service            NodePort    x.x.x.x.   <none>        8888:31888/TCP,8887:31887/TCP                   116m
    kuberay-operator                     ClusterIP   x.x.x.x.    <none>        8080/TCP                                        3d2h
    Questions: 1. Have I configured flyte to use ray correctly using the configmap in values.yaml? 2. How do I verify that the ray task that Flyte says was successful was indeed run on a ray cluster?
    t
    g
    t
    • 4
    • 8
  • t

    tall-exabyte-99685

    05/31/2023, 11:02 PM
    I know the ray plugin typically spins up an an ephemeral ray cluster to run a ray job (for isolation purposes). Does the flyte-ray plugin support running ray jobs on an already-running static ray cluster? If so, how would one produce that functionality? Thank you!
    g
    • 2
    • 4
  • u

    user

    06/01/2023, 12:41 PM
    This message was deleted.
    f
    t
    g
    • 3
    • 4
  • d

    dry-egg-91175

    06/09/2023, 5:59 PM
    Question about the Ray plugin. In the docs, it mentions that the Ray plugin use Ray's Job Submission to run the ray task. However, looking at the code, I don't see the
    JobSubmissionClient
    being used anywhere. Am I missing something?
    f
    g
    r
    • 4
    • 27
  • b

    bright-fireman-49979

    06/22/2023, 5:42 PM
    Hello! We noticed that the ray plugin, when used to create a RayCluster and RayJob on another, separate, GKE cluster, it will default to try to create those resources in a namespace (in the separate GKE cluster) with the same name as the one running the flyte workflow. Have some questions here: • Is there any way to override this behaviour? ◦ If there isn’t any, is this something that could be added?
    ➕ 2
    g
    f
    r
    • 4
    • 18
  • f

    future-notebook-79388

    06/29/2023, 1:29 PM
    hi, we have integrated ray with flyte for model tuning process. Where can we find logs in ray pods? we could find only the ray cluster creation logs in head and worker pods. where can we find the python task logs?
    a
    • 2
    • 2
  • f

    future-notebook-79388

    07/13/2023, 1:27 PM
    Hi, I am using the master version of kuberay-operator. Commands used to deploy kuberay-operator are:
    Copy code
    git clone <https://github.com/ray-project/kuberay.git>
    cd kuberay
    kubectl create -k ray-operator/config/default
    When I submit the Ray workflow, rayjob, head and worker pods are getting created and also is up and running. But the Workflow is not getting submitted in the cluster and the job is in queue in console for more than 1 hour. Script:
    Copy code
    import typing
    import ray
    import time
    from flytekit import Resources, task, workflow
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
    
    
    @ray.remote
    def square(x):
        return x * x
    
    ray_config = RayJobConfig(
          head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
          worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
          runtime_env={"pip": ["numpy", "pandas"]},
    )
    
    
    @task(task_config=ray_config, requests=Resources(mem="2000Mi", cpu="1"), limits=Resources(mem="3000Mi", cpu="2"))
    def ray_task(n: int) -> typing.List[int]:
        futures = [square.remote(i) for i in range(n)]
        return ray.get(futures)
    
    
    @workflow
    def ray_workflow(n: int) -> typing.List[int]:
        return ray_task(n=n)
    
    if __name__ == "__main__":
        print(ray_workflow(n=10))
    l
    h
    g
    • 4
    • 14
  • g

    gray-ocean-62145

    07/18/2023, 8:23 PM
    I’m going through the plugin integrations trying to iron out the kinks in our configuration…. Setting the Ray integration has been very straightforward, and the example in the docs is mainly working, however the RayJob and there the RayCluster is not deleted once the Ray task has successfully finished. The Flyte role has permissions to do this:
    Copy code
    - apiGroups:
            - <http://ray.io|ray.io>
          resources:
            - rayjobs
          verbs:
            - "*"
    And the RayJob is in a SUCCEEDED state
    Copy code
    apiVersion: <http://ray.io/v1alpha1|ray.io/v1alpha1>
    kind: RayJob
    metadata:
      creationTimestamp: '2023-07-18T20:09:53Z'
      finalizers:
        - <http://ray.io/rayjob-finalizer|ray.io/rayjob-finalizer>
      name: f9865b58322e24b91a6d-n0-0
      namespace: flyte-playground-development
      ownerReferences:
        - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
          blockOwnerDeletion: true
          controller: true
          kind: flyteworkflow
          name: f9865b58322e24b91a6d
          uid: dd5f635e-73b8-4641-b36f-96a47b39ce31
      resourceVersion: '391072831'
      uid: 6d09ea85-24f4-4191-a66c-098ceab3ad27
      ...
    status:
      endTime: '2023-07-18T20:10:13Z'
      jobDeploymentStatus: Running
      jobId: f9865b58322e24b91a6d-n0-0-9jcjv
      jobStatus: SUCCEEDED
    There isn’t anything in the logs to suggest propeller is having an issue removing it? Although I guess my question is this - Is it Flyte or Ray that is responsible for cleaning up the RayJob/RayCluster? I’m running Flyte 1.7.0 and ray-operator 1.5.2 which I’ve seen others say is working from them? Any ideas?
    g
    • 2
    • 3
  • d

    dry-egg-91175

    07/24/2023, 8:39 PM
    Hi folks, can I get some feedback on the following issues: • [Core feature] Different resources for Ray head pod and worker pods • [Core feature] Namespace configuration for Ray plugin
    👍 1
  • q

    quiet-grass-7164

    08/28/2023, 5:34 PM
    Hi Flyte folks, In our organization, we have a strict requirement to label all of our Kubernetes resources. Using Flytectl, I've successfully labeled the head and worker pods for Ray. However, these labels are not being propagated to the Kubernetes Jobs managed by the RayCluster. Could you provide guidance on how to label these Jobs through Flyte? Thank you https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/plugins/k8s/ray/ray.go#L144C2-L144C9
    f
    a
    +2
    • 5
    • 29
  • f

    future-notebook-79388

    08/30/2023, 5:01 AM
    Hi, I am performing XGBoost tuning with tune.run API in flyte. Issue is that more than half of the trials are getting failed due to worker died unexpectedly inbetween. After getting this error message, worker and head pods are getting terminated and creating a new ray cluster again and executing. What is the reason for worker getting failed inbetween the process continuously. Even with very minimal number of trials like 2 trials, I am getting same issue. 1 trial is getting executed successfully and the other trial is getting failed with the same error message. Not able to figure out why this is happening and because of this lots of trials are getting failed even though there is no issue with the code.
    Copy code
    Failure # 1 (occurred at 2023-08-25_05-10-14)
     Traceback (most recent call last):
     File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
       future_result = ray.get(ready_future)
     File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
       return func(*args, **kwargs)
     File "/tmp/ray/session_2023-08-25_05-08-10_330236_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
       raise value
     ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
           class_name: ImplicitFunc
           actor_id: 7ce680c3be6578ac3b02370c02000000
           pid: 131
           namespace: c2845d95-7689-447a-ab70-b45ab9bb75b8
           ip: 172.22.1.70
     The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR_EXIT
    Copy code
    Failure # 1 (occurred at 2023-08-24_15-04-28)
     Traceback (most recent call last):
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
       future_result = ray.get(ready_future)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
       return func(*args, **kwargs)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/worker.py", line 1833, in get
       raise value
     ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
     traceback: Traceback (most recent call last):
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 38, in from_ray_exception
       return pickle.loads(ray_exception.serialized_exception)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/mlflow/exceptions.py", line 83, in __init__
       error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
     AttributeError: 'str' object has no attribute 'get'
     
    The above exception was the direct cause of the following exception:
     
    Traceback (most recent call last):
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 340, in deserialize_objects
       obj = self._deserialize_object(data, metadata, object_ref)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/serialization.py", line 260, in _deserialize_object
       return RayError.from_bytes(obj)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 32, in from_bytes
       return RayError.from_ray_exception(ray_exception)
     File "/tmp/ray/session_2023-08-24_15-02-51_915721_9/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.8/site-packages/ray/exceptions.py", line 41, in from_ray_exception
       raise RuntimeError(msg) from e
     RuntimeError: Failed to unpickle serialized exception
    can anyone suggest some ways to resolve this and also confirm if this is any issue from ray ?
    t
    f
    • 3
    • 2
  • a

    acceptable-jackal-25563

    08/31/2023, 3:37 PM
    Hey folks. I have a Ray Cluster running on GCP and am relatively new to MLOps -- from reading this blog post https://flyte.org/blog/ray-and-flyte I did not fully grasp the benefits of running Ray inside Flyte -- can some users of both explain the benefits to me?
    f
    g
    a
    • 4
    • 15
  • b

    brash-piano-42461

    09/06/2023, 12:50 PM
    hey team, Ray and Flyte: Distributed Computing and Orchestration in this documentation it is written that
    Flyte starts a Ray dashboard by default that provides cluster metrics and logs across many machines in a single pane as well as Ray memory utilization while debugging memory errors. The dashboard helps Ray users understand Ray clusters and libraries.
    But i dont see ray dashboard i just see flyte console
    g
    t
    +2
    • 5
    • 12
  • b

    brash-piano-42461

    09/11/2023, 10:56 AM
    hey team, when i run a flyte with ray integration the pods tend to be in pending state and in the flyte console the task keeps on running

    https://flyte-org.slack.com/files/U05RR32SN00/F05RNV5KE4D/screenshot_2023-09-11_at_2.11.01_pm.png▾

    this is what the log show of pending pods
    Defaulted container "ray-worker" out of: ray-worker, init-myservice (init)
    t
    f
    g
    • 4
    • 21
  • b

    brash-piano-42461

    09/12/2023, 7:30 AM
    hey team, when i try running my code with ray plugin enabled the pod seems to be running file no issues in the log also, but the issue being the code keeps on running and i dont get the output and the code runs fine locally. Can u please look into it. Thanks! Code:
    Copy code
    import typing
    
    from flytekit import ImageSpec, Resources, task, workflow
    
    custom_image = ImageSpec(
        name="ray-flyte-plugin",
        registry="anirudh1905",
        packages=["flytekitplugins-ray"],
    )
    
    if custom_image.is_container():
        import ray
        from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
    
    @ray.remote
    def f1(x):
        return x * x
    
    @ray.remote
    def f2(x):
        return x%2
    
    ray_config = RayJobConfig(
        head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
        worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
        runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
    )
    
    @task(cache=True, cache_version="0.2",
        task_config=ray_config,
        requests=Resources(mem="2Gi", cpu="1"),
        container_image=custom_image,
    )
    def ray_task(n: int) -> int:
        futures = [f2.remote(f1.remote(i)) for i in range(n)]
        return sum(ray.get(futures))
    
    
    @workflow
    def ray_workflow(n: int) -> int:
        return ray_task(n=n)
    project_config.yaml
    Copy code
    domain: development
    project: flytesnacks
    defaults:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "3"
      memory: "8Gi"
    I also tried with kuberay version 0.3 and 0.5.2 in both its not working
    t
    g
    • 3
    • 31
  • g

    gorgeous-beach-23305

    09/13/2023, 9:58 AM
    Hi, we are executing ray tasks from flyte and we want to link the ray dashboard on the task UI. Ray automatically creates an ingress and we are able to access the dashboard by looking at the ingress details, but not sure how to get this on the flyte UI. Any ideas? We can access the ray dashboard like so - http://<some-code>.elb.eu-central-1.amazonaws.com/*atvxdf4jvppwptx6ss9w-n0-0*-raycluster-j7xhw/
    f
    t
    • 3
    • 8
  • t

    thankful-tailor-28399

    09/13/2023, 1:41 PM
    Team! One doubt over here. Is there a way to define resources for the ray workers being created? If I got it right, we can define the amount of replicas, but I did not get how to set the resources for those workers
    Copy code
    @dataclass
    class WorkerNodeConfig:
        group_name: str
        replicas: int
        min_replicas: typing.Optional[int] = None
        max_replicas: typing.Optional[int] = None
        ray_start_params: typing.Optional[typing.Dict[str, str]] = None
    a
    • 2
    • 4
  • t

    thankful-tailor-28399

    09/14/2023, 9:59 AM
    Hey! I’m following this guide to configure Ray. However, no ray cluster is being created when I launch tasks. Anything else I should configure? Permissions for example? Or do I need to install the API server when installing kuberay operator?
    Started a local Ray instance. View the dashboard at <http://127.0.0.1:8265>
    f
    t
    +3
    • 6
    • 25
  • g

    glamorous-carpet-83516

    09/15/2023, 12:26 AM
    cc @freezing-boots-56761
  • a

    agreeable-thailand-50345

    09/15/2023, 4:52 PM
    Posting a solution from another channel: https://flyte-org.slack.com/archives/CP2HDHKE1/p1694796138166579?thread_ts=1677661973.390819&amp;cid=CP2HDHKE1
    t
    • 2
    • 1
  • f

    future-notebook-79388

    11/06/2023, 8:11 AM
    hi, I am facing issues while performing ray tuning in flyte. I have used KubeRay operator v0.5.2. Using RayJobConfig, I am requestion for head node and 1 worker node.
    Copy code
    ray_config = RayJobConfig(
          head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
          worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
          runtime_env={"pip": ["numpy", "pandas"]},
    )
    For instance: The pod is assigned to a node of IP 172.22.1.123 and it has secondary IPs 172.22.1.234 and 172.22.1.456. The trials of the tuning process is running in secondary IPs and the trials running in 172.22.1.234 alone is running and providing proper results but when the trials running in other secondary IPs are getting failed with error. I have attached the error screenshot. The trials that are assigned to IP 172.22.1.234 alone is getting passed and trials assigned to other IPs are getting failed with the attached error message. Why are the trials getting assigned to the secondary IPs and why the trials running in a single IP alone is getting passed and trials trying to assign to other IPs are getting error ?
    • 1
    • 1
  • a

    average-finland-92144

    02/27/2025, 6:14 PM
    archived the channel