https://flyte.org logo
Join Slack
Powered by
# flyte-support
  • n

    numerous-refrigerator-74026

    06/05/2025, 6:09 PM
    Hello! Does anyone have any experience using bazel & and OCI image to properly create containers, upload to a cloud provider (ie ECR) and then remote register?
    f
    • 2
    • 7
  • r

    ripe-battery-86055

    06/06/2025, 3:34 PM
    Hello, We noticed that many jobs have been stuck in the ABORTING state for several hours. This happened after we deleted them using
    flytectl
    , following a period where Propeller was overloaded. After increasing Propeller’s resources, some of the pending jobs completed successfully and some ABORTING jobs transitioning to the ABORTED state. However, a large number of jobs still remain stuck in the ABORTING phase.
    p
    • 2
    • 3
  • r

    ripe-battery-86055

    06/06/2025, 3:36 PM
    We wanted to do overload testing on flyte instance, not sure these ABORTING jobs have any impact on it. Can someone enlight me in fixing this.
  • p

    powerful-australia-73346

    06/09/2025, 6:53 AM
    I think there is a miss-documentation on https://www.union.ai/docs/flyte/deployment/flyte-configuration/configuring-notifications/#webhook-connector where
    try/except
    block is used inside
    workflow
    , which should not be possible as it is straight up python in workflow, no? It should only work when executed locally but on remote should not work. At least I don't get it to work.
    Copy code
    notification_task = WebhookTask(
       name="failure-notification",
       url="<https://hooks.slack.com/services/xyz>", #your Slack webhook
       method=http.HTTPMethod.POST,
       headers={"Content-Type": "application/json"},
       data={"text": "Workflow failed: {inputs.error_message}"},
       dynamic_inputs={"error_message": str},
       show_data=True,
       show_url=True,
       description="Send notification on workflow failure"
    )
    ...
    
    @fl.workflow
    def ml_workflow_with_failure_handling() -> float:
       try:
           X, y = load_and_preprocess_data()
           model = train_model(X=X, y=y)
           accuracy = evaluate_model(model=model, X=X, y=y)
           return accuracy
       except Exception as e:
           # Trigger the notification task on failure
           notification_task(error_message=str(e))
           raise
    f
    • 2
    • 4
  • n

    numerous-refrigerator-74026

    06/09/2025, 6:40 PM
    I’m running into an issue with my dockerfile + flyte + ray integration (🧵 ).
    Copy code
    timestamp: 2025-06-05 22:36:50,156
    level: ERROR
    message: !! Begin Unknown System Error Captured by Flyte !!
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py", line 179, in dispatch_execute
        outputs = task_def.dispatch_execute(ctx, idl_input_literals)
        
      File "/usr/local/lib/python3.11/site-packages/flytekit/core/base_task.py", line 728, in dispatch_execute
        new_user_params = self.pre_execute(ctx.user_space_params)
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/flytekitplugins/ray/task.py", line 81, in pre_execute
        ray.init(**init_params)
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
        return func(*args, **kwargs)
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 1780, in init
        _global_node = ray._private.node.Node()
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/node.py", line 365, in __init__
        self.start_ray_processes()
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/node.py", line 1500, in start_ray_processes
        ) = ray._private.services.determine_plasma_store_config()
        
      File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/services.py", line 2198, in determine_plasma_store_config
        raise ValueError(
    
    ValueError: Attempting to cap object store memory usage at 46216396 bytes, 
               but the minimum allowed is 78643200 bytes.
    
    timestamp: 2025-06-05 22:36:50,157
    level: ERROR
    message: !! End Error Captured by Flyte !!
    This is my ray task
    Copy code
    @task(
        container_image=CONTAINER_IMAGE,
        task_config=RayJobConfig(
            # Configure the Ray cluster for this task
            head_node_config=HeadNodeConfig(
                ray_start_params={
                    "log-color": "True",
                    "object-store-memory": "100_000_000",
                },
            ),
            worker_node_config=[
                WorkerNodeConfig(
                    ray_start_params={
                        "object-store-memory": "100_000_000",
                    },
                    replicas=2,
                    group_name="worker-group",
                )
            ],  # 2 Ray workers
            shutdown_after_job_finishes=True,  # Clean up Ray cluster after task
            ttl_seconds_after_finished=120,  # Keep cluster for 2 mins after completion for debugging
        ),
        requests=Resources(cpu="1", mem="1Gi"),  # Resources for the task pod
        limits=Resources(cpu="1", mem="1Gi"),  # Maximum resources
    )
    c
    • 2
    • 3
  • g

    gentle-tomato-480

    06/10/2025, 8:01 AM
    Hey, does flytekit 1.16 work with flyte-binary 1.15.x?
    e
    • 2
    • 2
  • g

    gentle-tomato-480

    06/10/2025, 8:01 AM
    Since flyte 1.16 is not released yet
  • c

    crooked-holiday-38139

    06/10/2025, 11:32 AM
    We're attempting to debug a Flyte workflow on EKS where our pods complain there isn't enough ephemeral storage. Is it possible to get the pod spec that is issued to Kubernetes? The details on the "Task Details" link look close to what the pod would issue but I wondered if that's everything
    f
    • 2
    • 7
  • a

    ambitious-airplane-25777

    06/10/2025, 1:56 PM
    Hi, is there a way to encode a Flyte Serialized Object into json? Like the ones that represents WorkflowExecution inputs and outputs.
    f
    e
    • 3
    • 4
  • m

    mysterious-painter-66441

    06/10/2025, 7:23 PM
    Hi all, can I fetch a workflow with specific version from flyte cluster, and then create and register a launch plan for it? Something like this, but it is not working for me.
    Copy code
    remote = FlyteRemote(
        config=Config.auto(config_file=str(Path(__file__).parent.parent / "flytekit.config")),
        default_project=PROJECT_NAME,
        default_domain=PROJECT_DOMAIN,
    )
    
    wf = remote.fetch_workflow(
        name="<http://workflows.test.wf|workflows.test.wf>", version=version, domain=PROJECT_DOMAIN, project=PROJECT_NAME
    )
    
    lp_name = "test_lp"
    
    wf_lp = LaunchPlan.get_or_create(
        name=lp_name,
        workflow=wf,
        default_inputs={
            "input_num": 1
        },
        schedule=CronSchedule(schedule="0 0 1 * *"),
    )
    a
    g
    • 3
    • 6
  • e

    early-addition-41415

    06/10/2025, 8:29 PM
    has anyone deployed flyte on oracle cloud??
    f
    • 2
    • 2
  • m

    mammoth-quill-44336

    06/11/2025, 3:44 PM
    Hi team, wondering if it's possible to write a for loop in workflow and run tasks? the parameter changes in each loop, i got some type error like: argument 1 must be str, not Promise
    f
    • 2
    • 10
  • s

    square-carpet-13590

    06/12/2025, 2:25 PM
    hi team, has anyone faced the project-quota object modification issue ? looks like race condition issue the resource-quota is being updated in k8s and the propeller invalidates/rejects the node config as the version is not same as of the one with which the node was created ? Thank you
    Copy code
    │ {"json":{"exec_id":"a96pdwqvmh4gpb4cw8cz","node":"n14","ns":"flyte-pai-staging","res_ver":"620035405","routine":"worker-15","wf":"flyte-pai:staging:optimization_engine_workflows.wf.main_wf"},"level":"error","ms │
    │ g":"failed Execute for node. Error: failed at Node[n14]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Conflict] failed to create resource,  │
    │ caused by: Operation cannot be fulfilled on resourcequotas \"project-quota\": the object has been modified; please apply your changes to the latest version and try again","ts":"2025-06-11T07:57:26Z"}
    c
    • 2
    • 14
  • c

    cool-belgium-90703

    06/12/2025, 7:21 PM
    Hi I see this message in my propeller logs:
    Garbage collector is disabled, as ttl [0] is <=0
    Why would a TTL == 0 disable the GC? I just want the pods to be cleaned up immediately, not disable the GC.
  • c

    curved-whale-1505

    06/14/2025, 4:57 PM
    Trying to integrate Ray and Flyte and I’m running into this error:
    Copy code
    File "/tmp/ray/session_2025-06-14_16-47-27_250220_1/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.11/site-packages/flytekit/core/data_persistence.py", line 614, in async_get_data
        raise FlyteDownloadDataException(
    flytekit.exceptions.system.FlyteDownloadDataException: SYSTEM:DownloadDataError: error=Failed to get data from <s3://mybucket/flytesnacks/development/NF2PQ3P4WNNOVWDFEX6OAYWBD4======/fast9ba0043893d5986322db7a23c8d8bbd0.tar.gz> to ./ (recursive=False).
    
    Original exception: Unable to locate credentials
    I’m running the example as describe in this document: https://www.union.ai/docs/flyte/integrations/native-backend-plugins/ray-plugin/ray-example/ And I installed ray on the cluster as described in this document: https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s What am I missing to do here?
    • 1
    • 2
  • s

    square-agency-59624

    06/16/2025, 2:55 PM
    My input node keeps on getting stuck in queue. I'm following this guide. Not sure what I'm doing wrong, any suggestions on how I can go about this issue?
    f
    • 2
    • 3
  • m

    mammoth-mouse-1111

    06/18/2025, 8:32 PM
    Hi all - I'm having trouble connecting to the console UI for
    flyte-core
    (revision 1.16.0-b2). I'm deploying this on my on-prem system. I'm running the following port forwards:
    Copy code
    kubectl -n flyte port-forward service/flyteadmin 8089:81
    
    kubectl -n flyte port-forward service/flyteconsole 8088:80
    I'm getting a
    Error: invalid wire type 4 at offset 1
    on the console page. Inspecting the console, I get
    Copy code
    Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at <http://calpha-cluster-manager/me>. (Reason: CORS request did not succeed). Status code: (null).
    as the first error i run into. If anyone knows how to get around this, I'd gladly appreciate it. Thanks 🙏
    g
    • 2
    • 2
  • h

    high-night-5814

    06/20/2025, 6:36 PM
    Hi I am reaching out to understand an error that I keep encountering with the simple training script provided in the flyte-school repo: https://github.com/unionai-oss/flyte-school/blob/main/00-intro/workflows/example_intro.py, for some reason the workflow notebook here: https://github.com/unionai-oss/flyte-school/blob/main/00-intro/workshop.ipynb keeps failing at the train step when I push to a local cluster in docker ie --remote , I am using python 3.10.12 environment, and following is the error :
    Copy code
    Traceback (most recent call last):
    
          File "/opt/venv/lib/python3.10/site-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
            return wrapped(*args, **kwargs)
          File "/opt/venv/lib/python3.10/site-packages/flytekit/core/base_task.py", line 530, in dispatch_execute
            raise type(exc)(msg) from exc
    
    Message:
    
        Failed to convert inputs of task 'workflows.example_intro.train_model':
      'NoneType' object has no attribute 'DESCRIPTOR'
    
    SYSTEM ERROR! Contact platform administrators.
    c
    • 2
    • 3
  • w

    worried-airplane-87065

    06/20/2025, 6:37 PM
    Hi Flyte community! We running Flyte on GKE using the flyte-core deployment. I have a workflow that does (GPU inference + CPU post processing). I want to invoke this workflow on the order of 100-1500 times from a single workflow. Currently we're using
    @dynamic
    workflows to fanout with
    max-parallelism=200
    but we're seeing a great deal of latency in workflow progress. Looking at https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/ and https://www.union.ai/docs/flyte/user-guide/core-concepts/workflows/subworkflows-and-sub-launch-plans/, it looks like we can achieve similar concurrency by invoking a sublaunch plan 100-1500 times and increasing the free worker count for Flytepropeller. We have explored map_tasks but it's a little too restrictive for our use case. Has anyone been similar situations and would be willing to share how they approached the fanout issue. Using sub launchplans for fanout
    Copy code
    import flytekit as fl
    
    
    @fl.task
    def my_gpu_task() -> None:
        pass
    
    @fl.task
    def my_cpu_task -> None:
        pass
    
    @fl.workflow
    def my_workflow() -> None:
        my_gpu_task() >> my_cpu_task()
    
    my_workflow_lp = fl.LaunchPlan.get_or_create(my_workflow)
    
    
    @fl.dynamic
    def dynamic_lp(num_fanout: int) -> list[int]:
        return [my_workflow_lp() for i in range(num_fanout)]
    a
    • 2
    • 5
  • m

    millions-plastic-44322

    06/23/2025, 10:29 AM
    Hi! I've only just started with flyte, I have it deployed in k8s cluster, with s3 set up for object storage. When I try to run the example task with
    pyflyte run --remote --project  jiri-test --domain development hello_world.py hello_world_wf
    it fails with
    Copy code
    FlyteSystemException: SYSTEM:Unknown: error=None, cause=<_InactiveRpcError of RPC that terminated with:
            status = StatusCode.INTERNAL
            details = "failed to create a signed url. Error: NoCredentialProviders: no valid providers in chain. Deprecated.
            For verbose messaging see aws.Config.CredentialsChainVerboseErrors"
            debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to create a signed url. Error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see
    aws.Config.CredentialsChainVerboseErrors", grpc_status:13}"
    What's wrong, problem with s3 connection? I can access it from console (aws s3 ls works), I'm logged in using sso...
    c
    a
    • 3
    • 20
  • g

    gentle-tomato-480

    06/23/2025, 12:47 PM
    Hey, is there a programmatic way to abort all currently running workflowexecutions?
    • 1
    • 1
  • m

    magnificent-byte-14700

    06/23/2025, 6:35 PM
    Does flyte copilot only support AWS Endpoints? Trying to use containertask using a minionendpoint but getting nocredential provider error
    g
    • 2
    • 2
  • m

    microscopic-furniture-57275

    06/23/2025, 6:35 PM
    Hi all - I'm trying to upgrade to flytekit 1.16.1 (released ~2 weeks ago) to get a bugfix that was introduced in 1.16.0 -- but I see that the flyte-binary, whose versions typically match flytekit releases, is not yet available for 1.16.1 -- the latest appears to be 1.16.0-b2. Should I stick to 1.15.3 for the flyte-binary, or use the beta version of 1.16.0, or wait for 1.16.1?
    g
    • 2
    • 2
  • s

    steep-nest-3156

    06/25/2025, 3:15 PM
    Hi, I heard that offline batch inference is not recommended in flyte, and I better to deploy model to end-point and send requests there. Is it true?
    f
    • 2
    • 8
  • f

    freezing-airport-6809

    06/25/2025, 9:53 PM
    https://flyte-org.slack.com/archives/CNMKCU6FR/p1750888412021649
    🎉 1
  • r

    red-match-4610

    06/26/2025, 8:18 PM
    Hey guys Does flyte-binary support Pagerduty notifications? Or is it only flyte-core? Deployed with helm charts. Don't see any logs/notifications when doing tests.
    f
    a
    • 3
    • 15
  • f

    flat-waiter-82487

    06/27/2025, 7:54 AM
    Is there a way to speedup the
    FlyteDirectory.download()
    ? We have an S3 directory with ~200k files, 2-20kB each (which tbh is not THAT large for modern standards) and
    FlyteDirectory.download()
    fails due to:
    Copy code
    An error occurred (RequestTimeTooSkewed) when calling the GetObject operation: The difference between the request time and the current time is too large.
    g
    f
    f
    • 4
    • 20
  • c

    clean-glass-36808

    06/27/2025, 9:31 PM
    Is it possible for the same workflow to be executed in parallel by propeller? It seems possible since workflow update handler will just enqueue workflows when an update happens. I suppose one of the writes to etcd would be ignored during the race but it still seems like propeller could progress the same workflow twice with same state? As long as plugins are idempotent this should be ok?
    f
    • 2
    • 19
  • e

    early-napkin-90297

    06/29/2025, 2:36 PM
    Trying to understand how to use
    ImageSpec.is_container()
    and I found this old thread interesting: https://discuss.flyte.org/t/15616465/hey-guys-how-do-i-add-new-packages-to-the-project-in-flyte-i#01266f7b-9c80-4c3c-907e-2f8e47325480 TL;DR:
    pyflyte
    expects the dependency (
    cv2
    in this case) to be installed locally even though it's guarded by
    if cv2_image_spec.is_container():
    which, in my understanding, should imply that the dependency is only needed at run time in the task pod. This happens because the
    ExecutionState.mode
    is
    None
    and
    is_container()
    returns
    True
    when running
    pyflyte run --remote
    or
    pyflyte register
    . I've confirmed this behavior with
    flytekit==1.16.1
    . Is this the expected behavior or a bug? If the former, is late import within the task definition the only recourse for this scenario?
    g
    e
    • 3
    • 4
  • d

    dazzling-shampoo-1177

    07/02/2025, 1:56 AM
    Hi all, we trying out flyte for model training and data processing. Following the doc for authentication here, I am trying to run
    Copy code
    pyflyte create api-key admin --name my-admin-key
    but got error
    Copy code
    No such command 'create'.
    From looking at main branch in GH and then does look like create command doesn't exist. Can anyone tell me how to create an api-key?