Flyte #flyte-support

gentle-tomato-480

11/03/2025, 12:03 PM

Btw, the links in the https://www.union.ai/docs/v1/flyte/deployment/configuration-reference/ subpages (scheduler, datacatalog, flyteadmin, propeller) don't point to the page sections but instead refer back to https://www.union.ai/docs/v1/flyte/deployment/configuration-reference/

abundant-laptop-47033

11/04/2025, 9:33 PM

Hello! Is there a plan to release a 1.16 patch with this fix? We would love to try it out when it's available!

gentle-tomato-480

11/05/2025, 2:23 PM

Did flytectl

v0.9.0

got removed/deprecated for the

flytectl-setup-action

? I was using that in my CICD and it was still working last week. Today I'm getting:

Copy code

Error: Unable to find flytectl version "v0.9.0" for platform "Linux" and architecture "x86_64".

in my GHA workflow when running this action.

high-autumn-89220

11/10/2025, 5:08 PM

hey all, im trying to get flyte working with okta for user + machine to machine auth. has anyone been able to make okta work with the Client Credential (

ClientSecret

) auth type? does anyone know if it will work without custom auth servers on our plan? been struggling with this for a few weeks to no avail

wonderful-continent-24967

11/12/2025, 12:01 AM

What could be potential reasons for Cache write error in a Flyte task? I am seeing this error in flyte console -

Failed to write output for this execution to cache.

. I looked into datacatalog logs for the corresponding flyte task, nothing unusual there. Datacatalog created, updated & deleted reservations for that task as other tasks. We are using flyte

1.15.3

fancy-hamburger-89099

11/12/2025, 10:10 AM

Hi, I am facing a very strange issue, and I am out of ideas. We have 4 instances of Flyte, all of which are configured the same and run the latest version. Each of them is running on a different cluster, and we route traffic using Ingress Nginx Controller, which is configured in exactly the same way on all clusters. All instances use Azure AD SSO, and all use the same App Registration/credentials. However, for some reason, one of these 4 instances does not work. The issue is that when I access the URL, I get to the login page, and then successfully log in using the Azure AD SSO but after that, every request fails on 400 error

Copy code

400 Bad Request
Request Header Or Cookie Too Large
nginx

I tried different browsers, incognito mode, wiping cookies, everything. This only happens on that one instance, and it works without any issues on the other 3. Any ideas?

abundant-judge-84756

11/12/2025, 11:34 AM

Hi! 👋 We're running into an issue where executions are stuck in an

ABORTING

state and can't be fully terminated. The executions include a dynamic workflow step, and these dynamic workflows show 2 x tasks as

RUNNING

- the task descriptions specify they are

initializing

. I think these initializing dynamic tasks are somehow blocking the workflows from resolving the abort request. Any suggestions for ways we can trigger these workflows to transition to

ABORTED

? We're currently running flyte

1.15.3

cool-waitress-85601

11/12/2025, 3:40 PM

Hi, is there a way to use podman instead of docker to build images when running

pyflyte run --remote

fierce-monitor-77717

11/13/2025, 12:20 PM

Hi, is there any plan to support python3.13/14 in flytekit any soon?

cool-waitress-85601

11/17/2025, 5:41 PM

Hi everyone, I'm desperately trying to setup flyte-core with an s3 bucket and provide my access key and secret key via a secret. I can't find how to do that, the documentation isn't clear on what form should that secret take and the ai bot ins't helping and giving contradictory and false information. Can someone please provide an example? Thanks a lot

mysterious-painter-66441

11/17/2025, 9:57 PM

Hi Flyte Team, I noticed that in Flyte UI, workflow inputs defined as structured types (e.g.,

dataclass

) are displayed as a single opaque field rather than expanding into individual attributes. This makes it unclear to users what values are expected for each field. Could you advise if there’s a recommended approach to make structured inputs more user-friendly in the UI? For example, is there a way to automatically expand fields or provide schema hints for structured types? Thanks for your help!

brash-ram-89454

11/18/2025, 1:23 PM

Just a heads up that, Flyte v1 docs are down at the moment: https://www.union.ai/docs/v1/flyte/user-guide/

cool-waitress-85601

11/18/2025, 8:49 PM

Hi! I'm trying to figure out if/how it's possible to setup flyte for multi-tenancy, ie. isolate tenants workloads in separate namespaces, without sharing/mounting any global secret, thus relying only on tenant-scoped secrets. Ideally tenant workloads would run under tenant namespace. While there seems to be a way to have propellers per tenants, thus enabling true parallelism, IIUC there doesn't seem to be any way to isolate metadata per tenant, since there's a single s3 configuration shared by admin and all propellers/task executions. Which means sharing the bucket secret with all tenants, which wouldn't fit our requirements. Has anybody any experience/recommendations to share? Thanks a lot

gray-ocean-43286

11/19/2025, 4:33 PM

Hello Gents, I am currently working on Flyte to AWS Sagemaker Integration and facing problems with the idempotence_token in the create_sagemaker_deployment method in the flytekitplugin-awssagemaker_inference plugin version 1.16.1. I am currently testing model, ednpoint config and endpoint deployment using the Flyte Sagemaker plugin and passing the idempotence_token=False in the create_sagemaker_deployment method. But the endpoint_config deployment task still keeps expecting the idempotence_token field in it's input (which is the model_creation task's output). Copilot keeps saying this is known bug and I need to set it to True in order resolve it. But when I set it to True, the model_creation task itself fails in Flyte and gives me an error like so - failed to do boto task with error: Could not find the key model_name}-{idempotence_token in {'model_path': 's3://s3-bucket/models/xgboost-model.tar.gz', 'execution_role_arn': 'arnawsiam::account-id:role/app-flyte-sagemaker-executor-role', 'model_name': 'xgboost-diabetes-endpoint-model'}.. Having a tough time figuring this one out. I have tried multiple approaches but all in vain. Anyone who knows what this is all about?

cool-waitress-85601

11/19/2025, 4:45 PM

Hello folks, Do you know what metadata go into the project/domain specific bucket vs the global bucket when you use

raw_output_data_config

? For instance the user local code when using fast registration, will it be uploaded to the global or project scoped bucket? More generally, what data would go in the global bucket vs the project scoped bucket? Thanks

early-addition-41415

11/20/2025, 10:21 PM

in flyte-binary if you are not on aws or is there a way to provide access keys using secrets in helm values, so that aws can be accessed frrom somewhere else

early-addition-41415

11/20/2025, 10:22 PM

specifically here https://github.com/flyteorg/flyte/blob/master/charts/flyte-binary/values.yaml#L85-L87

early-addition-41415

11/20/2025, 10:23 PM

we need ti use authtype as accesskey

fancy-twilight-30247

11/21/2025, 10:12 AM

Hey everyone- I have a question about running multi-node pytorch workflows and error/exception handling. We're currently defining our training task as something like this:

Copy code

@task(
    task_config=task_config,
    cache=False,
    container_image=container_image,
    pod_template=pod_template,
    timeout=timeout,
    retries=max_retries,
)
def flyte_training_main_task():
  ...

with the task_config being (note that we don't really need the elastic part of things - we just need to launch a multi-node pytorch task):

Copy code

task_config = Elastic(
    nnodes=num_nodes,
    nproc_per_node=8,
)

Now imagine that a rank in the distributed training has an error of some sort - is there a way for us to configure our task so that the whole task/workflow is terminated (including all the pods corresponding to it) as soon as a single rank errors? Currently it seems like it requires all the ranks to exit/error until the task/workflow is terminated, which we often don't want (because other ranks might be stuck until NCCL timeout or might be stuck for other reasons). I've tried raising special exception types like

SignalException

ChildFailedError

, but it seems like it always waits until all the ranks exit. One hacky workaround I could think of is to manually terminate the workflow, but that also does not seem ideal. Thanks!!

👀 1

numerous-hamburger-7178

11/25/2025, 11:45 PM

Do newer versions of flyte have pydantic inputs to workflows/tasks show up as something other than structs? I've been using dataclassjsonmixin to get well formatted input in the UI but want to try switching over to pydantic but on a flyte 1.16.2 deployment, an example wf shows up as struct

cool-waitress-85601

11/26/2025, 1:16 PM

Hi folks, as anybody tried using Dex as the external authorization server? I'd be interested to hear about it. Thanks

aloof-magazine-44547

12/01/2025, 10:39 AM

Hi, can I get some help to merge https://github.com/flyteorg/flytekit/pull/3339? Its about serialising and deserialising models with FlyteFile/FlyteDirectory in them, causing a attribute error. cc @swift-oil-78197

thankful-lighter-72752

12/01/2025, 11:01 PM

Hello. Is there a recommended approach for removing older intermediate values from s3 that aren't required anymore? I have some large values returned from tasks that are taking up a lot of space. I can use PutBucketLifecycleConfiguration on the s3 side, but currently I don't want the workflow results to expire

👍 1

proud-napkin-10936

12/03/2025, 11:16 AM

Hey everyone. I'm preparing a large batch job (256 parallel tasks), I noticed this "max parallelism" under domain settings in flyte console. • What is this limit exactly? • How can I adjust it? Can't really find anything on it in the (legacy) docs.

wooden-scooter-1097

12/03/2025, 9:50 PM

Hi folks, I'm starting my Flyte journey, going through the local install docs. At the point of doing

flytectl demo start

, and the

flyte-sandbox-xxx

and

flyteconnector-xxx

services never get out of Pending state. Looking at the Docker (Rancher) logs show a few x509 ca cert issues as well as some "back-off" entries. Not sure what's going on. I am on the company VPN which I'm not allowed to disable, so if it's a cert issue, not sure how to get around it.

--admin.insecure

and

--admin.insecureSkipVerify

doesn't help. Ideas? Sample log entries...

Copy code

2025-12-03T21:43:57.491204261Z E1203 21:43:57.491163      68 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"local-path-provisioner\" with ErrImagePull: \"failed to pull and unpack image \\\"<http://docker.io/rancher/local-path-provisioner:v0.0.24\\\|docker.io/rancher/local-path-provisioner:v0.0.24\\\>": failed to copy: httpReadSeeker: failed open: failed to do request: Get \\\"<https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/10/10ada9a7f8ab578464314da2df287d1d384c6ef9f474d00dc73bf232599df55f/data?expires=1764801238&signature=KC81Pwa1VNzUPyOJ089%2BQZbYlH4%3D&version=2>\\\": tls: failed to verify certificate: x509: certificate signed by unknown authority\"" pod="kube-system/local-path-provisioner-84db5d44d9-q2chh" podUID="fad13c92-96bd-4cec-b19f-0e9ade5ffb19"

...

2025-12-03T21:44:05.221227848Z E1203 21:44:05.220969      68 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/mirrored-coredns-coredns:1.10.1\\\"\"" pod="kube-system/coredns-6799fbcd5-27h25" podUID="1fa7b663-8c6b-492e-a816-d35a29e56e30"

fierce-oil-47448

12/03/2025, 11:54 PM

Hello. The

flytectl

install instructions mention: •

curl -sL <https://ctl.flyte.org/install> | bash

This errors out on Ubunutu:

Copy code

flyteorg/flyte info checking GitHub for latest tag
flyteorg/flyte crit unable to find '' - use 'latest' or see <https://github.com/flyteorg/flyte/releases> for details

handsome-lock-30336

12/04/2025, 5:00 PM

hi! how would Flyte most easily support multi-cluster/multi-cloud and compute allocation. I see discussion about volcano plugin here. What's the general direction? cc @freezing-airport-6809 @ancient-apple-95774

nice-hairdresser-45030

12/05/2025, 2:31 PM

I have a question about the expected flytepropeller performance with a large number of pods in combination with array node/map task: • Workflow with 4 to 5 map tasks, between 5 to 15k pods existing at the same time. • I'm seeing that propeller sometimes doesn't look at the status of some completed pods for hours (have seen up to 10h) ◦ (I put print statements into plugin manager to see in which phase which resource is evaluated, they are not evaluated despite having completed so this is not related to errors sending update events to admin) • Sometimes the succeeded pods have been garbage collected and propeller treats the "missing" pod as a failure I'm aware that I can prevent the last point with

inject-finalizer

to at least get eventual consistency. But my question is whether propeller not evaluating pods for hours in such a scenario is expected or unexpected. I know that I can shard propeller but this would only help me if I break this down into multiple workflows? Any other parameters I can tune so have propeller evaluate the pods earlier? Thank you!

👀 1

melodic-mechanic-59879

12/06/2025, 10:45 PM

Hi!, please how can I load data from a service bus directly and use it in flyte?

fierce-farmer-40956

12/08/2025, 12:29 PM

Hello, we are getting lots of these error logs:

Copy code

duplicate key value violates unique constraint "tasks_pkey" (SQLSTATE 23505)

and our executions are all in UNKNOWN state. Would you have a pointer?