numerous-refrigerator-74026
06/05/2025, 6:09 PMripe-battery-86055
06/06/2025, 3:34 PMflytectl
, following a period where Propeller was overloaded.
After increasing Propeller’s resources, some of the pending jobs completed successfully and some ABORTING jobs transitioning to the ABORTED state. However, a large number of jobs still remain stuck in the ABORTING phase.ripe-battery-86055
06/06/2025, 3:36 PMpowerful-australia-73346
06/09/2025, 6:53 AMtry/except
block is used inside workflow
, which should not be possible as it is straight up python in workflow, no? It should only work when executed locally but on remote should not work. At least I don't get it to work.
notification_task = WebhookTask(
name="failure-notification",
url="<https://hooks.slack.com/services/xyz>", #your Slack webhook
method=http.HTTPMethod.POST,
headers={"Content-Type": "application/json"},
data={"text": "Workflow failed: {inputs.error_message}"},
dynamic_inputs={"error_message": str},
show_data=True,
show_url=True,
description="Send notification on workflow failure"
)
...
@fl.workflow
def ml_workflow_with_failure_handling() -> float:
try:
X, y = load_and_preprocess_data()
model = train_model(X=X, y=y)
accuracy = evaluate_model(model=model, X=X, y=y)
return accuracy
except Exception as e:
# Trigger the notification task on failure
notification_task(error_message=str(e))
raise
numerous-refrigerator-74026
06/09/2025, 6:40 PMtimestamp: 2025-06-05 22:36:50,156
level: ERROR
message: !! Begin Unknown System Error Captured by Flyte !!
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py", line 179, in dispatch_execute
outputs = task_def.dispatch_execute(ctx, idl_input_literals)
File "/usr/local/lib/python3.11/site-packages/flytekit/core/base_task.py", line 728, in dispatch_execute
new_user_params = self.pre_execute(ctx.user_space_params)
File "/home/flytekit/.local/lib/python3.11/site-packages/flytekitplugins/ray/task.py", line 81, in pre_execute
ray.init(**init_params)
File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 1780, in init
_global_node = ray._private.node.Node()
File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/node.py", line 365, in __init__
self.start_ray_processes()
File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/node.py", line 1500, in start_ray_processes
) = ray._private.services.determine_plasma_store_config()
File "/home/flytekit/.local/lib/python3.11/site-packages/ray/_private/services.py", line 2198, in determine_plasma_store_config
raise ValueError(
ValueError: Attempting to cap object store memory usage at 46216396 bytes,
but the minimum allowed is 78643200 bytes.
timestamp: 2025-06-05 22:36:50,157
level: ERROR
message: !! End Error Captured by Flyte !!
This is my ray task
@task(
container_image=CONTAINER_IMAGE,
task_config=RayJobConfig(
# Configure the Ray cluster for this task
head_node_config=HeadNodeConfig(
ray_start_params={
"log-color": "True",
"object-store-memory": "100_000_000",
},
),
worker_node_config=[
WorkerNodeConfig(
ray_start_params={
"object-store-memory": "100_000_000",
},
replicas=2,
group_name="worker-group",
)
], # 2 Ray workers
shutdown_after_job_finishes=True, # Clean up Ray cluster after task
ttl_seconds_after_finished=120, # Keep cluster for 2 mins after completion for debugging
),
requests=Resources(cpu="1", mem="1Gi"), # Resources for the task pod
limits=Resources(cpu="1", mem="1Gi"), # Maximum resources
)
gentle-tomato-480
06/10/2025, 8:01 AMgentle-tomato-480
06/10/2025, 8:01 AMcrooked-holiday-38139
06/10/2025, 11:32 AMambitious-airplane-25777
06/10/2025, 1:56 PMmysterious-painter-66441
06/10/2025, 7:23 PMremote = FlyteRemote(
config=Config.auto(config_file=str(Path(__file__).parent.parent / "flytekit.config")),
default_project=PROJECT_NAME,
default_domain=PROJECT_DOMAIN,
)
wf = remote.fetch_workflow(
name="<http://workflows.test.wf|workflows.test.wf>", version=version, domain=PROJECT_DOMAIN, project=PROJECT_NAME
)
lp_name = "test_lp"
wf_lp = LaunchPlan.get_or_create(
name=lp_name,
workflow=wf,
default_inputs={
"input_num": 1
},
schedule=CronSchedule(schedule="0 0 1 * *"),
)
early-addition-41415
06/10/2025, 8:29 PMmammoth-quill-44336
06/11/2025, 3:44 PMsquare-carpet-13590
06/12/2025, 2:25 PM│ {"json":{"exec_id":"a96pdwqvmh4gpb4cw8cz","node":"n14","ns":"flyte-pai-staging","res_ver":"620035405","routine":"worker-15","wf":"flyte-pai:staging:optimization_engine_workflows.wf.main_wf"},"level":"error","ms │
│ g":"failed Execute for node. Error: failed at Node[n14]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Conflict] failed to create resource, │
│ caused by: Operation cannot be fulfilled on resourcequotas \"project-quota\": the object has been modified; please apply your changes to the latest version and try again","ts":"2025-06-11T07:57:26Z"}
cool-belgium-90703
06/12/2025, 7:21 PMGarbage collector is disabled, as ttl [0] is <=0Why would a TTL == 0 disable the GC? I just want the pods to be cleaned up immediately, not disable the GC.
curved-whale-1505
06/14/2025, 4:57 PMFile "/tmp/ray/session_2025-06-14_16-47-27_250220_1/runtime_resources/pip/14a3daa1f865c8d76d1b356d563406ae44a38d58/virtualenv/lib/python3.11/site-packages/flytekit/core/data_persistence.py", line 614, in async_get_data
raise FlyteDownloadDataException(
flytekit.exceptions.system.FlyteDownloadDataException: SYSTEM:DownloadDataError: error=Failed to get data from <s3://mybucket/flytesnacks/development/NF2PQ3P4WNNOVWDFEX6OAYWBD4======/fast9ba0043893d5986322db7a23c8d8bbd0.tar.gz> to ./ (recursive=False).
Original exception: Unable to locate credentials
I’m running the example as describe in this document: https://www.union.ai/docs/flyte/integrations/native-backend-plugins/ray-plugin/ray-example/
And I installed ray on the cluster as described in this document: https://docs.flyte.org/en/latest/deployment/plugins/k8s/index.html#deployment-plugin-setup-k8s
What am I missing to do here?square-agency-59624
06/16/2025, 2:55 PMmammoth-mouse-1111
06/18/2025, 8:32 PMflyte-core
(revision 1.16.0-b2). I'm deploying this on my on-prem system.
I'm running the following port forwards:
kubectl -n flyte port-forward service/flyteadmin 8089:81
kubectl -n flyte port-forward service/flyteconsole 8088:80
I'm getting a Error: invalid wire type 4 at offset 1
on the console page. Inspecting the console, I get
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at <http://calpha-cluster-manager/me>. (Reason: CORS request did not succeed). Status code: (null).
as the first error i run into.
If anyone knows how to get around this, I'd gladly appreciate it. Thanks 🙏high-night-5814
06/20/2025, 6:36 PMTraceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
return wrapped(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/flytekit/core/base_task.py", line 530, in dispatch_execute
raise type(exc)(msg) from exc
Message:
Failed to convert inputs of task 'workflows.example_intro.train_model':
'NoneType' object has no attribute 'DESCRIPTOR'
SYSTEM ERROR! Contact platform administrators.
worried-airplane-87065
06/20/2025, 6:37 PM@dynamic
workflows to fanout with max-parallelism=200
but we're seeing a great deal of latency in workflow progress. Looking at https://www.union.ai/docs/flyte/deployment/flyte-configuration/performance/ and https://www.union.ai/docs/flyte/user-guide/core-concepts/workflows/subworkflows-and-sub-launch-plans/, it looks like we can achieve similar concurrency by invoking a sublaunch plan 100-1500 times and increasing the free worker count for Flytepropeller. We have explored map_tasks but it's a little too restrictive for our use case. Has anyone been similar situations and would be willing to share how they approached the fanout issue.
Using sub launchplans for fanout
import flytekit as fl
@fl.task
def my_gpu_task() -> None:
pass
@fl.task
def my_cpu_task -> None:
pass
@fl.workflow
def my_workflow() -> None:
my_gpu_task() >> my_cpu_task()
my_workflow_lp = fl.LaunchPlan.get_or_create(my_workflow)
@fl.dynamic
def dynamic_lp(num_fanout: int) -> list[int]:
return [my_workflow_lp() for i in range(num_fanout)]
millions-plastic-44322
06/23/2025, 10:29 AMpyflyte run --remote --project jiri-test --domain development hello_world.py hello_world_wf
it fails with
FlyteSystemException: SYSTEM:Unknown: error=None, cause=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "failed to create a signed url. Error: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to create a signed url. Error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see
aws.Config.CredentialsChainVerboseErrors", grpc_status:13}"
What's wrong, problem with s3 connection? I can access it from console (aws s3 ls works), I'm logged in using sso...gentle-tomato-480
06/23/2025, 12:47 PMmagnificent-byte-14700
06/23/2025, 6:35 PMmicroscopic-furniture-57275
06/23/2025, 6:35 PMsteep-nest-3156
06/25/2025, 3:15 PMfreezing-airport-6809
red-match-4610
06/26/2025, 8:18 PMflat-waiter-82487
06/27/2025, 7:54 AMFlyteDirectory.download()
? We have an S3 directory with ~200k files, 2-20kB each (which tbh is not THAT large for modern standards) and FlyteDirectory.download()
fails due to:
An error occurred (RequestTimeTooSkewed) when calling the GetObject operation: The difference between the request time and the current time is too large.
clean-glass-36808
06/27/2025, 9:31 PMearly-napkin-90297
06/29/2025, 2:36 PMImageSpec.is_container()
and I found this old thread interesting:
https://discuss.flyte.org/t/15616465/hey-guys-how-do-i-add-new-packages-to-the-project-in-flyte-i#01266f7b-9c80-4c3c-907e-2f8e47325480
TL;DR: pyflyte
expects the dependency (cv2
in this case) to be installed locally even though it's guarded by if cv2_image_spec.is_container():
which, in my understanding, should imply that the dependency is only needed at run time in the task pod.
This happens because the ExecutionState.mode
is None
and is_container()
returns True
when running pyflyte run --remote
or pyflyte register
. I've confirmed this behavior with flytekit==1.16.1
. Is this the expected behavior or a bug? If the former, is late import within the task definition the only recourse for this scenario?dazzling-shampoo-1177
07/02/2025, 1:56 AMpyflyte create api-key admin --name my-admin-key
but got error
No such command 'create'.
From looking at main branch in GH and then does look like create command doesn't exist. Can anyone tell me how to create an api-key?