I am observing following exception in one of our j...
# troubleshooting
a
I am observing following exception in one of our jobs
Copy code
File already exists: /tmp/flink-dist-cache-05e9a489-4829-4cc9-ab22-ae97c12d68d7/b3f2cc1d95d2b57669500d9752f89a50/spsp-bfso/sp-bfso-0.0.2-ubuntu_22_04.zip
sp-bfso-0.0.2-ubuntu_22_04.zip
is the udf that is attached to the job. We download this udf from S3 and add this as a python archive in the table environment. When the job fails and restarts due to any reason, it downloads the udf again and it starts failing with above exception. We are using restart strategy. My understanding is that due to restart strategy, the job graph is being rebuilt triggering the download and submission of plugin via python archive and since for the same jobId the plugin already existed, it causes the error. Any idea how we can overcome this problem? Is there a way to detect if the file is already there or clear up the directory whenever job fails?
Obviously, I can overcome this problem by adding a uuid to the file name but this is going to cause disk issues as there will be duplicate plugins which will exist in the cluster.