Hey, I am having problem to deploy my pyflink app ...
# troubleshooting
n
Hey, I am having problem to deploy my pyflink app to k3s cluster using flink on kubernete operator. Seems the python deployment example from the flink-kubernetes-operator link here doesn't work for me either. I have a local k3s/k3d running in docker, and I deployed the flink-operator using helm with this chart file. I built the image using the example Dockerfile and python_demo.py, and pushed it to a local registry. The tricky thing is it worked fine after deployed. But if I changed the python_demo.py file to a different name like word_count.py, no code change, rebuilt the image, pushed it to registry and deployed the app again. it failed with this error message in flink-operator/python-demo pod.
Copy code
Caused by: java.nio.file.NoSuchFileException: /tmp/pyflink/3e67a85b-296d-4586-aa5b-b654963e6464/7fc3a4b9-0c8f-4d6a-9178-8cff5e5c58e6/word_count.py
I looked at the PythonDriver.java codes, seems it should either create a soft link for the original word_count.py file under the /tmp/pyflink/xxx/yyy/word.py or copy the file over to the /tmp directory. I don't understand why it complaint. Also the weird thing is if I changed the codes to do different things in python_demo.py. the deployment still succeeded and ran the same stream job. Looks to me the pod was running something from pod template, not from my python-example.yaml. Has anybody met the same issue? Or I missed something? Thanks in advance!! gratitude thank you python-example.yaml
Copy code
apiVersion: <http://flink.apache.org/v1beta1|flink.apache.org/v1beta1>
kind: FlinkDeployment
metadata:
  name: python-demo
  namespace: flink-operator
spec:
  image: registry.localhost:5000/flink-python-demo:latest
  flinkVersion: v1_16
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "1"
  serviceAccount: flink
  jobManager:
    resource:
      memory: "2048m"
      cpu: 1
  taskManager:
    resource:
      memory: "2048m"
      cpu: 1
  job:
    jarURI: local:///opt/flink/opt/flink-python-1.16.1.jar # Note, this jarURI is actually a placeholder
    entryClass: "org.apache.flink.client.python.PythonDriver"
    args: ["-pyclientexec", "/usr/local/bin/python3", "-py", "/opt/flink/usrlib/word_count.py"]
    parallelism: 1
    upgradeMode: stateless
g
Can you show me your docker image definition for
registry.localhost:5000/flink-python-demo:latest
this is the one that should contain your python script
(not the Flink Kubernetes Operator)
n
Hey Gyula, I used the same Dockerfile from the https://github.com/apache/flink-kubernetes-operator/blob/main/examples/flink-python-example/Dockerfile. Just modified the filename to word_count.py. no code change inside the python file.
Copy code
# Check <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker> for more details
FROM flink:1.16

# install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source, \
# it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially.

RUN apt-get update -y && \
apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev && \
wget <https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz> && \
tar -xvf Python-3.7.9.tgz && \
cd Python-3.7.9 && \
./configure --without-tests --enable-shared && \
make -j6 && \
make install && \
ldconfig /usr/local/lib && \
cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \
ln -s /usr/local/bin/python3 /usr/local/bin/python && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# install PyFlink
RUN pip3 install "apache-flink>=1.16.0,<1.17.0"

# add python script
USER flink
RUN mkdir /opt/flink/usrlib
ADD word_count.py /opt/flink/usrlib/word_count.py
$ docker build -t registry.localhost5000/flink python demolatest .
$ docker push registry.localhost5000/flink python demolatest
g
Could it be that the tmp file could not be created in your env for some reason? Sounds strange
n
yeah, that's what i am suspecting as well...
g
You could try sshing into the running container to try to reproduce it
n
that's actually what i am struggling right now. the k3s cluster is running in a docker. so the running container is a docker inside the k3s docker. how can I do that?
Copy code
vscode ➜ /workspaces/flux (main) $ docker ps
CONTAINER ID   IMAGE                                                COMMAND                  CREATED       STATUS      PORTS                                                                                                                                                             NAMES
b59bde4de198   <http://ghcr.io/k3d-io/k3d-proxy:5.4.8|ghcr.io/k3d-io/k3d-proxy:5.4.8>                       "/bin/sh -c nginx-pr…"   10 days ago   Up 9 days   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:6443->6443/tcp, :::443->443/tcp, 0.0.0.0:9093-9094->9093-9094/tcp, :::9093-9094->9093-9094/tcp   k3d-k3s-default-serverlb
4cd1b5cf880d   rancher/k3s:v1.25.6-k3s1                             "/bin/k3d-entrypoint…"   10 days ago   Up 9 days                                                                                                                                                                     k3d-k3s-default-agent-0
0bf17549dc3f   rancher/k3s:v1.25.6-k3s1                             "/bin/k3d-entrypoint…"   10 days ago   Up 9 days                                                                                                                                                                     k3d-k3s-default-server-0
10494f92ba6c   registry:2                                           "/entrypoint.sh /etc…"   10 days ago   Up 9 days   0.0.0.0:5000->5000/tcp                                                                                                                                            registry.localhost
360322be10cd   vsc-flux-a10c43daac5c024fe6e7274f561a88ca            "/bin/sh -c 'echo Co…"   2 weeks ago   Up 9 days                                                                                                                                                                     nervous_allen
finally found the issue. it was the local registry messed up the image version. everything works fine now.
g
good news, thanks for the update!