Hi, we have some problems with running airbyte on ...
# ask-ai
a
Hi, we have some problems with running airbyte on k8s. We use karpenter for scaling and I noticed that when some workloads are run into different machines (e.g. source on node a, destination on next node) sync failed, if everything run on one single node everything works fine. I got an error on when normalization is executed.
Copy code
2023-04-27 07:05:04 normalization > 21 of 68 OK created view model ***.***............................................. [SUCCESS 1 in 3.87s]
2023-04-27 07:05:04 normalization > 20 of 68 OK created view model _AIRBYTE_AIRBYTE_SCHEMA.***................................................. [SUCCESS 1 in 4.04s]
2023-04-27 07:05:04 normalization > 25 of 68 START incremental model ***.***........................................................... [RUN]
2023-04-27 07:05:04 normalization > 26 of 68 START incremental model ***.**........................................................... [RUN]
2023-04-27 07:05:04 normalization > 27 of 68 START table model ***.***................................................... [RUN]
2023-04-27 07:05:09 normalization > 27 of 68 OK created table model ***.***.............................................. [SUCCESS 1 in 4.49s]
2023-04-27 07:05:09 normalization > 28 of 68 START incremental model ***.***......................................................... [RUN]
2023-04-27 07:05:13 INFO i.a.w.p.KubePodProcess(close):760 - (pod: airbyte / normalization-snowflake-normalize-19-1-nlpfa) - Closed all resources for pod
2023-04-27 07:05:13 INFO i.a.w.n.DefaultNormalizationRunner(close):194 - Terminating normalization process...
2023-04-27 07:05:13 ERROR i.a.w.g.DefaultNormalizationWorker(run):86 - Normalization failed for job 19.
io.airbyte.workers.exception.WorkerException: Normalization process did not terminate normally (exit code: 137)
	at io.airbyte.workers.normalization.DefaultNormalizationRunner.close(DefaultNormalizationRunner.java:205) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.general.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:84) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.general.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:37) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$6(TemporalAttemptExecution.java:202) ~[io.airbyte-airbyte-workers-0.43.1.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
I’ve noticed that process was killed
Copy code
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  39s   default-scheduler  Successfully assigned airbyte/normalization-snowflake-normalize-17-1-qhcws to ip-10-197-18-83.eu-west-1.compute.internal
  Normal  Pulled     39s   kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created    39s   kubelet            Created container init
  Normal  Started    39s   kubelet            Started container init
  Normal  Pulling    36s   kubelet            Pulling image "airbyte/normalization-snowflake:0.4.0"
  Normal  Pulled     21s   kubelet            Successfully pulled image "airbyte/normalization-snowflake:0.4.0" in 15.303836555s
  Normal  Created    21s   kubelet            Created container main
  Normal  Started    21s   kubelet            Started container main
  Normal  Pulled     21s   kubelet            Container image "alpine/socat:1.7.4.3-r0" already present on machine
  Normal  Created    21s   kubelet            Created container relay-stdout
  Normal  Started    21s   kubelet            Started container relay-stdout
  Normal  Pulled     21s   kubelet            Container image "alpine/socat:1.7.4.3-r0" already present on machine
  Normal  Created    21s   kubelet            Created container relay-stderr
  Normal  Started    21s   kubelet            Started container relay-stderr
  Normal  Pulled     21s   kubelet            Container image "curlimages/curl:7.83.1" already present on machine
  Normal  Created    21s   kubelet            Created container call-heartbeat-server
  Normal  Started    20s   kubelet            Started container call-heartbeat-server
  Normal  Killing    16s   kubelet            Stopping container main
  Normal  Killing    16s   kubelet            Stopping container call-heartbeat-server
  Normal  Killing    16s   kubelet            Stopping container relay-stdout
  Normal  Killing    16s   kubelet            Stopping container relay-stderr
Killing process happened immediately when call-heartbeat-server started so I looked into entrypoint of this container:
Copy code
trap "touch /termination/check" EXIT
      (set -e; while true; do curl -s 10.197.19.169:9000 &> /dev/null; sleep 1; done) &
      CHILD_PID=$!
      (while true; do if [ -f /termination/main ]; then kill $CHILD_PID; exit 0; fi; sleep 1; done) &
      wait $CHILD_PID
      EXIT_CODE=$?

      if [ -f /termination/main ]
      then
        exit 0
      else
        exit $EXIT_CODE
      fi
I confirmed that there is communication between pods in different nodes. Do you have any ideas what I can do with this issue? or maybe this is some limitation and whole sync process should be done on single node.
k
A message from kapa.ai
@Andrzej Lewandowski That's a great question, but unfortunately it is too long. Can you ask me again but make it shorter? I'm here to help!
a
normalization process was terminated with exit code 137
Fixed
v
Hi @Andrzej Lewandowski we are seeing the same error message, and most of the time it's showing up during the normalization step. We also use Karpenter and k8s as backend. How did you resolve this issue?
a
Hi @Vicky Kwan we’ve noticed that node was starting draining between sync/normalization so node was deleted. Increasing
ttlSecondsUntilExpired
value helps
👍 1
v
oh thank you!
I'll try that. what value worked for you? what connector are you using?
@Andrzej Lewandowski where could I set this
ttlSecondsUntilExpired
?
oh was it on the k8s side? not the Airbyte Helm chart
our ``ttlSecondsUntilExpired` currently is set to 1 week. curious what value did your team set at?
a
for 5 minutes, it’s kartpenter configuration