Hi we have some problems with running airbyte on k8s We use Airbyte #ask-ai

Hi, we have some problems with running airbyte on ...

Andrzej Lewandowski

04/27/2023, 7:21 AM

Hi, we have some problems with running airbyte on k8s. We use karpenter for scaling and I noticed that when some workloads are run into different machines (e.g. source on node a, destination on next node) sync failed, if everything run on one single node everything works fine. I got an error on when normalization is executed.

Copy code

2023-04-27 07:05:04 normalization > 21 of 68 OK created view model ***.***............................................. [SUCCESS 1 in 3.87s]
2023-04-27 07:05:04 normalization > 20 of 68 OK created view model _AIRBYTE_AIRBYTE_SCHEMA.***................................................. [SUCCESS 1 in 4.04s]
2023-04-27 07:05:04 normalization > 25 of 68 START incremental model ***.***........................................................... [RUN]
2023-04-27 07:05:04 normalization > 26 of 68 START incremental model ***.**........................................................... [RUN]
2023-04-27 07:05:04 normalization > 27 of 68 START table model ***.***................................................... [RUN]
2023-04-27 07:05:09 normalization > 27 of 68 OK created table model ***.***.............................................. [SUCCESS 1 in 4.49s]
2023-04-27 07:05:09 normalization > 28 of 68 START incremental model ***.***......................................................... [RUN]
2023-04-27 07:05:13 INFO i.a.w.p.KubePodProcess(close):760 - (pod: airbyte / normalization-snowflake-normalize-19-1-nlpfa) - Closed all resources for pod
2023-04-27 07:05:13 INFO i.a.w.n.DefaultNormalizationRunner(close):194 - Terminating normalization process...
2023-04-27 07:05:13 ERROR i.a.w.g.DefaultNormalizationWorker(run):86 - Normalization failed for job 19.
io.airbyte.workers.exception.WorkerException: Normalization process did not terminate normally (exit code: 137)
	at io.airbyte.workers.normalization.DefaultNormalizationRunner.close(DefaultNormalizationRunner.java:205) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.general.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:84) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.general.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:37) ~[io.airbyte-airbyte-commons-worker-0.43.1.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$6(TemporalAttemptExecution.java:202) ~[io.airbyte-airbyte-workers-0.43.1.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]

I’ve noticed that process was killed

Copy code

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  39s   default-scheduler  Successfully assigned airbyte/normalization-snowflake-normalize-17-1-qhcws to ip-10-197-18-83.eu-west-1.compute.internal
  Normal  Pulled     39s   kubelet            Container image "busybox:1.28" already present on machine
  Normal  Created    39s   kubelet            Created container init
  Normal  Started    39s   kubelet            Started container init
  Normal  Pulling    36s   kubelet            Pulling image "airbyte/normalization-snowflake:0.4.0"
  Normal  Pulled     21s   kubelet            Successfully pulled image "airbyte/normalization-snowflake:0.4.0" in 15.303836555s
  Normal  Created    21s   kubelet            Created container main
  Normal  Started    21s   kubelet            Started container main
  Normal  Pulled     21s   kubelet            Container image "alpine/socat:1.7.4.3-r0" already present on machine
  Normal  Created    21s   kubelet            Created container relay-stdout
  Normal  Started    21s   kubelet            Started container relay-stdout
  Normal  Pulled     21s   kubelet            Container image "alpine/socat:1.7.4.3-r0" already present on machine
  Normal  Created    21s   kubelet            Created container relay-stderr
  Normal  Started    21s   kubelet            Started container relay-stderr
  Normal  Pulled     21s   kubelet            Container image "curlimages/curl:7.83.1" already present on machine
  Normal  Created    21s   kubelet            Created container call-heartbeat-server
  Normal  Started    20s   kubelet            Started container call-heartbeat-server
  Normal  Killing    16s   kubelet            Stopping container main
  Normal  Killing    16s   kubelet            Stopping container call-heartbeat-server
  Normal  Killing    16s   kubelet            Stopping container relay-stdout
  Normal  Killing    16s   kubelet            Stopping container relay-stderr

Killing process happened immediately when call-heartbeat-server started so I looked into entrypoint of this container:

Copy code

trap "touch /termination/check" EXIT
      (set -e; while true; do curl -s 10.197.19.169:9000 &> /dev/null; sleep 1; done) &
      CHILD_PID=$!
      (while true; do if [ -f /termination/main ]; then kill $CHILD_PID; exit 0; fi; sleep 1; done) &
      wait $CHILD_PID
      EXIT_CODE=$?

      if [ -f /termination/main ]
      then
        exit 0
      else
        exit $EXIT_CODE
      fi

I confirmed that there is communication between pods in different nodes. Do you have any ideas what I can do with this issue? or maybe this is some limitation and whole sync process should be done on single node.

kapa.ai

04/27/2023, 7:21 AM

A message from kapa.ai

kapa.ai

04/27/2023, 7:21 AM

@Andrzej Lewandowski That's a great question, but unfortunately it is too long. Can you ask me again but make it shorter? I'm here to help!

Andrzej Lewandowski

04/27/2023, 7:23 AM

normalization process was terminated with exit code 137

Andrzej Lewandowski

04/27/2023, 7:23 PM

Fixed

Vicky Kwan

08/22/2023, 7:56 PM

Hi @Andrzej Lewandowski we are seeing the same error message, and most of the time it's showing up during the normalization step. We also use Karpenter and k8s as backend. How did you resolve this issue?

Andrzej Lewandowski

08/22/2023, 8:00 PM

Hi @Vicky Kwan we’ve noticed that node was starting draining between sync/normalization so node was deleted. Increasing

ttlSecondsUntilExpired

value helps

👍 1

Vicky Kwan

08/22/2023, 8:04 PM

oh thank you!

Vicky Kwan

08/22/2023, 8:04 PM

I'll try that. what value worked for you? what connector are you using?

Vicky Kwan

08/22/2023, 8:41 PM

@Andrzej Lewandowski where could I set this
ttlSecondsUntilExpired
?

Vicky Kwan

08/22/2023, 8:41 PM

oh was it on the k8s side? not the Airbyte Helm chart

Vicky Kwan

08/22/2023, 9:36 PM

our ``ttlSecondsUntilExpired` currently is set to 1 week. curious what value did your team set at?

Andrzej Lewandowski

08/23/2023, 11:37 AM

for 5 minutes, it’s kartpenter configuration

Open in Slack

Previous Next