Hi team! :wave: I'm running into an issue with inc...
# ask-community-for-troubleshooting
k
Hi team! 👋 I'm running into an issue with increasing memory for connector pods in Airbyte 2.0 on Kubernetes (Helm v2 charts). ***Problem:*** My postgres source connector is being OOMKilled with the default 2GB limit, and I need to increase it to 4GB. ***Documentation gap:*** The "Configuring Connector Resources" page (https://docs.airbyte.com/platform/operator-guides/configuring-connector-resources) says to "set the following variables in your values.yaml file" and lists: - JOB_MAIN_CONTAINER_MEMORY_REQUEST - JOB_MAIN_CONTAINER_MEMORY_LIMIT - etc. But it doesn't specify where in values.yaml these should go. After digging through GitHub issues and the Helm chart structure, I think they go under
workloadLauncher.extraEnv
, but I'm not certain. ***Questions:*** 1. Is
workloadLauncher.extraEnv
the correct place for these variables in Helm v2? 2. Is there newer documentation or a different approach I should be using instead? 3. Would it be possible to update the docs to show the full values.yaml structure? Currently using: - Airbyte 2.0 with Helm v2 charts - K3s cluster - RDS Postgres → Snowflake connection Thanks for any guidance! The product is great, just trying to get the resource configuration sorted. 🙏 Thanks!
k
Great questions! Here’s what the knowledge sources say about configuring connector pod resources in Airbyte 2.0 with Helm v2 charts: --- ### 1. Is
workloadLauncher.extraEnv
the correct place for these variables in Helm v2? No, for Helm v2 charts, you should not use
workloadLauncher.extraEnv
for the main resource variables.
In Helm v2, many environment variables that were previously set via
extraEnv
now have dedicated fields in the
values.yaml
file. For resource limits like memory and CPU, you should use the new structured fields under `global.workloads.resources.mainContainer`:
Copy code
yaml global:   workloads:     resources:       mainContainer:         memory:           request: 4Gi           limit: 4Gi         cpu:           request: 1           limit: 2
This is the recommended approach for instance-wide resource settings in Helm v2. The documentation explicitly maps the old environment variables to these new fields: -
JOB_MAIN_CONTAINER_MEMORY_REQUEST
global.workloads.resources.mainContainer.memory.request
-
JOB_MAIN_CONTAINER_MEMORY_LIMIT
global.workloads.resources.mainContainer.memory.limit
- (and similarly for CPU) [See mapping table](https://docs.airbyte.com/platform/1.8/enterprise-setup/chart-v2-enterprise#update-your-valuesyaml-file) If you need to set resource requirements for specific job types (like check, discover, spec), you may need to use the corresponding fields under
global.jobs.kube.scheduling
or, if no direct mapping exists, use
extraEnv
as a fallback for those specific cases. --- ### 2. Is there newer documentation or a different approach I should be using instead? Yes, the newer approach is to use the structured fields in
values.yaml
as described above.
The documentation for Helm v2 charts (Airbyte 2.0+) provides a full mapping of environment variables to their new locations in the values file. This is a change from Helm v1, where
extraEnv
was commonly used. [See the full mapping and examples](https://docs.airbyte.com/platform/1.8/enterprise-setup/chart-v2-enterprise#update-your-valuesyaml-file) --- ### 3. Would it be possible to update the docs to show the full values.yaml structure? The documentation does provide examples of the new structure, but it could be clearer. Here’s a minimal example for setting instance-wide memory for all connector pods in Helm v2:
Copy code
yaml global:   workloads:     resources:       mainContainer:         memory:           request: 4Gi           limit: 4Gi
If you need to set job-type-specific resources (e.g., for check/discover/spec jobs), you may need to use the corresponding fields under
global.jobs.kube.scheduling
or use
extraEnv
for those specific variables if no direct mapping exists. [See example and mapping](https://docs.airbyte.com/platform/1.8/enterprise-setup/chart-v2-enterprise#update-your-valuesyaml-file) --- Summary: - Use
global.workloads.resources.mainContainer.memory.request
and
.limit
for instance-wide settings in Helm v2. - Only use
extraEnv
for variables that do not have a direct mapping in Helm v2. - The documentation mapping table is your best reference for where to put each variable. If you follow this structure, your connector pods (including your Postgres source) should get the increased memory allocation you need. --- If you have a specific job type (like check/discover/spec) that still OOMs, you may need to set those job-type-specific variables as well, either via the new structured fields or, if not available, via
extraEnv
as a fallback. Let me know if you need a more detailed example for job-type-specific settings! --- References: - [Upgrade to Helm chart V2 (Self-Managed Enterprise) – Update your values.yaml file](https://docs.airbyte.com/platform/1.8/enterprise-setup/chart-v2-enterprise#update-your-valuesyaml-file) - [Configuring Connector Resources](https://docs.airbyte.com/platform/operator-guides/configuring-connector-resources)
t
I can confirm the AI suggestion 3 here as I am also battling OOM issues at the moment. For example, this sort of confirg works in my environment:
Copy code
global:
  workloads:
    resources:
      mainContainer:
        cpu:
          request: "2"
          limit: "4"
        memory:
          request: "6Gi"
          limit: "122i"
For reference I found this in the documentation in the “migrating helm charts v1 -> v2” guide
k
thanks @Tom Sweeting, did you find that worked for connectors? I've been having trouble getting my postgres source connector pod to ever allocate more than 2gb
t
Applying that seems to apply to the “replication-job” workers from what I can see. Increasing the memory to higher limits made it so I could get past syncing tables that were failing at 1GB data loaded before. In my case im reading from MySQL to send data to Postgres
An additional config that may help you (it did me) is this
Copy code
worker:
  maxSyncWorkers: "4"
That key goes at the top level of the values.yaml. In my case I used this to play with having less workers with more resources vs/ more workers with less. Im leaning towards using less workers + more resources per workers in my case.
k
great thx, maybe i'll try that. what's the default? i have a pretty beefy machine, 64gb ram, 16 cpu.
so that memory change didn't work for me.. i even tailed the airbyte-workoad-launcher to see what it's doing and it shows:
Copy code
i.a.w.l.p.KubePodClient(launchReplication):100 - [source] image: airbyte/source-postgres:3.7.0 resources: ResourceRequirements(claims=[], limits={memory=2Gi, cpu=2}, requests={memory=1Gi, cpu=1}, additionalProperties={})
ie using those low defaults. and the configmap shows that the values have been applied:
Copy code
kubectl get configmap airbyte-airbyte-env -n airbyte-v2 -o yaml | grep -A 1 "JOB_MAIN_CONTAINER"
  CHECK_JOB_MAIN_CONTAINER_CPU_LIMIT: ""
  CHECK_JOB_MAIN_CONTAINER_CPU_REQUEST: ""
  CHECK_JOB_MAIN_CONTAINER_MEMORY_LIMIT: ""
  CHECK_JOB_MAIN_CONTAINER_MEMORY_REQUEST: ""
  CLUSTER_TYPE: hybrid
--
  DISCOVER_JOB_MAIN_CONTAINER_CPU_LIMIT: ""
  DISCOVER_JOB_MAIN_CONTAINER_CPU_REQUEST: ""
  DISCOVER_JOB_MAIN_CONTAINER_MEMORY_LIMIT: ""
  DISCOVER_JOB_MAIN_CONTAINER_MEMORY_REQUEST: ""
  DISCOVER_REFRESH_WINDOW_MINUTES: ""
--
  JOB_MAIN_CONTAINER_CPU_LIMIT: "4"
  JOB_MAIN_CONTAINER_CPU_REQUEST: "2"
  JOB_MAIN_CONTAINER_MEMORY_LIMIT: 8Gi
  JOB_MAIN_CONTAINER_MEMORY_REQUEST: 2Gi
  JOB_SOURCE_DECLARATIVE_MANIFEST_KUBE_NODE_SELECTORS: ""
kind of at a losss here 🤷
t
I had an issue like this too with the limits not showing after running
helm upgrade
I found that running this afterwards helps to guarantee the new limits apply instantly
Copy code
kubectl rollout restart deployment -n airbyte-v2
k
hmm, did you have to change:
Copy code
resources:
      useConnectorResourceDefaults: true
to false? I am getting some guidance that maybe that is enforcing connector defaults that wont be overridden.. then I found this issue.
i'll try that command though, ty
is this maybe all due to the recent airbyte 2.0 release? I wonder if I've joined at just the wrong time 😓
t
You are welcome! I did not have to change that in my environment to see the changes apply. To confirm that the limits applied to the replication job I would do:
Copy code
kubectl get pods -n airbyte-v2
And then pick out an example job (after starting a sync attempt) and run something like this:
Copy code
kubectl get pod replication-job-XXXX-attempt-0 -n airbyte-v2 -o yaml | grep resources: -A6
k
thx, i actually installed k9 as part of this (not too familiar with kubernetes before) and that has been a helpful way to visualize what's going on
👍 1
t
Same, I was running
abctl
before but switched away from that myself the other day so I could get a little closer to the config/logs
k
haha same, i couldn't even tell what was happening with things failing in abctl, had no idea that pods were running out of memory...
😆 1
t
Yep.. its a ton better installing using helm IMO - also applying config changes is easier and faster!
(when you can find out what the changes need to be..)
k
ok i'm actually seeing the increased memory/cpu limits on the containers in the replication pod now. well, at least for 2/3..strangely the snowflake destination did not seem to pick them up
Screenshot 2025-10-20 at 2.07.30 PM.png
t
Sounds like progress at least!! Interestingly the problem you are having may be related to my issue (mine fails at the postgres destination for larger tables)
k
i actualy had progress on Friday I think by executing DB commands to airbyte's postgres to manually change the connector values there..
👍 1
but then that stopped working today 🤷 I've reached out their sales team, maybe I can get some enterprise support... I saw somewhere in the docs that the per-connection stuff is actually configurable via UI in enterprise
t
Interesting I saw that you can configure the per destination in the DB directly so that might be my next step to unblock my own sync now. Unfortunately for me I had a fully working incremental sync but now I need to do a full load due to a schema change and cant get it to complete 😞
k
maybe try disabling
Copy code
resources:
      useConnectorResourceDefaults: true
? I wonder if there's a specific destination default that is being set by that, and then cannot be overwritten by lower precedence level settings..
I'm actually going to checkout the airbyte source code and ask cursor how it works 😆 , maybe it'll find some clues
👍 1
t
ill give that a go my end… good luck checking out the source!
Cursor is a great idea… hope you have plenty of tokens available 😅
😬 1
k
Cursor confirmed my suspicions; the snowflake destination connector hard codes 2gb memory limit, so that overrides any high-level memory limits i put in. tldr: 🔍 Summary Out of hundreds of connectors: • Only 22 connectors (out of ~400+) define custom resource requirements • Snowflake specifically hardcodes 2Gi for sync jobs • Most connectors (like Postgres) have NO defaults and will use your Helm chart settings This explains your exact issue: • Postgres source → No metadata.yaml resourceRequirements → Uses your Helm 8GB setting • Snowflake destination → Has metadata.yaml with 2Gi hardcoded → Overrides your Helm setting Recommended Fix Set this in your Helm values: global: workloads: resources: useConnectorResourceDefaults: false # 🔑 This is the key! mainContainer: cpu: request: "2" limit: "4" memory: request: "2Gi" limit: "8Gi" This will make all connectors ignore their metadata.yaml defaults and use your Helm chart settings uniformly. Alternative: If you want to keep connector defaults for most connectors but override Snowflake specifically, you'd need to set it per-connection in the Airbyte UI (Connection Settings → Advanced → Resource Requirements).
t
Interesting & thanks for sharing your research here. Im going to try that and also add anything else I find. Im hoping to not have to switch away from Airbyte since other than these large tables, its been doing great!
Something I have noticed in my environment in the logs is that one of the worker threads seems to never flush for some reason. EG: there are logs that start like this
Copy code
INFO pool-5-thread-1 i.a.c.i.d.a.DetectStreamToFlush(getNextStreamToFlush):119 flushing: trigger info: ...
That end like this:
Copy code
... time trigger: false , size trigger: true current threshold b: 25 MB, queue size b: 2.24 GB, penalty b: 0 bytes, after penalty b: 2.24 GB
Its always the same where I have one particular thread that is ever expanding until I hit the OOM error. The log seems to suggest that is supposed to be flushing when it reaches 25MB, yet it never does 🤔
k
i don't immediately see the issue there
t
It falls apart for me unfortunately since I want to sync some tables that are larger than the amount of memory I have to allocate. Even though its supposed to be batching, in practice it seems to need at least enough memory per worker to keep the whole dataset in memory, perhaps x2 that (at least right now it does, which I presume that is a memory leak or bug)
k
going into postgres?
t
Yep its an initial load of mysql CDC -> postgres
For tables that get past that the syncs are super small and fast, but the initial load is impossible for me to complete
k
and you're seeing it just run out of memory, never flushing?
t
Yeah. I see other threads flushing but one of them keeps growing like its either not actually respecting the batch, or leaking memory
k
if i'm understanding your issue, here's what Cursor is saying: 🔍 Root Cause: Postgres Destination Memory Buffering You're experiencing a legitimate architectural issue with Postgres destination. Here's what's happening: The Problem From the code I just analyzed: Postgres Destination (destination-postgres/build.gradle line 21): BufferManager(defaultNamespace, (Runtime.getRuntime().maxMemory() * 0.2).toLong()) The Postgres destination buffers incoming data in memory and uses: • 20% of JVM heap for buffer • 25MB batch size threshold before flushing • 10,000 records per batch when writing to database If your container has limited memory: 1. Container gets 2GB memory (Snowflake's default we saw earlier) 2. JVM uses 75% of container = ~1.5GB heap (-XX:MaxRAMPercentage=75.0) 3. Buffer manager gets 20% of heap = ~300MB buffer 4. If MySQL source is fast and Postgres destination writes are slow, buffer fills up → OOM! Solutions Solution 1: Increase Postgres Destination Memory (Recommended) In your Helm values, set larger memory limits specifically for your sync: global: workloads: resources: useConnectorResourceDefaults: false # For the main connector containers (MySQL source + Postgres destination) mainContainer: memory: limit: "8Gi" # Much larger limit request: "4Gi" # Larger request cpu: limit: "4" request: "2" # For the replication orchestrator replication: memory: limit: "4Gi" request: "2Gi" Why this works: • More memory → larger buffer (20% of larger heap) • More headroom before OOM • Postgres can buffer more data while writing to DB Solution 2: Slow Down MySQL Source (Workaround) Reduce the MySQL source's throughput to match Postgres write speed: Edit your MySQL source configuration and add: # In MySQL source connector config (via UI or API) checkpoint_target_interval_seconds: 60 # Checkpoint every minute (default 300) Or in application.yml (if you control the connector build): throughput-bytes-per-second: 1000000 # Limit to 1MB/s instead of 10MB/s Why this works: • Source produces data slower • Gives destination time to flush • Reduces memory pressure Solution 3: Use Per-Connection Resource Override (Targeted) Instead of global settings, override resources for this specific connection via the Airbyte UI or API: Via UI: 1. Edit your connection 2. Go to Advanced Settings → Resource Requirements 3. Set higher memory for destination Via API: { "connectionId": "your-connection-id", "resourceRequirements": { "destination": { "memoryLimit": "8Gi", "memoryRequest": "4Gi" } } } 📊 Expected Behavior vs Your Experience How it SHOULD work: Source (MySQL) → [streaming, 1024 row chunks] ↓ Orchestrator → [300MB buffer, flushes at 25MB] ↓ Destination (Postgres) → [writes in 10k batches] What's likely happening: Source (MySQL) → [streaming fast, large table] ↓ Orchestrator → [300MB buffer fills up FAST] ↓ Destination (Postgres) → [writing slow, can't keep up] ⚠️ BUFFER OVERFLOW → OOM 🔍 Debug Commands To confirm this is the issue, check the orchestrator logs during sync: # Get the replication pod kubectl get pods -n <namespace> | grep replication # Watch the logs for memory/buffer messages kubectl logs -f <replication-pod> -n <namespace> | grep -E "ASYNC QUEUE INFO|memory|buffer|flush" You should see messages like: [ASYNC QUEUE INFO] Global: max: 300MB, allocated: 290MB (96% used) If allocated memory is consistently near max, you're hitting the limit. 💡 Recommended Action Immediate fix: # In your Helm values.yaml global: workloads: resources: useConnectorResourceDefaults: false mainContainer: memory: limit: "8Gi" # Increase from 2Gi request: "2Gi" Redeploy and retry your sync. Long-term: • Monitor buffer usage during syncs • Adjust memory based on your largest tables • Consider using CDC mode for MySQL (it's more memory-efficient for large tables) Does this match what you're seeing? Let me know if you want help checking the pod logs to confirm!
👀 1
t
Ahh that’s telling. If I apply the same kind of numbers to my own resource limits and settings and consider that 20% JWM heap for buffer, it makes sense as to why it OOM’s earlier than expected. This is helpful - I think im going to do the same as you and pull the source down too 👍 AI chat on it’s own without the source as a reference is not as helpful. Im still convinced I can get this to work so going to keep at it.
k
good luck! yes I pulled down two repos, airbyte and airbyte-platform and opened cursor on the folder containing them both.
🙏 1
t
In case it’s useful info to anyone in future: I was unable to find a workaround for the memory requirements, and in the end gave up and upgraded to a 64GB memory instance. After some tweaking of my values.yaml I was able to get the sync to complete (9.56GB | 6.45m records). Peak memory use was ~42GB during the sync. The total sync took 7h 34m 45s, which was unfortunately long enough for the CDC cursor to be invalid… so now I need to increase binlog retention and start over. 🤦🏻‍♂️ Around 3.5 hours of the sync time was spent inserting the data into the final table. The query that moves the data is SUPER intensive (unpacking the JSON from _airbyte_data, and applying type casting, etc.) which probably explains the official recommendation to use something like Snowflake instead for larger data sets. so tl;dr for my issues: • Airbyte is memory hungry: ensure you have > 4.5x the max source table data size available. • Initial sync can take a LONG time, so if using CDC make sure your binlog retention is long enough. • The postgres destination does not scale well with large datasets, due to how the data is handled (to be fair to Airbyte, that is the official recommendation)
k
fwiw, I am new to snowflake but found it to be very straightforward and fast. provisioned in the same datacenter as my source RDS postgres, and costs are low too (you can choose a small instance, and auto-suspend)
t
Thanks for the heads up! In your syncs does the final step go pretty fast then? For my postgres destination once it gets to this point i’m in for a long wait:
Copy code
destination INFO type-and-dedupe i.a.c.d.j.JdbcDatabase(executeWithinTransaction$lambda$1):65 executing query within transaction: .....
The query to copy the data to the final table takes a long time.
k
it does it in batches of 10k rows I think
so there's no long final step
t
Huh, interesting. My sync ends with giant queries that seem to move the entire table. (an insert, then two different deletes, then an update)
Perhaps this is due to using CDC source and the “Incremental | Deduped” mode 🤔
k
i used incremental| deduped as well; I think it does a check at the end for distinct IDs or something but it doesn't take too long. Also snowflake isn't a transactional database I don't think so the way you load data in (destination connectors) are probably coded quite differently vs postgres
t
Yeah.. i’m sure the official advice that says to consider using Snowflake is there for a very good reason. Im not sure if switching to snowflake will be on the table for me right now unfortunately, but I would sure be interested to try it.
Airbyte + postgres has been working great for all our smaller tables, but it seems like once you get to a certain data size the architecture hits a hard scaling wall.
k
maybe another analytics (OLAP) vs transactional (OLTP) db is an option? redshift, bigquery? there are open source ones as well.