Airbyte is an open-source data integration engine that helps you consolidate your data in your data warehouses, lakes and databases.

Airbyte

Hi All,
we have recently set up CDC connection between RDS PostgresSQL and BigQuery (using pgoutput plugin). The connection is running hourly and emits records correctly, however, we noticed that the WAL is not flushed properly with every run, meaning there is data accumulating in our Postgres log (sometimes only partial flush of the log happens). Is anyone facing the same issue or does anyone know how why this happens?

As far as I understand, depending on your Postgres settings, Postgres determines when to purge old records from the WAL - and I don’t believe that it is related to Airbyte reading the WAL.

From: <https://www.postgresql.org/docs/current/wal-configuration.html>

The number of WAL segment files in `pg_wal` directory depends on `min_wal_size`, `max_wal_size` and the amount of WAL generated in previous checkpoint cycles. When old log segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, `max_wal_size` is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest. The estimate is based on a moving average of the number of WAL files used in previous checkpoint cycles. The moving average is increased immediately if the actual usage exceeds the estimate, so it accommodates peak usage rather than average usage to some extent. `min_wal_size` puts a minimum on the amount of WAL files recycled for future usage; that much WAL is always recycled for future use, even if the system is idle and the WAL usage estimate suggests that little WAL is needed.
Independently of `max_wal_size`, the most recent <https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-WAL-KEEP-SIZE|wal_keep_size> megabytes of WAL files plus one additional WAL file are kept at all times. Also, if WAL archiving is used, old segments cannot be removed or recycled until they are archived. If WAL archiving cannot keep up with the pace that WAL is generated, or if `archive_command` or `archive_library` fails repeatedly, old WAL files will accumulate in `pg_wal` until the situation is resolved. A slow or failed standby server that uses a replication slot will have the same effect (see <https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS|Section 27.2.6>).

Thanks a lot <@U03BB17Q7K2>. In the end we solved the problem by introducing a heartbeat table which we included in the Airbyte connection, similarly to this tutorial: <https://wolfman.dev/posts/pg-logical-heartbeats/>. I wonder if this information would be helpful to mention on the CDC documentation page. The problem here is that when you set up a slot in postgres, the WAL is created from all databases (including internal ones, e.g., AWS RDS), but the WAL is flushed only when Airbyte syncs some records, meaning that if you sync only subset of tables or from database where changes are not frequent (e.g., dev DB) you can crash that DB due to the size of WAL which is not flushed.

<@U043W4A464S> Thank you so much! I had the same issue and was searching for days. :raised_hands: