Hello! So - I have two running Postgres CDC increm...
# ask-community-for-troubleshooting
l
Hello! So - I have two running Postgres CDC incremental sync into BigQuery, using the batch run type. I have added a third connection, using a distinct replication slot & publication, to the same Postgres database. I am attempting to sync 100 partitions of a larger table that, when originally attempted, failed numerous times mid-sync. Shifting to syncs of the partitions finally succeeded after a week of partition-by-partition additions to the connection, but sync now not completing now. Upon inspection of logs & the files sent over to GCS, the buffers are being flushed after 302 bytes, which seems really surprisingly small, and the syncs I cancelled saw 1715 changes emitted within an hour. My BigQuery file buffer count is at 40. What is the expected file size for a buffer file that has been flushed?
k
A message from kapa.ai
Hmm, I don't know enough to give you a confident answer yet. However, you can refer to the Airbyte documentation for more information on how to configure and troubleshoot your syncs: [Airbyte Documentation](https://docs.airbyte.com).
l
I have, @kapa.ai! Back and forth. Can you look into prior threads & conversations around small buffer files?
k
A message from kapa.ai
Hmm, I don't know enough to give you a confident answer yet. However, you can refer to the [Airbyte documentation](https://docs.airbyte.com) and [Airbyte forum](https://discuss.airbyte.io) for more information and support.
l
@kapa.ai can you give me a list of threads & conversations that may be related, even if you are not confident?
k
A message from kapa.ai
I found a couple of conversations from the provided extracts that might be related to your issue: 1. [Redshift destination buffer stream uploading duplicate files to S3 staging area](https://discuss.airbyte.io/t/redshift-destination-buffer-stream-uploading-duplicate-files-to-s3-staging-area/3313) - This forum post discusses an issue with duplicates in the Redshift destination buffer stream. 2. [Slack conversation about duplicates during Full Refresh](https://airbytehq.slack.com/archives/C021JANJ6TY/p1669113283280629) - This conversation is about duplicates occurring during a Full Refresh | Overwrite sync. Please note that I cannot guarantee the relevance of these conversations to your specific issue, but they might be helpful for you to explore further.
l
I increased to 50 buffers (against a sync of 100 tables) and that hasn’t seemed to have helped there. The individual files being uploaded look like this
i.e. very small given that the WAL is pretty large now
No resource constraints on either side of the workers — in fact, they’ve been overprovisioned (on this connection I have this set on
resource_requirements
{"cpu_limit": "3", "cpu_request": "2", "memory_limit": "12Gi", "memory_request": "10Gi"}
Took the resource requirements off — no effect. Tried a new connection where just one of the 100 partitions was read from the same source / destination, using the same publication & connection slot. It does not look like it began from the last CDC cutoff point & attempted to resync from scratch, so I cancelled. This makes me afraid of reducing the # of tables in the connection - I will lose sync progress for those tables up until this point
Is there an unstated limit to the number of tables allowed to be synced via cdc?
This one publication manages CDC from 100 tables - my best guess is that because changes are happening at random across all 100 tables (which are hash partitions on ID, meaning writes captured by the publication happen at random across those 100 tables), and because there’s a hard limit of 50 buffers that can be open at a given time, Airbyte is closing buffers when a write to another table is in a buffer that isn’t currently open
Is there a way to force 100 buffers to be opened on the destination, in order to test this theory?
I would limit my connection to 50 tables, but I will lose all progress on the other 50 tables & it will take days to resync, all for the sake of an experiment. Any BigQuery destination users at larger scale that know the answer to this question?
I edited the configuration in-database to try & force a 100 file buffer, but hit an error on spin-up of the destination pod,
must have a maximum value of 50,
, so unless there’s a way around validation I don’t think that’s a potential avenue
Is there any way around cutting to 50 tables, and is there any way to confirm that this is an actual path forward?
@Gabriel Levine maybe you have an idea, I remember your usage of BigQuery being substantial, wondering if you ran into this sort of cap
Or @[DEPRECATED] Marcos Marx maybe you know
(sorry for the tags…)
@kapa.ai Is there an unstated limit to the number of tables allowed to be synced via CDC to the BigQuery destination? Is the limit based on buffer count?
k
A message from kapa.ai
Based on the provided extracts, there is no explicit mention of an unstated limit to the number of tables allowed to be synced via CDC to the BigQuery destination or a limit based on buffer count. However, there are some limitations and considerations when using CDC with various sources: - [General CDC limitations](https://docs.airbyte.com/understanding-airbyte/cdc#limitations) - [Postgres CDC considerations](https://docs.airbyte.com/integrations/sources/postgres#configuring-postgres-connector-with-change-data-capture-cdc) - [MySQL CDC considerations](https://docs.airbyte.com/integrations/sources/mysql#change-data-capture) - [MSSQL CDC considerations](https://docs.airbyte.com/integrations/sources/mssql#change-data-capture) It's important to review these limitations and considerations to ensure proper configuration and usage of CDC with your specific source and destination.
g
Hey @Leo Sussan, I recall something somewhere about the number of buffers not substantially increasing throughput once the number of buffers exceeds the number of streams. Since you’re syncing one really big table it wouldn’t surprise me if you’re not seeing much benefit from the additional buffers (though more buffers does mean more memory pressure, but I recall that you had allocated a lot of memory to your pods)
l
Yeah - no memory pressures. So for context, the sync is of 100 postgres “partitions”, which are just individual tables in Postgres. To get around the fact that the sync of the top-level table kept failing at day #3 or something, meaning I’d have to start from scratch, I instead synced one partition at a time (by adding the table to the connection after the previous one completed), and then created a view in BigQuery that was essentially a stitched-together representation of each of those 100 tables
So — technically, this connection is attempting to sync 100 tables per-job. The Postgres publication is trained against those 100 partitions, meaning it should essentially look like a regular 100-table incremental sync with CDC
and according to the logs, each of those 100 tables is getting a buffer, as expected. Those buffers are being flushed really, really quickly — like, after 4kbs of data, 7kbs data, etc
My guess is, because the publication contains writes for each of the 100 partitions / tables, but Airbyte can only keep 50 buffers open at a time, Airbyte is forced to flush those buffers early, so that a change to the 51st table can be registered to a buffer it doesn’t have open
I.e. — number of buffers is at the max, 50. Number of tables is, for all intents, 100.
but I wanted to check that against other people’s experience to see if that observation had any merit - it feels like others operating at larger #s of buffers would have experienced something similar
g
I dropped my number of buffers down to 20 but they’re flushing fairly large amounts of data. MBs not KBs
Well, it’s a mix. Some streams are flushing bytes or kbs but I think those are just smaller
l
I have a feeling that if writes to the database were distributed at 100% random between tables (they are in this case, the partition scheme randomly assigns the writes to one of 100 tables) your buffers would look like mine, but that’s entirely a guess
you might be able to see this in your logs — when buffers are closed, I bet there’s a new table opened up in the newly opened set of buffers listed in your Airbyte logs
if that’s true — well, it still sucks (I’m going to have to resync half of my partitions again 😬 ) but at least it is guidance
Another thing to mention - I did have CDC working at larger scale earlier in this process (I.e Friday of last week) when a smaller set of partitions were synced - we did a fairly sizable data import & the WAL was cleared within minutes
Bumping this - could really use some assistance, nothing I’ve tried here is working
g
@Leo Sussan https://airbytehq.slack.com/archives/C01A4CAP81L/p1686236400692849 I’d be curious for you to try it out. Be a good stress test of the “any volume of data” claim https://airbyte.com/blog/checkpointing
l
yeah no doubt 👀