https://linen.dev logo
v

Vikram Kumar

02/04/2022, 11:42 AM
Hello. We are using Airbyte for replicating Aurora to Postgres RDS with about 2T rows and 2TB database. We have barely processed 200M rows in 24 hours. At this rate we are looking at 10-15 days for the initial sync. Is there anyway the initial full refresh can be accelerated?
1
a

Augustin Lafanechere (Airbyte)

02/04/2022, 1:48 PM
Hi @Vikram Kumar, I think the bottleneck you have is related to the default
fetchsize
Airbyte uses to read the source in batch. This
fetchsize
is today of 10000 records. This is not configurable at the moment and if you want to customize this you'll have to build Airbyte your own version of the Postgres connector and patch this file
airbyte-db/lib/src/main/java/io/airbyte/db/jdbc/PostgresJdbcStreamingQueryConfiguration.java
. An issue exists in our repo about this topic and we plan to make the fetch size configurable.
If you want to go for the 10/15days initial sync you need to increase the value of
SYNC_JOB_MAX_TIMEOUT_DAYS
env variable, which defaults to 3 days.
You also can have a look at this message on which Liren explains that TB data replication is not very well supported by Airbyte at the moment: https://airbytehq-team.slack.com/archives/C021JANJ6TY/p1643934145373819
v

Vikram Kumar

02/04/2022, 3:25 PM
Thanks for the reply. What’s a reliable way to pause the sync, stop the docker, reset those configs and resume?
a

Augustin Lafanechere (Airbyte)

02/04/2022, 3:29 PM
We do not have a pause feature, but if you're running incremental sync, stopping the sync job and restarting will make the new sync start from the last checkpointed cursor by the previous job.
2 Views