https://linen.dev logo
p

Phoebe Yang

02/04/2022, 12:22 AM
Hi! I’ve just started experimenting with Airbyte. It’s a powerful tool and I want to adopt it for production use for my organization. A couple of questions I have: • Can Airbyte handle big load of data migration ~1TB (from Heroku Postgres to AWS Postgres), and what’s the best way to optimize the migration?  For context, we cannot have downtime for the source and need to move a lot of data over and perform custom dbt transformation after loading the data to the destination. Is this sync feasible and how long does it usually take? If so, what are the machine requirements (for both the Airbyte host and the destination Postgres) to accomplish this as fast as possible?  • How the sync handles schema changes/updates after the initial full refresh?  If we changes the source schema or the transformation logic, will the destination be automatically updated? The next follow up is more a dbt question - but if I change the transformation logic and run full refresh on a big table (~300GB), how long does this usually take and will the table be locked during the refresh? Performance and availability are important for my use case. Would love to get thoughts and recommendations on these. Thanks in advance!
1
l

Liren Tu (Airbyte)

02/04/2022, 4:59 AM
• Can Airbyte handle big load of data migration ~1TB (from Heroku Postgres to AWS Postgres), and what’s the best way to optimize the migration? 
Currently I’m afraid the Postgres to Postgres connection won’t handle this load well. If the 1TB data is all getting synced in one connection, I think it will probably take a few days, and also likely to get stuck along the way. If the 1TB data is distributed across multiple schemas and tables, and multiple connections are set up to sync a subset of them, it may work. However, I don’t think the performance will match your requirement. Our initial benchmark results showed that it can take ~2 hours at least to sync 10GB data. So unfortunately it is pretty much infeasible to transfer 1TB data reliably at this moment.
• How the sync handles schema changes/updates after the initial full refresh? 
This is a very important topic we have not had bandwidth to address yet. Currently a schema change requires a reset on the destination side, i.e. purging the data. So again, it does not work for your use case.
This quarter we are mainly focusing on data warehouse destinations and API sources. Database performance may be worked on next quarter.
Sorry that we cannot handle your use case now. Database replication at terabyte load is one of our goals. It’s hard to give an ETA about when we can get there. But hopefully it will be within this year.
g

Gerardo Santacruz

02/04/2022, 1:44 PM
Are there any alternative recommendations you've found to work? Curious as this is something we're thinking about at my org as well.
a

Augustin Lafanechere (Airbyte)

02/04/2022, 2:51 PM
Hi @Gerardo Santacruz, the workaround Liren suggests is to create multiple connections on the same source and add a subset of the table to each connection, this can improve parallelism a bit.
p

Phoebe Yang

02/04/2022, 3:41 PM
Gotcha, thanks for the info. Do you think it’s possible to use
pg_dump
for the data migration/full refresh, and configure airbyte to do incremental refresh only on new data? From the doc it seems like the first sync is always a full refresh, if thats configurable it’d work too
l

Liren Tu (Airbyte)

02/04/2022, 4:52 PM
Do you think it’s possible to use 
pg_dump
 for the data migration/full refresh, and configure airbyte to do incremental refresh only on new data?
Yes, this is something we have thought about, but have not had the time to try yet.
a

Augustin Lafanechere (Airbyte)

02/04/2022, 4:54 PM
Using
pg_dump
will require you to manually set the state of your connection to the latest cursor value you loaded with
pg_dump
. This is done by running some SQL queries against Airbyte database's
state
table. Am I right @Liren Tu (Airbyte)?
l

Liren Tu (Airbyte)

02/04/2022, 5:07 PM
Correct. This is probably not something trivial to do now.
p

Phoebe Yang

02/04/2022, 6:05 PM
How complicated do you think it’ll be? If its just sql queries can we just have a script that do it?