Is there documentation or recommendations from Airbyte on how to parallelize a single data source?
I have a MS SQL source to Snowflake connection with about 2 billion rows which is unable to complete in under 18 hours which as about the max time I’m able to connect to the MS SQL server to perform the backfill.
The whole system is running on AWS EC2 at the moment with a 2 core instance, 16GB memory, and 100GB of EBS storage.
11/01/2021, 2:13 PM
if it’s historical data/a one-time sync, just split it in more jobs by tables. Or dump the tables to an S3 bucket and load them from there instead.
11/01/2021, 2:37 PM
So just create multiple connections with the same source and destination and partition which tables sync from each connection?
Also, in-terms of the S3 staging, do you know if it is using Snowpipe under the hood and if this improves the overall throughput?
11/01/2021, 2:47 PM
“So just create multiple connections with the same source and destination and partition which tables sync from each connection?” yes, that is what I meant.
My other option was to dump the tables from within the MS SQL server to an external, more open, intermediary destination, like an S3 bucket. From that bucket you will be able to use Snowpipe yourself to load it into ❄️