https://linen.dev logo
e

Elaaf

09/27/2021, 11:21 AM
Hi everyone, I had a question ! This may have been already asked but ... How does Airbyte scale data migration ? Is there an in-house developed MPP system being used to distribute and manage a HUGE ingestion job across multiple workers ? Regards, Elaaf
2
c

charles

09/27/2021, 3:01 PM
Good question! Can you describe a little more what you mean by huge ingestion job? total byte sent is very large? frequent syncs? huge records? etc?
e

Elaaf

09/28/2021, 4:41 AM
I am trying to gauge the performance of Airbyte against Spark (+ some Orchestrator) for data ingestion. Lets say, I am trying to ingest a table (~200 GB) with 1.5Billion+ records from Oracle DB into AWS S3, how would the data
lift and shift
scale for that table ? Going through the docs, it seems
temporal
handles the worker orchestration. How would the table be ingested in a partitioned/parallel manner across these workers ?
c

charles

09/28/2021, 4:05 PM
Right now there is no parallelization or partitioning for a single connection in airbyte. It is in our roadmap, but the earliest you would see if would be early spring of 2022. So if you are using spark and leveraging it's parallelization heavily, then airbyte, as it is now, is not a good fit.
200GB doesn't seem out of the realm of possibility for doing things in a serial manner however. I know we have replicated databases with 50GB of data. But we definitely would be at the edge of our know support.
e

Elaaf

09/29/2021, 4:18 AM
Ok, thank you so much for the information. This helps a lot.
b

Blake Enyart

10/30/2021, 8:27 PM
@charles, for single source parallelization, could you provide a link to the roadmap outlining when this might be completed by? I'm also pretty interested in seeing this feature come to fruition
c

charles

11/01/2021, 6:06 PM
Blake could you mention the requirements you're looking for? Same questions that I asked Elaaf earlier in this thread.
b

Blake Enyart

11/01/2021, 9:15 PM
Absolutely. So for myself at least, I'm struggling to get a performant backload ingest to occur right now from a MS SQL database source with ~190 million rows (~90GB) that need to be synced to a Snowflake data warehouse on a 24 hour cadence. With that, I'm finding that the insert functionality takes too long >18 hours which conflicts with a database rebuild job on the source. I would love to see something where rather than a serialized ingestion from one table processed to another table processed if there was some manner of parallelizing maybe up to 10 connections to the database that are all pulling different tables to increase the throughput. Does that get at some of the requirements? I'm also working on using the AWS S3 staging method for syncing the data to see if I can improve the process through that coupled with multiple connections for the same source + destination with subsets of the tables in each connection to pseudo parallelize under the current process to see if I can get a sync that can hopefully be done in less than 3 - 4 hours. Especially given we want to run this in the early AM to get the data as fresh as possible for the day. From working with Airbyte right now, one of the constraints I'm finding is that we will likely be resetting connections quite regularly with the need to add additional tables frequently that are built by the development team.
@charles, do you have any guidance or feedback on this?
c

charles

11/05/2021, 10:07 PM
and you need to resync all 190mm rows every time?
let me rephrase.
you want to resend all 190mm rows every time? or it's one sync with 190mm rows and every 24 hours there is some delta as an update.
b

Blake Enyart

11/05/2021, 11:27 PM
So it was one initial load of 190mm rows and now it syncs every 12 hours (just timing I picked with the lack of a cron config). With that, I anticipate needing to full refresh the system a couple times with the new additions of the incremental loads and other components here shortly with Airbyte. I've split the connection into 4 connections all triggered within a couple seconds of each other for now, but nervous about when I need to update any of the connections in the future for additional tables as the whole setup process took me about 1.5 hours to complete.
c

charles

11/06/2021, 12:44 AM
Got it. I agree that sounds precarious.
Realistically the earliest we would start working on something like this is january and likely it would actually be later in the quarter. Our focus is really nailing down the stability of the platform and connectors at a slightly lower scale before building up.
b

Blake Enyart

11/06/2021, 1:02 AM
Awww…that’s super helpful. With that, is there a specific issue on GitHub you would recommend I follow to track this?
c

charles

11/08/2021, 5:48 PM
i'm not sure if it 100% fits.
i think (and please correct me if i'm wrong), you have a database with a lot of data in it across multiple tables. so you need airbyte to replicate multiple tables in parallel.
the issue i linked is trying to handle the case where 1 really large table takes too long and for a single table we need to send data in parallel.
am i right in saying that multiple tables at the same time is the thing you care about? that's the thing that we are probably going to tackle first.
b

Blake Enyart

11/09/2021, 7:50 PM
@charles yes! You captured it great. I’m looking for multiple tables in parallel right now rather than a single large table. I’ll follow the second issue in that case. Thanks so much for sharing!
👍 1
v

Vinoth Govindarajan

01/19/2022, 5:27 AM
@Elaaf @charles - Sorry I'm late to the party on this, but I just discovered an interesting integration on Airbyte with Apache Hudi, by running a different replication launcher which is runs on spark, the author itself claiming that it might be deviating from the current Airbyte's roadmap, but its good to see that someone has already solved it.
6 Views