Hi everyone, hope you're all having a sensational ...
# troubleshooting
s
Hi everyone, hope you're all having a sensational day. Could I get some pointers regarding Airbyte's scalability? The docs recommend a
t2.large
instance and describe, in details, how Airbyte is mainly memory and disk bound. I've been testing stuff out now on an
t3.xlarge
and noticed the following: Loading one large-ish Oracle table (~9GB, 7M rows) takes me about 30min, which I think is pretty good. Now, loading two at the same time via the same connector (9GB, 7M rows, 13 GB, 7M rows) takes an hour in total, with both taking up roughly an hour each. What gives? Looking at htop, I seem to be running more into a CPU limit as well, so I'm not sure what's causing this. These are my two largest table, but in production I'd use Airbyte for another 30 or so tables, each between 10k and 1M rows as well, so this doesn't seem to scale well. Or am I doing something wrong?
a
Hi @Saman Arefi, the read and write operations on streams in a single connection are currently run sequentially, in other word Airbyte start by syncing the first table, then the second one. A workaround could be to create a different connection for each table and tweak the
MAX_SYNC_WORKERS
env var to increase sync parallelism.
s
Thanks for the speedy reply @Augustin Lafanechere (Airbyte)! ☺️ Would you mind further clarifying "connection" in this context? This is what my setup currently looks like. Two connections, both connected to the same Oracle DB, each loading a different table. Launching both at the same time resulted in what I described above.
a
Ok so you already implemented the workaround I suggested 😄 If you have more than 5 connections increasing the
MAX_SYNC_WORKERS
value to >5 might help. I'd suggest also you try to upsize your instance to a t3.2xlarge to check if you get some performance boost.
s
Yeah, I think I'll try that, cheers. I noticed that I currently also only have memory usage of up ~50% with 4 connections going at the same time, yielding me a throughput of only ~20GB/h - aren't both of those values quite low?
a
Do you mean 50% memory usage of the container request or of the overall instance memory? Feel free to tweak your docker engine config to give more memory to containers or use the
JOB_MAIN_CONTAINER_MEMORY_REQUEST
env var.