Hello all, I recently stumbled upon Airbyte and wa...
# ask-community-for-troubleshooting
s
Hello all, I recently stumbled upon Airbyte and was exploring it to use it for a big data application. As per the documentation, data between the Airbyte source and destination containers is moved through unix pipes (STDOUT --> STDIN). My question is around this implementation: • How reliable is this approach of moving data through unix pipes? • Do we have any documented benchmarks or performance/throughput stats of Airbyte? • What is the maximum limit to the data being moved from source to destination? How does Airbyte perform if the data is very large (in GBs lets say)? Any help would be greatly appreciated! 🙂
1
1
u
How reliable is this approach of moving data through unix pipes?
Do want to compare with what other method of moving data?
Do we have any documented benchmarks or performance/throughput stats of Airbyte?
Not yet! There is an issue on Github to create some benchmarks
What is the maximum limit to the data being moved from source to destination? How does Airbyte perform if the data is very large (in GBs lets say)?
There is no limit. You can control your resources and transfer the amount of data you want. I know some users who transfer +100Gb in each sync, other who transfer 17Tb of data using Airbyte. For large cases maybe you need to plan and size your instance differently.
o
Hi, I’m doing some Airbyte research and testing and interested in the Airbyte scalability and performance. Could you provide some more details and help to answer a couple of questions? 1. Does Airbyte stream data from source to destination in parallel or is it always a single thread? 2. How efficient is JSON serialization for AirbyteMessages? 3. Could you provide more details about +100Gb use cases (what source and destination are used, approximate time for such data transferring, etc.)? 4. Is there any doc to understand Airbyte data transferring in more detail (serialization, data transfer protocol, format, datatype conversion between source and destination, etc.)? 5. Where I can find a doc about tuning Airbyte workers (mostly interested in tuning on K8s, setting requests, and limits) to be able to ingest a big amount of data?
s
@[DEPRECATED] Marcos Marx
Do want to compare with what other method of moving data?
using something like a message queue or streaming?
There is no limit. You can control your resources and transfer the amount of data you want. I know some users who transfer +100Gb in each sync, other who transfer 17Tb of data using Airbyte. For large cases maybe you need to plan and size your instance differently.
This happens in batch right? What is the maximum batch size? Because I am assuming that for large data sizes, small batch jobs will slow down the process.
a
@Syed Farhan Ahmed Actually, the best I was able to do with localhost on
source-s3
(using minio, patched connector to support endpoint) is 168 KB/s (e2e: from s3 minio to stdout). At the same time,
wget
for th same file from minio produces 850 MB/s. Also, I have profiling, so I see hotspots in fact. Could you please share your benchmark ?
cc @[DEPRECATED] Marcos Marx
@Syed Farhan Ahmed sorry if miss-notified you in a case