Andrey Morskoy

10/11/2021, 9:50 AM
Dear Team. I have a question on Airbyte's roadmap for performance and scalability. Potentially I believe I could try work on performance improvements, if I lucky to have some time. Could someone please comment on my spectations bellow ?: 1. Inspected Python CDK and
, as well as
. Seems that ~60% of time source spends converting data into AirbyteMessage (before transformers) and later making
. Are there any plans on making these conversions less painful? I would be happy to get any info to understand in which direction this architecture moves generally. 2. Are there any plans for scalability? At this moment conversions and transformations, performed in
container, both are obvious subject to run in parallel. For me it looks pretty perspective to have
responsible only for data fetch in some raw form (byte arrays?) and delegate or complex conversions, transformations/normalization to scalable middle layer (even naive Apache Spark Streaming would be good improvement I suppose). May I ask which direction does Airbyte follow to deal with scalability?


10/11/2021, 11:37 AM
I have no definitive answer as I am totally new to Airbyte - however, on 2 I would be surprised and disappointed if Airbyte went with a model that simply delegates transformations out to something like Spark - this would mean that Airbyte is basically no different than StreamSets or Airflow, and introduces a dependency on expensive & complex services that people don't otherwise want. I do agree that transformation should be separated from Source/Destinations though, but as a middle layer of connectors (e.g. Source, Transformer, Destination) - as these are containers already in a k8 environment, you already have the ability to parallelise & dynamically expand, so you may as well take advantage of the architecture ** Though I suppose this may be related to Airbyte calling itself EL(T) rather than ETL, so doing these inline transforms probably aren't the purpose of the tool...


10/11/2021, 11:48 AM
And on Point 1, this is also a similar problem that you'll find in StreamSets, where everything is a Record, which has overhead and makes handling arbitrary binary files difficult or large datasets much less efficient. In contrast with NiFi, where everything is just a bytestream, and you can optionally apply Records on-top as needed. I'd be interested to learn what the teams thoughts are on both points as well.