horizontal scaling question... I understand there exists a setting to control max workers (syncs). But how about parallel / scale out execution for a single connection with huge amount of tables?
Eg a k8s deployment with a single connection should ideally be split across many available nods in the pool.
Current situation is that you should split the load manually across multiple connections, being careful to have distinct streams. Not nice, and should be a setting to lessen the maintainance...
Ideally there should be a que of streams and auto assigned to available resources
Probably a setting that is connection based. Connections (sources and destinations) shouldnt care about it. Strictly orchestration/scheduler implementaion reaponsibility. Eg split streams in ques, assign to worker
The scheduling optimizer could be further optimized by either stats or changing the source spec to include estimated row counts or aimilar useful metrics
Normalization should be optionally postponed if it affects workers (not sure how it is implemented) for successful EL streams. The idea is to be able to scale in the k8s pool as DBT is destination workload
One important consideration and reasoning to the original suggestion - i should only ever split the source load to multiple connections if i want to have different sync schedules, never because of parallelism
Augustin Lafanechere (Airbyte)
05/02/2022, 3:49 PM
Hi @Hrvoje Piasevoli, thank you for this feedback. As you observed, replication of streams currently happens sequentially. You can only tweak the parallelization of jobs, not of streams inside a single job as you mentioned. This is something we plan to work on, you can follow this epic issue on GitHub. Feel free to share you suggestions there too 👍🏻
05/02/2022, 3:57 PM
Thanks very much for this @Augustin Lafanechere (Airbyte). Exactly what I needed and hoped already existed but couldn't find. I'll add my comments there