Hi there! Is there a maximum number of connections...
# ask-community-for-troubleshooting
j
Hi there! Is there a maximum number of connections you can define in Airbyte? We make business intelligence software for bridal stores. We need to ELT each of our clients’ data into our database for analysis. The data comes from disparate systems (accounting, point-of-sale) using client specific credentials (api/oauth tokens). The way I’m thinking of using Airbyte for this is by dynamically creating a new source and connection for each client via the Airbyte API. This would result in many sources and connections. I’d like to know if my thinking is on the right track or if there are practical limitations for this approach.
1
s
What number of connections are you anticipating? @Harshith (Airbyte) operated an airbyte instance with thousands of connections via the API. UI would face limitations but an api based approach could work
j
We’d be looking at easily thousands in the short term. Probably tens of thousands in the medium term.
So, it feels as though our use case is a bit of an edge case for Airbyte considering sources are associated with an individual user’s credentials. Is there another way to come at this problem that I’m not thinking of?
s
Hi, I found this question interesting so I'll continue with similar questions. We also make business intelligence software for companies and ELT each of our customers' data into our database, so I'm curious as to what problems may arise with Airbyte when it is scaled past tens of thousands of connections. For instance, if we want to change all connections with Stripe as source, would we have to change each connection individually or is it possible to group connections somehow? Is our use case not optimal for Airbyte open source, or is there some kind of best practice to dynamically creating and maintaining tens of thousands of connections?
a
Hi @Jay Tavares and @Sebastian Berner, as Sherif explained the UI will be pretty unusable if you have thousands of connections. But you can still provision thousands of connections with the API or our upcoming CLI tool (#public-octavia-cli) , the limiting factor will be the size of your Airbyte instance. This kind of setup will require you to properly size your Airbyte database and set up the job right parallelism.
👍 1
j
We use Helm/Kubernetes for some other things, so we’d probably be looking to use it for Airbyte as well. I’ve seen some discussion about potential issues when too many jobs get scheduled. We’d want to make sure that Airbyte has the ability to queue sync jobs for later execution if the resources of our k8s cluster are overcommitted. I recognize that this is all pretty bleeding edge.
While the API-only approach could work, it would be nice to have access to the UI for troubleshooting use. What UI limitations are we talking about? Is it just the display of the sheer number of connections? Could some tweaks like pagination/infinite scroll, list filtering, etc. be implemented to help?
a
Is it just the display of the sheer number of connections? Could some tweaks like pagination/infinite scroll, list filtering, etc. be implemented to help?
Exactly, the UI might be slow responding and hard to manage due to the scrolling effort required. We are working on making it usable with pagination indeed.
We’d want to make sure that Airbyte has the ability to queue sync jobs for later execution if the resources of our k8s cluster are overcommitted. I recognize that this is all pretty bleeding edge.
The number of concurrent jobs is manageable using the
MAX_SYNC_WORKER
environment variable and queuing already exists in our current scheduling logic. Moreover, we have an ongoing effort to revamp our scheduling engine and use Temporal scheduling capabilities. If you have additional follow-up questions, feel free to post on our Discourse forum as we're migrating our community support there, to boost topics's discoverability.
j
Awesome. Sounds like things are probably moving in the right direction and while things may not be ideal for this setup today, they may be soon.
The number of concurrent jobs is manageable using the
MAX_SYNC_WORKER
environment variable and queuing already exists in our current scheduling logic.
👍