Hello! I am wondering if something like "remote wo...
# feedback-and-requests
b
Hello! I am wondering if something like "remote workers" is being considered. Hosting Airbyte on the cloud is great. However, in large companies you will often also have a bunch of sources on premise behind a firewall. Opening the firewall towards the cloud is not always going to be possible (even through private connections). In many solutions, you see a pattern where some "gateway" or "agent" is deployed on premise which polls the main environment (in the cloud) for work and executes any work it gets. It seems this would also be an interesting model for Airbyte. Manage your EL pipelines all on the same place, but allow some workers to execute for example on an on premise Kubernetes (and only needing to open https from inside to outside / depending on targets). I did a quick search on Airbyte docs and GitHub but didn't find anything related to something like this.
u
u
I think they are considering it, but having that as a feature would probably add to the volume and complexity of support, especially when launching cloud. 🤞 it’ll be an option in the future
b
Hi, this is definitely on our roadmap. If you look closely at our current architecture, you’ll see a separation of control and data plane to easily support something like this. Current thought is for this to happen a second half of 2022.
u
Thanks for the confirmation. Great to hear. The architecture indeed looks already pretty ready for this. It would be great if the implementation could allow only outgoing connections (pull model). For some companies, this would avoid a lot of discussion when deploying on-prem.
u
What do you mean by ‘only allowing outgoing connections’? Only allowing connections initiated by an egress call? Want to make sure I understand the concern here
b
I mean only the remote "agent" or "execution environment" (whatever it gets called) is connecting to "main Airbyte environment" (cloud or selfhosted somewhere else). Similar as how GitLab Runners only needs outbound connectivity (for example https://forum.gitlab.com/t/how-does-communicate-gitlab-runners/7553).
u
Yes that’s exactly what we have in mind. fyi @charles @Jared Rhizor (Airbyte) who are also involved in this initiative
u
This may related to question above, apologies will delete if not. Is a related possible issue with Airbyte the "noisy neighbour" connections, whereby a resource intensive or long running connection interferes with other connections? Is the a common issue? If so, is the idea of splitting the compute and scheduling concepts into different components addressed by the above future roadmap, or is this currently possible? (I guess this is becoming a question of orchestration rather than data movement)
j
My reason for the question is about security. Opening firewalls for connections from cloud to on-prem is for some companies either not possible or difficult. So if you want to ingest for example data from on-prem operational databases, you are quickly limited in what is possible.
u
@Bruno Quinart thanks, yes my question possibly tangential - related to resource management, but I suppose possibly the solution to your issue is related to mine 😄
u
I’ve not seen it in our tests, but it doesn’t mean noisy neigbour isn’t there. I doubt noisy neighbour is a significant cause unless there is one connection with such a high bandwidth that the CPU is saturated. Most cloud providers peg instance bandwidth to number of CPU cores. What we have seen is multiple ongoing syncs to the same destination eat each other’s bandwidth and the destination starts throttling writes from a single IP. We’ve seen this on Bigquery. We have reports on Snowflake but haven’t had time to reproduce
u
Ok very interesting, thanks Davin! Minor concern from me on that being an issue, I was mostly curious
u
The scenario of running an on-premises agent in order to have only outgoing connections makes sense from a security standpoint; I’ve been in that position before. As a shorter-term option: It might be possible to host a pair of bastion servers, where one is inside your company network and the other is outside of it… The inside one provides reverse-ssh tunneling to port-forward the things you want to load data from, but it only sends it to the external bastion; and airbyte connectors access through an ssh tunnel to the external bastion. This strategy leaves you in control of both servers, as well as ensuring that you’re making only outbound connections from the private network side. Is this acceptable to corporate security policy? Or is the existence of the external bastion still considered too much of a risk? I’m trying to get a sense of whether this is a sensible recipe to offer folks in similar positions.
m
Our issue is that our data can’t leave our private network as part of contractual agreements and we need to move it from disparate DCs in different geolocs(abiding by and cleaning data in compliance with local laws(GDPR especially)) to our warehouse. For us, having the connectors run outside of our network is the main issue. Self-hosted is working great so far though octavia loves
u
Okay, thanks for the clarity!