Hi there,
I consider using Airbyte to transfer my "final" tables from a PostgreSQL database where all the data transformations are performed to an only read target database dedicated for a BI tool.
Do you know if it's a good practice ? Or it's more appropriate to create an RDS database and a read replica for instance ?
Thank you in advance for your help !
✅ 1
u
[DEPRECATED] Augustin Lafanechere
12/27/2021, 2:05 PM
Hi @Louis Monier, it's a good use case. It's an excellent design to decouple BI analysis from the transactional use case that your source Postgres is probably used for. According to the volume or the availability requirements of your source database, you might also find it helpful to create a read replica of your source database to avoid load impact from the Airbyte sync on your master database.
l
Louis Monier
12/28/2021, 10:46 AM
Hey @[DEPRECATED] Augustin Lafanechere,
Thanks for your answer. Our database is not that big (~ 600 Go) but has a lot of schemas and tables (~1800). I'd say we only use ~50 "final" tables for our BI analysis.
I'll write a query to know what's the fraction of the total size those "final" tables represent: my first guess is something like 200 Go.
That's 100% sure we're not going to read the rest of the data is we create a replica. Only those 50 tables.
• Option replica: Is it stupid to have a replica with lots of data we never use ? Or that's pretty common ?
• Option Airbyte: Does 200 Go being synced each night from the master db to the read db is too much data ? Is their an incremental mechanism that takes care of this quickly ? Or the volume + load on the master db would be too big ?
I've searched on the Internet if it would be possible to create only partial read replica. It's not possible but there are workarounds:
• DMS but not recommended apparently
• AWS pipeline (looks similar to what Airbyte would achieve)
• Disable the read option from the replica and flush out the unwanted data (I think that's the worst) !
u
[DEPRECATED] Augustin Lafanechere
12/28/2021, 10:52 AM
I'd go for the Airbyte option, if your table can be loaded incrementally you'll only have an important load on the first sync but future daily syncs will only query recently created data.
✅ 1
l
Louis Monier
12/28/2021, 11:44 AM
Thanks for your help !
@[DEPRECATED] Augustin Lafanechere
I investigated a bit further the amount of data we really use in our BI tool.
• Total size of the db is currently: 570 Go
• Sum of the size of the tables we actually use in our BI tool: 109 Go (approximately) so 19% of the total size of the database.
I asked for a trial on Airbyte but it seems there is a waitlist and it's rather long (no access within a few hours/days after the request). I've searched a bit and right now, I consider using Stitch for a test.
• With those more precise numbers, do you still suggest using an ETL rather than a simple read replica on AWS ?
• What do you think about Stitch ?
• I've looked a bit about doing it myself:
◦ Create a EC2 instance dedicated to transfer the data from postgres to postgres
◦ pgloader or pgsync are open source codes that could do the trick but it looks complicated to implement and may be not robust.
◦ What's your opinion on this as well ?
I should mention that the main problem with ETL tools is that they change the name of the schema. I need to get the exact same names for the schemas / tables / columns. This has to do with my BI tool which will break all the dashboard filters if the reference name is different.
Thanks a lot in advance !
u
[DEPRECATED] Augustin Lafanechere
01/11/2022, 1:53 PM
Hi @Louis Monier, is your company located in the US? Otherwise, I'm afraid it's not possible to use Airbyte cloud at the moment... I'd suggest you try Airbyte locally or on a cloud instance simply by using docker-compose on our open source repo. I don't recommend you implement the entire ETL pipeline yourself and suggest you give our custom transformation feature a try if you want to rename output tables according to your need.