Hi, I don’t really know where to put this message,...
# feedback-and-requests
j
Hi, I don’t really know where to put this message, but feedback seems like the most appropriate. I’m trying to sync a database with only a handful of tables (only 4 actually), but a couple of them are sizable: one with 50+ million rows and another with 10+ million. Using the “Incremental | Deduped + history” sync mode to load this into a Postgres database seems all but impossible. Part of the problem is because of how Airbyte/DBT loads and transforms in the destination, and the other part of it seems due to how Postgres handles updates (INSERT/UPDATE/DELETE statements). First Airbyte copies the rows over and writes them in raw/JSON format; that’s not so bad, it takes about 4 hours on my AWS t3.xlarge/db.t3.xlarge instances. Next it creates the
_stg
tables. After that the
_scd
tables are created. Then the final tables are created. This means each row is processed at least 4 times (I’m not including indexing those tables). In my case, working with about 70 million records, 280 million records are getting processed. Incremental syncs should be better. But this problem is compounded by Postgres. Quickly, for anyone who is unfamiliar with Postgres, when you do an update, Postgres actually writes a new record to disk and marks the old one as dead. Later a process (Autovacuum) scans through the table looking for dead records, catalogs them, then removes them. Deleting a record simply marks it as dead. The way DBT creates/updates/deletes records creates a lot of dead records. Like a significant portion of those 280 million records will need to be vacuumed. Vacuuming, of course takes resources, and the database instance grinds to a crawl under the load DBT transforms and vacuuming. So far the 50 million record table has never completely synced, even letting it churn for, literally, days. Would a database other than Postgres be a better choice for a warehouse? Probably, but for now it’s what I have to work with. Could tuning Postgres/autovacuum improve performance? Possibly, but that’s a bit beyond my Postgres skill/knowledge/experience. I’ve also wondered a different sync mode would work better. Sorry, that turned into a bit of a rant. But hopefully it gives you a sense of some of the pain points of, at least, an initial sync of a significant dataset. I don’t know if there’s any possibility in the future to cut down on the amount of processing that happens in the destination database.
Here’s a screenshot of a database where I restarted a sync that I had killed because it was running so long.
Here’s a screenshot of a database I let run for several days before I killed the query and restarted it. The drops in CPU (green) corresponds to tables syncs completing.
m
Hi @Jason Edwards. I have no insight on the postgres performance tuning in particular, but as always you should make sure that your disk is performant if you host it yourself. For example, on GCP I have had great success with local NVMe SSD disks (IOPS scales with the number of partitions (disk size). I have measured order of magnitudes in performance differences compared to "standard" setups for certain workloads. That being said, I have faced similar issues for a long time (not airbyte specific, but tricky OLTP database ingestion workload in general), and I have been working on something that let you ingest this amount of data in short time*. It implements a singer/airbyte-like protocol natively and can act as both source and destination in airbyte. In your case it could potentially serve as a middle-layer between your source and the postgres destination, where it only load the normalized records once. I dont want to spread artificial benchmarks that compares apples to oranges, but the time of ingestion in your specific case would likely be measured in minutes. I also don't want to share too much details (in public) before it's ready, but in case you are interested and don't have a super critical workload, it would be interesting to get in touch and see if this could help you (and I would love to get your feedback).
@Christian Persson I’m interested, you can send me a DM and let’s talk more to see if it’s a fit. I’m far from an AWS RDS wizard, but as far as I can tell it doesn’t seem to be an I/O issue. When I look at the read/write latency/iops, they seem really low — like there’s not much I/O, it seems more like it’s CPU bound.
Looking into it more, I’ve recognized it’s still indexing the big (50+ million record)
_stg
table. The indexing has been running for 26 hours with 11,904,285 tuples done out of a total of 49,982,795 tuples. So, at least, for me indexing is a big problem.
t
I'll send you a DM later today. In the meantime, maybe it's worth looking into a service like this https://ottertune.com. I have no personal experiences using this tool, but one of the founders is Andy Pavlo.
Thank you @Jason Edwards for this valuable feedback! @Christian Persson could you please explain a bit more the role of the tool you're working on and how it will integrate with Airbyte? If you prefer to not spoil before release I'll understand.
s
@[DEPRECATED] Augustin Lafanechere I’m not quite ready, so it wouldn’t make sense to go public yet. I saw this as a good opportunity to test it and to get some early feedback.