Jason Edwards
01/12/2022, 6:47 PM_stg
tables. After that the _scd
tables are created. Then the final tables are created. This means each row is processed at least 4 times (I’m not including indexing those tables). In my case, working with about 70 million records, 280 million records are getting processed. Incremental syncs should be better.
But this problem is compounded by Postgres. Quickly, for anyone who is unfamiliar with Postgres, when you do an update, Postgres actually writes a new record to disk and marks the old one as dead. Later a process (Autovacuum) scans through the table looking for dead records, catalogs them, then removes them. Deleting a record simply marks it as dead. The way DBT creates/updates/deletes records creates a lot of dead records. Like a significant portion of those 280 million records will need to be vacuumed. Vacuuming, of course takes resources, and the database instance grinds to a crawl under the load DBT transforms and vacuuming. So far the 50 million record table has never completely synced, even letting it churn for, literally, days.
Would a database other than Postgres be a better choice for a warehouse? Probably, but for now it’s what I have to work with. Could tuning Postgres/autovacuum improve performance? Possibly, but that’s a bit beyond my Postgres skill/knowledge/experience. I’ve also wondered a different sync mode would work better.
Sorry, that turned into a bit of a rant. But hopefully it gives you a sense of some of the pain points of, at least, an initial sync of a significant dataset. I don’t know if there’s any possibility in the future to cut down on the amount of processing that happens in the destination database.Jason Edwards
01/12/2022, 6:55 PMJason Edwards
01/12/2022, 7:02 PMMijbel Alqattan
01/12/2022, 9:19 PMMijbel Alqattan
01/12/2022, 11:25 PMMijbel Alqattan
01/12/2022, 11:29 PM_stg
table. The indexing has been running for 26 hours with 11,904,285 tuples done out of a total of 49,982,795 tuples. So, at least, for me indexing is a big problem.Titas Skrebė
01/13/2022, 6:25 AMTitas Skrebė
01/13/2022, 2:41 PMSurya Prakash
01/16/2022, 1:27 PM