we went from 11 -> 12 in prod a week ago and al...
# aws
z
we went from 11 -> 12 in prod a week ago and all hell broke loose. I’d highly recommend running an ANALYZE on all your indexes, we had a lot of problems until we did that
b
@Zach oof thats a bit worrisome, we ran the upgrades for all our environments over the span of 2 months (we have many - qa, demo, trial, and 6 or so other environments) where either internal users or external users (customers) are accessing our many sites there and never got any complaints. We have 2 production RDS servers, one which we upgraded 2 weekends ago and one which we upgraded this past weekend.
What problems did you have?
z
horrible performance on indexed operations
we initially just ran renidexing, but then later found some documentation saying that computed indices during a major version upgrade really need to have Analyze run
we caught this issue when I finally found that RDS does a pgupgrade, which doesn’t catch all the recommended actions from postgres on major version upgrades. They actually recommend doing a pgdump and restore … obviously not so great on RDS but we didn’t realize initially that RDS didn’t conform to that
b
Interesting, other than the re-indexing and analyze and perfomance hits, you didn’t run into any other issues? My worry is more-so with data loss or data corruption. we did plan on doing the reindexing as mentioned here and will definitely run an analyze as well now: https://info.crunchydata.com/blog/just-upgrade-how-postgresql-12-can-improve-your-performance
z
that was the only issue, and we had problems with just about every database on 3 different RDS until we did the analyze
queries were taking forever to complete
b
hmm, we have roughly 1000 customers (big and small) and have not heard any complaints in regards to application performance. we migrated roughly 500 of them over 10 days ago.
z
well thats good then. You seem to be seeing the perf increases we had hoped we’d see from pg12 😆
b
I’ve sanity checked many of their sites but nonetheless I’m still EXTREMELY anxious about it.
was the re-indexing resource intensive? I know pg12 allows for concurrent re-indexing now which is nice
z
yes big cpu/load spikes even with the concurrent
also make sure you are nowhere near low storage, because it has to make copies of the indices
b
thanks for answering my many questions @Zach. how much storage are you using and what instance type?
we actually are WAY overprovisioned on storage as we had to scale up to handle upwards of 12,000 iops. We’re currently using 500GB out of 4TB of storage.
z
we’re pretty small, these are like m5.2xl with 500gb
we tend to be cpu bound more than memory or iops for some reason
b
does ANALYZE do anything in regards to improving performance?
Allows the query planner to be smarter?
Copy code
ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.
z
if the indices are using ‘computed’ values, it regenerates those
our problem seemed to be upgrade + reindex leaving lots of stale values in the indexes, so the planner decided to just ignore it
I’m not a DB guy so thats just what I learned on the fly as we put the fire out
b
neither am I. I’m wondering if your issue was specific to indicies w/ computed values and if my team doesn’t use computed values
I’m asking as we speak
z
I think our issue was computed values + really bloated partition tables
b
how did you verify analyze + reindex worked? I assumed since it was a fire, just by having the application re-run its queries?
z
db load shot back down
it was operating at like 40-50x load for the # of vcpus
b
hmm. my db load is low
z
yah your useage seems very different
or alternatively, my teams write bad sql 🤷‍♂️
b
yeah knock on wood but I might be lucky with this one. I’m definitely going to keep digging though. much appreciated @Zach
for the record, our cpu load is hanging around 20-40%
vcpu sessions in perf insights between 1~2 out of 8vcpus
sorry last question @Zach, if you dont mind roughly how many databases are you running in your server? im curious because we have many databases in a single server, my index tables could be small compared to yours even though we run basically the same size DB Server?
z
~10 services on one, which is where the fire came from when some of the unanalyzed indexes caused everything else to slam to a halt
the others are 1-2 services per rds