we went from 11 gt 12 in prod a week ago and all hell broke SweetOps #aws

we went from 11 -> 12 in prod a week ago and al...

Zach

02/16/2021, 9:27 PM

we went from 11 -> 12 in prod a week ago and all hell broke loose. I’d highly recommend running an ANALYZE on all your indexes, we had a lot of problems until we did that

btai

02/16/2021, 9:43 PM

@Zach oof thats a bit worrisome, we ran the upgrades for all our environments over the span of 2 months (we have many - qa, demo, trial, and 6 or so other environments) where either internal users or external users (customers) are accessing our many sites there and never got any complaints. We have 2 production RDS servers, one which we upgraded 2 weekends ago and one which we upgraded this past weekend.

btai

02/16/2021, 9:44 PM

What problems did you have?

Zach

02/16/2021, 9:51 PM

horrible performance on indexed operations

Zach

02/16/2021, 9:51 PM

we initially just ran renidexing, but then later found some documentation saying that computed indices during a major version upgrade really need to have Analyze run

Zach

02/16/2021, 9:53 PM

we caught this issue when I finally found that RDS does a pgupgrade, which doesn’t catch all the recommended actions from postgres on major version upgrades. They actually recommend doing a pgdump and restore … obviously not so great on RDS but we didn’t realize initially that RDS didn’t conform to that

btai

02/16/2021, 9:57 PM

Interesting, other than the re-indexing and analyze and perfomance hits, you didn’t run into any other issues? My worry is more-so with data loss or data corruption. we did plan on doing the reindexing as mentioned here and will definitely run an analyze as well now: https://info.crunchydata.com/blog/just-upgrade-how-postgresql-12-can-improve-your-performance

Zach

02/16/2021, 9:58 PM

that was the only issue, and we had problems with just about every database on 3 different RDS until we did the analyze

Zach

02/16/2021, 9:58 PM

queries were taking forever to complete

btai

02/16/2021, 10:00 PM

hmm, we have roughly 1000 customers (big and small) and have not heard any complaints in regards to application performance. we migrated roughly 500 of them over 10 days ago.

Zach

02/16/2021, 10:00 PM

well thats good then. You seem to be seeing the perf increases we had hoped we’d see from pg12 😆

btai

02/16/2021, 10:01 PM

I’ve sanity checked many of their sites but nonetheless I’m still EXTREMELY anxious about it.

btai

02/16/2021, 10:02 PM

was the re-indexing resource intensive? I know pg12 allows for concurrent re-indexing now which is nice

Zach

02/16/2021, 10:03 PM

yes big cpu/load spikes even with the concurrent

Zach

02/16/2021, 10:04 PM

also make sure you are nowhere near low storage, because it has to make copies of the indices

btai

02/16/2021, 10:04 PM

thanks for answering my many questions @Zach. how much storage are you using and what instance type?

btai

02/16/2021, 10:05 PM

we actually are WAY overprovisioned on storage as we had to scale up to handle upwards of 12,000 iops. We’re currently using 500GB out of 4TB of storage.

Zach

02/16/2021, 10:07 PM

we’re pretty small, these are like m5.2xl with 500gb

Zach

02/16/2021, 10:07 PM

we tend to be cpu bound more than memory or iops for some reason

btai

02/16/2021, 10:10 PM

does ANALYZE do anything in regards to improving performance?

btai

02/16/2021, 10:11 PM

Allows the query planner to be smarter?

Copy code

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

Zach

02/16/2021, 10:11 PM

if the indices are using ‘computed’ values, it regenerates those

Zach

02/16/2021, 10:12 PM

our problem seemed to be upgrade + reindex leaving lots of stale values in the indexes, so the planner decided to just ignore it

Zach

02/16/2021, 10:13 PM

I’m not a DB guy so thats just what I learned on the fly as we put the fire out

btai

02/16/2021, 10:13 PM

neither am I. I’m wondering if your issue was specific to indicies w/ computed values and if my team doesn’t use computed values

btai

02/16/2021, 10:13 PM

I’m asking as we speak

Zach

02/16/2021, 10:14 PM

I think our issue was computed values + really bloated partition tables

btai

02/16/2021, 10:16 PM

how did you verify analyze + reindex worked? I assumed since it was a fire, just by having the application re-run its queries?

Zach

02/16/2021, 10:16 PM

db load shot back down

Zach

02/16/2021, 10:17 PM

it was operating at like 40-50x load for the # of vcpus

btai

02/16/2021, 10:17 PM

hmm. my db load is low

Zach

02/16/2021, 10:17 PM

yah your useage seems very different

Zach

02/16/2021, 10:17 PM

or alternatively, my teams write bad sql 🤷‍♂️

btai

02/16/2021, 10:18 PM

yeah knock on wood but I might be lucky with this one. I’m definitely going to keep digging though. much appreciated @Zach

btai

02/16/2021, 10:19 PM

for the record, our cpu load is hanging around 20-40%

btai

02/16/2021, 10:20 PM

vcpu sessions in perf insights between 1~2 out of 8vcpus

btai

02/16/2021, 10:24 PM

sorry last question @Zach, if you dont mind roughly how many databases are you running in your server? im curious because we have many databases in a single server, my index tables could be small compared to yours even though we run basically the same size DB Server?

Zach

02/16/2021, 11:51 PM

~10 services on one, which is where the fire came from when some of the unanalyzed indexes caused everything else to slam to a halt

Zach

02/16/2021, 11:51 PM

the others are 1-2 services per rds

Open in Slack

Previous Next