A year ago we’ve added SMART metrics collection to our monitoring agent that collects disk drive attributes on clients servers.
So here a couple of interesting cases from the real world.
Because we needed it to work without installing any additional software, like smartmontools, we implemented collection not of all the attributes, but only basic and not vendor-specific ones — to be able to provide consistent experience. And also that way we skipped burdensome task of maintaining a knowledge base of specific stuff — and I like that a lot :)
This time we’ll discuss only SMART attribute named “media wearout indicator”. Normalized, it shows a percentage of “write resource” left in the device. Under the hood the device keeps track of the number of cycles the NAND media has undergone, and the percentage is calculated against the maximum number of cycles for that device. The normalized value declines linearly from 100 to 1 as the average erase cycle count increases from 0.
Though SSDs are pretty common nowadays, just couple of years earlier you could hear a lot of fear talk about SSD wearout. So we wanted to see if some of it were true. So we searched for the maximum wearout across all the devices of all of our clients.
It was just 1%
Reading the docs says it just won’t go below 1%. So it is weared out.
We notified this client. Turns out it was a dedicated server in Hetzner. Their support replaced the device:
As we introduced SMART monitoring for some of the clients already some time ago, we have accumulated history. And now we can see it on a timeline.
A server with highest wearout rate we have across our clients servers unfortunately was added to okmeter.io monitoring only two month ago:
This chart indicates that during these two month only, it burned through 8% of “write resource”.
So 100% of this SSD lifetime under that load will end in 100/(8/2) = 2 years.
Is that a lot or too little? I don’t know. But let’s check what kind of load it’s serving?
As you can see, it’s
ceph doing all the disk writes, but it’s not doing these writes for itself — it’s a storage system for some application. This particular environment was running under Kubernetes, so let’s sneak a peek what’s running inside:
It’s Redis! Though you might’ve noticed divergence in values with the previous chart — values here are 2 times lower (it’s probably due to ceph’s data replication), load profile is the same, so we conclude it’s redis after all.
Let’s see what redis is doing:
So it’s on average less than 100 write commands per second. As you might know, there’s two ways Redis makes actual writes to disk:
- RDB — which periodically snapshots all the dataset to the disk, and
- AOF — which writes a log of all the changes.
It’s obvious that’s here we saw RDB with 1 minute dums:
We see that there are three common patterns of server storage system setup with SSDs:
- Two SSDs in a RAID-1 that holds everything there is.
- Some HDDs + SSDs in a RAID-10 — we see that setup a lot on traditional RDBMS servers: OS, WAL and some “cold” data on HDD, while SSD array hold hotest data.
- Just a bunch of SSDs (JBOD) for some NoSQL like Apache Cassandra.
So in the first case with RAID-1 writes go to both disks symmetrically, and wearout happens with the same rate:
Looking for some anomalies we found one server where it was completely different:
Checking mount options, to understand this, didn’t produce much insight — all the partitions were RAID-1
But looking for per device IO metrics we see, again, there’s difference between two disks. And
/dev/sda gets more bytes written:
Turns out there’s swap configured on one of the
/dev/sda partitions. And pretty decent swap IO on this server:
This journey began with me looking to check SSD wearout with different Postgres write load profiles. But not much luck — all of our client’s Postgres databases, with at least somewhat high write load, are configured pretty carefully — writes go mostly to HDDs.
But I found one pretty interesting case nevertheless:
We see these two SSDs in a RAID-1 weared out 4% during 3 months. But checking if it’s high amount of WAL writes turned out to be wrong — it’s only less than 100Kb/s:
I figured that probably Postgres generates writes in some other way, and it is indeed. Constant temp files writes all the time:
Thanks to Postgres elaborate internal statistics and okmeter.io’s rich support for it, we easily spotted the root cause of that:
It was a
SELECT query generating all that load and wearout!
SELECT’s in Postgres can sometime generate even non-temp file, but real writes. Read about it here.
- Redis+RDB generates a ton of disk writes and it depends not on the amount of changes in Redis db, but on DB size and dump frequency. RDB seems to produce the maximum Write Amplification from all known to me storages.
- Actively used SWAP on SSD is probably a bad idea. Unless you want to add some jitter to RAID-1 SSDs wearout.
- In DBMSes like Postgresql it might be not only WAL and datafiles that dominate disk writes. Bad database design or access patterns might produce a lot of temp files writes. Read how to monitor Postgres queries.