The Qnap TS-859U+ is an old system, which I’m being using for years and years now. It never gave me any problems, being an reliable friend, until….
Yes.. anyone which does something with storage would tell you: “It’s not the question if a drive fails, the question is when the drive going to fail”. Maybe this sounds a bit cheesy, but it’s true. Therefore using RAID may prevent data loss, when a disk fails. Of course, this means a RAID set which brings redundancy. In this case the NAS can hold 8 disks. And I use 2x RAID 5 set, so 2×4 Disk in two RAID 5 sets. However, a RAID set is not a replacement, or substitute for backups.
In a RAID 5 set 1 drive can fail, and everything is still alright. The idea is when the bad disk is replaced, the RAID 5 set recovers, and everything is o.k. Well anyone with enough experience knows that rebuilding a RAID set (5 or 6) puts stress on the remaining disks. This brings the danger of a second disk failing during rebuild. Therefore, having a backup is a must, if you care about your data.
The problem with backups is (and it depends on how you make the backup) is that a incomplete backup, is no backup. So with this being said, let’s take a look what happened..
Ignoring a warning sign… not smart
These QNAP have one feature which is really useful, as long as it’s not being ignored, and that sounding a loud beep when something important is happing, like a disk failure. In the middle of a night I woke up, because I thought I heard a beep, listing for a few moments I didn’t hear anything else alarming, and got back to sleep, and forgot about it in the morning.
In the afternoon I noticed that the Qnap was reacting very slow.. It took ages before the user interface was loading. So it walked over to the QNAP, which is in other room, and I noticed 2 red lights on two disk in the same RAID 5 array set. Not good.
Thinking back, this was what I heard at night.. the QNAP was trying to warn me something bad was happening. To make things worse.. the backup hadn’t run…
An unpleasant situation
At this point I didn’t want to reboot the Qnap, not sure how it would boot up, and if I still got the data on this RAID 5 set. So I managed to log into the CLI, which is just a Linux shell, and started to stop every service I don’t need. At this time I could also tell that the data is still there, but that one disk was “ejected” and one disk was in a “unhealthy status”
After stopping the services, which I did to lower disk activity, and to give the CPU some less load, I started to think about a rescue plan…
The rescue plan
The first thing was trying to trigger a rebuild of the array, by replacing the ejected disk. And that is where the fun really started. After a while a new disk was marked as “bad”. So I swapped this new disk to another QNAP, which is almost the same, and the disk was just fine. Swapped the disk back, and after some period of time.. the disk was marked as bad. So the only thing I could think of, is that the slot in the QNAP itself had a problem. This put me in a rough spot, since I now have a RAID 5 set, which only 3 drives, and one of them is on the point of giving up the ghost. And at that point I realized that no matter what, I’m going to loose data. At the end I got all the important data of the array, and some data I couldn’t care about. But for bringing an RAID 5 array a fix is needed..
What could be wrong with the drive bay?
Thinking about the possible reason why a drive is marked bad after some period of time, I considered the following: The SATA logic is working, so chances are, the IC’s etc. are all fine. The most likely issue is a bad power line. This could be a power supply issue or just the power line of the drive bay itself. Since only one drive is having this problem, I don’t think it’s a power supply problem. Most likely it’s one or more a capacitors which is the root cause.
After taking the QNAP out the server rack, and remove the top cover from the QNAP, I noticed that all the drives are connected to some what looks like a power distribution board, with a lot of caps on them. All the connections to this board where nicely labelled. So I took the capacitors of the board in the area of connector “4” and tested the caps. And they where low in capacitance, so I decided to replace all the caps on the board. After putting everything together again I started testing with drive bay 4 ,and yes! it worked again. And hopefully this QNAP will continue to run reliable for quite some years to come.