Introduction
On my server where I host a couple of Virtual Machines (VM’s) I use a software raid 5. This raid 5 is build on top of ISCSI drives. These are four ISCSI targets. Which lives on 3 QNAP nas devices. So on one NAS I just have 2 targets configured.
When a failure occurs
While reconfigure a network switch, I had to reload the config of this switch. Which caused one NAS to be disconnected from the network. Which of course caused a failure on the software raid. The following messages appeared in dmesg:
[69298.238316] connection4:0: detected conn error (1022) [69729.155489] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4312322049, last ping 4312323328, now 4312324608 [69729.733780] connection2:0: detected conn error (1022) [69849.987513] session2: session recovery timed out after 120 secs [71820.937756] perf: interrupt took too long (12466 > 10041), lowering kernel.perf_event_max_sample_rate to 16000 [125542.257008] sd 5:0:0:0: rejecting I/O to offline device [125542.514793] blk_update_request: I/O error, dev sdd, sector 16 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 [125543.013298] md: super_written gets error=10 [125543.215817] md/raid:md0: Disk failure on sdd, disabling device.
This is to be expected. Once the switch was back and the NAS was reachable again, the state of the software raid can be checked by:
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd[4](F) sdc[2] sde[0] sdf[1] 1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] bitmap: 2/4 pages [8KB], 65536KB chunk unused devices: <none>
Notice that the device sdd is marked as failed (F). More details can be obtained by the following command:
mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sun Feb 27 13:08:41 2022 Raid Level : raid5 Array Size : 1572467712 (1499.62 GiB 1610.21 GB) Used Dev Size : 524155904 (499.87 GiB 536.74 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat Mar 12 06:25:43 2022 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : darklord:0 (local to host darklord) UUID : 75a05c94:d25d97da:56950464:c5aa539a Events : 3429 Number Major Minor RaidDevice State 0 8 64 0 active sync /dev/sde 1 8 80 1 active sync /dev/sdf 2 8 32 2 active sync /dev/sdc - 0 0 3 removed 4 8 48 - faulty /dev/sdd
So now we now this drive has failed, how to fix it? Since this is a “ISCSI disk” the drive is not really “faulty”
Fixing the raid 5 array
To fix the raid 5 array is actually quite simple. First we remove the failed drive:
mdadm --manage /dev/md0 --remove /dev/sdd mdadm: hot removed /dev/sdd from /dev/md0
Next we re-add the /dev/sdd device back into the array:
mdadm --manage /dev/md0 -a /dev/sdd mdadm: re-added /dev/sdd
Next is checking the raid 5 array:
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1] 1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] [================>....] recovery = 84.8% (444654656/524155904) finish=106.8min speed=12396K/sec bitmap: 2/4 pages [8KB], 65536KB chunk
So the raid 5 array is rebuilding. After a while:
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1] 1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] bitmap: 2/4 pages [8KB], 65536KB chunk
So all is good again.