Replace failed drive in a software raid 5 on ISCSI

Introduction

On my server where I host a couple of Virtual Machines (VM’s) I use a software raid 5. This raid 5 is build on top of ISCSI drives. These are four ISCSI targets. Which lives on 3 QNAP nas devices. So on one NAS I just have 2 targets configured.

When a failure occurs

While reconfigure a network switch, I had to reload the config of this switch. Which caused one NAS to be disconnected from the network. Which of course caused a failure on the software raid. The following messages appeared in dmesg:

[69298.238316] connection4:0: detected conn error (1022)
[69729.155489] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4312322049, last ping 4312323328, now 4312324608
[69729.733780] connection2:0: detected conn error (1022)
[69849.987513] session2: session recovery timed out after 120 secs
[71820.937756] perf: interrupt took too long (12466 > 10041), lowering kernel.perf_event_max_sample_rate to 16000
[125542.257008] sd 5:0:0:0: rejecting I/O to offline device
[125542.514793] blk_update_request: I/O error, dev sdd, sector 16 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[125543.013298] md: super_written gets error=10
[125543.215817] md/raid:md0: Disk failure on sdd, disabling device.

This is to be expected. Once the switch was back and the NAS was reachable again, the state of the software raid can be checked by:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4](F) sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
bitmap: 2/4 pages [8KB], 65536KB chunk

unused devices: <none>

Notice that the device sdd is marked as failed (F). More details can be obtained by the following command:

mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Feb 27 13:08:41 2022
Raid Level : raid5
Array Size : 1572467712 (1499.62 GiB 1610.21 GB)
Used Dev Size : 524155904 (499.87 GiB 536.74 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Sat Mar 12 06:25:43 2022
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Consistency Policy : bitmap

Name : darklord:0 (local to host darklord)
UUID : 75a05c94:d25d97da:56950464:c5aa539a
Events : 3429

Number Major Minor RaidDevice State
0 8 64 0 active sync /dev/sde
1 8 80 1 active sync /dev/sdf
2 8 32 2 active sync /dev/sdc
- 0 0 3 removed

4 8 48 - faulty /dev/sdd

So now we now this drive has failed, how to fix it? Since this is a “ISCSI disk” the drive is not really “faulty”

Fixing the raid 5 array

To fix the raid 5 array is actually quite simple. First we remove the failed drive:

mdadm --manage /dev/md0 --remove /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md0

Next we re-add the /dev/sdd device back into the array:

mdadm --manage /dev/md0 -a /dev/sdd
mdadm: re-added /dev/sdd

Next is checking the raid 5 array:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
[================>....] recovery = 84.8% (444654656/524155904) finish=106.8min speed=12396K/sec
bitmap: 2/4 pages [8KB], 65536KB chunk

So the raid 5 array is rebuilding. After a while:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 2/4 pages [8KB], 65536KB chunk

So all is good again.