Fixing Qnap TS-859U+

Introduction

The Qnap TS-859U+ is an old system, which I’m being using for years and years now. It never gave me any problems, being an reliable friend, until….

Disaster strikes

Yes.. anyone which does something with storage would tell you: “It’s not the question if a drive fails, the question is when the drive going to fail”. Maybe this sounds a bit cheesy, but it’s true. Therefore using RAID may prevent data loss, when a disk fails. Of course, this means a RAID set which brings redundancy. In this case the NAS can hold 8 disks. And I use 2x RAID 5 set, so 2×4 Disk in two RAID 5 sets. However, a RAID set is not a replacement, or substitute for backups.

In a RAID 5 set 1 drive can fail, and everything is still alright. The idea is when the bad disk is replaced, the RAID 5 set recovers, and everything is o.k. Well anyone with enough experience knows that rebuilding a RAID set (5 or 6) puts stress on the remaining disks. This brings the danger of a second disk failing during rebuild. Therefore, having a backup is a must, if you care about your data.

The problem with backups is (and it depends on how you make the backup) is that a incomplete backup, is no backup. So with this being said, let’s take a look what happened..

Ignoring a warning sign… not smart

These QNAP have one feature which is really useful, as long as it’s not being ignored, and that sounding a loud beep when something important is happing, like a disk failure. In the middle of a night I woke up, because I thought I heard a beep, listing for a few moments I didn’t hear anything else alarming, and got back to sleep, and forgot about it in the morning.

In the afternoon I noticed that the Qnap was reacting very slow.. It took ages before the user interface was loading. So it walked over to the QNAP, which is in other room, and I noticed 2 red lights on two disk in the same RAID 5 array set. Not good.

Thinking back, this was what I heard at night.. the QNAP was trying to warn me something bad was happening. To make things worse.. the backup hadn’t run…

An unpleasant situation

At this point I didn’t want to reboot the Qnap, not sure how it would boot up, and if I still got the data on this RAID 5 set. So I managed to log into the CLI, which is just a Linux shell, and started to stop every service I don’t need. At this time I could also tell that the data is still there, but that one disk was “ejected” and one disk was in a “unhealthy status”

After stopping the services, which I did to lower disk activity, and to give the CPU some less load, I started to think about a rescue plan…

The rescue plan

The first thing was trying to trigger a rebuild of the array, by replacing the ejected disk. And that is where the fun really started. After a while a new disk was marked as “bad”. So I swapped this new disk to another QNAP, which is almost the same, and the disk was just fine. Swapped the disk back, and after some period of time.. the disk was marked as bad. So the only thing I could think of, is that the slot in the QNAP itself had a problem. This put me in a rough spot, since I now have a RAID 5 set, which only 3 drives, and one of them is on the point of giving up the ghost. And at that point I realized that no matter what, I’m going to loose data. At the end I got all the important data of the array, and some data I couldn’t care about. But for bringing an RAID 5 array a fix is needed..

What could be wrong with the drive bay?

Thinking about the possible reason why a drive is marked bad after some period of time, I considered the following: The SATA logic is working, so chances are, the IC’s etc. are all fine. The most likely issue is a bad power line. This could be a power supply issue or just the power line of the drive bay itself. Since only one drive is having this problem, I don’t think it’s a power supply problem. Most likely it’s one or more a capacitors which is the root cause.

Finally fixing the QNAP’s drive problem

After taking the QNAP out the server rack, and remove the top cover from the QNAP, I noticed that all the drives are connected to some what looks like a power distribution board, with a lot of caps on them. All the connections to this board where nicely labelled. So I took the capacitors of the board in the area of connector “4” and tested the caps. And they where low in capacitance, so I decided to replace all the caps on the board. After putting everything together again I started testing with drive bay 4 ,and yes! it worked again. And hopefully this QNAP will continue to run reliable for quite some years to come.

 

Get a Netapp DS2246 with netapp disk working with a HP DL380p – part two

Introduction

In part One I got myself a Netapp DS2246 a LSI Megaraid controller card, the right SAS cable, and hooked everything up. At that point I thought that I had my storage quickly setup and running. But I was wrong. Very wrong. For some reasons the RAID controller did see the disks, which is a good thing. However the controller marked these drives as “Unsupported”.

Next step is to figure out what went wrong

In part two I already mentioned that I’m a network guy. Yeah sure, way back  I once was a server dude, messing around with 24/7 clusters, fiber channel SCSI and alike.  Well actually I played around with DL380 gen 1 servers a lot. Anyhow, at this point I didn’t know if it’s a problem with the disks, or with the SAS RAID controller. Luckily I have to extra 2.5″ spare disks. Since my setup is based on RAID 0 I really like to have a spare disk or two. (Well actually I always buy spare disks, just in case).

So I removed two disk from the DS2246, and swapped them out with my spare disks. And low and behold: after doing a re-scan of the disks the LSI Mega RAID controller recognized the disks, and I could configure them as RAID 0 or RAID 1 disk. So that proved to me that the DS2246 is good, the SAS cable is good, the LSI Mega RAID controller is good. Since all the disks in DS2246 are giving a green LED, I figured that the disks must be good also. But why doesn’t the RAID controller support them? Maybe firmware ?

The first rookie mistake

In part one I already mentioned that I made a rookie mistake. And that was: upgrading the firmware. I upgraded the LSI Mega raid card to 23.34.0. This resulted in a crashing WebBIOS. Once I entered the WebBIOS it just hangs. I also got a memory conflict error at startup. So at the end I could do a downgrade. To upgrade these cards, it’s just a matter of getting a tool “storcli”.

Upgrading the card of downgrading the card is done by using the command:

storcli64 /c# download file=firmwarefile nosigchk noverchk

Where # is controller card number, and firmwarefile the downloaded firmware.  The firmware for this card can be found at: 9286CV-8e firmware

Well that was a luckily escape. And while I could not enter the WebBIOS the drives didn’t show up, so it was not a firmware issue.

After googling I found people mention that Netapp formats it’s disk with a different sector size. Instead of the usual 512 byte sector, Netapp uses 520 bytes. And once I read that I knew that the sector size is the problem. So how to get these drives to work? Well as it turns out, the drives can be reformatted to a sector size of 512 byes. The problem is: how to that, since the RAID controller doesn’t support the drives as is with the 520 byte sector size.

Back to the HBA mode or IT mode

In part one I talked about HBA mode or also called IT mode. In this mode the RAID controller card is in pass through mode, it just presents the disk to the OS, without any interfering. So I need to get my SAS controller card into HBA mode. Unfortunately the LSI Mega RAID 9286CV-8e card I got doesn’t support that. However the internal SAS controller card (Smart Array P420i) card in my HP DL380p gen8 supports it. The card doesn’t support it by default, but with a little trick the card can be put in HBA mode.

There is one downside however. Once the P420i card is in HBA mode, it’s no longer possible to boot from the disks. This means that I have to reconfigure my server since I boot from a RAID 0 set. However I hope it’s easy to convert back from HBA mode, to RAID mode, and can just place the disks back, without any data loss. However, since I’m not sure. And things can go wrong, I started to backup everything, just in case.

Onto the path of victory and success

Since I have two spare disks which are working with the LSI card, and the DS2246 and the possibility to put the P420i card in HBA mode I maybe could get this to work. The plan I got is this:

      1. Get two disk in a RAID 0 set working with DS2246 and LSI card
      2. Get this logical RAID 0 disk to be boot-able
      3. Backup the data from existing disks (2x logic drive) connected to the P420I card
      4. Remove the existing disks( 2x logic drive)  from the P420I card
      5. Install Ubuntu 20.0.4 on new created RAID 0 disk
      6. Boot into Ubuntu 20.0.4
      7. Use the ssacli tool to put the Smart Array P420i card into HBA mode
      8. Use lssci tool and sg_scan tool to see if the Netapp drives with 520 bytes  sector are accesable
      9. Reform the Netapp drives to 512 byte sector
      10. Plug the reformatted drive back into the DS2246
      11. Test with sg_scan if the drive works

To execute this plan basically I have to reinstall my server, this is a lot of work, but at the end I should end up with 24 drives working.

The key thing in this plan is to get the HP Smart Array P420i card into HBA mode. And I must be able to boot from disk connected to the DS2246.

As it turns out, getting to boot from the DS2246 was easy. Installing Ubuntu onto the disk, was also easy.

Getting the Smart Array P420i into HBA mode

The next step was to install the tool to get the Smart Array into HBA mode.To get this controller into HBA mode it needs to have recent firmware. I run 8.32 and as you will see, that works fine. Since I removed all the drives from the controller, I didn’t had to clear the configuration.

If there is older firmware installed, try to get your hands on the HP SSP for DL380p gen 8 (or what ever genaration server you got for that matter)

I followed the steps documented here

It comes down to:

Setup the repository:

I added in /etc/apt/sources.list.d/mcp.list:

deb http://downloads.linux.hpe.com/SDR/repo/mcp focal/current non-free

Next I added the HPE Public Keys:

curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | apt-key add -

Next I updated the apt sources with:

apt-get update

And finally:

apt-get install ssacli

Next is to get the controller into HBA mode:

ssacli controller slot=0 modify hbamode=on

To check if the controller is in HBA mode:

ssacli controller slot=0 show

Which outputs:

Smart Array P420i in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: 0014380225BD250
Cache Serial Number: PBKUC0ARH2P0SK
RAID 6 Status: Enabled
Controller Status: OK
Hardware Revision: B
Firmware Version: 8.32
Firmware Supports Online Firmware Activation: False
Cache Board Present: True
Cache Status: Not Configured
Total Cache Size: 1.0
Total Cache Memory Available: 0.8
Battery Backed Cache Size: 0.8
Cache Backup Power Source: Capacitors
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
Controller Temperature (C): 48
Cache Module Temperature (C): 29
Capacitor Temperature (C): 22
Number of Ports: 2 Internal only
Driver Name: hpsa
Driver Version: 3.4.20
HBA Mode Enabled: True
PCI Address (Domain:Bus:Device.Function): 0000:02:00.0
Port Max Phy Rate Limiting Supported: False
Host Serial Number: CZ22280G56
Sanitize Erase Supported: False
Primary Boot Volume: Unknown (600508B1001C83E36DFBA10AEBE3971A)
Secondary Boot Volume: None

Accessing the Netapp drive from the OS

This looks good.  Next I placed one of the Netapp drives into the DL380P server, and checked if I could see the drive:

ssacli controller slot=0 physicaldrive all show

Smart Array P420i in Slot 0 (Embedded)

Unsupported Drives

physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SAS, 1.2 TB, OK)

This looks good, let’s see what more we can find out:

from dmesg:

[ 565.080210] sd 3:0:1:0: Attached scsi generic sg3 type 0
[ 565.080448] sd 3:0:1:0: [sdb] Unsupported sector size 520.
[ 565.080805] sd 3:0:1:0: [sdb] 0 512-byte logical blocks: (0 B/0 B)
[ 565.080808] sd 3:0:1:0: [sdb] 520-byte physical blocks
[ 565.081074] sd 3:0:1:0: [sdb] Write Protect is off
[ 565.081076] sd 3:0:1:0: [sdb] Mode Sense: f7 00 10 08
[ 565.081452] sd 3:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 565.130243] sd 3:0:1:0: [sdb] Unsupported sector size 520.
[ 565.138773] sd 3:0:1:0: [sdb] Attached SCSI disk

At this point I plugged two drives into the DL380P and lets see what we can doo with sg utilities:

sg_map
/dev/sg0 /dev/sr0
/dev/sg1
/dev/sg2 /dev/sda
/dev/sg3 /dev/sdb

Next lets see it we can re-format the drive:

sg_format -v --format --size=512 /dev/sg3
NETAPP X425_HCBEP1T2A10 NA01 peripheral_type: disk [0x0]
PROTECT=1
<< supports protection information>>
Unit serial number: KZHLXDBF
LU name: 5000cca01d5ac328
mode sense(10) cdb: 5a 00 01 00 00 00 00 00 fc 00
Mode Sense (block descriptor) data, prior to changes:
Number of blocks=2344225968 [0x8bba0cb0]
Block size=520 [0x208]
mode select(10) cdb: 55 11 00 00 00 00 00 00 1c 00

A FORMAT UNIT will commence in 15 seconds
ALL data on /dev/sg3 will be DESTROYED
Press control-C to abort

A FORMAT UNIT will commence in 10 seconds
ALL data on /dev/sg3 will be DESTROYED
Press control-C to abort

A FORMAT UNIT will commence in 5 seconds
ALL data on /dev/sg3 will be DESTROYED
Press control-C to abort
Format unit cdb: 04 18 00 00 00 00

Format unit has started

The format takes a while, but after some time:

FORMAT UNIT Complete

And after doing a sg_scan dmesg confirmed:

[29086.192424] hpsa 0000:02:00.0: scsi 3:0:1:0: updated Direct-Access NETAPP X425_HCBEP1T2A10 PHYS DRV SSDSmartPathCap- En- Exp=1
[29116.364679] hpsa 0000:02:00.0: SCSI status: LUN:0000000000800001 CDB:12010000040000000000000000000000
[29116.364684] hpsa 0000:02:00.0: SCSI Status = 02, Sense key = 0x05, ASC = 0x25, ASCQ = 0x00
[29116.364963] hpsa 0000:02:00.0: Acknowledging event: 0x80000002 (HP SSD Smart Path configuration change)
[29116.398781] hpsa 0000:02:00.0: scsi 3:0:1:0: removed Direct-Access NETAPP X425_HCBEP1T2A10 PHYS DRV SSDSmartPathCap- En- Exp=1
[29210.540998] scsi 2:0:83:0: Direct-Access NETAPP X425_HCBEP1T2A10 NA01 PQ: 0 ANSI: 6
[29210.542399] sd 2:0:83:0: Attached scsi generic sg3 type 0
[29210.559470] sd 2:0:83:0: [sdb] 2344225968 512-byte logical blocks: (1.20 TB/1.09 TiB)
[29210.577646] sd 2:0:83:0: [sdb] Write Protect is off
[29210.577652] sd 2:0:83:0: [sdb] Mode Sense: f7 00 10 08
[29210.614517] sd 2:0:83:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[29211.011025] sd 2:0:83:0: [sdb] Attached SCSI disk

So now 23 disks to go…

Getting the server to boot from the HP Smart Array controller

After I formatted all the 24 disks, and checked in WebBIOS that I actually could use the disks, it’s time to revert the server back into the configuration it was before I messed around with the Netapp DS2246.

So from Ubuntu, which was booted from the DS2246, I reconfigured the Smart Array back to RAID mode with:

ssacli controller slot=0 modify hbamode=off

After that I inserted the disks back in the order that I took them out (when I took the disk out, I marked them with a sharpy).

At that point I pulled out the 2 disks of the Netapp ds2246, and rebooted the server. And sure enough: The Smart Array controller detected 2x Logical Raid 0 device. And booted without any issue the original Ubuntu installation. Even the data on the second disk was intact. That saved me a lot of time. Now all I have to do is to configured the desired raid sets, and start moving data around.

Conclusion

Getting a netapp DS2246 and getting Netapp 520 byte sectors to work, is not trivial, but a quite a learning experience. When you decide to get yourself a Netapp DS2246 or similair and it comes with disk with 520 bytes sectors, make sure to have a SAS controller card which can do at least HBA or IT mode. If you can get your hands on a card which support both, thus HBA and RAID go for it. And you will get it to work.

For the LSI controller, it works fine however,  I’m not a big fan of the whole WebBIOS interface. To be honest I find it horrible. Adding a disk to an existing array is very hard, and confusing process.

Get a Netapp DS2246 with netapp disks working with a HP DL380p – part one

Introduction

I use a HP DL380p gen 8 for virtualization. The DL380p is a perfect server for that. These servers are cheap to get, and can hold a lot of memory, and reasonable powerful CPU’s. The DL380p gen 8 can use two CPU’s. When I bought this server I also got 4 2.5″ SAS disks of 600GB. And when playing around with virtualization, disk space can be a thing.

So I didn’t want to configure these disk in a RAID 5 set, since that would cost storage capacity. Another option would be to configure the disks as JBOD’s (Just Bunch Of Disk). Unfortunately the build-in RAID adapter of the DL380p gen 8 (Smart Array P420i) doesn’t support this. However more on this later on, since this turns out to be the key thing.

So I ended up in configuring the disks in two RAID 0, giving me 2 logical drives of roughly 1.2TB.

But I really don’t like RAID 0, since if one drive fails, you loose all the data. And sure I make backups. But reinstalling reconfigure a server is not something I like to do.

The DL380P I have can store a total of 8 drives. So I could place another 4 drives, and configure these in RAID 5 set. Another option I explored is to use ISCSI. I created ISCSI targets on the QNAP NAS servers, and configured a software RAID 5 on them. And while this works, I depend on my network. Which is not always ideal.

But there is a much cooler way of getting plenty of storage. And that is playing around with Disk Selves.

Getting a Disk Shelve

Currently it’s possible to get for example Netapp disk shelve like the DS2246 for cheap. These disk shelves are dumb. That is, to just present the disks. They don’t do fancy stuff like RAID, SMB, NFS or any other fancy stuff. The DS2246 can hold 24 2.5″ SAS,SATA or SSD disks.

A dive into external disk shelves and SAS

Since I’m a network guy, and not a storage dude, I had to dive into connecting a disk shelve to a server. And well, it didn’t sound that complicated. As it seems I only need to connect the DS2246 to a server. And  all you need for that is a SAS controller card with a external SAS port. The important bit is that a “special” cable is needed. Since the Netapp uses SSF 8436 port, and most SAS controllers with external port uses SFF-8088. These cables are called “QSFP SFF-8436 to Mini SAS SFF-8088”

Once I understood how the physical connection works, time to trying the next question: Raid or using HBA ?

Using hardware RAID controller of HBA controller

Basically there are two ways of presenting the disk of the disk shelf to the server. One method is to use a hardware raid controller. These controllers allows you to configure raid 0,1,5,6 for example. And some even allows you to configure raid 50,60. Which is just a mirrored raid 5 or raid 6 set. Once a raid set is configured, the server sees a logical drive. So for instance, if two 1TB drives are configured as a raid 0, the OS on the server sees a 2TB drive.

The other way is using a controller which supports HBA or also called “IT mode”. In this mode the controller works in a “pass through” mode. Meaning it presents the disks as is to the server. So it doesn’t provide any raid capability what so ever. The idea behind this is that all the individual disks are visible in the OS of the server. Which allows for using software raid to create raid sets.

There are RAID controller cards, which allows for running in RAID mode, or in HBA mode. This can be important. More on that later on. But upfront: Get a SAS controller with external ports which supports HBA and RAID. It can make your live much easier.

Making rookie mistakes

Armed with all the knowledge I gained I decided it was time to get myself a controller SAS card, and a Netapp DS2246. This DS2246 came with 24 1.2TB disks. For the controller card I picked up LSI MegaRAID SAS 9286CV-8e This card can do 6GB/S which is perfect, since the DS2246 has two IOM6 modules, which also provides 6GB/s.  And I got myself a 1 meter long QSFP SFF-8436 to Mini SAS SFF-8088 cable.

Once all the stuff arrived I installed the LSI card in my HP DL380p gen 8 server. Hooked up the DS2246 with the cable I got and turned both devices on. And when I powered on the DS2246 I was shocked how noisy this beast starts up. It is really loud. Luckily after a few seconds, all the fans decided to spin down, to a very acceptable noise level.

And once all started up, I entered the RAID BIOS, on LSI cards, this is called “WebBIOS” I could see all the drives, but they were all marked as “unsupported” oh boy.

In part two I’m trying to get this working. Hopefully I can get it to work, or else I got a lot of unusable disks, 24 to be precise…

 

 

 

 

 

Replace failed drive in a software raid 5 on ISCSI

Introduction

On my server where I host a couple of Virtual Machines (VM’s) I use a software raid 5. This raid 5 is build on top of ISCSI drives. These are four ISCSI targets. Which lives on 3 QNAP nas devices. So on one NAS I just have 2 targets configured.

When a failure occurs

While reconfigure a network switch, I had to reload the config of this switch. Which caused one NAS to be disconnected from the network. Which of course caused a failure on the software raid. The following messages appeared in dmesg:

[69298.238316] connection4:0: detected conn error (1022)
[69729.155489] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4312322049, last ping 4312323328, now 4312324608
[69729.733780] connection2:0: detected conn error (1022)
[69849.987513] session2: session recovery timed out after 120 secs
[71820.937756] perf: interrupt took too long (12466 > 10041), lowering kernel.perf_event_max_sample_rate to 16000
[125542.257008] sd 5:0:0:0: rejecting I/O to offline device
[125542.514793] blk_update_request: I/O error, dev sdd, sector 16 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[125543.013298] md: super_written gets error=10
[125543.215817] md/raid:md0: Disk failure on sdd, disabling device.

This is to be expected. Once the switch was back and the NAS was reachable again, the state of the software raid can be checked by:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4](F) sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
bitmap: 2/4 pages [8KB], 65536KB chunk

unused devices: <none>

Notice that the device sdd is marked as failed (F). More details can be obtained by the following command:

mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sun Feb 27 13:08:41 2022
Raid Level : raid5
Array Size : 1572467712 (1499.62 GiB 1610.21 GB)
Used Dev Size : 524155904 (499.87 GiB 536.74 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Sat Mar 12 06:25:43 2022
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Consistency Policy : bitmap

Name : darklord:0 (local to host darklord)
UUID : 75a05c94:d25d97da:56950464:c5aa539a
Events : 3429

Number Major Minor RaidDevice State
0 8 64 0 active sync /dev/sde
1 8 80 1 active sync /dev/sdf
2 8 32 2 active sync /dev/sdc
- 0 0 3 removed

4 8 48 - faulty /dev/sdd

So now we now this drive has failed, how to fix it? Since this is a “ISCSI disk” the drive is not really “faulty”

Fixing the raid 5 array

To fix the raid 5 array is actually quite simple. First we remove the failed drive:

mdadm --manage /dev/md0 --remove /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md0

Next we re-add the /dev/sdd device back into the array:

mdadm --manage /dev/md0 -a /dev/sdd
mdadm: re-added /dev/sdd

Next is checking the raid 5 array:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
[================>....] recovery = 84.8% (444654656/524155904) finish=106.8min speed=12396K/sec
bitmap: 2/4 pages [8KB], 65536KB chunk

So the raid 5 array is rebuilding. After a while:

cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd[4] sdc[2] sde[0] sdf[1]
1572467712 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 2/4 pages [8KB], 65536KB chunk

So all is good again.