This is going to be a bit of a head scratcher.
I have two HP Gen 10 Plus servers. Jaguar and Panther. They are fitted with SSD's. Both have a general OS SSD in bay 0. They then have 8TB Samsung QVO drives in bays 1, 2 and 3 forming a Raid 5 ZFS pool (there is still a spinning rust in bay 2 of server 2, but I don't think that has an effect on what's happening.) What happened on Jaguar also happened on Panther and the results were the same.
Static hostname: jaguar
Icon name: computer-desktop
Chassis: desktop![🖥️]()
Machine ID: 103129ff23a64aa1b7987000ae53604b
Boot ID: a248f3747b0a432d80465dd383c104b0
Operating System: Debian GNU/Linux 12 (bookworm)
Kernel: Linux 6.1.0-21-amd64
Architecture: x86-64
Hardware Vendor: HPE
Hardware Model: ProLiant MicroServer Gen10 Plus
Firmware Version: U48
I rebooted the servers and both of them came up with a failure of the first drive in the pools of both servers.
This seemed to be more than a coincidence to me.
The system saw the drives.
An online failed with ....I tried various options but then I ended up doing an export and import and that cleared it...My support with HP ran out about six months ago, so I have no support from them and can't get any further firmware. I think my last firmware patches to the servers were early this year.
So I'm not sure what I'm dealing with.
The fact that it happened to both servers exactly the same, makes me believe that this is not drive failure per-se. The drives were bought months apart.
I am either dealing with something hardware, firmware or OS related, but I can't figure what. Particularly as it happened to both systems at the same time.
For safety, next time I reboot the servers I'll be exporting the sets before reboot, and then importing... but there is the obvious question as to why this happened and I'm scratching my head.
Grateful for any thoughts please.
I have two HP Gen 10 Plus servers. Jaguar and Panther. They are fitted with SSD's. Both have a general OS SSD in bay 0. They then have 8TB Samsung QVO drives in bays 1, 2 and 3 forming a Raid 5 ZFS pool (there is still a spinning rust in bay 2 of server 2, but I don't think that has an effect on what's happening.) What happened on Jaguar also happened on Panther and the results were the same.
Static hostname: jaguar
Icon name: computer-desktop
Chassis: desktop
Machine ID: 103129ff23a64aa1b7987000ae53604b
Boot ID: a248f3747b0a432d80465dd383c104b0
Operating System: Debian GNU/Linux 12 (bookworm)
Kernel: Linux 6.1.0-21-amd64
Architecture: x86-64
Hardware Vendor: HPE
Hardware Model: ProLiant MicroServer Gen10 Plus
Firmware Version: U48
I rebooted the servers and both of them came up with a failure of the first drive in the pools of both servers.
This seemed to be more than a coincidence to me.
The system saw the drives.
An online failed with ....
Code:
NAME STATE READ WRITE CKSUMjaguar DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 261934978400867995 UNAVAIL 0 0 0 was /dev/sda1 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0errors: No known data errorsroot@jaguar:/home/mich# zpool online jaguar /dev/sda1cannot online /dev/sda1: cannot relabel '/dev/sda1': unable to read disk capacity
Code:
root@jaguar:~# zpool export jaguarroot@jaguar:~# zpool import jaguarroot@jaguar:~# zpool status pool: jaguar state: ONLINEstatus: One or more devices has experienced an unrecoverable error. Anattempt was made to correct the error. Applications are unaffected.action: Determine if the device needs to be replaced, and clear the errorsusing 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 04:23:03 with 0 errors on Sun Aug 11 04:47:04 2024config:NAME STATE READ WRITE CKSUMjaguar ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-Samsung_SSD_870_QVO_8TB_S5SSNF0W506592B ONLINE 0 0 1 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0errors: No known data errorsroot@jaguar:~# zpool clear jaguar
So I'm not sure what I'm dealing with.
The fact that it happened to both servers exactly the same, makes me believe that this is not drive failure per-se. The drives were bought months apart.
I am either dealing with something hardware, firmware or OS related, but I can't figure what. Particularly as it happened to both systems at the same time.
For safety, next time I reboot the servers I'll be exporting the sets before reboot, and then importing... but there is the obvious question as to why this happened and I'm scratching my head.
Grateful for any thoughts please.
Statistics: Posted by msknight — 2024-08-11 14:48 — Replies 1 — Views 23