This section describes the possible hardware related issues during the operation of the Steelhead appliance.
There are three kinds of storage devices in the Steelhead appliances:
USB flash disk with the boot-partition, the configuration and a copy of the latest installed RiOS software.
One or more hard disk drives.
One or more solid-state disks
From a troubleshooting point of view, the USB flash disk is not interesting.
The hard disk in a Steelhead appliance can get hammered a lot and can fail after time. That is why in the higher end Steelhead appliances their hard disks are mirrored in a RAID array, so that if one hard disk fails the data is still available on the mirror hard disk.
On the Steelhead appliance without a RAID array, the failure of a hard disk will cause initial slowness in the processing of optimized TCP sessions, and at a certain moment the hardware watchdog will timeout and restart the appliance and then the device will fail in the boot process.
In case of a hard disk failure there are different steps per model:
The low-end desktop models of the xx20, xx50, CX and EX series models have a single hard disk or a dual non-RAID set of disks. The whole chassis needs to be replaced.
The 1RU xx20 series models have a single hard disk. The whole chassis needs to be replaced.
The 1RU xx50 series 1050U, 1050L and 1050M models with a single hard disk, but the appliance can rebuild itself during the startup from the flash partition. Only a hard disk replacement is needed.
The 1RU xx50 series 1050H model has two hard disks but they are in a RAID0 array. The appliance can rebuild itself during the startup of the appliance and only a hard disk replacement is needed.
If a system dump is not available but the serial console bootlogs
are, search for the available disks in with the string
Attached SCSI disk
Figure 5.1. Grep for "Attached SCSI disk" in the bootlogs
[~/sysdumps] edwin@t43>grep "Attached SCSI disk" dmesg-boot sd 0:0:0:0: [sda] Attached SCSI disk sd 1:0:0:0: [sdb] Attached SCSI disk sd 6:0:0:0: [sdc] Attached SCSI disk
If
sd 1:0:0:0
shows up, that is disk 1 which means that disk 0 is missing.
The 1RU xx50 series 1050UR, 1050LR, 1050MR, 1050HR models and CX1555 models have four hard disks in a RAID10 array. In case of a single hard disk failure only the hard disk needs to be replaced. In case of a multiple hard disk failure on the same RAID1 array, the chassis can rebuild itself during the startup from the flash partition.
The 1RU xx50 series 2050 model, the 3RU xx50 series 5050 and 6050 models have multiple hard disks in a RAID10 array. In case of a single hard disk failure only the hard disk needs to be replaced. In case of a multiple hard disk failure on the same RAID1 array, the whole chassis needs to be replaced.
The 7050, 5055 and 6055 models have the hard disks in a RAID1 configuration while the solid state disks are in a FTS configuration.
During the RMA process of a broken hard disk, the device is running in a degraded mode. If the appliance has Silver level or Gold level support contract, you might want to obtain spare hard disks so you can replace them faster and thus get the redundancy on the RAID array back faster.
The solid-state disks will suffer from Write Endurance: The number of write cycles to any block of flash is limited. The Steelhead appliance keeps track of the number of writes per block of flash and if it gets higher than the manufacturer specifications it will raise an alarm that the solid-state disk needs to be replaced.
The operational status of the RAID array can be seen with the command
show raid diagram
.
Figure 5.2. Output of the command "show raid diagram" on a 2050 model
SH # show raid diagram [ 0 : online ][ 1 : online ][ 2 : online ][ 3 : online ]
When a hard disk fails, its status will go from online to missing or degraded. When the broken hard disk is replaced, the status will go to rebuilding. When the rebuilding has finished, then the status will go back to online.
In this following example there is a RAID10 array of 12 hard disks, with the RAID1 array of disk 2 and 3 being rebuild, the RAID1 array of disk 10 and 11 is degraded because of a broken disk and the RAID1 array of disk 8 and 9 still okay but disk 9 is currently degraded.
Figure 5.3. Output of the command "show raid diagram" on a model 5050 model
SH # show raid diagram [ 0 : online ][ 1 : online ][ 2 : online ][ 3 : rebuilding ] [ 4 : online ][ 5 : online ][ 6 : online ][ 7 : online ] [ 8 : online ][ 9 : degraded ][ 10 : online ][ 11 : missing ] [ 12 : missing ][ 13 : missing ][ 14 : missing ][ 15 : missing ]
The rebuild of a RAID array will take about 12 hours, depending on
how busy the Steelhead appliance is. On the xx50, CX and EX series
models, the progress of the rebuilding can be seen with the command
raid swraid mdstat
:
Figure 5.4. Output of the command "raid swraid mdstat" on a 5050 model
SH # raid swraid mdstat Personalities : [linear] [raid0] [raid10] [raid6] [raid5] md3 : active raid10 sdd6[12] sda6[0] sdl6[11] sdk6[10] sdj6[9] sdi6[8] sdh6[7] sdg6[6] sdf \ 6[5] sde6[4] sdc6[2] sdb6[1] 146512128 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU] resync=DELAYED md0 : active raid10 sdd5[12] sda5[0] sdl5[11] sdk5[10] sdj5[9] sdi5[8] sdh5[7] sdg5[6] sdf \ 5[5] sde5[4] sdc5[2] sdb5[1] 781288704 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU] resync=DELAYED md2 : active raid10 sdd2[3] sda2[0] sdl2[11] sdk2[10] sdj2[9] sdi2[8] sdh2[7] sdg2[6] sdf2 \ [5] sde2[4] sdc2[2] sdb2[1] 37784448 blocks 64K chunks 2 near-copies [12/12] [UUUUUUUUUUUU] md1 : active raid10 sdd3[12] sda3[0] sdl3[11] sdk3[10] sdj3[9] sdi3[8] sdh3[7] sdg3[6] sdf \ 3[5] sde3[4] sdc3[2] sdb3[1] 100678656 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU] [=>...................] recovery = 5.4% (920064/16779776) finish=6.3min speed=4182 \ 1K/sec unused devices: <none>
In this example, disk 3 was replaced. The re-syncing of RAID1 array md2 has been completed already while md1 is currently at 5.4% completion while completion is estimated in 6.3 minutes. md3 and md0 will be made redundant after this.
On machines with multiple hard disks in a RAID10 configuration, the machine can survive a multiple hard disk failure as long as the failures are not on the same RAID1 array.
If a disk is in the
degraded
state then it is not yet removed from the RAID array but it still
can cause performance problems. To remove it from the RAID array
use the command
raid swraid fail-disk
.
The failure of a disk is often indicated by a slower-than-normal performance of a Steelhead appliance: The RAID controller or RAID software has not yet marked the hard disk as broken, but it sees long delays in reading and writing with it.
In the system logs a failing disk can be seen with the log entries like:
Figure 5.5. A Failing disk in the system logs
SH kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 SH kernel: ata4.00: (irq_stat 0x40000008) SH kernel: ata4.00: cmd 60/3e:00:43:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 31744 in SH kernel: res 41/40:3e:43:00:00/00:00:00:00:00/40 Emask 0x9 (media error) SH kernel: ata4.00: configured for UDMA/133 SH kernel: SCSI error : <3 0 0 0> return code = 0x8000002 SH kernel: Info fld=0x4000000 (nonstd), Invalid sdd: sense = 72 11 SH kernel: end_request: I/O error, dev sdd, sector 67 SH kernel: Buffer I/O error on device sdd1, logical block 33 SH kernel: ata4: EH complete SH kernel: SCSI device sdd: 490350672 512-byte hdwr sectors (251060 MB)
The Steelhead appliances with multiple power supplies are able to sustain the loss of one. An alarm will be raised if a power supply has failed.
Depending on the model, there are one or multiple fans in a Steelhead appliance. An alarm will be raised if a fan has failed.
Figure 5.6. Output of the command "show stats fan" on the 100 / 200 / 300 models
SH # show stats fan FanId RPM Min RPM Status 1 5869 2657 ok
Figure 5.7. Output of the command "show stats fan" on the 520 / 1020 / 1520 / 2020 models
SH # show stats fan FanId RPM Min RPM Status 1 3000 750 ok 2 3000 750 ok
Figure 5.8. Output of the command "show stats fan" on the 3020 / 3520 / 5520 / 6020 / 6120 models
SH # show stats fan FanId RPM Min RPM Status 0 4963 712 ok 1 4821 712 ok 2 4963 712 ok 3 4821 712 ok 4 4963 712 ok 5 4963 712 ok
Figure 5.9. Output of the command "show stats fan" on the 150 / 250 / 550 / 555 / 755 models
SH # show stats fan FanId RPM Min RPM Status 1 3559 2500 ok 2 3573 2500 ok 3 3588 2500 ok
Figure 5.10. Output of the command "show stats fan" on the 1050 / 2050 models
SH # show stats fan FanId RPM Min RPM Status 1 8760 1080 ok 2 9240 1080 ok 3 9480 1080 ok 5 9120 1080 ok 7 9120 1080 ok
Figure 5.11. Output of the command "show stats fan" on the 5050 / 6050 / 7050 models
SH # show stats fan FanId RPM Min RPM Status 1 3720 1080 ok 2 3840 1080 ok 3 3720 1080 ok 4 4920 1080 ok 5 3720 1080 ok 6 3720 1080 ok
Figure 5.12. Output of the command "show stats fan" on the CX255 model
SH # show stats fan FanId RPM Min RPM Status 1 5133 164 ok 3 5192 164 ok
Figure 5.13. Output of the command "show stats fan" on the CX555 / CX755 model
SH # show stats fan FanId RPM Min RPM Status 1 4518 2500 ok 2 4397 2500 ok 3 4368 2500 ok
Figure 5.14. Output of the command "show stats fan" on the CX5055 / CX7055 model
SH # show stats fan SnId RPM Min RPM Status 1 5520 1080 ok 2 6840 1080 ok 3 5520 1080 ok 4 6960 1080 ok 5 5400 1080 ok 6 6840 1080 ok 7 5520 1080 ok 8 6840 1080 ok
Figure 5.15. Output of the command "show stats fan" on the DX8000 model
SH # show stats fan FanId RPM Min RPM Status 1 5280 1080 ok 2 6840 1080 ok 3 5400 1080 ok 4 7080 1080 ok 5 5280 1080 ok 6 6720 1080 ok 7 5400 1080 ok 8 6720 1080 ok
Figure 5.16. Output of the command "show stats fan" on the EX560 / EX760 model
SH # show stats fan FanId RPM Min RPM Status 1 5520 1800 ok 2 5520 1800 ok 3 5520 1800 ok
Figure 5.17. Output of the command "show stats fan" on the EX1160 model
SH # show stats fan FanId RPM Min RPM Status 1 6000 1080 ok 3 5640 1080 ok 4 5040 1080 ok 5 5760 1080 ok 6 5160 1080 ok 7 5640 1080 ok 8 5160 1080 ok 9 5640 1080 ok 10 5160 1080 ok
Figure 5.18. Output of the command "show stats fan" on the EX1260 / EX1360 model
SH # show stats fan FanId RPM Min RPM Status 1 5520 1080 ok 2 6840 1080 ok 3 5520 1080 ok 4 7080 1080 ok 5 5520 1080 ok 6 6840 1080 ok 7 5760 1080 ok 8 6840 1080 ok
The memory in the Steelhead appliances supports ECC, which means it can determine if there are problems inside the memory modules. An alarm will be raised if such a problem has been detected.
Figure 5.19. Output of the command "show stats ecc-ram"
SH # show stats ecc-ram No ECC memory errors have been detected