5.2. Hardware related issues

This section describes the possible hardware related issues during the operation of the Steelhead appliance.

5.2.1. Storage

There are three kinds of storage devices in the Steelhead appliances:

  • USB flash disk with the boot-partition, the configuration and a copy of the latest installed RiOS software.

  • One or more hard disk drives.

  • One or more solid-state disks

From a troubleshooting point of view, the USB flash disk is not interesting.

5.2.1.1. Hard disks

The hard disk in a Steelhead appliance can get hammered a lot and can fail after time. That is why in the higher end Steelhead appliances their hard disks are mirrored in a RAID array, so that if one hard disk fails the data is still available on the mirror hard disk.

On the Steelhead appliance without a RAID array, the failure of a hard disk will cause initial slowness in the processing of optimized TCP sessions, and at a certain moment the hardware watchdog will timeout and restart the appliance and then the device will fail in the boot process.

In case of a hard disk failure there are different steps per model:

  • The low-end desktop models of the xx20, xx50, CX and EX series models have a single hard disk or a dual non-RAID set of disks. The whole chassis needs to be replaced.

  • The 1RU xx20 series models have a single hard disk. The whole chassis needs to be replaced.

  • The 1RU xx50 series 1050U, 1050L and 1050M models with a single hard disk, but the appliance can rebuild itself during the startup from the flash partition. Only a hard disk replacement is needed.

  • The 1RU xx50 series 1050H model has two hard disks but they are in a RAID0 array. The appliance can rebuild itself during the startup of the appliance and only a hard disk replacement is needed.

    If a system dump is not available but the serial console bootlogs are, search for the available disks in with the string Attached SCSI disk

    Figure 5.1. Grep for "Attached SCSI disk" in the bootlogs

      [~/sysdumps] edwin@t43>grep "Attached SCSI disk" dmesg-boot 
      sd 0:0:0:0: [sda] Attached SCSI disk
      sd 1:0:0:0: [sdb] Attached SCSI disk
      sd 6:0:0:0: [sdc] Attached SCSI disk
    


    If sd 1:0:0:0 shows up, that is disk 1 which means that disk 0 is missing.

  • The 1RU xx50 series 1050UR, 1050LR, 1050MR, 1050HR models and CX1555 models have four hard disks in a RAID10 array. In case of a single hard disk failure only the hard disk needs to be replaced. In case of a multiple hard disk failure on the same RAID1 array, the chassis can rebuild itself during the startup from the flash partition.

  • The 1RU xx50 series 2050 model, the 3RU xx50 series 5050 and 6050 models have multiple hard disks in a RAID10 array. In case of a single hard disk failure only the hard disk needs to be replaced. In case of a multiple hard disk failure on the same RAID1 array, the whole chassis needs to be replaced.

  • The 7050, 5055 and 6055 models have the hard disks in a RAID1 configuration while the solid state disks are in a FTS configuration.

During the RMA process of a broken hard disk, the device is running in a degraded mode. If the appliance has Silver level or Gold level support contract, you might want to obtain spare hard disks so you can replace them faster and thus get the redundancy on the RAID array back faster.

5.2.1.2. Solid-state disks

The solid-state disks will suffer from Write Endurance: The number of write cycles to any block of flash is limited. The Steelhead appliance keeps track of the number of writes per block of flash and if it gets higher than the manufacturer specifications it will raise an alarm that the solid-state disk needs to be replaced.

5.2.1.3. RAID status and rebuilding

The operational status of the RAID array can be seen with the command show raid diagram.

Figure 5.2. Output of the command "show raid diagram" on a 2050 model

SH # show raid diagram
[  0 : online   ][  1 : online   ][  2 : online   ][  3 : online ]

When a hard disk fails, its status will go from online to missing or degraded. When the broken hard disk is replaced, the status will go to rebuilding. When the rebuilding has finished, then the status will go back to online.

In this following example there is a RAID10 array of 12 hard disks, with the RAID1 array of disk 2 and 3 being rebuild, the RAID1 array of disk 10 and 11 is degraded because of a broken disk and the RAID1 array of disk 8 and 9 still okay but disk 9 is currently degraded.

Figure 5.3. Output of the command "show raid diagram" on a model 5050 model

SH # show raid diagram
[  0 : online   ][  1 : online   ][  2 : online   ][  3 : rebuilding ]
[  4 : online   ][  5 : online   ][  6 : online   ][  7 : online     ]
[  8 : online   ][  9 : degraded ][ 10 : online   ][ 11 : missing    ]
[ 12 : missing  ][ 13 : missing  ][ 14 : missing  ][ 15 : missing    ]

The rebuild of a RAID array will take about 12 hours, depending on how busy the Steelhead appliance is. On the xx50, CX and EX series models, the progress of the rebuilding can be seen with the command raid swraid mdstat:

Figure 5.4. Output of the command "raid swraid mdstat" on a 5050 model

SH # raid swraid mdstat
Personalities : [linear] [raid0] [raid10] [raid6] [raid5] 
md3 : active raid10 sdd6[12] sda6[0] sdl6[11] sdk6[10] sdj6[9] sdi6[8] sdh6[7] sdg6[6] sdf \
    6[5] sde6[4] sdc6[2] sdb6[1]
      146512128 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU]
        resync=DELAYED
      
md0 : active raid10 sdd5[12] sda5[0] sdl5[11] sdk5[10] sdj5[9] sdi5[8] sdh5[7] sdg5[6] sdf \
    5[5] sde5[4] sdc5[2] sdb5[1]
      781288704 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU]
        resync=DELAYED
      
md2 : active raid10 sdd2[3] sda2[0] sdl2[11] sdk2[10] sdj2[9] sdi2[8] sdh2[7] sdg2[6] sdf2 \
    [5] sde2[4] sdc2[2] sdb2[1]
      37784448 blocks 64K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]
      
md1 : active raid10 sdd3[12] sda3[0] sdl3[11] sdk3[10] sdj3[9] sdi3[8] sdh3[7] sdg3[6] sdf \
    3[5] sde3[4] sdc3[2] sdb3[1]
      100678656 blocks 64K chunks 2 near-copies [12/11] [UUU_UUUUUUUU]
      [=>...................]  recovery =  5.4% (920064/16779776) finish=6.3min speed=4182 \
    1K/sec
      
unused devices: <none>

In this example, disk 3 was replaced. The re-syncing of RAID1 array md2 has been completed already while md1 is currently at 5.4% completion while completion is estimated in 6.3 minutes. md3 and md0 will be made redundant after this.

On machines with multiple hard disks in a RAID10 configuration, the machine can survive a multiple hard disk failure as long as the failures are not on the same RAID1 array.

If a disk is in the degraded state then it is not yet removed from the RAID array but it still can cause performance problems. To remove it from the RAID array use the command raid swraid fail-disk.

5.2.1.4. Indication of a failure of a disk

The failure of a disk is often indicated by a slower-than-normal performance of a Steelhead appliance: The RAID controller or RAID software has not yet marked the hard disk as broken, but it sees long delays in reading and writing with it.

In the system logs a failing disk can be seen with the log entries like:

Figure 5.5. A Failing disk in the system logs

SH kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
SH kernel: ata4.00: (irq_stat 0x40000008)
SH kernel: ata4.00: cmd 60/3e:00:43:00:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 31744 in
SH kernel:          res 41/40:3e:43:00:00/00:00:00:00:00/40 Emask 0x9 (media error)
SH kernel: ata4.00: configured for UDMA/133
SH kernel: SCSI error : <3 0 0 0> return code = 0x8000002
SH kernel: Info fld=0x4000000 (nonstd), Invalid sdd: sense = 72 11
SH kernel: end_request: I/O error, dev sdd, sector 67
SH kernel: Buffer I/O error on device sdd1, logical block 33
SH kernel: ata4: EH complete
SH kernel: SCSI device sdd: 490350672 512-byte hdwr sectors (251060 MB)

5.2.2. Power supplies

The Steelhead appliances with multiple power supplies are able to sustain the loss of one. An alarm will be raised if a power supply has failed.

5.2.3. Fans

Depending on the model, there are one or multiple fans in a Steelhead appliance. An alarm will be raised if a fan has failed.

Figure 5.6. Output of the command "show stats fan" on the 100 / 200 / 300 models

SH # show stats fan
FanId   RPM     Min RPM Status
1       5869    2657    ok

Figure 5.7. Output of the command "show stats fan" on the 520 / 1020 / 1520 / 2020 models

SH # show stats fan
FanId   RPM     Min RPM Status
1       3000    750     ok
2       3000    750     ok

Figure 5.8. Output of the command "show stats fan" on the 3020 / 3520 / 5520 / 6020 / 6120 models

SH # show stats fan
FanId   RPM     Min RPM Status
0       4963    712     ok
1       4821    712     ok
2       4963    712     ok
3       4821    712     ok
4       4963    712     ok
5       4963    712     ok

Figure 5.9. Output of the command "show stats fan" on the 150 / 250 / 550 / 555 / 755 models

SH # show stats fan
FanId   RPM     Min RPM Status
1       3559    2500    ok
2       3573    2500    ok
3       3588    2500    ok

Figure 5.10. Output of the command "show stats fan" on the 1050 / 2050 models

SH # show stats fan
FanId   RPM     Min RPM Status
1       8760    1080    ok
2       9240    1080    ok
3       9480    1080    ok
5       9120    1080    ok
7       9120    1080    ok

Figure 5.11. Output of the command "show stats fan" on the 5050 / 6050 / 7050 models

SH # show stats fan
FanId   RPM     Min RPM Status
1       3720    1080    ok
2       3840    1080    ok
3       3720    1080    ok
4       4920    1080    ok
5       3720    1080    ok
6       3720    1080    ok

Figure 5.12. Output of the command "show stats fan" on the CX255 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       5133    164     ok
3       5192    164     ok

Figure 5.13. Output of the command "show stats fan" on the CX555 / CX755 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       4518    2500    ok
2       4397    2500    ok
3       4368    2500    ok

Figure 5.14. Output of the command "show stats fan" on the CX5055 / CX7055 model

SH # show stats fan
SnId   RPM     Min RPM Status
1       5520    1080    ok
2       6840    1080    ok
3       5520    1080    ok
4       6960    1080    ok
5       5400    1080    ok
6       6840    1080    ok
7       5520    1080    ok
8       6840    1080    ok

Figure 5.15. Output of the command "show stats fan" on the DX8000 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       5280    1080    ok
2       6840    1080    ok
3       5400    1080    ok
4       7080    1080    ok
5       5280    1080    ok
6       6720    1080    ok
7       5400    1080    ok
8       6720    1080    ok

Figure 5.16. Output of the command "show stats fan" on the EX560 / EX760 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       5520    1800    ok
2       5520    1800    ok
3       5520    1800    ok

Figure 5.17. Output of the command "show stats fan" on the EX1160 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       6000    1080    ok
3       5640    1080    ok
4       5040    1080    ok
5       5760    1080    ok
6       5160    1080    ok
7       5640    1080    ok
8       5160    1080    ok
9       5640    1080    ok
10      5160    1080    ok

Figure 5.18. Output of the command "show stats fan" on the EX1260 / EX1360 model

SH # show stats fan
FanId   RPM     Min RPM Status
1       5520    1080    ok
2       6840    1080    ok
3       5520    1080    ok
4       7080    1080    ok
5       5520    1080    ok
6       6840    1080    ok
7       5760    1080    ok
8       6840    1080    ok

5.2.4. Memory errors

The memory in the Steelhead appliances supports ECC, which means it can determine if there are problems inside the memory modules. An alarm will be raised if such a problem has been detected.

Figure 5.19. Output of the command "show stats ecc-ram"

SH # show stats ecc-ram
No ECC memory errors have been detected