5.8. Unexpected reboots

Unexpected reboots can happen for three reasons: Kernel panics, power failures and hardware watchdog initiated reboots.

5.8.1. Kernel panics

The optimization service runs on top of a Linux kernel. When there is a problem in the kernel, it will panic and reboot the machine. Problems in the kernel can be either caused by software (for example a reference to a invalid block of memory or a divide by zero) or by hardware (errors in the memory-sticks). When this happens, the kernel will try to create a crash dump and reboot. During startup the startup scripts will detect these crash dumps and extract the necessary information from it.

Figure 5.42. Reboot caused by a kernel panic

localhost kernel: con_dump: restoring oops message, timestamp=1351785953 ---------
localhost kernel: divide error: 0000 [1] SMP 
localhost kernel: CPU 0 
localhost kernel: Pid: 0, comm: swapper Tainted: PF     2.6.9-34.EL-rbt-11902SMP
localhost kernel: RIP: 0010:[tcp_snack_new_ofo_skb+356/684] <ffffffff8045d302>{tcp_snack_n \
    ew_ofo_skb+356}
localhost kernel: RIP: 0010:[<ffffffff8045d302>] <ffffffff8045d302>{tcp_snack_new_ofo_skb+ \
    356}
localhost kernel: RSP: 0018:ffffffff806b4cc8  EFLAGS: 00010246
localhost kernel: RAX: 0000000000000551 RBX: 00000100755d5338 RCX: 0000000000000000
localhost kernel: RDX: 0000000000000000 RSI: 00000101286f7040 RDI: 00000100755d5338
localhost kernel: RBP: 00000100755d5000 R08: ffffffff01000000 R09: 00000101286f7040
localhost kernel: R10: 0000000000000000 R11: deaf1eed01000000 R12: 0000010074b5c034
localhost kernel: R13: 00000101286f7040 R14: 0000000000000295 R15: 0000000000000000
localhost kernel: FS:  0000000040401960(0000) GS:ffffffff80753e00(0000) knlGS:000000000000 \
    0000
localhost kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
localhost kernel: CR2: 0000002a98215000 CR3: 0000000000101000 CR4: 00000000000006e0
localhost kernel: Process swapper (pid: 0, threadinfo ffffffff80758000, task ffffffff805d9 \
    480)
localhost kernel: Stack: 00000100755d5338 ffffffff80460dc2 0000000000000001 00000100755d50 \
    00 
localhost kernel:        0000000000000003 000001012a83c800 0000000000000000 000001012a83c8 \
    00 
localhost kernel:        00000000ba098a8f ffffffff8046e25e 
localhost kernel: Call Trace:<IRQ> <ffffffff80460dc2>{tcp_rcv_state_process+2471} <fffffff \
    f8046e25e>{tcp_child_process+51} 
localhost kernel:        <ffffffff8046a72a>{tcp_v4_do_rcv+391} <ffffffff8046ad4a>{tcp_v4_r \
    cv+1449} 
localhost kernel:        <ffffffff8044ad04>{ip_local_deliver_finish+248} <ffffffff8044ac0c \
    >{ip_local_deliver_finish+0} 
localhost kernel:        <ffffffff8043e7a2>{nf_hook_slow+184} <ffffffff8044b007>{ip_local_ \
    deliver+552} 
localhost kernel:        <ffffffff8044b20e>{ip_rcv_finish+512} <ffffffff8044b00e>{ip_rcv_f \
    inish+0} 
localhost kernel:        <ffffffff8043e7a2>{nf_hook_slow+184} <ffffffff8044b6d7>{ip_rcv+11 \
    38} 
localhost kernel:        <ffffffff80435528>{netif_receive_skb+590} <ffffffff804355eb>{proc \
    ess_backlog+137} 
localhost kernel:        <ffffffff804356f5>{net_rx_action+129} <ffffffff8013ac80>{__do_sof \
    tirq+88} 
localhost kernel:        <ffffffff8013ad29>{do_softirq+49} <ffffffff8011311f>{do_IRQ+328} 
localhost kernel:        <ffffffff801107cb>{ret_from_intr+0}  <EOI> <ffffffff801
localhost kernel: con_dump: end of oops ------------------------------------

When this happens, please open a case with Riverbed TAC for follow-up.

In RiOS version 8.0 and later, there is the feature of a crash kernel available: When a kernel panic happens, instead of that the Steelhead appliance gets rebooted and the memory contents are lost, the Steelhead appliance will reboot into a second kernel and is able to collect all the memory contents for debugging later by Riverbed engineering.

Figure 5.43. Enabling the crash kernel feature

SH (config) # support show kexec
Kexec Mode Enabled: no 
SH (config) # support kexec enable
You must reboot the appliance for your changes to take effect
SH (config) # write memory
SH (config) # reload

When this crash kernel feature is enabled, a kernel crash will take longer than normal. Afterwards a file named kernel-crashdump-<timestamp>.tgz will be avalable in the process dump directory.

5.8.2. Power failure caused reboots

Power failure is either when the power to the power supplies fails or when a power supply fails. On the xx20 series models without a hardware RAID controller and on the xx50, CX and EX series models, this kind of reboots can be identified by checking the SMART status of the hard disks. In case of a power off/on, the value of the Current Power Cycle Count attributes gets increased while in case of a reboot it doesn't get increased.

So by by searching for the lines Current Power Cycle Count and shutdown_check in the startup logs, the kind of reboot can be determined:

Figure 5.44. This device was rebooted because of loss of power.

localhost disk: Drive=sda, Serial= 080806DP1D10DGGHV06P, Current Power Cycle Count=68, Las \
    t Power Cycle Count=66
[...]
localhost shutdown_check: Checking for unexpected shutdown
localhost shutdown_check: Detected unexpected shutdown!
localhost shutdown_check: 
localhost rc: Starting shutdown_check:  succeeded

In this case the Current Power Cycle Count was increased: This was a power related restart.

5.8.3. Hardware watchdog initiated reboots

The hardware watchdog is integrated on the motherboard and which will reboot the appliance if the process wdt has not been communicated with the hardware watchdog for 30 seconds.

This can happen when:

  • The kernel hangs and doesn't allow the user-land to run. This is a bad situation and the reboot will give the machine a chance to recover, hopefully without getting into the same situation again.

  • The kernel is waiting for a hard disk to complete a read or write operation and it takes too long. This can happen when a hard disk is in the process of failing but its controller hasn't given up yet.

    For non-RAID or non-FTS appliances, this might cause the Steelhead appliance to not come back since an essential hard disk has malfunctioned.

    For RAID-appliances, this might be the first step for the RAID controller, RAID software or FTS process to determine that a disk is broken.

    Figure 5.45. The watchdog interval was larger than expected.

    SH wdt[6805]: [wdt.WARNING]: watchdog poll late: 16415 with interval 1000 
    

    In this example, the watchdog timer had been activated 16.415 seconds after the last time instead if the expected one second.

Unlike the previous section, where the line Current Power Cycle Count had a different value for the current and last, a hardware watchdog initiated reboot does not increase that value.

Figure 5.46. This device was rebooted by the hardware watchdog.

localhost disk: Drive=sda, Serial= 080806DP1D10DGGHV06P, Current Power Cycle Count=68, Las \
    t Power Cycle Count=68
[...]
localhost shutdown_check: Checking for unexpected shutdown
localhost shutdown_check: Detected unexpected shutdown!
localhost shutdown_check: 
localhost rc: Starting shutdown_check:  succeeded

In the file hardwareinfo.txt, which is found in a system dump and discussed in the section Hardware Information, the hardware watchdog initiated reboots can be seen in the IPMI log:

Figure 5.47. Determining reboots based on the IPMI logs

Output of 'show_ipmi_info':

Motherboard '400-00100-01'

SEL Record ID          : 0005
 Record Type           : 02
 Timestamp             : 04/06/2011 15:56:54
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Critical Interrupt
 Sensor Number         : e7
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : 03ffff
 Description           : Software NMI

SEL Record ID          : 0157
 Record Type           : 02
 Timestamp             : 04/06/2011 15:56:55
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Watchdog 2
 Sensor Number         : 81
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : c104ff
 Description           : Hard reset

Since RiOS version 7.0 a software NMI handler is called before the hardware watchdog will reboot the device. This will be used to save a copy of the kernel stack which can be retrieved in the next startup of the device.

When the hardware watchdog resets the Steelhead appliance, please open a case with Riverbed TAC for investigation.