5.32. High CPU related issues

When there is a high CPU alarm on the Steelhead appliance, there can be various reasons for this:

5.32.1. Traffic related CPU issues

If the CPU issues are related to the traffic, then the CPU load should follow the traffic pattern.

If the CPU load recorded in the System Accounting Records on this device show that CPU 0-6 and 8-11 have all more or less the same pattern and CPU 7 shows with a huge %soft value which might be related to the traffic:

Figure 5.219. CPU 7 shows a higher CPU load than the other ones

CPU 7 shows a higher CPU load than the other ones

This is the traffic as in number of packets per second on the LAN side and on the WAN side as recorded in the System Accounting Records:

Figure 5.220. Number of packets going through the LAN and WAN side

Number of packets going through the LAN and WAN side

So that confirms that the CPU load is following the traffic pattern, which means that the traffic is causing the CPU alarm.

5.32.1.1. NIC limitations and multi-queuing

For some of the NICs, the traffic is dealt with by the same CPU. This is fine for desktop and 1RU devices, however for the large 3RU devices this is often a bottleneck. Since RiOS 8.0 and higher, for some of the NICs the traffic can be distributed in multiple queues, spreading the load over multiple CPUs and thus preventing a single CPU to be overloaded.

If the NIC supports this feature can be seen in the section of Output of 'cat /proc/interrupts' the file sysinfo.txt in the system dump:

The first output shows that each NIC is dealt with by its own single CPU:

Figure 5.221. Distribution of interrupts for a NIC dealt with by a single CPU

      CPU0       CPU1       CPU2       CPU3
[...]
 48:   143         14    1217838         96  PCI-MSI-edge  wan0_0
 49:     3         88        108    1217838  PCI-MSI-edge  lan0_0
 50:    15          8    2573665         65  PCI-MSI-edge  wan0_1
 51:    54         99         82    8434328  PCI-MSI-edge  lan0_1

The second output shows that each NIC queue is spread over all CPUs:

Figure 5.222. Distribution of interrupts for a NIC dealt with by multiple CPUs

      CPU0       CPU1       CPU2       CPU3
[...]
 52:     4          0          0          0  PCI-MSI-edge  wan1_1
 53:    68  375249745  383988175  371645180  PCI-MSI-edge  wan1_1-rx-0
 54:    22  424257135  414943503  415290180  PCI-MSI-edge  wan1_1-rx-1
 55:    22  327132986  319572437  314607486  PCI-MSI-edge  wan1_1-rx-2
 56:    22  405144547  408061894  400918998  PCI-MSI-edge  wan1_1-rx-3
 57:    22  650505324  625281921  621856160  PCI-MSI-edge  wan1_1-tx-0
 58:  7663  540561142  529003647  513297250  PCI-MSI-edge  wan1_1-tx-1
 59:    22  668132358  654101138  626834780  PCI-MSI-edge  wan1_1-tx-2
 60:    27  640642031  624693149  605929490  PCI-MSI-edge  wan1_1-tx-3

The third output shows that each NIC queue is dealt with by its own single CPU:

Figure 5.223. Distrubution of interrupts for a NIC dealt with by multiple CPUs

      CPU0       CPU1       CPU2       CPU3
[...]
 88: 1379090814          0          0          0     PCI-MSI-edge      wan2_0-TxRx-0
 89:         97 1975362362          0          0     PCI-MSI-edge      wan2_0-TxRx-1
 90:         97          0 1616533530          0     PCI-MSI-edge      wan2_0-TxRx-2
 91:         97          0          0 1533039750     PCI-MSI-edge      wan2_0-TxRx-3

and in the output of the ethtool statistics:

Figure 5.224. Statistics show that there are multiple queues for TX and RX

Output of 'ethtool -S lan1_0':

NIC statistics:
     rx_packets: 63450051974
     tx_packets: 98718359686
     rx_bytes: 37546291563374
     tx_bytes: 37018865599885
[...]
     tx_queue_0_packets: 25155414255
     tx_queue_0_bytes: 9268449949519
     tx_queue_0_restart: 0
     tx_queue_1_packets: 23339319194
     tx_queue_1_bytes: 9115355264290
     tx_queue_1_restart: 0
     tx_queue_2_packets: 23978453361
     tx_queue_2_bytes: 8207159290199
     tx_queue_2_restart: 0
     tx_queue_3_packets: 26245172876
     tx_queue_3_bytes: 10032958161335
     tx_queue_3_restart: 0
     rx_queue_0_packets: 15784045932
     rx_queue_0_bytes: 9098239857902
     rx_queue_0_drops: 0
     rx_queue_0_csum_err: 0
     rx_queue_0_alloc_failed: 0
     rx_queue_1_packets: 15658010393
     rx_queue_1_bytes: 9148562150602
     rx_queue_1_drops: 0
     rx_queue_1_csum_err: 0
     rx_queue_1_alloc_failed: 0
     rx_queue_2_packets: 15809170269
     rx_queue_2_bytes: 9485746463221
     rx_queue_2_drops: 0
     rx_queue_2_csum_err: 0
     rx_queue_2_alloc_failed: 0
     rx_queue_3_packets: 16198806283
     rx_queue_3_bytes: 9559962031124
     rx_queue_3_drops: 0
     rx_queue_3_csum_err: 0
     rx_queue_3_alloc_failed: 0

If the right NICs is installed but the RiOS version is lower than 8.0, then upgrade to a newer RiOS version. If the right NIC is installed but not in use, then move the cables and reconfigure the IP addresses to the right in-path interface.

The full list of NICs supporting multiple queues is available in KB article S23409.

5.32.1.2. Bypass the traffic

If the CPU is overloaded because of traffic, the most logical workaround would be to not send the traffic through the Steelhead appliance. There are several methods:

  • If a large chunk of the traffic is not optimizable, for example UDP voice traffic towards a VLAN with VoIP phones, the traffic could be routed from the WAN router around the Steelhead appliance towards the LAN switch and vice-versa. This way the Steelhead appliance doesn't see it and thus not have to process it.

  • If the NIC in the Steelhead appliance supports hardware by-pass, the traffic can be passed through on the NIC level instead of having to be dealt with in the kernel level on the Steelhead appliance.

  • Change to a virtual in-path design where only the relevant traffic gets send to the Steelhead appliance. In case of WCCP and PBR deployment, use the right redirection access-lists to only forward optimizable traffic. In case of an Interceptor deployment, only optimizable traffic gets redirected to the Steelhead appliance.

5.32.2. RSP related CPU issues

In the System Accounting Records this could like look:

Figure 5.225. A lot of CPU usage is in the 'nice' category

A lot of CPU usage is in the 'nice' category

The type is "Nice", which is indicates to RSP related process

The contents of the file top_output can show that the process causing the most load is the vmware-vmx process:

Figure 5.226. Single spinning CPU

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 5436 admin     18   1 1065m 797m 539m R 98.7 10.0 169856:25 vmware-vmx         

The next steps would be to determine what the RSP slot is doing.

5.32.3. Process related CPU issues

In the System Accounting Records this could like look:

Figure 5.227. A userland process is spinning on a CPU

A userland process is spinning on a CPU

What you see here is that since 21 January around 22:10, suddenly there were high CPU spikes on all CPUs. It can be on all CPUs, it can just be stuck on one CPU. The type is "User", which means that it is a userland process.

The contents of the file top_output can show that a single process uses 100% of the CPU time:

Figure 5.228. Single spinning CPU

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1414 admin     25   0 81176  45m  41m R 100.9  0.1   1005:54 winbindd          

That 100.9% CPU means that it is spinning in a loop, consuming a full CPU. The expected behaviour would be that the winbindd process will only be woken up when Active Directory integration related tasks are running, for example when a new encrypted MAPI session gets setup.