When there is a high CPU alarm on the Steelhead appliance, there can be various reasons for this:
High CPU due to traffic issues.
High CPU due to RSP issues.
High CPU due to spinning processes.
If the CPU issues are related to the traffic, then the CPU load should follow the traffic pattern.
If the CPU load recorded in the System Accounting Records on this device show that CPU 0-6 and 8-11 have all more or less the same pattern and CPU 7 shows with a huge %soft value which might be related to the traffic:
This is the traffic as in number of packets per second on the LAN side and on the WAN side as recorded in the System Accounting Records:
So that confirms that the CPU load is following the traffic pattern, which means that the traffic is causing the CPU alarm.
For some of the NICs, the traffic is dealt with by the same CPU. This is fine for desktop and 1RU devices, however for the large 3RU devices this is often a bottleneck. Since RiOS 8.0 and higher, for some of the NICs the traffic can be distributed in multiple queues, spreading the load over multiple CPUs and thus preventing a single CPU to be overloaded.
If the NIC supports this feature can be seen in the section of
Output of 'cat /proc/interrupts'
the file
sysinfo.txt
in the system dump:
The first output shows that each NIC is dealt with by its own single CPU:
Figure 5.221. Distribution of interrupts for a NIC dealt with by a single CPU
CPU0 CPU1 CPU2 CPU3 [...] 48: 143 14 1217838 96 PCI-MSI-edge wan0_0 49: 3 88 108 1217838 PCI-MSI-edge lan0_0 50: 15 8 2573665 65 PCI-MSI-edge wan0_1 51: 54 99 82 8434328 PCI-MSI-edge lan0_1
The second output shows that each NIC queue is spread over all CPUs:
Figure 5.222. Distribution of interrupts for a NIC dealt with by multiple CPUs
CPU0 CPU1 CPU2 CPU3 [...] 52: 4 0 0 0 PCI-MSI-edge wan1_1 53: 68 375249745 383988175 371645180 PCI-MSI-edge wan1_1-rx-0 54: 22 424257135 414943503 415290180 PCI-MSI-edge wan1_1-rx-1 55: 22 327132986 319572437 314607486 PCI-MSI-edge wan1_1-rx-2 56: 22 405144547 408061894 400918998 PCI-MSI-edge wan1_1-rx-3 57: 22 650505324 625281921 621856160 PCI-MSI-edge wan1_1-tx-0 58: 7663 540561142 529003647 513297250 PCI-MSI-edge wan1_1-tx-1 59: 22 668132358 654101138 626834780 PCI-MSI-edge wan1_1-tx-2 60: 27 640642031 624693149 605929490 PCI-MSI-edge wan1_1-tx-3
The third output shows that each NIC queue is dealt with by its own single CPU:
Figure 5.223. Distrubution of interrupts for a NIC dealt with by multiple CPUs
CPU0 CPU1 CPU2 CPU3 [...] 88: 1379090814 0 0 0 PCI-MSI-edge wan2_0-TxRx-0 89: 97 1975362362 0 0 PCI-MSI-edge wan2_0-TxRx-1 90: 97 0 1616533530 0 PCI-MSI-edge wan2_0-TxRx-2 91: 97 0 0 1533039750 PCI-MSI-edge wan2_0-TxRx-3
and in the output of the ethtool statistics:
Figure 5.224. Statistics show that there are multiple queues for TX and RX
Output of 'ethtool -S lan1_0': NIC statistics: rx_packets: 63450051974 tx_packets: 98718359686 rx_bytes: 37546291563374 tx_bytes: 37018865599885 [...] tx_queue_0_packets: 25155414255 tx_queue_0_bytes: 9268449949519 tx_queue_0_restart: 0 tx_queue_1_packets: 23339319194 tx_queue_1_bytes: 9115355264290 tx_queue_1_restart: 0 tx_queue_2_packets: 23978453361 tx_queue_2_bytes: 8207159290199 tx_queue_2_restart: 0 tx_queue_3_packets: 26245172876 tx_queue_3_bytes: 10032958161335 tx_queue_3_restart: 0 rx_queue_0_packets: 15784045932 rx_queue_0_bytes: 9098239857902 rx_queue_0_drops: 0 rx_queue_0_csum_err: 0 rx_queue_0_alloc_failed: 0 rx_queue_1_packets: 15658010393 rx_queue_1_bytes: 9148562150602 rx_queue_1_drops: 0 rx_queue_1_csum_err: 0 rx_queue_1_alloc_failed: 0 rx_queue_2_packets: 15809170269 rx_queue_2_bytes: 9485746463221 rx_queue_2_drops: 0 rx_queue_2_csum_err: 0 rx_queue_2_alloc_failed: 0 rx_queue_3_packets: 16198806283 rx_queue_3_bytes: 9559962031124 rx_queue_3_drops: 0 rx_queue_3_csum_err: 0 rx_queue_3_alloc_failed: 0
If the right NICs is installed but the RiOS version is lower than 8.0, then upgrade to a newer RiOS version. If the right NIC is installed but not in use, then move the cables and reconfigure the IP addresses to the right in-path interface.
The full list of NICs supporting multiple queues is available in KB article S23409.
If the CPU is overloaded because of traffic, the most logical workaround would be to not send the traffic through the Steelhead appliance. There are several methods:
If a large chunk of the traffic is not optimizable, for example UDP voice traffic towards a VLAN with VoIP phones, the traffic could be routed from the WAN router around the Steelhead appliance towards the LAN switch and vice-versa. This way the Steelhead appliance doesn't see it and thus not have to process it.
If the NIC in the Steelhead appliance supports hardware by-pass, the traffic can be passed through on the NIC level instead of having to be dealt with in the kernel level on the Steelhead appliance.
Change to a virtual in-path design where only the relevant traffic gets send to the Steelhead appliance. In case of WCCP and PBR deployment, use the right redirection access-lists to only forward optimizable traffic. In case of an Interceptor deployment, only optimizable traffic gets redirected to the Steelhead appliance.
In the System Accounting Records this could like look:
The type is "Nice", which is indicates to RSP related process
The contents of the file top_output can show that the process causing the most load is the vmware-vmx process:
Figure 5.226. Single spinning CPU
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5436 admin 18 1 1065m 797m 539m R 98.7 10.0 169856:25 vmware-vmx
The next steps would be to determine what the RSP slot is doing.
In the System Accounting Records this could like look:
What you see here is that since 21 January around 22:10, suddenly there were high CPU spikes on all CPUs. It can be on all CPUs, it can just be stuck on one CPU. The type is "User", which means that it is a userland process.
The contents of the file top_output can show that a single process uses 100% of the CPU time:
Figure 5.228. Single spinning CPU
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1414 admin 25 0 81176 45m 41m R 100.9 0.1 1005:54 winbindd
That 100.9% CPU means that it is spinning in a loop, consuming a full CPU. The expected behaviour would be that the winbindd process will only be woken up when Active Directory integration related tasks are running, for example when a new encrypted MAPI session gets setup.