5.17. Alarms and health status

The Steelhead appliance has various alarms which indicate a problem with the Steelhead appliance in general or with the optimization service specifically.

To make sure you are informed about possible issues on the Steelhead appliances, alarms can be send out via several methods:

There are six styles of lines for reporting alarm related issues:

The full output of the logging for alarms is:

Figure 5.102. Full output of a log entry for an alarm before RiOS 7.0

SH statsd[312]: [statsd.NOTICE]: Alarm triggered for rising error for event xxx


and

Figure 5.103. Full output of a log entry for an alarm on RiOS 7.0 and later

SH alarmd[29863]: [alarmd.NOTICE]: Alarm 'xxx' triggering 


With the introduction of RiOS 7.0 there are aggregate alarms, which rise if one of the source alarms rises. The aggregation tree is currently:

Figure 5.104. Aggregate alarm tree for RiOS 8.0

health
 |\_ admission_control
 |    |\_ admission_conn
 |    |\_ admission_cpu
 |    |\_ admission_mapi
 |    |\_ admission_mem
 |     \_ admission_tcp
 |\_ arcount
 |\_ block_store
 |    \_ block_store:*
 |\_ bypass
 |\_ connection_forwarding
 |    |\_ cf_ipv6_incompatible_cluster
 |    |\_ disconnected_sh_alert
 |     \_ single_cf
 |         |\_ cf_ack_timeout_aggr
 |         |     \_ cf_ack_timeout:*
 |         |\_ cf_conn_failure_aggr
 |         |     \_ cf_conn_failure:*
 |         |\_ cf_conn_lost_eos_aggr
 |         |     \_ cf_conn_lost_eos:*
 |         |\_ cf_conn_lost_err_aggr
 |         |     \_ cf_conn_lost_err:*
 |         |\_ cf_keepalive_timeout_aggr
 |         |     \_ cf_keepalive_timeout:*
 |         |\_ cf_latency_exceeded_aggr
 |         |     \_ cf_latency_exceeded:*
 |          \_ cf_read_info_timeout_aggr
 |               \_ cf_read_info_timeout:*
 |\_ cpu_util_indiv
 |     \_ cpu:*:util
 |\_ datastore
 |    |\_ datastore_error
 |    |\_ datastore_sync_error
 |    |\_ disk_not_setup
 |     \_ store_corruption
 |\_ domain_join_error
 |\_ duplex
 |\_ flash_protection_failed
 |\_ fs_mnt
 |     \_ fs_mnt:*:full
 |\_ granite-core
 |     \_ granite-core:*
 |\_ hardware
 |    |\_ disk
 |    |     \_ disk_error:*
 |    |\_ fan_error
 |    |\_ flash_error
 |    |\_ ipmi
 |    |\_ memory_error
 |    |\_ other_hardware_error
 |    |\_ power_supply
 |    |\_ raid_disk_indiv
 |    |     \_ disk:*:status
 |     \_ ssd_wear
 |          \_ ssd_wear_warning:*
 |\_ high_availability
 |     \_ high_availability:*
 |\_ inbound_qos_wan_bw_err
 |\_ iscsi
 |     \_ iscsi:*
 |\_ lan_wan_loop
 |\_ licensing
 |    |\_ appliance_unlicensed
 |    |\_ autolicense_error
 |    |\_ autolicense_info
 |    |\_ license_expired
 |     \_ license_expiring
 |\_ link_duplex
 |     \_ link_state:*:half_duplex
 |\_ link_io_errors
 |     \_ link_state:*:io_errors
 |\_ linkstate
 |     \_ link_state:*:link_error
 |\_ lun
 |     \_ lun:*
 |\_ nfs_v2_v4
 |\_ optimization_service
 |    |\_ halt_error
 |    |\_ optimization_general
 |     \_ service_error
 |\_ outbound_qos_wan_bw_err
 |\_ paging
 |\_ path_selection_path_down
 |\_ pfs
 |    |\_ pfs_config
 |     \_ pfs_operation
 |\_ profile_switch_failed
 |\_ rsp
 |    |\_ rsp_general_alarm
 |    |\_ rsp_license_expired
 |    |\_ rsp_license_expiring
 |     \_ rsp_service
 |\_ secure_vault
 |    |\_ secure_vault_rekey_needed
 |    |\_ secure_vault_uninitialized
 |     \_ secure_vault_unlocked
 |\_ serial_cascade_misconfig
 |\_ smb_alert
 |\_ snapshot
 |     \_ snapshot:*
 |\_ ssl
 |    |\_ certs_expiring
 |    |\_ crl_error:*
 |    |\_ non_443_sslservers_detected_on_upgrade
 |     \_ ssl_peer_scep_auto_reenroll
 |\_ sticky_staging_dir
 |\_ sw_version_aggr
 |    |\_ mismatch_peer_aggr
 |    |     \_ mismatch_peer:*
 |     \_ sw_version_mismatch_aggr
 |          \_ sw_version:*
 |\_ system_detail_report
 |\_ temperature
 |    |\_ critical_temp
 |     \_ warning_temp
 |\_ uncommitted_data
  \_ vsp
      |\_ esxi_communication_failed
      |\_ esxi_disk_creation_failed
      |\_ esxi_initial_config_failed
      |\_ esxi_license
      |    |\_ esxi_license_expired
      |    |\_ esxi_license_expiring
      |     \_ esxi_license_is_trial
      |\_ esxi_memory_overcommitted
      |\_ esxi_not_set_up
      |\_ esxi_version_unsupported
      |\_ esxi_vswitch_mtu_unsupported
      |\_ virt_cpu_util_indiv
      |     \_ virt_cpu:*:util
      |\_ vsp_general_alarm
      |\_ vsp_service
      |\_ vsp_service_not_running
       \_ vsp_unsupported_vm_count

If an alarm in one of the end-nodes gets triggered, the nodes above it will get triggered too: So if the warning_temp alarm gets triggered, the temperature alarm will get triggered and the health alarm gets triggered.

The examples of log messages in this section use a shorter notation without the hostname, process name and severity.

5.17.1. General alarms

5.17.1.1. Generic health alarm

Figure 5.105. Generic health alarm

Alarm 'health' triggered

This is the top level node which gets triggered when any of the other alarms gets triggered.

5.17.1.2. Certificates related alarms

Figure 5.106. Certificates related alarms

Alarm 'certs_expiring' triggered
Alarm 'crl_error:SSL_CAs' triggered
Alarm 'crl_error:SSL_Peering_CAs' triggered
Alarm 'ssl' triggered
Alarm 'ssl_peer_scep_auto_enroll' triggered

Alarm triggered for rising error for event certs_expiring
Alarm triggered for rising error for event crl_error
Alarm triggered for rising error for event ssl_peer_scep_auto_enroll

The certs_expiring alarm gets triggered if any Root CA certificates or any SSL peering certificates or any SSL server certificates are going to expire in two months or already have expired.

Any expired Root CA certificates can be safely removed if they are not used by one of your SSL server certificates.

The expired SSL peering certificates need to be reissued on the Steelhead appliance itself.

The expired SSL server certificates need to be obtained again from the department which runs the SSL server and imported again on the Steelhead appliance.

Next steps: Check the expiry dates on the certificates in the Certificate Authority and the certificates in the SSL server list.

The crl_error alarm gets triggered when the LDAP server containing the Certificate Revocation List cannot be contacted.

Next steps: Check the connectivity and availability of the LDAP server containing the CRL list.

The ssl_peer_scep_auto_enroll alarm get triggered when the SCEP functionality, to enroll peering certificates, has encountered an error.

Next steps: Check the connectivity and the content of the LDAP server.

5.17.1.3. CPU load related alarms

Figure 5.107. CPU load related alarms

Alarm 'cpu_util_indiv' triggered
Alarm 'cpu:N:util' triggered

Alarm triggered for rising error for event cpu_util_indiv 

These alarms get triggered when the usage on one of the CPU cores exceeds a certain threshold. The CPU usage metric are generally measured in four usage types: System, User, I/O Wait and Idle.

  • System: The percentage of time the CPU spends in the kernel.

  • User: The percentage of time the CPU spends in the user land (optimization service, the GUI, the CLI, the SNMP server etc).

  • I/O Wait: The percentage of time the CPU is waiting for an I/O device to complete its actions. On the Steelhead appliance this is most likely to complete a read and write request to the hard disk.

  • Idle: The percentage of time the CPU is waiting for interrupts to process.

Normally the CPU usage pattern should be more-or-less equal on all CPUs. It can be different if:

  • In a data recovery scenario with few TCP sessions which are all handled on one single CPU. This will show as a high User usage on one of the CPUs and not on the others.

    This can be changed with the option Multi-Core Balancing at Configure -> Optimization -> Performance.

  • PBR, WCCP and Interceptor redirected traffic is handled by a single thread, therefore it will be done on a single CPU. This will show up as a lot of System usage on one of the CPUs. For PBR and WCCP this can be resolved by only redirecting the traffic to be optimized and not the pass-through traffic.

    On a 10Gbps bypass card, this issue has been resolved since RiOS 8.0 by distributing this load over multiple CPUs.

  • If there is a problem with one of the threads in the optimization service. This will show as a lot of user-land usage on one of the CPUs. Please open a case with Riverbed TAC to troubleshoot this.

The following reasons could be the reason for a generic high CPU utilization:

  • If the CPU usage is mostly I/O Wait, then either one of the disks in the appliance has operational problems or there is a lot of encrypted traffic being optimized which causes a large amount of searching through the data store and a lot of writing of new segments. Inspecting the Traffic Summary and the SMART related part of the system dump would be the next steps.

  • If the CPU usage is on average at 80-90% and the CPU pattern follows the traffic pattern, then it could just be that the machine is underpowered for the traffic load.

Next steps: Open a case with Riverbed TAC to determine the reason for the high CPU load.

5.17.1.4. License alarms

Figure 5.108. License alarms

Alarm 'appliance_unlicensed' triggered
Alarm 'license_expired' triggered
Alarm 'license_expiring' triggered
Alarm 'licensing' triggered
Alarm 'autolicense_error' triggered
Alarm 'autolicense_info' triggered

Alarm triggered for rising error for event license 
Alarm triggered for rising error for event license 

The appliance unlicensed alarm is raised when there is no MSPEC license on the xx55 and xx60 platforms.

One or more evaluation license keys on the Steelhead appliance are expired or about to expire.

Figure 5.109. RSP License alarms

Alarm 'rsp_license_expired' triggered
Alarm 'rsp_license_expiring' triggered

Alarm triggered for rising error for event rsp_license_expired 
Alarm triggered for rising error for event rsp_license_expiring

These alarms get triggered when the evaluation RSP licenses are about to expire or already have expired.

When an RSP instance gets initially setup in an evaluation environment, it will get a time-limited RSP license. When the evaluation is finished and an RSP license is purchased and configured but the evaluation license which has not been removed will expire and this alarm is raised.

Next steps: Remove the temporary licenses, configure the proper licenses or stop the RSP service.

5.17.1.5. Disk alarms

Figure 5.110. Disk alarms

Alarm 'raid_disk_indiv' triggered
Alarm 'disk:X:status' triggered
Alarm 'disk' triggered
Alarm 'disk_error:X' triggered
Alarm 'ssd_wear' triggered
Alarm 'ssd_wear_warning:X' triggered

Alarm triggered for rising error for event disk_error
Alarm triggered for rising error for event disk_not_setup
Alarm triggered for rising error for event raid_error
Alarm triggered for rising error for event ssd_wear_warning

These alarms get triggered when a hard disk has high SMART error-rate or when a hard disk in a RAID array has failed. The ssd_wear alarm is related to the 5055, 7050 and 7055 models where the number of writes to a solid-state disk in the Fault Tolerant Segstore has exceeded the threshold.

Next steps: Contact Riverbed TAC for the replacement of the hard disk.

5.17.1.6. Domain joining alarm

Figure 5.111. Domain joining alarms

Alarm 'domain_join_error' triggered

Alarm triggered for rising error for event domain_join_error

This alarm gets triggered when the Steelhead appliance has been joined to the domain but the communication with one or more Domain Controllers has been interrupted.

Next steps: Check the settings for the domain and perform the domain join again.

5.17.1.7. Filesystem related alarm

Figure 5.112. File system alarms

Alarm 'fs_mnt' triggered
Alarm 'fs_mnt:X:full' triggered

Alarm triggered for rising error for event fs_mnt

This alarm gets triggered when the usage of one of the partitions are above a threshold or when it is completely filled up.

For the /var partition, there can be various reasons for this:

  • Too many system dumps or process dumps have been created. You can remove them via the GUI under Reports -> Diagnostics -> System Dumps or in the CLI via the command files debug ... delete and files snapshot ... delete.

  • A background tcpdump capture has captured too much data and has filled up the partition.

  • The logging level has been set too high, most likely INFO level, and the rotation-interval of the log files has been changed.

  • The rotation of the log files has failed. This can be determined by checking the dates on the earlier log files to spot a skip in time between them.

  • For older RiOS versions: The log files of the neural framing algorithm are filling up the partition or the rotation of the log files has failed due to a race condition between two logrotate processes.

If the usage was 100%, then once the disk space has been reclaimed the best next step is to reboot the appliance so that all processes and all log files are recreated properly.

For the /proxy partition, in use by the RSP system, there can be various solutions for this:

  • Remove old RSP installation images.

  • Remove old RSP packages.

  • Remove old RSP slots.

Note that the optimization service itself isn't affected by a full-disk situation since the data store is located on its own partition.

Next steps: If the space cannot be reclaimed via the removal of system dumps, process dumps or RSP related files, then contact Riverbed TAC for further investigation.

5.17.1.8. Paging alarms

Figure 5.113. Paging alarms

Alarm 'paging' triggered

Alarm triggered for rising error for event paging

This alarm gets triggered when there is excessive swapping happening.

Next steps: Do not reset the device and contact Riverbed TAC to analyse this issue.

5.17.1.9. RSP / VSP Alarms

Figure 5.114. RSP / VSP General alarms

Alarm 'rsp' triggered
Alarm 'rsp_general_alarm' triggered
Alarm 'rsp_service' triggered
Alarm 'virt_cpu_util_indiv' triggered
Alarm 'virt_cpu:*:util
Alarm 'vsp' triggered
Alarm 'vsp_general_alarm' triggered
Alarm 'vsp_service' triggered
Alarm 'vsp_service_not_running' triggered
Alarm 'vsp_unsupported_vm_count' triggered

Alarm triggered for rising error for event rsp_general_alarm 

These alarms get triggered when the VSP or RSP service has experienced a problem.

For RSP on the xx20 and xx50 series models, most likely this will be an incompatibility between the RiOS version installed and the RSP version installed.

Next steps: Install the correct RSP version for this RiOS release.

For VSP on the EX series models, the issue could be related to communication between RiOS and the ESXi infrastructure.

Figure 5.115. ESXi specific alarms

Alarm 'esxi_communication_failed' triggered
Alarm 'esxi_disk_creation_failed' triggered
Alarm 'esxi_initial_config_failed' triggered
Alarm 'esxi_license' triggered
Alarm 'esxi_license_expired' triggered
Alarm 'esxi_license_expiring' triggered
Alarm 'esxi_license_is_trial' triggered
Alarm 'esxi_memory_overcommitted' triggered
Alarm 'esxi_not_set_up' triggered
Alarm 'esxi_version_unsupported' triggered
Alarm 'esxi_vswitch_mtu_unsupported' triggered

These alarms are related to the ESXi part of the EX appliances.

The esxi_not_set_up alarm happens when an EX appliance doesn't have the VSP service enabled yet.

Next step: Enable the VSP service in the GUI under Configure -> Virtualization -> Virtual Services Platform.

The esxi_disk_creation_failed alarm and the esxi_initial_config_failed alarm happen when the initial setup has failed.

Next steps: Contact the Riverbed TAC.

The esxi_communication_failed alarm happens when the communication towards the ESXi platform doesn't work anymore. Most likely reason is that the password has been changed in the ESXi infrastructure.

Next steps: Update the password in the VSP configuration.

The esxi_memory_overcommitted alarm happens when the memory required by the ESXi system is less than what is available.

Next steps: Reduce the memory requirements of the VMs in the ESXi system.

The esxi_version_unsupported alarm happens when the version of ESXi has changed due to patches.

Next steps: Back out to the original ESXi version.

The esxi_vswitch_mtu_unsupported alarm is raised when the vSwitch on the ESXi platform has an MTU size of more than 1500.

Next steps: Undo the MTU size changes on the vSwitch.

Figure 5.116. RSP License alarms

Alarm 'rsp_license_expired' triggered
Alarm 'rsp_license_expiring' triggered

Alarm triggered for rising error for event rsp_license_expired 
Alarm triggered for rising error for event rsp_license_expiring

These alarms get triggered when the evaluation RSP licenses are about to expire or already have expired.

When an RSP instance gets initially setup in an evaluation environment, it will get a time-limited RSP license. When the evaluation is finished and an RSP license is purchased but not installed, the evaluation license will expire and the RSP service will not restart at the next restart.

Next steps: Remove the temporary licenses, configure the proper licenses or stop the RSP service.

5.17.1.10. Secure Vault alarm

Figure 5.117. Secure Vault alarms

Alarm 'secure_vault' triggered
Alarm 'secure_vault_rekey_needed' triggered
Alarm 'secure_vault_uninitialized' triggered
Alarm 'secure_vault_unlocked' triggered

Alarm triggered for falling error for event secure_vault_unlocked

The secure vault alarm is a general alarm when one of the other alarms get triggered.

The secure vault unlocked alarm gets triggered when the secure vault cannot be opened with the default password.

This happens when the password for the secure vault has been changed: After a restart of the Steelhead appliance the secure vault cannot be automatically opened anymore and this alarm gets triggered.

Next steps: Unlock the secure vault manually via the GUI of the Steelhead appliance or configure the correct secure vault password on the CMC.

5.17.2. Hardware related alarms

5.17.2.1. Fan alarms

Figure 5.118. Fan alarms

Alarm 'fan_error' triggered

Alarm triggered for rising error for event fan_error

This alarm gets triggered when one of the fans in the chassis has failed.

Next steps: Contact Riverbed TAC for the replacement of the fan.

5.17.2.2. Flash Error alarms

Figure 5.119. Flash Error alarms

Alarm 'flash_error' triggered
Alarm 'flash_protection_failed' triggered

Alarm triggered for rising error for event flash_error

The flash_error alarm gets triggered when the USB Flash Memory has become read-only or when it has become unavailable.

This is an issue with the xx50 series models, where the USB bus towards the Flash Memory becomes locked due to timing issues.

Checkout KB article S15568 to confirm that the device is running the right minimum RIOS version to best overcome this issue.

Next steps: Reboot the appliance as soon as possible to unlock the USB Flash Memory.

The flash_protection_failed alarm is raised when the backup of the USB Flash Memory could not have been completed.

Next steps: Confirm that there is enough free space on the /var partition for the backup.

5.17.2.3. Generic hardware alarm

Figure 5.120. Generic Hardware alarms

Alarm 'hardware' triggered
Alarm 'other_hardware_error' triggered

Alarm triggered for rising error for event hardware_error

These alarms get triggered when there is a mismatch with the configuration of the Steelhead appliance:

  • A faulty disk, insufficient memory or missing CPUs.

  • An unqualified hard disk, memory stick or network card has been detected.

Next steps: Contact Riverbed TAC for investigation.

5.17.2.4. IPMI alarms

Figure 5.121. IPMI alarms

Alarm 'ipmi' triggered

Alarm triggered for rising error for event ipmi

This alarm gets triggered when the IPMI subsystem reports an issue:

  • The chassis of the Steelhead appliance has been opened,

  • The ECC memory has reported an error,

  • A hard disk is failing or has failed,

  • A power supply is failing or has failed.

Next steps: If the alarm is with regarding to the intrusion alarm, it can be reset from the GUI. If the issue is with regarding to failing hardware, then contact Riverbed TAC for a replacement.

5.17.2.5. Memory error alarms

Figure 5.122. Memory error alarms

Alarm 'memory_error' triggered

Alarm triggered for rising error for event memory_error

This alarm gets triggered when an ECC error on one of the memory sticks gets reported.

Next steps: Contact Riverbed TAC for identification and replacement.

5.17.2.6. Power supply alarms

Figure 5.123. Power supply alarms

Alarm 'power_supply' triggered

Alarm triggered for rising error for event power_supply

This alarm gets triggered when one of the power supplies in the chassis has failed.

Next steps: Confirm that the power source to the power supply is working properly, if not contact Riverbed TAC for a replacement.

5.17.2.7. SSL Hardware alarm

Figure 5.124. SSL Hardware alarm

Alarm triggered for rising error for event ssl_hardware

This alarm gets triggered when the SSL Offload hardware has encountered an error.

Next steps: Contact Riverbed TAC for investigation.

5.17.2.8. Temperature alarms

Figure 5.125. Temperature alarms

Alarm 'temperature' triggered
Alarm 'critical_temp' triggered
Alarm 'warning_temp' triggered

Alarm triggered for rising error for event warning_temp
Alarm triggered for rising error for event critical_temp 

These alarms get triggered when the temperature of the Steelhead appliance is too high. Possible causes are fan-failures, high operating temperatures, high humidity, blocked air vents or dusty environment.

Next steps: Confirm that the environment temperature is not too high, that the airflow is sufficient, clear the chassis of excessive dust. If the issue keeps happening, contact Riverbed TAC for investigation.

5.17.3. Network related alarms

5.17.3.1. Asymmetric Routing alarms

Figure 5.126. Asymmetric routing alarms

Alarm 'arcount' triggered

Alarm triggered for rising error for event arcount 

This alarm gets triggered if one or more instances of asymmetric routing have been detected. See the chapter IP Routing Related Issues on how to deal with this.

Next steps: If the cause of the alarm is real network asymmetry, then make sure that all paths in the network are covered with Steelhead appliances. If the cause of the alarm is SYN retransmission, then configure Fixed-Target in-path rules to overcome the auto-discovery issue, or configure pass-through in-path rules to prevent optimization towards that subnet.

5.17.3.2. Bypass alarms

Figure 5.127. Bypass alarms

Alarm 'bypass' triggered

Alarm triggered for rising error for event bypass 

This alarm gets triggered when one or more in-path interfaces have gone into bypass, linking the WAN router and LAN interface together. This happens when the optimization service is stopped or when the network card watchdog has been activated.

Next steps: Restart the optimization service if the optimization service is stopped. Contact Riverbed TAC to investigate if the issue is related to the watchdog.

5.17.3.3. Duplex alarms

Figure 5.128. Duplex error

Alarm 'duplex' triggered
Alarm 'link_duplex' triggered
Alarm 'link_state:*:half_duplex' triggered
Alarm 'link_io_errors' triggered
Alarm 'link_state:*:io_errors' triggered

Alarm triggered for rising error for event duplex 

These alarms gets triggered when there is a large amount of frame errors or carrier errors on one of the interfaces, often indicating a speed/duplex configuration mismatch.

Next steps: Check the speed and duplex settings on the interface reported. Consider a new Ethernet cable or a different port on the switch if the speed and duplex settings are the same.

5.17.3.4. Link state alarms

Figure 5.129. Link state alarms

Alarm 'linkstate' triggered
Alarm 'link_state:X:link_error' triggered

Alarm triggered for rising error for event linkstate 

These alarms get triggered when one of the interfaces of the Steelhead appliance has lost its Ethernet link.

Next steps: Investigate the loss of Ethernet link.

5.17.4. Virtual Steelhead related alarms

Figure 5.130. Virtual Steelhead related alarms

Alarm 'disk_not_setup' triggered
Alarm 'lan_wan_loop' triggered

The disk_not_setup alarm gets triggered when the Virtual Steelhead appliance detects that the disk reserved for the data store is not yet provisioned or not of the right size.

Next steps: In the ESXi environment, check the properties of the virtual disks for the Virtual Steelhead.

The lan_wan_loop alarm gets triggered when the Virtual Steelhead appliance detects that the LAN and the WAN interface are configured to be on the same virtual switch.

Next steps: In the ESXi environment, put the LAN and the WAN interfaces in different virtual switches.

5.17.5. Optimization service alarms

5.17.5.1. Admission control alarms

Figure 5.131. Various admission control alarms

Alarm 'admission_control' rising
Alarm 'admission_conn' rising
Alarm 'admission_cpu' rising
Alarm 'admission_mapi' rising
Alarm 'admission_mem' rising
Alarm 'admission_tcp' rising

Alarm triggered for rising error for event admission_conn 
Alarm triggered for rising error for event admission_cpu 
Alarm triggered for rising error for event admission_mapi 
Alarm triggered for rising error for event admission_mem 
Alarm triggered for rising error for event admission_tcp 

The following admission control alarms do exist:

  • The generic admission control alarm, it is raised when one of the other alarms is raised.

  • Connection based admission control, when the number of TCP sessions optimized has increased to above the number of licensed TCP sessions.

    Next steps: Check the list of TCP connections in the Current Connections overview and to reduce the number of optimized TCP sessions by applying in-path rules to pass-through traffic which does not get a high optimization factor. If the issue is structural consider an upgrade to a higher model.

  • CPU load admission control, when the CPU load is too high and no new TCP sessions will be optimized.

    Next steps: Open a case with Riverbed TAC when this happens.

  • MAPI based admission control, when the number of optimized TCP sessions exceeds a by default 85% threshold of the maximum number of licensed TCP sessions. This is to overcome the problems caused by the MAPI protocol use of multiple TCP sessions which all need to be optimized against the same client-side and server-side Steelhead appliances.

    Next steps: The same as with Connection based admission control.

  • Memory based admission control, where the internal memory pool of the optimization service has exceeded a certain threshold.

    Next steps: Open a case with Riverbed TAC to investigate.

  • TCP based admission control, where the TCP buffers in the kernel running on the Steelhead appliance is running out of memory.

    Next steps: In the sysinfo.txt in the system dump, check the netstat output and find out which hosts have the biggest send-queue values. That is the host with the slow network stack. Also, open a case with Riverbed TAC to investigate.

5.17.5.2. Connection Forwarding related alarms

Figure 5.132. Connection Forwarding alarms

Alarm 'cf_ipv6_incompatible_cluster' triggered
Alarm 'connection_forwarding' triggered
Alarm 'disconnected_sh_alert' triggered
Alarm 'single_cf' triggered
Alarm 'cf_ack_timeout_aggr' triggered
Alarm 'cf_ack_timeout:*' triggered
Alarm 'cf_conn_failure_aggr' triggered
Alarm 'cf_conn_failure:*' triggered
Alarm 'cf_conn_lost_eos_aggr' triggered
Alarm 'cf_conn_lost_eos:*' triggered
Alarm 'cf_conn_lost_err_aggr' triggered
Alarm 'cf_conn_lost_err:*' triggered
Alarm 'cf_keepalive_timeout_aggr' triggered
Alarm 'cf_keepalive_timeout:*' triggered
Alarm 'cf_latency_exceeded_aggr' triggered
Alarm 'cf_latency_exceeded:*' triggered
Alarm 'cf_read_info_timeout_aggr' triggered
Alarm 'cf_read_info_timeout:*' triggered

Alarm triggered for rising error for event cf_ack_timeout 
Alarm triggered for rising error for event cf_conn_failure 
Alarm triggered for rising error for event cf_conn_lost_eos 
Alarm triggered for rising error for event cf_conn_lost_err 
Alarm triggered for rising error for event cf_keepalive_timeout
Alarm triggered for rising error for event cf_latency_exceeded 
Alarm triggered for rising error for event cf_read_info_timeout

The most common alarms with regarding to Connection Forwarding protocol are the cf_conn_failure alarm and the cf_latency_exceeded alarm.

The cf_conn_failure alarm happens when the Steelhead appliance is unable to setup of the TCP session for the Connection Forwarding protocol. This is most likely related to a network related issue between the two in-path interfaces or because the optimization service on the other Steelhead appliance is not operational.

The cf_latency_exceeded alarm happens when the responses in the Connection Forwarding sessions take too long to come back to the Steelhead appliance. This could be related to the network between the two in-path interfaces but also to the load on the neighbour Steelhead appliance.

Next steps: Check the network between the in-path interfaces.

The disconnected_sh_alert alarm gets triggered when a Connection Forwarding neighbour gets disconnected.

The single_cf alarm gets triggered when there are no neighbouring nodes in the Connection Forwarding cluster.

Next steps: Determine why the neighbours got disconnected.

The cf_ipv6_incompatible_cluster alarm gets triggered when one of the nodes in the cluster do not support optimization over IPv6 yet.

Next steps: Upgrade all nodes in the cluster to the right RiOS version.

5.17.5.3. Data store related alarms

Figure 5.133. Data store related alarms

Alarm 'datastore' triggered
Alarm 'datastore_error' triggered
Alarm 'datastore_sync_error' triggered
Alarm 'store_corruption' triggered

Alarm triggered for rising error for event datastore_error
Alarm triggered for rising error for event datastore_sync_error
Alarm triggered for rising error for event store_corruption

When the optimization service detects a corruption in the data store it try to recover from the issue and raise the store_corruption alarm. This alarm also can happen when encryption of the data store has been enabled or disabled but the restart of the optimization service didn't include the clearing of the data store.

Next steps: Restart the optimization service and clear the data store.

The datastore_sync_error alarm gets triggered when the Steelhead appliance is part of a data store synchronization cluster but its peer has become unreachable.

Next steps: Confirm the network between the two Steelhead appliances.

The datastore_error alarm gets triggered when the metadata in the data store cannot be initialized with the current settings. This alarm can happen when encryption of the data store has been changed or when the Extended Peering Table has been enabled or disabled but the restart of the optimization service didn't include the clearing of the data store.

Next steps: Restart the optimization service with the option to clear the data store.

5.17.5.4. Halt alarm

Figure 5.134. Halt alarms

Alarm 'halt_error' triggered

Alarm triggered for rising error for event halt_error

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.5. Mismatched Peer alarm

Figure 5.135. Mismatched Peer alarms

Alarm triggered for rising error for event mismatch_peer

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.6. NFS related alarm

Figure 5.136. NFS related alarms

Alarm 'nfs_v2_v4' triggered

Alarm triggered for rising error for event nfs_v2_v4

This alarm gets triggered when the NFS latency optimization has detected an NFS server which only uses the NFS version 2 or NFS version 4 protocols. The NFS latency optimization can only optimize NFS version 3 traffic.

Next steps: See if the NFS servers reported can be changed to support NFS version 3. If not, consider a pass-through rule for them.

5.17.5.7. Optimization Service alarm

Figure 5.137. Optimization Service alarms

Alarm 'service_error' triggered
Alarm 'optimization_service' triggered
Alarm 'optimization_general' triggered

Alarm triggered for rising error for event service_error

These alarms get triggered when the optimization service is not running or when an issue has been encountered which indicate a critical failure in the optimization protocol.

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.8. PFS related alarms

Figure 5.138. PFS related alarms

Alarm 'pfs' triggered
Alarm 'pfs_config' triggered
Alarm 'pfs_operation' triggered

Alarm triggered for rising error for event pfs_config
Alarm triggered for rising error for event pfs_operation

These alarms get triggered when there is an issue with the Proxy File Service.

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.9. Process dump related alarm

Figure 5.139. Process dump related alarms

Alarm 'sticky_staging_dir' triggered

Alarm triggered for rising error for event sticky_staging_dir

This alarm gets triggered when there is an issue creating new process dumps. Most of the time it correlates with a fs_mnt alarm.

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.10. Serial Cascade configuration alarm

Figure 5.140. Serial Cascade configuration alarms

Alarm 'serial_cascade_misconfig' triggered

Alarm triggered for rising error for event serial_cascade_misconfig

Next steps: Open a case with Riverbed TAC to investigate.

5.17.5.11. SMB alarm

Figure 5.141. SMB alarms

Alarm 'smb_alert' triggered

Alarm triggered for rising error for event smb_alert

This alarm gets triggered when there is an issue with the SMB Signing feature.

Next steps: Check the status of the computer object in the Active Directory service and the log files on the Domain Controllers. Check the status of the delegation user in the Active Directory service and the log files on the Domain Controllers. Check the connectivity between the Steelhead appliance primary interface and the Domain Controllers.

5.17.5.12. QoS related alarms

Figure 5.142. QoS alarms

Alarm 'inbound_qos_wan_bw_err' triggered
Alarm 'outbound_qos_wan_bw_err' triggered

These alarms get raised when the sum of the configured QoS bandwidth values exceed the configured WAN bandwidth speeds.

Next steps: Check the QoS settings page and confirm that the QoS bandwidth values are correct.

5.17.5.13. Software version alarm

Figure 5.143. Software version alarms

Alarm 'sw_version_aggr' triggered
Alarm 'mismatch_peer_aggr'
Alarm 'mismatch_peer:*'
Alarm 'sw_version_mismatch_aggr'
Alarm 'sw_version:*'

Alarm triggered for rising error for event sw-version

These alarms get triggered when an incompatibility with the RiOS software version of a remote Steelhead appliance has been detected.

Next steps: Upgrade the remote Steelhead appliance to a compatible RiOS version or configure Peering Rules to prevent optimization between the two Steelhead appliances.

5.17.5.14. SSL related alarm

Figure 5.144. SSL related alarms

Alarm 'non_443_ssl_servers_detected_on_upgrade' triggered

Alarm triggered for rising error for event non_443_ssl_servers_detected_on_upgrade

This alarm gets triggered when, after the upgrade to RiOS version 6.0 or later, an SSL server on a port different than port 443 has been detected.

In RiOS version 6.0 and later traffic on TCP port 443 is automatically assumed to be SSL encapsulated and when SSL optimization has been enabled, then no in-path rule is required to get that traffic optimized. For SSL traffic on a different TCP port than port 443 an in-path rule is still required.

Next steps: Configure the correct In-path Rules and Peering Rules to optimize this SSL traffic.

5.17.5.15. System Detail Report alarm

Figure 5.145. System Detail Report alarms

Alarm 'system_detail_report' triggered

Alarm triggered for rising error for event system_detail_report

This alarm gets triggered when an issue with the operation or configuration of the optimization service has been determined.

Next steps: Check the System Detail Report and attend to the issues reported.

5.17.6. Granite related alarms

Figure 5.146. Granite related alarms

Alarm 'block_store' triggered
Alarm 'block_store:*' triggered
Alarm 'profile_switch_failed' triggered
Alarm 'granite-core' triggered
Alarm 'granite-core:*' triggered
Alarm 'high_availability' triggered
Alarm 'high_availability:*' triggered
Alarm 'iscsi' triggered
Alarm 'iscsi:*' triggered
Alarm 'lun' triggered
Alarm 'lun:*' triggered
Alarm 'snapshot' triggered
Alarm 'snapshot:*' triggered
Alarm 'uncommitted_data' triggered

The block_store alarm gets triggered when there are issues with the Granite block store.

Next steps: Check the system logs and contact Riverbed TAC for investigation.

This alarm gets triggered when repartitioning the drives fails during the switching of the storage profile.

Next steps: Contact Riverbed TAC for investigation.

The granite-core alarm gets triggered when there are communication problems towards the Granite Core appliance.

Next steps: Confirm that there are no network related issues between the Granite Edge and the Granite Core.

The high_availability alarm gets triggered when there are communication problems towards a node in high availability cluster.

Next steps: Check connectivity with the HA peer.

The iscsi alarm gets triggered when the iSCSI initiator isn't accessible.

Next steps: Confirm the configuration of the iSCSI configuration.

The lun alarm gets triggered when a LUN isn't avaiable for the Granite Core.

Next step: Confirm the status of the LUN on the Data Center NAS.

The snapshot alarm gets triggered when a snapshot couldn't be completed or committed.

Next steps: Confirm the status of the Data Center NAS.

This alarm gets triggered when the Granite Edge has a large amount of uncommitted data in its blockstore.

Next steps: Confirm that the Granite Core and the NAS are working fine.