The Steelhead appliance has various alarms which indicate a problem with the Steelhead appliance in general or with the optimization service specifically.
To make sure you are informed about possible issues on the Steelhead appliances, alarms can be send out via several methods:
SNMP Traps - Configure the SNMP trap servers under Configure ->
System Settings -> SNMP Basic. You can use the CLI configuration
command
snmp-server trap-test
to confirm that the SNMP trap gets delivered.
Email alerts - Configure the email alerts under Configure ->
System Settings -> Email. You can use the CLI command
email send-test
to confirm that the email delivery works.
There are six styles of lines for reporting alarm related issues:
Alarm triggered for rising error: An error condition has been set because of the monitored value being too high. This is before RiOS 7.0.
Alarm triggered for rising clear: An error condition caused by the monitored value being too high, has been cleared. This is before RiOS 7.0.
Alarm triggered for falling error: An error condition has been set because of the monitored value being too low. This is before RiOS 7.0.
Alarm triggered for falling clear: An error condition caused by the monitored value being too low, has been cleared. This is before RiOS 7.0.
Alarm 'X' clearing: An error condition has been cleared. This is on RiOS 7.0 and higher.
Alarm 'X' triggering: An error condition has been set. This is on RiOS 7.0 and higher.
The full output of the logging for alarms is:
Figure 5.102. Full output of a log entry for an alarm before RiOS 7.0
SH statsd[312]: [statsd.NOTICE]: Alarm triggered for rising error for event xxx
and
Figure 5.103. Full output of a log entry for an alarm on RiOS 7.0 and later
SH alarmd[29863]: [alarmd.NOTICE]: Alarm 'xxx' triggering
With the introduction of RiOS 7.0 there are aggregate alarms, which rise if one of the source alarms rises. The aggregation tree is currently:
Figure 5.104. Aggregate alarm tree for RiOS 8.0
health |\_ admission_control | |\_ admission_conn | |\_ admission_cpu | |\_ admission_mapi | |\_ admission_mem | \_ admission_tcp |\_ arcount |\_ block_store | \_ block_store:* |\_ bypass |\_ connection_forwarding | |\_ cf_ipv6_incompatible_cluster | |\_ disconnected_sh_alert | \_ single_cf | |\_ cf_ack_timeout_aggr | | \_ cf_ack_timeout:* | |\_ cf_conn_failure_aggr | | \_ cf_conn_failure:* | |\_ cf_conn_lost_eos_aggr | | \_ cf_conn_lost_eos:* | |\_ cf_conn_lost_err_aggr | | \_ cf_conn_lost_err:* | |\_ cf_keepalive_timeout_aggr | | \_ cf_keepalive_timeout:* | |\_ cf_latency_exceeded_aggr | | \_ cf_latency_exceeded:* | \_ cf_read_info_timeout_aggr | \_ cf_read_info_timeout:* |\_ cpu_util_indiv | \_ cpu:*:util |\_ datastore | |\_ datastore_error | |\_ datastore_sync_error | |\_ disk_not_setup | \_ store_corruption |\_ domain_join_error |\_ duplex |\_ flash_protection_failed |\_ fs_mnt | \_ fs_mnt:*:full |\_ granite-core | \_ granite-core:* |\_ hardware | |\_ disk | | \_ disk_error:* | |\_ fan_error | |\_ flash_error | |\_ ipmi | |\_ memory_error | |\_ other_hardware_error | |\_ power_supply | |\_ raid_disk_indiv | | \_ disk:*:status | \_ ssd_wear | \_ ssd_wear_warning:* |\_ high_availability | \_ high_availability:* |\_ inbound_qos_wan_bw_err |\_ iscsi | \_ iscsi:* |\_ lan_wan_loop |\_ licensing | |\_ appliance_unlicensed | |\_ autolicense_error | |\_ autolicense_info | |\_ license_expired | \_ license_expiring |\_ link_duplex | \_ link_state:*:half_duplex |\_ link_io_errors | \_ link_state:*:io_errors |\_ linkstate | \_ link_state:*:link_error |\_ lun | \_ lun:* |\_ nfs_v2_v4 |\_ optimization_service | |\_ halt_error | |\_ optimization_general | \_ service_error |\_ outbound_qos_wan_bw_err |\_ paging |\_ path_selection_path_down |\_ pfs | |\_ pfs_config | \_ pfs_operation |\_ profile_switch_failed |\_ rsp | |\_ rsp_general_alarm | |\_ rsp_license_expired | |\_ rsp_license_expiring | \_ rsp_service |\_ secure_vault | |\_ secure_vault_rekey_needed | |\_ secure_vault_uninitialized | \_ secure_vault_unlocked |\_ serial_cascade_misconfig |\_ smb_alert |\_ snapshot | \_ snapshot:* |\_ ssl | |\_ certs_expiring | |\_ crl_error:* | |\_ non_443_sslservers_detected_on_upgrade | \_ ssl_peer_scep_auto_reenroll |\_ sticky_staging_dir |\_ sw_version_aggr | |\_ mismatch_peer_aggr | | \_ mismatch_peer:* | \_ sw_version_mismatch_aggr | \_ sw_version:* |\_ system_detail_report |\_ temperature | |\_ critical_temp | \_ warning_temp |\_ uncommitted_data \_ vsp |\_ esxi_communication_failed |\_ esxi_disk_creation_failed |\_ esxi_initial_config_failed |\_ esxi_license | |\_ esxi_license_expired | |\_ esxi_license_expiring | \_ esxi_license_is_trial |\_ esxi_memory_overcommitted |\_ esxi_not_set_up |\_ esxi_version_unsupported |\_ esxi_vswitch_mtu_unsupported |\_ virt_cpu_util_indiv | \_ virt_cpu:*:util |\_ vsp_general_alarm |\_ vsp_service |\_ vsp_service_not_running \_ vsp_unsupported_vm_count
If an alarm in one of the end-nodes gets triggered, the nodes above it will get triggered too: So if the warning_temp alarm gets triggered, the temperature alarm will get triggered and the health alarm gets triggered.
The examples of log messages in this section use a shorter notation without the hostname, process name and severity.
This is the top level node which gets triggered when any of the other alarms gets triggered.
Figure 5.106. Certificates related alarms
Alarm 'certs_expiring' triggered Alarm 'crl_error:SSL_CAs' triggered Alarm 'crl_error:SSL_Peering_CAs' triggered Alarm 'ssl' triggered Alarm 'ssl_peer_scep_auto_enroll' triggered Alarm triggered for rising error for event certs_expiring Alarm triggered for rising error for event crl_error Alarm triggered for rising error for event ssl_peer_scep_auto_enroll
The certs_expiring alarm gets triggered if any Root CA certificates or any SSL peering certificates or any SSL server certificates are going to expire in two months or already have expired.
Any expired Root CA certificates can be safely removed if they are not used by one of your SSL server certificates.
The expired SSL peering certificates need to be reissued on the Steelhead appliance itself.
The expired SSL server certificates need to be obtained again from the department which runs the SSL server and imported again on the Steelhead appliance.
Next steps: Check the expiry dates on the certificates in the Certificate Authority and the certificates in the SSL server list.
The crl_error alarm gets triggered when the LDAP server containing the Certificate Revocation List cannot be contacted.
Next steps: Check the connectivity and availability of the LDAP server containing the CRL list.
The ssl_peer_scep_auto_enroll alarm get triggered when the SCEP functionality, to enroll peering certificates, has encountered an error.
Next steps: Check the connectivity and the content of the LDAP server.
Figure 5.107. CPU load related alarms
Alarm 'cpu_util_indiv' triggered Alarm 'cpu:N:util' triggered Alarm triggered for rising error for event cpu_util_indiv
These alarms get triggered when the usage on one of the CPU cores exceeds a certain threshold. The CPU usage metric are generally measured in four usage types: System, User, I/O Wait and Idle.
System: The percentage of time the CPU spends in the kernel.
User: The percentage of time the CPU spends in the user land (optimization service, the GUI, the CLI, the SNMP server etc).
I/O Wait: The percentage of time the CPU is waiting for an I/O device to complete its actions. On the Steelhead appliance this is most likely to complete a read and write request to the hard disk.
Idle: The percentage of time the CPU is waiting for interrupts to process.
Normally the CPU usage pattern should be more-or-less equal on all CPUs. It can be different if:
In a data recovery scenario with few TCP sessions which are all handled on one single CPU. This will show as a high User usage on one of the CPUs and not on the others.
This can be changed with the option Multi-Core Balancing at Configure -> Optimization -> Performance.
PBR, WCCP and Interceptor redirected traffic is handled by a single thread, therefore it will be done on a single CPU. This will show up as a lot of System usage on one of the CPUs. For PBR and WCCP this can be resolved by only redirecting the traffic to be optimized and not the pass-through traffic.
On a 10Gbps bypass card, this issue has been resolved since RiOS 8.0 by distributing this load over multiple CPUs.
If there is a problem with one of the threads in the optimization service. This will show as a lot of user-land usage on one of the CPUs. Please open a case with Riverbed TAC to troubleshoot this.
The following reasons could be the reason for a generic high CPU utilization:
If the CPU usage is mostly I/O Wait, then either one of the disks in the appliance has operational problems or there is a lot of encrypted traffic being optimized which causes a large amount of searching through the data store and a lot of writing of new segments. Inspecting the Traffic Summary and the SMART related part of the system dump would be the next steps.
If the CPU usage is on average at 80-90% and the CPU pattern follows the traffic pattern, then it could just be that the machine is underpowered for the traffic load.
Next steps: Open a case with Riverbed TAC to determine the reason for the high CPU load.
Figure 5.108. License alarms
Alarm 'appliance_unlicensed' triggered Alarm 'license_expired' triggered Alarm 'license_expiring' triggered Alarm 'licensing' triggered Alarm 'autolicense_error' triggered Alarm 'autolicense_info' triggered Alarm triggered for rising error for event license Alarm triggered for rising error for event license
The appliance unlicensed alarm is raised when there is no MSPEC license on the xx55 and xx60 platforms.
One or more evaluation license keys on the Steelhead appliance are expired or about to expire.
Figure 5.109. RSP License alarms
Alarm 'rsp_license_expired' triggered Alarm 'rsp_license_expiring' triggered Alarm triggered for rising error for event rsp_license_expired Alarm triggered for rising error for event rsp_license_expiring
These alarms get triggered when the evaluation RSP licenses are about to expire or already have expired.
When an RSP instance gets initially setup in an evaluation environment, it will get a time-limited RSP license. When the evaluation is finished and an RSP license is purchased and configured but the evaluation license which has not been removed will expire and this alarm is raised.
Next steps: Remove the temporary licenses, configure the proper licenses or stop the RSP service.
Figure 5.110. Disk alarms
Alarm 'raid_disk_indiv' triggered Alarm 'disk:X:status' triggered Alarm 'disk' triggered Alarm 'disk_error:X' triggered Alarm 'ssd_wear' triggered Alarm 'ssd_wear_warning:X' triggered Alarm triggered for rising error for event disk_error Alarm triggered for rising error for event disk_not_setup Alarm triggered for rising error for event raid_error Alarm triggered for rising error for event ssd_wear_warning
These alarms get triggered when a hard disk has high SMART error-rate or when a hard disk in a RAID array has failed. The ssd_wear alarm is related to the 5055, 7050 and 7055 models where the number of writes to a solid-state disk in the Fault Tolerant Segstore has exceeded the threshold.
Next steps: Contact Riverbed TAC for the replacement of the hard disk.
Figure 5.111. Domain joining alarms
Alarm 'domain_join_error' triggered Alarm triggered for rising error for event domain_join_error
This alarm gets triggered when the Steelhead appliance has been joined to the domain but the communication with one or more Domain Controllers has been interrupted.
Next steps: Check the settings for the domain and perform the domain join again.
Figure 5.112. File system alarms
Alarm 'fs_mnt' triggered Alarm 'fs_mnt:X:full' triggered Alarm triggered for rising error for event fs_mnt
This alarm gets triggered when the usage of one of the partitions are above a threshold or when it is completely filled up.
For the /var partition, there can be various reasons for this:
Too many system dumps or process dumps have been created. You can
remove them via the GUI under Reports -> Diagnostics -> System Dumps
or in the CLI via the command
files debug ... delete
and
files snapshot ... delete
.
A background tcpdump capture has captured too much data and has filled up the partition.
The logging level has been set too high, most likely INFO level, and the rotation-interval of the log files has been changed.
The rotation of the log files has failed. This can be determined by checking the dates on the earlier log files to spot a skip in time between them.
For older RiOS versions: The log files of the neural framing algorithm are filling up the partition or the rotation of the log files has failed due to a race condition between two logrotate processes.
If the usage was 100%, then once the disk space has been reclaimed the best next step is to reboot the appliance so that all processes and all log files are recreated properly.
For the /proxy partition, in use by the RSP system, there can be various solutions for this:
Remove old RSP installation images.
Remove old RSP packages.
Remove old RSP slots.
Note that the optimization service itself isn't affected by a full-disk situation since the data store is located on its own partition.
Next steps: If the space cannot be reclaimed via the removal of system dumps, process dumps or RSP related files, then contact Riverbed TAC for further investigation.
Figure 5.113. Paging alarms
Alarm 'paging' triggered Alarm triggered for rising error for event paging
This alarm gets triggered when there is excessive swapping happening.
Next steps: Do not reset the device and contact Riverbed TAC to analyse this issue.
Figure 5.114. RSP / VSP General alarms
Alarm 'rsp' triggered Alarm 'rsp_general_alarm' triggered Alarm 'rsp_service' triggered Alarm 'virt_cpu_util_indiv' triggered Alarm 'virt_cpu:*:util Alarm 'vsp' triggered Alarm 'vsp_general_alarm' triggered Alarm 'vsp_service' triggered Alarm 'vsp_service_not_running' triggered Alarm 'vsp_unsupported_vm_count' triggered Alarm triggered for rising error for event rsp_general_alarm
These alarms get triggered when the VSP or RSP service has experienced a problem.
For RSP on the xx20 and xx50 series models, most likely this will be an incompatibility between the RiOS version installed and the RSP version installed.
Next steps: Install the correct RSP version for this RiOS release.
For VSP on the EX series models, the issue could be related to communication between RiOS and the ESXi infrastructure.
Figure 5.115. ESXi specific alarms
Alarm 'esxi_communication_failed' triggered Alarm 'esxi_disk_creation_failed' triggered Alarm 'esxi_initial_config_failed' triggered Alarm 'esxi_license' triggered Alarm 'esxi_license_expired' triggered Alarm 'esxi_license_expiring' triggered Alarm 'esxi_license_is_trial' triggered Alarm 'esxi_memory_overcommitted' triggered Alarm 'esxi_not_set_up' triggered Alarm 'esxi_version_unsupported' triggered Alarm 'esxi_vswitch_mtu_unsupported' triggered
These alarms are related to the ESXi part of the EX appliances.
The esxi_not_set_up alarm happens when an EX appliance doesn't have the VSP service enabled yet.
Next step: Enable the VSP service in the GUI under Configure -> Virtualization -> Virtual Services Platform.
The esxi_disk_creation_failed alarm and the esxi_initial_config_failed alarm happen when the initial setup has failed.
Next steps: Contact the Riverbed TAC.
The esxi_communication_failed alarm happens when the communication towards the ESXi platform doesn't work anymore. Most likely reason is that the password has been changed in the ESXi infrastructure.
Next steps: Update the password in the VSP configuration.
The esxi_memory_overcommitted alarm happens when the memory required by the ESXi system is less than what is available.
Next steps: Reduce the memory requirements of the VMs in the ESXi system.
The esxi_version_unsupported alarm happens when the version of ESXi has changed due to patches.
Next steps: Back out to the original ESXi version.
The esxi_vswitch_mtu_unsupported alarm is raised when the vSwitch on the ESXi platform has an MTU size of more than 1500.
Next steps: Undo the MTU size changes on the vSwitch.
Figure 5.116. RSP License alarms
Alarm 'rsp_license_expired' triggered Alarm 'rsp_license_expiring' triggered Alarm triggered for rising error for event rsp_license_expired Alarm triggered for rising error for event rsp_license_expiring
These alarms get triggered when the evaluation RSP licenses are about to expire or already have expired.
When an RSP instance gets initially setup in an evaluation environment, it will get a time-limited RSP license. When the evaluation is finished and an RSP license is purchased but not installed, the evaluation license will expire and the RSP service will not restart at the next restart.
Next steps: Remove the temporary licenses, configure the proper licenses or stop the RSP service.
Figure 5.117. Secure Vault alarms
Alarm 'secure_vault' triggered Alarm 'secure_vault_rekey_needed' triggered Alarm 'secure_vault_uninitialized' triggered Alarm 'secure_vault_unlocked' triggered Alarm triggered for falling error for event secure_vault_unlocked
The secure vault alarm is a general alarm when one of the other alarms get triggered.
The secure vault unlocked alarm gets triggered when the secure vault cannot be opened with the default password.
This happens when the password for the secure vault has been changed: After a restart of the Steelhead appliance the secure vault cannot be automatically opened anymore and this alarm gets triggered.
Next steps: Unlock the secure vault manually via the GUI of the Steelhead appliance or configure the correct secure vault password on the CMC.
Figure 5.118. Fan alarms
Alarm 'fan_error' triggered Alarm triggered for rising error for event fan_error
This alarm gets triggered when one of the fans in the chassis has failed.
Next steps: Contact Riverbed TAC for the replacement of the fan.
Figure 5.119. Flash Error alarms
Alarm 'flash_error' triggered Alarm 'flash_protection_failed' triggered Alarm triggered for rising error for event flash_error
The flash_error alarm gets triggered when the USB Flash Memory has become read-only or when it has become unavailable.
This is an issue with the xx50 series models, where the USB bus towards the Flash Memory becomes locked due to timing issues.
Checkout KB article S15568 to confirm that the device is running the right minimum RIOS version to best overcome this issue.
Next steps: Reboot the appliance as soon as possible to unlock the USB Flash Memory.
The flash_protection_failed alarm is raised when the backup of the USB Flash Memory could not have been completed.
Next steps: Confirm that there is enough free space on the /var partition for the backup.
Figure 5.120. Generic Hardware alarms
Alarm 'hardware' triggered Alarm 'other_hardware_error' triggered Alarm triggered for rising error for event hardware_error
These alarms get triggered when there is a mismatch with the configuration of the Steelhead appliance:
A faulty disk, insufficient memory or missing CPUs.
An unqualified hard disk, memory stick or network card has been detected.
Next steps: Contact Riverbed TAC for investigation.
This alarm gets triggered when the IPMI subsystem reports an issue:
The chassis of the Steelhead appliance has been opened,
The ECC memory has reported an error,
A hard disk is failing or has failed,
A power supply is failing or has failed.
Next steps: If the alarm is with regarding to the intrusion alarm, it can be reset from the GUI. If the issue is with regarding to failing hardware, then contact Riverbed TAC for a replacement.
Figure 5.122. Memory error alarms
Alarm 'memory_error' triggered Alarm triggered for rising error for event memory_error
This alarm gets triggered when an ECC error on one of the memory sticks gets reported.
Next steps: Contact Riverbed TAC for identification and replacement.
Figure 5.123. Power supply alarms
Alarm 'power_supply' triggered Alarm triggered for rising error for event power_supply
This alarm gets triggered when one of the power supplies in the chassis has failed.
Next steps: Confirm that the power source to the power supply is working properly, if not contact Riverbed TAC for a replacement.
This alarm gets triggered when the SSL Offload hardware has encountered an error.
Next steps: Contact Riverbed TAC for investigation.
Figure 5.125. Temperature alarms
Alarm 'temperature' triggered Alarm 'critical_temp' triggered Alarm 'warning_temp' triggered Alarm triggered for rising error for event warning_temp Alarm triggered for rising error for event critical_temp
These alarms get triggered when the temperature of the Steelhead appliance is too high. Possible causes are fan-failures, high operating temperatures, high humidity, blocked air vents or dusty environment.
Next steps: Confirm that the environment temperature is not too high, that the airflow is sufficient, clear the chassis of excessive dust. If the issue keeps happening, contact Riverbed TAC for investigation.
Figure 5.126. Asymmetric routing alarms
Alarm 'arcount' triggered Alarm triggered for rising error for event arcount
This alarm gets triggered if one or more instances of asymmetric routing have been detected. See the chapter IP Routing Related Issues on how to deal with this.
Next steps: If the cause of the alarm is real network asymmetry, then make sure that all paths in the network are covered with Steelhead appliances. If the cause of the alarm is SYN retransmission, then configure Fixed-Target in-path rules to overcome the auto-discovery issue, or configure pass-through in-path rules to prevent optimization towards that subnet.
Figure 5.127. Bypass alarms
Alarm 'bypass' triggered Alarm triggered for rising error for event bypass
This alarm gets triggered when one or more in-path interfaces have gone into bypass, linking the WAN router and LAN interface together. This happens when the optimization service is stopped or when the network card watchdog has been activated.
Next steps: Restart the optimization service if the optimization service is stopped. Contact Riverbed TAC to investigate if the issue is related to the watchdog.
Figure 5.128. Duplex error
Alarm 'duplex' triggered Alarm 'link_duplex' triggered Alarm 'link_state:*:half_duplex' triggered Alarm 'link_io_errors' triggered Alarm 'link_state:*:io_errors' triggered Alarm triggered for rising error for event duplex
These alarms gets triggered when there is a large amount of frame errors or carrier errors on one of the interfaces, often indicating a speed/duplex configuration mismatch.
Next steps: Check the speed and duplex settings on the interface reported. Consider a new Ethernet cable or a different port on the switch if the speed and duplex settings are the same.
Figure 5.129. Link state alarms
Alarm 'linkstate' triggered Alarm 'link_state:X:link_error' triggered Alarm triggered for rising error for event linkstate
These alarms get triggered when one of the interfaces of the Steelhead appliance has lost its Ethernet link.
Next steps: Investigate the loss of Ethernet link.
Figure 5.130. Virtual Steelhead related alarms
Alarm 'disk_not_setup' triggered Alarm 'lan_wan_loop' triggered
The disk_not_setup alarm gets triggered when the Virtual Steelhead appliance detects that the disk reserved for the data store is not yet provisioned or not of the right size.
Next steps: In the ESXi environment, check the properties of the virtual disks for the Virtual Steelhead.
The lan_wan_loop alarm gets triggered when the Virtual Steelhead appliance detects that the LAN and the WAN interface are configured to be on the same virtual switch.
Next steps: In the ESXi environment, put the LAN and the WAN interfaces in different virtual switches.
Figure 5.131. Various admission control alarms
Alarm 'admission_control' rising Alarm 'admission_conn' rising Alarm 'admission_cpu' rising Alarm 'admission_mapi' rising Alarm 'admission_mem' rising Alarm 'admission_tcp' rising Alarm triggered for rising error for event admission_conn Alarm triggered for rising error for event admission_cpu Alarm triggered for rising error for event admission_mapi Alarm triggered for rising error for event admission_mem Alarm triggered for rising error for event admission_tcp
The following admission control alarms do exist:
The generic admission control alarm, it is raised when one of the other alarms is raised.
Connection based admission control, when the number of TCP sessions optimized has increased to above the number of licensed TCP sessions.
Next steps: Check the list of TCP connections in the Current Connections overview and to reduce the number of optimized TCP sessions by applying in-path rules to pass-through traffic which does not get a high optimization factor. If the issue is structural consider an upgrade to a higher model.
CPU load admission control, when the CPU load is too high and no new TCP sessions will be optimized.
Next steps: Open a case with Riverbed TAC when this happens.
MAPI based admission control, when the number of optimized TCP sessions exceeds a by default 85% threshold of the maximum number of licensed TCP sessions. This is to overcome the problems caused by the MAPI protocol use of multiple TCP sessions which all need to be optimized against the same client-side and server-side Steelhead appliances.
Next steps: The same as with Connection based admission control.
Memory based admission control, where the internal memory pool of the optimization service has exceeded a certain threshold.
Next steps: Open a case with Riverbed TAC to investigate.
TCP based admission control, where the TCP buffers in the kernel running on the Steelhead appliance is running out of memory.
Next steps: In the sysinfo.txt in the system dump, check the netstat output and find out which hosts have the biggest send-queue values. That is the host with the slow network stack. Also, open a case with Riverbed TAC to investigate.
Figure 5.132. Connection Forwarding alarms
Alarm 'cf_ipv6_incompatible_cluster' triggered Alarm 'connection_forwarding' triggered Alarm 'disconnected_sh_alert' triggered Alarm 'single_cf' triggered Alarm 'cf_ack_timeout_aggr' triggered Alarm 'cf_ack_timeout:*' triggered Alarm 'cf_conn_failure_aggr' triggered Alarm 'cf_conn_failure:*' triggered Alarm 'cf_conn_lost_eos_aggr' triggered Alarm 'cf_conn_lost_eos:*' triggered Alarm 'cf_conn_lost_err_aggr' triggered Alarm 'cf_conn_lost_err:*' triggered Alarm 'cf_keepalive_timeout_aggr' triggered Alarm 'cf_keepalive_timeout:*' triggered Alarm 'cf_latency_exceeded_aggr' triggered Alarm 'cf_latency_exceeded:*' triggered Alarm 'cf_read_info_timeout_aggr' triggered Alarm 'cf_read_info_timeout:*' triggered Alarm triggered for rising error for event cf_ack_timeout Alarm triggered for rising error for event cf_conn_failure Alarm triggered for rising error for event cf_conn_lost_eos Alarm triggered for rising error for event cf_conn_lost_err Alarm triggered for rising error for event cf_keepalive_timeout Alarm triggered for rising error for event cf_latency_exceeded Alarm triggered for rising error for event cf_read_info_timeout
The most common alarms with regarding to Connection Forwarding protocol are the cf_conn_failure alarm and the cf_latency_exceeded alarm.
The cf_conn_failure alarm happens when the Steelhead appliance is unable to setup of the TCP session for the Connection Forwarding protocol. This is most likely related to a network related issue between the two in-path interfaces or because the optimization service on the other Steelhead appliance is not operational.
The cf_latency_exceeded alarm happens when the responses in the Connection Forwarding sessions take too long to come back to the Steelhead appliance. This could be related to the network between the two in-path interfaces but also to the load on the neighbour Steelhead appliance.
Next steps: Check the network between the in-path interfaces.
The disconnected_sh_alert alarm gets triggered when a Connection Forwarding neighbour gets disconnected.
The single_cf alarm gets triggered when there are no neighbouring nodes in the Connection Forwarding cluster.
Next steps: Determine why the neighbours got disconnected.
The cf_ipv6_incompatible_cluster alarm gets triggered when one of the nodes in the cluster do not support optimization over IPv6 yet.
Next steps: Upgrade all nodes in the cluster to the right RiOS version.
Figure 5.133. Data store related alarms
Alarm 'datastore' triggered Alarm 'datastore_error' triggered Alarm 'datastore_sync_error' triggered Alarm 'store_corruption' triggered Alarm triggered for rising error for event datastore_error Alarm triggered for rising error for event datastore_sync_error Alarm triggered for rising error for event store_corruption
When the optimization service detects a corruption in the data store it try to recover from the issue and raise the store_corruption alarm. This alarm also can happen when encryption of the data store has been enabled or disabled but the restart of the optimization service didn't include the clearing of the data store.
Next steps: Restart the optimization service and clear the data store.
The datastore_sync_error alarm gets triggered when the Steelhead appliance is part of a data store synchronization cluster but its peer has become unreachable.
Next steps: Confirm the network between the two Steelhead appliances.
The datastore_error alarm gets triggered when the metadata in the data store cannot be initialized with the current settings. This alarm can happen when encryption of the data store has been changed or when the Extended Peering Table has been enabled or disabled but the restart of the optimization service didn't include the clearing of the data store.
Next steps: Restart the optimization service with the option to clear the data store.
Figure 5.134. Halt alarms
Alarm 'halt_error' triggered Alarm triggered for rising error for event halt_error
Next steps: Open a case with Riverbed TAC to investigate.
Next steps: Open a case with Riverbed TAC to investigate.
Figure 5.136. NFS related alarms
Alarm 'nfs_v2_v4' triggered Alarm triggered for rising error for event nfs_v2_v4
This alarm gets triggered when the NFS latency optimization has detected an NFS server which only uses the NFS version 2 or NFS version 4 protocols. The NFS latency optimization can only optimize NFS version 3 traffic.
Next steps: See if the NFS servers reported can be changed to support NFS version 3. If not, consider a pass-through rule for them.
Figure 5.137. Optimization Service alarms
Alarm 'service_error' triggered Alarm 'optimization_service' triggered Alarm 'optimization_general' triggered Alarm triggered for rising error for event service_error
These alarms get triggered when the optimization service is not running or when an issue has been encountered which indicate a critical failure in the optimization protocol.
Next steps: Open a case with Riverbed TAC to investigate.
Figure 5.138. PFS related alarms
Alarm 'pfs' triggered Alarm 'pfs_config' triggered Alarm 'pfs_operation' triggered Alarm triggered for rising error for event pfs_config Alarm triggered for rising error for event pfs_operation
These alarms get triggered when there is an issue with the Proxy File Service.
Next steps: Open a case with Riverbed TAC to investigate.
Figure 5.139. Process dump related alarms
Alarm 'sticky_staging_dir' triggered Alarm triggered for rising error for event sticky_staging_dir
This alarm gets triggered when there is an issue creating new process dumps. Most of the time it correlates with a fs_mnt alarm.
Next steps: Open a case with Riverbed TAC to investigate.
Figure 5.140. Serial Cascade configuration alarms
Alarm 'serial_cascade_misconfig' triggered Alarm triggered for rising error for event serial_cascade_misconfig
Next steps: Open a case with Riverbed TAC to investigate.
Figure 5.141. SMB alarms
Alarm 'smb_alert' triggered Alarm triggered for rising error for event smb_alert
This alarm gets triggered when there is an issue with the SMB Signing feature.
Next steps: Check the status of the computer object in the Active Directory service and the log files on the Domain Controllers. Check the status of the delegation user in the Active Directory service and the log files on the Domain Controllers. Check the connectivity between the Steelhead appliance primary interface and the Domain Controllers.
Figure 5.142. QoS alarms
Alarm 'inbound_qos_wan_bw_err' triggered Alarm 'outbound_qos_wan_bw_err' triggered
These alarms get raised when the sum of the configured QoS bandwidth values exceed the configured WAN bandwidth speeds.
Next steps: Check the QoS settings page and confirm that the QoS bandwidth values are correct.
Figure 5.143. Software version alarms
Alarm 'sw_version_aggr' triggered Alarm 'mismatch_peer_aggr' Alarm 'mismatch_peer:*' Alarm 'sw_version_mismatch_aggr' Alarm 'sw_version:*' Alarm triggered for rising error for event sw-version
These alarms get triggered when an incompatibility with the RiOS software version of a remote Steelhead appliance has been detected.
Next steps: Upgrade the remote Steelhead appliance to a compatible RiOS version or configure Peering Rules to prevent optimization between the two Steelhead appliances.
Figure 5.144. SSL related alarms
Alarm 'non_443_ssl_servers_detected_on_upgrade' triggered Alarm triggered for rising error for event non_443_ssl_servers_detected_on_upgrade
This alarm gets triggered when, after the upgrade to RiOS version 6.0 or later, an SSL server on a port different than port 443 has been detected.
In RiOS version 6.0 and later traffic on TCP port 443 is automatically assumed to be SSL encapsulated and when SSL optimization has been enabled, then no in-path rule is required to get that traffic optimized. For SSL traffic on a different TCP port than port 443 an in-path rule is still required.
Next steps: Configure the correct In-path Rules and Peering Rules to optimize this SSL traffic.
Figure 5.145. System Detail Report alarms
Alarm 'system_detail_report' triggered Alarm triggered for rising error for event system_detail_report
This alarm gets triggered when an issue with the operation or configuration of the optimization service has been determined.
Next steps: Check the System Detail Report and attend to the issues reported.
Figure 5.146. Granite related alarms
Alarm 'block_store' triggered Alarm 'block_store:*' triggered Alarm 'profile_switch_failed' triggered Alarm 'granite-core' triggered Alarm 'granite-core:*' triggered Alarm 'high_availability' triggered Alarm 'high_availability:*' triggered Alarm 'iscsi' triggered Alarm 'iscsi:*' triggered Alarm 'lun' triggered Alarm 'lun:*' triggered Alarm 'snapshot' triggered Alarm 'snapshot:*' triggered Alarm 'uncommitted_data' triggered
The block_store alarm gets triggered when there are issues with the Granite block store.
Next steps: Check the system logs and contact Riverbed TAC for investigation.
This alarm gets triggered when repartitioning the drives fails during the switching of the storage profile.
Next steps: Contact Riverbed TAC for investigation.
The granite-core alarm gets triggered when there are communication problems towards the Granite Core appliance.
Next steps: Confirm that there are no network related issues between the Granite Edge and the Granite Core.
The high_availability alarm gets triggered when there are communication problems towards a node in high availability cluster.
Next steps: Check connectivity with the HA peer.
The iscsi alarm gets triggered when the iSCSI initiator isn't accessible.
Next steps: Confirm the configuration of the iSCSI configuration.
The lun alarm gets triggered when a LUN isn't avaiable for the Granite Core.
Next step: Confirm the status of the LUN on the Data Center NAS.
The snapshot alarm gets triggered when a snapshot couldn't be completed or committed.
Next steps: Confirm the status of the Data Center NAS.
This alarm gets triggered when the Granite Edge has a large amount of uncommitted data in its blockstore.
Next steps: Confirm that the Granite Core and the NAS are working fine.