Sometimes the optimization service determines that there is something wrong, either in the data store or in the communication towards a peer. It will then raise the service_error alarm, which is a non-fatal error for the optimization service but a fatal error for the TCP session it has been encountered on.
When a new computer gets rolled out and the installation of the software on it happens via cloning it from a master image, the data store id of the Steelhead Mobile Client software will be cloned too. As a result, the server-side Steelhead appliance will see duplicate labels and thus raise the service_error alarm:
Figure 5.204. Duplicate DEF and a Service error with a Steelhead Mobile Client
SH sport[1448]: [segpage.ERR] - {- -} Duplicate DEF {}136 hash 11237544087762729201 vs. 44 \ 0090/248488450:20#0{}176 hash 4723104384001475725, memcmp() 1 SH sport[1448]: [defunpacker.ALERT] - {- -} ALARM (clnt: 10.0.1.1:49358 serv: 192.168.1.1: \ 1352 cfe: 10.0.1.1:49349 sfe: 192.168.1.6:7800) name maps to more than one segment has \ a steelhead been improperly installed and/or steelhead mobiles are sharing config fil \ es p SH sport[1448]: [defunpacker.ALERT] - {- -} ossibly through copying or cloning? SH sport[1448]: [defunpacker.WARN] - {- -} (clnt: 10.0.1.1:49358 serv: 192.168.1.1:1352 cf \ e: 10.0.1.1:49349 sfe: 192.168.1.6:7800) Killing page 0x2b8835a000 name: 440090/248488 \ 450:20 off 59732398 refs 2 cnt 50 flags CA--- tag 1954835526/12, segment # SH sport[1448]: [defunpacker.WARN] - {- -} 544087762729201 SH sport[1448]: [segstore/kill_page.ERR] - {- -} Killing page name: 440090/248488450:20 of \ f 59732398 refs 2 cnt 50 flags CA--- tag 1954835526/12 and dumping it to refd_pages.14 \ 48 SH statsd[28893]: [statsd.NOTICE]: Alarm triggered for rising error for event service_erro \ r
The IP addresses of the clnt (client) and the cfe (client-side Steelhead) are the same, therefore it is a Steelhead Mobile Client causing this problem.
Figure 5.205. Duplicate DEF and a Service error but not with a Steelhead Mobile Client
SH sport[26651]: [defunpacker.ALERT] - {- -} ALARM (clnt: 10.0.1.1:51748 serv: 192.168.1.1 \ :80 cfe: 10.0.1.6:7801 sfe: 192.168.1.6:7800) name maps to more than one segment has a \ steelhead been improperly installed and/or steelhead mobiles are sharing conf SH sport[26651]: [defunpacker.ALERT] - {- -} ig files possibly through copying or cloning? \
Here the IP addresses of the clnt (client) and the cfe (client-side Steelhead) are not the same, therefore the remote IP address belong to a real Steelhead appliance and not a Steelhead Mobile Client.
The most likely cause is that there has been an issue with hosts
in a data store synchronization cluster and the data store wasn't
cleared after the cluster got taken apart. Clearing the data store and
a service restart with the command
restart clean
should resolve this issue.
When the sending Steelhead appliance sends a reference and the receiving Steelhead appliance does not have that reference in its data store, the receiving Steelhead appliance will request that reference from the sending Steelhead appliance. If the sending Steelhead appliance does not have that reference it its data store anymore, it will throw this error:
Figure 5.206. Requested reference does not exist anymore
SH sport[32414]: [replypacker.ALERT] - {- -} ALARM ACK problem: REQd segment 380198/193427 \ 159:2600447491#16 absent
This can happen when the optimized TCP connection is running in the SDR-M optimization policy or the optimized TCP connection is a CIFS connection and the reference was only available in the CIFS cache part of the data store.
If this happens once, then just clear the alarm. If this happens more and more, then open a TAC case.
If there is a problem in the protocol spoken on the inner channel between the Steelhead appliances, a service error will be raised.
Since the protocol is spoken over TCP, the network stack will take care of TCP checksum issues and the data should be valid. When using the WAN visibility modes of port transparency or full transparency, there might be firewalls or IPS devices which check the protocol and "fix" irregularities in it, which can cause corruption on the inner channel.
Figure 5.207. A service error due to a checksum mismatch
SH sport[3612]: [sportpkt.ALERT] - {- -} ALARM decode_hdr: checksum mismatch, cmd 1, len 3 \ 2789 SH statsd[9205]: [statsd.NOTICE]: Alarm triggered for rising error for event service_error
To troubleshoot checksum mismatches would be to take traces in various places in the network between the two Steelhead appliances until the issue happens again and see if the corruption is already there. Then use the divide-and-conquer approach to find the device which is corrupting the content.