5.28. Service Error

Sometimes the optimization service determines that there is something wrong, either in the data store or in the communication towards a peer. It will then raise the service_error alarm, which is a non-fatal error for the optimization service but a fatal error for the TCP session it has been encountered on.

5.28.1. Data store related service errors

5.28.1.1. Steelhead Mobile Client Cloning

When a new computer gets rolled out and the installation of the software on it happens via cloning it from a master image, the data store id of the Steelhead Mobile Client software will be cloned too. As a result, the server-side Steelhead appliance will see duplicate labels and thus raise the service_error alarm:

Figure 5.204. Duplicate DEF and a Service error with a Steelhead Mobile Client

SH sport[1448]: [segpage.ERR] - {- -} Duplicate DEF {}136 hash 11237544087762729201 vs. 44 \
    0090/248488450:20#0{}176 hash 4723104384001475725, memcmp() 1
SH sport[1448]: [defunpacker.ALERT] - {- -} ALARM (clnt: 10.0.1.1:49358 serv: 192.168.1.1: \
    1352 cfe: 10.0.1.1:49349 sfe: 192.168.1.6:7800) name maps to more than one segment has \
     a steelhead been improperly installed and/or steelhead mobiles are sharing config fil \
    es p
SH sport[1448]: [defunpacker.ALERT] - {- -} ossibly through copying or cloning?
SH sport[1448]: [defunpacker.WARN] - {- -} (clnt: 10.0.1.1:49358 serv: 192.168.1.1:1352 cf \
    e: 10.0.1.1:49349 sfe: 192.168.1.6:7800) Killing page 0x2b8835a000 name: 440090/248488 \
    450:20 off 59732398 refs 2 cnt 50 flags CA--- tag 1954835526/12, segment #
SH sport[1448]: [defunpacker.WARN] - {- -} 544087762729201
SH sport[1448]: [segstore/kill_page.ERR] - {- -} Killing page name: 440090/248488450:20 of \
    f 59732398 refs 2 cnt 50 flags CA--- tag 1954835526/12 and dumping it to refd_pages.14 \
    48
SH statsd[28893]: [statsd.NOTICE]: Alarm triggered for rising error for event service_erro \
    r

The IP addresses of the clnt (client) and the cfe (client-side Steelhead) are the same, therefore it is a Steelhead Mobile Client causing this problem.

5.28.1.2. Data store clustering

Figure 5.205. Duplicate DEF and a Service error but not with a Steelhead Mobile Client

SH sport[26651]: [defunpacker.ALERT] - {- -} ALARM (clnt: 10.0.1.1:51748 serv: 192.168.1.1 \
    :80 cfe: 10.0.1.6:7801 sfe: 192.168.1.6:7800) name maps to more than one segment has a \
     steelhead been improperly installed and/or  steelhead mobiles are sharing conf 
SH sport[26651]: [defunpacker.ALERT] - {- -} ig files possibly through copying or cloning? \
     

Here the IP addresses of the clnt (client) and the cfe (client-side Steelhead) are not the same, therefore the remote IP address belong to a real Steelhead appliance and not a Steelhead Mobile Client.

The most likely cause is that there has been an issue with hosts in a data store synchronization cluster and the data store wasn't cleared after the cluster got taken apart. Clearing the data store and a service restart with the command restart clean should resolve this issue.

5.28.1.3. Requested frame does not exist anymore

When the sending Steelhead appliance sends a reference and the receiving Steelhead appliance does not have that reference in its data store, the receiving Steelhead appliance will request that reference from the sending Steelhead appliance. If the sending Steelhead appliance does not have that reference it its data store anymore, it will throw this error:

Figure 5.206. Requested reference does not exist anymore

SH sport[32414]: [replypacker.ALERT] - {- -} ALARM ACK problem: REQd segment 380198/193427 \
    159:2600447491#16 absent 

This can happen when the optimized TCP connection is running in the SDR-M optimization policy or the optimized TCP connection is a CIFS connection and the reference was only available in the CIFS cache part of the data store.

If this happens once, then just clear the alarm. If this happens more and more, then open a TAC case.

5.28.2. Communication related service errors

If there is a problem in the protocol spoken on the inner channel between the Steelhead appliances, a service error will be raised.

Since the protocol is spoken over TCP, the network stack will take care of TCP checksum issues and the data should be valid. When using the WAN visibility modes of port transparency or full transparency, there might be firewalls or IPS devices which check the protocol and "fix" irregularities in it, which can cause corruption on the inner channel.

Figure 5.207. A service error due to a checksum mismatch

SH sport[3612]: [sportpkt.ALERT] - {- -} ALARM decode_hdr: checksum mismatch, cmd 1, len 3 \
    2789
SH statsd[9205]: [statsd.NOTICE]: Alarm triggered for rising error for event service_error

To troubleshoot checksum mismatches would be to take traces in various places in the network between the two Steelhead appliances until the issue happens again and see if the corruption is already there. Then use the divide-and-conquer approach to find the device which is corrupting the content.