…at least not how I thought…

The project I’m currently working on consists of a WCF service that controls/monitors a machine on a factory production line, and a UI that displays it’s state and allows setup of the various machine parameters. Any changes in the configuration are pushed from the UI via WCF to the service, and updates to the current state (whether the machine is running, various counters, etc.) are sent back from the service via WCF’s callback channel mechanism.

Now, this all worked fine and dandy in my test setup – I had the thing running for hours with no problems at all, even when the updates were coming thick and fast. However, the first day of testing at my customer’s factory and disaster: after a few minutes of the machine being idle, my customer tried to make a change to the configuration – which promptly threw an exception and died :(

We restarted, tried to make the same change and all worked fine. Then, after a few more minutes another change. Bang, same exception. Damn.

After a bit of hunting around I worked out what the problem was: the communication channel bindings used in WCF (netTcpBinding, wsDualHttpBinding, etc.) have a reliableSession that has an inactivityTimeout value – if there is no communication over the channel for that amount of time the channel is closed. My problem was that in my test setup I never left the machine idle, there was always data flying backwards and forwards between the service and the UI. However, in the real factory, the machine was not running, therefore there was no update of the state information, and hence no communication between service and UI. The channel was silently closed down and, when the subsequent configuration change was sent down the (now closed) channel, the exception occurred.

So, the simple answer is to increase the inactivityTimeout value to a ridiculously high value, eg. 999 years. This would have probably been good enough. However, there is still always a chance that the session could go down for another reason. So, I took a 2 pronged approach: firstly, I introduced a UI-service heartbeat every 30 seconds; and secondly, each attempt from the UI to send something to the service is wrapped in a method that spots the channel has been pulled from under us and reconnects before retrying. Eg.

public void SetMachineProperties(MachineProperty[] aProperties)
{
  ExecuteActionReconnectingIfNecessary(
    delegate
    {
      _Client.SetMachineProperties(aProperties);
    });
}

where ExecuteActionReconnectingIfNecessary looks like this:

private void ExecuteActionReconnectingIfNecessary(Action aAction)
{
  try
   {
    aAction();
  }
  catch (CommunicationException ex)
  {
    // Attempt to reconnect - this may also throw an exception, but we don't want to catch that
    Connect();
    // Try again
    aAction();
   }
}

This seems to work fine. Yay!