Re: [Quickfix-developers] Network disconnect recovery testing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Dermot!

On Mon, Apr 17, 2017 at 5:45 AM,  <daw...@ya...> wrote:
> ...
> Hi Mike,
>
> Just getting back to orders that are sent (and don't go anywhere) during network downtime - is there a recommended approach to identifying and cleaning these up? I'm thinking to just run a process every minute or so that checks "if currently logged on and order hasn't been acked in the last minute then delete from database". What do you think?

Some words of advice from the trenches ...

Think through your recovery strategy carefully (as you seem to
be doing).

First, if you send an order and receive an ack, then your counter-
party has accepted responsibility for that order.  (Hopefully he
fulfills his responsibility.)

If you try to send an order and don't receive an ack, you don't
know whether your counter-party has it or not.  (Send order,
receive and process order, send ack -- network goes down --
don't receive ack.  Note, in this scenario, your counter-party
doesn't know -- until and unless you request a gap fill -- that
you didn't receive the ack.  FIX doesn't ack acks.)  Note, this
means that you REALLY don't want to delete unacked orders
from your database and forget about them -- they might be
live on your counter-party's side.  So you need, at least, to
put them in some kind of limbo state, and have a protocol
for cleaning them up (or leaving them live when you get your
ack after a reconnect and gap fill).

You need to decide on a business protocol for handling network
outages.  Some counter-parties will let you (or require) that you
have them (pre-arranged -- not via FIX) auto-cancel any orders
if connectivity drops.  Note such a "cure" can be worse than the
problem if your network goes down for a few milliseconds or a
few seconds or even a few minutes.  It depends on your use case.

Similarly, you need to decide what you want to do with orders
that you have attempted to issue but haven't sent yet.  Again,
if the outage was only for a few milliseconds or seconds, you
might be best to just send them.

The problem is with orders that you might have sent.  You can't
really not resend them in a gap fill because that could be
inconsistent with the traffic you counter-party has already seen.
You could negotiate with your counter-party an "auto-cancel" policy
for new, previously unseen orders that come in a gap fill.

Let's say you do decide to filter out (somehow) possibly unsent traffic
when gap filling after an outage.  While you might want to "filter out"
new orders (or have your counter-party ignore / auto-cancel them)
you almost certainly do not want to filter out possibly unsent cancel
requests.

You are right that it makes a lot of sense to monitor whether you
have connectivity with your counter-party.  Note there are several
levels to this:

Business "connectivity" -- e.g., receive acks
FIX connectivity -- heartbeats, isLoggedOn
Network connectivity -- e.g., a free-standing "ping" watchdog

It does make sense -- to reduce potential gap-fill load and economic
exposure if you don't manage a timely reconnect -- to pause
sending orders on your side if you detect possible connectivity
interruption.  You may or may not wish to queue up such unsent
traffic to send if you reestablish connectivity after a "short" amount
of time.  Note, there is real business logic in how you decide to
handle this.  Would you really not want to send cancel requests
that you generated and got queued up during a (real or imagined)
loss of connectivity?

Repeating two key points:

It is well worth designing your recovery strategy with care.

The details of your strategy will depend on your specific business
use case -- auto-cancellation policies, how long an outage puts
you into a clean-up mode, rather than recovering and proceeding
normally (keeping orders live, etc.).

And last:  Test to the extent you can afford and your business requires.
I have NEVER seen automated recovery work completely correctly in
institutional environments -- even with reasonably well-tested systems.
(I'm not saying it can't happen -- I've just never seen it.)

> Dermot

Good luck.

K. Frank