[quagga-dev 4361] BGP 0.99 testers needed

Paul Jakma paul at clubi.ie
Thu Sep 14 04:42:12 BST 2006


I've taken the liberty of putting back what I believe may be the 
final 0.99 bgpd regression fixes. See below for details on those.

What would be really useful is for people with test networks, and any 
others who are affected by below issues, to take the CVS snapshot as 
of 20060914 and stress test it as much as possible.

In particular stress testing of any and all combinations of:

  - deleting neighbours and adding them back
  - reconfiguring neighbours
    (in ways that require bgpd to reset the session particularly)
  - hard clearing neighbours
    (in as wide a variety of BGP peer states as possible)
  - maximum prefix overflow
    (this was borked because of the prefix-count drift, that's fixed,
     but max-prefix isn't well-tested in combination with the
     clearing/shutdown/deleted changes. In theory those changes
     shouldn't affect max-prefix, confirmation would be good).

would be appreciated.

One changeset in particular has not been widely tested, and if it has 
a mistake will cause odd crashes (though, it's of a series intended 
to eliminate a crash..).

I'll be away for a while and may not have email access for up to a 
week and a half.



Regressions known in 0.99.5, and their status in CVS:

- Prefix count issue: Believed to be fixed by

   2006-09-06 Paul Jakma <paul.jakma at sun.com>

         * (general) Squash any and all prefix-count issues by
           abstracting route flag changes, and maintaining count as and
           when flags are modified (rather than relying on explicit
           modifications of count being sprinkled in just the right
           places throughout the code).

   Fix confirmed by one tester at least, who was seeing this in
   production. If there any issues please report output of the:

    'show .... bgp neighbor <address> prefix-counts'


   (NB: this command does a RIB walk, so it's not a 'cheap' command,
    you may wish to use it sparingly. It's an enable-mode only
    command for a reason)

- shutdown sometimes doesn't stick, and 'no neighbour' could still
   cause crashes: Believed to be fixed by

   2006-09-14 Paul Jakma <paul.jakma at sun.com>

         * (general) Fix some niggly issues around 'shutdown' and clearing
           by adding a Clearing FSM wait-state and a hidden 'Deleted'
           FSM state, to allow deleted peers to 'cool off' and hit 0
           references. This introduces a slow memory leak of struct peer,
           however that's more a testament to the fragility of the
           reference counting than a bug in this patch, cleanup of
           reference counting to fix this is to follow.

   One tester has tried to torture this a bit, and it fixes the 'no
   neighbour' crash he saw.

   The mentioned leak should be fixed by:

         * (general) fix the peer refcount issue exposed by previous, by
           just removing refcounting of peer threads, which is mostly
           senseless as they're references leading from struct peer,
           which peer_free cancels anyway. No need to muck around..

   which I've tested extensively, though in a slightly different form.
   It's fairly sane, but wider testing is needed to ensure there are
   no dumb mistakes.

If there are crashes, chances are high that /only/ the latter 'fix 
the peer refcount issue exposed' changeset needs to be reverted to 
regain stability (and a very slow leak of a ~40kB struct peer every 
time an ACCEPT_PEER peer comes in..).

Paul Jakma	paul at clubi.ie	paul at jakma.org	Key ID: 64A2FF6A
All right, let's not panic.  I'll make the money back by selling one
of my livers.  I can get by with one.

 		-- Homer Simpson
 		   Homer vs. Patty and Selma

More information about the Quagga-dev mailing list