[quagga-dev 4132] Re: Bug in long delay networks

Spagnolo, Phillip A phillip.a.spagnolo at boeing.com
Thu May 11 19:30:08 BST 2006


Vincent,

I am sorry for the long delay in responding to this e-mail.  Somehow I
missed the response on the list.  I appended the original message at the
bottom.
 
> -----Original Message-----
> From: Vincent Jardin [mailto:vincent.jardin at 6wind.com] 
> Sent: Tuesday, April 11, 2006 3:17 PM
> To: Spagnolo, Phillip A
> Cc: quagga-dev at lists.quagga.net; Kushi, David M; Henderson, Thomas R
> Subject: Re: [quagga-dev 4082] Bug in long delay networks
> 
> Hi,
> 
> >The solution we found is to simply increase 2 in OSPF_TIMER_ON
> >(ospf->t_maxage, ospf_maxage_lsa_remover, 2) to a reasonable value.
> >Maybe 60 or 600???
> >  
> >
> There is no recommendation into the RFC for having 2, 60 or 
> something 
> else. My concern is that higher the timer will be, more 
> entries will be 
> need to be kept until the remover is run.
> 
> Since I don't think it would be possible to guess 
> automatically the best 
> value, maybe this value should be configurable from the CLI and the 
> default one could remain 2, isn't it ?

As for the best default value, I don't know the answer.  However, the
same problem does not occur with a Cisco router because it keeps the
LSAs around for at least a couple hundred seconds.

A CLI addition would be fine.  We could also just put a comment in the
code and let people change it if needed.  

> 
> >Attached is a patch with this fix and a couple of minor 
> related changes
> >with explanations within the code.  
> >  
> >
> Please can you elaborate more about this comment:
> > "+  /* This does not seem to be necessary.  This LSA was already
flooded
> > +     when it entered the maxage list.  This flood is redundant //
*/
> > " ?
> For instance, can you describe a case when it occurs ?

This is the sequence of function calls
-ospf_lsa_flush_area()
  -MAXAGE LSA   --> set LSA to maxage
  -ospf_flood_through_area()  --> LSA is flooded throughout area 
  -ospf_lsa_maxage()  
     -OSPF_TIMER_ON (ospf->t_maxage, ospf_maxage_lsa_remover, 2);  -->
add to maxage list and schedule remover
-ospf_maxage_lsa_remover()
  -check if the LSA can be removed????
    -ospf_flood_through()
      -ospf_lsa_flush_area() --> already flooded above, so there is no
need to do it again

Does this make sense?  I don't see why an LSA that has already been
maxaged and flooded needs to be reflooded after it has been checked for
neighbor state and retransmission count.

Sincerely,
Phil


> 
> Regards,
>   Vincent
> 


Original Message:
All,

We have found a bug in ospfd for quagga 0.98.5 when it is used in high
delay networks.  I think the problem exists in 0.99.3 because the same
code is found there.

The bug exists in ospf_lsa.c.  It is found in ospf_lsa_maxage() when
OSPF_TIMER_ON (ospf->t_maxage, ospf_maxage_lsa_remover, 2) is called to
schedule removal of the LSA from the database.

Here is an example:
    5-----|
    |     |
    2--|  |
   /   |  |
  / 6-----|
 /  |  |  |
1---3--|  |
 \     |  |
  \ 7-----|
   \|  |
    4--|

Nodes 2,3,4 are connected by a broadcast network.
Nodes 5,6,7 are connected by a PTMP network.
Let the delay of the PTMP network be 8 secs.

If the broadcast network of 2,3,4 is brought down then nodes 2,3,4 and
will generate a Network LSA and then maxage the Network LSA as all
neighbors are removed from the link.  This is correct (RFC 2328 12.4.2
para 4).  The problem is that these LSAs will reach all nodes in the
network and purge the databases while they are still in transit in the
PTMP network (5,6,7).  When these LSAs come out of the PTMP then they
will be reinstalled and flooded again because they are already purged
from the databases.  The flooding repeats this process again.  

Short story, flooding is maintained for 3600 secs.

The solution we found is to simply increase 2 in OSPF_TIMER_ON
(ospf->t_maxage, ospf_maxage_lsa_remover, 2) to a reasonable value.
Maybe 60 or 600???

Attached is a patch with this fix and a couple of minor related changes
with explanations within the code.  

Is this the correct fix???  Are there reason not to increase this value?

Thanks,
Phil



Phil Spagnolo 
Network Technology Engineer 
The Boeing Company 
Phone:  (425) 865-6723




More information about the Quagga-dev mailing list