[Quagga-bugs] [Bug 330] routes disappear with 'could not determine nexthop' log entry

bugzilla-daemon at allevil.dishone.st bugzilla-daemon at allevil.dishone.st
Thu Jan 11 12:22:57 GMT 2007


Please do not reply directly to this email. All additional  
comments should be made in the comments box of this bug  
report.  
  
http://bugzilla.quagga.net/show_bug.cgi?id=330  
  




------- Additional Comments From windo at p6drad-teel.net  2007-01-11 12:22 -------
I think I understand how this problem occurs (I don't think I understand quagga
well enough to patch it though). To produce this problem, it is neccessary, that
there are multiple (in this case, 2) links between two routers.

When I disable the low-cost link at a time, so that the HELLO in one direction
gets through but the one in other direction doesn't (I deduced this, because i
had 'watch -n1 "show ip ospf neighbor"' running on both routers and there was a
5s difference between the dead timers - my hello interval) then when the first
router thinks the link is dead, it sends a new LSA which triggers a SPF
calculation on the second router.

Now, the logs read like this for a situation like this:

2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): Start
2007/01/11 11:55:39 OSPF: V (parent): Router vertex 192.168.36.149  distance 0
flags 0
2007/01/11 11:55:39 OSPF: W (dest)  : Router vertex 192.168.25.149  distance 40
flags 0
2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.2
2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): could not determine
nexthop for link
2007/01/11 11:55:39 OSPF: found Router LSA 192.168.25.149
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: Start
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/11 11:55:39 OSPF: found Router LSA 192.168.36.149
2007/01/11 11:55:39 OSPF: The LSA is already in SPF
2007/01/11 11:55:39 OSPF: SPF Result: 0 [R] 192.168.36.149

I think it is because the spf calculation code for point-to-point links checks
if any of the remote links terminates to the local link, but none do (since the
disappearing link was what triggered the LSA in the first place) and it does not
check any of the other local links (only one "considering link").

Now, when the dead timer on the second router (the one i'm pasting the logs
from) reaches zero as well, the logs show the nexthop calculation using the
other, working link:

2007/01/11 11:55:44 OSPF: ospf_nexthop_calculation(): Start
2007/01/11 11:55:44 OSPF: V (parent): Router vertex 192.168.36.149  distance 0
flags 0
2007/01/11 11:55:44 OSPF: W (dest)  : Router vertex 192.168.25.149  distance 50
flags 0
2007/01/11 11:55:44 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.4
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: Start
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/11 11:55:44 OSPF: found Router LSA 192.168.36.149
2007/01/11 11:55:44 OSPF: The LSA is already in SPF
2007/01/11 11:55:44 OSPF: SPF Result: 0 [R] 192.168.36.149
2007/01/11 11:55:44 OSPF: SPF Result: 1 [R] 192.168.25.149
2007/01/11 11:55:44 OSPF:  nexthop 0x80c1e20 10.1.0.3 t-dcl-cpn-tdc-2:10.1.0.4

Which results in working routing solution later on.

I think this is a problem, because no routing on router results in icmp net
unreachable which causes clients to fail and give up rather than send retrys.
And this is apart from the fact that there are (or at least could be) other
working routes available.

This could propably fixed by either trying other links to build the tree as well
(if first ones fail)?

There were a couple of longer routing outages in our live environment as well
(the hello/dead timers are 10/40 there): first a 30 second one and then a 5
second one. If my understanding of this bug is right, then that could have
theoretically been the same issue, where the 30-second gap could have been
produced by a short loss of connectivity in one direction?

PS:
This latest test was with quagga-0.99.5, but a diff with the 0.99.6 source
revealed no differences in ospf_spf.c, so I'll refrain from testing with the cvs
version for now (especially since I saw the problem occuring with the cvs
version as well).  
  
  
  
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


More information about the Quagga-bugs mailing list