[quagga-dev 4182] Re: possible ospfd bug: in state ExStart with routers that are neither DR nor BDR

Andrew J. Schorr aschorr at telemetry-investments.com
Thu Jun 29 03:34:00 BST 2006


On Thu, Jun 29, 2006 at 02:23:46AM +0100, Paul Jakma wrote:
> On Wed, 28 Jun 2006, Andrew J. Schorr wrote:
> 
> >I'm not sure if this is at all helpful; I doubt it. :-(
> 
> There was an election on your network and the DR and/or BDR changed 
> (presumably both).

Yes, that's what I think.  We had turned off a T3 wan link to
test that the backup route (4 T1 circuits bonded with TEQL) was
working properly.  When we turned it back on, we started getting
the errors.

> 77's NSM for some reason failed to reset state properly for those two 
> routers, or missed an AdjOK? event...
> 
> Can you dump core of 77's ospfd? (e.g. 'gcore' or with gdb?). 
> Actually, 'show ip ospf neigh <ip addr>' for each of the ExStart 
> neighbours probably is all the info. Anything earlier in the logs of 
> 77 (particularly about ISM changes and/or elections and anything 
> after that.)

Sadly (foolishly?), I already restarted the ospfd, so too late for a core or
for more debugging info.  But I imagine this will happen again
(I believe it has happened in the past), so I think we'll get more
chances to debug it.

The T3 went down at 19:14:27, and here's what I see in ti77's log at
that time:

2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.11.76 on eth0.19:192.168.167.77: 2-Way -> Down
2006/06/28 19:14:27 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.76 now Down/DROther
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.11.74 on eth0.19:192.168.167.77: 2-Way -> Down
2006/06/28 19:14:27 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.74 now Down/DROther
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.11.75 on eth0.19:192.168.167.77: 2-Way -> Down
2006/06/28 19:14:27 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.75 now Down/DROther
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.79

There are 3 hosts on each side of the T3 link, so that explains
the 3 AdjChg Down messages.  On one side we have ti74, ti75, & ti76,
and on the other ti77, ti78, & ti79.  So with the T3 bridge down, the network
is partitioned into two pieces.

We then brought the T3 back up 34 seconds later, and here's what's in the log:

2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: Packet[DD]: Neighbor 192.168.11.75: Initial DBD from Slave, ignoring.
2006/06/28 19:15:01 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.74 now 2-Way/DROther
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79

And it's at this same time that the errors started in ti75's log:

2006/06/28 19:15:01 OSPF: Packet[DD]: Neighbor 192.168.130.77 state is 2-Way, packet discarded.
...

Note that, from ti77's perspective, the DR & BDR did not change when
the link came back up (still ti78 & ti79, which are on ti77's side
of the T3).  But the election on the other side (ti75's side) did result
in a change of DR & BDR.  When we took down the link, we saw this
on ti75:

2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.130.79 on eth0.19:192.168.167.75: Full -> Down
2006/06/28 19:14:27 OSPF: nsm_change_state(192.168.130.79, Full -> Down): scheduling new router-LSA origination
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.78
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.130.78 on eth0.19:192.168.167.75: Full -> Down
2006/06/28 19:14:27 OSPF: nsm_change_state(192.168.130.78, Full -> Down): scheduling new router-LSA origination
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.77
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.77
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.130.77 on eth0.19:192.168.167.75: ExStart -> Down
2006/06/28 19:14:27 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.77 now Down/DR
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.76
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.76
2006/06/28 19:14:27 OSPF: Packet[DD]: Neighbor 192.168.11.76 Negotiation done (Slave).
2006/06/28 19:14:27 OSPF: DR-Election[1st]: Backup 192.168.167.75
2006/06/28 19:14:27 OSPF: DR-Election[1st]: DR     192.168.167.76
2006/06/28 19:14:27 OSPF: DR-Election[2nd]: Backup 192.168.167.75
2006/06/28 19:14:27 OSPF: DR-Election[2nd]: DR     192.168.167.76
2006/06/28 19:14:27 OSPF: ospfTrapIfStateChange trap sent: 192.168.167.75 now Backup
2006/06/28 19:14:27 OSPF: interface 192.168.167.75 [12] join AllDRouters Multicast group.
2006/06/28 19:14:27 OSPF: Packet[DD]: Neighbor 192.168.11.74: Initial DBD from Slave, ignoring.
2006/06/28 19:14:27 OSPF: Packet[DD]: Neighbor 192.168.11.74 Negotiation done (Master).
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.11.76 on eth0.19:192.168.167.75: Exchange -> Full
2006/06/28 19:14:27 OSPF: nsm_change_state(192.168.11.76, Exchange -> Full): scheduling new router-LSA origination
2006/06/28 19:14:27 OSPF: AdjChg: Nbr 192.168.11.74 on eth0.19:192.168.167.75: Exchange -> Full
2006/06/28 19:14:27 OSPF: nsm_change_state(192.168.11.74, Exchange -> Full): scheduling new router-LSA origination

So at that point (with the network partitioned into 2 pieces), ti75 was the
BDR, and ti76 was the DR on one side, with ti78/ti79 on the other.

And when the link came back up, we have this in ti75's log:

2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.75
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.76
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.75
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.75
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.75
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: DR-Election[2nd]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[2nd]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: ospfTrapIfStateChange trap sent: 192.168.167.75 now DROther
2006/06/28 19:15:01 OSPF: interface 192.168.167.75 [12] leave AllDRouters Multicast group.
2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.11.74 on eth0.19:192.168.167.75: Full -> 2-Way
2006/06/28 19:15:01 OSPF: nsm_change_state(192.168.11.74, Full -> 2-Way): scheduling new router-LSA origination
2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.11.76 on eth0.19:192.168.167.75: Full -> 2-Way
2006/06/28 19:15:01 OSPF: nsm_change_state(192.168.11.76, Full -> 2-Way): scheduling new router-LSA origination
2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.130.77 on eth0.19:192.168.167.75: ExStart -> 2-Way
2006/06/28 19:15:01 OSPF: ospfTrapNbrStateChange trap sent: 192.168.167.77 now 2-Way/DROther
2006/06/28 19:15:01 OSPF: DR-Election[1st]: Backup 192.168.167.78
2006/06/28 19:15:01 OSPF: DR-Election[1st]: DR     192.168.167.79
2006/06/28 19:15:01 OSPF: Packet[DD]: Neighbor 192.168.130.79 Negotiation done (Slave).
2006/06/28 19:15:01 OSPF: Packet[DD]: Neighbor 192.168.130.77 state is 2-Way, packet discarded.
2006/06/28 19:15:01 OSPF: Packet[DD]: Neighbor 192.168.130.78 Negotiation done (Slave).
2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.130.78 on eth0.19:192.168.167.75: Exchange -> Full
2006/06/28 19:15:01 OSPF: nsm_change_state(192.168.130.78, Exchange -> Full): scheduling new router-LSA origination
2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.130.79 on eth0.19:192.168.167.75: Exchange -> Full
2006/06/28 19:15:01 OSPF: nsm_change_state(192.168.130.79, Exchange -> Full): scheduling new router-LSA origination

It almost seems as if ti75 & ti77 moved into ExStart before the new election
was completed, and then ti77 failed to reset state properly?  I say
this because of this message:

2006/06/28 19:15:01 OSPF: AdjChg: Nbr 192.168.130.77 on eth0.19:192.168.167.75: ExStart -> 2-Way

So this suggests that ti77 got into state ExStart before ti75 regressed
it to 2-Way...

Regards,
Andy



More information about the Quagga-dev mailing list