[quagga-dev 3510] Re: rfc2385

Andreas John quagga at aj.net-lab.net
Sat Jun 11 10:54:32 BST 2005


Hi All!

> hi,
> we have the same strange kernel panic as mentioned before, with quagga:
> .
> kernel BUG at slab.c:1130!

I can confirm that Bug! This router was (is) running in production for 
several weeks (months?). Today I had to reset it for a hardware issue
(new kind of power distribution - we had to pulltheplug)

After rebooting ... the machine ran for 1 minute .... and boomm ... same 
kernel crash. After second reboot (someone noticed sweat on my face) the 
machine came up with:

Jun 11 01:34:14 core03 kernel: MD5 Hash NOT expected but found 
(xxxxxxxx, 179)->(xxxxxxxxx, 32768)

I assume that this happens when the neibor's session did not do down and 
the hash is still around, i.e. we get a packet with md5 for a tcp 
connection that we didn't build up from our point of view?

The peer did not go up until I made a shut / no shut on this particular 
neighbor. Then everything worked as expected.

> - this problem comes randomly after quagga start, sometimes is router 
> alive 1 minute, sometimes 2-3 days
> - system is Debian Woody, quagga_0.98.3-0.backports
> - vanilla kernel 2.4.28+rfc2385-2.4.28 patch, also tested vanilla kernel 
> 2.4.29+rfc2385-2.4.28.patch, kernel-2.4.30+rfc2385-2.4.30.patch

I'm using 2.4.30 vanilla + RFC2385.

The only difference to the last time rebootet is, that the routing table 
grew a little ;) and that I activated md5 for our upstream (which is a 
Cisco 7206VXR AFAIR .. could be a 75xx ... I dunno remember)

As some of you may be aware of, I am not a coder. But I know about 
if/then/while constructs I dared to have a lok into the source of slab.c 
(line 1130 as the foreposter mentioned - I dunno if I got it one the 
same line but I guess so)

* The test for missing atomic flag is performed here, rather than
* the more obvious place, simply to reduce the critical path length
* in kmem_cache_alloc(). If a caller is seriously mis-behaving they
* will eventually be caught here (where it matters).
*/
if (in_interrupt() && (flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC)
                 BUG();

I also looked into rfc2385-2.4.30.patch and 
ht-20050321-0.98.2-bgp-md5.patch and spotted two ocurrences something 
that has to do with kmem.... one of it shortly after freeing the mem for 
md5 key

-----
if (atomic_dec_and_test(&tw->refcnt)) {
...
#ifdef CONFIG_TCP_RFC2385
                /* Free the memory used for any md5 key */
                if (tw->md5_key) {
                        kfree (tw->md5_key);
                        tw->md5_key = NULL;
                        tw->md5_keylen = 0;
                }
#endif
                 kmem_cache_free(tcp_timewait_cachep, tw);
-----

It reads like (atomic_dec_and_test(&tw->refcnt)) is true but after 
calling kmem_cache_free(tcp_timewait_cachep, tw); it's no longer atomic? 
Isn't that a race condition? So I assume all that happens when
"MD5 Hash NOT expected but found" (see above), i.e. the tw->mb5 stuff is 
cleared and at the same time there comming in (out?) a non expected one 
and processed via interrupt? If the timing is "lucky" we get a race 
condition?

Rgds,
Andreas






More information about the Quagga-dev mailing list