[quagga-users 14031] "SLOW THREAD" errors and "Hold Timer Expired" events on all neighbors

Andrew Gideon andrew-quagga-users-918 at tagonline.com
Mon Jun 29 04:26:45 BST 2015


I've an older "router" running CentOS5 that I'm trying to replace with
one running CentOS7.  Both machines have 1G of RAM.

I'm having no problems with the older router.  My reason for upgrading
is to get away from the old 2.6.18 kernel and into a newer (3.10) kernel
with support for RFC6164, ipset, etc.

The older machine is running Quagga v0.99.22.4.  I've tried both
quagga-0.99.22.4-4.el7 and 0.99.24.1 on the newer machine.

In both cases, there are up to six IPv4 neighbors.  Two are eBGP peers
providing full feeds.  The other four are iBGP peers providing whatever
routes are "best" on those devices.

Just because of how our topology is, this usually means 500000+ routes
from three of the six peers.  The others are just sending a small number
of routes for the subnets to which they provide gateway service.

The newer router also has two IPv6 iBGP peerings, each providing about
10000 to 20000 routes.

Very rarely, I'll see something like:

        SLOW THREAD: task bgp_read (987250) ran for 9292ms (cpu time 9266ms)

on the older router.

On the newer router, the slightest route change (ie. adding or removing
a gateway IP for some subnet, which means adding or removing a single
route that will be distributed by BGP) often causes errors like:

        SLOW THREAD: task bgp_scan_timer (7fc91c8a9120) ran for 62227ms (cpu time 1875ms)

Sometimes, on the newer router, the problem goes further and all the
peerings drop with:

        %NOTIFICATION: sent to neighbor 207.111.77.38 4/0 (Hold Timer Expired) 0 bytes

That never occurs on the older router.

I'm trying to understand why the difference between the older and newer
routers. 

I've noticed that that bgpd gets significantly larger on the newer
router than the older, with ps reporting about 400000+M on the older
router and as much as 700000+M on the newer router.  Given that the
machines are rather memory-limited, I'm guessing that the problem on the
newer router is that the process is paged out too often.

I seem to have improved things - made the errors less frequent - by
making the changes on the new router:

      * Removing all "soft-reconfiguration inbound" which, at least on
        Cisco routers, consumes extra memory.  Note that this remains
        enabled on the older router for all neighbors.
      * Renicing the bgpd process to -4
      * Ionicing the bgpd process to Realtime

I've also tried running on the new router the same bgpd configuration
(with a change in neighbor IPs, of course) as on the older router.  Even
w/o all the stuff (neighbors, additional route-maps, etc.) added for the
IPv6 neighbors, the newer router still uses more memory that the older
router.

I'm going to try to put some more memory into the newer router later
this week.  But...I'm still discomforted by this (or at least my lack of
understanding of what is behind this).  I don't *know* that memory (and
therefore paging) is behind the timeouts, but I suspect so if only
because I tend to assume two odd problems occurring together are
related.

If anyone has any thoughts or suggestions, I'd welcome them.

Thanks...

	Andrew





More information about the Quagga-users mailing list