[quagga-users 14035] Re: "SLOW THREAD" errors and "Hold Timer Expired" events on all neighbors

Donald Sharp sharpd at cumulusnetworks.com
Wed Jul 1 15:40:23 BST 2015


I agree that it's a memory issue, but we are not doing ourselves any
service, by creating hashes w/ 256 buckets and then throwing 100's of
thousands entries to store in it, and then expanding and then reordering
the hash( which hash_get does ) as we need more entries.

/* Allocate a new hash with default hash size.  */
struct hash *
hash_create (unsigned int (*hash_key) (void *),
             int (*hash_cmp) (const void *, const void *))
{
  return hash_create_size (HASH_INITIAL_SIZE, hash_key, hash_cmp);
}

/* Default hash table size.  */
#define HASH_INITIAL_SIZE     256       /* initial number of backets. */
#define HASH_THRESHOLD        10        /* expand when backet. */

This will show up in hash_get, especially as it has to end up walking the
linked list.  I would be interested in seeing the call tree for the
hash_get and figuring out which tables are being used and using
hash_create_size for those hash's(perf report -g -o <outputfile> should do
the trick.

I would especially be interested in seeing the call graph for the first two
items on the perf list.

donald

On Wed, Jul 1, 2015 at 10:08 AM, Andrew Gideon <
andrew-quagga-users-918 at tagonline.com> wrote:

> On Mon, 2015-06-29 at 16:52 -0400, Donald Sharp wrote:
> > I wouldn't mind getting a look at a perf run over bgp, as it is coming
> > up/receiving all these routes.
>
>         [root at m10015 tmpdebug]# perf record -o addingnttfeed.fp -p 12977
> --call-graph fp
>         ^C[ perf record: Woken up 38 times to write data ]
>         [ perf record: Captured and wrote 9.470 MB addingnttfeed.fp
> (~413770 samples) ]
>
>         [root at m10015 tmpdebug]#
>
>
> I took this after bgpd had stabilized with four iBGP peers and no eBGP
> peers.  The iBGP peers had provided 535613, 125415, 11, and 43 routes
> each.
>
> I started perf record and then added an eBGP peer which eventually sent
> 535542 (jittering up and down a bit) routes.  Once the number of routes
> from the eBGP peer had settled down reasonable well, I stopped the
> recording.
>
> Here's the first page:
>
>         +   22.58%     0.00%  bgpd  [unknown]             [.]
> 0000000000000000
>         +   16.66%    16.07%  bgpd  libzebra.so.0.0.0     [.] hash_get
>         +    6.62%     0.01%  bgpd  [kernel.kallsyms]     [k]
> system_call_fastpath
>         +    6.12%     5.57%  bgpd  libc-2.17.so          [.] _int_malloc
>         +    4.92%     0.42%  bgpd  libc-2.17.so          [.]
> __vsnprintf_chk
>         +    4.16%     0.04%  bgpd  libc-2.17.so          [.] __select
>         +    4.08%     3.96%  bgpd  libzebra.so.0.0.0     [.]
> route_node_get
>         +    3.92%     0.04%  bgpd  [kernel.kallsyms]     [k] sys_select
>         +    3.66%     0.08%  bgpd  [kernel.kallsyms]     [k]
> core_sys_select
>         +    3.50%     0.29%  bgpd  [kernel.kallsyms]     [k] do_select
>         +    3.34%     3.22%  bgpd  libzebra.so.0.0.0     [.]
> route_lock_node
>         +    3.25%     3.17%  bgpd  bgpd                  [.]
> 0x0000000000033b72
>         +    2.85%     2.72%  bgpd  bgpd                  [.]
> 0x0000000000054dd2
>         +    2.85%     2.79%  bgpd  bgpd                  [.]
> 0x0000000000033b34
>         +    2.57%     2.50%  bgpd  libc-2.17.so          [.]
> __libc_calloc
>         +    2.27%     2.23%  bgpd  libc-2.17.so          [.] _int_free
>         +    2.25%     2.17%  bgpd  libc-2.17.so          [.] vfprintf
>         +    2.21%     2.19%  bgpd  libc-2.17.so          [.]
> malloc_consolidate
>         +    2.19%     2.13%  bgpd  libzebra.so.0.0.0     [.] prefix_match
>         +    2.04%     0.07%  bgpd  libc-2.17.so          [.]
> __GI___libc_read
>         +    2.03%     0.31%  bgpd  [kernel.kallsyms]     [k] sock_poll
>         +    1.76%     0.07%  bgpd  [kernel.kallsyms]     [k]
> apic_timer_interrupt
>         +    1.67%     0.02%  bgpd  [kernel.kallsyms]     [k]
> smp_apic_timer_interrupt
>         +    1.66%     0.03%  bgpd  [kernel.kallsyms]     [k] sys_read
>         +    1.58%     0.04%  bgpd  [kernel.kallsyms]     [k] vfs_read
>         +    1.52%     0.49%  bgpd  [kernel.kallsyms]     [k] tcp_poll
>         +    1.43%     0.02%  bgpd  [kernel.kallsyms]     [k]
> local_apic_timer_interrupt
>         +    1.39%     0.00%  bgpd  [unknown]             [.]
> 0x00000000274d6fcf
>         +    1.38%     0.04%  bgpd  [kernel.kallsyms]     [k]
> hrtimer_interrupt
>         +    1.38%     0.06%  bgpd  libc-2.17.so          [.]
> __GI___getrusage
>         +    1.29%     0.01%  bgpd  [kernel.kallsyms]     [k] __run_hrtimer
>         +    1.22%     0.03%  bgpd  [kernel.kallsyms]     [k] do_sync_read
>         +    1.22%     1.19%  bgpd  libzebra.so.0.0.0     [.]
> work_queue_run
>         +    1.19%     0.00%  bgpd  [unknown]             [.]
> 0x00007f4171ed7530
>         +    1.18%     0.01%  bgpd  [kernel.kallsyms]     [k] sock_aio_read
>         +    1.17%     0.00%  bgpd  [kernel.kallsyms]     [k]
> tick_sched_timer
>         +    1.15%     0.05%  bgpd  [kernel.kallsyms]     [k]
> sock_aio_read.part.7
>
>
> I ran this a second time with a more burdensome sample.  It started in
> the same "iBGP only" state (after I'd restarted bgpd to bring things
> back to zero), but I added - one after the other - two eBGP peers from
> different ASs each providing over 500000 routes.  I also added a local
> gateway IP with an implied /29 route which bgpd redistributes to the
> iBGP peers.
>
>         [root at m10015 tmpdebug]# perf record -o addingnttfeed.fp.2 -p
> 13398 --call-graph fp
>         ^C[ perf record: Woken up 128 times to write data ]
>         [ perf record: Captured and wrote 32.192 MB addingnttfeed.fp.2
> (~1406481 samples) ]
>
>         [root at m10015 tmpdebug]#
>
> I'll post the first page of the report separately.  It is taking a long
> time to load for the same memory-limiting issues that likely started
> this thread in the first place.
>
> It turns out that this machine requires ECC memory, and I haven't that
> around.  It'll take a little time, therefore, to try upgrading the
> memory.
>
> However, I have a CentOS6 device with only 0.5G RAM that - when running
> quagga - is exhibiting similar errors.  I was able to bring that to 1.5G
> RAM to see what would happen, and the errors ceased.  I'm reasonably
> comfortable, therefore, that this is related to memory.
>
>         - Andrew
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quagga.net/pipermail/quagga-users/attachments/20150701/c3f97ced/attachment-0001.html>


More information about the Quagga-users mailing list