[quagga-dev 3506] IPv6 and netlink problems revisited

Hasso Tepper hasso at estpak.ee
Thu Jun 9 13:57:51 BST 2005

Between 0.98.2 and 0.98.3 I fixed race condition in netlink code which caused 
trouble with IPv6 routes. I turned out that although race is fixed, it 
introduced even bigger problem. Let me remind how this netlink stuff works in 
zebra daemon.

We use two netlink sockets. One (netlink-cmd) is for sending messages to the 
kernel and receiving ACKs. It isn't subscribed to any netlink group, so it 
shouldn't receive any route/address/etc messages _originated_ from kernel. 
Other socket (netlink-listen) is for receiving messages from kernel - ie. all 
messages about add/delete interface/addresses/routes come to the zebra daemon 
via this interface. All this allows to handle kernel communication in quite 
clever way - we can block netlink-cmd socket until ACK is received to message 
we sent without any worry that we must drop something useful because of that.

1) Send message to the kernel and set socket blocking
2) Wait until ack is received
3) Set socket nonblocking

* See zebra/rt_netlink.c:netlink_talk() for details.

So, it can't happen that we loose actual change messages from kernel (if we 
have large enough receive buffer) because of blocking command socket. And all 
should be happy.

Problem I discovered some time ago was that zebra used netlink-listen socket 
to send messages about AF_INET6 routes to the kernel. It's obvious that there 
is race condition (yes, it's real, that's why I know about it at all):

1) Send message to the kernel and set (now netlink-listen) socket blocking
2) Wait until ack is received 		<= in this point some useful messages
                                           are arriving to the zebra daemon, 
					   but they are dropped, because we
					   wait for ACK
3) Set socket nonblocking

So, we loose messages and zebra and kernel are out of sync :(. I didn't 
realize then why it was there, so I fixed race.

Now it turns out that there was "reason". I can't not to use quotes here ... 
it's so fucking ugly that I refuse to call it even reason.

We should make difference between messages about deleting routes (received via 
netlink-listen socket) caused by zebra and other software (user). See 

if (rtm->rtm_protocol == RTPROT_ZEBRA && h->nlmsg_type == RTM_NEWROUTE)
  return 0;

If route is flagged as zebra route and it's new, we drop message. But we can't 
drop RTM_DELROUTE messages that way. Route flagged as zebra route can be 
deleted by zebra daemon, but by user as well. If zebra daemon deleted it, we 
should do nothing - we know all details already. But, if by user, we must 
pass it to the rib processing. So, we need the way to find out what caused 
the delete. For IPv4 routes it turns out works this one:

/* skip unsolicited messages originating from command socket */
if (nl != &netlink_cmd && h->nlmsg_pid == netlink_cmd.snl.nl_pid)
      zlog_debug ("netlink_parse_info: %s packet comes from %s",
                   nl->name, netlink_cmd.name);

Note, that order of arguments is actually wrong in zlog_debug() call ;P. 
Anyway, in this way we know that if we received message via netlink-listen 
socket, but pid is pid of netlink_cmd socket, it means that it was originated 
by zebra daemon and must be dropped (btw, seq number is also the number we 
sent message with).

But because of buggy kernel it works for IPv4 only, IPv6 route messages have 
pid 0 (originated by kernel) and seq number also 0.

So, what was the "solution"? Send IPv6 route messages via netlink-listen 
socket. While waiting ack, RTM_DELROUTE messages also will be received, but 
because we are in blocking mode, we drop them.

One side note, why it's fatal ... Two commands is enough to reproduce it:

ipv6 route dead::/64 x:x:x::1 20
ipv6 route dead::/64 x:x:x::2

After second command we delete first route from kernel. We receive 
RTM_DELROUTE message from kernel about route we just deleted and pass it to 
the rib_delete_ipv6(). Prefix is looked up and found that fib route exists 
(btw, why we don't check nexthop if it isn't ZEBRA_ROUTE_CONNECT), but it 
isn't the same one => unset FIB flag on all nexthops and unset active flag.
Process rib -> the same route we already have in kernel will be selected for 
FIB, but because it exists already in kernel, adding fails. As result we have 
to dead::/64 in RIB, none of them with FIB flag, but kernel has one of them 
in FIB => we are already fucked up.

RIB code contains many questionable code as well, but it isn't fatal. We don't 
check return value of kernel_delete_ipv6() for example in 
rib_uninstall_kernel() etc.


Fix kernel. There can't be better solution. If anyone has knowledge and time 
and can come up with patch, it would be really welcome. If no one comes up 
with patch, I will bug kernel developers in the weekend, I will be away for 
next days. Therefore don't expect any answers from me to this mail as well. I 
think that I made it quite clear where problem is and why it is ;).

As temporary solution you can revert patch which was applied before 0.98.3 - 
But it reintroduces race conditions.

Hasso Tepper
Elion Enterprises Ltd.
WAN administrator

More information about the Quagga-dev mailing list