[quagga-dev 3506] IPv6 and netlink problems revisited
hasso at estpak.ee
Thu Jun 9 13:57:51 BST 2005
Between 0.98.2 and 0.98.3 I fixed race condition in netlink code which caused
trouble with IPv6 routes. I turned out that although race is fixed, it
introduced even bigger problem. Let me remind how this netlink stuff works in
We use two netlink sockets. One (netlink-cmd) is for sending messages to the
kernel and receiving ACKs. It isn't subscribed to any netlink group, so it
shouldn't receive any route/address/etc messages _originated_ from kernel.
Other socket (netlink-listen) is for receiving messages from kernel - ie. all
messages about add/delete interface/addresses/routes come to the zebra daemon
via this interface. All this allows to handle kernel communication in quite
clever way - we can block netlink-cmd socket until ACK is received to message
we sent without any worry that we must drop something useful because of that.
1) Send message to the kernel and set socket blocking
2) Wait until ack is received
3) Set socket nonblocking
* See zebra/rt_netlink.c:netlink_talk() for details.
So, it can't happen that we loose actual change messages from kernel (if we
have large enough receive buffer) because of blocking command socket. And all
should be happy.
Problem I discovered some time ago was that zebra used netlink-listen socket
to send messages about AF_INET6 routes to the kernel. It's obvious that there
is race condition (yes, it's real, that's why I know about it at all):
1) Send message to the kernel and set (now netlink-listen) socket blocking
2) Wait until ack is received <= in this point some useful messages
are arriving to the zebra daemon,
but they are dropped, because we
wait for ACK
3) Set socket nonblocking
So, we loose messages and zebra and kernel are out of sync :(. I didn't
realize then why it was there, so I fixed race.
Now it turns out that there was "reason". I can't not to use quotes here ...
it's so fucking ugly that I refuse to call it even reason.
We should make difference between messages about deleting routes (received via
netlink-listen socket) caused by zebra and other software (user). See
if (rtm->rtm_protocol == RTPROT_ZEBRA && h->nlmsg_type == RTM_NEWROUTE)
If route is flagged as zebra route and it's new, we drop message. But we can't
drop RTM_DELROUTE messages that way. Route flagged as zebra route can be
deleted by zebra daemon, but by user as well. If zebra daemon deleted it, we
should do nothing - we know all details already. But, if by user, we must
pass it to the rib processing. So, we need the way to find out what caused
the delete. For IPv4 routes it turns out works this one:
/* skip unsolicited messages originating from command socket */
if (nl != &netlink_cmd && h->nlmsg_pid == netlink_cmd.snl.nl_pid)
zlog_debug ("netlink_parse_info: %s packet comes from %s",
Note, that order of arguments is actually wrong in zlog_debug() call ;P.
Anyway, in this way we know that if we received message via netlink-listen
socket, but pid is pid of netlink_cmd socket, it means that it was originated
by zebra daemon and must be dropped (btw, seq number is also the number we
sent message with).
But because of buggy kernel it works for IPv4 only, IPv6 route messages have
pid 0 (originated by kernel) and seq number also 0.
So, what was the "solution"? Send IPv6 route messages via netlink-listen
socket. While waiting ack, RTM_DELROUTE messages also will be received, but
because we are in blocking mode, we drop them.
One side note, why it's fatal ... Two commands is enough to reproduce it:
ipv6 route dead::/64 x:x:x::1 20
ipv6 route dead::/64 x:x:x::2
After second command we delete first route from kernel. We receive
RTM_DELROUTE message from kernel about route we just deleted and pass it to
the rib_delete_ipv6(). Prefix is looked up and found that fib route exists
(btw, why we don't check nexthop if it isn't ZEBRA_ROUTE_CONNECT), but it
isn't the same one => unset FIB flag on all nexthops and unset active flag.
Process rib -> the same route we already have in kernel will be selected for
FIB, but because it exists already in kernel, adding fails. As result we have
to dead::/64 in RIB, none of them with FIB flag, but kernel has one of them
in FIB => we are already fucked up.
RIB code contains many questionable code as well, but it isn't fatal. We don't
check return value of kernel_delete_ipv6() for example in
Fix kernel. There can't be better solution. If anyone has knowledge and time
and can come up with patch, it would be really welcome. If no one comes up
with patch, I will bug kernel developers in the weekend, I will be away for
next days. Therefore don't expect any answers from me to this mail as well. I
think that I made it quite clear where problem is and why it is ;).
As temporary solution you can revert patch which was applied before 0.98.3 -
But it reintroduces race conditions.
Elion Enterprises Ltd.
More information about the Quagga-dev