3

I'm running BGP using FRR on Debian Linux on several machines. My question might end up having to do with something in the FRR/BGP configuration but I'm trying to understand at a more basic level why a particular IPv6 route selection is happening (from the Linux kernel).

I have a machine "a3" which is peered with "a1" and "a2". "a1" and "a2" are route reflectors and are both providing a default gateway to a3. Here you can see a3's IPv6 routing table:

root@a3:~# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2602:fbbc:0:2::/64 dev vxbr2 proto kernel metric 256 pref medium
2602:fbbc:0:65::/64 dev vxbr101 proto kernel metric 256 pref medium
2602:fbbc:1:1::/64 dev 000_bridge proto kernel metric 256 pref medium
fe80::/64 dev 000_bridge proto kernel metric 256 pref medium
fe80::/64 dev vnet7 proto kernel metric 256 pref medium
fe80::/64 dev vxbr101 proto kernel metric 256 pref medium
fe80::/64 dev vxbr2 proto kernel metric 256 pref medium
fe80::/64 dev vnet40 proto kernel metric 256 pref medium
fe80::/64 dev vnet43 proto kernel metric 256 pref medium
fe80::/64 dev vnet46 proto kernel metric 256 pref medium
fe80::/64 dev vnet47 proto kernel metric 256 pref medium
fe80::/64 dev vnet54 proto kernel metric 256 pref medium
fe80::/64 dev vnet57 proto kernel metric 256 pref medium
fe80::/64 dev vnet58 proto kernel metric 256 pref medium
fe80::/64 dev vnet63 proto kernel metric 256 pref medium
fe80::/64 dev 001_bridge proto kernel metric 256 pref medium
default nhid 36 proto bgp metric 20 pref medium
    nexthop via 2602:fbbc:1:1::1 dev 000_bridge weight 1
    nexthop via 2602:fbbc:1:1::2 dev 000_bridge weight 1

As I understand it, the line near the bottom reading default nhid 36 proto bgp metric 20 pref medium is indicating that the nexthop entry numbered 36 is being used as the default route, which contains two other separate entries, one for 2602:fbbc:1:1::1 and one for 2602:fbbc:1:1::2.

Here's the nexthop table:

root@a3:~# ip nexthop
id 15 dev 001_bridge scope host proto zebra
id 16 dev 000_bridge scope link proto zebra
id 26 dev vxbr2 scope link proto zebra
id 27 dev vxbr101 scope link proto zebra
id 31 via 2602:fbbc:1:1::1 dev 000_bridge scope link proto zebra
id 32 via 10.1.0.1 dev 001_bridge scope link proto zebra
id 36 group 31/37 proto zebra
id 37 via 2602:fbbc:1:1::2 dev 000_bridge scope link proto zebra

So I would think, due to the sequence here (it is earlier in the nexthop list, lowered numbered and first in the sequence of id 36 group 31/37 proto zebra) that 2602:fbbc:1:1::1 would be selected as the default gateway, but this is not the case. Looking up any random public IPv6 address gives:

root@a3:~# ip -6 route get 2001:4860:4860::8888
2001:4860:4860::8888 from :: via 2602:fbbc:1:1::2 dev 000_bridge proto bgp src 2602:fbbc:1:1::a3 metric 20 pref medium

And I can confirm via traceroute6 and any other tools available that 2602:fbbc:1:1::2 is definitely being selected as the gateway, not 2602:fbbc:1:1::1. And I have no idea why.

Also, ip -6 route show cache gives no output, and ip -6 route flush cache has no effect, so it doesn't seem to be route cache related. There do not appear to be any custom rules configured either:

root@a3:~# ip -6 rule show
0:  from all lookup local
32766:  from all lookup main

I'm sure I will have more to tweak on the BGP configuration to resolve this but just from the perspective of how the route selection is done in Linux, does anyone have an idea on what could be causing this? (And any ideas on what parameter could be tuned to fix it?)

bgp
  • 143
  • 6

1 Answers1

2

It's a multipath route: both gateways are used, one has no precedence over the other, but for a specific destination (and other factors), the same gateway will be used to avoid disturbing flows using it. So if very few destinations are used or tested, one gateway might appear to be favored over the other.

A multipath route can be set with the "simple" syntax using directly ip route add ... nexthop ... nexthop ... or the newer and more featureful syntax using ip nexthop add id XXX ... and ip route add ... nhid XXX.

Here, route nhid 36 selects nexthop id 36 which is a nexthop group of id 31 and id 37. They have equal participation in the group (because no specific weight was set).

An algorithm selects which gateway is used for a specific destination: the default is the hash-threshold algorithm, as mentioned in the documentation for the alternate (resilient) algorithm and RFC 2992. This algorithm ensures that on average both gateway will be used, but for a specific destination always the same is used.

One can verify this by comparing routes for multiple different destination addresses. For example, with a mock-up configuration mimicking OP's default route, a loop (with bash and jq) gave this:

# for i in 2001:db8::{{0..9},{a..f}}; do ip -6 -json route get $i; done | jq -j '.[] | .dst, " via ", .gateway, "\n"'
2001:db8:: via 2602:fbbc:1:1::1
2001:db8::1 via 2602:fbbc:1:1::2
2001:db8::2 via 2602:fbbc:1:1::2
2001:db8::3 via 2602:fbbc:1:1::2
2001:db8::4 via 2602:fbbc:1:1::2
2001:db8::5 via 2602:fbbc:1:1::1
2001:db8::6 via 2602:fbbc:1:1::1
2001:db8::7 via 2602:fbbc:1:1::1
2001:db8::8 via 2602:fbbc:1:1::2
2001:db8::9 via 2602:fbbc:1:1::1
2001:db8::a via 2602:fbbc:1:1::1
2001:db8::b via 2602:fbbc:1:1::1
2001:db8::c via 2602:fbbc:1:1::2
2001:db8::d via 2602:fbbc:1:1::2
2001:db8::e via 2602:fbbc:1:1::2
2001:db8::f via 2602:fbbc:1:1::1

Result on an other system might differ, but overall both gateways will be used evenly, with only one of them per destination to minimize flow disruption (eg: a firewall in the path should see the whole flow rather than only a part of it).

The hash is actually not based only on destination, but might use source, protocol and probably other properties (eg: adding ipproto tcp to the ip route get command above changes the result, choosing udp or ipv6-icmp instead of tcp changes it again).

A.B
  • 31,762
  • 2
  • 62
  • 101