Leaky Redistribution (2)

In part 1 we had a look at some redistribution scenarios and what happens when equal-cost multi-pathing is involved.

The problem we look at in this post is what gave the series its title and it builds on the previous setup by adding VRFs and MPBGP to the mix.

The setup

I strongly suggest reading part 1 to get familiar with the setup and what's going on. What I did here is put the OSPF domain into a VRF (left) and the BGP domain into a different VRF (right). I still want communication between them, so I'm using MPBGP RT import/export on R2 to exchange routes between the two domains.

We're getting into deeper waters here, so you should have a good understanding about how MPBGP and L3VPNs work, as well as what VRF leaking is. There are lots of great articles you can read about them on the net, so I won't explain the concepts.

In L3VPN terms, R1 and R3 are CE routers and R2 is a PE router. The configs below show the relevant bits on R2, as this is where all the action takes place (R1,R3 configuration is trivial and doesn't change from part 1).

As you can see below, the two VRFs are configured and the RT import/export mechanism ensures that MPBGP exchanges all routes (no filtering) between the two L3VPNs.

ip vrf left
 rd 1:1
 route-target export 1:1
 route-target import 2:2
!
ip vrf right
 rd 2:2
 route-target export 2:2
 route-target import 1:1

For route exchange to actually happen, I have to redistribute between BGP and OSPF in VRF left, while VRF right enjoys the fact that it's running pure BGP and does everything transparently.

router ospf 1 vrf left
 router-id 2.2.2.2
 log-adjacency-changes
 redistribute bgp 2 subnets
!
router bgp 2
 bgp router-id 2.2.2.2
 no bgp default ipv4-unicast
 bgp log-neighbor-changes
 !
 address-family ipv4 vrf right
  neighbor 10.1.23.3 remote-as 3
  neighbor 10.1.23.3 activate
  no synchronization
 exit-address-family
 !
 address-family ipv4 vrf left
  redistribute ospf 1 vrf left match internal external 1 external 2
  no synchronization
 exit-address-family

Looking at the results

Once the bits shuffle around and the dust settles, I should be able to ping from 1.1.1.1 to 3.3.3.3. Job done. Or is it?

The expected end result should be the same as the one in Part 1/OSPF and BGP. R2 has two equal cost paths to reach R1, but only one route makes it to R3 (via redistribution).

Check the routing table output below... it looks almost right, doesn't it?

#R1
O E2    3.3.3.3 [110/1] via 10.2.12.2, 00:11:35, FastEthernet0/1
                [110/1] via 10.1.12.2, 00:11:35, FastEthernet0/0

#R3
B       1.1.1.1 [20/0] via 10.1.23.2, 00:11:54

#R2
Routing Table: left
O       1.1.1.1 [110/11] via 10.2.12.1, 00:13:20, FastEthernet0/1
                [110/11] via 10.1.12.1, 00:13:20, FastEthernet0/0
B       3.3.3.3 [20/0] via 10.1.23.3 (right), 00:12:21

Routing Table: right
B       1.1.1.1 [20/11] via 10.1.12.1 (left), 00:12:50, FastEthernet0/0
B       3.3.3.3 [20/0] via 10.1.23.3, 00:13:03
B       10.2.12.0 is directly connected, 00:12:50, FastEthernet0/1
B       10.1.12.0 is directly connected, 00:12:50, FastEthernet0/0

If you can't spot what I'm talking about (I won't blame you, took me a while to figure out), hold on for a bit, well come right back to it.

Let's walk through what happens (also with the help of the BGP tables below):

OSPF route 1.1.1.1/32 gets redistributed into BGP from VRF left (remember redistribution is done out of the RIB!). It might have two different next-hops, but as we've seen in part 1, it picks the numerically lowest one.
The routes that make it in BGP get tagged with RT 1:1 (the export RT for VRF left).
Because VRF right imports routes with RT 1:1, 1.1.1.1/32 appears in the BGP table for VRF right, with the same next-hop of 10.1.12.1. One of your eyebrows should be gaining altitude at this point.
The routes are then advertised to R3, which is a BGP neighbor in VRF right, and we get the routing table output you've seen above.

R2#show bgp vpnv4 unicast all
BGP table version is 12, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 1:1 (default for vrf left)
*> 1.1.1.1/32       10.1.12.1               11         32768 ?
*> 3.3.3.3/32       10.1.23.3                0             0 3 i
*> 10.1.12.0/24     0.0.0.0                  0         32768 ?
*> 10.2.12.0/24     0.0.0.0                  0         32768 ?
Route Distinguisher: 2:2 (default for vrf right)
*> 1.1.1.1/32       10.1.12.1               11         32768 ?
*> 3.3.3.3/32       10.1.23.3                0             0 3 i
*> 10.1.12.0/24     0.0.0.0                  0         32768 ?
*> 10.2.12.0/24     0.0.0.0                  0         32768 ?

If I traceroute from R3 to 1.1.1.1/32, based on previous experience, I would expect R2 to load-balance between its two equal cost paths. But it doesn't, and in this case it's impossible to make it do so, no matter what crazy tricks you employ.

R3#trace 1.1.1.1 so lo0

Type escape sequence to abort.
Tracing the route to 1.1.1.1

  1 10.1.23.2 28 msec 32 msec 32 msec
  2 10.1.12.1 [AS 2] 56 msec 56 msec 64 msec

You were probably expecting this, but I bet there's a bit of confusion floating around as to exactly why. I gave you a few hints along the way, so lets take them in order.

VRF right shows the following route: B 1.1.1.1 [20/11] via 10.1.12.1 (left), 00:12:50, FastEthernet0/0

what's interesting is that the next-hop is 10.1.12.1 and the outbound interface is Fa0/0 - both of which are in another VRF (left)!
this is VRF leaking in action: traffic from VRF right is explicitly forced to go out of an interface in VRF left via a path in the routing table. This happens because there is no physical link between the two VRFs (which are like two separate routers).
the BGP table for VRF right shows where this came from: *> 1.1.1.1/32 10.1.12.1 11 32768 ?

VRF right shows another funny entry: B 10.1.12.0 is directly connected, 00:12:50, FastEthernet0/0

so it's a BGP route B which is directly connected. Say what now?
again, it's VRF leaking in action. Fa0/0 is, indeed, directly connected to R2. But the route has to logically pass between two VRFs and BGP was used to exchange that information.

OK, so when a packet arrives in R2-right from R3 with a destination of 1.1.1.1, the routing table instructs it so send it out via interface Fa0/0 with a next-hop of 10.1.12.1.

We can also see this in the FIB of the router:

the CEF entry for 1.1.1.1/32 in VRF left has two next-hops and two outbound interfaces, as expected
the CEF entry for 1.1.1.1/32 in VRF right has only one exit point, completely bypassing VRF left RIB

R2#sh ip cef vrf left
Prefix              Next Hop             Interface
1.1.1.1/32          10.2.12.1            FastEthernet0/1
                    10.1.12.1            FastEthernet0/0
3.3.3.3/32          10.1.23.3            FastEthernet1/0

R2#sh ip cef vrf right
Prefix              Next Hop             Interface
1.1.1.1/32          10.1.12.1            FastEthernet0/0
3.3.3.3/32          10.1.23.3            FastEthernet1/0

The packet we're talking about will get chucked out of Fa0/0 directly without it actually triggering a lookup in the RIB of VRF left. And that's why there is no load-balancing.

The bottom line

So here you have it ladies and gentlemen: leaky redistribution. An ECMP route (1.1.1.1/32) gets redistributed, loses information about its equal-cost paths in the process, then VRF leaking bypasses the normal routing procedure and only uses the redistributed next-hop to forward traffic.

You might be wondering how does that apply in real-life and if it can happen to you. It can and it is quite hard to pin-point and troubleshoot as the torture you had to go through above well shows.

In today's networks it's not too difficult to have the ingredients for this to happen:

Equal-cost paths and load-sharing PE-CE.
Big PEs with a lot of L3VPNs on them.
Extranetting between at least two of these L3VPNs.

As long as the L3VPNs you do extranetting between are on different PEs, you won't feel a thing. But as soon as they reside on the same PE, those PEs will do VRF leaking (Cisco does it by default, while Juniper doesn't) and your carefully planned load-sharing might not work.

I'll leave you with the following diagram for an example where this issue happens to something that doesn't look like a lab-only topology.

And, as always, thanks for reading.

LEAKY REDISTRIBUTION (2)

The setup

Looking at the results

The bottom line

Any comments? Contact me via Mastodon or e-mail.

Share & Subscribe!