BGP VxLAN EVPN – Part 2: Underlay

In the previous post, found here I provided an overview of  BGP VxLAN EVPN and mentioned that various IGP’s could be utilised to provide the underlay. In this post I am going to flesh out what a potential underlay setup may look like based on OSPFv2.
There are some initial considerations which need to be defined when planning the underlay design. Some of these considerations are:
  • MTU
  • Unicast Routing Protocol
  • IP addressing
  • Multicast for BUM traffic replication

VxLAN adds 50 Bytes to the original Ethernet frame which needs to be catered for to avoid fragmentation. The simplest way of doing this is to enable Jumbo frames in the IP network where VxLAN will run. As most servers utilise a jumbo frame of 9000 it is recommended that the switches be configured with a Jumbo frame of 9192 / 9216 depending on what the model of hardware supports. This will cater for the servers 9000 plus the VxLAN overhead.

The next consideration is which IGP (unicast routing protocol) to utilise, however as mentioned this post will focus on OSPF.

IP addressing for the underlay needs to cater for the P2P links between the spine and leaf switches, the loopback interfaces on each spine and leaf switch and the multicast Rendezvous-Point (RP) address.

Whilst discussed in more detail later in this post it should be noted that the mode of multicast utilised will likely depend on the model of hardware which is being utilised. For example on the Cisco Nexus range, unfortunately, not all Nexus models support the same multicast mode. Below is a list of what is supported on each Nexus model:

  • Nexus 1000v – IGMP v2/v3
  • Nexus 3000 – PIM ASM
  • Nexus 5600 – PIM BiDir
  • Nexus 7000/F3 – PIM ASM / PIM BiDir
  • Nexus 9000 – PIM ASM

In this example we will leverage the loopback address for our multicast RP address, however as an example for a medium sized spine and leaf deployment utilising 4 spine switches and 20 leaf switches the following IP address usage needs to be considered:

  • 4 Spine x 20 leaf = 80 P2P Links
  • 80 links, with an IP address at each end = 160 P2P IP addresses
  • 24 devices in total = 24 Lookpack IP addresses.
  • Total = 160 P2P IP + 24 Loopback IP = 184 IP Addresses

Also note that to conserve IP addresses, ‘IP unnumbered loopback 0’ for the P2P interfaces, may be used, which means 1 IP address per device. This should be seriously considered for large deployment, however for simplicity in this example I am going to utilise 2 Spine switches and 3 Leaf switches and thus a unique IP address everywhere, meaning I need to cater for:

2 Spine x 3 leaf = 6 P2P links x 2 = 12 P2P IP addresses + 6 Loopback IP addresses.

Also I am going to assume that in this example that the servers are utilising the 10/8 IP address range, and thus I have opted to use the 192.168/16 range for the Loopback interfaces which are also used as the Router ID and 172.16/12 IP address range for the physical layer 3 P2P interfaces.

Also for reference whilst most of the thoery is independant of the vendor and hardware in this example I am using Cisco Nexus 9000 switches to implement this network technology, and as with all Nexus switches the features first need to be enabled, thus I have enabled the following:

Spine-1#show run | incl feature
feature nxapi
feature ospf
feature bgp
feature pim
feature interface-vlan
feature vn-segment-vlan-based
feature lacp
feature lldp
feature nv overlay
As the spine switches are the simplest to configure I’ll start there with the first spine switch. As mentioned, depending on how MAC address replication and flooding is configured in the environment multicast may be required. I’ll explain this in more detail later, but in this example I have enabled multicast and also nominated this spine switch as one of the RP, with the following commands.
ip pim rp-address 192.168.1.0
ip pim anycast-rp 192.168.1.0 192.168.1.1
ip pim anycast-rp 192.168.1.0 192.168.1.2
Once this is done the next step is to enable the underlay routing protocol. As I am using OSPF to provide IP reachability across the fabric, the first step is to configure the loopback interface which will be used as the router ID for the routing protocol, and then configure OSPF itself.

interface loopback0
description Router-ID – Spine1
ip address 192.168.1.1/32

router ospf UNDERLAY
router-id 192.168.1.1
log-adjacency-changes
maximum-paths 12
auto-cost reference-bandwidth 100000 Mbps
passive-interface default

The router-id is the IP I will use for the loopback0 interface and for all router-id’s defined on this switch.

The OSPF configuration is standard and should be familiar to anyone who has configured OSPF before, however the command ‘maximum paths’ may not be. This is enabled to provide Equal Cost Multi-Pathing between my leaf and spine switches. I chosen 12 just to have a large number and likely never need to worry about it again, but as long as this is equal to, or greater than, the amount of physical links it will be fine. Also it is always good practice to define the reference bandwidth, and in this example I have configured 100000 Mbps which is 100 Gbps and should cater for the largest link this environment will have. Also I prefer to manually nominate any interfaces I wish to participate in OSPF thus I have configured the interfaces to be passive by default.

TIP: By default OSPF is uses broadcast for message propergation and election, however we want to utilise the Network type P2P thus, ensure that ‘ip ospf network point-to-point’ on loopback and P2P interfaces is configured.

Once this is done I can go back into the loopback interface and assign the OSPF and Mulicast parameters so the loopback interface participates in these protocols, with the following configuration:

interface loopback0
  description Router-ID – Spine1
  ip address 192.168.1.1/32
  ip ospf network point-to-point
  ip router ospf UNDERLAY area 0.0.0.0
  ip pim sparse-mode
The next step is to configure the point to point interfaces and enable OSPF and Multicast. As we are using VxLAN we are going to increase the MTU to cater for the additional header size. Technically only an additional 50 bytes is required but for simplicity I’ve decided to enable jumbo frames and set the mtu to 9216 on all physical interfaces.
interface Ethernet1/43
  description – DC01-LSL06-03 [Eth1/47]
  mtu 9216
  ip address 172.16.1.1/30
  ip ospf network point-to-point
  no ip ospf passive-interface
  ip router ospf UNDERLAY area 0.0.0.0
  ip pim sparse-mode
  no shutdown

Its important to configure OSPF as point to point here to ensure there is no DR/BDR and thus no election as well as being a more optimised LSA database, and avoiding a full SPF calculation for a link failure.  Also as we have nominated passive-interface default in OSPF we need to enable this interface to participate in OSPF with the command ‘no ip ospf passive-interface’. I have also used a /30 for the point to point link which is not ideal for preserving IP address space and may cause scale issues in a very large deployment but for simplicity of configuration and troubleshooting I’ve decided the trade of here is fine.

All the interconnects between the leaf and spine switches are via 2 x 10G interfaces thus I need to replicate the above configuration on an additional interface as per the following configuration.

interface Ethernet1/44
  description – DC01-LSL06-03 [Eth1/48]
  mtu 9216
  ip address 172.16.1.5/30
  ip ospf network point-to-point
  no ip ospf passive-interface
  ip router ospf UNDERLAY area 0.0.0.0
  ip pim sparse-mode
  no shutdown
This should be repeated for all links between each spine and leaf adjusting the IP addresses as required until all of your switches form a neighbor relationship as shown here:
Spine-1# show ip ospf neighbors
 OSPF Process ID UNDERLAY VRF default
 Total number of neighbors: 6
 Neighbor ID     Pri State            Up Time  Address         Interface
 192.168.1.13      1 FULL/ –          1w5d     172.16.1.2      Eth1/43
 192.168.1.13      1 FULL/ –          1w5d     172.16.1.6      Eth1/44
 192.168.1.12      1 FULL/ –          1w5d     172.16.1.10     Eth1/45
 192.168.1.12      1 FULL/ –          1w5d     172.16.1.14     Eth1/46
 192.168.1.11      1 FULL/ –          1w5d     172.16.1.18     Eth1/47
 192.168.1.11      1 FULL/ –          1w5d     172.16.1.22     Eth1/48
Also as we enabled multicast PIM earlier, to confirm this has formed the appropriate neighbor relationships we use the following command:
Spine-1# show ip pim neighbor
PIM Neighbor Status for VRF “default”
Neighbor        Interface            Uptime    Expires   DR       Bidir-  BFD
                                                         Priority Capable State
172.16.1.2      Ethernet1/43         1w5d      00:01:42  1        yes     n/a
172.16.1.6      Ethernet1/44         1w5d      00:01:35  1        yes     n/a
172.16.1.10     Ethernet1/45         1w5d      00:01:26  1        yes     n/a
172.16.1.14     Ethernet1/46         1w5d      00:01:23  1        yes     n/a
172.16.1.18     Ethernet1/47         1w5d      00:01:34  1        yes     n/a
172.16.1.22     Ethernet1/48         1w5d      00:01:44  1        yes     n/a
Spine-1# show ip pim interface brief
PIM Interface Status for VRF “default”
Interface            IP Address      PIM DR Address  Neighbor  Border
                                                     Count     Interface
Ethernet1/43         172.16.1.1      172.16.1.2      1         no
Ethernet1/44         172.16.1.5      172.16.1.6      1         no
Ethernet1/45         172.16.1.9      172.16.1.10     1         no
Ethernet1/46         172.16.1.13     172.16.1.14     1         no
Ethernet1/47         172.16.1.17     172.16.1.18     1         no
Ethernet1/48         172.16.1.21     172.16.1.22     1         no
loopback0            192.168.1.1     192.168.1.1     0         no

Note: As this example is from the spine switch and each spine has 2 x 10G links to the 3 x leaf switches, there are 6 entries plus the loopback (depending on which command used) above.

This now has formed the underlay network with OSPF and Multicast and we can now build the overlay and control plane network above this. It is critical that reachability of the underlay is consistent across the fabric and this may be a good point to test failure scenarios for the underlay. It is a good point however to finish this blog, with the next providing the overlay and control plane configuration details.

 

BGP VxLAN EVPN – Part 1: Overview

This post focuses on BGP VxLAN EVPN and thus an understanding of BGP and VxLAN is very helpful to understand this topic. Additionally EVPN and VxLAN are considered overlay technologies which run over an underlay IP fabric. In this context the underlay fabric’s purpose is to provide reachability between VTEP’s.

Whilst outside the focus of this post, the main choices for the underlay are OSPF or IS-IS. There are pros and cons to each option, with OSPFv2 being very well understood by most engineers and simple to deploy, however it does not support IPv6 and thus OSPFv3 would be required for IPv6 support which is still not mature in vendor implementations. Alternatively IS-IS has supported both IPv4 and IPv6 for many years and is well supported by vendors but is not well understood by many engineers outside of Telcos. Note: BGP can also be used as the underlay but as it is also utilised in the overlay this can cause confusion and complexity. Utilising BGP for the underlay is fine however I would recommend doing your own research regarding the underlay protocol to use taking into account the engineer’s skills who will be deploying and supporting the fabric.

In summary VxLAN is a tunneling mechanism which takes layer 2 frames or a layer 3 packet, and encapsulates it with an IP header and routes it to a VxLAN Virtual Tunnel End Point (VTEP) for decapsulation; It effectively encapsulates an MAC address inside an IP packet.

Similar to VLANs which have a 12-bit field specifying the VLAN to which the frame belongs, for a total of 4096 VLAN tags, VxLAN header includes a 24-bit field called VxLAN Network Identifier (VNI), which allows up to 16 million layer 2 domains.

VXLAN uses by default the flood-and-learn behavior of the multicast control plane, which is fine for small deployments but does have salability limitations in large deployments. Another method is ingress Head End Replication (HER), which does not require multicast but is still a flood-and-learn data plane procedure. There are also some controller based solutions but these are outside the scope of this discussion.

To resolve the scaling limitation of the flood-and-learn approach Ethernet VPN (EVPN) control plane was created, utilizing a new address family in Multi-Protocol BGP (MP-BGP) to distribute the layer 2 and layer 3 host reachability information. Therefore Multi-Protocol Border Gateway Protocol (MP-BGP) was extended to utilise the Network Layer Reachability Information (NLRI) to carry both Layer 2 MAC and Layer 3 IP information at the same time, and this is called EVPN – Ethernet Virtual Private Network.  It also offers a range of other benefits such as reduction of data center traffic through ARP suppression.

Utilising BGP as the control plane for VxLAN enables capabilities such as MAC address learning and VRF multi-tenancy while providing optimized equal-cost multi-pathing (ECMP). The new BGP address family in Multi-Protocol BGP is utilised to exchange Network Layer Reachability Information (NLRI) via a series of route types. Of these route types, the two most applicable for this discussion are:

Type 2 – Host MAC and IP addresses (MAC-VRF)
Type 5 – IP Prefix information (IP-VRF)

Type-2 routes (RT-2) are utilised to advertise an end host’s MAC and IP address within the VLAN over an IP network. A VxLAN Network Identifier (VNI) is mapped to a VLAN and all VTEP’s (typically leaf switches) within the VNI utilise RT-2 to share and learn the end host’s MAC addresses to provide Layer 2 reachability.

Type-5 routes (RT-5) are utilised to advertise IP prefixes. A VXLAN Network Identifier (VNI) is mapped to a Virtual Routing & Forwarding (VRF) which identifies a tenant within the fabric, allowing for multitenancy and route tables to coexist.

The advertisement of the type 5 EVPN attribute will provide the NLRI between subnets and routing contexts, allowing for learning of prefixes (not MACs) that are advertised across different VRFs in the fabric. This means the fabric can provide end-to-end segmentation without being aware of the segmentation itself. For example, a VRF context can be created on a pair of Leaf switches and be extended to some other pair of Leaf switches without the devices in between being aware of the VRFs. With EVPN, only the leaf switches need to possess the VRFs which endpoints are attached to, allowing the Spine switches to simply provide transit between Leafs.

There are two models to provide inter-subnet routing with EVPN, which are asymmetric integrated routing and bridging (IRB) and symmetric IRB. The main difference between the asymmetric IRB model and symmetric IRB model is how and where the routing lookup is done, which results in differences concerning which VNI the packet travels on through the infrastructure.

The asymmetric model allows routing and bridging on the VXLAN tunnel ingress, but only bridging on the egress. This results in bi-directional VXLAN traffic traveling on different VNIs in each direction (always the destination VNI) across the routed infrastructure

The symmetric model routes and bridges on both the ingress and the egress leafs. This results in bi-directional traffic being able to travel on the same VNI, hence the symmetric name. However, a new specialty transit VNI is used for all routed VXLAN traffic, called the L3VNI. All traffic that needs to be routed will be routed onto the L3VNI, tunneled across the layer 3 Infrastructure, routed off the L3VNI to the appropriate VLAN and ultimately bridged to the destination.

Depending on the vendor hardware the topic of asymmetric or symmetric model may not be of concern as some hardware only supports one model and thus you will need to configure the fabric based on that limitation.

Generally, if you configure all VLANs/Subnets/VNIs on all leafs anyway then the asymmetric model is fine and may be simpler to configure as it doesn’t require extra VNIs.

If your VLANs/Subnets/VNIs are widely dispersed and/or provisioned on the fly, then the symmetric model is better and all routed traffic will use a transit VNI (L3VNI), while bridged traffic will use L2VNI.

NOTE: The symmetric model is what Cisco utilises and supports.

This should provide an overview of EVPN, and I’ll delve into more technical detail and configuration in subsequent posts.