BGP VxLAN EVPN – Part 1: Overview

This post focuses on BGP VxLAN EVPN and thus an understanding of BGP and VxLAN is very helpful to understand this topic. Additionally EVPN and VxLAN are considered overlay technologies which run over an underlay IP fabric. In this context the underlay fabric’s purpose is to provide reachability between VTEP’s.

Whilst outside the focus of this post, the main choices for the underlay are OSPF or IS-IS. There are pros and cons to each option, with OSPFv2 being very well understood by most engineers and simple to deploy, however it does not support IPv6 and thus OSPFv3 would be required for IPv6 support which is still not mature in vendor implementations. Alternatively IS-IS has supported both IPv4 and IPv6 for many years and is well supported by vendors but is not well understood by many engineers outside of Telcos. Note: BGP can also be used as the underlay but as it is also utilised in the overlay this can cause confusion and complexity. Utilising BGP for the underlay is fine however I would recommend doing your own research regarding the underlay protocol to use taking into account the engineer’s skills who will be deploying and supporting the fabric.

In summary VxLAN is a tunneling mechanism which takes layer 2 frames or a layer 3 packet, and encapsulates it with an IP header and routes it to a VxLAN Virtual Tunnel End Point (VTEP) for decapsulation; It effectively encapsulates an MAC address inside an IP packet.

Similar to VLANs which have a 12-bit field specifying the VLAN to which the frame belongs, for a total of 4096 VLAN tags, VxLAN header includes a 24-bit field called VxLAN Network Identifier (VNI), which allows up to 16 million layer 2 domains.

VXLAN uses by default the flood-and-learn behavior of the multicast control plane, which is fine for small deployments but does have salability limitations in large deployments. Another method is ingress Head End Replication (HER), which does not require multicast but is still a flood-and-learn data plane procedure. There are also some controller based solutions but these are outside the scope of this discussion.

To resolve the scaling limitation of the flood-and-learn approach Ethernet VPN (EVPN) control plane was created, utilizing a new address family in Multi-Protocol BGP (MP-BGP) to distribute the layer 2 and layer 3 host reachability information. Therefore Multi-Protocol Border Gateway Protocol (MP-BGP) was extended to utilise the Network Layer Reachability Information (NLRI) to carry both Layer 2 MAC and Layer 3 IP information at the same time, and this is called EVPN – Ethernet Virtual Private Network.  It also offers a range of other benefits such as reduction of data center traffic through ARP suppression.

Utilising BGP as the control plane for VxLAN enables capabilities such as MAC address learning and VRF multi-tenancy while providing optimized equal-cost multi-pathing (ECMP). The new BGP address family in Multi-Protocol BGP is utilised to exchange Network Layer Reachability Information (NLRI) via a series of route types. Of these route types, the two most applicable for this discussion are:

Type 2 – Host MAC and IP addresses (MAC-VRF)
Type 5 – IP Prefix information (IP-VRF)

Type-2 routes (RT-2) are utilised to advertise an end host’s MAC and IP address within the VLAN over an IP network. A VxLAN Network Identifier (VNI) is mapped to a VLAN and all VTEP’s (typically leaf switches) within the VNI utilise RT-2 to share and learn the end host’s MAC addresses to provide Layer 2 reachability.

Type-5 routes (RT-5) are utilised to advertise IP prefixes. A VXLAN Network Identifier (VNI) is mapped to a Virtual Routing & Forwarding (VRF) which identifies a tenant within the fabric, allowing for multitenancy and route tables to coexist.

The advertisement of the type 5 EVPN attribute will provide the NLRI between subnets and routing contexts, allowing for learning of prefixes (not MACs) that are advertised across different VRFs in the fabric. This means the fabric can provide end-to-end segmentation without being aware of the segmentation itself. For example, a VRF context can be created on a pair of Leaf switches and be extended to some other pair of Leaf switches without the devices in between being aware of the VRFs. With EVPN, only the leaf switches need to possess the VRFs which endpoints are attached to, allowing the Spine switches to simply provide transit between Leafs.

There are two models to provide inter-subnet routing with EVPN, which are asymmetric integrated routing and bridging (IRB) and symmetric IRB. The main difference between the asymmetric IRB model and symmetric IRB model is how and where the routing lookup is done, which results in differences concerning which VNI the packet travels on through the infrastructure.

The asymmetric model allows routing and bridging on the VXLAN tunnel ingress, but only bridging on the egress. This results in bi-directional VXLAN traffic traveling on different VNIs in each direction (always the destination VNI) across the routed infrastructure

The symmetric model routes and bridges on both the ingress and the egress leafs. This results in bi-directional traffic being able to travel on the same VNI, hence the symmetric name. However, a new specialty transit VNI is used for all routed VXLAN traffic, called the L3VNI. All traffic that needs to be routed will be routed onto the L3VNI, tunneled across the layer 3 Infrastructure, routed off the L3VNI to the appropriate VLAN and ultimately bridged to the destination.

Depending on the vendor hardware the topic of asymmetric or symmetric model may not be of concern as some hardware only supports one model and thus you will need to configure the fabric based on that limitation.

Generally, if you configure all VLANs/Subnets/VNIs on all leafs anyway then the asymmetric model is fine and may be simpler to configure as it doesn’t require extra VNIs.

If your VLANs/Subnets/VNIs are widely dispersed and/or provisioned on the fly, then the symmetric model is better and all routed traffic will use a transit VNI (L3VNI), while bridged traffic will use L2VNI.

NOTE: The symmetric model is what Cisco utilises and supports.

This should provide an overview of EVPN, and I’ll delve into more technical detail and configuration in subsequent posts.

OTV – Configuration and Verification

In my previous post here, I discussed the concepts of OTV, so if you are not familiar with OTV concepts perhaps go read that post first. In this post I intend to dive a little deeper into OTV configuration and verification.

The assumption is that IP connectivity is in place, within the Data Centre and between Data Centres.

One of the first configuration steps is to define the OTV site VLAN and OTV site identifier.

As mentioned, the local OTV edge devices need to communicate as part of the AED election process. The requirement for this AED election is that the devices participating are connected via a local VLAN. Note: This site VLAN must NOT be stretched over the OTV link but rather trunked between the OTV edge devices. Thus on each of the OTV edge devices a VLAN needs to be configured such as:

otv site-vlan 999


This can be the same in each site, but must not be extended over the Overlay interface. This enables the OTV edge devices to discover each other and determine their roles on a per VLAN basis as the nominated OTV edge device which later in this post you will see that even VLANs are active on one OTV edge whilst odd VLANs are active on the other OTV edge, shown via the ‘show otv vlan’ command.

OTV uses the site identifier to identify the OTV edge devices which exists in a specific site (where a site is a single geographic data centre) which can form an adjacency with another site. Thus in a site with dual AED’s, the site identifier needs to match, however should be unique per site. Thus in one site the identifier may be 0x001 on both OTV edge devices, whilst in the other site it may be 0x002. Thus the following example shows a site identifier:

otv site-identifier 0x001

At a high level the overlay interface configuration would look like the following where  port-channel1 is defined as the join interface, which as per the previous post is the L3 physical link or port-channel which is used to route upstream towards the DCI / Core. Whilst the Overlay interface is the logical OTV tunnel interface which performs the encapsulation and where the OTV configuration is done.

interface Overlay1
 description Overlay Network
 otv join-interface port-channel1
 otv extend-vlan 100, 205
 otv use-adjacency-server 172.16.50.1 unicast-only
 no shutdown

The Join interface is the L3 interface on the OTV edge device connecting to the DCI or Core (IP transport network). This interface is used as the source of OTV encapsulation and assigned to the logical ‘Overlay’ interface.

interface port-channel1
 description OTV/GRE uplink to Core / DCI
 mtu 9216
 ip address 172.16.50.1/30

The Internal interface is a L2 interface, typically configured as a trunk or access port, which takes part in STP and learns MAC addresses per normal. This is typically the interface that connects the device performing L3 gateway / SVI functionality to the OTV edge device.

interface port-channel10
 description To L3 Router VDC
 switchport
 switchport mode trunk
 switchport trunk allowed vlan 100,205
 spanning-tree port type normal
 mtu 9216
 vpc 10

As mentioned, the local OTV edge devices need to communicate as part of the AED election process and is a local VLAN to these devices and NOT stretched over the OTV link but rather trunked between the OTV edge devices. This enables the OTV edge devices to discover each other and determine their roles on a per VLAN basis as the AED.

Once the configuration is in place all of the OTV edge devices need to form an adjacency. This allows the OTV edge devices to learn and distribute the list of neighbors which it can replicate the control packets to. Thus every OTV edge device which joins the OTV domain needs to join by registering with the Adjacency Server whilst the other OTV edge devices are discovered dynamically through the Adjacency Server and thus all are aware of each other and can update when OTV devices join or leave.

To check the OTV Adjacency use the command “show otv adjacency”

Hostname                         System-ID      Dest Addr       Up Time   State
otv-site1-2                      4055.3905.64c1 172.16.50.2     2w1d      UP
otv-site2-2                      4055.3905.b6c1 172.16.50.50    2w1d      UP
otv-site2-1                      4055.3905.c641 172.15.50.46    2w1d      UP

Also the command “show otv overlay 1″ also provides good information including the Adjacency Server details.

#show otv overlay 1

OTV Overlay Information
Site Identifier 0000.0000.0100

Overlay interface Overlay1

VPN name : Overlay1
VPN state : UP
Extended vlans : 100 205 (Total:2)
Join interface(s) : Po1 (172.16.50.1)
Site vlan : 999 (up)
AED-Capable : Yes
Capability : Unicast-Only
Is Adjacency Server : Yes
Adjacency Server(s) : 172.16.50.1

To confirm which VLANs have been stretched over the Overlay interface and identify which OTV edge device they is their active AED, the command “show otv vlan” can be used.

#show otv vlan

OTV Extended VLANs and Edge Device State Information (* - AED)

Legend:
(NA) - Non AED, (VD) - Vlan Disabled, (OD) - Overlay Down
(DH) - Delete Holddown, (HW) - HW: State Down
 (NFC) - Not Forward Capable 

VLAN   Auth. Edge Device                     Vlan State                 Overlay
----   -----------------------------------   ----------------------       -------
 100*  otv-site-1                            active                  Overlay1
 205   otv-site-2                            inactive(NA)            Overlay1

As can be seen above, on this OTV edge device, VLAN 100 is the active AED, meaning this device is responsible for encapsulating the VLAN traffic and sending it to the other site. VLAN 205 is also stretched across the OTV but this devices neighbor is the active AED for that odd VLAN.

The command “show otv route” can be utilised to see where a specific MAC address is learnt from. In the following example the MAC address for the host in VLAN100 is learnt over the OTV link, whilst the MAC address for the host in VLAN205 is learnt from the downstream gateway device local to this site.

#show otv route

OTV Unicast MAC Routing Table For Overlay1

VLAN MAC-Address     Metric  Uptime    Owner      Next-hop(s)
---- --------------  ------  --------  ---------  -----------
 100 0050.568d.16d7  42      1w5d      overlay    otv-site-1
 205 0050.568d.5b2d  1       1w5d      site       port-channel10

On the gateway router where the SVI’s exist it is always a good idea to check that the MAC address is being learnt locally and that the HSRP/VRRP status is what you expect to see. In this example no FHRP (HSRP/VRRP) filtering is done thus the traffic to/from the end hosts is always routed via the same gateway in the same site. There is issue with this approach as it can cause traffic to trombone across the OTV link adding latency and providing a less than optimal path, but that is a post for another time.

#show hsrp brief
*:IPv6 group #:group belongs to a bundle
P indicates configured to preempt.
|
 Interface   Grp  Prio P State    Active addr      Standby addr     Group addr
  Vlan100     1    120  P Active   local            172.24.24.2      172.24.24.1 (conf)

And also to check that whichever is the primary gateway device can see the appropriate MAC addresses being learnt:

#show ip arp vrf VPN-GW-1 | incl Vlan100
172.24.24.10    00:07:22  0050.568d.16d7  Vlan100
172.24.24.11    00:00:26  0050.568d.43db  Vlan100

Note; whilst the MTU in this example has been set to 9216 as this is supported by the IP transport which OTV runs over the gateway SVI’s should be set lower to allow for the OTV overhead added by the OTV edge devices.

Just for completeness the configuration of the SVI gateway for VLAN100 is as follows:

interface Vlan100
 description : Gateway_Test_OTV-SPAN
 no shutdown
 mtu 9000
 vrf member VPN-GW-1
 ip address 172.24.24.0/24
 ip unreachables
 hsrp version 2
 hsrp 1
   preempt
   priority 120
   ip 172.24.24.1