VyosAristaCVX-VXLAN-lab

From NesevoWiki
Jump to navigationJump to search

Introduction

More and more, EVPN is making its place in Datacenter field. Mainstream vendors integrate its support, with VXLAN or MPLS for data plane + BGP for control plane. In previous labs, we were able to observe that interoperability was quite good between vendors. Nevertheless, even we've already tested CumulusVX, there are also other open sources NOS which can support EVPN.

Today we'll test VyOS in it's latest rolling version, which handles the support of EVPN.

The targets of this lab are to :

  • create a VXLAN Fabric with VyOS nodes
  • provide a L2VPN and a L3VPN to client nodes within the fabric
  • integrate some other vendors node into the fabric, here Arista and Cumulus VX
  • test that all works as expected

Lab setup

To reach those goals, we'll use the EVE-NG tool in community edition (https://www.eve-ng.net/index.php/download/#DL-COMM) and use the following NOS version :

We'll build a 3-tiers fabric, with leaf/spine/superspine devices, VXLAN/EVPN running on all nodes. It gives the following setup:

Nesevo-Lab-Vyatta-Arista-CVX.png

On EVE-NG, we add IP information, ASN, loopback IPs. All other information will be past after when dealing with technical details.

So, it gives:

2022-02 eveng-lab-vyatta-cum-arista.png

VXLAN/BGP infos

Note: in our lab, route-distinguisher is constructed on the model <loopback_ip>:<vlan_id>

service vlan id route-target VNI Commentaires
L2 cross-DC 100 target:65000:100 100 the 6 clients are in the same broadcast domain / subnet
L3VPN cros-DC 200 target:65000:200 200 the 6 clients have their own subnets. Here we use only one vni because no ESI will be implemented. One leaf one subnet.

Client devices are represented by computers on EVE-NG. IPs in green are using the L2VPN service (extended layer 2 between pods) and IPs in orange the L3VPN service (routing service).

Below some diagrams showing the logic topologies :

Nesevo-Lab-Vyatta-Arista-CVX-l2vpn.png


Nesevo-Lab-Vyatta-Arista-CVX-l3vpn.png


Below the link to the configurations to allow you to quickly reproduce the tab.

To be noted that the configurations are not "ready-to-prod " and need to be optimized (adding bfd, bgp optimization, etc.). Some typos can subsist. Whatever, the lab works and gives the expected results :)

Link to devices configurations : lab-vyos-eos-cvx_v2.zip

Lab details

Considering that the lab is up and running, we'll now come back to the architecture.

We have 3 kinds of network nodes :

  • leaf switches, on edge, which provide L2 and L3 services.
  • couple of spine switches which allow leaf from a same pod to exchange traffic. They are also connected to superspine to reach other pods
  • superspine switches which interconnext the different spine switches from the different pods, thus allowing interpods traffic


All those network devices control plane can exchange routes' information within 2 layers of bgp : underlay and overlay.

The role of the "underlay" is to dynamically route the traffic between loopback interfaces from the different nodes. Loopback interfaces are used to mount BGP sessions for the underlay and are used as VTEP interfaces on leaf nodes. Generally, it's better to have 2 different Loopback for this usage (in case of anycast vtep setup for example), but some vendors don't allow to 2 loopback interfaces in the same routing context. Whatever, in our setup, having only one loopback is fine.

Here, the underlay is done in eBGP, ipv4 unicast family only + route-map to only redistribute loopback interfaces IP.

  • Each node mount a session to the directly connected peers.
  • Each leaf node has a dedicated ASN.
  • Each couple of spine/superspine has a ASN for two, to follow the best practices to optimize the routing table (ECMP).

The role of the "overlay" is to exchange EVPN routes type 2,3,5 here (type 1/4 routes seem to not be managed by VyOS at the moment, TBC).

  • eBGP multihop sessions are mounted to directly neighbor loopback IPs.
  • policies/options are set to not change the original nexthop when propagating routes
  • for L2 services we only allow the redistribution of type 2 (macip) and type 3 (inclusive mcast for BUM management) routes
  • for L3 services we only allow the redistribution of type 5 (prefix) routes
  • regarding the setup, to enhance the scalability, it could be nice to use dedicated nodes (VM or phys) to take the role of route-servers. Here, we could connect them to superspine. Thus, spine/superspine would only do underlay and could not be a bottleneck for the control plane scalability



Moreover, regarding the lab setup, no bgp pic / bfd / VMTO / other option are implemented to optimize the convergence / traffic delivery. The goal was to verify that VyOS is compatible in a basic but functionnal VXLAN setup.

Now, we'll take a look on the different output from the 2 vendors leaf / spine / superspine to verify that all works as expected.

We verify that underlay works well. All is fine if:

  • we have an eBGP session per directly connected peer
  • we receive the same number of prefixes per peer type:
    • a leef has to receive the same number of prefixes from its spine neighbors
    • spine neigbors have to receive the same amount of prefixes from superspine, and reciprocally
    • all VTEP IPs have to be known into the GRT (global routing table / default routing instance)
On VyOS

Below we can see that we receive 17 prefixes from both neighbors and that we know all loopback IPs from the setup (13 remote, 1 local). We have a bit too much routes, because on Cumulus nodes the filters that I have configured seem misconstructed.

To be noted that spine will announce all the Loopback IPs except the one from it's twin, since they are not directly connected and superspine respect as loop protection constraint (don't reflect routes originated by an ASN "X" to a peer from the same ASN "X").

vyos@leaf1-pod1:~$ show ip bgp summary established 

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.255.255.5, local AS number 65011 vrf-id 0
BGP table version 20
RIB entries 37, using 6808 bytes of memory
Peers 2, using 1446 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.254.254.9    4      65001     29993     29988        0    0    0 02w6d19h           17       19 N/A
10.254.254.11   4      65001     17218     17220        0    0    0 01w4d22h           17       19 N/A

Displayed neighbors 2
Total number of neighbors 2

vyos@leaf1-pod1:~$ show ip route | match /32
B>* 10.255.255.1/32 [20/0] via 10.254.254.9, eth0, weight 1, 02w6d19h
B>* 10.255.255.2/32 [20/0] via 10.254.254.11, eth1, weight 1, 01w4d22h
B>* 10.255.255.3/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
B>* 10.255.255.4/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
C>* 10.255.255.5/32 is directly connected, dum0, 02w6d19h
B>* 10.255.255.6/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d22h
B>* 10.255.255.7/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
B>* 10.255.255.8/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
B>* 10.255.255.9/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
B>* 10.255.255.10/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w4d18h
B>* 10.255.255.11/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w2d01h
B>* 10.255.255.12/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w2d00h
B>* 10.255.255.13/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w2d01h
B>* 10.255.255.14/32 [20/0] via 10.254.254.9, eth0, weight 1, 01w2d00h
On Arista

We verify that we receive prefix + same leaf / superspine / remote pods spine IPs (we note there misses some filters on spine since we receive /31 interco routes):

leaf1-pod3#show ip bgp neighbors 10.254.254.43 received-routes 
BGP routing table information for VRF default
Router identifier 10.255.255.11, local AS number 65015
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
                    % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  AIGP       LocPref Weight  Path
          10.254.254.36/31       10.254.254.43         0       -          -       -       65004 ?
          10.254.254.38/31       10.254.254.43         0       -          -       -       65004 ?
          10.254.254.42/31       10.254.254.43         0       -          -       -       65004 ?
          10.254.254.44/31       10.254.254.43         -       -          -       -       65004 65016 ?
          10.254.254.46/31       10.254.254.43         0       -          -       -       65004 ?
 *        10.255.255.1/32        10.254.254.43         -       -          -       -       65004 65003 65001 ?
 *        10.255.255.2/32        10.254.254.43         -       -          -       -       65004 65003 65001 ?
 *        10.255.255.3/32        10.254.254.43         -       -          -       -       65004 65003 65002 ?
 *        10.255.255.4/32        10.254.254.43         -       -          -       -       65004 65003 65002 ?
 *        10.255.255.5/32        10.254.254.43         -       -          -       -       65004 65003 65001 65011 ?
 *        10.255.255.6/32        10.254.254.43         -       -          -       -       65004 65003 65001 65012 ?
 *        10.255.255.7/32        10.254.254.43         -       -          -       -       65004 65003 65002 65013 ?
 *        10.255.255.8/32        10.254.254.43         -       -          -       -       65004 65003 65002 65014 ?
 *        10.255.255.9/32        10.254.254.43         -       -          -       -       65004 65003 ?
 *        10.255.255.10/32       10.254.254.43         -       -          -       -       65004 65003 ?
 *        10.255.255.12/32       10.254.254.43         -       -          -       -       65004 65016 ?
 * >      10.255.255.14/32       10.254.254.43         0       -          -       -       65004 ?
leaf1-pod3#show ip bgp neighbors 10.254.254.41 received-routes
BGP routing table information for VRF default
Router identifier 10.255.255.11, local AS number 65015
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
                    % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  AIGP       LocPref Weight  Path
 * >      10.255.255.1/32        10.254.254.41         -       -          -       -       65004 65003 65001 ?
 * >      10.255.255.2/32        10.254.254.41         -       -          -       -       65004 65003 65001 ?
 * >      10.255.255.3/32        10.254.254.41         -       -          -       -       65004 65003 65002 ?
 * >      10.255.255.4/32        10.254.254.41         -       -          -       -       65004 65003 65002 ?
 * >      10.255.255.5/32        10.254.254.41         -       -          -       -       65004 65003 65001 65011 ?
 * >      10.255.255.6/32        10.254.254.41         -       -          -       -       65004 65003 65001 65012 ?
 * >      10.255.255.7/32        10.254.254.41         -       -          -       -       65004 65003 65002 65013 ?
 * >      10.255.255.8/32        10.254.254.41         -       -          -       -       65004 65003 65002 65014 ?
 * >      10.255.255.9/32        10.254.254.41         -       -          -       -       65004 65003 ?
 * >      10.255.255.10/32       10.254.254.41         -       -          -       -       65004 65003 ?
 * >      10.255.255.12/32       10.254.254.41         -       -          -       -       65004 65016 ?
 * >      10.255.255.13/32       10.254.254.41         -       -          -       -       65004 i
On Cumulus
cumulus@cumulus:mgmt:~$ net show bgp summary 
show bgp ipv4 unicast summary
=============================
BGP router identifier 10.255.255.12, local AS number 65016 vrf-id 0
BGP table version 103
RIB entries 37, using 7104 bytes of memory
Peers 2, using 43 KiB of memory

Neighbor                   V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.254.254.45              4      65004    306764    261331        0    0    0 01:39:02           12
spine1-pod3(10.254.254.47) 4      65004    261602    261664        0    0    0 01w2d02h           16

Total number of neighbors 2

[...]

cumulus@cumulus:mgmt:~$ net show route | grep /32
B>* 10.255.255.1/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.2/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.3/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.4/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.5/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.6/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.7/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.8/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.9/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.10/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.11/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
C>* 10.255.255.12/32 is directly connected, lo, 01w2d02h
B>* 10.255.255.13/32 [20/0] via 10.254.254.45, swp1, weight 1, 01:39:03
B>* 10.255.255.14/32 [20/0] via 10.254.254.47, swp2, weight 1, 01w2d02h

We can also verify that routing is ok by pinging remote loopback IPs from local loopback + check eBGP EVPN session status:

# ping example from VyOS leaf
vyos@leaf1-pod1:~$ ping 10.255.255.12 source-address 10.255.255.5
PING 10.255.255.12 (10.255.255.12) from 10.255.255.5 : 56(84) bytes of data.
64 bytes from 10.255.255.12: icmp_seq=1 ttl=61 time=2.60 ms
64 bytes from 10.255.255.12: icmp_seq=2 ttl=61 time=2.54 ms
64 bytes from 10.255.255.12: icmp_seq=3 ttl=61 time=2.44 ms

# check of EVPN sessions on the different vendors

# vyos

vyos@leaf1-pod1:~$ show bgp l2vpn evpn summary 
BGP router identifier 10.255.255.5, local AS number 65011 vrf-id 0
BGP table version 0
RIB entries 31, using 5704 bytes of memory
Peers 2, using 1446 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.255.255.1    4      65001     30205     30208        0    0    0 02w6d21h           13       13 N/A
10.255.255.2    4      65001     17392     17429        0    0    0 01w5d00h           11       13 N/A

# Arista

leaf1-pod3#show bgp evpn summary
BGP summary information for VRF default
Router identifier 10.255.255.11, local AS number 65015
Neighbor Status Codes: m - Under maintenance
  Description              Neighbor         V AS           MsgRcvd   MsgSent  InQ OutQ  Up/Down State   PfxRcd PfxAcc
  OVL_spine1-pod3          10.255.255.13    4 65004          15608     15503    0    0 00:03:47 Estab   17     17
  OVL_spine2-pod3          10.255.255.14    4 65004         262780    308558    0    0    9d02h Estab   13     13

# Cumulus

cumulus@cumulus:mgmt:~$ net show bgp l2vpn evpn summary
BGP router identifier 10.255.255.12, local AS number 65016 vrf-id 0
BGP table version 0
RIB entries 29, using 5568 bytes of memory
Peers 2, using 43 KiB of memory

Neighbor                   V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.255.255.13              4      65004    307243    261693        0    0    0 00:04:25           14
spine1-pod3(10.255.255.14) 4      65004    262001    262025        0    0    0 01w2d02h           10


On EVPN, we should receive quite the same number of prefixes from both spine, but looks like there differences of best path election + redistribution between vendors. Perhaps something to dig. The most important point is to verify that the FIB for the services are consistent.

L2 service implementation

On that topic, we'll firstly focus on the Layer 2 service. To be sure that all works as expected we have to check that:

  • mac-address-table are consistent on the whole fabric. In traditionnal L2 setup, switch FIBs depend of the traffic passing through them and a mac @ can be see "everywhere" if it does a BUM on the network. In VXLAN, in a mac-vrf topology without filer, even a host doesn't send BUM, the mac address is propagated on the whole fabric (within BGP update). But, as it is on traditional L2 topology, you have to be sure that mac @ table aging time are consistent on all devices, with an aging time superior to ARP timer on hosts.
  • routes type 2 are visible on each leaf nodes for a same vni, routes type 3 to reach remote VTEP have to be learned
  • a common route-target for the VRF routes need to be used/set within policies. In effect, there are differences between vendors on route "tagging" (features to auto generate route-target, "hard coded" alogos, etc.). Here, on VyOS, we got to add adminstrative extended communities within route-map for the L2 VNI. We'll show it below
  • all fabric connections (interconnections between network nodes) have the sufficient "IP MTU". Please keep in mind that VXLAN overhead has a size of 50 bytes. So, to be sure that Jumbo traffic for host is ok, configure it at least at 9050 Bytes. I would advise to push it at 9100 Bytes in case of VXLAN usage on client HOSTS for Virtualized setup on servers.


Check FDB on leaf switch + debug type 2 routes' propagation for leaf1 pod1

##############
# VyOS
##############

vyos@leaf1-pod1:~$ show bridge br100 fdb
50:01:00:01:00:02 dev eth2 vlan 1 master br100 permanent
50:01:00:01:00:02 dev eth2 master br100 permanent
33:33:00:00:00:01 dev eth2 self permanent
33:33:00:00:00:02 dev eth2 self permanent
01:00:5e:00:00:01 dev eth2 self permanent
01:80:c2:00:00:0e dev eth2 self permanent
01:80:c2:00:00:03 dev eth2 self permanent
01:80:c2:00:00:00 dev eth2 self permanent
33:33:00:00:00:01 dev br100 self permanent
33:33:00:00:00:02 dev br100 self permanent
33:33:ff:01:00:02 dev br100 self permanent
01:00:5e:00:00:6a dev br100 self permanent
33:33:00:00:00:6a dev br100 self permanent
01:00:5e:00:00:01 dev br100 self permanent
33:33:ff:00:00:00 dev br100 self permanent
00:50:79:66:68:15 dev vxlan100 vlan 1 extern_learn master br100 
00:50:79:66:68:15 dev vxlan100 extern_learn master br100 
8a:d0:00:16:1c:17 dev vxlan100 vlan 1 master br100 permanent
8a:d0:00:16:1c:17 dev vxlan100 master br100 permanent
00:00:00:00:00:00 dev vxlan100 dst 10.255.255.6 self permanent
00:00:00:00:00:00 dev vxlan100 dst 10.255.255.11 self permanent
00:00:00:00:00:00 dev vxlan100 dst 10.255.255.12 self permanent
00:00:00:00:00:00 dev vxlan100 dst 10.255.255.7 self permanent
00:00:00:00:00:00 dev vxlan100 dst 10.255.255.8 self permanent
00:50:79:66:68:15 dev vxlan100 dst 10.255.255.12 self extern_learn 

# and to focus on remote nodes

vyos@leaf1-pod1:~$ show bridge br100 fdb | grep "self extern_learn"
00:50:79:66:68:15 dev vxlan100 dst 10.255.255.12 self extern_learn 
00:50:79:66:68:13 dev vxlan100 dst 10.255.255.11 self extern_learn 
00:50:79:66:68:19 dev vxlan100 dst 10.255.255.8 self extern_learn 
00:50:79:66:68:0c dev vxlan100 dst 10.255.255.6 self extern_learn 
00:50:79:66:68:17 dev vxlan100 dst 10.255.255.7 self extern_learn 

# or on local clients 

vyos@leaf1-pod1:~$ show bridge br100 fdb | grep eth | grep -v perm
00:50:79:66:68:0b dev eth2 master br100 

# Now if we want to focus on route type 3 learned 

Displayed 7 prefixes (12 paths) (of requested type)
vyos@leaf1-pod1:~$ show bgp l2vpn evpn route type 3 | grep Dis
Route Distinguisher: 10.255.255.5:3
Route Distinguisher: 10.255.255.6:3
Route Distinguisher: 10.255.255.7:2
Route Distinguisher: 10.255.255.8:2
Route Distinguisher: 10.255.255.11:100
Route Distinguisher: 10.255.255.12:100

# we notice that RDs for VyOS nodes don't respect the logic of <Loopback:vni>. It s because for the moment we can't set it by vni (or I didn't find how to do that)

##############
# Arista
##############

# we check the fdb. Remote host are behind the port Vx1

leaf1-pod3#show mac address-table vlan 100
          Mac Address Table
------------------------------------------------------------------

Vlan    Mac Address       Type        Ports      Moves   Last Move
----    -----------       ----        -----      -----   ---------
 100    0050.7966.680c    DYNAMIC     Vx1        1       0:00:38 ago
 100    0050.7966.6813    DYNAMIC     Et3        1       0:14:08 ago
 100    0050.7966.6815    DYNAMIC     Vx1        1       9 days, 2:36:39 ago
 100    0050.7966.6817    DYNAMIC     Vx1        1       0:00:36 ago
 100    0050.7966.6819    DYNAMIC     Vx1        1       0:00:35 ago
Total Mac Addresses for this criterion: 5

# we check that all VTEP are ready for BUM

leaf1-pod3#show bgp evpn route-type imet | i >
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
 * >Ec   RD: 10.255.255.5:3 imet 10.255.255.5
 * >Ec   RD: 10.255.255.6:3 imet 10.255.255.6
 * >Ec   RD: 10.255.255.7:2 imet 10.255.255.7
 * >Ec   RD: 10.255.255.8:2 imet 10.255.255.8
 * >     RD: 10.255.255.11:100 imet 10.255.255.11
 * >Ec   RD: 10.255.255.12:100 imet 10.255.255.12
[...]

leaf1-pod3#show vxlan vtep detail
Remote VTEPS for Vxlan1:

VTEP               Learned Via        MAC Address Learning       Tunnel Type(s) 
------------------ ------------------ -------------------------- -------------- 
10.255.255.5       control plane      control plane              flood, unicast 
10.255.255.6       control plane      control plane              flood, unicast 
10.255.255.7       control plane      control plane              flood, unicast 
10.255.255.8       control plane      control plane              flood, unicast 
10.255.255.12      control plane      control plane              flood, unicast 

##############
# Cumulus
##############

# we check the fdb. Remote host are behind the port Vx1

cumulus@cumulus:mgmt:~$ net show bridge macs vlan 100

VLAN  Master  Interface  MAC                TunnelDest  State      Flags         LastSeen
----  ------  ---------  -----------------  ----------  ---------  ------------  ----------------
 100  bridge  bridge     50:01:00:11:00:03              permanent                9 days, 02:47:42
 100  bridge  swp3       00:50:79:66:68:15                                       00:00:06
 100  bridge  vni100     00:50:79:66:68:0c                         extern_learn  00:00:40
 100  bridge  vni100     00:50:79:66:68:13                         extern_learn  00:21:01
 100  bridge  vni100     00:50:79:66:68:17                         extern_learn  00:00:38
 100  bridge  vni100     00:50:79:66:68:19                         extern_learn  00:00:36

# --> here we see that it misses the @mac from the host on leaf1 pod1. 
# we check the evpn routes on the Cumulus node and we see that from leaf1 pod1 vni 100 we have only those routes

Route Distinguisher: 10.255.255.5:3
*  [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*  [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*> [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8

# --> something is wrong on the remote leaf.
# on the leaf we check the policy for the EVPN routes export

  route-map RM-OUT-VNI {
         rule 1 {
             action permit
             match {
                 evpn {
                     route-type multicast
                     vni 100
                 }
             }
             set {
                 extcommunity {
                     rt 65000:100
                 }
             }
         }
         rule 2 {
             action permit
             match {
                 evpn {
                     route-type prefix
                     vni 100
                 }
             }
             set {
                 extcommunity {
                     rt 65000:100
                 }
             }
         }
         rule 3 {
             action permit
             match {
                 evpn {
                     vni 200
                 }
             }
             set {
                 extcommunity {
                     rt 65000:200
                 }

# we notice that a typo in rule numbering removed the export of macip routes for vni 100.
# we correct that : 

route-map RM-OUT-VNI {
         rule 1 {
             action permit
             match {
                 evpn {
                     route-type multicast
                     vni 100
                 }
             }
             set {
                 extcommunity {
                     rt 65000:100
                 }
             }
         }
         rule 2 {
             action permit
             match {
                 evpn {
                     route-type macip
                     vni 100
                 }
             }
             set {
                 extcommunity {
                     rt 65000:100
                 }
             }
         }
         rule 3 {
             action permit
             match {
                 evpn {
                     route-type prefix
                     vni 200
                 }
             }
             set {
                 extcommunity {
                     rt 65000:200
                 }


# once corrected we check the fdb on the cumulus node + evpn routes

cumulus@cumulus:mgmt:~$ net show bridge macs vlan 100

VLAN  Master  Interface  MAC                TunnelDest  State      Flags         LastSeen
----  ------  ---------  -----------------  ----------  ---------  ------------  ----------------
 100  bridge  bridge     50:01:00:11:00:03              permanent                9 days, 02:51:48
 100  bridge  swp3       00:50:79:66:68:15                                       00:02:01
 100  bridge  vni100     00:50:79:66:68:0b                         extern_learn  00:00:26
 100  bridge  vni100     00:50:79:66:68:0c                         extern_learn  00:04:46
 100  bridge  vni100     00:50:79:66:68:13                         extern_learn  00:25:08
 100  bridge  vni100     00:50:79:66:68:17                         extern_learn  00:04:45
 100  bridge  vni100     00:50:79:66:68:19                         extern_learn  00:04:43

Route Distinguisher: 10.255.255.5:3
*  [2]:[0]:[48]:[00:50:79:66:68:0b]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*  [2]:[0]:[48]:[00:50:79:66:68:0b]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*> [2]:[0]:[48]:[00:50:79:66:68:0b]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*  [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*  [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8
*> [3]:[0]:[32]:[10.255.255.5]
                    10.255.255.5                           0 65004 65003 65001 65011 i
                    RT:65000:100 RT:65011:100 ET:8


# we have now Type 2 routes, all is ok now and we learn the client mac 00:50:79:66:68:0b
# Nevertheless, the trafic was working between the client hosts, but trafic from Cumulus leaf client to VyOS leaf client was broadcasted to all leaf since the @mac was not advertised on the fabric

It could be interesting to check how each node is configured to deliver the service, but the article would be too long. So, to sum the steps to provide a L2 service:

  1. create a VLAN (on Arista) / bridge (VyOS and Cumulus) + associate it to the VNI/VXLAN instances/interface
  2. create the mac vrf + define the route-target into bgp (Arista + Cumulus) / update the route-maps + community list to allow the redistribution of the VNI routes (type 2 and type 3)

L3 service implementation

Now, we'll now look at the L3 service. On that kind of setup we need to implement / check:

  • leaf IP subnets are well redistributed betweeb all leaf (here we are in a vrf "any to any" topology)
  • we only redistribute routes for IP prefix. Routes type 2 (mac @) don't need to be redistributed between leaf
  • that all clients can ping each other through the L3VPN

To achieve that, we need to implement analog configuration than for the L2 service, but with the following differences:

  • create a VLAN (on Arista) / bridge (VyOS and Cumulus) + associate it to the VNI/VXLAN instances|interface
  • create the L3 VRF + define the route-target into bgp (Arista + Cumulus + VyOS) + allow connected routes distribution (Arista / Cumulus) / update the route-maps + community list to allow the redistribution of the VNI routes (type 5) for VyOS / restrict the routes redistribution to type 5 routes into the VNI (Cumulus)


Now we check the routing table with common cli commands + into the EVPN RIB :

# check routes into the vrf 

vyos@leaf1-pod1:~$ show ip route vrf VRF-L3-1
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF VRF-L3-1:
K>* 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 02w6d22h
C>* 192.168.10.0/24 is directly connected, br200, 02w6d22h
B>* 192.168.20.0/24 [20/0] via 10.255.255.6, br200 onlink, weight 1, 05:11:08
B>* 192.168.30.0/24 [20/0] via 10.255.255.7, br200 onlink, weight 1, 05:04:15
B>* 192.168.40.0/24 [20/0] via 10.255.255.8, br200 onlink, weight 1, 04:59:41
B>* 192.168.50.0/24 [20/0] via 10.255.255.11, br200 onlink, weight 1, 05:11:08
B>* 192.168.60.0/24 [20/0] via 10.255.255.12, br200 onlink, weight 1, 05:11:08

leaf1-pod3#show ip route vrf VRF-L3-1

VRF: VRF-L3-1
Codes: C - connected, S - static, K - kernel, 
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B - BGP, B I - iBGP, B E - eBGP,
       R - RIP, I L1 - IS-IS level 1, I L2 - IS-IS level 2,
       O3 - OSPFv3, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route, V - VXLAN Control Service,
       DH - DHCP client installed default route, M - Martian,
       DP - Dynamic Policy Route, L - VRF Leaked,
       G  - gRIBI, RC - Route Cache Route

Gateway of last resort is not set

 B E      192.168.10.0/24 [200/0] via VTEP 10.255.255.5 VNI 200 router-mac 12:9b:9b:d8:f6:86 local-interface Vxlan1
 B E      192.168.20.0/24 [200/0] via VTEP 10.255.255.6 VNI 200 router-mac 50:01:00:02:00:03 local-interface Vxlan1
 B E      192.168.30.0/24 [200/0] via VTEP 10.255.255.7 VNI 200 router-mac 50:01:00:09:00:03 local-interface Vxlan1
 B E      192.168.40.0/24 [200/0] via VTEP 10.255.255.8 VNI 200 router-mac 50:01:00:0a:00:03 local-interface Vxlan1
 C        192.168.50.0/24 is directly connected, Vlan200
 B E      192.168.60.0/24 [200/0] via VTEP 10.255.255.12 VNI 200 router-mac 50:01:00:11:00:03 local-interface Vxlan1


cumulus@cumulus:mgmt:~$ net show route vrf VRF-L3-1
show ip route vrf VRF-L3-1 
===========================
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route


VRF VRF-L3-1:
K>* 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 01w2d03h
B>* 192.168.10.0/24 [20/0] via 10.255.255.5, vlan200 onlink, weight 1, 03:18:37
B>* 192.168.20.0/24 [20/0] via 10.255.255.6, vlan200 onlink, weight 1, 03:18:37
B>* 192.168.30.0/24 [20/0] via 10.255.255.7, vlan200 onlink, weight 1, 03:18:37
B>* 192.168.40.0/24 [20/0] via 10.255.255.8, vlan200 onlink, weight 1, 03:18:37
B>* 192.168.50.0/24 [20/0] via 10.255.255.11, vlan200 onlink, weight 1, 03:18:37
C>* 192.168.60.0/24 is directly connected, vlan200, 01w2d03h

# we can notice that the output are quite similar, except on Arista which indicates the VNI for remote targets, instead of referencing the local bridge / vlan200

# Now we check the routes for the VNI 200

# VyOS 

vyos@leaf1-pod1:~$ show bgp l2vpn evpn route type prefix 
BGP table version is 1, local router ID is 10.255.255.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[EthTag]:[ESI]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
                    Extended Community
Route Distinguisher: 10.255.255.5:200
*> [5]:[0]:[24]:[192.168.10.0]
                    10.255.255.5             0         32768 ?
                    ET:8 RT:65000:200 Rmac:12:9b:9b:d8:f6:86
Route Distinguisher: 10.255.255.6:200
*> [5]:[0]:[24]:[192.168.20.0]
                    10.255.255.6                           0 65001 65012 ?
                    RT:65000:200 ET:8 Rmac:50:01:00:02:00:03
Route Distinguisher: 10.255.255.7:200
*  [5]:[0]:[24]:[192.168.30.0]
                    10.255.255.7                           0 65001 65003 65002 65013 ?
                    RT:65000:200 RT:65013:200 ET:8 Rmac:50:01:00:09:00:03
*> [5]:[0]:[24]:[192.168.30.0]
                    10.255.255.7                           0 65001 65003 65002 65013 ?
                    RT:65000:200 RT:65013:200 ET:8 Rmac:50:01:00:09:00:03
Route Distinguisher: 10.255.255.8:200
*> [5]:[0]:[24]:[192.168.40.0]
                    10.255.255.8                           0 65001 65003 65002 65014 ?
                    RT:65000:200 RT:65014:200 ET:8 Rmac:50:01:00:0a:00:03
*  [5]:[0]:[24]:[192.168.40.0]
                    10.255.255.8                           0 65001 65003 65002 65014 ?
                    RT:65000:200 RT:65014:200 ET:8 Rmac:50:01:00:0a:00:03
Route Distinguisher: 10.255.255.11:200
*  [5]:[0]:[24]:[192.168.50.0]
                    10.255.255.11                          0 65001 65003 65004 65015 i
                    RT:65000:200 ET:8 Rmac:50:01:00:cd:ff:68
*> [5]:[0]:[24]:[192.168.50.0]
                    10.255.255.11                          0 65001 65003 65004 65015 i
                    RT:65000:200 ET:8 Rmac:50:01:00:cd:ff:68
Route Distinguisher: 10.255.255.12:10200
*  [5]:[0]:[24]:[192.168.60.0]
                    10.255.255.12                          0 65001 65003 65004 65016 i
                    RT:65000:200 ET:8 Rmac:50:01:00:11:00:03
*> [5]:[0]:[24]:[192.168.60.0]
                    10.255.255.12                          0 65001 65003 65004 65016 i
                    RT:65000:200 ET:8 Rmac:50:01:00:11:00:03

# Arista
                        
leaf1-pod3#show bgp evpn route ip-prefix ipv4 
BGP routing table information for VRF default
Router identifier 10.255.255.11, local AS number 65015
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup
                    % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  LocPref Weight  Path
 * >     RD: 10.255.255.5:200 ip-prefix 192.168.10.0/24
                                 10.255.255.5          -       100     0       65004 65003 65001 65011 ?
 *       RD: 10.255.255.5:200 ip-prefix 192.168.10.0/24
                                 10.255.255.5          -       100     0       65004 65003 65001 65011 ?
 * >     RD: 10.255.255.6:200 ip-prefix 192.168.20.0/24
                                 10.255.255.6          -       100     0       65004 65003 65001 65012 ?
 *       RD: 10.255.255.6:200 ip-prefix 192.168.20.0/24
                                 10.255.255.6          -       100     0       65004 65003 65001 65012 ?
 * >     RD: 10.255.255.7:200 ip-prefix 192.168.30.0/24
                                 10.255.255.7          -       100     0       65004 65003 65002 65013 ?
 *       RD: 10.255.255.7:200 ip-prefix 192.168.30.0/24
                                 10.255.255.7          -       100     0       65004 65003 65002 65013 ?
 * >     RD: 10.255.255.8:200 ip-prefix 192.168.40.0/24
                                 10.255.255.8          -       100     0       65004 65003 65002 65014 ?
 *       RD: 10.255.255.8:200 ip-prefix 192.168.40.0/24
                                 10.255.255.8          -       100     0       65004 65003 65002 65014 ?
 * >     RD: 10.255.255.11:200 ip-prefix 192.168.50.0/24
                                 -                     -       -       0       i
 * >     RD: 10.255.255.12:10200 ip-prefix 192.168.60.0/24
                                 10.255.255.12         -       100     0       65004 65016 i
 *       RD: 10.255.255.12:10200 ip-prefix 192.168.60.0/24
                                 10.255.255.12         -       100     0       65004 65016 i

# CVX

cumulus@cumulus:mgmt:~$ net show bgp l2vpn evpn route type prefix
BGP table version is 7, local router ID is 10.255.255.12
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
                    Extended Community
Route Distinguisher: 10.255.255.5:200
*  [5]:[0]:[24]:[192.168.10.0]
                    10.255.255.5                           0 65004 65003 65001 65011 ?
                    RT:65000:200 ET:8 Rmac:12:9b:9b:d8:f6:86
*> [5]:[0]:[24]:[192.168.10.0]
                    10.255.255.5                           0 65004 65003 65001 65011 ?
                    RT:65000:200 ET:8 Rmac:12:9b:9b:d8:f6:86
Route Distinguisher: 10.255.255.6:200
*  [5]:[0]:[24]:[192.168.20.0]
                    10.255.255.6                           0 65004 65003 65001 65012 ?
                    RT:65000:200 ET:8 Rmac:50:01:00:02:00:03
*> [5]:[0]:[24]:[192.168.20.0]
                    10.255.255.6                           0 65004 65003 65001 65012 ?
                    RT:65000:200 ET:8 Rmac:50:01:00:02:00:03
Route Distinguisher: 10.255.255.7:200
*  [5]:[0]:[24]:[192.168.30.0]
                    10.255.255.7                           0 65004 65003 65002 65013 ?
                    RT:65000:200 RT:65013:200 ET:8 Rmac:50:01:00:09:00:03
*> [5]:[0]:[24]:[192.168.30.0]
                    10.255.255.7                           0 65004 65003 65002 65013 ?
                    RT:65000:200 RT:65013:200 ET:8 Rmac:50:01:00:09:00:03
Route Distinguisher: 10.255.255.8:200
*  [5]:[0]:[24]:[192.168.40.0]
                    10.255.255.8                           0 65004 65003 65002 65014 ?
                    RT:65000:200 RT:65014:200 ET:8 Rmac:50:01:00:0a:00:03
*> [5]:[0]:[24]:[192.168.40.0]
                    10.255.255.8                           0 65004 65003 65002 65014 ?
                    RT:65000:200 RT:65014:200 ET:8 Rmac:50:01:00:0a:00:03
Route Distinguisher: 10.255.255.11:200
*  [5]:[0]:[24]:[192.168.50.0]
                    10.255.255.11                          0 65004 65015 i
                    RT:65000:200 ET:8 Rmac:50:01:00:cd:ff:68
*> [5]:[0]:[24]:[192.168.50.0]
                    10.255.255.11                          0 65004 65015 i
                    RT:65000:200 ET:8 Rmac:50:01:00:cd:ff:68
Route Distinguisher: 10.255.255.12:10200
*> [5]:[0]:[24]:[192.168.60.0]
                    10.255.255.12            0         32768 i
                    ET:8 RT:65000:200 Rmac:50:01:00:11:00:03

# Now we check with traceroute on the 3 clients connected on focused leaf switches

# client leaf VyOS

VPCS> ping 192.168.50.2

84 bytes from 192.168.50.2 icmp_seq=1 ttl=62 time=16.456 ms
84 bytes from 192.168.50.2 icmp_seq=2 ttl=62 time=11.852 ms
^C
VPCS> ping 192.168.60.2

84 bytes from 192.168.60.2 icmp_seq=1 ttl=62 time=5.554 ms
84 bytes from 192.168.60.2 icmp_seq=2 ttl=62 time=4.432 ms
^C
VPCS> trace 192.168.50.2
trace to 192.168.50.2, 8 hops max, press Ctrl+C to stop
 1   192.168.10.1   0.499 ms  3.143 ms  1.432 ms
 2   192.168.50.1   10.465 ms  12.512 ms  8.992 ms
 3   *192.168.50.2   11.056 ms (ICMP type:3, code:3, Destination port unreachable)

VPCS> trace 192.168.60.2
trace to 192.168.60.2, 8 hops max, press Ctrl+C to stop
 1   192.168.10.1   0.525 ms  0.312 ms  0.278 ms
 2   192.168.60.1   4.949 ms  4.232 ms  4.022 ms
 3   *192.168.60.2   4.167 ms (ICMP type:3, code:3, Destination port unreachable)


# client leaf Arista

VPCS> ping 192.168.10.2

84 bytes from 192.168.10.2 icmp_seq=1 ttl=62 time=10.914 ms
^C
VPCS> ping 192.168.60.2

84 bytes from 192.168.60.2 icmp_seq=1 ttl=62 time=11.755 ms
^C
VPCS> trace 192.168.10.2
trace to 192.168.10.2, 8 hops max, press Ctrl+C to stop
 1   192.168.50.1   4.742 ms  4.123 ms  4.293 ms
 2   192.168.10.1   12.550 ms  15.039 ms  9.411 ms
 3   *192.168.10.2   8.939 ms (ICMP type:3, code:3, Destination port unreachable)

VPCS> 
VPCS> trace 192.168.60.2
trace to 192.168.60.2, 8 hops max, press Ctrl+C to stop
 1   192.168.50.1   3.073 ms  2.901 ms  2.585 ms
 2   192.168.60.1   10.815 ms  11.493 ms  12.144 ms
 3   *192.168.60.2   10.429 ms (ICMP type:3, code:3, Destination port unreachable)

# client leaf Cumulus

VPCS> ping 192.168.10.2

84 bytes from 192.168.10.2 icmp_seq=1 ttl=62 time=4.731 ms
^C
VPCS> ping 192.168.50.2

84 bytes from 192.168.50.2 icmp_seq=1 ttl=62 time=11.722 ms
^C
VPCS> trace 192.168.50.2
trace to 192.168.50.2, 8 hops max, press Ctrl+C to stop
 1   192.168.60.1   0.487 ms  0.266 ms  0.248 ms
 2   192.168.50.1   6.691 ms  6.281 ms  6.613 ms
 3   *192.168.50.2   9.355 ms (ICMP type:3, code:3, Destination port unreachable)

VPCS> 
VPCS> trace 192.168.10.2
trace to 192.168.10.2, 8 hops max, press Ctrl+C to stop
 1   192.168.60.1   0.436 ms  0.315 ms  0.380 ms
 2   192.168.10.1   4.535 ms  4.661 ms  3.994 ms
 3   *192.168.10.2   4.292 ms (ICMP type:3, code:3, Destination port unreachable)

Conclusion of the lab

These first results are satisfying, it comforts the idea that VXLAN is a strong option to deliver L2/L3 services in datacenters, but not only. This kind of architecture is more and more chosen in campus designs, PRA, NFV solutions.

The support of this technology on more and more open source NOS are encouraging regarding the innovation. If we look at Vyatta project for example, whom OS can be compiled "brick per brick", we could imagine to create NFV CPEs providing VXLAN fabric extensions of some client VRFs. CPE could be connected to the central Fabric within encrypted tunnels (wireguard,ipsec) or through direct link over MacSec (this feature in on the Vyatta Roadmap).

Please contact us if you want to feed the topic / discuss the lab.

Cheers

Pierre L.