Multiple Upstream Failover HowTo

About

This document will cover how we can achieve automatic failover in a setup that utilizes multiple upstream providers. This can be useful if we want to increase the redundancy for our upstream connections.

In order to achieve this we will need to utilize a few different functionalities, including:

Static unicast routing.
Policy-Based Routing (PBR).
Ping trigger, part of the alarm system.

Introduction

For this example we imagine a topology where we have a Router connected to two different upstream providers. This is done to improve our redundancy, by having one of them being the primary and the other used as a backup, in case we lose connection on the first mentioned. In Figure 1 below a representation of this scenario is presented.

                          198.19.20.21
                                |
                             .--.-.
                            ( (    )__
                           (_,  \ ) ,_)  Internet
                             '-'--`--'
                              |     |
                              |     |
                      .-------'     '-------.
             .-->     |                     |     <--.
             |        |                     |        |
     Primary |   .----+----.           .----+----.   | Secondary
     Uplink  |   |         |           |         |   | Uplink
             |   |  ISP1   |           |  ISP2   |   |
             |   |         |           |         |   |
             |   '----+----'           '----+----'   |
                  .99 |                     | .99
                      |                     |
        172.16.1.0/24 '-------.     .-------' 172.16.2.0/24
                              |     |
                           .1 |     | .1
           Interface: vlan1 .-+-----+-. Interface: vlan2
                            |         |
                            |   R1    |
                            |         |  GW: 172.16.1.99 distance 1
                            '----+----'      172.16.2.99 distance 10
                                 |
                                 |
                             .--.-.
                            ( (    )__
                           (_,  \ ) ,_)  Lan
                             '-'--`--'

Figure 1: The router R1, connected to two different upstream providers ISP1 and ISP2, in order to achieve redundancy, by having ISP1 serve as the primary and ISP2 the secondary.

As can be seen in Figure 1, the router R1 is connected to the two upstream providers ISP1 and ISP2. Two different default gateways are configured on R1 towards ISP1 at 172.16.1.99 and ISP2 at 172.16.2.99 respectively. In order to ensure that the first route will serve as the primary, the distance of this route is configured to 1 and for the other it is set to 10.

Note

A route with a lower distance will be considered the better route. Therefore by adjusting this value we can specify what route will be selected first in this case.

Handle the Failover

The next step will be to address how the failover is supposed to be carried out, but first its good to understand why we need special handling for this. As it stands right now, the only way for the secondary route to become selectable is if the upstream interface, vlan1 on R1, gets a direct link failure. Then the network that the default route is associated with will go down and the route will become inactive, thus the secondary route will now be selected instead.

However, if the link would become broken somewhere along the path between ISP1 and our target destination, as shown in Figure 2, it would not be possible for R1 to be aware of that in the current situation. So traffic will be routed from the Lan towards ISP1 that will not be able to reach our intended destination.

                  198.19.20.21
                        |
                     .--.-.
                    ( (    )__
                   (_,  \ ) ,_)  Internet
                     '-'--`--'
                      |     |
                      |     |
Link          .-------'     '-------.
Failure ----> X                     |
              |                     |
         .----+----.           .----+----.
         |         |           |         |
         |  ISP1   |           |  ISP2   |
         |         |           |         |
         '----+----'           '----+----'
          .99 |                     | .99
              |                     |
172.16.1.0/24 '-------.     .-------' 172.16.2.0/24
                      |     |
                   .1 |     | .1
  Interface: vlan1  .-+-----+-. Interface: vlan2
                    |         |
                    |   R1    |
                    |         |
                    '----+----'
                         |
                         |
                     .--.-.
                    ( (    )__
                   (_,  \ ) ,_)  Lan
                     '-'--`--'

Figure 2: A link failure has occurred somewhere between ISP1 and our target destination.

In order to handle this the ping trigger functionality will be used. It will track an upstream destination, in this example we will check if the address 198.19.20.21 is reachable. Two different ping triggers will be used, one for each of the upstream interfaces vlan1 and vlan2.

Subsequently, the ping trigger is attached to the route as a remote tracker. If the ping trigger is not able to reach its provided destination address, it will adjust the distance of the route to 255. When this happens, the route with the next best distance will be used instead, in this specific example it will be the route to our Secondary provider ISP2.

Note

When a route is configured with a distance of 255, it is considered infinite, and the route will not be used, i.e. it will not be active in the routing table.

Ensure Correct Upstream Interface is Used for Ping Trigger

It is important to ensure that the requests sent by the ping triggers egress on the correct associated upstream interface, regardless of the state of the routing table. Since what we want to keep track of in this case is that we can reach our intended destination from R1 on interface vlan1 over the upstream provider ISP1, and vice versa for vlan2 and ISP2.

In order to achieve this desired behavior, policy-based routing will be utilized. The policy-based route instances that will be applied are only intended to ensure that traffic, originating from the device itself, will egress on the correct upstream interface. For each of the upstream interfaces we want to specify an associated policy-based route, that should be configured based on the following requirements:

Match on the Source IP address of the upstream interface itself.
Only consider for locally originated frames, i.e. no frames being routed through the device should be considered for the policy-based route.
Set the next hop address to the upstream provider associated with the particular interface the route is intended to support.

Without the policy-based routes, all active ping trigger requests will be routed based on the currently active routes in the regular routing table. Hence, we will not be able to ensure that the requests are directed towards the intended interfaces. Depending on how the routes are set up on the rest of the network, this could lead to some of the following issues:

The ping trigger reaching its destination across an unintended path. Therefore not actually verifying the path that was intended. This could lead to false positive scenarios, where we may think that a destination is reachable across a specific upstream provider.
The ping trigger indicating that it cannot reach its destination, even though it actually would, if sent on the correct interface.

Warning

When these policy routes are active, any frame originating from the device that happen to match the policy will also be routed towards the specified next hop. This is not just active for the ping trigger traffic itself.

For instance, if the IP address of one of the upstream interfaces is pinged, and the response must be routed, the response will be routed based on the next hop configured for the associated policy-based route. This is done regardless if a “better” route may exist on the system, because policy-based routes always have priority over any other route that resides in the regular routing table.

Configuration

All the configuration will be carried out on the representation of router R1. As described above we need to configure a number of different things, so this section will consist of a number of subsections focusing on the different components of the setup.

For the sake of this configuration example, we assume that the relevant interfaces have already been configured, along with their IP addresses.

Ping Trigger

First we start with the configuration of the two ping triggers, one to be associated with each of the upstream interfaces. We want both of the ping triggers to peer against the IP address 198.19.20.21, but to ensure that this is done on different outbound interfaces.

We begin with the configuration of ping trigger 1, aimed to operate over vlan1 across the upstream provider ISP1:

R1:/#> configure
R1:/config/#> alarm
R1:/config/alarm/#> trigger ping
R1:/config/alarm/trigger-1/#> peer 10.0.0.199
R1:/config/alarm/trigger-1/#> outbound 172.16.1.1
R1:/config/alarm/trigger-1/#> end
R1:/config/alarm/#>

Next we configure ping trigger 2, to operate over vlan2 across the upstream provider ISP2:

R1:/config/alarm/#> trigger ping
R1:/config/alarm/trigger-2/#> peer 10.0.0.199
R1:/config/alarm/trigger-2/#> outbound 172.16.2.1
R1:/config/alarm/trigger-2/#> end
R1:/config/alarm/#> end
R1:/config/#>

Take note that we specify the outbound option for both of the ping triggers. Doing this is very important, since this will be needed to correctly match on the intended policy-based routes that we are going to add later. The IP addresses provided are the ones configured for vlan1 and vlan2 respectively.

Warning

When configuring the outbound option we specify the IP address of the intended interface directly, instead of providing the interface name. This is done because some inconsistencies have been noted in regards to the matching behavior of the policy-based routes, when matching on source IP address in this specific situation.

Tip

In order to adjust the reaction time of the ping trigger it is possible to adjust the transmission interval and the number of frames that must be lost.

R1:/config/alarm/trigger-1/#> interval 2
R1:/config/alarm/trigger-1/#> number 1

With the ping triggers configured, they can now be connected to the default routes that are going to be configured in the next step.

Default Routes

In this step the two different default routes are configured. The two routes will have their next hop set to the ip address of ISP1 and ISP2 respectively. The previously created ping triggers will be connected to the two routes, by providing the route configuration with the track option. The value provided to this option is the id of the specific ping trigger. If the associated ping trigger cannot reach the configured destination the distance of the route will be adjusted to 255, so that it will not be useable.

First we configure the route that will serve as the primary uplink:

R1:/config/#> ip
R1:/config/ip/#> route default 172.16.1.99 track 1

Note

When we provide no specific distance to a routes configuration, it will be given a distance of 1.

Next we will add the route that will act as the secondary uplink. This route will be configured to have a distance of 10, so that it will not be useable while the primary route is active:

R1:/config/ip/#> route default 172.16.2.99 10 track 2

In summary, we configured two different default gateways, each pointing to either ISP1 at 172.16.1.199, or ISP2 at 172.16.2.199. The route pointing to ISP2 have a higher distance, and will therefore not be active while the primary route is. Both of the routes have been connected to a ping trigger using the track command. If the associated ping trigger fails to reach its destination, the route will have its distance increased to 255, making it inactive. This is the mechanism that allows for the failover to occur.

Policy-Based Routes

Lastly we configure the necessary policy-based routes that will ensure the ping trigger traffic egress towards the correct upstream. We configure two different policy-based route instances, one associated with each ping trigger.

First, we start with the route that should handle ping trigger 1, responsible for verifying the uplink towards ISP1:

R1:/config/ip/#> policy-route 1
R1:/config/ip/policy-route-1/#> in-iface lo
R1:/config/ip/policy-route-1/#> next-hop 172.16.1.99
R1:/config/ip/policy-route-1/#> match saddr 172.16.1.1/32
R1:/config/ip/policy-route-1/#> end
R1:/config/ip/#>

Next, we configure the route that should handle ping trigger 2, responsible for verifying the uplink towards ISP2:

R1:/config/ip/#> policy-route 2
R1:/config/ip/policy-route-2/#> in-iface lo
R1:/config/ip/policy-route-2/#> next-hop 172.16.2.99
R1:/config/ip/policy-route-2/#> match saddr 172.16.2.1/32
R1:/config/ip/policy-route-2/#> end
R1:/config/ip/#>

In summary, two policy-based routes have been configured, one associated with each ping trigger. This is the intention for each of the options set for the instance:

The in-iface is set to the lo interface. Doing this will ensure that only traffic originating from the device itself can match the route.
The next-hop is configured to be the address of the upstream next-hop, in this case either ISP1 for policy-route 1, or ISP2 for policy-route 2.
The match saddr is set to the IP address of the local upstream interface. In this case the ip address of interface vlan1 for ISP1 and the IP address of interface vlan2 for ISP2.

If everything has been configured correctly, everything should now be as we want it to be. We can now apply the configuration and verify if it works.

R1:/config/ip/#> leave
Applying configuration.
Configuration activated.  Remember "copy run start" to save to flash (NVRAM).
R1:/#>

Status

Initially we expect everything to be up and running correctly, with no link failures anywhere. Therefore, we start with simply verifying that everything looks correct.

The first thing we can verify is that our policy-based routes look correct, so that we can expect our ping triggers to egress on the correct interfaces. It should look something like this.

R1:/#>  show ip policy-route
PRIO  SADDR          DADDR  SPORT  DPORT  IN-IFACE  NEXT-HOP   
300   172.16.1.1/32  -      -      -      lo        172.16.1.99
301   172.16.2.1/32  -      -      -      lo        172.16.2.99

If the policy-based routes look correct, we can check the alarm status, to see if the ping triggers can correctly reach their destination address. It should look something like this, with both ping triggers reporting that the ping works.

R1:/#>  show alarm
NO TRIGGER          ENA ACT REASON                                            
 1 Ping             YES  NO ping to 10.0.0.199 works
 2 Ping             YES  NO ping to 10.0.0.199 works

At this point we can check if our default gateways looks as we want them to. If everything is correct the route towards ISP1 at 172.16.1.99 should be the active default route:

R1:/#>  show ip route
S - Static | C - Connected | K - Kernel route  | > - Selected route
O - OSPF   | R - RIP       | [Distance/Metric] | * - FIB route

S   0.0.0.0/0 [10/0] via 172.16.2.99, vlan2, weight 1, 07:03:24
S>* 0.0.0.0/0 [1/0] via 172.16.1.99, vlan1, weight 1, 07:03:24
C>* 172.16.1.0/24 is directly connected, vlan1, 07:03:24
C>* 172.16.2.0/24 is directly connected, vlan2, 07:03:24

If all the above checks out, we should be in a operating state that mirrors what we expected, based on the technical setup description.

Status On Link Failure

If we have a link failure, as shown in Figure 2, we expect the primary route to have its distance increased, so that the secondary route will now be the active route.

However, first we can check the status of our ping triggers. If the destination address is unreachable over interface vlan1 through ISP1, trigger 1 should now indicate this:

R1:/#> show alarm
NO TRIGGER          ENA ACT REASON                                            
 1 Ping             YES YES ping to 10.0.0.199 fails
 2 Ping             YES  NO ping to 10.0.0.199 works

If the ping trigger indicates that it fails to reach the destination, the associated route should now have had its distance increased to 255. If we check the routing table it should now show that the default route towards ISP2 at 172.16.2.99 is the active default route:

R1:/#>  show ip route
S - Static | C - Connected | K - Kernel route  | > - Selected route
O - OSPF   | R - RIP       | [Distance/Metric] | * - FIB route

S>* 0.0.0.0/0 [10/0] via 172.16.2.99, vlan2, weight 1, 00:00:02
S   0.0.0.0/0 [255/0] via 172.16.1.99, vlan1, weight 1, 00:00:02
C>* 172.16.1.0/24 is directly connected, vlan1, 07:10:13
C>* 172.16.2.0/24 is directly connected, vlan2, 07:10:13

If everything has behaved as above, we have now correctly had the upstream failover to the secondary uplink.

Possibilities When Combined With ECMP Routing

It is possible to combine this solution with ECMP routing, if we want to have failover, but utilize both of the uplinks at the same time. For an example how load balancing can be achieved, in a similar situation to the one presented in this document, refer to the following howto on ECMP load balancing.

In ECMP routing, we can have multiple routes, that are of equal cost, for the same destination active at the same time. This can allow for load balancing to occur between the Primary and Secondary uplinks, in this use case example. The failover scenario in this situation will be the same, the route will have its distance increased so that it is no longer selectable. Therefore, the failover scenario can be seen as simply disabling a specific route, instead of performing a failover to a secondary route.

In order to achieve this type of setup, the only thing that would need to change is the configuration of the routes themselves. Instead of providing the secondary route with a higher distance value, we simply ensure that all the upstream routes have the same distance. Configuring the two default routes with the same distance, will lead to them having the same cost, making ECMP routing possible. Therefore, we perform the route configuration from this step, in the following manner instead:

R1:/#> configure
R1:/config/#> ip
R1:/config/ip/#> route default 172.16.1.99 track 1
R1:/config/ip/#> route default 172.16.2.99 track 2
R1:/config/ip/#> leave
Applying configuration.
Configuration activated.  Remember "copy run start" to save to flash (NVRAM).
R1:/#>

Now both of these routes should be active at the same time, and should be selectable at the same time, for individual flows of packets. In order to verify this, check the active routing table:

R1:/#> show ip route
S - Static | C - Connected | K - Kernel route  | > - Selected route
O - OSPF   | R - RIP       | [Distance/Metric] | * - FIB route

S>* 0.0.0.0/0 [1/0] via 172.16.1.99, vlan1, weight 1, 00:00:01
  *                 via 172.16.2.99, vlan2, weight 1, 00:00:01
C>* 172.16.1.0/24 is directly connected, vlan1, 00:01:32
C>* 172.16.2.0/24 is directly connected, vlan2, 00:01:32

As we can see, both of our default routes will now be active at the same time. Again, if we have a link failure as described in Figure 2, it should still look similar to how it did in the case where we did not utilize ECMP routing:

R1:/#> show ip route
S - Static | C - Connected | K - Kernel route  | > - Selected route
O - OSPF   | R - RIP       | [Distance/Metric] | * - FIB route

S>* 0.0.0.0/0 [1/0] via 172.16.2.99, vlan2, weight 1, 00:00:02
S   0.0.0.0/0 [255/0] via 172.16.1.99, vlan1, weight 1, 00:00:02
C>* 172.16.1.0/24 is directly connected, vlan1, 00:01:32
C>* 172.16.2.0/24 is directly connected, vlan2, 00:01:32