Multiple Upstream Failover HowTo
About
This document will cover how we can achieve automatic failover in a setup that utilizes multiple upstream providers. This can be useful if we want to increase the redundancy for our upstream connections.
In order to achieve this we will need to utilize a few different functionalities, including:
- Static unicast routing.
- Policy-Based Routing (PBR).
- Ping trigger, part of the alarm system.
Introduction
For this example we imagine a topology where we have a Router connected to two different upstream providers. This is done to improve our redundancy, by having one of them being the primary and the other used as a backup, in case we lose connection on the first mentioned. In Figure 1 below a representation of this scenario is presented.
198.19.20.21
|
.--.-.
( ( )__
(_, \ ) ,_) Internet
'-'--`--'
| |
| |
.-------' '-------.
.--> | | <--.
| | | |
Primary | .----+----. .----+----. | Secondary
Uplink | | | | | | Uplink
| | ISP1 | | ISP2 | |
| | | | | |
| '----+----' '----+----' |
.99 | | .99
| |
172.16.1.0/24 '-------. .-------' 172.16.2.0/24
| |
.1 | | .1
Interface: vlan1 .-+-----+-. Interface: vlan2
| |
| R1 |
| | GW: 172.16.1.99 distance 1
'----+----' 172.16.2.99 distance 10
|
|
.--.-.
( ( )__
(_, \ ) ,_) Lan
'-'--`--'
As can be seen in Figure 1, the router R1
is connected to the two upstream
providers ISP1
and ISP2
. Two different default gateways are configured on
R1
towards ISP1
at 172.16.1.99
and ISP2
at 172.16.2.99
respectively.
In order to ensure that the first route will serve as the primary, the
distance
of this route is configured to 1 and for the other it is set to 10.
Note
A route with a lower distance will be considered the better route. Therefore by adjusting this value we can specify what route will be selected first in this case.
Handle the Failover
The next step will be to address how the failover is supposed to be carried out,
but first its good to understand why we need special handling for this. As it
stands right now, the only way for the secondary route to become selectable is
if the upstream interface, vlan1
on R1
, gets a direct link failure. Then the
network that the default route is associated with will go down and the route
will become inactive, thus the secondary route will now be selected instead.
However, if the link would become broken somewhere along the path between ISP1
and our target destination, as shown in Figure 2, it would not be possible
for R1
to be aware of that in the current situation. So traffic will be routed
from the Lan
towards ISP1
that will not be able to reach our intended
destination.
198.19.20.21
|
.--.-.
( ( )__
(_, \ ) ,_) Internet
'-'--`--'
| |
| |
Link .-------' '-------.
Failure ----> X |
| |
.----+----. .----+----.
| | | |
| ISP1 | | ISP2 |
| | | |
'----+----' '----+----'
.99 | | .99
| |
172.16.1.0/24 '-------. .-------' 172.16.2.0/24
| |
.1 | | .1
Interface: vlan1 .-+-----+-. Interface: vlan2
| |
| R1 |
| |
'----+----'
|
|
.--.-.
( ( )__
(_, \ ) ,_) Lan
'-'--`--'
In order to handle this the ping trigger
functionality will be used. It will
track an upstream destination, in this example we will check if the address
198.19.20.21
is reachable. Two different ping triggers will be used, one for
each of the upstream interfaces vlan1
and vlan2
.
Subsequently, the ping trigger is attached to the route as a remote tracker. If
the ping trigger is not able to reach its provided destination address, it will
adjust the distance of the route to 255. When this happens, the route with the
next best distance will be used instead, in this specific example it will be the
route to our Secondary provider ISP2
.
Note
When a route is configured with a distance of 255, it is considered infinite, and the route will not be used, i.e. it will not be active in the routing table.
Ensure Correct Upstream Interface is Used for Ping Trigger
It is important to ensure that the requests sent by the ping triggers egress on
the correct associated upstream interface, regardless of the state of the
routing table. Since what we want to keep track of in this case is that we can
reach our intended destination from R1
on interface vlan1
over the upstream
provider ISP1
, and vice versa for vlan2
and ISP2
.
In order to achieve this desired behavior, policy-based routing will be utilized. The policy-based route instances that will be applied are only intended to ensure that traffic, originating from the device itself, will egress on the correct upstream interface. For each of the upstream interfaces we want to specify an associated policy-based route, that should be configured based on the following requirements:
- Match on the Source IP address of the upstream interface itself.
- Only consider for locally originated frames, i.e. no frames being routed through the device should be considered for the policy-based route.
- Set the next hop address to the upstream provider associated with the particular interface the route is intended to support.
Without the policy-based routes, all active ping trigger requests will be routed based on the currently active routes in the regular routing table. Hence, we will not be able to ensure that the requests are directed towards the intended interfaces. Depending on how the routes are set up on the rest of the network, this could lead to some of the following issues:
- The ping trigger reaching its destination across an unintended path. Therefore not actually verifying the path that was intended. This could lead to false positive scenarios, where we may think that a destination is reachable across a specific upstream provider.
- The ping trigger indicating that it cannot reach its destination, even though it actually would, if sent on the correct interface.
Warning
When these policy routes are active, any frame originating from the device that happen to match the policy will also be routed towards the specified next hop. This is not just active for the ping trigger traffic itself.
For instance, if the IP address of one of the upstream interfaces is pinged, and the response must be routed, the response will be routed based on the next hop configured for the associated policy-based route. This is done regardless if a “better” route may exist on the system, because policy-based routes always have priority over any other route that resides in the regular routing table.
Configuration
All the configuration will be carried out on the representation of router R1
.
As described above we need to configure a number of different things, so this
section will consist of a number of subsections focusing on the different
components of the setup.
For the sake of this configuration example, we assume that the relevant interfaces have already been configured, along with their IP addresses.
Ping Trigger
First we start with the configuration of the two ping triggers, one to be
associated with each of the upstream interfaces. We want both of the ping
triggers to peer
against the IP address 198.19.20.21
, but to ensure that
this is done on different outbound interfaces.
We begin with the configuration of ping trigger 1, aimed to operate over vlan1
across the upstream provider ISP1
:
R1:/#> configure R1:/config/#> alarm R1:/config/alarm/#> trigger ping R1:/config/alarm/trigger-1/#> peer 10.0.0.199 R1:/config/alarm/trigger-1/#> outbound 172.16.1.1 R1:/config/alarm/trigger-1/#> end R1:/config/alarm/#>
Next we configure ping trigger 2, to operate over vlan2
across the upstream
provider ISP2
:
R1:/config/alarm/#> trigger ping R1:/config/alarm/trigger-2/#> peer 10.0.0.199 R1:/config/alarm/trigger-2/#> outbound 172.16.2.1 R1:/config/alarm/trigger-2/#> end R1:/config/alarm/#> end R1:/config/#>
Take note that we specify the outbound
option for both of the ping triggers.
Doing this is very important, since this will be needed to correctly match on
the intended policy-based routes that we are going to add later. The IP
addresses provided are the ones configured for vlan1
and vlan2
respectively.
Warning
When configuring the outbound
option we specify the IP address of the intended
interface directly, instead of providing the interface name. This is done
because some inconsistencies have been noted in regards to the matching behavior
of the policy-based routes, when matching on source IP address in this specific
situation.
Tip
In order to adjust the reaction time of the ping trigger it is possible to
adjust the transmission interval
and the number
of frames that must be
lost.
R1:/config/alarm/trigger-1/#> interval 2 R1:/config/alarm/trigger-1/#> number 1
With the ping triggers configured, they can now be connected to the default routes that are going to be configured in the next step.
Default Routes
In this step the two different default routes are configured. The two routes
will have their next hop set to the ip address of ISP1
and ISP2
respectively. The previously created ping triggers will be connected to the two
routes, by providing the route configuration with the track
option. The value
provided to this option is the id of the specific ping trigger. If the
associated ping trigger cannot reach the configured destination the distance of
the route will be adjusted to 255, so that it will not be useable.
First we configure the route that will serve as the primary uplink:
R1:/config/#> ip R1:/config/ip/#> route default 172.16.1.99 track 1
Note
When we provide no specific distance to a routes configuration, it will be given a distance of 1.
Next we will add the route that will act as the secondary uplink. This route
will be configured to have a distance
of 10, so that it will not be useable
while the primary route is active:
R1:/config/ip/#> route default 172.16.2.99 10 track 2
In summary, we configured two different default gateways, each pointing to
either ISP1
at 172.16.1.199, or ISP2
at 172.16.2.199. The route pointing to
ISP2
have a higher distance, and will therefore not be active while the
primary route is. Both of the routes have been connected to a ping trigger using
the track
command. If the associated ping trigger fails to reach its
destination, the route will have its distance increased to 255, making it
inactive. This is the mechanism that allows for the failover to occur.
Policy-Based Routes
Lastly we configure the necessary policy-based routes that will ensure the ping trigger traffic egress towards the correct upstream. We configure two different policy-based route instances, one associated with each ping trigger.
First, we start with the route that should handle ping trigger 1, responsible
for verifying the uplink towards ISP1
:
R1:/config/ip/#> policy-route 1 R1:/config/ip/policy-route-1/#> in-iface lo R1:/config/ip/policy-route-1/#> next-hop 172.16.1.99 R1:/config/ip/policy-route-1/#> match saddr 172.16.1.1/32 R1:/config/ip/policy-route-1/#> end R1:/config/ip/#>
Next, we configure the route that should handle ping trigger 2, responsible for
verifying the uplink towards ISP2
:
R1:/config/ip/#> policy-route 2 R1:/config/ip/policy-route-2/#> in-iface lo R1:/config/ip/policy-route-2/#> next-hop 172.16.2.99 R1:/config/ip/policy-route-2/#> match saddr 172.16.2.1/32 R1:/config/ip/policy-route-2/#> end R1:/config/ip/#>
In summary, two policy-based routes have been configured, one associated with each ping trigger. This is the intention for each of the options set for the instance:
-
The
in-iface
is set to thelo
interface. Doing this will ensure that only traffic originating from the device itself can match the route. -
The
next-hop
is configured to be the address of the upstream next-hop, in this case eitherISP1
forpolicy-route 1
, orISP2
forpolicy-route 2
. -
The
match saddr
is set to the IP address of the local upstream interface. In this case the ip address of interfacevlan1
forISP1
and the IP address of interfacevlan2
forISP2
.
If everything has been configured correctly, everything should now be as we want it to be. We can now apply the configuration and verify if it works.
R1:/config/ip/#> leave Applying configuration. Configuration activated. Remember "copy run start" to save to flash (NVRAM). R1:/#>
Status
Initially we expect everything to be up and running correctly, with no link failures anywhere. Therefore, we start with simply verifying that everything looks correct.
The first thing we can verify is that our policy-based routes look correct, so that we can expect our ping triggers to egress on the correct interfaces. It should look something like this.
R1:/#> show ip policy-route
PRIO SADDR DADDR SPORT DPORT IN-IFACE NEXT-HOP
300 172.16.1.1/32 - - - lo 172.16.1.99
301 172.16.2.1/32 - - - lo 172.16.2.99
If the policy-based routes look correct, we can check the alarm status, to see if the ping triggers can correctly reach their destination address. It should look something like this, with both ping triggers reporting that the ping works.
R1:/#> show alarm
NO TRIGGER ENA ACT REASON
1 Ping YES NO ping to 10.0.0.199 works
2 Ping YES NO ping to 10.0.0.199 works
At this point we can check if our default gateways looks as we want them to. If
everything is correct the route towards ISP1
at 172.16.1.99
should be the
active default route:
R1:/#> show ip route S - Static | C - Connected | K - Kernel route | > - Selected route O - OSPF | R - RIP | [Distance/Metric] | * - FIB route S 0.0.0.0/0 [10/0] via 172.16.2.99, vlan2, weight 1, 07:03:24 S>* 0.0.0.0/0 [1/0] via 172.16.1.99, vlan1, weight 1, 07:03:24 C>* 172.16.1.0/24 is directly connected, vlan1, 07:03:24 C>* 172.16.2.0/24 is directly connected, vlan2, 07:03:24
If all the above checks out, we should be in a operating state that mirrors what we expected, based on the technical setup description.
Status On Link Failure
If we have a link failure, as shown in Figure 2, we expect the primary route to have its distance increased, so that the secondary route will now be the active route.
However, first we can check the status of our ping triggers. If the destination
address is unreachable over interface vlan1
through ISP1
, trigger 1 should
now indicate this:
R1:/#> show alarm
NO TRIGGER ENA ACT REASON
1 Ping YES YES ping to 10.0.0.199 fails
2 Ping YES NO ping to 10.0.0.199 works
If the ping trigger indicates that it fails to reach the destination, the
associated route should now have had its distance increased to 255. If we check
the routing table it should now show that the default route towards ISP2
at
172.16.2.99
is the active default route:
R1:/#> show ip route S - Static | C - Connected | K - Kernel route | > - Selected route O - OSPF | R - RIP | [Distance/Metric] | * - FIB route S>* 0.0.0.0/0 [10/0] via 172.16.2.99, vlan2, weight 1, 00:00:02 S 0.0.0.0/0 [255/0] via 172.16.1.99, vlan1, weight 1, 00:00:02 C>* 172.16.1.0/24 is directly connected, vlan1, 07:10:13 C>* 172.16.2.0/24 is directly connected, vlan2, 07:10:13
If everything has behaved as above, we have now correctly had the upstream failover to the secondary uplink.
Possibilities When Combined With ECMP Routing
It is possible to combine this solution with ECMP routing, if we want to have failover, but utilize both of the uplinks at the same time. For an example how load balancing can be achieved, in a similar situation to the one presented in this document, refer to the following howto on ECMP load balancing.
In ECMP routing, we can have multiple routes, that are of equal cost, for the same destination active at the same time. This can allow for load balancing to occur between the Primary and Secondary uplinks, in this use case example. The failover scenario in this situation will be the same, the route will have its distance increased so that it is no longer selectable. Therefore, the failover scenario can be seen as simply disabling a specific route, instead of performing a failover to a secondary route.
In order to achieve this type of setup, the only thing that would need to change is the configuration of the routes themselves. Instead of providing the secondary route with a higher distance value, we simply ensure that all the upstream routes have the same distance. Configuring the two default routes with the same distance, will lead to them having the same cost, making ECMP routing possible. Therefore, we perform the route configuration from this step, in the following manner instead:
R1:/#> configure R1:/config/#> ip R1:/config/ip/#> route default 172.16.1.99 track 1 R1:/config/ip/#> route default 172.16.2.99 track 2 R1:/config/ip/#> leave Applying configuration. Configuration activated. Remember "copy run start" to save to flash (NVRAM). R1:/#>
Now both of these routes should be active at the same time, and should be selectable at the same time, for individual flows of packets. In order to verify this, check the active routing table:
R1:/#> show ip route S - Static | C - Connected | K - Kernel route | > - Selected route O - OSPF | R - RIP | [Distance/Metric] | * - FIB route S>* 0.0.0.0/0 [1/0] via 172.16.1.99, vlan1, weight 1, 00:00:01 * via 172.16.2.99, vlan2, weight 1, 00:00:01 C>* 172.16.1.0/24 is directly connected, vlan1, 00:01:32 C>* 172.16.2.0/24 is directly connected, vlan2, 00:01:32
As we can see, both of our default routes will now be active at the same time. Again, if we have a link failure as described in Figure 2, it should still look similar to how it did in the case where we did not utilize ECMP routing:
R1:/#> show ip route S - Static | C - Connected | K - Kernel route | > - Selected route O - OSPF | R - RIP | [Distance/Metric] | * - FIB route S>* 0.0.0.0/0 [1/0] via 172.16.2.99, vlan2, weight 1, 00:00:02 S 0.0.0.0/0 [255/0] via 172.16.1.99, vlan1, weight 1, 00:00:02 C>* 172.16.1.0/24 is directly connected, vlan1, 00:01:32 C>* 172.16.2.0/24 is directly connected, vlan2, 00:01:32