Failover detection and triggers

To achieve high availability BIG-IP uses failover managers that monitor various parts and services of BIG-IP When failover manager detects failed process, it will do one of several actions which can be configured. It can restart the process, failed to standby mate, reboot. Here are some failover managers:

watchdog – performs hardware health check
overdog – software to correct hardware failures – it just reboot the device
sod (switch over demon) – monitors switch fabric and do correct actions

High availability table

Stores list of features that can cause failover along with state of each of this feature. All failover managers updates and monitor this failover table. The table contains name of the feature, action on failure, if its enabled, if the feature its in failed state. To see content of HA table run command show /sys ha-status all-properties.

Failover triggers

Things that cause failover are network related but also the issues with the BIG-IP itself. Failover managers constantly monitors 6 failover processes (demons). Check the picture.

VLAN failsafe

if you enable this failover mode, the Big-ip monitors traffic for a specific VLAN. If no traffic is detected on half time of failover, the big-ip tries to ping known host and verify if traffic works. If big-ip doesnt receive any response it do appropriate action. Default timeout is 90 seconds.

Gateway failsafe

This is not commonly configured. If you want to use two gateways you can configure it so when gateway fails you failover to second gateway.

Hardware failover

HW failover is always enabled. It uses front panel failover port and DB9 specially pinned failover cable from big IP. Applied voltage from active unit, tells the standy unit to remain in standby mode. This failover cable carries no data and can be 15 meters max long. If there is a loss of voltage because of the broken cable, both systems takes active role!!! Not goood. You have to use the network failover also to mitigate this risk. Then the standby unit would not take the active role unless the both failover method (hw, network) fails.

Network failover

Net failover uses a system hearthbeat send over the network by active unit causing the peer system to remain in standby mode. You are not limited with 15 m long cable like in HW failover. If you want to set up network failover you have to do it on both devices because config is not synchronized.

Stateful Failover

Default action on failover is that standby unit will accept new connections but current connections and persistence on active units are lost. If you want to keep connections and persistence you have to configure stateful failover also called mirroring on BIG-IP. You have to configure it on every device because this configuration is not synchronized.

Types of mirroring

Connection mirroring – F5 recommends to configure this just for long last connections like telnets, ftp, ssh, rdp
Persistence mirroring – F5 recommends to turn this on. If failover occurs subsequent connections will be directed to proper pool member.
SNAT mirroring – if you use SNAT (dynamic pat) you have to configure also SNAT mirroring for stateful failover. NAT is not needed as it is 1:1 static translation.

On pictures below we can see how to enable connection, persistence and SNAT mirroring.

MAC masquerading

When failover occurs the standby unit will send GARP to inform everyone in network that they need to update ARP table for different MAC address. If you have device on network that dont update ARP table (maybe some old legacy shit) you can configure something called MAC masquerading. MAC masquerading brings something called floating virtual MAC address. And gues what its shared between the peers. You have to configure it on every device as it is not shared. Then BIG-IP doesnt have to send GARP when failover occurs.

Summary of communication between units in HA mode

Synchronization data – configuration synchronization with peer on TCP/443 port
Network failover – it uses UDP/1026 to send keep alives to peer
Mirroring data – it uses TCP/1028 for connection and persistence table synchronization between HA units