Failover

How Failover Works in a Communications Network

Failover in a communications network is the process of instantly transferring tasks from a failed component to a similar redundant component to avoid disruption and maintain operations. Automated failover is the ability to reroute data automatically from a failed component, server or network connection to a functioning one instantly, and is essential for mission-critical systems.

Failover Hierarchy, Network Failover, WAN Link Failover

Failover occurs in a communications network when operation from a failed component, such as a controller, disk drive, server, etc., is transferred to the same type of redundant component to ensure there is no gap in data flow and operation. If a primary component becomes unavailable because of either failure or scheduled down time, the secondary component serves as a backup and takes over for its failed counterpart. 

The capability to switch automatically to a redundant or standby computer server, system or network upon failure happens without human intervention (see Failover Hierarchy below for other types of failover).  Automated failover is essential in servers, systems or networks requiring continuous availability and a high degree of reliability—those that are responsible for mission-critical processes and data (see examples below).

Failover Hierarchy

There are different types of failover. Some are intentionally not entirely automatic and require manual intervention. When hardware is on cold standby, failover must be performed manually, which invites error.

In contrast, where hardware is on warm standby, the backup system runs in the background, so the transfer takes place automatically. However, the current transaction may be aborted because it was not possible to synchronize the data prior to failure. 

Therefore, the most reliable scenario is hot standby, wherein both systems permanently run in parallel—data on both systems is 100 percent synchronized at all times. Users will not be aware of any failures.

Here is how the four general approaches to system failover compare in terms of recovery time, expense and user impact:


Some enterprises implement hot failover and cold failover for disaster recovery. It is important to differentiate between failover and disaster recovery[1]. Failover is a methodology to resume system availability in an acceptable period of time, while disaster recovery is a methodology to resume system availability when all failover strategies have failed.

Critical Role of Failover

The convergence of voice, data and video over a single IP network is making the network infrastructure one of the most critical elements in operational success. These voice, video, and data services are increasingly integrated with business-critical applications such as email, customer relationship management (CRM), and human resources management (HRM).

Therefore, all forms of communication with customers, suppliers and employees are inextricably tied to network operation. If the network fails, access to critical information can be lost or potentially compromised, with potentially calamitous results. For example, an airport risks massive delays that impact passengers, or patients’ health may be compromised by a major medical center experiencing delays.

Examples of Organizations that Need Failover

  • Small and medium-sized businesses need both incoming and outgoing failover and aggregation of an increasing assortment of critical business traffic, from applications to VoIP to email, such as the local corner store that does online banking and bill-pay over the Internet; and/or a manufacturing company that needs email, web services, hosted enterprise resource planning (ERP) and ecommerce applications available 100 percent of the time.
  • Companies with a central headquarters and a number of branch offices need secure and reliable data communication among those locations. These businesses frequently use virtual private network (VPN) tunnels between their remote locations and headquarters and have intense company traffic 24/7. They need reliable performance and high availability of their VPN data, including the ability of the tunnel to automatically failover if a WAN link goes down.
  • Web hosting companies/ASPs/Small ISPs need incoming aggregation and failover to their services, with extra bandwidth and redundancy available to their servers. Their mission-critical e-commerce applications need to be up and running 24/7. If a WAN link goes down, the failover process has to be smooth and transparent to users.
  • Many companies now need Quality of Service (QoS)[2] levels and traffic-shaping for guaranteed bandwidth to critical services/applications. These companies are attempting to deploy reliable and affordable VoIP solutions to cut expenses and enhance productivity.

Failover Requirements

Most corporate and government networks are comprised of three main elements—LAN, WAN and network infrastructure services. The LAN provides interconnectivity around a single organizational location. The WAN provides interconnectivity between these locations (interconnecting specific geographical sites), other business partners, and access to public networks such as the public switched telephone network in the case of voice traffic and the Internet for data traffic.

The following are other critical elements that comprise a failover environment:

Power: The Source
With power failures cited as the single largest reason for network and systems failures, all critical network components at either the primary data center, call center or failover site must be connected to a power source that has very high availability—99.999 percent in the case of a data center.

Network Redundancy
Levels of redundancy should be determined for the primary and backup networks based on the identification of critical network components, impact analyses, and established recovery objectives. There should be considerations for redundancy of components of network elements (e.g., switches, routers, etc.). There should also be consideration given to redundant components such as power supplies, CPUs, and circuit cards for those network switches and routers.

In addition, there should also be considerations given to the redundancy and diversity of WAN circuits in conjunction with automated failover. Redundancy can be achieved by providing multiple circuits and multiple types of circuits between critical sites and applications.

Multiple carriers are often used in conjunction with multi-homing WAN link failover and Internet load-balancing appliances to provide Internet access diversity and redundancy to companies that rely heavily on Internet connectivity for ecommerce.

Capacity
Several capacity factors of alternate sites must be properly assessed in order to avoid failures caused by unanticipated additional high traffic volumes from a primary site. One is the peak capacity of the secondary site to which the traffic will be rerouted. The second is the peak capacity coming from the primary site that failed. The size of the WAN circuits should allow for both peak capacities plus an additional 25-40 percent, to accommodate new peak traffic volumes from added VoIP and/or data traffic caused by customers, suppliers, and employees trying to learn how the problem affects them.

Aggregated bandwidth must be ample enough to provide ISP failover and redundancy, and to handle both inbound and outbound failover and Internet load balancing. Intelligent load balancing monitors bandwidth availability throughout the network and priority-assigns traffic to the link with the greatest available bandwidth in order to guarantee that time-sensitive traffic—voice and video—as well as critical applications receive the bandwidth required for smooth, consistent performance.

In addition to the availability of WAN circuits, there is a need for a load balancer to connect clients to an available server. If the server where the client is connected suddenly becomes unavailable, the load balancer redirects the request to one of the other replicated servers.

Many companies employ an appliance located at the LAN gateway that merges WAN failover and load-balancing technology to cost-effectively automate the elimination of downtime for business-critical, time-sensitive applications and ensure network performance.

Summary of Key Failover Issues

  • Improve network performance and eliminate downtime for business-critical, time-sensitive applications. 
  • Globally manage all enterprise and remote WAN resources.
  • Provide redundant hardware failover and monitoring capabilities for mission-critical applications to eliminate all potential single points of WAN link failure.
  • Establish reliable network connections.
  • Ensure inbound and outbound traffic management over best-performing WAN link.
  • Provide traffic load balancing (both inbound and outbound) from network for ample bandwidth aggregation.
  • Increase scalability and throughput of WAN connectivity.
  • Failover to secondary data center if all links at primary data center are down.
  • Coordinate point-to-point channel bonding among all locations, providing uninterrupted Internet access for reliable performance of applications like VPN and VoIP.
  • Multi-homing[3] WAN link failover and Internet load balancing for guaranteed Internet high availability.
  • Comments

    Tom Pick
    Tom Pick
    Marketing & PR Executive
    Article rating:
    Your rating:
    Moderated collaboration
    All signed in users can suggest edits to the knol, but these need approval from an author before being published
    Version: 7
    Versions
    Last edited: Apr 16, 2009 2:19 PM.

    Reviews

      Knol translations

      Categories

      Based on community consensus.

      Activity for this knol

      This week:

      11pageviews

      Totals:

      370pageviews