Network Load Balancing Architecture

**unlimitedtech** · 01-12-2010

To maximize throughput and high availability, Network Load Balancing uses a fully distributed software architecture. An identical copy of the Network Load Balancing driver runs in parallel on each cluster host. The drivers arrange for all cluster hosts on a single subnet to concurrently detect incoming network traffic for the cluster's primary IP address (and for additional IP addresses on multihomed hosts). On each cluster host, the driver acts as a filter between the network adapter's driver and the TCP/IP stack, allowing a portion of the incoming network traffic to be received by the host. By this means incoming client requests are partitioned and load-balanced among the cluster hosts.

Network Load Balancing runs as a network driver logically situated beneath higher-level application protocols, such as HTTP and FTP. Figure 2 below shows the implementation of Network Load Balancing as an intermediate driver in the Windows 2000 network stack.

This architecture maximizes throughput by using the broadcast subnet to deliver incoming network traffic to all cluster hosts and by eliminating the need to route incoming packets to individual cluster hosts. Since filtering unwanted packets is faster than routing packets (which involves receiving, examining, rewriting, and resending), Network Load Balancing delivers higher network throughput than dispatcher-based solutions. As network and server speeds grow, its throughput also grows proportionally, thus eliminating any dependency on a particular hardware routing implementation. For example, Network Load Balancing has demonstrated 250 megabits per second (Mbps) throughput on Gigabit networks.

Another key advantage to Network Load Balancing's fully distributed architecture is the enhanced availability resulting from (N-1)-way failover in a cluster with N hosts. In contrast, dispatcher-based solutions create an inherent single point of failure that must be eliminated using a redundant dispatcher that provides only 1-way failover. This offers a less robust failover solution than does a fully distributed architecture.

Network Load Balancing's architecture takes advantage of the subnet's hub and/or switch architecture to simultaneously deliver incoming network traffic to all cluster hosts. However, this approach increases the burden on switches by occupying additional port bandwidth. (Please refer to the "Network Load Balancing Performance" section of this paper for performance measurements of switch usage.) This is usually not a concern in most intended applications, such as Web services and streaming media, since the percentage of incoming traffic is a small fraction of total network traffic. However, if the client-side network connections to the switch are significantly faster than the server-side connections, incoming traffic can occupy a prohibitively large portion of the server-side port bandwidth. The same problem arises if multiple clusters are hosted on the same switch and measures are not taken to setup virtual LANs for individual clusters.

During packet reception, Network Load Balancing's fully pipelined implementation overlaps the delivery of incoming packets to TCP/IP and the reception of other packets by the network adapter driver. This speeds up overall processing and reduces latency because TCP/IP can process a packet while the network driver interface specification (NDIS) driver receives a subsequent packet. It also reduces the overhead required for TCP/IP and the NDIS driver to coordinate their actions, and in many cases, it eliminates an extra copy of packet data in memory. During packet sending, Network Load Balancing also enhances throughput and reduces latency and overhead by increasing the number of packets that TCP/IP can send with one NDIS call. To achieve these performance enhancements, Network Load Balancing allocates and manages a pool of packet buffers and descriptors that it uses to overlap the actions of TCP/IP and the NDIS driver.

**unlimitedtech** · 01-12-2010

Distribution of Cluster Traffic

Network Load Balancing uses layer-two broadcast or multicast to simultaneously distribute incoming network traffic to all cluster hosts. In its default unicast mode of operation, Network Load Balancing reassigns the station address ("MAC" address) of the network adapter for which it is enabled (called the cluster adapter), and all cluster hosts are assigned the same MAC address. Incoming packets are thereby received by all cluster hosts and passed up to the Network Load Balancing driver for filtering. To insure uniqueness, the MAC address is derived from the cluster's primary IP address entered in the Network Load Balancing Properties dialog box. For a primary IP address of 1.2.3.4, the unicast MAC address is set to 02-BF-1-2-3-4. Network Load Balancing automatically modifies the cluster adapter's MAC address by setting a registry entry and then reloading the adapter's driver; the operating system does not have to be restarted.

If the cluster hosts are attached to a switch instead of a hub, the use of a common MAC address would create a conflict since layer-two switches expect to see unique source MAC addresses on all switch ports. To avoid this problem, Network Load Balancing uniquely modifies the source MAC address for outgoing packets; a cluster MAC address of 02-BF-1-2-3-4 is set to 02-h-1-2-3-4, where h is the host's priority within the cluster (set in the Network Load Balancing Properties dialog box). This technique prevents the switch from learning the cluster's actual MAC address, and as a result, incoming packets for the cluster are delivered to all switch ports. If the cluster hosts are connected directly to a hub instead of to a switch, Network Load Balancing's masking of the source MAC address in unicast mode can be disabled to avoid flooding upstream switches. This is accomplished by setting the Network Load Balancing registry parameter MaskSourceMAC to 0. The use of an upstream level three switch will also limit switch flooding.

Network Load Balancing's unicast mode has the side effect of disabling communication between cluster hosts using the cluster adapters. Since outgoing packets for another cluster host are sent to the same MAC address as the sender, these packets are looped back within the sender by the network stack and never reach the wire. This limitation can be avoided by adding a second network adapter card to each cluster host. In this configuration, Network Load Balancing is bound to the network adapter on the subnet that receives incoming client requests, and the other adapter is typically placed on a separate, local subnet for communication between cluster hosts and with back-end file and database servers. Network Load Balancing only uses the cluster adapter for its heartbeat and remote control traffic.

Note that communication between cluster hosts and hosts outside the cluster is never affected by Network Load Balancing's unicast mode. Network traffic for a host's dedicated IP address (on the cluster adapter) is received by all cluster hosts since they all use the same MAC address. Since Network Load Balancing never load balances traffic for the dedicated IP address, Network Load Balancing immediately delivers this traffic to TCP/IP on the intended host. On other cluster hosts, Network Load Balancing treats this traffic as load balanced traffic (since the target IP address does not match another host's dedicated IP address), and it may deliver it to TCP/IP, which will discard it. Note that excessive incoming network traffic for dedicated IP addresses can impose a performance penalty when Network Load Balancing operates in unicast mode due to the need for TCP/IP to discard unwanted packets.

Network Load Balancing provides a second mode for distributing incoming network traffic to all cluster hosts. Called multicast mode, this mode assigns a layer two multicast address to the cluster adapter instead of changing the adapter's station address. The multicast MAC address is set to 03-BF-1-2-3-4 for a cluster's primary IP address of 1.2.3.4. Since each cluster host retains a unique station address, this mode alleviates the need for a second network adapter for communication between cluster hosts, and it also removes any performance penalty from the use of dedicated IP addresses.

Network Load Balancing's unicast mode induces switch flooding in order to simultaneously deliver incoming network traffic to all cluster hosts. Also, when Network Load Balancing uses multicast mode, switches often flood all ports by default to deliver multicast traffic. However, Network Load Balancing's multicast mode gives the system administrator the opportunity to limit switch flooding by configuring a virtual LAN within the switch for the ports corresponding to the cluster hosts. This can be accomplished by manually programming the switch or by using the Internet Group Management Protocol (IGMP) or the GARP (Generic Attribute Registration Protocol) Multicast Registration Protocol (GMRP). The current version of Network Load Balancing does not provide automatic support for IGMP or GMRP.

Network Load Balancing implements the Address Resolution Protocol (ARP) functionality needed to ensure that the cluster's primary IP address and other virtual IP addresses resolve to the cluster's multicast MAC address. (The dedicated IP address continues to resolve to the cluster adapter's station address.) Experience has shown that Cisco routers currently do not accept an ARP response from the cluster that resolves unicast IP addresses to multicast MAC addresses. This problem can be overcome by adding a static ARP entry to the router for each virtual IP address, and the cluster's multicast MAC address can be obtained from the Network Load Balancing Properties dialog box or from the Wlbs.exe remote control program. The default unicast mode avoids this problem because the cluster's MAC address is a unicast MAC address.

Network Load Balancing does not manage any incoming IP traffic other than TCP traffic, User Datagram Protocol (UDP) traffic, and Generic Routing Encapsulation (GRE) traffic (as part of PPTP traffic) for specified ports. It does not filter IGMP, ARP (except as described above), the Internet Control Message Protocol (ICMP), or other IP protocols. All such traffic is passed unchanged to the TCP/IP protocol software on all of the hosts within the cluster. As a result, the cluster can generate duplicate responses from certain point-to-point TCP/IP programs (such as ping) when the cluster IP address is used. Because of the robustness of TCP/IP and its ability to deal with replicated datagrams, other protocols behave correctly in the clustered environment. These programs can use the dedicated IP address for each host to avoid this behavior.

**unlimitedtech** · 01-12-2010

Load Balancing Algorithm

Network Load Balancing employs a fully distributed filtering algorithm to map incoming clients to the cluster hosts. This algorithm was chosen to enable cluster hosts to make a load-balancing decision independently and quickly for each incoming packet. It was optimized to deliver statistically even load balance for a large client population making numerous, relatively small requests, such as those typically made to Web servers. When the client population is small and/or the client connections produce widely varying loads on the server, Network Load Balancing's load balancing algorithm is less effective. However, the simplicity and speed of its algorithm allows it to deliver very high performance, including both high throughput and low response time, in a wide range of useful client/server applications.

Network Load Balancing load-balances incoming client requests by directing a selected percentage of new requests to each cluster host; the load percentage is set in the Network Load Balancing Properties dialog box for each port range to be load-balanced. The algorithm does not respond to changes in the load on each cluster host (such as the CPU load or memory usage). However, the mapping is modified when the cluster membership changes, and load percentages are renormalized accordingly.

When inspecting an arriving packet, all hosts simultaneously perform a statistical mapping to quickly determine which host should handle the packet. The mapping uses a randomization function that calculates a host priority based on the client's IP address, port, and other state information maintained to optimize load balance. The corresponding host forwards the packet up the network stack to TCP/IP, and the other cluster hosts discard it. The mapping does not vary unless the membership of cluster hosts changes, ensuring that a given client's IP address and port will always map to the same cluster host. However, the particular cluster host to which the client's IP address and port map cannot be predetermined since the randomization function takes into account the current and past cluster's membership to minimize remappings.

The load-balancing algorithm assumes that client IP addresses and port numbers (when client affinity is not enabled) are statistically independent. This assumption can break down if a server-side firewall is used that proxies client addresses with one IP address and, at the same time, client affinity is enabled. In this case, all client requests will be handled by one cluster host and load balancing is defeated. However, if client affinity is not enabled, the distribution of client ports within the firewall usually provides good load balance.

In general, the quality of load balance is statistically determined by the number of clients making requests. This behavior is analogous to coin tosses where the two sides of the coin correspond to the number of cluster hosts (thus, in this analogy, two), and the number of tosses corresponds to the number of client requests. The load distribution improves as the number of client requests increases just as the fraction of coin tosses resulting in "heads" approaches 1/2 with an increasing number of tosses. As a rule of thumb, with client affinity set, there must be many more clients than cluster hosts to begin to observe even load balance.

As the statistical nature of the client population fluctuates, the evenness of load balance can be observed to vary slightly over time. It is important to note that achieving precisely identical load balance on each cluster host imposes a performance penalty (throughput and response time) due to the overhead required to measure and react to load changes. This performance penalty must be weighed against the benefit of maximizing the use of cluster resources (principally CPU and memory). In any case, excess cluster resources must be maintained to absorb the client load in case of failover. Network Load Balancing takes the approach of using a very simple but powerful load-balancing algorithm that delivers the highest possible performance and availability.

Network Load Balancing's client affinity settings are implemented by modifying the statistical mapping algorithm's input data. When client affinity is selected in the Network Load Balancing Properties dialog box, the client's port information is not used as part of the mapping. Hence, all requests from the same client always map to the same host within the cluster. Note that this constraint has no timeout value (as is often the case in dispatcher-based implementations) and persists until there is a change in cluster membership. When single affinity is selected, the mapping algorithm uses the client's full IP address. However, when class C affinity is selected, the algorithm uses only the class C portion (the upper 24 bits) of the client's IP address. This ensures that all clients within the same class C address space map to the same cluster host.

In mapping clients to hosts, Network Load Balancing cannot directly track the boundaries of sessions (such as SSL sessions) since it makes its load balancing decisions when TCP connections are established and prior to the arrival of application data within the packets. Also, it cannot track the boundaries of UDP streams, since the logical session boundaries are defined by particular applications. Instead, Network Load Balancing's affinity settings are used to assist in preserving client sessions. When a cluster host fails or leaves the cluster, its client connections are always dropped. After a new cluster membership is determined by convergence (described below), clients that previously mapped to the failed host are remapped among the surviving hosts. All other client sessions are unaffected by the failure and continue to receive uninterrupted service from the cluster. In this manner, Network Load Balancing's load-balancing algorithm minimizes disruption to clients when a failure occurs.

When a new host joins the cluster, it induces convergence, and a new cluster membership is computed. When convergence completes, a minimal portion of the clients will be remapped to the new host. Network Load Balancing tracks TCP connections on each host, and, after their current TCP connection completes, the next connection from the affected clients will be handled by the new cluster host; UDP streams are immediately handled by the new cluster host. This can potentially break some client sessions that span multiple connections or comprise UDP streams. Hence, hosts should be added to the cluster at times that minimize disruption of sessions. To completely avoid this problem, session state must be managed by the server application so that it can be reconstructed or retrieved from any cluster host. For example, session state can be pushed to a back-end database server or kept in client cookies. SSL session state is automatically recreated by re-authenticating the client.

The GRE stream within the PPTP protocol is a special case of a session that is unaffected by adding a cluster host. Since the GRE stream is temporally contained within the duration of its TCP control connection, Network Load Balancing tracks this GRE stream along with its corresponding control connection. This prevents the addition of a cluster host from disrupting the PPTP tunnel.

**unlimitedtech** · 01-12-2010

Convergence

Network Load Balancing hosts periodically exchange multicast or broadcast heartbeat messages within the cluster. This allows them to monitor the status of the cluster. When the state of the cluster changes (such as when hosts fail, leave, or join the cluster), Network Load Balancing invokes a process known as convergence, in which the hosts exchange heartbeat messages to determine a new, consistent state of the cluster and to elect the host with the highest host priority as the new default host. When all cluster hosts have reached consensus on the correct new state of the cluster, they record the change in cluster membership upon completion of convergence in the Windows 2000 event log.

During convergence, the hosts continue to handle incoming network traffic as usual, except that traffic for a failed host does not receive service. Client requests to surviving hosts are unaffected. Convergence terminates when all cluster hosts report a consistent view of the cluster membership for several heartbeat periods. If a host attempts to join the cluster with inconsistent port rules or an overlapping host priority, completion of convergence is inhibited. This prevents an improperly configured host from handling cluster traffic.

At the completion of convergence, client traffic for a failed host is redistributed to the remaining hosts. If a host is added to the cluster, convergence allows this host to receive its share of load-balanced traffic. Expansion of the cluster does not affect ongoing cluster operations and is achieved in a manner transparent to both Internet clients and to server programs. However, it may affect client sessions because clients may be remapped to different cluster hosts between connections, as described in the previous section.

In unicast mode, each cluster host periodically broadcasts heartbeat messages, and in multicast mode, it multicasts these messages. Each heartbeat message occupies one Ethernet frame and is tagged with the cluster's primary IP address so that multiple clusters can reside on the same subnet. Network Load Balancing's heartbeat messages are assigned an ether type-value of hexadecimal 886F. The default period between sending heartbeats is one second, and this value can be adjusted with the AliveMsgPeriod registry parameter. During convergence, the exchange period is reduced by half in order to expedite completion. Even for large clusters, the bandwidth required for heartbeat messages is very low (for example, 24 Kbytes/second for a 16-way cluster).

Network Load Balancing assumes that a host is functioning properly within the cluster as long as it participates in the normal heartbeat exchange among the cluster hosts. If other hosts do not receive a heartbeat message from any member for several periods of message exchange, they initiate convergence. The number of missed heartbeat messages required to initiate convergence is set to five by default and can be adjusted using the AliveMsgTolerance registry parameter.

A cluster host will immediately initiate convergence if it receives a heartbeat message from a new host or if it receives an inconsistent heartbeat message that indicates a problem in the load distribution. When receiving a heartbeat message from a new host, the host checks to see if the other host has been handling traffic from the same clients. This problem could arise if the cluster subnet was rejoined after having been partitioned. More likely, the new host was already converged alone in a disjoint subnet and has received no client traffic. This can occur if the switch introduces a lengthy delay in connecting the host to the subnet. If a cluster host detects this problem and the other host has received more client connections since the last convergence, it immediately stops handling client traffic in the affected port range. Since both hosts are exchanging heartbeats, the host that has received more connections continues to handle traffic, while the other host waits for the end of convergence to begin handling its portion of the load. This heuristic algorithm eliminates potential conflicts in load handling when a previously partitioned cluster subnet is rejoined; this event is recorded in the event log.

Remote Control

Network Load Balancing's remote control mechanism uses the UDP protocol and is assigned port 2504. Remote control datagrams are sent to the cluster's primary IP address. Since the Network Load Balancing driver on each cluster host handles them, these datagrams must be routed to the cluster subnet (instead of to a back-end subnet to which the cluster is attached). When remote control commands are issued from within the cluster, they are broadcast on the local subnet. This ensures that all cluster hosts receive them even if the cluster runs in unicast mode.