Networking Guide 8 - Fault Tolerance and Disaster Recovery

**mindreader** · 16-09-2004

Computers are not perfect. They can (and do) have problems that affect their users’ productivity. These problems range from small errors to total system failure. Errors and failures can be the result of environmental problems, hardware and software failure, hacking (malicious, unauthorized use of a computer or a network), as well as natural disasters.

In all cases, you can take measures to minimize the impact of computer and network problems. These measures fall into two major categories: fault tolerance and disaster recovery. Fault tolerance is the capability of a computer or a network system to respond to a condition automatically, usually resolving it, and thus reducing the impact on the system. If fault-tolerant measures have been implemented, it is unlikely that a user would know that a problem existed. Disaster recovery , as its name suggests, is the ability to get a system functional after a total system failure (a disaster for a company and the network administrator) in the least amount of time. Strictly speaking, if enough fault tolerance methods are in place, you shouldn’t need disaster recovery.

In this guide, we will look at the following:

How to assess fault tolerance and disaster recovery needs
Power management
Disk system fault tolerance methods
Backup considerations
Virus protection

**mindreader** · 16-09-2004

Assessing Fault Tolerance and Disaster Recovery Needs

Before implementing fault tolerance or disaster recovery, you should determine how critical your systems are to daily business operations. Additionally, you should determine how long each system could afford to be nonfunctional (down). Making these determinations will dictate which fault tolerance and disaster recovery methods you implement and to what extent. The more vital the system, the greater lengths (and, thus, greater expense) you should go to in order to protect it from downtime. Less critical systems may call for simpler measures. For example, banks, insurance companies, the U.S. government, and airlines all run highly critical computer and network systems. Thus, they all have complex and expensive fault tolerance and disaster recovery systems in place.

In terms of how fault tolerance and disaster recovery are implemented, sites can be described as hot, warm, or cold. As the temperature decreases, so does the level of fault tolerance and disaster recovery that are implemented at a site.

Hot Sites

In a hot site, every computer system and piece of information has a redundant copy (possibly multiple redundancies). This level of fault tolerance is used when systems must be up 100 percent of the time. Hot sites are strictly fault-tolerant implementations, not disaster recovery implementations (as no downtime is allowed). Budgets for this type of fault-tolerant implementation are typically large.

In a system that has 100-percent redundancy, the redundant system(s) will take over for the failed system without any downtime. The technology used to implement hot sites is clustering , which is the process of grouping multiple computers in order to provide increased performance and fault tolerance.

Clustering Technologies

Although servers are commonly clustered, workstations are normally not clustered because they are simple and cheap to replace. Each computer in the cluster is connected to the other computers in the cluster by high-speed, redundant links (usually multiple fiber-optic cable). Each computer runs special clustering software that makes the cluster of computers appear as a single entity to clients.

There are two levels of cluster service: failover and true.

Failover Clustering

A failover cluster includes two entities (usually servers). The first is the active device (the device that responds to network requests), and the second is the failover device. The failover device is an exact duplicate of the active device, but it is inactive and connected to the active device by a high-speed link. The failover device monitors the active device and its condition by using what is known as a heartbeat . A heartbeat is a signal that comes from the active device at a specified interval. If the failover device doesn’t receive a heartbeat from the active device in the specified interval, the failover device considers the active device inactive, and the failover device comes online (becomes active) and is now the active device.

When the previously active device comes back online, it starts sending out the heartbeat. The failover device, which currently is responding to requests as the active device, hears the heartbeat and detects that the active device is now back online. The failover device then goes back into standby mode and starts listening to the heartbeat of the active device again.

In a failover cluster, both servers must be running failover clustering software, such as Novell’s SFTIII (System Fault Tolerance, Level III), Standby Server and High Availability Server (with Novell’s High Availability software, either of the servers can fail and the other will take over), and Microsoft’s Cluster Server (MSCS) for Windows NT servers. Each software package provides failover functionality.

Here are some advantages of this approach to fault tolerance:

Resources are almost always available. This approach ensures that the network service(s) that the device provides will be available as much as 99 percent of the time. Each network service and all data are exactly duplicated on each device, and when one experiences problems, the other takes over for virtually uninterrupted service.
It is relatively inexpensive when compared with true clustering (discussed in the next section).
But, as with any technology, there are disadvantages, and failover clustering has its fair share:
There is only one level of fault tolerance. This technology works great if the active device fails, but if the failover device fails as well, the network will totally lose that device’s functionality.
There is no load balancing. Servers in a failover-clustering configuration are in either active or standby mode. There is no balancing of network service load across both servers in the cluster. The active server responds to network requests, and the failover server simply monitors the active server, wasting its processor resources.
During cutover time, the server can’t respond to requests. Failover clusters take anywhere from a few seconds to a few minutes to detect and recover from a failed server. This is called cutover time. During cutover time, the server can’t respond to network client requests, so the server is effectively down. This time is indeed short, but, nevertheless, clients can’t get access to their services during it.
Hardware and software must be exactly duplicated. In most failover configurations, the hardware for both active and failover devices must be identical . If it’s not, the transition of the failover device to active device may be hindered. These differences may even cause the failover to fail. This is a disadvantage because it involves checking all aspects of the hardware. (For servers this means disk types and sizes, NICs, processor speed and type, and RAM.)

Note Even though Microsoft Cluster Server (MSCS) is described above as a failover clustering technology, it does have some capability for load balancing (according to Microsoft). It currently supports only a two-device configuration, so it primarily fits into this category of clustering.

True Clustering

True clustering differs from failover clustering in two major ways:
It supports multiple devices.
It provides load balancing.

In true clustering (also called multiple server clustering ), multiple servers (or any network devices) act together as a kind of super server. True clusters must provide load balancing. For example, 20 servers can act as one big server. All network services are duplicated across all servers, and network requests are distributed across all servers. Each server is connected to the other servers through a high-speed, dedicated link. If one server in the cluster malfunctions, the other servers automatically take over the burden of the failed server. When the failed server comes back online, it resumes responding to requests as part of the cluster.

This technology can provide greater than 99-percent availability for network services hosted by the cluster. Unfortunately, most NOS vendors (including Novell and Microsoft) don’t currently ship true clustering software solutions. One notable exception is VMS, by Digital Equipment Corporation. However, both Novell and Microsoft have announced plans to release true clustering server solutions.

Several advantages are associated with true clustering, including:

There is more than 99-percent availability for network services. With multiple servers, the impact of a single server, or even more than one server, in the cluster going down is minimized because other servers take over the functionality.
It offers increased performance. Because each server is taking part of the load of the cluster, much higher total performance is possible.
There is no cutover time. Because multiple servers are always responding to network requests, true clusters don’t suffer from the cutover time even when a server goes down. The remaining servers do receive an increased load, and clients may see a Server Busy or Not Found error message if they should, by some chance, try to communicate with the server that is going down. But if the user tries the operation again, one of the remaining servers will respond to the request.
It provides for replication. If the clustering software in use supports it, a few servers can be located off site in case the main site is destroyed by fire, flood, or other disaster. Because there is a replica (copy) of all data in a different location, this technology is known as replication.
But these advantages don’t come without their price. Here are a couple of disadvantages to true clustering:
The more servers, the more complex the cluster. As you add servers to the cluster to increase performance, you also increase the complexity. For this reason, most clustering software is limited to a maximum of 64 servers. As technology develops, this limit will increase. The minimum number of servers in a true cluster is 2.
It is much more expensive. Because of the hardware involved and the complexity of the clustering software, true clustering requires a serious financial commitment. To justify the expense, ask the keepers of the purse strings how much money would be lost if the system were down for a day.

**mindreader** · 16-09-2004

Warm Site

In a warm site (also called a nearline site ), the network service and data are available most of the time (more than 85 percent of the time). The data and services are less critical than those in a hot site. With hot-site technologies, all fault tolerance procedures are automatic and are controlled by the NOS. Warm-site technologies require a little more administrator intervention, but they aren’t as expensive.

The most commonly used warm-site technology is a duplicate server. A duplicate server , as its name suggests, is currently not being used and is available to replace any server that fails. When a server fails, the administrator installs the new server and restores the data; the network services are available to users with a minimum of downtime. The administrator sends the failed server out to be repaired. Once the repaired server comes back, it is now the spare server and is available when another server fails.

Using a duplicate server is a disaster recovery method because the entire server is replaced, but in a shorter time than if all the components had to be ordered and configured at the time of the system failure. The major advantage of using duplicate servers rather than clustering is that it’s less expensive. A single duplicate server costs much less than a comparable clustering solution.

Corporate networks don’t often use duplicate servers, and that’s because there are some major disadvantages associated with using them:

<LI class=first-listitem>You must keep current backups. Because the duplicate server relies on a current backup, you must back up every day and verify every backup, which is time-consuming. To stay as current as possible, some companies run continuous backups.
You can lose data. If a server fails in mid-afternoon and the backup was run the evening before, you will lose any data that was placed on the server since the last backup. This may not be a big problem on servers that aren’t updated frequently.

Cold Site

A cold site cannot guarantee server uptime. Generally speaking, cold sites have little or no fault tolerance and rely completely on efficient disaster recovery methods to ensure data integrity. If a server fails, the IT personnel will do their best to recover and fix the problem. If a major component needs to be replaced, the server stays down until the component is replaced. Errors and failures are handled as they occur. Apart from regular system backups, no fault tolerance or disaster recovery methods are implemented.

This type of site has one major advantage: It is the cheapest way to deal with errors and system failures. No extra hardware is required (except the hardware required for backing up).

**mindreader** · 16-09-2004

Power Management

A key element of any fault tolerance plan is a power management strategy. Electricity powers the network, switches, hubs, PCs, and computer servers. Variations in power can cause problems ranging from a reboot after a short loss of service to damaged equipment and data. Fortunately, a number of products are available to help protect sensitive systems from the dangers of lightning strikes, dirty (uneven) power, and accidental power cable disconnection, including surge protectors, Standby Power Supplies, uninterruptible power supplies, and line conditioners. What you use depends on how critical your system is (whether you decide that it is a hot, warm, or cold site). At a minimum, you should connect individual workstations to surge protectors, and network hardware and servers should use uninterruptible power supplies or line conditioners. Critical operations, such as ambulance corps and hospitals, typically go one step further and also have a gas-powered backup generator to provide long-term supplemental power to all systems.

**mindreader** · 16-09-2004

Surge Protectors

Surge protectors (also commonly referred to as surge suppressors ) are typically power blocks or power strips with electronics that limit the amount of voltage, current (amps), and noise that can get through to your equipment. They are designed to protect your equipment from long-lasting increases in voltage (surges) and high, short bursts of voltage (spikes). The unit does not provide any power, however. Rather, it blocks harmful electricity from reaching your equipment. The surge protector detects a surge or a spike and clamps down on the incoming voltage, reducing it to safe levels. If the surge is large enough, it can trip the built-in safety mechanism. You may then lose power and have to reset the equipment you are protecting. Common causes of surges and spikes are fluctuations in power from the electricity company, additions of equipment to the power grid by customers, and natural storms.

Level of Protection

Unfortunately, surge protectors provide only a limited amount of protection. Surge protectors are simple devices that can only protect against large spikes and surges. Small increases in voltage are allowed to pass. These small increases may not cause immediate damage, but over time, they can damage sensitive computer equipment. It is definitely better to have a surge protector than not have one, but the surge protector must be of high quality (these usually cost more than $30).

Warning The $5.99 power strips you find at Wal-Mart and similar stores are not true surge protectors. They are simply multiple-outlet strips with a single circuit breaker and provide only the most basic protection. Don’t use them with computer equipment.

Common Components/Features

Tripp Lite’s Isobar and American Power Conversion’s (APC) SurgeArrest are two leading surge protector products. When selecting a surge protector, look for these components and features:

Active Protection Light When this light is illuminated, the unit is properly functioning. It should be on at all times.

Site Wiring Fault Light When this light is illuminated, there is a wiring fault in the circuit to which the surge protector is connected. This light should be off at all times.

Ground Make sure that the unit has three prongs on the plug, the third, middle plug, being for ground. If the ground is missing, the user can receive a lethal shock. This may seem obvious, but it is important to remember.

IEEE 587 A Let-Through Rating Check the value of the IEEE 587 A Let-Through rating. This value indicates how much voltage is let through when the surge protector clamps down on the incoming spike or surge. The lower this rating, the lower the voltage that is let through and the better you are protected. A 330V rating is excellent protection.

UL Listing Underwriters Laboratories Inc. is an independent testing laboratory that certifies electrical equipment specifications. A UL listing indicates that the surge protector meets national electrical code and safety standards.

Circuit Breaker This button pops out after a large spike or surge. When the circuit breaker trips, you will lose all power to your equipment. Press the button back in to reset the surge protector.

Additional Ports New protectors protect much more than power cables. Today’s surge protectors have RJ-45 and coaxial connectors for protecting network cards from extremely high surges. Also, RJ-11 and ISDN ports protect modems from telephone pole lightning strikes (which can follow the phone line right into the modem, thus damaging it).

Note IEEE stands for International Electrical and Electronics Engineers, an organization that is involved in creating standards. For more information, visit www.ieee.com.

**mindreader** · 16-09-2004

Battery Backup Systems

Battery backup systems protect computer systems from power failures. There are several different types of power failures, including brownouts and blackouts. A brownout is when the power level falls to a lower level and stays there for several minutes (or longer). This may eventually lead to a blackout, or total loss of power.
Battery backup systems use a battery to power the computer and its assorted peripherals. Generally speaking, when these devices are activated due to a power failure, they permit the user to save data and initiate a graceful shutdown of the system. They normally aren’t used to run the system for an extended period (unless the units use a very large-capacity battery).

Note Never plug a laser printer or copier into a battery backup device. These devices draw tremendous amounts of current when they are turned on (much more than any computer or network device would ever draw). If you do this, you could permanently damage or disable your battery backup device.

There are two main types of battery backup systems:

Standby Power Supply (SPS)
Uninterruptible Power Supply (UPS)

Note Power output from battery-powered inverters isn’t exactly perfect. Normal power output alternates polarity 60 times a second (60 Hertz). When graphed, this output looks like a sine wave. Output from inverters is stepped to approximate this sine wave output, but it really never duplicates it. Today’s inverter technology can come extremely close, but the differences between inverter and true AC power can cause damage to computer power supplies over the long run.

Standby Power Supply (SPS)

A Standby Power Supply (SPS) contains a battery, a switchover circuit, and an inverter (a device to convert the DC voltage from the battery into AC voltage that the computer and peripherals need). The outlets on the SPS are connected to the switching circuit, which is in turn connected to the incoming AC power (called line voltage). The switching circuit monitors the line voltage. When it drops below a factory-preset threshold, the switching circuit switches from line voltage to the battery and inverter. The battery and inverter power the outlets (and, thus, the computers or devices plugged into them) until the switching circuit detects that line voltage is present again at the correct levels. The switching circuit then switches the outlets back to line voltage.

Level of Protection

SPS can provide some protection against power outages (more so than surge protectors, at any rate). Unfortunately, because the switching circuit must switch between power sources, there is a short period of time when the outlets have no power. Computers and network devices can usually handle this infinitesimally short period of time without power, but they don’t always handle it gracefully. Some devices will lock up or experience errors. Others can even reboot (thus negating the reason for having a battery backup system).

For this reason, SPS has never been really popular with computer and electronic equipment users. They are inexpensive and they can provide a basic level of protection, but this is usually not sufficient for sites that require 100-percent uptime.

Common Components/Features

Most Standby Power Supplies will have one or more of these features or components:

Multiple Outlets Each SPS will have at least one outlet for connecting computers or network devices to the SPS. Most have multiple outlets. The number of outlets depends on the capacity of the battery, the inverter, and the switching circuit in the SPS.

Line Voltage Indicator This light or indicator, when illuminated, indicates that the SPS is receiving sufficient AC line voltage to power the equipment plugged into the SPS.

Battery Power Indicator This light or indicator, when illuminated, indicates that the equipment plugged into the SPS is running off the battery and inverter in the SPS. When this indicator is initially illuminated, a beep will sound, warning that power to the SPS has failed.

System Management Port This is usually a standard serial port (although USB ports are becoming more popular). It allows the SPS to connect to the host computer (or server) it is protecting. The host computer runs SPS management software that gathers statistics about the power the SPS is using and providing. Also, when a power failure occurs, this port is used to send a signal from the SPS informing the management software on the host computer that the power to the SPS has failed. The management software can then initiate a graceful shutdown of the workstation computer or server.

Uninterruptible Power Supply (UPS)

An Uninterruptible Power Supply (UPS) is another type of battery backup often found on computers and network devices today. It is similar to an SPS in that it has outlets, a battery, and an inverter. The similarities end there, though. A UPS uses an entirely different method to provide continuous AC voltage to the equipment it supports.

In a UPS, the equipment is always running off the inverter and battery. A UPS contains a charging/monitoring circuit that charges the battery constantly. It also monitors the AC line voltage. When a power failure occurs, the charger just stops charging the battery. The equipment never senses any change in power. The monitoring part of the circuit senses the change and emits a beep to tell the user the power has failed.

Level of Protection

A UPS provides a significant amount of protection against many types of power problems because the computer is always running off the battery and inverter. Problems with the input line voltage don’t really affect the output voltage. They only affect the efficiency of the charging circuit. A UPS is the most popular form of power protection because it provides significant protection at a fairly low cost.

Common Components/Features

When buying a UPS, you must look for the features that will solve your particular power problems or that meet your needs in general. Some of the features of a UPS include:

Multiple Outlets Each UPS will have at least one outlet for connecting computers or network devices to the UPS. Most have multiple outlets. The number of outlets depends on the capacity of the battery, inverter, and switching circuit in the UPS.

Line Voltage Indicator This light or indicator, when illuminated, indicates that the UPS is receiving sufficient AC line voltage to power the charging circuit of the UPS.

Battery Power Indicator This light or indicator, when illuminated, indicates that the equipment plugged into the UPS is running off the battery and inverter in the UPS and that the charging circuit is not active. When this indicator is initially illuminated, a beep will sound, warning that power to the UPS has failed.

System Management Port This is usually a standard serial port (although USB ports are becoming more popular). It allows the UPS to connect to the host computer (or server) it is protecting. The host computer runs UPS management software that gathers statistics about the power the UPS is using and providing. Also, when a power failure occurs, this port is used to send a signal to the management software on the host computer that the power to the UPS has failed. The management software can then initiate a graceful shutdown of the workstation computer or server.

**mindreader** · 16-09-2004

Line Conditioners

The AC voltage that powers our everyday devices comes from power sources usually located far from where we use it. The power is conducted through wires and stepping stations over many miles on its trip from where it’s generated to where it’s used. At any point along this trip, erroneous electrical patterns or signals that computers may not be able to handle properly can be introduced into the power. These erroneous signals are known as line noise and can cause many types of problems, including random lockups, random reboots, and system crashes.

All power signals have varying degrees of line noise. In areas that have particularly bad line noise, a device known as a line conditioner is used. This device filters out the erroneous signals in the power, leaving the devices it supplies with clean, 110-volt, 60Hz power.

Line conditioners are complex (and, thus, expensive) devices that incorporate a number of power-correction technologies to provide electronic devices with clean power. Some of these technologies include UPS, surge suppression, and power filtering.

Level of Protection

Line conditioners provide the highest level of power protection for electronic devices. Hot sites will have a large line conditioner (or multiple line conditioners) that service every computer in an organization. These conditioners are often wired directly into the electrical system of a company. Special outlets (with markings that indicate they are protected outlets) are wired in each room. Wires from these outlets lead directly back to the line conditioner. These devices are usually cost-prohibitive for smaller companies or for a single computer, although some small companies will invest in a small line conditioner for their main server, if it is a critical server.

Common Components/Features

Line conditioners usually have control panel interfaces. Some manufacturers replace the control panel interface with a computer and power management software. These interfaces can report both incoming and outgoing voltages, as well as any problems these interfaces might be experiencing. These devices are so complex and large that they typically require large cooling fans and an adequate supply of cool air.

**mindreader** · 16-09-2004

hard disk is a temporary storage device, and every hard disk will eventually fail. The most common problem is a complete hard-disk failure (also known as a hard-disk crash). When this happens, all stored data is irretrievable. Therefore, if you want your data to be accessible 90 to 100 percent of the time (as with warm and hot sites), you need to use some method of disk fault tolerance. Typically, disk fault tolerance is achieved through disk management technologies such as mirroring, striping, and duplexing drives, and provides some level of data protection. As with other methods of fault tolerance, disk fault tolerance means that a disk system is able to recover from an error condition of some kind.

The methods that provide fault tolerance for hard-disk systems include:

Mirroring
Duplexing
Data striping
Redundant array of independent (or inexpensive) disks (RAID)

Understanding Disk Volumes Before you read about the various methods of providing fault tolerance for disk systems, you should know about one important concept: volumes. When you install a new hard disk into a computer and prepare it for use, the NOS sets up the disk so that you can store data on it in a process known as formatting. Once this has been achieved, the NOS can access the disk. Before it can store data on the disk, it must set up what is known as a volume. A volume, for all practical purposes, is a named chunk of disk space. This chunk can exist on part of a disk, can exist on all of a disk, or can span multiple disks. Volumes provide a way of organizing disk storage.

**mindreader** · 16-09-2004

Disk Mirroring

Mirroring a drive means designating a hard-disk drive in the computer as a mirror or duplicate to another, specified drive. The two drives are attached to a single disk controller. This disk fault tolerance feature is provided by most network operating systems. When the NOS writes data to the specified drive, the same data is also written to the drive designated as the mirror. If the first drive fails, the mirror drive is already online, and because it has a duplicate of the information contained on the specified drive, the users won’t know that a disk drive in the server has failed. The NOS notifies the administrator that the failure has occurred. The downside is that if the disk controller fails, neither drive is available

The drives do not need to be identical, but this helps. Both drives must have the same amount of free space to allow a mirror to be formed. For example, you have two 4GB drives; one has 3GB free, and the other has 2GB free. You can create one 2GB mirrored system.

Note Mirroring is an implementation of RAID level 1, which is discussed in detail later in this guide.

**mindreader** · 16-09-2004

Disk Duplexing

As with mirroring, duplexing also saves data to a mirror drive. In fact, the only major difference between duplexing and mirroring is that duplexing uses two separate disk controllers (one for each disk). Thus, duplexing provides not only a redundant disk, but a redundant controller and data ribbon as well. Duplexing provides fault tolerance even if one of the controllers fails

Note Duplexing is also an implementation of RAID level 1.

**mindreader** · 16-09-2004

Disk Striping

From a performance point of view, writing data to a single drive is slow. When three drives are configured as a single volume, information must fill the first drive before it can go to the second and fill the second before filling the third. If you configure that volume to use disk striping, you will see a definite performance gain. Disk striping breaks up the data to be saved to disk into small portions and sequentially writes the portions to all disks simultaneously in small areas called stripes. These stripes maximize performance because all of the read/write heads are working constantly.Notice that the data is broken into sections and that each section is sequentially written to a separate disk.

Striping data across multiple disks improves only performance; it does not improve fault tolerance. To add fault tolerance to disk striping, it is necessary to use parity. Disk striping is also known as RAID level 0.

Parity Information Parity, as it relates to disk fault tolerance, is a general term for the fault tolerance information computed for each chunk of data written to a disk. This parity information can be used to reconstruct missing data should a disk fail. Striping can use parity or not, but if the striping technology doesn’t use parity, you won’t gain any fault tolerance. When using striping with parity, the parity information is computed for each block and written to the drive.

The advantage to using parity with striping is gaining fault tolerance. If any part of the data gets lost or destroyed, the information can be rebuilt from the parity information. The downside to using parity is that computing and writing parity information reduces the total performance of a disk system that uses striping. The parity information also reduces the total amount of free disk space.

**mindreader** · 16-09-2004

Redundant Array of Inexpensive (or Independent) Disks (RAID)

RAID is a technology that uses an array of less expensive hard disks instead of one enormous hard disk and provides several methods for writing to those disks to ensure redundancy. Those methods are described as levels, and each level is designed for a specific purpose:

RAID 0 (Commonly Used) This method is the fastest because all read/ write heads are constantly being used without the burden of parity or duplicate data being written. A system using this method has multiple disks, and the information to be stored is striped across the disks in blocks without parity. This RAID level only improves performance; it does not provide fault tolerance.

RAID 1 (Commonly Used) This level uses two hard disks, one mirrored to the other (commonly known as mirroring; duplexing is also an implementation of RAID 1). This is the most basic level of disk fault tolerance. If the first hard disk fails, the second automatically takes over. No parity or error-checking information is stored. Rather, each drive has duplicate information of the other. If both drives fail, a new drive must be installed and configured, and the data must be restored from a backup.

RAID 2 At this level, individual bits are striped across multiple disks. One drive (designated as the parity drive) in this configuration is dedicated to storing parity data. If any data drive (a drive in this configuration that is not the parity drive) fails, the data on that drive can be rebuilt from parity data stored on the parity drive. At least three disk drives are required in this configuration. This is not a commonly used implementation.

RAID 3 At this level, data is striped across multiple hard drives using a parity drive (similar to RAID 2). The main difference is that the data is striped in bytes, not bits as in RAID 2. This configuration is popular because more data is written and read in one operation, increasing overall disk performance.

RAID 4 This level is similar to RAID 2 and 3 (striping with parity drive), except that data is striped in blocks, which facilitates fast reads from one drive. RAID 4 is the same as RAID 0, with the addition of a parity drive. This is not a popular implementation.

RAID 5 (Commonly Used) At this level, the data and parity are striped across several drives. This allows for fast writes and reads. The parity information for data on one disk is stored with the data on another disk, so if any one disk fails, the drive can be replaced and its data can be rebuilt from the parity data stored on the other drives. This works well if one disk fails. If more than one disk fails, however, the data will need to be recovered from backup media. A minimum of three disks is required. Five or more disks are most often used.

**mindreader** · 16-09-2004

Backup Considerations

Although you can never be completely prepared for every natural disaster or human foible that can bring down your network, you can make sure that you have a solid backup plan in place to minimize the impact of lost data. Even if the worst happens, you don’t have to lose days or weeks of work, provided that you have a solid plan in place. A backup plan is the set of guidelines and schedules that determine which data should be backed up and how often. A backup plan includes information such as:

What to back up
Where to back it up
When to back up
How often to back up
Who should be responsible for backups
Where media should be stored
How often to test backups
The procedure to follow in case of data loss

This section covers some of the items that are contained in a common backup plan, including:

Backup media options
Backup utilities
Backup types
Tape rotation schedule

**mindreader** · 16-09-2004

Backup Media Options

When you back up your network’s data, you must have something on which to store that data, which is called the backup medium. You have several options, including:

Small-capacity removable disks
Large-capacity removable disks
Removable optical disks
Magnetic tape

Let’s examine the advantages and disadvantages of each type, starting with small-capacity removable disks.

Small-Capacity Removable Disks

Small-capacity disks are magnetic media disks with a capacity of less than 500MB, which can be removed from their drives and replaced as they get filled. They are popular because of their low cost and ease of use. Additionally, because they are inexpensive, many computers come with one or more of these drives.

Large-Capacity Removable Disks

Large-capacity removable disks are virtually the same as small-capacity removable disks except they can store more data (more than 500MB per disk). The drives and media cost more, but the increase in capacity easily offsets the increased cost. Large-capacity removable disks are good for backing up a workstation that has only one or two disks. You can also use them to back up a server, but because they don’t have the capacity to back up a server with a single removable disk (multiple disks would be required for each backup), their use is limited

Removable Optical Disks

Removable optical disks use a laser (or some kind of light beam) to read and write information stored on a removable disk. They typically have large capacities and are fairly slow (more than 100 milliseconds as opposed to less than 50 milliseconds for magnetic) access times. The advantage to optical disks is that the capacities start at about 128MB and go up from there (650MB is a common size). There are even special optical jukeboxes, containing hundreds of disks and a robotic arm to select disks and put them in the drive(s), that have capacities in the hundreds of terabytes (1 terabyte is 1024 gigabytes).

Magnetic Tape

Magnetic tape is the oldest and most popular backup medium for offline (not readily accessible) data storage. It stores data in the form of magneticallyoriented metal particles (either copper oxide or chromium dioxide) on a polyester tape. It is popular because it is simple, inexpensive, and has a high capacity. Most networks use a magnetic tape backup of some kind

Real World Scenario: Copying Workstation Data to the Network

Servers must be backed up because they contain all the data for the entire network. In most networks, workstations are not backed up because they usually don’t contain any data of major importance. (Individual workstations would be backed up only if the users are trained improperly and don’t store all their data on the network.) Users can mistakenly save their data to their local workstation. Also, user application configuration data are normally stored on the workstation. If a workstation’s hard disk goes down, the configuration is lost.

For backups to be successful, users need to ensure that all necessary data is located on the network. You can do this in two ways: user training and folder replication. Training is time-consuming and costly, but productive in the long run. Users should understand the general network layout and know how to save their data in the proper place. This keeps all user data centralized and makes it easy for the administrator to back up the data.

When you replicate folders, client platforms that support replication will share their hard disks (or portions of them) with the rest of the network. The network backup software then backs up those portions of the workstation that the administrator specifies.

**mindreader** · 16-09-2004

A backup utility is a software program that can archive the data on a hard disk to a removable medium. Backup utilities can compress data before they store it, making it more efficient to use a backup program to archive data than to simply copy it to the backup medium.

Most operating systems include backup utilities, but these are usually simple programs that lack the advanced features of full-fledged, third-party programs (such as Seagate Backup Exec and Computer Associates’s ARCserve):

Windows 98 comes with Microsoft Backup
Windows NT has a backup program with a similar interface
Novell’s NetWare comes with SBACKUP
Unix comes with a command-based tape archive utility called tar.

All of these backup utilities are good for an initial backup of your system. For a complete set of features including scheduling and managing tape sets, purchase a third-party product that fits your platforms and specific backup requirements

Backup Types

After you choose your backup medium and backup utility, you must decide what type of backup to run. The types vary by how much data they back up each time and by how many tapes it takes to restore data after a complete system crash. The three backup types are:

Full
Differential
Incremental

Full Backup

In a full backup, all network data is backed up (without skipping any files). This type of backup is straightforward because you simply tell the software which servers (and, if applicable, workstations) to back up and where to back up the data, and then you start the backup. If you have to do a restore after a crash, you have only one set of tapes to restore from (as many tapes as it took to back up everything). Simply insert the most recent full backup into the drive and start the restore.

If you have a tape system with a maximum capacity of half the size of all the data on your server, the backup utility will stop the backup halfway through and ask you to insert the next tape. Normally, full backups take several hours, and most companies can’t afford to have a user sit in front of the tape drive and change tapes. So you need a backup drive and medium with enough capacity or a backup system that can automatically change its own tapes (such as a DAT autoloader).

Differential Backup

In a differential backup strategy, a single, full backup is done typically once a week. Every night for the next six nights, the backup utility backs up all files that have changed since the last full backup (the actual differential backup). After a week’s worth of differential backups, another full backup is done, starting the cycle all over again. With differential backups, you use a maximum of two backup sessions to restore a file or group of files.

Here’s how it works: The backup utility keeps track of which files have been backed up through the use of the archive bit, which is simply an attribute that indicates a file’s status with respect to the current backup type. The archive bit is cleared for each file backed up during the full backup. After that, any time a program opens and changes a file, the NOS sets the archive bit, indicating that the file has changed and needs to be backed up. Then each night, in a differential backup, the backup program copies every item that has its archive bit set, indicating the file has changed since the last full backup. The archive bit is not touched during each differential backup.

When restoring a server after a complete server failure, you must restore two sets of tapes: the last full backup and the most current differential backup. A full restoration may take longer, but each differential backup takes much less time than a full backup. This type of backup is used when the amount of time each day available to perform a system backup (called the backup window) is smaller during the week and larger on the weekend.

Incremental Backup

In an incremental backup, a full backup is used in conjunction with daily partial backups to back up the entire server, thus reducing the amount of time it takes for a daily backup. With an incremental backup, the weekly full backup takes place as it does during a differential backup, and the archive bit is cleared during the full backup. The incremental, daily backups back up only the data that has changed since the last backup ( not the last full backup). The archive bit is cleared each time a backup occurs. With this method, only files that have changed since the previous day’s backup are backed up. Each day’s backup is a different size because a different number of files are modified each day.

This method provides the fastest daily backups for networks whose daily backup window is extremely small. However, the network administrator does pay a price for shortened backup sessions. The restores made with this method after a server failure take the longest of the three methods. The full backup set is restored plus every tape from the day of the failure back to the preceding full backup.

Thread: Networking Guide 8 - Fault Tolerance and Disaster Recovery

Thread Tools

Networking Guide 8 - Fault Tolerance and Disaster Recovery

Assessing Fault Tolerance and Disaster Recovery Needs

Warm Site and Cold Site

Power Management

Surge Protectors

Battery Backup Systems

Line Conditioners

Disk System Fault Tolerance

Disk Mirroring

Disk Duplexing

Disk Striping

Redundant Array of Inexpensive (or Independent) Disks (RAID)

Backup Considerations

Backup Media Options

Backup Utilities

Similar Threads

Need information about sun Solaris disaster recovery plan

why don't raid 0 provide fault tolerance

Fault Tolerance and Disaster Recovery

How to implement Disaster recovery Exchange server 2007

Disaster recovery for clusters

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions