Networking Guide 9 - Network Troubleshooting
Networking Guide 9 - Network Troubleshooting
There is no doubt about it. The only way to get good at troubleshooting computers and networks is the same way to get good at any other art: practice, practice, practice. And as with any art, you must learn some basic skills before you can start practicing.
This guide introduces you to some items to keep in mind when troubleshooting networks as well as the troubleshooting topics covered on the Network+ exam. In this chapter, we’ll examine some basic troubleshooting techniques. First, we’ll look at how to check quickly for simple problems. Then, we’ll discuss a common troubleshooting model that you can use to identify many network problems. Finally, we’ll look at some common troubleshooting resources, and tips and tricks that you can use to make troubleshooting easier. Let’s start with how you go about narrowing down the problem.
Narrowing Down the Problem
Troubleshooting a network problem can be daunting. That’s why it’s best to start by trying to narrow down the source of the problem. You do this by checking a few key areas, beginning with the simple stuff.
Checking for the Simple Stuff
The first thing to check, as most people will tell you, is the simple stuff. There’s a saying that goes “all things being equal, the simplest explanation is probably the correct one.” For computers, it’s rather hard to categorize simple stuff because what’s simple to one person might be complex to another. I like to define simple stuff (as it relates to troubleshooting) as those items that you don’t think to check, but when it turns out that one of those items is the problem, you say, “Oh, DUH!” Almost everyone can agree on a few items that fall into this category:
- Correct login procedure and rights
- Link lights/collision lights
- Power switch
- Operator error
Real World Scenario: Can the Problem Be Reproduced? The first question to ask anyone who reports a network or computer problem is “Can you show me what ‘not working’ looks like?” If you can reproduce the problem, you can identify the conditions under which it occurs. And if you can identify the conditions, you can start to determine the source.
Unfortunately, not every problem can be reproduced. The hardest problems to solve are those that can’t be reproduced, but instead appear randomly.
The Correct Login Procedure and Rights
The Correct Login Procedure and Rights
To gain access to the network, users must follow the correct login procedure exactly. If they don’t, they will be denied access. Considering everything that must be done correctly and in the correct order, it’s a miracle that anyone logs in to a network correctly at all. There are so many opportunities for making a mistake.
First, a user must enter the username and password correctly. As easy as this sounds, users frequently enter this information incorrectly, don’t realize it, and report to the network administrator that the network is broken or that they can’t log in. The most common problem is accidentally typing the wrong username or password incorrectly. In some operating systems, this can happen when you accidentally leave the Caps Lock key pressed. An example of this is Unix, in which passwords are case-sensitive; the user will not be able to log in, unless his or her password is in all capital letters.
Additionally, in NetWare and Windows NT the network administrator can restrict the times and conditions under which users can log in. If a user doesn’t log in at the right time or from the right workstation, the network operating system will reject the login request, even though it might be a valid request in terms of the username and password being spelled correctly. Additionally, a network administrator might restrict how many times a user can log in to the network simultaneously. If that user tries to establish more connections than are allowed, access will be denied. Any time a user is denied access to the network, they are likely to interpret that as a problem, even though the network operating system might be doing what it should.
To test for these types of problems, first check to see if the username and password are being typed correctly and whether or not the Caps Lock key is pressed. Try the login yourself from another workstation (assuming that doesn’t violate the security policy). If it works, you might try asking the user to check to see if the Caps Lock light on the keyboard is on (indicating that the Caps Lock key has been pressed). If that doesn’t solve the problem, check the network documentation to see if the aforementioned kinds of restrictions are in place.
Tip If intruder detection is enabled on the network, the user’s account will be locked after a specified number of incorrect login attempts. In this case, the user cannot log in until the administrator has unlocked the account, or until a certain amount of time specified by the administrator has elapsed, after which the account is unlocked.
The Link and Collision Lights
The link light is a small light-emitting diode (LED) found on both the NIC and the hub. It is typically green and is labeled link (or some abbreviation). A link light indicates that the NIC and hub (in the case of 10BaseT) are making a logical (Data Link layer) connection. You can usually assume that the workstation and hub are communicating if the link lights are lit on both the workstation’s NIC and the hub port to which the workstation is connected.
Note The link lights on some NICs aren’t activated until the operating system driver is loaded for that NIC. So, if the link light isn’t on when the system is first turned on, you may have to wait until the operating system loads the NIC driver.
The collision light is also a small LED, typically amber in color. It can usually be found on both Ethernet NICs and hubs. When lit, it indicates that an Ethernet collision has occurred. It is important to know that this light will blink occasionally, because collisions are somewhat common on busy Ethernet networks. However, if this light stays on continuously, there are too many collisions happening for legitimate network traffic to get through. This can be caused by a malfunctioning network card or another malfunctioning network device.
Warning Be careful not to confuse the collision light with the network activity or network traffic light (usually green). The network activity light indicates that a device is transmitting. This particular light should be blinking on and off continually as the device transmits and receives data on the network.
The Power Switch
To function properly, all computer and network components must be turned on and powered up. As obvious as this is, network administrators often hear a user complain, “My computer is on, but my monitor is dark.” In this case, our response is to ask, “Is the monitor turned on?” After a pause, the voice on the other end usually says sheepishly, “Oh. Thanks.”
Most systems include a power indicator such as a Power or PWR light, and the power switch typically has a 1 or an On indicator. However, the unit could be powerless even if the power switch is in the On position. Thus, you need to check that all power cables are plugged in, including the power strip.
Tip Remember that every cable has two ends, and both must be plugged in to something.
When troubleshooting power problems, start with the most obvious device and work your way back to the power service panel. There could be any number of power problems between the device and the service panel, including a bad power cable, bad outlet, bad electrical wire, tripped circuit breaker, or blown fuse. Any of these items can cause power problems at the device.
Operator Error
The problem may be that the user simply doesn’t know how to perform the operation correctly; in other words, the problem may be due to OE ( operator error) . Those in the computer and networking industry have devised several colorful expressions to describe operator error:
- EEOC (Equipment Exceeds Operator Capability)
- PEBCAK (Problem Exists Between Chair And Keyboard)
- ID Ten T Error (written as ID10T)
Assuming that all problems are related to operator error, however, is a mistake. Before you attribute any problem to operator error, ask the user to reproduce the problem in your presence, and pay close attention. You may find out that the user is having a problem because he or she is using an incorrect procedure—for example, flipping the power switch without following proper shutdown procedures. You may also find out that the user was trained incorrectly, in which case you might want to see if others are having the same difficulty. If the problem and solution are not obvious, try the procedure yourself, or ask someone else at another workstation to do so.
Note This is only a partial list of simple stuff. You’ll come up with our own expanded list over time, as you troubleshoot more and more systems.
Is Hardware or Software Causing the Problem?
A hardware problem typically manifests itself as a device in your computer that fails to operate correctly. You can usually tell that a hardware failure has occurred because you will try to use that piece of hardware, and the computer will issue an error indicating that this has happened. Some failures, such as hard-disk failures, may give warning signs—for example, a Disk I/O error or something similar. Other components may just suddenly fail. The device will be operating fine and then simply fail.
The solution to hardware problems usually involves either changing hardware settings, updating device drivers, or replacing hardware. As we have discussed in previous chapters, I/O address, IRQ (interrupt requests), and DMA (direct memory access) conflicts can cause computers (including workstations and servers) to malfunction. Change the hardware settings to solve these types of problems.
If the hardware has actually failed, however, you must get out your tools and start replacing components. If this is not one of your skills, you can send the device out for repair. In either case, because the system can be down for anywhere from an hour to several days, it’s always prudent to have backup hardware on hand.
Software problems are a little more evasive. Some problems might result in General Protection Fault messages, which indicate a Windows or Windows program error of some type. Also, a program might suddenly stop responding (hang), or the entire machine might lock up randomly. The solution to these problems generally involves a trip to the manufacturer’s support website to get software updates and patches or to search for the answer in a knowledge base.
Sometimes software will give you a precise message regarding the source of the problem, such as the software is missing a file or a file has become corrupt. In this case, you can either provide the file or, if necessary, reinstall the software. Neither solution takes long, and your computer will be up and running in a short time.
Tip Sometimes fragmented memory, which occurs after you open and close too many programs, is the source of the problem. The solution may be to reboot the computer, thus clearing memory. Be sure to add this to your network-troubleshooting bag of tricks.
Is It a Workstation or a Server Problem?
Is It a Workstation or a Server Problem?
Troubleshooting this problem involves first determining whether one person or a group of people are affected. If only one person is affected, think workstation. If several people are affected, the server or, more generally speaking, a portion of the network is probably experiencing problems.
If a single user is affected, your first line of defense is to try to log in from another workstation within the same group of users. If you can do so, the problem is related to the user’s workstation. Look for a cabling fault, a bad NIC, or some other problem.
On the other hand, if several people in a group (such as a whole department) can’t access a server, the problem may be related to that server. Go to the server in question, and check for user connections. If everyone is logged in, the problem could be related to something else, such as individual rights or permissions. If no one can log in to that server, including the administrator, the server may have a communication problem with the rest of the network. If it has crashed, you might see messages to that effect on the server’s monitor, or the screen might be blank, indicating that the server is no longer running. These symptoms vary among network operating systems.
Which Segments of the Network Are Affected?
Which Segments of the Network Are Affected?
Making this determination can be tough. If multiple segments are affected, the problem could be a network address conflict. As you may remember “Networking Guide 4 - TCP/IP Utilities,” network addresses must be unique across an entire network. If two segments have the same IPX network address, for example, all the routers and NetWare servers will complain bitterly and send out error messages, hoping that it’s just a simple problem that a router can correct. This is rarely the case, however, and, thus, the administrator must find and resolve the issue. Also keep in mind that the continuous broadcasting of error messages will negatively impact network performance.
If all users of the network are experiencing the problem, it could be related to a different device, such as a server that everyone accesses. Or, a main router or hub could be down, making network transmissions impossible.
Additionally, if the network has WAN connections, you can determine if a network problem is related to the WAN connection by checking to see if stations on both sides can communicate. If they can, the problem isn’t related to the WAN. If they can’t communicate, you must check everything between the sending station and the receiving one, including the WAN hardware. Usually, the WAN devices have built-in diagnostics that can indicate whether the WAN link is functioning correctly to help you determine if the fault is related to the WAN link or to the hardware involved.
The Troubleshooter’s Resources
In the process of troubleshooting a workstation, a server, or other network component, you have many resources at your disposal. In this section, we’ll take a brief look at some of them. Those you use depend on the situation and your personal preferences. You will eventually have your own favorites.
Log Files
Log files can indicate the general health of a server. Each log file format is different, but, generally speaking, the log files contain a running list of all errors and notices, the time and date they occurred, and any other pertinent information. Let’s look at a couple of the log files from the most commonly used network operating systems, NetWare 5 and Windows NT 4.
NetWare Log Files
NetWare uses three log files that can help you diagnose problems on a NetWare server:
- The Console Log file (CONSOLE.LOG)
- The Abend Log file (ABEND.LOG)
- The Server Log file (SYS$LOG.ERR)
Each file has different uses in the troubleshooting process.