|Tags: guide, networking, troubleshooting|
| ||Thread Tools||Search this Thread|
Networking Guide 9 - Network Troubleshooting
There is no doubt about it. The only way to get good at troubleshooting computers and networks is the same way to get good at any other art: practice, practice, practice. And as with any art, you must learn some basic skills before you can start practicing.
This guide introduces you to some items to keep in mind when troubleshooting networks as well as the troubleshooting topics covered on the Network+ exam. In this chapter, we’ll examine some basic troubleshooting techniques. First, we’ll look at how to check quickly for simple problems. Then, we’ll discuss a common troubleshooting model that you can use to identify many network problems. Finally, we’ll look at some common troubleshooting resources, and tips and tricks that you can use to make troubleshooting easier. Let’s start with how you go about narrowing down the problem.
Narrowing Down the Problem
Troubleshooting a network problem can be daunting. That’s why it’s best to start by trying to narrow down the source of the problem. You do this by checking a few key areas, beginning with the simple stuff.
Checking for the Simple Stuff
The first thing to check, as most people will tell you, is the simple stuff. There’s a saying that goes “all things being equal, the simplest explanation is probably the correct one.” For computers, it’s rather hard to categorize simple stuff because what’s simple to one person might be complex to another. I like to define simple stuff (as it relates to troubleshooting) as those items that you don’t think to check, but when it turns out that one of those items is the problem, you say, “Oh, DUH!” Almost everyone can agree on a few items that fall into this category:
Real World Scenario: Can the Problem Be Reproduced? The first question to ask anyone who reports a network or computer problem is “Can you show me what ‘not working’ looks like?” If you can reproduce the problem, you can identify the conditions under which it occurs. And if you can identify the conditions, you can start to determine the source.
Unfortunately, not every problem can be reproduced. The hardest problems to solve are those that can’t be reproduced, but instead appear randomly.
The Correct Login Procedure and Rights
The Correct Login Procedure and Rights
To gain access to the network, users must follow the correct login procedure exactly. If they don’t, they will be denied access. Considering everything that must be done correctly and in the correct order, it’s a miracle that anyone logs in to a network correctly at all. There are so many opportunities for making a mistake.
First, a user must enter the username and password correctly. As easy as this sounds, users frequently enter this information incorrectly, don’t realize it, and report to the network administrator that the network is broken or that they can’t log in. The most common problem is accidentally typing the wrong username or password incorrectly. In some operating systems, this can happen when you accidentally leave the Caps Lock key pressed. An example of this is Unix, in which passwords are case-sensitive; the user will not be able to log in, unless his or her password is in all capital letters.
Additionally, in NetWare and Windows NT the network administrator can restrict the times and conditions under which users can log in. If a user doesn’t log in at the right time or from the right workstation, the network operating system will reject the login request, even though it might be a valid request in terms of the username and password being spelled correctly. Additionally, a network administrator might restrict how many times a user can log in to the network simultaneously. If that user tries to establish more connections than are allowed, access will be denied. Any time a user is denied access to the network, they are likely to interpret that as a problem, even though the network operating system might be doing what it should.
To test for these types of problems, first check to see if the username and password are being typed correctly and whether or not the Caps Lock key is pressed. Try the login yourself from another workstation (assuming that doesn’t violate the security policy). If it works, you might try asking the user to check to see if the Caps Lock light on the keyboard is on (indicating that the Caps Lock key has been pressed). If that doesn’t solve the problem, check the network documentation to see if the aforementioned kinds of restrictions are in place.
Tip If intruder detection is enabled on the network, the user’s account will be locked after a specified number of incorrect login attempts. In this case, the user cannot log in until the administrator has unlocked the account, or until a certain amount of time specified by the administrator has elapsed, after which the account is unlocked.
The Link and Collision Lights
The link light is a small light-emitting diode (LED) found on both the NIC and the hub. It is typically green and is labeled link (or some abbreviation). A link light indicates that the NIC and hub (in the case of 10BaseT) are making a logical (Data Link layer) connection. You can usually assume that the workstation and hub are communicating if the link lights are lit on both the workstation’s NIC and the hub port to which the workstation is connected.
Note The link lights on some NICs aren’t activated until the operating system driver is loaded for that NIC. So, if the link light isn’t on when the system is first turned on, you may have to wait until the operating system loads the NIC driver.
The collision light is also a small LED, typically amber in color. It can usually be found on both Ethernet NICs and hubs. When lit, it indicates that an Ethernet collision has occurred. It is important to know that this light will blink occasionally, because collisions are somewhat common on busy Ethernet networks. However, if this light stays on continuously, there are too many collisions happening for legitimate network traffic to get through. This can be caused by a malfunctioning network card or another malfunctioning network device.
Warning Be careful not to confuse the collision light with the network activity or network traffic light (usually green). The network activity light indicates that a device is transmitting. This particular light should be blinking on and off continually as the device transmits and receives data on the network.
The Power Switch
To function properly, all computer and network components must be turned on and powered up. As obvious as this is, network administrators often hear a user complain, “My computer is on, but my monitor is dark.” In this case, our response is to ask, “Is the monitor turned on?” After a pause, the voice on the other end usually says sheepishly, “Oh. Thanks.”
Most systems include a power indicator such as a Power or PWR light, and the power switch typically has a 1 or an On indicator. However, the unit could be powerless even if the power switch is in the On position. Thus, you need to check that all power cables are plugged in, including the power strip.
Tip Remember that every cable has two ends, and both must be plugged in to something.
When troubleshooting power problems, start with the most obvious device and work your way back to the power service panel. There could be any number of power problems between the device and the service panel, including a bad power cable, bad outlet, bad electrical wire, tripped circuit breaker, or blown fuse. Any of these items can cause power problems at the device.
The problem may be that the user simply doesn’t know how to perform the operation correctly; in other words, the problem may be due to OE ( operator error) . Those in the computer and networking industry have devised several colorful expressions to describe operator error:
Note This is only a partial list of simple stuff. You’ll come up with our own expanded list over time, as you troubleshoot more and more systems.
Is Hardware or Software Causing the Problem?
A hardware problem typically manifests itself as a device in your computer that fails to operate correctly. You can usually tell that a hardware failure has occurred because you will try to use that piece of hardware, and the computer will issue an error indicating that this has happened. Some failures, such as hard-disk failures, may give warning signs—for example, a Disk I/O error or something similar. Other components may just suddenly fail. The device will be operating fine and then simply fail.
The solution to hardware problems usually involves either changing hardware settings, updating device drivers, or replacing hardware. As we have discussed in previous chapters, I/O address, IRQ (interrupt requests), and DMA (direct memory access) conflicts can cause computers (including workstations and servers) to malfunction. Change the hardware settings to solve these types of problems.
If the hardware has actually failed, however, you must get out your tools and start replacing components. If this is not one of your skills, you can send the device out for repair. In either case, because the system can be down for anywhere from an hour to several days, it’s always prudent to have backup hardware on hand.
Software problems are a little more evasive. Some problems might result in General Protection Fault messages, which indicate a Windows or Windows program error of some type. Also, a program might suddenly stop responding (hang), or the entire machine might lock up randomly. The solution to these problems generally involves a trip to the manufacturer’s support website to get software updates and patches or to search for the answer in a knowledge base.
Sometimes software will give you a precise message regarding the source of the problem, such as the software is missing a file or a file has become corrupt. In this case, you can either provide the file or, if necessary, reinstall the software. Neither solution takes long, and your computer will be up and running in a short time.
Tip Sometimes fragmented memory, which occurs after you open and close too many programs, is the source of the problem. The solution may be to reboot the computer, thus clearing memory. Be sure to add this to your network-troubleshooting bag of tricks.
Is It a Workstation or a Server Problem?
Is It a Workstation or a Server Problem?
Troubleshooting this problem involves first determining whether one person or a group of people are affected. If only one person is affected, think workstation. If several people are affected, the server or, more generally speaking, a portion of the network is probably experiencing problems.
If a single user is affected, your first line of defense is to try to log in from another workstation within the same group of users. If you can do so, the problem is related to the user’s workstation. Look for a cabling fault, a bad NIC, or some other problem.
On the other hand, if several people in a group (such as a whole department) can’t access a server, the problem may be related to that server. Go to the server in question, and check for user connections. If everyone is logged in, the problem could be related to something else, such as individual rights or permissions. If no one can log in to that server, including the administrator, the server may have a communication problem with the rest of the network. If it has crashed, you might see messages to that effect on the server’s monitor, or the screen might be blank, indicating that the server is no longer running. These symptoms vary among network operating systems.
Which Segments of the Network Are Affected?
Which Segments of the Network Are Affected?
Making this determination can be tough. If multiple segments are affected, the problem could be a network address conflict. As you may remember “Networking Guide 4 - TCP/IP Utilities,” network addresses must be unique across an entire network. If two segments have the same IPX network address, for example, all the routers and NetWare servers will complain bitterly and send out error messages, hoping that it’s just a simple problem that a router can correct. This is rarely the case, however, and, thus, the administrator must find and resolve the issue. Also keep in mind that the continuous broadcasting of error messages will negatively impact network performance.
If all users of the network are experiencing the problem, it could be related to a different device, such as a server that everyone accesses. Or, a main router or hub could be down, making network transmissions impossible.
Additionally, if the network has WAN connections, you can determine if a network problem is related to the WAN connection by checking to see if stations on both sides can communicate. If they can, the problem isn’t related to the WAN. If they can’t communicate, you must check everything between the sending station and the receiving one, including the WAN hardware. Usually, the WAN devices have built-in diagnostics that can indicate whether the WAN link is functioning correctly to help you determine if the fault is related to the WAN link or to the hardware involved.
After you determine whether the problem is related to the whole network, to a single segment, or to a single workstation, you must determine whether the problem is related to network cabling. First, check to see if the cables are properly connected to the correct port. More than once, I’ve seen a wall phone cable plugged into a modem in the In jack.
Additionally, patch cables from workstation to wall jack can and do go bad, especially if they get moved or tripped over often. This problem is often characterized by connection problems. If you test the NIC and there is no link light (discussed earlier in this chapter), the problem could be related to a bad patch cable.
It is also possible to have a cabling problem in the walls where the cabling wasn’t installed correctly. If a network cable was run over a fluorescent light, for example, the workstation attached to that cable might have problems only when the lights are on. The problem is that the fluorescent lights produce a large amount of EMI and can disrupt communications in that cable. This kind of problem may manifest itself only at times when most lights need to be on.
Next, check the MDI/MDX port setting on small, workgroup hubs, a potential source of trouble that is often overlooked. This port is used to uplink, for example, to a hub on the network’s backbone. The port setting has to be set to either MDI or MDX, depending on the type of cable used for the hub-to-hub connection. A crossover cable (discussed later in this chapter) requires that the port be set to MDI; a standard network patch cable requires that the port be set to MDX (sometimes labeled MDI-X). You can usually adjust the setting via a regular switch or a DIP (Dual Inline Package) switch. Check the hub’s documentation.
Note Some hubs just have a port labeled MDX, since the MDI setting is really just another standard port for all intents and purposes. If you connect hubs using a standard patch cable, you must connect the MDX port to a standard port on the backbone hub.
In the Network+ troubleshooting model, there are eight steps:
Step 1: Establish Symptoms
Obviously, if you can’t identify a problem, you can’t begin to solve it. Typically, you need to ask some questions to begin to clarify exactly what is happening. In our example, we should ask the user the following:
Step 2: Identify the Affected Area
Computers and networks are fickle; they can work fine for months, suddenly malfunction horribly, and then continue to work fine for several more months, never again exhibiting that particular problem. And that’s why it’s important to be able to reproduce the problem and identify the affected area. Identifying the affected area narrows down what you have to troubleshoot.
One of your goals is to make problems easier to troubleshoot and, thus, get users working again as soon as possible. Therefore, the best advice you can give when training users is that when something isn’t working, try it again and then write down exactly what is and is not happening. Most users’ knee-jerk reaction is to call you immediately when they experience a problem. This isn’t necessarily the best thing to do, because your response is most likely, “What were you doing when the problem occurred?” And most users don’t know precisely what they were doing at the computer because they were primarily trying to get their job done. Therefore, if you train users to reproduce the problem first, they’ll be able to give you the information you need to start troubleshooting it.
In our example, we find out that when the user tries to access the corporate intranet, he gets the following error message:
We’re in luck—we can re-create this problem.
Tip It is a definite advantage to be able to watch the user try to reproduce the problem, because you can determine whether the user is performing the operation correctly.
Step 3: Establish What Has Changed
If you can reproduce the problem, your next step is to attempt to determine the cause by determining what has changed. Drawing on your knowledge of networking, you might ask yourself and your user questions such as the following:
Were you ever able to do this? If not, then maybe this is not an operation the hardware or software is designed to do. You can inform the user that the system won’t do the operation (or that she may need additional hardware or software to do it).
If so, when did you become unable to do it? If the computer was able to do the operation and then suddenly could not, the conditions that surround this change become extremely important. You may be able to discover the cause of the problem if you know what happened immediately before the change. It is likely that the cause of the problem is related to the conditions surrounding the change.
Has anything changed since you were last able to do this? This question can give you insight into a possible source for the problem. Most often, the thing that changed before the problem started is the source of the problem. When you ask this question of a user, the answer is typically that nothing has changed, so you might need to rephrase it. For example, you can try asking, “Did anyone add anything to your computer?” or “Are you doing anything differently from the way you normally proceed?”
Were any error messages displayed? This is one of the best indicators of the cause of a problem. Error messages are designed by programmers to help them determine what aspect of a computer system is not functioning correctly. These error messages are sometimes clear, such as “Disk Full” (indicating that the disk cannot store any more files on it because it is full). Or they can be cryptic, such as “A random bit has been flipped in the I/O subsystem of memory junction 44FA380h” (this is a fictitious error, but you may encounter those just as complex). If you get a cryptic error message, you can go to the software or hardware vendor’s support website and usually get a translation of the “programmerese” of the error message into English.
Are other people experiencing this problem? This is one question you must ask yourself. That way you might be able to narrow down the problem to a specific item that may be causing the problem. Try to duplicate the problem yourself from your own workstation. If you can’t duplicate the problem on another workstation, it may be related to only one user or group of users (or possibly their workstations). If more than one user is experiencing this problem, you may know this already because several people will be calling in with the same problem.
Is the problem always the same? Generally speaking, when problems crop up, they are almost always the same problem each time they occur. But their symptoms may change ever so slightly as conditions surrounding them change. A related question is, “If you do x, does the problem get better or worse?” For example, you might ask a user, “If you use a different file, does the problem get better or worse?” If the symptoms become less severe, it might indicate that the problem is related to the original file being used.
These are just a few of the questions you can use to isolate the cause of the problem.
In our example, we find out that the problem is unique to one user, indicating that the problem is specific to his workstation. When we watch him as he attempts to reproduce the problem, we notice that he is typing the address correctly. The error message leads us to believe that the problem has something to with DNS (Domain Name Service) lookups on his workstation.
Step 4: Select the Most Probable Cause
After you observe the problem and isolate the cause, your next step is to select the most probable cause for the problem. Trust me, this gets easier with time and experience.
You must come up with at least one possible cause, even though it may not be correct. And you don’t always have to come up with it yourself. Someone else in the group may have the answer. Also, don’t forget to check online sources and vendor documentation.
In our example, we determined earlier that the cause was improperly configured DNS lookup on the workstation. The correction, then, is to reconfigure DNS on the workstation.
Step 5: Implement a Solution
In this step, you implement the solution. In our example, we need to reconfigure DNS on the workstation by following these steps:
Step 6: Test the Result
Now that you have made the changes, you must test your solution to see if it solves the problem. In our example, we’d ask the user to try to access the intranet (since that was the problem reported). In general terms, ask the user to repeat the operation that previously did not work. If it works, great! The problem is solved. If it doesn’t, try the operation yourself.
If the problem isn’t solved, you may have to go back to step 4, select a new possible cause, and redo steps 5 and 6. But it is important to make note of what worked and what didn’t so that you don’t make the same mistakes twice.
Step 7: Recognize the Potential Effects of the Solution
The fundamental flaw of any network technician is the ability of the technician to solve only the one problem and not realize what other problems that solution may cause. It is possible that the solution may be worse than the problem. As the saying goes, “Sometimes the cure is worse than the disease.”
Before fully implementing the solution to a problem, make sure you are completely aware of the potential effects of the solution and the other problems it may cause. If it causes more problems than it fixes, the solution isn’t probably the best solution for the problem.
Step 8: Document the Solution
You’ll definitely want to document problems and solutions so that you have the information at hand when a similar problem arises in the future. With documented solutions to documented problems, you can assemble your own database of information that you can use to troubleshoot other problems. Be sure to include information such as the following:
The Troubleshooter’s Resources
In the process of troubleshooting a workstation, a server, or other network component, you have many resources at your disposal. In this section, we’ll take a brief look at some of them. Those you use depend on the situation and your personal preferences. You will eventually have your own favorites.
Log files can indicate the general health of a server. Each log file format is different, but, generally speaking, the log files contain a running list of all errors and notices, the time and date they occurred, and any other pertinent information. Let’s look at a couple of the log files from the most commonly used network operating systems, NetWare 5 and Windows NT 4.
NetWare Log Files
NetWare uses three log files that can help you diagnose problems on a NetWare server:
The CONSOLE.LOG File
The Console Log file ( CONSOLE.LOG) keeps a history of all errors and information that have been displayed on the server’s console. It is located in the SYS:\ETC directory on the server and is created and maintained by the utility CONLOG.NLM that comes with NetWare versions 3.12 and later. You must load this utility manually (or place the load command in the AUTOEXEC.NCF file so that it starts automatically upon server startup) by typing the following at the console prompt:
LOAD CONLOG Once this utility is loaded, it erases whatever CONSOLE.LOG file currently exists and starts logging to the new file.
Note This command works with any version of NetWare, including 3.12 or later. However, if you are using NetWare 5 or later, the LOAD command is optional. It is required in versions 3.12 to 4.1x.
From this log file, we can tell that someone edited the AUTO-EXEC.NCF file and then restarted the server. This indicates a major change on the server. If we were trying to troubleshoot a server that was starting to exhibit strange problems after a recent reboot, this might be a source to check.
Warning The information in the CONSOLE.LOG file is lost every time the CONLOG.NLM is unloaded and reloaded. It doesn’t keep a history of every command ever issued, only those issued since CONLOG.NLM was loaded. However, you can configure the ARCHIVE=YES parameter to configure CONLOG to keep a history of all the conlog files. The first file is saved with a .000 extension, the next with a .001 extension, and so forth. The complete command to run at the console (or add to Autoexec.ncf) is Conlog archive=yes.
The ABEND.LOG File
This log file registers all Abends on a NetWare server. An Abend (ABnormal END) is an error condition that can halt the proper operation of the Net-Ware server. Abends can be serious enough to lock the server, or they can simply force an NLM to shut down. You know an Abend has occurred when you see an error message that contains the word Abend on the console. Additionally, the server command prompt will include a number in angle brackets (for example, <1>) that indicates the number of times the server has Abended since it was brought online.
Because the server may reboot after an Abend, these error messages and what they mean can be lost. NetWare versions 4.11 and later include a routine to capture the output of the Abend both to the console and to the ABEND.LOG file. ABEND.LOG is located in the SYSYSTEM directory on the server.
The ABEND.LOG file contains all the information that is output to the console screen during an Abend, plus much more:
Server S1 halted Friday, February 12, 1999 2:37:03 pm Abend 1 on P00: Server-5.00a: Page Fault Processor Exception (Error code 00000002) Registers: CS = 0008 DS = 0010 ES = 0010 FS = 0010 GS = 0010 SS = 0010 EAX = 00000000 EBX = D0AC2238 ECX = 0697DEF0 EDX = 00000009 ESI = D0C5C040 EDI = 00000000 EBP = 0697DED0 ESP = 0697DEC0 EIP = D0AC2232 FLAGS = 00014246 D0AC2232 C600CC MOV [EAX]=?,CC EIP in ABENDEMO.NLM at code start +00000232h Running process: Abendemo Process Created by: NetWare Application Thread Owned by NLM: ABENDEMO.NLM Stack pointer: 697DCE0 OS Stack limit: 697A000 Scheduling priority: 67371008 Wait state: 5050170 (Blocked on keyboard) Stack: D0AC22C1 (ABENDEMO.NLM|MenuAction+89) D1FEA602 (NWSNUT.NLM|NWSShowPortalLine+3602) --00000008 ? --00000000 ? --0697DF20 ? --D0134080 ? --00000001 ? D1FEA949 (NWSNUT.NLM|NWSShowPortalLine+3949) --00000010 ? --0697DEF0 ? --0697DEF4 ? --0697DFAC ? --D0C2E100 (CONNMGR.NLM|WaitForBroadcastsToClear+C90C) --00000003 ? --00000008 ? --00000012 ? --00000000 ? --00000019 ? --00000050 ? --000000FF ? --00000001 ? --00000010 ? --00000001 ? --00000000 ? --00000011 ? --0697DFDC ? --0000000B ? --00000000 ? D1FEABD9 (NWSNUT.NLM|NWSShowPortalLine+3BD9) --0000000B ? --00000000 ? --00000000 ? Additional Information: The CPU encountered a problem executing code in ABENDEMO.NLM. The problem may be in that module or in data passed to that module by a process owned by ABENDEMO.NLM. Loaded Modules: SERVER.NLM NetWare Server Operating System Version 5.00 August 27, 1998 Code Address: FC000000h Length: 000A5000h Data Address: FC5A5000h Length: 000C9000h LOADER.EXE NetWare OS Loader Code Address: 000133D0h Length: 0001D000h Data Address: 000303D0h Length: 00020C30h CDBE.NLM NetWare Configuration DB Engine Version 5.00 August 12, 1998 Code Address: D087E000h Length: 00007211h Data Address: D0887000h Length: 0000684Ch
The SYS$LOG.ERR File
The general Server Log file, found in the SYSYSTEM directory, lists any errors that occur on the server, including Abends and NDS errors and the time and date of their occurrence. An error in the SYS$LOG.ERR file might look something like this:
1-07-1999 11:51:10 am: DS-7.9-17 Severity = 1 Locus = 17 Class = 19 Directory Services: Could not open local database, error: -723
Windows NT 4 Log Files
Windows NT, like other network operating systems, employs comprehensive error and informational logging routines. Every program and process theoretically could have its own logging utility, but Microsoft has come up with a rather slick utility, Event Viewer, which, through log files, tracks all events on a particular Windows NT computer. Normally, though, you must be an administrator or a member of the Administrators group to have access to Event Viewer.
To use Event Viewer, follow these steps:
Warning Even though this list displays Windows 95/98 computers, you cannot view log files on those computers because their logging system isn’t designed to interface with Event Viewer.
Using Event Viewer, you can take a look at three types of files:
The System Log
The Security Log
The Application Log
Tip To view the log files of any Windows NT machine from your Windows 95/98 client, copy the Server Tools from the Windows NT Server CD to your hard disk and create a shortcut for them. The Server Tools directory is located in the \CLIENTS\SRVTOOLS\ directory on the Windows NT Server Installation CD.
|Thread Tools||Search this Thread|
|Similar Threads for: "Networking Guide 9 - Network Troubleshooting"|
|Thread||Thread Starter||Forum||Replies||Last Post|
|The Darkness II Troubleshooting Guide and PC Fixes||spookshow||Guides & Tutorials||4||22-03-2012 10:23 PM|
|Dragon Age 2 Troubleshooting Guide||OptimuS PrimE||Guides & Tutorials||22||27-01-2012 02:26 PM|
|Networking Guide 7 - Network Access and Security||mindreader||Networking & Security||26||28-08-2010 02:09 PM|
|Wireless Networking for Businesses Guide||Richard B Rufus||Guides & Tutorials||0||29-04-2008 07:15 PM|
|Networking Guide Part 3 - TCP/IP Fundamentals||mindreader||Networking & Security||26||12-11-2004 08:07 AM|