Search | Directories | Reference Tools
UW Home > UWIN > Networking 

UW Performance troubleshooting procedures

Jonathan Morris
NDC Network Operations Center
University of Washington

October 16, 2002

Abstract:

The University of Washington is home to a large, generally high
performance network. When network performance degrades from the 
point of view of the end user the troubleshooting steps outlined
below should be followed. These steps focus on the entirety of 
the network picture.

1. Overview

The following document is meant to be a reference in tracking down
the cause of any possible report of poor network performance.  The steps
outlined below are part of the evolving set of troubleshooting procedures
used. These can and will change over time. Not all procedures listed are
necessary in all trouble cases.

The information outlined in section 2, "Fast steps to troubleshooting",
is meant to work as a quick reference. Section 3, "In depth 
troubleshooting procedures", contains more detailed information on 
data gathering and problem solving.

2. Fast steps to troubleshooting

	2.1 The following quick steps should be used as a guide while 
	troubleshooting. More in depth explanations and steps follow in 
	section 3.
	
	Begin trouble tracking.

	Check for current known events that would affect the user.

	Check the user's configuration

		IP address
		
		DNS servers
		
		DHCP client configuration
		
		Netmask setting
		
		Gateway setting
		
		Cabling to wallport
		
			Include check of hublet/switchlet
		
		NIC configuration
			
			Speed and duplex settings
			
			Multiple NIC configuration

		Possible host based or network firewall configuration(s)
		
	Recreate the problem
			
		Use ping and traceroute to the user and their destination
				
			Look for packet loss, high latency, routing anomalies
				
		Check DNS resolution for any delay

		Perform IPERF tests if feasible
				
			Use IPERF if the problem is not visible through other means
		
	Try and localize the problem

		Transport and Internetworking layers

			ARP resolution
			
				Possible IP conflict

			Supported transport protocols
				
				TCP
			
				UDP
			
				Multicast

			Unsupported transport protocols
			
				Appletalk
					Supported over IP encapsulation only
		
				NetBios
					Supported over IP encapsulation only
			
				IPX
					Supported over IP encapsulation only

			Route infrastructure health

				High CPU on routers

				Low memory, small "largest free" blocks

				High input packets-per-second, high bandwidth

				Log messages indicating router problems
					VIP resets
					malloc-fails

			Routing

				Check for correct routing

					OSPF

					BGP

					IS-IS

		Link Layer (Layer2) and Physical Layer
		
			Using metric data begin examining the network
		
				Check the user's Layer2 port
				
					Errors

					Link Saturation
				
				Check the health of the Layer2 network
					
					CPU

					Memory

					Log messages
				
				Check intra-Layer2 device links

					Switch uplinks

					Hub AUI ports

				Check the subnet's router(s)

					Link saturation
					
					Errors

				Check the UW's border

					Link Saturation

					Errors

				Check the GigaPop infrastructure and links to 
					commodity Internet and Abilene

					Link Saturation

					Errors
					

3. In depth troubleshooting procedures

	3.1 Trouble complaint

		3.1.1 Begin trouble tracking to start record of the problem

		3.1.2 Always search the ticket database for any planned or known
		events. Double check e-mail also for any recent information.

		3.1.3 Begin gathering information, as detailed as possible from
		the user.  Other information can be obtained from our tools, such
		as machine locations, MAC address (given an IP), etc...

			1. Exact text if available of any error message they have received

			2. Information about the IP's involved:
			
				A. Source(s)
			
				B. Destination(s)
		
				C. Is DHCP in use or are the IPs manually assigned?
		
				D. Associated DNS for both source(s) and destination(s)

				E. Configured netmask(s)

				F. Configured gateway(s)
		
			3. MAC address:
			
				A. Source(s)
			
				B. Destination(s)
		
			4.	What is their physical location(s)?
		
			5. The user's computer information:
			
				A. Is it a desktop?
			
				B. Is it a server?
			
					a. Does it have multiple NICs?

					b. Does it have any virtual interfaces?
			
				C. What OS(s) do they use?
			
				D. What type of NIC(s) do they have?
			
				E. What is their connection speed set for? What duplex?
			
				F. What applications are being used?
	
				G. What type of protocol are they using?

					a. TCP

					b. UDP

					c. Multicast

					d. Appletalk

					e. NetBios

					f. IPX
	
				H. What do they have as their primary DNS server? Secondary?
				
			6. When did the problem start?

			7. Is it intermittent?
		
			8. What type of network are they on?
			
				A. Layer2
				
					a. Switched
				
					b. Shared
				
					c. Segment switched
		
				B. Upstream router and uplink
				
					a. Gigabit Ethernet
				
					b. 100Mbit Ethernet
				
					c. 10Mbit	Ethernet
			
					d. DS1 serial link
			
					e. Other serial links

			9. Do they connect through a user hublet?

			10. Do they sit behind a firewall?

			11. Does the other end of their attempted connection 
				have a firewall?

			12. Are others in the area having similar problems?
	
	3.2 Trouble investigation

		3.2.1 Check that the user's configuration for the information
		found in section 3.1.2 is correct. Once it is clear that the problems
		are not due to the user's configuration begin looking top-down at 
		Layer3 and Layer2 network infrastructure. If multiple users are
		involved or it is clear that the problem is not local to the user, 
		checking into Layer3 routing infrastructure should precede looking
		at Layer2. 
		
		3.2.2 Try to recreate the problem as the user sees it

			1. ICMP ping

				A. Ping with varying size packets for varying lengths of time
				from:
				
					a. Inside the network

					b. The user's computer

					c. The user's destination (if possible)

					d. A point with no known problems, usually one of our servers
						or workstations.
					
				B. Forcing a ping to fragment by using packet sizes greater than
				the normal MTU of the users network can be beneficial.  Using
				route ping can show if there is any kind of load sharing or
				possible asymmetric routing.

				C. High round trip times (RTTs) and high latency (delay) can be
				indicative of a performance problem. Any packet loss could
				be caused by the root performance problem.
			
			2. Traceroute

				A. Using ICMP or UDP traceroute to the source(s) and
				destination(s). Do this several times to determine attributes
				of the network such as load-sharing or asymmetric routing.
				Work from points:

					a. Inside the network

					b. The user's computer

					c. The user's destination (if possible)

					d. A point with no known problems, usually one of our servers
						or workstations.

				B. Unbalanced load-sharing and asymmetric routing can introduce
				latency and possibly packet loss.

			3. DNS resolution

				A. Check that the forward and reverse of any DNS records
				reported (such as www.yahoo.com or www.washington.edu) resolve
				as expected. 

					a. On the user's machine

					b. From a known good source (such as our servers or
					workstations)

				B. Double check any MX, CNAME, or SRV records for servers.

				C. Make sure that resolution is not taking longer than
				expected.

			4. IPERF testing

				A. This powerful tool is usually utilized by Network
				Implementation when out in the field to reproduce problems
				that may or may not be visible through other means. Because NI
				must usually make a trip specifically to the user's location
				for this purpose, encouraging the user to attempt using IPERF 
				may be reasonable.

		3.2.3 After gathering metric data on the health of the network attempt
		to isolate the problem to a specific area.
	
			1. Transport and Internetwork Layers
				
				A. Check that there are no simple problems involving:

					a. ARP resolution

					b. Transport protocols

					The University only supports TCP/IP protocols on user
					connections. Supported protocols include TCP, UDP, and
					Multicast.

					Unsupported protocols include Appletalk, Netbios, and IPX.
					These can be tunnelled, via IP, and connectivity to hosts
					involved should be verified at the IP layer.

				B. Examine the health of routing equipment looking for:

					a. High CPU usage

					b. Low available memory and/or too small blocks of
					"largest free" available memory.

					c. High input packet rates or bandwidth usage

					d. Any log messages indicating such things as Cisco VIP
					resets, malloc-fails, etc...

				C. More complicated routing problems can occur:

					a. OSPF

					b. BGP

					c. IS-IS

			2. Link Layer (Layer2) and Physical Layer

				A. Is the problem between the Layer2 port and the user's
				computer?
					
					a. Are there any errors present on the Layer2 port?
					
					b. What speed is the user running at? What duplex?
				
					c. Do they have a Cat5 cable? Cat3? Thin coax?

					d. What is the building wiring? Cat3, Cat5, Thin coax?
					
					FCS/CRC, alignment, giants, runts, and other errors may show
					up on the Layer2 port. These can indicate a speed or duplex
					mis-match, a bad cable or malfunctioning NIC.

					The user may experience some sever performance hits if trying
					to use 100Mbps and/or full-duplex settings if the Layer2 port
					is not configured similarly. It is recommended that the user
					always set their computer to Auto-Negotiate. 

					If the user is trying to use Cat3 and has settings for
					Auto-Negotiate, 100Hdx or 100Fdx, they may receive errors.
					Cat5 cable is recommended.  

					If they have thin coax, terminators should be checked.

					e. Is there a hublet or switchlet between the physical
					wallport and the user's NIC?
					
					f. Are there other devices on the hublet or switchlet?

					g. Is the link becoming saturated?
					
					If there is a device here, is the uplink to the 
					wallport properly configured? The uplink port on most
					hublets is shared with the first or last port in the
					hublet. The shared port is either a cross-over or
					straight-through port. If there is not a specific uplink
					or cross-over port, a cross-over uplink cable will be
					needed.  If the uplink port is a shared port no connection
					can be made to the straight-through side if the cross-over
					side is in use.

					If the device is a switch, the uplink may also be
					shared, straight-through or cross-over.  Switches can
					sometimes be set to auto-negotiate. The configuration of
					the switchlet should be checked to make sure there is not
					a duplex or speed mis-match.

					Hublets will perform very poorly if the Layer2 port is
					configured for anything but 10Hdx, however, auto-negotiate
					should work in most cases without error.

					If there are several devices in use at the same time
					through the same hublet or switchlet it could be that they
					are exceeding the maximum bandwidth for their segment from
					the Layer2 port. On shared networks this would be visible
					by most people on that network.

				C. Make a check of the health of the Layer2 infrastructure
				devices.

					a. Check for high CPU usage

					b. Check for low memory availability and/or high packet
					buffer usage. 

					c. Check the device logs for messages concerning resets,
					malloc-fails, low-buffers.

				D. Are there errors within the Layer2 infrastructure?

					a. Hub AUI

					b. Switch uplinks

				E. Is there congestion in the Layer2 infrastructure?

					a. Hub AUI

					b. Switch uplinks

				F. The uplink to the the router serving the subnetwork.

					a. The router interface may be saturated.

					b. Errors on the router or subnet uplink.

				G. Within the University's backbone infrastructure.

					a. Check the backbone interfaces on the specific router.

					b. Check the interfaces on the backbone switches themselves

					c. Check the back bone interfaces on the University's 
					border routers with the PNW-GigaPop.

				H. On the commodity or Abilene Internet links.

					a. First using traceroute data check the PNW-GP 
					routers connected along the path out to the Internet

					b. Check the "peer route server" routers if the route goes
					through the Pacific Wave exchange point.

					c. Check the "commodity NSP" routers for traffic destined
					to the commercial Internet.

					d. Check the "high-performance NSP" routers for traffic
					destined to Abilene participants, peers, and connectors.