UW Performance troubleshooting procedures
Jonathan Morris
NDC Network Operations Center
University of Washington
October 16, 2002
Abstract:
The University of Washington is home to a large, generally high
performance network. When network performance degrades from the
point of view of the end user the troubleshooting steps outlined
below should be followed. These steps focus on the entirety of
the network picture.
1. Overview
The following document is meant to be a reference in tracking down
the cause of any possible report of poor network performance. The steps
outlined below are part of the evolving set of troubleshooting procedures
used. These can and will change over time. Not all procedures listed are
necessary in all trouble cases.
The information outlined in section 2, "Fast steps to troubleshooting",
is meant to work as a quick reference. Section 3, "In depth
troubleshooting procedures", contains more detailed information on
data gathering and problem solving.
2. Fast steps to troubleshooting
2.1 The following quick steps should be used as a guide while
troubleshooting. More in depth explanations and steps follow in
section 3.
Begin trouble tracking.
Check for current known events that would affect the user.
Check the user's configuration
IP address
DNS servers
DHCP client configuration
Netmask setting
Gateway setting
Cabling to wallport
Include check of hublet/switchlet
NIC configuration
Speed and duplex settings
Multiple NIC configuration
Possible host based or network firewall configuration(s)
Recreate the problem
Use ping and traceroute to the user and their destination
Look for packet loss, high latency, routing anomalies
Check DNS resolution for any delay
Perform IPERF tests if feasible
Use IPERF if the problem is not visible through other means
Try and localize the problem
Transport and Internetworking layers
ARP resolution
Possible IP conflict
Supported transport protocols
TCP
UDP
Multicast
Unsupported transport protocols
Appletalk
Supported over IP encapsulation only
NetBios
Supported over IP encapsulation only
IPX
Supported over IP encapsulation only
Route infrastructure health
High CPU on routers
Low memory, small "largest free" blocks
High input packets-per-second, high bandwidth
Log messages indicating router problems
VIP resets
malloc-fails
Routing
Check for correct routing
OSPF
BGP
IS-IS
Link Layer (Layer2) and Physical Layer
Using metric data begin examining the network
Check the user's Layer2 port
Errors
Link Saturation
Check the health of the Layer2 network
CPU
Memory
Log messages
Check intra-Layer2 device links
Switch uplinks
Hub AUI ports
Check the subnet's router(s)
Link saturation
Errors
Check the UW's border
Link Saturation
Errors
Check the GigaPop infrastructure and links to
commodity Internet and Abilene
Link Saturation
Errors
3. In depth troubleshooting procedures
3.1 Trouble complaint
3.1.1 Begin trouble tracking to start record of the problem
3.1.2 Always search the ticket database for any planned or known
events. Double check e-mail also for any recent information.
3.1.3 Begin gathering information, as detailed as possible from
the user. Other information can be obtained from our tools, such
as machine locations, MAC address (given an IP), etc...
1. Exact text if available of any error message they have received
2. Information about the IP's involved:
A. Source(s)
B. Destination(s)
C. Is DHCP in use or are the IPs manually assigned?
D. Associated DNS for both source(s) and destination(s)
E. Configured netmask(s)
F. Configured gateway(s)
3. MAC address:
A. Source(s)
B. Destination(s)
4. What is their physical location(s)?
5. The user's computer information:
A. Is it a desktop?
B. Is it a server?
a. Does it have multiple NICs?
b. Does it have any virtual interfaces?
C. What OS(s) do they use?
D. What type of NIC(s) do they have?
E. What is their connection speed set for? What duplex?
F. What applications are being used?
G. What type of protocol are they using?
a. TCP
b. UDP
c. Multicast
d. Appletalk
e. NetBios
f. IPX
H. What do they have as their primary DNS server? Secondary?
6. When did the problem start?
7. Is it intermittent?
8. What type of network are they on?
A. Layer2
a. Switched
b. Shared
c. Segment switched
B. Upstream router and uplink
a. Gigabit Ethernet
b. 100Mbit Ethernet
c. 10Mbit Ethernet
d. DS1 serial link
e. Other serial links
9. Do they connect through a user hublet?
10. Do they sit behind a firewall?
11. Does the other end of their attempted connection
have a firewall?
12. Are others in the area having similar problems?
3.2 Trouble investigation
3.2.1 Check that the user's configuration for the information
found in section 3.1.2 is correct. Once it is clear that the problems
are not due to the user's configuration begin looking top-down at
Layer3 and Layer2 network infrastructure. If multiple users are
involved or it is clear that the problem is not local to the user,
checking into Layer3 routing infrastructure should precede looking
at Layer2.
3.2.2 Try to recreate the problem as the user sees it
1. ICMP ping
A. Ping with varying size packets for varying lengths of time
from:
a. Inside the network
b. The user's computer
c. The user's destination (if possible)
d. A point with no known problems, usually one of our servers
or workstations.
B. Forcing a ping to fragment by using packet sizes greater than
the normal MTU of the users network can be beneficial. Using
route ping can show if there is any kind of load sharing or
possible asymmetric routing.
C. High round trip times (RTTs) and high latency (delay) can be
indicative of a performance problem. Any packet loss could
be caused by the root performance problem.
2. Traceroute
A. Using ICMP or UDP traceroute to the source(s) and
destination(s). Do this several times to determine attributes
of the network such as load-sharing or asymmetric routing.
Work from points:
a. Inside the network
b. The user's computer
c. The user's destination (if possible)
d. A point with no known problems, usually one of our servers
or workstations.
B. Unbalanced load-sharing and asymmetric routing can introduce
latency and possibly packet loss.
3. DNS resolution
A. Check that the forward and reverse of any DNS records
reported (such as www.yahoo.com or www.washington.edu) resolve
as expected.
a. On the user's machine
b. From a known good source (such as our servers or
workstations)
B. Double check any MX, CNAME, or SRV records for servers.
C. Make sure that resolution is not taking longer than
expected.
4. IPERF testing
A. This powerful tool is usually utilized by Network
Implementation when out in the field to reproduce problems
that may or may not be visible through other means. Because NI
must usually make a trip specifically to the user's location
for this purpose, encouraging the user to attempt using IPERF
may be reasonable.
3.2.3 After gathering metric data on the health of the network attempt
to isolate the problem to a specific area.
1. Transport and Internetwork Layers
A. Check that there are no simple problems involving:
a. ARP resolution
b. Transport protocols
The University only supports TCP/IP protocols on user
connections. Supported protocols include TCP, UDP, and
Multicast.
Unsupported protocols include Appletalk, Netbios, and IPX.
These can be tunnelled, via IP, and connectivity to hosts
involved should be verified at the IP layer.
B. Examine the health of routing equipment looking for:
a. High CPU usage
b. Low available memory and/or too small blocks of
"largest free" available memory.
c. High input packet rates or bandwidth usage
d. Any log messages indicating such things as Cisco VIP
resets, malloc-fails, etc...
C. More complicated routing problems can occur:
a. OSPF
b. BGP
c. IS-IS
2. Link Layer (Layer2) and Physical Layer
A. Is the problem between the Layer2 port and the user's
computer?
a. Are there any errors present on the Layer2 port?
b. What speed is the user running at? What duplex?
c. Do they have a Cat5 cable? Cat3? Thin coax?
d. What is the building wiring? Cat3, Cat5, Thin coax?
FCS/CRC, alignment, giants, runts, and other errors may show
up on the Layer2 port. These can indicate a speed or duplex
mis-match, a bad cable or malfunctioning NIC.
The user may experience some sever performance hits if trying
to use 100Mbps and/or full-duplex settings if the Layer2 port
is not configured similarly. It is recommended that the user
always set their computer to Auto-Negotiate.
If the user is trying to use Cat3 and has settings for
Auto-Negotiate, 100Hdx or 100Fdx, they may receive errors.
Cat5 cable is recommended.
If they have thin coax, terminators should be checked.
e. Is there a hublet or switchlet between the physical
wallport and the user's NIC?
f. Are there other devices on the hublet or switchlet?
g. Is the link becoming saturated?
If there is a device here, is the uplink to the
wallport properly configured? The uplink port on most
hublets is shared with the first or last port in the
hublet. The shared port is either a cross-over or
straight-through port. If there is not a specific uplink
or cross-over port, a cross-over uplink cable will be
needed. If the uplink port is a shared port no connection
can be made to the straight-through side if the cross-over
side is in use.
If the device is a switch, the uplink may also be
shared, straight-through or cross-over. Switches can
sometimes be set to auto-negotiate. The configuration of
the switchlet should be checked to make sure there is not
a duplex or speed mis-match.
Hublets will perform very poorly if the Layer2 port is
configured for anything but 10Hdx, however, auto-negotiate
should work in most cases without error.
If there are several devices in use at the same time
through the same hublet or switchlet it could be that they
are exceeding the maximum bandwidth for their segment from
the Layer2 port. On shared networks this would be visible
by most people on that network.
C. Make a check of the health of the Layer2 infrastructure
devices.
a. Check for high CPU usage
b. Check for low memory availability and/or high packet
buffer usage.
c. Check the device logs for messages concerning resets,
malloc-fails, low-buffers.
D. Are there errors within the Layer2 infrastructure?
a. Hub AUI
b. Switch uplinks
E. Is there congestion in the Layer2 infrastructure?
a. Hub AUI
b. Switch uplinks
F. The uplink to the the router serving the subnetwork.
a. The router interface may be saturated.
b. Errors on the router or subnet uplink.
G. Within the University's backbone infrastructure.
a. Check the backbone interfaces on the specific router.
b. Check the interfaces on the backbone switches themselves
c. Check the back bone interfaces on the University's
border routers with the PNW-GigaPop.
H. On the commodity or Abilene Internet links.
a. First using traceroute data check the PNW-GP
routers connected along the path out to the Internet
b. Check the "peer route server" routers if the route goes
through the Pacific Wave exchange point.
c. Check the "commodity NSP" routers for traffic destined
to the commercial Internet.
d. Check the "high-performance NSP" routers for traffic
destined to Abilene participants, peers, and connectors.