Monitoring and Statistics
Why Monitor
performance tuning
reduce bottlenecks, tune and optimize systems, improve QOS,
optimize investments, balance workloads
troubleshooting
increase reliability/availability, id problems before user
sees them
planning
expectations
security
accounting
What to monitor - LAN
performance of networked applications
communications devices
computers and peripherals
host resource utilization
What to do with the data?
Gather
Analyze
Reduce
Alert
Gather What?
#good packets, #kilobytes, pkt size distribution
#errors, types of errors
# pkts dropped, discarded, buffer/controller overflows
critical servers, router interfaces, ethermeters
off-site links
critical unix daemons and services
arp caches
appearance of new unregistered nodes
Gather when?
polling
rolling 5 minute averages
event driven
Analysis
hourly and daily graphs
long-term graphs
interface stats
top talkers
Data reduction
analysis generates thousands of reports, mostly boring
reductions examine analysis and reports exceptions;
duplicate IP addresses
appearance of new unregistered nodes
loss of connectivity
data values exceeding thresholds
errors > 1 per 10000 pkts
total util on a subnet > 10% for the day
broadcast rates > 150 pkts/sec
bridge/router overflows
What to monitor -WAN
throughput
delay
availability
packet loss
Why to monitor the Internet
what is really happening/improve performance
who are grade A providers
settlements
Specific Internet Metrics
General Metrics
Routing Metrics
Path metrics
Other metrics
Topology visualization
For a metric, consider
clear definition
where to monitor
why to monitor
measurement tools
public or private
General metrics
access capacity per link
at exchanges for settlements a priori public
connect time
total traffic (bytes)
peak travel (bits/sec sustained for n seconds)
Routing metrics
announced routes (number)
route flaps (#)
stability (route uptime/downtime)
presence of more specific routes with less specific routes
number of reachable destinations covered by a route
Path metrics
delays (ms)
anywhere for settlements and performance and grading, ping
flow capacity
anywhere for performance and grading, treno
mean packet loss (%)
mean RTT (sec)
Hop counts, congestion (traceroute)
Other
flow characteristics
network outage information
AS x AS matrices
caches
mean NOC trunaround
Packages
Network probe daemon (vern paxson)
traceroute, ping and treno
Netscarf
Comments
Poisson vs self-similarity
Sampled versus full data
Privacy considerations
impact of monitoring on performance
ANX