Monitoring and Statistics

Why Monitor

        performance tuning
                reduce bottlenecks, tune and optimize systems, improve QOS,
optimize investments, balance workloads
        troubleshooting
                increase reliability/availability, id problems before user
sees them
        planning
        expectations
        security
        accounting

What to monitor - LAN

        performance of networked applications
        communications devices
        computers and peripherals
        host resource utilization

What to do with the data?

        Gather
        Analyze
        Reduce
        Alert

Gather What?

        #good packets, #kilobytes, pkt size distribution
        #errors, types of errors
        # pkts dropped, discarded, buffer/controller overflows
        critical servers, router interfaces, ethermeters
        off-site links
        critical unix daemons and services
        arp caches
        appearance of new unregistered nodes

Gather when?

        polling
        rolling 5 minute averages
        event driven

Analysis

        hourly and daily graphs
        long-term graphs
        interface stats
        top talkers

Data reduction

        analysis generates thousands of reports, mostly boring
        reductions examine analysis and reports exceptions;
                duplicate IP addresses
                appearance of new unregistered nodes
                loss of connectivity
        data values exceeding thresholds
                errors > 1 per 10000 pkts
                total util on a subnet > 10% for the day
                broadcast rates > 150 pkts/sec
                bridge/router overflows

What to monitor -WAN

        throughput
        delay
        availability
        packet loss

Why to monitor the Internet

        what is really happening/improve performance
        who are grade A providers
        settlements

Specific Internet Metrics

        General Metrics
        Routing Metrics
        Path metrics
        Other metrics
        Topology visualization

For a metric, consider

        clear definition
        where to monitor
        why to monitor
        measurement tools
        public or private

General metrics

        access capacity per link
                at exchanges for settlements a priori public
        connect time
        total traffic (bytes)
        peak travel (bits/sec sustained for n seconds)

Routing metrics

        announced routes (number)
        route flaps (#)
        stability (route uptime/downtime)
        presence of more specific routes with less specific routes
        number of reachable destinations covered by a route

Path metrics

        delays (ms)
                anywhere for settlements and performance and grading, ping
        flow capacity
                anywhere for performance and grading, treno
        mean packet loss (%)
        mean RTT (sec)
        Hop counts, congestion (traceroute)

Other

        flow characteristics
        network outage information
        AS x AS matrices
        caches
        mean NOC trunaround

Packages

        Network probe daemon (vern paxson)
        traceroute, ping and treno
        Netscarf

Comments

        Poisson vs self-similarity
        Sampled versus full data
        Privacy considerations
        impact of monitoring on performance
        ANX