ISMA Jan '99 - Workshop Report

Introduction and Goals

On January 14-15, 1999 CAIDA hosted an Internet Statistics and Metrics Analysis (ISMA) workshop focused on passive monitoring and analysis of Internet traffic data. The meeting was intended to engage researchers and practitioners experienced in this field in discussions aimed at:

calibrating what each group is doing in this arena
identifying the how these efforts fit within the context of `the field'
clarifying critical priorities and next steps

Fifty-four (54) individuals participated, representing Internet service providers (ISPs), the research and education (R&E) community, and vendors. Each participant completed survey questions summarizing their interests and current activities in the field.

The ISMA workshop was held at the San Diego Supercomputer Center (SDSC) on the campus of the University of California, San Diego (UCSD). The meeting was sponsored by the Cooperative Association for Internet Data Analysis (CAIDA).

Demonstrations of several collection and analysis tools were conducted:

cflowd by Daniel McRobb,
Coral by Hans-Werner Braun,
CoralReef by David Moore,
DAG by Ian Graham, Stephen Donnelly and Jed Martens,
NetraMet by Nevil Brownlee.

Tutorials were also conducted immediately following the workshop, for cflowd, CoralReef, and NetraMet.

Findings and Conclusions

The first day highlighted community activities in this field, including available measurement hardware and software and the types of analyses underway within the private sector, academic, and research communities. The second day focused on priorities and future directions for this nascent sector. Attendees also explored requirements for measurement functionality within future Internet routers and switches during discussions of a draft measurement specification. The sections that follow describe highlights from these sessions.

What to Measure?

Daniel McRobb (UCSD/CAIDA) opened discussions with a presentation on router-based measurements using the cflowd software. cflowd is a flow analysis tool currently used for analyzing Cisco's NetFlow statistics collection functionality. Version 2.0 of cflowd replaces an earlier version of the traffic analysis software developed by McRobb while at ANS. The current release includes the collections, storage, and basic analysis modules for cflowd and for arts++ libraries. It permits data collection and analysis by ISPs and network engineers in support of capacity planning, trends analysis, and characterization of workloads in a network service provider environment. Autonomous system matrices continue to be a particularly useful feature of cflowd for ISPs. Other areas where cflowd has proven useful include tracking usage for web hosting, accounting and billing, network planning and analysis, network monitoring, developing user profiles, data warehousing and mining, as well as security-related investigations (see Sager's presentation below). Collection and reporting granularities are user definable. Future enhancements of cflowd will include: support for v8 flow-export, addition of plotting utilities using XRT/PDS, and the development of java-based web reporting tools. A mailing list is available for users and developers at cflowd-dev@caida.org. Presentation

Presenters described several stand-alone passive monitoring devices. Ian Graham (Waikato University) provided an overview of a family of traffic monitor cards known as DAG. Various Asian sites are deploying versions of the DAG-2 and DAG-3, operating at DS3 and OC3 speeds. These cards, and an OC12 version, will undergo testing and evaluation starting mid-1999 by the National Laboratory for Applied Network Research (NLANR) and CAIDA. The DAG cards are designed for IP over ATM measurements, but UNIX drivers for packet over sonet measurements are expected this summer. Waikato's comparison of measurement accuracy between DAG versus tcpdump suggest two orders of magnitude better results using the DAG card.

Graham also provided a short overview of the results of interarrival times and one-way delay measurements underway between New Zealand and San Diego using DAG-2 cards and GPS antenna (100 nanosec. accuracy), see http://atm.cs.waikato.ac.nz/wand/delay/. He explained the significant differences between the forward versus reverse paths as mostly reflective of queueing delays in the U.S. Waikato's active measurement initiative provides measurements similar to the Surveyor Project, however the surveyor measurement hosts are currently designed exclusively for active measurements, no passive monitoring. A DAG card for ethernet measurements, e.g., of collision and non-deterministic behavior, and DAG support for OC48 are also planned for this year.

kc claffy (CAIDA) provided an overview of Coral as a flexible platform for passive network monitoring. Currently, the Coral monitor is the lowest cost TCP/IP traffic monitor available operating at OC3 and OC12 speeds. Originally developed by MCI's vBNS team in collaboration with NLANR, the functionality of the early monitors has expanded to include support on a Unix platform, as well as a range of real-time and packet trace analysis software in the form of CoralReef (described below). Development of security capabilities for the monitors is also underway within MCI and withing the Pacific Institute for Computer Security (PICS) at UCSD/SDSC. Presentation

Jonathan Wang summarized Bellcore's efforts to analyze passive traffic in support of SS7 networks, and provided specific examples of tests results involving Bellcore's frame relay traffic data collector. Bellcore uses very fast tape drives to store gigabytes of data collected from frame relay monitors. On average, up to 2.5 hours of data may be collected during a single period. These raw data, ranging from headers to entire traces depending on client requirements, are shipped to New Jersey for analysis. A paper, entitled Operations Measurements for Engineering Support of High-Speed Networks with Self-Similar Traffic, summarizes Bellcore's findings with respect to acceptable sampling times for specific applications. Presentation

Greg Miller (MCI Worldcom) presented an ISP perspective, from the vantage point of the very-high-performance Backbone Network Service (vBNS). MCI/vBNS have been particularly sensitive to the important contribution measurement and analysis of traffic can have in a high performance network. MCI Worldcom maintains public web access of both active and passive measurements on the vBNS at http://www.vbns.net/. Miller described these analyses through illustrative graphics depicting "average" flows across the vBNS at 5-minute intervals. Miller also described an instance where traffic analyses were able to dispute assertions by vendors, e.g., a case where reassembly of packets was suggested as a problem with a specific router (Oc3mon was able to establish that it never saw more than about 20 packets simultaneously being reassembled, far below the number the vendor claimed the box could handle.). In another, Coral monitors were used to refute the hypothesis that traffic was being dropped due to a core switch setting the CLP-bit. (The Oc3mon verified that CLP was not being set.).

Special analyses underway on vBNS traffic include: TCP analysis (retransmits, window size, MSS, options, flags); number of simultaneous reassemblies in progress; TOS and precedence bits usage; individual flow tracing; and multicast. Analysis of AS pairs is also important. Areas where MCI Worldcom does not currently use OC3mon/OC12mon monitors (but might consider in the future) include: billing and accounting, capacity planning, security incidents, SLA verification, and traffic engineering. Note however that most flows currently present on the vBNS are of short duration, suggesting that new techniques such as MPLS may not be immediately useful for traffic engineering at the micro-flow level on this network. Presentation

Phil Emer (NCSU) described how campuses are using passive collection tools to monitor their networks using the experiences of the North Carolina GigaPoP and the North Carolina Research and Education Network (NCREN). Specific areas where Emer feels measurement techniques can assist campuses include determining when a specific link is 'full' (i.e., excessive packet delay) and monitoring for `obscene amounts of traffic' from individual sources that might serve as indications of a DOS attack or cases where campus users are engaging in inappropriate behavior. Who/what is using x percentage of the pipe is also particularly relevant for effective management of campus networks. Where to monitor is not always easy, Emer explained, citing an example where traffic patterns did not improve following the addition of a third DS3 line to a site. The presence of ATM circuits and multiple interfaces also tends to complicate effective measurements. Representatives of other universities echoed similar situations and described home-grown scripts being used for various traffic and security analyses situations.

SESSION DISCUSSIONS: Participants discussed the pros and cons of in-router measurements (via cflowd, NetFlow, and similar tools) versus those of stand-alone monitors. For ISPs, measurement from within the router has clear advantages. Aggregation schemes focusing on key statistics (e.g., AS pairs) are of increasing importance as a means to deal with the high volumes of data generated by high speed routers. Participants felt that exporting flow data from the router via UDP poses significant problems (oss in routers when UCP flow-export loads are too high, requirements for placing collection boxes in relatively close proximity to the routers), urging TCP as an alternative. (Later discussions included expression of concern over vendors that implemented different versions of TCP for different applications (ssh, BGP, telnet) within the same box.)

Participants also discussed the benefits and costs of collecting all data (headers or full traces for short periods) versus traffic sampling. The consensus was that the usefulness of sampling depends on what one is measuring, e.g. sampling 1 in 50 or 100 packets may suffice for link capacity planning, but many research questions, e.g., analysis of interarrival times, will require finer grained data sets.

Extensive debate centered on definitions of "flow". Use of the 64 second timeout definition proposed by Claffy in the mid 1990s is still used as the default by several groups (including MCI, NLANR and CAIDA), however, there are no recent studies suggesting that this is (or is not) a reasonable value for measurement of current Internet traffic. While this value should remain user-configurable in measurement devices, additional research is needed into issues of acceptable time-out values in measuring different types of flows. Participants also urged benchmarking of both router-based and stand-alone monitors, with particular attention to accuracy of data and ability to report packets dropped during export of flows.

Status of Analysis Activities

Several individuals made short presentations on the status and findings of their passive measurement projects. The presentations tended to focus on relevance of analysis of passive data to:

designing networks
operating networks
research (next generation hardware, software, protocols), or
billing or accounting

Nevil Brownlee (U. of Auckland) described the architecture proposed for flow metering under the Internet Engineering Task Force (IETF) RealTime Traffic Flow Measurement (RTFM) working group and the implementation status of the tool NeTraMet. Nevil reviewed the key attributes of bidirectional TCP flows measured by NeTraMet. One of the newest areas of research uses NeTraMet for performance analysis by passively measuring packet loss using a byte loss percentage (BLP), defined as the difference of sent minus acked bytes, times 100, divided by the number of acked bytes. These results can then be aggregated by AS number to generate estimates of loss over specific autonomous systems at a specific point in time. Users can easily collect distributions of turnaround times (as an indication of host/network response times) for non-TCP flows using turnaround distribution time between successive packets in each direction. TCP flows, however, require the use of packet pairs where the first packet sends data, which is then acknowledged by the second packet. Brownlee provided illustrations of packet loss analyses/distributions using both passive measurements (NeTraMet) and active measurements (the Surveyor monitor). He suggested that NeTraMet's BLP and TCP-derived turnaround times can make important contributions to performance analysis and merit additional research/development. Presentation

IBM uses Lotus Notes for its international corporate infrastructure, explained Sigmund Handelman, therefore analysis of this traffic can contribute to optimization of enterprise networks such as IBM's. Using SNMP MIB structure (rule and flow tables etc.), the NeTraMet meter reader can collect SNMP data for use in defining daily traffic patterns and mixes and as inputs to traffic models. The results of Handelman's research suggest that traffic peaks within the IBM network are closely correlated with perceptions of delay expressed by clients. Analysis clearly identified LAN choke points, resulting in upgrades through inter-site PVCs. The ability of measurement devices to focus on specific traffic, e.g., Lotus Notes, can contribute to an organization's ability to use (and evaluate) differentiated service classes in the future. Presentation

The National Laboratory for Applied Network Research is actively deploying dozens of OC3mon and OC12mon monitors at campus high performance connection sites. Hans-Werner Braun (UCSD/NLANR) described this National Analysis Infrastructure (NAI) and how NLANR's Measurement and Operation Analysis Team (MOAT) is using these data to assist NSF-supported research and education networking efforts. Two minute header traces are collected every three hours, synchronized across machines. These data are anonymized and stored at SDSC and accessible to the public through a Datacube file structure by specifying origin site, project, date, and/or measurement time parameters. Braun reviewed the trade-offs between specific forms of analysis (real-time, local staggered access, vs. central analysis), and described the various means of data collection (Ethernet, FDDI, DS-3, OC-3, OC-12). The University of Waikato's DS-3, OC-3, and OC-12 DAG cards will undergo testing and evaluation during March/April. POS drivers for these cards are of particular interest to NLANR and to some of its collaborating universities. Presentation

Wenjia Fang (Princeton University) described her use of Coral data from NAI to study inter-ISP traffic behavior, providing charts of flows between hosts, networks, and ASes. Today's commercial Internet bares little resemblance to that of the old NSFnet due to the adoption of CIDR and longest prefix matching algorithms in routers, deployment of BGP and AS organizing principles for routing, and the sheer increase in number of hosts, routers,ISPs, and volume of traffic. Of key interest to her research are issues of geographic uniformity of flows. If spatial locality of traffic is widespread, she asserts, there are opportunities for aggregation of flows by network and ASes into "jumbo flows" supporting the cost-effectiveness of reserving bandwidth for these flows. A tool named fs2jumboflows is then be used on trace data to map IP addresses to their respective networks and ASes for analysis against routing table data from Merit. Presentation

Glenn Sager (SDSC/PICS) is developing security applications both for the Coral OC12 monitors and for the cflowd tool. With the advent of ATM switches, encapsulation, and higher speed networks, monitoring for security purposes is becoming increasingly difficult. For security monitoring, details on source and destination addresses, the start time of flows, and the destination port are very useful; these data are all readily obtainable through the cflowd and Coral tools. cflowd provides a vehicle for security monitoring at OC3 and lower speeds, including use for detection of sparse, long-duration scans, site traffic profiling and anomaly detection, and identification of link layer forged-source denial of service (DOS) attacks. Results can be linked to an active policy-enforcement module such as Stanford's 2swatch tool for generating security alerts and periodic reports. The lack of access to packet payload data through cflowd limits some forms of security analyses requiring peering into the content of the packet. Presentation

Ramon Caceres (AT&T Research) described measurement and analysis of AT&T Worldnet traffic using "PacketScope" on FDDIs to monitor dialup, server and backbone traffic passively. The PacketScope hosts are 500 MHz Alpha workstations, supported by a 10 GB disk array and 140 GB tape robot. Data collected from a specific link is sent to AT&T Labs via a dedicated T1 line to ensure that transmissions not affect collection. PacketScope gathers per-packet trace information using a modified version of tcpdump and timestamps of 10 microsecond resolution. The tool filters packet streams to reduce data volume. It takes continuous traces for arbitrary periods, comprising up to 150 million packets with less than 0.3% packet loss normally. Considerations of security and network integrity are critical in designing architectures for this type of measurement, including receive-only capabilities from the monitors, encryption of IP addresses, and independent means of data retrieval. PacketScope has contributed to analyses of web proxy caching, IP traffic over flow-switched networks, multifractal traffic analysis, and related studies, see papers at http://www.attresearch.com/~{ramon, jrex, anja}. Presentation

A model of Net News traffic across a modem based network was described by Will Leland (Bellcore). The NNTP model, developed by Bellcore's Tuenis Ott using NNTP data from Southwestern Bell's network. Bellcore analyzed a sample of approximately 700 flow pairs covering part of a days traffic from a set of modems. Most packets were in the order of 1500 bytes in size, with HTTP traffic representing roughly 50-80% of the total during the sample period. The NNTP traffic tended to be heavy tailed only in the number of articles accessed per session. Presentation

David Moore (CAIDA) summarized CoralReef -- a comprehensive software suite for analysis of passive Internet traffic data. CoralReef is being developed by CAIDA for public release in March 1999. It includes drivers, libraries, utilities, and analysis software, as well as basic web report generation forms, examples and capture software. Interfaces for other monitors and TCP dump formats are also being developed. The goal is to have CAIDA developers maintain this software, with future development occurring with the support and collaboration of the Internet measurement community. CAIDA's Coral project is focused on the OC3mon and OC12mon collection tools implemented on Intel-based workstations running FreeBSD (DOS versions are being maintained by MCI as part of the vBNS activities). CAIDA's plans include development of an OC48mon monitor (development is continuing joinly with Joel Apisdorf of MCI Worldcom and University of Waikato under CAIDA's NGI project). CoralReef will soon include support for the Dag2 network cards which supports OC3mon and OC12mon on both POS and ATM networks.

SESSION DISCUSSIONS: Discussions focused on the role of TCP sequence numbers as an indicator of "goodput". Matt Mathis suggested that by analyzing the actual sequence numbers of the packets received, the percentage of packet loss can be passively determined.

Packet checksum analyses (IP, TCP, and UDP traffic) of individual packets was also discussed. Checksum analyses are important for trend analysis and for monitoring of static PVCs.

Additional areas requiring research include visualization of protocol traffic and relationships among networks. Also required are improvements to processor speeds and a way to cope with disk interface limitations for stand-alone monitors.

Panel on Tools

Nevil Brownlee (U. of Auckland) chaired the panel, with Ramon Caceres, Daniel McRobb, David Moore, and Marc Pucci as panelists.

The rapid pace at which the Internet is evolving places several challenges upon the panel focused on the critical challenges inherent in emerging backbone networks and differences between ISP and user requirements. Highlights of this discussion include:

size and speeds of emerging networks - with the speed of backbones increasing to OC48 and OC192, monitoring these networks for management, billing, problem diagnosis or other purposes is becoming increasingly difficult due to hardware and software limitations, e.g., the absence of capture cards capable of OC48 speeds, I/O limitations on the host, disk and RAM constraints, etc. As one participant put it, "we'll need an 18 wheeler to simply transport the monitoring equipment".
IPsec - the IETF is fostering development and deployment of encrypted data streams using IPsec, which will virtually eliminate the utility of the current traffic characterization tools. Alternatives include the placement of probes at the locations where services are being provided (ends vs. core of the network) and the development of tools that can recognize applications or protocols based on various behavior characteristics, e.g., behavior characteristics of email vs. video vs. games. Nevertheless, individuals could also subvert these techniques by layering, e.g., IP on IP on UDP/Audio/FTP etc. McRobb noted that ISPs are mostly interested in the size of the payload, not its content. Therefore, with the exception of capacity planning, ISPs are not overly concerned about the implications of IPsec for traffic monitoring. In the long-run, the real issue is that the more valuable the instrumentation, the more willing the IPsec folks should be to devise compromise solutions. Steve Bellovin, for example, is proposing the export of key header fields as clear text for measurement purposes. This solution will only be pursued, however, if ISPs demand it. [Ed: This idea unfortunately got a lukewarm reception at the TF-ESP session of the Minneapolis IETF meeting.]
Port ID - Characterizing traffic based on their port is becoming increasingly difficult. For example, both real audio and video often start with port 80 (typically associated with HTTP traffic), then RTSP dynamically negotiates the ultimate port within a 200 port range (including AFS ports). One example given was CMU traffic that appeared to be real audio traffic, but was in fact AFS traffic. The increasing usage of UDP for games like Quake also makes identification difficult. (Caceres noted the strong potential utility of tools that could parse traffic from RTSP and similar session control protocols, then dynamically set up packet filters to capture traffic on the dynamically negotiated ports.)
encapsulation and layering - Pucci described a case of severe protocol abuse where packets were found to have seven layers of layer 2-4 protocol encapsulated over frame relay frames. Similarly, virtual private networks (VPNs) are proliferating. VPNs' use of encapsulation layers and in-house applications, e.g., video-conferencing, significantly complicates identification of traffic types.
SLAs - Increasing customer expectations are forcing the need for standardized metrics for service level agreements (SLAs) and for tools to measure and validate these metrics. Specifically, tools are needed that reflect and pinpoint problems in the network. Additional definitions of flow based on TOS byte or differentiated service may be needed soon. Validating SLAs using differentiated services will be particularly difficult given that performance can be directly affected by traffic shaping (backbone) and policing activities (edges).
timing - As network speeds include, data accuracy becomes increasingly difficult due to "clock issues". Many current efforts use network time protocol (NTP) which is accurate up to the millisecond time granularity. More detailed measurements would require global positioning system (GPS) clocks, however, this option is difficult due to requirements for installation of antennas.
billing and settlements - Accounting is increasingly viewed as an important area for development by ISPs, however, it is very difficult in transit environments and in environments with speeds of DS3 or higher. Deployment of lots of independent monitors on backbone links is difficult and the alternative router-based measurement is complicated by a lack of comparability across routers, the pace of change within router products, and the general lack of support for measurement issues from vendors.
flow definitions - The lack of standard definition for flow complicates standardization of passive metrics. Alternative definitions include TCP syn/fin transactions, individual connections, or timeout values.
monitors - ISPs like SBC want to measure on the edge of networks and characterize traffic in order to determine opportunities for deployment of technologies such as web caching. For SBC, ensuring that measurements are not biased is an important rationale for using independent monitors. For BBN, however, the focus is on measurements at the core of the network, for which external boxes are not viewed as an option. Cost recovery will be the factor driving ISP deployment of monitors anywhere. Currently, the focus of ISPs is on meeting exponential demand -- when demand slows however, ISPs will refocus on issues of recovering costs for their buildouts. Sampling and its use for specific applications, ranging from capacity planning to billing, will also need to be reexamined.

Panel on Passive Analysis Priorities

Anja Feldman (ATT Research) chaired the panel with Paul Barford, Ted Hardie, Aaron Klink and Matt Mathis as panelists.

The panel explored alternative requirements for passive measurements. Mathis described TCP related research questions relating to the reconstruction of TCP state (TCP sequencing) and TCP interactions with the transport protocols and the importance of these techniques for traffic characterization and analysis of performance.

Hardie stressed the need to use passive measurements to assist in making peering as effective as possible. For NAP purposes, it is important that only data useful for engineering and accounting purposes be available -- data on user interactions are to be avoided due to liability and related issues.

Barford discussed the community's need for analysis of trends in packet trace data. Using Boston University data from 1994, 1995 and 1998, for example, BU analysts assert that the opportunities for web caching have decreased for their campus. Caceres urged the public release of traces or characterization of traces in order to compare different views of the Internet. Mathis cautioned that standardized definitions and formats for data collection are important for comparisons of data across locations and time. Further, such comparisons require well-synced clocks on tracers, to UTC if possible.

Techniques for analyzing how people use the Web are emerging and include methodologies for inferring server performance based on information available at the client site. Analysis of HTTP logs and network performance suggests strong correlation between the two. Use of statistical methodologies for characterizing user behavior are difficult due to the size of the datasets and inadequacies of current statistical techniques in depicting Internet traffic behavior (e.g., whether Internet, or HTTP, traffic is fractal in behavior).

How much data should be retained, if any, engendered a lively discussion among participants. Mathis characterized network measurement as "a continuing lesson in humility" and suggested that practices are evolving at such a pace that only real-time analysis is warranted. In many cases the huge volume of statistical data that higher speed circuits can produce creates formidable storage problems, limiting what users will retain.

Leland noted that the science and engineering communities have different priorities and that we are continually redefining questions to explore with current and past data. Trend analysis, several participants suggested, is sufficient reason for retaining some historic data. The development of predictive models and algorithms for identifying signatures of impending problems is also very important, e.g., when a routing change goes from causes jitter-induced discomfort to causing an actual outage (disconnectivity). It is uncertain whether we are at a state that models can be developed reflecting long-term changes within the Internet. Catastrophic events, such as a backhoe accident, can not be predicted and will continue to require use of active measurements for diagnosis.

Issues of passive data collection were explored, with concerns expressed about inband versus outband transfer of data. The legal issues associated with collection are also of concern, with participants agreeing that data collected for traffic engineering purposes is acceptable, but care is required to ensure privacy of IP addresses. One-way encryption was suggested as a possibility -- however, in the event of a later billing dispute, the question of who the firm was communicating with may be important. Service level agreements and monitoring of differentiated services or multicast services may also require retention of significant amounts of IP address related information. All agreed that address encryption is an important consideration for traffic monitoring.

How can we verify the traffic measurement data is also of concern. SNMP switch counters are frequently used to check utilization, but participants noted the significant differences between readings from the switch and from alternative measurements provided by NetFlow exports. In one example, NetFlow read traffic throughput at 30 Mbps, while the reading provided by the switch was 45 Mbps. This variance can be attributed to a variety of factors, including:

SNMP encapsulation,
frequent inaccuracies in MIB counters on routers/switch,
the fact that NetFlow only measures TCP data, and
affect of a specific next hop on the counter
the probability that NetFlow is dropping some flows (unclear if some flows are being dropped disproportionately to others).

Given the growing utilization of usage-based billing, the lack of accurate, verifiable meters is problematic. Benchmarking of the various measurement alternatives (measurement tools, routers, and MIB counters) is necessary.

Means of anticipating network behavior based on simulations, particularly those incorporating actual traffic data, is limited. Participants did not consider very useful the current state of network simulators capable of depicting and predicting traffic behavior in large-scale networks are rare. The best simulators can do at this point is to prove the existence of a specific test phenomena For example, BU uses the Vint/ns simulator from ISI/UC-Berkeley to test for specific problems associated with large web cache infrastructures; PSC uses ns to prove the presence of corner cases. Bellcore noted that their clients often prefer simulators that contain graphics capabilities appropriate for presentations to management.

Barford noted that persistent connections (Jeff Mogul's work referenced) can have a serious performance impact on servers, but this doesn't show up in simulators. Use of certain protocols also shapes application's signatures. For example, NNTP traffic may appear somewhat different depending on whether it is viewed through an Apache web server versus HTTP server. Participants suggested that a study of these effects as witnessed from endpoints would be useful. Correlation of results from any passively monitored data with specific events, network topologies, routing data, and network management systems could also likely be quite elucidating.

Panel on Measurement Specification for Routers

kc claffy (UCSD/CAIDA) chaired this session, with Randy Bush, Stan Hanks, Daniel McRobb, and Greg Miller as panelists.

kc provided an overview of requirements for a measurement specification for Internet routers and plans for a draft measurement spec. The purpose of the specification is to provide guidance to router vendors on what should be measured and how in order to support specific requirements associated with:

capacity planning
peering engineering
SLA verification
tracking topology and routing changes
ATM/cell level errors

Caida is preparing an initial strawman document for input and development by the community, see https://www.caida.org/tools/measurement/measurementspec/. The document includes details relating to the collection of application-specific data, e.g., reconstruction of full HTTP headers from packet level data, and flow aggregation parameters, e.g., granularity, timeout values, configurability options. The focus is on the IP/TCP level, but also with information on topics such as queue lengths.

Participants agreed that a common measurement specification would provide customers with leverage to use in requesting measurement functionality from their vendors and will greatly assist comparability of data across platforms. Currently, Cisco is the only vendor with routers providing detailed measurement data (via NetFlow flow export statistics). However, participants questioned whether these data are sufficient to meet customers' requirements and whether router vendors should be the driver behind the definition or measurement specification for Internet traffic metrics.

Operational requirements for future Internet routers -- which customers may pay for -- include data supporting:

autonomous system (AS) matrices, encompassing next hop AS, origin/destination AS, city-pair, peer details
traffic import/export statistics by flow, MAC accounting
queue length information focusing on packet drop counters on interfaces -- particularly relevant to capacity planning (MIB-2 implemented in the router may provide some of this information, e.g., discards, though not input and output queue lengths)
basic traffic characterization (while less important to engineers, has an important role to play in new hardware and software development and deployment
support for traffic engineering, including MPLS and beyond
support for QoS and Differentiated Services, including data on latencies and packet loss
support for SLA compliance monitoring associated with connectivity, delay, and dropped packets
security/vulnerability related support, e.g., post-processing or on-card kernel packet filtering against DOS attacks and other threats

Research, or longer-term analysis requirements that might be addressed in a measurement specification (and preferably through a stats router i/f card) include:

interarrival times for distribution analysis and identification of packet run lengths
protocol relevant data, including information on TCP retransmissions and sequencing and packet sizes
routing/address space coverage, indicating the efficiencies and overall development patterns of infrastructure as a whole

Network simulation tools are currently being used to model larger networks, e.g., comparing caching algorithms, but is not particularly respected by everyone. Some measurement folks are looking into trying to identify types of traffic by looking for typical patterns/behavior instead of, or in addition to, trying to extrapolate from port number. It is hard to instrument large ISPs with passive monitors and maintain monitor deployment in the face of constant upgrades to their infrastructures.

Sampling and aggregation of data before it is sent from the router is critical, especially in high capacity networks/routers/switches. McRobb described analyses at ANS in the mid-90's that validated packet sampling for capacity planning purposes, but suggested that new studies are needed to revalidate acceptable rates given current network speeds and complexity. Verification of the numbers coming out of measurement tools is also very important, particularly for currently available tools such NetFlow, cflowd, Coral/OCxmons, and MIBs). How much data should then be retained from these tools will remain a problem requiring individual networks to balance the gathering/storing of large quantities of traffic data with not only technical support for the technology, but also against protection of user privacy and limitation of corporate liability.

Hanks (Enron) prophecied that new financial models will soon emerge where bandwidth will be traded like a commodity, e.g., natural gas and electricity. This in turn will require more accurate tools for monitoring service delivery between two points, with the measurement focus being on metrics of throughput and packet loss. SLAs are being fielded now (e.g., at least UUnet claims 85ms RTT within their U.S. backbone, 120 ms NYC to London, outage notification within 15 minutes, `guaranteed' 100% uptime for your connection if they provide the local loop.) importantly, measurement functionality in routers to support them. MPLS is coming, as is IPsec. Diagnostic issues will be increasingly difficult to trace and solve, requiring tools that can look inside the label at IP header. Participants agreed that it is incumbant on groups monitoring and analyses packet data to publish useful analyses of plain text packet headers as proof to the IPsec community that they should make these data available in the future.

Active measurements are critical to diagnosis of network problems, as well as performance monitoring by ISPs, customers and others. Many of these active measurement tools are ICMP, which poses a research integrity problem. Cisco routers, in particular, process ICMP packets on the router CPU, which is a different code path from that across which normal Internet traffic travels, To defend against ICMP-based denial-of-service attacks and also to support firewalls intended to hide the topology of internal enterprise networks from to the global Internet, many ISPs have disabled the processing of ICMP on their routers. The solution, supported by all participants, is for ICMP traffic to be process by line cards on the router (e.g., as with Bay Networkss'), not by the CPU. Ensuring that future routers adequately support active measurements and SLA should be an important design goal for vendors.

Participants requested that the current draft measurement specification document be divided into two parts: one general, focusing on the objectives, and one detailed, defining critical metrics and recommending implementations of specific router capabilities. CAIDA will draft the first document, based on comments at this meeting and inputs from the community. Development of the second document will be suggested to IETF leadership as a standards-based effort appropriate to an IETF working group.

Workshop Organization/Acknowledgments

This ISMA was organized by Amy Blanchard of UCSD/CAIDA, with technical and multicast support from Jay Dombrowski, Sean McCreary, and Brendan White. Many thanks to the staff at SDSC and UCSD who contributed to ISMA's success.

Additional focused ISMA meetings are planned by CAIDA. ISMA: Network Visualization for April 15-16, 1999 and ISMA: Active Measurements and Analysis for Fall 1999. For more information, contact info @ caida.org.

Related Objects

See https://catalog.caida.org/paper/1999_isma9901/ to explore related objects to this document in the CAIDA Resource Catalog.