NSF 98-120: Project Description
This page contains the Project Description for the CAIDA proposal entitled: "Correlating Heterogeneous Measurement Data to Achieve System-Level Analysis of Internet Traffic Trends."
Principal Investigator: kc claffy
Funding source: NSF 0137121 Period of performance: September 15, 2002 - February 28, 2009.
1 The Challenge: Characterizing Internet Traffic Trends
The transition of Internet infrastructure from NSF stewardship to a competitive service industry left this incredible resource with no framework for system-level analysis of wide-area, cross-domain Internet traffic behavior. Nonetheless, development of applications using data and computing resources distributed throughout the Internet is underway. Development occurs, at incalculable risk, in spite of a lack of Internet traffic models based on real data.
Competitive Internet providers, struggling to meet burgeoning demands of customers for additional services, do not significantly invest in gathering or analyzing workload data on their networks. Instead, Internet service providers match rising demand by increasing network capacity as fast as possible; today's core backbone links are OC48 and will be OC192c by 2002. This `traditional' approach is primarily based on brute force over-engineering. For example, ISPs simply upgrade after reaching a certain link utilization level, rather than examining parameters of how network capacity is actually utilized, or determining if link use is efficient.
The lack of specific traffic flow parameters or any realistic model of Internet traffic is a situation that shows little sign of changing without a substantial shift in attention toward the task. There is as yet no instrumentation available for gathering fine-grained workload information from anything above OC12 bandwidth links.1 The few high-speed links that are monitored are typically found at lightly utilized research sites. Larger providers have little incentive to invest in such instrumentation, much less risk political damage by making any resulting data public. The lack of rigorous analysis tools to support wide-area Internet data collection, and the absence of baseline data against which to compare any independent results serve to further dissuade efforts to collect data.
As a result, evaluation of macroscopic workload trends, and systematic preparation for the growing expectations of Internet users, is not possible today. Lack of historic or current data providing a cross-domain characterization of traffic on the wide-area Internet prevents accurate projection of the network's evolution. Existing projections either have no empirical basis or are based upon small data sets from few locations with no justification for claiming to be representative of larger scale infrastructure. Without cross-domain analysis, we cannot determine the extent to which local phenomena (e.g., caching, routing flaps, flash events, denial-of-service incidents) correlate to global Internet behavior.
Overcoming Myths and Obsolete Assumptions
Globally relevant measurement requires: 1) research into methods for classifying, archiving, and retrieving data from massive, distributed datasets, 2) improvements in both measurement and traffic characterization methods, and 3) analysis software capable of correlating and visualizing data in time to be useful for traffic engineering purposes. The CAIDA team is comprised of experts in network measurement, systems engineering, and data analysis.
The P.I. for the proposed effort has published a number of studies involving the collection and analysis of massive datasets monitoring heavily used research and commercial Internet links. Team members have years of experience engineering systems for Internet measurement, and analyzing data from both active and passive measurement infrastructures.
1.1 Our Mission
The research team's goals:
- Utilize multiple deployed and tested NSF-funded networking technologies.
- Establish a network measurement meta-data repository to facilitate access to results as well as raw data by both the Internet research community and application developers. Support this with an annotation system applicable to other researcher's demonstrably relevant Internet data sets.
- Enable testing of network traffic analysis methodologies to determine which parameters and attributes are vital to network management. Create a language for labeling and annotating data sets.
1.1.1 Relevance to Present State of Knowledge and the Future of the Internet
There are several currently deployed Internet measurement infrastructures having various intents and scope of analysis.2 Several sources of Internet data exist, each focused on specific aspects of workload, topology, performance, and routing, but subject to significant limitations. For example, NLANR/Moat, through its Passive Measurement Analysis (PMA) program, has been archiving packet header trace samples (under 2 minutes each, several times a day) from OC3mon and OC12mon devices located at NSF-sponsored High Performance Computing institutions, typically college campuses attached to the vBNS or Abilene backbones. One strength of this data is the large scale of its deployment. Monitors are located at more than 20 HPC campus measurement points, supporting several research and analysis projects. However, there are limitations for projects wishing to utilize this data. The data mining tasks involved are formidable: the topological situation of each campus measurement device is not standard, and sometimes not well-documented. Trace formats have changed over time, and thus conversion utilities are necessary to analyze long-term trends. Users have also expressed interest in longer traces (e.g., greater than five minutes) for some time. Though it is technically possible to capture both directions of traffic flow on a link for hours, bidirectional flow analysis is difficult without clock drift compensation across the two data collection interfaces. Furthermore, PMA archiving policy requires IP address sanitization, which precludes the ability to answer any question involving geography or actual topology. While there are legitimate privacy concerns. CAIDA has been able to successfully navigate them while carrying out research involving geographic and topological information using unsanitized data. In addition to needing geographic data, traffic profiling would benefit from correlating the results of many disparate sources of data. Strategic use of distributed data sets could enhance the ability to detect and model network anomalies and trends, improving the ability to predict the effects of external hardware, software, security, and news events.
Current papers that propose new techniques and protocols often make assumptions about traffic characteristics that are simply not validated by real data. The proposed meta-data repositories will allow researchers to investigate hypotheses about the level of fragmented traffic, encrypted traffic, traffic favoritism, path symmetry, address space utilization and consumption, directional balance of traffic volume, routing protocol behavior and policy, distribution statistics of path lengths, flow sizes, packet sizes, prefix lengths, and routing announcements. In cases where analysis is based on locally generated academic data sets, attempts to generalize typically lose integrity when applied to additional real-world data sets. The community could make better use of its collective intellectual resources if they could test hypotheses against a larger variety of empirical data sets before investing research and development time and energy into specific studies.
1.1.2 Impact, Innovations, and Longer-term Goals
The proposed meta-data repositories will seed a new generation of Internet research, putting the community in a position to significantly accelerate the pace of progress of measurement-based network research. The project yields the opportunity to place realistic network measurement data within easy reach of the community of researchers and application developers most likely to benefit from it. Further, the proposed project solves some problems with current measurement projects that limit their utility, increasing return on NSF investment in those projects.
In the next decade, the need to access and manage massive heterogeneous tracefile datasets will increase dramatically. An annotation and storage system suitable for distributed repositories of demonstrably relevant Internet data sets, in conjunction with a language for describing traffic phenomena along a variety of dimensions, will support the field of network research for the foreseeable future. The network measurement meta-data repository will allow for cross-domain characterization of wide-area Internet traffic, and evaluation of macroscopic trends in workload, performance, and routing behavior. Researchers will have the opportunity to correlate data across time, space (trace location), and data features. Such studies will yield predictive models that can then be applied back to the measurement tools to improve their operational utility. The ability to coordinate data collection from multiple sites in the community will also provide a facility to track distributed security attacks more effectively, and to assess potential consequences of introducing new or emerging protocols and technology into current networks. It will also facilitate the study of hybrid techniques to support technologies that have been resistant to solution, e.g., the use of real-time passive measurements in correlation with real-time active measurements to support realistic and enforceable Service Level Agreements (SLAs) , or bandwidth estimation techniques. (See related proposal of PIs.)
While the proposed work collects data needed to reach long-term research goals, its most immediate and yet presumably lasting effect will be the ability to base proposed new techniques and protocols on empirical traffic data rather than assumptions - before investing research and development time and energy on them. In short, the proposed effort places us in a safer position to project as well as control characteristics of the network's evolution.
As described in Section 1.1.1, there is no dearth of Internet measurement data. On the contrary, additional data will not help without a rational architecture for 1) collecting measurement data, 2) storing, processing, indexing, and searching that data, and 3) making data accessible to a wide variety of users. Developing such a system will provide network researchers, application developers, and traffic engineers with a fundamentally new vantage point, from which they can accurately refute or confirm crucial assumptions about Internet traffic, behavior, and development.
This proposal is motivated by the recognition, shared by many in both the research and operational communities, that understanding Internet behavior and trends requires a carefully designed collection of data. Most research efforts that need Internet data for experimentation and validation e.g., packet traces, flow export records, macroscopic topology, performance, or routing information, typically require large data sets, easily several Gigabytes for a single data file. The NLANR/Moat project  alone has almost a Terabyte worth of archived (un-annotated, un-indexed) data. The proposed work described below will facilitate access, archiving, and long-term storage of such data sets.
2 Specific Goals
2.1 Goal 1: Deploy Strategic Internet Measurement Instrumentation
Frequent passive header capture from a statistically significant number of monitors with any reasonable amount of traffic is unsustainable. We must design a strategic approach in terms of trace schedules and duration, post-processing, analysis, visualization and archival to minimize system management requirements. We also need to provide maximally representative data sets.3
NLANR/Moat's PMA measurements provide high-precision brief traces of HPC site packet headers, with addresses anonymized to protect user privacy. The DAG project network measurement cards support high-precision timestamping and clock synchronization, which allows collection of longer bidirectional traces. We propose to complement the current PMA measurement program with strategically located commercial measurement sites, and to corroborate passive data with other types of measurements (e.g., active probing, routing tables). We will gather longer (multiple-hour) traces on several high bandwidth commodity backbone links utilizing card-to-card synchronization to prevent clock drift and allow for bidirectional flow-based analysis of Internet traffic. We will support various levels of aggregation of these traces. Because certain studies might require different parts of the packet headers, we will provide some application-specific traces to support analysis of e.g., streaming media protocol behavior and performance.
CAIDA continues to support CoralReef, a publically available comprehensive software suite developed to collect and analyze data from passive Internet traffic monitors, in real-time or from trace files. CoralReef is a package of libraries, device drivers, classes, and applications written in, and for use with, several programming languages. Its architecture makes it a powerful, extensible, efficient, and convenient package for passive data collection and traffic characterization, enabling the addition of tools for correlating with other types of data. CoralReef includes modules for the storage and manipulation of frequently collected data including: source and destination hosts, IP protocols, ports, and amounts of traffic in bytes, packets and flows. CoralReef's demonstrated passive monitoring and analysis capabilities represent a key strength for achieving the proposed goals. CAIDA key technical personnel and collaborators also support other passive measurement tools4, including NeTraMet, cflowd, and FlowScan. These tools support continuous monitoring and archiving of data and can be used for calibration of finer-grained measurements, or benchmarking against commercial statistics collection functionality, (e.g., NetFlow).
Network measurement improvements are necessary to apply the meta-data repository to current and emerging research problems, examples of which are section 2.3. For example, we expect to add software modules that can trigger more complete packet capture upon detection of DoS activity.
2.2 Goal 2: Facilitate Community Access to Data RepositoriesRather than trying to support data set storage and access from a central location, we will design an annotation system for the repository in which meta-data for data sets is archived and served from many other sites. We will support a large storage infrastructure at SDSC, but the meta-data repository will multiply its value by accommodating raw data sets from other sites.
We will develop common formats, terminology, and a formal language to allow multiple annotations to a given data set based on independent analyses. External researchers making use of the meta-data repository can then query for specific signatures in data sets, and register their own annotations based on results of their own analyses.
The Ïnternet Meta-Data Repository" architecture will be sufficiently flexible to support a variety of data sets and include capabilities to add user-defined annotations. Measurement data from tools5 such as skitter, Mantra, CoralReef, cflowd, FlowScan, and NeTraMet, as well as web cache logs, Route-Views, and MRTd will initially seed the meta-data repository. Raw data or meta-data may then be catalogued for distribution from the sites providing the data.
Participants can submit data sets with extensive annotations gathered as the set was collected and processed, including user-perceived performance data or exogenous events occurring during the trace (e.g., users observed that the network was `slow' at time t1, or the campus web cache was disabled at time t2 during this data set). More importantly, subsequent research on the same traces could yield additional knowledge that could subsequently be annotated, rather than only archived in the prose of less-accessible journal or workshop proceedings.
Maximizing the quality of data in the repository will be difficult without ensuring the availability and dissemination of measurement tools with compelling operational relevance, as described in section 2.1. However, one aspect of this project will be to ensure that measurement tools are sufficiently responsive to user needs. While we will support existing tools, our core focus concerns specifying and developing the back end information management system necessary to support a wide variety of tools, traces, and analysis needs for the community. As such, this project provides an opportunity to bring several of the independent measurement projects in the community together.
2.3 Goal 3: Apply Repository Data to Current Research Problems
The proposed Internet meta-data repository, carefully architected and annotated, will significantly advance the possibilities for Internet data analysis and modeling. Without such a repository, Internet research will continue to be handicapped by lack of baseline data calibrated against real traffic behavior. For example, after identifying a given `killer application' protocol, a researcher using the proposed resources could determine when its usage began, and annotate related traces accordingly. The comprehensive nature of the data and the ability to tie different data sets together, will enable us to explore macroscopic questions regarding Internet robustness and efficiency that we cannot answer from single viewpoints.
For example, sets of potential Internet research questions are presented below, roughly organized by analysis category. Many of these questions have political and regulatory relevance (e.g., trade balance of traffic, traffic favoritism, traffic locality). Answers to these questions require the wide variety in type, scale, and global context of data that the proposed meta-data repository will afford:
2.3.1 Workload Trend Research
- For what applications or traffic categories is usage growing most quickly?
- How rapidly are new protocols such as SCTP and RTP being deployed on the Internet? (Such protocols provide alternatives to TCP and UDP for newly emerging services.)
- How much growth is there in tunneling technologies (e.g., encapsulation for IPv6, IPsec, MPLS) and how does this growth impact levels of packet fragmentation?
- To what degree is traffic growth due to more users and to what degree is it due to more traffic/user? (for various definitions of `user', e.g., hosts, prefixes, sites, ASes, and aggregating effects such as web proxies, IP masquerading at firewalls, and de-aggregating effects such as IP assignment on dialup modems).
- How can we classify traffic categories at a semantically higher level, such as behavioral characteristics, without relying only upon inconclusive or even possibly misleading header fields such as TCP/UDP ports? In particular, what traffic classifications are useful for engineering purposes (beyond rudimentary `bulk transfer' vs `interactive') and what characteristics are best used for the classification (e.g., inter-arrival time distribution directional symmetry, packet sizes and directional sequence patterns, initial ports, distribution of destinations per source, address signatures, matrix of host pairs making lots of connections).
- How do different models of flows compare (e.g., SYN/FIN vs timeout-based definitions) for a given trace in terms of statistics such as flow size distributions?
- Is traffic locality changing with growth, e.g., what percent of traffic stays within a campus, region, or country?
- How much global distributed denial-of-service activity is occurring (using information coordinated from multiple sites)? Note that two of the PI/KTPs have recently published a paper quantitatively assessing the degree of global dDOS activity on the Internet based on CAIDA measurements . The methodology described in that paper was recently applied to track hosts infected with several variations of the Code Red worm.
2.3.2 Performance Assessment Research
- Using patterns of acknowledgments, sequence numbers, and advertised window sizes, how much bandwidth is wasted on retransmissions? (e.g., a congestion indicator might be defined based on ACK retransmissions within 12-20 secs compared to the number of outstanding unACKed packets).
- How elastic (responsive to congestion conditions) are flows at various levels of granularity (host, net, autonomous system, city)?
- How common are high-bandwidth flows in the Internet that are not using end-to-end congestion control?
- Are TCP flows really using their entire bandwidth-delay product? How much buffer space should routers allocate for TCP flows?
- What is the macroscopic effect of flash events on Internet traffic behavior, e.g., unsuccessful presidential election or transition to gTLD server infrastructure?
- What are performance effects of violations of the traditional end-to-end model, e.g., transparent caching, global load balancing, CDNs?
- How does the DNS system perform, e.g., has the gTLD mesh improved the macroscopic performance for users? Brownlee and Nemeth have done passive and active analysis of the root name server and gTLD mesh over the last year, and are developing robust methods for correlating the two techniques.
2.3.3 Topology Correlations to Workload, Performance, and Routing
- To what countries is the US a net exporter of IP packets? Are these numbers growing or shrinking? 6
- How can we identify (and monitor long-term performance of) critical routers and sites that play a significant (and thus perhaps vulnerable) role in the infrastructure?
- How can we develop a calculus for describing and drawing the difference between two given `snapshots' of network topology?
2.3.4 Routing and Addressing Research
- What are long-term trends in per-prefix routing table growth? Is there an uneven distribution of traffic exchanged with few sites?
- How is IPv4 address space being announced versus actually used over time? 
- What gives rise to the discrepancies seen between actual traffic behavior (forward paths) and routing policies articulated via BGP?
- What are the macroscopic effects of different multicast architectures, e.g., traditional versus `single-source' multicast (SSM)?
- What are long-term trends in a) per-prefix routing table growth ? b) prevalence of packet fragmentation ? c) number of globally reachable hosts? d) IP path hop count distribution ? e) AS path length distribution ? f) traffic flow by prefix length distributions ?
2.4 Key Collaborators
CAIDA has developed many collaborative relationships with Internet researchers. Opportunities for sharing data and methods provide benefit to the community at large. For example:
- UO Route-Views
- - CAIDA is collaborating with the University of Oregon's Advanced Network Technology Center's Route-Views project, which provides archives of a union of several dozen unpruned backbone tables.
- - Team members are well acquainted with the principals involved in another NSF project for "Multiresolution Analysis for the Global Internet" that is actively pursuing analysis algorithms and methodologies using both real and simulated network data. The research proposed here complements and offers technologies and tools which help the MRA effort to accomplish its goals. (A letter of support from an MRA principal investigator is attached.)
- University of Wisconsin
- - Collaborator David Plonka developed FlowScan - a network traffic flow visualization and reporting utility. He uses it to monitor and graph flow information from Cisco and Riverstone routers at the University of Wisconsin in near real-time. The FlowScan utility has proven to be very useful for facilitating traffic engineering, and offers much needed support for managing time-series data.
- Waikato University
- - The WAND/WITS Project has developed a network hardware interface and software drivers under subcontract to CAIDA. These DAG Project interfaces support traffic monitoring of up to OC48 speeds. WAND/WITS also publishes Trace data for network researchers.
- ISPs and Vendors
- - To be of lasting effect for the continued evolution of the Internet, we recognize that the measurement and meta-data repository infrastructure will require support beyond the duration of this project. CAIDA has a proven record of effective technology transfer to industry and also of engaging industry in cost-sharing for measurement and analysis activities of direct relevance to their activities. We hope to take advantage of this experience in creating a lasting architecture that will help support Internet infrastructure research and development for at least the next decade.
2.5 Curriculum outreach
As part of this project, CAIDA plans to develop network analysis curriculum materials for undergraduate and graduate use as part of CAIDA's NSF-supported Internet Engineering Curriculum (IEC) repository and Internet Teaching Labs7. CAIDA will also sponsor tutorials and workshops on how to use both the measurement tools themselves as well as the data repository. For several years, the IEC project held curriculum training workshops for professors of Internet classes. Documentation from the proposed tasks would be ideal for incorporation into future workshops, as well as the curriculum repository itself. Curriculum modules for undergraduate and graduate education will be patterned after the successful Traffic Analysis module8 and suggested Projects for Networking Classes page9.
3 Work Plan: Task Goals and Objectives
Senior personnel assigned to each task are listed, with the task lead indicated in bold print. Top-level goals are given by year, along with more specific objectives.
3.1 Task 1: Improve Internet Measurement Instrumentation
Researchers: Brownlee, Claffy, Moore, Voelker, GSR(1)
- Year One:
Deployment of High-Speed Passive Monitors
- Deploy 4-5 passive monitors at strategic high-speed commercial global Internet locations.
- Identify optimal data collection strategies.
- Coordinate movement of traffic measurement data.
- Year Two:
Meaningful, Maintainable Passive Traffic
- Based on community feedback (See Task 3, Year One), run specialized targeted trace collection.
- Correlate active probes with passive measurement (header capture) techniques for developing new Service Level Agreement models.
- Year Three:
Collection and Annotation Refinement
- Coordinate with results to date from Task 2 to automatically post-process, filter, aggregate, annotate, or index collected tracefiles in order to more efficiently facilitate their inclusion into published traffic meta-data repositories.
3.2 Task 2: Develop Distributed Meta-Data Repositories
Researchers: Brownlee, Claffy, GSR(1)
- Year One:
- Data modeling
- Standardize traffic attributes and schemas (consider inclusion of XML DTD or XML Schema definitions) that correspond to the various low-level Internet traffic monitor data.
- Create logical representations for a hierarchy of trace information from low-level trace data fields through user-defined hierarchies of annotations, to even higher event-level grouping relationships and concepts
- Define Internet data collection management strategies for handling distributed repositories and archives.
Objectives for Year One include "Community Internet Data Annotation Language" that will be drafted and submitted for review to the research community.
- Year Two:
automatic methods for annotation and
- Evaluate trace data annotation and meta-indexing strategies.
- Standardize the APIs and query interfaces to the traffic data collections. (For example, consider using XPath or XQuery mechanisms to specify analysis and filter methods.)
- Evaluate requirements for interfacing with collection-based persistent archive software.
Objectives for Year Two include: Develop specifications for manipulating and querying Internet data using a new yet "community-based" approach.
- Year Three:
collection for prototypes.
- Publish Internet traffic data at distributed sites.
- Define specifications for data ingestion, storage, and interaction modules using standards-based interfaces.
Objectives for Year Three include: Distributed data collection and publication.
3.3 Task 3: Apply Repository Data to Current Internet Research Problems
Researchers: Brownlee, Claffy, Moore, Voelker, GSR(2)
- Year One:
research questions in response to concerns of both ISP operators
and high-speed application developers.
- Participate in a separately funded Internet Statistics and Metrics Analysis (ISMA) or other relevant conference for academic and commercial researchers. Survey their research goals and concerns.
- Design data collection experiments to address selected research questions.
- Define and classify parameters and attributes for link capacity measurements.
- Define annotation language.
- Analyze and visualize results. Publish results in technical journals and on web-site.
- Years Two and Three:
- Continue analysis and visualization activities taking into account currently relevant research issues and questions.
4 Previous Results
- "CAIDA: Cooperative Association for Internet Data Analysis". NCR-9711092. $3,143,580. Oct 1997 - Jul 2001. (Brownlee, Claffy, Moore, Murray) This collaborative undertaking brings together organizations in the commercial, government, and research sectors. CAIDA provides a neutral framework to support cooperative technical endeavors, and encourages the creation and dissemination of Internet traffic metrics and measurement methodologies. 10 Results of this collaborative research and analytic environment can be seen on published web pages.11 CAIDA also develops advanced Internet measurement and visualization tools. 12
- ÏEC: Internet Engineering Curriculum Repository". ANI-97-06181. $590,555. Aug 1997 - Sep 2001. (Claffy) This CAIDA project helps educators and others interested in Internet technology to keep up with developments in the field. A repository of collected teaching materials is published on the web.13.
- Ïnternet Atlas". ANI-99-96248 $304,816. Jan 1999 - Dec 2001. (Claffy, Murray) This effort involves developing techniques and tools for mapping the Internet, focusing on Internet topology, performance, workload, and routing data. A gallery that presents and evaluates state-of-the-art techniques and tools in this nascent sector is published on the web.14.
5 Management Plan
All research participants have significant experience mentoring students. CAIDA staff and previous students have produced software prototypes and production code. In addition:
- Administration of the project will be provided by CAIDA/SDSC.
- Communication will be facilitated by monthly project status meetings. Additionally, personnel involved with each specific task, including post-docs and students, will set a meeting schedule as appropriate for reviewing progress and investigating results of research. CAIDA also has a well-regarded history of remote online collaboration using text-based virtual environments, including one in support of inter-ISP coordination.
- Yearly workshops, funded by other grants, will bring together members of the academic and commercial research community to guide the evolution of yearly goals for analysis and instrumentation. The workshop will be attended by all proposal participants, including junior personnel. We plan to invite CAIDA industrial sponsors and members to the workshops. Additionally, we expect to have smaller joint workshops with other network research organizations, (e.g., with the MRA project collaborators). Historically, CAIDA has sponsored a series of workshops on Internet Statistics and Metrics Analysis (ISMA). where both academics and industrial researchers can meet with traffic engineering and operations personnel to discuss issues and solution strategies.15
- Dissemination of Results:
- In addition to publication of research results through scientific journals, we will make results of this project available in several other ways. Data, tools and specifications developed during the course of this project will be made available via the CAIDA website.
6 Conclusion: Significance of Proposed Effort
As it grows, the Internet is becoming more fragile in many ways. The complexity in managing or repairing damage to the system can only be navigated with sustained understanding of the evolving commercial Internet infrastructure. The research and tools proposed under this effort lead to such insights. In particular, richer access to data will facilitate development of tools for navigation, analysis, and correlated visualization of massive network data sets and path specific performance and routing data that are critical to advancing both research and operational efforts. We also expect to be able to offer suggestions to ISPs and routing vendors with respect to what instrumentation within the router would facilitate diagnosing and fixing problems in [closer to] real-time. Finally, this research has obvious relevance to public policy and regulatory questions regarding concentration of administration of Internet infrastructure.