CAIDA Home
 Cisco | DNS-ITR | NeTS-NR | NSF Trust | Trends | WIDE | Proposals  
 www.caida.org > funding : : predict
    visit     contact     search:
CAIDA: Cooperative Association for Internet Data Analysis
(NBCHC070133) Supporting Research and Development of Security Technologies through Network and Security Data Collection

-----summary of contents-----

The CAIDA PREDICT project "Supporting Research and Development of Security Technologies through Network and Security Data Collection" (DHS contract NBCHC070133) started September 1, 2007. Our primary activities for this contract include collection, documentation, anonymization, and distribution of routing, peering point, and UCSD Network Telescope data, and providing advise on technical, legal, and practical aspects of PREDICT policies and procedures. CAIDA's previous PREDICT project "Network Traffic Data Repository to Develop Secure IT Infrastructure" (DHS contract NBCHC040159) ran from August 8, 2004 to July 31, 2007.


-----end summary of contents-----

Supporting Research and Development of Security Technologies through Network and Security Data Collection

Executive Summary:

Research and development targeted at identifying and mitigating Internet security threats requires currently network data. To fulfill this need, Cooperative Association for Internet Data Analysis (CAIDA) a program at the University of California's San Diego Supercomputer Center which is based at the University of California, San Diego (UCSD), will collect backbone/peering point data from large ISPs (depends on access to links which is not guaranteed), trace data from the UCSD Network Telescope, datasets on past (Code-Red, Witty) and future Internet worms, data on the IPv4 and IPv6 topologies, and realtime monitors to view traffic on monitored backbone/peering links and the UCSD Network Telescope. This data will be curated, anonymized, and distributed to the network security community.

Technical Approach:

Over the past two decades, the Internet has become critical infrastructure for almost every aspect of American life. Commerce, business, government, education, and even interpersonal relationships rely on networked computers for communication and data distribution. Yet the discovery of new security threats continues to outpace the development of new technologies to ensure the security, integrity, and privacy of digital information. The current lack of data documenting both malicious and benign traffic traversing the Internet impedes security threat mitigation efforts because there are:

  • few or no ground truth examples of neoteric attacks in the wild, so focusing research and development to target current threats remains difficult
  • no realtime datasets available to allow those responsible for high-security sites to differentiate between general attacks and those specifically targeting their installations
  • no easily available traces containing traffic from current high-speed networks to use in development and testing mitigation technologies to minimize both false negatives and false positives in deployed infrastructure
  • no canonical data sets with which to compare the efficacy of competing technologies that promise to detect or respond to a given threat
  • limited availability and evaluation of tools for anonymizing data sets for protected sharing with researchers

The state-of-the-art in the development of security technologies could be improved through coordinated data collection and distribution efforts. We propose the collection, curation, anonymization[1], and distribution of Internet data to support research and development activities, with the goal of eventual participation as a Data Provider and Data Host in the Protected REpository for the Defense of Infrastructure against Cyber Threats (PREDICT) program. PREDICT provides thoroughly vetted central infrastructure designed to maximize ubiquitous data access while ensuring data security and privacy.

This basic fundamental research is being performed on a reasonable efforts basis. CAIDA's network data collection capabilities include:

Passive network monitors:

Each monitor consists of a pair of 2-unit servers instrumented with either an off the shelf NIC or an Endace DAG high-performance data collection card. The servers are time-synchronized with stratum one time servers to allow interpolation of trace data collected at disparate locations. Currently one OC12 link at AMPATH is monitored, as well as several GigE links at UCSD. Data from the UCSD links cannot be redistributed.

The Archipelago (Ark) active measurement platform:

Ark is a new platform designed, developed, and deployed by CAIDA for optimized, coordinated active network measurements. We currently have 8 Ark monitoring locations but we expect to grow to 15 locations by July 08. skitter, a legacy active measurement project (to be replaced by Ark in 2008) collects traceroute data from 16 locations. Ark will support a variety of macroscopic Internet active measurement projects, including the scamper IPv4/v6 topology discovery tools. Existing skitter monitors will be upgraded to Ark monitors during the expected period of performance of this project.

The UCSD Network Telescope:

The UCSD Network Telescope consists of a large piece of globally announced IPv4 address space. The telescope contains almost no legitimate hosts, so inbound traffic to nonexistent machines is always anomalous in some way. Because the network telescope contains approximately 1/256th of all IPv4 addresses, we receive roughly one out of every 256 packets sent by an Internet worm with an unbiased random number generator. Because we are uniquely situated to receive traffic from every worm-infected host, we provide a global view of the spread of Internet worms.

Adaptive Netflow:

Adaptive NetFlow, deployable through an update to router software addresses many shortcomings of NetFlow by dynamically adapting the sampling rate to achieve robustness without sacrificing accuracy. Thus collection infrastructure remains intact during flooding attacks, sampling rates are automatically tuned to data volume, and flow data reporting interacts well with applications that operate on time-binned data. To enable counting of non-TCP flows, we also developed an optional Flow Counting Extension that can augment existing hardware at routers. Both our proposed solutions readily provide descriptions of the traffic of progressively smaller sizes. Transmitting these at progressively higher levels of reliability allows graceful degradation of the accuracy of traffic reports in response to network congestion on the reporting path. They also provide low, statistically provable error rates on sampled data. Adaptive NetFlow has been incorporated into CoralReef, CAIDA's passive measurement software suite, and is available for data collection on high-speed links.

We will provide access to data for researchers in several ways in accordance with UCSD policy. We will maintain one or more data servers to allow researchers to download data via secure login and encrypted transfer protocols. We will receive, format, transfer data to, and return USB hard drives to researchers who wish to access datasets whose volume prohibits timely data download via the Internet. Finally, we will provide a near-realtime, interactive graphical interface to passive monitors and the Network Telescope to allow researchers a continual view of statistics of Internet traffic on these links and to allow them to identify time periods containing traffic characteristics of interest for further investigation using raw traces.

Statement of Work:

We propose to pursue the installation of existing equipment to monitor OC48 and OC192 links, including running the CoralReef report generator and collecting packet traces, as allowed by link owners and Data Providers.

We propose to collect, process, and distribute data from the following sources:

  • Internet OC48/GigE, OC192/10GigE, and ISP peering point links (when links and monitors are available). Will include raw traces and statistical summaries.
  • The UCSD Network Telescope, including data on random-spread Internet worms, distributed Denial-of-Service attacks, port and host scanning, and botnets. We will provide data on every worm or virus we deem important that is monitored by our measurement infrastructure. Quarterly traces will be coordinated with other collectors of Internet background radiation data to ensure broadly, time synchronized datasets for researchers.
  • scamper running on the Ark infrastructure, collecting IPv4 and IPv6 network topology as discovered via continuous, active traceroute-like probing (including all /24 networks of the IPv4 address space). In conjunction with BGP routing tables from RouteViews or RIPE, this data allows us to create and serve Autonomous System (AS)-level topology graphs updated weekly (for use in virus, worm, botnet spread propagation research, routing security database support, infrastructure stability and vulnerability analysis).
  • Realtime (or close to real-time) detailed traffic reports (from CoralReef report generator software) from any available OC48/GigE, OC192/10GigE links (subject to the approval of the Data Provider), and from the UCSD Network Telescope, to provide data on current threats and help researchers identify periods of interest in collected trace data.

We will also continue to distribute previously collected data, including denial-of-service attack datasets, Code-Red and Witty worm datasets, UCSD Network telescope traces, and scamper topology data. This data will allow previously impossible longitudinal analysis of threat evolution over the last several years.

We will pursue participation in the PREDICT program via development of mutually acceptable Memoranda of Agreement and help to develop appropriate PREDICT infrastructure to serve the evolving needs of the research and development communities.

Sharing of sensitive network data with researchers is almost always blocked on the need to protect personally identifying information, but there has been little attention thus far by the research community in analyzing and comparing existing anonymization schemes for data leakage and other performance characteristics. We will investigate current and proposed anonymization schemes that support PREDICT's goal to protect privacy while supporting cybersecurity research. In the first year we will make available via the web an initial taxonomy of known tools, techniques, related publications and known issues. We will also provide suggestions to PREDICT data providers on the use of current and future data anonymization schemes to increase security and privacy. We will update this web page and set of suggestions as technology develops in future years of the project.


[1] Or other appropriate measures of privacy protection.

Cooperative Association for Internet Data Analysis (CAIDA)
  Last Modified: Fri Dec-14-2007 11:13:26 PDT
  Maintained by: Alex Ma
  Page URL: http://www.caida.org/funding/predict/index.xml