Supporting Research and Development of Security Technologies through Network and Security Data Collection

Our primary activities for this project include collection, curation, hosting, and distribution of active and passive Internet measurement data as well as providing advice on technical, legal, and practical aspects of PREDICT policies and procedures.

The original project summary and description (PDF) is available for viewing.

Sponsored by:
Department of Homeland Security (DHS)

Principal Investigator: kc claffy

Funding source:  FA8750-12-2-0326 NBCHC070133 Period of performance: September 28, 2012 - September 27, 2017.

Supporting Research and Development of Security Technologies through Network and Security Data Collection

Executive Summary:

Research and development targeted at identifying and mitigating Internet security threats requires current network data. To fulfill this need, the Cooperative Association for Internet Data Analysis (CAIDA), a program at the University of California's San Diego Supercomputer Center which is based at the University of California, San Diego (UCSD), will collect packet header data from large backbone ISPs (so long as we have access to links, which is not guaranteed) and the UCSD Network Telescope, IPv4 and IPv6 topology data, and real-time monitors to view traffic on monitored links. We will curate this data, in some cases anonymize, and distribute to the network and security research community. In light of progress and pitfalls encountered in the first two years of this process, and in the face of increased concerns over policy obstacles to cybersecurity research, in September 2009 we re-aligned our statement of work to better support what PREDICT needs to accomplish in the next two years -- community building and demonstrated responsiveness to current public and private sector needs in Cybersecurity S&T research. We have replaced fixed data set collection intervals with a more flexible approach designed to better meet the current needs of researchers, including eventual access to real-time traffic data from the telescope for vetted security researchers. Many of our deliverables have changed in support of these new objectives.


We now critically depend on the Internet for our professional, personal, and political lives. This dependence has rapidly grown much stronger than our comprehension of its underlying structure, performance limits, dynamics, and evolution. Fundamental characteristics of the Internet are perpetually challenging to research and analyze, and we must admit we know little about what keeps the system stable. As a result, researchers and policymakers currently analyze what is literally a trillion-dollar ecosystem essentially in the dark, and agencies charged with infrastructure protection have little situational awareness regarding global dynamics and operational threats. To make matters worse, the few data points suggest a dire picture, shedding doubt on the Internet's ability to maintain and strengthen its role as the world's communications substrate.

The current lack of data documenting both malicious and benign Internet traffic impedes security threat mitigation efforts because there are;

  • no realtime datasets available to allow those responsible for high-security sites to differentiate between general attacks and those targeting their installations,
  • no easily available traces containing traffic from current high-speed networks to use in development, testing, and comparison of mitigation technologies
  • limited availability and evaluation of tools for anonymizing data sets for protected sharing with researchers.

The state-of-the-art in the development of security technologies could be improved through coordinated data collection and distribution efforts. We propose the collection, curation, anonymization (or other appropriate measures of privacy protection), and distribution of Internet data to support research and development activities, and to participate as a Data Provider and Data Host in the Protected REpository for the Defense of Infrastructure against Cyber Threats (PREDICT) program.

Technical Approach:

This basic fundamental research is being performed on a reasonable efforts basis.

CAIDA's network data collection capabilities include:

The Archipelago (Ark) active measurement platform:

Ark is a distributed platform designed, developed, and operated by CAIDA for optimized, coordinated active network measurements. As of May 2012, we have 60 Ark monitors (28 of them IPv6 capable) deployed on 6 continents in 30 countries. Ark supports a variety of macroscopic Internet measurement projects, including ongoing IPv4/IPv6 topology discovery with scamper tool.

The UCSD Network Telescope:

The UCSD Network Telescope consists of a large piece of globally announced IPv4 address space. This address space contains almost no legitimate hosts, so inbound traffic to non-existent machines is unsolicited, and anomalous in some way. Our network telescope contains approximately 1/256th of all public IPv4 addresses and consequently receives roughly one out of every 256 packets sent by a malicious software with an unbiased random number generator.

The telescope has enabled us to provide a unique global view of the spread of some Internet worms. The advent of the Conficker worm and its associated traffic load have changed our approach to sharing telescope data. We are transitioning from a model of static trace sharing (which is of extremely limited utility to researchers, especially when anonymized as it was for Phase I of PREDICT) and indefinite storage of data on CAIDA servers, to a model of real-time data sharing with vetted researchers, but only retaining a 60-day window of history.

Passive network monitors:

Each monitor consists of a pair of 2-unit servers instrumented with either an off the shelf NIC or an Endace DAG high-performance data collection card. The servers are time-synchronized with stratum-1 time servers to allow comparison of trace data collected at disparate locations.

Adaptive Netflow:

Adaptive NetFlow, deployable through an update to router software addresses shortcomings of NetFlow by dynamically adapting the sampling rate to achieve robustness without sacrificing accuracy. Thus, collection infrastructure remains functional during flooding attacks, as sampling rates are automatically tuned to data volume. Flow data reporting interacts well with applications that operate on time-binned data. Adaptive NetFlow has been incorporated into CoralReef, CAIDA's passive measurement software suite, and is available for data collection on high-speed links.

Data Sharing:

We provide access to data for researchers in several ways in accordance with PREDICT requirements and UCSD policy. We maintain one or more data servers to allow researchers to download data via secure login and encrypted transfer protocols. Optionally, we may receive, format, transfer data to, and return USB hard drives to researchers who wish to access datasets whose volume prohibits timely data download via the Internet. We also support a near-realtime, interactive graphical interface to passive monitors and the Network Telescope to provide a continual view of statistics of Internet traffic on these links and to allow researchers to identify time periods containing traffic characteristics of interest for further investigation. Finally, we are enabling real-time sharing of the telescope data, using the framework we developed for balancing privacy and utility in Internet research.


Task Number Task Description (Note: PREDICT site will not be utilized for performance of Tasks 1 through 6).
Task 1 PREDICT Framework and function: Work with legal counsel to facilitate the incorporation of non-anonymized and real-time traffic data in PREDICT. Work with PREDICT Coordination Center to develop appropriate processes and procedures for researchers, hosts, and providers to use to provide, process, receive, and safeguard data.
Task 2 PREDICT Memos of Agreement (MOAs): Help develop MOAs that allow CAIDA to participate in PREDICT project, updating MOAs as needed to support new types of data.
Task 3 Pursue the installation of passive monitors on Internet backbone and peering points links.
Task 4 Collect, process, anonymize as necessary, and serve passive data from monitors.
Task 5 Investigate inadvertent information leakage via current anonymization schemes and create taxonomy of schemes and known issues.
Task 6 Collect, process, curate, and serve IPv4 and IPv6 topology data
Task 7 Operate realtime interactive monitor displaying aggregated traffic statistics from all monitored passive network links pursuant to link owner permission.
Task 8 Develop metadata, software and processes for CAIDA interaction with PREDICT portal to optimize assessment of data requests and response to approved data requests.
Task 9 Install and deploy new monitors to replace aging Network Telescope monitoring infrastructure, and support vetted real-time sharing of data.
Task 10 Organize and host "ethical and responsible network research" report-writing (and supporting) workshops under discretion of Program Manager, prepare guiding material in advance and summaries after workshops. Co-author final draft of document.


# Deliverable Description Due date Status
1 Quarterly Denial-of-Service Backscatter and other older telescope traces available to researchers 12/31/2007; quarterly through 2008. done
2 Scamper data processed and available to researchers. 1/31/2008 done
3 Provide suggestions on use of current and future data anonymization schemes to PREDICT data providers to increase the security and privacy of the resulting datasets. 6/31/2008 done
4 Taxonomy of current anonymization techniques, tools, related publications and known issues, made available via the web. 6/31/2008 done
5 Realtime interactive monitor on UCSD Network Telescope 11/31/2007 done
6 Realtime interactive monitor on passive network links As soon as monitors can be deployed done
7 The UCSD Network Telescope, with real-time data sharing supported under a new manuscript entitled, "Internet Data Sharing Framework For Balancing Privacy and Utility", which we will also publish in 2009. 12/31/2010 done
8 Quarterly large-ISP anonymized traces available to researchers. As soon as monitors can be installed and connected; quarterly thereafter done
9 Make available IPv4 topology and derived AS topology datasets. 12/31/2007, regularly thereafter done
10 Publish papers to educate lawyers and researchers about the current legal obstacles to network research, and how to effectively apply privacy protections to their research. 01/31/2010 done
11 Help organize, host, prepare guiding material for, and summarize reports of workshops intended to result in a documented set of guidelines for considering and addressing ethical issues in network and security research. 01/31/2011 done