Collection, curation, and sharing of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events are CAIDA's core objectives. Our Overview of available CAIDA Data, has links to data descriptions, request forms for restricted data, download locations for publicly available data, real-time reports, and other meta-data. Note that since April 2016 some CAIDA datasets are distributed exclusively through IMPACT (Information Marketplace for Policy and Analysis of Cyber-risk and Trust).
CAIDA operates active and passive measurement infrastructures enabling visibility into global Internet behavior. We collect, curate, archive, and share datasets resulting from these measurements. We also produce and share multiple derivative datasets.
Active measurements: CAIDA's flagship Macroscopic Topology Project, Archipelago, measures Internet connectivity and latency using active probing to a stratified cross-section of the commodity IPv4 and IPv6 Internet. We are currently collecting more than 50 Mb/day of raw Archipelago data from more than 200 monitors located on 6 continents in over 60 countries.
Passive measurements: CAIDA collaborates with organizations that operate network infrastructure in various environments to passively monitor traffic on selected links. As of 2018, CAIDA continuously collects two types of passive measurements data: Internet Background Radiation data and US backbone bidirectional traffic data.
- Internet Background Radiation data is collected using UCSD Network Telescope Infrastucture which consists of a globally routed, but lightly utilized /8 network prefix, that is, 1/256th of the whole IPv4 address space. It contains few legitimate hosts; inbound traffic to non-existent machines - so called Internet Background Radiation (IBR) - is unsolicited and results from a wide range of events, including misconfiguration (e.g. mistyping an IP address), scanning of address space by attackers or malware looking for vulnerable targets, backscatter from randomly spoofed source denial-of-service attacks, and the automated spread of malware. CAIDA continuously captures and archives this anomalous traffic discarding the legitimate traffic packets destined to the few reachable IP addresses in this prefix. We are currently collecting more than 3 TB of uncompressed IBR traffic traces data per day.
- US backbone bidirectional traffic data have been collected by CAIDA since 2008. These data contain anonymized passive traffic traces from various CAIDA's high-speed monitors on high-speed Internet backbone links. Since March 2018, we have been capturing one-hour long monthly traces on a 10 Gb link monitor in New York City. During capture, packets are truncated at a snap length selected to avoid excessive packet loss due to disk I/O overload. These data are anonymized using CryptoPan prefix-preserving anonymization, stored in pcap format and published quarterly. The size of one trace is currently about 340 Gbytes. Depending on storage resources we may decide to retain and publish all monthly traces (as we did before 2014).
Access to Datasets
We share the collected datasets with researchers in accordance with University of California, San Diego, policies. We maintain servers that allow researchers to download data via a secure login and encrypted transfer protocols. For our most sensitive data, we enforce "bring-code-to-data" approach giving vetted researchers accounts on CAIDA computers to analyze data using CAIDA resources.
There are two complementary ways that users can request access to CAIDA's data:
through the CAIDA portal and through the Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT) portal (for academic researchers, government agencies and
corporate entities from DHS-Approved Locations *.
* - currently, US, Canada, Australia, United Kingdom, Israel, Japan, the Netherlands, and Singapore
Datasets that can be requested through the CAIDA portal fall into two categories: public and "by-request." Public datasets are available to users who agree to CAIDA's Acceptable Use Policy for public data. After filling out the corresponding public data request form, users are redirected to the data download page. Access to the "by-request" datasets is subject to approval by CAIDA data administrator. Access to the CAIDA datasets through IMPACT must be approved by IMPACT staff.
User Activity: During the three years 2015-2017, over 1.2 million unique visitors browsed our website, and CAIDA granted 1716 researchers from 82 countries access to restricted data. The countries with the most users were the U.S. (504), China (265), and India (134). Collectively these users downloaded 179.4 TB of data. Over the same period, approximately 36,000 users downloaded 136.3 TB of public CAIDA data.
Research Publications using CAIDA Data
CAIDA data provide an empirical foundation for Internet research. Researchers worldwide have used these data to publish papers in the scientific literature. We maintain lists of publications by CAIDA researchers and collaborators, as well as publications by external researchers who report back the use of CAIDA data in publications as required by our Acceptable Use Policy (AUP).
User activity: During the three years 2015-2017, we found 529 papers by external authors using CAIDA data. The most-used data are our AS-relationship data (190 papers) and Anonymized Internet Traces (166 papers). The affiliated institutions for the first authors were located in 53 different countries; the most active being the US (140 papers), China (63 papers), India (34 papers), and Germany (27 papers).
CAIDA has developed a privacy-sensitive data sharing framework that employs technical and policy means to balance individual privacy, security, and legal concerns against the needs of governments researchers, and scientists for access to data in an attempt to address the inevitable conflict between data privacy and science.