CAIDA's Annual Report for 2021

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, and operating expenses for 2021.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insights into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report summarizes CAIDA’s activities for 2021 in the areas of research, infrastructure, data collection and analysis. Our research projects span: Internet cartography and performance; security, stability, and resilience studies; economics; and policy. Our infrastructure, software development, and data sharing activities support measurement-based Internet research, both at CAIDA and around the world, with focus on the health and integrity of the global Internet ecosystem.

Internet Mapping. We continued to pioneer methods and tools for Internet cartography, including for identifying and geolocating cloud interconnections, and inferring regional access topologies of Internet Service Providers. We extended our efforts from last year in automated learning of semantic structure in router hostnames, to include extraction of geographic and network ownership information. We also developed a methodology to identify state-owned Internet operators in observed network level (BGP) topology data. These building blocks will enable creation of macroscopic Internet topology maps of unprecedented richness and fidelity.

Performance Measurement. We designed and implemented a system to leverage thousands of public speedtests servers to comprehensively measure performance from the public clouds to other networks. In collaborate with U. Twente, we built on our anycast census work from last year to characterize the growing adoption of anycast in DNS authoritative infrastructure, and its implications.

Security, Stability, and Resilience (SSR) of the Internet’s addressing, routing, and naming systems. We completed five technical and two policy studies on vulnerabilities of the subsystems that constitute the Internet’s fundamental plumbing: IP addressing, DNS, and BGP routing. Each of these systems was characterized by critical flaws that continue to leave the Internet ecosystem vulnerable to a variety of attacks. Our published studies in 2021 included: analyzing the (in)accuracy of the existing Internet Routing Registry (IRR) databases as the emerging Resource Public Key Infrastructure gains traction; comparing the administrative (observable in Regional Internet Registry (RIR) data) vs operational properties (observable in BGP data) of autonomous systems; IPv6 privacy mechanism vulnerabilities; and risks exposed by DNS registrar name management practices. Our collaboration with MIT Lincoln Labs continued as they used the UCSD Telescope data to study security-relevant scaling characteristics of darkspace traffic.

Economics and Policy. Our final research thrust focuses on the implications of empirical studies of the Internet on public policy. KC completed participation in ICANN’s Second Security, Stability, and Resiliency (SSR2) Review Team, and continues to serve as a shepherd to support ICANN’s processing of the recommendations. We supported economists at Harvard in using our topology data to study the impact of GDPR on the interconnection ecosystem. Finally, we described data-driven approaches to improve the security of the Internet infrastructure.

Infrastructure Operations. Our continuing NSF and new DARPA support allowed us to make progress on almost all infrastructure components that create data products in the most demand by the community, including Ark, AS Rank, AS-to-Org mapping, BGPStream, Periscope, Spoofer, and the UCSD Network Telescope. We reached milestones with new infrastructure components: a new DNS TLD zone database, an IP address metadata software library; a new BGPView data processing pipeline component of BGPStream, and the FANTAIL project for processing and querying terabytes of traceroute data, and our rich-context Resource Catalog for CAIDA Internet Data Science Resources.

New Infrastructure Awards. In collaboration with U. Oregon’s Network Startup Resource Center (NSRC) and MIT, we received two large National Science Foundation (NSF) infrastructure grants that started in October 2021. The first was Integrated Laboratory for Advancing Network Data Science (ILANDS), which will support enhancements to our infrastructure to handle 100GB packet rates, and projected routing table growth, including deploying enhanced storage and compute resources to support long-term use of the data. The second is our largest award to date: a Mid-scale Research Infrastructure Design Project to build a Global Measurement Infrastructure. We are grateful for this project which offers a potential path to put CAIDA’s activities on a sustainable footing. It will support our design and prototyping of a new highly distributed network measurement platform capable of capturing several types of data relevant to security research, as well as hosting new vetted experiments. We will have more to report on these projects next year.

Community Service. At NSF’s invitation/request, we had the honor of co-organizing the first NSF-sponsored Workshop on Overcoming Measurement Barriers to Internet Research in early 2021, and posted a report that was cited in NSF’s new Internet Measurement Research solicitation, the first U.S.-government solicitation ever focused on Internet measurement research. Progress!

Everything Else. As always, we engaged in a variety of tool development, data sharing, and outreach activities, including updating our web site, and publishing 13 peer-reviewed papers, 2 workshop reports, 9 presentations, and 2 blog entries. This report summarizes the status of our activities. Details about our research are available in papers, presentations, and interactive resources on our web sites. We provide listings and links to software tools and data sets shared, and statistics reflecting their usage. Finally, we offer a “CAIDA in numbers” section: statistics on our performance, financial reporting, and supporting resources, including visiting scholars and students, and all funding sources.

CAIDA’s program plan for 2018-2023 is available at https://www.caida.org/about/progplan/progplan2018/. We will begin our 2023-2028 program plan in late 2022. Please feel free to send comments or questions to info at caida dot org. Please note the link to donate to CAIDA at the top of our newly renovated web site; UC San Diego charges no overhead on donations; it is tax-deductible and goes 100% to research (no university overhead)!

Research and Analysis

Internet Cartography (Mapping) Methods

Inferring Cloud Interconnections

We built a foundation to learn the network paths from clouds to external devices and developed a new scalable tool for identifying the interconnections that clouds use to reach destinations around the world, enabling entities to learn the how clouds reach current or potential deployments via their public WAN. We developed two techniques for geolocating interconnections between cloud networks at the city level, and used them to geolocate interconnections between the cloud networks, discovering that clouds interconnect on every populated continent, and often interconnect in the same cities as each other. (Inferring Cloud Interconnections: Validation, Geolocation, and Routing Behavior, PAM)

Minimum latency from each location to a single server in San Diego. Colored regions indicate the measurements were handled by the same Edge CO (Central Office, inferred from IPv6 addresses).

Minimum latency from each location to a single server in San Diego. Colored regions indicate the measurements were handled by the same Edge CO (Central Office, inferred from IPv6 addresses). (Inferring Regional Access Network Topologies: Methods and Applications, IMC)

Inferring Regional Access Network Topologies

Applying new Internet cartography methods, we undertook a comprehensive active measurement-driven study of the topology of U.S. regional access ISPs. We discovered that access networks are prone to single points-of-failure, with facility-level aggregation as the root cause. ISPs use dedicated access networks to provide connectivity for each geographic region with a strict hierarchy of facilities within each regional network. Based on validation with engineers of two regional access networks, our tools allowed us to make surprisingly accurate maps despite considerable noise in our input signals, e.g., missing or incorrect DNS or traceroute hops. We identified different approaches to provisioning redundancy across links, nodes, buildings, and at different levels of the hierarchy. These measurements provide a basis for reasoning about sources of performance and reliability impairment. (Inferring Regional Access Network Topologies: Methods and Applications, IMC)

Learning to Automatically Extract Geographic Location and Network Information from Internet Router Hostnames

In collaboration with Matthew Luckie from University of Waikato, we took a new approach to geolocating Internet infrastructure, designing a machine learning solution capable of accurately and comprehensively extracting geographic information that network operators publicly disclose via router interface hostnames. (Learning to Extract Geographic Information from Internet Router Hostnames, CoNEXT)

This collaboration also led to the design, implementation, evaluation, and validation of a fully automated system that learns regular expressions (regexes) to extract network names from Internet hostnames assigned by operators using their own conventions. Our method first learns the dictionary of network names, and then automatically generates and evaluates regexes that extract these names. We validated our dictionary against ground truth, finding that 97.3% of the names our regexes extract were valid names for the networks. (Learning Regexes to Extract Network Names from Hostnames, AINTEC)

We added data supplements for these two projects. The first data supplement contains the data used to train our model to learn regular expressions that extract AS names from router hostnames, as well as various model outputs. The second supplement contains data used for training the model to extract geohints from router hostnames. These data were derived from ITDK dataset, and are designed to be used with sc_hoiho, one of scamper’s utilities.

Identifying ASes of State-Owned Internet Operators

We developed a methodology to accurately identify state-owned Internet operators worldwide and their Autonomous System Numbers (ASNs). We obtained the first accurate dataset of ASNs of state-owned Internet operators, and made it available to the research community together with the several lessons we learned in the process, and performed a preliminary analysis based on our data. We found that 53% (i.e., 123) of the world’s countries were majority owners of Internet operators. We also documented the existence of subsidiaries of state-owned governments operating in foreign countries, an aspect that touches every continent and particularly affects Africa. (Identifying ASes of State-Owned Internet Operators, IMC)

Performance Measurement

Measuring the network performance of Google Cloud Platform

We designed and implemented the CLoud-based Applications Speed Platform (CLASP) to measure performance from the cloud to other networks. CLASP conducts speed tests from cloud virtual machines to external speed test servers, indicating the likely bandwidth available to devices near the server. In our five-month measurement experiment using Google’s Cloud Platform (GCP), we found that 30-70% of ISPs we measured showed severe throughput degradation during peak usage hours. We designed a method to identify diurnal congestion events by analyzing variations in throughput that can allow for available bandwidth predictions. (Measuring the network performance of Google Cloud Platform, IMC)

Characterization of Anycast Adoption in the DNS Authoritative Infrastructure

In collaboration with researchers from the University of Twente, we used the data of our anycast census to characterize the adoption of anycast in DNS Authoritative infrastructure. Our findings show that anycast adoption changes the DNS service availability risk profile but does not eliminate all risks. In fact, anycast can hide certain types of availability failures, and limit recovery options. A mixed deployment that includes traditional unicast redundancy as well as anycast options mitigates this risk, but increases cost and complexity. (Characterization of Anycast Adoption in the DNS Authoritative Infrastructure, TMA Best paper award).

Security, Stability, and Resilience (SSR) of the Internet’s addressing, routing, and naming systems

Invalid IPv4 prefix assertions from ISPs that publicly announced they had started to discard invalid assertions with respect to RPKI ROAs in BGP, from January 2019 until November 2020. The vertical dotted lines correspond to the date of their public announcement.

Invalid IPv4 prefix assertions from ISPs that publicly announced they had started to discard invalid assertions with respect to RPKI ROAs in BGP, from January 2019 until November 2020. The vertical dotted lines correspond to the date of their public announcement. (A Data-Driven Approach to Understanding the State of Internet Routing Security, TPRC)

A Data-Driven Approach to Understanding the State of Internet Infrastructure Security

We described a data-driven approach to improve the security of the Internet infrastructure. We identified the key vulnerabilities within regional security rather than unachievable global security, and introduced a concept we call zones of trust. With respect to these various security challenges, we described zones of trust that can mitigate concerns within the BGP, DNS, and CA system. Our long-term goal is to foster the emergence of zones of trust within the Internet with proper framing and shaping of incentives. (Trust Zones: A Path to a More Secure Internet Infrastructure, TPRC and JIP. In parallel work, we drilled down on the Internet routing system as an example of why security of the Internet “plumbing” layers is a difficult and persistent problem, including summarizing evidence of malicious routing behavior, and discussing proposed ways forward and their complications. (A Data-Driven Approach to Understanding the State of Internet Routing Security, TPRC).

IRR Hygiene in the RPKI Era

The Internet Routing Registry (IRR) and Resource Public Key Infrastructure (RPKI) are designed to protect BGP from origin hijacking. Network operators may register their address blocks in either database, and may query either database to validate ownership of other registered address blocks. The rapid growth in use of the RPKI provides an opportunity to use it as ground truth against which to measure IRR data correctness. Tools that identify inconsistencies in the two databases can help those wanting (or willing) to maximize the utility of both platforms. (IRR Hygiene in the RPKI Era, IMC)

The parallel lives of Autonomous Systems: ASN Allocations vs. BGP

Autonomous Systems (ASes) exist in two dimensions on the Internet: the administrative and the operational one. Regional Internet Registries (RIRs) rule the former, while BGP the latter. We presented a methodology to extract insights about AS life cycles, including dealing with pitfalls affecting authoritative public datasets. We then performed a joint analysis to establish the relationship (or lack of) between these two dimensions for all allocated ASNs and all ASNs visible in BGP. We characterized usual behaviors, specific differences between RIRs and historical resources, as well as measured the discrepancies between the two “parallel” lives. (The parallel lives of Autonomous Systems: ASN Allocations vs. BGP, IMC).

Follow the Scent: Defeating IPv6 Prefix Rotation Privacy

IPv6’s large address space allows ample freedom for choosing and assigning addresses. To improve client privacy and resist IP-based tracking, standardized techniques leverage this large address space, including privacy extensions and provider prefix rotation. We developed measurement techniques that exploit these legacy devices to make tracking such moving IPv6 clients feasible by combining intelligent search space reduction with modern high-speed active probing. Via an Internet-wide measurement campaign, we discovered more than 9M affected edge routers and approximately 13k /48 prefixes employing prefix rotation in hundreds of ASes worldwide. We used the IPv6 topology data set to seed a six-week measurement campaign to characterize the size and dynamics of these deployed IPv6 rotation pools, and demonstrated via a case study the ability to remotely track client address movements over time. (Follow the Scent: Defeating IPv6 Prefix Rotation Privacy, IMC).

Risky BIZness: Risks Derived from Registrar Name Management

We explored a domain hijacking vulnerability that is an accidental byproduct of undocumented operational practices between domain registrars and registries. We showed how over the last nine years over 512K domains have been implicitly exposed to the risk of hijacking, affecting names in most popular TLDs (including .com and .net) as well as legacy TLDs with tight registration control (such as .edu and .gov). Moreover, we showed that this weakness has been actively exploited by multiple parties who, over the years, have assumed control over 163K domains without having any ownership interest in those names. (Risky BIZness: Risks Derived from Registrar Name Management, IMC)

Spatial Temporal Analysis of Internet Darkspace Packets

Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed with GraphBLAS hierarchical hyper- sparse matrices. These analyses provided unique insight on this unsolicited Internet darkspace traffic with the discovery of many previously unseen scaling relations. The data showed a significant sustained increase in unsolicited traffic corresponding to the start of the COVID19 pandemic, but relatively little change in the underlying scaling relations associated with unique sources, source fan-outs, unique links, destination fan-ins, and unique destinations. This work provided a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant quantities of anonymized Internet traffic. Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets, HPEC).

Economics and Policy

Second Security, Stability, and Resiliency (SSR2) Review Team Final Report

The SSR Review is a Specific Review mandated by ICANN’s Bylaws that require a periodic assessment of the Security, Stability, and Resiliency of the Domain Name System (DNS). The issues that the review team for the SSR Review may assess are the following: security, operational stability and resiliency matters, both physical and network, relating to the coordination of the Internet’s system of unique identifiers; conformance with appropriate security contingency planning framework for the Internet’s system of unique identifiers. The SSR2 Review Team Final Report contains 63 full consensus recommendations. (Second Security, Stability, and Resiliency (SSR2) Review Team Final Report, SSR2).

The impact of the General Data Protection Regulation on internet interconnection

The Internet comprises thousands of independently operated networks, interconnected using bilaterally negotiated data exchange agreements. The European Union (EU)’s General Data Protection Regulation (GDPR) imposes strict restrictions on handling of personal data of European Economic Area (EEA) residents. We investigated whether this decline in derived demand for data exchange impacts EEA networks’ decisions to interconnect relative to those of non-EEA OECD networks. All evidence infers that there are no visible short run effects of the GDPR on these measures at the internet layer. (The impact of the General Data Protection Regulation on internet interconnection, Telecommunications Policy).

Challenges in measuring the Internet for the public Interest

We wrote a paper to offer framing for conversations about the role of measurement in informing public policy about the Internet, the barriers to gathering measurements, public policy challenges that are creating pressure for reform in this space, and recommended actions that could facilitate gathering of measurements to support policymaking. (Challenges in measuring the Internet for the public Interest, TPRC).

Measurement Infrastructure and Data Sharing Projects

We continued to evolve our measurement, data analytics, and data sharing platforms and pipelines for collecting and curating infrastructure data in a form that facilitates query, integration, and analysis. Below we list the main changes and updates that happened in 2021.

The slidedeck CAIDA Measurement Data Infrastructure Overview contains with detailed information about existing CAIDA datasets used for networking and security research.

Archipelago

We continued to maintain the Archipelago (Ark) active measurement platform as much as we can. The project has no dedicated funding since 2020, but as Ark feeds the FANTAIL system, we made numerous updates to the Ark software stack and the scamper measurement tool. Ark monitors continue to provide raw data for most of our macroscopic Internet data sets. We continued to support Vela, a system for executing on-demand measurements from Ark nodes.

AS Rank

We updated the GraphQL-based AS Rank API. In addition to adding IPv6 support for measurement and meta-data access libraries, we updated the data preprocessing scripts to receive data through the AS-to-organization API and functionality for manual annotations and incorporation of third-party corrections.

AS to Organization mapping

We continued to support our AS-to-Organization mapping by increasing the amount of raw data we collected from National Internet Registries beyond the base WHOIS data that we download from the Regional Internet Registries (RIRs). We integrated the BGPStream service into the AS-to-Organization mapping code base and wrote a RESTful API for the AS to Organization mapping data.

BGPStream

We continued to maintain, support, and evolve the BGPStream code, including bug handling, patching, caching, IPv4/IPv6 storage, and reduced memory usage. In 2021, there were 9739 unique IPs sending requests to BGPStream, with 968 unique ASes being behind those requests. Most of these requests originated from organizations that span various industries such as Computer and Information Technology, Education and Research, Community Groups and Nonprofits, and more. With further analysis, requests originate from over 119 countries the top being the US, China, Germany, and France. In terms of use within the Education and Research Community, 170 unique IPs have sent requests, with 71 of those IPs being within the United States.

DNS Zone Database (DZDB) platform for querying DNS TLD zone files

We continued working with collaborator Ian Foster to transition his software and hardware platform for querying DNS zone files to CAIDA infrastructure. Since the beginning of 2016 this platform has ingested daily zone files from Top-Level Domains, indexing and annotating them in a database. Researchers can query this database for domains, nameservers, IP addresses and more by requesting access to the DZDB API. At the end of December, we were downloading and adding more than 1300 zone files daily to our DZDB database.

IODA (Internet Outage Detection and Analysis)

CAIDA’s infrastructure for detecting macroscopic Internet-edge outage events fused three data sources: Internet Background Radiation (darknet traffic), Border Gateway Protocol (BGP) update messages (used to exchange reachability information between Internet Service Providers), and active probing results that reveal the reachability of end-hosts. In August 2021, IODA PI Alberto Dainotti moved to the Georgia Institute of Technology, and over the next six months migrated the IODA platform from CAIDA to Georgia Tech.

IP address metadata: Libipmeta

Libipmeta is a library to support the querying for historical and realtime IP metadata including CAIDA’s geolocation information, Prefix-To-AS databases, and future metadata on IP addresses. This library combined with its companion pyipmeta library supports several CAIDA analysis projects. In 2021, we added IPv6 support to the software framework.

Internet Topology Data Kit (ITDK)

Our ongoing collection of Macroscopic Internet Topology Data Kits (ITDK) that started in 2010 and now includes 21 Kits. In early 2022 we published the 2021-03 ITDK. These data sets contain router-level topologies generated from the Ark IPv4 Routed /24 Topology Dataset.

IP Prefix to AS Mapping

One of CAIDA’s most frequently requested datasets is the RouteViews Prefix to AS Mapping Dataset for IPv4 and IPv6 This dataset contains IPv4/IPv6 Prefix-to-Autonomous System (AS) mappings derived from the NSRC RouteViews Project, which gathers BGP updates from hundreds of vantage points around the world. CAIDA uses the RouteViews BGP tables dumps to perform a longest-prefix match on observed prefixes, to produce daily snapshots of the Prefix to AS mapping. This daily updated dataset goes back to May 2005 and is publicly available for downloading. We are working on an expansion of this data set to include additional BGP routing tables from RIPE NCC’s Routing Information Services (RIPE RIS) and RouteViews, to increase coverage of the routing tables (address space, ASes, links, paths). We received XSEDE HPC allocation to use our BGPView software framework and raw data from RouteViews and from RIPE RIS to update the Prefix to AS dataset for the last twenty years (2001-2022). We will release these data in mid-2022.

Measurement and ANalysis of Internet Congestion (MANIC)

We provided minimal support for the MANIC (Measurement and Analysis of Network Interdomain Congestion) component of our infrastructure, supporting the DARPA project Performance Evaluation Network Measurements and Analytics (PENMAN) goals to improve the ability of a third party to characterize performance bottlenecks along a given path of interest. (This project will sunset in 2022.)

MIDAR

We continued to maintain all backend and database components that use the MIDAR IPv4 alias resolution service. The MIDAR web API delivers access to MIDAR’s functionality. This tool is a pillar of our Macroscopic Internet Topology Data Kits (ITDK) described above.

Periscope

The Periscope Looking Glass API continued to be in public alpha testing. We continue to maintain it on the backend and plan to improve automation of account creation. We began the process of adapting it to support its use with RouteViews collectors.

Spoofer

Ark monitors continue to help measure the Internet’s susceptibility to spoofed source address IP packets. Without dedicated funding supporting development on Spoofer, we primarily performed maintenance bugfixes, including updating the backend server code to adapt to high load and improving the AS notification email system. We continued to support the Spoofer client software package on new OS releases for Windows, MacOS, and Linux.

STARDUST (Sustainable Tools for Analysis and Research on Darknet Unsolicited Traffic)

Beginning in 2020, the STARDUST project allowed us to provide access to the UCSD Network Telescope data via a VM-based analysis platform. Historical FlowTuple and daily RSDoS attack metadata and the most recent 30 days of raw telescope packets are kept in Swift, our OpenStack object-based cloud storage and accessed via the STARDUST VM environment. The FlowTuple data format enables a more efficient processing and analysis for many research-use cases that do not need access to the full packet contents. The current version of FlowTuple data (2008 - current) uses the Apache Avro data format, parseable with our PyAvro-STARDUST libraries or Pyspark STARDUST API. To reduce storage requirements for traffic flow data, we developed a new version (v4) of our flowtuple record format (FlowTuple4), and converted older archives to v4, freeing a few hundred TB of storage. We also changed the format of the RSDoS attacks dataset to the Corsaro 3 framework and to output data in Apache Avro format. Each record in an Avro file describes an individual DDoS attack observed within a particular 5-minute time interval. This dataset starts July 14, 2020, and is updated daily.

This project entered a transition phase this year, as its funding ended and its principal investigators departed. We are currently granting data access only to collaborators or projects that contribute to funding telescope operations. We published documentation and tutorials for tools and data, and created substantial internal documentation to support future infrastructure management, licensed use of the data, and potential transition.


Facilitating Advances in Network Topology Analysis (FANTAIL)

Illustration of the FANTAIL four-component system.

Illustration of the FANTAIL four-component system.

We are still developing the FANTAIL system to enable discovery of the full potential value of massive raw Internet end-to-end path measurement data sets. We are creating a four-component system: (1) an interactive web interface; (2) an API built on web standards; (3) a full-text search system based on Elasticsearch; and (4) a big data processing system based on Spark, leveraging SDSC’s cluster resources. In 2021, we completed the API client script, server-side support, and updated the pipeline between the web interface and FANTAIL. We completed software to annotate traces with ITDK aliases and AS assignments, and started working on annotating traces with IXP prefixes. We also finished the analysis module to map IP addresses in traceroute paths to network operators (ASes) using bdrmapIT data generated from CAIDA’s ITDKs. We look forward to releasing FANTAIL in 2021.

PacketLab

CAIDA supported the University of Illinois at Urbana-Champaign (PI Kirill Levchenko) on the PacketLab project, which is investigating new technical solutions to sharing network measurement infrastructure by developing an experimental interface to disparate measurement endpoints maintained by different research teams. PacketLab is built on two key ideas: It moves the measurement logic out of the endpoint to a separate experiment control server, making each endpoint a lightweight packet source/sink. At the same time, it provides a way to delegate access to measurement endpoints while retaining fine-grained control over how one’s endpoints are used by others, allowing research groups to share measurement infrastructure with each other with little overhead. CAIDA developed the data visualization on the PacketLab web page, and performed work on developing experiments for PacketLab.

Data Collection Statistics

These graphs show the cumulative volume of data accrued over the last several years by our primary data collection infrastructures, Archipelago (Ark) and the UCSD Network Telescope.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.


CAIDA executes the following Ark measurements on an ongoing basis:

  • IPv4 team probing: daily traceroutes to all routed /24 IPv4 networks – Ark IPv4 Routed /24 Topology Dataset
  • IPv4 prefix probing: daily traceroutes to every BGP-announced IPv4 prefix from a subset of Ark monitors – IPv4 Prefix-Probing Traceroute Dataset
  • IPv6 topology measurements collected by a subset of ark monitors that probe all announced IPv6 prefixes (/48 or shorter) once every 48 hours – Ark IPv6 Topology Dataset
  • Congestion measurements (part of MANIC project above) aimed at detecting congestion on interdomain links of the networks hosting the Ark monitors. Congestion measurements are comprised of Time-Sequence Ping (TSP) measurements and Border Mapping (bdrmap) measurements

As of January 2022, we collect about 9 TB of uncompressed data per day, more than 95% of which is Telescope (darknet) data. We archive raw telescope data at NERSC. In 2021 CAIDA captured about 17 TB of uncompressed topology traceroute data, and about 1.5 PB of Internet darknet traffic data.

CAIDA Resource Catalog

In pursuit of the FAIR (findable, accessible, interoperable, reusable) principles of our scientific data infrastructure mission, we invested significant effort to make our data sets and associated resources more accessible to other researchers. We continued development of the CAIDA Resource Catalog, which indexes papers and presentations with related datasets and software. We opened the catalog for public access and allowed third-party contributions. We added a handful of Catalog Recipes which are instructions (with code) on how to solve various Internet security-related problems using existing CAIDA and other datasets and tools. We improved search functionality by adding a “relevance score”, and adding annotation tags. We put substantial effort into increasing the richness of related links and improving the accuracy of the underlying data.

CAIDA Website upgrade

In 2021 we finished migrating our 15-year-old web site backend to a modern content management framework, which was needed to unify it with the CAIDA Resource Catalog. Changing version controls from CVS to Git allowed us to take advantage of existing continuous integration/continuous deployment (CI/CD) practices for faster collaborative development. Moving from the XML-based web pages run on Apache Cocoon to the Hugo static site generator allowed us to make our web pages simpler to create, edit, and maintain, by allowing configured pages to be written in Markdown.


Data Distribution Statistics

CAIDA has shared data and knowledge, including sensitive knowledge about critical Internet infrastructure, for over two decades. The field of Internet infrastructure research is essentially dependent on CAIDA’s measurement and data sharing infrastructure. CAIDA’s data enables organizations to identify and remediate various Internet transport layers vulnerabilities including IP spoofing, BGP routing attacks, DNS abuse, and Certificate Authority manipulations.
Users now request access to CAIDA’s data through the CAIDA Resource Catalog. CAIDA datasets fall into two categories: public and by-request. We make public datasets available to users who agree to CAIDA’s Acceptable Use Policy for public data. We vet other datasets by-request and make them available for use by academic researchers, US government agencies, and corporate entities through the UC San Diego’s Office of Innovation and Commercialization. Users fill out the appropriate request form including a brief description of their intended use of the data, and agree to an Acceptable Use Policy.

In 2021, CAIDA shared publicly available datasets with many U.S. government organizations and their contractors. For instance, NIST used CAIDA data for validation of algorithms. NIWS Pacific used our IXP data to build statistical traffic models. DoD used UCSD telescope data to validate traffic classification algorithms. MITRE used our publicly available trace data for testing their analytical tools. The list of government organizations/contractors who used our data last year includes: FCC, Naval Postgraduate School, Peraton Labs, Raytheon, SAIC, SANDIA, Los Alamos, and Idaho National labs, CACI, SCIUMO, Cynnovative, and the Laboratory for Telecommunication Sciences (LTS). We are developing new licensing to accommodate growing interest in commercial use of our data.

The graphs below show the annual counts of unique visitors who downloaded CAIDA datasets (public and by-request) and the total size of downloaded data. These statistics do not include Near-Real-Time Telescope datasets (raw traffic traces in pcap format, aggregated flow and daily RSDoS attack metadata), which users access via the STARDUST platform.

The decrease in the number of users and downloaded volume of “by-request” datasets can be explained by stricter US export control policies. In 2021 we received nearly 600 requests for the CAIDA by-request datasets and granted access to only 340 new users. The decrease of the “Anonymized Passive Traces” users might also be explained by the fact that this data is now three years old. Our last passive traffic trace was collected in January 2019, due to lack of funding to keep up with the link upgrade. We continue to get many community requests for more recent samples of such data, and are in the process of building a 100GB packet capture monitor for deployment in 2022.

Our most popular dataset in 2021 was RouteViews Prefix to AS mapping (see IP Prefix to AS Mapping above) which was downloaded by more than 55,000 unique users. This dataset is used to map IP addresses to prefixes and to AS numbers.


Publications using public and/or restricted CAIDA data (by non-CAIDA authors)

We know of a total of 88 publications in 2021 by non-CAIDA authors that used the CAIDA data. We update the external publications database as we learn of new publications. Some papers used more than one dataset. As of January 2022, we found 2276 papers with first authors in 91 countries. Please let us know if you know of a paper using CAIDA data not yet on our list: Non-CAIDA Publications using CAIDA Data.


CAIDA Software

CAIDA develops and maintains supporting tools for Internet data collection, analysis, and visualization, available in CAIDA’s Resource Catalog.


Workshops

In mid-January 2021, CAIDA held the NSF Workshop on Overcoming Measurement Barriers to Internet Research (WOMBIR-1) via video conference. Part two of the WOMBIR workshop (WOMBIR-2) was held in mid-April via video conference. The goals of this workshop were to identify critical research questions that warrant a call for network measurement (broadly defined), identify barriers and facilitators of that research, and discuss how research results can have impact beyond the research community. (WOMBIR 2021 Final Report, CCR)

In March 2021, we presented at an invited NSF workshop, “Making the Leap to Large”, providing insights on data management and data sharing (Thoughts from the Data Trenches, NSF “Making the Leap to Large” workshop 2021).

In mid-July, CAIDA held the 3rd International Workshop on Darkspace and UnSolicited Traffic Analysis (DUST 2021) via video conference. The goal of the DUST workshop series is to bring together researchers, operators, and analysts interested in unsolicited traffic analysis, especially traffic destined to unassigned (dark) IP address space. The 2021 DUST workshop focused on three topics: Infrastructure for capture and analysis of unsolicited traffic, Research and Education case studies, Distributed Telescopes and the future of the STARDUST project. (DUST 2021)


CAIDA in Numbers

In 2021, CAIDA published 13 peer-reviewed papers (see below), and 2 workshop reports, made 9 presentations, and posted 2 blog entries. A list of presented materials is listed on the CAIDA Resource Catalog. Our web site www.caida.org attracted approximately 316,619 unique visitors, with an average of 1.89 visits per visitor, serving an average of 3.15 pages per visit. During 2021, CAIDA employed 18 staff (researchers, programmers, data administrators, technical support staff), hosted 2 postdocs, 4 PhD students, and 22 undergraduate students.

These charts below show CAIDA expenses, by type of operating expenses, by funding source, and by program area:

Expense type Amount ($) Percentage
Labor $ 1,524,642.30 42%
Indirect Costs $ 1,095,592.44 30%
Benefits $582,712.91 16%
Subcontracts $ 274,382.57 7%
Supplies & Expenses $ 139,246.05 4%
Professional Development $ 36,666.15 1%
Equipment $ 15,733.64 <1%
Total $3,668,976.06 100%
Funding Source Amount ($) Percentage
NSF $2,346,655.43 64%
DARPA $533,481.85 15%
State Dept $413,965.84 11%
Gift $232,579.04 6%
DHS $105,123.63 3%
Other $37,170.27 1%
Total $3,668,976.06 100%
Research Program Area Amount ($) Percentage
Infrastructure & Data Sharing $1,464,958.14 40%
Security, Stability, Resilience $971,111.82 26%
Cartography $630,020.81 17%
Performance $320,289.95 9%
Outreach $282,595.35 8%
Total $3,668,976.07 100%


Publications

Publications are grouped by research categories.

Internet Cartography (Mapping) Methods

Performance Measurement

Internet Security, Stability, and Resilience

Economics and Policy Research

Workshop Reports

Supporting Resources

CAIDA’s accomplishments are in large measure due to the high quality of our visiting students and collaborators. We are also fortunate to have financial and IT support from sponsors, members, and collaborators, and monitoring hosting sites.

UC San Diego Graduate Students

Visiting Scholars

Funding Sources

Published
Last Modified