CAIDA's Annual Report for 2022

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, and operating expenses for 2022.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insights into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report summarizes CAIDA’s activities for 2022 in the areas of research, infrastructure, data collection and analysis.

Research. Our research primarily focused on security, resilience, and performance studies of the underlying transport systems of the Internet: forwarding, BGP routing, naming (DNS), and TLS certificates.
Each of these systems has critical flaws that leave the Internet ecosystem vulnerable to a variety of attacks. Our research in these areas focused on independent assessment of the extent of the problem and effectiveness of mitigations.

BGP Routing. Focused on the routing system, we used cryptographically authenticated RPKI information to analyze the inaccuracy of Internet Route Registries databases which are still commonly used to support route filtering to protect against hijacks and route leaks. We also analyzed what public blacklists can tell us about the effectiveness of IRR/RPKI as a routing security mechanism. Finally, We provided the first independent look into the efficacy of collective action efforts to advance routing security, revealing significant room for improvement.

DNS. We took a similar approach with the DNS, undertaking four studies to ascertain what an independent party can analyze regarding DNS vulnerabilities and their exploitation. Each study required joining one or more DNS data sets with other diverse sources of data – Internet wide scans, darknet traffic data, TLS certificates, BGP data, AS metadata, and geolocation data. One study introduced a new approach to analyzing the impact of (Distributed) Denial of Service attacks against DNS infrastructure. Another study measured longitudinal changes in the makeup of naming, hosting and certificate issuance for domains in the Russian Federation since the hostilities in Ukraine.

Traffic Analysis. We developed new collaborations to broaden the impact of our Network Telescope data, including collaborations that spanned industry, government, and academic stakeholders to compare phenomena seen with what is seen in industry honeypot data sources. The observations suggest a correlated high frequency concentration of suspicious sources that drifts on time scales of months. We began development of a new machine learning framework to scale event detection in this traffic data, and examined Internet-wide scan traffic through a reactive network telescope, finding that today’s scans are highly targeted and vary across regions.

Performance. We published a unified and configurable framework for facilitating automatic test execution and cross-layer analysis of test results for five major web-based speed test platforms, and applied it to investigate impediments to accuracy of latency measurements, which play a vital role in today’s speed tests. We also created a jitter-based congestion inference framework called Jitterbug, and applied it to a range of traffic scenarios to identify both recurrent and one-off congestion events.

Policy. We participated in policy research and discussions related to these security issues. We published an analysis of the role of measurement in informing public policy about the Internet, including different stakeholders’ approaches to measurements and associated challenges.
We also published a taxonomy of harms at the Internet transport layer and measurements that currently inform their analysis. We participated in FCC’s Notice of Inquiry related to routing security, which continues into 2023.

Infrastructure Operations and Design. Our NSF mid-scale design effort allowed us to make progress with our infrastructure, software development, and data sharing activities to support Internet research, both at CAIDA and around the world. We continued to support and enhance our infrastructure components that create data products in the most demand by the community, including Ark, AS Rank, AS-to-Org mapping, BGPStream, DNS Zone Database, Internet Topology Data Kit, MIDAR, Periscope, Spoofer, and the UCSD Network Telescope. We introduced new tools including an improved inference engine for hostname-based geolocation. We continued expanding our rich-context Resource Catalog for CAIDA Internet Data Science Resources. We engaged with partners from industry, academia, and government to gain insights into measurement needs and data acquisition infrastructure design.

Everything Else. As always, we engaged in a variety of tool development, data sharing, and outreach activities, including publishing 16 peer-reviewed papers. We provide select highlights in this report; details are available in papers, presentations, blog, and interactive resources on our web sites. We list and link to publications, tools and data sets shared. Finally, we offer a “CAIDA in numbers” section: statistics on our performance, collaborators, finances and funding sources.

We are still developing CAIDA’s program plan for 2023-2027. Please feel free to send comments or questions to info at caida dot org. Please note the link to donate to CAIDA at the top of our web site. UC San Diego charges no overhead on donations; it is tax-deductible and goes 100% to research (no university overhead)!

Research and Analysis

Our research collaborations spanned academic, government, and industry stakeholders. We focused on security and stability studies of the Internet’s routing, naming, and certificate systems. We leveraged our large darknet to pursue scalable methods to infer security-relevant events in the face of increasing traffic volumes, and developed new methods for network performance assessments.

Security, Stability, and Resilience (SSR) of the Internet’s routing system

The global routing protocol (the Border Gateway Protocol or BGP) propagates topology and routing policy information across 70K+ independent networks called autonomous systems. The critical security vulnerability with BGP is well-known: a rogue Autonomous System can announce a false assertion that it originates or is in the path to a block of addresses that it does not in fact have the authority to announce. BGP, as part of its design, does not include mechanisms to prevent such false assertions. Routers who accept such a false assertion will then deflect traffic intended for addresses in that block to that rogue AS, which can drop, inspect, or manipulate that traffic, or send traffic masquerading as those addresses. A malicious AS can falsify any part of a BGP announcement, including the origin prefix or AS, or the path. This attack is called a route hijack.

IRR Hygiene in the RPKI Era

The Internet Route Registry (IRR) and Resource Public Key Infrastructure (RPKI) both allow networks to register routing information and develop route filters based on information other networks have registered. RPKI is a cryptographically authenticated system, with associated complexity and policy challenges; it has seen substantial but slowing adoption. IRR databases contain inaccurate records due to lack of validation standards. We quantified the consistency between IRR and RPKI records, analyzed the causes of inconsistency, and examined which ASes are contributing correct IRR information. (IRR Hygiene in the RPKI Era, PAM)

Analyzing the Effectiveness of DROP: Don’t Route Or Peer list

We analyzed what public blacklists (as a source of information about hijacked prefixes) can tell us about the effectiveness of IRR/RPKI as a routing security mechanism. We analyzed the properties of prefixes that appeared in Spamhaus’ Don’t Route Or Peer (DROP) list over a nearly three-year period from June 2019 to March 2022. We showed that attackers subverted multiple defenses against malicious use of address space, including creating fraudulent Internet Routing Registry records for prefixes shortly before using them. Other attackers disguised their activities by announcing routes with spoofed origin ASes consistent with historic route announcements, and in one case, with the ASN in a Route Origin Authorization. We quantified the substantial and actively-exploited attack surface in unrouted address space, which warrants reconsideration of RPKI eligibility restrictions by RIRs, and reconsideration of AS0 policies by both operators and RIRs. (Stop, DROP, and ROA: Effectiveness of Defenses through the lens of DROP, IMC)

A Study of Collective Action Efforts to Improve Routing Security

MANRS – Mutually Agreed Norms for Routing Security – is an industry-led initiative to improve Internet interdomain (BGP) routing security by encouraging participating networks to implement a series of mandatory or recommended actions. We provided the first independent look into the MANRS ecosystem by using publicly available data to analyze the routing behavior of participant networks, quantify MANRS participants’ conformance with MANRS requirements, and compare the behavior of MANRS and non-MANRS networks. Although most networks are conformant, the gaps suggest a need for sustained independent auditing of MANRS practices. (Mind Your MANRS: Measuring the MANRS Ecosystem, IMC, Studying Conformance of MANRS Members, blog summary)

DNS and TLS Certificates

The Domain Name System (DNS) translates human-meaningful domain names into IP addresses to which routers forward packets. The Certificate Authority system manages and distributes to users encryption keys used for transport connections so that users can confirm the identity of the party with which they are communicating. Both systems have vulnerabilities that can mislead users to malicious sites they did not intend to reach.

Retroactively Identifying DNS Infrastructure Hijacks

Combining data from Internet-wide scans, passive DNS records, CAIDA’s DNS Zone Database (DZDB) and Certificate Transparency logs, we constructed a methodology for identifying potential victims of sophisticated DNS infrastructure hijacking and used it to identify a range of victims (primarily government agencies), both those named in prior reporting, and others previously unknown. (Retroactive Identification of Targeted DNS Infrastructure Hijacking, IMC)

Investigating the Impact of DDoS Attacks on DNS Infrastructure

To characterize recent DDoS attacks against authoritative DNS infrastructure, we joined two data sets to discover evidence that millions of domains (up to 5% of the DNS namespace) experienced a DoS attack during our observation window. Most attacks did not substantially harm DNS performance, but in some cases we saw 100-fold increases in DNS resolution time, or complete unreachability. Our data corroborates the value of known best practices to improve DNS resilience to attacks, including the use of anycast and topological redundancy in nameserver infrastructure. (Investigating the impact of DDoS attacks on DNS infrastructure, IMC)

Assessing the Impact on Russian Domain Infrastructure from Hostilities in Ukraine

Economic sanctions against Russia in the wake of hostilities in Ukraine led to internal pressures on Russian sites to (re-)patriate the infrastructure they depend on (e.g., naming and hosting) and external pressures arising from Western providers disassociating from some or all Russian customers. We directly measured longitudinal changes in the makeup of naming, hosting and certificate issuance for domains in the Russian Federation. While a considerable number of Russian websites faced limitations in accessing Western service providers, the impact on their operations was not catastrophic. Conversely, the CA issuance emerges as a notable area of vulnerability for Russia. The most surprising result was the near-complete control Let’s Encrypt holds in securing Russian web sites. While Let’s Encrypt has a public interest mission that provides free CA service to all comers, it is also a US entity and subject to US law and export control restrictions. (Where .ru? Assessing the Impact of Conflict on Russian Domain Infrastructure, IMC)

A Study of Collective Action Efforts to Improve DNS Security and Resilience

Inspired by the MANRS effort, ICANN recently proposed an initiative to codify best practices into a set of global norms to improve security: the Knowledge-Sharing and Instantiating Norms for DNS and Naming Security (KINDNS). One challenge for both initiatives is independent verification of conformance with the practices. Stakeholders of the KINDNS initiative are still debating what should be in the set of practices, and we analyzed possible best practices in terms of their measurability by third parties, including a review of DNS measurement studies and available data sets. (Observable KINDNS: Validating DNS Hygiene, IMC Poster)

Security-relevant analyses of Internet darknet (background) traffic

Temporal Correlation of Internet Observatories and Outposts

Our collaboration with MIT Lincoln Labs continued as they used the UCSD Telescope data to study security-relevant scaling characteristics of darkspace traffic. The collaboration leveraged data and infrastructure from DOE’s NERSC, Globus, SDSC, MIT Lincoln Labs, Texas A&M, and cybersecurity company Greynoise. The team compared unsolicited Internet traffic sources from the CAIDA telescope with those from Greynoise’s commercial honeyfarm, using GraphBLAS hyperspace matrices and associative arrays. Over 6 months, 70% of the highest frequency sources in the CAIDA telescope were consistently detected by the GreyNoise honeyfarm. These observations suggest a correlated high frequency beam of sources that drifts on time scales of months. Each of these observations provides a basis for predictions for future measurements and for theoretical modeling of the underlying generative processes. (Temporal Correlation of Internet Observatories and Outposts, GrAPL). (MIT Lincoln Labs published four other studies based on this data in 2022.)

A Scalable Network Event Detection Framework for Darknet Traffic

In pursuit of scalable methods to navigate the overwhelming growth in Internet background radiation (IBR) traffic, we proposed a machine learning (ML)-based framework to detect events by characterizing traffic dynamics across many time series generated from raw traffic flows. Our proposed method leverages ML techniques to extract meaningful signals from aggregated data, enabling the identification of specific time periods that warrant using raw packet traces for further investigation of potential attacks. (A Scalable Network Event Detection Framework for Darknet Traffic, IMC Poster)

Investigating the evolution of Internet-wide scanning behavior

Through an international collaboration, an in-depth analysis was conducted on Internet-wide scan traffic using a responsive network telescope that operates in real-time. Our findings provided a clear signature of today’s scans as: highly targeted, varying across regions, and generated in significant part from malicious sources. (Spoki: Unveiling a New Wave of Scanners through a Reactive Network Telescope, USENIX Security Symposium)

Performance Measurement

Web-based Speed Test Analysis Tool Kit

We introduced WebTestKit, a unified and configurable framework for facilitating automatic test execution and cross-layer analysis of test results for five major web-based speed test platforms. Capturing only packet headers of traffic traces, WebTestKit performs in-depth analysis by carefully extracting HTTP and timing information from test runs. We applied WebTestKit to investigate impediments to accuracy of latency measurements, which play a vital role in test server selection and throughput estimation. (Design and Implementation of Web-based Speed Test Analysis Tool Kit, PAM)

Jitter-based Congestion Inference

We discovered a jitter-based time series that is characteristic of periods of congestion. We created a jitter-based congestion inference framework called Jitterbug, and applied it to a range of traffic scenarios to identify both recurrent and one-off congestion events. (Jitterbug: A new framework for jitter-based congestion inference, PAM)

Policy Activities

Challenges in measuring the Internet for the public Interest

We identified barriers of Internet infrastructure measurement, societal challenges that create pressure to overcome these barriers, and steps that could facilitate measurement to support policymaking. (Challenges in measuring the Internet for the public Interest, Journal of Information Policy)

Public Comments In FCC’s NOI on Secure Internet Routing

David Clark and Cecilia Testart (MIT/CSAIL), and KC Claffy (CAIDA) submitted a response to the U.S. Federal Communications Commission’s (FCC) Notice of Inquiry seeking comment on steps that the FCC should take to protect the nation’s communications network from BGP vulnerabilities. (Comments before the FCC in the matter of Secure Internet Routing, FCC)

The EU NIS-2 proposal and the DNS

David Clark summarized the potential impact on the DNS from the 2022 EU’s Network and Information Security (NIS-2) Directive. This new EU regulation mandates that TLD registries and entities providing domain name registration services collect and maintain accurate and complete domain name registration data, and efficiently provide access to “legitimate seekers”. Directive underlines the necessity of a reliable and secure DNS for the internet’s integrity. The regulations also require adherence to EU data protection law when processing personal data and ensuring data availability to public authorities for DNS abuse prevention. They propose that TLD registries and registration entities collect, maintain, and ensure accurate and complete domain name registration data integrity, and also make public registration data that falls outside EU data protection rules. Moreover, the regulations assign jurisdiction to the Member State where an entity has its main establishment and stipulate that entities not established in the EU but offering services within it should designate a representative. The Directive emphasizes the crucial role of collaboration between research institutions and providers of network and information services deemed critical or significant. Specifically, these entities should mitigate cybersecurity risks emanating from their interactions and relationships within a wider ecosystem. This includes ensuring that their collaborations with academic and research institutions align with their cybersecurity policies and adhere to best practices regarding secure information access and dissemination, with particular attention to intellectual property protection. Moreover, the directive urges EU Member States to adopt policies that support academic and research institutions in the development of cybersecurity tools and secure network infrastructure. (The EU NIS-2 proposal and the DNS, MIT/CSAIL tech report)

Measurement Infrastructure and Data Sharing Projects

Designing a Global Measurement Infrastructure to Improve Internet Security (GMI3S)

Overview of the GMI3S structure

Overview of the GMI3S structure

We continued to evolve our measurement, data analytics, and data sharing platforms and pipelines for collecting and curating infrastructure data in a form that facilitates query, integration, and analysis. Framing these efforts was our new GMI Design Project (Designing a Global Measurement Infrastructure to Improve Internet Security, or GMI3S). The goal of the GMI project is to design a new generation of measurement infrastructure for the Internet, which will support collection, curation, archiving, and expanded sharing of data needed to advance critical scientific research on the security, stability, and resilience of Internet infrastructure. In collaboration with the University of Oregon’s Network Startup Resource Center (NSRC), MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and numerous others in industry and academia, CAIDA is investigating sustainable production-data acquisition, curation tools, meta-data generation and efficient storage and dissemination to identify security gaps and explore potential areas of improvement within the domain. (GMI3S Design Project website)

As part of this project, we are developing a semi-automatic inventory system tracking all CAIDA virtual and hardware machines and software, and how they map to services and data shared with the research community. We are also designing and developing mechanisms to map these software and hardware systems to datasets and software objects in the catalog. Below we list the main changes and updates that happened in 2022.

Archipelago

We continued to maintain the Archipelago (Ark) active measurement platform as needed to inform our next generation active measurement infrastructure design effort. Ark monitors continue to provide raw data for most of our macroscopic Internet data sets.

AS Rank

AS Rank integrates heterogeneous datasets, including AS-relationships, inferred relationships, customer cones, AS to organization mapping, Netacuity geolocation, and more, to generate a unified and comprehensive perspective of the Internet’s Autonomous Systems (ASes). This unified view allows for a better understanding of the relationships, hierarchy, and characteristics of ASes within the global Internet infrastructure. We updated our GraphQL-based AS Rank API. Based on community feedback, we added a default sort of an AS’s by providers, then peers, then customers. We continue to update AS Rank’s inferences with operator ground truth, increasing the accuracy of the presented data.

AS to Organization mapping

We continued updating the as2org data set every quarter. In addition, we manually integrated seven annotations provided by operators and other Internet experts.

View of the BGP2GO dashboard

View of the BGP2GO dashboard

BGP metadata and BGP2GO to facilitate access and sharing of relevant MRT files.

We prototyped a web application that assists in selecting and obtaining relevant MRT data sets for further analysis. We have indexed almost ten years of RouteViews data into a supporting metadata database, and created a graphical interface to explore and compile MRT file information. With each lookup, a summary is presented to the user: the number of MRT files, approximate download size, the earliest and latest available MRT files, and the involved collectors. Lead designer Thomas Krenc gave several talks on BGPMeta to get feedback on the design. We hope to extend this prototype to include other types of data, e.g., RIR allocation files, DNS (OpenIntel), DNS data. Indexing more data will facilitate correlation of activities of an identifier across data sets.

BGPStream

We continued limited support for the BGPStream broker, which is seeing growing use.

DNS Zone Database (DZDB) platform for querying DNS TLD zone files

CAIDA’s DNS Zone Database (DZDB) is a platform providing access to time-series data derived from current and historical zone files provided by generic Top-Level Domains (gTLDs) participating in the Central Zone Data Service (CZDS) or directly by TLD Registry Operators in compliance with license agreements. We continued working with collaborator Ian Foster to transition the DZDB platform to CAIDA infrastructure. Researchers can query this database for domains, nameservers, IP addresses and more by requesting access to the DZDB API. At the end of December, we were downloading and adding more than 1250 zone files daily to our DZDB database. We added a feature to the API to let users type in a prefix and get all nameserver IPs/hostnames and domains served by that prefix. We also added documentation for the internal web API.

Facilitating Advances in Network Topology Analysis (FANTAIL)

We developed the Facilitating Advances in Network Topology Analysis (FANTAIL) system to facilitate discovery of the full potential value of massive raw Internet end-to-end path measurement data sets, allowing researchers to use high-level queries to perform data processing and analysis tasks on matching traces without owning/operating a cluster, and without learning big data programming. FANTAIL is a four-component system: (1) an interactive web interface; (2) an API built on web standards; (3) a full-text search system based on Elasticsearch; and (4) a big data processing system based on Spark. In 2022, we implemented support for trace annotations (aliases, ASes, hostnames, IXPs) in the FANTAIL web and API interfaces. We completed a recipe to find all interconnection links of a given AS or set of ASes (sibling ASes) in a given time period, and a recipe to construct the topology of a target AS (that is, all routers and links that map to the AS) in a given time period, presented as a graph. We integrated these recipes with FANTAIL’s front end interfaces, and implemented a simple way of executing analysis recipes via both the API and web interfaces. We will begin beta access to FANTAIL in 2023.

Hoiho

Holistic Orthography of Internet Hostname Observations (Hoiho) is an open-source tool released as a part of scamper. It uses CAIDA’s Macroscopic Internet Topology Data Kit (ITDK) and observed round trip times to infer regular expressions that extract apparent geolocation hints from hostnames. The ITDK contains a large dataset of routers with annotated hostnames, which are used as input to Hoiho for its inference rules (encoded as regular expressions) that extract these annotations. In 2022, we created a web API for Hoiho. This public interface provides hostname-location lookups.

Libipmeta to support querying for IP address metadata

Libipmeta is a library to support the querying for historical and realtime IP metadata including CAIDA’s geolocation information, Prefix-To-AS databases, and future metadata on IP addresses. This library has a companion pyipmeta library. We implemented a Golang-native version of ipmeta, to enable concurrent processing of data, and used it to analyze network telescope traffic. We hope to release this version in 2023.

Internet Topology Data Kit (ITDK)

Our ongoing collection of Macroscopic Internet Topology Data Kits (ITDK) started in 2010 and now includes 22 Kits. In early 2022 we published the 2021-03 ITDK, and later the 2022-02 ITDK. These data sets contain router-level topologies generated from the Ark IPv4 Routed /24 Topology Dataset.

IP Prefix to AS Mapping

One of CAIDA’s most frequently requested datasets is its RouteViews Prefix to AS Mapping Dataset for IPv4 and IPv6 dataset. This dataset contains IPv4/IPv6 Prefix-to-Autonomous System (AS) mappings derived from the NSRC RouteViews Project, which gathers BGP updates from hundreds of vantage points around the world. CAIDA uses RouteViews BGP tables dumps (from one collector) to perform a longest-prefix match on observed prefixes, to produce daily snapshots of the Prefix to AS mapping. This daily updated dataset goes back to May 2005. We also did major refactoring, redesign, optimizations, output formatting, and code cleanup in the prefix2as code within BGPView (a BGPstream-based library).

MIDAR

We continued to maintain all backend and database components that use the MIDAR IPv4 alias resolution service. The MIDAR web API delivers access to MIDAR’s functionality. This tool is a pillar of our Macroscopic Internet Topology Data Kits (ITDK).

PacketLab

CAIDA continued to collaborate with UIUC’s team (led by PI Kirill Levchenko) investigating a new technical approach to sharing network measurement infrastructure by developing an experimental interface to disparate measurement endpoints maintained by different research teams. PacketLab is built on two key ideas: It moves the measurement logic out of the endpoint to a separate experiment control server, making each endpoint a lightweight packet source/sink. At the same time, it provides a way to delegate access to measurement endpoints while retaining fine-grained control over how one’s endpoints are used by others, allowing research groups to share measurement infrastructure with each other with little overhead.
In 2022, UIUC released an Alpha version of the PacketLab software package at pktlab.github.io) for community evaluation. PacketLab - Tools Alpha Release and Demo (IMC)

Periscope

In 2022, the Periscope Looking Glass API continued public beta testing. We maintained the backend and improved automation of account creation. We began the process of adapting the backend and API to support its use for access to scamper processes running on BGP collectors, and its integration with CAIDA’s new Keycloak-based authentication/authorization framework.

Spoofer

Spoofer is a suite of open-source software tools to assess and report on the deployment of source address validation (SAV) best anti-spoofing practices. This client-server system periodically tests a network’s ability to both send and receive packets with forged source IP addresses (spoofed packets). The CAIDA Spoofer Data API (https://api.spoofer.caida.org/) provides a public data interface to the publicly sharable data collected by the Spoofer service.

We continued to support the Spoofer client software package on new OS releases for Windows, MacOS, and Linux. We added new documentation in a recipe form for how to download and run the spoofer client. We incorporated Spoofer results into the AS Rank API and web interface.

Telescope

We have experimented with several different ways of sharing Telescope (IBR) data with researchers for open source and commercial efforts:

  1. A direct stream of curated (reduced from original) RS-DOS event data with trusted collaborators (U. Twente, with whom we had a joint DHS/NWO project that originally funded this data-sharing, and we have sustained it beyond the end of that project.) This is the highest-fidelity way to access the traffic but requires a startup cost and high trust in collaborators. Our Dutch collaborators used this data to correlate DDoS events with active measurement indicators of performance degradation, e.g., Investigating the impact of DDoS attacks on DNS infrastructure (IMC22).
  2. Access to raw historical pcap files from the NERSC archive, again with trusted collaborators who already have an established relationship with NERSC (Lincoln Labs, a DOD FFRDC.) Extending this approach to commercial users will require additional permissions from NSF and NERSC. This mode of data sharing has contributed to at least six publications listed in our catalog.
  3. We have established a data exporter to send a subset of packets received by the telescope to an industry partner (DomainTools) over our existing infrastructure. This is the most likely mode that commercial users will want to leverage.
  4. In some cases we shared data through virtual machine (enclave) access through our OpenStack Hypervisor system. This requires that users log into our virtual machines, and that we limit CPU, memory, and disk usage on a per-user basis. Most publications using telescope data that are reported in our catalog have used this method.
  5. For collaborators who cannot meet their processing requirements using the VM option, we have provided temporary direct access to a CAIDA computer server.
  6. UCSD researchers are leveraging Expanse (ACCESS) to process telescope data, a mode we have offered to other researchers who qualify for ACCESS. Thus far this new project has lead to “A Scalable Network Event Detection Framework for Darknet Traffic.” (IMC22 poster.)
  7. We continue to support our time-series dashboard of statistics of traffic coming to the UCSD telescope (available to anyone with Globus or Github account), which researchers are using to identify suspicious events in telescope data. This approach sometimes leads to questions that we try to help answer, such as case studies of traffic spikes that are likely malicious events.

The variety of approaches has illustrated to us how much benefit there is to be able to accommodate different needs in accessing the data. But of course, each of these approaches required dedicated IT staff time and attention.

Data Collection Statistics: Topology and Traffic

The slide deck CAIDA Measurement Data Infrastructure Overview summarizes CAIDA datasets used for networking and security research.

The graphs presented here show the cumulative volume of data accrued over the last several years by our primary data collection infrastructures, Archipelago (Ark) and the UCSD Network Telescope.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

In 2022 CAIDA executed the following Ark measurements on an ongoing basis:
• IPv4 team probing: daily traceroutes to all routed /24 IPv4 networks – Ark IPv4 Routed /24 Topology Dataset
• IPv4 prefix probing: daily traceroutes to every BGP-announced IPv4 prefix from a subset of Ark monitors – IPv4 Prefix-Probing Traceroute Dataset
• IPv6 topology measurements collected by a subset of ark monitors that probe all announced IPv6 prefixes (/48 or shorter) once every 48 hours – Ark IPv6 Topology Dataset

In 2022 CAIDA captured about 17 TB of uncompressed topology traceroute data, and about 1 PB of compressed Internet darknet traffic data. As of January 2023, we collect about 10 TB of uncompressed data per day, more than 95% of which is Telescope (darknet) data. We archive raw telescope data at NERSC.


CAIDA Resource Catalog

Schematic representation of the Resource Catalog architecture.

Schematic representation of the Resource Catalog architecture.

In pursuit of the FAIR (findable, accessible, interoperable, reusable) principles of our scientific data infrastructure mission, we invested significant effort to make our data sets and associated resources more accessible to other researchers. We continued development of the CAIDA Resource Catalog, which provides a unified interface to metadata and relationships between datasets, papers, presentations, media, software, and recipes (code and instructions on how to solve various Internet security-related problems using datasets, tools and other objects indexed in catalog). We added new types of resources (collections and presentations) and new metadata fields to facilitate discovery of resources. We created collections, which are groups of resources by CAIDA topic or project. We responded to community feedback by adding “suggestions” when searching and including a link to search instructions for how to search the catalog under the search bar on all pages. We created DOIs for all ongoing datasets and will propagate those into the catalog citation. As the amount of data grew, we also took time to refactor and improve the load time of our search pages. We created a feedback page to solicit feedback on the catalog design or on a specific resource.


Data Distribution Statistics

Unique users downloading CAIDA data and corresponding ASes aggregated by country.

Unique users downloading CAIDA data and corresponding ASes aggregated by country.

CAIDA has been a trusted source of data and knowledge for the past twenty-five years, playing a crucial role in advancing research on critical Internet infrastructure. The field of Internet research heavily relies on CAIDA’s robust measurement and data sharing infrastructure. CAIDA has shared data and knowledge, including sensitive knowledge about critical Internet infrastructure, for over quarter of a century. The field of Internet infrastructure research is essentially dependent on CAIDA’s measurement and data sharing infrastructure. CAIDA’s data enables organizations to identify and remediate various Internet transport layers vulnerabilities including IP spoofing, BGP routing attacks, DNS abuse, and Certificate Authority manipulations.

To facilitate seamless access to its valuable resources, CAIDA employs multiple data sharing methods. These methods encompass downloadable files, interactive Web services, programmatic access to data streams, and APIs.

Users now request access to CAIDA’s data through the CAIDA Resource Catalog. CAIDA datasets fall into two categories: public and by-request. We make public datasets available to users who agree to CAIDA’s Acceptable Use Policy for public data. We safeguard restricted datasets and make them available for use by academic researchers, U.S. government agencies, and corporate entities through the UC San Diego’s Office of Innovation and Commercialization. Users fill out the appropriate request form including a brief description of their intended use of the data, and agree to an Acceptable Use Policy.

The Data Distribution Statistics graph shows the annual counts of unique visitors who downloaded CAIDA datasets (public and by-request) . These statistics do not include Near-Real-Time Telescope datasets (raw traffic traces in pcap format, aggregated flow and daily RSDoS attack metadata).

Our most popular publicly available datasets are ongoing “Routviews Prefix to AS Mapping”, “Topology Ark data” , “Internet eXchange Points Dataset” and 2019-2021 Hoiho papers data supplements (for which the number of users has tripled in comparison to 2021).

Even though the number of unique users downloading topology and passive traces data decreased in 2022, the downloaded volume of these data increased by nearly 25% in comparison to 2021. In 2022 users downloaded about 104 TB of data. The main data categories contributing to the volume of downloaded data are: Anonymized Internet Passive Traces (57 TB) and Ark Topology data (45 TB).


AS Rank usage statistics

AS Rank requests received annually aggregated by country.

AS Rank requests received annually aggregated by country.

AS Rank requests received annually aggregated by organization type.

AS Rank requests received annually aggregated by organization type.

In 2022, over 190 ASes from U.S. Education and Research Groups have utilized AS Rank, alongside 34 ASes from U.S. Government and Public Administration Groups. The spike in 2021 that did not appear in 2022 was due to an influx of requests from multiple telecommunications and internet service providers (e.g., IPTP Networks, Akamai Technologies Inc, GTT, Telia ).


BGPstream

BGPstream requests received annually aggregated by country.

BGPstream requests received annually aggregated by country.

BGPstream requests received annually aggregated by organization type.

BGPstream requests received annually aggregated by organization type.

BGPStream usage has grown steadily over the last 3 years. In 2022, there were 9,141 unique IPs sending requests to BGPStream, from 943 unique ASes in 104 countries, the top being US, France, China, and Japan. Of these unique IPs, at least 329 were from the research and education community, 220 of those in the U.S.


Publications using public and/or restricted CAIDA data (by non-CAIDA authors)

Users of CAIDA datasets agree, as part of the Acceptable Use Agreements (see “Data Distribution Statistics” section), to provide CAIDA with information of their publications using CAIDA data. Our Data Publication Report Page provides instructions on how to report papers most easily.

In addition, we are conducting extensive literature search trying to locate relevant papers. To initiate this process, we employ Google Scholar, utilizing search phrases derived from the names of CAIDA datasets. Our search strategy aligns with the reference format specified in the AUA. Notably, this approach yields the highest number of search results. To complement the outcomes from Google Scholar, we also utilize other search engines that have a stronger focus on computer science, such as IEEE Xplore Digital Library, ACM Digital Library, ScienceDirect.com, and Springer, among others. Although these targeted searches do not yield a significant number of additional hits compared to Google Scholar (approximately 5-15% of the total), they do reveal papers pertinent to subjects that extend beyond the domain of computer science.

We know of a total of 278 publications in 2022 by non-CAIDA authors that used the CAIDA data. We update the external publications database as we learn of new publications. Some papers used more than one dataset. As of January 2023, we found 3412 papers.

Please let us know if you know of a paper using CAIDA data not yet on our list: Non-CAIDA Publications using CAIDA Data.


Workshops

In 2022 (starting from late 2021), all workshop-related activity was centered around the GMI3S project, in the form of internal working group meetings via video conference. These GMI meetings brought together academic, government, and industry stakeholders to discuss the four different measurement dimensions of the project: BGP, DNS, and DDoS, and active measurement. Meeting minutes are currently intended for internal distribution only.

We conducted monthly GMI-DDoS workgroup meetings involving academic and industry researchers. During these meetings, we discussed the progress made on a collaborative white paper that aimed to compile insights from both industry and academia. The paper focuses on documenting DDoS trends, and related vulnerabilities and mitigation strategies to tackle DDoS-related security challenges.

We also had meetings to discuss next steps in DNS and BGP measurement and data processing architectures, as well as how to enhance and sustain the utility of the telescope data for a wider variety of users, including commercial users.


CAIDA in Numbers

In 2022, CAIDA published 16 peer-reviewed papers (see below), made 5 presentations, and posted 3 blog entries. A list of presented materials is listed on the CAIDA Resource Catalog. Our web site www.caida.org attracted approximately 223,577 unique visitors, with an average of 1.78 visits per visitor, serving an average of 2.83 pages per visit. During 2022, CAIDA employed 16 staff (researchers, programmers, data administrators, technical support staff), hosted 1 postdocs, 5 PhD students, 4 masters students, and 24 undergraduate students.

The chart below shows CAIDA operating expenses, with a breakdown of operating expenses by type and program area:

Expense type Amount ($) Percentage
Labor $1,400,602 38%
(UCSD) Indirect Costs $1,125,930 31%
(UCSD) Benefits $537,325 15%
Subcontracts $338,944 9%
Equipment $117,114 3%
Professional Development $98,853 3%
Supplies & Expenses $71,996 2%
Total $3,690,763 100%
Research Program Area Amount ($) Percentage
Infrastructure & Data Sharing $2,063,578 56%
Security, Stability, Resilience $1,238,985 34%
Performance $214,369 6%
Cartography $167,851 5%
Outreach $5,982 <1%
Total $3,690,763 100%

Publications

Publications are grouped by research categories.

Performance Measurement

Internet Security, Stability, and Resilience

Economics and Policy Research

Supporting Resources

CAIDA’s accomplishments are in large measure due to the high quality of our visiting students and collaborators. We are also fortunate to have financial and IT support from sponsors, members, and collaborators, and monitoring hosting sites. In 2022 we welcomed Matthew Luckie (who was a visiting scholar in the first half of the year) and Brendon Jones as consulting research scientists to advise on and execute our infrastructure deployment efforts.

UC San Diego Graduate Students

Visiting Scholars

Funding Sources

Published
Last Modified