CAIDA's Annual Report for 2023

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, and operating expenses for 2023.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insights into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report summarizes CAIDA’s activities for 2023 in the areas of research, infrastructure, data collection and analysis.

Infrastructure Operations and Design. Our research infrastructure funding from NSF, most notably the NSF mid-scale design project, allowed us to make significant progress in developing the next generation of Internet measurement infrastructure to enhance the security and utility of Internet measurements. We focused on creating innovative platforms and software tools for data collection, curation and utilization, particularly targeting data related to the security vulnerabilities within the packet carriage layer of the Internet, which often lead to significant harm. We enhanced infrastructure components that create data products or services requested by the community, including Archipelago (Ark), AS Rank, AS-to-Org mapping, DNS Zone Database (DZDB), Internet Topology Data Kit (ITDK), Facilitating Advances in Network Topology Analysis (FANTAIL), Periscope, Spoofer, and the UCSD Network Telescope. To support researchers trying to find and make use of the best available data from these and other infrastructures, we expanded and designed new functionality for our rich-context Resource Catalog for CAIDA Internet Data Science Resources, most notably data access via the catalog.

We also initiated the design of new infrastructure components – BGP, passive traffic capture, and active measurement – to overcome scaling limitations of current systems. To facilitate scientific use of the data generated by these platforms, we explored current and potential approaches to data analysis and visualization, addressing the needs for standardization, interoperability, AI readiness of our data and platforms. We engaged with partners from industry, academia, and government to gain insights into measurement needs and data acquisition infrastructure design.

Research. Our research continued to focus on Internet cartography (mapping), security, resilience, and performance studies, in the following categories.

Internet cartography and security. We developed and demonstrated new techniques for analyzing access-network topology to demonstrate the feasibility of of targeted attacks on access network infrastructure, and suggested possible mitigation approaches. We developed new metrics to identify and rank the most important networks from a connectivity perspective for countries around the world, with some case studies to illustrate the geopolitical insights provided by these metrics. We undertook two analyses related to the latest routing security techniques and their effectiveness, using global data sources. We completed the first phase of our effort to infer the semantics of BGP communities in the wild. Finally, we continued our DOD-funded research to build automated techniques to identify and avoid adversarial components of infrastructure paths and divert communications to safe paths.

Performance. We made progress on three projects related to Internet performance measurement. First, we designed and implemented a crowdsourcing-based platform (QUINCE) to measure the QoE of video streaming and video conferencing applications. Second, we are leveraging CloudBank resources to understand performance bottlenecks in commercial cloud connectivity. Finally, we began a new NSF-funded project to develop a new measurement toolkit to enable reproducible, comprehensive speed test infrastructure discovery and characterization, and consistent test parameters across platforms.

Policy. We proposed a new approach to routing security that achieves four design goals: improved incentive alignment to implement best practices; protection against path hijacks; expanded scope of such protection to customers of those engaged in the practices; and reliance on existing capabilities rather than needing complex new software in every participating router. We were motivated by the FCC’s Notice of Inquiry on Routing Security, and wanted to suggest an alternative to regulation, under which the industry can make practical, measurable progress against the threat of route hijacks in the short term by leveraging institutionalized cooperation rooted in transparency and accountability. We submitted our idea to the FCC public comment process.

With four industry and 11 academic partners, we undertook a detailed analysis of Distributed Denial-of-Service (DDoS) attacks by integrating perspectives from both industry reports and academic research. We implemented a new approach to transparency with industry by aggregating target information (IPs) from academic sources and allowing industry players to join this data with their data sources revealing gaps in visibility and sharing results. This approach helped validate an industry-reported 2021-2022 drop in spoofed reflection-amplification attacks that increased again in 2023.

We analyzed and summarized elements of the EU Digital Services Act intended to ensure that independent, third-party researchers such as academics have access to the data necessary to understand the nature of the harms and the effectiveness of the mitigations.

Everything Else. As always, we engaged in a variety of tool development, data sharing, and outreach activities, including publishing 7 peer-reviewed papers, 5 blog entries, and 22 presentations, all indexed in CAIDA Resource Catalog. Our web site www.caida.org attracted approximately 261,770 unique visitors, with an average of 1.84 visits per visitor, serving an average of 3.13 pages per visit. During 2023, CAIDA employed 17 staff (researchers, programmers, data administrators, technical support staff), hosted 1 postdoc, 7 PhD students, 12 masters students, and 35 undergraduate students. We provide select highlights in this report; details are available in papers, presentations, blog, and interactive resources on our web sites. We list and link to publications, tools and data sets shared. Finally, we offer a CAIDA in numbers section: statistics on our performance, collaborators, finances and funding sources. We are still developing CAIDA’s program plan for 2025-2030. Please feel free to send comments or questions to info at caida dot org. Please note the link to donate to CAIDA at the top of our web site. UC San Diego charges no overhead on donations; it is tax-deductible and goes 100% to research (no university overhead)!

Measurement and Data Analysis Infrastructure

Overview of the GMI3S structure

Overview of the GMI3S structure

We continued redesigning our measurement, data analytics, and data sharing platforms and pipelines for collecting and curating infrastructure data in a form that facilitates query, integration, and analysis. Framing these efforts was our GMI Design Project (Designing a Global Measurement Infrastructure to Improve Internet Security, or GMI3S). The goal of the GMI project is to design a new generation of measurement infrastructure for the Internet, which will support collection, curation, archiving, and expanded sharing of data needed to advance critical scientific research on the security, stability, and resilience of Internet infrastructure. In collaboration with the University of Oregon’s Network Startup Resource Center (NSRC), MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and numerous others in industry and academia, CAIDA is investigating sustainable production-data acquisition, curation tools, meta-data generation and efficient storage and dissemination to identify security gaps and explore potential areas of improvement within the domain. (GMI3S Design Project website) Guiding our work is our (evolving) catalog of Data Needs for Securing Internet Infrastructure. The GMI project and our other infrastructure funding

Archipelago

We overhauled the Archipelago (Ark) active measurement platform as needed to inform our next generation active measurement infrastructure design effort. In an effort to make a platform that was easier to use for external measurement researchers, while also providing important access control, we designed and implemented a researcher development environment that allows for complex, distributed, and reactive measurements built on a well-defined set of measurement primitives, from a set of distributed VPs. The result is our scamper python module which acts as a bridge to the measurement tools on each Ark node. In January 2024, we released the source code of this module, allowing researchers to develop and test their measurements locally before trying them on Ark. We also designed and tested a new Kafka infrastructure to support coordination among nodes, a certificate authority system to allow authentication of remote nodes, and a monitoring management platform. To evaluate our new design, we supported several third-party experiments by researchers in the U.S. and global R&E community. We also used this new software underlay to launch a complete redesign and reimplementation of the software development packaging for creating several of our flagship data sets (including ITDK), modernizing and automating all components.

Hardware. Ark monitors continue to provide raw data for most of our macroscopic Internet data sets. In an effort to reevaluate our Ark infrastructure going forward, we began evaluating other single-board computers apart from the Raspberry Pi to discover if any viable alternatives that have emerged since the initial deployment of Ark almost two decades ago. We evaluated several different devices, but due to the supply chain abundance of Raspberry Pi’s, and their prevalence in other research cyberinfrastructure, we are leaning toward Raspberry Pi’s (version 4) as the base hardware for Ark nodes in the field. We are also investing in software-only deployments, which has required designing, prototyping, and field testing containerized versions of the Ark software platform.

AS Rank

AS Rank integrates heterogeneous datasets, including AS-relationships, inferred relationships, customer cones, AS to organization mapping, Netacuity geolocation, and more, to generate a unified and comprehensive perspective of the Internet’s Autonomous Systems (ASes). This unified view allows for a better understanding of the relationships, hierarchy, and characteristics of ASes within the global Internet infrastructure. Seeking operator ground truth in our rankings, we worked on an interface to solicit feedback and corrections from the community. We continue to update AS Rank’s inferences with operator ground truth, increasing the accuracy of the presented data.

Cloud-Native Active Internet Measurement Pipeline

We developed a software pipeline to deploy measurements on public cloud platforms at scale. The pipeline integrates modern software stacks, enabling us to provision multiple measurement virtual machines (VMs) across cloud platforms and securely collect measurement results. Leveraging this pipeline, we conducted large-scale traceroute measurements to geolocate 2.5 million IP addresses discovered by CAIDA’s Macroscopic Internet Topology Data Kit (ITDK) from VMs we set up in 85 data centers from three major cloud providers (Amazon AWS, Google GCP, and Microsoft Azure). We inferred the geolocation of the IPs using hoiho, which analyzes geolocation hints in hostnames and cross-validates with network latency. Conducting measurements from diverse vantage points could improve the accuracy of the geolocation inference.

DNS Zone Database (DZDB) platform for querying DNS TLD zone files

CAIDA’s DNS Zone Database (DZDB) is a platform providing access to time-series data derived from current and historical zone files provided by generic Top-Level Domains (gTLDs) participating in the Central Zone Data Service (CZDS) or directly by TLD Registry Operators in compliance with license agreements. We continued working with collaborator Ian Foster to transition the DZDB platform to CAIDA infrastructure and updated the import code to run on our systems. Researchers can query this database for domains, nameservers, IP addresses and more via the DZDB API, whose documentation we significantly expanded.

Facilitating Advances in Network Topology Analysis (FANTAIL)

Over the last few years we have developed the Facilitating Advances in Network Topology Analysis (FANTAIL) system to facilitate discovery of Internet end-to-end path measurement data from massive archives. FANTAIL allows researchers to use high-level queries to perform data processing and analysis tasks on matching traces without owning/operating a cluster, and without learning big data programming. FANTAIL is a four-component system: (1) an web interface; (2) an standard API; (3) a search system based on Elasticsearch; and (4) a big data processing system based on Spark. This year we enhanced system stability, performance, manageability, and completed implementation of some data analysis recipes. We imported all traces and annotations through 2023, and documented the process for doing so. We enhanced the user documentation with hints on proper querying techniques and interpreting outputs, transitioned the query output format from JSON to JSONL, and created wireframes outlining a restructured layout for the query landing page. We began limited access to beta-testers in the research community. (FANTAIL API scripts and file formats documentation)

DNS-based inference of infrastructure properties

We demonstrated a new design of a platform that relies on active measurement and DNS data to extensively label Internet topology (with geolocation and infrastructure ownership). Holistic Orthography of Internet Hostname Observations (Hoiho) is an open-source tool released as a part of scamper. It uses CAIDA’s Macroscopic Internet Topology Data Kit (ITDK) and observed round trip times to infer regular expressions that extract apparent geolocation hints from hostnames. The ITDK contains a large dataset of routers with annotated hostnames, which are used as input to Hoiho for its inference rules (encoded as regular expressions) that extract these annotations. In 2023, in response to user feedback, we expanded the documentation of the web API for Hoiho that provides hostname-location lookups.

GILLnet: Next Generation BGP data collection and analysis platform

We initiated a fundamental reconceptualization of public BGP data collection architectures to address scalability limitations with the current RIPE RIS and RouteViews systems. The project was led by researchers at the University of Strasbourg, with first author graduate student Thomas Alfroy who completed his 4-month internship at CAIDA in December 2023. We iterated on the design of our “overshoot-and-discard” (redundant-data) paradigm for data collection, and published the first version of our design in ACM SIGCOMM HotNets (https://catalog.caida.org/paper/2023_internet_science_moonshot ) to introduce the data collection method and evaluate its effectiveness in detecting two important phenomena using BGP data: AS-topology mapping and hijacks. Thomas Alfroy presented this work at the HotNets conference in November 2023. We deployed our prototype system at https://bgproutes.quest where we have invited R&E networks to peer. (The peering is entirely automated via a web form.) To scale this system up we will need to leverage hardware and personnel at UCSD, funded by a future grant. (Internet Science Moonshot: Expanding BGP Data Horizons, ACM)

Internet Data Science Resource Catalog

Users now request access to CAIDA’s data through the CAIDA Resource Catalog. CAIDA datasets fall into two categories: public and by-request. We make public datasets available to users who agree to CAIDA’s Acceptable Use Policy for public data. We safeguard restricted datasets and make them available for use by academic researchers, U.S. government agencies, and corporate entities through the UC San Diego’s Office of Innovation and Commercialization. Users fill out the appropriate request form including a brief description of their intended use of the data, and agree to an Acceptable Use Policy.

Internet Topology Data Kit (ITDK)

Our ongoing collection of Macroscopic Internet Topology Data Kits (ITDK) started in 2010 and now includes 23 Kits. In 2023 we published the 2023-03 ITDK. These data sets contain router-level topologies generated from the Ark IPv4 Routed /24 Topology Dataset.

Libipmeta to support querying for IP address metadata

Libipmeta is a library to support the querying for historical and realtime IP metadata including CAIDA’s geolocation information, Prefix-To-AS databases, and future metadata on IP addresses. This library has a companion pyipmeta library. We implemented a Golang-native version of ipmeta, to enable concurrent processing of data, and used it to analyze network telescope traffic. We released version v3.2.1 this year.

PacketLab

CAIDA continued to collaborate with UIUC’s team (led by PI Kirill Levchenko) investigating a new technical approach to sharing network measurement infrastructure by developing an experimental interface to disparate measurement endpoints maintained by different research teams. The goal of PacketLab is to move the measurement logic out of the endpoint to a separate experiment control server, making each endpoint a lightweight packet source/sink. It also provides a mechanism to delegate access to measurement endpoints while retaining fine-grained control over how one’s endpoints are used by others, allowing research groups to share measurement infrastructure with each other with little overhead. We surveyed recent Internet measurement studies and empirical comparisons with native implementations to discover that PacketLab produces similar results to native versions for various measurement types. The findings suggest that PacketLab could contribute to reproducing or extending many surveyed studies, covering measurements such as latency, throughput, network path, and non-timing data. (Empirically Testing the PacketLab Model, IMC). UIUC released an updated Alpha version of the PacketLab software package at pktlab.github.io for community evaluation.

Periscope

In 2023, the Periscope Looking Glass API continued public beta testing. We continued adapting the backend and API to support its use for access to scamper processes running on BGP collectors, and its integration with CAIDA’s new Keycloak-based authentication/authorization framework.

Spoofer

Contrasting the calculations for Customer Cone (CC) and AS Hegemony (AH) for an AS topology with provider-to-customer and peer-to-peer relationships.

Contrasting the calculations for Customer Cone (CC) and AS Hegemony (AH) for an AS topology with provider-to-customer and peer-to-peer relationships. (On the Importance of Being an AS, IMC)

Spoofer is a suite of open-source software tools to assess and report on the deployment of source address validation (SAV) best anti-spoofing practices. This client-server system periodically tests a network’s ability to both send and receive packets with forged source IP addresses (spoofed packets). The CAIDA Spoofer Data API (https://api.spoofer.caida.org/) provides a public data interface to the publicly sharable data collected by the Spoofer service. We improved the API by allowing easier browser access to larger downloads of data for time series analysis. We also streamlined the process of notifying interested administrators when we detect spoofing in their network.

Telescope Traffic Monitor

CAIDA operates the world’s largest Internet traffic observatory (UCSD-NT) to capture Internet background radiation (IBR) (unsolicited traffic) from a large segment of mostly unutilized IPv4 address space. The infrastructure now captures (unprecedented for the network research community) O(1TB) per day. The data collection pipeline includes capturing raw packets and processing them into a more compressed flow record format for archiving. In parallel, we also extract thousands of time-series statistics directly from the packet headers. We completed our plan for deployment of a new hardware prototype for telescope deployment. In the meantime, the existing infrastructure against encountered several problems with data integrity that we investigated in collaboration with security researchers who brought them to our attention. We explored options for offloading some of the pipeline, e.g. flowtuple generation, to other machines while we wait for new hardware deployment.

Two-way Passive Traces Monitor

Although this phase took us 5 years, we finally got the two-way traffic monitor installed. We will begin sharing these Anonymized Two-Way Traffic Packet Header Traces with researchers in 2024.

Internet Cartography and Security research

Assessing Physical Risks to Internet Access Networks

Regional access networks, crucial for connecting users to the Internet, face vulnerabilities due to economic and architectural constraints, leaving them susceptible to targeted physical attacks. The study combines novel techniques for analyzing access-network infrastructure with large-scale outage measurements to demonstrate the feasibility and quantify potential impacts of such attacks. The research provides insights into the physical attack surfaces and resiliency limits of regional access networks, suggesting potential mitigation approaches while acknowledging drawbacks identified by network operators. The empirical evaluation aims to inform risk assessments, operational practices, and stimulate further analyses of this critical infrastructure. (Access Denied: Assessing Physical Risks to Internet Access Networks, USENIX Security Symposium)

Country-level AS Rankings

Currently, there is no reliable technique for quantifying Internet sovereignty metrics, such as the extent to which a country’s Internet communication relies on networks potentially controlled by adversarial nation-states. To address this gap, we created metrics to identify and rank the most important ASes from a connectivity perspective for countries around the world. We adapted the two most-used AS Ranking metrics to country-specific versions, and navigated the challenges of incomplete BGP data coverage and geolocation. We analyzed our country-specific metrics through two prisms: international, which considered inbound paths to the country; and national, for paths starting and ending within the country. The Customer Cone International metric identifies transit providers commonly used outside a country to reach addresses in that country, while the Customer Cone National metric identifies the top ISPs in the domestic transit market. The corresponding AS Hegemony metrics capture dominant providers without regard for whether links are transit (customer) or peering. We showed that the metrics are consistent with geopolitical and economic knowledge about the ranked networks and countries. We provided case studies on Australia, Japan, Russia, Taiwan, and the United States, which revealed insights into telecommunications market concentration and interdependence. The metrics also confirm the dominant role the U.S. still plays in global telecommunications infrastructure, with at least one dominant U.S carrier (Lumen) providing international transit for 81% of countries. (On the Importance of Being an AS: An Approach to Country-Level AS Rankings, IMC)

Investigating Irregularities in the Internet Routing Registry

Illustration of the threat model where attackers register false IRR records.

Illustration of the threat model where attackers register false IRR records. (IRRegularities in the Internet Routing Registry, IMC)

The Internet Routing Registry (IRR) is a set of distributed databases used by networks to register routing policy information and to validate messages received in the Border Gateway Protocol (BGP). Attackers have begun to register false records in the IRR to bypass operators’ defenses when launching attacks on the Internet routing system, such as BGP hijacks. We performed a longitudinal analysis of the IRR over the span of 1.5 years. We developed a workflow to identify irregular IRR records that contain conflicting information compared to different routing data sources. We found IRR databases prone to staleness and errors, confirming the importance of operators transitioning to RPKI-based filtering. In addition, we found inconsistencies between IRR databases, suggesting opportunities for improved coordination across IRR providers to improve routing security. Finally, we described the challenges of inferring the suspiciousness of such irregular objects and compiled a list of 6,373 suspicious route objects. We hope this work inspires new directions in automating the detection of abuse of IRRs, such as a multilateral comparison across bases, ideally in time to prevent or thwart an attacker’s ultimate objective.

Utilizing RPKI for Validation of BGP Messages

Networks use RPKI to check whether the Autonomous System (AS) at the origin of the AS path in a BGP announcement is authorized to originate the IP prefixes being announced. We explored a lightweight technique to identify ASes that propagate RPKI invalid prefixes, i.e., do not perform ROV. If the ASes responsible for propagating the most invalid prefixes were to deploy ROV, it could dramatically increase the security of the routing ecosystem. Thus, stakeholders can focus on promoting ROV deployment in those ASes. Our technique can help optimize future ROV deployment, e.g., to estimate which ASes would provide the greatest marginal increase in protection. (Taking the Low Road: How RPKI Invalids Propagate, SIGCOMM Poster)

Techniques to support inference of BGP community semantics

Operators use BGP communities to influence routing decisions made by other networks, or to record metadata that operators can use when applying BGP policy. This figure illustrates some community values observed in the wild.

Operators use BGP communities to influence routing decisions made by other networks, or to record metadata that operators can use when applying BGP policy. This figure illustrates some community values observed in the wild. (Coarse-grained Inference of BGP Community Intent, IMC)

BGP communities allow operators to influence routing decisions made by other networks (action communities) and to annotate their network’s routing information with metadata such as where each route was learned or the relationship the network has with their neighbor (information communities). BGP communities also help researchers understand complex Internet routing behaviors. However, there is no standard convention for how operators assign community values. We discovered that ignoring the coarse-grained classification (information vs action community) comes at significant cost in accuracy, of both inference and validation. To advance this powerful direction in Internet infrastructure research, in 2023 we designed and validated an algorithm to execute this first fundamental step: inferring whether a BGP community is in the action (request action from another network) or information (signal information to another network) category. We validated our results extensively and published and presented them in ACM SIGCOMM IMC2023. We publicly shared our code, dictionaries, inferences, and datasets to enable the community to benefit from them. In the recent reporting period (October 23 -March 24), we undertook the next step of this inference process: inferring geolocation semantics in BGP communities. We are still working on an algorithm we will submit to IMC24; this will be the basis of the design of the BGP community dictionary we propose to include in the Implementation Phase of this project. (Coarse-grained Inference of BGP Community Intent, IMC 2023)

Building communication systems to avoid adversarial network infrastructure

Led by Johns Hopkins University, and in collaboration with Princeton and USC/ISI, CAIDA is participating in an NSF-Funded Convergence Accelerator project to build an automated system (AVOID: Automated Verification Of Internet Data-paths) that helps Department of Defense (DOD) operators who want to communicate with 5G devices by avoiding nation-state adversaries and moving communications to safe paths. In 2023, we spent considerable effort obtaining preliminary results to demonstrate the power of our design, and submitting a Phase II proposal. We focused on designing, training, and testing a base station vendor classifier, as well as demonstrating the power of our geolocation-based approach. We filed a provisional patent to protect new intellectual property developed under this project.

Measuring Network Performance

Platform for Measuring Quality of Experience (QoE)

We designed and implemented a crowdsourcing-based platform (QUINCE) to measure the QoE of video streaming and video conferencing applications. We integrated the QUINCE experiment platform with public cloud platforms to automate the deployment of experiments in geographically diverse cloud regions. Leveraging NSF-funded CloudBank resources, we evaluated software and APIs offered by AWS, Microsoft Azure, and Google Cloud platform in terms of ease of use, deployment speed, and functionality for deploying virtual machines (VMs) and Docker images to support live video streaming and videoconferencing experiments. These tools enabled us to develop a customized web console in QUINCE to monitor the use of cloud resources in real-time. We prototyped crowdsourcing-based video conferencing experiments and are addressing challenges in operationalizing the prototype. We enhanced the QUINCE prototype by introducing more visualization and gamification features to boost the intrinsic motivation of subjects.

Measurement of Cloud Performance and Reachability

We are developing tools to understand cloud connectivity performance and reachability in the U.S. and around the world. In particular we want to discover performance bottlenecks outside the cloud networks where the high cost of deployment and operations leads to infrastructure bottlenecks for cloud applications. We started by developing tools to identify performance bottleneck links between cloud datacenters and thousands of publicly accessible speed test servers, by synthesizing active measurements with TCP flows. We evaluated three packet processing frameworks (DPDK, eBPF, and ConnectX SmartNIC) for implementing an in-band measurement tool that injects traceroute into speed test TCP flows to identify network bottlenecks. We built a prototype on top of Capsule-rs, a network function pipeline written in Rust. We compared the packet forwarding, processing, and filtering performance across these implementations on the FABRIC virtual testbed, which enabled us to customize routing between physical test sites to realistically emulate path latency.

Reproducible Assessment of Broadband Internet Topology and Speed (RABBITS)

We began a new NSF-funded project to develop a new measurement toolkit to enable reproducible, comprehensive speed test infrastructure discovery and characterization, and consistent test parameters across platforms. The goal of the toolkit is to overcome the obstacles that have prevented these widely deployed global measurement infrastructures from supporting either rigorous scientific research or public policy needs for consistent and reusable tests of broadband performance and and service availability.

Informing Public Policy

Design and evaluate measurement-based approaches to improve routing security via trust zones

Although Internet routing security best practices have recently seen auspicious increases in uptake, ISPs have limited incentives to deploy them. They are operationally complex and expensive to implement, provide little competitive advantage, and protect only against origin hijacks, leaving unresolved the more general threat of path hijacks. We proposed a new approach that achieves four design goals: improved incentive alignment to implement best practices; protection against path hijacks; expanded scope of such protection to customers of those engaged in the practices; and reliance on existing capabilities rather than needing complex new software in every participating router. Our proposal leverages an existing coherent core of interconnected ISPs to create a zone of trust, a topological region that protects not only all networks in the region, but all directly attached customers of those networks. Customers benefit from choosing ISPs committed to the practices, and ISPs thus benefit from committing to the practices. We compare our approach to other schemes, and discuss how a related proposal, ASPA, could be used to increase the scope of protection our scheme achieves. We hope this proposal inspires discussion of how the industry can make practical, measurable progress against the threat of route hijacks in the short term by leveraging institutionalized cooperation rooted in transparency and accountability. We submitted our idea to the FCC public comment (A path forward: Improving Internet routing security by enabling trust zones, FCC) and also submitted a follow-up response to other comments to the FCC’s context of the FCC’s exploration of possible regulatory intervention. Notice of Inquiry on Secure Internet Routing. Our discussion highlighted the increasing tension on the topic amid prolonged multistakeholder efforts and the growing risk of BGP hijacks affecting even major corporations. (Notice of Ex Parte Meeting, Secure Internet Routing, FCC)

Reporting on the EU Digital Services Act

The EU Digital Services Act (DSA) is intended to reduce the risks and challenges for individual recipients of Digital Information services, in particular what are described as intermediate services. The regulation imposes specific obligations on platforms that allow online trading, and expanded obligations on very large online platforms and very large search engines. A second goal of the DSA is to provide a single, harmonized regulatory framework for the Union, and to preempt the creation of divergent regulatory structures by individual States. The regulation states that Member States should not adopt or maintain additional national requirements relating to the matters falling within the scope of this Regulation.(The EU Digital Services Act and Academic Research - Technical Report, CAIDA)

Multistakeholder analysis of the DDoS Landscape

With four industry and 11 academic partners, we undertook a detailed analysis of Distributed Denial-of-Service (DDoS) attacks by integrating perspectives from both industry reports and academic research. We conducted a multi-stakeholder analysis, examining 24 industry reports characterizing DDoS in 2022-2023 and 9 datasets across both academic and industry sources spanning from 2019 to 2023. This analysis identified and analyzed discrepancies in DDoS data reporting and analysis methods, focusing on two primary types of attacks: direct-path (both spoofed and unspoofed) and reflection-amplification attacks. We implemented a new approach to transparency with industry by aggregating target information (IPs) from academic sources and sharing it with industry. Industry players then joined it with their data sources revealing gaps in visibility and sharing results. This approach helped us to validate the industry-reported 2021-2022 drop in spoofed reflection-amplification attacks that increased again in 2023. We will publish the report in 2024.

Data Collection Statistics: Topology and Traffic

The slide deck, CAIDA Measurement Data Infrastructure Overview continues to summarize CAIDA datasets used for networking and security research.

The graphs presented here show the cumulative volume of data accrued over the last several years by our primary data collection infrastructures, Archipelago (Ark) and the UCSD Network Telescope.

Topology Measurements from Ark

In 2023 CAIDA executed the following Ark measurements on an ongoing basis:

Compressed size of Ark topology measurements.

Compressed size of Ark topology measurements.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.


AS to Organization mapping

We continued updating the as2org data set every quarter.

IP Prefix to AS Mapping

One of CAIDA’s most frequently requested datasets is its RouteViews Prefix to AS Mapping Dataset for IPv4 and IPv6 dataset. This dataset contains IPv4/IPv6 Prefix-to-Autonomous System (AS) mappings derived from the NSRC RouteViews Project, which gathers BGP updates from hundreds of vantage points around the world. CAIDA uses RouteViews BGP tables dumps (from one collector) to perform a longest-prefix match on observed prefixes, to produce daily snapshots of the Prefix to AS mapping. This daily updated dataset goes back to May 2005.

RouteViews BGP Peerstats Tooling

We created a page of RouteViews peer statistics for the RouteViews team to enable them to reason about noisy peers and contact the respective operators.

UCSD Telescope

In 2023 CAIDA captured about 1.3 PB of compressed Internet darknet traffic data. As of January 2024, we collect about 1.7 TB of compressed data per day, more than 95% of which is Telescope (darknet) data. We archive raw telescope data at NERSC.

Annotation Schema for Data Sets

We developed a unified relational descriptive representation of the data contained in our and other contributed datasets. After much disappointing research on other existing ontology frameworks, we are surprised to find that we mostly have to build our own, although we are building on schema.org as much as possible. Our goal is to provide a high-level unified structure for comparing, joining, searching, querying, and understanding datasets and databases from different sources and formats. In pursuit of this goal, the Annotated Schema provides a set of canonical object categories, properties, and namespaces. We simplified our Annotated Schema design draft based on feedback from the community, and updated AS Rank’s data schema to align with these updates. (Annotated Schema: Mapping Ontologies onto Dataset Schemas, CAIDA) Standardization of the conceptual model for basic elements of Internet measurements is required to provide a consistent framework for developing a sustainable interoperable system. This effort provided some assurance that we are implementing the “FAIR” data guiding principles, i.e., making data Findable, Accessible, Interoperable, and Reusable.

Data Distribution Statistics

The following graphs define a unique user is defined as an unique IP address that has accessed a dataset or service. We try to filter potential bots, crawlers and spiders from listed totals. The totals reported in this year’s report reflects more accurate accounts of usage than previous years, where numbers may have been inflated due to previously uncaught spiders and scanners. The Data Distribution Statistics (bar & pie) graphs show how many visitors downloaded CAIDA’s most popular datasets (public and by-request) over time and in 2023. These statistics do not include Near-Real-Time Telescope datasets (raw traffic traces in pcap format, aggregated flow and daily RSDoS attack metadata). The time-series bar graph shows a surge in users accessing Anonymized Internet Traces, Ark Topology, and RouteViews Prefix2as in 2020-2021, which we identified as an influx of IP addresses from three ASes: Amazon (Hosting), CHINA UNICOM Industrial Internet Backbone, and China Telecom.

Unique users downloading CAIDA data and corresponding ASes aggregated by country.

Unique users downloading CAIDA data and corresponding ASes aggregated by country.


AS Rank usage statistics

In 2023, over 41 ASes from U.S. Education and Research Groups have utilized AS Rank, making up nearly 26% of AS Rank’s user base. The remaining users fall under computer and technology information (74%) and less than 1% under other organization types such as commercial, government and administration, and community groups. In regards to requests from unique IPs, we’ve observed 12,529 in 2023, the top ASes being from Google, Comcast, and ISPs. Apart from the US, we see China, Brazil, Japan, Great Britain and France being the top countries we see requests originating from.

AS Rank unique user count by year.

AS Rank unique user count by year.

AS Rank requests received annually aggregated by organization type.

AS Rank requests received annually aggregated by organization type.


BGPStream

BGPStream usage has grown steadily over the last 4 years, with 2021 being an outlier where we saw a surge of unique users. In 2023, there were 2,177 unique IPs sending requests to BGPStream, from 480 unique ASes in 70 countries, the top being US, France, Spain, Japan and Germany. Of these unique IPs, at least 302 were from the research and education community, 213 of those in the U.S. As shown in the pie graph below, majority of our users come from education, with computer and information technology following behind it.

BGPstream unique user count by year.

BGPstream unique user count by year.

BGPstream requests received annually aggregated by organization type.

BGPstream requests received annually aggregated by organization type.


Publications using public and/or restricted CAIDA data (by non-CAIDA authors)


Users of CAIDA datasets agree, as part of the Acceptable Use Agreements, to provide CAIDA with information of their publications using CAIDA data. But many users forget to report. We conduct extensive literature searches to locate relevant papers, searching Google Scholar for names and DOIs of of CAIDA datasets and services. We also use computer science search engines, such as IEEE Xplore Digital Library, ACM Digital Library, ScienceDirect.com, and Springer, among others. We are aware of 327 publications authored by non-CAIDA researchers that utilized CAIDA data and that were published in 2023. Our external publications database is updated as we become aware of new publications. As of September 2024, we have indexed 3603 papers in our database. Please let us know if you know of a paper using CAIDA data not yet on our list: Non-CAIDA Publications using CAIDA Data.

CAIDA in Numbers: Outreach, Workshops, Publications, Funding

We held three in-person retreat/workshops with external academic collaborators: January, May, October. During these workshops we discussed active measurement needs of the community as we planned for the future of CAIDA’s Internet measurement infrastructure: Ark, BGP analysis platforms, the Telescope. We also held monthly workgroup meetings with academic and industry researchers for several projects related to the GMI project: BGP, DNS, DDOS.

In 2023, CAIDA published 7 peer-reviewed papers (4 non-peer reviewed), made 22 presentations, and posted 5 blog entries. A list of presented materials is listed on the CAIDA Resource Catalog. Our web site www.caida.org attracted approximately 261,770 unique visitors, with an average of 1.84 visits per visitor, serving an average of 3.13 pages per visit. During 2023, CAIDA employed 17 staff (researchers, programmers, data administrators, technical support staff), hosted 1 postdocs, 7 PhD students, 12 masters students, and 35 undergraduate students.

The chart below shows CAIDA operating expenses, with a breakdown of operating expenses by type and program area:

Expense type Amount ($) Percentage
Labor $1,669,395 37%
(UCSD) Benefits $554,749 12%
Supplies & Expenses $472,142 10%
Subcontracts $321,307 7%
Equipment $106,633 2%
Professional Development $5,821 <1%
(UCSD) Indirect Costs $1,368,653 30%
Total $4,498,700 100%
Research Program Area Amount ($) Percentage
Infrastructure & Data Sharing $2,214,412 49%
Security, Stability, Resilience $1,387,854 31%
Cartography $611,701 14%
Performance $284,732 6%
Total $4,498,699 100%

Publications

Publications are grouped by research categories.

Network Measurement and Analysis

Internet Routing and Security

Data and Ontology Mapping

Supporting Resources

CAIDA’s accomplishments are in large measure due to the high quality of our visiting students and collaborators. We are also fortunate to have financial and IT support from sponsors, members, and collaborators, and monitoring hosting sites. In 2022 we welcomed Matthew Luckie (who was a visiting scholar in the first half of the year) and Brendon Jones as consulting research scientists to advise on and execute our infrastructure deployment efforts.

UC San Diego Graduate Students

Visiting Scholars

Funding Sources

Published
Last Modified