CAIDA's Annual Report for 2020

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, and operating expenses for 2020.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insights into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report summarizes CAIDA’s activities for 2020 in the areas of research, infrastructure, data collection and analysis. Our research projects span: Internet cartography and performance; security, stability, and resilience studies; economics; and policy. Our infrastructure, software development, and data sharing activities support measurement-based internet research, both at CAIDA and around the world, with focus on the health and integrity of the global Internet ecosystem.

Internet Mapping. We continued to pioneer methods for Internet cartography, including router-level mapping of transit and cloud interconnections, inference of underlying layer-2 topology structure from layer-3 measurements, geolocation, detection of anycast prefixes, and automated learning of semantic structure in router hostnames. We also studied IPv6 address assignment practices, and created a new animation of the history of IPv4 address space.

Performance Measurement. The COVID pandemic and associated quarantine provided opportunities to measure its impacts on Internet infrastructure. We analyzed our existing (lightweight RTT measurement) data to see whether changes in demand triggered congestion on interconnection links. Because these links involve joint business decisions, they may not be as well-provisioned as assets internal to a single network, and may be harder to upgrade quickly. Performance of these links may reflect the Internet’s resilience to changing demands. Although our visibility of the interconnection ecosystem is limited, in the initial lock-down period in March, we saw evidence of congestion on some interconnection links, which largely went away in a few weeks.

We also undertook experimental work measuring the impact of the pandemic-induced quarantine on the cloud-based applications on which we all became dependent, leveraging speed test servers in access networks to execute measurements that would reflect the user’s quality of experience with cloud applications. In addition, we proposed a new active measurement tool that leverages application TCP flows and recursive packet trains to locate bandwidth bottlenecks. And we continued our work to improve the efficiency of QoE (quality of experience) crowdtesting.

We completed a complex investigation that combined topology and performance data to analyze the impact of the 2019 deployment of the first submarine cable connecting Africa to South America, unexpectedly finding some paths whose performance actually degraded after the cable went live, due to suboptimal routing. This study won the Best Paper Award at PAM 2020.

Security, Stability, and Resilience (SSR) of the Internet’s addressing, routing, and naming systems. We completed seven technical studies on vulnerabilities of the subsystems that constitute the Internet’s fundamental plumbing: IP addressing, DNS, and BGP routing. Each of these systems is characterized by critical flaws that continue to leave the Internet ecosystem vulnerable to a variety of attacks. Our published studies in 2020 included: application of methods for detection of spoofed traffic at Internet Exchange Points (IXPs); investigation of the prevalence, persistence, and risks of several types of DNS misconfigurations; and analyses of operational use of BGP features with implications for routing security.

Economics and Policy. Our final research thrust focuses on the implications of empirical studies of the Internet on public policy. We responded to U.S. government requests for information on the American research environment, updated a policy analysis of interconnection congestion, and published a new evidence-based proposal to advance the security of the persistently vulnerable Internet addressing, naming, and routing systems, which we call trust zones. kc served on two ICANN review teams, one of which (Accountability and Transparency) completed its work in 2020. (The SSR2 Review finished in early 2021.)

Infrastructure Operations. Although all of our DHS-funded research infrastructure projects ended in 2020, we were heavily engaged in two substantial NSF-funded infrastructure development projects, with a renewed focus on the security and resilience of the key Internet subsystems mentioned above. Our continuing NSF and new DARPA support allowed us to make progress on almost all infrastructure components that create data products in the most demand by the community, including: Ark, AS Rank, AS-to-Org mapping, BGPStream, Periscope, Spoofer, the UCSD Network Telescope, and IODA. We also reached milestones with new infrastructure components: a new DNZ TLD zone database, an IP address metadata software library; a new BGPview data processing pipeline component of BGPStream, and the FANTAIL project for processing and querying terabytes of traceroute data. Last but not least, based on enthusiastic feedback from many researchers, we launched a new (prototype) rich-context Resource Catalog for CAIDA Internet Data Science Resources.

Everything Else. As always, we engaged in a variety of tool development, data sharing, and outreach activities, including maintaining our web site, and publishing 17 peer-reviewed papers, 1 workshop report, 25 presentations, and 7 blog entries. This report summarizes the status of our activities. Details about our research are available in papers, presentations, and interactive resources on our web sites. We provide listings and links to software tools and data sets shared, and statistics reflecting their usage. Finally, we offer a “CAIDA in numbers” section: statistics on our performance, financial reporting, and supporting resources, including visiting scholars and students, and all funding sources. With the ending of DHS support for many of our projects, we spent time in 2020 pursuing more sustainable funding for CAIDA’s activities. We have not yet succeeded, but NSF has recognized our challenge. At NSF’s invitation/request, we had the honor of co-organizing the first NSF-sponsored Workshop on Overcoming Measurement Barriers to Internet Research in early 2021 and will post the final report on our web site as soon as possible.

CAIDA’s program plan for 2018-2023 is available at https://www.caida.org/about/progplan/progplan2018/. We will begin our 2023-2028 program plan in late 2021. Please feel free to send comments or questions to info at caida dot org. Please note the link to donate to CAIDA at the top of our newly renovated web site; UC San Diego charges no overhead on donations; it is tax-deductible and goes 100% to research (no university overhead)!

Research and Analysis

Internet Cartography (Mapping) Methods

APPLE: Alias Pruning by Path Length Estimation

We developed a new technique, Alias Pruning by Path Length Estimation (APPLE), for resolving router IP aliases. Our approach avoids relying on system-specific IP implementations. Instead, it filters potential router aliases seen in traceroute by comparing the reply path length from each address to a distributed set of vantage points. APPLE’s coverage of potential alias pairs in the ground truth networks rivals the current state-of-the-art in IPv4, and far exceeds existing techniques in IPv6. APPLE complements existing alias resolution techniques. (APPLE: Alias Pruning by Path Length Estimation, PAM)

vrfinder: Finding Outbound Addresses in Traceroute

Current methods to analyze the Internet’s router-level topology with paths collected using traceroute assume that the source address for each router in the path is either an inbound or off-path address on each router. However, outbound addresses are commonly observed, and can mislead inferences of router ownership and interdomain links. We hypothesized that the primary contributor to outbound addresses is Layer 3 Virtual Private Networks (L3VPNs), and proposed vrfinder, a technique for identifying L3VPN outbound addresses in traceroute collections. We validated vrfinder against ground truth from two large research and education networks. We extended the bdrmapIT tool with this technique, substantially increasing the accuracy of its router ownership inferences. (vrfinder: Finding Outbound Addresses in Traceroute, SIGMETRICS).

Improving accuracy of router-level mapping of cloud interconnections

We used our new fast-probing traceroute tool bdrmapIT to execute periodic interconnection-mapping measurements of three major cloud platforms (AWS, Google, Azure) in all U.S. regions. This tool enables us to assess the effects of high-rate probing on the completeness and accuracy of inference of cloud interconnections, and to determine optimal probing rates. The measurement data enables us to infer inter-domain connections between cloud regions and the Internet, and to capture temporal changes of these connections over time. We are still improving the tool based on discoveries we find in the data.

Learning to Extract and Use ASNs in Hostnames

We presented the design, implementation, evaluation, and validation of a system that learns regular expressions (regexes) to extract Autonomous System Numbers (ASNs) from hostnames associated with router interfaces. Our modifications increased the agreement between extracted and inferred ASNs for routers in CAIDA’s January 2020 Internet Topology Data Kit (ITDK) from 87.4% to 97.1% and reduced the error rate from 1/7.9 to 1/34.5. In addition to increasing the coverage and accuracy of inferences in our existing data sets, this work provides a new avenue for collecting validation data, opening a broader horizon of opportunity to advance methods for evidence-based router ownership inference. (Learning to Extract and Use ASNs in Hostnames, IMC)

RIPE IPmap Active Geolocation

Knowledge about the geographic locations of Internet routers and servers is highly valuable for research on various aspects of Internet structure, performance, economics, and security. Many commercial geolocation databases target end hosts, but our critical infrastructure research needs to geolocate core Internet routers. RIPE NCC offers an open IPmap platform, including its single-radius engine, for geolocation of core Internet infrastructure. We evaluated the accuracy, coverage, and consistency of geolocation of the single-radius method for different types of autonomous systems. We demonstrated the general effectiveness and accuracy of the IPmap single-radius method relative to commercial geolocation databases, and discussed the important role that the IPmap platform can play in future research to improve geolocation of core infrastructure. (RIPE IPmap Active Geolocation: Mechanism and Performance Evaluation, CCR)

Detecting anycast prefixes on the global Internet

We developed a methodology for detecting Anycast prefixes that uses a distributed measurement platform of anycast vantage points as sources to probe potential anycast destinations. We used this approach to analyze how the DNS ecosystem uses anycast deployment to eliminate sensitivity to latency dynamics and improve efficiency and scalability. (MAnycast2 - Using Anycast to Measure Anycast, IMC)

Analyzing address assignment practices in IPv4 and IPv6

We analyzed IPv6 address assignment dynamics around the world. Among other discoveries, we found that IPv6 assignments have longer durations than IPv4 assignments, often stable for months, facilitating long-term fingerprinting of IPv6 subscribers. Our observations benefit many applications, including host reputation systems, active probing methods, and mechanisms for privacy preservation. (DynamIPs: Analyzing address assignment practices in IPv4 and IPv6, CoNEXT)

IPv4 address allocation history

We completed an animated visualization of the history of IPv4 address allocations, which shows how the growing demand for Internet addresses transformed the Internet’s address governance model from a relatively small contract with the U.S. Department of Defense, to the global multi-stakeholder governance model we have today. We cover the exhaustion of the unallocated IPv4 addresses, and now substantial market for buying and selling IPv4 addresses. We hope the video serves to educate those interested in the past, present, and future of the Internet addressing architecture.


Performance Measurement

Measuring the impact of COVID-19 on cloud network performance

We participated in NSF’s COVID-19 Internet measurement program, which supported many studies of how the Internet handled unprecedented surges of traffic. Our project focused on the use of cloud-based applications, such as online shopping, video conferencing, and video streaming. End users often use network throughput measurement (or speed measurement) services to understand the performance of last-mile links. However, these test results are not representative of cloud-based application performance because web speed test servers are usually located within the same access network as the client performing the test. To better understand the performance impact of congestion between cloud platforms and access ISPs, we developed the CLASP (CLoud-based Applications Speed measurement Platform) to conduct throughput measurements from speed test servers to virtual clients in the cloud. These measurements are representative of a component of video conferencing: sending video (and audio) data from access networks to the cloud. We used a topology-aware approach to select test servers such that our measurements traversed interconnections to different networks. (Measuring the impact of COVID-19 on cloud network performance, COVID-19 Network Impacts Workshop)

Improving the Efficiency of QoE Crowdtesting

Crowdsourced testing is an increasingly popular way to study the quality of experience (QoE) of applications such as video streaming, because it provides a more realistic assessment environment than laboratory-based assessments allow. We proposed a novel experiment design to conduct a longitudinal crowdsourcing study aimed at improving the efficiency of crowdsourced QoE assessments. Our experimental approach yielded a high level of revisit intent and continual participation in measurements. We replicated the video streaming QoE assessments in a traditional laboratory setting, finding similar trends in the relationship between video bitrate and QoE, confirming previous findings. (Improving the Efficiency of QoE Crowdtesting, ACM Quality of Experience in Visual Multimedia Applications)

Examples of suboptimal trajectories found after deployment of trans-Atlantic undersea cable (SACS) between South America and Africa.

Examples of suboptimal trajectories found after deployment of trans-Atlantic undersea cable (SACS) between South America and Africa.

Effects of submarine cable deployment on Internet routing

We used traceroute and BGP data from globally distributed Internet measurement infrastructures to study the impact of the first submarine cable directly connecting Africa to South America. We leveraged archived data from RIPE Atlas and CAIDA Ark platforms, and measurements from strategic vantage points, to analyze latency and path lengths before and after deployment of this new South-Atlantic cable. We found that ASes operating in South America significantly benefit from this new cable with reduced latency to all measured African countries, but for some paths latency actually increased due to routing suboptimalities. After notifying one network of our results, they resolved most of these suboptimalities. Our method generalizes to the study of other cable deployments or outages. We shared our code to promote reproducibility and extension of our work. (Unintended consequences: Effects of submarine cable deployment on Internet routing, PAM Best Paper Award)

FlowTrace

Active measurements provide an important tool for understanding and diagnosing performance bottlenecks on the Internet. We proposed FlowTrace - a readily deployable user-space active measurement framework that leverages application TCP flows to carry out in-band network measurements. Our implementation of Pathneck using FlowTrace creates recursive packet trains to locate bandwidth bottlenecks. Experimental evaluation on a testbed showed that FlowTrace locates bandwidth bottlenecks as accurately as Pathneck, with significantly less impact on the network. (FlowTrace: A Framework for Active Bandwidth Measurements using In-band Packet Trains, PAM)

Security, Stability, and Resilience (SSR) of the Internet’s addressing, routing, and naming systems

Inference of traffic with spoofed source addresses at IXPs: Challenges, methods and analysis

We updated our 2017 study and observed no significant improvement in deployment of Source Address Validation (SAV) in networks that used a mid-size IXP between 2017 and 2019. We explored the feasibility of scaling the system to larger and more diverse IXPs who want to avoid use of their infrastructure for launching spoofed-source DoS attacks. To promote this goal, and broad replicability of our results, we made the source code of Spoofer-IX publicly available. (Spoofed traffic inference at IXPs: Challenges, methods and analysis, Computer Networks)

Spoofer-IX inference method overview

Spoofer-IX inference method overview

Prevalence, Persistence, and Perils of Lame Delegations and other Misconfigurations

DNS zone administration is a complex task involving manual work and several entities, and often results in misconfigurations. Faulty configurations that bind domains, nameservers and glue records jeopardize the correct and efficient implementation of the function of the DNS. In particular, lame delegations, which occur when a nameserver responsible for a domain is unable to provide authoritative information about the domain, introduce both performance and security risks. We performed a broad-based measurement study of lame delegations, using both longitudinal zone data and active querying. We found that lame delegations of various kinds are common, that they can significantly degrade lookup latency (when they do not lead to outright failure), and that they expose hundreds of thousands of domains to adversarial takeover. We explored circumstances that give rise to this surprising prevalence of lame delegations, including unforeseen interactions between the operational procedures of registrars and registries. (Unresolved Issues: Prevalence, Persistence, and Perils of Lame Delegations, IMC)

One type of lame delegation can be caused by an orphan DNS record, in which a glue record for a delegation that does not exist anymore is forgotten in the zone file. An attacker may easily hijack domains that have these records in their delegation, by registering the domain associated with the orphan. In collaboration with researchers from the University of Twente, we identified a new type of glue record misconfiguration – abandoned records – and analyzed the continued prevalence of both types of misconfigurations compared to a decade-old study. (The Forgotten Side of DNS: Orphan and Abandoned Records, Workshop on Traffic Measurements for Cybersecurity)

Our U. Twente collaborators also developed and deployed a tool for detection of DNS parent-children configuration mismatches (https://superdns.nl). Analysis of the results underlined the risk such inconsistencies pose to the availability of misconfigured domains. (When parents and children disagree: Diving into DNS delegation inconsistency, PAM)

DNS Cache Snooping Rare Domains at Large Public DNS Resolvers

UC San Diego CSE collaborators used our Ark system to develop and evaluate Trufflehunter, a DNS cache snooping tool for estimating the prevalence of rare and sensitive Internet applications. Trufflehunter models the complex behavior of large multi-layer distributed caching infrastructures e.g., Google Public DNS. Using a controlled testbed, we evaluated how accurately Trufflehunter can estimate domain name usage across the U.S. Applying this technique in the wild, we provided a lower-bound estimate of the popularity of several rare and sensitive applications (most notably smartphone stalkerware) that otherwise present measurement challenges. (Trufflehunter: Cache Snooping Rare Domains at Large Public DNS Resolvers, IMC)

Routing Security: Empirical studies

We used BGP and RPKI data to analyze the degree that ISP’s registration of routes in the RPKI (Resource Public Key Infrastructure) protects a network from illicit announcements of their prefixes. (To Filter or not to Filter: Measuring the Benefits of Registering in the RPKI Today, PAM)

We also studied AS Path Prepending (ASPP) – a well-known technique that inflates the AS path in order to engineer traffic toward (or away from) certain paths. We found 18% of the prepended prefixes contained unnecessary prepends that achieved no apparent goal other than amplifying existing routing security risks. (AS-Path Prepending: there is no rose without a thorn, IMC)

Economics and Policy

Policy Outreach

Early in 2020, we submitted a four-page response to the (Office of Science and Technology Policy (OSTP): National Science and Technology Council: Request for Information on the American Research Environment, Federal Register), focused on how lack of rigorous scientific research on the character of the Internet will grow more problematic, as the Internet continues to be ever more deeply embedded as critical infrastructure for society. (Comments on Request for Information on the American Research Environment, NSTC)

Policy challenges in mapping Internet interdomain congestion

We published an updated version of our analysis of the policy implications of different interdomain congestion measurement methods in the Journal of Information Policy. We used six case studies that show how our conceptual model can guide a critical analysis of what is or should be measured and reported, and how to soundly interpret these measurements. (Policy challenges in mapping Internet interdomain congestion, Journal of Information Policy)

IoT deployment and the DNS

IoT deployment and the DNS

The DNS in IoT: Opportunities, Risks, and Challenges

A key challenge of the modern Internet is how to protect users and Internet infrastructure operators from attacks on or launched through vast numbers of autonomously operating sensors and actuators. We explored how the security extensions of the Domain Name System (DNS) offer an opportunity to help tackle that challenge, while also outlining the risks that the IoT poses to the DNS in terms of complex and quickly growing IoT-powered Distributed Denial of Service (DDoS) attacks. We identified three challenging opportunities for the DNS and IoT industries to address the risks, for example by making DNS security functions (e.g., response verification and encryption) available on popular IoT operating systems. (The DNS in IoT: Opportunities, Risks, and Challenges, IEEE Internet Computing)

Trust Zones: A Path to a More Secure Internet Infrastructure

We proposed and analyzed a data-driven approach to improve the security of the Internet infrastructure. We identified the key vulnerabilities and described why barriers to progress are not just technical, but embedded in a complex space of misaligned incentives, negative externalities, lack of agreement as to priority and approach, and missing leadership. We described current trends in how applications are designed on the Internet, which leads to increasing localization of the Internet experience. Exploiting this trend, we focused on regional security rather than what we consider the unachievable aspiration of global security, and introduced a concept we call zones of trust to make realistic, measurable progress on persistently unsolved challenges. (Trust Zones: A Path to a More Secure Internet Infrastructure Telecommunications Policy Research Conference)

ATRT3 Report and Minority Statement

KC Claffy participated in ICANN’s third Accountability and Transparency Review Team (ATRT3), which completed their work in May 2020 with five (multi-part) recommendations. She published a Minority Statement indicating her concerns with ATRT3’s recommendation on the future of reviews. The report recommended terminating all Specific Reviews except ATRT: the Security, Stability, and Resiliency Review (SSR), the Competition, Consumer Trust, and Consumer Choice Review (CCT) and the Registration Directory Service (RDS) Review (formerly WHOIS Review). More precisely, the report recommended to suspend any further SSR or RDS reviews until and unless a future ATRT deems them necessary again, and to allow only one additional CCT review, but not until after the next round of new gTLDs. The report recommends replacing these terminated reviews with a single new holistic review approximately every 8 years, a remarkably long time for the Internet industry. In addition, the report recommends terminating all independent organizational reviews and replacing them with self-directed, i.e., not independent, “continuous improvement programs which have to produce a status report at least every three years.” Implementing these changes would require substantial changes to ICANN’s Bylaws. The ICANN Board has since moved to accept this recommendation and submitted it to the implementation process. (Third Accountability and Transparency Review Team (ATRT3) Report and kc’s Minority Statement, ICANN ATRT3)

Measurement Infrastructure and Data Sharing Projects

KISMET architecture: data, people, institutional components

KISMET architecture: data, people, institutional components

We continued to evolve our measurement, data analytics, and data sharing platforms for collecting and curating infrastructure data in a form that facilitates query, integration and analysis. We also undertook a complete re-evaluation of CAIDA’s data pipeline and the strategy around interfaces, services and microservices to provide researchers more accessible, calibrated and user-friendly tools for collecting, analyzing, querying, and interpreting measurements of the Internet ecosystem. Below we list the main changes and updates that happened in 2020.

KISMET project

First, we completed our participation in NSF’s first cohort of Convergence Accelerator program with our KISMET (Knowledge of Internet Structure: Measurement, Epistemology, and Technology) project. Our experience with this new and unique NSF program was priceless. We conducted over 80 interviews across disciplines and sectors; infrastructure operators, threat intelligence, network security, economists, government regulators, IT consultants, and researchers focused on the various aspects of the Domain Name System (DNS), routing dynamics and infrastructure, and threat intelligence. In the process, we developed a cohesive team of partners and collaborators that includes experts with deep understanding of subdomains of the Internet identifier ecosystem and with records of delivering software, services, methods of analysis, technical training, or deploying and maintaining complementary measurement infrastructure and datasets. Although the project did not advance to Phase 2, we appreciated the reviewers' concerns that the problems we were proposing to tackle required a larger budget than the Convergence Accelerator program allowed. We took this feedback seriously and submitted larger proposals in 2021.

Data Infrastructure Building Blocks for Applied Network Data Analysis

Several of our infrastructure projects ended in 2020, but we are making valiant efforts to keep operational those in use by the community.

Archipelago

We continue to maintain the Archipelago (Ark) active measurement platform as much as we can (the project has no dedicated funding). Ark monitors continue to provide raw data for most of our macroscopic Internet data sets. We continued development of Vela, a system for executing on-demand measurements from Ark nodes.

AS Rank

We released a new version of the AS Rank API based on GraphQL. The new version reflects a move from a RESTful API (v1) to one that uses GraphQL (v2), a more flexible and powerful query language. To support users not comfortable switching, we released version 2.1, which reduced complexity of the full-featured GraphQL interface through a simplified RESTful API.

AS to Organization mapping

We improved our AS-to-Organization mapping by increasing the amount of raw data we collected from National Internet Registries beyond the base WHOIS data that we download from the Regional Internet Registries (RIRs). We integrated the BGPStream service into the AS-to-Organization mapping code base and wrote a RESTful API for the AS to Organization mapping data.

BGPStream

We continued to maintain, support, and evolve the BGPStream code, including bug handling, patching, caching, IPv4/IPv6 storage, and reduced memory usage. We released libbgpstream v2 in August 2020. So many projects depend on our BGP pipeline that for performance reasons, we decided it was prudent to overhaul its BGPview component to optimize data structures and libraries for (re-)construction, transport and analysis of BGP routing tables. We did major refactoring, redesign, and optimizations and cleaned up code formatting in the prefix2as code within BGPView. We plan to make this new component available in 2021.

DNS Zone Database platform for querying DNS TLD zone files

We worked with collaborator Ian Foster (Google) to transition his software and hardware platform for querying DNS zone files to CAIDA infrastructure. While a student, Ian created a longitudinal dataset of legacy and new gTLD zone files spanning nine years that allows researchers to study trends in vulnerabilities in the naming ecosystem, but he could no longer maintain the system. We agreed to support it as resources permit; it supported several of our DNS research studies. DZDB.

IP address metadata: Libipmeta

We spent significant effort to fix bugs to release libipmeta v2 and to extend functionality for version 3. Libipmeta is a library to support the querying for historical and realtime IP metadata including CAIDA’s geolocation information, Prefix-To-AS databases, and future metadata on IP addresses. This library combined with its companion pyipmeta library supports several CAIDA analysis projects.

Measurement and ANalysis of Internet Congestion (MANIC)

We continue to support the MANIC (Measurement and Analysis of Network Interdomain Congestion) component of our infrastructure, now supporting the DARPA project Performance Evaluation Network Measurements and Analytics (PENMAN) goals to improve the ability of a third party to characterize performance bottlenecks along a given path of interest.

MIDAR

We continued to maintain all backend and database components that use the MIDAR IPv4 alias resolution service. The MIDAR web API delivers access to MIDAR’s functionality.

Periscope

Spoofer session report showing path taken by spoofed packets.

Spoofer session report showing path taken by spoofed packets.

This year CAIDA put the Periscope Looking Glass API service into public alpha testing. We improved the account creation process and monitoring functionality to detect future problems. We monitored logs and fixed bugs.

Spoofer

Ark monitors continue to help measure the Internet’s susceptibility to spoofed source address IP packets. We implemented an improved visual interface to the data and created a Spoofing notification registration system in the Spoofer project’s reporting engine that allows a vetted operator within an AS to receive these notifications. We also continued to support the Spoofer client software package on new OS releases for Windows, MacOS, and Linux. We expanded our efforts to build Spoofer packages for home access router platforms, e.g., OpenWRT-based. We extended the capabilities of the client to test networks whose Network Address Translation (NAT) router obscures our view of SAV deployment. We modified our server software to respond to packets whose spoofed source address has been rewritten to infer SAV policy for these networks.


STARDUST platform

As part of the (Sustainable Tools for Analysis and Research on Darknet Unsolicited Traffic (STARDUST) Project aimed at maintaining continued operation of the UCSD Network Telescope infrastructure, we continued to refine and extend the time series component of the platform, Corsaro. We collaborated with Kentik (kentik.com) to explore the feasibility of using their commercial traffic monitoring platform, either as an additional component of the STARDUST platform, or as a way to reduce the amount of infrastructure that CAIDA needs to operate on an ongoing basis. We began working with Merit Network, who operate another (much smaller) network telescope, to transition analysis of their data from the legacy Corsaro v2 system over to the new Corsaro v3 Time Series system developed by the STARDUST project. This data will be available via an interface alongside data from the UCSD network telescope. We completed the deployment of all key components of the STARDUST infrastructure now fully operational and serving users. We migrated users to the STARDUST virtualized cloud research compute environment. We deployed a completely new interface for interactive dashboards and for exploration of our time series data. (STARDUST interface)

Facilitating Advances in Network Topology Analysis (FANTAIL)

Illustration of the FANTAIL four-component system.

Illustration of the FANTAIL four-component system.

In response to research community feedback, we are developing the FANTAIL system to enable discovery of the full potential value of massive raw Internet end-to-end path measurement data sets. We are creating a four-component system: (1) an interactive web interface; (2) an API built on web standards; (3) a full-text search system based on Elasticsearch; and (4) a big data processing system based on Spark, leveraging SDSC’s cluster resources. In 2020, we designed a document model for traceroute data that supports our proposed initial set of queries in the FANTAIL proposal, and wrote scripts to import traces in JSON format into Elastic Search (ES), and to export from the current data store into ES, including transforming the trace format. We developed a query script to support traces partitioned into (monitor, year)-indexes and implemented support for querying across all yearly indexes of all monitors in one query. All queries executed from the current web interface now use data collected between 2016-2019 (37.2B traces, 26TB as stored in ES). It can output the raw query string in JSON format, which will allow a user to run Spark ES jobs on the set of traces that a query returns. We provide accounts to researchers and students for access to the prototype via the Vela interface.

IODA platform

Our infrastructure for detecting macroscopic Internet-edge outage events fuses three data sources: Internet Background Radiation from the UCSD Network Telescope (IBR – one-way unsolicited traffic generated by millions of Internet hosts worldwide), Border Gateway Protocol (BGP) update messages (used to exchange reachability information between Internet Service Providers), and active probing results that reveal the reachability of end-hosts. By analyzing how an event manifested itself across various data sources, we can investigate its potential underlying cause(s). In 2020, the IODA project upgraded the codebase of the user interface (UI) and back-end APIs to simplify future development, among other development. IODA now monitors BGP reachability of IPv6 network prefixes. IPv6 BGP visibility time series data for each Country/AS/Sub-national Region are now accessible through the IODA Explorer interface. CAIDA provides users access to the IODA dashboards. We used IODA to investigate four macroscopic outages that affected Iranian networks in early 2020. (An analysis of large Internet outages affecting Iranian networks in early 2020, CAIDA)

Data Sharing Infrastructure

Note: Changes in Data Access

IMPACT project discontinued

For the last five years CAIDA made some by-request datasets available exclusively through the DHS Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT) portal. DHS terminated the IMPACT program, so users should now request access to these data via the corresponding CAIDA data access request forms (linked as “Request access”) in the details of restricted CAIDA datasets.

Access to UCSD Telescope Network Datasets now via STARDUST virtual machine (VM) platform.

Beginning in 2020, we provide access to the UCSD Network Telescope data via a VM-based analysis platform developed specifically for UCSD Network Telescope data users. Users analyze these datasets only on CAIDA computers. Currently, all historical aggregated flow and daily RSDoS attack metadata and about 30 days of the most recently collected raw telescope data are kept in Swift, our Openstack object-based cloud storage and accessed via the STARDUST VM environment. To extend use of our finite resources to more researchers, we updated the user account expiration policy.

Data Collection

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Compressed size of the UCSD Network Telescope raw data stored at NERSC.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.

Uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology.

These graphs show the cumulative volume of data accrued over the last several years by our primary data collection infrastructures, Archipelago (Ark) and the UCSD Network Telescope. CAIDA executes the following Ark measurements on an ongoing basis:

  • IPv4 team probing: daily traceroutes to all routed /24 IPv4 networks – Ark IPv4 Routed /24 Topology Dataset
  • IPv4 prefix probing: daily traceroutes to every BGP-announced IPv4 prefix from a subset of Ark monitors – IPv4 Prefix-Probing Traceroute Dataset
  • IPv6 topology measurements collected by a subset of ark monitors that probe all announced IPv6 prefixes (/48 or shorter) once every 48 hours – Ark IPv6 Topology Dataset
  • Congestion measurements aimed at detecting congestion on interdomain links of the networks hosting the Ark monitors. Congestion measurements are comprised of Time-Sequence Ping (TSP) measurements and Border Mapping (bdrmap) measurements

We currently collect about 9 TB of uncompressed data per day (more than 95% of which is Telescope (darknet) data). In 2020 CAIDA captured about 12 TB of uncompressed topology traceroute data, and about 1.6 PB of Internet darknet traffic data.

New and Improved Datasets

Historical Database of DNS Top-Level Domain (TLD) Zone Files

We adopted a Historical Database of DNS Top-Level Domain Zone files from ICANN’s Centralized Zone Data Service that had insufficient resources left for stewardship. Our new service ingests zone files from over 1300+ Top-Level Domains, indexes and annotates them to allow users to query for domains, nameservers, IP addresses and more. The service is available for research use until at least the end of 2021.

Internet Topology Data Kit (ITDK)

We added ITDK 2020-01 and ITDK 2020-08 to our ongoing collection of Macroscopic Internet Topology Data Kits (ITDK) that started in 2010 and now includes 20 Kits. These data sets contain router-level topologies generated from the Ark IPv4 Routed /24 Topology Dataset. They also include an IPv6 router-level topology, assignments of routers to ASes, geographic locations of each router, and Domain Name Service (DNS) lookups of all observed IP addresses.

Data Supplements

We added data supplements for Learning to Extract and Use ASNs in Hostnames (IMC-2020) and for Unintended consequences: Effects of submarine cable deployment on Internet routing (PAM 2020) papers.

New CAIDA Internet Measurement Resource Catalog (BETA!)

CAIDA Resource Catalog interface at initial launch.

CAIDA Resource Catalog interface at initial launch.

In pursuit of the FAIR (findable, accessible, interoperable, reusable) principles of our scientific data infrastructure mission, we invested significant effort to make our data sets and associated resources more accessible to other researchers. In 2020, we designed and developed a prototype back end and user interface for a new CAIDA Resource Catalog. This catalog indexes papers, presentations, other media, as well as datasets, software, and “recipes” (aka solutions, gists). The catalog matches papers to the CAIDA tools and datasets used in the papers, both for CAIDA-authored papers and for papers by external authors that used CAIDA resources. While we emphasize the early (prototype) nature of this catalog, we hope to continue to evolve it so that it will help users interested in doing Internet research get started by supplying with them a rich set of metadata and tools for each dataset. CAIDA Datasets in the Catalog.


Data Distribution Statistics

CAIDA has shared data and knowledge, including sensitive knowledge about critical Internet infrastructure, for two decades. The field of Internet infrastructure research is essentially dependent on CAIDA’s measurement and data sharing infrastructure. CAIDA’s data enables organizations to identify and remediate various Internet transport layers vulnerabilities including IP spoofing, BGP routing attacks, DNS abuse, and Certificate Authority manipulations.

Users now request access to CAIDA’s data through the CAIDA Resource Catalog. CAIDA datasets fall into two categories: public and by-request. We make public datasets available to users who agree to CAIDA’s Acceptable Use Policy for public data. We vet other datasets by-request and make them available for use by academic researchers, US government agencies, and corporate entities through the UC San Diego’s Office of Innovation and Commercialization. Users fill out the appropriate request form including a brief description of their intended use of the data, and agree to an Acceptable Use Policy. In the last two years, CAIDA has shared ongoing publicly available datasets to government organizations and their contractors including FCC, CACI, IRS, SAIC, Idaho National Lab, Johns Hopkins Applied Physics Lab, NIST, USCG, DHS, Sandia, Fermi, MIT Lincoln Labs, Lawrence Berkley National Labs, Naval Postgraduate School, General Dynamics, and Northrop Grumman. CAIDA also shared data with DARPA contractors (e.g. Raytheon Applied Signal Technology – DARPA I20 SearchLight and NICE), US Air Force contractors (e.g. Perspecta Lab/USAF CHARON). Companies including Cisco, Microsoft, Nokia Bell, Airbus, cPacket Networks, Zvelo, Greynoise, Wells Fargo have used CAIDA data in the last two years to understand traffic and topology characteristics, including to map logical to physical network topologies.

The graphs below show the annual counts of unique visitors who downloaded CAIDA datasets (public and by-request) and the total size of downloaded data. In 2020 we granted access to the CAIDA by-request datasets to more than 300 new users. These statistics do not include Near-Real-Time Telescope datasets (raw traffic traces in pcap format, aggregated flow and daily RSDoS attack metadata), which users access via the STARDUST platform. Our last passive traffic trace was collected in January 2019, due to lack of funding to keep up with the link upgrade. We continue to get many community requests for more recent samples of such data, and are trying to build a 100GB packet capture monitor for experimental deployment in 2021.

Data Distribution Statistics: Unique users downloading CAIDA data downloaded annually.

Data Distribution Statistics: Unique users downloading CAIDA data downloaded annually.

Data Distribution Statistics: Volume of data downloaded annually. Multiple downloads of the same file by the same user, which is common, only counted once.

Data Distribution Statistics: Volume of data downloaded annually. Multiple downloads of the same file by the same user, which is common, only counted once.

Unique users downloading CAIDA data and corresponding ASes aggregated by country.

Unique users downloading CAIDA data and corresponding ASes aggregated by country.


Publications using public and/or restricted CAIDA data (by non-CAIDA authors)

We know of a total of 50 publications in 2020 by non-CAIDA authors that used the CAIDA data. We update the external publications database as we learn of new publications. Some papers used more than one dataset. As of January 2021 we found 1,918 papers with 1,322 different authors in 92 countries. Please let us know if you know of a paper using CAIDA data not yet on our list: Non-CAIDA Publications using CAIDA Data. (This data lags because we have not had time to chase down 2020 papers yet, and sometimes have to wait for researchers to report them.)

Impact of CAIDA data sharing: Annual number of non-CAIDA publications using CAIDA data

Impact of CAIDA data sharing: Annual number of non-CAIDA publications using CAIDA data

Impact of CAIDA data sharing: Country of affiliation of authors of non-CAIDA papers using CAIDA data.

Impact of CAIDA data sharing: Country of affiliation of authors of non-CAIDA papers using CAIDA data.


Tools

CAIDA develops and maintains supporting tools for Internet data collection, analysis and visualization.

In 2020, we modified the Scamper tool to extract AS numbers from hostnames. The work presents the design, implementation, evaluation, and validation of a system that learns regular expressions to extract Autonomous System Numbers (ASNs) from hostnames associated with router interfaces. (Learning to Extract and Use ASNs in Hostnames, IMC)

In addition to rearchitecting the Corsaro system to utilize the nDAG multicast architecture, we also drastically improved the performance of the Corsaro TimeSeries plugin to lower CPU requirements and increase capacity to handle exceptional packet rates.

In 2020 we migrated the STARDUST telescope time series data from DBATS, our custom time series database, to an InfluxDB/Grafana based architecture (accessible from https://explore.stardust.caida.org). This drastically reduces the amount of in-house code that we have to maintain and support, further improving sustainability of this platform.

The following chart and table display CAIDA-developed and currently supported tools and the number of external downloads (by unique IP address) during 2020. Description of tool functionality available in CAIDA’s Resource Catalog.

CAIDA tools downloads in 2020

CAIDA tools downloads in 2020

Workshops

In late February, CAIDA held the Workshop on Active Internet Measurements: Knowledge of Internet Structure: Measurement, Epistemology, and Technology (AIMS-KISMET 2020), part two of the WIE-KISMET workshop held in December 2019. The goals of this in-person workshop were to further discuss possible future Internet scenarios that might change the security profile of the Internet. We did not publish a workshop report, instead we integrated community feedback into the KISMET Phase 2 proposal that was not funded.

Then the world blew up and we had to convert the community PAM conference to a virtual conference on very short notice. We published an editorial note with key takeaways, lessons learned, and suggestions for future virtual conferences distilled from this experience. (Lessons Learned Organizing the PAM 2020 Virtual Conference, CCR)

We got to use these insights six months later, when we hosted a virtual version of our annual Workshop on Internet Economics (WIE 2020), inviting users of CAIDA data that have explicitly tried to conduct economics or policy research with the data, or provide feedback on what they would like to do with the data. We integrated this feedback into our 2021 data infrastructure plans. (WIE 2020 report)

CAIDA in Numbers

In 2020, CAIDA published 17 peer-reviewed papers (see below), and 1 workshop report, made 25 presentations, and posted 7 blog entries. A list of presented materials is listed on the CAIDA Resource Catalog. Our web site www.caida.org attracted approximately 434,840 unique visitors, with an average of 1.72 visits per visitor, serving an average of 3.12 pages per visit. During 2020, CAIDA employed 21 staff (researchers, programmers, data administrators, technical support staff), hosted 2 postdocs, 4 PhD students, 1 masters student, and 14 undergraduate students.

These charts below show CAIDA expenses, by type of operating expenses, by funding source, and by program area:

645
Expense type Amount ($) Percentage
Labor $1,851,050.00 42%
Benefits $809,897.88 18%
Indirect Costs $1,330,473.94 30%
Professional Development $62,088.78 1%
Supplies & Expenses $70,958.61 2%
Subcontracts $266,094.79 6%
Equipment $44,349.13 1%
Total $4,434,913.13 100%
319
Funding Source Amount ($) Percentage
NSF $2,740,885.33 61.80%
DHS $674,620.10 15.21%
Gift $145,551.24 3.28%
DARPA $508,674.24 11.47%
Other $365,182.22 8.23%
Total $4,434,913.13 100%
13
Research Program Area Amount ($) Percentage
Security, Stability, Resilience $618,875.85 14%
Performance $313,501.54 7%
Infrastructure & Data Sharing $1,672,116.48 38%
Cartography $1,119,967.24 25%
Econ & Policy $710,452.02 16%
Total $4,434,913.13 100%


Publications

Publications are grouped by research categories.

Internet Cartography (Mapping) Methods

642

Performance Measurement

331

Internet Security, Stability, and Resilience

944

Economics and Policy Research

582

Workshop Reports

456

Supporting Resources

CAIDA’s accomplishments are in large measure due to the high quality of our visiting students and collaborators. We are also fortunate to have financial and IT support from sponsors, members, and collaborators, and monitoring hosting sites.

UC San Diego Graduate Students

Visiting Scholars

Funding Sources

Published
Last Modified