CAIDA's Annual Report for 2019

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, funding sources, and operating expenses for 2019.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insight into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report summarizes CAIDA's activities for 2019 in the areas of research, infrastructure, data collection and analysis. Our research projects span Internet cartography, security and stability studies (of outages, performance, and vulnerabilities), economics, and policy. Our infrastructure, software development, and data sharing activities support measurement-based internet research, both at CAIDA and around the world, with focus on the health and integrity of the global Internet ecosystem.

Internet Mapping and Performance Measurement. We completed a study tracking IPv6 deployment over the last several decades, outlined open challenges in geolocation of BGP prefixes, and began to consider new approaches to inferring anycast prefixes. Our performance studies focused on mobile application performance, including one on the application of reinforcement learning to reconfigure edge networks to improve video streaming performance over wireless networks. We also continued development of our QUINCE system for correlating crowdsourced QOE measurement assessments with observed network performance across the same paths.

Monitoring Global Internet Security and Stability. We published several studies on outage detection, including the intentional use of DoS attacks to disrupt connectivity as a political act. We developed new methodologies for studying BGP hijacks, and in collaboration with MIT, published a study characterizing the behavior of "serial" BGP hijackers, providing insights about BGP hijacking events detected in the wild. We began a new project in collaboration with Dutch colleagues (co-funded by U.S. and Dutch governments) on mapping DNS-related DDoS vulnerabilities to improve protection of this vital core Internet infrastructure. We also continued our study of the state of source address validation (to prevent spoofed source attacks), including developing new methods to use IXP traffic data to expand visibility of compliance with source address validation best practices. We published our most important work thus far on the prospect of remediating this fundamental architectural vulnerability: a comprehensive analysis of deployment and characteristics of IP source address validation on the Internet since 2005, including an analysis of approaches taken to encouraging remediation and the challenges of evaluating their impact.

Economics and Policy. We published a series of instructional videos on Internet public policy topics, such as reasonable network management and network neutrality. We also published a preliminary although hopefully comprehensive annotated taxonomy of harms that arise in the Internet ecosystem, hoping to advance the rigor of conversations in today's hectic and reactive Internet policy environment. In the same conference, we expanded on previous work describing the implications for regulation when platforms embed a layered communications architecture. Finally, our contribution to science policy this year: we developed a set of recommendations that the Internet scientific research community can undertake to initiate a cultural change toward reproducibility of our work.

Infrastructure Operations. We operate active and passive measurement infrastructure to provide visibility into global Internet behavior, and associated software tools that facilitate network research and security and stability analysis for the community. We continued to support the IODA platform for outage detection, and the underlying Network Telescope that serves as a data source to this platform. With accessibility as a goal, we are creating APIs to access many of our data services. With sustainability as a goal, we are migrating our data processing platforms to an OpenStack environment with a Swift storage back end. We tried to maintain the Ark active measurement infrastructure, and MANIC congestion measurement system although both ran out of funding this year. Unfortunately, we lost our backbone traffic monitor in January 2019 when the link upgraded to 100 GB, leaving our 10GB hardware incapable of capturing traces. This is our most popular data set by far in the research community; we will try to recover this capability in 2020, resources permitting.

New Projects. We began three new projects this year. The first is a collaboration with NPS to rigorously investigate, develop, and evaluate new strategies for large-scale IPv6 active mapping. This effort will include measurement strategies that can amplify topology measurement coverage by orders of magnitude; innovations in IPv6-specific algorithms to infer router-level topologies; and analysis and remediation of security and privacy risks that our measurements reveal. The second new project will develop a platform to enable discovery of the full potential value of massive raw Internet end-to-end path measurement (traceroute) data sets. Finally, we began Phase I of a project under NSF's new Convergence Accelerator program to explore the feasibility of codifying an Open Knowledge Network (OKN) about properties of the Internet identifier system - the domain names and addresses that represent communication entities - and the rich structural relationships among these entities. We will have more to report on these projects in 2020. The proposals for all of our funded projects are available on our web site.

We engaged in a variety of tool development, data sharing, and outreach activities, including maintaining web sites, publishing 18 peer-reviewed papers, 2 workshop reports, 30 presentations, 6 blog entries. This report summarizes the status of our activities; details about our research are available in papers, presentations, and interactive resources on our web sites. We provide listings and links to software tools and data sets shared, and statistics reflecting their usage. Finally, we offer a "CAIDA in numbers" section: statistics on our performance, financial reporting, and supporting resources, including visiting scholars and students, and all funding sources.

CAIDA's program plan for 2018-2023 is available at www.caida.org/about/progplan/progplan2018/. Please feel free to send comments or questions to info at caida dot org. Please note the link to donate to CAIDA at the top of our web site; UC San Diego charges no overhead on donations; it is tax-deductible and goes 100% to research (no university overhead)!


Research and Analysis


Internet Mapping

Studying the Evolution of Content Providers in IPv4 and IPv6 Internet Cores. The core of the Internet, formerly dominated by large transit providers, has been reshaped after the transition to a multimedia-oriented network, first by general-purpose CDNs and now by private CDNs. We used k-cores, an element of graph theory, to define which ASes compose the core of the Internet and to track the evolution of the IPv4 and IPv6 core since 1999. We demonstrated that content providers have taken a decisive role in the AS ecosystem, where seven large companies in the Internet content market have moved toward the core of the network. (Studying the Evolution of Content Providers in IPv4 and IPv6 Internet Cores , Computer Communications Journal)

Tracking the deployment of IPv6. We used historical BGP data and recent active measurements to analyze trends in the growth, structure, dynamics and performance of the evolving IPv6 Internet, and compare them to the evolution of IPv4. Routing dynamics in the IPv6 topology are largely similar to those in IPv4, and churn in both networks grows at the same rate as the underlying topologies. Our measurements suggest that performance over IPv6 paths is now largely comparable to (or better than) that over IPv4 paths. (Tracking the deployment of IPv6: Topology, routing and performance, Computer Networks)

Towards Passive Analysis of Anycast in Global Routing. Anycast has been widely adopted by today's Internet services, including DNS, CDN, and DDoS protection, in which the same IP address is announced from distributed locations and clients are directed to the topologically-nearest service replica. Supporting researcher at the University of Delaware, we propose and investigate a method based on passive measurements to infer Anycast prefixes. (Towards Passive Analysis of Anycast in Global Routing: Unintended Impact of Remote Peering, CCR)

Geo-Locating BGP Prefixes. Geo-locating BGP prefixes can help us understand routing anomalies, prefix aggregation, or reveal what regions are affected by an Internet outage. We published a study showing that the naive approach to prefix geo-location -- simply mapping each IP address to its corresponding geo-location -- can be ambiguous because a prefix may contain another, separately-announced prefix that maps to a different geographical location. This work identified other issues with geo-location of prefixes, which paves the way towards more sophisticated applications such as the geo-location of autonomous systems. (Geo-Locating BGP prefixes, TMA)


Performance Measurement

QFlow architecture Figure. The system architecture of QFlow. (QFlow, ACM MobiHoc).
Quince architecture Figure. The overall architecture of the QUINCE measurement platform. (Quince, SIGCOMM Poster).

An Empirical Study of Mobile Network Behavior and Application Performance in the Wild. Monitoring mobile network performance is critical for optimizing the Quality of Experience (QoE) of mobile apps. We analyzed a two-year-long dataset collected by a crowdsourcing per-app measurement tool to gain new insights into mobile network behavior and application performance. We observed that only a small portion of WiFi networks worked in high-speed mode, and more than one-third of the observed ISPs still had not deployed 4G. For cellular networks, DNS settings on smartphones can have a significant impact on mobile app network performance. We proposed an automatic performance degradation detection and localization method for finding possible network problems in our huge, imbalanced and sparse dataset. Our evaluation and case studies show that our method was effective and the running time acceptable. (An Empirical Study of Mobile Network Behavior and Application Performance in the Wild, IWQoS)

QFlow. We considered the design, implementation, and evaluation of QFlow, a platform for reinforcement learning based edge network configuration. Working with off-the-shelf hardware and open source operating systems and protocols, we showed how to couple queueing, learning and scheduling to develop a system that is able to reconfigure itself to best suit the needs of video streaming applications. As our YouTube observations suggest, such a holistic framework that accounts for this entire chain can reveal efficiencies and interactions that a narrow focus on individual components of the system is incapable of achieving. We believe our system will be applicable in upcoming small cell wireless architectures such as 5G. (QFlow: A Reinforcement Learning Approach to High QoE Video Streaming over Wireless Networks, ACM MobiHoc)

QUINCE. We developed QUINCE, a QoE measurement platform, which uses a gamified approach to enable longitudinal study with repeated and varying measurements in a single platform. We leveraged existing Internet measurement data and infrastructures to integrate three different types of network and QoE measurements to yield a more comprehensive view of subjects. Our preliminarily results show that QUINCE achieves a high level of engagement from subjects and collects data that is useful for correlating network performance and YouTube video streaming QoE. (QUINCE: A unified crowdsourcing-based QoE measurement platform, SIGCOMM Poster)


Monitoring Global Internet Security and Stability

Finding Correlated Internet Failures. We analyzed simultaneous disruptions of multiple addresses related by geography and ISP, indicative of a shared cause. Using binomial testing, we characterized groups of likely correlated disruptions, challenging conventional wisdom on how such outages affect Internet address blocks. (How to Find Correlated Internet Failures, PAM)

Mean hourly inflation in dropout probability by U.S. state for thunderstorm, rain, and snow Figure. Mean hourly inflation in dropout probability by U.S. state for thunderstorm, rain, and snow. Large geographic regions can exhibit common behavior; northern states are more prone to failures in thunderstorms, midwestern states in rain, and southern states in snow. (Residential Links Under the Weather, SIGCOMM).

Outages in Residential Links Due to Weather. Investigating outages in residential networks due to weather is challenging because residential Internet is heterogeneous: there are different media types, different protocols, and different providers, in varying contexts of different local climate and geography. Sensitivity to these different factors leads to narrow categories when estimating how weather affects these different links. To address these issues, we performed a large-scale study looking at eight years of active outage measurements that were collected across the bulk of the last mile Internet infrastructure in the United States. (Residential Links Under the Weather, SIGCOMM)

MADDVIPR overview poster
Figure. MADDVIPR overview poster, DHS C&I Showcase).

Outage detection for Internet Background Radiation. We proposed Chocolatine, which detects remote outages using Internet Background Radiation traffic. The underlying predictive methodology is based on SARIMA models. Both the method and the data are easy to respectively deploy and collect in most ISP. Our method is tailored to seasonal data and is robust to noise. It is therefore applicable to many other data sources reflecting Internet activity. For example, we plan to experiment its deployment on access logs of widely popular content, while its operational integration into the CAIDA's IODA outage detection system is already in progress. (Chocolatine: Outage Detection for Internet Background Radiation, TMA)

Mapping DNS DDoS vulnerabilities to improve protection and prevention (MADDVIPR). With researchers from the University of Twente, Netherlands, our MADDVIPR project tries to comprehensively characterize DDoS attacks targeting the DNS, and vulnerabilities that impede resilience of the DNS in the face of such DDoS attacks. We presented a poster introducing the project in March. The University of Twente created and presented a new poster "It's Time To Lie: The DNS TTL Mismatch Problem reporting preliminary results at TMA.

BGP Hijacking Classification. We completed a set of methodologies for detecting BGP hijacking events, and co-authored a paper in collaboration with Stony Brook University, U. Mass-Amherst, and IIJ Research Lab presenting new methods to detect path manipulation attacks and misconfigurations, both leading to prefix hijacking. We also completed the prototype deployment of our Internet global monitoring system using these methods. (BGP Hijacking Classification, TMA).

Profiling BGP Serial Hijackers. We worked, in collaboration with MIT, on characterizing the behavior of "serial" BGP hijackers, providing insights about BGP hijacking events detected in the wild. Our work presents a solid first step toward identifying and understanding this important category of events, which can aid network operators in taking proactive measures to defend themselves against prefix hijacking and serve as input for current and future detection systems (Profiling BGP Serial Hijackers: Capturing Persistent Misbehavior in the Global Routing Table, IMC).

Political use of Denial-of-service Attacks. We studied the political use of denial-of-service (DoS) attacks, a particular form of cyberattack that disables web services by flooding them with high levels of data traffic. Non-democratic governments employ DoS attacks to censor regime-threatening information, and activists use DoS attacks as a tool to publicly undermine the government's authority. Our results show that in authoritarian countries, elections increased the number of DoS attacks. However, these attacks did not seem to be directed primarily against the country itself but rather against other states that serve as hosts for news websites from this country. (At Home and Abroad: The Use of Denial-of-service Attacks during Elections in Nondemocratic Regimes, Journal of Conflict Resolution)

Deployment of Source Address Validation in the Internet. The Spoofer project has collected data on the deployment and characteristics of IP source address validation on the Internet since 2005. Data from the project comes from participants who install an active probing client that runs in the background. The client automatically runs tests both periodically and when it detects a new network attachment point. We analyzed the rich dataset of Spoofer tests in multiple dimensions: across time, networks, autonomous systems, countries, and by Internet protocol version. In our data for the year ending August 2019, at least a quarter of tested ASes did not filter packets with spoofed source addresses leaving their networks. We showed that routers performing Network Address Translation do not always filter spoofed packets, as 6.4% of IPv4 /24 blocks tested in the year ending August 2019 did not filter. Worse, at least two-thirds of tested ASes did not filter packets entering their networks with source addresses claiming to be from within their network that arrived from outside their network. We explored several approaches to encouraging remediation and the challenges of evaluating their impact. (Network Hygiene, Incentives, and Regulation: Deployment of Source Address Validation in the Internet, IMC)

Challenges in Inferring Spoofed Traffic at IXPs. We completed a study of a new method for inferring spoofed packets, and applied a method that accounts for both epistemological and operational challenges, and showed how this method reveals inaccuracies in methods that are agnostic to AS relationship semantics, but we also found epistemological challenges remain. (Challenges in Inferring Spoofed Traffic at IXPs, CoNEXT)


Economics and Policy

Net Neutrality video series
Video: The Net Neutrality Debate, SDSC).

Video tutorials on network neutrality and related topics. The San Diego Supercomputer Center published a series of instructional videos where we explained network neutrality and the role and importance of measuring the Internet to inform public policy. The video series includes short videos on 1) Net Neutrality, 2) The Death of Common Carriage, 3) Is My Internet Being Throttled?, 4) The Dirt Road Problem, and 5) The Hidden Cost of Free Internet. (video series on San Diego Supercomputer YouTube channel)

Regulation when platforms are layered. Drawing on the layered platform nature of the Internet ecosystem as described in our 2014 paper ``Platform Models for Sustainable Internet Regulation'', we explored how this model could help scope the duties for an agency (or agencies) with sector-specific expertise. (Regulation When Platforms Are Layered, TPRC)

Threats to the Internet ecosystem. One foundational justification for regulatory intervention is that there are threats or harms occurring of a character that create a public interest in mitigating them. News headlines for the last few years suggest that the range of such harms is unbounded. We undertook an effort to comprehensively classify harms in the Internet ecosystem, hoping to facilitate conversations and development of ideas to help mitigate harms in a more systematic way, as opposed to fighting an endless defensive battle against whatever happens next. (Toward a Theory of Harms in the Internet Ecosystem, TPRC)

Encouraging Reproducibility in Scientific Research of the Internet. For several reasons, including the sensitive and/or proprietary nature of some Internet measurements, the networking research community pays limited attention to the of reproducibility of results, tending to accept papers that appear plausible. A Dagstuhl seminar on Encouraging Reproducibility in Scientific Research of the Internet was held in October 2018, discussing challenges to improving reproducibility of scientific Internet research, and developed a set of recommendations that the research community can undertake to initiate a cultural change toward reproducibility of our work. (Encouraging Reproducibility in Scientific Research of the Internet, Dagstuhl Reports)


Measurement Infrastructure and Data Sharing Projects


Platform for Applied Network Data Analysis (PANDA)

PANDA infrastructure overview Figure. Planned architecture of the PANDA infrastructure.

For more than 20 years, CAIDA has developed many data-focused services, products, tools and resources to advance the study of the Internet. We have also spent years cultivating relationships across disciplines (networking, security, economics, law, policy) with those interested in CAIDA data, but the impact thus far has been limited to a handful of researchers. The current mode of collaboration simply does not scale to the exploding interest in scientific study of the Internet. To address this gap, we are integrating a number of existing measurement and analysis components previously developed by CAIDA. Our goal is to enable new scientific directions, experiments and data products for a wide set of researchers from four targeted disciplines: networking, security, economics, and public policy. The platform will employ efficient indexing and processing of terabyte archives, provide advanced visualization tools to show geographic and economic aspects of Internet structure, and support careful interpretation of displayed results.

In 2019, we tested the new integrated system by moving the AS Rank web application stack, the API, and the user interface services from a bare metal environment with data stored on local disk or NFS mounted storage to an environment with containerized applications running on virtualized machines with shared network and storage resources in a local cloud. We chose to migrate to the OpenStack environment with a Swift storage back end. We built it with sustainability as a goal; we plan to use this system not only for other software components for this project, but also as lasting infrastructure for other CAIDA software projects.

Archipelago. We continued our bordermapping measurements that detect network (AS) boundaries in traceroute measurements launched from the Ark nodes. We continued to expand Ark coverage and presence until February 2019 (when the dedicated CRI funding ended). As of February 2019, we had 247 active measurement nodes listed as enabled, 146 of which contributed to our team-probing experiment to measure all routed prefixes at a /24 granularity. By August 2019, the number of active nodes had dropped to 190, demonstrating how much care and feeding this sort of infrastructure needs. To increase sustainability, we designed a new web-based, monitor deployment management application, built on the Python Flask platform. This app provides functions for tracking information about each monitor (e.g., hosting organization, location, etc.) and its lifecycle (e.g., monitor health, connectivity or data issues, etc.). These features and functionality will allow hosting sites to accomplish their own self-diagnosis and service when monitors experience temporary outages or hardware failures.

AS Rank. In 2019, we redesigned and reimplemented the AS Rank web and API services to support historical data, releasing AS Rank version 2 in August 2019.

BGPStream. We improved the BGPStream software framework based on community feedback. Most importantly, we updated the BGPStream libraries to support the new "RIS Live" streaming format that RIPE is using to stream data from its RIPE RIS routing infrastructure. We added native support for processing raw BGP data in BMP format. OpenBMP is an open source project that implements BMP protocol version 3, and allows sharing of BGP data from routers in a more systematic and complete way than current methods. We implemented a new high-level PyBGPStream API (prototype) and new interface to filters. We ported the HTTP API to Symfony 4. We switched the production version https://bgpstream.caida.org/ (including broker queries from libbgpstream) to the new OpenStack-based deployment environment.

Measurement and ANalysis of Internet Congestion (MANIC). We completed the MANIC home page and dashboard. The dashboard uses the Grafana front-end and Version 1.0 of the MANIC API. We released version 1 of our API, giving programmatic access to the data, and then continued to improve it throughout the year, and began work on version 2.

MIDAR. We completed development of all components that use the MIDAR IPv4 alias resolution service. The MIDAR web API delivers access to MIDAR's functionality. The system relies on backend applications and databases that implement the job queue and handle execution of MIDAR. We completed considerable fault tolerance enhancements and refactoring of the MIDAR code. We improved documentation for the database we created to store the MIDAR-enabled ITDK IP address aliases and a standalone command-line tool, aliasq, for efficiently querying that database.

Periscope. We overhauled the Periscope service, which provides a unified interface to public Internet measurement infrastructure that allows traceroute and BGP queries. We released it for internal beta testing in April 2019 and enabled wider use of it toward the end of the year.

Spoofer. We continued to use Ark to support our Spoofer project. Ark monitors help measure the Internet's susceptibility to spoofed source address IP packets. We created and released a web-based RESTful API for the Spoofer service, which allows researchers to programmatically extract data from the Spoofer back-end database.

Vela. We continued development of Vela, a prototype system for executing on-demand measurements on the Ark platform, and for querying IPv4 address aliases. We provided Vela accounts to researchers and students for their projects. MIT graduate students used Vela to execute measurements inspired by Vern Paxson's 1997 paper on observed routes in the internet (persistence, prevalence, symmetry) for coursework in a Computer Networks course. Iowa State researchers used Vela to conduct bulk ping measurements from five continents to Netflix prefixes to find servers geographically close to the measurement nodes but exhibiting high RTT.


IODA platform

A high-level view of the architecture of IODA Figure. A high-level view of the architecture of IODA.

Our infrastructure for detecting macroscopic Internet-edge outage events uses three data sources: Internet Background Radiation (IBR -- one-way unsolicited traffic generated by millions of Internet hosts worldwide), Border Gateway Protocol (BGP) update messages (used to exchange reachability information between Internet Service Providers), and active probing results that reveal the reachability of end-hosts. Fusing event signals extracted from these data sources increases IODA's overall accuracy and coverage. By analyzing how an event manifested itself across various data sources, we can investigate its potential underlying cause(s). The prototype IODA platform with interactive dashboards accessible at ioda.caida.org.


UCSD Network Telescope

We maintain and continuously improve the UCSD Network Telescope measurement infrastructure to enable studying of Internet phenomena by monitoring and analyzing unsolicited traffic arriving at a globally routed underutilized /8 network. We enabled near-real-time data access to vetted researchers, which requires tackling challenges in storage, curation, and privacy-protected sharing of large volumes of data. In 2019 we moved all the "Daily Randomly and Uniformly Spoofed Denial-of-Services (RSDos) Attack Metadata" and "Aggregated Flow Dataset" timeseries into Swift OpenStack -- an object-based cloud storage. We also store the most recent (30 days) pcap files containing the raw Network Telescope traffic in Swift OpenStack. Users can either download data from Swift or access these data via the native Swift API. We implemented a new VM-based analysis platform, which gives users a dedicated VM to process telescope data.

We continued developing Corsaro, our open source software suite for capturing, processing, management, analysis, visualization, and reporting of collected Telescope data. In 2019 we released a new version (v3.0) of Corsaro which aims to be better suited to processing parallel traffic sources such as nDAG streams or DPDK pipelines. It includes new meta-data tagging modules related to spoofed traffic, erratic traffic components, IP geolocation, and AS lookup tagging. Corsaro was specifically designed to be used with passive traces captured by darknets, but the overall structure is generic enough to be used with any type of passive trace data.


Data

In the interests of reproducibility of our own work and to facilitate expanded scientific analysis of the research topics pursued, we invest significant effort to ensure that data we gather or derive from various raw data sources is available to other researchers. We list all available data sets, including legacy ones, on our CAIDA Data Overview page, and twice a year email our data users with updates and important news. In 2019, we added or improved some of the datasets as described below.

New and Improved Datasets

We added ITDK 2019-01 and ITDK 2019-04 to our ongoing collection of Macroscopic Internet Topology Data Kits (ITDK) that started in 2010 and now includes 18 Kits. The ITDKs contain two router-level topologies generated from the same IP-level topology based on data from the Ark IPv4 Routed /24 Topology Dataset. They also include an IPv6 router-level topology, assignments of routers to ASes, geographic locations of each router, and Domain Name Service (DNS) lookups of all observed IP addresses. These ITDKs utilize traceroutes not only from our Archipelago measurement infrastructure but also some traceroutes from the RIPE Atlas Internet measurement platform.

US backbone bidirectional traffic data. In January 2019 we took the last monthly trace on our 10 Gb link monitor in New York city. The link upgrade to 100 GB left our 10GB hardware incapable of capturing traces. January 2019 traces are available online.

Peering DB is a public dataset that now includes the version 2 JSON files.

The data supplement for our study Learning Regexes to Extract Router Names from Hostnames (IMC) paper.

Data Collection

The graphs below show the cumulative amount of data accrued over the last several years by our primary data collection infrastructures, Archipelago and the UCSD Network Telescope. We are currently collecting about 5 TB of uncompressed data per day (more than 95% of which is Telescope data). In 2019 CAIDA captured about 13 TB of uncompressed topology traceroute data, and about 1.3 PB of Internet background radiation (IBR) traffic data.

To provide continuous and timely Internet topology data, the following categories of Ark measurements execute on an ongoing basis:

[Figure: Archipelago measurements cumulative data capture] [Figure: UCSD Network Telescope capture]

Cumulative amount of data accrued over the years. Left panel shows uncompressed size of Ark topology measurements. Light green shading indicates the size of IPv4 team probing measurements, dark green -- the size of IPv4 prefix probing, blue -- IPv4 TSLP congestion, red -- IPv4 Border Mapping, purple -- IPv6 topology. Right panel shows compressed and uncompressed size of the UCSD Network Telescope raw data.

Data Distribution Statistics

There are two complementary ways that users can request access to CAIDA's data: through the CAIDA portal and through the Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT) portal. Datasets shared through the CAIDA portal fall into two categories: public and by-request. Public datasets are available to users who agree to CAIDA's Acceptable Use Policy for public data. These datasets are available for use by academic researchers, US government agencies, and corporate entities who participate in CAIDA's membership program. Users provide a brief description of their intended use of the data, and agree to an Acceptable Use Policy.

Access to the CAIDA datasets through IMPACT is subject to corresponding IMPACT Terms . These datasets are available for use by academic researchers, government agencies and corporate entities from DHS-Approved Locations (US, Canada, Australia, United Kingdom, Israel, Japan, the Netherlands, and Singapore).

The graphs below show the annual counts of unique visitors who downloaded CAIDA datasets (public, by-request, and IMPACT) and the total size of downloaded data. In 2019 we granted access to the CAIDA by-request and IMPACT datasets to more than 300 new users. Even though the number of users who downloaded Anonymized Internet traces slightly decreased (in comparison with 2018), the volume of downloaded data increased by nearly 10 TB. Also note that our last trace of this category was collected in January 2019 (see above). There is clearly an unmet demand for this type of data by the research community.

The volume of downloaded Ark topology data increased by about 17 TB. These statistics do not include Near-Real-Time Telescope datasets ( raw traffic traces in pcap format, aggregated flow and daily RSDoS attack metadata) dissemination. Users can analyze these datasets only on CAIDA computers and are not allowed to download them. Currently, about 30 days of the most recently collected raw telescope data are kept in Swift, our Openstack object-based cloud storage.

[Figure: 
total request counts statistics for data] [Figure: download statistics for CAIDA data]

Data Distribution Statistics: Unique users downloading CAIDA data and volume of data downloaded annually. Multiple downloads of the same file by the same user, which is common, only counted once.

[Figure:                                                                                                              users' IPs and ASes geolocation]

Unique users downloading CAIDA data and corresponding ASes aggregated by country.

Publications using public and/or restricted CAIDA data (by non-CAIDA authors)

We know of a total of 104 publications in 2019 by non-CAIDA authors that used these CAIDA data. We update this data as we learn of new publications. Some papers used more than one dataset. As of 2019 we are aware of 1846 papers with 1264 different authors in 95 countries. Please let us know if you know of a paper using CAIDA data that is not on our list: Non-CAIDA Publications using CAIDA Data.

[Figure: Number of papers by dataset] [Figure: Country of affiliation of authors of non-CAIDA papers using CAIDA data]

Impact of CAIDA data sharing: (a) Annual number of non-CAIDA publications using CAIDA data; (b) Country of affiliation of authors of non-CAIDA papers using CAIDA data.

Tools

CAIDA develops and maintains supporting tools for Internet data collection, analysis and visualization.

In 2019, we deployed AS Rank API version 2, moving to a GraphQL API that allows clients to create queries that specify which values they require and contain multiple resources. We also made the first public release of BGPStream V2 (release candidate 2). HiCube and MANIC platforms were also made available for preview in 2019.

ARTEMIS : Neutralizing BGP Hijacking within a Minute. In 2019, in collaboration with FORTH/University of Crete, we released the open source software ARTEMIS V1, an implementation of our ARTEMIS: Neutralizing BGP Hijacking within a Minute methodology. We successfully piloted deployment of ARTEMIS in three network operators: Internet2, Great Plains Network, and Merit. During our collaboration with Internet2, ARTEMIS successfully detected a real prefix hijacking event affecting a /30 network of Internet2 within a few seconds.

The following chart and table display CAIDA-developed and currently supported tools and number of external downloads (by unique IP address) during 2019.

[Figure: The number of times each tool was downloaded from the CAIDA web site in 2019.]
Tool Description Downloads
arkutil RubyGem containing utility classes used by the Archipelago measurement infrastructure and the MIDAR alias-resolution system. 333
Autofocus Internet traffic reports and time-series graphs. 249
BGPStream Open-source software framework for live and historical BGP data analysis, supporting scientific research, operational monitoring, and post-event analysis 1,100
Chart::Graph A Perl module that provides a programmatic interface to several popular graphing package 162
CoralReef Measures and analyzes passive Internet traffic monitor data. 135
Corsaro Extensible software suite designed for large-scale analysis of passive trace data captured by darknets, but generic enough to be used with any type of passive trace data. 392
Cuttlefish Produces animated graphs showing diurnal and geographical patterns. 169
dbats High performance time series database engine 110
dnsstat DNS traffic measurement utility. 220
iatmon Ruby+C+libtrace analysis module that separates one-way traffic into defined subsets. 84
iffinder Discovers IP interfaces belonging to the same router. 504
kapar Graph-based IP alias resolution. 342
libsea Scalable graph file format and graph library. 94
libtimeseries Provides a high-performance abstraction layer for efficiently writing to time series databases. 72
Marinda A distributed tuple space implementation. 228
MIDAR Monotonic ID-Based Alias Resolution tool that identifies IPv4 addresses belonging to the same router (aliases) and scales up to millions of nodes. 447
Motu Dealiases pairs of IPv4 addresses. 78
mper Probing engine for conducting network measurements with ICMP, UDP, and TCP probes. 104
otter Visualizes arbitrary network data. 344
plot-latlong Plots points on geographic maps. 123
plotpaths Displays forward traceroute path data. 74
rb-mperio RubyGem for writing network measurement scripts in Ruby that use the mper probing engine. 295
RouterToAsAssignment Assigns each router from a router-level graph to its Autonomous System (AS). 361
rv2atoms (including straightenRV) A tool to analyze and process a Route Views table and compute BGP policy atoms. 27
scamper A tool to actively probe the Internet to analyze topology and performance. 2,426
sk_analysis_dump A tool for analysis of traceroute-like topology data. 112
spoofer Source address validation measurement program that measures susceptibility to spoofed source address IP packets. 7,064
topostats Computes various statistics on network topologies. 141
Walrus Visualizes large graphs in three-dimensional space. 410
* Note: Chart::Graph is also available on CPAN.org. The number shown is direct downloads from caida.org only (statistics from CPAN not available).

Workshops

Workshop on Active Internet Measurement Systems (AIMS). In April, CAIDA hosted our annual Workshop on Active Internet Measurement Systems (AIMS) at the UC San Diego Supercomputer Center. This workshop series provides a forum for stakeholders in Internet active measurement projects to communicate their interests and concerns, and explore cooperative approaches to maximizing the collective benefit of deployed infrastructure and gathered data. An overarching theme this year was scaling the storage, indexing, annotation, and usage of Internet measurements. We discussed tradeoffs in use of commercial cloud services to make measurement results more accessible and informative to researchers in various disciplines. Other agenda topics included status updates on recent measurement infrastructures and community feedback; measurement of poorly configured infrastructure; and recent successes and approaches to evolving challenges in geolocation, topology, route hijacking, and performance measurement. (The 11th Workshop on Active Internet Measurements (AIMS-11) Workshop Report, CCR)

DUST. On September 9th-10th, 2019, we hosted the 2nd International Workshop on Darkspace and UnSolicited Traffic Analysis (DUST 2019) at the San Diego Supercomputer Center, UCSD, San Diego, California. The goal of the DUST workshop series is to bring together researchers, operators, and analysts interested in unsolicited traffic analysis, especially traffic destined to unassigned (dark) IP address space. In this workshop we introduced STARDUST, the project that aims at maintaining continued operation of the UCSD Network Telescope infrastructure and maximizing its utility to researchers from various disciplines.

WIE and KISMET. In December, CAIDA hosted the 10th interdisciplinary Workshop on Internet Economics (WIE). This workshop series provides a forum for researchers, Internet facilities and service providers, technologists, economists, theorists, policymakers, and other stakeholders to exchange views on current and emerging economic and policy debates. This year's meeting had a narrower focus than in years past, motivated by a new NSF-funded project being launched at CAIDA: KISMET (Knowledge of Internet Structure: Measurement, Epistemology, and Technology). The objective of the KISMET project is to improve the security and resilience of key Internet systems by collecting and curating infrastructure data in a form that facilitates query, integration and analysis. This project is a part of NSF's new Convergence Accelerator Phase I program, which seeks to support fundamental scientific exploration by creating partnerships across public and private sectors to solve problems of national importance (Workshop on Internet Economics (WIE-KISMET 2019) report, CCR).


CAIDA 2019 in Numbers

In 2019, CAIDA published 18 peer-reviewed papers (see below), and 2 workshop reports, made 30 presentations, and posted 6 blog entries. No technical reports were published this year. A complete list of presented materials is available on the CAIDA Presentations page. We also organized and hosted three workshops: AIMS 2019: Workshop on Active Internet Measurements, DUST 2019: 2nd International Workshop on Darkspace and UnSolicited Traffic Analysis, and WIE-KISMET 2019: Workshop on Internet Economics: Knowledge of Internet Structure: Measurement, Epistemology, and Technology. We provided logistical support in processing student travel grants for TPRC 47: Research Conference on Communications, Information and Internet Policy.

In 2019, our web site www.caida.org attracted 336,929 unique visitors, with an average of 1.96 visits per visitor, serving an average of 6.42 pages per visit.

During 2019, CAIDA employed 24 staff (researchers, programmers, data administrators, technical support staff), hosted 4 postdocs, 7 PhD students, 5 graduate students, and 11 undergraduate students.

We received $4.4M to support our research activities from the following sources:

[Figure: Allocations by funding source]
Funding Source Amount ($) Percentage
NSF $2,798,477 63%
DHS $800,000 18%
Gift & Members $249,940 6%
Other $563,428 13%
Total $4,411,845 100%

Two views of historical funding allocations are shown below, presented by total amount received and by percentage based on funding source.


These charts below show CAIDA expenses, by type of operating expenses and by program area:

[Figure: Operating Expenses]
Expense Type Amount ($) Percentage
Labor $2,628,648.37 59%
Indirect Costs (IDC) $1,338,385.50 30%
Professional Development $2,756.40 <1%
Supplies & Expenses $92,937.26 2%
Workshop & Visitor Support $55,700.70 1%
CAIDA Travel $82,501.99 2%
Subcontracts $155,822.47 4%
Equipment $62,086.74 1%
Total $4,418,839.43 100%
[Figure: Expenses by Program Area]
Program Area Amount ($) Percentage
Economics & Policy $110,400.41 2%
Future Internet $87,088.82 2%
Mapping & Congestion $164,097.48 4%
Infrastructure $1,686,440.13 38%
Security & Stability $2,030,025.73 46%
Outreach $121,403.14 3%
CAIDA Internal Operations $8,799.25 <1%
Total $4,418,389.43 100%


Publications

(listed by primary topic area, but many cross multiple topics)

Supporting Resources

CAIDA's accomplishments are in large measure due to the high quality of our visiting students and collaborators. We are also fortunate to have financial and IT support from sponsors, members, and collaborators, and monitoring hosting sites.

UC San Diego Students

  • Alex Gamero-Garrido, PhD student at UC San Diego
  • Chongyang Du, graduate student at UC San Diego
  • Zesen Zhang, PhD student at UC San Diego
  • Rui Yang, graduate student at UC San Diego
  • Gautam Akiwate, PhD student at UC San Diego

Visiting Scholars

  • Roderick Fanou, postdoc from IMDEA Networks
  • Raphael Hiesgen, PhD student at Hamburg University of Applied Sciences, Germany
  • Shinyoung Cho, PhD student at Stonybrook University
  • Rafael Almeida, graduate student at Universidade Federal de Minas Gerais, Brazil
  • Elverton Fazzion, PhD student at Universidade Federal de Minas Gerais, Brazil
  • Loqman Salamatian, graduate student Sorbonne University, France
  • Ran Zhou, PhD student at Harvard University
  • Elena Dominguez-Rodriguez, researcher at RIPE NCC
  • Alexander Marder, postdoc from University of Pennsylvania
  • Ramakrishna Padmanabhan, postdoc from University of Maryland, College Park
  • Shuai Hao, postdoc from College of William and Mary

Funding Sources

Published