CAIDA's Annual Report for 2013

A report on CAIDA research initiatives, project progress and results, data sets, tool development, publications, presentations, workshops, web site statistics, funding sources, and operating expenses for 2013.

Mission Statement: CAIDA investigates practical and theoretical aspects of the Internet, focusing on activities that:

  • provide insight into the macroscopic function of Internet infrastructure, behavior, usage, and evolution,
  • foster a collaborative environment in which data can be acquired, analyzed, and (as appropriate) shared,
  • improve the integrity of the field of Internet science,
  • inform science, technology, and communications public policies.

Executive Summary

This annual report covers CAIDA's activities in 2013, summarizing highlights from our research, infrastructure, data-sharing and outreach activities. Our research projects span Internet topology, routing, traffic, security and stability, future Internet architecture, economics and policy. Our infrastructure activities support measurement-based Internet studies, both at CAIDA and around the world, with focus on the health and integrity of the global Internet ecosystem.

We collect and share the largest Internet topology data sets (IPv4 and IPv6) available to academic researchers. We also curate and share many aggregated derivative data sets, including rankings of ISPs by customer cone based on (our inferred) business relationships between autonomous networks. Applying our improved alias resolution techniques for mapping IP addresses to physical routers, we collected, analyzed, curated and released our sixth published Internet Topology Data Kit (ITDK), which curated measurements taken in April 2013. We also completed redesigning our AS relationship inference algorithm, and assembled the largest source of validation data for AS-relationship inferences to date. On the theoretical side, we discovered some additional insights into the structural dynamics of large-scale graphs that could support mathematically rigorous models of evolving complex networks.

We continued our involvement in the Named Data Networking project, a 12-university collaboration exploring a generalization of the Internet architecture that allows naming not just the endpoints, i.e., source and destination IP addresses, but rather the data (content) itself. By naming data instead of locations, the new architecture transforms data into a first-class entity while addressing several known technical challenges of today's Internet, including routing scalability, network security, content protection and privacy. We participated in NDN routing research, testbed operations, management duties, and provided web site support.

We also continued our empirical study of a more immediate architectural transition of the Internet -- to IPv6. To support IPv6 topology and performance analysis, in 2013 we developed, tested, and validated new methods to perform IPv6 address alias resolution at scale. We looked at the IPv4 address block transfer "grey market" and tested various techniques to infer likely IP address ownership transfers from publicly available data. We began a comparative analysis of BGP update churn in the IPv4 and IPv6 routing systems, demonstrating that routing dynamics are qualitatively similar in both address apaces. Finally, we created new annual IPv4 and IPv6 AS core visualizations using January 2013 raw Ark-based traceroute topology snapshots.

In the area of Internet security and stability, we are developing an operational capability to aggregate Internet measurement data from multiple available sources in order to detect, monitor, and characterize global connectivity disruptions due to political or catastrophic causes. We are also studying the volume, geographic origin, and characteristics of malicious and anomalous unsolicited traffic observed with the UCSD Network Telescope, a large IPv4 (/8) darkspace monitor.

Our economics research seeks to understand the structure and dynamics of the Internet ecosystem from an economic perspective, capturing relevant interactions between network business relations, internetwork topology, routing policies, and resulting interdomain traffic flow. We used modeling and simulation of interdomain network formation and peering selection strategies to analyze aspects of provider behavior such as the gravitation of Internet transit providers toward open peering. We pursued several other related projects: characterizing the Internet peering ecosystem using PeeringDB data; analyzing the 95th percentile transit billing mechanism and possible alternatives; and developing measurable proxy metrics for ISP size.

In collaboration with MIT CSAIL, we undertook new policy research in 2013, aiming to develop a model of industry dynamics that captures two durable and persistent features of today's telecommunications: the use of layered platforms to implement desired functionality; and interconnection between actors at different platform layers. We used multi-sided platform theory to explore several recent and impending industry innovations that have been naively conflated with the public Internet, and explored their differences.

We continued to support several measurement and data infrastructure projects. We made two infrastructure improvements to our active measurement infrastructure Ark. First, we enabled a web interface to request measurements from the Ark infrastructure. Second, we started a gradual transition from deploying Ark nodes as 1U PC hardware to using the credit-card sized Raspberry Pi devices instead. The much smaller form factor allowed us to substantially accelerate monitor deployment; by the end of 2013, we increased the number of vantage points to 79 Ark monitors (34 are IPv6-capable and 31 are Pi-based) deployed in 35 countries.

We also maintained our passive traffic collection system known as the UCSD Network Telescope, which monitors a large volume of unsolicited traffic arriving at a globally routed underutilized /8 network. We released version 2.0 of Corsaro, a software suite for performing large-scale analysis of trace data. Although designed for use with passive traces captured by darknets, users can apply this software to process any type of passive trace data.

We continued to support three community-oriented projects related to data sharing: (1) DHS's Protected Repository for the Defense of Infrastructure Against Cyber Threats (PREDICT) to support protected sharing of security-related Internet data with researchers; (2) the Internet Measurement Data Catalog (DatCat), an index of information (metadata) about data sets and their availability under various usage policies; and (3) measurement guidance for the International Research Network Connections Program (IRNC).

Finally, as always, we engaged in a variety of tool development, and outreach activities, including web sites, 12 peer-reviewed papers, 4 technical and workshop reports, 28 presentations, 9 blog entries, 4 workshops, and a seminar series. Details of our activities are below. CAIDA's program plan for 2010-2013 is available at https://www.caida.org/about/progplan/progplan2010/. We will be creating a new 4-year program plan in 2014. Please do not hesitate to send comments or questions to info at caida dot org.


Research Areas


Internet Topology Measurement, Analysis, and Modeling

Goals

CAIDA's long-term topology research agenda includes two strategic areas: 1) integrating macroscopic Internet topology measurements and analysis capabilities in both IPv4 and IPv6 address space (see below in the Exploring the evolution of IPv6 section) to create comprehensive annotated Internet topology maps; and 2) developing mathematically rigorous models of complex networks.

Activities

  1. We continued large-scale macroscopic topology measurements using our Archipelago (Ark) measurement platform. We completed the 6th calendar year of the IPv4 Routed /24 Topology Dataset collection, including automated DNS name lookups for discovered IP addresses. We created the annual IPv4 AS Core Graph using April 2013 Ark data.
  2. We published "Internet-Scale IPv4 Alias Resolution with MIDAR", which documents our methodology and software implementation for resolving millions of observed interfaces into routers (alias resolution) based on similarities in IP ID time series produced by coordinated probing of many IP addresses.
  3. We released a new Internet Topology Data Kit (ITDK), synthesizing the IPv4 Routed Topology Dataset and targeted alias resolution measurements conducted in April 2013. The April 2013 ITDK includes: two related router-level topologies; router-to-AS assignments; geographic location of each router; and DNS lookups of all observed IP addresses.
  4. We developed, implemented, and validated a method for discovering currently invisible IXP peering links by mining BGP communities used by IXP route servers to implement multilateral peering. In Inferring Multilateral Peering , we analyzed route server data juxtaposed with a mapping of BGP community values to infer 206K p2p links at 13 large European IXPs, four times as many p2p links than what public BGP data reveals. The proposed technique utilizes only existing public BGP data sources, and does not require the deployment of additional vantage points.
  5. In AS Relationships, Customer Cones, and Validation, we published a new algorithm to infer business relationships between ASes using BGP path data. Unlike previous approaches, our algorithm does not assume the presence (or seek to maximize the number) of valley-free paths, instead relying on three assumptions about the Internet's inter-domain structure: (1) ASes enter into provider relationships in order to become globally reachable; and (2) a peering clique of ASes at the top of the hierarchy, and (3) no cycles of p2c links in paths. We assembled the largest source of validation data for AS-relationship inferences to date, validating over 30% of the AS graph. Using these inferred relationships, we evaluated three algorithms for inferring an AS customer cone, defined as the set of ASes an AS can reach using customer links. We demonstrated the utility of our algorithms for studying the rise and fall of large transit providers over the last 15 years as well as for investigating recent claims about the flattening of the AS-level topology and the decreasing influence of tier-1 ASes on the global Internet.
  6. We explored the vast realm of network models that can be roughly divided into equilibrium and nonequilibrium approaches. In the former, one studies equilibrium ensembles of graphs of a fixed size. In the latter, graphs grow, usually by adding nodes one at a time, introducing statistical dependencies. In our study Duality between equilibrium and growing networks, we showed that under certain conditions, there exists an equilibrium formulation for any growing network model, and vice-versa. Moreover, the equivalence between the equilibrium and nonequilibrium formulations is exact not only asymptotically but for any finite system size. These required conditions are satisfied in random geometric graphs in general and causal sets in particular, and to a large extent in some real networks.

Outreach

  1. UCSD undergraduate students Jorge Landaverde, Adam Velasco, and Jonathan Yuan assisted CAIDA personnel with various tasks in topology measurement and analysis via the Research Experience for Undergraduates (REU) program.
  2. We hosted visiting graduate students Chiara Orsini (University of Pisa, Italy) and Pol Colomer de Simon (University of Barcelona, Spain).
  3. We organized and hosted the 5th Workshop on Active Internet Measurement (AIMS-5). The workshop report is available.
  4. We organized and hosted the Network Geometry (NetGeo) Workshop that brought together a small group of invited researchers to discuss their collaborative work-in-progress and future research directions in describing structural and dynamical properties of real networks via mathematical formalism developed for random geometric graphs in non-Euclidean spaces.
  5. From October 2010 through December 2013, CAIDA organized and hosted the UCSD Complex Network Seminar: Different Angles on Network Complexity, Engineering, and Science (DANCES). The seminar fostered communication and collaboration among junior and senior researchers (including UCSD graduate students and post-docs) studying networks in different disciplines: physics, biology and bioengineering, sociology, computer science, math, neuroscience, cognitive science, etc. It also provided young researchers with a forum to practice their presentation and communication skills.

Publications


Future Internet Research

Our research on the future of the Internet is currently focused on two primary areas: 1) participating in the Future Internet architecture project Named Data Networking (NDN); and 2) studying the deployment evolution of the Internet Protocol version 6 (IPv6).

Named Data Networking (NDN)

Goals

The main goals of the collaborative Named-Data Networking project include research, development, and testbed deployment of a new Internet architecture that replaces IP with a network layer routing directly on content names. By naming data instead of locations, this architecture aims to transition the Internet from its current reliance on "where" data is located (addresses and hosts) to "what" the data is (the content that users and applications care about). Our activities for this project included testbed participation, research on fundamentally new NDN-compatible routing protocols, and overall management support.

Activities

  1. We maintained a node on the national NDN testbed using the NDNx software that branched from the previous CCNX hub software. We also hosted a desktop computer configured with NDN-based video and audio software (provided by UCLA Center for Research in Engineering, Media, and Performance) to support team experiments and testing of instrumented environments, participatory sensing, and media distribution via the NDN infrastructure.
  2. In 2010, we published a description of the modified greedy forwarding algorithm (MGF) that uses a concept of a hyperbolic metric space underlying complex networks to enable efficient greedy forwarding without any global knowledge of the network topology. In 2013 we measured the performance metrics for the MGF algorithm on the NDN testbed, both for the full graph of participating sites and for all graphs obtained from the full graph by removing one link without disconnecting the full graph. We found that MGF in the resulting networks was efficient and resilient with regard to node removals, increasing our confidence that it could potentially provide highly efficient forwarding if used by the NDNx software stack.
  3. We provided overall management support to the NDN project spread across 11 participating institutions. We hosted and maintained the internal NDN project Wiki, and assisted in the development and maintenance of the Named Data Network site based at UCLA.
  4. We hosted the 4th NDN Project retreat in November 2013.

Exploring the evolution of IPv6: topology, performance, and traffic

Goals

CAIDA aims to measure the evolution of IPv6 in three areas: topology, traffic, and performance. Our goal is to uncover characteristics of current IPv6 deployment that can be used to infer how to advance IPv6 deployment, either via technical capability or policy development. Some of our IPv4 topology work serves as a baseline to support IPv6 topology analysis.

Activities

  1. We completed the 5th full calendar year of the IPv6 Topology Dataset collection and created new annual IPv4 and IPv6 AS Core Graph visualizations using January 2013 Ark data.
  2. We developed and validated a fingerprint-based Internet-scale IPv6 alias resolution technique that utilizes induced fragmented responses from IPv6 router interfaces. In IPv6 Alias Resolution via Induced Fragmentation, we demonstrated the accuracy of this technique in a controlled environment and on a small subset of the production IPv6 Internet for which we have ground-truth. In Speedtrap: Internet-Scale IPv6 Alias Resolution, we described the design, implementation, and validation of this alias resolution technique and produced router-level Internet IPv6 topologies using measurements collected by CAIDA's distributed infrastructure.
  3. As Internet RIRs have begun to ration IPv4 allocations, IPv4 transfer markets have emerged as a new mechanism to acquire IPv4 addresses, but have largely flown under the radar. In A First Look at IPv4 Transfer Markets, we made an attempt to characterize the transfer market using data from three RIRs and BGP data from the Routeviews and RIPE repositories: the types of players involved, the sizes and characteristics of transferred address blocks, and the visibility of transferred address blocks in the routing table before and after the transfer. We also described additional data sources and analysis techniques that may help shed light on the currently opaque processes of many IPv4 address block transfers.
  4. In the mid 2000s there was concern in the research and operational communities over the scalability of BGP, the Internet's interdomain routing protocol. The worry was that update churn (the number of protocol messages exchanged during route changes) was growing too fast for routers to handle. Recent work somewhat allayed those fears, showing that update churn grows slowly in IPv4, but the question of IPv6 routing scalability remained. We developed a model that expresses BGP churn in terms of four measurable properties of the routing system and showed that the number of updates normalized by the size of the topology is constant, and that routing dynamics are qualitatively similar in IPv4 and IPv6. We found that the exponential growth of IPv6 churn is entirely expected, as the underlying IPv6 topology is also growing exponentially.

Outreach

  1. We hosted a graduate student Ioana Livadariu (Simula Research Laboratory, Norway) who collaborated with CAIDA researchers in developing a quantitative model for IPv6 adoption.

Publications


Security and Stability

Goals

In the area of Internet security and stability, we develop new methods of analysis and aggregation of Internet measurement data from multiple available sources that shed light on Internet security related events, including global connectivity disruptions due to political or catastrophic causes. Our goal is to make use of our methodology and findings as the basis for automated early-warning detection systems for large-scale Internet outages. We also study in depth the one-way unsolicited traffic collected on the UCSD Network Telescope , a large IPv4 (/8) darkspace monitor seeking to characterize malicious and anomalous activities ubiquitously present in the Internet.

Activities

  1. We studied the effects of "Patch Tuesday" events (Microsoft's release of accumulated security patches on the 2nd Tuesday of each month) on the volume and characteristics of malicious and unwanted traffic. In The Day After Patch Tuesday: Effects Observable in IP Darkspace Traffic, we described a significant increase in the number of active hosts observed at our darkspace monitor the day after Patch Tuesday in each of the six months that we examined (while there was no significant changes in the overall traffic volume). Our results suggest that the observable effects of Patch Tuesday merit further investigation, and potential tuning of sampling methods toward activity periods that likely contain more interesting information (i.e., many new malicious sources) than other time periods.
  2. We have shown previously that analysis of unsolicited network traffic (mostly generated by malicious software) can be used to identify large-scale disruptions of connectivity at an Autonomous System (AS) granularity. In Gaining Insight into AS-level Outages through Analysis of Internet Background Radiation we explored what metrics of this spurious traffic may shed light on the causes of macroscopic connectivity disruptions. We considered metrics indicating packet loss (e.g., due to link congestion) along a path from a specific AS to our observation point, and conducted three case studies to illustrate how our metrics can help identify packet loss characteristics of an outage. These metrics could make up the diagnostic component of a semi-automated system for detecting and characterizing large-scale outages.
  3. We studied the correlation between country governance regimes and the reputation of their Internet address allocations. We used two qualitative measures of national governance: the Corruption Perceptions Index and the Democracy Index. To represent the reputation of a country's Internet, we used its number of blacklisted IPv4 addresses on well-known blacklists. Our results confirm that countries with more transparent, democratic governmental institutions harbor a smaller fraction of misbehaving (blacklisted) hosts.

Publications

Outreach

  1. Karyn Benson, a UCSD graduate student, continued analyzing telescope data as part of her thesis work. Graduate student Andreas Reuter (University of Berlin, Germany) developed software based on RTRlib to correlate BGP measurements with information from the RPKI and analyze the correlations in the context of potential BGP hijacks.
  2. We hosted a visiting scholar Tanja Zseby (FOKUS - Fraunhofer Institute for Open Communication Systems, Berlin, Germany) who analyzed the Patch Tuesday effects and helped curate two educational data kits from the UCSD Network Telescope data for future publication.

Publications


Economics and Policy

Goals

Our economics research aims to understand the structure and dynamics of the Internet ecosystem from an economic perspective, capturing relevant interactions between network business relations, internetwork topology, routing policies, and resulting interdomain traffic flows. On the policy side, we strive to respond to requests from government agencies and policymaking bodies for comments and positions that empirically inform industry tussles and telecommunication policies. We also provide expertise on ethical issues pertinent to information and communication technology research.

Activities

  1. In collaboration with researchers at Georgia Tech, we developed an agent-based model simulating peer selection process in the Internet at the AS level. This model, GENESIS-CBA, is based on realistic constraints and provider selection mechanism, with an assumption that ASes are acting in a myopic and decentralized manner to optimize a cost-related fitness function. We also introduced a new peering scheme, Cost-Benefit-Analysis, which, in contrast to existing peering strategies, gives ASes the ability to analyze the impact of each peering link on their economic fitness. Using this analysis, ASes can engage in only those peering relations that will likely have a positive impact on their fitness. As opposed to analytical game-theoretic models, which focus on proving the existence of equilibria, GENESIS-CBA is a computational model that simulates the network formation process and allows one to actually compute distinct equilibria (i.e., internetworks) and examined the behavior of (and characteristics common to) oscillatory paths that do not converge.
  2. We designed and implemented ITMgen, a tool for generating synthetic but representative interdomain traffic matrices (ITMs). ITMgen works at the connection level, taking into account the relative sizes of ASes, their popularity with respect to various applications, and the relation between forward and reverse traffic for different application types. In ITMgen - A First-principles Approach to Generating Synthetic Interdomain Traffic Matrices , we described how we can realistically parameterize application types and content popularity into the model by combining public sources like Alexa that capture traffic trends at a macro level, and use local traffic sampling (NetFlow, DPI) to provide finer-grained enhancements to the model. We demonstrated that we can synthesize ITMs that match real-world measurements more closely than the current state of the art. The modular design philosophy of ITMgen facilitates integration of refinements to improve the accuracy of our existing implementation.
  3. We studed the Internet peering ecosystem using an online database (PeeringDB) where participating networks contribute information about their peering policies, estimated traffic volumes, and presence at various geographic locations. We have collected daily snapshots of the PeeringDB database since 2011 and used BGP data to cross-validate the representativeness and accuracy of PeeringDB data. We measured correlations between reported network properties: BGP-advertised address space, geographical presence, peering policies, and approximate traffic volume. Our analysis revealed evolutionary trends of the peering ecosystem, including geographic expansion and contraction by players, increases and decreases in traffic volume, and recent shifts toward more restrictive peering (a reversal of a trend).
  4. The 95th percentile billing mechanism has been an industry de facto standard for transit providers for well over a decade. While the simplicity of the scheme makes it attractive as a billing mechanism, dramatic evolution in traffic patterns, associated interconnection practices and industry structure over the last two decades motivates an obvious question: is it still appropriate? We started evaluating the 95th percentile pricing mechanism from the perspective of transit providers, using a decade of traffic statistics from SWITCH (a large research/academic network), and more recent traffic statistics from three Internet Exchange Points. We found that in our data set, heavy-inbound and heavy-hitter networks are able to achieve a lower 95th-to-average ratio than heavy-inbound and moderate-hitter networks, perhaps due to their ability to better manage their traffic profile. The 95th percentile traffic volume also does not necessarily reflect the cost burden to the provider. We developed a alternative metric -- the provision ratio for a customer -- which better captures the costs a given customer imposes on a network.
  5. The dynamic nature of the telecommunications industry, with its rapidly changing technology and industry structure, presents a serious challenge to the theory and practice of regulation, which has a slower time scale and a tendency to embed assumptions about technology and industry into regulatory frameworks. In Platform Models for Sustainable Internet Regulation, we proposed a new model that attempts to capture two durable and persistent features of today's telecommunications ecosystem: the use of layered platforms to implement desired functionality; and interconnection between actors at different platform layers. Using modern theories of multi-sided platforms (MSPs) to focus on key technical and business aspects of today's industry, our MSP-aware layered model explored several recent and impending innovations that have been naively conflated with the global Internet, and illuminated their differences. We also illustrated the potential of our model as a baseline for future research by considering how it can help scope consistent policy discourse of three open questions: specialized services, minimum quality regulations (the "dirt road" problem), and structural separation.
  6. Researchers are faced with time-driven competitive pressures to research and publish, to achieve tenure, and to deliver on grant funding proposals. That ethical considerations can be incongruent with these incentives is neither novel nor unique to information and communication technology (ICT) research. To help ICT researchers understand and preempt or minimize the ethics risks in the lifecycle of their research, in 2012 we published the Menlo Report that summarized a set of basic principles to guide the identification and resolution of ethical issues in research about or involving ICT. In 2013 we published a companion report, Applying Ethical Principles to Information and Communication Technology Research: A Companion to the Menlo Report, which details the Menlo Report principles and applications, and illustrates their implementation in real and synthetic case studies.

Outreach

  1. UCSD undergraduate students Carlos Garibay and Andre Gatorano assisted CAIDA personnel with various tasks in Internet economics research via the Research Experience for Undergraduates (REU) program.
  2. Natalie Larson, a UCSD graduate student, worked with CAIDA researchers on the PeeringDB data analysis.
  3. We hosted an intern Vamseedhar Reddyvari (Texas A&M University, College Station, TX) who analyzed the properties of the 95th percentile pricing mechanism.
  4. We organized and hosted the 4th Workshop on the Internet Economics (WIE 2013). The workshop report is available.
  5. We co-organized the Cyber-security Research Ethics Dialogue & Strategy Workshop (CREDS 2013) co-located with the 34th IEEE Symposium on Security and Privacy . The workshop report is available.

Publications


Infrastructure Projects


Archipelago (Ark)

Archipelago (Ark) is CAIDA's active measurement infrastructure, which enables large-scale Internet measurements, while reducing the effort needed to develop, deploy and conduct sophisticated experiments.

Activities

  1. We are gradually transitioning from using 1U hardware as Ark monitors to credit-card-sized and approximately $35 Raspberry Pi devices. In 2013, we deployed/replaced 19 monitors (all of them Raspberry Pis) and increased the number of vantage points to 79, including 34 IPv6-capable and 31 Pi-based Ark nodes, deployed in 35 countries.
  2. Upon a suggestion from NSRC we developed a one-page brochure for potential hosting sites explaining the benefits and significance of the Ark project to the research community.
  3. CAIDA now has a web interface to request measurements on the Ark infrastructure, with accounts made available to researchers upon request. This interface allows topology measurements on demand, for example, a request that all (or a given subset of) Ark monitors perform RTT and traceroute measurements to a single given destination.
  4. We continued improving our measurement techniques and analysis methodologies for alias resolution inferences. We released an update to arkutil, a RubyGem containing various utility classes used by the Ark measurement infrastructure and the MIDAR alias resolution system.
  5. We also continued support for the spoofer experiment (collaboration with Robert Beverly, NPS).
  6. We maintain a mailing list of researchers using Ark data and regularly email them with updates and important news about the data.

UCSD Network Telescope

We develop and maintain a passive data collection system known as the UCSD Network Telescope to study security related events on the Internet by monitoring and analyzing unsolicited traffic arriving at a globally routed underutilized /8 network. We are seeking to maximize the research utility of these data by enabling near-real-time data access to vetted researchers and solving the associated new challenges in flexible storage, curation, and privacy-protected sharing of large volumes of data.

Activities

  1. We released version 2.0 of our software suite Corsaro for capture, processing, management, analysis, visualization and reporting on data collected with the UCSD Network Telescope. Although designed primarily for performing large-scale analysis of passive trace data captured by darknets, this software can be used to process any type of passive trace data.
  2. We began sharing the Near-Real-Time Network Telescope Dataset that includes: (i) the most recent 60 days of raw telescope traffic (in pcap format); and (ii) aggregated flow data for all telescope traffic since February 2008 (in Corsaro flow tuple format).
  3. We released the Industry Evaluation Near-Real-Time Network Telescope Dataset to allow industry researchers and developers to evaluate the utility of the telescope data before they decide whether to sponsor CAIDA for access to additional data. This data set contains 24 compressed pcap files, each containing one hour of data, and 24 compressed Corsaro flow files containing aggregated flow information in Corsaro FlowTuple format.
  4. We began to curate two educational data kits from the telescope data to use as a teaching resource for computer science undergraduate and graduate students. The first kit will include all Telescope data collected in April 2012 and shed light on the effects of Microsoft Patch Tuesday activities. The second kit will include 17 days of data collected in January-February 2011 and demonstrate the possibility of using darknet unsolicited traffic for characterization of large-scale Internet outages. The data in both kits will be fully anonymized and the kits will be publicly downloadable.

Outreach

UCSD undergraduate students Jeffrey Syang and Florence Yu assisted CAIDA personnel with various tasks for network telescope infrastructure development via the Research Experience for Undergraduates (REU) program.


Data Sharing for Security / PREDICT

The Department of Homeland Security project Protected Repository for the Defense of Infrastructure Against Cyber Threats (PREDICT) provides vetted researchers with current network security-related data in a disclosure-controlled manner that respects the security, privacy, legal, and economic concerns of Internet users and network operators. CAIDA supports PREDICT goals as Data Provider and Data Host and also plays an advisory role in developing technical, legal, and practical aspects of PREDICT policies and procedures.

Activities

  1. We collected, hosted, and provided the following current Internet Topology data to PREDICT: (i) Internet Topology measured from Ark Platform (IPv4 Routed /24 Topology, IPv4 Routed /24 DNS Names, IPv6 Topology); and (ii) Internet Topology Data Kits (ITDKs).
  2. We collected, hosted, and provided UCSD Near-Real-Time Network Telescope Data to qualified researchers through PREDICT.
  3. We continued to host and share the legacy Internet topology skitter data as well as various archived Telescope data sets and added a newly curated Patch Tuesday Network Telescope Dataset.
  4. We collected, curated, and released CAIDA Anonymized Internet Traces 2013 dataset. It contains anonymized passive traffic traces from two high-speed monitors on commercial Internet backbone links and is available to academic and government researchers by request.
  5. To support corporate research, we curated two new industry evaluation datasets: (i) the CAIDA Anonymized Industry Evaluation Internet Traces Dataset; and (ii) the CAIDA Industry Evaluation Near-Real-Time Network Telescope Dataset. These datasets contain samples of the corresponding full data sets and are intended to demonstrate the potential usefulness of CAIDA data for industrial research thus justifying the cost of CAIDA membership for corporate sponsors.

DatCat: Internet Measurement Data Catalog

We continued development and refinement of the Internet Measurement Data Catalog (IMDC, or DatCat) -- an index of information (metadata) about data sets and their availability under various usage policies. DatCat addresses a significant challenge in network science: reducing the cost of searching for data by organizing metadata about existing Internet data sets into a single repository. We cataloged metadata for several datasets into DatCat including over 40 DHS S&T PREDICT sponsored datasets to demonstrate the potential for DatCat to be used by other projects seeking to index data sets. We also seeded the DatCat Forum with some discussion groups covering general items of interest for DatCat users, dataset specifics, and dataset request issues.


Sustainable data-handling and analysis methodologies for the IRNC networks

NSF International Research Network Connections Program (IRNC) has funded five ProNet (production network) projects to provide network connections linking U.S. research networks with peer networks in other parts of the world and several special projects that primarily addressed measurement and monitoring of operational networks. Our special project supported the IRNC community measurement efforts by fostering and leading discussion of how to best make IRNC data and statistics available, and by adapting CAIDA measurement technologies for IRNC community needs.

Activities

  1. In January 2013, PI kc claffy presented IRNC-SP: Sustainable data handling and analysis methodologies for the IRNC networks to the NSF IRNC PI Meeting. She also led a discussion of existing measurement capabilities of ProNet sites, other measurement activities funded by the IRNC program, and potential opportunities for collaboration in pursuit of greater empirical visibility into the IRNC infrastructure's operations, usage, and value. We summarized these discussions in a technical report, International Research Network Connections: Usage and Value Measurement .
  2. We worked with IRNC ProNet PIs to define a network topology data format using YAML, and to collect and populate a file with information describing various ProNet links. We produced a prototype interactive visualization that depicts the layer-two links between the IRNC-funded routers at a city granularity. The interface allows the user to select the router nodes or links to display additional information.
  3. We maintained an IRNC Wiki page serving as a collection point for IRNC related activities. (Funding for this project ended in 2013.)

Tools

In the course of our funded research and infrastructure projects, we regularly develop tools for Internet data collection, analysis and visualization, and make these tools available to the community. The following table displays all CAIDA developed and currently supported tools (note that we do not receive specific funding for tool support and maintenance) and the number of downloads of each version during 2013.

Tool Description Downloads
arkutil RubyGem containing utility classes used by the Archipelago measurement infrastructure and the MIDAR alias-resolution system. 112
Autofocus Internet traffic reports and time-series graphs. 324
Chart::Graph A Perl module that provides a programmatic interface to several popular graphing packages. 228*
CoralReef Measures and analyzes passive Internet traffic monitor data. 499
Corsaro Extensible software suite designed for large-scale analysis of passive trace data captured by darknets, but generic enough to be used with any type of passive trace data. 246
Cuttlefish Produces animated graphs showing diurnal and geographical patterns. 160
dnsstat DNS traffic measurement utility. 231
iatmon Ruby+C+libtrace analysis module that separates one-way traffic into defined subsets. 108
iffinder Discovers IP interfaces belonging to the same router. 349
libsea Scalable graph file format and graph library. 182
kapar Graph-based IP alias resolution. 175
MIDAR Monotonic ID-Based Alias Resolution tool that identifies IPv4 addresses belonging to the same router (aliases) and scales up to millions of nodes. 270
Motu Dealiases pairs of IPv4 addresses. 110
mper Probing engine for conducting network measurements with ICMP, UDP, and TCP probes. 220
otter Visualizes arbitrary network data. 715
plot-latlong Plots points on geographic maps. 172
plotpaths Displays forward traceroute path data. 81
rb-mperio RubyGem for writing network measurement scripts in Ruby that use the mper probing engine. 290
RouterToAsAssignment Assigns each router from a router-level graph to its Autonomous System (AS). 216
rv2atoms (including straightenRV) A tool to analyze and process a Route Views table and compute BGP policy atoms. 107
scamper A tool to actively probe the Internet to analyze topology and performance 462
sk_analysis_dump A tool for analysis of traceroute-like topology data. 162
topostats Computes various statistics on network topologies. 184
Walrus Visualizes large graphs in three-dimensional space. 1520
* Note: Chart::Graph is also available on CPAN.org. The number shown is direct downloads from caida.org only (statistics from CPAN not available).

Tool Development Activities

  1. We released version v0.13.5 of arkutil on June 24, 2013.
  2. Corsaro v2.0.0 providing new plugin support, a geolocation framework, and full real time data aggregation support, was released on November 21, 2013.
  3. We released a new version of kapar that fixed some bugs and added a mechanism for the unique identification of anonymous interfaces.
  4. We updated MIDAR to fix some bugs and added the capability to stop a run at the end of the current step.
  5. Previously updated in 2007, Chart::Graph was updated this year so the modules properly allow options after specified output types.

Data

Data Collection Statistics

In 2013, CAIDA captured the following raw data. The first number in parentheses shows the actual disk space used to store the compressed data; the second number is its uncompressed size.

  • traceroutes probing IPv4 address space collected by our Ark infrastructure (843.7 GiB/2.6 TiB), and traffic from reverse DNS lookups for discovered IPv4 addresses (11.6 GiB/45.1 GiB)
  • traceroutes IPv6 address space collected by a subset of IPv6-enabled Ark monitors (11.0 GiB/51.7 GiB)
  • passive traffic traces from the equinix-chicago and equinix-sanjose monitors connected to Tier-1 ISP backbone links at the Equinix facilities in Chicago, IL, and San Jose, CA (1.7 TiB/4.1 TiB)
  • passive darkspace traffic traces collected by our UCSD Network Telescope (45.0 TiB/110.5 TiB)

We supported the Day In The Life of the Internet (DITL) data collection campaign on 29 May 2013, and captured one-hour passive traces on both backbone monitors, which we distribute as part of the Anonymized Internet Traces 2013 data set.

We curated the following data sets from the above raw data:

We provided several academic researches with access to our Near-Real-Time Network Telescope Dataset.

We developed and maintained the AS Rank web site and the related AS Relationships Dataset that use BGP data and empirical analysis algorithms developed by CAIDA researchers to infer business relationships between ASes.

We also continued to maintain and provide access to several legacy data sets that are no longer collected. We list all available data on our CAIDA Data page.

Data Distribution Statistics

  • Publicly Available Data

    These datasets require that users agree to CAIDA Acceptable Use Policy for public data, but are otherwise freely available. The table lists the number of unique visitors and the total amount of data downloaded in 2013.

Dataset Unique visitors (IPs) Data Downloaded *
AS Rank 12857 47.6 GiB
AS Relationships 270 0.7 GiB
IPv4 Routed /24 AS Links 727 37.0 GiB
IPv6 AS Links 70 507.5 MiB
Skitter AS Links (AS Adjacencies) 267 2.1 GiB
Skitter Router Adjacencies 148 313.6 MiB
AS Taxonomy 132 64.7 MiB **
Witty Worm Dataset 514 704.3 MiB
Code-Red Worms Dataset 78 6.7 GiB
Telescope Sipscan Data Supplement 90 16.1 GiB
* We count the volume of data downloaded per unique user per unique file, so if a user downloads a file multiple times, we only count that file once for that user. This methodology significantly underestimates the total volume of data served through our data servers.
** Our AS Taxonomy dataset is mirrored at the Georgia Tech main AS Taxonomy site, so these downloads reflect only a fraction of this data set popularity.
  • Restricted Access Data

    These datasets require that users:

    • be academic or government researchers, or sponsor CAIDA;
    • request an account and provide a brief description of their intended use of the data; and
    • agree to an Acceptable Use Policy.

    The following table shows statistics about data requests, and data downloaded for the restricted CAIDA datasets: number of requests received, number of users whose request was granted, number of users that actually downloaded data, and amount of data downloaded by all users in 2013. We received about 149 more requests in 2013 then in 2012, and approved 88 more requests for access to restricted datasets. About 80% of the users granted access actually access our web servers to download data.

Dataset Number of requests received Number of requests granted Unique visitors (usernames) Data Downloaded *
Anonymized Internet Backbone Traces 390 287 243 31.1 TiB
Backscatter Datasets 45 33 32 1.3 TiB
Active Topology Trace Datasets 162 118 116 10.9 TiB
Witty Worm Dataset 21 15 13 94.7 GiB
DNS Root/gTLD server RTT Dataset 9 7 5 64.1 MiB
DDoS Attack Dataset 177 120 119 461.1 GiB
Telescope Datasets 44 29 26 2.4 TiB
* We count the volume of data downloaded per unique user per unique file, so if a user downloads a file multiple times, we only count that file once for that user. This methodology significantly underestimates the total volume of data served through our dataservers.
  • Publications using public and/or restricted CAIDA data (by non-CAIDA authors)

    We know of a total of 122 publications by non-CAIDA authors that used these CAIDA data. Some of these papers used more than one dataset. A complete list of all papers can be found on our webpage for Non-CAIDA Publications using CAIDA Data

Dataset Number of papers
Anonymized Internet Backbone Traces 46
DDoS Attack Dataset 8
Backscatter Datasets 4
Code-Red Worms Dataset 1
Witty Worm Dataset 4
Telescope Datasets 2
Active Topology Trace Datasets (skitter and Ark) 26
AS-relationships and AS rank 39

CAIDA 2013 in Numbers

CAIDA researchers published 12 peer-reviewed publications and 4 reports:

For details, please see CAIDA papers.

CAIDA researchers presented their results and findings at Techs in Paradise (Honolulu, HI), AIMS Workshop (San Diego, CA), SIAM Conference (San Diego, CA), PAM (Hong Kong), Traffic Monitoring and Analysis Workshop (Turin, Italy), IMC (Barcelona, Spain), and CoNEXT (Santa Barbara, CA) as well as at various other workshops, seminars, Program and PI meetings. A complete list of presented materials are available on CAIDA Presentations page.

CAIDA organized and hosted 3 workshops: Workshop on Active Internet Measurements (AIMS-5), Network Geometry Workshop (NetGeo), 4th Workshop on Internet Economics (WIE 2013); and co-organized the Cyber-security Research Ethics Dialog & Strategy Workshop (CREDS 2013) hosted at the 34th IEEE Symposium on Security and Privacy.

In 2013, our web site www.caida.org attracted 358,334 unique visitors, with an average of 1.94 visits per visitor.

As of the end of December 2013, CAIDA employed 18 staff, 4 postdocs, 2 graduate students, and 6 undergraduate students.

We received $1.97M to support our research activities from the following sources:

[Figure: Allocations by funding source]
Funding Source Amount ($) Percentage
NSF 722,592 37%
DARPA 166,874 8%
DHS 774,995 39%
Gift & Members 312,817 16%
Total 1,977,278 100%

These charts below show CAIDA expenses, by Expense Type and Program Area:

[Figure: Expenses by expense type]
Expense Type Amount ($) Percentage
Labor 1,962,545 60%
IDC 1,087,497 33%
Supplies & Expenses 79,874 2%
Travel 58,075 2%
Workshop & Visitor Support 42,384 1%
Equipment 17,677 1%
Professional Development 6,092 0%
Total 3,254,144 100%
Labor Salaries and benefits paid to staff and students
IDC Indirect Costs (grant overhead) paid to the University of California, San Diego
Supplies & Expenses Computer hardware (costing less than $5000) and software, telephone, Internet, IT services, data storage, machine room, mail, printing, and general office supplies
Travel CAIDA personnel trips to conferences, PI meetings, operational meetings, and sites of remote monitor deployment
Workshop & Visitor Support Conference room and equipment charges, meals, travel grants
Equipment Computer hardware or other equipment costing more than $5000.
Professional development CAIDA personnel education, registration fees, international visas, publication charges, meals at internal meetings, etc.
[Figure: Expenses by Program Area]
Program Area Amount ($) Percentage
Economic & Policy 54,458 2%
Future Internet 461,106 14%
Topology 1,093,454 34%
Infrastructure 1,162,599 36%
Security 394,005 12%
Outreach 80,464 2%
CAIDA Internal Operations 8,057 0%
Total 3,254,144 100%


Funding Sources

CAIDA thanks our sponsors, members, and collaborators.

Published