This page describes the results of research and developement efforts funded by contract N66001-08-C-2029 with the DHS Science and Technology Directorate (March 2008 - July 2012) and co-funded by the NSF grant CNS-0958547 (started in March 2010).
We integrated the following six strategic measurement and analysis capabilities to improve DHS' situational awareness of Internet topology structure and behavior:
- a new architecture to support Internet topology measurement
- application of IP alias resolution techniques
- conversion of IP- and router-level graphs to AS-level topology graphs
- AS taxonomy and relationship inference
- geolocation of IP resources
- interactive visualization of large annotated graphs
The objective of these applied research, development, and deployment efforts was to develop the capability to regularly provide richly annotated topology maps of observable Internet infrastructure, as well as to develop a powerful measurement platform capable of performing other types of Internet infrastructure measurement experiments as needed. Our approach has uniquely been able to provide benefits in two areas: (i) improving critical national capabilities in understanding the structure and evolution of our communications infrastructure, based on innovative measurement, analysis and inference techniques; and (ii) fill a recognized need in network science: allowing researchers to execute empirical research in cybersecurity without having to build and maintain the requisite global measurement infrastructure.
1. Archipelago (Ark) Measurement Infrastructure
We have created a powerful and versatile distributed measurement infrastructure Archipelago (Ark) that makes use of measurement nodes located in various networks worldwide and connected via the Internet to a central server located at CAIDA. Co-funded by DHS and NSF, we have grown Ark from 13 monitors in December 2007 to 60 monitors as of September 2012, deployed in 30 countries on six continents. 28 monitors are enabled to conduct both IPv4 and IPv6 measurements. Ark has pioneered new features and functionality of distributed measurement infrastructure, including flexible and efficient measurement and data collection methods. It has proven successful at providing both benefits described above: unprecedented intelligence regarding macroscopic Internet connectivity, and support for several third-party cybersecurity-related global Internet measurement experiments.
Ark is now continuously gathering the largest set of IPv4 and IPv6 topology data made freely available to academic researchers and government agencies via the CAIDA web site as well as via IMPACT. For IPv4 topology, Ark monitors continuously measure IP-level paths to a dynamically generated list of IP addresses covering all /24 prefixes (about 10 million as of July 2012) in routed IPv4 address space. Measurement parallelization allows us to cycle through probing each routed /24 prefix in about two days. From September 2007 through June 2012, we collected more than 17 billion traceroutes (6.7 TB of uncompressed data, 2.1 TB compressed). IPv6-capable monitors conduct continuous probing of BGP-announced IPv6 prefixes (/48 or shorter, nearly 10,000 prefixes as of July 2012). Each Ark monitor probes a single random destination in each prefix; a full probing cycle takes 48 hours.
We maintain an Ark statistics page that shows
- statistics aggregated across all monitors:
- statistics for each individual monitor:
We field-tested Ark several times in support of global measurement experiments that resulted in several conference and journal publications. These experiments included investigation of relative topological coverage of different forward path probing methods, evaluating the efficacy of deployed Internet source address validation filtering, measuring the impact of certain causes of missing hops in traceroute paths, and a comparison of public and commercial geolocation databases. This range of scientific experiments has successfully demonstrated our vision of a metaphorical distributed measurement "operating system" to support empirical Internet science.
Based on our experience with Ark, CAIDA has also made recommendations for designing and operating the next generation of Internet topology measurement platforms.
We developed the following Ark-related software:
- coordination software: Marinda - a distributed tuple space implementation used for coordinating measurements across monitors (included in MIDAR v0.3.0 release)
- probing application software:
- mper - probing engine based on scamper (released)
- rb-mperio - Ruby library for writing Ruby measurement scripts that use mper probing engine (released)
- system-management software: ArkUtil - a library of utilities including software to remotely manage software running on monitors (released)
- data management software:
- ark-collector - Ark service for automatic fault-tolerant downloading of measurement data from monitors
- ScamperDataFeed/ScamperIO - Ruby classes for programmatically controlling scamper from a Ruby script, part of ArkUtil
- researcher-supporting software: Eva - Ruby library for writing efficient event-driven applications (included in MIDAR v0.3.0 release)
- data analysis software:
- rb-asfinder - a Ruby library for fast IP-to-prefix/AS lookups using a Patricia trie (wraps our C++ Patricia trie implementation)
- rb-wartslib - (link not available) - Ruby library for reading and writing scamper warts files (released)
- rb-judy - Ruby wrapper for the Judy array library (included in MIDAR v0.3.0 release)
- topology statistics analysis tool: topostats (released)
- topology-on-demand prototype software: Vela - performing user-specified distributed traceroutes and pings from Ark platform (released)
- prototype services to support clients interacting over the Marinda tuple space:
- asfinder - generic service using rb-asfinder to map IP addresses to prefixes and ASes
- geoloc - generic service for geolocating IP addresses
Ongoing data collections:
- IPv4 Routed /24 (topology probes to each /24, continuously)
- IPv4 Routed /24 DNS Names (DNS resolution for the above)
- IPv6 Topology (topology probes to each routed IPv6 prefix)
- Internet Topology Data Kit (ITDK) (curated IPv4 data)
- AS Links: IPv4 Routed /24 AS Links (AS adjacencies)
- AS Relationships (inferred AS relationships)
- AS Rank (inferred AS ranking by customer cone or degree)
- "Traceroute Probe Method and Forward IP Path Inference", IMC'08, Matthew Luckie, Young Hyun, and Bradley Huffaker.
- "Understanding the efficacy of deployed internet source address validation filtering", IMC'09, Rob Beverly, Arthur Berger, Young Hyun, KC Claffy.
- "Measured impact of crooked traceroute", CCR, Jan 2011 Matthew Luckie, Amogh Dhamdhere, KC Claffy, David Murrell.
- "Efficient Internet Topology Discovery Techniques", Masters Thesis, U. Waikato, Alistair King.
As of January 1, 2013, 124 articles by non-CAIDA researchers cite or use data from Ark (according to our search using Google Scholar).
2. Alias Resolution: Mapping IP Addresses to Routers
A critical step in creating accurate Internet topology maps from traceroute data is mapping IP addresses to routers, a process known as alias resolution. We tested previously existing small-scale experimental methods for mapping IP addresses to routers and re-designed and re-implemented them into more robust alias resolution techniques capable of working on Internet-scale topologies, i.e., millions of addresses.
Our first developed tool, kapar uses traceroutes collected by Ark as an input, breaks the observed IP paths into IP links, and heuristically infers the set of IP addresses that belong to the same router, as well as the set of two or more routers on the same "IP link" (either a point-to-point link, or LAN, or shared medium with multiple attached IP addresses).
Our second developed tool, Monotonic ID-Based Alias Resolution tool (MIDAR), infers aliases based on similarities in IP ID time series produced by different IP addresses. To improve precision and sensitivity of this technique, we developed an extremely precise IP ID comparison test based on monotonicity rather than proximity. MIDAR also integrates multiple probing methods, multiple vantage points, and a novel sliding-window probe scheduling algorithm to ensure scalability to millions of IP addresses. Our experiments showed that MIDAR's approach is effective at minimizing the false positive rate sufficiently to achieve a high positive predictive value at Internet scale. We validated MIDAR against available ground truth and showed that its results are significantly better than was previously achievable with MIDAR's predecessors.
We released the following software:
- alias resolution tool kapar
three versions of MIDAR tool serving different purposes:
- v0.1.0 - a small scale corroboration tool capable of testing a small ( < 200) set of IP addresses using only the corroboration stage of the MIDAR IP ID test, a single probe method, from a single host
- v0.2.0 - a medium-scale resolution system capable of testing a medium-size ( < 40,000) set of IP addresses with the MIDAR IP ID test, with all MIDAR stages and multiple probe methods, from a single host
- v0.3.0 - a large-scale system, capable of testing an Internet-scale (2 million or more) set of IP addresses with the MIDAR IP ID test, with all MIDAR stages and probe methods, from multiple monitor hosts.
We also conducted a rigorous and unprecedented analysis systematically comparing the topologies we derive from Ark measurement data with other topologies inferred from the best available data sources and common inference techniques. We compared topology graphs at three granularities (IP interface, router, and AS) derived from seven different topology data sources (CAIDA's traceroute data, BGP (Route Views and RIPE NCC RIS), IRR data (RIPE's WHOIS registry), iPlane, DIMES, and IRL). We selected the four basic statistical characteristics for this comparison: average node degree, node degree distribution, average neighbor degree, and local clustering as a function of node degree. These metrics are the most popular in the networking literature and also are definitive, since reproducing these metrics is sufficient to capture all essential topological characteristics of Internet AS- and router-level topologies.
As far as we know, this report available online is the most comprehensive comparative study thus far, with published sources of data, metrics, methods, and results. The investment of effort required to gather, curate, interpret, and integrate Internet topology data into a research project is intimidatingly huge. Our analysis can facilitate more informed selection of topology datasets to support specific research or analysis needs, consistent with CAIDA's mission to promote the use of the best available data in Internet research.
- "Internet-scale IP alias resolution techniques", Ken Keys, CCR, January 2010.
- "Toward Topology Dualism: Improving the Accuracy of AS Annotations for Routers", PAM'2010, Bradley Huffaker, Amogh Dhamdhere, Marina Fomenkov and KC Claffy.
- "MERLIN: MEasure the Router Level of the INternet", Conference on Next Generation Internet (NGI'2011), Pascal Mérindol, Benoit Donnet, Jean-Jacques Pansiot, Matthew Luckie and Young Hyun.
- "Internet-Scale IPv4 Alias Resolution with MIDAR: System Architecture", submitted to Transactions on Networking, June 2011. Ken Keys, Young Hyun, Matthew Luckie, KC Claffy.
- "Internet Topology Data Comparison", technical report. May 2012, Bradley Huffaker, Marina Fomenkov, and KC Claffy.
3. Internet Topology Data Kit (ITDK)
Starting in 2010, we applied our state-of-the-art measurement and analysis techniques to collect, analyze, process and release four Internet Topology Data Kit (ITDK) data sets. Each ITDK is produced from active measurements conducted on our Archipelago (Ark) measurement infrastructure. We obtained the raw topology by performing traceroutes to randomly-chosen destinations in each routed /24 BGP prefix using all Ark monitors over a two-week period. The curated data set includes: two versions of router-level topologies (derived using different combinations of alias resolution methods, to minimize false positive and false negatives, respectively); router-to-AS assignments; geographic location of each router; and DNS lookups of all observed IP addresses. We continue to refine data analysis, inferences, and annotation methods, and will be augmenting ITDKs with additional annotations, in particular AS links as inferred from combining the AS assignment and node datasets, and inferred AS relationships for each AS link.
As of November 2012, 6 Ark ITDKs are available.
4. AS Rank
Independent networks (Autonomous Systems, or ASes) engage in voluntary bilateral interconnection ("peering") agreements to provide reachability to each other for some subset of the Internet, and AS granularities. Using BGP routing data from the Route Views Project and RIPE NCC RIS, we implemented multiple data analysis methodologies and developed a procedure to rank Autonomous Systems (AS Rank) based on a measure of their influence in the global routing system. Using WHOIS data, we also construe sibling relationships between ASes belonging to the same organizaiton (ISP). ASes and organizations are ranked by their customer cone size, which is the number of their direct and indirect customer ASes based on inferred business relationships among ASes. (We do not have data to rank ASes or ISPs by traffic, revenue, users, or any other non-topological metric.)
The results of our AS ranking algorithm are represented as an interactive web page which allows users to provide feedback correcting false relationship inferences. We received more ground truth back from ISPs than has ever been integrated into an AS relationship algorithm, and refined our algorithm based on this data as well as user suggestions in follow-up conversations with submitters of ground truth.
We also validated the output of our improved algorithm against our ground truth data set and found our c2p (customer-to-provider) inferences to be 99% correct and our p2p inferences to be 89% correct. Our relationships dataset is a better match against our ground truth data than any previous work. We explain these concepts in detail on the AS Rank Help web site and in a forthcoming publication.
Since geolocation of IP resources is such an essential component of our mapping research (and of much other Internet research), we undertook a systematic quantitative comparison of currently available geolocation service providers. We added depth to previous geolocation studies by analyzing inconsistencies across databases for different geographic regions and organization types.
We analyzed the geolocation results on both country and lat-long granularities. For IP-to-country mapping, we compared an individual database answer for a given IP address with the answer selected as the majority vote across all databases that geolocate this IP address. We found that providers generally agreed on IP-to-country mappings, the level of agreement between an individual database and the majority varying from 99.1% to 94.3%.
To compare databases at a lat-long granularity we used an 80 km threshold for two lat-longs coordinates to be in the same geographic region. Our report describes the process for selecting this threshold, and the centroid-based algorithm for comparing an individual database lat-long results with a majority of responses from the set of databases we evaluated.
While not completely foolproof - the databases could all be converging to the same wrong answers over time, our comparative analysis methodology is based on a reasonable assumption that database providers successfully work toward improving the accuracy of their databases over time. Therefore, in the absence of substantial ground truth, our method offers a systematic way to study and cross-compare the geolocation databases.
In the Supplemental Phase of the project, we intended to re-run the geolocation comparison experiment to include two large additional databases, Quova and Akamai. Unfortunately, upon learning of our comparison project, both providers refused to even sell us access to their products. Therefore, we had to eliminate this task from the Supplemental SOW.
- "Geocompare: a comparison of public and commercial geolocation databases", Network Mapping and Measurement Conference, May 2011. Bradley Huffaker, Marina Fomenkov, and KC Claffy.
6. Multi-granularity topology graph visualizations
We developed an integrated visualization of topological connectivity of a single AS on the PoP/city location and router levels. ISPs maintain points-of-presence (PoPs) across the world to connect with other networks. We do not currently distinguish between different PoPs in the same city, so we infer a city-level graph of routers for each AS (ISP). Starting with our ITDK dataset, which contains annotated router level topology data, we use MaxMind's GeoLite City geolocation database (free) to create a Geographic Graph by mapping a router's interfaces to geographic locations. If a router's interfaces do not map to the same geographic location, we omit its geographic annotation. We do retain "singleton" router nodes (which themselves may represent multiple routers in the same city) for which we saw no observed links to remote (different city) routers in the same AS.
To improve visibility of highly clustered PoPs, we created a Location Graph in which we loosen the geographic mapping seen in the Geographic Graph described above. Nodes in the Location Graph represent PoP locations for a single AS, each location represented by a white circle. Nodes (PoPs) are originally placed in their geographic location, but are then shifted to create a visible distance between locations. A single color is drawn around all locations within the same country, whose name is displayed in the circumscribed area. The darkness of links between locations reflects the smaller of the two nodes' degrees.
We also provide a Location Router-Level Graph which depicts the routers operated by a single AS in a single location, and the locations and ASes connected to those routers. The main location is visualized as a large ivory circle in the middle of the graph. Routers in the main location are represented by rectangles. ASes in the main location are represented by numbered circles. If the routers connect to routers in the same AS, but a different location, that connection is represented by a link to a numbered circle AS in that location.
Finally, we also developed a Relationship Graph that displays providers, peers, and customers of a selected AS. The graph centers on the selected AS, with providers drawn above and customers below. Peer locations depend on their customer cone size relative to the selected AS's customer cone size. The vertical location of each AS in the graph corresponds to its cone size and horizontal location is chosen to minimize overlap.
We have integrated all the visualizations described above into our AS Rank web interface. This is a preliminary step toward creating a comprehensive, operationally useful view of intra- and inter- AS connectivity depicting not only the network topology (the number of neighbors, the types of connecting links, etc.), but including also geographic and economic attributes such as ownership structure, regional presence, financial indicators.
During the course of the contract, CAIDA hosted four annual workshops on Active Internet Measurements (AIMS). The AIMS workshops are intended to advance our understanding of the potential and limitations of active measurement research and infrastructure in the wide-area Internet, and to promote cooperative solutions and coordinated strategies to address future data needs of the network and security research communities. These workshops have fostered interdisciplinary conversation among researchers, operators, and government, focused on analysis of goals, means, and emerging issues in active Internet measurement projects. The first workshop emphasized discussion of existing hardware and software platforms for macroscopic measurement and mapping of Internet properties, in particular those related to cybersecurity. The second workshop included more performance evaluation and data-sharing approaches. In the third workshop we expanded the workshop agenda to include active measurement topics of more recent interest: broadband performance; gauging IPv6 deployment; and measurement activities in international research networks. The fourth workshop focused more narrowly on IPv4 to IPv6 transition issues and on broadband performance measurement, data analysis, and visualization.
- ISMA 2009 AIMS - 1st Workshop on Active Internet Measurements
- ISMA 2010 AIMS-2 - Workshop on Active Internet Measurements
- ISMA 2011 AIMS-3 - Workshop on Active Internet Measurements
- ISMA 2012 AIMS-4 - Workshop on Active Internet Measurements
- "The Workshop on Active Internet Measurements (AIMS) Report", ACM SIGCOMM Computer Communications Review (CCR), Oct 2009.
- "AIMS-2 Workshop on Active Internet Measurements Report", ACM SIGCOMM Computer Communication Review (CCR), Oct 2010.
- "AIMS-3 Workshop on Active Internet Measurements Report", ACM SIGCOMM Computer Communication Review (CCR), July 2011.
- "AIMS-4 Workshop on Active Internet Measurements Report", ACM SIGCOMM Computer Communication Review (CCR), July 2012.