Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
Center for Applied Internet Data Analysis
Internet Mapping: Cartographic Capabilities for Critical Cyberinfrastructure

This page describes the results of research and development efforts funded by contract N66001-12-C-0130 with the DHS Science and Technology Directorate (base period: September 2012 - January 2016) and co-funded by the NSF grant CNS-1513283 (started in August 2015). This contract currently continues through an exercised optional six month final phase ending September 2016.

|   Final Report    Statement of Work     Proposal   |

Final Report

  1. Summary
  2. Archipelago (Ark) Measurement Infrastructure
  3. Increasing the completeness, accuracy, and richness of macroscopic Internet maps
  4. Interactive analysis tools and capabilities
  5. Workshops
  6. Contributions

1.  Summary

The Cartographic Capabilities for Critical Cyberinfrastructure (c4) contract continued research and development efforts started under our previous contract with DHS (Leveraging the Science and Technology of Internet Mapping for Homeland Security). We developed new technologies and analysis capabilities spanning multiple research domains of the Internet measurement field to enable delivery of cybersecurity-relevant annotated maps of critical Internet resources. We improved the completeness, accuracy, and richness of our macroscopic Internet maps, developed a scalable and user-friendly interactive interface to conduct on-demand measurements and query historical traceroute data, and integrated several data types and sources to produce a coherent representation of macroscopic Internet topology at multiple granularities: AS-level, PoP-level, and router-level maps. We also began development of an AS-traceroute tool.

2.  Archipelago (Ark) Measurement Infrastructure

We continued to grow and upgrade our powerful and versatile distributed measurement infrastructure Archipelago (Ark). Ark makes use of measurement nodes located in various networks worldwide that connect via the Internet to a central server located at CAIDA. Co-funded by DHS and NSF, we have more than doubled the number of monitors: from 60 in September 2012 to 132 as of January 2016, deployed in 50 countries on six continents, with 54 monitors enabled to conduct both IPv4 and IPv6 measurements.

We started using the Raspberry Pi platform for the Ark hardware. This move drastically reduced the cost of each node and expanded our delivery options enabling both much cheaper standard shipping to destinations worldwide and distribution through direct contact with colleagues. Moreover, the popularity and portability of miniature Raspberry Pi devices made it easier for hosting sites to donate the hardware and allowed residential sites to host an Ark node. We were also able to improve our coverage of previously underprovisioned geographical regions, such as Africa where we added 12 monitors, and South America where we now have five hosting sites.

We continued to add new features and improve functionality of the Ark platform, implementing flexible and efficient measurement and data collection methods. Ark has proven successful at providing unprecedented intelligence regarding macroscopic Internet connectivity and supporting third-party cybersecurity-related global Internet measurement experiments (listed on the Ark home page). Ark continuously gathers the largest set of IPv4 and IPv6 topology data made available to researchers and government agencies via the CAIDA web site and the DHS S&T IMPACT portal. For IPv4 topology, Ark monitors continuously measure IP-level paths to a dynamically generated list of IP addresses covering all /24 prefixes (about 10.9 million as of January 2016) in the routed IPv4 address space. Measurement parallelization allows us to cycle through this probing in about three days. IPv6-capable monitors conduct continuous probing of BGP-announced IPv6 prefixes (/48 or shorter, nearly 20,367 prefixes as of January 2016). Each Ark monitor probes a single random destination in each prefix; a full probing cycle takes 48 hours.

We maintain an Ark statistics page that shows

During the base period of the project we used Ark to conduct several measurement experiments and published the results in conferences and journals. These experiments included investigation of IPv6 alias resolution via induced fragmentation [1], evaluation of a new algorithm for inferring business relationships between ASes [2], inferring multilateral peering [3], and detection of third-party addresses in traceroute traces [4].

Software Development

We updated the following Ark-related software:

  • coordination software: Marinda - a distributed tuple space implementation used for coordinating measurements across monitors (released in MIDAR v0.6.0)
  • probing application software: rb-mperio - Ruby library for writing Ruby measurement scripts that use mper probing engine
  • system-management software: ArkUtil - a library of utilities including software to remotely manage software running on monitors
  • on-demand topology measurement software: Vela - web interface for performing user-specified distributed traceroutes and pings from Ark platform
We developed the following Ark-related new software:
  • tod - command-line tool for performing on-demand, user-specified distributed traceroutes and pings from Ark platform
  • toq - command-line tool for querying historical Ark topology measurements
  • Dolphin - bulk DNS resolution tool (replacement for HostDB)

Ongoing Data Collections using Ark

  1. IPv4 Routed /24 (topology probes to each /24, continuously)
  2. IPv4 Routed /24 DNS Names (DNS resolution for the above)
  3. AS Links: IPv4 Routed /24 AS Links (AS adjacencies)
  4. IPv6 Topology (topology probes to each routed IPv6 prefix)
  5. IPv6 DNS Names (DNS resolution for above)
  6. IPv6 AS Links Dataset (AS links derived from ongoing traceroute-like measurements that make up our IPv6 Topology Dataset)
  7. Internet Topology Data Kit (ITDK) (curated IPv4 data)
  8. AS Relationships (inferred AS relationships)

3.   Increasing the completeness, accuracy, and richness of macroscopic Internet maps

Identifying and removing false links

During the Applied Research Phase of the project, we studied the impact of false link inferences on router-level, PoP/city-level, and AS-level graphs. We developed techniques to identify and filter out false link inferences by quantifying consensus across different sources of AS-level and traceroute data and resolving knowledge conflicts among different data sets. This methodology relies upon establishing provenance of data used for inferences, and estimating the relative accuracy of data sources using as much ground truth as we could obtain.

We investigated a proposed technique to detect third-party addresses that appear in traceroute traces using the IP timestamp option [4] Routers sometimes respond to traceroute probes with a source address not in the path towards the destination, i.e. an off-path address. By mapping to ASes not in the corresponding BGP path, such third-party addresses cause false inferences of AS-level links and paths. We applied a first-principles approach, rooted in the assumption that subnets between connected routers are often /30 or /31 because routers are often connected with point-to-point links. We inferred if an address in a traceroute path corresponds to the interface on a router that received the packet (the inbound interface) by attempting to infer if its /30 or /31 subnet mate was an alias of the previous hop. We collected traceroutes from 8 Ark monitors to 80K randomly chosen destinations, and found that most observed addresses were configured on the in-bound interface on a point-to-point link connecting two routers, i.e. are on-path. Because the IP timestamp option technique reports more than 70% of these addresses as being off-path, we concluded that it is not reliable at inferring which addresses are off-path or third-party. That means that deriving a technique that accurately infers AS links from traceroute paths remains an important and currently unsolved problem.

We also studied spurious routes in public BGP data [7]. The BGP data is provided voluntarily by network operators who establish BGP sessions with route collectors that record this data. Unfortunately, it is trivial for a single vantage point (VP) to introduce thousands of spurious routes into the collection. We explored the impact these misbehaving VPs can have on AS relationship inference, showing that they introduce thousands of AS links that did not exist, and cause relationship inferences for links that did exist to be corrupted. For example, one of the misbehaving VP we discovered added thousands of spurious routes for nine consecutive months until 8 November 2012. Although it appears that this artifact barely impacts (0.1%) our validation of our AS relationship inferences, this number may be misleading since most of our validation data relies on BGP and RPSL validation of only existing links, rather than asserting the non-existence of links. We have only a few assertions of non-existent routes, all received via our public-facing website that allows operators to provide validation data through our interactive feedback mechanism. When two independent operators corrected some of our inferences, we noticed that the spurious routes all came from the same VP. This event highlights the limitations of even the best available topology data, and provides additional proof that comprehensive ground truth validation from operators is essential to scientific research on Internet topology.

Improving geographical annotations

Determining the physical locations of Internet routers is crucial for characterizing Internet infrastructure and understanding geographic pathways of global routing, as well as for creating Point-of-Presence (POP) level maps. These tasks require good router geolocations. We developed a methodology to infer router locations [6] by extracting and decoding geography-related strings from fully qualified domain names (hostnames). We first compiled an extensive dictionary associating geographic strings (e.g., airport codes) with geophysical locations. We then searched a large set of router hostnames for these strings, assuming each autonomous naming domain uses geographic hints consistently within that domain. We used topology and performance data continually collected by Ark to ascertain whether a given hint appears to co-locate different hostnames in which it is found. Finally, we generalized consistent geolocation hints into domain-specific rule sets thus obtaining a total of 1,711 rules covering 1,398 different domains. We validated the rules using available domain-specific ground truth for six domains. Unlike previous efforts at using DNS information to infer geolocation, which relied on labor-intensive domain-specific manual analysis, our process for inferring the domain specific heuristics is automated, representing a measurable advance in the state-of-the-art of methods for geolocating Internet resources.

We developed CAIDA public DNS Decoding database (DDec) that contains geographic locations, router types, and other information that organizations encode into DNS hostnames (for example, LosAngeles and edge in xe-9-3-0.edge5.LosAngeles1.Level3.net. Combining rules from multiple sources, DRoP [6] and Undns (part of Rocketfuel ), DDec serves as a single location for decoding DNS names. A web interface (BETA version) allows users to decode hostnames, find rules for individual domains, lookup encoding rules by name, or see all the rule sets.

To further improve our router-level view of the Internet topology, we set out to annotate Internet interconnections with robust physical coordinates at the level of a building. This information can help locate points of attacks, congestion, or instability on the Internet. However, like most other aspects of Internet interconnection, its geophysical locus is generally not public; the facility used for a given link must be inferred to construct a macroscopic map of peering. We developed a methodology, called constrained facility search, to infer the physical interconnection facility where an interconnection occurs [12] among all possible candidates. We used publicly available data about the presence of networks at different facilities and traceroute measurements executed from more than 8,500 available measurement servers scattered around the world to identify the technical approach used to establish an interconnection. A key insight of our method is that inference of the technical approach for an interconnection sufficiently constrains the number of candidate facilities such that it is often possible to identify the specific facility where a given interconnection occurs. Validation via private communication with operators confirmed the accuracy of our method, which outperforms heuristics based on naming schemes and IP geolocation. Our study also revealed the multiple roles that routers play at interconnection facilities; in many cases the same router carries out both private interconnections and public peerings, in some cases via multiple Internet exchange points. Our study also shed light on peering engineering strategies used by different types of networks around the globe.

Improving AS-level Internet maps

To increase the completeness of our maps, we developed novel topology discovery techniques. We implemented and validated a method for discovering currently invisible IXP peering links [3] by mining BGP communities used by IXP route servers to implement multilateral peering (MLP), including communities that signal the intent to restrict announcements to a subset of participants at a given IXP. Using route server data juxtaposed with a mapping of BGP community values, we inferred 206K p2p links from 13 large European IXPs, that is, four times more p2p links than what is directly observable in public BGP data repositories (Routeviews and RIPE RIS). The advantages of our technique are threefold. First, it utilizes existing BGP data sources, plus inferences based on BGP community values, and does not require the deployment of additional vantage points nor the acquisition of private data. Second, it requires only a few active queries, facilitating repeatability of the measurements. Finally, it offers a new source of data regarding the dense establishment of MLP at IXPs.

We studied the utility of PeeringDB for improving our understanding of the peering ecosystem [5]. PeeringDB is an online database where participating networks contribute information about their peering policies, traffic volumes and presence at various geographic locations. Established to support the practical needs of operators, this data also provides a valuable source of information to researchers. Using BGP data to cross-validate three years of PeeringDB snapshots, we found that PeeringDB membership is reasonably representative of the Internet transit, content, and access providers in terms of business types and geography of participants, and PeeringDB data is generally up-to-date. We found strong correlations among different measures of network size: BGP-advertised address space, PeeringDB-reported traffic volume, and presence at peering facilities. We also found correlations between these size measures and advertised peering policies.

To annotate nodes in the AS-level Internet graph with AS business types (Transit/Access, Content, and Enterprise), we created an AS classification data set. We split a ground-truth data set from PeeringDB into two parts: a labeled training set and a validation set. We used the former to train a decision-tree machine learning classifier, and used the latter to validate the results, finding the Positive Predictive Value of the classifier at 70%. We then applied the classifier to taxonomize all Internet ASes.

To annotate links in the AS-level Internet graph with AS business relationships, we developed new methods of inferring customer-to-provider and peer-to-peer AS relationships using BGP paths [2]. Unlike previous approaches, our algorithm does not assume the presence (or seek to maximize the number) of valley-free paths, instead relying on three assumptions about the Internet inter-domain structure: (i) an AS enters into a provider relationship to become globally reachable; (ii) there exists a peering clique of ASes at the top of the hierarchy; and (iii) there is no cycle of p2c links. We assembled the largest source of validation data for AS-relationship inferences to date, validating 34.6% of our 126,082 c2p and p2p inferences to be 99.6% and 98.7% accurate, respectively. Using these inferred relationships, we next evaluated three algorithms for inferring each AS's customer cone, defined as the set of ASes an AS can reach using customer links. We demonstrated the utility of our algorithms for studying the rise and fall of large transit providers over the last fifteen years, including recent claims about the flattening of the AS-level topology and the decreasing influence of "tier-1" ASes on the global Internet.

The traditional approach of modeling relationships between ASes abstracts relationship types into three broad categories: transit, peering, and sibling. However, more complicated configurations exist, and understanding them may improve Internet economics routing modeling capability. To further refine our AS-level topology data, we used BGP, traceroute, and geolocation data and extended our AS relationship inference algorithm to infer two types of complex relationships [9]: hybrid relationships, where two ASes have different relationships at different interconnection points, and partial transit relationships, which restrict the scope of a customer relationship to the provider's peers and customers. Using our new algorithm, we found that 4.5% of the 90,272 provider-customer relationships observed in March 2014 were complex, including 1,071 cases of hybrid relationships and 2,955 cases of partial-transit relationships. Because most peering relationships are invisible, we believe these numbers are lower bounds. Using feedback from operators, and relationships encoded in BGP communities and RPSL, we validated 20% and 6.9% of our partial transit and hybrid inferences, respectively. Hybrid relationships are not only established between large transit providers; in 57% of the inferred hybrid transit/peering relationships the customer had a customer cone of fewer than 5 ASes.

4.  Interactive analysis tools and capabilities

AS Rank

We applied our developed techniques and methodologies to integrate multiple diverse types and sources of data (BGP routing data from the Route Views Project and RIPE NCC RIS, WHOIS data, PeeringDB, inferred AS relationships, AS classification, geolocation, etc.) into an updated coherent representation of macroscopic Internet topology at multiple granularities (AS Rank).

We rank ASes by their transit degree or by their customer cone size, which is the number of their direct and indirect customer ASes calculated based on inferred AS business relationships. The customer cone size can also be measured in terms of the number of IPv4 prefixes, or the number of IPv4 addresses. The results of AS ranking are represented as an interactive web page. Users can choose among different ranking metrics and provide feedback correcting false relationship inferences. Via this web site we received more ground truth from ISPs (190 submissions) than we have ever had access to before, and refined the underlying AS relationship algorithm based on this data as well as user suggestions in follow-up conversations with contributors of ground truth.

We also developed a method to infer sibling relationships between ASes belonging to the same organization (ISP) and enabled a ranking by organization, using the same topological metrics as in our AS ranking. Our multifaceted ranking algorithm gauges a measure of AS (or ISP) influence in the global routing system, and represents a test of the data gathering techniques and analysis methodologies developed over the course of this project in a typical operational environment. (Note that we do not have data to rank ASes or ISPs by traffic, revenue, users, or any other non-topological metric.)

The data underlying AS-Rank are available for download in two forms: serial-1 and serial-2. The serial-1 directory contains AS relationships inferred using the method described in "AS Relationships, Customer Cones, and Validation" [2] (published in IMC 2013). In the serial-2 data set, we augment serial-1 data with additional peering links derived from Ark traceroute data, and multilateral peering [3] data. We are working on a third version of AS Relationships, serial-3, that will contain the synthesized, annotated, traceroute-based Internet topology from further expanded number of sources of traceroute data.

Interface for querying historical data

Over eight years, we have captured 18 TB of Ark trace data (41 billion traces). The collection continues to grow by approximately 316 GB (766 million traces) per month, or 9 billion traces per year. Additionally, we have begun capturing "prefix-probing" data where we probe every announced IPv4 BGP prefix (~609k) daily from 37 Ark monitors. This data set grows by approximately 700 million traces per month or 8.4 billion traces per year. Finding and retrieving data of interest from this enormous collection is a challenging task. We developed a topology query system (toq) to allow easier access to and promote the analysis of the available wealth of data. The system provides full access from the command-line enabling researchers to execute queries and write their own analysis and visualization scripts. We also support simplified access via the Vela web interface. This interface presents pre-made queries, analysis, and visualization to help widen the audience and to cater to the long tail of casual users.

The main focus of the querying system is to allow users to explore topological properties of traceroutes. Example queries might include all traces that pass through/reach a set of IP addresses, prefixes, ASes, or countries. We will try to implement commonly desired queries, analysis, and visualization.

The software takes advantage of multiple CPU cores and is scalable and efficient. To improve the performance, we intend to profile the code and rewrite critical portions of the software in C/C++ from its original Python.

Topology probing on-demand

We also continued to develop and improve Vela, our topology-on-demand probing system. CAIDA researcher Young Hyun collaborated with Prof. R. Beverly (Naval Postgraduate School) making use of Vela capabilities to explore more efficient topology probing techniques.

5.  Workshops

During the course of the contract, CAIDA hosted three annual workshops on Active Internet Measurements. The over-arching goal of these workshops is to advance our understanding of the potential and limitations of active measurement research and infrastructures in the wide-area Internet, and to promote cooperative solutions and coordinated strategies to address future data needs of the network and security research communities. These workshops have fostered interdisciplinary conversation among researchers, operators, and government, focused on analysis of goals, means, and emerging issues in active Internet measurement projects.

The AIMS 2013 workshop focused on creating, managing, and analyzing annotations of large longitudinal active Internet measurement data sets. We also dedicated half a day to large-scale active measurement (performance/topology) from mobile/cellular devices. The next AIMS 2014 workshop explored capabilities and opportunities for network measurement in the wireless (mobile and cellular) domain, and research infrastructure to support it. Participants agreed that the conversation was only beginning, and that it is important to build consensus on standard metrics to measure, constructing a road map for wireless measurement research infrastructure and activities for the next decade. Topics discussed at the AIMS 2015 workshop included the current state of Ark and related infrastructure, current and proposed experiments using these infrastructures, and participants' views of relevant challenges and priorities.

Workshop reports are published:

  1. "AIMS-5 Workshop on Active Internet Measurements Report", ACM SIGCOMM CCR, July 2013.
  2. "AIMS-6 Workshop on Active Internet Measurements Report", ACM SIGCOMM CCR, October 2014.
  3. "AIMS-7 Workshop on Active Internet Measurements Report", ACM SIGCOMM CCR, October 2015.

6.  Contributions

Publications

  1. "IPv6 Alias Resolution via Induced Fragmentation", PAM'13, R. Beverly, W. Brinkmeyer, M. Luckie, and J. Rohrer.
  2. "AS Relationships, Customer Cones, and Validation", IMC'13, M. Luckie, B. Huffaker, k. claffy, A. Dhamdhere, and V. Giotsas.
  3. "Inferring Multilateral Peering", CoNEXT, Dec 2013, V. Giotsas, S. Zhou, M. Luckie, and k. claffy.
  4. "A Second Look at Detecting Third-Party Addresses in Traceroute Traces with the IP Timestamp Option", PAM'14 M. Luckie and k. claffy.
  5. "Using PeeringDB to Understand the Peering Ecosystem", SIGCOMM CCR, Apr 2014 A. Lodhi, N. Larson, A. Dhamdhere, C. Dovrolis, and k. claffy.
  6. "DRoP: DNS-based Router Positioning", CCR, Jul 2014, B. Huffaker, M. Fomenkov, and k. claffy.
  7. "Spurious Routes in Public BGP Data", CCR, Jul 2014, M. Luckie.
  8. "Measurement and Analysis of Internet Interconnection and Congestion", TPRC'14, D. Clark, S. Bauer, k. claffy, A. Dhamdhere, B. Huffaker, W. Lehr, and M. Luckie.
  9. "Inferring Complex AS Relationships", IMC'14, V. Giotsas, M. Luckie, B. Huffaker, and k. claffy.
  10. "Challenges in Inferring Internet Interdomain Congestion", IMC'14, M. Luckie, A. Dhamdhere, D. Clark, B. Huffaker, and k. claffy.
  11. "IPv6 AS Relationships, Clique, and Congruence", PAM'15, V. Giotsas, M. Luckie, B. Huffaker, and k. claffy.
  12. "Mapping Peering Interconnections to a Facility", CoNEXT'15, V. Giotsas, G. Smaragdakis, B. Huffaker, M. Luckie, and k. claffy.
  13. "Measuring Interdomain Congestion and its Impact on QoE", Workshop on Tracking Quality of Experience in the Internet, October 2015, A. Dhamdhere, M. Luckie, D. Clark, and k. claffy.

Presentations

Blogs


  Last Modified: Wed May-11-2016 14:49:18 PDT
  Page URL: http://www.caida.org/funding/c4/index.xml