CAIDA researchers have been collecting connectivity and latency data for a wide cross-section of the commodity Internet since 1998 for the IPv4 address space and since 2003 for the IPv6 address space. We use these data to derive maps of the Internet at various granularity levels: IP, router, AS.
The archive of raw IPv4 data and raw IPv6 data, topology measurement tools, and sample analysis code are available to the research community. The AS adjacencies derived daily from our active connectivity measurements are also available.
Topology maps of the Internet are an important tool for characterizing this critical infrastructure and understanding its properties, dynamic behavior, and evolution. They are also crucial for realistic modeling, simulation, and analysis of Internet infrastructure and other large-scale complex networks. These maps can be constructed for different layers (or granularities), e.g., fiber/copper cable, IP address, router, Points-of-Presence (PoPs), autonomous system (AS), ISP/organization. Router-level and PoP-level topology maps can powerfully inform and calibrate vulnerability assessments. ISP-level topologies, sometimes called AS-level or interdomain routing topologies (although an ISP may own multiple ASes so an AS-level graph is a slightly finer granularity) provide insights into technical, economic, policy, and security needs of the largely unregulated peering ecosystem.
CAIDA has been conducting measurements of the Internet macroscopic topology since 1998. Our tools - first skitter (1998-2008), now scamper (2007-present) have been tracking global IP level connectivity by sending probe packets from a set of source monitors to millions of geographically-distributed destinations in the IPv4 address space. Since 2003, we have been continuously probing IPv6 address space as well.
The gathered data:
- characterize macroscopic connectivity and performance of the Internet,
- allow various topological and geographical representations at multiple levels of aggregation granularity,
- provide a valuable input for empirically-based modeling of the Internet behavior and properties,
- improve situational awareness of the critical cyberinfrastructure for government agencies.
We use two sources of data for Macroscopic Topology studies: forward Internet (IP) path information from traceroute-like active measurements and routing data from inter-domain BGP routing tables.
Active measurement data
Co-funded by DHS and NSF, we have created a powerful and versatile distributed measurement infrastructure Archipelago (Ark) that makes use of measurement nodes located in various networks worldwide and connected via the Internet to a central server at CAIDA. Ark has pioneered new features and functionality of distributed measurement infrastructure, including flexible and efficient measurement and data collection methods. Ark topology datasets (described below) available to academic researchers and government agencies via the CAIDA topology data request web form as well as via PREDICT provide unprecedented intelligence regarding macroscopic Internet connectivity.
To gather topology data, Ark monitors are continuously running scamper, a powerful and flexible tool that actively probes forward IP paths and round trip times (RTTs) from a host to a list of destinations.
For IPv4 topology, we measure IP-level paths to a dynamically generated list of IP addresses covering all /24 prefixes in routed IPv4 address space. To do it efficiently, we employ a process called team probing where we group monitors into teams and dynamically divide up the measurement work among team members. This parallelization allows us to cycle through probing all routed /24's in a reasonable amount of time: about 2-3 days for a team of 22-24 monitors at 100 probes per second. We currently have three teams of monitors active, and each team probes independently.
We store the collected traceroute data in individual files classified by scamper host and by day, where day is defined as 24 hour period starting from midnight UTC. These files constitute CAIDA IPv4 Routed /24 Topology Dataset. We have collected billions of traceroutes since scamper probing started in September 2007; the collection continues to grow by about 500 million traceroutes per month.
We augment the data (
Routed /24 Topology Dataset) with DNS names for all intermediate addresses and responding destinations seen in the data. We resolve the names using an in-house bulk DNS lookup service called HostDB that can look up millions of addresses per day.
We also provide the IPv4 Routed /24 AS Links Dataset available for unrestricted public download. It contains Autonomous System (AS) links derived daily from the raw IP paths and represents an AS-level graph of the Internet. This is our most popular dataset usually downloaded about 40 times per month.
We use our tools kapar and MIDAR to determine which IP addresses collected by Ark traceroutes belong to the same router. This process is called alias resolution. We are working on combining multiple alias-resolution techniques into a unified tool and system for generating router-level topology from the IP Topology Dataset.
Currently, we generate router-level topologies of the Internet every 6 months or so and release them as part of the Macroscopic Internet Topology Data Kits (ITDK).
For IPv6 topology, about a half of the Ark monitors that are IPv6-capable also conduct continuous probing of BGP-announced IPv6 prefixes (/48 or shorter), each monitor probing a single random destination in each prefix. A full probing cycle in the IPv6 address space takes 48 hours. The resulting raw data constitute the IPv6 Topology Dataset.
CAIDA researchers are currently developing methods and algorithms to enable IPv6 alias resolution and to derive AS Links for IPv6 topology.
We obtain routing information from inter-domain BGP routing tables provided by Route Views project. This project gathers BGP routing perspectives from more than 60 major ISPs worldwide. Each BGP table is a list of AS paths that packets should traverse from a given router to the prefix containing its destination IP address. The AS terminating an AS path for a given prefix in a core routing table is administratively responsible for this prefix and is called an origin AS. We use the combined BGP table to map IP addresses in our IP paths to their origin ASes. As of 2013, the combined table typically has more than 460k globally routable prefixes.
Advantages and limitations of the data
CAIDA multi-year collection of traceroute paths represents one of the most comprehensive archives of macroscopic topology measurements available to the Internet research community. These data are a key input for realistic simulation and modeling research efforts. However, it is important to clearly understand the intrinsic limitations of the topology data obtained with traceroute-like measurements.
- The success of our measurements depends on both intermediate IP addresses in a path and, albeit to a lesser extent, the target destination returning an ICMP responses to our ICMP echo-request probes. ICMP filtering or rate limiting may reduce the completeness of the discovered topology data, and therefore care should be taken not to infer a lack of connectivity from a lack of data during analysis.
- The scamper tool cannot map IP paths behind firewalls or Network Address Translators (NATs).
- Obtaining a complete map of the Internet requires a large number of vantage points probing towards a large number of destinations. Although we have more than 70 vantage points that probe every routed /24 network (more than 10 million as of 2013), our measurements only provide a sampling of the global Internet, and there may be unintended sampling biases intrinsic to measuring the Internet in this way.
- MPLS, a layer 2 technology used by some network providers, may affect the accuracy of the topology obtained by traceroute-based techniques. In particular, MPLS may exaggerate the apparent connectivity of routers at the IP layer.
Using publicly available BGP data repositories is a popular method for inferring Internet topology. BGP tables and updates are easy to parse, process and comprehend. However, current BGP data repositories (the most commonly used are Route Views and RIPE RIS) also have limitations that affect the accuracy of topology inferences.
- BGP data reflects a control-plane signal rather than how traffic actually travels toward a destination network.
- By virtue of the number and range of vantage points, they tend to capture much less peripheral (not core) connectivity (peering) among regional networks.
- Use of static tables does not reveal short-term AS path variations and load balancing; BGP updates can reveal some such phenomena but constitute a noisy signal.
Both traceroute and BGP-based topology mapping methods have strengths and limitations; it is still an open challenge to integrate both types of data to maximize the scope, precision, and accuracy of Internet topology inferences.