Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
Center for Applied Internet Data Analysis > data : internet-topology-data-kit
Macroscopic Internet Topology Data Kit (ITDK)
The ITDK contains data about connectivity and routing gathered from a large cross-section of the global Internet. This dataset is useful for studying the topology of the Internet at the router-level, among other uses.
|   data process   |

ITDK Datasets

Recent ITDKs (less than two years old) are only available for restricted access. ITDKs older than two years are available for public access. See "data availability" section below.

The latest ITDK release, 2017-02, currently consists of

  • two related IPv4 router-level topologies,
  • router-to-AS assignments,
  • geographic location of each router, and
  • DNS lookups of all observed IP addresses.
We plan to expand this ITDK release with other complementary datasets as they become available (more details are below). This ITDK is produced from active measurements conducted on our Archipelago (Ark) measurement infrastructure. For the IPv4 router-level topology, we used a subset of the IPv4 Routed /24 Topology Dataset, which is collected continuously. Specifically, we obtained the raw IPv4 topology by performing traceroutes to randomly-chosen destinations in each routed /24 BGP prefix using 121 Ark monitors located in 42 countries on Jan 22 to Feb 7, 2017.

Prior Releases

Router-Level Topologies

The two included IPv4 router-level topologies are generated from the same IP-level topology but differ in the accuracy and completeness of the alias resolution performed to create them. The first topology is derived from aliases resolved with MIDAR and iffinder, which yield the highest confidence aliases with very low false positives. The second topology also uses MIDAR and iffinder but further includes aliases resolved with kapar, which significantly increases the coverage of aliases but at the cost of false positives (which inflate the size of routers and decrease the router count). Researchers should choose the topology to use depending on the relative importance they place on accuracy vs. comprehensiveness of alias resolution. Choose the most accurate alias resolution if uncertain about which to use.

Each router-level topology is provided in two files, one giving the nodes and another giving the links. There are additional files that assign ASes to each node, provide the geographic location of each node, and provide the DNS name of each observed interface.

Nodes File

The nodes file lists the set of interfaces that were inferred to be on each router.

File format: node <node_id>:   <i1>   <i2>   ...   <in>

Each line indicates that a node node_id has interfaces i1 to in. Interface addresses in (IANA reserved space for multicast) are not real addresses. They were artificially generated to identify potentially unique non-responding interfaces in traceroute paths.

NOTE: In ITDK release 2013-04 and earlier, we used addresses in instead of for these non-real addresses.

Links File

The links file lists the set of routers and router interfaces that were inferred to be sharing each link. Note that these are IP layer links, not physical cables or graph edges. More than two nodes can share the same IP link if the nodes are all connected to the same layer 2 switch (POS, ATM, Ethernet, etc).

File format: link <link_id>:   <N1>:i1   <N2>:i2   [<N3>:[i3]]   ..   [<Nm>:[im]]

Each line indicates that a link link_id connects nodes N1 to Nm. If it is known which router interface is connected to the link, then the interface address is given after the node ID separated by a colon (e.g., "N1:"); otherwise, only the node ID is given (e.g., "N1").

By joining the node and link data, one can obtain the known and inferred interfaces of each router. Known interfaces actually appeared in some traceroute path. Inferred interfaces arise when we know that some router N1 connects to a known interface i2 of another router N2, but we never saw an actual interface on the former router. The interfaces on an IP link are typically assigned IP addresses from the same prefix, so we assume that router N1 must have an inferred interface from the same prefix as i2.

Node-AS File

The node-AS file assigns an AS to each node found in the nodes file. We use our final Election+Degree assignment heuristic to infer the owner AS of each node.

Addresses that belong to the address space of an Internet exchange point (as self-identified in PeeringDB: are excluded from the AS analysis, as we don't consider them to be part of the AS-level topology.

File format: node.AS   <node_id>   <AS>   <method>

Each line indicates that the node node_id is owned/operated by the given AS, as inferred with the given method. There are three inference methods:

a router has only a single choice of AS
multiple ASes are present on a router, and one AS occurs more frequently than the rest
multiple ASes are present on a router, but no AS occurs the most frequently, so the choice is based on AS degree

Hostnames File

The hostnames file contains the hostname for every IP address in the router-level topology for which a successful reverse DNS lookup could be found.

File format: <timestamp>   <IP_address>   <hostname>

Node-Geolocation File

The node-geolocation file contains the geographic location for each node in the nodes file. We use MaxMind's GeoLite City database for the geographic mapping.

File format: node.geo   <node_id>:   <continent>   <country>   <region>   <city>   <latitude>   <longitude>

Future Work

We are in the process of expanding the ITDK with additional data that will combine router and AS-level views of the Internet topology. The AS link dataset will contain the set of AS links as inferred from combining the AS assignment and node datasets. The AS relationship dataset will contain the business relationship associated with each AS link in the AS link dataset. These datasets will be available in the coming months.

Data Availability

ITDKs older then approximately two years are available as a public dataset. The most recent two years are subject to restricted access.

Acceptable Use Agreement for the restricted data

Access to these data is subject to the terms of the following CAIDA Acceptable Use Agreement (printable version in PDF format)

Acceptable Use Agreement for the public data

Access to these data is subject to the terms of the following CAIDA Acceptable Use Agreement (printable version in PDF format)

When referencing this data (as required by the AUA), please use:

The CAIDA UCSD Internet Topology Data Kit - <release date>,
Also, please, report your publication to CAIDA.

Data Access

  • Access the publicly available CAIDA Internet Topology Data Kits (and other topology data)
  • Request Access to the restricted CAIDA Internet Topology Data Kits (and other topology data)

Restricted access is granted to all available Ark-based ITDK releases in the last two years. All older ITDKs are included in the public dataset. Note also that two historical ITDK releases made in 2002 and 2003 are also available as public datasets, though these datasets should be used with caution, as they were constructed using completely different procedures and using topology data collected on the now decommissioned skitter measurement infrastructure.

Internet Topology Data Kit Process

Below we describe the various steps involved in producing the datasets that are part of the ITDK.

Alias resolution

For alias resolution, we rely on several CAIDA tools: iffinder, kapar, MIDAR, (recent tech report), and speedtrap. MIDAR (Monotonic ID-based Alias Resolution, a tool we hope to release soon) expands on the IP velocity techniques of RadarGun, while kapar expands the analytical techniques of APAR. We use the traceroute dataset as input to MIDAR and iffinder, which generate output files used as input to kapar. kapar heuristically infers the set of interfaces that belong to the same router, and the set of two or more routers on the same IP link (a construct that represents either a point-to-point link, or LAN or cloud with multiple attached IP addresses). We use iffinder, kapar, and MIDAR to construct IPv4 topologies, and speedtrap to construct IPv6 topologies.

DNS hostnames

We have an in-house bulk DNS lookup service called HostDB that can look up millions of addresses per day. We look up all intermediate addresses and responding destinations seen in the Topology Dataset. Each ITDK contains a list of the successful lookups for each IP address found in the nodes dataset.


To assign IP addresses to ASes, we used a publicly available BGP dump provided by Routeviews. BGP (Border Gateway Protocol) is the protocol for exchanging interdomain routing information among ASes in the Internet. A single origin AS typically announces ("originates") each routable prefix via BGP. We perform IP-to-AS mapping by assigning an IP address to the origin AS of the longest matching prefix for that IP address in the BGP tables.

AS relationships

We used the BGP data to annotate each interdomain link with one of three simplified business relationships -- customer-provider (the customer pays the provider), settlement-free peer (typically no money is exchanged), and sibling (both ASes belong to the same organization) -- using the classification algorithm by Dimitropolous, et al., resulting in what we call the AS relationship dataset.

AS assignment

The goal of the AS assignment process is to determine the AS that owns each router. For each router r, we first create an AS frequency matrix that counts the number of interfaces (known and inferred) from each AS that appears on r. The ASes in this frequency matrix represent the set of possible owner ASes of r. We use the following AS assignment heuristic to assign a router r to an AS.

The Election heuristic assigns router r to the AS with the highest frequency in r's AS frequency matrix. The intuition behind this heuristic is that routers will tend to have more interfaces in the address space of their owner. If two ASes from r's AS frequency matrix have the same count, then Election cannot decide an owner.

The Customer heuristic uses the AS relationship dataset to assign relationships to each pair of ASes from r's AS frequency matrix. Customer assigns r to the AS inferred to be a customer of every other AS in r's AS frequency matrix. This heuristic is based on the common practice that customer and provider routers typically interconnect using addresses from the provider's address space. Consequently, a router with interfaces from both the customer and provider address spaces is assigned to the customer.

For the Degree heuristic, we first generate an AS-level graph by assuming full-mesh connectivity among ASes from each router's AS frequency matrix. We then use this graph to generate an AS degree for each AS. Degree assigns router r to the smallest-degree AS from r's AS frequency matrix, i.e., the AS most likely to be the customer AS, based on similar intuition as the Customer heuristic.

For the Neighbor heuristic, we first determine the set of single-AS routers to which r is connected (its single-AS neighbors). We create a new AS frequency matrix that counts the number of single-AS neighbors of r from each AS. The Neighbor heuristic assigns r to the AS with the largest frequency (most single-AS neighbors), based on the intuition that a router is connected to a larger number of single-AS routers in its owner AS. Neighbor produces an ambiguous assignment when multiple ASes have the same (highest) frequency.

In case one of the previously described primary heuristics is unable to produce an AS assignment, we attempt to break the tie using one of the other heuristics as a tie-breaker. Our evaluation in the paper shows that Neighbor was the best stand-alone heuristic, while Election+Degree was the best combination.

For further details, please see the paper Toward Topology Dualism: Improving the Accuracy of AS Annotations for Routers.


with in 10 km num. geolocated
MaxMind Geolite 33% 1980
DDec 67% 848
IX 92% 349
A comparison between a ground truth set of router geographic locations and the location inferred by each method. MaxMind provided the most geographic locations, while IX provided the most accuracy.

We use a combination of publicly known Internet eXchange (IX) point information, DDec hostname mapping, and MaxMind's free GeoLite City database to provide the geographic location (at city granularity) of routers in the router-level graph.

We generated an internal IX database containing information combined from: BGP Looking glass database, Wikipedia's list of Internet Exchange Points, PeeringDB, and PCH's IX database. The geographic city names are then mapped to Geoname's locations. This data provides a set of geographic locations and prefixes for each IX. A router is mapped to an IX's location if the router contains at least one interface from the IX's address space and the IX is located in a single city.

We then collect the hostnames for interfaces on routers that are not geolocated to an IX. These hostnames are then mapped to geographic locations using DDec's heuristics.

All remaining routers are geolocated using MaxMind Lite. Because this database maps individual IP addresses to locations, we take the following steps to find the location of each router (which by definition has multiple interfaces). We first map each interface on a router to a location. If all interfaces map to the same location, then we assign that location to the router; otherwise, we do not assign any location to the router (that is, the router does not appear in the geolocation file).

In order to evaluate the accuracy of these methods, we compared the distance between the inferred geographic location and the geographic location of collection of routers which were with a 3 milisecond of the known location of an Atlas monitor. MaxMind provided geographic locations for the largest number of routers wth 1980. It was followed by DDec with 848 and IX with 349 routers. Accuracy was the inverse with IX mapping 92% of it's routers to with in 10 km. DDec had 67% routers with in 10 km. MaxMind only had 33% with in 10 km.

  Last Modified: Fri Mar-24-2017 15:31:51 PDT
  Page URL: