Macroscopic Internet Topology Data Kit (ITDK)

Name: Macroscopic Internet Topology Data Kit
Creator: CAIDA

The ITDK contains data about connectivity and routing gathered from a large cross-section of the global Internet. This dataset is useful for studying the topology of the Internet, among other uses.

ITDK Datasets

The ITDK contains data about connectivity and routing gathered from a large cross-section of the global Internet. This page presents a general summary of the ITDK features and file types. For release-specific filenames, counts, collection dates, and full format details, consult the README packaged with the ITDK release you use.

Releases less than one year old require request/access approval; older releases are publicly available.

ITDK releases can consist of

one or more IPv4 topologies,
one or more IPv6 topologies,
node-to-AS assignments,
geographic locations of nodes,
target lists identifying the IP addresses probed for aliases, and
DNS lookups of IP addresses probed for aliases.

Not all ITDK releases include all of these items. ITDK releases are produced from traceroutes conducted on the Archipelago (Ark) measurement infrastructure. To build IPv4 topologies, we use subsets of the IPv4 Routed /24 Topology Dataset, which contains traceroutes to randomly-chosen destinations in each routed /24 BGP prefix. To build IPv6 topologies, we use subsets of the Ark IPv6 Topology Dataset, which contains traceroutes to randomly-chosen destinations in each routed BGP prefix (/48 or shorter) as well as the ::1 of those prefixes.

Alias Resolution

When extracting IP addresses from traceroute paths for the purposes of resolving them for aliases (which addresses belong to the same system) we only included addresses that appeared as an intermediate hop in some traceroute path, as these are most likely router interface addresses. Each nodes file includes nodes with these router interface addresses, responding destinations, the Ark monitors, as well as placeholder nodes artificially generated to identify potentially unique non-responding interfaces in traceroute paths. The router interface addresses probed for aliases represent a small subset of the nodes in each nodes file. Most ITDK releases provide a file that identifies the addresses probed for aliases, the name of which is identified in the README for that ITDK release.

For current alias resolution, we rely on several approaches: iffinder, MIDAR, speedtrap, SNMPv3 probes.

iffinder infers IPv4 aliases if UDP probes to different addresses solicit ICMP port unreachable messages with the same source address.
MIDAR infers IPv4 aliases if IPID time series built from responses solicited from multiple targets indicate those targets derive IPID values from a shared counter.
speedtrap infers IPv6 aliases if IPID time series built from responses solicited from multiple targets indicate those targets derive IPID values from a shared counter.
SNMPv3 responses from probes sent to different addresses can imply aliases if the responses include the same SNMPv3 EngineID, the SNMPv3 response indicates the system was booted at the same time, and the number of times the SNMPv3 engine has restarted is the same.

Topology files derived from active probing (iffinder, MIDAR, speedtrap, and SNMPv3) have fewer false positives than analytical approaches, but are incomplete because they rely on implementation artifacts and those systems to be responsive to active probes. The topologies included in the ITDK identify the approaches used to build the topology in the name of the file. For example, files named midar-iff used MIDAR and iffinder probing. Similarly, files named speedtrap-snmp used speedtrap and SNMPv3 probing.

Nodes Files

Nodes files list the set of addresses that are inferred to be on the same system. Each nodes file includes routers, responding destinations, the Ark monitors, as well as placeholder nodes artificially generated to identify potentially unique non-responding interfaces in traceroute paths. The router interface addresses probed for aliases represent a small subset of the nodes in each nodes file. Most ITDK releases provide a file that identifies the addresses probed for aliases, the name of which is identified in the README for that ITDK release.

File format:

node <node_id>:   <i₁>   <i₂>   ...   <iₙ>

Each line indicates that a node node_id has interfaces i₁ to iₙ. Interface addresses in 224.0.0.0/3 (IANA reserved space for multicast) or 0.0.0.0/8 (IANA reserved for self-identification) are not addresses that appeared in traceroutes – they were artificially generated to identify potentially unique non-responding interfaces in traceroute paths.

Links Files

The links files list the set of routers and router interfaces that were inferred to be sharing each link. Note that these are IP layer links, not physical cables or graph edges. More than two nodes can share the same IP link if the nodes are all connected to the same layer 2 switch (POS, ATM, Ethernet, etc).

File format:

link <link_id>:   <N₁>:i₁   <N₂>:i₂   [<N₃>:[i₃]]   ..   [<Nₘ>:[iₘ]]

Each line indicates that a link link_id connects nodes N₁ to Nₘ. If it is known which interface address is connected to the link, then the interface address is given after the node ID separated by a colon (e.g., “N1:1.2.3.4”); otherwise, only the node ID is given (e.g., “N1”). By joining the node and link data, one can obtain the known and inferred interfaces of each router. Known interfaces actually appeared in some traceroute path. Inferred interfaces arise when we know that some router N₁ connects to a known interface i₂ of another router N₂, but we never saw an actual interface on the former router. The interfaces on an IP link are typically assigned IP addresses from the same prefix, so we assume that router N₁ must have an inferred interface from the same prefix as i₂.

Interfaces Files

The interfaces file provides additional information about interfaces included in the provided graphs.

File format:

<address> [<node_id>] [<link_id>] [T] [D]

Each optional field may or may not be present. The node_id starts with “N” and identifies the node, or alias set, to which the address belongs. The link_id starts with “L” and identifies the link to which the address is attached, if known.

The “T” flag indicates that the address appeared in at least one traceroute as a transit hop. The “D” flag indicates that the address appeared in at least one traceroute as a responding destination hop. The flags are not mutually exclusive.

Router Files

The routers file contains nodes with at least one IP address considered during alias resolution because the address appeared as an intermediate hop in some traceroute path. Researchers interested solely in the router portion of the topology should use this file.

Each interface appears on a line by itself, annotated with a lowercased DNS name when available. Individual routers are separated by an empty line and are prefaced with their node ID values and AS inference.

Example:

# node2id: 1
# node2as: 64496
192.0.2.1 esr1-ge-5-0-0.jfk2.example.net
192.0.2.10 esr1-ge-5-0-6.jfk2.example.net
192.0.31.60

# node2id: 2
# node2as: 64496
192.0.2.2 esr2-xe-4-0-0.lax.example.net
192.0.2.5 esr2-xe-4-0-1.lax.example.net
192.0.31.8

This is the same file format that Hoiho uses.

Node-to-AS Files

Each node-to-AS file assigns an AS to nodes identified by their ID value. Current ITDK releases use bdrmapIT.

File format:

node.AS   <node_id>   <AS>   <heuristic>

Each line indicates that the node node_id is owned/operated by the given AS, as inferred with the given heuristic. bdrmapIT heuristic labels can include:

origins – AS inferred based on the AS announcing the longest matching prefixes for the addresses.
lasthop – AS inferred based on the destination AS of the IP addresses tracerouted.
refinement – AS inferred based on the ASes of surrounding routers.
as-hints AS hints embedded in PTR records checked with bdrmapIT.
unknown – nodes that bdrmapIT could not infer an AS for.

Node-to-Geolocation Files

Each node-to-geolocation file contains an inferred geographic location of each node in the nodes file, where possible. Current ITDK geolocation files are provided for IPv4 topologies.

File format:

node.geo   <node_id>:   <continent>   <country>   <region>   <city>   <latitude>   <longitude>   <method>

ITDKs beginning with 2021-03 have the method column; prior ITDKs do not have that column, because they only used MaxMind GeoLite City. Each line indicates that the node with the given node_id has the given geographic location. Columns after the colon are tab-separated. The fields have the following meanings:

continent – a two-letter continent code
- AF – Africa
- AN – Antarctica
- AS – Asia
- EU – Europe
- NA – North America
- OC – Oceania
- SA – South America
country – a two-letter ISO 3166 Country Code.
region – a two or three alphanumeric region code.
city – city or town in ISO-8859-1 encoding (up to 255 characters).
latitude and longitude – signed floating point numbers.
method the geolocation method which inferred the location
- hoiho – inferred using Hoiho’s rules
- ix – inferred based on the known location of an IXP
- maxmind – inferred using maxmind

Hoiho Geolocation Rule Files

The geolocation rules file contains per-suffix geolocation rules inferred with Hoiho. The rules can be used with the Hoiho apply script provided in the Hoiho paper’s data supplement.

Target Address Files

The target address files contain IPv4 or IPv6 addresses, one per line, that identify addresses probed for aliases.

DNS Name Files

The DNS name files contain PTR lookup responses for addresses listed in the corresponding target address file. These lookups are collected close to the alias-resolution runs, which makes them useful for extracting DNS-based ground truth that can be compared with alias-resolution results.

Each line contains three tab-separated entries:

<timestamp>   <IP-address>   <DNS-name>

Data Availability

Data older than one year is available as a public dataset. You can obtain access using the Public Macroscopic Internet Topology Data From Archipelago User Info Request form.
The most recent one year of data is available for use by academic researchers and US government agencies. This data is also available for corporate entities (including corporate researchers) who participate in CAIDA's membership program. Please, complete and submit the CAIDA Topology Data Request Form to request access to the most recent data. It usually takes about two to three business days to process your request. Upon approval you will receive an email with instructions on how to download the data you requested. If you have any questions or problems using this form, please contact data-info@caida.org.

Acceptable Use Agreement for the public data

Please read the terms of the CAIDA Acceptable Use Agreement (AUA) for Publicy Accessible Datasets below:

When referencing this data (as required by the AUA), please use:

The CAIDA Macroscopic Internet Topology Data Kit - <release dates >,
https://www.caida.org/catalog/datasets/internet-topology-data-kit

You are required to report your publications using this dataset to CAIDA.

Request Data Access

Access the publicly available CAIDA Ark IPv4 Internet Topology Data Kits Dataset (and other topology data)
Request Access to the restricted CAIDA Ark IPv4 Internet Topology Data Kits Dataset