ITDK Datasets
This page describes the historical ITDK releases from 2010 to present, which consists of
- an IPv4 router-level topology,
- an IPv6 router-level topology,
- router-to-AS assignments,
- geographic locations of routers, and
- DNS lookups of all observed IP addresses
ITDK releases were produced from traceroutes conducted on the Archipelago (Ark) measurement infrastructure. We used a subset of the IPv4 Routed /24 Topology Dataset, which contains traceroutes to randomly-chosen destinations in each routed /24 BGP prefix.
Latest release
Prior Releases
- ITDK-2024-02
- ITDK-2023-03
- ITDK-2022-02
- ITDK-2021-03
- ITDK-2020-01 - ITDK-2020-08
- ITDK-2019-01 - ITDK-2019-04
- ITDK-2018-03
- ITDK-2017-02 - ITDK-2017-08
- ITDK-2016-03 - ITDK-2016-09
- ITDK-2015-08
- ITDK-2014-04 - ITDK-2014-12
- ITDK-2013-04 - ITDK-2013-07
- ITDK-2012-07
- ITDK-2011-04 - ITDK-2011-10
- ITDK-2010-01 - ITDK-2010-04 - ITDK-2010-07
- historical ITDK releases 0204 and 0304 from April 2002 and 2003 collected with skitter (use with caution)
Router-Level Topologies
The IPv4 router-level topology is derived from aliases resolved with MIDAR and iffinder, which yield the highest confidence aliases with very low false positives. The IPv6 router-level topology is derived from alias resolution with speedtrap. The router-level topology is provided in two files, one giving the nodes and another giving the links. There are additional files that assign ASes to each node, provide the geographic location of each node, and provide the DNS name of each observed interface.
Nodes File
The nodes file lists the set of interfaces that were inferred to be on each router.
File format:
node <node_id>: <i1> <i2> ... <in>
Each line indicates that a node node_id has interfaces i1 to in. Interface addresses in 224.0.0.0/3 (IANA reserved space for multicast) are not real addresses. They were artificially generated to identify potentially unique non-responding interfaces in traceroute paths.
NOTE: In ITDK release 2013-04 and earlier, we used addresses in 0.0.0.0/8 instead of 224.0.0.0/3 for these non-real addresses.
Links File
The links file lists the set of routers and router interfaces that were inferred to be sharing each link. Note that these are IP layer links, not physical cables or graph edges. More than two nodes can share the same IP link if the nodes are all connected to the same layer 2 switch (POS, ATM, Ethernet, etc).
File format:
link <link_id>: <N1>:i1 <N2>:i2
[<N3>:[i3]] .. [<Nm>:[im]]
Each line indicates that a link link_id connects nodes N1 to Nm. If it is known which router interface is connected to the link, then the interface address is given after the node ID separated by a colon (e.g., "N1:1.2.3.4"); otherwise, only the node ID is given (e.g., "N1").
By joining the node and link data, one can obtain the known and inferred interfaces of each router. Known interfaces actually appeared in some traceroute path. Inferred interfaces arise when we know that some router N1 connects to a known interface i2 of another router N2, but we never saw an actual interface on the former router. The interfaces on an IP link are typically assigned IP addresses from the same prefix, so we assume that router N1 must have an inferred interface from the same prefix as i2.
Node-AS File
The node-AS file assigns an AS to each node found in the nodes file. We use our final bdrmapIT and Hoiho assignment heuristic to infer the owner AS of each node.
Addresses that belong to the address space of an Internet exchange point (as self-identified in PeeringDB: https://www.peeringdb.com/) are excluded from the AS analysis, as we don't consider them to be part of the AS-level topology.
File format:
node.AS <node_id> <AS> <method>
Each line indicates that the node node_id is owned/operated by the given AS, as inferred with the given method. There are three inference methods:
refinement
- AS inferred based on the ASes of surrounding routers
origins
- AS inferred based on the AS announcing the longest matching prefixes for the router interface IP addresses
unknown
- routers that bdrmapIT could not infer an AS for
lasthop
- bdrmapIT assignment by the ASN of the last hop in the path
as-hints
- bdrmapIT assignment using ASN hints in the hostname extracted with Hoiho rules
Hostnames File
The hostnames file contains the hostname for every IP
address in the router-level topology for which a successful reverse
DNS lookup could be found.
File format:
<timestamp> <IP_address> <hostname>
Node-Geolocation File
The node-geolocation file contains the geographic location for each node in the nodes file. We use the Hoiho, IXP, and MaxMind's GeoLite City database in that order.
File format:
node.geo <node_id>:
<continent>
<country>
<region>
<city>
<latitude>
<longitude>
<population>
<method>
hoiho
- geolocated by Hoiho from DNS names
ix
- geolocated from IXP's location of node's IP address
maxmind
- geolocated from node's addresses in Maxmind
Data Availability
- Data older than one year is available as a public dataset. You can obtain access using this form.
- The most recent one year of data is available for use by academic researchers and US government agencies. This data is also available for corporate entities (including corporate researchers) who participate in CAIDA's membership program. Please, complete and submit the online form to request access to the most recent data. It usually takes about two to three business days to process your request. Upon approval you will receive an email with instructions on how to download the data you requested. If you have any questions or problems using this form, please contact data-info@caida.org.
Acceptable Use Agreement for the public data
Please read the terms of the CAIDA Acceptable Use Agreement (AUA) for Publicy Accessible Datasets below:
When referencing this data (as required by the AUA), please use:
The CAIDA Macroscopic Internet Topology Data Kit - <release dates >,
https://www.caida.org/catalog/datasets/internet-topology-data-kit
You are required to report your publications using this dataset to CAIDA.
Request Data Access
- Access the publicly available CAIDA Ark IPv4 Internet Topology Data Kits Dataset (and other topology data)
- Request Access to the restricted CAIDA Ark IPv4 Internet Topology Data Kits Dataset
Note that two historical ITDK releases made in 2002 and 2003 are also available as public datasets. These datasets should be used with caution, as they were constructed using completely different procedures and using topology data collected on the now decommissioned skitter measurement infrastructure.
Internet Topology Data Kit Process
Below we describe the various steps involved in producing the datasets that are part of the ITDK.
Alias resolution
For alias resolution, we rely on several CAIDA tools: iffinder, MIDAR, and speedtrap. MIDAR (Monotonic ID-based Alias Resolution) expands on the IP velocity techniques of RadarGun. We use the traceroute dataset as input to MIDAR and iffinder, which infers the interfaces that belong to the same router, and the set of two or more routers on the same IP link (a construct that represents either a point-to-point link, or LAN or cloud with multiple attached IP addresses). We use iffinder and MIDAR to construct IPv4 topologies, and speedtrap to construct IPv6 topologies.
DNS hostnames
We used zdns to look up the intermediate addresses seen in the traceroute dataset.
BGP
To assign IP addresses to ASes, we used a publicly available BGP dump provided by Routeviews. BGP (Border Gateway Protocol) is the protocol for exchanging interdomain routing information among ASes in the Internet. A single origin AS typically announces ("originates") each routable prefix via BGP. We perform IP-to-AS mapping by assigning an IP address to the origin AS of the longest matching prefix for that IP address in the BGP tables.
AS relationships
We used the BGP data to annotate each interdomain link with one of two simplified business relationships -- customer-provider (the customer pays the provider), or settlement-free peer (typically no money is exchanged) -- using the classification algorithm by Luckie et al., resulting in what we call the AS relationship dataset.
AS assignment
bdrmapIT combines the AS inference heuristics found in both bdrmap and MAP-IT. By synthesis of these two techniques, bdrmapIT is designed to accurately identify interdomain links in a traceroute dataset using a graph refinement strategy. Our algorithm proceeds as follows:
- The first step (Graph Construction) processes the traceroutes, extracting paths and generating a prioritized graph of the interfaces. It annotates the graph using BGP data and AS relationship inferences. This interface graph is used by our heuristics in Step 3 to infer router ownership and interdomain links.
- In the second step (Graph Initialization), we use the paths and alias resolution to annotate routers that always appear at the end of a traceroute. We process these first since we are unable to refine these inferences later.
- Finally, in the third step (Graph Refinement), we use the graph refinement loop to annotate the remaining routers and interfaces using the prioritized interface graph and the path data. After each iteration of the loop we refine the inferences, enabling additional accuracy.
Hoiho uses inital router to AS assignment inferences from bdrmapIT combined with DNS hostnames to create a set of rules which can extract AS annotations recorded by network operators in hostnames. Those rules are then used to add an additional refinement to bdrmapIT's initial assignment. For further details, please see the papers bdrmap: Inference of Borders Between IP Networks, MAP-IT: Multipass Accurate Passive Inferences from Traceroute, and Learning to Extract and Use ASNs in Hostnames.
Geolocation
We use a combination of publicly known Internet eXchange (IX) point information, Hoiho hostname mapping, and MaxMind's free GeoLite City database to provide the geographic location (at city granularity) of routers in the router-level graph.
We collect the hostnames for interfaces on routers, and then use these hostnames to map routers to geographic locations using Hoiho's heuristics. For further details on Hoiho's geographic inference, please see the Learning to Extract Geographic Information from Internet Router Hostnames paper.For routers where Hoiho's heuristics do not return an inference, we use CAIDA's Internet eXchange Points (IXPs) database, containing information combined from: Wikipedia's list of Internet Exchange Points, PeeringDB, and PCH's IX database. The geographic city names are then mapped to Geoname's locations. This data provides a set of geographic locations and prefixes for each IX. A router is mapped to an IX's location if the router contains at least one interface from the IX's address space and the IX is located in a single city.
All remaining routers are geolocated using MaxMind Lite. Because this database maps individual IP addresses to locations, we take the following steps to find the location of each router. We first map each interface on a router to a location. If all interfaces map to the same location, then we assign that location to the router; otherwise, we do not assign any location to the router (that is, the router does not appear in the geolocation file).
Funding support
Support for the Macroscopic Internet Topology Data Kit project is provided by the Department of Homeland Security (DHS) contract N66001-08-C-2029 Cartographic Capabilities for Critical Cyberinfrastructure, contract N66001-08-C-2029 Cartographic Capabilities for Critical Cyberinfrastructure, cooperative agreement FA8750-18-2-0049 Advancing Scientific Study of Internet Security Topological Stability, grant award 2015-ST-061-CIRC01, subaward 077083-16369 Quantifying Interdependencies of the Logical/Physical Internet topologies, S&T contract HHSP 233201600012C Science of Internet Security: Technology Experimental Research, S&T contract NBCHC070133 Supporting Research Development of Security Technologies through Network Security Data Collection, and S&T cooperative agreement FA8750-12-2-0326 Supporting Research and Development of Security Technologies through Network and Security Data Collection and the National Science Foundation (NSF) grants C-ACCEL OIA-1937165 Knowledge of Internet Structure: Measurement, Epistemology, Technology, CNS-0958547 Internet Laboratory for Empirical Network Science, CNS-1414177 Mapping Interconnection in the Internet: Colocation, Connectivity Congestion, CNS-1513283 Internet Laboratory for Empirical Network Science: Next Phase, CNS-1901517 Strategies for Large-Scale IPv6 Active Mapping, CNS-1925729 Facilitating Advances in Network Topology Analysis, CNS-2120399 Integrated Library for Advancing Network Data Science, OAC-1724853 Integrated Platform for Applied Network Data Analysis, and OAC-2131987 Designing a Global Measurement Infrastructure to Improve Internet Security. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DHS, NSF, or the U.S. Government.