Internet Protocol Address (IP) Geolocation Bibliography

This page presents an annotated bibliography of papers and datasets related to the field of Internet Protocol (IP) address geolocation. Many applications require the association of Internet resources with an accurate geographic label at some granularity. For some applications knowing the country of origin might be sufficient; for others a more precise indication at state, city or zip code granularity, or even a specific latitude/longitude is needed. Below we provide an overview of published literature related to geolocation in an attempt to describe the current state of the art. We conducted this literature search as part of our efforts to compare geolocation tools .

Introduction

IP address geolocation reminds one of the classic bumper sticker, "think globally, act locally." In today's far reaching Internet, organizations and institutions of all kinds from corporations to governments want exactly that, the ability to communicate to the entire world and, at the same time, to develop applications which help them to target, limit, customize their messages, balance resources, and coordinate responses based on the location of the receiver. Organizations accomplish this by using tools and services that translate an IP address or prefix range into a geographic location (country, state, city, zip, geographic latitude/longitude) associated with the address(es). Simple, right?

However, which method(s) work best? Which sources of geolocation services and information return the most reliable locations and at what cost? What is the geographic resolution? Further, if a source provides the geographic location of the owner of an IP address, is this location the same as the location where the device is actually broadcasting and receiving packets? And, if different, can the difference be quantified?

What constitutes a "good" geolocation result? Some numbers: with a total land area of 1.5×10⁸ km² and 195 countries, the average country size on Earth is about 7.7×10⁵ km², or a linear size of 880 km. The surface area of the US is about 10⁷ km². With 50 states, over 3,000 counties and on the order of 43,000 zip codes, the average linear size of a state, county or zip code is about 450, 55 and 15 km, respectively. Looking at another big country, China (about the same size as the USA) has 33 provinces, 333 prefectures, about 3000 counties, and about 42000 townships, giving sizes of 550, 170, 60 and 18 km, respectively. To begin to be useful a geolocation method would at the very least need to be able to pinpoint the correct country, and, in large countries like the USA or China, the correct state or province. Looking at the above numbers this would require geolocation errors of at most a few hundred kilometers. An accuracy measured in tens of kilometers would be required to be effective at a truly local level (county or zip code).

Useful Definitions

A number of concepts are commonly encountered in the geolocation literature. We define the main ones here.

IP geolocation describes methods of assigning a geographic label to an individual Internet Protocol address (IP).
A Vantage Point (VP) is a measurement infrastructure node with a known geographic location.
A Landmark is a responsive Internet identifier with a known location to which the VP will launch a measurement that can serve to calibrate other measurements to potentially unknown geographic locations. Some papers use the term Active Landmarks to refer to points which act as both landmark and vantage point. Often they are part of an infrastructure platform like PlanetLab.
A Target is an Internet identifier whose location will be inferred from a given method. Typically some targets have known geographic locations (ground truth), which researchers can use to evaluate the accuracy of their geolocation methodology.
A Location is a geographic place that geolocation techniques attempt to infer for a given target. Examples include cities and ISP Points of Presences (PoPs).

Not all terms are used in all papers.

Geolocation Papers

The tables below contain annotations for papers on the topic of geolocation. We have collected and reviewed papers published between 1996 and 2010, starting with papers from peer-reviewed academic research conferences, and then including papers cited from this initial seeding, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers.

The first table emphasizes papers that directly address geolocation methodology, introducing new methods, extensions to previous methods, performance analysis, etc. The second table includes papers that address other geolocation-related issues, including applications of geolocation, and coordinate-based methods for modeling network delays.

Alongside author and publication information, the tables include a number of additional columns.

The papertype gives a category indication; we use "survey", "analysis", "methodology", "tools", and "other". "Methodology" papers develop specific methods of geolocaton; "analysis" papers focus on providing a quantitative foundation for geolocation methods (e.g., by comparing results for several methods); "survey" papers provide an overview of geolocation issues.

Data describes the type of data on which the results claimed in the paper are based. We mention here if the paper describes "ground truth" (authoritative mappings between IP addresses and geographic locations) used to validate geolocation results.

Findings gives a brief description of the main results claimed in the paper.

Probes gives an indication of the experimental setup (probes, landmarks, targets) used in a geolocation experiment (where appropriate).

Note: This is a wide multi-column table, and horizontal scrolling may be required.

ID	Title	Year	Publication	Authors	Paper Type	Method	Data	Findings	Measurement Setup
P-01	A Means for Expressing Location Information in the Domain Name System	1996	RFC Editor	Davis, C. and Vixie, P. and Goodwin, T. and Dickinson, I.	RFC 1876	Description of DNS LOC records
P-02	GTrace - A Graphical Traceroute Tool	1999	Usenix LISA	Periakaruppan, R. and Nemeth, E.	tool	GUI for displaying traceroute results on geographic map		uses geo-info from DNS LOC, WHOIS (NetGeo), IP-to-location databases, hostname heuristics, combined with RTT-based verification
P-03	Where in the World is netgeo.caida.org?	2000	INET	Moore, D. and Periakaruppan, R. and Donohoe, J. and k claffy, k	tool
P-04	An Investigation of Geographic Mapping Techniques for Internet Hosts	2001	SIGCOMM	Padmanabhan, V.N. and Subramanian, L.	methodology	IP2Geo suite: GeoTrack (traceroute info+host name heuristics), GeoPing (based on similarities in RTT-delay patterns), GeoCluster (IP-to-location database+BGP routing info)	IP-to-location datasets: 41772 Hotmail users at state granularity; 181246 IPs from bCentral web-hosting company at zipcode granularity; 142807 IPs from FooTV at zipcode granularity	GeoTrack most promising with median errors of 28 (well-connected hosts) to few hundred km. Median errors for geolocation on same set of univ hosts: 28 km for GeoCluster; 102 for GeoTrack; 382 for GeoPing	14 landmarks and 365 targets (in US, mostly univ-based)
P-05	Similarity Models for Internet Host Location	2003	ICON	Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.	analysis	explores various similarity measures for GeoPing-type (see P-04) geolocation	RIPE TTM delay measurements from Dec 2002 to Jan 2003)	similarity based on "city-block" distance measure works best	55 landmarks (RIPE TTM hosts)
P-06	Toward a measurement-based geographic location service	2004	Passive and Active Network Measurement Workshop (PAM)	Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.	analysis	explores various similarity measures for GeoPing-type (see P-04) geolocation	delays from probes to all landmarks, and from probes to target host	sufficient correlation between geographic distance and network delay exists for coarse-grained geolocation; explores mostly distance based similarity measures; median distance error of 314 km	397 landmarks (RIPE TTM and LibWeb servers); 9 probes (NIMI)
P-07	Constraint-based Geolocation of Internet Hosts	2004	IEEE/ACM Transactions on Networking	Gueye, B. and Ziviani, A. and Crovella, M. and Fdida, S.	methodology	multilateration based on geometric constraints (upper limit on target host distance from landmark) derived from delay measurements between landmarks provides location estimate and confidence region	one-way delays between TTM hosts (Dec 2002-Feb 2003); RTT delays between AMP hosts (30 Jan 2003); known landmark locations	median error of 95 km (NLANR, USA) and 22 km (TTM, Europe)	Landmarks: 95 NLANR AMP hosts; 42 RIPE TTM hosts
P-08	Demographic Placement for Internet Host Location	2003	GLOBECOM	Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.	methodology	develops methodology for efficiently deploying landmarks and probes for delay-based geolocation methods (in particular GeoPing from P-04		Landmarks are placed using demographic criteria (locations with high user density); probes are placed sparsely at locations with high connectivity
P-09	Improving the accuracy of measurement-based geographic location of Internet hosts	2005	Computer Networks and ISDN Systems	Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B.	methodology	explores several issues related to implementation of GeoPing-type geolocation: correlation between RTT and geographic distance: optimal placements of landmarks and probes; methods for evaluating similarities between delay patterns	delays from LIP6 (Paris, France) to 109 LibWeb hosts in June 2002; delays between 55 RIPE TTM host from Dec 2002 to Feb 2003	number and location of landmarks and probes is optimized using a demographic approach; similarity based on "city-block distances" outperforms Euclidean distance model	109 LibWeb webservers; 55 RIPE TTM hosts
P-10	Towards IP geolocation using delay and topology measurements	2006	Internet Measurement Conf (IMC)	Katz-Bassett, E. and John, J.P. and Krishnamurthy, A. and Wetherall, D. and Anderson, T. and Chawathe, Y.	methodology	Develops Topology-based geolocation (TBG): improves on pure CBG based on end-to-end delays by leveraging network topology at the router level and validated external hints; uses global optimization approach to determine router and target locations simultaneously.	PlanetLab hosts are used as landmarks. Geolocation experiments using targets collocated with 11 Abiline PoPs, 22 Sprint PoPs and 128 Univ. hosts	Improves errors by typically factors 3 to 4 as compared with CBG	68 PlanetLab landmarks; 128 US Univ. host targets
P-11	Leveraging Buffering Delay Estimation for Geolocation of Internet Hosts	2006	Int Federation for Information Processing Technical Committee 6 (IFIP-TC6) Networking Conf	Gueye, B. and Uhlig, S. and Ziviani, A. and Fdida, S.	methodology	GeoBuD: CBG (P-07) augmented with buffering delay estimates at intermediate routers derived from traceroutes	traceroutes from PlanetLab landmarks for 17 Oct 2005 (US dataset); and 21 Nov 2005 (WE dataset)	incorporating buffering delays at intermediate routes in CBG (P-07) reduces median geolocation error from 228 to 144 km for US dataset and 137 to 100 im for WE dataset	29 US PlanetLab landmarks with 87 AMP targets; 27 WE PlanetLab landmarks with 57 RIPE TTM targets
P-12	IP Geolocation	2007	Internet Measurement seminar	Holzhauer, F.	survey	review of methods, emphasizing CBG ([P-07], and TBG ([P-10])
P-13	Octant: A Comprehensive Framework for the Geolocalization of Internet Hosts	2007	USENIX Symp on Networked System Design and Implementation (NSDI)	Wong, B. and Stoyanov, I. and Gun Sirer, E.	methodology	uses positive and negative constraints on hosts and intermediate routers; assigns "weights" to handle uncertainty in constraints; uses fictitious "height" to capture last hop delays; uses geometric technique based on Bezier curvers that can incorporate extraneous geographic hints	Latency data collected on 1-Feb-2006 and 18-Sep-2006 between landmarks, intermediate routers and targets	Median geolocation error of 35 km	51 PlanetLab hosts; 53 public traceroute servers
P-14	Investigating the Imprecision of IP Block-based Geolocation	2007	Passive and Active Network Measurement Workshop (PAM)	Gueye, B. and Uhlig, S. and Fdida, S.	analysis	uses CBG geolocation to investiage geographic spread of IP addresses in same IP block	CBG (P-07) locations for 18759 IPs in 876 IP blocks between 31 Mar and 19 Apr 2006. IPs are CoralCDN Web clients, linked to IP blocks using database from paper R-05	~ 60% of IP blocks have spread in excess of 200 km	74 PlanetLab landmarks
P-15	Assessing the geographic resolution of exhaustive tabulation for geolocating Internet hosts	2008	Passive and Active Network Measurement Workshop (PAM)	Siwpersad, S.S. and Gueye, B. and Uhlig, S.	analysis	comparison of locaion estimates from CBG (P-07) with locations from MaxMind and Hexasoft databases	single IP from 41758 MaxMind and 15823 Hexasoft IP blocks are geolocated with CGB	database location for more than 90% of IP blocks lies outside CBG confidence region	39 PlanetLab landmarks
P-16	Internet geolocation: evasion and counterevasion	2009	ACM Computing Surveys	Muir, J. and van Oorschot, P.C.	survey	overview of geolocation methods with general discussion of limitations; discussion of ways adversaries can avoid geolocationp; mentions extraction of IP using Java applet, and RTT measurement by HTTP refreshes		no geolocation method is robust (works for all IP addresses, network configs, and against adverserial users); those trying to evade geolocation, can complicate the task for locators, but geographic information can leak in many ways
P-17	Mining the Web and the Internet for Accurate IP Address Geolocations	2009	IEEE Conf on Computer Communications (INFOCOM)	Guo, C. and Liu, Y. and Shen, W. and Wang, H.J. and Yu, Q. and Zhang, Y.	methodology	data base mining technique (Structon): geographic information from large database of Webpages, combined with number of heuristics to increase accuracy and coverage	500 million Chinese URLs, augmented with traceroutes, WHOIS and BGP information	87% accuracy at city-level granularity
P-18	Statistical geolocation of Internet hosts	2009	Int Conf on Computer Communications and Networks (ICCCN)	Youn, I. and Mark, B.L. and Richards, D.	methodology	delay-based statistical method: delay-to-distance relationship is expressed in a probability density function; solution by iterative force-directed method	delay measurements between all pairs of landmarks every five minutes for one week	Compared to GeoPing (P-04) and CBG (P-07) median errors are reduced by ~20%; mean errors by ~50% (i.e., significant improvement especially in reducing large errors in GeoPing and CBG	85 PlanetLab landmarks
P-19	A study of geolocation databases	2010	arXiv cs.NI/1005.5674v3	Shavitt, Y. and Zilberman, N.	survey	statistical analysis of PoP "range of convergence" and deviations of IP and PoP locations within PoP	PoP map of 3800 PoPs (52K IPs) derived from DIMES traceroute measurements in March 2010	vast majority of location info in databases is correct, but also errors in the range of 1000s km	DIMES
P-20	GeoWeight: Internet host geolocation based on a probability model for latency measurements	2010	Australasian Conf on Computer Science (ASCS)	Arif, M.J. and Karunasekera, S. and Kulkarni, S.	methodology	constraint-based (CBG) augmented by a probability model for latency vs. geographic distance	150000 latency measurements PlanetLab landmarks from 23 Sep 2008 to 25 Oct 2008; latencies from landmarks to 80 NA targets	Median geolocation errors of ~44 km compared to > 200 for Octant and > 500 for CBG	50 PlanetLab landmarks and 80 targets in North America
P-21	A model based approach for improving router geolocation	2010	Computer Networks: The Int Journal of Computer and Telecommunications Networking	Laki, S. and Matray, P. and Haga, P. and Csabai, I. and Vattay, G.	methodology	develops detailed path-latency model (separating propagation and per-hop delays); uses global optimization to solve for target locations (similar to P-10)		mean geolocation errors of ~150 km	ETOMIC landmarks and 41 GEANT2 targets; 151 PlanetLab nodes, used as landmarks and targets
P-22	Internet Host geolocation using maximum likelihood estimation technique	2010	IEEE Int Conf on Advanced Information Networking and Applications (AINA)	Arif, M.J. and Karunasekera, S. and Kulkarni, S. and Gunatilaka, A. and Ristic, B.	methodology	delay-based statistical method: delay-to-distance relationship is expressed as a probability density function; solution by MLE method	delay measurements between landmarks from 23 Sep to 25 Oct 2008	median error of 134 km, compared to 216 for Octant (P-13) and 506 km for CBG (P-07) on same dataset	50 NA PlanetLab landmarks and 50 other NA hosts as targets
P-23	Dude, where's that IP? Circumventing measurement-based IP geolocation	2010	Usenix Security Symp	Gill, P. and Ganjali, Y. and Wong, B.	analysis	Simulated "attacks" using PlanetLab testbed to foil delay-based and topology-aware geolocation attempts		topology-aware techniques are more susceptible to tampering than simpler delay-based techniques	50 NA and 30 WE PlanetLab nodes
P-24	A learning-based approach for IP geolocation	2010	PAM	Eriksson, B. and Barford, P. and Sommers, J. and Nowak, R.,	methodology	statistical method: expresses relation of distance to delay and hop count as a probability density function; solution by learning-based classification method	iPlane data for 12 Dec 2008 to 8 Jan 2009, supplemented with traceroutes between 375 PlanetLab hosts. Three sets of PlanetLab traceroutes between 11 Dec 2008 6 Jan 2009	MaxMind db is used as "ground truth". Results are compared with CBG (P-07: mean error reduces from 519 km for CBG to 408 km for proposed learning-based method	375 NA PlanetLab nodes
P-25	Spotter: A model based active geolocation service	2011	INFOCOM	Laki, S. and Matray, P. and Haga, P. and Sebok, T. and Vattay, G.	methodology	delay-base statistical method: combining spatial probability density function for all landmarks defines estimated region for location of target	uses PlanetLab nodes as targets; also uses 23000 Cogent IP address locations in Europe and US	all landmarks are described by same probabilistic delay-distance model	PlanetLab
P-26	Towards street-level client-independent IP geolocation	2011	Usenix	Wang, Y. and Burgener, D. and Flores, M. and Kuzmanovic, A. and Huang, C.	methodology	combines active measurement approach with an active web-mining technique. Uses CBG for "coarse" geolocation; refines location using "relative network distance" in combination with large number of landmarks located using web-mining technique.	method evaluated using 88 PlanetLab nodes; a set of 72 residential IP address; and 3^rd undisclosed dataset	claims median errors of 1--2 km for the three datasets used.	88 PlanetLab targets
P-27	iPlane Nano: path prediction for peer-to-peer applications	2009	Usenix Symp on Networked systems design and implementation (NSDI)	Madhyasth, H.V. and Katz-Bassett, E. and Anderson, T. and Krishnamurthy, A. and Venkataramani, A.	methodology	provide atlas for PoP-level paths, with latencies, and loss rates predictions between arbitrary hosts on the Internet by path stitching across inferred PoP paths.	iPlane data	iPlane iNano provides PoP-level paths between arbitrary end-hosts with an atlas that is less than 7MB in size and can be updated
P-28	Matchmaking for online games and other latency-sensitive P2P systems	2009	ACM SIGCOMM Computer Communication Review	Agarwal, S. and Lorch, J.R.	methodology	Htrea: place pairs of clients in a network coordinate system to provide client to client latency prediction. They seed their network corrdinate system with Maxmind Geolite coordinates.	3.5 million console to console latencies from Halo (microsoft) Geolite IP to geographic location (Maxmind)	50% of predictions under 15 ms for Htrae, 24 ms for Geolite. 95% of Htrea's predictions with in 138 ms and 208 for Geolocation.	one time volunteers Htrea deployment on 11 home machines
P-29	IP geolocation databases: unreliable?	2011	ACM SIGCOMM Computer Communication Review	Poese, I. and Uhlig, S. and Kaafar, M.A. and Donnet, B. and Gueye, B.	survey	Compare prefix distributions in 5 geolocation DBs with groundtruth.	Groundtruth: 357 BGP prefixes from large European ISP with city-level location of router advertising subnet inside ISP. Databases of HostIP, IP2Location, InfoDB, Maxmind, Software77	databases are strongly biased to popular countries; db IP blocks use official advertisements of ISP; while some of the ISP address space is geolocated decently (e.g. 20% of Maxmind within 10s of km of groundtruth), in most cases DBs are off by 100s to 1000s of km.
P-30	Geocompare: a comparison of public and commercial geolocation databases	2011	CAIDA Technical Report	Huffaker, B. and Fomenkov, M. and claffy, kc	survey	compare geographic database against each other and ground truth dataset	RIR, Software77, HostIP, IPligence, Cyscape, MaxMind GeoIP, MaxMind GeoLite, IPnfoDB, and Digital Envoy	databases roughly all agree on country, MaxMind GeoIP and Digital Envoy did best on ground truth. Digital Envoy did best on routers.
P-31	Network measurement based modeling and optimization for IP geolocation	2012	Computer Networks	Dong, Z. and Perera, R.D.W. and Chandramouli, R. and Subbalakshmi, K.P.	methodology	measurement-based geolocation method tested on PlanetLab nodes; k-means clustering of landmark-to-landmark measurements defines distance segments that each fits RTT vs distance using polynomial regression; semidefinite programming method finds optimized location of target host from estimated distances to landmarks.	traceroutes from PlanetLab probes to landmarks for one week in Nov 2010	Best results: 27-32 km in North America; 41-53 in Europe	81 North-American and 90 European PlanetLab probes; 206 PlanetLab landmarks
P-32	A structural approach for PoP geo-location	2012	Computer Networks	Feldman, D. and Shavitt, Y. and Zilberman, N.	methodology	Method to generate PoP-level geographic maps from IP-level graph based on DIMES traceroutes. PoP identification based on structure ('motifs') and a partitioning algorithm that assigns nodes to PoPs; geographic location assigned to PoP from geoloc DBs	DIMES: 56M traceroutes from 2009 Jul and 33M from 2010 Oct. GeoDB useds: MaxMind, IPligence, HostIP, IP2Location, GeoBytes	Comparision with published PoP maps, finds that most large PoPs are found, but few small ones. The majority of incorrect links is attributed to database errors.	DIMES, 1308 agents in 49 countries
P-33	Posit: a lightweight approach for IP geolocation	2012	ACM SIGMETRICS Performance Evaluation Review	Eriksson, B. and Barford, P. and Maggs, B. and Nowak, R.	methodology	CBG for geolocation to constrained region; Finds the most likely location from limited pool (ie city) given RTT to target and landmarks in constrained region.	431 Akamai vantage points/targets and addtion 283 targets
P-34	Enhancing the classification accuracy of IP geolocation	2012	Military Communications Conf (MILCOM)	Maziku, H. and Shetty, S. Han, K. and Rogers, T.	methodology	Machine-learning approach (extension of P-24). Larger set of classifiers include average, median, mode and std dev of delay measurement; hop count; pop. density.	142,937 (23,843 after de-aliasing) router IP addresses from traceroutes between PlanetLab nodes between Jun and Oct 2011.	Results heavily depend on good coverage by landmarks. Median errors vary from 0 (NE US) to ~500 km (N-Central US).	67 well-distributed PlanetLab nodes across US serve as landmarks and probes
P-35	Towards geolocation of millions of IP addresses	2012	ACM Internet measurement conference (IMC)	Hu, Z. and Heidemann, J. and Pradkin, Y.	methodology	select 10 nearest, by RTT, Vantage Points to probe /24 prefix	400 PlanetLab nodes to 25 known landmarks	proves that selecting 10 Vantage Points with lowest RTT values to a given /24 prefix greatly reduces the amount of traffic needed to geolocate without large increases in error
P-36	Using Whois based geolocation and Google maps API for support cybercrime investigations	2013	Recent Advances in Telecommunications and Circuits	Butkovic, A. and Orucevic, F. and Tanovic, A.	methodology
P-37	Topology mapping and geolocating for China's Internet	2013	IEEE Trans. On Parallel and Distributed Systems	Tian, Y. and Dey, R. and Liu, Y. and Ross, K.W.	methodology

ID	Title	Year	Publication	Authors	Paper Type	Method	Data	Findings	Measurement Setup
R-01	Predicting Internet Network Distance with Coordinates-Based Approaches	2002	IEEE Conference on Computer Communications (INFOCOM)	Ng, T. S. E. and Zhang, H.	methodology	develops GNP, a coordinate-based method for estimating minimum RTT using absolute coordinates	distance (minimum RTT) measurements between landmarks and two sets of targets	Euclidean embedding combined with a relative error measurement function works best	19 landmarks (12 in NA; 5 in AP; 2 EU); two target set: 869 global; 127 Abilene-connected
R-02	Virtual Landmarks for the Internet	2003	Internet Measurement Conference (IMC)	Tang, L. and Crovella, M.	methodology, analysis	coordinate-based method for estimating minimum RTT using Euclidean embedding; uses "virtual landmarks" for speed and scalibility	seven collections of RTT data	network distances can be described with 7-9 orthogonal vectors; ~90% of distances preserved with relative error <0.5; "virtual landmark" method simpler and faster than nonlinear optimization	NLANR AMP
R-03	On the geographic location of Internet resources	2003	J. on Selected Areas in Communications, Vol. 21., pp. 934-947	Lakhina, A. and Byers, J.W. and Crovella, M. and Matta, I.	analysis	statistical analysis of geographic properties of router topology.	Uses CAIDA Skitter data (26 Dec 2001 to 1 Jan 2002) and Scan Project Mercator data (Aug 1999). Geolocation is done using IxMapper and EdgeScape.	Superlinear relation between router and population density; connection patterns linked to geographic distance. Nr of AS locations correlates with AS degree and AS geographic dispersal.
R-04	Vivaldi: A Decentralized Network Coordinate System	2004	SIGCOMM	Dabek, F. and Cox, R. and Kaashoek, F. and Morris, R.	methodology	Coordinate-based method for predicting communication latencies using 2D Euclidean embedding augmented with a "height" component	RTTs between 192 PlanetLab nodes; and between 1740 DNS servers	Median relative error in RTT prediction of 11%	PlanetLab
R-05	Geographic Locality of IP Prefixes	2005	Internet Measurement Conference (IMC)	Freedman, M. J. and Vutukuru, M. and Feamster, N. and Balakrishnan, H.	analysis	Statistical analysis of geographic properties of IP prefixes within context of implications for routing policies. Uses undns for geolocation (i.e., IP-to-geographic location mapping based on geographic information in DNS names)	170000 IP prefixes from RouteViews from 27-Feb-2005; traceroutes to CoralCDN clients and servers; traceroutes from PlanetLab hosts to 4 IPs per prefix	discontiguous prefixes announced by AS from single location usually due to fragmented alloction by registries; announcement of contiguous prefixes announced by AS from different geographic locations limits oportunities of aggregration of prefixes	25 PlanetLab hosts
R-06	Geolocalization of Proxied Services and its Application to Fast-Flux Hidden Servers	2009	IMC	Castelluccia, C. and Kaafar, M.A. and Manils, P. and Perito, D.	application	application of CBG (P-07) to geolocation of fast-flux hidden servers
R-07	Eyeball ASes: From Geography to Connectivity	2010	Internet Measurement Conference (IMC)	Rasti, A. and Magharei, N. and Rejaie, R. and Willinger, W.	analysis	IP geolocation (48x10⁶ IP addresses from P2P apps) done using GeoIP and IP2Location; used to determine geo- and PoP-level footprints of ASes	PoP info for 45 ASes in NA and EU compiled from online data
R-08	Improving AS Relationship Inference Using PoPs	2013	Traffic Monitoring and Analysis Workshop (TMA 2013)	Neudorfer, L. and Shavitt, Y. and Zilberman, N.	methodology	Method to use PoP level maps to find complex and anomalous AS relationships	29M DIMES traceroutes (May 2012); DIMES IP-to-PoP mapping (May 2012) with 5215 PoPs, 98650 IPs in 2636 AS; CAIDA AS rank data from August 2012 with 119,924 AS pairs.	Discusses several examples complex and/or anomalous AS relationships between AS (different at different ASes)	DIMES

Measurement Infrastructure

The above bibliography references several datasets. These resources are listed here with references back to the papers.

ID	Name	Organization	Description	PaperID
D-01	PlanetLab	Princeton University	Global research network for the development of new network services. Currently over 1000 nodes worldwide	P-10, P-11, P-13, P-14, P-15, P-18, P-20, P-21, P-22, P-23, P-24, P-25, P-26, R-04, R-05, R-06
D-02	iPlane	University of Washington	Scalable service for predictions of Internet path performance for emerging overlay services (incl. access to iPlane datasets)	P-20, P-22, P-24
D-03	TTM	RIPE	Test Traffic Measurement Service (TTM) measures key parameters of he connectivity between points on the internet	P-05, P-06, P-07, P-09, P-11
D-04	AMP	NLANR	NLANR Active Measurement Project (AMP), active between 1998-2006. Datasets available at RIPE.	P-07, P-11, R-02
D-05	ETOMIC		European Traffic Observatory Measurement Infrastructure (ETOMIC) is a measurement infrastructure, distributed throughout Europe, that is able to carry out active measurements	P-21
D-06	GEANT2		High-bandwidth, academic Internet serving Europe's research and education community	P-21
D-07	DIMES	Tel Aviv Univ.	Distributed scientific research project, aimed at studying structure and topology of the Internet	P-19
D-08	Skitter	CAIDA	Tool for actively probing the Internet for topology and performance analysis. Retired in 2008. Dataset availabe from CAIDA	R-02

Geographic Information and Geolocation Methods

Current geolocation techniques can be broadly divided into two categories: database-driven (or registry-based P-25) and measurement-based. This categorization mirrors a similar division in the types of geographic information available for IP geolocation: qualitative data, and numerical (quantitative) data. Both have been present in geolocation efforts from the onset.

The class of quantitative data includes the workhorse of measurement-based geolocation methods: delay measurements from probes to landmarks and targets. A number of publications establish the relationship between Internet delay and geographic distance ( P-06, P-09, P-10, P-25 ) in the presence of obfuscating factors like circuitous routing, buffering and other delays (P-11,P-21), etc. Also included in this class is network topology information, typically derived from traceroute measurements. Topology information can be an integral part of a geolocation algorithm (e.g., when intermediate routers to an end target are geolocated alongside the target itself in a global optimization; P-10), but is also used in simpler arguments that relate topological proximity to geographic proximity (e.g., when geolocating the last intermediate router when the real target is unreachable). Hop counts (also derived from traceroutes) are explored in a recent paper (P-24) as another quantitative measure of geographic distance.

The class of qualitative data includes the usual suspects (WHOIS registry, DNS LOC records, DNS names, BGP router tables), but also databases based on information gathered from the Internet community (either directly through user input, or indirectly, e.g. by parsing large quantities of URLs P-17). This probably also includes the various types of proprietary databases used in commercial geolocation products. All of these contain geographic information (directly as in DNS LOC records, or indirectly by linking to an organization or AS number) that, if correctly interpreted, provide clues about the geographic location of an IP address, or IP address block.

The earliest geolocation attempts, GTrace (P-02; constructed around NetGeo), GeoTrack and GeoCluster (P-04) emphasize qualitative data (primarily WHOIS records and DNS names), but already delay (RTT) measurements are incorporated. GTrace uses RTT data to validate results using "speed-of-light" arguments; GeoPing (P-04) is purely RTT-based. From these early attempts a number of measurement-based algorithms have appeared in the academic literature. The table below provides an overview of the accuracy achieved by the various techniques. In this list only the first three are database-driven; all others (starting with GeoPing) are measurement-based.

GeoPing (P-04), which uses similarities between "fingerprints" (based on delay measurements from a set of probes) for target and landmarks to select the location of the landmark with the most similar fingerprint as the target location, appears to be mostly of historical significance at this point as the first measurement-based geolocation method. Constraint-based geolocation (CBG; P-07), using deterministic geometric constraints derived from delay measurements to constrain the probable location of a target, has set the stage for future development, and is the most common "benchmark" used to compare more recent models against.

Subsequent geolocation methods show an increasing sophistication in extracting geographic information, either by supplementing delay measurements with additional data, or by more complex algorithms. Topology-based geolocation (TBG; P-10) introduces topology measurements to simultaneously geolocate intermediate routers and targets. Further refinements include an improved analysis of delay measurements (separating the distance-sensitive propagation delays from other processing delays; P-11, P-21), incorporating database-driven approaches to improve geolocation accuracy (P-10), and integrating hop counts into the geolocation algorithm (P-24). Algorithms also are evolving. The most recent models favor probabilistic approaches, which seem to be a better match to the essentially statistical nature of the relation between geographic distance and delay measurements. GeoWeight (P-20) marks a transition by combining deterministic constraints, similar to CBG, with probability assignments; P-18, P-22, P-24 and P-25 describe delay measurements using probability density functions, and use various statistical methods to build a geolocation algorithm.

Few detailed descriptions of database-driven techniques exist in the literature. The exceptions are NetGeo (P-03) and Structon (P-20). Not surprisingly, published literature contains little concrete information about algorithms employed in commercial geolocation products. Whether the qualitative input data are web pages, WHOIS registry records, or DNS names a database-driven geolocation algorithm tends to be a collage of various heuristic arguments, approximations and intelligent guesswork.

Error Distance Matrix

The table below compiles numbers from geolocation experiments described in the above publications for measurement-based techniques. The column headers indicate a range of median errors in geolocation distance reported in the papers; the values in the columns are the number of experiments that report median errors in the indicated range. Even though direct comparison of these numbers is tricky due to the wide variations in experiment characteristics (different types of targets, different set of landmarks, etc.), the picture that emerges is that state-of-the-art measurement based techniques can comfortably geolocate targets with median errors of < 250 km, while some techniques under favorable conditions can approach an accuracy of < 100 km. To put this in context: 1000 km can be roughly viewed as country granularity; 50 km approaches city or zip code granularity.

Method	ID	d < 5 km	5 < d < 50 km	50 < d < 100 km	100 < d < 250 km	250 < d < 500 km	500 < d < 750 km
NetGeo	P-03						1
GeoTrack	P-04			1	2		1
GeoCluster	P-04		1
GeoPing	P-04			1	5	2
CBG	P-07		1	3	6	2	2
TBG	P-10			1	1
GeoBuD	P-11			1	1
Octant	P-13		2		4	2
SG	P-18			1
GeoWeight	P-20		1
Geo-Rh	P-21				1
MLE	P-22				1
Naive Bayes	P-24				1
Spotter	P-25		1	1
Street GeoLoc	P-26	1
Dong et al.	P-31		1
Maziku et al.	P-34	1

Discussion

A direct comparison between measurement-based and database-driven approaches, or even just between measurement-based algorithms is tricky at best. A systematic comparison would require the availability of a reliable "ground truth" database of IP addresses at known geographic locations. This is difficult to find. However, in practice, the pool of potential test targets at known locations is limited: most recent published experiments select their ground truth from hosts in measurement infrastructures like PlanetLab in North America or Europe. So, even though hard to quantify, the ground truth in some published experiments probably is similar. In some papers the same ground truth is used to compare different algorithms (typically CBG is used as a benchmark, which explains the high number of entries for CBG in the above table), providing some insight in comparative performance. Obvious questions remain though. How representative are results based on a limited number of PlanetLab targets for the Internet as a whole? How much does the accuracy for a method vary from well-connected hosts (routers) to a heterogenous collection of end hosts? Looking at the above table, the median errors for CBG experiments vary from better than 50 km to more than 500 km (one order of magnitude) presumably reflecting a wide variation in experiment characteristics.

In an average sense the performance of the best geolocation techniques can be quantified reasonably well: the best measurement-based methods have median errors of at most a few hundred km (well within country granularity), with the best results maybe approaching 50 km (city or zip level). Similarly database-driven techniques also appear to do quite well at the country level, but start running out of steam at the city level. Whether database-driven or measurement-based, all techniques suffer from what might be called an outlier syndrome. All techniques are plagued by outliers with location errors well exceeding 1000 km (or country level). It would seem that for any potential application of geolocation the key question to ask is whether being right most of the time is good enough. If the answer is yes, a secondary question is whether the average accuracy of a selected algorithm is satisfactory.