The contents of this legacy page are no longer maintained nor supported, and are made available only for historical purposes.

Internet Protocol Address (IP) Geolocation Bibliography

This page presents an annotated bibliography of papers and datasets related to the field of Internet Protocol (IP) address geolocation. Many applications require the association of Internet resources with an accurate geographic label at some granularity. For some applications knowing the country of origin might be sufficient; for others a more precise indication at state, city or zip code granularity, or even a specific latitude/longitude is needed. Below we provide an overview of published literature related to geolocation in an attempt to describe the current state of the art. We conducted this literature search as part of our efforts to compare geolocation tools .

Introduction

IP address geolocation reminds one of the classic bumper sticker, "think globally, act locally." In today's far reaching Internet, organizations and institutions of all kinds from corporations to governments want exactly that, the ability to communicate to the entire world and, at the same time, to develop applications which help them to target, limit, customize their messages, balance resources, and coordinate responses based on the location of the receiver. Organizations accomplish this by using tools and services that translate an IP address or prefix range into a geographic location (country, state, city, zip, geographic latitude/longitude) associated with the address(es). Simple, right?

However, which method(s) work best? Which sources of geolocation services and information return the most reliable locations and at what cost? What is the geographic resolution? Further, if a source provides the geographic location of the owner of an IP address, is this location the same as the location where the device is actually broadcasting and receiving packets? And, if different, can the difference be quantified?

What constitutes a "good" geolocation result? Some numbers: with a total land area of 1.5×108 km2 and 195 countries, the average country size on Earth is about 7.7×105 km2, or a linear size of 880 km. The surface area of the US is about 107 km2. With 50 states, over 3,000 counties and on the order of 43,000 zip codes, the average linear size of a state, county or zip code is about 450, 55 and 15 km, respectively. Looking at another big country, China (about the same size as the USA) has 33 provinces, 333 prefectures, about 3000 counties, and about 42000 townships, giving sizes of 550, 170, 60 and 18 km, respectively. To begin to be useful a geolocation method would at the very least need to be able to pinpoint the correct country, and, in large countries like the USA or China, the correct state or province. Looking at the above numbers this would require geolocation errors of at most a few hundred kilometers. An accuracy measured in tens of kilometers would be required to be effective at a truly local level (county or zip code).

Useful Definitions

A number of concepts are commonly encountered in the geolocation literature. We define the main ones here.

  • IP geolocation describes methods of assigning a geographic label to an individual Internet Protocol address (IP).
  • A Vantage Point (VP) is a measurement infrastructure node with a known geographic location.
  • A Landmark is a responsive Internet identifier with a known location to which the VP will launch a measurement that can serve to calibrate other measurements to potentially unknown geographic locations. Some papers use the term Active Landmarks to refer to points which act as both landmark and vantage point. Often they are part of an infrastructure platform like PlanetLab.
  • A Target is an Internet identifier whose location will be inferred from a given method. Typically some targets have known geographic locations (ground truth), which researchers can use to evaluate the accuracy of their geolocation methodology.
  • A Location is a geographic place that geolocation techniques attempt to infer for a given target. Examples include cities and ISP Points of Presences (PoPs).

Not all terms are used in all papers.

Geolocation Papers

The tables below contain annotations for papers on the topic of geolocation. We have collected and reviewed papers published between 1996 and 2010, starting with papers from peer-reviewed academic research conferences, and then including papers cited from this initial seeding, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers.

The first table emphasizes papers that directly address geolocation methodology, introducing new methods, extensions to previous methods, performance analysis, etc. The second table includes papers that address other geolocation-related issues, including applications of geolocation, and coordinate-based methods for modeling network delays.

Alongside author and publication information, the tables include a number of additional columns.

The papertype gives a category indication; we use "survey", "analysis", "methodology", "tools", and "other". "Methodology" papers develop specific methods of geolocaton; "analysis" papers focus on providing a quantitative foundation for geolocation methods (e.g., by comparing results for several methods); "survey" papers provide an overview of geolocation issues.

Data describes the type of data on which the results claimed in the paper are based. We mention here if the paper describes "ground truth" (authoritative mappings between IP addresses and geographic locations) used to validate geolocation results.

Findings gives a brief description of the main results claimed in the paper.

Probes gives an indication of the experimental setup (probes, landmarks, targets) used in a geolocation experiment (where appropriate).

Note: This is a wide multi-column table, and horizontal scrolling may be required.
ID Title Year Publication Authors Paper Type Method Data Findings Measurement Setup PDF
P-01 A Means for Expressing Location Information in the Domain Name System 1996 RFC Editor Davis, C. and Vixie, P. and Goodwin, T. and Dickinson, I. RFC 1876 Description of DNS LOC records
P-02 GTrace - A Graphical Traceroute Tool 1999 Usenix LISA Periakaruppan, R. and Nemeth, E. tool GUI for displaying traceroute results on geographic map uses geo-info from DNS LOC, WHOIS (NetGeo), IP-to-location databases, hostname heuristics, combined with RTT-based verification
P-03 Where in the World is netgeo.caida.org? 2000 INET Moore, D. and Periakaruppan, R. and Donohoe, J. and k claffy, k tool
P-04 An Investigation of Geographic Mapping Techniques for Internet Hosts 2001 SIGCOMM Padmanabhan, V.N. and Subramanian, L. methodology IP2Geo suite: GeoTrack (traceroute info+host name heuristics), GeoPing (based on similarities in RTT-delay patterns), GeoCluster (IP-to-location database+BGP routing info) IP-to-location datasets: 41772 Hotmail users at state granularity; 181246 IPs from bCentral web-hosting company at zipcode granularity; 142807 IPs from FooTV at zipcode granularity GeoTrack most promising with median errors of 28 (well-connected hosts) to few hundred km. Median errors for geolocation on same set of univ hosts: 28 km for GeoCluster; 102 for GeoTrack; 382 for GeoPing 14 landmarks and 365 targets (in US, mostly univ-based)
P-05 Similarity Models for Internet Host Location 2003 ICON Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. analysis explores various similarity measures for GeoPing-type (see P-04) geolocation RIPE TTM delay measurements from Dec 2002 to Jan 2003) similarity based on "city-block" distance measure works best 55 landmarks (RIPE TTM hosts)
P-06 Toward a measurement-based geographic location service 2004 Passive and Active Network Measurement Workshop (PAM) Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. analysis explores various similarity measures for GeoPing-type (see P-04) geolocation delays from probes to all landmarks, and from probes to target host sufficient correlation between geographic distance and network delay exists for coarse-grained geolocation; explores mostly distance based similarity measures; median distance error of 314 km 397 landmarks (RIPE TTM and LibWeb servers); 9 probes (NIMI)
P-07 Constraint-based Geolocation of Internet Hosts 2004 IEEE/ACM Transactions on Networking Gueye, B. and Ziviani, A. and Crovella, M. and Fdida, S. methodology multilateration based on geometric constraints (upper limit on target host distance from landmark) derived from delay measurements between landmarks provides location estimate and confidence region one-way delays between TTM hosts (Dec 2002-Feb 2003); RTT delays between AMP hosts (30 Jan 2003); known landmark locations median error of 95 km (NLANR, USA) and 22 km (TTM, Europe) Landmarks: 95 NLANR AMP hosts; 42 RIPE TTM hosts
P-08 Demographic Placement for Internet Host Location 2003 GLOBECOM Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. methodology develops methodology for efficiently deploying landmarks and probes for delay-based geolocation methods (in particular GeoPing from P-04 Landmarks are placed using demographic criteria (locations with high user density); probes are placed sparsely at locations with high connectivity
P-09 Improving the accuracy of measurement-based geographic location of Internet hosts 2005 Computer Networks and ISDN Systems Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. methodology explores several issues related to implementation of GeoPing-type geolocation: correlation between RTT and geographic distance: optimal placements of landmarks and probes; methods for evaluating similarities between delay patterns delays from LIP6 (Paris, France) to 109 LibWeb hosts in June 2002; delays between 55 RIPE TTM host from Dec 2002 to Feb 2003 number and location of landmarks and probes is optimized using a demographic approach; similarity based on "city-block distances" outperforms Euclidean distance model 109 LibWeb webservers; 55 RIPE TTM hosts
P-10 Towards IP geolocation using delay and topology measurements 2006 Internet Measurement Conf (IMC) Katz-Bassett, E. and John, J.P. and Krishnamurthy, A. and Wetherall, D. and Anderson, T. and Chawathe, Y. methodology Develops Topology-based geolocation (TBG): improves on pure CBG based on end-to-end delays by leveraging network topology at the router level and validated external hints; uses global optimization approach to determine router and target locations simultaneously. PlanetLab hosts are used as landmarks. Geolocation experiments using targets collocated with 11 Abiline PoPs, 22 Sprint PoPs and 128 Univ. hosts Improves errors by typically factors 3 to 4 as compared with CBG 68 PlanetLab landmarks; 128 US Univ. host targets
P-11 Leveraging Buffering Delay Estimation for Geolocation of Internet Hosts 2006 Int Federation for Information Processing Technical Committee 6 (IFIP-TC6) Networking Conf Gueye, B. and Uhlig, S. and Ziviani, A. and Fdida, S. methodology GeoBuD: CBG (P-07) augmented with buffering delay estimates at intermediate routers derived from traceroutes traceroutes from PlanetLab landmarks for 17 Oct 2005 (US dataset); and 21 Nov 2005 (WE dataset) incorporating buffering delays at intermediate routes in CBG (P-07) reduces median geolocation error from 228 to 144 km for US dataset and 137 to 100 im for WE dataset 29 US PlanetLab landmarks with 87 AMP targets; 27 WE PlanetLab landmarks with 57 RIPE TTM targets
P-12 IP Geolocation 2007 Internet Measurement seminar Holzhauer, F. survey review of methods, emphasizing CBG ([P-07], and TBG ([P-10])
P-13 Octant: A Comprehensive Framework for the Geolocalization of Internet Hosts 2007 USENIX Symp on Networked System Design and Implementation (NSDI) Wong, B. and Stoyanov, I. and Gun Sirer, E. methodology uses positive and negative constraints on hosts and intermediate routers; assigns "weights" to handle uncertainty in constraints; uses fictitious "height" to capture last hop delays; uses geometric technique based on Bezier curvers that can incorporate extraneous geographic hints Latency data collected on 1-Feb-2006 and 18-Sep-2006 between landmarks, intermediate routers and targets Median geolocation error of 35 km 51 PlanetLab hosts; 53 public traceroute servers
P-14 Investigating the Imprecision of IP Block-based Geolocation 2007 Passive and Active Network Measurement Workshop (PAM) Gueye, B. and Uhlig, S. and Fdida, S. analysis uses CBG geolocation to investiage geographic spread of IP addresses in same IP block CBG (P-07) locations for 18759 IPs in 876 IP blocks between 31 Mar and 19 Apr 2006. IPs are CoralCDN Web clients, linked to IP blocks using database from paper R-05 ~ 60% of IP blocks have spread in excess of 200 km 74 PlanetLab landmarks
P-15 Assessing the geographic resolution of exhaustive tabulation for geolocating Internet hosts 2008 Passive and Active Network Measurement Workshop (PAM) Siwpersad, S.S. and Gueye, B. and Uhlig, S. analysis comparison of locaion estimates from CBG (P-07) with locations from MaxMind and Hexasoft databases single IP from 41758 MaxMind and 15823 Hexasoft IP blocks are geolocated with CGB database location for more than 90% of IP blocks lies outside CBG confidence region 39 PlanetLab landmarks
P-16 Internet geolocation: evasion and counterevasion 2009 ACM Computing Surveys Muir, J. and van Oorschot, P.C. survey overview of geolocation methods with general discussion of limitations; discussion of ways adversaries can avoid geolocationp; mentions extraction of IP using Java applet, and RTT measurement by HTTP refreshes no geolocation method is robust (works for all IP addresses, network configs, and against adverserial users); those trying to evade geolocation, can complicate the task for locators, but geographic information can leak in many ways
P-17 Mining the Web and the Internet for Accurate IP Address Geolocations 2009 IEEE Conf on Computer Communications (INFOCOM) Guo, C. and Liu, Y. and Shen, W. and Wang, H.J. and Yu, Q. and Zhang, Y. methodology data base mining technique (Structon): geographic information from large database of Webpages, combined with number of heuristics to increase accuracy and coverage 500 million Chinese URLs, augmented with traceroutes, WHOIS and BGP information 87% accuracy at city-level granularity
P-18 Statistical geolocation of Internet hosts 2009 Int Conf on Computer Communications and Networks (ICCCN) Youn, I. and Mark, B.L. and Richards, D. methodology delay-based statistical method: delay-to-distance relationship is expressed in a probability density function; solution by iterative force-directed method delay measurements between all pairs of landmarks every five minutes for one week Compared to GeoPing (P-04) and CBG (P-07) median errors are reduced by ~20%; mean errors by ~50% (i.e., significant improvement especially in reducing large errors in GeoPing and CBG 85 PlanetLab landmarks
P-19 A study of geolocation databases 2010 arXiv cs.NI/1005.5674v3 Shavitt, Y. and Zilberman, N. survey statistical analysis of PoP "range of convergence" and deviations of IP and PoP locations within PoP PoP map of 3800 PoPs (52K IPs) derived from DIMES traceroute measurements in March 2010 vast majority of location info in databases is correct, but also errors in the range of 1000s km DIMES
P-20 GeoWeight: Internet host geolocation based on a probability model for latency measurements 2010 Australasian Conf on Computer Science (ASCS) Arif, M.J. and Karunasekera, S. and Kulkarni, S. methodology constraint-based (CBG) augmented by a probability model for latency vs. geographic distance 150000 latency measurements PlanetLab landmarks from 23 Sep 2008 to 25 Oct 2008; latencies from landmarks to 80 NA targets Median geolocation errors of ~44 km compared to > 200 for Octant and > 500 for CBG 50 PlanetLab landmarks and 80 targets in North America
P-21 A model based approach for improving router geolocation 2010 Computer Networks: The Int Journal of Computer and Telecommunications Networking Laki, S. and Matray, P. and Haga, P. and Csabai, I. and Vattay, G. methodology develops detailed path-latency model (separating propagation and per-hop delays); uses global optimization to solve for target locations (similar to P-10) mean geolocation errors of ~150 km ETOMIC landmarks and 41 GEANT2 targets; 151 PlanetLab nodes, used as landmarks and targets
P-22 Internet Host geolocation using maximum likelihood estimation technique 2010 IEEE Int Conf on Advanced Information Networking and Applications (AINA) Arif, M.J. and Karunasekera, S. and Kulkarni, S. and Gunatilaka, A. and Ristic, B. methodology delay-based statistical method: delay-to-distance relationship is expressed as a probability density function; solution by MLE method delay measurements between landmarks from 23 Sep to 25 Oct 2008 median error of 134 km, compared to 216 for Octant (P-13) and 506 km for CBG (P-07) on same dataset 50 NA PlanetLab landmarks and 50 other NA hosts as targets
P-23 Dude, where's that IP? Circumventing measurement-based IP geolocation 2010 Usenix Security Symp Gill, P. and Ganjali, Y. and Wong, B. analysis Simulated "attacks" using PlanetLab testbed to foil delay-based and topology-aware geolocation attempts topology-aware techniques are more susceptible to tampering than simpler delay-based techniques 50 NA and 30 WE PlanetLab nodes
P-24 A learning-based approach for IP geolocation 2010 PAM Eriksson, B. and Barford, P. and Sommers, J. and Nowak, R., methodology statistical method: expresses relation of distance to delay and hop count as a probability density function; solution by learning-based classification method iPlane data for 12 Dec 2008 to 8 Jan 2009, supplemented with traceroutes between 375 PlanetLab hosts. Three sets of PlanetLab traceroutes between 11 Dec 2008 6 Jan 2009 MaxMind db is used as "ground truth". Results are compared with CBG (P-07: mean error reduces from 519 km for CBG to 408 km for proposed learning-based method 375 NA PlanetLab nodes
P-25 Spotter: A model based active geolocation service 2011 INFOCOM Laki, S. and Matray, P. and Haga, P. and Sebok, T. and Vattay, G. methodology delay-base statistical method: combining spatial probability density function for all landmarks defines estimated region for location of target uses PlanetLab nodes as targets; also uses 23000 Cogent IP address locations in Europe and US all landmarks are described by same probabilistic delay-distance model PlanetLab
P-26 Towards street-level client-independent IP geolocation 2011 Usenix Wang, Y. and Burgener, D. and Flores, M. and Kuzmanovic, A. and Huang, C. methodology combines active measurement approach with an active web-mining technique. Uses CBG for "coarse" geolocation; refines location using "relative network distance" in combination with large number of landmarks located using web-mining technique. method evaluated using 88 PlanetLab nodes; a set of 72 residential IP address; and 3rd undisclosed dataset claims median errors of 1--2 km for the three datasets used. 88 PlanetLab targets
P-27 iPlane Nano: path prediction for peer-to-peer applications 2009 Usenix Symp on Networked systems design and implementation (NSDI) Madhyasth, H.V. and Katz-Bassett, E. and Anderson, T. and Krishnamurthy, A. and Venkataramani, A. methodology provide atlas for PoP-level paths, with latencies, and loss rates predictions between arbitrary hosts on the Internet by path stitching across inferred PoP paths. iPlane data iPlane iNano provides PoP-level paths between arbitrary end-hosts with an atlas that is less than 7MB in size and can be updated
P-28 Matchmaking for online games and other latency-sensitive P2P systems 2009 ACM SIGCOMM Computer Communication Review Agarwal, S. and Lorch, J.R. methodology Htrea: place pairs of clients in a network coordinate system to provide client to client latency prediction. They seed their network corrdinate system with Maxmind Geolite coordinates. 3.5 million console to console latencies from Halo (microsoft) Geolite IP to geographic location (Maxmind) 50% of predictions under 15 ms for Htrae, 24 ms for Geolite. 95% of Htrea's predictions with in 138 ms and 208 for Geolocation. one time volunteers Htrea deployment on 11 home machines
P-29 IP geolocation databases: unreliable? 2011 ACM SIGCOMM Computer Communication Review Poese, I. and Uhlig, S. and Kaafar, M.A. and Donnet, B. and Gueye, B. survey Compare prefix distributions in 5 geolocation DBs with groundtruth. Groundtruth: 357 BGP prefixes from large European ISP with city-level location of router advertising subnet inside ISP. Databases of HostIP, IP2Location, InfoDB, Maxmind, Software77 databases are strongly biased to popular countries; db IP blocks use official advertisements of ISP; while some of the ISP address space is geolocated decently (e.g. 20% of Maxmind within 10s of km of groundtruth), in most cases DBs are off by 100s to 1000s of km.
P-30 Geocompare: a comparison of public and commercial geolocation databases 2011 CAIDA Technical Report Huffaker, B. and Fomenkov, M. and claffy, kc survey compare geographic database against each other and ground truth dataset RIR, Software77, HostIP, IPligence, Cyscape, MaxMind GeoIP, MaxMind GeoLite, IPnfoDB, and Digital Envoy databases roughly all agree on country, MaxMind GeoIP and Digital Envoy did best on ground truth. Digital Envoy did best on routers.
P-31 Network measurement based modeling and optimization for IP geolocation 2012 Computer Networks Dong, Z. and Perera, R.D.W. and Chandramouli, R. and Subbalakshmi, K.P. methodology measurement-based geolocation method tested on PlanetLab nodes; k-means clustering of landmark-to-landmark measurements defines distance segments that each fits RTT vs distance using polynomial regression; semidefinite programming method finds optimized location of target host from estimated distances to landmarks. traceroutes from PlanetLab probes to landmarks for one week in Nov 2010 Best results: 27-32 km in North America; 41-53 in Europe 81 North-American and 90 European PlanetLab probes; 206 PlanetLab landmarks
P-32 A structural approach for PoP geo-location 2012 Computer Networks Feldman, D. and Shavitt, Y. and Zilberman, N. methodology Method to generate PoP-level geographic maps from IP-level graph based on DIMES traceroutes. PoP identification based on structure ('motifs') and a partitioning algorithm that assigns nodes to PoPs; geographic location assigned to PoP from geoloc DBs DIMES: 56M traceroutes from 2009 Jul and 33M from 2010 Oct. GeoDB useds: MaxMind, IPligence, HostIP, IP2Location, GeoBytes Comparision with published PoP maps, finds that most large PoPs are found, but few small ones. The majority of incorrect links is attributed to database errors. DIMES, 1308 agents in 49 countries
P-33 Posit: a lightweight approach for IP geolocation 2012 ACM SIGMETRICS Performance Evaluation Review Eriksson, B. and Barford, P. and Maggs, B. and Nowak, R. methodology CBG for geolocation to constrained region; Finds the most likely location from limited pool (ie city) given RTT to target and landmarks in constrained region. 431 Akamai vantage points/targets and addtion 283 targets
P-34 Enhancing the classification accuracy of IP geolocation 2012 Military Communications Conf (MILCOM) Maziku, H. and Shetty, S. Han, K. and Rogers, T. methodology Machine-learning approach (extension of P-24). Larger set of classifiers include average, median, mode and std dev of delay measurement; hop count; pop. density. 142,937 (23,843 after de-aliasing) router IP addresses from traceroutes between PlanetLab nodes between Jun and Oct 2011. Results heavily depend on good coverage by landmarks. Median errors vary from 0 (NE US) to ~500 km (N-Central US). 67 well-distributed PlanetLab nodes across US serve as landmarks and probes
P-35 Towards geolocation of millions of IP addresses 2012 ACM Internet measurement conference (IMC) Hu, Z. and Heidemann, J. and Pradkin, Y. methodology select 10 nearest, by RTT, Vantage Points to probe /24 prefix 400 PlanetLab nodes to 25 known landmarks proves that selecting 10 Vantage Points with lowest RTT values to a given /24 prefix greatly reduces the amount of traffic needed to geolocate without large increases in error
P-36 Using Whois based geolocation and Google maps API for support cybercrime investigations 2013 Recent Advances in Telecommunications and Circuits Butkovic, A. and Orucevic, F. and Tanovic, A. methodology
P-37 Topology mapping and geolocating for China's Internet 2013 IEEE Trans. On Parallel and Distributed Systems Tian, Y. and Dey, R. and Liu, Y. and Ross, K.W. methodology

Related Papers

Note: This is a wide multi-column table, and horizontal scrolling may be required.
ID Title Year Publication Authors Paper Type Method Data Findings Measurement Setup PDF
R-01 Predicting Internet Network Distance with Coordinates-Based Approaches 2002 IEEE Conference on Computer Communications (INFOCOM) Ng, T. S. E. and Zhang, H. methodology develops GNP, a coordinate-based method for estimating minimum RTT using absolute coordinates distance (minimum RTT) measurements between landmarks and two sets of targets Euclidean embedding combined with a relative error measurement function works best 19 landmarks (12 in NA; 5 in AP; 2 EU); two target set: 869 global; 127 Abilene-connected
R-02 Virtual Landmarks for the Internet 2003 Internet Measurement Conference (IMC) Tang, L. and Crovella, M. methodology, analysis coordinate-based method for estimating minimum RTT using Euclidean embedding; uses "virtual landmarks" for speed and scalibility seven collections of RTT data network distances can be described with 7-9 orthogonal vectors; ~90% of distances preserved with relative error <0.5; "virtual landmark" method simpler and faster than nonlinear optimization NLANR AMP
R-03 On the geographic location of Internet resources 2003 J. on Selected Areas in Communications, Vol. 21., pp. 934-947 Lakhina, A. and Byers, J.W. and Crovella, M. and Matta, I. analysis statistical analysis of geographic properties of router topology. Uses CAIDA Skitter data (26 Dec 2001 to 1 Jan 2002) and Scan Project Mercator data (Aug 1999). Geolocation is done using IxMapper and EdgeScape. Superlinear relation between router and population density; connection patterns linked to geographic distance. Nr of AS locations correlates with AS degree and AS geographic dispersal.
R-04 Vivaldi: A Decentralized Network Coordinate System 2004 SIGCOMM Dabek, F. and Cox, R. and Kaashoek, F. and Morris, R. methodology Coordinate-based method for predicting communication latencies using 2D Euclidean embedding augmented with a "height" component RTTs between 192 PlanetLab nodes; and between 1740 DNS servers Median relative error in RTT prediction of 11% PlanetLab
R-05 Geographic Locality of IP Prefixes 2005 Internet Measurement Conference (IMC) Freedman, M. J. and Vutukuru, M. and Feamster, N. and Balakrishnan, H. analysis Statistical analysis of geographic properties of IP prefixes within context of implications for routing policies. Uses undns for geolocation (i.e., IP-to-geographic location mapping based on geographic information in DNS names) 170000 IP prefixes from RouteViews from 27-Feb-2005; traceroutes to CoralCDN clients and servers; traceroutes from PlanetLab hosts to 4 IPs per prefix discontiguous prefixes announced by AS from single location usually due to fragmented alloction by registries; announcement of contiguous prefixes announced by AS from different geographic locations limits oportunities of aggregration of prefixes 25 PlanetLab hosts
R-06 Geolocalization of Proxied Services and its Application to Fast-Flux Hidden Servers 2009 IMC Castelluccia, C. and Kaafar, M.A. and Manils, P. and Perito, D. application application of CBG (P-07) to geolocation of fast-flux hidden servers
R-07 Eyeball ASes: From Geography to Connectivity 2010 Internet Measurement Conference (IMC) Rasti, A. and Magharei, N. and Rejaie, R. and Willinger, W. analysis IP geolocation (48x106 IP addresses from P2P apps) done using GeoIP and IP2Location; used to determine geo- and PoP-level footprints of ASes PoP info for 45 ASes in NA and EU compiled from online data
R-08 Improving AS Relationship Inference Using PoPs 2013 Traffic Monitoring and Analysis Workshop (TMA 2013) Neudorfer, L. and Shavitt, Y. and Zilberman, N. methodology Method to use PoP level maps to find complex and anomalous AS relationships 29M DIMES traceroutes (May 2012); DIMES IP-to-PoP mapping (May 2012) with 5215 PoPs, 98650 IPs in 2636 AS; CAIDA AS rank data from August 2012 with 119,924 AS pairs. Discusses several examples complex and/or anomalous AS relationships between AS (different at different ASes) DIMES

Measurement Infrastructure

The above bibliography references several datasets. These resources are listed here with references back to the papers.

IDNameOrganizationDescriptionPaperID
D-01PlanetLabPrinceton UniversityGlobal research network for the development of new network services. Currently over 1000 nodes worldwideP-10, P-11, P-13, P-14, P-15, P-18, P-20, P-21, P-22, P-23, P-24, P-25, P-26, R-04, R-05, R-06
D-02iPlaneUniversity of WashingtonScalable service for predictions of Internet path performance for emerging overlay services (incl. access to iPlane datasets)P-20, P-22, P-24
D-03TTMRIPETest Traffic Measurement Service (TTM) measures key parameters of he connectivity between points on the internetP-05, P-06, P-07, P-09, P-11
D-04AMPNLANRNLANR Active Measurement Project (AMP), active between 1998-2006. Datasets available at RIPE.P-07, P-11, R-02
D-05ETOMICEuropean Traffic Observatory Measurement Infrastructure (ETOMIC) is a measurement infrastructure, distributed throughout Europe, that is able to carry out active measurementsP-21
D-06GEANT2High-bandwidth, academic Internet serving Europe's research and education communityP-21
D-07DIMESTel Aviv Univ.Distributed scientific research project, aimed at studying structure and topology of the InternetP-19
D-08SkitterCAIDATool for actively probing the Internet for topology and performance analysis. Retired in 2008. Dataset availabe from CAIDAR-02

Geographic Information and Geolocation Methods

Current geolocation techniques can be broadly divided into two categories: database-driven (or registry-based P-25) and measurement-based. This categorization mirrors a similar division in the types of geographic information available for IP geolocation: qualitative data, and numerical (quantitative) data. Both have been present in geolocation efforts from the onset.

The class of quantitative data includes the workhorse of measurement-based geolocation methods: delay measurements from probes to landmarks and targets. A number of publications establish the relationship between Internet delay and geographic distance ( P-06, P-09, P-10, P-25 ) in the presence of obfuscating factors like circuitous routing, buffering and other delays (P-11,P-21), etc. Also included in this class is network topology information, typically derived from traceroute measurements. Topology information can be an integral part of a geolocation algorithm (e.g., when intermediate routers to an end target are geolocated alongside the target itself in a global optimization; P-10), but is also used in simpler arguments that relate topological proximity to geographic proximity (e.g., when geolocating the last intermediate router when the real target is unreachable). Hop counts (also derived from traceroutes) are explored in a recent paper (P-24) as another quantitative measure of geographic distance.

The class of qualitative data includes the usual suspects (WHOIS registry, DNS LOC records, DNS names, BGP router tables), but also databases based on information gathered from the Internet community (either directly through user input, or indirectly, e.g. by parsing large quantities of URLs P-17). This probably also includes the various types of proprietary databases used in commercial geolocation products. All of these contain geographic information (directly as in DNS LOC records, or indirectly by linking to an organization or AS number) that, if correctly interpreted, provide clues about the geographic location of an IP address, or IP address block.

The earliest geolocation attempts, GTrace (P-02; constructed around NetGeo), GeoTrack and GeoCluster (P-04) emphasize qualitative data (primarily WHOIS records and DNS names), but already delay (RTT) measurements are incorporated. GTrace uses RTT data to validate results using "speed-of-light" arguments; GeoPing (P-04) is purely RTT-based. From these early attempts a number of measurement-based algorithms have appeared in the academic literature. The table below provides an overview of the accuracy achieved by the various techniques. In this list only the first three are database-driven; all others (starting with GeoPing) are measurement-based.

GeoPing (P-04), which uses similarities between "fingerprints" (based on delay measurements from a set of probes) for target and landmarks to select the location of the landmark with the most similar fingerprint as the target location, appears to be mostly of historical significance at this point as the first measurement-based geolocation method. Constraint-based geolocation (CBG; P-07), using deterministic geometric constraints derived from delay measurements to constrain the probable location of a target, has set the stage for future development, and is the most common "benchmark" used to compare more recent models against.

Subsequent geolocation methods show an increasing sophistication in extracting geographic information, either by supplementing delay measurements with additional data, or by more complex algorithms. Topology-based geolocation (TBG; P-10) introduces topology measurements to simultaneously geolocate intermediate routers and targets. Further refinements include an improved analysis of delay measurements (separating the distance-sensitive propagation delays from other processing delays; P-11, P-21), incorporating database-driven approaches to improve geolocation accuracy (P-10), and integrating hop counts into the geolocation algorithm (P-24). Algorithms also are evolving. The most recent models favor probabilistic approaches, which seem to be a better match to the essentially statistical nature of the relation between geographic distance and delay measurements. GeoWeight (P-20) marks a transition by combining deterministic constraints, similar to CBG, with probability assignments; P-18, P-22, P-24 and P-25 describe delay measurements using probability density functions, and use various statistical methods to build a geolocation algorithm.

Few detailed descriptions of database-driven techniques exist in the literature. The exceptions are NetGeo (P-03) and Structon (P-20). Not surprisingly, published literature contains little concrete information about algorithms employed in commercial geolocation products. Whether the qualitative input data are web pages, WHOIS registry records, or DNS names a database-driven geolocation algorithm tends to be a collage of various heuristic arguments, approximations and intelligent guesswork.

Error Distance Matrix

The table below compiles numbers from geolocation experiments described in the above publications for measurement-based techniques. The column headers indicate a range of median errors in geolocation distance reported in the papers; the values in the columns are the number of experiments that report median errors in the indicated range. Even though direct comparison of these numbers is tricky due to the wide variations in experiment characteristics (different types of targets, different set of landmarks, etc.), the picture that emerges is that state-of-the-art measurement based techniques can comfortably geolocate targets with median errors of < 250 km, while some techniques under favorable conditions can approach an accuracy of < 100 km. To put this in context: 1000 km can be roughly viewed as country granularity; 50 km approaches city or zip code granularity.

Method ID d < 5 km 5 < d < 50 km 50 < d < 100 km 100 < d < 250 km 250 < d < 500 km 500 < d < 750 km
NetGeo P-03
1
GeoTrack P-04
1
2
1
GeoCluster P-04
1
GeoPing P-04
1
5
2
CBG P-07
1
3
6
2
2
TBG P-10
1
1
GeoBuD P-11
1
1
Octant P-13
2
4
2
SG P-18
1
GeoWeight P-20
1
Geo-Rh P-21
1
MLE P-22
1
Naive Bayes P-24
1
Spotter P-25
1
1
Street GeoLoc P-26
1
Dong et al. P-31
1
Maziku et al. P-34
1

Discussion

A direct comparison between measurement-based and database-driven approaches, or even just between measurement-based algorithms is tricky at best. A systematic comparison would require the availability of a reliable "ground truth" database of IP addresses at known geographic locations. This is difficult to find. However, in practice, the pool of potential test targets at known locations is limited: most recent published experiments select their ground truth from hosts in measurement infrastructures like PlanetLab in North America or Europe. So, even though hard to quantify, the ground truth in some published experiments probably is similar. In some papers the same ground truth is used to compare different algorithms (typically CBG is used as a benchmark, which explains the high number of entries for CBG in the above table), providing some insight in comparative performance. Obvious questions remain though. How representative are results based on a limited number of PlanetLab targets for the Internet as a whole? How much does the accuracy for a method vary from well-connected hosts (routers) to a heterogenous collection of end hosts? Looking at the above table, the median errors for CBG experiments vary from better than 50 km to more than 500 km (one order of magnitude) presumably reflecting a wide variation in experiment characteristics.

In an average sense the performance of the best geolocation techniques can be quantified reasonably well: the best measurement-based methods have median errors of at most a few hundred km (well within country granularity), with the best results maybe approaching 50 km (city or zip level). Similarly database-driven techniques also appear to do quite well at the country level, but start running out of steam at the city level. Whether database-driven or measurement-based, all techniques suffer from what might be called an outlier syndrome. All techniques are plagued by outliers with location errors well exceeding 1000 km (or country level). It would seem that for any potential application of geolocation the key question to ask is whether being right most of the time is good enough. If the answer is yes, a secondary question is whether the average accuracy of a selected algorithm is satisfactory.

Published
Last Modified