Internet Protocol Address (IP) Geolocation Bibliography
Introduction
IP address geolocation reminds one of the classic bumper sticker, "think globally, act locally." In today's far reaching Internet, organizations and institutions of all kinds from corporations to governments want exactly that, the ability to communicate to the entire world and, at the same time, to develop applications which help them to target, limit, customize their messages, balance resources, and coordinate responses based on the location of the receiver. Organizations accomplish this by using tools and services that translate an IP address or prefix range into a geographic location (country, state, city, zip, geographic latitude/longitude) associated with the address(es). Simple, right?
However, which method(s) work best? Which sources of geolocation services and information return the most reliable locations and at what cost? What is the geographic resolution? Further, if a source provides the geographic location of the owner of an IP address, is this location the same as the location where the device is actually broadcasting and receiving packets? And, if different, can the difference be quantified?
What constitutes a "good" geolocation result? Some numbers: with a total land area of 1.5×108 km2 and 195 countries, the average country size on Earth is about 7.7×105 km2, or a linear size of 880 km. The surface area of the US is about 107 km2. With 50 states, over 3,000 counties and on the order of 43,000 zip codes, the average linear size of a state, county or zip code is about 450, 55 and 15 km, respectively. Looking at another big country, China (about the same size as the USA) has 33 provinces, 333 prefectures, about 3000 counties, and about 42000 townships, giving sizes of 550, 170, 60 and 18 km, respectively. To begin to be useful a geolocation method would at the very least need to be able to pinpoint the correct country, and, in large countries like the USA or China, the correct state or province. Looking at the above numbers this would require geolocation errors of at most a few hundred kilometers. An accuracy measured in tens of kilometers would be required to be effective at a truly local level (county or zip code).
Useful Definitions
A number of concepts are commonly encountered in the geolocation literature. We define the main ones here.
- IP geolocation describes methods of assigning a geographic label to an individual Internet Protocol address (IP).
- A Vantage Point (VP) is a measurement infrastructure node with a known geographic location.
- A Landmark is a responsive Internet identifier with a known location to which the VP will launch a measurement that can serve to calibrate other measurements to potentially unknown geographic locations. Some papers use the term Active Landmarks to refer to points which act as both landmark and vantage point. Often they are part of an infrastructure platform like PlanetLab.
- A Target is an Internet identifier whose location will be inferred from a given method. Typically some targets have known geographic locations (ground truth), which researchers can use to evaluate the accuracy of their geolocation methodology.
- A Location is a geographic place that geolocation techniques attempt to infer for a given target. Examples include cities and ISP Points of Presences (PoPs).
Not all terms are used in all papers.
Geolocation Papers
The tables below contain annotations for papers on the topic of geolocation. We have collected and reviewed papers published between 1996 and 2010, starting with papers from peer-reviewed academic research conferences, and then including papers cited from this initial seeding, as well as follow-up papers written by the same authors. We provide a flexible interactive table that supports selection of relevant attributes from these papers.
The first table emphasizes papers that directly address geolocation methodology, introducing new methods, extensions to previous methods, performance analysis, etc. The second table includes papers that address other geolocation-related issues, including applications of geolocation, and coordinate-based methods for modeling network delays.
Alongside author and publication information, the tables include a number of additional columns.
The papertype gives a category indication; we use "survey", "analysis", "methodology", "tools", and "other". "Methodology" papers develop specific methods of geolocaton; "analysis" papers focus on providing a quantitative foundation for geolocation methods (e.g., by comparing results for several methods); "survey" papers provide an overview of geolocation issues.
Data describes the type of data on which the results claimed in the paper are based. We mention here if the paper describes "ground truth" (authoritative mappings between IP addresses and geographic locations) used to validate geolocation results.
Findings gives a brief description of the main results claimed in the paper.
Probes gives an indication of the experimental setup (probes, landmarks, targets) used in a geolocation experiment (where appropriate).
Note: This is a wide multi-column table, and horizontal scrolling may be required.ID | Title | Year | Publication | Authors | Paper Type | Method | Data | Findings | Measurement Setup | |
---|---|---|---|---|---|---|---|---|---|---|
P-01 | A Means for Expressing Location Information in the Domain Name System | 1996 | RFC Editor | Davis, C. and Vixie, P. and Goodwin, T. and Dickinson, I. | RFC 1876 | Description of DNS LOC records | ||||
P-02 | GTrace - A Graphical Traceroute Tool | 1999 | Usenix LISA | Periakaruppan, R. and Nemeth, E. | tool | GUI for displaying traceroute results on geographic map | uses geo-info from DNS LOC, WHOIS (NetGeo), IP-to-location databases, hostname heuristics, combined with RTT-based verification | |||
P-03 | Where in the World is netgeo.caida.org? | 2000 | INET | Moore, D. and Periakaruppan, R. and Donohoe, J. and k claffy, k | tool | |||||
P-04 | An Investigation of Geographic Mapping Techniques for Internet Hosts | 2001 | SIGCOMM | Padmanabhan, V.N. and Subramanian, L. | methodology | IP2Geo suite: GeoTrack (traceroute info+host name heuristics), GeoPing (based on similarities in RTT-delay patterns), GeoCluster (IP-to-location database+BGP routing info) | IP-to-location datasets: 41772 Hotmail users at state granularity; 181246 IPs from bCentral web-hosting company at zipcode granularity; 142807 IPs from FooTV at zipcode granularity | GeoTrack most promising with median errors of 28 (well-connected hosts) to few hundred km. Median errors for geolocation on same set of univ hosts: 28 km for GeoCluster; 102 for GeoTrack; 382 for GeoPing | 14 landmarks and 365 targets (in US, mostly univ-based) | |
P-05 | Similarity Models for Internet Host Location | 2003 | ICON | Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. | analysis | explores various similarity measures for GeoPing-type (see P-04) geolocation | RIPE TTM delay measurements from Dec 2002 to Jan 2003) | similarity based on "city-block" distance measure works best | 55 landmarks (RIPE TTM hosts) | |
P-06 | Toward a measurement-based geographic location service | 2004 | Passive and Active Network Measurement Workshop (PAM) | Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. | analysis | explores various similarity measures for GeoPing-type (see P-04) geolocation | delays from probes to all landmarks, and from probes to target host | sufficient correlation between geographic distance and network delay exists for coarse-grained geolocation; explores mostly distance based similarity measures; median distance error of 314 km | 397 landmarks (RIPE TTM and LibWeb servers); 9 probes (NIMI) | |
P-07 | Constraint-based Geolocation of Internet Hosts | 2004 | IEEE/ACM Transactions on Networking | Gueye, B. and Ziviani, A. and Crovella, M. and Fdida, S. | methodology | multilateration based on geometric constraints (upper limit on target host distance from landmark) derived from delay measurements between landmarks provides location estimate and confidence region | one-way delays between TTM hosts (Dec 2002-Feb 2003); RTT delays between AMP hosts (30 Jan 2003); known landmark locations | median error of 95 km (NLANR, USA) and 22 km (TTM, Europe) | Landmarks: 95 NLANR AMP hosts; 42 RIPE TTM hosts | |
P-08 | Demographic Placement for Internet Host Location | 2003 | GLOBECOM | Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. | methodology | develops methodology for efficiently deploying landmarks and probes for delay-based geolocation methods (in particular GeoPing from P-04 | Landmarks are placed using demographic criteria (locations with high user density); probes are placed sparsely at locations with high connectivity | |||
P-09 | Improving the accuracy of measurement-based geographic location of Internet hosts | 2005 | Computer Networks and ISDN Systems | Ziviani, A. and Fdida, S. and de Rezende, J.F. and Duarte, O.C.M.B. | methodology | explores several issues related to implementation of GeoPing-type geolocation: correlation between RTT and geographic distance: optimal placements of landmarks and probes; methods for evaluating similarities between delay patterns | delays from LIP6 (Paris, France) to 109 LibWeb hosts in June 2002; delays between 55 RIPE TTM host from Dec 2002 to Feb 2003 | number and location of landmarks and probes is optimized using a demographic approach; similarity based on "city-block distances" outperforms Euclidean distance model | 109 LibWeb webservers; 55 RIPE TTM hosts | |
P-10 | Towards IP geolocation using delay and topology measurements | 2006 | Internet Measurement Conf (IMC) | Katz-Bassett, E. and John, J.P. and Krishnamurthy, A. and Wetherall, D. and Anderson, T. and Chawathe, Y. | methodology | Develops Topology-based geolocation (TBG): improves on pure CBG based on end-to-end delays by leveraging network topology at the router level and validated external hints; uses global optimization approach to determine router and target locations simultaneously. | PlanetLab hosts are used as landmarks. Geolocation experiments using targets collocated with 11 Abiline PoPs, 22 Sprint PoPs and 128 Univ. hosts | Improves errors by typically factors 3 to 4 as compared with CBG | 68 PlanetLab landmarks; 128 US Univ. host targets | |
P-11 | Leveraging Buffering Delay Estimation for Geolocation of Internet Hosts | 2006 | Int Federation for Information Processing Technical Committee 6 (IFIP-TC6) Networking Conf | Gueye, B. and Uhlig, S. and Ziviani, A. and Fdida, S. | methodology | GeoBuD: CBG (P-07) augmented with buffering delay estimates at intermediate routers derived from traceroutes | traceroutes from PlanetLab landmarks for 17 Oct 2005 (US dataset); and 21 Nov 2005 (WE dataset) | incorporating buffering delays at intermediate routes in CBG (P-07) reduces median geolocation error from 228 to 144 km for US dataset and 137 to 100 im for WE dataset | 29 US PlanetLab landmarks with 87 AMP targets; 27 WE PlanetLab landmarks with 57 RIPE TTM targets | |
P-12 | IP Geolocation | 2007 | Internet Measurement seminar | Holzhauer, F. | survey | review of methods, emphasizing CBG ([P-07], and TBG ([P-10]) | ||||
P-13 | Octant: A Comprehensive Framework for the Geolocalization of Internet Hosts | 2007 | USENIX Symp on Networked System Design and Implementation (NSDI) | Wong, B. and Stoyanov, I. and Gun Sirer, E. | methodology | uses positive and negative constraints on hosts and intermediate routers; assigns "weights" to handle uncertainty in constraints; uses fictitious "height" to capture last hop delays; uses geometric technique based on Bezier curvers that can incorporate extraneous geographic hints | Latency data collected on 1-Feb-2006 and 18-Sep-2006 between landmarks, intermediate routers and targets | Median geolocation error of 35 km | 51 PlanetLab hosts; 53 public traceroute servers | |
P-14 | Investigating the Imprecision of IP Block-based Geolocation | 2007 | Passive and Active Network Measurement Workshop (PAM) | Gueye, B. and Uhlig, S. and Fdida, S. | analysis | uses CBG geolocation to investiage geographic spread of IP addresses in same IP block | CBG (P-07) locations for 18759 IPs in 876 IP blocks between 31 Mar and 19 Apr 2006. IPs are CoralCDN Web clients, linked to IP blocks using database from paper R-05 | ~ 60% of IP blocks have spread in excess of 200 km | 74 PlanetLab landmarks | |
P-15 | Assessing the geographic resolution of exhaustive tabulation for geolocating Internet hosts | 2008 | Passive and Active Network Measurement Workshop (PAM) | Siwpersad, S.S. and Gueye, B. and Uhlig, S. | analysis | comparison of locaion estimates from CBG (P-07) with locations from MaxMind and Hexasoft databases | single IP from 41758 MaxMind and 15823 Hexasoft IP blocks are geolocated with CGB | database location for more than 90% of IP blocks lies outside CBG confidence region | 39 PlanetLab landmarks | |
P-16 | Internet geolocation: evasion and counterevasion | 2009 | ACM Computing Surveys | Muir, J. and van Oorschot, P.C. | survey | overview of geolocation methods with general discussion of limitations; discussion of ways adversaries can avoid geolocationp; mentions extraction of IP using Java applet, and RTT measurement by HTTP refreshes | no geolocation method is robust (works for all IP addresses, network configs, and against adverserial users); those trying to evade geolocation, can complicate the task for locators, but geographic information can leak in many ways | |||
P-17 | Mining the Web and the Internet for Accurate IP Address Geolocations | 2009 | IEEE Conf on Computer Communications (INFOCOM) | Guo, C. and Liu, Y. and Shen, W. and Wang, H.J. and Yu, Q. and Zhang, Y. | methodology | data base mining technique (Structon): geographic information from large database of Webpages, combined with number of heuristics to increase accuracy and coverage | 500 million Chinese URLs, augmented with traceroutes, WHOIS and BGP information | 87% accuracy at city-level granularity | ||
P-18 | Statistical geolocation of Internet hosts | 2009 | Int Conf on Computer Communications and Networks (ICCCN) | Youn, I. and Mark, B.L. and Richards, D. | methodology | delay-based statistical method: delay-to-distance relationship is expressed in a probability density function; solution by iterative force-directed method | delay measurements between all pairs of landmarks every five minutes for one week | Compared to GeoPing (P-04) and CBG (P-07) median errors are reduced by ~20%; mean errors by ~50% (i.e., significant improvement especially in reducing large errors in GeoPing and CBG | 85 PlanetLab landmarks | |
P-19 | A study of geolocation databases | 2010 | arXiv cs.NI/1005.5674v3 | Shavitt, Y. and Zilberman, N. | survey | statistical analysis of PoP "range of convergence" and deviations of IP and PoP locations within PoP | PoP map of 3800 PoPs (52K IPs) derived from DIMES traceroute measurements in March 2010 | vast majority of location info in databases is correct, but also errors in the range of 1000s km | DIMES | |
P-20 | GeoWeight: Internet host geolocation based on a probability model for latency measurements | 2010 | Australasian Conf on Computer Science (ASCS) | Arif, M.J. and Karunasekera, S. and Kulkarni, S. | methodology | constraint-based (CBG) augmented by a probability model for latency vs. geographic distance | 150000 latency measurements PlanetLab landmarks from 23 Sep 2008 to 25 Oct 2008; latencies from landmarks to 80 NA targets | Median geolocation errors of ~44 km compared to > 200 for Octant and > 500 for CBG | 50 PlanetLab landmarks and 80 targets in North America | |
P-21 | A model based approach for improving router geolocation | 2010 | Computer Networks: The Int Journal of Computer and Telecommunications Networking | Laki, S. and Matray, P. and Haga, P. and Csabai, I. and Vattay, G. | methodology | develops detailed path-latency model (separating propagation and per-hop delays); uses global optimization to solve for target locations (similar to P-10) | mean geolocation errors of ~150 km | ETOMIC landmarks and 41 GEANT2 targets; 151 PlanetLab nodes, used as landmarks and targets | ||
P-22 | Internet Host geolocation using maximum likelihood estimation technique | 2010 | IEEE Int Conf on Advanced Information Networking and Applications (AINA) | Arif, M.J. and Karunasekera, S. and Kulkarni, S. and Gunatilaka, A. and Ristic, B. | methodology | delay-based statistical method: delay-to-distance relationship is expressed as a probability density function; solution by MLE method | delay measurements between landmarks from 23 Sep to 25 Oct 2008 | median error of 134 km, compared to 216 for Octant (P-13) and 506 km for CBG (P-07) on same dataset | 50 NA PlanetLab landmarks and 50 other NA hosts as targets | |
P-23 | Dude, where's that IP? Circumventing measurement-based IP geolocation | 2010 | Usenix Security Symp | Gill, P. and Ganjali, Y. and Wong, B. | analysis | Simulated "attacks" using PlanetLab testbed to foil delay-based and topology-aware geolocation attempts | topology-aware techniques are more susceptible to tampering than simpler delay-based techniques | 50 NA and 30 WE PlanetLab nodes | ||
P-24 | A learning-based approach for IP geolocation | 2010 | PAM | Eriksson, B. and Barford, P. and Sommers, J. and Nowak, R., | methodology | statistical method: expresses relation of distance to delay and hop count as a probability density function; solution by learning-based classification method | iPlane data for 12 Dec 2008 to 8 Jan 2009, supplemented with traceroutes between 375 PlanetLab hosts. Three sets of PlanetLab traceroutes between 11 Dec 2008 6 Jan 2009 | MaxMind db is used as "ground truth". Results are compared with CBG (P-07: mean error reduces from 519 km for CBG to 408 km for proposed learning-based method | 375 NA PlanetLab nodes | |
P-25 | Spotter: A model based active geolocation service | 2011 | INFOCOM | Laki, S. and Matray, P. and Haga, P. and Sebok, T. and Vattay, G. | methodology | delay-base statistical method: combining spatial probability density function for all landmarks defines estimated region for location of target | uses PlanetLab nodes as targets; also uses 23000 Cogent IP address locations in Europe and US | all landmarks are described by same probabilistic delay-distance model | PlanetLab | |
P-26 | Towards street-level client-independent IP geolocation | 2011 | Usenix | Wang, Y. and Burgener, D. and Flores, M. and Kuzmanovic, A. and Huang, C. | methodology | combines active measurement approach with an active web-mining technique. Uses CBG for "coarse" geolocation; refines location using "relative network distance" in combination with large number of landmarks located using web-mining technique. | method evaluated using 88 PlanetLab nodes; a set of 72 residential IP address; and 3rd undisclosed dataset | claims median errors of 1--2 km for the three datasets used. | 88 PlanetLab targets | |
P-27 | iPlane Nano: path prediction for peer-to-peer applications | 2009 | Usenix Symp on Networked systems design and implementation (NSDI) | Madhyasth, H.V. and Katz-Bassett, E. and Anderson, T. and Krishnamurthy, A. and Venkataramani, A. | methodology | provide atlas for PoP-level paths, with latencies, and loss rates predictions between arbitrary hosts on the Internet by path stitching across inferred PoP paths. | iPlane data | iPlane iNano provides PoP-level paths between arbitrary end-hosts with an atlas that is less than 7MB in size and can be updated | ||
P-28 | Matchmaking for online games and other latency-sensitive P2P systems | 2009 | ACM SIGCOMM Computer Communication Review | Agarwal, S. and Lorch, J.R. | methodology | Htrea: place pairs of clients in a network coordinate system to provide client to client latency prediction. They seed their network corrdinate system with Maxmind Geolite coordinates. | 3.5 million console to console latencies from Halo (microsoft) Geolite IP to geographic location (Maxmind) | 50% of predictions under 15 ms for Htrae, 24 ms for Geolite. 95% of Htrea's predictions with in 138 ms and 208 for Geolocation. | one time volunteers Htrea deployment on 11 home machines | |
P-29 | IP geolocation databases: unreliable? | 2011 | ACM SIGCOMM Computer Communication Review | Poese, I. and Uhlig, S. and Kaafar, M.A. and Donnet, B. and Gueye, B. | survey | Compare prefix distributions in 5 geolocation DBs with groundtruth. | Groundtruth: 357 BGP prefixes from large European ISP with city-level location of router advertising subnet inside ISP. Databases of HostIP, IP2Location, InfoDB, Maxmind, Software77 | databases are strongly biased to popular countries; db IP blocks use official advertisements of ISP; while some of the ISP address space is geolocated decently (e.g. 20% of Maxmind within 10s of km of groundtruth), in most cases DBs are off by 100s to 1000s of km. | ||
P-30 | Geocompare: a comparison of public and commercial geolocation databases | 2011 | CAIDA Technical Report | Huffaker, B. and Fomenkov, M. and claffy, kc | survey | compare geographic database against each other and ground truth dataset | RIR, Software77, HostIP, IPligence, Cyscape, MaxMind GeoIP, MaxMind GeoLite, IPnfoDB, and Digital Envoy | databases roughly all agree on country, MaxMind GeoIP and Digital Envoy did best on ground truth. Digital Envoy did best on routers. | ||
P-31 | Network measurement based modeling and optimization for IP geolocation | 2012 | Computer Networks | Dong, Z. and Perera, R.D.W. and Chandramouli, R. and Subbalakshmi, K.P. | methodology | measurement-based geolocation method tested on PlanetLab nodes; k-means clustering of landmark-to-landmark measurements defines distance segments that each fits RTT vs distance using polynomial regression; semidefinite programming method finds optimized location of target host from estimated distances to landmarks. | traceroutes from PlanetLab probes to landmarks for one week in Nov 2010 | Best results: 27-32 km in North America; 41-53 in Europe | 81 North-American and 90 European PlanetLab probes; 206 PlanetLab landmarks | |
P-32 | A structural approach for PoP geo-location | 2012 | Computer Networks | Feldman, D. and Shavitt, Y. and Zilberman, N. | methodology | Method to generate PoP-level geographic maps from IP-level graph based on DIMES traceroutes. PoP identification based on structure ('motifs') and a partitioning algorithm that assigns nodes to PoPs; geographic location assigned to PoP from geoloc DBs | DIMES: 56M traceroutes from 2009 Jul and 33M from 2010 Oct. GeoDB useds: MaxMind, IPligence, HostIP, IP2Location, GeoBytes | Comparision with published PoP maps, finds that most large PoPs are found, but few small ones. The majority of incorrect links is attributed to database errors. | DIMES, 1308 agents in 49 countries | |
P-33 | Posit: a lightweight approach for IP geolocation | 2012 | ACM SIGMETRICS Performance Evaluation Review | Eriksson, B. and Barford, P. and Maggs, B. and Nowak, R. | methodology | CBG for geolocation to constrained region; Finds the most likely location from limited pool (ie city) given RTT to target and landmarks in constrained region. | 431 Akamai vantage points/targets and addtion 283 targets | |||
P-34 | Enhancing the classification accuracy of IP geolocation | 2012 | Military Communications Conf (MILCOM) | Maziku, H. and Shetty, S. Han, K. and Rogers, T. | methodology | Machine-learning approach (extension of P-24). Larger set of classifiers include average, median, mode and std dev of delay measurement; hop count; pop. density. | 142,937 (23,843 after de-aliasing) router IP addresses from traceroutes between PlanetLab nodes between Jun and Oct 2011. | Results heavily depend on good coverage by landmarks. Median errors vary from 0 (NE US) to ~500 km (N-Central US). | 67 well-distributed PlanetLab nodes across US serve as landmarks and probes | |
P-35 | Towards geolocation of millions of IP addresses | 2012 | ACM Internet measurement conference (IMC) | Hu, Z. and Heidemann, J. and Pradkin, Y. | methodology | select 10 nearest, by RTT, Vantage Points to probe /24 prefix | 400 PlanetLab nodes to 25 known landmarks | proves that selecting 10 Vantage Points with lowest RTT values to a given /24 prefix greatly reduces the amount of traffic needed to geolocate without large increases in error | ||
P-36 | Using Whois based geolocation and Google maps API for support cybercrime investigations | 2013 | Recent Advances in Telecommunications and Circuits | Butkovic, A. and Orucevic, F. and Tanovic, A. | methodology | |||||
P-37 | Topology mapping and geolocating for China's Internet | 2013 | IEEE Trans. On Parallel and Distributed Systems | Tian, Y. and Dey, R. and Liu, Y. and Ross, K.W. | methodology |
Related Papers
ID | Title | Year | Publication | Authors | Paper Type | Method | Data | Findings | Measurement Setup | |
---|---|---|---|---|---|---|---|---|---|---|
R-01 | Predicting Internet Network Distance with Coordinates-Based Approaches | 2002 | IEEE Conference on Computer Communications (INFOCOM) | Ng, T. S. E. and Zhang, H. | methodology | develops GNP, a coordinate-based method for estimating minimum RTT using absolute coordinates | distance (minimum RTT) measurements between landmarks and two sets of targets | Euclidean embedding combined with a relative error measurement function works best | 19 landmarks (12 in NA; 5 in AP; 2 EU); two target set: 869 global; 127 Abilene-connected | |
R-02 | Virtual Landmarks for the Internet | 2003 | Internet Measurement Conference (IMC) | Tang, L. and Crovella, M. | methodology, analysis | coordinate-based method for estimating minimum RTT using Euclidean embedding; uses "virtual landmarks" for speed and scalibility | seven collections of RTT data | network distances can be described with 7-9 orthogonal vectors; ~90% of distances preserved with relative error <0.5; "virtual landmark" method simpler and faster than nonlinear optimization | NLANR AMP | |
R-03 | On the geographic location of Internet resources | 2003 | J. on Selected Areas in Communications, Vol. 21., pp. 934-947 | Lakhina, A. and Byers, J.W. and Crovella, M. and Matta, I. | analysis | statistical analysis of geographic properties of router topology. | Uses CAIDA Skitter data (26 Dec 2001 to 1 Jan 2002) and Scan Project Mercator data (Aug 1999). Geolocation is done using IxMapper and EdgeScape. | Superlinear relation between router and population density; connection patterns linked to geographic distance. Nr of AS locations correlates with AS degree and AS geographic dispersal. | ||
R-04 | Vivaldi: A Decentralized Network Coordinate System | 2004 | SIGCOMM | Dabek, F. and Cox, R. and Kaashoek, F. and Morris, R. | methodology | Coordinate-based method for predicting communication latencies using 2D Euclidean embedding augmented with a "height" component | RTTs between 192 PlanetLab nodes; and between 1740 DNS servers | Median relative error in RTT prediction of 11% | PlanetLab | |
R-05 | Geographic Locality of IP Prefixes | 2005 | Internet Measurement Conference (IMC) | Freedman, M. J. and Vutukuru, M. and Feamster, N. and Balakrishnan, H. | analysis | Statistical analysis of geographic properties of IP prefixes within context of implications for routing policies. Uses undns for geolocation (i.e., IP-to-geographic location mapping based on geographic information in DNS names) | 170000 IP prefixes from RouteViews from 27-Feb-2005; traceroutes to CoralCDN clients and servers; traceroutes from PlanetLab hosts to 4 IPs per prefix | discontiguous prefixes announced by AS from single location usually due to fragmented alloction by registries; announcement of contiguous prefixes announced by AS from different geographic locations limits oportunities of aggregration of prefixes | 25 PlanetLab hosts | |
R-06 | Geolocalization of Proxied Services and its Application to Fast-Flux Hidden Servers | 2009 | IMC | Castelluccia, C. and Kaafar, M.A. and Manils, P. and Perito, D. | application | application of CBG (P-07) to geolocation of fast-flux hidden servers | ||||
R-07 | Eyeball ASes: From Geography to Connectivity | 2010 | Internet Measurement Conference (IMC) | Rasti, A. and Magharei, N. and Rejaie, R. and Willinger, W. | analysis | IP geolocation (48x106 IP addresses from P2P apps) done using GeoIP and IP2Location; used to determine geo- and PoP-level footprints of ASes | PoP info for 45 ASes in NA and EU compiled from online data | |||
R-08 | Improving AS Relationship Inference Using PoPs | 2013 | Traffic Monitoring and Analysis Workshop (TMA 2013) | Neudorfer, L. and Shavitt, Y. and Zilberman, N. | methodology | Method to use PoP level maps to find complex and anomalous AS relationships | 29M DIMES traceroutes (May 2012); DIMES IP-to-PoP mapping (May 2012) with 5215 PoPs, 98650 IPs in 2636 AS; CAIDA AS rank data from August 2012 with 119,924 AS pairs. | Discusses several examples complex and/or anomalous AS relationships between AS (different at different ASes) | DIMES |
Measurement Infrastructure
The above bibliography references several datasets. These resources are listed here with references back to the papers.
ID | Name | Organization | Description | PaperID |
---|---|---|---|---|
D-01 | PlanetLab | Princeton University | Global research network for the development of new network services. Currently over 1000 nodes worldwide | P-10, P-11, P-13, P-14, P-15, P-18, P-20, P-21, P-22, P-23, P-24, P-25, P-26, R-04, R-05, R-06 |
D-02 | iPlane | University of Washington | Scalable service for predictions of Internet path performance for emerging overlay services (incl. access to iPlane datasets) | P-20, P-22, P-24 |
D-03 | TTM | RIPE | Test Traffic Measurement Service (TTM) measures key parameters of he connectivity between points on the internet | P-05, P-06, P-07, P-09, P-11 |
D-04 | AMP | NLANR | NLANR Active Measurement Project (AMP), active between 1998-2006. Datasets available at RIPE. | P-07, P-11, R-02 |
D-05 | ETOMIC | European Traffic Observatory Measurement Infrastructure (ETOMIC) is a measurement infrastructure, distributed throughout Europe, that is able to carry out active measurements | P-21 | |
D-06 | GEANT2 | High-bandwidth, academic Internet serving Europe's research and education community | P-21 | |
D-07 | DIMES | Tel Aviv Univ. | Distributed scientific research project, aimed at studying structure and topology of the Internet | P-19 |
D-08 | Skitter | CAIDA | Tool for actively probing the Internet for topology and performance analysis. Retired in 2008. Dataset availabe from CAIDA | R-02 |
Geographic Information and Geolocation Methods
Current geolocation techniques can be broadly divided into two categories: database-driven (or registry-based P-25) and measurement-based. This categorization mirrors a similar division in the types of geographic information available for IP geolocation: qualitative data, and numerical (quantitative) data. Both have been present in geolocation efforts from the onset.
The class of quantitative data includes the workhorse of measurement-based geolocation methods: delay measurements from probes to landmarks and targets. A number of publications establish the relationship between Internet delay and geographic distance ( P-06, P-09, P-10, P-25 ) in the presence of obfuscating factors like circuitous routing, buffering and other delays (P-11,P-21), etc. Also included in this class is network topology information, typically derived from traceroute measurements. Topology information can be an integral part of a geolocation algorithm (e.g., when intermediate routers to an end target are geolocated alongside the target itself in a global optimization; P-10), but is also used in simpler arguments that relate topological proximity to geographic proximity (e.g., when geolocating the last intermediate router when the real target is unreachable). Hop counts (also derived from traceroutes) are explored in a recent paper (P-24) as another quantitative measure of geographic distance.
The class of qualitative data includes the usual suspects (WHOIS registry, DNS LOC records, DNS names, BGP router tables), but also databases based on information gathered from the Internet community (either directly through user input, or indirectly, e.g. by parsing large quantities of URLs P-17). This probably also includes the various types of proprietary databases used in commercial geolocation products. All of these contain geographic information (directly as in DNS LOC records, or indirectly by linking to an organization or AS number) that, if correctly interpreted, provide clues about the geographic location of an IP address, or IP address block.
The earliest geolocation attempts, GTrace (P-02; constructed around NetGeo), GeoTrack and GeoCluster (P-04) emphasize qualitative data (primarily WHOIS records and DNS names), but already delay (RTT) measurements are incorporated. GTrace uses RTT data to validate results using "speed-of-light" arguments; GeoPing (P-04) is purely RTT-based. From these early attempts a number of measurement-based algorithms have appeared in the academic literature. The table below provides an overview of the accuracy achieved by the various techniques. In this list only the first three are database-driven; all others (starting with GeoPing) are measurement-based.
GeoPing (P-04), which uses similarities between "fingerprints" (based on delay measurements from a set of probes) for target and landmarks to select the location of the landmark with the most similar fingerprint as the target location, appears to be mostly of historical significance at this point as the first measurement-based geolocation method. Constraint-based geolocation (CBG; P-07), using deterministic geometric constraints derived from delay measurements to constrain the probable location of a target, has set the stage for future development, and is the most common "benchmark" used to compare more recent models against.
Subsequent geolocation methods show an increasing sophistication in extracting geographic information, either by supplementing delay measurements with additional data, or by more complex algorithms. Topology-based geolocation (TBG; P-10) introduces topology measurements to simultaneously geolocate intermediate routers and targets. Further refinements include an improved analysis of delay measurements (separating the distance-sensitive propagation delays from other processing delays; P-11, P-21), incorporating database-driven approaches to improve geolocation accuracy (P-10), and integrating hop counts into the geolocation algorithm (P-24). Algorithms also are evolving. The most recent models favor probabilistic approaches, which seem to be a better match to the essentially statistical nature of the relation between geographic distance and delay measurements. GeoWeight (P-20) marks a transition by combining deterministic constraints, similar to CBG, with probability assignments; P-18, P-22, P-24 and P-25 describe delay measurements using probability density functions, and use various statistical methods to build a geolocation algorithm.
Few detailed descriptions of database-driven techniques exist in the literature. The exceptions are NetGeo (P-03) and Structon (P-20). Not surprisingly, published literature contains little concrete information about algorithms employed in commercial geolocation products. Whether the qualitative input data are web pages, WHOIS registry records, or DNS names a database-driven geolocation algorithm tends to be a collage of various heuristic arguments, approximations and intelligent guesswork.
Error Distance Matrix
The table below compiles numbers from geolocation experiments described in the above publications for measurement-based techniques. The column headers indicate a range of median errors in geolocation distance reported in the papers; the values in the columns are the number of experiments that report median errors in the indicated range. Even though direct comparison of these numbers is tricky due to the wide variations in experiment characteristics (different types of targets, different set of landmarks, etc.), the picture that emerges is that state-of-the-art measurement based techniques can comfortably geolocate targets with median errors of < 250 km, while some techniques under favorable conditions can approach an accuracy of < 100 km. To put this in context: 1000 km can be roughly viewed as country granularity; 50 km approaches city or zip code granularity.
Method | ID | d < 5 km | 5 < d < 50 km | 50 < d < 100 km | 100 < d < 250 km | 250 < d < 500 km | 500 < d < 750 km |
---|---|---|---|---|---|---|---|
NetGeo | P-03 | ||||||
GeoTrack | P-04 | ||||||
GeoCluster | P-04 | ||||||
GeoPing | P-04 | ||||||
CBG | P-07 | ||||||
TBG | P-10 | ||||||
GeoBuD | P-11 | ||||||
Octant | P-13 | ||||||
SG | P-18 | ||||||
GeoWeight | P-20 | ||||||
Geo-Rh | P-21 | ||||||
MLE | P-22 | ||||||
Naive Bayes | P-24 | ||||||
Spotter | P-25 | ||||||
Street GeoLoc | P-26 | ||||||
Dong et al. | P-31 | ||||||
Maziku et al. | P-34 |
Discussion
A direct comparison between measurement-based and database-driven approaches, or even just between measurement-based algorithms is tricky at best. A systematic comparison would require the availability of a reliable "ground truth" database of IP addresses at known geographic locations. This is difficult to find. However, in practice, the pool of potential test targets at known locations is limited: most recent published experiments select their ground truth from hosts in measurement infrastructures like PlanetLab in North America or Europe. So, even though hard to quantify, the ground truth in some published experiments probably is similar. In some papers the same ground truth is used to compare different algorithms (typically CBG is used as a benchmark, which explains the high number of entries for CBG in the above table), providing some insight in comparative performance. Obvious questions remain though. How representative are results based on a limited number of PlanetLab targets for the Internet as a whole? How much does the accuracy for a method vary from well-connected hosts (routers) to a heterogenous collection of end hosts? Looking at the above table, the median errors for CBG experiments vary from better than 50 km to more than 500 km (one order of magnitude) presumably reflecting a wide variation in experiment characteristics.
In an average sense the performance of the best geolocation techniques can be quantified reasonably well: the best measurement-based methods have median errors of at most a few hundred km (well within country granularity), with the best results maybe approaching 50 km (city or zip level). Similarly database-driven techniques also appear to do quite well at the country level, but start running out of steam at the city level. Whether database-driven or measurement-based, all techniques suffer from what might be called an outlier syndrome. All techniques are plagued by outliers with location errors well exceeding 1000 km (or country level). It would seem that for any potential application of geolocation the key question to ask is whether being right most of the time is good enough. If the answer is yes, a secondary question is whether the average accuracy of a selected algorithm is satisfactory.