Where in the World is netgeo.caida.org?
David Moore
(info@caida.org)
Cooperative Association for Internet Data Analysis (CAIDA)
University of California, San Diego
USA
Ram Periakaruppan
(ramanath@cs.colorado.edu)
Cooperative Association for Internet Data Analysis (CAIDA)
University of Colorado, Boulder
USA
Jim Donohoe
(jim.donohoe@computer.org)
Cooperative Association for Internet Data Analysis (CAIDA)
University of California, San Diego
USA
k claffy
(kc@caida.org)
Cooperative Association for Internet Data Analysis (CAIDA)
University of California, San Diego
USA
Table of Contents
- Introduction
- NetGeo Overview
- Internal Databases
- Methods for Location Mapping
- whois
- Hostname
- DNS LOC
- Problems with whois Based Lookups
- Future Work
- Related Work
- Acknowledgments
- Conclusion
- References
- Appendix: whois Record Parsing
Introduction
When your packets travel through the Internet, where exactly do they go? How many users of your web site live in Europe? How will your Internet service be affected by a new trans-pacific cable? To answer these questions you need geographic information. Internet researchers frequently need to map their observed data to specific places. But IP addresses, Autonomous System numbers, and hostnames are values in a logical hierarchy; they contain no geographic information. There is no authoritative database for mapping these identifiers to locations, so several sources of network information must be used, and these sources may be conflicting or incomplete. The large size of the typical data set used in Internet research makes manually mapping many thousands of IP addresses to locations impractical and imprecise; an automated solution is required. In this paper we describe NetGeo, a tool that overcomes these obstacles.
NetGeo is a tool that maps IP addresses, domain names, and Autonomous System (AS) numbers to geographic locations. NetGeo has significant potential to support a variety of tasks: automatic selection of geographically nearby mirror sites; ISP decisions on where to deploy new infrastructure, traffic flow analysis for tariff policy research; regionally-based advertising design, etc. NetGeo is currently being used both in a graphical traceroute tool and for studies of connectivity and traffic flow between countries.
NetGeo can be accessed interactively via the web and through Java and Perl APIs. The NetGeo back-end consists of a database and a collection of Perl scripts for address parsing and heuristic analysis of whois records. To reduce the load on whois servers and to improve performance, NetGeo caches geographic information parsed from previous queries.
Prior to the development of NetGeo, Internet geographic information was not easily available. We look forward to many creative uses of this tool as researchers become aware of its availability.
NetGeo Overview
NetGeo is designed to extract and process all available geographic information about a network entity (IP address, domain or machine name, or AS number) to give the most probable latitude and longitude, the source and specificity of the location, and a gauge of the reliability of that result. To achieve that goal, NetGeo must collate location information from multiple sources. Because development of heuristics for integration and comparison of geographic information is ongoing and context-dependent, NetGeo allows direct control over which sources of information are used for localization.
NetGeo currently allows lookups of the following types of identifiers: AS numbers, IP addresses, domain names, and machine names. For each lookup, NetGeo returns a record containing: a location, the data source of the location, the granularity determining the location, and additional meta-data (date of query, date of last whois update, and server used). The location section contains fields for city, state/province/administrative-unit, country, latitude, and longitude. An example record for the domain name caida.org is shown in Figure 1.
NAME: CAIDA.ORG NUMBER: CITY: LA JOLLA STATE: CALIFORNIA COUNTRY: US LAT: 32.85 LONG: -117.25 LAT_LONG_GRAN: City LAST_UPDATED: 11-Jul-98 NIC: INTERNIC LOOKUP_TYPE: Domain Name RATING: DOMAIN_GUESS: STATUS: OK |
Figure 1 - NetGeo result record for whois based lookup of caida.org domain name. |
When a user initiates a lookup, NetGeo uses the lookup type requested and the identifier type to determine which mechanism will provide the most accurate result. For some of the mechanisms, NetGeo has all the information it needs to determine a location; for other mechanisms NetGeo must query external databases. Because external lookups consume resources on the remote server (whois, DNS) as well as in NetGeo itself, NetGeo attempts to cache results derived from external queries whenever possible.
In addition to its framework for lookup determined remote database queries, the NetGeo architecture uses its internal databases to parse query results.
Internal Databases
NetGeo uses a relational database consisting of both static and dynamic tables. The static tables, used to recognize location names in whois records, consist of latitude/longitude values for approximately 100,000 cities, administrative divisions (e.g., states, provinces, departments), and countries. Our primary source for this data set was the Getty Institute's Thesaurus of Geographic Names database [GETTY]. The dynamic tables cache derived results to eliminate repetitive external queries and parsing during future NetGeo lookups.
The NetGeo address parser is a knowledge-directed parser -- it can recognize only location names contained in its database. For many locations the NetGeo tables contain several variant spellings, or location names in several languages, e.g., Spain and Espana. Names are stored in the standard 26-letter English alphabet, so España is stored as Espana.
In addition to tables of location names, the NetGeo database also contains several other static tables: U.S. zip codes and their associated locations; names of administrative units; and phone codes with corresponding countries. Names of administrative units are sometimes listed explicitly in addresses, e.g., many location names are suffixed with "REPUBLIC" and "RESPUBLIKA" in Russia. The parser uses the table of administrative unit names to isolate and recognize specific location names.
The dynamic tables cache results of previous NetGeo whois queries, including the original query target (e.g., IP address) along with the complete location results. These tables allow NetGeo to respond quickly to frequent, identical queries, thus avoiding unnecessary repeated queries to external servers.
Additionally, NetGeo uses rule files containing regular expressions to extract city names and airport codes from certain classes of machine names, particularly those used in network backbones.
Methods for Location Mapping
Ideally, the latitude and longitude corresponding to any Internet address would be available via a DNS LOC record [RFC1876] and each AS number would have a whois record containing a complete address. In reality, because LOC resource records are not required to make DNS work, few network administrators support them. NetGeo can use DNS LOC records, if available, but primarily resorts to two other localization techniques: whois registration records [RFC954], and specialized hostname rule files. Whois-based results occasionally pinpoint locations other than the exact physical site, but NetGeo is able to obtain acceptable results in most cases.
Additional information about an entity to be mapped provided by the user may increase the accuracy of the results. GTrace [GTRACE], a geographic traceroute tool, is a good example of the use of NetGeo in conjunction with an independent set of heuristic rules to improve accuracy.
whois
NetGeo uses whois to find locations for three types of network identifiers: domain names, AS numbers, and IP addresses. A particular host may be associated with one or more such identifier. When NetGeo uses whois to localize a host, it is actually finding the addresses or locations of the registered contacts, which may or may not be related to the physical location of the machine. For example, large companies with machines distributed across many states or even countries typically register their domain names with the address of their corporate headquarters, regardless of the actual locations of the machines. Although this practice hinders exact placement of a host, it is useful for analysis based on administrative or political boundaries, for example, linking an IP address with a Swiss company.
For NetGeo, the first step in mapping an Internet identifier to an address is to obtain the whois record for the target entity from a whois server. NetGeo has a list of whois server host names and their requisite query formats. In the simplest case, the desired whois record is found by a single query to one whois server. Some lookups require multiple queries, with intermediate parsing of responses indicating which whois server to query next.
- AS Numbers: AS numbers are the easiest network identifiers to localize because there are relatively few of them, each represented by a unique 16-bit integer. AS number lookups typically require only one query to each of a maximum of three whois servers (ARIN, RIPE, and APNIC). In addition, many lookups with NetGeo have shown that whois records for AS numbers are more likely to have parseable address information than other whois records.
- IP Addresses: The same registrars, ARIN, RIPE, and APNIC, share responsibility for IP address space administration. The sizes and starting addresses of IP address blocks granted to recipient organizations have changed over time, so their current allocation policies [ENTITY-ALLOC] do not always accurately describe IP address blocks held by older organizations. Additionally, recipient organizations may register subdivisions of their address blocks to their customers or other organizations. Because of the potential for these nested blocks, NetGeo must sometimes issue multiple whois requests to find the most appropriate record. Sub-block registration allows determination of a more specific geographic location for a given host.
- Domain Names: Unlike AS numbers and IP address blocks, whois server responsibility for domain names is highly distributed. Although there are regional registries, many top-level country domains have an independent registry and whois server. There are also separate registries for .gov, .mil, and .edu name spaces. The registrar system for .com, .org, and .net domain names recently changed to a distributed model under which there are many registries approved by ICANN. For domain names in these three Top Level Domains (TLDs), NetGeo first queries a single central site (Internic), which provides a pointer to the actual registry containing the requested record.
Once the record is retrieved from the whois server, it needs to be parsed to extract location information. (See Appendix: whois Record Parsing)
Hostname
While NetGeo's whois-based techniques are reasonable for small organizations near the edges of the network, they often fail for large, geographically dispersed organizations for which whois records map all hosts to the organization's registered headquarters. For example, most devices from core backbone (transit) providers, which typically fall under the .net domain, are deployed in places with no relation to the ISP's whois records. Hosts from large .com organizations, like IBM present similar localization challenges.
In the case of ISP transit backbones, their hostnames often contain geographical information such as a city name/abbreviation or airport code. NetGeo uses rule-based domain parsing files to extract these geographical hints. For example, ALTER.NET (a domain name used by UUNET, a part of MCI/WorldCom) names some of their router interfaces with three letter airport codes as shown below:
193.ATM8-0-0.GW2.EWR1.ALTER.NET 190.ATM8-0-0.GW3.BOS1.ALTER.NET (EWR -> Newark, NJ) (BOS -> Boston, MA) 198.ATM6-0.XR2.SCL1.ALTER.NET 199.ATM6-0.XR1.ATL1.ALTER.NET (Exception) (ATL -> Atlanta, GA)
This technique also works for other TLDs that carry geographical hints. For example, *.almaden.ibm.com hosts are likely at IBM's Almaden Research Center in San Jose, California rather than in New York as a whois-query on ibm.com would suggest.
s/.*?\.([^\.]+)\d\.ALTER\.NET/$1/this,airport.db scl=santaclara, ca, us tco=tysonscorner, va, us nol=neworleans, la, us |
Figure 2 - Example of a domain parsing file for ALTER.NET. |
Figure 2 shows an example of a NetGeo domain parsing file for ALTER.NET hosts. The file first defines regular expressions, followed by any domain specific exceptions. The user may identify an exception's location by either city or by latitude/longitude value using the format shown below:
exception=city,state,country city,country L: latitude, longitude
The first line in Figure 2 defines a substitution operation, which when matched against 193.ATM8-0-0.GW2.EWR1.ALTER.NET, would return "EWR". The contents following the last "/" of the first line tell NetGeo what to do with a successful match: in this case to check first for a match in the current file and then for a match in the airport database.
The reason for checking the domain parsing file first is that sometimes the naming scheme for a given domain is not consistent. For example, a search for SCL obtained from 198.ATM60.XR2.SCL1.ALTER.NET in the airport database would return a location for Santiago de Chile. ALTER.NET uses both standard airport codes and independent, non-standard three letter abbreviations for US cities (Figure 2 illustrates three such abbreviations.) Additional information in the rule file is required to detect such exceptions.
Sometimes ISPs name their hosts with more than one geographical hint. For example VERIO.NET names some of their hosts in the following format: den0.sjc0.verio.net, which suggests source and destination of the interface. There is no way to automatically discriminate between sources and destinations; the domain parsing file would have to include this domain-specific information.
The regular expressions present in the domain parsing files are necessary because the existence of over 10,000 valid airport codes [AIRPORT-CODES] makes it impractical, if not impossible, to arbitrarily match three-letter combinations without a large number of false positives. Additional information, such as round trip time to the host, helps eliminate false positives since IP packets cannot travel faster than the speed of light.
An advantage to hostname-based mapping is that one can describe an entire domain as a set of rules without needing whois lookups for every host in the domain. However, this technique will fail for domains that do not use internally consistent naming schemes.
DNS LOC
Both the whois-based and hostname-based mapping rely on the assumption that educated guesses are required in the absence of explicit location information. While RFC 1876 [RFC1876] did define a DNS extension to provide a LOC resource record type that allows administrators to associate latitude and longitude information with entries, it turns out to be sub-optimally useful. First, the RFC specifies only the format and interpretation of the new field, without establishing where or at what granularity to use it. Because of this, finding the appropriate LOC resource record may require multiple DNS queries.
More importantly, people just do not use it. NetGeo currently does not use DNS LOC queries by default because their low success rate does not justify the expense of the three or more DNS lookups typically needed to rule out the existence of a valid DNS LOC record.
Since NetGeo provides location information (city, state, country) in addition to latitude and longitude, when NetGeo finds a DNS LOC record, it needs to map the lat/long values to a known location contained in its internal database. Within a given error tolerance, it uses the closest matching lat/long it can find in its database. This approach can have problems: although NetGeo knows latitudes and longitudes of over 93,000 unique city locations, they are not necessarily representative or well-distributed around the world. For locations near country borders, the resulting match may not even belong to the correct country.
Problems with whois Based Lookups
- AS Numbers: Three registrars, ARIN, RIPE, and APNIC, share responsibility for AS number records. Occasionally, not all AS number entries are found on the appropriate server according to current allocation policies [ENTITY-ALLOC]. Thus, queries to all three servers may be necessary to either locate a record for certain AS numbers, or to determine that no such record exists.
- IP Addresses:
Like AS numbers, not all IP address
block records are found in the correct registry. For historical
reasons, the same block may have records in multiple registries,
and the records may not agree. While knowledge of current
policy could help determine which registry is authoritative,
there are cases in which the currently authoritative registry
lacked information found in the previously responsible registry.
Additionally, this validation method would require close
tracking of policies for dividing administration of the IP
address space.
Because of the potential for nested allocations of unknown size, it is possible that a lookup of a single IP address does not provide information about nearby addresses. In some cases, nearby addresses map to a different subdivision that would otherwise be obscured (See figures 3 and 4). If the entire set of IP allocation blocks were available at once (from a database dump [RIPE-DB] [APNIC-DB]) it would be possible to correctly build location tables for all IP address ranges. However, using only standard whois queries allows two options: store only single IP addresses, rather than ranges; or store larger ranges, with knowledge that some smaller subdivisions may be masked. NetGeo currently stores records for all sub-blocks containing 256 or fewer IP addresses. Single address entries are still necessary for hosts residing in blocks containing more than 256 addresses.
- Domain Names: Unfortunately, there is as yet no standard format for records returned by these diverse registries, and there is a much wider range of formats than there is for AS number and IP address records. Additionally, the current political framework around the the .com, .org, and .net space allows arbitrary new registrars to appear at any time with unknown record formats. As discussed in the appendix, accurate location determination in NetGeo depends heavily on the structure of the records. More generic matching algorithms are possible, but such algorithms are likely to run significantly slower and with an increased likelihood of false matches.
Figure 3 shows lookups into separate subdivided blocks in which the entire immediate parent block is completely subdivided. Lookup of X.Y.0.1 (a) yields records for the ranges X.Y.0.0/16 and X.Y.0.0/17. Since X.Y.0.0/17 is the most specific range, that record is parsed for location information. Similarly, a lookup of X.Y.128.1 (b) yields records for both X.Y.0.0/16 and X.Y.128.0/17. The order of lookups (a) and (b) does not matter. |
Figure 4 demonstrates the ambiguity in lookups into a partially but incompletely partitioned block. A lookup of X.Y.0.1 (a) yields records for the ranges X.Y.0.0/16 and X.Y.0.0/17. Since X.Y.0.0/17 is the most specific range, its record is used for finding location information. However, a lookup of X.Y.128.1 (b) yields only the record for X.Y.0.0/16. So if (a) is looked up before (b) then both the records for X.Y.0.0/16 and X.Y.0.0/17 would be stored in the database. However, if (b) is looked up before (a), the lookup of (b) would store the record for X.Y.0.0/16 and the subsequent lookup of (a) would match the record for (b) in the database, preventing the query of the registry for the correct record. |
Future Work
Our ideas for potential uses of NetGeo exceed the resources we have to explore them. Maintaining the current system requires considerable effort since the external databases queried have rapidly changing record formats and policies. NetGeo's parsing scripts must be modified to accommodate every change in any whois server's record format. With the imminent proliferation of ICANN-sanctioned registrars using non-standard formats, we may need to implement parsers that trade performance for generality. Less optimized heuristics would better handle the diversity of output from new servers.
Since IP address blocks and AS numbers are administered by only three registries, it may be possible to store those three databases directly in NetGeo to save significant query traffic. In particular, APNIC and RIPE make their databases publicly available, and we are working with NSI to obtain bulk access to a subset of NetGeo-relevant fields in their database. Although incorporating such databases would require significant architectural changes to NetGeo, they would allow research into sub-allocation behavior -- how registries are further distributing sub-blocks of IP addresses, and the resulting sub-allocation hierarchies.
In its current form, NetGeo bases each lookup on only one piece of network identifying information from the user (IP address, domain or host name, or AS number). Future versions will allow the user to augment NetGeo's decision process with additional identifiers and information for a single entity, providing more accurate localization. For example, to look up a single entity the user might provide: multiple IP addresses (e.g., from the same router), multiple forward DNS names, a set of traceroute paths containing the entity, RTT information, etc.
We would also like to calibrate the accuracy of NetGeo's location determinations. This would involve comparing results provided by multiple techniques with known values. Although verification of hundreds of thousands of mappings presents a challenge, other sources of information, including paths traced by CAIDA's skitter measurement project [SKITTER], may assist in the validation process.
Related Work
Digital Island's TraceWare product [TRACEWARE] maintains a database of IP addresses correlated to country code. It is aimed at supporting customization of web content based on the source of the HTTP query. Their techniques are patent pending and proprietary, so specific relevance to NetGeo's techniques is unknown.
The Pablo research group at UIUC supports an IP to latitude/longitude service [UIUC-IP2LL], using the Internic's whois database. It resolves US sites to their city, Canadian sites to their province, and other non-US sites to the country's capital. This server does not handle the .mil domain.
Other geographic traceroute servers including VisualRoute [VISUALROUTE], WhatRoute [WHATROUTE], GeoBoy [GEOBOY], and NeoTrace [NEOTRACE], use their own techniques for locating IP addresses on geographic maps, generally relying on static mapping databases and whois information.
Uri Raz [RAZ] maintains a web page listing resources to facilitate manual discovery of a host's geographic location. Christopher Davis [DAVIS] maintains a web page to help administrators easily enter LOC records into their DNS configuration files.
Acknowledgments
This work was funded by NSF under ANI-9996248, as part of CAIDA's Internet Atlas project. Additional sponsors include the APNIC, ARIN, NSI, and RIPE NCC registries. The Getty Institute's Thesaurus of Geographic Names was an invaluable source of location information. For all their bug reports and performance complaints, we thank Brad Huffaker and other NetGeo users inside CAIDA. We are especially grateful to Colleen Shannon for support and editing during the writing of this paper. kc claffy's continued insistence on the need for visualizing features of the Internet provided the impetus for this work.
Conclusion
CAIDA developed NetGeo to support several of its Internet topology visualization projects and NetGeo has become an essential component of much of our analysis. Yet our efforts do not begin to realize NetGeo's potential. Several organizations have expressed interest in using NetGeo in selecting optimal mirror sites for content downloads, targeting web-based content to specific user region, developing tools for mapping networks and providing invaluable guidance for ISP infrastructure expansion. NetGeo can support investigations of many Internet research questions that have a geographical questions, and could be usefully combined with many network visualization and educational tools.
NetGeo offers many features not found in any other known service: resolution down to the city granularity, a wide range of whois server sources, the use of caching of recent query results to improve performance and interactive and API-based public access to the database. We anticipate many novel uses for NetGeo's functionality as knowledge of NetGeo's availability spreads throughout the research community.
The problem of mapping network entities to geographical locations is difficult, requiring many heuristics and the leveraging of as many independent sources of information as possible. The process is made more challenging by changes in formats of information sources, dynamic nature of the information, and the general growth of the Internet.
Perl and Java client interfaces are available at http://netgeo.caida.org/.
References
[GETTY] | The Getty Thesaurus of Geographic Names, |
[RFC-1876] | Davis, C., Vixie, P., Goodwin, and T. Dickinson, "A Means for Expressing Location Information in the Domain Name System", January 1996. |
[RFC-954] | Harrenstien, K., Stahl, M., and E Feinler, "NICNAME/WHOIS", RFC 954, October 1985. |
[GTrace] | Periakaruppan, R., Nemeth, E., "GTrace - A Graphical Traceroute Tool", USENIX LISA'99, November 1999. |
[ENTITY-ALLOC] | Asian Pacific Network Information Center, Division of IPv4 Address Space and AS Numbers Among Registries, |
[AIRPORT-CODES] | Smith, D., Listing of Airport Codes, |
[RIPE-DB] | Réseaux IP Européens Network Coordination Centre, Available Databases, |
[APNIC-DB] | Asian Pacific Network Information Center, Available Databases, |
[SKITTER] | McRobb, D., Skitter, |
[TRACEWARE] | TraceWare, Digital Island, |
[UIUC-IP2LL] | UIUC's IP to Lat/Long Server, |
[VISUALROUTE] | VisualRoute, Datametrics System Corporation, |
[WHATROUTE] | WhatRoute, Bryan Christianson, |
[GEOBOY] | GeoBoy, NDG Software, |
[NEOTRACE] | NeoTrace, A Graphical Traceroute Tool, |
[RAZ] | Raz, Uri, Finding a host's geographical location, |
[DAVIS] | Davis, C., DNS LOC: Geo-enabling the Domain Name System, |
Appendix: whois Record Parsing
Address parsing of whois records consists of five steps, with potential repetition of individual steps. NetGeo continues the address parsing process until it recognizes a location, preferably a city and state (or province, region, district, etc.), but sometimes just state or perhaps only country. In rare cases there is no recognizable location information in a whois record, and the parser is not able to determine location at any level of granularity.
We list and then describe in detail the five steps used in address parsing:
- Extract address strings from whois record
- Search for country indicators
- Standardize address strings
- Parse address strings, determine location (country, state, city)
- Lookup latitude and longitude corresponding to the location
NetGeo performs steps (1) and (2) in parallel as it traverses the whois record line-by-line.
When parsing RIPE or APNIC records it is often necessary to iterate over steps (1) through (4), searching for a recognizable address in several blocks of address strings found in the record. NetGeo may parse a block of address strings more than once; if it finds a new country indicator it will try to use this country as a hint when re-parsing a block of address strings. This technique is useful for address blocks which lack a country name or code.
The process used for US addresses containing zip codes is somewhat simpler. During the extraction process, NetGeo searches for a U.S. zip code (a 5- or 9-digit string in the correct context) and it compares any found to the zip code database. If there is a match, NetGeo can skip the rest of parsing process and localize the zip code.
- Extract address strings from whois record
- Search for country indicators
- In RIPE and APNIC records, NetGeo looks for a line labeled "country:"; such a line usually contains the 2-letter ISO 3166 country code. The example RIPE record in section (1) contains a "country:" line and so the parser would recognize the subsequent "NL" country code.
- NetGeo looks for a full country name in the "netname:" field, e.g., "CIP-BELGIUM-REGIONAL". To prevent false matches, NetGeo only accepts full country names and not country codes embedded in netnames.
- NetGeo looks for an international phone code in the record, and map the phone code to a country. In the example shown above, the contact information block contains the phone number "+31 20 535 4444", the phone code "31" corresponds to The Netherlands. For many countries, the area code within the country can further localize the record. In this example the "20" area code indicates that the contact person is in Amsterdam.
- NetGeo looks for an email address in the record with a 2-letter Top Level Domain (TLD); it will use the first email address with a 2-letter TLD from a line labeled "e-mail:", otherwise the first email address with a 2-letter TLD from any type of line. In the example shown above the contact person's email address, "ops@ripe.net", has a generic TLD and so cannot be used to guess the country. In some records the address might be "ops@ripe.nl", in which the 2-letter TLD "nl" localizes the record.
- Standardize address strings.
- Parse address strings, determine location (country, state, city)
- Lookup latitude and longitude corresponding to the location
This step extracts from the record one or more lines that are likely to contain an address. Address extraction from records returned by the ARIN or Internic whois servers is trivial. The address, if present, follows immediately the line containing the registrant's name. Address extraction from records returned by the RIPE and APNIC servers is more difficult; these records frequently contain more than one address and have inconsistent labeling of address lines. Incomplete addresses exist in all registries.
Example: whois record from ARIN
The ARIN whois server (whois.arin.net) returned the following record for a lookup of IP address 192.149.252.22 (the address of the ARIN whois server itself). Lines not used in address parsing have been omitted.
American Registry for Internet Numbers (NETBLK-ARIN-NET) 4506 Daly Drive Suite 200 Chantilly, VA 20151 US Netname: ARIN-NET Netblock: 192.149.252.0 - 192.149.252.255 Coordinator: ARIN IP Team (IP-FIX-ARIN) hostmaster@ARIN.NET 703-227-0660 [14 lines omitted]In this example the address extracted from the record is:
4506 Daly Drive Suite 200 Chantilly, VA 20151
Addresses in ARIN and Internic records are frequently in this format, with city, state, and zip all on one line. When NetGeo detects the city-state-zip line, and the zip code matches a value in the database, NetGeo stops any in-process parsing. Without this identifying line, NetGeo would extract the above lines along with the "US" line.
Example: whois record from RIPE
The RIPE whois server (whois.ripe.net) returned the following record for a lookup of IP address 193.0.0.200 (the address of the RIPE whois server itself). Lines not used in address parsing have been omitted.
% Rights restricted by copyright. See http://www.ripe.net/db/dbcopyright.html inetnum: 193.0.0.0 - 193.0.1.255 netname: RIPE-NCC descr: RIPE Network Coordination Centre descr: Amsterdam, Netherlands country: NL admin-c: DK58 [24 lines omitted] role: RIPE NCC Operations address: Singel 258 address: 1016 AB Amsterdam address: The Netherlands phone: +31 20 535 4444 fax-no: +31 20 535 4445 e-mail: ops@ripe.net [30 lines omitted]
In this example the first set of extracted address strings is:
RIPE Network Coordination Centre Amsterdam, NetherlandsIf the NetGeo parser could not recognize a location from the above address block, it would extract a second address block:
Singel 258 1016 AB Amsterdam The Netherlands
If possible, NetGeo extracts address strings from the first group of lines labeled "descr:". If there is no such group, or if the strings from that group do not contain recognizable locations, the parser will extract address strings from lines labeled "address:". We prefer data from "descr:" lines since "address:" lines refer to addresses of the contact individuals, rather than the registered entity itself.
APNIC records have a structure similar to RIPE records, so extraction of address strings is similar to the above example.
As NetGeo collects address strings from the record, it tests each for values that might indicate the country. Techniques used in addition to parsing country names or country codes from address strings are:
The characters in address strings are mapped to the standard 26 uppercase ASCII letters, e.g., "España" -> "ESPANA". Strings are split at commas into separate lines, which frequently separates the city name from the state or province name. Tokens containing numerals (assumed to be postal codes) are removed for non-US addresses. For example, the three original address strings:
Periférico Sur 3190 México City, Distrito Federal 01900 Mexico
become the four standardized address strings:
PERIFERICO SUR MEXICO CITY DISTRITO FEDERAL MEXICO
Periods used in abbreviations are removed, e.g., "N.Y." -> "NY". "Saint" appearing as a separate word is mapped to "ST", e.g., "Saint Paul" -> "ST PAUL". These canonical formats allow efficient comparison with names stored in the database.
Address parsing proceeds from the last line in the block of standardized address strings back to the first line. Parsing continues until a city is found, if possible. The parser attempts to find the country (if not already known), then the state or province, then the city.
After finding a country name or code, the parser attempts to find a state or province name in the line preceding the country, or in the tokens preceding the country on the country line. When searching for a city or state, the parser attempts to find the longest right-anchored collection of tokens that form a valid city or state name for the current country.
For example, an address string may contain a street address, city name and state abbreviation without commas. In this example assume the country has already been determined to be the US and the state is California.
9500 Gilman Drive La JollaAfter standardization this becomes:
GILMAN DRIVE LA JOLLAThe parser tests the strings "GILMAN DRIVE LA JOLLA" and "DRIVE LA JOLLA" before it finds a match in the database with "LA JOLLA". In this example "LA JOLLA" is the longest right-anchored collection of tokens matching a valid city name for the current state and country.
If no collection of right-anchored tokens matches a valid city name, the parser attempts to find the longest left-anchored collection of tokens matching a valid city name. In this example, if "LA JOLLA" had not been recognized, the parser would have continued testing "GILMAN DRIVE LA", then "GILMAN DRIVE", and finally "GILMAN". The search for left-anchored strings is useful for skipping over unrecognized state or country abbreviations. For example, if the standardized string were "LA JOLLA CALIF" the parser could recognize the left-anchored string "LA JOLLA" even if it didn't recognize the non-standard abbreviation "CALIF".
When searching for longest strings the parser disregards tokens designating the governmental unit, as is common in some international addresses. For example, when parsing Russian addresses the parser disregards the tokens "REPUBLIC" or "RESPUBLIKA" and attempts to match only the name of the republic against the values in the database.
In some situations no match is found in the database because the candidate location name differs slightly in format from the value stored in the database. When candidate strings containing hyphens, apostrophes, embedded spaces fail to match the database entries, the parser retries the database query using a modified candidate string. For example, the location names "St.-Pierre" and "Saint-Pierre" are stored in the database as "ST-PIERRE". A candidate string such as "St. Pierre" would initially be converted to the standardized form "ST PIERRE", which would fail to match the "ST-PIERRE" entry in the database. The parser would then perform a query using the modified form "ST-PIERRE", which would match the entry in the database.
After successfully parsing an address from a whois record, the lookup of the corresponding latitude and longitude values is straightforward. At this step the city and/or state names have been validated against the place names in the database, so a simple database query returns the latitude and longitude, if known.