We use data from WHOIS databases maintained by the five Regional Internet Registries (RIRs: ARIN for North America, LACNIC for South America, RIPE NCC for Europe, AFRINIC for Africa, and APNIC for Asia/Pacific, including Australia) and by two National Internet Registries (NIRs: KRNIC for South Korea and JPNIC for Japan). The WHOIS databases contain a wide range of information intended primarily for network operators. Unfortunately, the databases are manually updated with few requirements for maintaining and changing the registered information in a timely fashion. Still, they represent the most useful and bountiful source of information about ASes at the organizational level. We collect bulk dumps of these databases 3-4 times per year.
There are two other groups also developing AS-to-organization mapping: Packet Clearing House [PCH] and Cai et al. [Cai10c]. Packet Clearing House maintains an AS-to-organization database populated on a voluntary basis. Similar to our approach, Cai et al. also start with data from the RIR WHOIS databases. First, for each AS in each of the databases, they create a single object. Next, they consider other objects in the RIR database that link to a given AS, and assign fields from those objects to the single object they created for this AS. Finally, they use a machine-based learning algorithm to analyze similarities between objects and to group ASes into organizations. In contrast, our algorithm (described in detail below) maintains the original database structure and creates different objects for organizations, ASes, and contacts. We then group organizations into families based on their fields and their relationships to other objects.
Step 1: prepare the data for uniform inter-database analysis. Each WHOIS database has its own schema and uses different data formats. To coherently compare data from different databases, we create our own objects for organizations, ASes, and contacts. We then populate fields of these objects by consistently converting the disparate data formats used by different databases to a common representation.
We consider three types of fields in the WHOIS records: links to other records, email addresses, and plain text. We normalize links and email addresses by removing any extraneous comments, splitting out multiple items listed in a single field, and converting them to monocase. For text fields (including names, street addresses, and phone numbers) we remove spaces, punctuation, ASCII art and other uninformative noise; we also convert the text to monocase.
As a pre-condition to mapping, we need to link each of our AS objects, aut-num and ASHandle for ARIN, to an organization object, org_id. AFRINC and ARIN's AS records have an explicit organization link. Although RIPE maintains organization records, their AS records do not provide a link to corresponding organization records. APNIC and LACNIC do not provide any organization records with their dumps. So we create an AS and organization object for each AS record, link the AS object to the organization object, and populate the organization object with the fields from the AS record.
Several types of records in the WHOIS databases resist straightforward one-to-one transformations into our objects. For example, APNIC registrants frequently put AS block information (rather than just a single AS) into aut-num records with a revealing name (e.g., authority-BLOCK) and a comment describing the record as a delegation of authority to another RIR. ARIN also may insert AS block information instead of a single AS into their ASHandle records using a range of AS numbers and indicating the registry to whom these AS numbers are delegated in the org_id field (rather than the actual organization-customer to whom these AS numbers are assigned). We recognize both of these kinds of records, store this information as special as-block objects, and eventually use it to determine which registry is authoritative for a particular AS in case of conflicting data.
While processing the information in the WHOIS databases and populating the fields in our created objects, we count the number of times every particular value occurs in every field. If the number of occurrences exceeds a certain threshold, we designate a given value as generic and would not use it for subsequent grouping (cf. Step 2 below). Such generic values may include phone numbers and email addresses for organizations that maintain records as a service, generic strings used to hide information (i.e., "Private Customer", "Private Address"), and generic role names that carry no identifying information (i.e., "Admin", "Customer Service"). The threshold for classifying a given value as generic is defined as the lower of 50 or the square root of the total number of its occurrences.
Finally, we discard records we cannot reach by following a series of links starting with an aut-num record.
After creating the initial sets of AS and organization objects, we further check our data for redundancies and inconsistencies. We resolve any duplicate organization, contact, and mntner (maintainer) records. By definition, we consider two records as duplicate if only their id, change date, or source fields are different, and at least one of the remaining fields is nontrivially equal (i.e., not blank, a link, or a low-cardinality field such as country). In this case, we delete one of the records and modify all other records that referred to it to refer to the remaining record instead. If this modification yields additional objects with identical fields, we consolidate these objects too, continuing the process until no further consolidation is possible.
If multiple registries contain an aut-num (ASHandle for ARIN) record for the same AS number, we use as-block objects to determine which registry's data is most authoritative; if we fail to find an authoritative registry, we use the object with the most recent changed data. If an AS number is within the range of multiple as-block records, we choose the record with the most recent change date.
|Organization||ASes||Pairs||Families||TP (pairs)||FN = Pairs-TP||FP (ASes)|
Step 2: grouping objects into families. We group organization and contact objects into families based on commonalities found in their email, name, street, and phone fields. In searching for common values, we ignore generic values defined above since they do not reveal informative ownership relationships.
We treat each non-generic value found in email, name, street, and phone fields in each of our organization or contact objects as a reference to a virtual object whose id equals that value. All (real or virtual) objects linked to each other by eligible fields with non-generic values form a single family.
We experimented with the data in order to determine which grouping of fields would maximize true positives and minimize false positives. Our final selection includes:
- The best discriminating fields that we always use: aut.org_id, org.admin_c, org.tech_c, org.phone, contact.phone, org.org_name, aut.admin_c, aut.tech_c, aut.owner_c.
- Less efficient, but still useful fields that yielded the results in Table 1: aut.mnt_by, aut.noc_c, org.notify, org.noc_c, mntner.admin_c, mntner.tech_c.
- Fields excluded from the grouping process since they yielded more false positives than true positives in our data: org.mnt_ref, mntner.mnt_by, aut.abuse_c, org.street, aut.aut_name, contact.email, contact.street, aut.changed_email, aut.notify, contact.contact_name, org.mnt_irt, org.changed_email, org.mnt_by.
Step 3: validation. To validate our AS-to-organization data, we obtained the same ground truth used by Cai [Cai10c], and manually updated it to reflect recent changes. Table 1 summarizes the comparison of our mapping to the ground truth.