Our method consists of using labeled AS classification data to train a machine-learning classifier to classify ASes according to their business type. We first use a ground-truth dataset from PeeringDB (described next), and split it into two parts to create a labeled training and validation set. We then train a machine-learning classifier using a number of features for each AS (described next) to train a decision-tree machine-learning classifier.
To train and validate our classification approach, we use ground-truth data from PeeringDB, the largest source of self-reported data about the properties of ASes. From PeeringDB, we extract the self-reported business type of each AS, which is one of "Cable/DSL/ISP", "NSP" (Network Service Provider), "Content", "Education/Research", "Enterprise" and "Non-profit". We combine the "Cable/DSL/ISP" and "NSP" classes into a single class "Transit/Access". We ignore the "Non-profit" category for the purposes of this classification. The labeled ground-truth data thus consists of three classes: "Transit/Access", "Content" and "Enterprise". As PeeringDB under-represents the "Enterprise" category, we manually assemble a set of 500 networks which we determine to be enterprise customers based on their WHOIS records and webpages, and add this set to the labeled classification data.
We use the following features for each AS in the training and validation set.
1) Customer, provider and peer degrees: We obtain the number of customers, providers and peers (at the AS-level) using CAIDA's AS-rank data.
2) Size of customer cone in number of ASes: We obtain the size of an AS' customer cone using CAIDA's AS-rank data.
3) Size of the IPv4 address space advertised by that AS. We obtain this quantity using BGP routing tables collected from Routeviews.
4) Number of domains from the Alexa top 1 million list hosted by the AS. We obtain the list of top 1 million websites from Alexa, perform DNS lookups on each domain (at CAIDA) and map each returned IP address to the corresponding ASN using longest-prefix matching using a routing table from Routeviews. We then count the number of domains hosted by each AS.
5) Fraction of an AS's advertised space that is seen as active in the UCSD Network Telescope.
We use half of the ground-truth data to validate the machine-learning classifier. The Positive Predictive Value (PPV) of the classifier is currently 70%.
AS classification dataset
The AS classification dataset contains the business type associated with each AS.
File format: <AS>|<Source>|<Class>
|CAIDA_class||Classification was an inference from the machine-learning classifier|
|peerDB_class||AS classification was obtained directly from the PeeringDB database|
|Transit / Access||ASes which was inferred to be either a transit and/or access provider.|
|Content||ASes which provide content hosting and distribution systems.|
|Enterprise||Various organizations, universities and companies at the network edge that are mostly users, rather than providers of Internet access, transit or content.|
Acceptable Use Agreement
Access to these data is subject to the terms of the following CAIDA Acceptable Use Agreement
(printable version in PDF format)
When referencing this data (as required by the AUA), please use:
The CAIDA UCSD AS Classification Dataset, <date range used>Also, please, report your publication to CAIDA.
Access the public CAIDA UCSD AS Classification Dataset