AS Classification

This page documents our method for classifying Autonomous Systems (ASes) according to their business type.

January 2021: We are improving the algorithm of the AS Classification Dataset and have removed download access of this dataset for the time being.

Method

Our method consists of using labeled AS classification data to train a machine-learning classifier to classify ASes according to their business type. We first use a ground-truth dataset from PeeringDB (described next), and split it into two parts to create a labeled training and validation set. We then train a machine-learning classifier using a number of features for each AS (described next) to train a decision-tree machine-learning classifier.

Ground-truth dataset

To train and validate our classification approach, we use ground-truth data from PeeringDB, the largest source of self-reported data about the properties of ASes. From PeeringDB, we extract the self-reported business type of each AS, which is one of "Cable/DSL/ISP", "NSP" (Network Service Provider), "Content", "Education/Research", "Enterprise" and "Non-profit". We combine the "Cable/DSL/ISP" and "NSP" classes into a single class "Transit/Access". We ignore the "Non-profit" category for the purposes of this classification. The labeled ground-truth data thus consists of three classes: "Transit/Access", "Content" and "Enterprise". As PeeringDB under-represents the "Enterprise" category, we manually assemble a set of 500 networks which we determine to be enterprise customers based on their WHOIS records and webpages, and add this set to the labeled classification data.

Classifier features

We use the following features for each AS in the training and validation set.

1) Customer, provider and peer degrees: We obtain the number of customers, providers and peers (at the AS-level) using CAIDA's AS-rank data.

2) Size of customer cone in number of ASes: We obtain the size of an AS' customer cone using CAIDA's AS-rank data.

3) Size of the IPv4 address space advertised by that AS. We obtain this quantity using BGP routing tables collected from Routeviews.

4) Number of domains from the Alexa top 1 million list hosted by the AS. We obtain the list of top 1 million websites from Alexa, perform DNS lookups on each domain (at CAIDA) and map each returned IP address to the corresponding ASN using longest-prefix matching using a routing table from Routeviews. We then count the number of domains hosted by each AS.

5) Fraction of an AS's advertised space that is seen as active in the UCSD Network Telescope.

Validation

We use half of the ground-truth data to validate the machine-learning classifier. The Positive Predictive Value (PPV) of the classifier is currently 70%.

AS classification dataset

The AS classification dataset contains the business type associated with each AS.

File format: <AS>|<Source>|<Class>

Source description
CAIDA_classClassification was an inference from the machine-learning classifier
peerDB_classAS classification was obtained directly from the PeeringDB database
Class description
Transit / Access ASes which was inferred to be either a transit and/or access provider.
ContentASes which provide content hosting and distribution systems.
EnterpriseVarious organizations, universities and companies at the network edge that are mostly users, rather than providers of Internet access, transit or content.

Acceptable Use Agreement

Please read the terms of the CAIDA Acceptable Use Agreement (AUA) for Publicy Accessible Datasets below:

When referencing this data (as required by the AUA), please use:

The CAIDA UCSD AS Classification Dataset, <date range used>
https://www.caida.org/catalog/datasets/as-classification/
You are required to report your publications using this dataset to CAIDA.

Data Access

January 2021: We are improving the algorithm of the AS Classification Dataset and have removed download access of this dataset for the time being.

Related Objects

See https://catalog.caida.org/dataset/as_classification/ to explore related objects to this document in the CAIDA Resource Catalog.
Published
Last Modified