BGP Community Dictionary Dataset
A BGP Community is an attribute (standardized in RFC1997 in 1996) that provides meta-information about prefixes announced to customer and peer networks. The attribute is represented by an X:Y pair, where X and Y are two 16-bit values (extended communities use four octets). By convention, the first two octets encode the Autonomous System Number (ASN) of the operator that sets the community. The next two octets encode an arbitrary value to denote some property relevant to routing policy. Unfortunately, the specific values and semantics of this BGP attribute are not standardized. Also, the community is a transitive optional attribute, meaning that BGP implementations do not have to recognize this attribute. It is at the network operator's discretion to accept it or pass it on to another AS.
This BGP Community Dictionary Dataset focuses on Location-Encoding Ingress Communities which label the location where the prefix entered the network. The Dictionary represents our best effort to extract meaningful geolocation information encoded by network operators into the Community attributes they set up for their networks.
Operator choose their own schemes to describe ingress location information at various granularities. Many publish their community schemes either in their Internet Routing Registry (IRR) records or in their support Web pages. The documentation is in natural text and lacks a standardized structure and terminology; parsing it requiring significant manual work. To tackle this problem, we developed a web-mining tool that enables automatic compilation of a community dictionary. The tool uses a web scraper to extract text from the remarks sections of IRR records and from ASes's web pages. Next, a text parser analyzes the extracted text using the Natural Language ToolKit to discover infrastructure-related communities. We identify sub-strings that include community values using regular expression matching and use Stanford's Named Entity Recognizer (NER) to identify named entities, focusing on entities that pertain to locations or infrastructure operators.
The BGP Community Dictionary can be used as a source of meta-data to interpret/annotate other BGP data, including for inferring topological and geographic locations of outages. For more details please see the papers Detecting Peering Infrastructure Outages in the Wild and Inferring BGP Blackholing Activity in the Internet.
# The format of the file is tab-separated with the following fields: # column 1: The BGP community # column 2: The ingress location (location-code, # optional: country,region,city,latitude,longitude) # column 3: The date (YYYY.MM.DD) when this community-to-location mapping was found # column 4 (optional): Additoinal human-readable location information # column 5 (optional): Specific facility # # For questions, feedback or corrections please contact: <firstname.lastname@example.org> 10204:33619|xkl-my|2017.01.01 3491:1007|mia-us,US,FL,Miami,-80.29060364,25.79319954|2017.01.01|Miami|NOTA:IXPEach entry contains the BGP community, the geographic encoding, and the date when it was collected. The first record above is an example of such minimum information: the community 10204:33619 refers to an unknown geographic location xkl-my in Malaysia. Other fields are optional. When possible, the geographic encoding may include the country, region, city, latitude, and longtitude. Some records may also include additional (human-readable) location information in Columns 4 and 5, such as the city, facility, etc. In the second example above, the community 3491:1007 maps to the IXP NOTA in the city of Miami, Florida, USA.
Acceptable Use Agreement
Access to these data is subject to the terms of the following CAIDA Acceptable Use Agreement
When referencing this data (as required by the AUA), please use:
The CAIDA UCSD BGP Community Dictionary - <dates used>,You are required to report your publications using this dataset to CAIDA.
Request Data Access
- Request Access to the CAIDA Interconnection Datasets.