Skip to Content
[CAIDA - Center for Applied Internet Data Analysis logo]
The Center for Applied Internet Data Analysis
Goals

CAIDA has facilitated research using data collected by the UCSD Network Telescope since 2001, but there has been no common framework for conducting analysis. Each researcher had to first write code to process the compressed hourly pcap files. Given the complexity and scale of the data, this is not a trivial task.

Corsaro has been designed to help with two problems related to Network Telescope research - data capture, and analysis. Because of the volume of traffic received by network telescopes, it is desirable for a capture and analysis tool to minimize the amount of storage media used by the generated data. It also must be able to sustain a high throughput rate to allow it either be used directly on a live interface, or at least keep up with processing of trace files as they are captured by another method. Additionally, Corsaro has been designed to allow researchers to easily develop and test new analysis techniques.

Compression

Corsaro is an interval-based trace processing tool. It is designed to allow plugins to perform analysis on a packet-by-packet basis, the results of which are saved at the end of an interval. This allows plugins the opportunity to heavily aggregate the input data.

Data generated by trace analysis, even when aggregated to intervals, is often repetitive, which lends itself to being compressed by off-the-shelf compression algorithms such as gzip. To this end, Corsaro leverages the libwandio library (a part of the libtrace library) to provide IO APIs to plugins which transparently handle compression/decompression of files.

In addition to aggregation and byte compression, the Core Plugins are carefully designed so that the output they create is as efficient as possible. The FlowTuple, for example, not only uses a custom, packed binary format, but also the order in which the FlowTuple records have been sorted is based on a field ordering scheme that has been empirically shown to allow gzip to better compress the data.

Speed

Internally, Corsaro uses the libtrace trace processing library. Libtrace is designed explicity with speed in mind [1] . It makes use of zero-copy behavior to minimize unneccessary copying of data in memory, threaded I/O to allow compression and decompression to be off-loaded onto a different CPU, and caching of header locations and length fields in each packet.

Like libtrace, Corsaro also leverages the libwandio I/O library to provide transparent, threaded I/O to plugins. In this way, each file that is written to by a plugin has it's own dedicated thread for doing any needed compression and writing to disk. For example, running Corsaro with only the FlowTuple plugin active will use a total of three threads - one for reading the trace data, one for processing the packets, and one for writing the FlowTuple output to disk.

In addition to threaded I/O, Corsaro implements a technique called plugin chaining, which allows plugins to pass knowledge gained about a packet on to successive plugins. This not only makes development of new plugins potentially simpler and quicker, it also reduces the amount of re-work that plugins must do. For example, there are plugins planned which will augment each packet with the corresponding ASN and geographic information based on the source address, thus allowing other plugins to leverage this information for further analysis.

Usability

Corsaro has been designed to allow researchers to more easily perform research using darknet data. To this end, Corsaro has a modular design which enables analysis plugins to be created and used with a minimum of effort. Implementing trace analysis within Corsaro allows researchers to focus simply on the task of analyzing the packets - Corsaro takes care of opening the trace file, interval notifications, and provides high-level I/O functionality.

In addition to the features that Corsaro directly provides to plugins, libtrace provides high-level packet API functions which handle protocol decoding. For example, rather than having to write code to search for the TCP header, a plugin may just use the trace_get_tcp function. These functions not only reduce the amount of (re)work needed to write analysis code, but they also provide well-tested handling of edge-cases such as incomplete packet headers, that could produce incorrect analysis results[1].

Corsaro also provides several tools which aid with exploratory analysis of both raw trace data, and Corsaro aggregated data. For example, cors-ft-aggregate allows FlowTuple data to be reaggregated using different fields and over different time intervals. See the Tools page for more information.

If researchers have existing trace data, Corsaro can easily be used to generate aggregated data using either the Core Plugins, or a specialized analysis plugin. The Corsaro tool is capable of processing a wide variety of trace formats, including capturing packets from a live interface, and can easily be run from the command line with a mimimun of configuration.

Portability

Corsaro uses the GNU Build System (autoconf, automake, etc) suite to manage configuration, compilation and installation. This allows Corsaro to be easily built on a wide variety of platforms. It has been tested on FreeBSD, GNU/Linux, MacOSX and Solaris X.

Extensibility

As mentioned in the Usability section, trace analysis logic within Corsaro is separated into a set of plugins. This allows Corsaro to be easily extended to provide to analysis functionality. For information about creating a new plugin, see the Tutorials section of this manual.

In addition to extending Corsaro by creating new plugins for trace analysis, Corsaro can also be used as a library within another application. This allows Corsaro be to driven with packets that another tool captures. In fact, the corsaro tool is simply a light wrapper around the libcorsaro library. For example, Corsaro has been used from within the IATmon tool. This allows a reduction of overhead in reading trace data from disk – each packet is read once, and passed to Corsaro when IATmon finishes with it.

The libcorsaro library can also be used to write software which processes existing Corsaro data, such as the FlowTuple output. This allows researchers to write efficient code to perform further analysis on Corsaro data. For example, the cors-ft-aggregate tool uses libcorsaro to read and reaggregate FlowTuple data.

Reliability

Corsaro has been used extensively by CAIDA for wholesale analysis of the historical data archive for the UCSD Network Telescope. To date, Corsaro has successfully processed over 35,000 hours of pcap data, generating close to 20 TiB of compressed metadata.

The latest version of Corsaro also fully supports capturing data from a live interface in real time for generating reporting of darknet data. See the corsaro tools documentation for more information.