CAIDA Data Usage Frequently Asked Questions
Suggestions for this FAQ? Please send them to firstname.lastname@example.org. We appreciate your feedback.
An organization can be an Internet Service Provider (ISP), however, it could also be a group, like "1st Financial Bank." There can exist a many-to-many relationship between ASes (autonomous systems) and organizations. CAIDA primarily uses WHOIS information available from Regional and National Internet Registries to infer a mapping from AS numbers to the organizational entities that operate them.
Pcap is a file format used to capture network traffic. It is the format used by the tcpdump program.
A non-exhaustive list of applications that can read pcap files:
- Wireshark ( fork of the Ethereal project )
If you want to write your own programs reading pcap files, libraries to read these files are:
- libpcap ( C )
- CoralReef ( C and perl )
Why are some files compressed using LZO compression instead of the more common DEFLATE (zip, gzip) compression?For datasets both the compression ratio and speed of decompression are key factors in usability. For some large datasets we've chosen LZO over DEFLATE for its better speed of decompression. LZO compressed files can be decompressed using lzop.
For traces where privacy plays an important role (like our high-speed internet traces taken on a link of a tier1 ISP), we use Crypto-PAn anonymization. This anonymization is prefix-preserving, so if two original IP addresses share a k-bit prefix, their anonymized mappings will also share a k-bit prefix. Each dataset is anonymized using a single key, so each original IP address is mapped to the same anonymized IP address across the whole dataset. Only IP addresses are anonymized; all other packet header fields (e.g., port numbers) retain their original value.
How can I extract header info such as Source/Dest IP address and/or TCP ports for each packet in the pcap traces?Use tcpdump or crl_print_pkt (part of CoralReef) to write the packets in ascii form to standard output or to a file. Then write your own script in your favorite language (e.g. C, Perl) to extract the desired information.
You could use a web browser to download individual files from our data servers, but this is a tedious task for more then a couple of files and some browsers have problems with large files. There are a number of tools that can be used to automate downloading of data files from a web server. A few open source tools that help in non-interactive downloads are:
Typical command-line options for downloading all files in http://www.example.com/data using wget is:
wget -np -m http://www.example.com/data
Explanation of options:
-m : mirroring, ie. make a copy of the directory-structure and files
-np : don't ascend to the parent directory
It is not advisable to use older versions of wget (before 1.10) to download CAIDA data, due to the lack of large file support.
To use wget (version 1.10.2) to download from https://www.example.com/data for user email@example.com with password c00lbean$:
- First create a .wgetrc file in your home directory, and make sure it's only readable by you (eg. on UNIX do a chmod 600 .wgetrc), and edit the file to contain the next line:
http-password = c00lbean$
It is advisable to remove the .wgetrc that contains your password after you have finished downloading.
It is also possible to specify the password on the command-line with the --http-password=c00lbean$ command-line option. This would allow others to see your password if they check the process table on your computer, so this is not advisable on multi-user systems.
- Next you can start the actual download using the command:
wget -np -m --firstname.lastname@example.org --no-check-certificate https://www.example.com/data
You'll need the --no-check-certificate option on some CAIDA servers where self-signed SSL certificates are used.
Yes, wget has the -c option that allows you to continue downloading partially-downloaded files. This works together with the mirroring option -m, with the exception of directory listings that will be refetched when you restart a mirroring download.
Most large data files CAIDA hosts have their md5 checksum stored in a file md5.md5 in the same directory as the data file. This checksum can be used to verify your data download. An md5check.pl script is provided at http://data.caida.org/scripts/ for checking md5 checksums of downloaded data files against the md5.md5 file (this requires perl with the Digest::MD5 module installed, which is included in the core perl distribution since perl 5.8.0). Some other utilities to compute MD5 checksums are: md5 (installed on most UNIX versions), openssl, and FastSum (Windows).
For our restricted data sets we require our data not to be distributed beyond authorized users. This includes preventing unauthorized users on a multiuser system to access your data. This can easily be accomplished by changing the access permissions on the data files. For example: on UNIX systems you can change permissions to user-only read-only access with the command:
chmod 600 <datafile>
Alternatively you use the umask command to set default file permissions to user-only read-only before you start downloading:
umask 077 ; wget -np -m --email@example.com --no-check-certificate https://www.example.com/data
Also make sure the data isn't (inadvertently) shared over a network resource that is accessible through the Internet (a file share, FTP- or web server are examples), since that is in violation with the Acceptable Use Policy (AUP) of our datasets.
Why is the display of data from the UCSD network telescope realtime monitor delayed by approximately a day?The display of data from the UCSD network telescope realtime monitor is delayed for security reasons. The delay makes it harder to find evidence of attacks in the data that is captured by the telescope. We do this to minimize the risk of evidence of attacks being used in illegal activities, like extortion.
CAIDA maintains a mailing list to which all announcements about new datasets are sent. If you wish to receive news about CAIDA datasets, subscribe to firstname.lastname@example.org by visiting https://rommie.caida.org/mailman/listinfo/data-announce.
- Give us feedback (email@example.com) on what datasets were useful to you and how you used them.
Report publications (papers, web pages, class projects, presentations etc.) using our datasets to us. This information allows CAIDA to justify the time and effort spent on data provisioning to our funding agencies. We maintain a list of publications using CAIDA data, based on feedback from our data users.
- Join CAIDA. Available options include donations of software or equipment, financial contributions to support CAIDA as an organization, and direct sponsorship of a specific project. For more information, see Joining CAIDA and CAIDA Sponsorship Information.
- Give us feedback (firstname.lastname@example.org) on what datasets were useful to you and how you used them.
In principle, you should be able to download data at speeds of at least 1 Mbps. Occasionally, however, so many of our users are downloading at the same time that the resulting load on our dataservers causes much lower download speeds (by an order of magnitude or more).
The output of sc_analysis_dump for an Ark warts file identifies paths as incomplete (I) or complete (C) in the 13th column. Most paths are listed as "incomplete". What does this mean?An Ark trace is marked "incomplete" if a router in the path or the destination does not respond. Some routers do not respond to traceroute with an ICMP time-exceeded message, or they rate limit their responses. The choice of destinations can also lead to a high rate of "incomplete" paths. For example, randomly chosen destinations have a lower probability of responding, since the destination IP address may refer to a host that is not present. The distinction between "complete" and "incomplete" is purely technical, and "incomplete" paths are not less useful than "complete" paths.
The public CAIDA "IPv4 Routed /24 AS Links" dataset provides different and independent AS maps for each measurement cycle and each team individually. Is there a way to combine multiple AS links files into a merged AS topology map?AS links files from the IPv4 Routed /24 AS Links dataset cannot simply be concatenated. However, there is a Ruby script, merge-warts-aslinks, provided at http://data.caida.org/scripts, that can be used to combine any number of AS links files (including compressed files) from filenames listed on the command line.
RIPE Atlas, like Ark, is a distributed measurement system that does ping and traceroute measurements. How do Ark and Atlas differ?
- RIPE Atlas allows almost anyone to conduct measurements as long as they have credit (by hosting a probe). Access to our on-demand Ark measurements is currently restricted to academic researchers, but no credit is required.
- Ark conducts systematic, large-scale ongoing measurements of the global Internet with the goal of obtaining a broad baseline view of the Internet and its change/evolution over long periods of time. Because of this focus on global coverage, our measurements are not as focused on satisfying immediate operational troubleshooting needs (e.g., why is my network not reachable right now?).
- Ark probes are relatively powerful systems (up to 1GHz quad-core ARM processors with 1GB of RAM and 8GB of flash) running a full Linux distribution. They are used to conduct many other kinds of measurements not currently feasible on RIPE Atlas nodes; e.g., studying congestion at interdomain peering links, the degree and type of header alteration by middle boxes, the degree of filtering of packets with spoofed source addresses. Researchers can run their software on Ark nodes, which Atlas doesn't allow for policy and technical reasons; for example, we conduct large-scale timing-sensitive alias resolution runs using dozens of probes in concert.
Does "application tuples/second" represent the number of concurrent flows being tracked, the number of unique tuples seen during that second, or the number of new tuples seen during that second?It is the number of unique tuples seen in an interval, divided by the interval length (default 5 minutes).
The sampling time varies with the length of the time series displayed. For the shortest times series (~1 day) the sampling time is 5 minutes. This means 5 minutes of data are accumulated in counters (e.g. number of unique flows). At the end of the 5 minutes average rates (e.g. flows/s) over the sampling interval are calculated by dividing the final counter value by the sampling time. For the ~1 week-long, ~1 month-long, and ~1 year-long time series the sampling times are 30 minutes, 2 hours, and 24 hours, respectively. The data collection is done with rrdtool, if you need more detailed information.
Why is the packet rate for the San Jose monitors lower in the first half of 2012 as compared to 2011, slowly recovering to 2011 levels by the end of 2012?Toward the end of 2011 the number of 10G links between San Jose and LA was expanded from 4 to 6. This would be expected to result in a (temporary) decrease in load on each of the individual links. Since we are tapping only a single link, we observe this decrease.
We have collected data on the UCSD Network Telescope since 2003. Data collection on the telescope was patch between 2003-2008.
Similar darknets do exist elsewhere on the Internet and have for similar lengths of time. You might check the catalog of the DHS IMPACT project, previously called PREDICT, for other available darknet data.