How to Document a Data Collection

While the attention and effort necessary for collecting data make it seem like the details will be remembered long into the future, the continuous demands of research cause even the most significant of details to be forgotten over time. This page contains guidelines for how to document Internet data collections so that necessary information is preserved to promote scientifically rigorous, reproducible research.

Why document data collection?

A clear description of the methodology used to collect data is the cornerstone of any research that will be done on that data. Seemingly minor methodological details can seriously influence (or invalidate) any analysis that is subsequently performed on the data. Information on how to collect Internet data can be found in Vern Paxson's "Strategies for Sound Internet Measurement" paper.

This page contains CAIDA's recommendations for metadata that should be recorded and kept alongside Internet data collections. Wireless data collections can be listed in CRAWDAD, a Community Resource for Archiving Wireless Data At Dartmouth.

Metadata

In the list below, very important fields are marked with "***", and important fields are marked with "*". We recommend that information for all applicable fields be recorded at the time that a data collection is performed.

Metadata that must be recorded at collection time

***name

Each data file should have a name that is unique across the collection project. It is often useful to have the filename include the server name, date, and time. We recommend choosing one consistent time zone for all times recorded for a collection project, even if data is collected in multiple time zones. This eliminates confusion about needing to remember which time zone a given date belongs to or whether Daylight Saving Time is in effect. We recommend the use of Coordinated Universal Time (UTC) as the time zone for data collection time stamps, especially those that include data collected in more than one time zone.

***format

The format of the collected data (e.g. "pcap" for data collected with tcpdump).

***start time

The time data collection began. We recommend using the UTC time zone for all data collection timestamps.

***end time

The time data collection ends. If the collection is of long duration or is ongoing, we recommend recording the planned duration and/or the reasons the collection will continue over time. We recommend using the UTC time zone for all data collection timestamps.

***creation process

Describe as specifically as possible how the data was collected. For example, "The first 54 bytes of all IPv4 packets traversing the OC48 link between the University of Freedonia and the Internet." will be much more useful than "A passive trace from our network." The creation process should include specific tools and parameters (including version information) used to collect the data. If a tool is proprietary or not widely used, it can be useful to include a brief description of what the tool does. Other information that we recommend documenting includes:

sampling, e.g. Cisco NetFlow sampling at 1 in 100 packets
aggregation, e.g. flow data aggregated using source IP, destination IP, source port, destination port, and protocol
anonymization, e.g. prefix-preserving anonymization using the CryptoPAN library
time synchronization procedure (if any), e.g. ran ntpdate -q before beginning collection
known limitations of the collection system, e.g. rate-limited to 100Mbps

For passive trace collection, we recommend recording the following parameters:

any packet filter, e.g. "udp port 53"
length of captured packets, e.g., the -s option of tcpdump
description of algorithm used to anonymize IP addresses, if any
a description of any known limitations of the data capture system, such as the maximum packet size that could be collected

For data derived from existing data collections, we recommend documenting:

what sources were used and/or combined to create the new data
contact information for the creators of the original data sources
details of the process of data derivation, including which tools and what configuration parameters were used

***creators

A list of names, affiliations, and contact information (at least email addresses) for people responsible for collecting the data

***primary contact

The name and email address of a person or team that can answer questions about the data

*description of any known corruption or anomalies during collection

*platform

The hardware, software, and operating system used to collect the data

*time zone

Although we recommend that all data collection timestamps be recorded in UTC, it can be still useful to researchers to know the local time zone

*geographic location

The geographic source of the data collection, in terms of continent, country, state, province, city, etc.

*network location

Where on the network the data was collected, in terms of hostname, IP address, AS, etc.

*logistic location

The source of the data from an organizational viewpoint, e.g. "X-root DNS server" or "University of Freedonia off-campus link"

clock parameters

accuracy (difference from true UTC)
resolution (smallest measurable time unit)
synchronization method (how was the clock synchronized?)

Metadata that can be obtained by anyone by post-processing the data

***MD5 hash: Calculating an MD5 hash as soon as possible after data collection provides a baseline for future integrity checks after files are transferred, backed up, restored, etc.
***file size
counts of unique destination and source port numbers: for TCP and UDP
counts of bytes, packets, destination addresses, source addresses, and total addresses: for IPv4 and IPv6
DNS query count: for data containing DNS traffic

Related Objects

See https://catalog.caida.org/dataset to explore related objects to this document in the CAIDA Resource Catalog.