How to Document a Data Collection

| 
|
|
While the attention and effort necessary for collecting data
make it seem like the details will be remembered long into
the future, the continuous demands of research cause even the
most significant of details to be forgotten over time.
This page contains guidelines for how to document Internet
data collections so that necessary information is preserved
to promote scientifically rigorous, reproducible research.
| 
|

|
Why document data collection?
A clear description of the methodology used to collect data is
the cornerstone of any research that will be done on that data.
Seemingly minor methodological details can seriously influence
(or invalidate) any analysis that is subsequently performed on
the data. Information on how to collect Internet data can be
found in Vern Paxson's "Strategies for Sound Internet Measurement" paper.
This page contains CAIDA's recommendations for metadata that
should be recorded and kept alongside Internet data collections.
We encourage any researcher who collects data to list the data
in DatCat, the
Internet Measurement Data Catalog. Wireless data collections can
also be listed in CRAWDAD, a
Community Resource for Archiving Wireless Data At Dartmouth.
Metadata
In the list below, very important fields are marked with "***", and important fields are marked
with "*".
We recommend that information for all applicable fields be recorded
at the time that a data collection is performed.
-
Metadata that must be recorded at collection time
-
-
***name
- Each data file should have a name that is unique across
the collection project. It is often useful to have the
filename include the server name, date, and time. We
recommend choosing one consistent time zone for all times
recorded for a collection project, even if data is collected
in multiple time zones. This eliminates confusion about
needing to remember which time zone a given date belongs to
or whether Daylight Saving Time is in effect. We recommend
the use of Coordinated Universal Time (UTC) as the time
zone for data collection time stamps, especially those that
include data collected in more than one time zone.
-
***format
- The format of the collected data (e.g. "pcap" for data collected with tcpdump).
-
***start time
- The time data collection began. We recommend using the UTC time zone for all data collection timestamps.
-
***end time
- The time data collection ends. If the collection is of long duration or is ongoing, we recommend recording the planned duration and/or the reasons the collection will continue over time. We recommend using the UTC time zone for all data collection timestamps.
-
***creation process
- Describe as specifically as possible how the data was collected. For example, "The first 54 bytes of all IPv4 packets traversing the OC48 link between the University of Freedonia and the Internet." will be much more useful than "A passive trace from our network." The creation process should include specific tools and parameters (including version information) used to collect the data. If a tool is proprietary or not widely used, it can be useful to include a brief description of what the tool does. Other information that we recommend documenting includes:
- sampling, e.g. Cisco NetFlow sampling at 1 in 100 packets
- aggregation, e.g. flow data aggregated using source IP, destination IP, source port, destination port, and protocol
- anonymization, e.g. prefix-preserving anonymization using the CryptoPAN library
- time synchronization procedure (if any), e.g. ran ntpdate -q before beginning collection
- known limitations of the collection system, e.g. rate-limited to 100Mbps
For passive trace collection, we recommend recording the following parameters:
- any packet filter,
e.g. "udp port 53"
- length of captured packets,
e.g., the -s option of tcpdump
- description of algorithm used to
anonymize IP addresses, if any
- a description of any known
limitations of the data capture system, such as the maximum packet size that could be collected
For data derived from existing data collections, we recommend documenting:
- what sources were used and/or combined to create the new data
- contact information for the creators of the original data sources
- details of the process of data derivation, including which tools and what configuration parameters were used
-
***creators
- A list of names, affiliations, and contact information
(at least email addresses) for people responsible for
collecting the data
-
***primary contact
- The name and email address of a person or team that can answer
questions about the data
-
*description of any known corruption or anomalies during collection
-
*platform
- The hardware, software, and operating system used to collect the data
-
*time zone
- Although we recommend that all data collection timestamps be recorded in UTC, it can be still useful to researchers to know the local time zone
-
*geographic location
- The geographic source of the data collection, in terms of continent, country, state, province, city, etc.
-
*network location
- Where on the network the data was collected, in terms of hostname, IP address, AS, etc.
-
*logistic location
- The source of the data from an organizational viewpoint, e.g. "X-root DNS server" or "University of Freedonia off-campus link"
- clock parameters
-
- accuracy (difference from true UTC)
- resolution (smallest measurable time unit)
- synchronization method (how was the clock synchronized?)
-
Metadata that can be obtained by anyone by post-processing the data
-
-
***MD5 hash
- Calculating an MD5 hash as soon as possible after data collection provides a baseline for future integrity checks after files are transferred, backed up, restored, etc.
-
***file size
- counts of unique destination and source port numbers
- for TCP and UDP
- counts of bytes, packets, destination addresses, source addresses, and total addresses
- for IPv4 and IPv6
- DNS query count
- for data containing DNS traffic
|
|