Recommendations for future large scale simultaneous DNS data collections

CAIDA appreciates the effort and resources required to conduct a simultaneous data collection at multiple DNS root server anycast instances. Based on our preliminary analysis of previously collected traces, we offer the following suggestions how to optimize the research potential of future wide-area data collection experiments.

Recommendations for future collections
Capturing packets with dnscap
Capturing packets with tcpdump
Recommended metadata to collect
Possible analyses of the collected data
Statistical information about previous traces

Recommendations for future collections

Researchers tend to want as much data as possible without the datasets getting too unwieldy. It would be ideal to capture all TCP and UDP data, queries and responses, from all root instances and gTLDs for port 53. Although TCP generally accounts for a small fraction of DNS traffic, having more comprehensive data increases its value to the research community. Query responses, while requiring significantly more disk, are necessary to answer questions related to DNSSEC workload and DNS response sizes.

All measurements should syncronize to UTC time. This means that machines where collection occurs must be synchronized with ntp. CAIDA will send timestamp probes to root server instances during data collection to test for clock skew.

We recommend that measurement schedules cover a 50-hour period (48 hrs + leading and trailing hours), preferrably mid-week (Tuesday-Wednesday or Wednesday-Thursday). This approach is more likely to capture a continuous 48-hour period, i.e., two full days of data. Researchers will be able to determine average daily traffic, to see diurnal patterns, and to rule out that a single day might be anomalous.

Analysis tasks are typically easier if collection files are split based on time rather than on file size boundaries. In previous years we preferred one hour long pcap files, however, we now recommend 10 minute periods as hour long traces grow too large.

Additionally, pcap files should start and stop on time-based boundaries. For example, a pcap file should start at 12:00:00 and end at 12:59:59. ISC's dnscap tool, available here, will do this automatically.

Do not collect responses if you have a shortage of local disk and/or bandwidth resources. In the past, we've had a hard time getting data from some collection sites because they either run out of local disk space, or do not have sufficient bandwidth to transfer the pcap files faster than they are generated. You may omit DNS responses from packet captures to make them smaller. We'd much rather have (only) all the queries than some of the queries and some of the responses.

Capturing packets with dnscap

We recommend using dnscap because it will automatically rotate pcap files, listen on multiple network interfaces, and only capture port 53 packets.

Sample usage:

dnscap -i eth0 -t 600 -w ${root}.${instance}

To capture only queries add "-s i" to command line:

dnscap -i eth0 -t 600 -w ${root}.${instance} -s i

To capture only queries to specific addresses, add "-z" options to command line. For example:

dnscap -i eth0 -t 600 -w ${root}.${instance} -s i \
-z 192.5.5.241 -z 2001:500::1035

To automatically compress each pcap file after each interval, use the "-k" option:

dnscap -i eth0 -t 600 -w ${root}.${instance} -s i \
-z 192.5.5.241 -z 2001:500::1035 -k 'gzip -9'

Instead of 'gzip -9' you might want to execute a script that compresses the file and uploads it to a different system.

Capturing packets with tcpdump

If you'd like to use tcpdump, You may want to use our ditl-dnsroot-run script, along with Duane's tcpdump-split program, which will take care of file rotation, error handling, recording some useful metadata, and other issues. However, we realize that many sites already have regular data collection in place or local policy that mandates certain filters. Plese keep the following factors in mind when designing your collection process:

Be sure to use the -w option of tcpdump to write the raw packets rather than parsing and printing them out as text.
Use the -s0 option of tcpdump to capture full packets.
We recommend a file naming convention that helps guarantee uniqueness and encapsulates some of the key metadata used for combining datasets, such as ${root}.${instance}.${date}.${time}.pcap.
We recommend using one of the following tcpdump filters, listed from most to least prefered:

collect TCP and UDP, requests and responses (preferred, but requires the most disk space)

"host (${hosts}) and port 53"

collect UDP requests, and TCP requests and responses

"(udp and dst host (${hosts}) and dst port 53) or (tcp and host (${hosts}) and port 53)"

collect TCP and UDP requests

"dst host (${hosts}) and dst port 53"

collect UDP requests

"udp and dst host (${hosts}) and dst port 53"

where ${hosts} is a list of DNS server addresses, separated by "or", e.g. "192.5.5.241 or 2001:500::1035".
That is, if you must drop some types of packets because of limited resources, the best thing to drop is responses. This is because at least some of the information contained in responses can be reconstructed, given a zone file. Dropping requests of any kind is a last resort, because they are useful for more kinds of research and can not be reconstructed.

Recommended metadata to collect

To increase the usability of the data, and particularly for indexing in DatCat, please collect extra information about the data.

We understand that pariticipating sites will make use of varying tcpdump options, so please always record the specific tcpdump command-line used to collect the data. For specific recommendations on what type of metadata to include, refer to CAIDA's web page on How to Document a Data Collection.

Possible analyses of the collected data

Time-of-day usage differences:
We hope to see differences in the instances diurnal patterns, corresponding to increased user activity within local daylight/evening hours.
Distribution of queries across anycast instances
Plot geographic distribution of clients. Is anycast attracting geographically local client workload as expected?
Distribution of queries by GTLD and ccTLD:
Traffic to all root instances will be dominated by the gTLDs (.com especially), but there will be variations in the sets of ccTLDs requested from different parts of the world. One could graph requests/responses to country codes, by node as well as instance.
Distribution of response sizes and types:
Distribution of response sizes, by node as well as anycast instances. How is it shifting due to DNSSEC, ENUM, etc?
With TCP requests and response data:
what fraction are genuine DNS requests, versus bogus?
Growth in and impact of DNSSEC
not yet existing at the roots.. relevant to other TLDs, e.g., se. if data becomes available.

Related Objects

See https://catalog.caida.org/paper/2010_understanding_dns_evolution/ to explore related objects to this document in the CAIDA Resource Catalog.