What Researchers Would Like to Learn from the DITL Project: The Top Questions and Data Types

The following questions were contributed by researchers during discussion of the Day in the Life of the Internet (DITL) project at the January 2008 CAIDA/WIDE workshop. The list serves as inspiration for DITL participation, it includes questions that require data not currently, but we hope eventually, included in DITL collections.

A slideset, "Day In The Life of the Internet 2008 Data Collection Event", is available as an overview and summary of the collection event.

A summary of the March 18-19, 2008 Collection Event is available also.

Please send your contributions and comments to ditl-info@caida.org for us to integrate into the list.

To participate in the 2008 Day in the Life of the Internet collection event, please send a message to ditl-info@caida.org with a description of the data you planned to collect and index.

I. Top DITL Questions

A. The Role of Locality in Internet Usage

What are the traffic patterns and connectivity in different geographic regions?
What is the distribution of DNS query subjects by TLD vs. the geographic origin of query sources?
For ISPs appearing in different geographic regions around the world, do peering relationships change depending on the location?

B. Workload, Traffic, and Performance

What is the mix of application and transport protocols on typical trunks? How has the introduction of P2P applications changed this mix?
Construct and analyze traffic matrices: which ASes are exchanging how much traffic with which others at public IXes and private IXes?
What observable behavior is attributable to botnets?
How can we identify applications (web, VoIP, video, p2p), and estimate their share of traffic?
Do IPv6 traffic characteristics differ from IPv4?
How are flow and packet size distributions changing, including bandwidth symmetry?
Is latency and jitter on the Internet increasing or decreasing?
How can we analyze TCP performance characteristics:
- the penetration of new versions of TCP/IP
- the prevalence of TCP reset flags and TCP retransmissions
- increase in buffer sizes
- application specific characteristics of TCP flows
- responsiveness of modern applications (games, streaming) to congestion
How different is Internet2 traffic from the real world?
How much web data is unnecessarily uncachable?
How is R&E traffic different from commercial traffic?

C. DNS

Who is generating invalid traffic to the root servers? Why are the number of queries and garbage at the roots inversely proportional?
Who is querying in-addr.arpa records for unallocated and unassigned address space? How many of these queries do the roots receive?
What does root server data suggest about trends in IPv6, DNSSEC, DNS packet sizes, prevalence of TCP-based DNS queries?
Can we characterize workload and performance of IDN deployments?
Why are millions of clients querying old IP addresses of roots?
How prevalent are misconfigurations, e.g., lame delegations?

D. Addressing, Topology, and Routing

What is the (distribution of the) distribution of hosts per subnet, and subnets per AS? (intranet topology)
What are the convergence properties of the current routing protocols?
Which ASes control how much of the Internet address space?
What percent of Internet links block ICMP or other probing traffic?
Can we characterize the distribution of hosts hidden behind NAT?
What percentage of users on public wireless networks uses VPNs?
How many four-byte ASes exist?
How much allocated but "unused" IPv4 space remains?
For ISPs appearing in different geographic regions around the world, do peering relationships change depending on the location?

E. Measurement Methodology and Experimental Design

How can we measure host-to-host clock skew and NTP pool drift characteristics?
How can we probe IPv4/IPv6 in a better way?
How many measuring points do we need?
How much storage will be required?
How can I determine whether a cable modem acts as a bridge or a router? Do probes stop at the modem or make it through to a device behind the modem?
How dynamic are dynamic address assignments? What is the distribution of the time that a dynamically assigned address remains assigned to a single customer?
What are appropriate guidelines for measurement, data sharing, and data analysis, to minimize impact on the network and privacy?
How do we evaluate the scalability of a measurement system?
How do we anonymize data while still preserving the maximum utility possible for research?
What can/should we measure from the edge?
What incentives would increase participation in data sharing?

F. Social

The following questions pertain to the ever increasing role of the Internet in the modern society. While not immediately answerable with the available and even expected data, these questions extend the scope of future Internet research efforts and tie its technical foundation with the core interests of its human users.

What are generational differences in Internet use?
How will high-speed broadband effect consumer behavior?
What is the language distribution of content?
How many people object to government or ISPs sniffing traffic?
How much of Internet infrastructure is under control of organized crime?
How much email is unsolicited spam?

II. Types of Data Needed for the Above Questions

DNS query packet traces and/or logs from various places (roots, TLDs, IN-ADDR.ARPAs, ISP resolvers).
Active DNS measurements such as open resolver surveys.
IPv4/IPv6 topology probing data.
BGP feeds/updates, including simultaneous with topology probes.
Web cache logs.
Anonymized reports, e.g, coralreef, netflow-based.
Large router-level topology (anonymized), w up/down time log per link.
Consistent macroscopic ping data over years.
Packet traces from the core, the edge, close to customers all appropriately anonymized.
Traces collected on end-user machines, e.g. NETI@home

III. Example Data Access Policies

The following list provides examples of potential data access policies and structures that might allow network researchers to gain access to otherwise unavailable data.

Unrestricted: Anonymized versions w/o payload publicly available.
Restricted: Access via Access Agreement requires that the data and analysis must remain on specific servers.
Restricted: Contact via email for access.
Restricted: Access requests accepted for collaborative agreements to share analysis, implementation code, and results.
Restricted: Researchers may submit analysis code for staff to run on data.
Restricted: Available to academic, government and non-profit researchers and members upon request.