DNS-ITR (DNS-OARC) - Proposal: Improving the Integrity of Domain Name System (DNS) Monitoring and Protection

This proposal helps to address National and Homeland Security recommendations by the President's Critical Infrastructure Protection Board to develop a 'cyberspace network operations center (NOC)'. The long-term mission of this proposal - to provide data needed to support DNS research - also has relevance to the real Internet and how it supports economic prosperity and a vibrant civil society. Indeed, the data, models, communications analysis, and simulation functionalities to be provided have the potential to dramatically improve the quality of the lens with which we view the Internet as a whole.

1 Introduction. The Domain Name System: current status, problems, and threats

At its core the DNS is a globally distributed and decentralized database of network identifiers. Its most common use is to resolve host names into Internet addresses. This mapping occurs continually, for example, every time someone visits a web page, sends email, or uses an instant messaging service.

One of the most important properties of the DNS is its use of hierarchical namespaces. This hierarchy is manifest through the familiar ``dot'' notation used in web site and domain names. For example, in order to reach a machine with the name ``not.invisible.net,'' we must send a query to the DNS server that is responsible, or authoritative for machines (and/or sub-domains) in the domain ``invisible.net.'' In order to identify this authoritative machine for ``invisible.net,'' we must send a query to the server authoritatively responsible for ``.net.'' Such a server is called a TLD (top-level domain) server. In order to find out where the appropriate TLD server is, we must send a query to one of the 13 root servers. Thus, resolution of domain names relies upon the proper, secure, and efficient operation of the root and TLD name servers. These servers experience heavy load because they are the starting points for DNS clients (applications) when resolving host names. A typical root server receives between 8000-10000 queries per second [2]; this load appears to grow linearly in proportion to the number of registered domain names [3].

The DNS infrastructure has another hierarchical aspect. Most organizations operate one or more caching name servers for their users. End systems are configured to send their queries to the local name server, which forwards queries to other servers and caches the answers. Each DNS record has a time-to-live (TTL), which specifies how long it can be legitimately (accurately) cached. This caching allows servers to answer more quickly, reduces latency and network traffic and generally facilitates global scalability beyond that which even its designers anticipated. Flawless behavior of DNS cache implementations is crucially important for the stability of the whole system since they generate almost all of the queries to authoritative servers on the Internet.

The most popular DNS implementation in use today (75%) is the Berkeley Internet Name Daemon (BIND) software [4]. Other DNS implementations include nsd [5], djbdns [6], ANS and CNS [7], and Microsoft's DNS software bundled with Windows [8]. Our previous studies indicate serious pathological behavior in the DNS, a substantial portion of which derives from inadequacies in a specific vendor's DNS implementations [9,10,,12]. All implementations have bugs and are sensitive to network behavior, but there has been no systematic effort to diagnose them.

The operational community is also concerned about how the DNS will handle the arrival of new technologies such as IPv6 and DNSSEC (DNS Security extensions) [13]. In particular, both of these new features require significantly larger packets that will require resolvers to support DNS packets to exceed the originally specified limits. DNSSEC also generates more DNS lookups in order to retrieve signatures and public keys associated with zone data.

Finally, the DNS has only unsophisticated and limited mechanisms for protecting itself against attacks and malfunctions. The vital nature of servers at the top levels of the DNS hierarchy makes them increasingly attractive targets of both organized and disorganized malice. Today, the global DNS relies primarily on over-provisioning to handle traffic bursts as well as to remain operational in the face of denial of service (DoS) attacks. The most recent technique used to distribute the load is IP anycast: even though there are only 13 root server addresses, there are actually close to 100 physical DNS root servers [2]. As the top of the DNS hierarchy grows more distributed, it will be harder for attackers to disrupt service, but more complex to debug attacks and other anomalies that do occur. The proliferation of exploits and distributed DoS attack technology, in conjunction with impending new burdens on the DNS due to IPv6 and DNSSEC, renders the improvement and protection of this a core component of cyberinfrastructure crucial to both national and international security.

Hardening the DNS will be neither simple nor easy: the dynamics are complex and the workloads prodigious. Indeed, the DNS is an ideal example of an area where we lack fundamental understanding of `how failures cascade, how scalability and interoperability among heterogeneous systems can be ensured, how inherent complexity can be managed.'¹ The largest obstacle to sound DNS analysis and research is the lack of relevant data sets. The goal of this project is to fill this gap by deploying instrumentation for data collection, a vehicle for delivering data to researchers, and a framework for sharing lessons learned. In-depth analysis of empirical data will empower new countermeasures to attacks as well as policy and architectural discussions of the future of DNS.

2 DNS Operations Analysis and Research Center DNS-OARC

Despite the essential nature of the DNS, long-term research and analysis in support of its performance, stability, and security is extremely sparse. ISC's Operations, Analysis, and Research Center (OARC), launched in October 2003, is attempting to address this situation. It includes as participants root and TLD nameserver operators, ISPs, and leading research institutions. The mission of OARC is to provide a trusted platform for bringing together key operators, implementers, and researchers so they can identify problems, test solutions, share information, and learn together.

2.1 OARC functions

OARC supports five key functions, all critically needed for the stability of the DNS, and all noticeably unsupported by the current political, regulatory, and fiscal landscape of the Internet.

Operational Characterization. As Internet traffic levels continue to grow, the demand on root and other key nameservers will outgrow the current infrastructure: this year's DoS attack traffic levels will become next year's steady state load. OARC will monitor the performance and load of key nameservers and publish statistics on both traffic load and traffic type (including error types).
Incident Response. The OARC provides a forum for the DNS operations community to interact during attacks or other incidents that affect global DNS operations. Stringent confidentiality requirements and secure communications will support sharing of proprietary information on a bilateral basis.
Testing. A testing laboratory containing common DNS implementations and network elements will enable rigorous analysis of fixes, patches, and performance characteristics in both a real-world operational environment and under simulated attack conditions. Understanding of the compatibility and interoperability issues is a prerequisite to finding working solutions.
Analysis. Leading researchers and developers will use the data collected by OARC for long-term analysis of DNS performance and post-mortems of attacks and will promote institutional learning. A centralized data storage allows OARC members to download traces and logs to perform their own analysis.
Outreach. Many problems with the DNS are the result of misconfigurations by end users, vendors, or large corporate networks. Outreach insures that critical information about the global DNS reaches those that need to know.

2.2 Trust models and operational procedures

A key interest area for OARC, and a key potential contribution, is in the area of formalizing and scaling trust across diverse groups with common operational concerns. Currently, different groups of operators and implementors coordinate to handle existing issues within small, ad hoc groups. For example, root nameserver operators have established an informal but highly effective set of contacts, trust relationships, and procedures for helping each other in the event of shared malfunction or attack. However, to date this process has relied on personal introductions and informal commitments for the needed efforts to maintain shared online resources. This approach can be effective within a small community for specific problems but does not scale to the operators of 250 TLDs, the 5 largest registrars, and the Fortune 1000 companies whose revenues and even assets often acutely depend on the reliability of the DNS. It also does not work in more diverse problem spaces, especially those involving politics, or implementation of preventive measures for problems that have not yet occurred.

The Internet's backbone ISP operators also use a loose network of trust and cooperation, that sometimes presents daunting and insurmountable obstacles when cooperative real-time response is required across many, often directly competing, corporate entities. Scaling and extending such ad hoc operational trust networks is still a nascent area of operations research. OARC will closely examine existing mechanisms and attempt to provide a more formal technical and social infrastructure to evolve and grow them. Potential results will be relevant far beyond DNS-OARC to other components of cyberspace infrastructure.

2.3 OARC's position in the political landscape

Oversight of the DNS is loosely exercised by the Internet Corporation for Assigned Names and Numbers (ICANN). This non-profit corporation assumed responsibility for the IP address space allocation, protocol parameter assignment, domain name system management, and root server system management functions previously performed under U.S. Government contract by the Internet Assigned Numbers Authority (IANA) and other entities. ICANN's role is to coordinate the technical and operational aspects of these functions.

Figure 1: Location of DNS root servers, including anycast nodes, identified by their one-letter names.

The Root Server System Advisory Committee (RSSAC) advises ICANN on operational issues of relevance to the root name servers. Figure 1 shows the locations of existing root servers around the world, including headquarters of six root servers (C, F, I, J, K, and M) that currently implement anywhere from 2 to 20 anycast nodes ² per root [2]. RSSAC has been the primary vehicle by which root name server operators provide technical input to ICANN on policy matters within its purview, such as implications for the DNS roots of the increasing deployment of IPv6.

Another ICANN committee, the Security and Stability Advisory Committee (SSAC), overlaps RSSAC but focuses specifically on the integrity of DNS infrastructure overall. Its scope includes the DNS root and TLD servers and registries, and the consequences of the introduction of new technologies such as DNSSEC. Several root name server operators are members of SSAC, which has also invited CAIDA to participate in some of the committee's discussions. The SSAC will be one of the groups using the results of the work proposed here to analyze the impacts of changes to the infrastructure on the security and stability of the DNS.

Neither ICANN nor its subcommittees have any funding allocated for technical DNS research or analysis. Despite limited resources, over the last four years CAIDA has provided uniquely useful empirical data and unbiased analyses in support of policy discussions within both RSSAC [10] and more recently SSAC [9,11,14,15,12]. In the process CAIDA and some of the root name server operators (including ISC) have deployed both active and passive monitors to collect extensive (but still incomplete) data sets on root servers' connectivity and performance. We propose to extend this data, which has been useful to date primarily for lack of anything else. A more comprehensive effort to instrument and analyze the system will provide both the operational community and policy bodies with far greater insight into both ``normal'' and ``abnormal'' DNS behavior.

Over the last year, many TLD servers have deployed anycast to support multiple nodes per IP address, distributed all over the globe. The operational implications of this innovation are not well understood, nor does the analysis of such implications fall under the auspice of any existing organization. Meanwhile, the increasing internationalization of the Internet has inspired the UN, through the ITU, to pursue their own version of involvement, against a colorful backdrop of strictly political and economic sovereignty and governance claims. Technically based, operationally sound input to the process of institutionalizing the DNS infrastructure remains elusive.

OARC strives to mitigate the most imminent technical risks and support the necessary political evolution by providing sound data collection and analysis as the basis for domestic and international policy decisions. Both CAIDA and ISC are neutral, non-competitive players in this arena. While continuing our involvement with ICANN's RSSAC and SSAC committees, we will emphasize an open process. We will document ICANN activities relevant to security and stability of the DNS on our web site and provide a platform for shared data and open debate in the engineering and policy communities. We believe that the proposed approach is the most effective way to evolve standards and practices for robust, reliable network infrastructure. Further, faced with an environment where infrastructure providers are equally as afraid of being the vector for a serious attack on the global Internet as they are of being regulated by those who regard it as their job to prevent or mitigate such an attack, OARC's deliverables will greatly mitigate both fears.

2.4 Why our team is most appropriate

CAIDA is recognized as a world leader in Internet measurement and data analysis, and has provided several landmark studies of DNS performance, workload, and topology issues [16]. CAIDA also has long-standing collaboration relationships with many providers and equipment vendors.

ISC is deeply engaged with the operational and engineering communities, including operators of many TLD or root nameservers as well as of large backbone networks and exchanges.

Our team represents a unique combination of talents and facilities necessary to achieve the proposed goals. Moreover, we envision that OARC will not be just a three-year project. On the contrary, NSF will be seeding an effort with lasting impact and relevance to Internet evolution. The research efforts funded under this proposal will either complete or find continued funding, but the measurement infrastructure and supporting software tools will remain in place. Monitoring efforts will continue under ISC, funded by, OARC membership fees[2].

3 Measurement infrastructure

The OARC team will design a robust measurement infrastructure that balances the needs for large raw datasets, resilience under temporary outages, and low processing overhead to accompany heavy workloads of busy DNS servers. The design must make it easy to install new capture software in order to adapt to the changing needs of the research community.

As previously mentioned, many root DNS and ccTLD servers will be using anycast technology by the end of 2004 to multiply their coverage and capacity, rendering the measurement problem an order of magnitude more difficult. This development is unsettling given that the top levels of the DNS hierarchy were never instrumented for data collection even before anycast, precluding empirically grounded research of the system (or even `back-of-the-envelope' analysis!).

We intend to instrument each instance of an anycast F-root nameserver (run by ISC) and to capture data locally for both real-time responses to attacks and for later in-depth analysis at the OARC data centers. In addition to a root server, ISC also has a TLD server that handles about 60 TLDs. We will instrument and monitor this TLD server as well.

Capture hardware. Each of the F-root anycast machines is connected to the Internet via a network device with two measurement ports. These ports can passively monitor the DNS traffic in each direction without impacting the root server itself. Ideally each instance of F-root will have a dedicated box receiving data from the measurement ports and processing it locally. This measurement box will be a high-end PC with multiple 100mb/sec Ethernet network interfaces³ 1GB memory, and 200GB high speed local disk.

Part of the data collected locally at each instance of F-root will be summarized and uploaded to a storage array for long term trend analysis. The storage array will have an initial capacity of about 10 terabytes⁴ and be located at the ISC data center. For some studies the data sets will be too large to download and analyze remotely, in which case researchers will be able to install their analysis software on ISC machines at the data center or actually locate their own analysis machines there.

Measurement software. ISC will develop software to manage the data stream, local storage, and data archive. We will refine and extend the collection software as data needs of the research community are better understood. CAIDA already supports tools dnsstat and dnstop for nameserver operators to analyze data on their own servers [17]. More recently we built on our experience with these tools to prototype a more general measurement daemon, dsc (DNS statistics collector), to gather statistics and dump them to stable storage every minute. We have tested this prototype at the San Francisco and Palo Alto instances of F-root and at the ISC's TLD server. It captures the query/response DNS traffic to the server and summarizes it in various ways. The first graph (Figure 2) shows query types and query rate vs. time. Notice the anomaly at 0530 UTC as well as the seemingly random spikes of IPv6 queries (for A6 and AAAA records). Note that native IPv6 support for DNS is blocked on the lack of IPv6 addresses for the top level servers. IANA is evaluating the effects of adding IPv6 glue records to the root zone. Once approved, the various F-root instances will be able to serve DNS data to IPv6 only hosts. Several F-root cities are expected to become substantial IPv6 users as soon as IANA approves IPv6 addresses. OARC will be able to measure the penetration of IPv6 with its measurement infrastructure, not only IPv6 AAAA queries but also the use of IPv6 transport to the instances of F-root.

figures/qtype-24hr-posterized.png

Figure 2: Types and rates of DNS queries to f-root's SFO2 node, 7 Feb 2004.

figures/qtype_vs_tld.png

Figure 3: Most popular TLDs and query types sent to f-root's SFO2 node, 7 Feb 2004.

Figure 3 shows the most popular TLDs queried and query types for those domains. Several non-existent TLDs receive an immense number of queries, mostly due to bugs in a popular desktop vendor's caching nameserver (e.g., queries for TLD `41') or misconfigurations (queries for TLD `local'). A similar graph organized by the subnet making the query can help identify broken implementations at the subnet level. Given many natural variations in query patterns, establishing a baseline not only requires a lot of information, but also must be updated regularly.

A second piece of measurement software will write a full packet trace to local disk, using a sliding time window. At any time all traffic seen in the last N hours will be available (N < 24). This data will not be archived back at the central repository, but will be available locally for analyzing attacks or anomalies in as close to real-time as possible.

We will also allow authorized persons to get more detailed information about the DNS activity from their network. For example, if they enter an IP address/network, we would lookup the tech/admin contacts in registry (e.g., ARIN/RIPE/APNIC/LACNIC) databases and email the results pertaining to their network to them.

Deployment. A typical monitoring station will cost about $5000. The cost covers high speed network interfaces, memory and disk, in a unit small enough (1U) to conserve rack space in busy pops. (Rack space is expensive and we ask the host organizations to donate that cost.)

We will deploy dedicated measurement machines at the busiest instances of the F-root server in the first year of the grant. At the same time, because F-root nodes are highly over-provisioned, we can safely run the collection and summarization software directly on the less busy instances of F-root for at least the first two years of this project. (We are seeking hardware donations from PC manufacturers to expand our monitoring coverage.) Each year of the grant we will buy 5 additional monitoring stations, install them at the busiest instances of F-root and re-assign the initial monitors to less busy nodes without monitors. This approach ensures the presence of monitoring machines that can keep up with the traffic load at the busiest servers, for the duration of the grant period.

4 Research questions

"This is a dynamic system and it's hard - we don't even know how to formulate the right questions."

- last RSSAC meeting (Minneapolis, MN, 8 November 2003)

In this section we provide examples of questions the community will be poised to answer given the resources provided by the OARC. We list several analysis tasks that CAIDA will undertake, as well as other questions currently important to the operational and research community.

Effects of anycast for a given root. ISC, the operator of F-root, pioneered the use of anycast to replicate root servers for robustness and greater resilency to attacks. The number of distinctly IP-addressable root servers is limited to 13 due to the 512 byte UDP packet size of a DNS message, but anycast allows many hosts to have the same IP address while located in different places. The routing system then naturally maps F-root's IP addresses to the nearest anycast F-root node, resulting in each anycast instance of F-root developing a basin of attraction for queries sourced nearby. F-root was the first root server to employ anycast and is one of the most ubiquitous, currently replicated at 20 locations around the world [2].

CAIDA will use the following approach to measure the DNS effects of removing (via controlled routing announcements) an instance of the anycast F-root:

Collect the following data sets before and after a new anycast node (de-)activates: (1) current DNS client distribution (dnsstat for 24 hours from all F-nodes); (2) RIB from a router where the new node is located; (3) BGP view from RouteViews before and after the (de-)activation.
Compare client distribution among anycast nodes of a given root after the [de]activation of a new node. How does the workload distribution compare with the AS graph data after instrumenting a new node? How long does it take to stabilize? Is it predictable?
Compare RIB-based and RouteViews-based prediction of how anycast will redistribute load.

Such experiments will allow us to model the anycast root server system over time to determine the performance impact of architectural, software, topology, and policy changes. CAIDA and OARC are committed to working with operators deploying anycast to provide logistic information such as topology information and notification about peering changes for anycast root nodes.

Effects of anycast for a given client. We understand how anycast works in theory. Queries from a given client are routed to the closest anycast node (for some definition of closest). However, there is little certainty of how it works in the wild. How stable is the mapping between clients and anycast nodes? Are routing flaps significant? Can application or operating system software affect anycast node selection? Do all clients within the same /24 hit the same node? /20? /16?

To answer these questions we plan to collect traces at multiple anycast nodes for a random selection of client IP addresses. At the same time, we will also monitor the global routing table from multiple locations. We will look for anomolies such as: a single client hitting multiple anycast nodes within a short amount of time, changes in mapping that correspond to routing changes, changes in mapping that do not correspond to routing changes.

Identifying oddities in DNS traffic. Our 2002 study identified a few odd DNS client behaviors. One in particular showed that all queries from a client were duplicated and coming from two different source addresses. In one particular case the presence of A queries for IP addresses easily identified the name server as an (unpatched) Windows box. Whether these redundant queries were due to misconfiguration of the DNS, of the routing or network interfaces, or a software bug remains unknown. Taxonomizing and reporting such anomalies could reduce the load on the root servers.

Spurious traffic reaching the root name servers. Organizations operating root DNS servers report loads exceeding 100 million queries per day. Given the design goals of the DNS and what we know about today's Internet, this number is about two orders of magnitude more than expected. In 1999 and 2002, ISC and CAIDA collaborated on studies of the F-root server [11] [9] to investigate the sources of this seemingly huge load. ISC provided traces of queries to F-root, and CAIDA classified each query. The 1999 study identified the bogus A query (asking for the IP address of an IP address) which was 14% of the query load due to a bug in the Win2K nameserver. We also identified other types of unnecessary queries to root servers including attempted updates to the root zone by users' desktop name servers. The 2002 study (of a 24-hour trace) divided queries into nine categories. The vast majority of these queries (over 70%) were repeats; only a small percent were legitimate (see Figure 4.) About 12% of queries received by the root server were for nonexistent top-level domains, e.g., ".elvis", ".corp", ".localhost".⁵ The number of bogus A queries was down to 7% reflecting the deployment of the Microsoft patch for the problem. We also characterized a few of the egregious abusers - clients sending a particularly large number of queries to the root server. We believe that much of the root server misuse occurs because querying agents never receive the replies, due either to packet filters, misconfigured firewalls, or routing issues.

In subsequent work [12] (laboratory simulations), we learned that our initial model was incomplete. The initial model assumed that once a cache learns about the referral for a TLD, it will have no reason to send any further TLD queries to the roots until the TLD NS and/or A record(s) expire. However, we learned that some caching nameservers (BIND9 and DJBDNS) have apparently valid reasons for sending more queries to the roots. When a nameserver A record expires, and they need to refresh it, they start at the root. For example, if the A record for NS0.MICROSOFT.COM has expired but the record for the COM authorative servers has not, BIND9 or DJBDNS send the NS0.MICROSOFT.COM query to the roots anyway. This prevents any cache poisoning that might occur at the cost of substantially increasing the load on the roots. BIND8 would use the cached COM record and avoid a root query.⁶

For this project, the OARC team will monitor query traffic at F-root on a regular basis. We will also compare F-root statistics with other root servers⁷. We will develop methodologies for automated processing of collected data to identify sudden changes in query patterns, analyze the most egregious abusers, and evaluate effects of implemented improvements.

figures/sdscPiesm.png

Figure 4: Categories of illegitimate queries to F-root name server for 24-hours on 4 October 2003.

Understanding Spurious RFC1918 Traffic. In pursuit of a finer-grained understanding of some of the spurious traffic just described, CAIDA analyzed one type of improper machine-generated DNS traffic - attempts to erroneously and incessantly update address-to-hostname mappings for private address space in nameservers at the top of the DNS hierarchy [14]. RFC1918 [18] and other private addresses are permanently unassigned and therefore are not globally reachable (or routable). Networks that use these addresses internally rely on network address translation (NAT) to reach the Internet. Internal RFC1918 addresses should never reach the global Internet; they are meaningless there.

Our RFC1918 DNS study discovered that a large portion of spurious updates are caused by the default configuration of the DHCP/DNS servers shipped with a specific desktop vendor's operating systems. The vendor's server software sends periodic updates with frequencies that we found with spectral analysis and confirmed via a laboratory experiment and vendor documentation. This (mis)configuration is so widespread that patterns of Internet access by end users are reflected in pulsations within the flow of spurious DNS updates to root name servers! Furthermore, there is no reason to believe that such spurious DNS packets are limited to private networks. Users are unaware that their machines are misbehaving.

Prior to the deployment of dedicated authoritative servers to capture and divert RFC1918 address space in the spring of 2002, millions of desktop machines with private addresses were attempting to update the DNS root servers, which are authoritative for the in-addr.arpa top level domain. Such misdirected default behavior effectively constituted in a slowly paced, massive, distributed denial of service (DDOS) attack on the root name server system. We have worked with Microsoft to correct this problem, but such `normal accidents' [19] will grow even more prevalent in our increasingly interconnected cyberinfrastructure. Indeed the current state of desktop software poses a substantial and increasing burden on if not threat to the robustness of the global Internet. Dedicated attention to its macroscopic effects on the infrastructure is as scant as it is important.

Automatic identification of one-way queries. Our earlier work has shown that a significant number of bogus queries reaching a root server are due to one-way communication. That is, client-side packet filters, firewalls, unroutable source addresses, or perhaps even saturated links do not allow the answer to a query to reach the client.

We believe that it may be possible to automatically identify clients suffering from this problem. We can develop software that detects such a situation and generates an email message to one or more appropriate contacts (from the whois database) for the offending source address. This technique requires some investigation and refinement because: (1) automatically generated messages may be identified as spam; (2) source address may be spoofed and not actually originate from the organization listed in whois; (3) this feature may become a target of attacks itself (i.e. to generate a notification message for one's enemy). We can perhaps leverage existing DDoS attack detection tools and techniques to help out in these situations. i.e., how far back can we trace the source of one-way queries and find out whether or not they are spoofed?

Does DNSSEC mean more queries. Zones using DNSSEC will incur longer replies [13], suggesting interesting questions: 1) Will there be more queries/replies per transaction? 2) What portion of DNS resolvers are still bound by the historical 512-byte DNS message size limit? 3) How can we encourage administrators to upgrade past this problem? The OARC measurement framework will allow us to monitor the deployment of both DNSSEC and EDNS0 (which also increases the UDP packet size limit for a given DNS conversation) and to track the effects of DNSSEC on the DNS system.

DNS resolver behavior and its effect on the global DNS. Many aspects of resolver behavior affect the roots and other authoritative servers, so it is of deep interest to the root and TLD DNS operators. Realistic simulation of DNS resolver behavior has received remarkably little attention; almost every application on the Internet uses the DNS but simulation tools such as ns-2 completely ignore it. CAIDA has built simulation tools [12] that can test fielded resolver implementations against empirical traces of DNS queries in a controlled environment. In particular, we seek more complete answers to questions such as the following: (1) How does the choice of DNS caching software for local resolvers affect query load at the higher levels? (2) How do DNS caching implementations spread the query load among a set of higher level DNS servers? We found that resolvers (caching name servers) use quite different approaches to distribute the query load to the upper levels and also different levels of respect for TTLs.

Simple performance measurements indicate that the root name servers are heavily loaded, which is somewhat surprising since almost all DNS responses are cachable⁸. Additionally, root server responses have long time-to-live parameters (TTLs), normally about 2 days. Once a DNS client or caching nameserver learns the address of the TLD server for the .com domain, it should not need to ask a root server for that information until the TTL expires. However, to avoid against cache poisoning, some implementations (BIND9 and DJBDNS) return to the roots when a TTL record expires.⁹ Observed traffic confirms these extra queries to the roots and by implication to TLD servers as well.¹⁰ We plan to pursue a more in-depth study of both TTL implications and caching name server behavior, enhancing our simulation software to test against workloads taken from traces at the root servers.

Measuring DNS Root Servers Performance and Connectivity. Several groups currently observe DNS behavior using active measurement techniques from a small set of endpoints [21,,,]. These measurements are useful but they do not scale since they increase the load on all servers they measure and they do reach all instances of anycast nodes. Since mid-2000 CAIDA has developed novel passive techniques for measuring DNS performance [25,26,27]. The advantages of a passive approach are two-fold: measurements characterize actual rather than test DNS traffic; and no extra packets are injected into the network. Open research questions include to what extent active measurements, be they ICMP, UDP, or TCP, can reflect actual DNS (UDP-based) performance, or whether we can design a passive measurement infrastructure that is sufficient to capture most metrics of import. We expect a hybrid approach to be best.

Visualization. Building on CAIDA's experience with visualization tools [17] and database projects [28], CAIDA hopes to develop useful ways to visualize, both topologically and geographically, the upper layers of the DNS, including anycast nodes, as well as authoritative nameservers for approximately 250 ccTLD domains. For topology identification and analysis we will leverage experience from CAIDA's skitter and iffinder [17] as well as U. Washington's rocketfuel [29] tools for efficiently monitoring topology and identifying IP addresses that are part of the same host.

4.1 Related work

The indispensable role of the DNS in Internet functioning and its unparalleled scale prompted multiple studies of DNS performance per se [30,20,31,32], as well as its contribution to overall web performance [33,34,35]. These studies usually involve measurements from a few locations in the Internet topology and focus analysis on the effects of DNS implementations bugs on caching.

Jung et al. [20] investigated caching by DNS resolvers and proposed a cache-driven simulation to model DNS cache behavior. This work is valuable, but is from a limited perspective (two end user sites) and suggests that a much larger set of data collection and analysis including data from strategic locations (e.g., root and gTLD nameservers) is needed to verify any simulation. The authors found that DNS caching server implementations were overly persistent in the face of failures and that a quarter of the queries sent to the root name servers resulted in a negative response, confirming what was shown in [9]. They also performed trace-driven simulations and concluded that the cacheability of NS records is more important for DNS performance than aggressive caching of A records. While this is true for BIND8 (presumably what they were using at the time) it is not true for either BIND9 or DJBDNS. Their conclusions regarding caching seemed to be based on the cache hit rate, but it is the cache miss rate that causes additional root queries. They found that changing the TTL from 24 hours to 15 minutes increased the cache miss rate by a factor of 6.

Liston et al. [31] identified various DNS performance metrics (completion and success rates of resolving names; mean response time for completed lookups; root and gTLD servers favored by sites; distribution of TTLs across names), and studied location-related variations of these metrics. They obtained measurements from 75 different Internet locations in 21 countries. Liston et al. concluded that the greatest performance enhancements could be achieved by reducing the response time of intermediate-level servers rather than top-level root and gTLD servers.

Somegawa et al. [32] examined server selection mechanisms employed by different DNS implementations (reciprocal algorithm in BIND-8, best server in BIND-9, uniform algorithm in djbdns and Windows 2000) ¹¹ as a case study for the general problem of best server placement and selection. They used data collected by Cho et al. [36] and simulated effects of different server selection mechanisms. Somegawa et al. found that the reciprocal algorithm is more suitable for the Internet environment than the other two currently implemented algorithms. They also showed that the proper use of server selection algorithms is essential to stability of the DNS service.

Other studies have considered the DNS in conjunction with the more general problem of nearest server selection. Shaikh et al. [37] evaluated the effectiveness of DNS-based server selection. They found that DNS-based schemes typically disable client-side caching of name resolution results. This policy has two negative consequences: a) considerable increase of name resolution overhead for the client, especially when the number of embedded objects, e.g., images and advertisements, served from multiple sources increases; b) growth in the number of queries to authoritative DNS servers and in network traffic incurred by these queries. Shaikh et al. propose modifications to the DNS protocol to improve the accuracy of the DNS-based server selection technique.

Cohen and Kaplan [38] propose and evaluate enhancements to current passive caching of DNS data aimed at reducing user-perceived latency due to DNS query time. They state that the proposed "proactive DNS caching" falls within the framework of the current DNS architecture and can be locally deployed. Their premise is that user time is valuable and therefore it is okay to perform more than one automatic (unsolicited) query in order to refresh the cache and avoid a cache miss.

5 Hardening and securing the DNS

It is a common misconception that one can invest, or otherwise prophylactically fund, one's way out of vulnerability to a DDoS attack. In reality it is impossible to make an Internet-connected server 100% secure against a sufficiently widespread DDoS attack because such an attack may involve an arbitrarily large number of compromised (controlled without the owners' permission) Internet hosts programmed to simultaneously flood a victim with traffic. Even if the server is exceptionally capable, the bandwidth to the server is itself another resource under attack, and investing in a bigger pipe is fruitless against an arbitrarily large number of hosts pelting packets at it.¹² Eradicating every vulnerability is impossible, but with OARC we can at least face them with deeper situational awareness.

Communication vehicle. We neither propose nor intend OARC to be a panacea for the woes extant in the current DNS. On the contrary, in the face of clearly superior munitions, effective communication - getting vital information where it needs to be when it needs to be there - is the only worthy countermeasure. OARC's mission embraces this need, as a clearing-house and communication vehicle for DNS operational issues together with technical support for rational, measured deployment of new protective technologies (e.g., anycast) at the top of the DNS hierarchy. OARC will also participate in the operational security mailing list community (including closed mailing lists we either host or attend), and support a community channel for real-time (and post-mortem) interaction among DNS OARC members and researchers in the face of events affecting the infrastructure.¹³

Database of incident profiles. Some anomalies seen by our capture software, (for example, tall spikes of IPv6 traffic in Figure 2) lend themselves to automatic detection. We will build a database of incident profiles that can sometimes be detected and dealt with automatically, whether they stem from implementation errors, configuration errors, or attacks.

DNSSEC deployment, community support, and tracking. The IETF is standardizing DNSSEC, a public key cryptographic method of securing and authenticating DNS data, so that a site issuing a query will know that the answer came from the correct server and was complete and correct. The current architecture allows DNS queries to be hijacked and answered arbitrarily, without the querier knowing it has happened. DNSSEC will present new and undoubtedly some unanticipated stresses to the DNS system. Packets will be larger, there will be more queries, the EDNS0 protocol will be used, verification will take longer, editing zones will be harder since cryptographic signing will be involved. All these issues may impact performance of the DNS system. OARC monitors at the F-root servers and multiple top level domain servers will be well-positioned to access the impact of these new technologies and identify operational problems. They will also allow us to track the deployment of DNSSEC throughout the DNS hierarchy. Eventually DNSSEC, together with the shared secret cryptography available in DNS for server to server communication,¹⁴ will help secure the information integrity of the entire system.

Working with industry to fix software problems. CAIDA routinely discovers DNS implementation errors or default configuration errors during research studies. Our initial DNS measurement work at F-root [9] revealed that the Microsoft DNS server in Windows 2000 contained an error that sent queries for the IP address of an IP address to the root servers. An IP address is syntatically (dot separated ASCII strings) the same as a hostname and therefore was not recognized as bad input by the underlying software. These resolvers thus queried for TLDs that were numbers, specifically the last byte of the IP address being incorrectly used in place of a hostname. These malformed A queries accounted for up to 14% of root server load at that time.

CAIDA worked with Microsoft engineers to get the problem fixed; it was deployed in a Windows Service Pack several months later. A second study at F-root [11] showed that while still present, this type of bogus address queries had been reduced to 7% of the load on the root servers.

We also contacted Microsoft to help solve the dynamic update problem that involved them computing the ënclosing zone" to notify of the update, but doing so recursively so that if a local server did not accept the update, the name server walked up the DNS naming tree to the roots before giving up. New microsoft name server code in both XP and .NET will have limits as it walks the DNS tree and will not update TLDs or in-addr.arpa.

OARC's framework will facilitate such channels of communication between researchers and the vendors or implementors of software.

6 Outreach to research and operational community

CAIDA and ISC will both communicate with the research and operational communities through their web sites, conference talks, workshops, and shared student projects.

6.1 Dissemination of results

We will present research and analysis results to the operational community via the OARC web site and at conferences such as IETF, NANOG, and RIPE. We will also publish results to the academic community via conferences and journals. The measurement software developed will be open source, available to anyone in the community wanting to duplicate our measurement infrastructure. The architecture and configuration of the monitors will be published on our web site.

Other sites wishing to monitor their important DNS servers will benefit from our experience and may copy/adapt our infrastructure to their needs. Other researchers will be able to access OARC data by signing an acceptable use agreements or using appropriately anonymized data.

6.2 Workshops

Each year of the project, CAIDA will host an to bring together the operational and the measurement research communities to summarize efforts from the previoous year, stimulate new research ideas, and revisit priorities for the measurement infrastructure. These workshops would be in our ISMA series [39]¹⁵ and focus on different analysis topics each year.

6.3 CAIDA and ISC Internships

Both CAIDA and ISC are rich environments for students. The analysis work at CAIDA offers a unique opportunity for students to gain experience with massive (and messy) datasets that require both sound methodology and efficient management of computing resources to analyze. During summers, CAIDA hosts several graduate students from other institutions in order to acquaint Internet researchers with our data sets and to advance mutually beneficial collaboration. We expect to have 2-4 summer students each year of the grant, at least half of them from institutions other than UCSD. CAIDA also encourages sabbatical visits from industry engineers [40]. We will also promote OARC research projects and internships through Dr. Claffy's 2004 fall graduate seminar.

ISC has opportunities for students, especially undergraduates, who want a taste of research in an intense operational environment. ISC has a history of mentoring advanced undergraduates and exposing them to good software engineering habits, robust network design, and real world software problems and solutions.

7 Summary

As noted in the National Strategy to Secure Cyberspace [1], ``the common defense of cyberspace depends on a public-private partnership,'' a framing tenet for DNS-OARC. Indeed, the DNS-OARC must implement robust communication mechanisms for data collection and correlation, analysis tools to assist experts with real-time response to anomalous behavior, and a trust infrastructure. The DNS-OARC explicitly addresses recommendations in [1] to develop a cyberspace network operations center and support a Cyber Warning Information Network (CWIN) in key government and non-government security-related network operation centers.

We expect that DNS operators will participate in a limited, mutually beneficial data exchange if a neutral, independent, well-trusted, technically capable team coordinates data collection, crisis response, and assumes responsibility for aggregating and publishing select statistics, vulnerability reports, and results. The proposed effort allows the CAIDA/ISC team to offer necessary services and tools to encrypt sensitive data, process log files, reduce and visualize large data sets, and provide interactive, access-controlled access to customizable reports. The data collection and analysis methods, services and tools resulting from the proposed research, in conjunction with the relationships of trust and influence built through OARC, will be essential to protecting the robustness of the global Internet. OARC will enhance national and international security by establishing and refining mechanisms for emergency response, and developing new technology for prevention, detection, remediation, and attribution of malicious or otherwise misbehaving agents affecting the DNS. At the same time, the proposed scalable trust infrastructure can serve as a scientifically validated model for other cybersecurity initiatives. Finally, this project has extraordinary synergy with several other NSF sponsored projects, offering considerable leverage of invested resources toward this project as well as toward more general cybersecurity goals.

References

[1]: The President's Critical Infrastructure Protection Board, ``National Strategy to Secure Cyberspace,'' sept 2002. http://www.whitehouse.gov/pcipb/.
[2]: ISC. http://www.root-servers.org/.
[3]: M. Kosters, ``Massive scale name management: Lessons learned from the .com namespace.'' TWIST 99 Workshop, UC Irvine, http://www.ics.uci.edu/IRUS/twist/twist99/presentations/kosters/kosters.ppt.
[4]: BIND website.'' http://www.isc.org/.
[5]: NLnet Labs. https://www.nlnetlabs.nl.
[6]: D. Bernstein, ``djbdns.'' https://cr.yp.to/djbdns.html.
[7]: Nominum Inc. http://www.nominum.com/.
[8]: How to install and configure Microsoft DNS server.'' http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q172953\&.
[9]: E. Nemeth, k claffy, and N. Brownlee, ``DNS measurements at a root server,'' in IEEE Globecom, November 2001. https://catalog.caida.org/paper/2001_dnsmeasroot.
[10]: CAIDA, ``DNS analysis: Analysis of the DNS root and gTLD nameserver system: status and progress report.'' https://www.caida.org/projects/dns/status.
[11]: Duane Wessels and Marina Fomenkov, ``Wow, That's a lot of packets,'' in Passive and Active Measurement 2003 (PAM2003), Apr 2003. https://catalog.caida.org/paper/2003_dnspackets.
[12]: Duane Wessels, Marina Fomenkov, Nevil Brownlee and kc claffy, ``Measurements and laboratory simulations of the upper DNS hierarchy,'' in Passive and Active Measurement 2004 (PAM2004), Apr 2004. https://catalog.caida.org/paper/2004_dnspam.
[13]: R. Bush, ``The DNS today: Are we overloading the saddlebags on an old horse?,'' Dec 2000. IETF plenary presentation, San Diego, https://www.ietf.org/proceedings/49/slides/PLENARY-3/sld001.htm.
[14]: Andre Broido, Evi Nemeth and kc claffy, ``Spectroscopy of DNS Update Traffic,'' in Sigmetrics 2003, August 2003. https://catalog.caida.org/paper/2003_dnsspectroscopy_full.
[15]: T. Lee, B. Huffaker, M. Fomenkov, kc claffy, ``On the problem of optimization of DNS root servers' placement,'' in Passive and Active Measurement 2003 (PAM2003), Apr 2003. https://catalog.caida.org/paper/2003_dnsplacement.
[16]: CAIDA. https://catalog.caida.org/paper/.
[17]: CAIDA. https://catalog.caida.org/software.
[18]: Yakov Rekhter, et al., ``RFC1918, Address Allocation for Private Internets,'' feb 1996. http://www.faqs.org/rfcs/rfc1918.html.
[19]: C. Perrow, Normal Accidents. 1999.
[20]: J. Jung, E. Sit, H. Balakrishnan, and R. Morris, ``DNS performance and the effectiveness of caching.'' Internet Measurement Workshop 2001, http://www.icir.org/vern/imw-2001/imw2001-papers/89.ps.gz.
[21]: A. Kato, ``JPNIC study on DNS misconfiguration.'' IEPG meeting presentation, July 2002, http://www.potaroo.net/iepg/july2002/.
[22]: G. Michaelson, ``More bad DNS.'' IEPG meeting presentation, July 2002, http://www.potaroo.net/iepg/july2002/.
[23]: E. Lewis, ``DNS lameness.'' IEPG meeting presentation, July 2002, http://www.potaroo.net/iepg/july2002/.
[24]: R. Thomas, ``DNS root name server query response.'' daily updated web page, active measurements, http://www.cymru.com/DNS/dns.html.
[25]: N. Brownlee, ``Using NeTraMet for production traffic measurement.'' Intelligent Management Conference (IM2001), May 2001.
[26]: N. Brownlee and I. Ziedins, ``Response time distributions for global name servers.'' PAM2000 workshop, March 2002, https://catalog.caida.org/paper/2002_nsrtd.
[27]: CAIDA, ``Dns Root/gTLD performance plots.'' https://cgi.caida.org/cgi-bin/dns_perf/main.pl.
[28]: B. Huffaker, ``Unified whois database,'' 2004. https://www.caida.org/tools/newwhois/.
[29]: N. Spring and R. Mahajan, ``rocketfuel: an ISP topology mapping engine,'' 2002. http://www.cs.washington.edu/research/networking/rocketfuel/.
[30]: P. Danzig, K. Obraczka, and A. Kumar, ``An analysis of wide-area name server traffic: a study of the Internet Domain Name System,'' in Proc. ACM SIGCOMM, 1992.
[31]: E. Z. R. Liston, S. Srinivasan, ``Diversity in DNS Performance Measures,'' November 2002. ACM Internet Measurement Workshop.
[32]: R. Somegawa, K. Cho, Y. Sekiya, and Y. S., ``The effects of server placement and server selection for Internet services.'' IEICE 2002 (The Institute of Electronics, Information and Communication Engineers, Japan).
[33]: C. Wills and H. Shang, ``The contribution of DNS lookup costs to web object retrieval,'' 2000. Tech. Rep. TR-00-12, Worcester Polytechnic Institute.
[34]: C. Huitema and S. Weerhandi, ``Internet Measurements: the Rising Tide and the DNS Snag,'' in Monterey ITC Workshop, September 2000.
[35]: G. Chandranmenon and G. Varghese, ``Reducing web latency using reference point caching,'' in Proc. IEEE Infocom, 2001.
[36]: K. Cho, A. Kato, Y. Nakamura, R. Somegawa, Y. Sekiya, T. Jinmei, S. Suzuki, and J. Murai, ``A study on the performance of the root name servers,'' 2003. http://mawi.wide.ad.jp/mawi/dnsprobe/.
[37]: A. Shaikh, R. Tewari, and M. Agrawal, ``On the Effectiveness of DNS-based Server Selection.'' Proc. of IEEE INFOCOM 2001.
[38]: E. Cohen and H. Kaplan, ``Proactive Caching of DNS Records: Addressing a Performance Bottleneck ,'' in Symposium on Applications and the Internet (SAINT 2001), 2001.
[39]: CAIDA, ``Internet Statistics and Metrics Analysis Workshops.'' https://www.caida.org/workshops/isma/.
[40]: CAIDA, ``Caidabbatical program.'' https://www.caida.org/about/jobs/#caidabbatical.

Footnotes:

¹NSF ITR solicitation, http://www.nsf.gov/pubs/2004/nsf04012/nsf04012.htm.

²The number of anycast root nodes is likely to have grown by the time the panel sees this proposal.

³We will keep interface speed matched to that of F-root nodes.

⁴10TB, RAID 5, with two data access machines connected with NFS over fiberchannel (redundant fiberchannel hubs) and gigabit ethernet IP network interfaces.

⁵Legitimate top-level domains include country codes such as ".au" for Australia, ".jp" for Japan, or ".us" for the United States, as well as generic domains such as ".com", ".net", and ".edu".

⁶In the 2002 study, these queries would have been classified as "Repeated Qname" or possibly "Referral not cached."

⁷In particular, statistics at A-root may differ because this server receives the bulk of Windows default DNS traffic.

⁸ Server failure is not cachable

⁹This finding contradicts the previous study [20] that encouraged sites to set their TTLs to low values (e.g., 5 minutes) and claimed that only TTLs of NS records are important to effective caching.

¹⁰The TTL associated with a DNS record, or with its non-existence for negative caching, interacts with configuration parameters for positive and negative caching on the resolver. A caching name server will trim the TTL received with a DNS response to enforce the relationships: configuration-neg-cache < record-neg-cache < record-pos-cache.

¹¹The reciprocal algorithm selects a server with a probability reciprocal to a certain metric. The best server algorithm chooses a server with the best metric. The uniform algorithm selects each server with uniform probability.

¹²One advertised countermeasure is to use CDN services such as Akamai to handle distribution server content. This approach applies to certain but not all applications, and brings its own constraints to the situation.

¹³ Initially we plan to deploy a jabber server and offer session and sysadmin support.

¹⁴DNS (BIND) has shared secret cryptography (TSIG) for the authentication between master and slave servers so that zone transfers and protected. DNSSEC is for the authenticity of the zone data itself and is public key cryptography.

¹⁵ Internet Statistics and Metrics Analysis, https://www.caida.org/programs/isma/

File translated from T_EX by T_TH, version 2.78.
On 24 Feb 2004, 20:11.