All those systems were designed with traffic loads in mind which match the rate and complexity of human-generated requests. A high-end workstation, a cluster of these, or a distributed set of clusters can serve the whole population of Internet users. The service however can be easily overwhelmed when streams of repeating requests come from devices about as powerful as servers themselves. This can happen even with each device producing a trickle of requests when large numbers of request streams converge on a few servers. Computing power required to process N streams of rate nu requests per second, variations notwithstanding, is close to
In this paper, we analyze spurious machine-generated traffic which push the worldwide Domain Name System (DNS) service to and beyond the edge of its performance constraints. Large fractions of this traffic are completely repetitive and periodic. Classifying and identifying phenomena by their frequencies is known to practitioners of Natural Science as an art of Spectroscopy. We introduced the term Network Spectroscopy to refer to all methods of identifying discrete network components like links, paths, transmission technologies, device models, operating systems and like by delay, frequency, periods and other kinds of spectra, This notion encompasses various approaches based on phenomenology, deductive reasoning or sound theory, and motivated by different applications [Dovrolis] [Nowak] [Katabi & Blake].
Repetitive requests and more generally delay quantization, either in the form of strict periodicity or in delays belonging to a set of equispaced values, is present in many other types of Internet traffic data. We found these phenomena especially important for understanding BGP updates [Andre], broadband end-user traffic [Andre & RyanKing], round-trip times (RTTs) and precision of packet timestamps. We are currently preparing a description of these and related questions which can be analyzed in the framework of Internet spectroscopy.
An enterprise that decides to use IP addresses out of the address space defined in this document can do so without any coordination with IANA or an Internet registry.
and continues:
Indirect references to such addresses should be contained within the enterprise. Prominent examples of such references are DNS Resource Records and other information referring to internal private addresses. In particular, Internet service providers should take measures to prevent such leakage.
In this paper, we examine how the above statement is implemented today and observe that millions of DNS packets are sent daily to nameservers outside private nets requesting or containing information on RFC1918 addresses. DNS records for RFC1918 addresses (and thus updates to these records) are legitimate only within the network on which a host with RFC1918 address resides. They should not appear on the public Internet; they are not unique and are not globally routed.
IP addresses are often assigned dynamically using DHCP (Dynamic Host Configuration Protocol) [RFC1541]. When this is done, the requesting host receives an IP address lease valid for a fixed period of time that is guaranteed to be unique in the local context. But the mapping between the hostname and the IP address may have changed since the host last was active on the network and the DNS records for that host may be incorrect. The Internet Software Consortium's DNS software has had the ability to receive dynamic updates of new address assignments since 1996.
Flaws in the Microsoft (and others) software implementations or configurations have caused these update packets using RFC1918 addresses to leak out to the global Internet and arrive at the root servers -- the top of the Internet naming tree. Initially, the root servers refused the updates and logged the error, but as the load increased, separate servers were deployed to handle just the RFC1918 addresses. This has reduced the spurious update load on the root servers significantly.
We examine these attempted updates to try to determine which operating systems are guilty of leaking private names and addresses onto the global Internet and what configuration can be done to alleviate the problem. Our data source is the log files from an RFC1918 authoritative server, in particular the attempts to dynamically update the reverse DNS records (PTR records) that map from an IP address to a hostname. We also see attempts to update the DNS A records, but in much smaller numbers.
An RFC1918 address can appear in DNS packets as either the source address of the packet or as part of the DNS data inside the packet. In the first case, there is no route back to the sending host and the packet cannot be answered at all. In the second case, the sending host has a valid IP address, but the root servers receiving the packet have no interest in the local RFC1918 address mappings.
The ambiguity in the status of RFC1918 addresses (legitimate only within the scope of the local organization) results in DNS software being unable to deny all RFC1918 updates a priori, since this will disrupt operation of internal networks. Software misconfiguration and incorrect default behavior allow local nameservers to send information about hostnames in their RFC1918 address blocks to the root servers.
Dynamic update packets for RFC1918 addresses are generated by DHCP servers on networks where private addresses are used internally (see NANOG discussion [NANOG]) DHCP servers periodically assign and renew hosts' IP addresses on their networks. We have run small-scale (about 500 hosts at a time) attempts to identify the operating system of the DHCP server but the fact that a fully patched Windows 2K or XP system currently shows up as "unknown" hampers these efforts. At a recent IEPG (Internet Engineering Planning Group) meeting, many other DNS based misconfigurations were documented [1].
The analysis presented here extends CAIDA's earlier work on measurement, performance and placement of DNS root servers [2] [3] [4] and on the use of private and unrouted addresses [5]. In particular, [2] and [3] discuss the vast extent of DNS misconfiguration that manifests itself in queries reaching the root servers.
In the sections that follow, we describe our data source and then examine the prevalence of update attempts to the various RFC1918 address blocks. We also identify the sources of these update attempts and categorize them by continent, country, and ISP or organization. The log data provides timestamps and we use these to determine periodicity in the updates. Finally, we coalesce all this information to fingerprint the guilty operating systems and suggest configuration changes to ameliorate some of the damage.
The data presented here is obtained from hazel.isc.org, an authoritative server for RFC1918 addresses that is located near F-root in Palo Alto, California. Hazel is part of the AS112 project (http://as112.net). Whenever a nameserver tries to update a root server with data about an RFC1918 address (like 192.168.0.1), it is told the machine at 192.175.48.1 (hazel) has authority for this zone and should be contacted instead. This is called a referral; Hazel is then contacted with the update request. Hazel logs the request and returns an update denied answer to the sending host. The log record includes a timestamp, source IP address, source port, and the RFC1918 zone to be updated. The timestamp has 1 millisecond resolution. This is a finer resolution than that typically used with nameserver software and will allow us to study interarrival times in detail.
Hazel is actually multiple machines on the network 192.175.48.0/24, an IANA reserved address block. All dynamic updates are referred to the machine with host byte .1, while queries go to .6 and .42. The route to the AS112 network, 192.175.48.0/24 is carried by most networks worldwide and by all networks participating in the University of Oregon RouteViews project [RouteViews]. The AS112 netblock is globally reachable. It is an anycast block in the sense that there are several places in the Internet where machines are assigned these three addresses; the routing system chooses the one closest to the sender. All of our measurements are from hazel, the instance of a server at address 192.175.48.1 that is run by Paul Vixie at the Internet Software Consortium (isc.org). Any ISP intending to run servers to confine RFC1918 updates to their own networks are encouraged to use those same three IP addresses for their RFC1918 servers [Vixie, NANOG].
Our analysis uses two data sets: one collected May 28 to June 4, 2002 and the other collected from July 4 to July 30, 2002. We have monitored hazel since April, 2002, but operational issues have interfered a bit and our largest continuous stream of data at the time of this writing is 26 days in July. Hazel logs about 1/2 gigabyte every 8 hours.
Recall that the major RFC1918 reverse DNS zones are:
168.192.in-addr.arpa for 192.168.0.0/16
10.in-addr.arpa for 10.0.0.0/8
16-31.172.in-addr.arpa*** for 172.16.0.0/12
*** footnote: The log files contain entries for each of the /16 networks making up the 172.16.0.0/12 block; we aggregate the results over the whole /12 block in the analysis that follows.
To see the extent of the problem, we computed the number of distinct hosts that were sending at least one DNS update packet toward hazel, our authoritative server for the RFC1918 zones. The count of these hosts over time is shown in Figure 1 below.
***** change the figure title as follows ***** Figure 1: Update Attempts over Time RFC1918 in-addr.arpa reverse zones July, 2002 (26 days), Palo Alto, CA ***** and use initial capital in axes labels ***** ***** make x axis caption be "Time, day in July, 2002" *****
***** andre will change figure 1a to one that spans the long dataset, 26 days ***** andre, i dont think we need both fig 1a and 1b, can we put them on the ***** same graph, with two x axes, time and #updates. what happens to the ***** second one (vs. updates) if you make the x axis or both axes linear.

Figure 1a is necessarily a monotonically increasing function. It
increases steadily over the 26 days of July
and approaches the square root function.
Figure 1b shows this same data, but relative to the number of updates,
rather than time. Both curves show that many, many hosts are
contributing to the problem, not simply a few badly broken systems.
The distributions over time and over number of updates are smooth
indicating lots of little contributions rather than a few large spikes.

Figure 2 shows the persistence of the updating hosts over the 26 day period (624 hours) in July, 2002. The x-axis is the duration of updates where duration is measured by looking at the first update from a host and the last update from that same host and computing the update interval. The y-axis is the fraction of updates from hosts that were sending update packets over a particular duration. About 60% of the updates came from hosts that were updating for the whole measurement period.
***** change the figure title as follows ***** Figure 2: Duration of Update Series RFC1918 in-addr.arpa reverse zones July, 2002 (26 days), Palo Alto, CA ***** and use initial capital in axes labels *****

We now look at the update behavior in more detail separating each of the 3 address blocks in RFC1918 space and also using finer time granularity.
IP addresses in the old ARPAnet range, 10.0.0.0/8, are often used in corporate environments, in particular, many VPNs are numbered in 10-net space. This space is presumably managed by professional system administrators resulting in fewer instances of address leakage. The 192.168.0.0/16 block is often used by manufacturers of networking gear for home and small office use - NATs, firewalls, DSL "routers", combinations thereof and sundry boxes, the multitude of which escapes classification. These devices have either manufacturer's defaults that assign 192.168.0.0/16 addresses to the LAN computers, or advise users to set up addresses in that range in their instruction manuals. 172.16.0.0/12 does not enjoy the same level of popularity, and as expected, has fewer RFC1918 updates. It is used by some universities [6] for internal routing.
We examined the attempted updates per address block during a 3 1/2 day period in our May-June data set. Table 1 below shows the number of attempted updates per RFC1918 address block and Figure 3 plots their distribution over the data collection interval.
Table 1: Number of Updates by DNS Zone Start: 01-Jun-2002 06:28:35.835 End: 04-Jun-2002 20:58:34.648 DNS zone #Updates Percentage ------------------------------------------ 168.192.in-addr.arpa 35055154 68.3% 10.in-addr.arpa 12391040 24.2% 16.172.in-addr.arpa 3834284 7.5% Total updates: 51370999

***** change the figure title as follows ***** Figure 3: Update Attempts per Minute RFC1918 in-addr.arpa reverse zones DNS root server f.root-servers.net ***** and use initial capital in axes labels *****Figure 3 shows the number of updates arriving per minute for IP addresses in each RFC1918 block. Note that the baselines of each plot are in general agreement with the share of updates for each block given in Table 1 above.
We see a periodic diurnal and weekly pattern which we discuss in detail in the following section. Figure 4 uses per second granularity to detail the 9AM spike on June 3.

***** change the figure title as follows ***** Figure 4: 9AM Spike of DNS Update Attempts RFC1918 reverse zones, Palo Alto, CA ***** and use initial capital in axes labels *****
Many systems attempt to update the nameserver at 9:00. Due to the lack of clock synchronization their updates spread over 6 minutes. This is a good news for the nameserver that has to process 107,000 updates in 6 minutes. Proper clock synchronization would have an overwhelming negative impact on the system. The other whole-hour boundaries where spikes tend to occur show similar results.
Table 2: Update Packets from RFC1918 Sources RFC1918 block #IPs #Updates -------------------------------- 10/8 1554 216408 172.16/12 589 34844 192.168/16 1234 297734 Total 3378 548995While this is a small fraction (1%) of the total update attempts in the data set, it has slightly different characteristics. The average number of updates per IP address is 162, double that (81 updates/IP) for the whole sample. This is not surprising because of the non-uniqueness of the senders' source addresses causes data from different machines to be coalesced and results in an undercount of the hosts leaking their private addresses onto the Internet. This will not influence our conclusions, because the number of updates contributed by this segment of host population is so small. If each host in this group were to contribute 80 updates, as in the rest of the sample, the total number of RFC1918 addressed hosts would increase to about 7,000, which is less than 1% of all IP addresses (1.2M) observed.
Table 3 below shows the RFC1918 IP addresses that sourced the largest numbers of updates over the 3.6 day period of measurement.
Table 3: Most Popular RFC1918 Addresses Seen IP address #Updates ------------------------ 192.168.0.186 29196 192.168.206.2 14686 192.168.19.6 11294 192.168.0.1 10866 10.0.0.1 10335 10.0.1.1 10298 10.44.72.110 10148 192.168.0.2 9056 10.191.1.2 7842 192.168.50.31 5679 ...Addresses like 192.168.0.1 or 10.0.0.1 are popular because they are first in their block. Addresses like 192.168.0.186 are likely to be either a misconfigured host spewing lots of update traffic or a default address assigned by a DSL or cable modem manufacturer. The first address from the 172.16 block is number 27 in the list, and contributes 2632 updates. The first 36 addresses in the list all have counts over 2000 and contribute a total of over 200K updates (36.6% of the total.)
In this section we explore the source of the update attempts and try to classify them by their layer 3 attributes such as IP address, port, network prefix and autonomous system (AS).
One way to tackle this problem is to see whether the machines are in address space allocated to end users or to corporations. Traditionally class B space was allocated to universities and medium sized businesses. Many class B allocations happened before allocations in class C space and the upper half of class A space. Figure 5 below shows the distribution of IP addresses that are the source of update packets. The bands of points correspond loosely to IP address allocation policies.
refer to hwb address space plot

***** change the figure title as follows ***** Figure 5: IP Addresses Responsible for DNS Update Attempts RFC1918 reverse zones, Palo Alto, CA ***** and use initial capital in axes labels ***** ***** and make the x-axis label be Source IP address, first byte *****Figure 5 shows the counts of updates originating in IP addresses that have one byte in common (/8s, squares) or two bytes in common (/16s, dots). The largest individual /8 contribution comes from the 24.0.0.0/8 block that is the cable companies' traditional address space. Many newer allocations with first byte between 60 and 68 also belong to broadband end-user connectivity providers. These users have a connection with a large enough bitrate to put multiple computers on an internal network and typically use RFC1918 addresses for numbering their private networks since providers charge extra for real IP addresses. Not being professional system administrators, they are likely to use whatever defaults are provided by the vendors of their operating systems. This suggests an explanation for the prominence of update counts in those blocks.
Class & #IPs & Percent & #Updates & Percent
--------------------------------------------------
A & 507541 & 42.2 & 47263887 & 48.2 \\
B & 65177 & 5.4 & 3048519 & 3.1 \\
C & 631432 & 52.4 & 47787751 & 48.7 \\
Tot & 1204150 & 100.0 & 98100157 & 100.0 \\
Class B networks are rarely the source of RFC1818 updates
(only 5% of all IP addresses and
3% of all updates come from class B sources.)
The major established registries are APNIC (Asia-Pacific), RIPE (Europe, Middle East, and the former Soviet Union) and ARIN (the rest). As of mid-2002, NICs (Network Information Centers) for Latin America and Africa are in the process of becoming fully operational.
To analyze updates per registry region, we used the tables of allocated address blocks dated April 1, 2002, available from ARIN (ftp://ftp.arin.net/pub/stats/). We converted all IP addresses that attempted to update hazel to their respective countries of origin and continents. With minor exceptions, the blocks in the tables are unique and those few that are common to two registries have the same country information (for details, see [5].)
Before the existence of RIPE and APNIC, ARIN allocated address blocks to Asian and European countries. We included these with the RIPE and APNIC data. Some IP addresses assigned to companies registered in one country and having equipment in another may be misplaced through the use of registries' tables, but their number is very small. (CAIDA geographic studies [7] [8] are usually done with Netgeo or its commercial counterpart, Ixia's IxMapper, that try to resolve these ambiguities.) However these tools are not universally available and our analysis is clearer by using a publicly accessible source for address mappings. We compare the accuracy of the two methods of IP to continent identification in a later section; the registry method is sufficient for our analysis.
Table 4 below shows the number of sources (IP addresses) and the number of attempted updates generated by those sources for each registry area.
Table 4: Hosts and Update Attempts by Continent Total hosts: 1204150 **** these numbers dont add up, 1204162 (+12) Total updates: 98100157 **** 97925805 (-174352) not the rfc1918 total 548995 Region #Hosts Percent #Updates Percent ------------------------------------------------- America 327616 27.2% 49029151 50.0% Asia 372974 31.0% 25041172 25.5% Europe 484227 40.2% 22314423 22.7% Unknown 19345 1.6% 1541059 1.6% Total 1204162/1204150 97925805/98100157 **** fix, also update % dont add upUnknown addresses include RFC1918 sources discussed in the previous section and IP addresses not found in registries' lists of allocated blocks. At least one matching IP block is missing from ARIN table, even though it is is present in Whois databases, and contains IPs with assigned DNS names.
Figure 6, below, shows different regional patterns of diurnal and weekly variation in the flow of RFC1918 updates. It is a mixture of singular spikes and smooth periodic patterns; the spikes are probably automatically generated, for example by many DHCP leases expiring at the same time, while the smooth swells are more likely human-related events such as turning on the computer at the beginning of the day.



As the above plot shows, the spikes coincide with the midnights for time zones in major centers of Internet user population density. We checked two such midnoghts and found that samples taken in first 6 minutes after the midnight for US East and West Coast tend to underrepresent companies located outside US. For example, Telstra (AS 1221) occupies 4-th place in update sources for July 27 nighttime traffic (3% of all IPs). In midnight EDT and PDT samples it is found at 11-th place (2% of all IPs). Swiss IpPlus moves from 7-th place to 23-rds, resp. 19-th place. American and Canadian ISPs move up at the same time, but this move appears to be only marginally dependent on whether they serve one or both US coasts, and which of them. Most prominent ASes in both EDT and PDT ssmples are Pacific and Southwestern Bells, which are closest to the server at ISC, Telus from Canada (serving West coast and Ontario regions) and Cablevision from New York state. ASes like Megapath (AS 23215), and XO (AS 2828), Earthlink and Bell Advanced (Canada) occupy high positions in both midnight samples as well. In any event, it does not look as if the midnight spike is caused by a few ISPs resetting all their DHCP leases.
***** change the figure title as follows ***** Figure 6: Update Attempts per Minute by Registry RFC1918 in-addr.arpa reverse zones DNS root server f.root-servers.net ***** and use initial capital in axes labels *****
****** make the ticks on the x-axis the same as in figure 1, 4 is ok, 6 is ok, just make them the same. ****** dont need both of these figures, choose one, either 7 days or 26.
The large spikes of updates occur near midnight in the various time zones where many Internet users are located. We see four in America an hour apart with the east and west coasts dominating the middle of the country, three in Asia and two in Europe - one in Britain and another in the rest of Western Europe.
The smooth patterns closely resemble weekly life cycles of individuals in respecitive countries. These updates appear to occur at times when people turn on their computers. In particular, update activity in Asia and Europe has much sharper rise an the beginning of a business day than in America. This may be caused by larger number of time zones in America, but may also reflect more uniform daily behaviour of people in Asia and Europe.
***** could this be also that in america people have computers at home while in europe and asia they may only have access to computers at work? *****The weekend in Europe is characterized by much lower Internet use that non-weekend days, whereas the activity pattern in Asia is not much different between weekends and weekdays. This may be influenced by countries where Saturday remains a (mandatory or voluntary) working day. A large surge of Asian activity is associated with the onset of Monday's business hours.
***** i dont think you can say this unless you have looked at other mondays and dont see that big a spike. ******As this is the first Monday of the month, this suggests that vacations is some countries may be scheduled on whole-month basis. We also see two abrupt transients in May 29 and early May 30 European updates. Those are most likely associated with routing changes that influence which of the anycast servers is closest to the source. BGP routing makes decisions based largely on the length of the AS path to a destination. A change in AS path length, even if it is the result of path prepending (a common practice in traffic shaping) will influence the choice of anycast server at any particular update source.
The frequency of updates per IP source address ranges widely between regions. American sources generate about 150 updates per IP in a week; Asian sources generate 67 and European 46. Assuming that misconfigured DNS servers are equally frequent in different regions, the larger updates-per-IP ratio for America suggests the number of computers on networks behind DHCP servers is larger.
Table 5: IxMapper -- Update Attempts by Continent Region # Hosts Percent ----------------------------------- Europe 453415 37.6% North America 340285 28.3% Asia 332321 27.6% Oceania 31687 2.6% South America 5497 0.46% Africa 2076 0.17% Unknown 164 0.01% Unresolved 38716 3.2% Total 1204162If we use the Registry classifications: Europe, Asia, America and Unknown, to aggregate the IxMapper data we get 30.2% for Asia (Asia + Oceania), 29% for America (North + South America + Africa), and 3.2% unresolved (including RFC1918 source addresses.) The per-region percentage of IP addresses inferred from the registry data shown in Table 4 and that from IxMapper are close within a few percent, except for unresolved addresses. Almost all registered addresses unresolved by IxMapper are in America (167 blocks, 33K addresses). Despite that fact, the IxMapper count of IPs in the Americas and Africa is larger than registries' count by about 20K. There are 14K addresses which are not present in April 01 registries' tables, but which IxMapper locates in Japan (7K), Italy (2.8K), in the US (1.6K) and the rest mostly scattered in Europe. This again points to incompleteness of some of these registry tables.
****** should we do countries based on the registry tables ???? *****We used IxMapper to identify the country where the update attempts originated; the top 10 countries are shown in Table 6 below.
Table 6: IxMapper -- Update Attempts by Country Country # Hosts Percent CDF P(>=X) USA 320981 27.5 27.5 China 147776 12.7 40.2 Japan 126960 10.9 51.1 Switzerland 73630 6.3 57.4 United Kingdom 59595 5.1 62.5 Netherlands 58077 5.0 67.5 Germany 53559 4.6 72.1 Austria 50802 4.4 76.5 Spain 46270 4.0 80.5 France 37432 3.2 83.7The next 10 countries are Australia, Portugal, Italy, Taiwan, Canada, Hong Kong, Arab Emirates, South Korea, Poland and Belgium. The top 20 countries account for over 95% of the sources of updates.
****** this is may 16 data, we need to stay consistent with our data sets **** ****** also need some words :-)

44.3% of all updates come from port range 1024-5000 Sharp edge: 17 times more updates come from port 5000 than from 5001 Entropy is 14.9 bits, close enough to maximum (16 bits) for uniform distribution.
Table 7: Top 20 AS sources of RFC1918 updates AS# #Updates Percent Cumul.% AS Name, Country --------------------------------------------------- 4134 7329178 7.51 7.51 CHINALINK, China 3352 6166266 6.32 13.84 Ibernet (TDE), Spain 7132 4559748 4.67 18.51 SW Bell, US 5673 3271669 3.35 21.86 Pac Bell, US 5676 2936073 3.01 24.87 Pac Bell, US 4813 2765227 2.83 27.71 China Telecom (Guandong) 4812 2644362 2.71 30.42 China Telecom (Shanghai) 852 2176242 2.23 32.65 Telus, Canada 6128 2083593 2.14 34.79 Cablevision, US 2828 1855065 1.90 36.69 XO, US 11427 1753091 1.80 38.49 Road Runner, US 7843 1504131 1.54 40.03 Adelphia, US 4760 1413921 1.45 41.48 Netvigator, Hong Kong 2914 1393102 1.43 42.90 Verio, US 1221 1378306 1.41 44.32 Telstra, AU 11509 1226816 1.26 45.58 Pajo, US 4436 1142608 1.17 46.75 SantaCruz Community I't, US 11426 1135058 1.16 47.91 Road Runner, US 10994 1129898 1.16 49.07 Time Warner, US 2548 1091393 1.12 50.19 Business Internet, USWe see that more half of the updates come from 20 ASes, which is only 0.6% of the total number of autonomous systems. On that aggregation level, RFC1918 update traffic is clearly dominated by elephants. The largest numbers come from incumbent telecom carriers for respective regions, and from cable companies. Backbone ISPs produce fewer updates. This is not surprising since these ISPs cater mostly to medium and large business customers who often have their own AS number, fewer, but larger networks and use globally unique addesses. Even when these corporations use RFC1918 space, they are more likely be properly configured. The cable and DSL companies charge for globally unique addresses which encourages customers to use RFC1918 addresses internally, thus creating more potential for leakage. Countries, such as China, that are relatively late in joining the Internet have trouble getting enough global address space allocated from the registries.
In terms of the IP addresses of the hosts sending the update requests, the bias is even higher. The 20 top ASes contain over 54% of all IP addresses from which updates were sent. See Table 8.
Table 8: Top 20 ASes Updating RFC1918 Zones, by #Hosts AS # Hosts Percent Cumul.% Name, Country --------------------------------------------- 4134 74758 6.2262e-02 6.2262e-02 CHINALINK, China 3352 47647 3.9683e-02 1.0195e-01 Ibernet (TDE), Spain 3303 47379 3.9460e-02 1.4141e-01 Swisscom IP-plus, Switzerland 7132 46445 3.8682e-02 1.8009e-01 SW Bell, US 4713 44828 3.7335e-02 2.1742e-01 NTT Communications, JP 5673 41129 3.4254e-02 2.5168e-01 Pac Bell, US 4813 40379 3.3630e-02 2.8531e-01 China Telecom (Guandong) 5388 37874 3.1544e-02 3.1685e-01 Energis Squared, UK 8447 33079 2.7550e-02 3.4440e-01 TELEKOM-AT, Austria 3209 26932 2.2430e-02 3.6683e-01 Arcor, Germany 4812 26720 2.2254e-02 3.8908e-01 China Telecom (Shanghai) 1221 26106 2.1742e-02 4.1083e-01 Telstra, AU 5676 25183 2.0974e-02 4.3180e-01 Pac Bell, US 3215 23774 1.9800e-02 4.5160e-01 France Telecom, France 4355 21949 1.8280e-02 4.6988e-01 EarthLink, US 4760 20428 1.7014e-02 4.8689e-01 Netvigator, Hong Kong 8737 18674 1.5553e-02 5.0245e-01 Planet Media, Netherlands 3462 18094 1.5070e-02 5.1752e-01 GSA Data Communications, US 6730 17210 1.4333e-02 5.3185e-01 Sunrise, Switzerland 4732 11125 9.2655e-03 5.4112e-01 Dion KDDI Japan
****** changed table, dont number lines, need to change from sci notation fractions to percent. *****Note that the largest update contributors in terms of number of updates have only 9 ASes in common with the largest contributors in terms of the number of hosts sending the updates.
***** new data here maybe, this is may 16 ****** ***** and do cdf instead of ccdf for consistency ***** ***** and do percents with not so many significant places ***** ***** and dont make the names right justified, so left justified ****Note that those names belong to the routed (globally unique) IP addresses from which the updates were sent. The DNS server logs from hazel contain the IP address of the updating host, and the RFC1918 zone that the packet attempts to update, but no details on the update payload.
The following DNS names are present in half (54%) of the 222364 source IP addresses observed on May 16, 2002.
$ IPs Fraction ccdf P(X>=x) 17375 7.813765e-02 1.000000e+00 rr.com 14063 6.324315e-02 9.218624e-01 pacbell.net 12918 5.809394e-02 8.586192e-01 nombres.ttd.es 9012 4.052814e-02 8.005253e-01 swbell.net 5622 2.528287e-02 7.599971e-01 optonline.net 4794 2.155925e-02 7.347143e-01 interbusiness.it 4140 1.861812e-02 7.131550e-01 netvigator.com 4081 1.835279e-02 6.945369e-01 tin.it 3878 1.743987e-02 6.761841e-01 pol.co.uk 3850 1.731395e-02 6.587442e-01 highway.telekom.at 3675 1.652696e-02 6.414303e-01 bigpond.net.au 3559 1.600529e-02 6.249033e-01 mindspring.com 3548 1.595582e-02 6.088980e-01 libero.it 3405 1.531273e-02 5.929422e-01 adelphia.net 2785 1.252451e-02 5.776295e-01 arcor-ip.net 2627 1.181396e-02 5.651050e-01 attbi.com 2562 1.152165e-02 5.532910e-01 rima-tde.net 2435 1.095051e-02 5.417694e-01 dialup.online.no 2294 1.031642e-02 5.308188e-01 dial.wxs.nl 1983 8.917810e-03 5.205024e-01 telus.net 1978 8.895325e-03 5.115846e-01 turboline.skynet.be 1864 8.382652e-03 5.026893e-01 snet.net 1782 8.013887e-03 4.943066e-01 megapath.net 1716 7.717077e-03 4.862927e-01 shawcable.net 1657 7.451746e-03 4.785757e-01 dsl.net 1613 7.253872e-03 4.711239e-01 direcpc.com 1488 6.691731e-03 4.638701e-01 cox.netAgain, as in the AS contribution analysis, relatively few (23) second-level domain names account for more than half of the hosts originating updates.
The webpages of these organizations, reveal that they are almost exclusively cable and DSL providers.
In addition, DNS names containing one of the words: catv, cable, client, cust, dial, direc, dsl, host, hsia ("high-speed Internet access"), nat, online, pool, port, are present in 113847 (51.2%) of the DNS names of hosts attempting to update the RFC1918 zones.
We examined all 222K pairs of source IP addresses and corresponding DNS names and found that the full DNS name quite often contains a numeric IP address in decimal notation. The values of individual bytes are usually connected by dashes. Hex adresses are also used, albeit less often.
More than 60% of all DNS names in the data contain 7 or more digits. When dots and dashes are viewed as field separators, 98776 or 44% of the names contain at least 4 fields of digits. 114333 or 51.4% names contain at least two fields which are just digits. This indicates that many, if not most, of the DNS names present in RFC1918 updates are generated automatically from IP addresses, or from internal customer IDs.
106: No response
77: Unknown, not Microsoft Windows
56: Windows 2k. SP1, SP2/Windows XP
47: Windows Based. Open/Net/FreeBSD/DG-UX/HP-UX 10.x etc
33: Novell (FreeBSD 4.3-current(?))
31: Ultrix!HPUX 10.20(?)
16: 3Com SuperStack II Switch SWNBBSI-CF,11.1.0.00S38 | Nokia IPSO
+3.2-2.3.1 releng 783-849 | Ricoh Aficio AP4500 Network Laster Printer |
+Linux 2.0.x/2.2.x/2.4.x | Shiva AccessPort Bridge/Router Software V.2.1.0
11: OpenBSD 2.4-2.5!NetBSD 1.5, 1.4.1, 1.4
10: AIX
5: Windows NTsp4+
4: Windows 95
4: Linux 2.2.x/2.4.5+ kernel
3: Cisco IOS 11.x-12.x
2: Little endian BSDI/NetBSD 1.1.x-1.2.x! MacOS X 1.0-1.2
2: HPUX 10.x
1: Windows ME
1: Unknown Unix (Accuracy dropped) or MacOS X
1: ULTRIX
1: NetBSD
1: Linux kernel 2.2.x! 2.4.x! assumed.
1: IBM OS/390
413 Total
Although we could not resolve the ambiguity in the fourth largest
count (Windows Based. Open/Net/FreeBSD/DG-UX/HP-UX 10.x etc), it
appears that there is no dominant operating system in the set.
Note that IP addreses in this sample were not weighted by their
number of updates. Howewer, when we did that in previous experiments
we got a qualitatively similar picture.
Machines that did not respond were presumably on xDSL or cable modem
connections and had simply been turned off.
Miscrosoft Windows based platforms were recognized in 66 instances out of 413 (16%). If in addition we assume that about half of the "Windows-Unix" group is in fact Windows, their number increases to 90, or 22%. Furthermore, if the statistics of the non-responding destinations is similar to the responding ones, dividing the assumed number of Windows by the number of responses gives about 30% of Windows boxes. This gives a rough idea of how many Windows boxes sent these updates. Unix boxes make up a comparable number of systems with the same (factor of 2) degree of uncertainty. Notably missing from the list are Apple systems, but through Mac OS 10.1 they do not do dynamic DNS updates at all.
Our operating system fingerprinting efforts did not yield a very coherent picture of the sources of DNS update attempts. In section xxx, we describe a laboratory test network that we built to try to understand the sources, and regularity of update attempts.
Figure 7 below shows the orders of magnitude of the elephants, mice and workhorses contributions. The dashed vertical lines mark the middle of the distribution, that is half the sources (or updates) lie to the left of the line and half to the right.

Summarizing, for the weekly update log from May 28 to June 4, 2002:
*** Footnote:. In the general scientific context, exponential interarrival distributions represent the simplest model for a flow of events that occur independently, at random, and with a constant average arrival rate. *** end of footnote ****Cases in which the distribution significantly deviates from exponential are rare. They occur when large gaps between requests are present with higher frequency. Figure 8 shows the distribution of interarrival times of one million updates between 6:28 and 8:12 AM on Saturday, June 1, 2002. The distribution is very close to exp(-x/6.5) which translates into an average of 6.5 milliseconds or 170 updates per second.

The distribution shown in Figure 8, deviates from exponential for very small interarrival times, in which case, the probability of packets having 0 or 1 ms interarrival time is less than that predicted by the exponential model. There are also a few longer interarrival times in the range of 100 ms. Figure 9 below, compares 21 interarrival time distributions for measurements taken at approximately 8 hour intervals. Most of the distributions are very close to exponential, with only one deviating significantly in the range of interarrival times exceeding 70 ms.
***** andre may change this figure *****


The distribution of interrival times for the larger dataset (26 days) is very close to an exponential when the interval is less than 0.1 sec. In crosses over to power funcion in the range of larger times. The largest interval we saw in 26 days is 64 sec.
To see how many of the update sources are periodic, we analyzed the average update rates for sources present in the 26-day July dataset. An average update rate is the number of updates from given source minus one, divided by the timestamp difference between last and first update in the series.
Figure 10 shows the density of updates vs. the update rate with a resolution of 20 bins per decade (a factor of 1.122 between successive bin boundaries). We took only sources whose update series lasted longer than 1 hour. This resulted in the removal of 882,633 sources (1,582,417 updates) leaving 1.45M hosts with 302M updates over 26 days in July 2002. The solid line is updates and the dashed line source IP addresses.

The two large spikes represent periods of 60 minutes and 75 minutes. Five percent of all updates come from sources with average update rates in the range 1-1.122 per hour (60 minute cycle) and 8% from sources with 2.24-2.51 updates per hour. This 8% actually matches a cycle of 3 updates in 75 minutes. The next noticeable spike is at twice this rate. It is most likely caused by networks with two hosts in RFC1918 space, for which 6 updates are generated in 75 min.
As the dashed line shows, most of the IPs are sending updates at much lower rates; half of the IP addresses are sending at a rate of 0.09 updates per hour or less. However, half of the updates come at rates of 5 or less per hour. The rates of 1 per hour and 3 per 75 min. account for 6.43 and 3.53% of all observed sources. Neither of these numbers, however, reveals how strict or loose the periodicity is, nor the spacing of updates within a period.
It is difficult to find the precise period of updates because every now and then an update is missing from the series, either because a host is switched off, a DNS packet is lost in the network or some activity on the source network is interfering with updates. Furthermore, often an extra sequence of updates becomes interleaved into the series because another host becomes active on the private network. For that reason we could not use the Fourier transform on update arrivals to extract a period. The lack of coherence in update arrival times would defeat the amplifying properties of the transform. We tested two approaches to finding the update period. Both of them evaluate a binary autocorrelation function. By determining the lag (shifts) at which the autocorrelation is maximal, we find how many updates constitute a period. We then recover the actual (temporal) period from the original interarrival times.
In the first method, we sorted each logfile*** by the IP address of the source of the update packet, and only used sources with 15 or more updates.
*** footnote An update logfile usually covers about 8 hours and contains up to 5M updates. ***We then computed sequences of update interarrival times for each source, rounding them to whole minutes. We calculated how many of these rounded values will coincide with the update sequence shifted by 1...9. Those sources which for some shift match in more than 90% of rounded inter-update times were classified as periodic. The sum of minute counts over the lag (shift length) was taken as their period.

Figure 11 above presents a histogram obtained by that method. In that example, we used a 7.5 hour logfile from early Wednesday May 29, 2002 that contained 4.67M updates and 240K source IPs. Of those, 78K sent 15 or more updates over the duration of the log, of which 32K (40%) were found periodic. Among the periodic updates, 2001 (60%) have period 60 minutes, 22333 (70%) period 75 minutes and 10% a period of 76 minutes.
In the whole set of 21 logfiles, 38-56% of the sources were periodic. Of these, 6-12% were 60 minute periods, 64-70% 75 minute periods, and 1% the 76 minute period.
This approach discovers a smaller percentage of periodic sources when run over the whole one-week dataset. 314996 sources were found to have 15 or more updates; out of those, 86580 (27.5%) are identified as having one period. Among those IPs, 32456 (38%) have a period of 60 minutes, 37575 (43%) 75 minutes, and 5503 (6.4%) 76 minutes. The significant drop in the fraction of 75 minute periods is most likely caused by occasional missing updates and/or rounding errors in converting interarrival times to whole minutes, that destroy the periodicity of minute counts.
As a remedy against these variations, we chose to use a more robust algorithm, which finds the fraction of periodic updates from one source as follows:
0. Take all sources with ten or more updates.
1) Take the sequence of interarrival times expressed as integers in milliseconds.
2) Convert them to logarithms base two truncated to integer parts *** *** footnote: We add 1 to the truncated integers to disambiguate them from 0 which represents 0 milliseconds *** end footnote.
3) For each shift of the update sequence by 1, 2, ..., 30 updates, count the number of positions in which truncated logarithms in the original and shifted sequences are equal.
4) Find the lag (shift) at which this overlap is maximal. Discard the source if the maximal count is less than 10% of all its updates.
5) Find the longest contiguous stretch in which every entry equals its shifted counterpart.
6) Extract the interarrival times from the beginning of this longest stretch. Take the sum of these interarrival times as the period.
While this seems involved, it was the only method we found that worked well. The problem is that update data contains interleaved sequences sent on behalf of several local hosts that can join and leave the private network at arbitrary times. This, together with the occasional missing or extra updates, requires a very robust algorithm. Clock skew in the source hosts also contributes to the noise that must be filtered out. That is why we chose to match binary logarithms of data rather than numeric values, and relaxed the threshold condition for source's periodicity (matching 10% of the updates as opposed to 90% in the first algorithm.)
Figure 12, below shows the number of IP source addresses which send a significant fraction of the updates in periodic intervals and the number of updates produced by these sources. 360710 source IP addresses were included in the analysis; each source contributed at least 10 updates. The largest observed period was 75 hours.

The pattern of the 75-minute update cycle is especially revealing. It usually involves three updates, made at intervals of 5, 10, and 60 minutes. The most likely cause is that an attempted update (at "0" minutes) is repeated after timeout of 5 minutes and then again after doubling the timeout to 10 minutes, after which the system falls back to a default of 60 minutes.
There is also a strong spike which represents a simpler nameserver behavior in which updates are sent strictly at 60 minute intervals. The most frequent periods and their prevalence is shown in Table 8.5 below.
Table 8.5: Update periods and their prevalence Period % Sources % Updates ---------------------------------------- 0 minutes 8% 24% 60 minutes 24% 14% 75 minutes 34% 28% 76 minutes 8% 5%These and nearby periods account for 3/4 of the sources and updates.
The largest contributions to the computation of periodicity are shown in Table 9 below. The first line is the 3 part period 75 minutes; the second is the single 60 minute period. Later lines are the smaller spikes in Figure 12, which correspond to multiple computers on the private network with 75 minute periods. Note that 82% of the updates are in sequences with periods listed in the table. Table 9: Update Contributions Determining Periodicity
Updates Update Percent Cumulative /Period Count of Data Percent ----------------------------------------- 3 2.44132e+07 25.61 25.61 1 2.10346e+07 22.07 47.68 2 8.76354e+06 9.19 56.88 6 8.56021e+06 8.98 65.86 9 4.15423e+06 4.36 70.21 12 3.86912e+06 4.06 74.27 18 3.07479e+06 3.23 77.50 4 3.02154e+06 3.17 80.67 15 1.91918e+06 2.01 82.68We can also see the periodicity if we look at update rate relative to the port spectrum. In Figure 13 we plot the update rate vs. the largest port number used by a particular host and indicate with the plot symbol the number of updates contributing to any update rate and port range.

Notice the black bands with update rate between 1 and 10 per hour and between 1 and 50. The first corresponds to a TCP stack that uses ports up to 5000 and the second a stack that uses the full port range.
need andre sentences here describing the periodicity better
this needs to be done when i am back and can work with brian or else grant/dan needs to do it.
An overwhelming majority of the hosts that are trying to update RFC1918 zones at the AS112 server are connnected to the Internet via DSL and cable modem providers. Since these companies serve almost exclusively home-based users and (to lesser extent) small business customers we conclude that the bulk of RFC1918 updates originate in home office and small business environments. This is further corroborated by diurnal and weekly variation in the flow of updates, by the prevalence of personal operating systems (such as Windows and Linux), and and by the generally small numbers of updates contained in one update period (for each source IP), reflecting small number of hosts on local LANs getting their addresses from the same DHCP server. We found that the process of update arrivals has three specific timescales. On the timescale of milliseconds interarival time of all updates is close to exponential distribution with average time 6.5 ms for May-June and 8.5 ms for July data. On the timescale of minutes, individual sources display periodic behavior, with dominant interarrival times of 5 min, 10 min, 30 min and 1 hour. Finally, on the timescale of hours updates from hosts in different time zones increase by a factor of four over 6 min. intervals immediately following midnight local time. of which most prominent spikes can be identified with time zones in US, West Europe and Pacific Asia.
***** need to have more conclusions wehn we get the windows boxes traced here in the test lab at sdsc. jeff and tom are setting them up for us. also need these conclusions here to tie back to paper a bit better *****
***** add piet barber if we use his dig fingerprinting and jeff and tom for their win2k help *****References
***** need to flesh out the references too *****[1] IEPG meeting - July 2002. http://www.potaroo.net/iepg/july2002/[2] Nevil Brownlee, kc claffy, and Evi Nemeth. DNS Root/gTLD Performance Measurements. Usenix LISA, 2001.
[3] Nevil Brownlee, kc claffy, and Evi Nemeth. DNS Measurements at a Root Server. Globecom 2001.
[4] Marina Fomenkov, kc claffy, Bradley Huffaker, and David Moore. Macroscopic Internet Topology and Performance Measurements From the DNS Root Name Servers. Usenix LISA, 2001
[5] Andre Broido, kc claffy. Inter-domain routing evolution - Episode II: Dark Space" (ARIN IX, Apr 02)". http://www.caida.org/outreach/presentations/
[6] Brian Kantor, UCSD Network Services, private communication, July 4, 2002.
[7] Bradley Huffaker. Skitter daily summaries. http://www.caida.org/cgi-bin/skitter_summary/main.pl
[8] Bradley Huffaker, Daniel Plummer, David Moore, and k claffy Topology discovery by active probing. http://www.caida.org/outreach/papers/2002/SkitterOverview/
[9] Route Views archive. http://archive.routeviews.org/
[10] Andre Broido, Evi Nemeth, kc claffy. Packet arrivals on rate-limited Internet links. CAIDA, Nov.2000 http://www.caida.org/~broido/coral/packarr.html
[11] Constantinos Dovrolis, M.Jain. Bandwidth estimation, 2001.
[12] Dina Katabi, Charles Blake. Inferring congestion sharing and link characteristics from packet interarrival times. MIT LCS Technical Report, 2001.
[13] Mark Coates, Alfred Hero, Robert Nowak, Bin Yu. Internet tomography. IEEE Signal Processing Magazine, May 2002, vol.19, No.3, 47-65.
[NANOG] Discussion of RFC1918 updates. NANOG mailing list, April 2002. www.irbs.net/internet/nanog/0204/0450.html.