For each connection attempt, we recorded the following values:
- calendar date and time,
- time it took to resolve domain name into IP address for the server,
- connection opening time,
- downloading time,
- the number of bytes downloaded, and
- the number of links (href's) to other domain names contained in the homepage.
A total of 154,230 individual host names was collected using search engines, access logs to three UCSD http servers, passive monitors and a number of other sources. To make sure that every host has an http server, names from sources other than search engines were added to the sample only when they contained "www". IP addresses were obtained by running command nslookup. Addresses with different combinations of the first three octets were obtained by sorting with UNIX command sort -u.
We were able to elicit HTTP response from 122,346 servers (close to 80% of the whole sample). These servers opened TCP connection and returned either homepage or an error message. The timeout for the whole sequence (domain name resolution, opening TCP stream and data retrieval) was set to 10 sec. Timeouts for individual operations were not used. Some of the connections timed out in retrieval phase due to excessive time spent for the first two operations. We did not collect statistics which could subdivide this 20% of non-responding servers into different groups, e.g. to find a share of those opening TCP connection, but failing to produce http response etc.
As the server name was resolved to IP before each connection was opened, a small fraction of the servers ended up on the same /24s. Some of them even had same IPs. This is due to the presence of names which resolve to multiple IP addresses. Whenever DNS has more than one IP for a name, it chooses one in an unpredictable (from the client's viewpoint) manner. That is intended for load sharing among the machines with the same name. For that reason, of those hosts for which http retrieval could be completed, 121986 have distinct IPs, and 121153 are on different /24s. As this is over 99% of the total, we felt justified to discard duplicate /24s using the command sort -u.
Homepage size distribution
The statistics we collected does not fully represent the size of the pages as seen by a browser, since we retrieved only one page per each server, without downloading graphics. (We did not follow redirect links either.) However, it has a number of interesting and instructive properties.
The probability distribution for homepage size does not follow a power, logarithmic or an exponential law. It can be approximated though by a logarithmic law (which in a density function of type C/x) on the intervals (0.5,2) and (3,10). However, on the second interval it decreases much faster than on the first. If we were to estimate a chance of finding a homepage over 32K in size, extrapolating the data from the first interval, we might have come up with a number close to 0.12. It is, in fact, as numerical data for this distribution shows, 0.01275, that is, 10 times smaller. Note that the ability to make predictions of that type is, generally speaking, the raison d'etre of approximating data by elementary functions.
Let us look now at the tail of the distribution. Approximation between 32K and 100K shows that in that region cumulative function is reasonably close to 1/x^3.2 (density 1/x^4.2). This results in much faster decay than what we could expect from the intervals (.5,2) and (3,10).
However, one more surprise is awaiting us as we move to the next portion of the tail. The portion of data ove 100K, scarce as it is, appears to follow power law with exponent -1.5 (density C/x^2.5). If we would try, e.g., to estimate the expected number of files over 320K with the approximation for the interval (32,100) discussed above, we could easily conclude that this number is 0. There are, however, 5 files in the sample which are that large.
Another feature of the CDF is its apparent discontinuity near the size of 456 bytes. In fact, all 1-byte bins between sizes 446 and 467 bytes appear to be overly full, as compared to adjacent byte counts. This is, however, an artifact of data ackquisition, since our scripts do not filter out error messages, if these are returned by HTTP server, and most of these are about that long. A reference to Microsoft server is present in 14318 returned requests. It may look like
HTTP/1.1 200 OK Server: Microsoft-IIS/4.0 Date: Wed, 31 May 2000 11:25:44 GMTIt may also contain an error message.