( NAE '99 workshop )
We discuss the collection, analysis and visualization of four forms of Internet traffic data: network topology, workload, performance, and routing. Topology data describe network link infrastructure at a variety of `protocol' layers. Workload measurements involve the collection of traffic information from a point within a network, e.g., data collected by a router or switch or by an independent device passively monitoring traffic as it traverses a network link. Performance measurements involve the introduction of traffic into the network for the purpose of monitoring delay between specific end-points. Routing data includes data from Border Gateway Protocol (BGP) routing tables, which reflect the transit relationships between individual Autonomous Systems (ASes) at a given point in time. We describe highlights from these topic areas and their role in the state of Internet measurement and data analysis today.
|Scientific apparatus offers a window to knowledge, but as they grow more elaborate, scientists spend ever more time washing the windows.|
|-- Isaac Asimov|
The infrastructure of the Internet can be considered the cyber equivalent of an ecosystem. The last mile connections from the Internet to homes and businesses are supplied by thousands of capillaries, small and medium sized Internet Service Providers (ISPs), which are in turn interconnected by 'arteries' maintained by transit (backbone) providers. The global infrastructure of the Internet consists of a complex array of telecommunications carriers and providers, a very difficult infrastructure to analyze diagnostically except within the borders of an individual network. Nonetheless, insights into overall health and scalability are critical to the Internet's evolution.
Attempts to adequately track and monitor the Internet were greatly diminished in early 1995 when the National Science Foundation (NSF) relinquished its stewardship role over the Internet. The resulting transition into a competitive industry for Internet services left no framework for the cross-ISP communications needed for engineering or debugging of network performance problems and security incidents. Nor did competitive providers, all operating at fairly low profit margins, and struggling to meet the burgeoning demands of new customers and additional capacity, place a high priority on gathering or analyzing data on their networks. This attitude is strengthened by the general lack of quality measurement or analysis tools to support these endeavors, and the absence of baseline data against which an analyst can compare any results.
As a result, today's Internet industry lacks any ability to evaluate trends, identify performance problems beyond the boundary of a single ISP, or prepare systemically for the growing expectations of its users. Historic or current data about traffic on the Internet infrastructure, maps depicting the structure and topology of this amorphous global entity, or projections about how it is evolving, simply do not exist.
That is not to say that no measurement of the Internet occurs. There are numerous independent activities in the area of end-to-end measurement of the Internet. Typically spawned by end users with an interest in verifying performance of their Internet service, these measurements involve an end host sending active probe traffic out into the network, recording the delay until that packet returns to its source. Unfortunately such traffic measurements involve a large number of parameters that are difficult if not impossible to model independently, and the resulting complexity renders elusive any comparability or useful normalization of the gathered data. There are research groups trying to deploy technology and infrastructure to support more standardized measurement and evaluation of performance and reliability of selected Internet paths, and what specific segments of a given path limit that performance and reliability, but such efforts are slow and have thus far remained unable to meet the needs of any of the user, research, or ISP communities.
In the remainder of this paper we will highlight activities in the four different main areas of Internet measurements: topology and mapping, passive workload measurements, active performance measurements, and routing dynamics. We will conclude with a focus on near-term research priorities and forecast of activities for the next five years.
|in an expanding system, such as a growing organism, freedom to change the pattern of performance is one of the intrinsic properties of the organism itself.|
New connections among core Internet backbones occur hourly, ranging in capacity from T1 copper (1.55 megabits per second) to OC48 fiber optics (2.48 gigabits per second). This physical structure supports a myriad of new technologies and products, including live (or 'streaming') audio and video, distance education, entertainment, telephony and video-conferencing, as well as numerous new and often still evolving communications protocols.
Tracking and visualizing Internet topology in such an environment is challenging at best. A particularly ambitious endeavor is underway at CAIDA, through the recent development of skitter, a tool for dynamically discovering and depicting global Internet topology, in the process also gathering data on performance of specific paths through the Internet.
skitter works using a process somewhat analogous to medical x-ray tomography, a technique where a three-dimensional image is achieved by rotating an x-ray emitter around the subject and measuring the intensity of transmitted rays from each angle, and then reconstructing the resulting two-dimensional images into a three-dimensional object. Geologists rely on similar techniques to build models of seismic activity using cross-section images (slices) of the earth. Data gathered from tomographic scans play an important role in developing models to analyze and predict select phenomena.
CAIDA is currently using skitter to gather infrastructure-wide (global) connectivity information (what's connected to what?), and round trip time (RTT) and path data (how does a packet get from A to B and how long does it take?) for more than 30,000 destination hosts from six source monitors spread throughout the United States, with additional monitors planned for the U.S., Europe, and Asia in 1999. skitter measures the Internet path to a destination by sending a sequence of ICMP data packets to the destination host, setting a longer 'time to live' (TTL) value in each packet, similar to the traceroute utility. Each intermediate hop along the path between source and destination decrements the TTL value of any packet passing through it, and will notify the source host of the fate of packets whose TTLs reach zero (these packets must be discarded).
Probing paths from multiple sources to a large set of destinations throughout the current IPv4 address space allows both topological and geographical representations of a significant fraction of Internet connectivity, the latter admittedly constrained by the abysmal lack of geographic mapping data for Internet address space. Supporting tools also analyze the frequency and pattern of routing changes (when and how often are alternative paths used between the same two endpoints?)
Analyzing data from tens of thousands of path measurements can identify critical roles that specific backbones, traffic exchange points, and even individual routers (the devices that direct and carry packets through segments of the network) play in transmitting Internet traffic. Figure 1 shows a preliminary two-dimensional visualization of skitter data depicting a macroscopic snapshot of Internet connectivity, with selected backbone ISPs colored separately. The graph reflects 23,000 end destinations, through many more intermediate routers. While visually interesting, the volume of data represented in the figure unfortunately reduces its utility to operators and users alike.
|Figure 1: Prototype two-dimensional image depicting global connectivity among ISPs as viewed from skitter host. The layout algorithm used in these images was developed by Hal Burch (CMU) in support of Bill Cheswick's (Bell Labs) Internet Mapping Project (http://cheswick.com/ches/map/).|
Note that the particular source monitor will skew the relative prevalence of a given AS in the graph. For example, if a source is a customer of a given ISP, that ISP will appear in most paths from that source, as seen in CERFnet's strong representation in the data from the San Diego source. This also occurs with ASes topologically 'nearby' this primary service provider.
Given the number of nodes and links available for visualization (tens of thousands), the limitations of two-dimensional tools severely constrain useful depiction of the skitter data. We are exploring three-dimensional visualization of topology and RTT performance across segments of the infrastructure, using Tamara Munzner's hypviewer [M98], which uses a hyperbolic layout projected onto a unit sphere. This technique allows one to reduce clutter in the neighborhood of a node (rendering a readable local topology in a huge graph), while still allowing easy navigation of the entire graph. The improved viewability is due in large part to the use of hyperbolic space, but also the layout algorithm. Figure 3(a) shows initial results for a large skitter dataset of roughly 29,000 destinations on February 18th, 1999. Figure 3(b) depicts a hyperbolic view of a core Internet routing table from the same day, i.e., peering connecitivity among over 2,000 autonomous systems. Links are red on their outbound (from nodes) end and blue on their inbound (to nodes) end.
|Figure 3(a): Focus on the neighborhood of 22.214.171.124, but still reflecting large portions of the network elsewhere (in the top and left of the sphere).|
|Figure 3(b): hyperbolic view of BGP routing table data describing connectivity among autonomous systems.|
Hypberbolic viewing tools have tremendous potential as an interactive navigation system for topology and other network data; we have briefly described only only 2 examples due to space limitations.
Everything you've learned in school as "obvious" becomes less and less
obvious as you begin to study the universe. |
For example, there are no solids in the universe.
There's not even a suggestion of a solid. There are no absolute continuums. There are no surfaces. There are no straight lines.
|-- R. Buckminster Fuller|
Workload measurements require collecting traffic information from a point within a network, e.g., data collected by a router or switch or by an independent device passively monitoring traffic as it traverses a network link. Collection of such data allow for a variety of traffic analyses, e.g., composition of traffic by application, packet size distributions, packet inter-arrival times, performance, path lengths, that contribute to our ability to engineer next generation internetworking equipment and infrastructures. Of particular interest are traffic flow matrices: tables of how much traffic is flowing from a given source to a given destination network, information that turns out to be vital to optimizing engineering decisions relating to route peering and infrastructure investments.
Figure 4 shows a sample matrix of traffic from source to destination Autonomous Systems (ASes). Since an Autonomous System is the unit at which Internet routing relationships are established and negotiated, a traffic matrix at this granularity is of immediate utility to networking engineers trying to optimize topology or route peering decisions. (Peering is the relationship between two autonomous systems that agree to exchange routing information with each other.)
Figure 5 shows a traffic matrix by country, of interest from both a policy as well as international commerce perspective. Taken at a United States peering point location, this particular image indicates the use of the United States as an international communications hub, reflected in the presence of traffic from non-U.S. countries to other non-U.S. via the U.S. The log scale highlights that it is, however, still quite a small fraction of overall traffic, but it is a useful statistic to be able to track. Figure 6 shows the trade balance of IP traffic with the US for several countries: note the U.S. is almost universally a net exporter of IP traffic.
As one other example of relevant workload characteristsics, we'll discuss Internet packet sizes. Statistics of packet size distribution and arrival patterns are of relevance to designers of network routing and switching equipment since there are both per-packet and per-byte components of the cost of switching a packet, so having metrics typical Internet workloads allows designers to optimize hardware and software architectures around relevant benchmarks.
Figure 7a shows the distribution of packet sizes from a 24-hour time period on both directions of the measured trunk. As with graph from previous years [TMR97], this figure illustrates the predominance of small packets, with peaks at the common sizes of 44, 552, 576, and 1500 bytes. The small packets, 40-44 bytes in length, include TCP acknowledgement segments, TCP control segments such as SYN, FIN, and RST packets, and telnet packets carrying single characters (keystrokes of a telnet session). Many TCP implementations that do not implement Path MTU Discovery use either 512 or 536 bytes as the default Maximum Segment Size (MSS) for nonlocal IP destinations, yielding a 552-byte or 576-byte packet size [STEVENS]. A Maximum Transmission Unit (MTU) size of 1500 bytes is characteristic of Ethernet-attached hosts.
Figure 7b shows the cumulative distribution of packet sizes, and of bytes by the size of packets carrying them. This graph shows that almost 75% of the packets are smaller than the typical TCP MSS of 552 bytes. Nearly half of the packets are 40 to 44 bytes in length. Note however that in terms of bytes, the picture is much different. While almost 60% of packets are 44 bytes or less, constituting a total of 7% of the byte volume, over half of the bytes are carried in packets of size 1500 bytes or larger.
Analyzing composition of traffic by protocol type is important because some protocols are ``friendlier'', or more responsive to network signals of congestion, than others, and a strong growth in the proportion of such unfriendly protocol traffic would have unsalutary implications on the infrastructure. On the Internet, standard implementations of TCP (Transport Control Protocol) are friendly, while UDP (User Datagram Protocol) implementations are not. Fortunately for the stability of the infrastructure, TCP is the protocol that carries most popular applications known to users today: e.g., web, e-mail, net news.
Using a sample of traffic from MCI's backbone for a week in April 1998, TCP averaged about 95% of the bytes, 90% of the packets, and 80% of the flows (approximately `conversations'). UDP makes up most of the rest of the traffic, with IPv6, encapsulated IP (IP-in-IP), ICMP, and other protocols taking up around 3% of the traffic. For applications that use TCP and UDP, the Web is the dominant application on the link, comprising up to 75% of the bytes and 70% of the packets when client and server traffic are considered together. The sizeable `other' category is spread among a wide range of TCP and UDP port numbers, no one of which represents a significant percentage of the traffic by itself. Among the most common port numbers in this category are 81, 443, 3128, 8000, and 8080, which are all Web-related, indicating that the Web may actually be slightly under-represented in measurements.
In addition to Web traffic, five other applications contribute an appreciable percentage of traffic: DNS, SMTP, FTP (data connections), NNTP, and telnet. SMTP averages about 5% of the bytes and packets, FTP about 5% of bytes and up to 3% of packets. NNTP represents 2% of the bytes, and less than 1% of packets. Finally, telnet accounts for about 1% of the packets, and less than 1% of the bytes, a marked decrease from recent years as alternative interactive protocols (e.g., ssh, kerberos, rlogin) have increased in popularity[CPB93].
The distribution of Internet traffic flow (conversation between two endpoints) sizes, as measured in packets, is heavy-tailed. Our measurements indicate that the majority of flows are still transaction-style, e.g., HTTP, SMTP, DNS, carrying much less traffic than the traditional bulk data transfer-style flows, e.g., ftp-data, nntp. Of particular concern is the effect of the increasing popularity of streaming and other multimedia applications that are much larger, often orders of magnitude, than even the historically `bulky' ones. Several fundamental aspects of the infrastructure, not least of which is the fairly limited resource accounting and pricing models, render this significant and by now expected shift in the distribution of flow sizes rather ominous for the stability of the current framework. Indeed, only more accurate resource consumption and concomitant pricing models will allow progress in growing infrastructure at pace with demand. This direction would be auspicious for the industry anyway, moving away from what is currently a rather randomized economic model that unsurprisingly prevents rational valuation of utility of Internet service and thus maximizing that value for the end user.
We have only provided a few examples of the potential information available via passive workload monitoring tools. Other applications of passive monitoring include: characterizing the potential benefit and optimal configuration of web caches and proxies; identifying and tracking security compromises to one's infrastructure; assessing the elasticity of flows and effectiveness of congestion control algorithms; the extent to which traffic growth is due to additional users versus an increase in per-user traffic; changes in profile of popular protocols and applications; and penetration and impact of emerging technologies and protocols such as multicast or IPv6.
Unfortunately, the state of passive monitoring technology lags significantly behind the underlying switching technology, which poses one of the most formidable obstacles to continued tracking of Internet traffic behavior over the next decade. While routers and switches at OC-192 speeds will emerge from multiple vendors in the competitive marketplace by the end of 1999, there is as yet no commodity solution to TCP/IP monitoring at even OC-3 bandwidths. The solutions that exist now are mostly research efforts with little to no software or hardware support or documentation. Particular areas of passive monitoring in desperate need of attention in the next three years include support for monitoring different link speeds (OC12, OC48, DS3), interface types, and encapsulations/framing; performance testing to assess when monitors fail to keep up with load; flexibility in configuration of what to collect; and improved security and manageability.
|No aphorism is more frequently repeated...than that we must ask Nature a few questions, or ideally, one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logically and artfully thought out questionnaire; indeed if we ask her a single question, she will often refuse to answer until some other topic has been discussed.|
|-- Sir Ronald A. Fisher, Perspectives in Medicine and Biology, 1973.|
Performance measurement techniques are often used by network engineers in diagnosing network problems; however, most recently their application has been by network users or researchers in analyzing traffic behavior across specific paths or the performance associated with individual Internet Service Providers (ISPs). A recent development in the industry is the offering of service level agreements (SLAs), contracts to guarantee a specified level of service, subject to cost rebates or other consumer remuneration should measurements suggest that the ISP did not adhere to the SLA. SLA's are rather controversial in the community since there is no standard metric or even measurement methodology for calibrating them. We will focus on more tools and techniques for more generic active measurement rather than those typically proprietary tools used to monitor current or emerging SLAs.
The skitter tool mentioned in the earlier discussion of topology measurement also has a related set of utilities that focus on active performance measurement. We will use these as examples of state-of-the-art active measurement techniques. One such module, called skping, measures at high resolution the end-to-end performance, i.e., delay and jitter, from a skitter source host to a selected destination. Skping sends ICMP echo requests (a technique used by the ping utility) to a destination host and listens for ICMP echo replies, plotting round trip time (RTT) values and packet loss in real time for the last N points, as well as summary RTT information in a candle plot.
Figure 8 shows an example skping delay distribution, with the common heavy-tail characteristic of many Internet end-to-end delay distributions (where many points lie above the lower band of the majority of the data).
|Figure 8(a): skping real-time plotting of measured round trip time (RTT) delay from lancelet.caida.org (in Ann Arbor, MI) to www.ucsd.edu (in San Diego, CA)|
|Figure 8(b): distribution of last 1600 delay values measured in figure 8(a); green vertical line represents the median value.|
|Figure 8(c): distribution of last 1600 delay values measured in figure 8(a) (log scale); green vertical line represents the median value.|
|Figure 8(d): Box & whisker plot of delay values measured in figure 8(a) (log scale). Each candle represents 400 delay values. The blue box delineates the 25th and 75th percentile of those 400 values; the ends of the whiskers delineate the minimum and maximum values. This plot shows a heavy tailed distribution across a fairly long period of time.|
|Figure 9(a): skping running across a path experiencing approximately 10% packet loss sustained over several minutes during a weekday hour in Feburary 1998.|
|Figure 9(b): distribution of last 1600 points measured for skping running in figure 9(a).|
|Figure 9(c): box & whisker plot of skping data measured to www.freebsd.org over time. See figure 8(d) for an explanation of box & whisker plots.|
There are many other active performance measurements efforts undertaken by various players in the Internet community, the most popular of which are typically user-instigated `Internet weather reports', a selection of which are described in Nancy Bachman's http://www.caida.org//tools/taxonomy/performance.xml#weather page. The most important deliverables of most current active monitoring tools focus on either verifying bandwidth or performance stated or implied by vendors and providers, or ascertaining those parameters if the information is not available in the first place. But there are an enormous number of research questions not under concerted investigation at the moment due to the lack of adequate active tools for doing so. Identifying and locating what might be construed as particularly topologically critical pieces of the public infrastructure is one area that the developers of the skitter platform hope to accomplish. Others include: finding particular periodic cycles or frequency components in performance data; developing a calculus for describing and drawing the difference between two given `snapshots' of network performance; finding the topological `center' of the net, techniques for real-time visualization of routing dynamics; and correlation with passive measurements.
|people make the mistake of talking about `natural laws'. there are no natural laws. there are only temporary habits of nature.|
|-- Robert Green Ingersoll|
The robustness and reliability of the Internet are highly dependent on efficient, stable routing among provider networks. Analysis of real world Internet routing behavior has direct implications for the next generation of networking hardware, software and operational policies. Observations of macroscopic routing dynamics provide insights into:
In the example in figure 9, we had discussed the suspected presence of congestion along the path, but no tools focusing on only the end hosts would suffice for pinpointing the problem at any finer granularity. Conveniently enough, the skitter family of tools has another utility, called sktrace, that pursues hop-by-hop analysis of an entire path from source to destination.
|Figure 10: sktrace data measured to www.freebsd.org. See figure 8(d) for an explanation of box & whisker plots.|
|Figure 11: sktrace scatter plot of RTT data for each hop along the path to www.freebsd.org, also suggesting congestion along the path.|
|Figure 12: sktrace for a route changing in the middle of an sktrace measurement.|
Another skitter tool, skpaths, is effective for highly dynamic paths because it colors the end-to-end delay data point based on the particular path taken for a given measurement. Figure 13 shows an example, where the stable delay values with two predominant independent paths suggest load balancing across multiple paths. In contrast, figure 14 shows much more jitter (variance) in the delay data, strongly suggesting route instability. Data from these and other sites over time suggest that instead of consistent performance, a significant fraction of Internet traffic takes longer than expected to reach its destination, if it ever arrives at all. This characteristic produces a tendency for heavy-tailed distributions of round-trip times on the global Internet.
|Figure 13: skpaths example, where two predominant independent paths with minimum performance variation between them suggests load balancing among routers.|
|Figure 14: skpaths reflecting what is likely route instability, with variations in round trip (RTT) performance ranging from 98 to 163 ms between the same two hosts.|
Other areas of analysis with strong technical and policy implications: assessing the effectiveness of utilization of the IP address space; extent of asymmetric routing and route instability as a function of service provider and over time; the distribution of traffic by network address prefix lengths; efficiency of usage of BGP routing table space, e.g., via aggregation; favoritism of traffic flow and routing toward a small proportion of the possible addresses/entities; degree of incongruity between unicast and multicast routing; and quantifying effects on connectivity of removal of specific ASes.
|Science is not about control. It is about cultivating a perpetual sense of wonder in the face of something that forever grows one step richer and subtler than our latest theory about it. It is about reverence, not mastery."|
|-- Richard Powers from the Gold Bug Variations|
Each measurement effort provides a new window on the infrastructure for network operators, designers and researchers. But without well-considered, strategically deployed, and collaboratively maintained measurement tools/infrastructure, these windows are not necessarily offering any useful insight. A particular obstacle is the lack of reasonable knowledge base for mapping IP addresses to more useful analysis entities: autonomous systems (BGP routing granularity), countries, router equipment (multiple IP addresses map to same router but without any mechanism for deriving the mapping), geographic location information (latitude/longitude coordinates). There are efforts underway to develop prototype databases for canonical mappings; http://www.caida.org/outreach/info/ lists some of them, but their precision, completeness, and concomitant utility will require more concerted community participation.
Indeed, progress in this field requires both top-down and bottom-up pursuit: application developers must scope out what measurements would allow their software to negotiate performance constraints with the network, and Internet service providers need to participate in deploying and evaluating the utility of measurement technology for their own network design, operation, and cost recovery.
The network research community is in a difficult position between these two groups, hoping to design a framework for windows that are useful. For several years the infrastructure was in such a measurement-deprived state that even deploying any data collection tool at all qualified as ground-breaking work. The current state is quite different: there is plenty of measurement occurring, albeit of questionable quality. The current community imperative is rather for more thoughtful infrastructure-relevant analysis of the data that is collected, in particular correlating among data sources/types, and providing feedback into tool design to improve future data acquisition techniques. Unlike many other fields of engineering, Internet data analysis is no longer justifiable as an isolated activity. The ecosystem under study has grown too large, and is under the auspices of too many independent, uncoordinated entities. Nonetheless, the system is evolving rapidly, and prudence would dictate that the depth and breadth of our understanding of it follow in much closer pursuit.
Thanks to Daniel McRobb for help with the sections
on performance and routing, and to Nancy Bachman
for helpful editing comments.
Many thanks to Bill Cheswick and Hal Burch
(Lucent/Bell Laboratories) for providing the graph layout code for
Figure 1. For more information see http://cheswick.com/ches/map/
Acknowledgements. Thanks to Daniel McRobb for help with the sections on performance and routing, and to Nancy Bachman for helpful editing comments. Many thanks to Bill Cheswick and Hal Burch (Lucent/Bell Laboratories) for providing the graph layout code for Figure 1. For more information see http://cheswick.com/ches/map/kc claffy founded CAIDA, a collaborative organization supporting cooperative efforts among the commercial, government and research communities aimed at promoting a scalable, robust Internet infrastructure. CAIDA is based at the University of California's San Diego Supercomputer Center (SDSC). Support for these efforts is provided by CAIDA members and by the Defense Advanced Research Project Agency (DARPA), through its Next Generation Internet program, and by the National Science Foundation (NSF). More information is available at http://www.caida.org.