The traces used for this study were collected from the NASA Ames Internet exchange (AIX) in Mountain View, CA [AIX] as part of an NSF/NASA collaborative effort with NLANR/MOAT. They were collected from one of four (now five) OC-3 ATM links that interconnect AIX and MAE-West in San Jose, CA.
Fig 1: A diagram showing the location of the optical splitter used to collect the data. Note that there are currently five links between NASA-Ames and MAE-West. Thanks to Hans-Werner Braun and NLANR/MOAT for use of this figure.
This group of links form a striped connection between two DEC Gigaswitches, with an aggregate bandwidth of approximately a single OC-12 link. The Gigaswitches use a proprietary scheduling algorithm for sending packets across this link, but each packet is sent across an individual link inside an AAL5 PDU. This means the scheduling inside the Gigaswitches happens at the packet level, since all cells from a PDU are sent over the same link.
Consequently, the data we collect from this site is essentially sub-sampled from the actual data traversing this link using the proprietary scheduling algorithm inside the Gigaswitches. This algorithm is approximately round-robin, but also depends on internal load characteristics in the Gigaswitch switching fabric. However, we know that the distribution of packets among the OC-3 links is not entirely uniform, since measurements of the link utilizations show two of them carry approximately twice the traffic of the other two (measured by byte volume, not packet volume) [Feldman98]. We assume that the Gigaswitch scheduling algorithm is independent of the encapsulated protocol, e.g. not dependent upon packet length.
Because of this complex scheduling algorithm we are not able to accurately estimate the number of conversations traversing the monitored links, or the length of these conversations in packets or bytes. Consequently, we only characterize the workload observed at AIX in terms of relative fractions of packets and bytes.
The data collection system is essentially similar to the one used in [Thompson97]. A Coral/OC3mon platform was connected to one link in each direction using optical splitters. The traces we studied were collected as part of NLANR/MOAT's Network Analysis Infrastructure (NAI) project [Braun98]. For each packet that passes the monitor, only the first ATM cell from the AAL5 PDU is captured and written to disk. The first cell contains the first 40 bytes of each packet, which is usually enough to extract the TCP or UDP port numbers from the transport layer headers. However, the monitor does not verify that the entire AAL5 PDU is carried by the link, and so estimates of the data rate carried by the link may be inflated in the presence of cell loss. Six to eight traces were collected each day, usually with a duration of 90 seconds each. The starting time for each trace was set at equal intervals during the 24 hour period, and randomized over a range of an hour at the beginning of each interval.
After collection, the traces are processed to remove any information that might compromise the privacy of the individuals generating the traffic. This processing masks the source and destination IP addresses, and deletes all data from the IP payload except for a TCP or UDP header (if present), or the ICMP or IGMP type and code fields. If the packet carries enough bytes of IP header options, then the TCP or UDP port numbers may not be present in the first cell of the PDU. In this case, we ignore that packet in subsequent application workload analysis. Since the fraction of packets with IP header options is typically less than 0.003%, this doesn't seriously impact our measurements of the traffic fraction generated by the most popular TCP and UDP applications.
We used CoralReef [CoralReef] to reduce each raw trace to a set of summary tables that we archived for later analysis. The tables include aggregate numbers such as the number of packets and bytes in the trace as well as distributions of packet lengths and the number of packets and bytes seen for each IP-layer protocol.
For TCP and UDP, we analyze application usage using port address pairs. The packet traces available from the NAI archive [Braun98] only include IP and transport layer headers, so our methodology does not use encapsulated data to identify the application that generated the packets. Traces in the NAI archive have had all payload data removed to protect the privacy of Internet users.
In most cases, we have assumed that packets sent between any port number higher than 1023 and a well-known port number below 1023 are generated by the same protocol (e.g., HTTP on port 80). This matches typical end host behavior, in which clients allocate ephemeral ports from the range 1024 to 32767 [Stevens94].
For some of the protocols, we have condensed ranges of port numbers in both the source and destination fields. For example, the RealAudio category in the UDP table includes all traffic with destination ports between 6970 and 7170 inclusive [RealNetworks]. Unfortunately, this range also includes the ports used by AFS, and so we are potentially confusing an unknown amount of AFS traffic with RealAudio. However, the majority of RealAudio traffic appears on UDP ports 6970, 6971, and 6972, none of which are used by AFS. By only considering traffic on UDP ports from this range that are not used by AFS, the amount of RealAudio traffic can be estimated independently from the amount of AFS traffic that may be present as well.
We are currently investigating better techniques for differentiating RealAudio and AFS traffic using packet size distribution and packet inter arrival patterns, and we hope to be able to conclusively differentiate between the two in the future. A recent analysis of the traffic patterns exhibited by RealAudio traffic [Mena00] has shown several parameters that may be used to differentiate between RealAudio and other protocols. A further study characterizing AFS traffic patterns needs to be undertaken to identify the best metrics to use to separate the two.
For both TCP and UDP traffic, there is a significant fraction of traffic that cannot be mapped to applications using well known port numbers. Many protocols do not depend on well-known port numbers, but either use a well-known service for negotiating the port numbers used by secondary connections, or use arbitrary but fixed port numbers that are not registered with IANA. The most popular application with negotiated port numbers is passive-mode FTP, in which the client sends the port number to use for a data connection over the command channel. There are many other protocols that use similar behavior, such as Napster and Internet telephony applications.
Most online games do not register well-known ports with IANA, but use arbitrary port numbers above 5000. We have collected the port numbers used by several of the popular games and use this information to estimate the fraction of traffic generated by them. Our analysis of online game traffic includes game traffic on the following UDP ports:
|Half Life||any to or from 27005|
|any to or from 27015|
|Quake 3: Arena||any to or from 27960|
|Starcraft||6112 to 6112|
|Quake II||any to or from 27901|
|any to or from 27910|
|QuakeWorld||any to or from 27500|
|any to or from 27001|
|Unreal||any to or from 7777|
Table 1: UDP ports used by Online Games
As is the case with RealAudio and AFS, there are many possibilities for confusion between game traffic and other applications when only port numbers are used to make the classification. We assume that there are no other protocols that preferentially use these same ports, and that applications that ephemerally use these ports contribute equal amounts of traffic across all traffic categories. This assumption carries significant risks, and needs further analysis to fully evaluate its impact on our data.
- [AIX] Ames Internet eXchange, http://aix.arc.nasa.gov/ (No longer available)
- [Braun98] H.-W. Braun. Towards a systemic understanding of the Internet organism: a framework for the creation of a Network Analysis Infrastructure, http://moat.nlanr.net/NAI (No longer available)
- [CoralReef] CoralReef home page, http://www.caida.org/tools/measurement/coralreef .
- [Feldman98] S. Feldman. MAE-West Link Utilization Statistics, http://www.mae.net/~feldman/gigaswitch/ames (No longer available)
- [Mena00] A. Mena and J. Heidemann. An Empirical Study of RealAudio Traffic, , IEEE INFOCOM 2000, http://www.isi.edu/~johnh/PAPERS/Mena00a.html .
- [RealNetworks] RealNetworks RealSystem Firewall Support, http://service.real.com/firewall/adminfw.html .
- [Stevens94] W. Richard Stevens. TCP/IP Illustrated, Volume 1: The Protocols Addison-Wesley, 1994.
- [Thompson97] K. Thompson, G. Miller, and R. Wilder. Wide Area Internet Traffic Patterns and Characteristics IEEE Network, Vol. 11 No. 6, pp. 10-23, Nov/Dec 1997. http://dx.doi.org/10.1109/65.642356 .