- PATH CAPACITY
- AVAILABLE BANDWIDTH
Most measurements were taken between wednesday.caida.org (in San Diego, CA) and cruise.ornl.gov (in Oak Ridge, TN). Wednesday is running FreeBSD 4.8-RELEASE, and cruise is running Red Hat Linux release 9 with a 2.4.23 kernel that has the Web100 patch (www.web100.org) geared for high speed links. Both hosts are more than fast enough to handle high speed transfers. Specifically, wednesday has two 1.8GHz Pentium 4 Xeon processors and a PCI-X slot for the NIC, and cruise has four 2.8GHz Pentium 4 Xeon processors. Both hosts have a gigabit Ethernet interface with an MTU of 9000 bytes. However, only the path from cruise to wednesday supports 9000-byte packets end-to-end. The path MTU of the reverse path, from wednesday to cruise, is limited to 1500 bytes by a switch at a peering link between CENIC and ESNet. If it were not for this switch, the path MTU would be 9000 bytes. The end-to-end capacity from cruise to wednesday is 1 Gbps; the capacity in the opposite direction is 622 Mbps (OC12). However, taking the protocol overhead of Packet over SONET into account, the maximum achievable throughput at the IP layer on the path from wednesday to cruise is 597 Mbps for 1500 byte packets and 600 Mbps for 9000 byte packets. The protocol overhead of Gigabit Ethernet in the opposite direction places a limit of 975 Mbps for 1500 byte packets and 996 Mbps for 9000 byte packets.
The two paths between wednesday and cruise consist exclusively of academic, research, and government networks, including CENIC, ESNet, and Abilene. The paths are highly asymmetric, with one direction crossing Abilene but not ESnet, and the other direction crossing ESnet but not Abilene.
The following shows the traceroute paths in each direction augmented with the capacity of each link and the interfaces at each hop facing away from the traceroute source. We obtained the link capacities in the following three main ways: (1) by asking a network operator, (2) by extracting capacity clues from router names (e.g., "ge" often means Gigabit Ethernet), and (3) by applying experimental techniques currently under investigation. We had to guess at a few link capacities, and these are noted with a question mark in the following annotated traceroutes.
We obtained the interfaces facing away from each traceroute source by applying several known techniques. First, we performed traceroutes from multiple vantage points to find candidate interfaces and to determine the interface adjacencies (that is, to determine the pairs of interfaces that appear next to each other in actual traceroute paths, in order to confirm the existence of a link between the routers of such interface pairs). Then we matched up the interfaces in the forward and reverse directions with two alias resolution techniques. The first technique is the one used in the Mercator project and by CAIDA's iffinder. We ping an interface A and examine the address, say B, recorded in the response packet. If B is not the same as A, then we know that B is another interface on the same router as A. The second technique is essentially the IP-ID technique used in the University of Washington's Rocketfuel project. We send a specially interleaved sequence of pings to the interfaces A and B and concurrently run tcpdump, capturing the ICMP responses. If the IP ID field of the response packets from A and B increase in an interleaved fashion without duplication and without a great separation in values, then we can be confident that A and B are on the same router (under the assumption that the IP ID is generated from a counter shared by all interfaces on a router).
|wednesday to cruise:|
|[GIGE (egm == SBC Enhanced GigaMAN)]|
1500 byte MTU -- Mike O'Conner at ESnet]
|[OC12, Packet over SONET -- Susan Hicks at ORNL]|
|11||cruise.ornl.gov (184.108.40.206)||69.828 ms|
|cruise to wednesday:|
discusses sharing the cost of "the OC48 Abilene connection"
-- also confirmed by an experimental technique of ours]
|10||wednesday (220.127.116.11)||69.832 ms|
These paths appear to have been stable over the 6 week period in which we ran the tools discussed here, with one exception. For at least a few days in mid-September, the path from wednesday to cruise changed slightly, taking a different backbone route within ESnet than before. We do not believe this transient path change affected our results in any significant way. We include this transient path below, for completeness (the differing hops are highlighted with asterisks):
wednesday to cruise: 1 piranha.sdsc.edu (18.104.22.168) 3.692 ms 2 sdg-hpr--sdsc-sdsc2-ge.cenic.net (22.214.171.124) 0.519 ms 3 lax-hpr1--sdg-hpr1-10ge-l3.cenic.net (126.96.36.199) 16.101 ms 4 svl-hpr--lax-hpr-10ge.cenic.net (188.8.131.52) 44.063 ms 5 lbl2-cenic.es.net (184.108.40.206) 12.718 ms 6 snv-lbl-oc48.es.net (220.127.116.11) 14.001 ms *7 chicr1-oc192-snvcr1.es.net (18.104.22.168) 62.028 ms *8 aoacr1-oc192-chicr1.es.net (22.214.171.124) 82.357 ms *9 dccr1-oc48-aoacr1.es.net (126.96.36.199) 86.567 ms 10 atlcr1-oc48-dccr1.es.net (188.8.131.52) 102.200 ms 11 ornl-atlcr1.es.net (184.108.40.206) 106.830 ms 12 orgwy.ornl.gov (220.127.116.11) 83.743 ms 13 cruise.ornl.gov (18.104.22.168) 83.493 ms
We ran pathrate 2.4.0 between wednesday and cruise, using 8988 byte packets (8960 bytes of UDP payload). In the direction from cruise to wednesday, pathrate reported a capacity of 830 to 854 Mbps (a second run produced similar results). This is 14-17% less than the expected 996 Mbps (the maximum achievable IP layer throughput on a Gigabit Ethernet link). In the opposite direction, from wednesday to cruise, pathrate reported a capacity of 410 to 418 Mbps (two other runs produced similar results). This is 30-32% less than the expected 600 Mbps (the maximum achievable IP layer throughput on an OC12 link when using Packet over SONET).
Fragmentation may be partly responsible for this noticeable underestimation in the wednesday-to-cruise direction. Along this path, each 8988 byte packet is fragmented to 7 packets of at most 1500 bytes each. Thus, the additional IP header overhead and the dispersion (if any) of these fragments will increase the effective transmission time for a single 8988 byte packet, lowering the capacity estimate. Indeed, pathrate does produce a better estimate when fragmentation is avoided. When run with 1500 byte packets, pathrate reported a capacity of 521 to 576 Mbps, or 4-13% less than expectation. Thus, although, in general, using large packets on high-speed links is strongly recommended, fragmentation appears to undermine some of the advantages of large packets over small packets, at least for making accurate capacity estimates.
We also ran pathrate from 15 skitter monitors, which are distributed across the globe, to cruise. These monitors run various versions of FreeBSD from 3.1 to 4.8 and typically have older generation Intel processors (PII, PIII, and Celerons). They all have Ethernet interfaces configured to 100 Mbps with an MTU of 1500 bytes. The following table provides a summary of the pathrate measurements. All measurements, including the hop distance and RTT, were taken in the direction towards cruise from the skitter monitors.
|skitter monitor||location||hop dist||RTT (ms)||estimated capacity (Mbps)|
|apan-jp||Tokyo||8||161||97 - 98|
|arin||Bethesda, MD||12||42||1.5 - 1.5|
|b-root||Marina del Rey, CA||13||71||98 - 104|
|cam||Cambridge, UK||15||117||95 - 99|
|cdg-rssac||Paris||11||113||94 - 98|
|champagne||Urbana, IL||11||37||94 - 99|
|d-root||College Park, MD||12||22||93 - 100|
|i-root||Stockholm, Sweden||19||134||20 - 21|
|iad||Washington, DC||14||141||93 - 98|
|k-peer||Amsterdam||13||109||92 - 97|
|lhr||London||15||213||92 - 99|
|nrt||Tokyo||12||199||93 - 99|
|uoregon||Eugene, OR||12||69||140 - 153|
|yto||Ottawa, Canada||11||62||96 - 99|
We speculate that many of these paths have an end-to-end capacity of 100 Mbps. If true, then the pathrate results are generally quite good. The lower capacities seen for arin and i-root might be due to rate limiting by an immediate provider. The estimate for uoregon is the only result that is obviously wrong, since the capacity range of 140 to 153 Mbps exceeds the configured speed of the interface card in the uoregon monitor. (Three separate runs of pathrate, over three hours, produced approximately the same capacity range.) However, another pathrate run made from uoregon to cruise two weeks later produced a more reasonable capacity range of 93 to 100 Mbps. Although the path between these hosts did change at one hop during the two weeks, we do not believe the difference in results can be attributed solely to a path change. We suspect some characteristic of the cross traffic in the earlier run might be the real culprit, but this behavior needs further investigation.
We ran the abing sender on wednesday and the abing reflector on cruise. In this arrangement, traffic originates on wednesday and is bounced back to wednesday by cruise. Because of the bidirectional traffic, the abing process running on wednesday is able to determine the available bandwidth and the "dominated bottleneck capacity" (DBC) in both directions at once. The DBC is the capacity of the link that has the least available bandwidth of all links in the path. In general, the DBC does not necessarily correspond to the minimum of the end-to-end capacity. For instance, a heavily loaded 1 Gbps link can become the "dominated bottleneck" even if the path has 100 Mbps links, so long as the 100 Mbps links have greater available bandwidth than the 1 Gbps link. However, if the links with the least capacity also have the least available bandwidth, then the DBC will correspond to the end-to-end bottleneck capacity. Although not advertised as such, abing may be useful as a capacity estimation tool, under the right conditions, which the paths of our study seem to meet. The paths between wednesday and cruise are a mixture of OC12, Gigabit Ethernet, 10 Gigabit Ethernet, and OC192 links. It is probably very rare for the 10 Gbps links to have less than 1 Gbps of available bandwidth, so most of the time, the DBC will correspond to the bottleneck capacity.
One run of abing normally consists of 20 packet pairs. The results in the following table are the averages for 100 trains of 20 packet pairs each, with the sender running on wednesday. All figures are in Mbps, and ABw stands for available bandwidth.
These DBC values are approximately 9% and 13% less than the respective bottlenecks of 600 Mbps (OC12 POS) and 1 Gbps, and they are similar to the capacity estimates made by pathrate. Another run made a few weeks later produced similar DBC values (558 Mbps and 870 Mbps), though the reported available bandwidth differed, suggesting that DBC might be stable enough to be relied upon as a capacity estimate.
We also tried running abing with larger measurement packets, namely 8978 byte packets, in order to gauge the effects of packet size on measurement accuracy. We initially added support for large packets by modifying some constants in the abing source code, but it was later discovered that abing has an undocumented command line option for changing the packet size. Despite the existence of this undocumented option, we are not certain that abing was designed to work with large packets; it may very well be that certain calculations depend upon the default packet size of 1478 bytes (1450 bytes of UDP payload). Also, the abing reflector has a hardcoded packet size of 1478 bytes (1450 bytes of UDP payload), which was not changed, even for runs in which 8978 byte packets were sent to the reflector. We left this limit alone primarily to avoid fragmentation in the path from wednesday to cruise. Despite these caveats, experimenting with large packets seemed worth trying, considering the importance of using large packets on high-speed links.
When set to use large packets, abing failed to produce any output when run from wednesday. We could see that (fragmented) packets were being received by cruise, but the abing reflector on cruise was not echoing the packets. Switching the roles of the two machines--that is, running abing on cruise and the reflector on wednesday--did produce output, which is summarized in the following table, with each value being the average for 100 trains of either 20 or 100 packet pairs (increasing the number of packet pairs per train presumably increases the accuracy of a given measurement). Note, however, that these measurements were taken several weeks after the measurements provided in the previous table, and thus, the available bandwidth reported in these two tables cannot be compared.
|packet size||packet pairs||
Similar results were obtained for two additional runs made more than 15 minutes after the runs reported in the above table. This suggests that the above results are not isolated instances of spurious measurements. However, the results for the wednesday-to-cruise direction should most likely be disregarded, since the reflector is returning 1478 byte packets even for 8978 byte packets it receives. In the opposite direction, the results for 1478 byte packets look clearly wrong, and the results for 8978 byte packets are no better than when 1478 byte packets were used in earlier runs. Modifying the reflector to return 8978 byte packets (which are fragmented in transit to cruise) caused the abing sender on cruise to produce no results at all. The foregoing results do not necessarily mean that abing is unable to support large packets, nor that there is no benefit to using large packets, but clearly, there is more work needed to get abing to support large packets than making a few obvious changes to the source code.
The following table summarizes the results of running pathchirp between wednesday and cruise. The maximum packet size supported by pathchirp is 8228 bytes (8200 bytes of UDP payload). During the 120 seconds set for each run, the number of instantaneous bandwidth measurements produced by pathchirp (shown in the "# results" column) varied according to packet size. The available bandwidth (Mbps) reported in the table is the average value of these instantaneous bandwidth measurements.
|wed to cruise||
|cruise to wed||
These results are quite reasonable considering the known capacity in each direction. There does not appear to be any significant advantage to using large packets over small ones. In fact, runs not included in the above table show that the run-to-run variation is sometimes greater than the difference between using large and small packets, although the available bandwidth reported for large packets is usually a little higher than for small packets.
Every attempt to run pathchirp with large packets failed in the wednesday-to-cruise direction, even for longer experiment durations (set on the command line). Fragmentation may be at fault, though this needs further investigation.
The following table summarizes the results of running pathload between wednesday and cruise.
|wed to cruise||
|cruise to wed||
For 1500 byte packets in the wednesday-to-cruise direction, pathload reported that the available bandwidth was greater than the sending rate (324 Mbps, according to pathload) capable on wednesday. Sending 8988 byte packets in the same direction produced a range of 409 to 424 Mbps, though with a warning that "actual probing rate [does not equal] desired probing rate." Fragmentation may be at fault, though this needs further investigation.
For 1500 byte packets in the opposite direction, pathload reported an available bandwidth equal to the maximum sending rate it could achieve of 600 Mbps, suggesting the actual available bandwidth is greater. With 8988 byte packets, however, pathload had no difficulty measuring the available bandwidth, which at 846 Mbps is similar to the results of other tools.
The following table summarizes the results of running spruce between wednesday and cruise. Each value of available bandwidth reported in the table is the average of 10 runs made over several minutes.
|wed to cruise||
|cruise to wed||
Overall, spruce performed poorly in the wednesday-to-cruise direction. With large packets, it always reported an available bandwidth of 0 Mbps. We suspect fragmentation may be at fault. With small packets, it reported an available bandwidth that fluctuated wildly between zero and 114 Mbps (up to 226 Mbps for later runs not reported here), with 7 out of 10 runs reporting 0 Mbps. However, we were able to improve these results by correcting the following mistake. Spruce requires the user to specify the path capacity with the -c command line option, and we had used an incorrect value. At the time of these runs, we believed the path from wednesday to cruise had an end-to-end capacity of 1 Gbps. It was later discovered that the bottleneck capacity is actually OC12, or 622 Mbps, owing to a single peering link in the middle of the path. Re-running spruce with the correct capacity produced a more plausible estimate of 296 Mbps (note that this measurement was taken several weeks after the other spruce measurements reported here; however, later runs confirm that an incorrect capacity value affects spruce in the way reported here).
[Sep 23, 2004]
|wed to cruise||
|cruise to wed||
Spruce performed much better in the cruise-to-wednesday direction. When using the hardcoded packet size of 1500 bytes, spruce reported 516 Mbps for the available bandwidth. Re-running spruce after increasing the packet size to 9000 bytes (by trivially modifying the source code) produced an estimate of 807 Mbps, which is more in line with other tools.
Stab works differently than the other tools discussed here for finding available bandwidth. Rather than reporting the end-to-end available bandwidth, stab provides an estimate of the available bandwidth for each hop in the path. Stab also indicates the probability of each link being a "thin link", which is "a link with less available bandwidth than all links preceding it on the path". The end-to-end available bandwidth is equal to the available bandwidth of the thin link nearest, or at, the destination.
The following plots summarize the results of running stab between wednesday and cruise.
Figures 1 and 2 show the results in the wednesday-to-cruise direction (further runs produced similar results). Stab sent 1500 byte packets (1472 bytes of UDP payload) at a "maximum probing rate" of 1.5 Gbps and ran for 1 hour, taking 46 measurements of the available bandwidth at each of the 11 hops.12
In Figure 1, each curve traces the successive estimates of the available bandwidth for a single hop over the duration of the run. The x-axis specifies the time, in minutes, relative to the first estimate output by stab. Although the duration of the run was 1 hour, stab produced output only for the last 40 minutes of the run, presumably because the initial 20 minutes were needed to "warm up" the analysis. Broadly speaking, the available bandwidth values form two bands, an upper band above 850 Mbps and a lower band below 550 Mbps. Although the upper band reaches as high as 1250 Mbps, stab is almost certainly underestimating the available bandwidth of the many 10 Gbps and 2.5 Gbps links (namely, hops 3 to 8) in the path. Therefore, a more sympathetic interpretation of these values is that they are simply indicating that the available bandwidth for some part of the path is equal to or greater than the interface speed at the sending host. The values in the lower band are seemingly easier to explain. They fall exactly in the range of values we would expect considering that OC12 is the bottleneck capacity of the path.
The clear separation of the available bandwidth estimates into these two bands suggests an obvious explanation, that the hops preceding the OC12 bottleneck appear in the upper band and the hops following the bottleneck appear in the lower band. Indeed, Figure 2 supports this hypothesis. Figure 2 shows the available bandwidth estimate vs. the hop distance. Each curve represents the full path from wednesday to cruise and connects together the available bandwidth estimates for all hops in a single round of measurements. We can see that hops 1 to 8 are responsible for the upper band in Figure 1, and these are exactly the hops preceding the bottleneck. Hop 9 is the bottleneck, and Figure 2 shows that hops 9 to 11 are limited by the bottleneck.
Figure 2 also shows, for each hop, the probability of the hop being a thin link. Stab identifies hops 5, 7, and 9 as potential thin links, with hop 9, the bottleneck hop, having near 100% probability in many rounds of measurement. The flagging of hop 7 as a potential thin link is difficult to explain, but there may be an explanation for hop 5. Hop 5 is a peering link between CENIC and ESnet and the location of a switch limited to a maximum of 1500 byte packets. Although no measurement packets were fragmented, it is conceivable that available bandwidth could be lowered slightly by fragmentation at this hop. For instance, extra delays could be imposed on the measurement traffic by the fragmentation of cross traffic at, or near, this switch. These extra delays would most likely be small, but they might be enough to reduce the available bandwidth at hop 5 to less than that of all preceding hops, where packets can be sent at 1 Gbps or faster without any fragmentation. The preceding speculation may very well be wrong, but that does not matter. The important point is the plausibility of hop 5 being a thin link according to what we know about that link and the fact that stab was able to detect this characteristic automatically. On the one hand, the detection of possible thin links at hops 5 and 9 suggests that stab may actually be useful for this purpose. On the other hand, the flagging of hop 7 as a thin link, when there is no apparent reason for it to be so (given our limited knowledge), suggests that false positives might undermine some of this tool's usefulness.
Figures 3 and 4 have the same design as Figures 1 and 2, except the results are now for the cruise-to-wednesday direction. Stab sent 8228 byte packets (8200 bytes of UDP payload) at a "maximum probing rate" of 1.5 Gbps and ran for 1 hour, taking 31 measurements of the available bandwidth at each of the 10 hops. Stab did not output any estimates in the first 26 minutes of the run.
In Figure 3, all hops have an available bandwidth greater than 550 Mbps, except for hop 9, which never exceeds 163 Mbps. The values in the upper band are reasonable for the 1 Gbps links in the path but not reasonable for the four OC192 and 10 GigE links at hops 5 to 8. The top three curves represent hops 3, 2, and 1, in decreasing order, which are presumably all GigE links. Hop 5, the first 10 Gbps link in the path, is the fourth curve and has an average available bandwidth of 715 Mbps. The remaining 10 Gbps links have an average available bandwidth of less than 646 Mbps. Thus, the results for the 10 Gbps links look significantly off, even under the assumption that 1 Gbps is the maximum value that can be reported. The lowest curve represents hop 9, a 1 Gbps peering link between CENIC and SDSC. The average value of 151 Mbps for hop 9 is highly questionable and inconsistent with the available bandwidth measured by other tools (at other times).
Figure 4 shows the available bandwidth estimate vs. hop distance. There are dips at hops 4 and 9. Hop 4 is a peering link between SoX (Southern Crossroads) and Abilene, and we believe, based on some experiments, that this peering happens over an OC48 link employing Packet over SONET. It is puzzling, therefore, to see a dip at hop 4, indicating an average available bandwidth of 641 Mbps. Presuming this dip is significant (it does, at least, appear in multiple runs), we speculate that stab might be reacting to some characteristic of inter-AS peering links that is not shared by links within a single AS. It is intersting to note that hop 9, which also has a prominent dip, is also a peering link between two ASes. We do not believe the available bandwidth estimates at hops 4 and 9 are accurate, but the relative dip in the curve may have some significance, though this will require further investigation.
Figures 5 and 6 show the stab results when 1500 byte packets are sent from wednesday to cruise. Other than the packet size, the stab configuration is the same for the two pairs of plots, Figures 5 and 6 and Figures 3 and 4, and it is worth comparing these two sets of plots to see the effects of packet size on the results. One difference not directly evident in the plots is the number of chirps sent during the 1 hour measurement period---568 chirps with 8228 byte packets and 3159 chirps with 1500 byte packets. The ratio of the two packet sizes is approximately equal to the inverse of the ratio of the corresponding number of chirps sent. This suggests that the sending bit rate of the measurement traffic is nearly the same at the two packet sizes.
Because of a higher chirp sending rate, the warm-up period needed by stab is significantly shorter at the smaller packet size than at the larger size---namely, 5 minutes vs. 36 minutes. However, with 1500 byte packets, the available bandwidth estimates have a high degree of short term variability, as a comparison of Figure 3 and Figure 5 shows, and thus some of the advantages of a shorter warm-up period may be negated by the necessity of taking measurements over a long enough period to overcome the apparent noise in the measurements.
The values of the thin-link probabilities are also noticeably different between these two sets of plots, which is not surprising, given that these probabilities are derived in a direct manner from the available bandwidth estimates. With 8228 byte packets, there are two clear spikes in the thin-link probabilities, at hops 4 and 9, and all remaining hops have zero probability. In contrast, with 1500 byte packets, nearly all hops show moderate to high probabilities, to such an extent, indeed, that the signal seems lost in the noise. It does not seem likely that so many hops are actually thin links. Thus, the high variability resulting from using small packets is an impediment to detecting the real thin links in the path.
The negative effects of high variability are mitigated somewhat by averaging the measurements. Figure 7 shows the average available bandwidth at each hop for measurements taken at the two packet sizes. The two curves have the same general shape, differing mostly at hop 4, with the 1500-byte curve lacking the dip at hop 4. However, the available bandwidth measured with 1500 byte packets is consistently lower than with 8228 byte packets, with the two differing by as much as 199 Mbps. The lower values reported with 1500 byte packets are most likely underestimating the true available bandwidth, considering the results obtained with other tools.
Figure 8 shows the average thin-link probability at each hop for both packet sizes. (The peaks at hop 1 appear to be an artifact of the stab implementation and can be ignored.) The averaging operation has sharpened the curve for the 1500 byte packets into a much more reasonable shape than before. Both curves strongly indicate a thin link at hop 9, but the peak at hop 5 in the large-packet curve does not appear in the small-packet curve, except perhaps as the small bump at hop 5. It is unclear which of the two curves is more accurate, but it is clear that there are noticeable differences, suggesting the importance of choosing the right packet size for measurements.
1. Overriding the maximum probing rate with "-u 15000" [15 Gbps] had no effect; stab continued to choose 1.5 Gbps as the upper limit. In fact, stab still chose 1.5 Gbps even with "-u 1000" [1 Gbps]. There appears to be some sort of autotuning algorithm at work.
2. We also used the option "-J 4" to deal supposedly with interrupt coalescence at the receiver.