Intro and Motivation
End-to-end network application developers and users need better techniques
and tools to estimate and utilize network bandwidth. Meeting attendees are
developing bandwidth estimation techniques and tools to meet this need.
This meeting enabled exchange of information so that individual projects
can identify possibilities for project integration and
collaboration.
First, some definitions:
| capacity | maximum throughput physically
possible on particular link media. |
| end-to-end capacity | minimum capacity on an
end-to-end path; corresponds to the capacity of the "narrow link" |
| available bandwidth | maximum unused throughput on an end-to-end path, given current cross-traffic;
corresponds to the capacity of the "tight link" |
| bulk transfer capacity (BTC) | maximum throughput that the network
can provide to a single, congestion-aware transport layer connection
|
| achievable throughput | maximum throughput that an application can
obtain depending on the transmission mechanism (e.g., TCP, UDP, ATM, Sonet)
|
Bandwidth Estimation Tools Development
| "Link Capacity Tool Developments" |
| Guojun Jin | LBL Distributed Systems Department |
| website: |
http://www.itg.lbl.gov/~jin |
| email: | j_guojun@lbl.gov |
Guojun is developing link characteristics tools. He focuses on algorithm
development to quickly and efficiently measure link characteristics. His goal is to optimize
TCP application performance on high speed networks by experimenting with
a number of parallel streams, optimal window size, and other TCP tuning parameter settings.
This work is a joint venture with Berkeley Lab's Self-Configuring Network Monitor (SCNM) and
LBL's Distributed Monitoring Framework
(DMF) projects.
NCS technologies developed as part of these projects include
the data collection and analysis tool called Netlogger.
Guojun's new tool Netest-2 is
unidirectional and makes end-to-end measurements at the application layer.
Netest-2 measures achievable throughput for UDP and TCP applications,
and suggests optimal TCP window size. Netest-2 also flags hardware and system characteristics that
may limit throughput. In addition, Netest-2
recommends how many, if any, parallel TCP streams will optimize "friendly" and
"aggressive" throughput. Friendly streams backoff upon detecting
congestion while aggressive streams do not.
Netest-2 runs on multiple UNIX platforms and has been tested with
multiple NICs at speeds from ADSL to 10 GigE (currently FreeBSD only).
A beta version of netest-2 was given to CAIDA in August, and is currently
under test.
Guojun plans to develop another new tool, pipechar-2, next year.
The goal of pipechar development is to investigate algorithms for per-hop bandwidth estimation. Guojun's research
will determine whether these algorithms work.
| "The effect of store-and-forward devices on per-hop capacity estimation" |
| Ravi Prasad | U Delaware/Georgia Tech |
| email: | ravi@cc.gatech.edu |
Ravi performed this work with Constantinos Dovrolis of the Georgia Institute
of Technology and with Bruce A. Mah of Packet Design. Their paper
entitled "The effect of layer-2 store-and-forward devices on per-hop
capacity estimation," has been accepted at the
INFOCOM 2003 conference co-sponsored by the IEEE Computer and Communications
Societies, to be held in San Francisco, March 30 - April 1, 2003.
This talk investigates reasons why bandwidth estimation tools pathchar, pchar, and clink give erroneous measurements. All of these tools employ the Variable
Packet Size (VPS) methodology, and depend on measuring serialization delay.
However, Layer 2 switches
increase RTT proportional to the packet size,
affecting capacity estimations.
Other sources of introduced errors include:
- traffic load
- non-zero queueuing delays
- limited clock resolution
- error propagation from the previous hop
- possibly ICMP generation latency
Errors attributable to limitations of using the VPS methodology are significant, and occur consistently across
several repetitions of tool runs.
| "Multivariate Resource Performance
Prediction in the NWS" |
| Martin Swany | UC Santa Barbara |
| website: |
http://www.cs.ucsb.edu/~swany/ |
| email: | swany@strat.cs.ucsb.edu |
The NWS architecture consists of CPU, memory, and network sensors that
generate time series data. Time-series data passes through a set of
performance forecasters. NWS uses lightweight probes that do not measure
nominal bandwidth, especially as the bandwidth delay product grows. Recent
work by Primet, Harakaly, and Bonnassieux (INRIA, ENS-Lyon)
presented at CCGrid02 "Experiments of Network Throughput Measurement and
Forecasting Using the Network Weather Service" attempts to relate
NWS and Iperf data. However, attempts at UCSB to replicate this
work were unsuccessful.
While most correlation mechanisms assume normal distributions,
network traffic parameters do not routinely exhibit them (e.g., Gaussian).
NWS employs a novel multivariate prediction technique to forecast target
network performance variables. A cumulative distribution function (CDF)
correlator makes the translation. Martin presents several examples of CDFs:
- MAE - Mean Average Error
- MSE - Mean Square Error
- MNEP - Moving Normalized Error Percentage
His examples show that CDF is proving useful
for situations requiring a non-normal distribution.
| "Testing Tools in an International Infrastructure: Interesting Results" |
| Jiri Navratil | SLAC/SCS-Network Monitoring |
| email: | jiri@slac.stanford.edu |
Jiri presented issues about estimation of utilization and cross-traffic of TCP
applications. He showed graphs (see his slides (PDF)) where performance of parallel TCP streams as measured by Iperf reveal inflection points indicating a significant change in stream rates as numbers of streams increase. Jiri attributes queueing as a
cause. If parallel streams compete on high capacity lines having free capacity, then the throughput distributions of individual streams will be nearly equal.
The inflection point mentioned above shows where Iperf and cross traffic share
the bandwidth equally. Adding parallel streams beyond this point does not improve aggregated throughput.
Several different bandwidth estimation tools were compared:
- pathrate
- pathload
- Iperf
- netest-2
- Incite BWe (Rice University)
- UDPmon
Finally, Jiri presented results of tests using the "Multifractal-cross traffic
estimator" (MF-CT) originally developed at Rice University. The MF-CT is
an active probing tool based on MATLAB code that performs UDP echoes.
MF-CT is intended
for permanent monitoring and prediction of available bandwidth.
Jiri presented several results of cross traffic detection
on high capacity lines between DOE labs and Caltech in order to show
the resolution of MF-CT. Jiri also showed several samples of
discovered bottlenecks attributed to dynamic cross traffic on
intermediate routers
resulting in packet-pair delays. In his graphs, bottlenecks show
up as a superposition of all delays in the path. (See slides 60-69 out of 70.)
Measurement Infrastructures
Les presented measurements from a new simple, robust infrastructure for
continuous persistent monitoring of high speed network and application
performance. This infrastructure minimizes measurement traffic
from ping, traceroute, pipechar, iperf,
and bbcp. Infrastructure tools
and methods reduce,
analyze, and publicly report measurements,
and may compare and validate new measurement or bandwidth estimation tools,
and provide forecasting and configuration information to Grid and other
applications.
IEPM-BW utility/deliverables:
- understand and identify needed resources
- provide access to archival and near real-time data and results
- identify critical changes in performance
IEPM-BW involves 33 sites on the PPDG, GriPHyN, and the
European Data Grid (EDG) networks. World-wide locations include:
Brookhaven, Milan, Rome, Esnet, CERN and others.
Les shows evidence that 10 seconds is sufficient to acquire accurate
iperf measurements, using multiple streams and big windows.
Ping, traceroute, and iperf measurements have been made at
90 min intervals from SLAC to 33 hosts in 8 countries since Dec 2001.
Bbcp memory to memory throughput measurements match
iperf measurements, until hitting disk-to-disk performance limits.
Slide 8 (HTML) in his presentation shows flat horizontal
lines on the Iperf vs file copy disk to disk graph
at over 60Mbits/s. These results imply that bbftp and
bbcp will fail above
100Mbits/s.
Disk performance matters when end-to-end applications measure
bandwidth, because if
disk speed is less than network speed, there is no need to measure the network.
Disk performance depends on disk sub-system; file system; caching parameters.
network. Parallelize disk and server access before attempting to
optimize network performance.
Iperf results match well to web100 goodput.
Even when using big (1Mb) windows, parallel TCP streams (e.g. between 2 and
24) are required before there is any chance of saturating a high speed link.
Typically, throughput increases up to an inflection point (see slide 15)
that appears to be related to window size.
Experimental results on the path between SLAC and ANL indicate
that as long as less than 70%
of the available capacity is utilized, then RTTs are stable.
Pushing beyond this point more than doubles the RTT.
A technique to detect the observed inflection point shows promise
for optimizing application performance.
Iperf measurements from SLAC to Caltech (Feb-Mar '02) also correlate well to NetFlow
measurements.
While aggregating all flows related to a single application call is difficult,
IEPM validated iperf measurements against NetFlow bytes/duration throughput measurements at border routers.
Aggregation involves identifying flows having a fixed triplet (src, dst, port)
starting at the same time (+-2.5s) and ending at roughly the same time.
While this algorithm needs tuning (because it misses
some delayed flows), it already shows that Iperf and NetFlow measurements
correlate well. In contrast, bbftp seems to underreport.
NetFlow does not consider window size, so using it for validation only makes
sense after tuning a particular bandwidth estimation application.
Iperf measurements are potentially intrusive on a link. However, it
may be possible for High Energy Physics (HEP)
sites to schedule measurements using
QBone Scavenger Service (QBSS),
which works similarly to the Unix nice command.
Internet2 has the QBSS turned on, while ESnet does NOT.
This is because Juniper 80M cards deployed on ESnet do not support QBSS.
Philosophically, there are two different ways to verify tools:
- Publish their algorithms.
- Make the tools available for testing in controlled experiments.
pathload uses a new methodology: Self-Loading Periodic Streams (SLoPS).
With SloPS, a Sender transmits periodic UDP packet streams to a
timestamped Receiver. pathload analyzes measured
one way delays (OWD) by looking for delay variations.
Since several definitions are in use, Const clarifies how he will
use the following terms:
-
capacity: maximum possible throughput end-to-end, limited by
the lowest transmission rate link.
-
narrow link: The narrow link is the one in the
end-to-end path that has the minimum capacity.
-
available bandwidth: maximum end-to-end throughput given
current cross traffic. Available bandwidth is measured for a
specific time interval and yields a range of variations.
As the selected averaging interval decreases then the variation
increases.
-
tight link: The tight link is the one having the minimum
available bandwidth, due to cross-traffic.
Const clarifies the potential advantages of estimating
bandwidth:
- congestion control and TCP: measure Bandwidth-Delay-Product
- streaming apps: adjust encoding rate
- SLA and Qos: monitor path load
- content distribution nets: select best server
- overlay nets: configure overlay routes
- end-to-end admission control: verify that sufficient bandwidth is
available before admitting a new flow
Many current approaches have important limitations. For example,
the INCITE mechanism (Ribeiro et al ITC'00)
of multifractal cross-traffic estimation (also known as Delphi) provides
correct estimation only when queueing occurs at a single link in path.
In other words, the packet is only queued at the tight link in the path.
INCITE underestimates available bandwidth if packet queueing occurs on multiple links in the end-to-end path.
While conventional wisdom suggests that bulk-TCP measurements will
oscillate around
the available bandwidth, experimental results (see slide 11 of 29 (pdf)
) show this not to be the case.
The graph on slide 11 shows that the TCP connection gets more than 4Mbps
of TCP throughput when MRTG reports an available bandwidth of less than 1MBPS.
At the same time, RTT is changing from 200msec to 350msec.
Queueing is the most plausible reason for this increase in delay.
Iperf itself makes the available bandwidth drop
to close to zero therefore
significantly increasing RTT. It remains an open question whether turning
on the
QBone Scavenger Service (QBSS)
can alleviate this condition.
SLoPS is aware of queueing issues. When the
Sender sends a periodic stream of K UDP packets
of size L bytes at R = L/T (T=L/R), the
Receiver will experience gradually increasing
one way delays (OWDs) if the rate is greater than the
available bandwidth. This increasing trend occurs because the queue
starts filling up. In contrast, when the rate is less than the available
bandwidth, then OWDs do not increase.
There is a third case: Where rate R is about the same as the available
bandwidth, a "grey region" of OWD measurements occurs.
pathload sends sets of streams until it can bound this "grey region",
using four state variables Gmax Gmin Rmax Rmin, representing
grey region maximum and minimum and rate maximum and minimum.
Pathload creates multiple streams, Rmax is initialized to the first
timestamp, and the other three variables are initialized to zero.
Successive measurements progressively
refine all four variables until Rmax - Rmin
is less than a threshhold value set by the user.
Pathload takes 15s to run against an ns network traffic simulator.
For checking utilization along real paths, 1min SNMP data (e.g. MRTG
granularity) will be sufficient.
Pathload does not affect available bandwidth, or RTT.
Therefore, pathload is not intrusive.
Currently, pathload has measured available bandwidth on links
from 2Mbps to 240Mbps. When verifying this tool on GigEther paths, packet
spacing must be less than 12 microseconds. It appears that
interrupt coalescing may be a serious problem.
UDPmon and TCPstream are
network testing tools developed by Richard.
These tools are in use by members of the European High Energy Physics community
as well as the Grid community.
Documentation and tarballs are available on his website.
UDPmon
UDPmon makes histograms (see slide 7) of individual measurements
of RTTs of UDP Request-Response frames of increasing frame sizes.
UDPmon also sends a controlled stream of regularly spaced UDP frames to
measure UDP throughput. UDPmon is useful for understanding how
hardware affects LAN / MAN / WAN behavior.
In slides 5 and 6, Richard plots UDPmon measured RTTs as
a function of frame size
in order to characterize latency. The slope of this line shows end-to-end
processing, HW, NIC, and link delays while the point where the line intercepts
the y-axis shows just the processing (memory copy, PCI) and hardware latencies.
UDPmon latency measurements imply that the speed of the PCI bus impacts
latency:
| GigEther card | CPU | PCI | Expect: | Measured: |
| SysKonnect SK-9843 | PIII 800MHz | 64bit 33MHz |
(PCI=0.00758 GigE 0.008) 0.0236 us/b |
0.0252 us/b |
| Intel Pro/1000 | PIII 800MHz | 64bit 66MHz |
(PCI 0.00188 GigE=0.008) 0.0118 us/b |
0.0187 us/b |
TCPstream
TCPstream measures TCP throughput with respect to message size, transmit
spacing, the number of concurrent streams, and the TCP window size.
One TCP control link manages n test streams treating each equally.
TCPstream uses only
one process with no threads and no context switching.
Calculation of times and delay relies on the Pentium CPU
cycle counter, requiring minimal user code. TCPstream is useful for understanding available TCP
throughput of the LAN / MAN / WAN.
Richard presents results of his experiments to quantify PCI activity.
His graphs (see slides 19-23) show that different GigEther cards
generate different amounts of PCI activity. Both tested cards can
drive packets at line speed. However, in the bottom half of slide
22, Richard observes extreme delays and dropped packets when using the
Intel Pro/1000 card. The packet drops are happening on the receiver
side. What could possibly be the cause?
In general, making a read or write call between the user and the system space
takes roughly the same amount of time.
Could a receiver side checksum taking an extra 2.7us
(18.7%) explain experimental throughput results?
920 Mbps expected throughput * (1 - 18.7% delay due to checksum) = 747.9 Mbps measured throughput.
Other possible factors that might explain packet drops:
- NIC related problem (If one NIC drops packets on different CPUs and OSs,
but other NICs do not have such a problem, an unlikely situation.)
- Hardware (hw) problem (If a different NIC of the same type works,
then this NIC must be defective.)
- Software (sw) problems, due to differences in drivers for different OSes.
- System specific problem where different NICs have similar problems on a
particular system (
Hardware problems: bad memory chip, bad PCI controller, etc.
Software problem: should not be a problem for current OSes.)
- Combination problem, e.g. where problem occurs on a certain OS with a
specific NIC driver, only when CPU speed is a limiting factor.
For these tests, Richard could eliminate bad NICs as a cause, so
the checksum offload feature found on Tigon chip based NICs became a prime
candidate for explaining the extra receiver side delay.
While the checksum offload feature is disabled by default, if it is on,
it will take 5.7us on the receiver to calculate the checksums, causing
18.7% more load on the receiver NIC than on the sender.
If turning off the checksum offload feature does not eliminate the
packet drop problem, then one of the following
may be true:
- the CPU is too low powered (not the case with a 370 DLE 800 MHz CPU)
- there is a bad or slow memory sub-system
- problematic O.S.
- problematic NIC driver
This testing example illustrates the complexity of
issues impacting GigEther performance.
Also see the
DataTAG Report to WP7 group for a more detailed
presentation of these results.
Existing bandwidth estimation methods suffer from noise
(hidden L2 swithces; cross-traffic; variations in load; head-of-line
blocking; HDLC byte stuffing ) So, Andre suggests a paradigm shift:
Rather than suppressing or compensating for packet delay noise -
extract information from it!
Andre presented some preliminary ideas about
Internet spectroscopy. He wants to identify discrete
network behavior components such as link capacity.
Andre demostrated entropy calculation. Identification of some components is potentially
simpler than other measurements and may facilitate bandwidth estimation.
Andre's work builds on ideas developed at Rice University by
Robert Nowak, whose student Ryan King will work with Andre over the summer.
| "The Grid Measurement Working Group" |
| Martin Swany | UC Santa Barbara |
| website: |
http://www.cs.ucsb.edu/~swany/ |
| email: | swany@strat.cs.ucsb.edu |
Recognizing that an XML vs. LDAP war occurred within the Grid Forum group,
Martin showed us an LDAP gateway that he developed to interface with
all components of NWS. Martin defined a data model where a
call to the NWS API GetNameServer becomes a search for GridDaemon objects of type
nameserver.
This approach requires an Object ID hierarchy. So,
the GridEvent Type hierarchy normalizes grid (net) events.
GridEvent= GridTarget, GridEventType, Timestamp (a candidate key)
Most relevant Grid area is
Grid Information Systems and Performance.
Recently, a
Grid Monitoring Architecture (GMA) was proposed by Brian Tierney and others.
Another new Grid group,
Discovery and Monitoring
Event Description Working group (DAMED) is organized by
Jennifer Schopf from Argonne. Finally,
the Network Measurement Working Group
(NMWG) is now chaired by Bruce Lowenkamp and
Richard Hughes-Jones. Charlie Cattlet approved all of these charters.
NMWG references and includes work by the IETF's Internet Protocol
Performance Metrics (IPPM) group.
In addition, the NMWG explicitly intends to leverage the CAIDA's
tools taxonomy to clarify measurement metrics and methods.
Richard Hughes-Jones commented that the NMWG and GGF can play
a role in harmonizing related efforts.
WP7 is the
Data Grid Network Work Package 7, a component of the Data Grid Initiative.
To describe the context of WP7, Richard defines the Grid simply as
distributed computing sitting on the network.
EU DataGrid Project
correlates with Particle Physics
Data Grid (PPDG) in the U.S.
The United States funds 1/2 of the physical line used in
the EU DataTAG project.
Europe also supports the AstroGrid
to enable the astronomers to make image queries.
Several different kinds of tools for measuring one way delay, throughput,
and predicting performance are currently in use.
A non-exhaustive list includes:
PingER; iperf; UDPmon, rTPL (from Univ. of Amsterdam);
GridFTP; RIPE NCC Test Traffic Measurements; and the Network
Weather Service (NWS) prediction engine
GridFTP works under the Globus
Grid Security Infrastructure (GSI) and therefore requires
the use of authentication certificates,
However, each country wants its own certificate authority, but first
a human protocol must be developed to decide how and why to trust any user.
Security requirements impact how measurement tools run, and remain an
ongoing concern.
Interest in achieving Gigabit throughput for Grid applications has
resulted in support for research into QoS and high-throughput in the EU.
Kc mentioned that there are about 6 new companies offering intelligent
route control as an alternative to QoS.
WP7 finds that iperf and UDPmon measures of TCP throughput
correlate well with the NWS predictor.
WP7 currently operates an OC48/POS development net to experiment with using
MPLS for network management and operations.
Testing Methodologies and Environments
California Next Generation Internet (CalNGI) established test centers in Northern
(UCBerkeley) and Southern (SDSC) California. CalNGI emphasizes e-commerce,
recognizing that a majority of Internet "customers" work with web services and
supply chains. Approximately 15-16 commercial developers have received grants,
focusing on network and application performance tests and measurements.
The CalNGI NPRL at SDSC contains commercial test and measurement equipment for
testing 10Mb to 10Gb Ethernet and OC3 to OC192 fiber media.
Specific hardware and software includes:
- Cisco Systems: CiscoWorks; QoS Mgr; VPN/Security Mgr
- HP: OpenView; NetworkNodeMgr; MeasureWare Performance Agent Technology
- NetIQ: Chariot Applications Scanner
- Spirent Communications: Adtech broadband traffic generator/analyzer; SmartBits active
measurement system (with GigE, OC3, OC12, OC48, 10 Gig UNI-PHY PICs).
Spirent has MOUs with SDSC and Internet2.
Spirent gear was used to measure "bandwidth challenge" for SuperComputing.
- Netscout: GigE RMON probes Synthetic Agents; Ngenius Mgmt Station
- Netoptics: regeneration taps (4-port, single; 4 port multi)
- Foundry, Cisco, HP switches 10/100/1000 layer 2 and layer 3 switches
- 2 ForceTen GigE switches are on SDSC computer room floor.
Active Projects:
- Bandwidth challenge at SC2002
Recent Projects:
- Internet2 E2E and OC192 deploynment testing
- SDSC 10GigExperience II Fall Quarter
- 10Gigabit Ethernet Interoperability at N+I
Limitations:
Majority of manufacturers (except for ForceTen) currently have backplane limitations.
Nate has observed actual 5-7.5 Gb throughput instead of advertised 10Gb.
No manufacturer has mature routing code combined with 10GigE IEEE 802.3ae
wide-area network physical layer (Wi-Fi, or Wireless Fidelity Certified).
SDSC plans to evolve from OC12 to 10GigE OC192c (depending on Qwest).
There is gear for taking OC192 SONET to WAN PHY. Initially, the TeraGrid
will consist of a 4-5 site network, with 40G (consisting of 4 10G lambdas)
cores in LA and Chicago.
ForceTen has the capacity because it is both a Layer 2 and a Layer 3 device.
Advanced Testing Engineering and Management exists at
SDSC; Ohio U ITEC; NCU ITEC; Mexico City; and British Columbia.
At SDSC, Kevin makes arrangements so that corporations refresh technology
in order to avoid obsolescence.
There is also a lab at Georgia Tech.
We might be able to dial in specific traffic between Georgia Tech and SDSC.
SDSC plans to perform 9K Jumbo Stream tests at 4 sites (of Ethernet frames)
Kevin has already run tests between SDSC and UTexas.
Issues for high capacity paths:
- instruction cache misses
- batched interrupt
- NIC to userspace latency
Research Questions:
How does packet dispersion relate to capacity?
Is dispersion preserved until timestamping? Can one packet take more processing time
than another? What is the minimum dispersion that can be measured?
What if processing takes more time than this dispersion?
The time it takes to process a packet can be substantially
different depending on whether there is an instruction cache miss or not.
When a packet pair arrives, the first packet sees an instruction cache miss
and therefore requires more processing time than the second packet.
To compensate for this, it is a good idea to send three packet probes for
a packet pair measurement and then ignore the first probe measurement.
In addition, a single interrupt may handle more than one packet.
In this case, packet dispersion measurements
could be affected by interrupt coalescence at either the source or destination
host CPU.
If a large series of packets yields no packet dispersion measurement, interrupt coalescence is a likely cause and the packet train size must be increased.
More refined responses are necessary in order to distinguish interrupt
coalescence from random noise in the network, or to
measure the minimum dispersion possible at a particular host node. Refinement
techniques remain a research issue, with a goal of finding
a better way to measure minimum acceptable dispersion.
The current experimental methodology for measuring
minimum acceptable dispersion:
- Calculate user to kernel latency.
- Multiply by "a factor" to arrive at the NIC-to-userspace latency.
-
Minimum Acceptable Dispersion = Factor * Kernel_User_Latency
-
Gettimeofday->sendto(pkt)->recvfrom(pkt)->Gettimeofday / 2
OS issues become important for high capacity paths.
Due to interrupt coalescence, the actual data transfer rate may be limited by
the hosts as opposed to the network.
Structured, controlled tests were performed on 100Mbps links in the CalNGI lab.
Initial results of bandwidth estimation tool testing show that most existing tools
for measuring end-to-end link capacity return very inaccurate results.
Several explanations were considered, including:
- Generated traffic packets may be too evenly spaced. Repeating the tests with
variable spaced packets still yielded poor results for VPS tools.
- Packets may be getting lost. A NeTraMet passive RTFM monitor confirms that
packets are not getting lost.
Next Steps
The next DOE
SciDAC PI meeting, for all SciDAC projects, is scheduled for
March 10-11, 2003 in the San Francisco Bay area.