are raw bandwidth testbeds worth the investment

Archived MagicPoint presentation slides, compiled into a single PDF document.

2000_extreme0001.pdf (33 slides, 241 KB)

Slide text transcript

Slide 1: are raw bandwidth testbeds worth the investment?

are raw bandwidth testbeds
worth the investment?


Matt Mathis, PSC/NLANR
mathis@psc.edu
www.psc.edu

kc claffy, UCSD/SDSC/CAIDA
kc@caida.org 
www.caida.org

Slide 2: raw b/w network research testbeds

raw b/w network research testbeds 

historically under used,
   starting w/ gigabits in the early 80's

relatively few papers considering the tax dollars invested

hard to identify indirect results
(improvements in real products)

poor utilization even when connected to production campus nets
TCP in the field is lame
 
good PR in some circles

Slide 3: typical non-net-researcher feedback

typical non-net-researcher feedback


basically unhappy 

experience really poor local performance

view testbed as a waste

unhappy to hear about "solved" problems when theirs are not

users complaints can reach congress

Slide 4

typical testbed results/solns


mostly single point solutions:
hardware typically not ready for products
non-standard s/w or configs, e.g. larger than std MTU
non-general optimizations
Moore law advances suggest that next generation general solution will overtake point solution

hero numbers are not cost-effective

Slide 5: persistently unsolved research problems

persistently unsolved research problems

TCP dynamics
we do not understand TCP
simulation of unproven relevance

routing dynamics
instability, load-balancing, propagation, bugs

SLA articulation
we do not have a calculus to describe performance

measurement
we do not (know how to) measure real traffic

none are needed (possible) on a 
raw bandwidth testbed

Slide 6: conclusion

conclusion


another raw bandwidth testbed is not going to help the network research most needed now


the real problem today is not filling yet another faster empty link, but understanding traffic dynamics on real infrastructure

   wasted headroom can never be filled until we understand congestion

Slide 7: part 1 end

part 1 end

Slide 8: What is the Important Problem?

What is the
Important Problem?

Slide 9: The Wizard Gap

The Wizard Gap

Slide 10: The Wizard Gap

The Wizard Gap

(TCP over a long path)

Year	Wizards	   Non-wizards	Ratio
1988	1 Mb/s	   300 kb/s       3:1
1991	10 Mb/s
1995	100 Mb/s
1999	1 Gb/s	   3 Mb/s	300:1

Non-wizards are not happy

More hero numbers are likely to be bad politics

Web100 is attacking this gap

Slide 11: The Web100 project

The Web100 project


Key components:
Better instrumentation within TCP
If TCP is slow, just ask TCP why
Autotune TCP/IP
Require less expertise from the users

See: www.web100.org

Impact
Reduce stack bottlenecks
Indirectly fix paths
Indirectly fix applications

Slide 12: Danger - new load levels

Danger - new load levels


Any single pair of (cheap) workstation can congest any OC-3 link.

Any single pair of (expensive) workstations can congest any OC-12 link.

10+ TB/s loads in the core?
 	100k users * 100 Mb/s

Pandemic congestion

Slide 13: Why not before Web100?

Why not before Web100?


Lame TCP hides path problems

Lame paths hide TCP problems

For nearly everyone debugging TCP is a random walk in the dark

Lame TCP + hidden path problems smooth the traffic and limit peak loads

This should change in about 5 years

Slide 14: Problems

Problems


No deployed queue management
No deployed QoS
Pervasive broken link layers
No models for traffic sharing
Shared gigabit problem
No SLA quantification
New load levels

Slide 15: No deployed queue management

No deployed queue management


With drop tail routers, TCP controls against queue full

This causes huge delays and/or delay variance

Have observed single tuned flows causing 1.2 s RTT
Zero current users have well tuned flows!

Slide 16: No deployed QoS

No deployed QoS


TCP requires queues for correct operation
Web100 will cause queues

Most UDP prefers not to have queues

Why do real time applications work at all today?

Is web100 going to break all delay sensitive applications?

Slide 17: Pervasive broken link layers

Pervasive broken link layers

(Poor behavior under sustained laminar packet flows)

Queueing problems
Insufficient queues
Policing without shaping

Coupling between flows
Channel acquisition in CSMA/CD Ethernet, wireless, etc

Slide 18: No models for traffic sharing

No models for traffic sharing

(How do transient flows impact large flows?)

Matt's first TCP question (1991):
 	Half T3 NSF net (22 Mb/s) with 10 Mb/s load
 	Best possible FTP was 5 Mb/s
 	Where was the missing 5 Mb/s?

No theory either
 	(the mice and the elephant problem)

Akin to turbulence

We do not even know the dimensionality of the problem space

Slide 19: Shared gigabit problem

Shared gigabit problem


Can a 500 Mb/s application + 500 Mb/s "background" traffic share a 1 Gb/s link?


Can six 100 Mb/s applications + 500 Mb/s "background" traffic equitably share a 1 Gb/s link?

Slide 20: No SLA quantification

No SLA quantification


How would you write a (multi-provider) service level agreement for a commodity service to support specific data rates to a large number of sites?   With specific latency requirements?

Can the next NGI be just common SLA language, say for 500 Mb/s between all R1 university's and research labs?

Slide 21: New load levels

New load levels


Will ubiquitous well tuned TCP crush the net?

Slide 22: The common theme

The common theme


We do not understand...

how traffic interacts with other traffic when the net is full

how traffic interacts with links when the net is full

fully utilized networks

We do not understand congestion!
 	and it has already been much longer than 5 years!

Slide 23: We need a traffic dynamics testbed

We need a traffic dynamics testbed


Study how traffic interacts with other traffic and underlying infrastructure

Slide 24: Which will have the longest impact?

Which will have the longest impact?


Implementing the first 10 Gb/s application?
Getting an application to fill a 1 GB/s link with 500 Mb/s background traffic

Solving the second problem will create the demand for industry to solve the first

Slide 25: part 2 end

part 2 end

Slide 26: Traffic Dynamics Testbeds

Traffic Dynamics
Testbeds

Slide 27: A Traffic Dynamics Testbed

A Traffic Dynamics Testbed


To study how traffic interacts with other traffic and lower layers

Requires carefully managed experiments where innocent traffic is routed over research infrastructure

Slide 28: The Basic Tension

The Basic Tension


Is the net for the users

or the network researchers?


Traffic dynamics research requires that the network be acceptable to both

Users want stable, reliable network properties
Researchers want to change things

Slide 29: Special Requirements

Special Requirements

  
Parallel standard "production like" infrastructure
Must offer similar properties and performance
Full capacity Interconnects w/ standard infrastructure
Fast IP routing knife switches to move traffic back and forth between the TB and standard paths.
An AUP that permits (requires) momentary prime time service interruptions.

Slide 30: Normal requirements

Normal requirements


Lots of modularity, patch panels etc
support tinkering with different gear

Easily programmed variable sub-rate on the links
 to create and study bottlenecks

Slide 31: Example Experiment Scenario

Example Experiment Scenario


Load custom microcode into the TB
Run test applications
Re-route (most) I2/NGI/etc traffic over the TB
Run test applications + real traffic
Adjust link rates down
introduces some congestion
Rerun test applications + real traffic
Fast emergency cutback to standard paths

Note that step 1 (custom microcode) is more difficult at high rates

Slide 32: Lost vBNS opportunity

Lost vBNS opportunity


Proposed vBNS AUP:  Campuses could use vBNS for whatever they pleased as long as they purchased sufficient commodity connectivity to withstand prime time down time on the vBNS

Slide 33: Conclusion

Conclusion


A testbed that can not easily support the above experiment is not going to help the network research most needed now


The real problem today is not filling yet another faster empty link, but making full use of existing headroom in the current infrastructure

Related Objects

See https://catalog.caida.org/media/2000_extreme0001/ to explore related objects to this document in the CAIDA Resource Catalog.