Thought experiment: Suppose you are developing a program like FTP. If you want to be sure all of the data got there, how do you go about it?
Sender and receiver must communicate "success". Receiver must ultimately count the data and make sure the right number of bytes arrived.
Similarly, receiver must checksum or otherwise verify correctness.
The end-to-end argument says that if you are going to do something on the end-host anyway, don't replicate that function in the network.
Think "hop-by-hop" vs "end-to-end". This drives the decision of what goes into the transport layer.
IP provides an unreliable datagram service. What additional features are generally needed (and therefore should be provided by a Transport layer?)
TCP Header Format (from RFC 793) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ TCP Header Format Note that one tick mark represents one bit position. Figure 3.
Note that we refer to IP Packets and TCP Segments. A TCP segment refers just to the data. Because packets can be fragmented, it is useful to make this distinction. Also segments can be retransmitted; in this case the same segment is repeated in a different packet.
Which duty is each field in the packet format related to?
Port numbers: Multiplexing, connection setup/management.
Sequence numbers: reliability and ordering.
Flag bits: connection management (primarily).
Window: Flow control
Urgent Pointer: control function. Not really used.
From 793 +---------+ ---------\ active OPEN | CLOSED | \ ----------- +---------+<---------\ \ create TCB | ^ \ \ snd SYN passive OPEN | | CLOSE \ \ ------------ | | ---------- \ \ create TCB | | delete TCB \ \ V | \ \ +---------+ CLOSE | \ | LISTEN | ---------- | | +---------+ delete TCB | | rcv SYN | | SEND | | ----------- | | ------- | V +---------+ snd SYN,ACK / \ snd SYN +---------+ | |<----------------- ------------------>| | | SYN | rcv SYN | SYN | | RCVD |<-----------------------------------------------| SENT | | | snd ACK | | | |------------------ -------------------| | +---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+ | -------------- | | ----------- | x | | snd ACK | V V | CLOSE +---------+ | ------- | ESTAB | | snd FIN +---------+ | CLOSE | | rcv FIN V ------- | | ------- +---------+ snd FIN / \ snd ACK +---------+ | FIN |<----------------- ------------------>| CLOSE | | WAIT-1 |------------------ | WAIT | +---------+ rcv FIN \ +---------+ | rcv ACK of FIN ------- | CLOSE | | -------------- snd ACK | ------- | V x V snd FIN V +---------+ +---------+ +---------+ |FINWAIT-2| | CLOSING | | LAST-ACK| +---------+ +---------+ +---------+ | rcv ACK of FIN | rcv ACK of FIN | | rcv FIN -------------- | Timeout=2MSL -------------- | | ------- x V ------------ x V \ snd ACK +---------+delete TCB +---------+ ------------------------>|TIME WAIT|------------------>| CLOSED | +---------+ +---------+ TCP Connection State Diagram Figure 6.
Normal server connection establishment:
Passive Open -> Listen -> SYN RCVD -> EST
Normal client connection establishment:
Active Open -> SYN SENT -> EST
3 way handshake protects against duplicate or old data in the network. Each host must ACK the attempt to establish the connection.
Similarly, FIN is used to close connections; each side must ACK the attempt to close. TCP permits half-open connections to continue sending in one direction. Don't know of anything that uses this.
The receiver announces a window which specifies the amount of data which can be outstanding in the network at any point in time. This prevents the sender from overrunning the receiver with data if the receiver can't keep up.
If the receiver slows down, the sender automatically slows down as the ACK rate decreases.
Side effect: what is the maximum bandwidth which can be obtained at a given window? Answer: Window/RTT.
Initial TCP performed loss detection through a simple timeout. Go-back-n algorithm went back to the last byte successfully received and started transmitting from there.
What are the problems with this approach?
All of these problems lead to "congestion collapse" in the Internet in 1988. Van Jacobson invented a set of fixes which revolutionized TCP.
Basic idea: Add a congestion window (cwnd). This adds sliding window congestion control to TCP. cwnd controls the amount of data in the network.
First benefit: if the network slows down, queues increase, ACKs arrive more slowly, and the sender naturally slows down. Introduce the idea of ACK clocking.
Second benefit: the sender can adjust cwnd to match the available network capacity.
Question: how do you adjust?
Answer: additive increase, multiplicative decrease. TCP adds one segment to cwnd for each RTT, and divides cwnd by two when there is a loss.
Problem: starting with cwnd=1 might take many RTTs to get to large values of cwnd.
Solution: Slow start by doubling cwnd each RTT.
Questions: How do you identify a loss? (duplicate ACKs). How can you distinguish loss from reordering? (Add a 3 dupacks check). Fast Retransmit...
Duplicate ACKs cause problems with clocking. Can't use snd.una + cwnd to figure out what to send.
Tahoe solution: Identify congestion; retransmit; slowstart up to cwnd/2.
Reno solution: Artificially inflate cwnd until ACK advances. (Fast Recovery).
At this point, show a nam animation of slowstart and congestion avoidance.
Question: Do we still need timeouts at all?
Yes -- could still lose a whole window of data or ACKs.
Discuss adaptive timers and the RTO calculation.
Time one packet per RTT. Compute a smoothed RTT (SRTT), and deviation (SDEV).
Base the timeout on RTO=SRTT+4*SDEV. This adapts the timeout to the connection RTT.
Timer backoff: Successive timeouts are an indication of heavy congestion. Apply an exponential backoff to the timeout (1,2,4,8,16,32,64,64,64,64,64...) for each successive timeout.
(Not enough time for this, probably).
Now for something completely different. Suppose you want to attack a TCP connection. Example, insert data in the middle of a netscape download.
TCP provides some protection by randomizing the ISN. This makes it hard for an outsider to inject data. (Need to sniff the connection first to get the seq numbers, then add the false data).
Another popular attack is the SYN flood. Establish many half-open connections to fill up server memory.
Labs will use a number of popular applications.
Netperf and TTCP run TCP connections and measure their performance.
WebPolygraph runs a web-like workload designed to benchmark web proxies.
tcpdump: collect packet traces.
tcptrace: analyze tcpdump traces.
xplot: plot traces from tcptrace.
This section introduces new developments to address some of the problems found in the first lab.
Try to roll through these solutions in chronological order.
TCP includes an MSS negotiation option. Originally, this value was used if the hosts were on the same network. If going over a WAN, a minimal packet size of 576 bytes was used. This prevented unnecessary fragmentation, but was inefficient.
RFC1191 (Path MTU Discovery) provided a means for hosts to detect the path MTU. (But there are still problems with it due to various blackholing problems...)
Because of the blackholing problems, some OS come with PMTU disabled by default.
Can now scale window by up to 2^14. New max window is 1 GB. (How soon will it be until this runs out?) (With 1500 byte packets, this is a window of 600,000 packets. How long would that take to open?)
Can anyone see problems with this? Haven't changed the exponential smoothing, so RTO varies rapidly. In particular SDEV could become small quickly even if long term RTT fluctuations are occuring.
Next Lab one group devoted to this issue.
NewReno attacks the problem by filling one hole per RTT. That is, a partial ACK is indicative of a new hole, which is immediately retransmitted. The downside of this is that it may take many RTTs to recover from a loss episode, and you must have enough new data around to keep the ACK clock running.
Selective Acknowledgment (SACK) attaches detailed information in SACK options on ACKs. Each SACK block specifies a block of data which has been successfully received. The sender can use this information to quickly retransmit only those segments which have been lost. This enables holes to be filled within roughly one RTT.
Lab 2 exercises use NetBSD kernel. Some require kernel programming, some do not. Quick introduction to the BSD kernel. Copies of Stevens available for reference during the lab.
Q: Who has experience building kernels? Kernel hacking?
NetBSD quick primer:
Kernel source files: (include copies of important sections in the handouts?)
tcp_input uses a "header prediction" algorithm to speed up the code. This algorithm attempts to immediately go to the most common code path by predicting the next header. If successful, tcp_input processing is very short. Non-success cases include bi-directional data, duplicate ACKs, and state-machine transitions.
tcp_input calls tcp_output when ACKs come in which trigger new data. For bulk transfer (where cwnd or receiver window is limiting the transmissions), this is the normal means for calling tcp_output to send a segment.
Would like to better understand how TCP performs under various circumstances.
Simple analysis of slowstart. Go through math.
Simple analysis of 1/sqrt(p). Go through math for periodic loss case.
UMass calculation. Here, we add the following additional "features" to the calculation. (cwnd < 4 MSS results in timeout).
Fairness of TCP. Assume all connections at a bottleneck see equal loss rates. RTT and MSS weight the fairness.
Multiple congested gateways: Connections passing through two gateways get lower performance.
Introduce concept of min-max fairness.
Finally, some new things taking place.
T/TCP: Transaction TCP. Allow sending SYN/DATA/FIN in one packet. State machine is more complex to attempt maintain security guarantees. Goal is to reduce 5 packet transaction to 3 packets.
Rate-halving algorithm: unifies newreno, SACK, and ECN. Single algorithm.
Autotuning: This solves the receiver window problem. Receivers announce a maximum window. Note that receiver socket buffer consumes no memory normally. Sender is in full control using cwnd. To save memory on the sender, sender only allocates socket buffer for 2-4 times cwnd.
Sharing congestion state: Desirable because http likes to open multiple connections. RFC2140 specifies that these connections may share congestion state (cwnd).
IETF Endpoint Congestion Management WG is expanding on this idea. Create a UI for congestion control, and allow TCP and UDP to use the same UI. CM does all of the regulation of packets to and from the network. TCP, and UDP apps, control what data to send.
(I believe that RH is a good algorithm for this, because it separates data transmission from congestion control; exactly what CM needs).
New congestion control algorithms: For inherently rate-based applications where window based TCP can't work, two new approaches are being tested.
Finally, reliable multicast is attempting to apply TCP congestion control principles to multicast. Multicast has the problem that you can't send ACKs for every packet back to the sender, so window-based schemes are hard to use.
One approach being worked on is heirarchical ACKs (HACKs). ACKs are aggregated on the way back up the tree to the sender, allowing for the possibility of using a window based CC scheme.
The other approach is using equation based CC. Here, receivers infrequently send back info on packet loss rates and RTTs. The sender aggregates all of this information together and chooses an appropriate rate.
For both of these schemes the general idea is "go no faster than TCP would go over the worst path in the multicast tree".