辐射避难所伤害计算:TCP_CORK: More than you ever wanted to know | christopher baus.net

来源:百度文库 编辑:九乡新闻网 时间:2024/04/30 08:31:02

TCP_CORK: More than you ever wanted to know

April 06, 2005 permalink /printable /comments

I previously mentioned theleakiness of Unix's file metaphor. Theleak often becomes a gushing torrent when trying to bump up performance.TCP_CORK is yet another example.

Before I get into the details of TCP_CORK and the problem it addresses, I wantto point out that this is a Linux only option,although variants exist on other *nix flavors -- for instance TCP_NOPUSHon FreeBSD and Mac OS X (although from what I read the OS X implementationis buggy). This is one of the unfortunate aspects of modern Unix programming.While most of the APIs are identical between Unix like OSes, if thefunctionality isn't specified byPOSIX, none of themajor *nix's can seem to agree on an implementation.

What are "physical" socket writes?

The root of the abstraction leak derives from the semantics of thewrite() functionwhen applied to TCP/IP. Historically (and any Unix experts in the crowdfeel free to correct me here if this is not accurate) the write() functionresulted in a physical, non-buffered, write to the device. With TCP/IPthe device is a network packet, but the implementors were forced to define aphysical write given Unix's file semantics, so a TCP/IP write()was defined as follows:

Any data that has been sent to the kernel with write() is placedinto one or more packets and immediately sent onto the wire.

The resulting behavior is what application programmers expected.When they called write() the data would be sent and available to host on theother side of the wire. But it didn't take long to realize that thisresulted in some interesting performance problems, which were addressedby Nagle's algorithm.

Nagle's algorithm

In the early 1980'sJohn Nagle found that the networks at Ford Aerospace were becomingcongested with packets containing only a single character's worthof data. Basicallyevery time a user struck a key in a telnet-like console app an entire packetwas put onto the network.As Nagle pointed out, this resulted in about 4000% overhead (the total amount of data sent vs.the actual application data). Nagle's solution was simple: wait for thepeer to acknowledge the previously sent packet before sending anypartial packets. This gives the OS time to coercemultiple calls to write() from the application into larger packets beforeforwarding the data to the peer.

Nagle's algorithm is transparent to applicationdevelopers, and it effectively sticks a fat finger in the abstraction leak.Calls to write() guarantee that data is delivered to the peer. Nagle also hasthe side benefit of providing additional rudimentary flow control.

Nagle not optimal for streams

While Nagle's algorithm is an excellent compromise for many applications, andit is thedefault behavior for most TCP/IP implementations including Linux's, itisn't without drawbacks. The Nagle algorithm is most effective ifTCP/IP traffic is generated sporadically by user input, not by applicationsusing stream oriented protocols. It worksgreat for Telnet, but it is less than optimal for HTTP. For example, ifan application needs tosend 1 1/2 packets of data to complete a message, the second packetis delayed until an ACK is received from the previous packet, therebyneedlessly increasing latency when the application doesn't expect to sendmore data.

It also requires the peer to process more packets when networklatency is low. This can affect the responsiveness of the peer,by causing it to needlessly consume resources.

Unfortunately, as is often the case, the file abstraction must beviolated to improve performance. The application must instruct theOS not to sendany packets unless they are full, or the application signals the OS tosend all pending data. This is the effect of TCP_CORK.

The application must tell the OS where the boundariesof the application layer messages are. For instance multiple HTTPmessages can be passed on one connection using HTTP pipelines. Whena message is complete the application should signal the OS to send anyoutstanding data. If the application fails to signal the peerof a completed message, the peer will hang waiting for theremainder of the message.

In my HTTP implementation, I use the flush metaphor which is commonwith streams, but not usually associated with calls to write() whichare supposed to be physical. I set the TCP_CORK option when thesocket is created, and then "flush" the socket at message boundaries.

Prefer the gather function writev()

If you need to write multiple buffers that are currently in memory youshould prefer the gather function writev() before considering TCP_CORK with multiplecalls to write().This function allows multiple non-contiguous buffers to be written withone system call. The kernel can then coerce the buffers efficientlyinto packet structures before writing them to the network. It alsoreduces the number of system calls required to send the data, and henceimproves performance.

This should be combined with TCP_NOWAIT option or TCP_CORK options. TCP_NOWAITdisables the Nagle algorithm and ensures that the data will be written immediately.Using TCP_CORK withwritev()will allow the kernel to buffer and align packets betweenmultiple calls to write() or writev(), but you must remember to remove the cork optionto write the data as described in the next section.

TCP_NOWAIT is set on a socket as follows:

int state = 1;setsockopt(fd, IPPROTO_TCP, TCP_NOWAIT, &state, sizeof(state));

The drawback of writev() is that it is difficult to use with non-blocking I/O, when the functionmay return before all the data is written. A post calloperation must be preformed to determine how much data was written, and to realign the buffersfor subsequent calls. This is an area with auxiliary library functionality would help.Also the behavior of writev() with non-blocking I/O isn't well documented.

A quick look at the TCP_CORK API

If you need the kernel to align and buffer packet data over the lifespanof buffers (hence the inability of using writev()), then TCP_CORK shouldbe considered.TCP_CORK is set on a socket file descriptor using thesetsockopt() function.When the TCP_CORK option is set, only full packets are sent, untilthe TCP_CORK option is removed. This is important. Toensure all waiting data is sent, the TCP_CORK option MUST be removed.Herein lies the beauty of the Nagle algorithm. It doesn't require anyintervention from the application programmer. But once you set TCP_CORK,you have to be prepared to remove it when there is no more data to send.I can't stress this enough, as it is possible that TCP_CORK could causesubtle bugs if the cork isn't pulled at the appropriate times.

To setTCP_CORK use the following:

int state = 1;setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
The cork can be removed and partial packets data send with:
int state = 0;setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
As I mentioned, I use the flush paradigm, which involves awkwardlyremoving and reapplying of the TCP_CORK option.This can be done as follows:
int state = 0;setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));state ~= state;setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));

Other solutions

User mode buffered streams, is another solution to problem. User modebuffering is implemented follows: instead of calling write()directly, the application storesdata in a write buffer. When the write buffer is full, all data is then sentwith a call to write().

Even with buffered streams the application must be ableto instruct the OS to forward all pending data when the stream has been flushed for optimal performance.The application does not know where packet boundaries reside, hencebuffer flushes might not align on packet boundaries. TCP_CORK can packdata more effectively, because it has direct access to the TCP/IP layer.

Also application buffering requires gratuitous memory copies, whichmany high performance servers attempt to minimize. Memory buscontention and latency often limit a server's throughput.

If you do use an application buffering and streaming mechanism (as doesApache), I highly recommend applying the TCP_NODELAYsocket option which disables Nagle's algorithm. Allcalls to write() will then result in immediate transfer of data.