辐射避难所升级大门:TCP/IP options for high-performance data transmission | TechRepublic

来源:百度文库 编辑:九乡新闻网 时间:2024/04/19 04:25:40

TCP/IP options for high-performance data transmission

By Guest ContributorMarch 26, 2002, 8:00am PSTRecommend0Votes1CommentsSharemore +














By Alexander Tormasov and Alexey Kuznetsov

In a previous article, we explained how you can use the sendfile()syscall to reduce the overhead of data transfer from a disk to anetwork. Now, we're going to cover another aspect of network connectioncontrol that can help maximize sendfile() capabilities in real life situations—setting TCP/IP options to control socket behavior.

TCP/IP data transfer
Thedata transfer in a TCP/IP network is usually block-based. From aprogrammer’s point of view, sending data means issuing a series of “senddata block” requests. On a system level, sending an individual block ofdata could be performed by a write() or sendfile()syscall. At the network level, you will see more data blocks, usuallycalled frames, which are ordered sets of bytes with headers travelingacross the wires. What is inside the frame and its header is defined byseveral protocol layers, from the physical to the application layer ofthe OSI model.

The length and sequence of network packets isunder the control of the programmer because the programmer chooses themost appropriate application protocol to be used in a networkconnection. Equally important, the programmer must select the way thisprotocol is implemented in software. The TCP/IP protocol itself has manyinteroperable implementations, so when two parties are communicating,each could have its own low-level behavior—another fact the programmershould be aware of.

Normally, the programmer need not worry abouttinkering with the way that the underlying operating system and networkstack sends and receives network data. The built-in algorithms definethe low-level data organization and transmission; however, there aresome ways to influence the behavior of these algorithms and provide morecontrol on network connections. For example, if an application protocoluses timeouts and retransmission, the programmer might want to set orobtain the timeout parameters. He or she might also need to increase thesize of send and receive buffers to ensure uninterrupted informationflow in the network. The general way to change the conduct of the TCP/IPstack is through so-called TCP/IP options. Let's take a look at how youcan use them to optimize the data transmission.

TCP/IP options
Thereare many options that alter the behavior of the TCP/IP stack. Usingthese options can have adverse effects on other applications running onthe same computer, so they are normally unavailable for ordinary users(other than root). We will concentrate on options that change theoperations of an individual connection or socket in TCP/IP terms.

The ioctl-style getsockopt() and setsockopt() system calls provide the means to control socket behavior. For example, to set the TCP_NODELAY option in Linux, it is necessary to code as shown in Listing A.

Although there are many TCP options to manipulate, we'll focus on just two of them here, TCP_NODELAY and TCP_CORK, which both significantly influence the behavior of network connection. TCP_NODELAY is implemented on many UNIX systems, but TCP_CORKis Linux-specific and relatively new; it was first implemented in thekernel version 2.4. Other UNIX flavors could have functionally similaroptions, notably the TCP_NOPUSH option on a BSD-derived system, which is actually one part of T/TCP implementation.

TCP_NODELAY and TCP_CORKbasically control packet “Nagling,” or automatic concatenation of smallpackets into bigger frames performed by a Nagle algorithm. John Nagle,after whom this process was named, first implemented this as a way tofight Ford’s network congestion in 1984. (See IETF RFC 896 for more details.) The problem he solved was the so-called silly window syndrome,where congestion occurred simply because widespread terminalapplications sent keystrokes one per packet, typically one byte ofpayload and 40 bytes of header, thus causing 4,000 percent overhead.Nagling became standard and was aggressively implemented over theInternet. It is now considered a default, but as we'll see, there aresituations when turning it off is desirable.

Let's say anapplication just issued a request to send a small block of data. Now, wecould either send the data immediately or wait for more data. Someinteractive and client-server applications will benefit greatly if wesend the data right away. For example, when we are sending a shortrequest and awaiting a large response, the relative overhead is lowcompared to the total amount of data transferred, and the response timecould be much better if the request is sent immediately. This isachieved by setting the TCP_NODELAY option on the socket, which disables the Nagle algorithm.

Anothercase involves waiting until we have the maximum amount of data thenetwork can send at once, benefiting the performance of the large datatransfers—typically any file servers. The Nagle algorithm looks toaccommodate these cases. But if you're sending a large amount of data,you could set a TCP_CORK option to disable Nagling in a way that's opposite to how TCP_NODELAY does it. (TCP_CORK and TCP_NODELAY are mutually exclusive.) Let's take a closer look at how this works.

Imagine that the application using sendfile()transfers bulk data. Application protocols usually require sending someinformation that helps interpret the data first, known as a header.Typically, the header is small, and the TCP_NODELAY is set on thesocket. The packet with the header will be transmitted immediately and,in some cases (depending on internal packet counters), it could evencause a request of acknowledgement that this packet was successfullyreceived by the other side. Thus, the transfer of bulk data will bedelayed and unnecessary network traffic exchanged.

But if we set the TCP_CORKoption on the socket, our header packet will be padded with the bulkdata and all the data will be transferred automatically in the packetsaccording to size. When finished with the bulk data transfer, it isadvisable to “uncork” the connection by unsetting the TCP_CORK option so that any partial frames that are left can go out. This is equally important to “corking.”

To sum it up, we recommend setting the TCP_CORKoption when you're sure that you will be sending multiple data setstogether (such as header and a body of HTTP response), with no delaysbetween them. This can greatly benefit the performance of WWW, FTP, andfile servers, as well as simplifying your life. Listing B provides an example.

Unfortunately, many popular programs do not take these considerations into account. For example, Eric Allman’s sendmail does not set any options on its sockets, although its performance is quite low anyway, so there may be nothing to optimize.

Apache HTTPD—the most popular Web server on the Internet—has the TCP_NODELAYoption set on all its sockets, and its performance is regarded assatisfactory by most users. Why? The answer lies in implementationdifferences. BSD-derived TCP/IP stacks (notably FreeBSD) operatedifferently in this situation. When submitting a large amount of smalldata blocks for transmission in TCP_NODELAY mode, a large amount of information will be sent, one per each write()call. However, the probability of introducing delays will be much lowerbecause the counters that are responsible for requestingacknowledgements of delivery are byte-oriented and not packet-oriented(as in Linux.) Thus, only total size will matter. Whereas Linux asks foracknowledgement after the first packet, FreeBSD will wait for hundredof packets before doing the same.

In Linux, the effect of TCP_NODELAYcould be quite different from what is expected by a developer who isused to BSD-derived TCP/IP stacks, and Apache on Linux performs worsethan it could. The same is true for many other applications activelyusing TCP_NODELAY on Linux.

Get the best of both
Yourdata transmission needs won't always conform neatly to one option orthe other. In that case, you may want to take advantage of a moreflexible approach for controlling a network connection: Set TCP_CORK before sending a series of data that should be considered as a single message and set TCP_NODELAY before sending short messages that should be sent immediately.

Combined with a zero-copy approach and sendfile() syscall (as covered in a previous article),this technique could significantly improve total system throughput anddecrease CPU load. Our experience in using this combined approach fordeveloping a name-based hosting subsystem for SWsoft’s Virtuozzotechnology demonstrates it is possible to achieve almost 9,000 HTTPrequests per second on a 350-MHz Pentium II PC, which was consideredpractically impossible before. The performance gain is tremendous.