WO2000060825A1

WO2000060825A1 - Connection pass-through to optimize server performance

Info

Publication number: WO2000060825A1
Application number: PCT/US2000/008453
Authority: WO
Inventors: David J. Yates; Anthony D. Amicangioli; Abdelsalam A. Heddaya; William Y. Tao; Sulaiman A. Mirdad; Ian C. Yates; Jeanette P. Fariborz; David E. Dukinfield
Original assignee: Infolibria, Inc.
Priority date: 1999-04-02
Filing date: 2000-03-30
Publication date: 2000-10-12
Also published as: WO2000060825A9; EP1166525A1

Abstract

A technique for deploying a network cache server with a redirecting switch to permit document request messages to be routed to a cache server or transparently bypassed through to the originating home server. The method optimizes the number of objects delivered from the cache by maintaining a list of popular stored objects and providing the list of stored objects to the message redirector. When under a maximum connection load, the message redirector enters a selectivity mode in which, when a new connection request is received, it is not routed up to the cache server.

Description

CONNECTION PASS-THROUGH TO OPTIMIZE SERVER PERFORMANCE

BACKGROUND OF THE INVENTION

Computer network industry analysts and experts agree that the data traffic over large networks and, in particular, the Internet, is presently so heavy that the very nature in the way in which it is possible to use such networks may require fundamental changes . These difficulties are no doubt the result of continued exponential increases in the number of users, as well as in the number of large documents, such as media files, to which these users desire access. As a result of this unprecedented demand in the need for bandwidth and access to networks, Internet Service Providers (ISPs) , backbone providers, and other carriers that provide the physical connections which implement the Internet face correspondingly unprecedented difficulty. This difficulty exists at all levels of network hierarchy, including points of presence (POPs) , central access nodes, network access points, and network exchange points, such as metropolitan area exchanges.

The Internet provides widespread access to content on an equal basis through the use of a client and server communication model. In this structure, certain computers known as "servers" are used to store and provide information. One type of server, known as a host server, provides access to information such as data, text, documents, programs stored in various computer file formats, but generally referred to as a "document." Other computers in the network known as "clients" allow the users to view documents through the use of a computer program known as a browser that requests a copy of the document be sent from host servers down to the client. Documents are typically requested by the client browser program specifying an address which identifies the host server which stores the document . The request is sent over the network to a naming service in order to obtain instructions for how to establish a connection with the host server associated with the address. Once this connection is established, the server retrieves the document from its local disk and transmits the document over network to the client. The connection between the client and host server is then terminated. A given request may require that it pass through a number of routers or "hops" through the Internet on its way from the host server down to the client.

A common solution for the present bottlenecks within the Internet is to deploy higher speed interconnection hardware. Such solutions include the deployment of digital subscriber line (xDSL) and cable modem technology to speed up the access between the end users and points of presence. Gigabit speed routers and optical fiber backbones are also being proposed to alleviate congestion within the network itself. At the server site, server clusters and load balancers are being deployed to assist with the dispatching of Web pages more efficiently.

While all of these solutions provide some expediency, each only solves part of the problem, and none provides a satisfactory solution to the ultimate problem -- the path between the client and server is only as fast or as slow as the slowest link.

As it turns out, much of the traffic on the Internet is redundant in the sense that different users request the same documents from the same servers over and over again. Therefore, it is becoming increasingly apparent that certain techniques, such as distributed content caching, may be deployed to reduce the demand for access to both the servers and to the network routing infrastructure. Distributing content throughout the network, such as through the use of document caches, provides a way to intercept client requests and serve copies of the original document to multiple client locations .

Using a cache, the process for providing document files to the client computers changes from the normal process. In particular, when the client requests the connection, say to a given server, the intermediate cache server may instead be requested to obtain the document . While the document is being transmitted down to the client computer, a copy is stored at the intermediate cache server. Therefore, when another client computer connected to the same network path requests the same content as the first user, rather than requiring the request to travel all the way back to the host server, the request may be served from the local cache server.

By moving popular content closer to the users who want it, distributed content servers may be used to alleviate the congestion at its cause. These distributed cache servers dramatically improve end user response time, decrease backbone and server loading, provide a vehicle for efficient routing of time sensitive traffic.

However, various cache techniques are typically sub-optimal in one way or another. For example, every Web browser has a built in cache that keeps copies of recently viewed content within the client computer itself. If the same content is requested again, the browser retrieves it from its local cache instead of going out to the network. However, when a browser cache services only one end user, content often expires before it can be reused.

A so-called browser redirected cache server may also be deployed to service multiple end users. Such a browser redirected cache sits inside a gateway or other point of presence into the network. End users configure their Web browsers to redirect all HTTP traffic to the cache instead of the locations implied by the Uniform Resource Locators (URLs) . The browser redirected cache server returns the requested Web page if it has a copy. Otherwise, it forwards the request to the originally specified server and saves a copy as the response flows back. Such a proxy server therefore acts as a gatekeeper, receiving all packets destined for the Internet and examining them to determine if it can fulfill requests locally. However, when using such proxy servers, it is typically necessary to configure the client browser, proxy server, routers, or other network infrastructure equipment in order to cause the request messages to be redirected to the proxy server. This provides some configuration management difficulties in that reconfiguration of browsers typically requires administrative overhead on the part of the humans who manage the networks .

To improve the odds of locating desired content without having to traverse the entire Internet, local points of presence can be supported by additional caches placed deeper into the network, such as at peering centers. If the primary cache cannot satisfy a request, it queries a secondary cache which in turn may query a tierciery cache, and so forth. If none of the hierarchy of caches has the desired content, the primary cache ends up returning the original request to the originally requested host . Each of these caching schemes falls short in some way. Forced redirection of HTTP traffic employed by both browser redirected and router redirected caches turns cache servers into single points of failure. If a cache server overloads or malfunctions, access to the network is blocked. Recovery is especially awkward with browser redirected caching since every end user's Web browser has an explicit point to the broken server.

Forced redirection can also have a negative affect on network performance. Even if a browser is topologically closer to the real content server than to a cache server, all HTTP requests detour through the cache and any Web object not in the cache passes through the nearby router or switched twice — one when it travels from the originating server to the cache, and again as the cache forwards it back to the browser and furthermore passing messages from primary to secondary caches and back again at its noticeable latency and ultimately limits the scope of caching in larger networks .

There is presently, therefore, much controversy over the deployment of caches for several reasons. Cache servers are, in particular, notoriously difficult to optimize. In certain configurations, they will quickly become overloaded in that the number of connections that they are expected to maintain with the user locations is more than the processing power can handle. Time spent determining whether to accept connections, cache documents, and/or refuse connections therefore overloads the cache server, which in turn reduces its performance on an exponential basis. In the other situation, the cache servers are underloaded and not enough traffic is routed to them. They therefore represent a large investment of resources to deploy not providing optimum utilization are in effect underloaded.

The network caches themselves are also in many cases limited by the bus speed of the personal computer (PC) in which they are implemented and that processes ability to process IP connections. In today's current practice, all such connections must enter the PC bus and be processed by the local processor, regardless of the ability of the cache device itself to add utility to that connection. For instance, if the network cache is being overloaded by too many requests for connections, the processor in the cache must still look at all new connection requests to determine if the cache server should continue to be servicing that connection. The cache server must also keep such other connections open while it waits for them to close. In addition, network caches typically selectively engage source and destination connections, but this functionality is performed by the content delivery device itself.

SUMMARY OF THE INVENTION

The present invention is a technique for implementing a cache server together with a message redirector for off-loading connection processing functions. The message redirector performs the function of filtering traffic away from the network cache or other content delivery device so that the number of connections for which the network cache cannot contribute utility is minimized. This is done by performing a time-wait functionality and blocking new connections when the cache server is overloaded.

More particularly, the message redirector is a three-logical port, transparent bridge with enhanced features such as filtering and traffic redirection. The message redirector permits the cache server to be transparently installed in-line between routers, switches, and other network backbone infrastructure.

The message redirector includes a bridge filtering logic function which implements a connection pass through feature that provides increased performance for the cache server as measured in the number of objects delivered from the cache versus the number of objects which must be retrieved from elsewhere in the network.

Internal to the cache is maintained a list of stored objects. A cache manager process in the cache server scans this list of stored objects and identifies a subset of number, N, of the most requested objects. An object may be a domain such as a full IP address or may be sub-net mask. In general, when the cache is performing well, it is overloaded such that the number of offered connections is much greater than the number of serviceable connections. The list of popular requested addresses is then sent down to the message redirector from time to time.

During this process, the bridge filter logic in the message redirector looks for a connection request such as for HTTP request which include an SYN message. The associated Internet Protocol (IP) address is compared to the local selective connection table in the message redirector. The SYN request is then routed up to the cache server only if a free connection is available.

The filter logic can also optionally determine if a connection is free based upon dynamic processing conditions in the server, such as file system load, number of active connections, the number of hits or misses being experienced, and the size of the cached objects . BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a block diagram showing an overview of a cache server and message redirector implemented according to the invention and their associated software functionalities .

Fig. 2 is a more detailed view of a three-port transparent message redirector used with the cache server.

Fig. 3 is a flow chart of the steps performed by the filter logic portion of the message redirector.

Fig. 4 is a more detailed view of the HTTP selectivity process in the filter logic.

Fig. 5 is a diagram of a first test configuration.

Fig. 6 is a diagram of a second test configuration.

Fig. 7 is a chart of measured throughput versus offered load.

Fig. 8 is a chart of response time versus throughput .

DETAILED DESCRIPTION OF THE INVENTION

1. Architectural Overview of the Redirecting Cache Server

Fig. 1 shows a block diagram of an exemplary network content delivery device, such as a redirecting network cache 10, and the manner in which it may be implemented to achieve the advantages of the present invention. The network cache 10 is deployed at any of a number of places in a network infrastructure 12. It may be deployed at network access sites, such as points of presence (POPs) , at an Internet Service Provider (ISP) , at ISP peering points, at interchange points in a large scale enterprise network, central offices in a local exchange carrier network, metropolitan area exchanges, or other points in a network through which message traffic is concentrated. In the illustrated embodiment, the network cache 10 is deployed at an intermediate point in the network 12 and is configured to cache Web pages traveling at the request of a Hypertext Transfer Protocol (HTTP) client 14, through a first set of network connections 15, through a Router A 16, to Router B 17, through a second set of network connections 18, to an HTTP server 19. Other content delivery devices may also take advantage of the teachings of this invention, however .

The network cache 10 consists of a message redirector 20 and cache server 22. The message redirector 20 consists of four ports 24-1, 24-2, 24-3, and 24-4, a pair of switches 26-1 and 26-2, and redirector controller 30.

The ports 24-1 and 24-4 provide connections to the network 12 such as from a local area network (LAN) or wide area network (WAN) . The network ports 24-1, 24-4 may, for example, be compliant with Ethernet 10 base T, 100 base T, or other types of physical layer implementations, such as ATM, PPC/SONET, frame relay, or other network protocols. Although in the illustrated embodiment, the ports 24-1 and 24-4 are shown as connected to, respectively, Router A 16, and Router B 17, it should be understood that they may provide connections to other access devices, switches, servers, bridges, and the like.

The other ports 24-2 and 24-3, referred to herein as the server ports, provide a connection for passing message traffic up to and down from the cache server 22. These server ports may also provide typically the same sort of physical layer link as provided for the respective network ports 24-1, and 24-4.

The redirector controller 30 controls the switches 26-1, 26-2 to permit each message to be either routed up to the cache server 22 from either Router A 16 or Router B 17, or permits the message to be passed straight through between Routers A 16 and Router B 17. As discussed in greater detail below, the redirector controller has several processes which accomplish this, including layer spoofing 31, and bridge filter logic 35 which encompasses connection selectivity 32 and connection pass through 33.

The message redirector 20 and cache server 22 cooperate to provide a transparent HTTP object cache for the network 12. In particular, the redirecting network cache 10 monitors HTTP traffic flow between the routers 16 and 17 and stores copies of sufficiently popular Web pages. Subsequent requests for the stored pages, for example from an HTTP client 14, are then retrieved from the cache storage 24 rather than from the originating server 19. This results in a significant reduction in network line utilization and improves user response time by reducing the number of hops between client 14 and server 19 and also by providing multiple server sources for popular Web pages.

The cache server 22 performs the system's core storage functions, such as HTTP object storage and retrieval. The cache server 22 maintains a connection service processes 41 which services active connections; that is, it accepts HTTP requests for active connections and provide the requested objects from the cache server 22 once active. The message redirector 20 provides a fail-safe mechanism for missing critical data links by monitoring the cache server's health and completes the bypassing if a failure occurs. For more information on this particular feature of the message redirector 20, attention is directed to published Patent Cooperation Treaty (PCT) document W099/48262 entitled "Message Redirector with Cut Through Switch for Highly Reliable and Efficient Network Traffic Processor Deployment" filed March 3, 1999 and assigned to InfoLibria, Inc., the assignee of the present application.

The transparent network cache 10 caches and serves HTTP objects without specific reconfiguration of a browser program located at the HTTP client 14. To support this functionality, the design provides a form of link transparency which allows the network cache 10 to participate in HTTP data transfers without advertising itself as a Router 16, 17 or a host 19. By being designed to be transparent relative to Internet Protocol (IP) network state and topology, although network data passes through the server 22 when packets travel between Router A 16 and Router B 17, the routers do not recognize the network cache 10 as an intermediate hop. In fact, for non-HTTP traffic, the network cache 10 behaves as a transparent Ethernet bridge. These behaviors are achieved through two modifications to the standard TCP/IP driver stack. These include address resolution protocol (ARP) and Media Access Control (MAC) layer address spoofing 31 and the addition of a filter logic and HTTP selectivity function 32 to the IP driver layers . The MAC layer spoofing function 31 is provided as follows. When the network cache 10 is operational, network interface cards within the message redirector 20 are reprogrammed to accept packets with the MAC layer addresses of the respective router interfaces. This can be achieved by either setting the interfaces into a so- called promiscuous mode, or by reprogramming the interface cards with the same MAC address that are used on the Routers 16, 17 a form of MAC layer spoofing. For MAC spoofing, port 1 of the cache server 22 is connected to Router A and is therefore reprogrammed with Router B's MAC layer address (MAC_B) . Likewise, the network interface associated with port 2 of the cache server 22 is reprogrammed with Router A's MAC layer address (MAC_A) . In this spoofing mode, the network cache 10 first performs a static configuration of IP by transmitting the ARP and/or reverse ARP (RARP) requests before the message redirector 20 is fully operational.

The address resolution protocol is a standard protocol used to convert an IP address into a physical address. A host wishing to obtain a physical address broadcasts an ARP request out onto the network. The host on the network that has the IP address in the request then replies with its physical hardware address. The protocol is used to operate below the network layer as part of the OSI link layer.

If the promiscuous mode is selected, this consumes additional bus bandwidth and CPU cycles in the cache server 22 in configurations where more than one device is located on a given Ethernet segment. For example, from the cache server's perspective, if two routers are connected to Port 2 (28-2) (e.g., Router BI and Router B2) traffic flows between these routers will be read by the server 22 and will need to be dropped. Another possible approach is to use the server's 22 stock MAC layer address so that Routers A and B will still view each other as the next hop, but the MAC addresses returned on the ARP requests are the server MAC addresses. This approach is not preferred because of the MAC mismatch that occurs when the message redirector 20 switches the cache server 22 offline in the event of a failure. It also does not provide true transparency.

Message redirector 20 supports a Layer Two (L2) routing table as constructed through automatic discovery. To support this, the NICs 51 are initialized in a promiscuous mode, allowing any and all packets to enter the bridge for forwarding . Upon reboot , the bridge will have no Layer Two (L2) MAC address routes in its table. While in this state, the bridge will flood all packets received. In other words, if a packet is received from port 1 from bridge A, it will automatically be forwarded out of port 2 to bridge B regardless of whether the destination device resides on that segment .

As the unit receives and forwards packets, the source address fields of the packets allow the device to add the discovered MAC address to the L2 routing table once each device on the attached segments has "talked" once, the discovery process is complete. Thus, the L2 routing table consists of a MAC address, the port packets destination to that particular device should be forwarded.

The bridge filter (BF) logic function 35 provides a connection selectivity function 32 as well as a dynamic load shedding or connection pass through function 33. The purpose of these functions in general is to bridge packets straight from port 24-1 through to port 24-4 that the server 22 is not processing. Packets that are being processed by the cache server 22 are passed up the stack through one of the ports 24-2 or 24-3. Both filter logic 35 functions of selectivity 32 and connection pass through 33 are described in greater detail below.

A hardware block diagram of the message redirector 20 which implements these features is shown in Fig. 2. It consists of a pair of network interface cards (NIC) 51-1, 52-2, each associated with particular router connection, respectively, Router A 16 or Router B 17. The message redirector 20 is typically implemented in a personal computer. The NIC 51 are thus connected to an internal bus structure 50 through one or more interfaces such as PC industry standard architecture (ISA) or extended ISA (EISA) or interface 52, or PCI interfaces 55 to a central processing unit (CPU) 53. CPU has an associated memory 54. The second PCI interface 55 provides connection through the bus up to the cache server 22. It should be understood that multiple PCI interfaces such as a secondary PCI interface 55-i may be provided to permit a PCI interface to be associated with each of the cache server ports 28-1 and 28-2.

2. HTTP Connection Selectivity (32) Fig. 3 is a more detailed flow chart of the logic provided to perform the connection selectivity function 32. As shown in state 100, after a packet is received, a next state 101 is entered in which the MAC layer address is examined to determine if the packet is a "for us" message. If it is so, then the packet is passed in state 102 to the "for us" driver in the cache server 22. The "for us" functionality can be used to define a logical device driver that allows certain non-HTTP traffic to be received by the message redirector and routed up to the cache server 22. Examples of "for us" traffic include inband SNMP management, inband Telnet sessions for controlling configuration, and FTP downloads that are, for example, software updates. The "for us" driver typically will have a single IP address (IP_F) and MAC address (MAC_F) . The server receives such ports on either packets on either port their first check to see if they are "for us" packets, the information can be checked by looking at either the MAC or IP address on the incoming packets.

If, however, the MAC address does not indicate a "for us" message, then TCP header port number is read in state 103 to determine if it is an HTTP packet. If it is, such as by indicating that the TCP header port number is set equal to 80, then the packet is not type HTTP and it is forwarded, or bridged, out to the other interface in state 104. Thus, for example, if the packet was received in on interface 24-1 from Router A, it is forwarded directly out to Router B on port 24-4. Similarly, if the packet was received from Router B on interface 24-4, it is routed straight out through interface 24-1 to Router A 16.

If, however, the packet is an HTTP packet, then processing proceeds to a state 105. An important consideration at this point relating to HTTP selectivity is the issue of IP fragmentation. Sometimes HTTP packets are fragmented into multiple IP packets and therefore they need to be reassembled before they can be passed up the stack (assuming that the TCP header is present only in the first packet) .

In any event, in state 105, the packet is examined to determine if it includes an SYN packet. Such packets indicate the beginning of a request for a connection for an HTTP object. If the packet is not a SYN packet, then processing proceeds to state 106 in which a TCP connection table maintained in the memory 54 for active TCP connections is examined. If the connection is found in the table in state 107, then the packet is pushed up to the IP layer in the cache server in state 109. If, however, the connection is not found in state 108, then the packet is passed over to the other interface, e.g., the network connection on which it was not received. The manner of maintaining this via a selective connection table 34 will be described in further detail below.

If the packet is a SYN packet, then a state 110 is entered in which a new connection is being requested. In the state 110, it is determined whether or not a new connection can be established. This depends upon whether or not an open connection is available in the server 22 and other factors, as described below in the discussion concerning connection pass through logic 33. If this is not the case, then processing proceeds to a state 114 where the packet is simply bridged out to the other side. If this is the case, then processing will proceed to a state 115. The selective connection state 115 is more particularly shown in Fig. 4.

Connection selectivity is based on a list of IP addresses and sub-net masks. These addresses are yielded by analyzing an HTTP object table as maintained by the cache server 22 and then building a list of the IP addresses of the servers that contain the most popular objects stored in the cache 24. The selective connection table (SCT) generation process 42 executes as part of the cache manager 40. The list referred to herein as the selectivity connection table 34 is periodically generated and downloaded to the message redirector 20. This selective connection table allows the message redirector 20 to hunt for connection requests (SYNs) that have a higher probability of a hit in the cache server 22 given that their destination IP address already has content loaded in the cache server 22. This feature also allows the network cache 10 to effectively shift the optimum cache locality point because it allows the cache server 22 to effectively participate in the need to compare fewer IP addresses.

Sub-net masks and/or complete IP addresses may be stored in the selective connection table 3 . For example, certain sites, such as cnn.com or yahoo.com, have a number of pages associated with them that may rise to the level of being sufficient popular to be maintained in the cache 24. In this instance, rather than maintain the complete four-digit full IP address for each page, together with the sub-net information mask may be provided to indicate more than one page at the site.

In the preferred embodiment, the selectivity policy can be set through two basis parameters; the period or ratio. The selectivity period is a single timer setting that is global to all selective connections. When the message redirector 22 enters a state 110 in which it is determining whether to make a connection, a timer is begun. If a select connection (an SYN with an IP address in the selective connection table) is not found before the timer expires, the selective connection state switches to a non-selective mode. In this non-selective mode, any occurring SYN will be permitted to be routed up to the cache. Thus, in a first mode or selective mode, only SYN requests already have their associated IP addresses and/or sub-net masks stored in the selective connection table are permitted to be routed out to the cache server. In the non-selective mode, the next SYN will be routed up.

In this mode, the system provides an N÷K selective to non-selective behavior. For example, if this is set to one over three for every three non-selective SYNs, the system provides a selective search.

As shown in Fig. 4, the connection selectivity function 32 can be provided in state 110 as follows. In an initial state 111 (which may be entered at any point after message traffic begins to be received) , a selective connection timer 112 is reset. If it is not in the selective mode, then processing can exit from state 110 to prepare the connection to be maintaining in state 115.

If, however, in state 112 the timer is in the selective mode, then the state 113 is entered. In this state 113, it is determined whether or not the IP address of the SYN request is located in the selective connection table. If this is the case, then the connection will be permitted to be maintained and processing will return to state 115. If, however, in state 113, the IP address or sub-net mask is not located in the selective connection table, then processing will proceed to state 114 in which the packet will be bridged to the other interface.

Having available a selective connectivity period, it provides a natural effect of controlling the connection acceptance rate. For example, consider the case where the cache 10 is hunting for selective connections but the population of selective connections is low. In this case, the new connection SYNs allowed to be routed up to the cache server 22 are spaced out at intervals of the selectivity period, t, plus the average SYN arrival . Another important feature of the selectivity time period is that it provides a natural load control mechanism. For example, assume that the number of offered connections, (O_c) , which is the actual number of connections passing through network 12. Also assume that the number of serviceable connections is the number of connections that the cache server 22 can actually service at any point in time (S_c) . In general, the number of offered connections will exceed the number of serviceable connections since the cache server 22 has a finite capacity.

The goal is to obtain a higher hit rate for the cache server 22 as measured in the number of objects delivered from the cache 22 as opposed to the number of objects which must be obtained from routes from the HTTP servers 19. Assuming that the number of offered connections exceeds the number of serviceable connections, by setting the selectivity period to zero, the cache server 22 will attempt to service all of the offered connections. On the other hand, if the selective connection period is set to a relatively high value, such as 100 milliseconds, the cache server 22 will likely service a connection count which is under its maximum capacity and thus spend most of its time hunting for SYNs that are on its selectivity list. Thus, a proper selective period setting should provide an optimum connection load for the cache server 22. To achieve this, the server may preferably use a successive approximation approach by first setting the selectivity period to a predetermined value, such as fifty percent of a known maximum value, and then moving it up and down until the connection load runs just slightly below the maximum period. The selectivity timer is typically set by a function 43 running in the cache server 22. 3. Connection Pass Through (33) Dynamic load shedding, also known as connection pass-through 33, is an important feature for field deployment of the redirecting cache 10 since load spikes that occur on the redirecting cache 10 increase the response time of serving connections. Under extreme load (i.e., when there are a large number of connections) , the response time may increase to a level where it is better to let connections pass through the redirecting cache 10 rather than attempt to continue to serve them.

Dynamic load shedding is needed since it is impossible to determine in advance what "load" a particular number of connections will exert on the redirecting cache 10. This depends on many factors, for example :

the number of active connections (i.e., the number performing data transfers, versus the number in an "opening" or "closing" state) ;

whether or not the connections are processing cache hits or cache misses; and

the size of the objects being served.

Thus, it is unreasonable to expect a static parameter (e.g., a simple measure of the minimum number of physical maximum serviceable connections in the cache server 22) to be sufficient to shed load appropriately.

There are three design goals for dynamic load shedding : 1) Always shed load when the cache server 22 is overloaded;

2) Very rarely shed load when the cache is not overloaded; and

3) Prioritize HTTP work over ICP work.

The main design issue is getting a control loop that a) senses load and then b) makes the right decision about whether or not to shed load.

The simplest unit of load to measure for the redirecting cache 10 is a connection. However, some connections are more expensive to service than others. For example, servicing a miss requires all the work that is done for a hit (assuming a fill-on-miss cache fill policy) and, in addition, requires writing content to disk. Thus, misses are more expensive than hits. This is seen in performance tests which show operations per second (OPS) numbers for all-hit tests that are a factor of 3 or more higher than all -miss tests. In other words, bandwidth to the cache disk array becomes a bottleneck for a miss intensive workload.

If the redirecting cache 10 can know (or guess) in advance whether or not a connection is a hit or a miss, it could choose to shed misses over hits. As an alternative, the design can shed shedding fills (i.e., the writing of HTTP objects to disk) during peak load. This is referred to as Fill Load Shedding.

The parameters which are preferred to measure in order to determine connection load are the number of "opening" (i.e., soon to be active) connections and the number of active connections (those performing data transfer) . The kernel httpd data structures track both opening connections and connections doing "useful work" in the cache. Useful work includes all data transfer for HTTP objects as well as query processing for ICP. Useful work does not include HTTP connections that are in a closing state (e.g., TCP TIME_WAIT) .

In a preferred embodiment, the metric uses a sum of opening and active connections as a control parameter to decide whether or not to invoke connection pass-through. This is referred to as Connection Load Shedding. Since evaluating whether or not to accept a new connection is an inexpensive, constant-time procedure, it can be performed by the filter logic for every potential connection.

To implement Connection Load Shedding, the kernel httpd data structures are maintained with a single function exported to the bridge filter logic 35 module. This function returns the capacity of the cache server code as follows:

int il_accept_new_connection( u_short dst_port /* dest TCP port in network byte order

capacity = available slots in HTTP (dst_port) listen queue + available idle threads in cache server return capacity;

If the capacity is zero, the connection pass through logic 33 turns away the new connection. If the capacity is greater than zero, the connection pass through logic 33 may accept the new connection. Note however that this check is independent of the offered connections parameter (or its associated check in the pass through logic 33) .

In other words, a new connection is accepted if and only if there is space in the SCT table 34 and capacity in the cache server code to accept and service a new connection. Thus, the offered connections parameter still controls the maximum number of HTTP connections that the cache server can concurrently service.

To implement Fill Load Shedding, the cache server will be configured with a parameter called MaxConcurrentFills . If the number of concurrent miss connections being serviced is above this value, subsequent miss processing will bypass filling the cache until the number of miss connection drops below this value again.

To prioritize HTTP work over ICP work, the cache server 10 is configured with an additional parameter, namely MaxConcurrentlcpQueries . Queries that are received when there are more than this many ICP queries being serviced are dropped. This does not violate the ICP specification since the protocol itself is designed to run over UDP and therefore deal with intermittent communication problems (e.g., lost packets) between peers .

There is one additional detail addressed in connection load shedding: when to accept or refuse proxy HTTP connections. This design proposes refusing proxy connections whenever the cache is disabled. To implement proxy connection load shedding, a single function is exported by the bridge filter logic 35 to a connection accept system call processing in the cache server 22 . This functions returns whether or not the cache is enabled as follows :

int il_cache_enabled ( void )

if BF . of f ered_connections > 0 return TRUE ; else return FALSE ;

The bridge filter logic 35 thus provides a connection selectivity also based on a type of applied "backpressure." The need for this functionality is due to the fact that the cache server 10 will most likely be a performance-bound by number of connections, file system performance, and MIPS. The current consensus is that the best item to focus on to control system overloading is the number total number of connections allowed by the server. Tuning the system to provide the proper connection count should allow ultimate control of all other system resources such as MIPS and memory consumption. We then assume that the number of Peak Offered Connections (Pc) on the link between Routers A 16 and B 17 is greater than the Maximum Server Connections (Mc) that the server can support. Hence,

Pc > Mc .

Another important item to consider is that places reliability of the cache server 10 is one of the highest priorities from a marketing perspective. If the bridge filter logic 35 passes all HTTP connection requests up to the IP layer for service, the result will be lost connections , sluggish performance , or worse , all of which are unacceptable design results .

The solution to this problem is connection selectivity based on backpressure in order to allow connections to " pass- through . " The basic idea is to have a finite number of connection obj ects that can be bound to atomic TCP connection transactions , allow only selected connection flows to be passed up the IP stack, and (hopefully) quickly turn away the connections that cannot be processed . This is preferably implemented as part of the determination of whether or not a free connection is available in state 110 of the bridge filter logic 35 (Fig . 3 ) . This works somewhat like a memory management system where a free list (pool , heap, . . . ) of connections is maintained in driver space . Each connection obj ect maintains a unique connection identifier (source-destination IP address and source-destination TCP port numbers) plus a variable that tracks the connection ' s state .

The specific logic that implements connection pass -through in step 110 on every connection (CONN) accepted by the network cache may be as follows :

IF number-of -miss -connections GT miss -connections -threshold THEN

PASS -THROUGH ( CONN ) ELSE IF number-of -opening-connections GT opening-connections -threshold THEN

PAS S - THROUGH ( CONN ) ELSE IF number-of -total -connections GT total -connections -threshold THEN

P ASS - THROUGH ( CONN ) ELSE

ACCEPT ( CONN )

This means this algorithm has three thresholds which are chosen to provide the client the same expected response time . These include a miss-connections threshold, an opening-connection threshold, and a total connection threshold.

At least four other parameters are needed to make this work :

^p _hιt ⁼ probability that a connection is a hit P,_1EB = 1 - Phit = probability that a connection is a miss T_hlt = response time given that a connection is a hit ^τ _mιsε = reponse time given that a connection is a miss

For the purpose of explanation, we assume that the miss-connection-threshold is unity . In practice this is either assumed and therefore set statically, or measured dynamically either by the network cache itself or by external device (s) . Thus ,

Opening-connections -threshold = 1 + P_mιss * F (T_hlt / T_mιss) total -connections -threshold = 1 + P_hlt * G (T_mss / T_hlt)

where F() and G() are normalizing functions.

There are a number of other possible bottlenecks that will be the limiting factor on system performance and therefore translate to a reduced connection count through the backpressure mechanism. A less dynamic approach is also possible. One can select a constant maximum connection count that will maintain performance in any and all load scenarios but this will likely result in performance well under possible maximums . The other possibility is a dynamic feedback mechanism that will allow the system to adjust to different load situations. However, the possibility of instability exists if the feedback mechanism is not responsive enough to respond to changing conditions quickly. For example, suppose we detect the amount of non-HTTP traffic on the line and increase the line utilization by increasing connection counts until the line is nearly fully utilized. We then suffer a sudden increase in non-http traffic but are unable to detect it and reduce the connection count before we are faced with packet loss. Every possible feedback/detection system has this potential hazard and should be considered based on system response time (time constants) as well as our ability to dampen the response by adding extra buffers to absorb bursts and sudden changes in traffic mixture profiles .

The possibilities for load detection therefore include :

* MIPS consumption: Measure (perhaps) the idle processes run time, find the minimum idle 'run' time that provides the greatest bits/second of hit count . This should be checked against varying traffic mixtures (HTTP (LH) vs non-HTTP (LO) ) as it is likely to be non-linear. If behavior is non-linear, we can maximize performance by varying the connection count as a function of mixture.

* Memory/Mbuf Consumption: Measure the average Mbuf population at IP output queues. If buffers rise above a given high water mark, we can decrease the selective connection population until the system stabilizes.

* Line Utilization: We should try to support a mechanism that supports throttling based on line utilization. If the cache is operating on a highly utilized line, we should reduce the amount of traffic created by the DynaCache appropriately.

* File System performance: The performance of the file system may be a major bottleneck in the server's ability to deliver data. Thus, for certain conditions measuring the file system may limit the number of connections that can be serviced.

We may find that in practice only one or two of the above pressure points will need to be detected and applied to the backpressure mechanism given that many of the parameters are related. For example, if line utilization is high with non-HTTP traffic, we will likely see an increase in Mbuf population and be able to ignore direct line utilization detection. (In fact this is a better possibility because it will naturally have better response time characteristics than direct line utilization detection.)

4. Performance Benchmark

Rigorous benchmark testing shows the cache architecture described herein as a high performance network-grade caching product . In this independently-conducted test, the cache server 10 performed at Fast Ethernet line speed, and readily scaled up to meet full-duplex OC-3 line speeds. The tests illustrated that: a. The redirecting cache server 10 is robust under continuous heavy load of millions of requests per hour when millions of objects are already cached, dramatically exceeding the demands of most ISP and enterprise networks. b. Sustained peak throughput meets line speed requirements for Fast Ethernet, and scales easily to line speed of full-duplex OC-3 (using a cluster) . This translates into 3400 operations per second, which is multiple times faster than any previously documented cache performance, and more than fast enough for the leading national ISPs. c. Response time for cache hits remains in the tens to low-lOOs of milliseconds throughout its entire performance range, with no degradation of performance as the load increases. In the case of a cache miss, the redirecting cache server 10 adds a negligible delay to the no-cache response time. d. The bandwidth delivered to Web clients is expanded by 22-24%, resulting in a payback, based on bandwidth savings alone, of 5 to 9 months. Other benefits, such as speeding up the web or increasing the number of customers, shorten this payback period.

The redirecting cache server 10 also exhibits other desirable characteristics:

1. Fail safety - The redirecting cache server 10 is much less likely to fail than other caches because it protects itself far better from overload. But if it does, it is the only caching solution that can automatically remove itself from the line of traffic, and does so virtually instantaneously. 2. Manageability - The redirecting cache server 10 is unique in being managed as a single transparent caching unit. Its management interface is accessible remotely via out-of-band and serial line interfaces, and through an interface indistinguishable from other network devices. The redirecting cache server 10 can also collect detailed usage and traffic data that can be used for customer billing and network performance optimization.

3. Content friendliness - Redirecting cache server 10 preserves the relationship between content-providers and users by forwarding hit and cookie information in real-time, and simultaneously checking freshness .

4. Platform Flexibility - Redirecting cache server 10 provides a platform for additional functionality and services, such as the delivery of high-quality multimedia content.

In the rest of this report, we explain the test metrics, tools and methods in comparison to other caching performance benchmarks, and analyze testing results .

The nascent web caching industry has been beset with conflicting vendor claims, using a multitude of definitions and testing methodologies. This effort included the production of a so-called PolyGraph benchmark. PolyGraph consists of a load-generator, a measurement test-bed, and testing methodology for evaluating web caches. PolyGraph combines the best elements of current practice and available research in the field, in a tool that simultaneously features high performance and great accuracy in modeling Internet characteristics .

PolyGraph is a very sophisticated test environment, which simulates both web clients and web servers. The test is designed to load the cache under test to the maximum by submitting requests at an intense and "bursty" rate, with high levels of concurrency, and by varying object sizes to mimic real objects in the field. In doing so, PolyGraph comes the closest to emulating typical high HTTP traffic patterns.

In comparison to other tools developed by caching vendors, or by researchers, PolyGraph excels in:

* Generating load in a variable, bursty fashion, just like Internet loads.

* Modeling large and variable Internet delays.

* Accurately accounting for other major phenomena such as object size variability, popularity skew, and request locality.

* Robust testing of key aspects of cache design, such as indexing and disk caching.

* Sound testing methodology that generates consistent, reproducible and predictive results.

* Web server benchmarks such as SPECWeb96, WebStone, and WebBench not only lack any support for testing web caches, but they also omit modeling many key attributes of real web traffic.

PolyGraph clients submit requests at a given average rate, with intense bursts driven by a Poisson probability distribution. The burstiness of the request rate results in load being higher than twice the average for at least 40% of the time. Stated differently, the PolyGraph request rate has a large standard deviation (which equals the average) . It is common in statistical analysis to consider the sum of the average and the standard deviation as a measure of the sustained peak, which, in PolyGraph ' s case, will occur 40% of the time. PolyGraph servers provide server side delay, again with some burstiness driven by a normal probability distribution, which is intended to model Internet and server delays. While this statistical model is not totally accurate, it is much more accurate and cache-stressing, than any other known benchmark. The rules of the benchmark require that caches also implement the TCP TIME_WAIT state for at least 60 seconds, which is consistent with best current practice in guaranteeing robustness of the underlying TCP connections against network misbehavior. Unfortunately, it is known that many cache vendors reduce this parameter drastically, in order to achieve higher performance .

The PolyGraph test-bed produces the following raw measurements of cache performance. For each, we specify (in italics) the metric we derive from it, and why.

1. Average request and response rates in HTTP operations per second (OPS) . We translate both of these metrics to sustained peak rates expressed in megabits per second by accounting for the bits required to transport the objects. This is more relevant to caching deployed as part of the bandwidth-providing network infrastructure .

2. Concurrent connections can be calculated by multiplying the response rate in OPS by 60, which is the length of time (in seconds) a connection is tracked by the cache. This assumes that the cache has honestly complied with the TCP TIME_WAIT requirement mentioned above .

3. Response time, which we present as measured, but also breakdown to its key constituents: hit response time, and latency-the delay added by the cache in processing the miss. The latter metrics characterize the cache's speed contribution independently of hit ratio and Internet delay.

4. Hit ratio. We use hit ratio to determine the bandwidth expansion factor delivered by the cache, which is one of the primary parameters in determining the payback period (how long a cache takes to pays back for its price, based on bandwidth savings alone) .

Before we delve more deeply in these aspects of performance and how they were measured, we remind the reader one more time that performance is only one attribute of many when considering deployment of caching products into complex networks. For ISP and carrier type applications, other potentially even more significant considerations include reliability, manageability, content -friendliness and serviceability.

These metrics are the correct ones, for several reasons. First of all, these load and performance metrics correspond closely to the best known traditional metrics of computer and network system performance . Second, the metrics support four fundamental requirements :

1. Load must be representative: real -world phenomena should be modeled. 2. Load must stress the caches under test : the cache must be forced to exercise all its major components (disk, RAM, CPU, NIC) .

3. Performance must be comprehensive: the cache cannot, for example, provide fast response at the cost of chewing up more bandwidth (or vice-versa) , or sacrifice TCP robustness for performance.

4. Performance must be meaningful to the end-user: performance affects end-user experience, cost, and perceived reliability (e.g., overload is tantamount to failure) .

All of these requirements are met by PolyGraph, to the maximum extent allowed by current knowledge and practice in the field. No other benchmark comes close in terms of addressing the complex web of issues involved.

Two models of the redirecting cache server 10 were submitted for testing. System 1 was a redirecting cache server 10 prototype, model IL-20OX- 14 as manufactured by InfoLibria, Inc. The 200 series is a robust, high availability configuration, designed for carrier-class applications. The cache was configured with 512 MB memory, fourteen hot-swappable 9 GB cache drives, two system/log drives with mirrors (all hot-swappable) , a message redirector 20, and redundant system power supplies. The tested configuration used three 4 port Ethernet switches to concentrate the three pairs of PolyGraph client/servers. See Fig. 5.

System 2 was a cluster of four redirecting cache server 10 prototype models IL-100-7. Each cache was configured with 512 MB memory, seven 9GB cache drives, two system/log drives, redirector 22, and a single system power supply. In this configuration, four 4 -port switches were used to concentrate the eight pairs of PolyGraph clients, while a 16 port L2 switch was used to handle the eight PolyGraph servers. See Fig. 6.

The PolyGraph tests were set up in three distinct steps as described below. A full description of these test can be found at the IRCache web site, http : //bakeoff . ircache . net/doc/tests . html .

1. No-proxy. This control experiment measures the capabilities of the test-bed itself, so that it does not end up being a performance bottleneck. Any serious testing must begin with this type of step.

2. Filling the cache. Before performance measurement begins the cache is filled to capacity. It must be able to do so within a set maximum time. This forces the cache under test to perform garbage collection under testing conditions, and exposes weaknesses in object indexing and lookup performance. No other benchmark requires this step, to our knowledge.

3. PolyMix. This is a sequence of ten nearly back-to-back hour-long runs. Runs differ in average request rate only. Otherwise, all runs subject the cache under test to a bursty load with the given average, a 55% hit ratio. Of the responses returned by PolyGraph servers, 80% are cacheable . The server also adds a delay, normally distributed, with an average of 3 seconds and standard deviation of 1.5 sec.

The graphs in Figs. 7 and 8 summarize the test results for the two systems submitted. Request and response rates are shown in megabits per second, and displayed in terms of their sustained peak values, in order to be able to draw conclusions about redirecting cache server 10 capacity. We calculate sustained peak request and response rates based on the following facts:

1. The standard deviation of the (Poisson distributed) request rate is the same as the mean, which means that Poisson is a highly variable, or bursty, distribution .

2. For 40% of the time, the request rate exceeds twice the mean (see the appendix for the derivation of this fact from the Poisson probability distribution function) .

3. In practice, the response rate is clipped at the level of maximum resource utilization. At best, this resource is the Fast Ethernet itself.

Conversely, it is incorrect to use the average alone to represent bake-off results, since the request distribution has large variance. For example, the median is larger than the mean by 44%. It is also quite common in statistics to use standard deviation as a measure of the "spread" of the values of a random variable.

The throughput graph in Fig. 7 shows that IL-200X-14 is able to keep up with HTTP loads up to Fast Ethernet line speeds without any trouble, and that the cluster of four IL-100-7 exceeds the line speed of a full-duplex OC-3. Ideal performance on this graph would be a straight line with slope 45 degrees, which is matched nearly perfectly by redirecting cache server's 10 measured performance.

The primary use of web caches is to bring content closer to users, expanding the bandwidth delivered to them beyond that which the network backbone provides, and speeding up their access to content. The more requested hits per second a cache can serve, the less bandwidth will be demanded.

The response time is the second most important characteristics of caches. Ideally, a network cache would have near- zero response time for hits, and near- zero latency on response time for misses. The cache should exhibit fast response time for every request even as the number of request per second increase . In practice, when the load is low, the response time for the cache is very fast. As the cache's load is increased, and cache is required to handle a higher number of requests per second, translating to performing a higher number of calculations and I/O's to the disk, the response time will increase. The PolyGraph had a cut off of a maximum of 6 seconds response time. If a cache reached six seconds, it failed the test. Not only is the response time important from a users perspective, but network operators need to have predicable, well behaved devices in their networks . Unpredictable behavior from a device, could produce a cascading effect on the rest of the network which could make the whole network unstable for a period of time.

The response time graph shown in Fig. 8 represents the aggregate average of hit and miss response times across the entire performance range. When redirecting cache server 10 senses the possibility of a significant increase in response time based on a user-configurable parameter, it starts passing some connections through transparently. This enables users of redirecting cache server 10 to guarantee that the cache is speeding up their network at a predictable level, by configuring redirecting cache server 10 to serve objects only at a set response time, and to pass-through all other connections .

More specifically, redirecting cache server 10s exhibit an average hit response time of about 50 milliseconds up to about 50% utilization of the link on which they are deployed, and never more than a few hundred milliseconds. Latency (in the case of a miss) ranges from negligible at or below 50% line utilization, to at most 25% at maximum line utilization of 100%.

Under network loads similar to those submitted by PolyGraph, redirecting cache server 10 pays back its purchaser, just from bandwidth savings alone, in 6-9 months. We compute this, based on an assumption of a monthly cost of $1,000 per megabit per second of bandwidth, taking care to apply only the savings obtained at peak sustained throughput. Relative to peak sustained throughput, redirecting cache server 10 expands the bandwidth delivered to clients by 22-24% above and beyond what it consumes in terms of upstream bandwidth.

5. Relationship Between Average and Sustained Peak

Here we show that the sustained peak rate is twice the mean (or average) for a Poisson-distributed random variable. The inter-arrival times of requests generated by PolyClient obey a Poisson distribution, when the average request rate is given by " --req__rate" argument. Let T be the inter-arrival time random variable, and R = l/T be the arrival rate random variable. The probability distribution function (PDF) of inter-arrival time T is given by Pr [T<t ] = 1 - exp ( - t *L) ,

where L is the average request rate. The standard deviation of T is known to be l/L, and, hence, that of R is L, i.e., the standard deviation of R equals the mean.

This is key, and means that the request rate is quite bursty. By inverting T in the above equation, we obtain the probability that the request rate R exceeds a certain value

Pr [R>r] = 1 - exp(-L/r)

Therefore, the probability that R is greater than twice its mean, is

Pr[R>2L] = 1 - exp(-L/2L)= 1 - exp(0.5) = 0.4

While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMSWhat is claimed is:

1. A method for servicing requests for document objects in a networked computer environment where certain document server computers store the document objects and other client computers originate the requests for the document objects, the method comprising the steps of : storing, at a local cache server, copies of the document objects for which original document objects reside on document servers located at other nodes in the network; interposing a message redirector between the local cache server and the network, the message redirector having at least three communication ports, such that two ports are network ports connected to network routers, and such that a third port is a local server port connected to the local cache server, the message redirector thus interconnecting the local cache server into the network between at least one client computer and at least one server computer; and processing messages in the message redirector, at least one message so processed containing a request for a document object located at one of the document servers, the messages received on at least one of the network ports, and routing a message to the server port only if the local cache server can process the request, and otherwise passing through the message from the network port on which it was received to the other network port .

. A method as in claim 1 additionally comprising the step of: setting a network address of the local server port so that the local cache server is a transparent bridging device such that the client and document server computers do not recognize the local cache server as a network node through which message traffic must flow.

3. A method as in claim 2 wherein the step of setting a network address additionally comprises the step of: setting a Media Access Control (MAC) layer address such that the local server port appears as a MAC layer address of a router associated with one of the network ports.

4. A method as in claim 1 wherein the document objects are web pages, the cache server is a web page server, and the step of processing messages additionally comprises the step of: routing only HyperText Transfer Protocol (HTTP) type message traffic to the local cache server.

5. A method as in claim 1 additionally comprising the step of, in the message redirector: maintaining a selective connection table that includes a list of document object identifiers for document objects stored in the local cache, and associated Internet Protocol (IP) addresses for the document servers that store the corresponding original document objects.

6. A method as in claim 5 wherein the step of processing messages further comprises the step of : comparing an IP address contained in a request message to IP addresses in the selective connection table, to determine whether to route the request message to the local cache server port or to route the request message to the network port.

7. A method as in claim 5 additionally comprising the step of : sending information from the local cache server to the message redirector concerning the document objects stored in the cache server, so that the selective connection table can be kept updated as contents of the local cache server change .

8. A method as in claim 5 additionally comprising the step of, in the message redirector: maintaining an active connection table that includes a list of active connections between client computers and document servers that are presently being serviced by the local cache server.

9. A method as in claim 8 additionally comprising the steps of : examining a message type field in messages received from the network ports; and if the request message contains a packet indicating that the message is a request to create a new connection between a client computer and a document server computer, then routing the request message to the local cache server port .

0. A method as in claim 9 additionally comprising the step of : if the request message does not contain a packet indicating that the message is a request to create a new connection, then comparing the request message to the list of active connections in the active connection table, to determine whether to route the request message to the local cache server port or to a network port .

11. A method as in 9 additionally comprising the steps of: maintaining a selective connection table that includes a list of document object identifiers for document objects stored in the local cache, and associated Internet Protocol (IP) addresses for document servers that store the corresponding original document objects; and routing the connection request message to the local cache server port if an IP address in the connection request message is contained in the selective connection table.

12. A method as in claim 11 additionally comprising the step of: routing the connection request message to the local cache server port only if the selective connection table contains space to accommodate a new entry, and if the local cache server indicates that it has capacity to service a new connection.

13. A method as in claim 12 wherein the local cache server is not considered to have capacity to service a new connection if a number-of-miss- connections is greater than a miss-connection- threshold, if number-of-opening-connections-in- process is greater than an opening-connections- threshold, or if number-of-total-connections is greater than a total-connections-threshold.