WO2007138250A2

WO2007138250A2 - Computer system with lock- protected queues for sending and receiving data

Info

Publication number: WO2007138250A2
Application number: PCT/GB2007/001821
Authority: WO
Inventors: David James Riddoch
Original assignee: Solarflare Communications Incorporated
Priority date: 2006-05-25
Filing date: 2007-05-18
Publication date: 2007-12-06
Also published as: WO2007138250A3

Abstract

A computer system which is capable of running a plurality of concurrent processes, the system being operable to establish a first queue in which items related to data for sending over the network are enqueued, and to which access is governed by a lock; and when a first of said processes is denied access to the first queue by the lock, to enqueue the items in to a second queue to which access is not governed by the lock, and to arrange for the items in the second queue to be dequeued with items in the first queue.

Description

COMPUTER SYSTEM

The present application relates to a computer system capable of running a plurality of processes, especially a computer system which is connected to a network, and discloses four distinct inventive concepts which are described below in Sections A to D of the description.

Claims 1 to 13 relate to the description in Section A, claims 14 to 32 relate to the description in Section B, claims 33 to 59 relate to the description in Section C, and claims 60 to 72 relate to the description in Section D.

In the appended drawings, figures 1 to 8 relate to the description in Section A, figures 9 to 12 relate to the description in Section B, figures 13 to 15 relate to the description in Section C, and figures 16 to 22 relate to the description in Section D.

Embodiments of each of the inventions described herein may include any one or more of the features described in relation to the other inventions.

Where reference numerals are used in a Section of the description they refer only to the figures that relate to the description in that Section.

SECTION A SEND PRE-QUEUE

The present invention relates to a computer system capable of running a plurality of processes, especially a computer system which is connected to a network.

In normal use, computer systems run a plurality of concurrent processes, some of which relate to the operation of the operating system, and some of which relate to the operation of applications. These processes often need to access common portions of state at the same time. To avert the problems which would obviously arise if the processes were allowed unrestricted concurrent access, locks are employed.

Computer systems often operate in a network. In order to keep the software architecture simple, it is often advantageous to use a single, shared lock to govern access to the send queues on which data items are enqueued before sending over the network via the network interface. However, with many processes running concurrently on the computer system and potentially seeking to send data, the shared lock is prone to a problem called lock contention.

A lock becomes contended when a process tries to obtain the lock that is held by another process. The overhead associated with lock contention reduces system performance and is always best avoided.

In operating system kernels, it is common to use spinlocks. According to the spinlock model, a process repeatedly tries to obtain the lock until it succeeds. This works well when locks are held for short periods of time.

However, this model does not work well at the user level, for example, when the contending processes are threads of user applications where the lock may be held for a considerable period of time. The normal approach, in this case, is to put the contending process to sleep until the current lock-holding process releases the lock. This model is normally referred to as blocking.

An object of the present invention is to reduce the lock contention overhead when, queueing items in preparation for subsequent sending over a network.

With this in mind, according to one aspect, the present invention may provide a computer system which is capable of running a plurality of concurrent processes, the system. being operable: to establish a first queue in which items related to data for sending over the network are enqueued/and to which access is governed by a. lock; when a first of said, processes is denied access to the first queue by the lock, to enqueue the items into a second queue to which access is not governed by the lock; and to arrange for the items in the second queue to be dequeued with the items in the first queue.

Thus, even though the lock may well be highly contended, the provision of a second queue to which access is not governed by the lock enables a first process, although locked out from the first queue, to nonetheless enqueue items for sending subsequently. In this way, the present invention avoids the above- mentioned overheads associated with the spinlock or blocking lock contention handling models. Further, by arranging it such that items on the second queue are handled together with items in the first queue, the present invention ensures that the items in the second queue are processed in a timely fashion.

Preferably, the system is operable to integrate items from the second queue_^ into the first queue. In this way, the items in the second queue will be dequeued by the system as though they were from the beginning enqueued in first queue items. Integration may be achieved by linking the first queue and the second queue together. Alternatively, the items in the second queue may be dequeued from the second queue and moved to the first queue. Preferably, the second queue comprises a data structure facilitating access by concurrent processes. In this way, the integrity of the second queue can be maintained even after it has been integrated into the first queue and might be ; subject to concurrent manipulation by the first process enqueueing items and another process dequeueing items from the first queue. The latter process might possibly only respect said lock, for example, a shared lock, and may be oblivious to any. other locks, for example, a socket lock, which may govern the portion of state where the second queue resides. In a preferred embodiment, the second queue comprises a linked list having a head, to^: which items are added . and from which they are removed by atomic instructions; !n other embodiments, the second: queue may comprise a circular.buffer having an input pointer pointing to where items are entered into the buffer, and an output pointer pointing where items are removed from the buffer, wherein the input and output pointer are. prevented from crossing one another.

Linked lists are suitable structures by which to implement the first and second queues, and preferably the first queue is linked to the second queue by arranging, for a pointer of the first queue to point to the second queue, whereby the first and second queues form a single linked list structure. . ^• : ^{• '} . ^{• ■}

Preferably, the system comprises means for registering the existence of the second queue, and_: is further operable after completing the second queue to register the existence of the second queue with the registering means if the lock is held by a said process other than the first process. This registration provides the mechanism through which the need to process the second queue can be communicated, and thus delegated, by the first process to a said process other than first process; However, if after completing the second queue, it turns out that, the lock is no longer being held and is grabbed by the first process, the first process may itself dequeue for sending the items in the second queue. In such a case, no registration of the second queue takes place. Preferably, the lock is a single unit of memory which is manipulated atomically. Manipulating the lock by an atomic operation guarantees that only one process, can hold ithe. lock at one time. The uniUot^'memory can be a single word, or multiple words, .if the processor architecture supports atomic manipulation of. multiple words. . . . . ^{' ■} .. . :^' .: . - : ^• , \. : '

The registering means nhay include bits of the lock. This is advantageous because, it enables^ lock^" manipulation 'and .the rdeterrήination of whether there exist :any second queues fdr-.prbcessing^! to be carried out iri the same operation. lmthoset bits of the lock allocatedrto the registering^;- means, the head of a linked list;may be; stored. : !The:lihked-4ristniay comprise: items, each item referring to a soekefrwhich has formed a ^rsaid: second queue. ; ^.:κ-sr -.^I'V -T •. .^;• •' .• ^• - . . \ : ■ j; v

Preferably, the system is further operable when a process is about to release a lock to check whether there are any second queues to be dequeued for sending. This may^; be achieved by the registering means where the existence of second queues, is logged. The^' operation of checking for the existence of any registered second queues should be atomic with respect to releasing the lock, i.e. it should not be possible ftKrelease the lock if there is a registered second queue. _. ; ...

Said items .can . include¹ the data itself, especially when the data volume is, small, but preferablyy comprises a pointer :to^; a_;; buffer in, for example, an applications address spiace where ^;the. data is . actually held. This saves the overhead . of moving the data around durfng.enqueueing'. /^; .^' ■^{• '} - . ^■ v ._. ■ ^■ ,-•■

According . tbva further-aspect; the present invention may provide a computer program for, sending data to the network interface of a computer system whiςh Js capable of running a plurality of concurrent processes, the computer program being operable to establish a first queue in which, items related to data for sending over the^' network interface are enqueued, and to which access is governed by a lock; and when a said process is denied access to the first queue by the lock, to enqueue the items for sending into a second queue to which access is not governed by said lock, wherein the computer program is further operable to arrange for the items in the second queue to be dequeued with the items in the first queue. ^' . .

According to a further aspect, the present invention may provide a data carrier bearing the above-mentioned computer program.

According to a ^"further aspect, the present invention may provide a computer, system running a plurality of processes, comprising a first queue in which items related to data for sending over a network are enqueued; a lock by which access^", to the first qϋέu^ is governed; a second queue to which access is not governed by^; the lock; wherein'ihe system is operating such that when a said process is denied¹ access^' to the first⁵ queue by the lock, the data item for sending is enqueued in the second queue, and items in the second queue are dequeued with items in the first queue. - ^" .

Exemplary embodiments of the invention are described herein with reference to. the accompanying drawings, in which:

Figure 1 shows a hardware overview of an embodiment of the invention;

Figure- 2 shows" an overview of Various portions of state and associated lock in an embodiment of the invention;

Figures 3(a) and 3(b) show algorithms in accordance with an embodiment of the invention; ' ^:; . ^■■/ ^• .. . ,

Figures 4 to 6 show an overview of various data structures established by an embodiment of the invention operating according to the algorithms of Figures 3(a), (b);

Figure 7 shows the bit structure of the netif lock; and Figure 8 shows the structure of a queue formed in accordance with an embodiment of the invention.

Referring to Figure 1 , a computer system 10 in the form of a personal computer (PC) comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30. The Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system. Although referred to as a card, the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.

Figure 2 shows the system state. when a number of application execution threads 40, 42, 44 have been initiated, and a number of TCP or UDP sockets 46, 48 have been opened up by the applications to communicate with the network interface card 25. Because the sockets 46, 48 can be accessed by more than one thread, some of the state for each socket is protected by a socket-specific lock 52, 54 known as a sock-lock. The sock-locks protect aspects of the operation of the sockets which are independent of other parts of the system. In the drawings, these portions of state, are represented diagrammatically by the regions 56, 58. Also, each socket has -portions of state all of which are protected by a single shared lock, hereinafter. referred to as. the network interface or netif lock 60. In the drawings, these portions of state are represented diagrammatically by the regions 62, 64. In the embodiment the network interface lock 60 also protects, as the name suggests, various portions of state related to access to the network interface. ,

Referring to Figure 7, the netif lock is implemented by as a single word in memory so that it may be manipulated by atomic instructions. Some bits 60a are used to indicate whether it is locked or not. Some bits 60b are used to indicate whether it is contended (i.e. whether other threads are waiting for the lock). Some bits 60c are used to request special actions when the netif is unlocked e.g. via callbacks, as described in the applicant's co-pending patent application GB0504987.9, which is incorporated herein by reference. And a set of bits 6Od are used to implement a list or register of deferred sockets which is described in more detail below.

When a currently running application thread, say thread 44, seeks to send data items through an associated open socket, it follows the algorithm in Figure 3(a).

First, the thread 44, at step 110, allocates buffers for the data to be sent in the application address space. In other embodiments, the data to be sent need not be at the user level. Next, at step 112, it fills those buffers with data for sending. It will be noted that steps 110, 112 are independent of other parts of the system and so there is no need to obtain access by means of a lock. The thread 44, then at step 114, attempts to grab the netif lock 60 using an atomic compare-and-swap (CAS) operation. With a single machine instruction, the CAS operation compares the bits 60a of the netif lock which are indicative of whether it is locked or not with a predetermined set of bits which represent the lock not being held; if the compared bits are the same, then the bits are swapped for another set of bits indicative of the netif lock being held by the thread 44. Else, if the compared bits are different, indicating the netif lock is already held by another thread, no swap operation is performed. The use of a netif lock comprising only one word and manipulated by an atomic operation guarantees that only one thread can hold the netif lock at one time.

If the thread 44 found the netif lock 60 in an unheld condition and took possession of it, at step 114, it moves onto step 116 where it enqueues the items in the send queue 70 in the socket 48 which it is using.

This is the situation shown in Figure 4, in which four queue items 72a-d in a linked list data structure form the send queue 70. A send queue head item pointer 71 points to the head item of the queue. Unless the amount of data is small, the data itself is not contained in the queue items as such. Rather, the data is enqueued using an IOVEC (input output vector) pointer structure. This is more clearly shown in Figure 8. Each queue item 72a-d comprises a pointer field 74 which points to the next item in the queue, a field 75 indicating the length of the data for sending, and a field 76 indicating its start address in the application address space 77. The use of this or other types of IOVEC pointers means that the data itself, the volume of which might be quite high, need not be moved while the send queue is being formed. Once the queue is complete, at step 118, the dequeueing and sending of data items from the send queue 70 via the network interface card 25 over the network 35 proceeds according to transport protocol used by the socket 48.

On the other hand, if the thread 44 found the netif lock 60 in a locked condition at step 114, for example, if it is held by, say, the thread 40 which may have been in the process of transmitting a large amount of data from its send queue 80, having a head item pointer 81 , over the network when it was interrupted and put to sleep, then step 124 takes place. As mentioned above, the conventional prior art system, faced with a situation like this where it was denied access to the send queue, typically adopted either a spinlock or blocked model of operation. In accordance with an aspect of this invention, this embodiment, at step 124, obtains the sock-lock 54 and establishes a send, prequeue 85 as illustrated in Figure 5. As the sock-locks are socket specific, and rarely need to be accessed by different threads at the same time, they are rarely contended in practice, and so this means that the thread 44 is likely to be able to enqueue the data items in the send prequeue without any blocking delay. The send prequeue 85 comprises a send prequeue head pointer 86 and can be the same basic structure as the send queue, but differs in that access to the send prequeue is governed by a sock-lock, in this particular case the sock-lock 54 associated with the socket 48. As with the send queue 70, enqueueing of the data items may be accomplished by an IOVEC pointer structure as shown in Figure 8. Although the send prequeue 85 is protected by a sock-lock generally, it is possible that, as will be described later, it will, be dequeued by a netif lock holding thread which will ignore the sock-lock. Thus, the location to which send prequeue head pointer 86 points is manipulated using an atomic instruction, which means that a thread enqueueing data items into the queue need not synchronize with a thread dequeueing data items from the queue. The thread 44, at step 126, operating on the bits 60a performs an atomic CAS operation. If the netif lock 60 was dropped during the formation of the send prequeue 85, then it is grabbed and the socket 48 is not registered as a deferred socket. Thereafter, at step 128, the send prequeue is integrated into the send queue and the sock-lock is released. In a preferred embodiment, the integration is carried out by transferring the link list structure itself into the send queue i.e. by removing items from the send prequeue 85 and transferring those items to the send queue, 70. Alternatively, the pointer at the end of the send queue is simply made equal to the send prequeue head pointer 86, whereby the send queue and the send prequeues are effectively concatenated. In the example shown in Figure 5, there are no data items in the send queue and so the send queue head pointer 71 is simply made equal to the send prequeue head pointer 86. Thereafter, at step 130, the send queue including the linked send prequeue are dequeued and transmitted onto the network. Because, at step 130, no deference is paid to any sock-lock, it is essential that, as mentioned previously, that the head of the send prequeue is manipulated atomically because while items are being dequeued and sent over the network, another thread could be enqueueing more items.

On the other hand, if the netif lock 60 was not dropped during the formation of the send prequeue 85, and is still being held by another thread, say thread 40, then the socket 48 is registered as a deferred socket. Referring to Figure 7, the register or list of deferred sockets is constructed as a linked list 90 comprising a head item formed by the bits 6Od of the netif lock 60 and linked items 91a, 91b. Each item 6Od, 91a, 91b comprises a pointer pointing to a socket which has formed a send prequeue which was not able to be immediately sent. In the particular example under discussion, only sockets 46, 48 are present and, therefore, the socket 48 is registered as the first and head item in the register 90 in bits 6Od. If other sockets were already registered as deferred, then the socket 48 would be registered in one of the linked items 91a, 91b. In this case, the thread 44 having obtained the sock-lock 54, at step 124, still holds it. Holding the sock-lock before registering the socket as deferred is important as it ensures that the socket registration takes place only once. For this reason, in other implementations, where the send prequeue is not protected by a sock-lock, it is necessary to grab the sock-lock just before registration. After registration, the thread continues with other tasks. The fact that, as here, the socket is registered as a deferred socket only when the netif lock is being held by another thread is a necessary condition for this embodiment of the invention to operate properly.

By virtue of the socket 48 being registered as deferred socket with the bits 6Od of the netif lock, the thread 44 need not wait in limbo until the netif lock is available again, but the act of registering the socket delegates the handling of the send prequeue to the current lock holding thread, i.e. thread 40. This is because a thread is never allowed to drop the netif lock when there are sockets registered as deferred. So after the thread 40 has dequeued its send queue 80 and sent the data over the network, it makes a check for any sockets which have been registered as deferred.

Figure 3(b) shows the algorithm which is used whenever a thread wants to drop the netif lock. At step 140, the thread performs by means of a single atomic instruction a comparison between the bits 60c, and bits 6Od to check whether they are all set such that are no callbacks registered to be performed amongst the bits 60c, and there are no sockets registered in the deferred sockets list 90. If there are no callbacks or deferred sockets registered, the netif lock 60 is dropped (still step 140). On the other hand, if there are callbacks or deferred sockets registered, then the thread enters a slow path (step 144), and checks individually which of the bits 60c, 6Od indicate that action is required and attends to the actions which need to be done before attempting again to drop the netif lock at step 140. In this particular example, the socket 48 is registered as a deferred socket, and so the thread 40 transfers the prequeue 85 into its send queue 70. Again, this is done simply by making the send queue head pointer equal to the send prequeue head pointer 71, as shown in Figure 6. The send queue 70 may be dequeued and the data sent to the network 30 according to the algorithms of the transport protocol. In other examples, there could be a whole series of actions which would need to be done before finally dropping the netif lock. For example, there could be a number of sockets in the deferred sockets list which would need, at this point, to be integrated into the associated send queue for subsequent dequeueing and sending over the network according to the algorithms of the relevant transport protocol.

In the above-described embodiments, the network interface lock 60 has been used as the single shared lock protecting the send queues 70,80, but in other embodiments, another shared lock unconnected with the role of protecting access to the network interface and which may be dedicated to protecting the send queues may be used instead. It will be understood by the skilled person that, as long as the role of protecting the send queues is provided, it is not material to the invention whether the shared lock is also used to protect access to any other shared resources.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention. SECTION B

ASYNC RECEIVE QUEUE

The present invention relates to a computer system having a network interface and which is capable of running a plurality of concurrent processes.

When data is received from the network interface, it can be delivered to a receive buffer of a destination application using a variety of receive models. According to the conventional synchronous receive model, a received packet is de-multiplexed to an appropriate socket and the payload is delivered to the receive queue for that socket. Then, at some later time, the application transfers the payload to its receive buffer. In the case where the application requires the payload before it has been received, the application may choose to block until it arrives. According to the conventional asynchronous receive model, the application may pre-allocate an application receive buffer to which subsequently received payload should be delivered. Normally, if a receive buffer has been allocated, then received payload is delivered directly to the receive buffer. If a receive buffer has not been allocated, then the payload is enqueued on the receive queue. Often, an application will allocate many receive buffers for incoming payload, and so descriptors for the allocated receive buffers are stored in a queue.

In a practical setting, the implementation of these receive models may be problematic for a number of reasons.

First, in a practical system in which a plurality of concurrent processes are running, and in which the operating system may be required to support an application employing the synchronous receive model or the asynchronous receive model, or both models simultaneously, it is necessary to protect the integrity of the queues. One technique for protecting commonly accessible portions of state is to use locks. Intrinsic to the use of locks is establishing a lock regime which achieves a favourable trade-off between the desirable protective effect which they afford, and the drag that they have on system performance due to lock contention.

Second, in some systems, the state of a socket may be directly manipulated by processes operating in multiple address spaces, including the address spaces of the operating system kernel and one or more others for user-level processes. As a result, the possibility arises that the process which is handling the receipt of an incoming packet and operating in one context may not be able to access the application receive buffers which may reside in another address space.

With the foregoing in mind, according to one aspect, the present invention may provide, a computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to

establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue is governed by a second lock;

establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the second lock;

the system being operable when a said process, holding the first lock, processes incoming payload to

determine, from the second queue whether an application receive buffer is available for the payload;

and if an application receive buffer is available, attempt to take possession of the second lock, and if the attempt fails, set a control flag as a signal to another said process. Thus, in accordance with this aspect of the invention, a first-lock-holding process may fail to take possession of the second lock and so be prevented from loading the payload directly into an available application receive buffer, but, by setting the control flag it ensures that another process holding the second lock is signaled that this work needs to be carried out. As the right to enqueue on the first queue is governed by the first lock, the payload can nonetheless be enqueued without delay on the first queue. Further, as the rights to dequeue from the first and second queues are governed by the second lock, a process, on taking possession of only the second lock, is empowered to dequeue the payload from the first queue, and transfer it to application receive buffer.

The incoming payload can be enqueued on the first queue by default. Alternatively, only when, there is no application receive buffer descriptor is specified in the second queue or the said process fails to obtain the second lock is the payload enqueued on the first queue.

Preferably, the system is further operable such that the another said process in response to the control flag being set and when holding the second lock dequeues payload from the first queue, and transfers it to an application receive buffer specified in the second queue,.

In some cases, the said another process which is signaled by the control flag and which then goes on to dequeue the payload from the first queue may be the same process which initially set the control flag. For example, this might happen when the process initially tries to grab the second lock, but fails, sets the control flag and goes on to perform a series of further operations. Then, just before releasing the first lock, it tries one final time to grab the second lock. If this time it is successful, because during the performance of the further operations the second lock was dropped by another process, the process will be able to take care of dequeueing the payload from the first queue itself. Preferably, the attempt to obtain the second lock, and upon failing, setting the control flag is performed by an atomic instruction, for example, a compare-and- swap instruction. Preferably, bits implementing the second lock and the control flag reside in the same word of memory.

According to a further aspect, the present invention may provide a computer system having a network interface and running a plurality of concurrent processes, the system

providing a first receive queue on which received payload is enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue is governed by a second lock; .

providing a second receive queue on which descriptors for application receive buffers are enqueued and from which the right to dequeue is governed by the second lock;

wherein when a said process, holding the first lock, processes incoming payload, the system determines from the second queue whether an application receive buffer is available for the payload; and if an application receive buffer is available, attempts to take possession of the second lock, and if the attempt fails, sets a control flag as a signal to another said process.

According to a further aspect, the present invention may provide a computer program for a computer system having a network interface and which is capable of running a plurality of concurrent processes, the computer program being operable

to establish a first receive queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue items is governed by a second lock; to establish a second receive queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the second lock; and y _. , - . _. •

when a said process, holding the first lock, processes incoming payload to : .

determine from the second queue whether an application receive buffer is _: available for the payload;

and if an application receive buffer is available, attempt to take possession of the second lock, apd if the attempt fails, set a control flag as a signal to another said process.

According to a further aspect, the present invention may provide a data carrier for the above computer program. , . .

According to a further aspect of the invention, the present invention may provide a computer system having a network interface and capable of running a plurality of concurrent processes in a plurality of address spaces, the system being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the system- being operable when a said process processes incoming payload to

identify from the receive queue structure • an application receive buffer for the payload; . ^{• ■} _. ., . „•

determine whether said application receive buffer is accessible in the current address space; and if it is not, arrange for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible. Thus, in accordance with this aspect of. the invention, a process is able to ensure that payload destined for an application receive buffer reaches that destination despite the fact the pointer to the application receive buffer is valid only in a different address space.

In one preferred embodiment, the system is operable, upon determining that said application receive buffer is not accessible in the current address space, to make a system call to the kernel context in order to load the payload into the application receive buffer. In some systems, for example, those running on Linux ™, in the kernel context, ariy:address space can be accessed providing the page tables for the address ^"space are known. ^{• "}' . •^■. . ^'-^'. • .

Preferably, information about the address space in which the pointer(s) is valid is stored either together with the pointer(s) or in the state of a socket with which the application received buffer is associated. Preferably, a reference to the page tables for the socket are stored in a kernel-private portion of the socket state.

Due to the paging mechanisms of the kernel, a pointer may not successfully resolve to an application receive buffer even though it is valid. This may happen because the memory has been paged-out to disk, or because of a physical memory page has not yet been allocated. In order to deal with this situation, preferably, the system is further operable, if a process fails to address an application receive buffer, to enqueue the payload in the receive queue structure and set a control flag as a signal to another process that some payload in the receive queue structure needs to be moved to an application receive buffer.

In another preferred embodiment, the system is operable, upon determining that said application receive is not accessible in the current address space, to arrange for a thread in the appropriate address space to run in order that the payload may be loaded into the application receive buffer. As an example, in the Windows ™ operating system, this may be achieved by scheduling an Asynchronous Procedure Call (APC). Said receive queue structure preferably comprises a first receive queue on which received payload can be, enqueued, and a second receive queue on which descriptors for application receive buffers can be enqueued.

Accordingrto a further aspect of the invention, the present invention may provide a computer system having a network interface and running a plurality of concurrent processes in- a plurality of address spaces, the system establishing a receive queue structure comprising at least one pointer to an application receive buffer, wherein, when a said process processes incoming payload, the system identifying from the receive queue structure an application .receive buffer for the payload; determining whether said application receive buffer is accessible in the current address space; and if it is not, arranging for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible.

According to a further aspect of the invention, the present invention may provide a computer program for a computer system having a network interface which is capable of running a plurality of concurrent processes in a plurality of address spaces, the program being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the computer program being operable when a said process processes incoming payload to

identify from the receive queue structure an application receive buffer for the payload;

determine whether said application receive buffer is accessible in the current address space; and if it is not, arrange for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible. According to a further aspect, the present invention may provide a data carrier for the above computer program.

Exemplary embodiments of the invention are described herein with reference to the accompanying drawings, in which:

Figure 9 shows an overview of hardware suitable for performing the invention;

Figure 10 shows an overview of various portions of state in a first embodiment of the invention;

Figures 11 show an algorithm in accordance with the invention; and

Figure 12 shows an overview of various portions of state in a second embodiment of the invention.

Referring to Figure 9, a computer system 10 in the form of a personal computer (PC) comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30. The Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system. Although referred to as a card, the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.

Figure 10 shows the system state of a first embodiment of the invention. In the state shown in Figure 10, an application execution thread 40 has been initiated and a TCP socket 50 has been opened up enabling the application to communicate via the network interface card 25 over the network. Processing of data at the TCP layer involves two linked list queue structures: a receive queue (RQ) 70 and an asynchronous receive queue (ARQ) 80, both of which are first-in first-out (FIFO). The RQ 70 comprises a plurality of items 71 in which each item 71 references a block of data after TCP processing. Each item 71 comprises a pointer portion 71a which points to the start of the data block, and block-length portion 71b giving the length of the block. The memory region where blocks of data are stored in buffers after TCP processing is designated 95. The ARQ 80 comprises a plurality of items 81 in which each item 81 references an application receive buffer which the application 40 has pre-allocated for incoming data, for example, when an application invokes an asynchronous (or overlapped) receive request. Each item 81 consists of an application receive buffer descriptor comprising a pointer portion 81a which points to the start of the buffer and a buffer-length portion 81b defining the length of the buffer. The memory region which may be allocated for buffers for incoming data is designated 42. In other embodiments, queue structures other than linked lists may be used.

In order to protect the integrity of the RQ 70 and ARQ 80, locks 52, 62 are employed. The first lock 62 is a shared lock which is widely used to protect various portions of the system state. Hereinafter, this lock is referred to as the network interface lock or netif lock. The second lock 52 is a lock which is dedicated to protecting certain portions of the state of the socket 50. Hereinafter, this lock is referred to as a sock-lock. In this embodiment of the invention, the right to put an item 71 onto the RQ 70 is governed by the netif lock 62 and this right is denoted in Figure 10 by the arrow 62-P. The right to remove/get an item from the RQ 70 is governed by the sock-lock 52 and this right is denoted in Figure 10 by the arrow 52-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, whereas a process which dequeues an item has first to take possession of the sock-lock 52. For the ARQ 80, the right to both put an item 71 onto the ARQ 80 and the right to remove/get an item from the ARQ 80 are governed by the sock-lock 52 and these rights are denoted in Figure 10 by the arrows 52-P and 52-G, respectively. Thus, any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock-lock 52, and similarly any process which dequeues an item has also first to take possession of the sock-lock 52. Also included within the socket state is a drain bit 58. From time to time, the system makes a check to see whether there is a receive event which is ready for processing and queueing. If there is, then the algorithm shown in Figure 11 is carried out. In this case, it is assumed that a process 68 acts as the receive process and that it, at this point, is already in possession of the netif lock 62. The process 68 can be a user-level process, a kernel thread or an interrupt service routine. At step 100, the TCP layer protocol processing and de-multiplexing is carried out, and the post TCP layer processing payload/data block 93 is stored in a memory space 95. At step 102, the data block 93 is enqueued on the RQ 70 by adding an item 71 onto the RQ 70 which references the data block 93. Next, at step 104, a check is made to see whether the ARQ 80 contains any items 81. If there are buffer descriptors in the ARQ 80, this means that the application 40 has allocated some buffers for incoming data, and so the receive process 68 tries, at step 106, to grab the sock-lock 52. If there are no buffers descriptors in the ARQ 80, this means that there are no application receive buffers yet allocated, and so the receive process 68 having already deposited the payload in the RQ 70 moves onto further tasks or finishes as the case may be.

If the receive process 68 succeeds in taking possession of the sock-lock 52 without blocking, it, at step 108, performs a so-called 'drain down' operation, in which data blocks referenced in the RQ 70 (and actually stored in the memory region 95) are transferred to the buffers listed in the ARQ 80 i.e. to the memory region 42. It is a significant advantage of this embodiment of the invention that the single action of taking possession of the sock-lock 52 empowers the receive process to invoke the drain down operation which requires dequeueing rights for both the RQ 70 and ARQ 80. Filled buffers are either removed from the ARQ 80 or their buffer-length portion 81 b adjusted. At the end of the drain down operation, notification may be made to the application 40 that the operation has occurred.

If the receive process 68 fails to take possession of the sock-lock 52, then, at step 110, the drain bit 58 is set. This is the instant shown in Figure 10, as the application process thread 40 holds the sock-lock 52. Preferably, the attempt to grab the sock-lock 52 at step 108, and the setting of the drain bit 58 at step 110 are atomic. In one implementation, a word of memory can contain bits serving as the sock-lock 52 and a bit serving as the drain bit 58, and the steps 108, 110 can be performed using an atomic compare-and-swap operation. The process 68 then goes about other processing actions, and the set drain bit 58 serves as a signal to another process that the socket 50 needs attention, specifically that a drain down operation needs to be performed. This technique of delegating a required action from one process to another was described in the applicant's co- pending patent application GB0504987.9, which is incorporated herein by reference.

It is normal and advantageous that the application process 40, just before releasing the sock-lock 52, will notice that that the drain bit 58 has been set and perform the drain down operation before releasing the sock-lock 52. However, if that does not happen, a subsequent process which takes possession of the sock- lock 52 will notice that the drain bit 58 has been set and take care of the drain down operation before releasing the sock-lock 52.

In other embodiments, the check, at step 104, to determine whether the ARQ 80 is empty or not, can be carried out before incoming payload is enqueued in the RQ 70 (at step 102). Thus, when the ARQ 80 is non-empty the RQ 70 can be completely bypassed.

Figure 12 shows the system state of a second embodiment of the invention operating on a Linux™ operating system in a multiple address space environment, including a kernel context and at least one user-level context. The second embodiment is substantially the same as the first embodiment, except that it includes features, discussed hereinafter, to handle the multiple address space environment. In this embodiment, the RQ 70 and ARQ 80 reside in shared memory that is directly accessible in at least two, and possibly all, of the multiple address spaces. Referring to Figure 12, some time after opening up a socket, an address space is associated with the socket. Each address space is allocated an address space tag which uniquely corresponds to a single address space. The address space tag for the socket is stored in the shared socket state and is designated by reference numeral 56. A reference to the page tables for the address space associated with the socket 50 is stored in kernel-private buffer 60 rather than in the shared socket state to ensure that it cannot be corrupted by user-level processes as that would be a hazard to system security. The second embodiment operates similarly to the first embodiment and essentially performs the Figure 11 algorithm. However, before performing step 108 where the receive process 68 is required to write to the application receive buffer 42, the process 68 compares the address space tag 56 of the socket with that of the current address space. If they do not match (and the process is not executing in the kernel context), then the process 68 cannot address the application receive buffer 42 because the pointer portion 81a of the application receive buffer descriptor will only validly resolve to the correct address within the same address space. However, in the Linux kernel context, any address space for any process can be accessed providing the page tables for the address space are known. Therefore, the task of loading the application receive buffer 42 is passed to a kernel context routine which using the page tables in the kernel-private buffer 60 and standard operating system routines is able to resolve the relevant pointer 81a to the correct address.

Due to the paging mechanisms of the kernel, a pointer may not successfully resolve to a buffer even though it is valid. This may happen because the memory has been paged-out to disk, or because of a physical memory page has not yet been allocated. In such circumstances, it is not possible for the process 68 to access the buffer. Instead the data block is added to the RQ 70 and the system ensures that the application thread 40 (or another thread using the same address space) is awoken, and subsequently performs a drain-down operation. The normal paging mechanisms of the operating system will make the application receive buffer available in this case. The applicant hereby discloses in isolation each individual feature, described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the invention may consist of any such individual feature . or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of. the invention.

SECTION C

I/O COMPLETiON PORT

The present invention relates to a computer system operating a user-level stack.

Conventionally, an application running on a computer system communicates over a network by opening up a socket. The socket connects the application to remote entities by means of a network protocol, for example, TCP/IP. The application can send and receive TCP/IP messages ^" by invoking the operating system's networking functionality via system calls which cause the messages to be transported across the network. System calls cause the. CPU to switch to a privileged level and start executing routines in the operating system.

An alternative approach is to use an architecture in which at least some of the networking functionality, including the stack implementation, is performed at the user level, for example, as described in the applicant's co pending PCT application WO 2004/079981 and WO 2005/104475.

Applications which require high performance networking typically elect to use an overlapped (asynchronous) I/O mode of operation or non-blocking mode of operation together with I/O synchronisation mechanisms.

Some I/O synchronisation mechanisms involve the application directly interrogating sockets, which it specifies, to obtain I/O status information for the . specified sockets. The interrogation is performed by operating system routines which are invoked by a system call. The application specifies the sockets of interest to it in the argument of the system call.

Some I/O synchronisation mechanisms involve the use of I/O synchronisation objects, such as, for example, I/O completion ports in Windows™. I/O synchronisation objects can be associated with one or sockets and later interrogated by the application to provide I/O status information on the overlapped I/O operations on the associated socket(s). The creation, updating and interrogation of the I/O synchronisation objects is performed by operating system routines which are invoked by a system call. The application specifies the I/O synchronisation object of interest to it in the argument of the system call.

Some problems can arise when I/Q synchronisation mechanisms are used with a user-level-stack architecture.

First, as the operating system is not operating the stack itself, it is blind, in some systems, to the data traffic passing through a particular socket. In order to prevent the operating system returning an inaccurate result should an application request status information from an I/O synchronisation object via a system call, the user-level stack has the responsibility of keeping the I/O synchronisation object updated as appropriate. However, this updating is performed by a system call and, is therefore, expensive and should not be performed when it is not necessary. However, at the user-level, in some systems, there is .no way to discern whether or not the application has set up an I/O synchronisation object for a particular socket.

Second, in a typical stack implementation^ interrupts are generated by incoming events in order to allow prompt updating of the stack. In a user-level stack, an interrupt incurs a particularly heavy overhead and so interrupts may be selectively enabled. While selective enablement of interrupts is beneficial in terms of overall system performance, at any given instant, there is a danger that an application requesting I/O status information from an I/O synchronisation mechanism via a system call may be given a misleading result.

With the foregoing in mind, according to one aspect, the present invention may provide a computer system capable of operating in a network and being arranged to establish a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process.

By so recording the association of an I/O synchronisation object with the socket, this information about the socket becomes available at the user level. As a result, when the user-level stack associated with the socket has information which should be used to update an I/O synchronisation socket, a check can be made to see whether an associated I/O synchronisation object exists, and only in the case that it does exist, is the system call made to effect the update to the I/O synchronisation object. In this way, the overhead of pointless system calls is avoided.

Preferably, the association of an I/O synchronisation object with the socket is recorded in its user-level state.

Preferably, the system is configured to direct said system call made by the application to a user-level routine, in which the recording of the said association may take place.

In one embodiment, I/O synchronisation object comprises an I/O completion port, and the system call, CreateloCompletionPort(), serves to create an I/O completion port and/or associate it with a socket. The system call may also serve to associate the I/O completion port with another file object. In other embodiments, the I/O synchronisation object may be created by one system call and associated with a socket or other file object by a separate system call.

According to a further aspect of the invention, the present invention may provide a computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to a user-level routine which is operable to update a user-level stack;

make a system call to request the information previously requested by the application; and

return the result of the system call to the application.

Configuring the system to direct an application's system call to a user-level routine provides an opportunity for the user-level routine to update a user-level stack before going on to make a system call mimicking or duplicating that made by the application. In providing this opportunity to update the user-level stack, this aspect of the invention can reduce the likelihood that misleading information will be returned to the application.

Preferably, the user-level routine updates a user-level stack which has relevance to the request from the application. This is particularly advantageous in systems with more than one stack, since it may be beneficial to perform updates on only those one or more stacks which have relevance to the information requested by the application. According to some I/O synchronisation mechanisms, the relevance of a particular stack to the request from the application can be ascertained from the sockets specified in the application's request. When the I/O synchronisation mechanisms involve I/O synchronisation objects, the. relevance of a particular stack can be ascertained from the one or more sockets with which the I/O synchronisation object is associated. When the I/O synchronisation object is associated with the sockets of more than one user-level stack, preferably, the user-level routine updates all the associated user-level stacks. In other embodiments of the invention, the user-level routine can update all the user-level stacks regardless of their number and relevance to the request made by the application.

Depending on the kind of I/O synchronisation mechanism, the nature of the I/O status information may vary. In one embodiment, the I/O status information comprises event-based information, for example, information about I/O operations which have completed In other embodiments, the I/O status information may comprise state-based information, for example, information about whether or not the socket has received data available for reading.

Preferably, the I/O synchronisation mechanism comprises an I/O synchronisation object associated with a said user-level stack. An example of this is when the I/O synchronisation object comprises an I/O completion port, and the system call, GetQueuedCompletionStatus(), returns a list of completed I/O operations.

It is preferred that the user-level routine is operable, based on certain operating conditions, to make a determination as to whether it is currently opportune to update the user-level stack, and to update the user-level stack only when it is determined to be opportune.

If it is determined that it is currently not opportune to update the stack, then the system call may be made without updating the stack.

For example, the determination may include a check as to whether there is any data awaiting processing by the user-level stack. In some cases there may not be, and so there is no point in arranging for the stack to be specially updated.

Another example relates to the lock which often governs the right to update the stack in order to prevent multiple processes from accessing the stack concurrently. The lock must be obtained before a stack update can be performed. Accordingly, the system may comprise a lock that governs the right to update the user-level stack, wherein the determination may include a check as to whether the lock is locked. It is preferred that if the lock is not obtained without blocking meaning that is already locked, i.e. held by another process thread, then similarly the system call should be made without updating the stack. In this case, it is possible that the process, which is currently holding the lock, may attend to the updating of the stack. It is desirable in a user-level stack to avoid the heavy overhead incurred by _. interrupts. However, it is at times desirable to enable interrupts because a user- level process may not be available to update the stack itself, for example, because it is blocked. Accordingly, it is preferred that the interrupts for the user- level stack are selectively enabled.

A flag may be used to store the enablement status of the interrupts for the user- level stack. An advantage of this approach is that, when interrupts are enabled, this state can be determined and so no attempt will be made to enable interrupts when they are already enabled. .

Preferably, said interrupts are not enabled, when, during the stack updating, a process thread was awoken. It is advantageous not to enable interrupts when, during the updating of the stack a process thread was awoken, since that thread will probably take care of updating the stack.

Preferably, said interrupts are not enabled, if the lock was locked. Again, another process thread may well take care of updating the stack.

Furthermore, it is preferable that said interrupts are not enabled, if the said system call made by the application was non-blocking.

Preferably, the determination of whether it is opportune to update the stack includes a check as to whether the interrupts are enabled. If they are enabled, then the stack is not updated.

Preferably, the system is configured to direct the system call made by the application to the user-level routine using a dll interception mechanism which is discussed hereinafter. ^• According to a further aspect, the present invention may provide a computer system operating in a network and providing a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible ,to a user- level process.

According to a further aspect, the present invention may provide a computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the program, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user- level process.

According to a further aspect, the present invention may provide a data carrier bearing the above computer program.

According to a further aspect, the present invention may provide a computer system operating in a network and providing a user-level stack, the system directing a system call, made by an application, requesting I/O status information to a user-level routine which

updates a user-level stack;

makes a system call to request the information previously requested by the application; and

returns the result of the system call to the application.

According to a further aspect, the present invention may provide a computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to the program which runs at the user level and which is operable to

update a user-level stack; . .... , ^' ._." '■ ■

make a system call to request the information previously requested by the application; and . . . .

return the result of the system call to the application.

According to a further aspect, the present invention may provide a data carrier bearing the above computer program. . . . .^{' •}

According to a further aspect, the invention may provide a method for use in a computer system capable of operating in a network and being arranged to establish a user-level stack, the method comprising, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for. associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process. . . ^. . . . ^:

According to a yet further aspect, the invention may provide a method for use in a computer system operating in a network and providing a user-level stack, the method comprising directing a system call, made by an application, requesting I/O status information to a user-level routine which

updates a user-level stack;

makes a system call to request the information previously requested by the application; and returns the result of the system call to the application.

Exemplary embodiments of the invention are hereinafter described with reference to the accompanying drawings, in which

Figure 13 shows the basic architecture of a computer system operating a user- level stack;

Figure 14 illustrates the operation of a computer system in accordance with an embodiment of the invention having created an I/O completion port; and

Figure 15 shows ύ routine in accordance with an aspect of the invention.

By way of introductory background, a basic example of an architecture of a computer system 10 operating the Windows™ operating system and providing networking functionality at the user-level is shown in Figure 13. The system 10 comprises the operating system 12, the user/application space 14 and the network interface hardware 16. When an application 18a, 18b wants to request a networking operation, it does so via a user-mode library using the Winsock API. In Windows terminology, the user-mode library comprises a Winsock API . implementation 22 and a Winsock service provider or WSP 24. Windows™ is supplied with a default WSP, the Microsoft™ TCP/IP WSP or MS WSP. This WSP does very little other than map the networking operations requested by an application onto the corresponding operating system networking call interface and then invokes the appropriate operating system routine . However, in the system 10, the WSP 24 provides the networking functionality, including the implementation of the stack 30. Thus, when an application 18a, 18b requires a networking-related operation to be performed, it invokes a command, say send() or receive() supported by the Winsock API, which is then carried out by the WSP 24. Of course, the WSP 24 can also make use of existing operating system networking functionality to the extent that it needs to. Figure 14 shows the computer system 10 where, for clarity of illustration, the user- mode library has been omitted. Also, the stack 30 has been illustrated as comprising a receive path 3OR arid a transmit path 3OT. In the situation in Figure 14, the application 18a has opened up a socket 32 for communication with the network. Using data reception as an example, when new data is received from the network interface 16, it is passed to an incoming event queue selected according to the destination of the data. From the appropriate event queue the data can be protocol processed and validated by the receive path 3OR of the stack 30. The receive path 3OR handles the data from three event queues 31a, 31b, 31c. Thereafter, the data can then be passed to a receive queue, e.g. the receive queue 34, managed by the socket of the respective application process to which the data pertains. If the data in one event queue relates to more than one socket, it is demultiplexed onto the appropriate socket. When a stack processes data from an event queue and delivers it to a socket, this process may be referred to as updating the stack. During data transmission, an analogous process takes place on the transmit path 3OT of the stack 30 as is well known in the art. A lock 40 governs the right to update both paths 3OR, 3OT of the stack 30. The system 10 may also comprise other stacks (not shown) each of which is protected by their own lock.

Although it is not compulsory to do so upon opening the socket 32, in Figure 14, the application has chosen to set up an I/O completion port 35 for the socket 32. In this example, the I/O completion port 35 is shown as being associated with only the socket 32. In other embodiments, it may be associated with more than one socket and other system objects. The I/O completion port 35 serves as a repository in which the details of completed overlapped I/O operations made from the socket 35 is stored as a list. Details of various types of I/O operations are stored including, for example, transmitting, receiving, connecting, disconnecting, and accepting new connections. It will be noted that by virtue of being associated with the socket 32, the I/O completion port 35 inherently becomes associated with the stack 30 which services the socket 32. The completion port is set up by a CreateloCompletionPort() system call by the application 18a. However, this system call does not pass through the WSP 24, and so in the normal course of events, the CreateloCompletionPort() function would be looked up from a function table maintained in the application. From this table, a pointer corresponding to the CreateloCompletionPort() function would be identified and the operating, system code referenced by the pointer invoked. However, in this embodiment, the system is configured to direct the CreateloCompletionPortO system call to a user-level dll (dynamic link library) function, denoted by the reference numeral 45 in the drawings. This configuration is achieved by pre_.-replacing, for example, during the initialisation of the application, or when the application first seeks, to use the user-level networking functionality, the original pointer for CreateloCompletionPortO system call with a pointer to the user-level dll 45. By configuring the system in this way, the dll 45 may be thought, from the perspective of the application 18a, to be intercepting its original system call. To conveniently distinguish here between the original system call and a corresponding dll function, the prefix intercept^ will be hereinafter used for the dll function name. lntercept_CreateloCompletionPort() is operable to ascertain the socket for which an I/O completion port has been requested i.e. read the argument of the CreateloCompletionPortO system call, and to record that information, possibly as a single bit, in the user-level state. This information is referred to as the completion port indicator and denoted by 33 in Figure 14. IntercepHDreateloCompletionPortO then goes onto make the system call CreateloCompletionPortO which then does the work of actually initialising the I/O completion port object 35 and associating it with the socket 32. The dll 45 is. able to call CreateloCompletionPortO because like an application it maintains its own function table but, in this case, the pointer corresponding to the CreateloCompletionPortO function still points to the appropriate operating system code.

After setting up the I/O completion port 35, the application 18a may from time to time initiate I/O requests on the socket 32. Depending on the state of the system, these I/O requests may complete immediately, or at a later time as a result of, for example, the processing of network events in the event queue 31a. When a requested I/O operation for the socket 32 has completed, the stack 30 will be notified of the completion and make a check in the user-level state of the socket 32, specifically the completion port indicator 33, and determine whether the socket 32 has an associated I/O completion port 35. If there is an associated I/O completion port 35, then a system call is made to update the I/O completion port 35 to that effect. If there is no associated I/O completion port 35, then, no system call is made.

It is a significant advantage of this embodiment of the invention, that, in the case that there is no associated I/O completion port 35, no system call is made. This is a substantial reduction in processing overhead in comparison with an implementation in which the user-level networking functionality has no way of knowing whether an I/O completion port for a given socket exists and blindly makes a system call every time an I/O operation completes.

Protocol processing stacks implemented in an operating system tend to be interrupt driven. This means no matter what the relevant application is doing the stack will always be prioritised when network events occur (e.g. when data is received from the network and is passed onto an event queue) because the protocol processing software is invoked in response to such events by means of interrupts. On the other hand, user-level architectures, which involve certain . processing tasks being driven by a user process rather than by an operating system, can suffer from the following disadvantage. If the driving process is blocking, waiting for control of the CPU, or performing a non-data-transfer task, then the user-level stack may not be given control of the CPU. In the case of reception of data from the network this can result in data being caused to wait in the event queues, for example, even queues 31a, 31b, 31c, in an unprocessed state for an extended time. In the case of transmission of data onto the network it can delay the pushing of packets from sockets' buffers onto the link and cause the link to become idle. Thus, it will be appreciated there is a propensity for processing backlogs to accumulate for the stack 30. One extreme solution to the problem of the user-level stack falling behind is to generate an interrupt every time new data is received. However, because interrupts have a heavy associated overhead, this is not desirable. Instead, it is preferred that interrupts are selectively enabled. While selective enablement of interrupts is beneficial in terms of overall system performance, at any given instant, there is a danger that an application requesting status information from an I/O completion port via a system call may be given a misleading result.

In order to request information from the I/O completion port 35, the application 18a makes the GetQueuedCompletionStatus() system call. This system call returns a list of completed overlapped I/O operations for the sockets associated with the I/O completion port. In this case, only the socket 32 is associated with the I/O completion port 35. There are various arguments for GetQueuedCompletionStatus(), some of which will not be discussed further here, and also an argument related to a timeout period. When the timeout is zero, GetQueuedCompletionStatus() immediately returns with a list of the available data. When the timeout is non-zero, the call blocks until at least one I/O operation completes, or the timeout period elapses, whichever is the sooner.

To cope with the danger that a naked GetQueuedCompletionStatus() system call might well return a misleading result because of the possible processing backlog in the user-level stack 30, whereby data could, for example, be sitting in the event queues 31a,31b, 31c unprocessed, using the same mechanism discussed above in relation to CreateloCompletionPort(), an intercepting dll function, intercept_GetQueuedCompletionStatus() is used since the

GetQueuedCompletionStatus () system call also does not pass through the WSP, 24. Figure 15 shows a routine 50 carrying out intercept^

GetQueuedCompletionStatus(). It is assumed that the initial

GetQueuedCompletionStatusO system call by an application is made with a timeout argument set to timeout e.g. GetQueuedCompletionStatus(....timeout). At step 52, a flag,.do_enable_interrupts, to signal whether to enable interrupts is set to true. True means that interrupts should be set and false means that they should not be set. Next, at step 54, a test is made to determine whether the user-level stack 30 needs updating. If it needs updating, then, at step 56, an attempt is made to grab the lock 40 without blocking. If possession of the lock 40 is taken, then, the situation is opportune to update the stack. At step 58, the stack is updated at the user-level. At step 60, the lock 40 is released. At step 62, a test is made to determine whether during stack updating at step 58, a process thread was awoken, and if a process thread was awoken, do_enable_interrupts is set to false. Next, at step 64, a system call, GetQueuedCompletionStatus(... ,0) is made. It will be noted that the timeout argument is set to zero meaning the system call is non-blocking and will return immediately. Next, at step 66, a it determined whether any data was returned by the system call, or the original timeout argument supplied by the calling application, i.e. timeout was zero. If timeout was zero or some data was returned by the system call, then a return is made to the calling application at step 67. This is because if the timeout argument was set to zero, then the application wanted a non-blocking response. And, if some data was returned, then this should be promptly reported to the application.

At step 54, if the stacks did not need updating, then, the routine goes straight to step 64, where the non-blocking system call, GetQueuedCompletionStatus(...,0) is made.

At step 56, if the attempt to grab the lock 40 without blocking failed, then, at step 68, the do_enable_interrupts flag is set to false and the routine 50 goes on to, step 64, where the non-blocking system call, GetQueuedCompletionStatus() is made.

At step 72, the do_enable_interrupts flag is inspected, and if it is true and interrupts for the stack 30 are not already enabled, then they are enabled. At step 74, the system call GetQueuedCompletionStatus(....timeout) is made, and when this returns, the routine 50 then returns to the calling application.

In another embodiment, before step 72, the routine may return to the beginning at step 52, and continue spinning in a loop until either some data is returned by GetQueuedCompletionStatusO or a short spin period has elapsed, whereafter the routine continues at step 72. The advantage of this approach is that interrupts are not enabled if any I/O operations complete within the spin period.

Thus, it will be appreciated that intercept_GetQueuedCompletionStatus() 50 invokes a stack update if operating conditions, like, for example, the availability of the lock 40 and state of the receive event queues 31a,31b,31c, make it advantageous to do so, whereby the list returned to the calling application 18a should be up-to-date.

Although the exemplary embodiments of the invention have been discussed purely in terms of the Windows™ operating system and a specific I/O . synchronisation object supported by Windows, namely an I/O completion port, the invention may be implemented on any other suitable operating system. Examples of other I/O synchronisation objects supported by other operating systems include "kqueues", "epoll", and "realtime signal queues". Furthermore, the invention may be implemented using an I/O synchronisation mechanism which does not use an I/O synchronisation object, for example, "poll", and "select".

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

SECTION D

ASYNC RECEIVE QUEUE Il

The present invention relates to a computer system having a network interface and which is capable of running a plurality of concurrent processes. .

In a practical system in which a plurality of concurrent processes are running, and in which the operating system may be required to support an application employing the synchronous receive model or the asynchronous receive model, or both models simultaneously, it is necessary to protect the integrity of the queues. One technique for protecting commonly accessible portions of state is to use locks. Intrinsic to the use of locks is establishing a lock regime which achieves a favourable trade-off between the desirable protective effect which they afford, and the drag that they have on system performance due to lock contention. With the foregoing in mind, according to one aspect, the present invention may provide a computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to

establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock;

establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock;

wherein the right to dequeue from the first queue is governed by a second lock when the second queue is empty, and by the first lock when the second queue is not empty.

Thus, in accordance with this aspect of the invention, the above-defined lock regime is advantageous in that an application process operating according to the asynchronous receive model when allocating a new application receive buffer for incoming payload is able by virtue of taking possession of the second lock, when the second queue is empty, to transfer payload from the first queue to the new application receive buffer without further having to obtain possession of the first lock. The above-defined lock regime is also advantageous in that an application process operating according to the synchronous model is able by virtue of taking possession of the second lock, when the second queue is empty, to transfer payload from the first queue to a new application receive buffer specified by the application process without further having to take possession of the first lock. The above-defined lock regime is also advantageous in that when a process, holding the first lock, is processing incoming payload, it is able either, when the second queue is empty, to enqueue the incoming payload on the first queue, or, when the second queue is not empty, to perform a drain down operation as described later, without further having to obtain possession of the second lock. Preferably, the system, when an application process, holding the second lock, has a new descriptor for enqueueing on the second queue, is operable, when the second queue is empty, to dequeue payload from the first queue and transfer it to the application receive buffer specified by the said new descriptor.

Preferably, the system, when an application process seeks to receive incoming payload and takes possession of the second lock, is operable to transfer payload from the first queue to a new application receive buffer specified by the application receive process.

Preferably, the system, when a process, holding the first lock, processes incoming payload is operable, when the second queue is empty, to enqueue the incoming payload on the first queue, and, when the second queue is not empty, to transfer payload from the first queue to a buffer specified by a descriptor in the second queue. In one embodiment, payload transferred from the first queue to a buffer may from time to time be said incoming payload when said incoming payload is deposited in the first queue without first determining the condition of the second queue. In other embodiments, the incoming payload, when the second queue is not empty, may completely bypass the first queue.

Preferably, the right to enqueue on the second queue is governed by the second lock. This is advantageous in that an application process operating according to the asynchronous receive model is able, by virtue of taking possession of the second lock, when second queue is empty, to not only transfer payload from the first queue to the new application receive buffer, but also add the descriptor corresponding to the new buffer to the second queue without having to obtain possession of the first lock. Preferably, adding the descriptor corresponding to the new application receive buffer takes place when either the first queue is empty or the second queue is non-empty.

In various situations, the system may be operable to perform a drain down operation in which, while the first and second queues are not empty, to transfer payload from the first queue to buffers specified in the second queue. In other words, payload is transferred until there is no more payload in the first queue or there are no more buffers specified in the second queue, whichever comes first. One situation is after the application process has caused the new buffer descriptor to be added to the second queue, but during that operation, payload has been added to the first queue, whereby it has become non-empty. At this point, an attempt is made to take possession of the first lock. Once the first lock is obtained, the drain down operation is carried out. Another situation is when the process responsible for delivering the incoming payload to a socket finds that the second queue is not empty, This also presents an opportunity for draining from the first queue if it is needed.

providing a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock;

providing a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock;

According to a further aspect, the present invention may provide a computer program for a computer system having a network interface and capable of running a plurality of concurrent processes, the program being arranged to

establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock; establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock ;

According to a further aspect, the present invention may provide a data carrier bearing the above program.

Figure 16 shows an overview of hardware suitable for performing the invention;

Figure 17 shows an overview of various portions of state of a socket in an embodiment of the invention;

Figure 18 show a first lock configuration applied to the Figure 17 embodiment;

Figure 19 shows a second configuration applied to the Figure 17 embodiment;

Figures 20 show an algorithm for delivering a packet to the socket in accordance with the invention;

Figure 21 shows an algorithm according to an asynchronous receive model of operation; and

Figure 22 shows an algorithm according to a synchronous receive model of operation; Referring to Figure 16, a computer system 10 in the form of a personal computer (PC) comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30. The Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system. Although^" referred to as a card, the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.

Figure 17 shows the system state of an embodiment of the invention. In the state shown in Figure 17, an application process thread 40 has been initiated and a TCP socket 50 has been opened up enabling the application to communicate via the network interface card 25 over the network. Processing of data at the TCP layer involves two linked list queue structures: a receive queue (RQ) 70 and an asynchronous receive queue (ARQ) 80, both of which are first-in first-out (FIFO). The RQ 70 comprises a plurality of items 71 in which each item 71 references a block of data after TCP processing. Each item 71 comprises a pointer portion 71a which points to the start of the data block, and block-length portion 71b giving the length of the block. The memory region where blocks of data are stored in buffers after TCP processing is designated 95. The ARQ 80 comprises a plurality of items 81 in which each item 81 references an application receive buffer which the application 40 has pre-allocated for incoming data, for example, when an application makes an asynchronous receive request. Each item 81 consists of an application receive buffer descriptor comprising a pointer portion 81a which points to the start of the buffer and a buffer-length portion 81b defining the length of the buffer. The memory region which may be allocated for buffers for incoming data is designated 42. In other embodiments, queue structures other than linked lists may be used.

In order to protect the integrity of the RQ 70 and ARQ 80, locks 52, 62 (shown in Figures 18 and 19) are employed. The first lock 62 is a shared lock which is widely used to protect various portions of the system state, including, for example, portions of state of other (unshown) sockets. Hereinafter, this lock is referred to as the network interface lock or netif lock. The second lock 52 is a lock which is dedicated to protecting certain portions of the state of the specific socket 50. Hereinafter, this lock is referred to as the sock-lock 50.

The configuration of the locks varies according to the condition of the ARQ 80. Figure 18 represents the lock configuration when the ARQ 80 contains no items 81. In the drawings, the symbol 0 is used to represent an empty condition. The right to put an item 71 onto the RQ 70 is governed by the netif lock 62 and this right is denoted in Figure 18 by the arrow 62-P. The right to remove/get an item from the RQ 70 is governed by the sock-lock 52 and this right is denoted in Figure 18 by the arrow 52-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, whereas a process, which dequeues an item has first to take possession of the sock-lock 52. For the ARQ 80, the right to put an item 81 onto the ARQ 80 is governed by the sock-lock 52, and this right is denoted in Figure 18 by the arrow 52-P. Thus, any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock-lock 52. As, by definition, the ARQ 80 is empty in Figure 18, no right to dequeue has been illustrated. Figure 19 represents the lock configuration when the ARQ 80 contains at least one item 81. The rights both to put an item 71 onto the RQ 70 and to remove/get an item 71 from the RQ 70 are governed by the netif lock 62 and these rights are denoted in Figure 19 by the arrows 62-P and 62-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, and any process which seeks to dequeue an item has likewise first to take possession of the netif lock 62. For the ARQ 80, the right to put an item 81 onto the ARQ 80 is governed by the sock-lock 52, and the right to remove/get an item from the ARQ 80 is governed by the netif lock 62 and these rights are denoted in Figure 19 by the arrows 52-P and 62-G, respectively. Thus, any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock- lock 52, and any process which dequeues an item has first to take possession of the netif lock 62. From time to time, the system . makes a check to see whether there is a received packet which is ready for processing and.queueipg,. If there is, then the algorithm, 100 for delivering the. received packet to the socket 50 as shown in Figure 20 is invoked/performed. In this case, it is assumed that a process 68 is awakened and acts as the receive process- and.. that it, at this point, has already taken possession of the netif lock 62. At step 101 , the . TCP layer protocol processing and demultiplexing is carried out, and; the post TCP layer processing payload/data block 93 is stored in a memory space 95. At step 102_.;- the data block 93 is enqueued on the RQ 70 by adding an item 71 onto the RQ. 70 which references the data block 93. Next, at step 104,- a check is made to see whether the ARQ 80 contains any items 81. If there are buffer descriptors in; the ARQ 80, this, means that the Figure 19 lock configuration is valid and the process 68 being already in possession of the netif lock 62 has dequeue rights for both the RQ 70 and the ARQ 80. Therefore, without having to attempt to take possession of another lock, the process 68, at step 106, is able to drain down payload in the RQ 70 into the buffers referenced by descriptors in the ARQ 80 to the extent that the buffers have been pre-allocated i.e. until either the RQ 70 or the ARQ 80 becomes empty. Filled buffers are either removed from the ARQ 80 or their buffeNength portion 81b adjusted

If there are no buffers descriptors in the ARQ 80, this means that the Figure 18 lock configuration is valid. With the receive process 68 holding the netif lock 62; this is the situation shown in Figure 18. But, as no application receive buffers have yet been allocated, the ,_ receive process 68 having already deposited the payload in the RQ 70 moves onto further tasks or goes to sleep as the case may be. . _. ;

In other embodiments, the check, at step 104, to determine whether the ARQ 80 is empty or not, can be carried out before incoming payload is enqueued in the RQ 70 (at step 102). Thus, when the ARQ 80 is non-empty and RQ 70 is empty, the RQ 70 can be completely bypassed. As will be recalled, the netif lock 62 is a shared lock, and thus, according to the described lock regime, it governs the enqueueing of payload to not only the RQ ₁ 70 of the socket 50, but it may do so too for the other unshown sockets. In such a case, it will be appreciated that many sockets can be serviced while incurring the overhead of obtaining the netif lock only once.

When" the application process thread 40 wants to operate according to the asynchronous receive model, as mentioned earlier, and has a new receive buffer 43 for allocation, it performs/invokes the asynchronous receive algorithm 108 shown in . Figure .21. The descriptor for the hew receive buffer 43 may be specified as the argument of a call, like async_recv(), which runs the _/algorithm 108. ; .V .

First, at step 110, the application 40 takes possession of the sock-lock 52 in order to obtain the right to enqueue the descriptor corresponding to the new receive buffer onto the ARQ 80. It will be noted that, at this point, no test is first required to determine the condition of the ARQ 80 as in both the possible Figure 18 and Figure 19 lock configurations, the right to enqueue on the ARQ 80 is. governed by the sock-lock 52. Next, at step 112, a test is performed to determine whether the ARQ 80 is empty. If the ARQ is empty, then the Figure 18 Jock configuration applies, whereby the application 40, which already holds the sock_^lock 52, thus, already has the right to dequeue from the RQ 70. Then, at step 114, a. check is made to determine whether there is any payload/data in the RQ 70. If there is, at step 116, it is transferred from the RQ 70 to the new buffer 43. In this case, it will be noted that a descriptor corresponding to the new buffer 43 is not put onto the ARQ 80 at any time. Furthermore, it will be appreciated that performance of step 116, in terms of access to the RQ 70 and the ARQ 80, requires only the right to dequeue from the RQ 70. This means that in this branch of the algorithm with the Figure 18 lock configuration being valid, only possession of the sock-lock which was already taken at step 110 is required. At step 130, the sock-lock 52 is dropped. On the other hand, if either the ARQ 80 is non-empty or the RQ 70 is empty, then, at step 118, a descriptor 44 corresponding to the new buffer 43 is enqueued on the ARQ 80; With the application 40 holding the sock-lock 52, this is the situation shown in Figure 19. At step 120, a check is made to determine whether during the performance of step 118, further payload has been enqueued on the RQ 70, i.e. whether RQ 70 is still empty. If it is still empty, then the sock-lock is dropped at step 130. In this case, the new descriptor 44 has been enqueued on the ARQ, but no further work needed to be performed before dropping the sock-lock.

However, if it turns out that further payload had been enqueued on the RQ 70, Le. _. the RQ 70 is now non-empty, then at step 122, the netif lock 62 is grabbed. This lock is needed to dequeue items from the RQ 70, because, in this branch of the algorithm, the Figure 19 lock configuration applies. At step 124, the payload in the RQ 70 is drained down i.e. transferred to application received buffers specified by the descriptor in the ARQ 80 to the extent permitted by the availability of buffers, or in other words, until either RQ or ARQ become empty. At step 126, the netif lock 62 is dropped. Obtaining the shared netif lock at step 122 might tend to result in blocking, but entering this part of the algorithm is a not so common occurrence.

When the application process thread 40 wants to operate according to the synchronous receive model, as mentioned earlier, it invokes/performs the following synchronous receive algorithm 140 specifying a new application receive buffer. The descriptor for the new receive buffer 43 may be specified as the argument of a call, like, for example, recv(), which runs the algorithm 140. First, at step 142, the sock-lock 52 is grabbed. Then, a check, is made, at step 144, on the condition of the ARQ 80 and if the ARQ 80 is non-empty, then, regardless of the condition of the RQ 70, blocking occurs at step 146. The rationale for blocking even though there may be payload in the RQ 70 is that, as the ARQ 80 is nonempty, the payload will be there only temporarily, and so no special steps need to be taken. At step 148, a check is made on the condition of the RQ 70 and if it is non-empty, then payload is transferred from the RQ 70 to the new application receive buffer. On the other hand, if the RQ 70 is empty, then, blocking occurs at step 152.

As a variation on this embodiment, the condition of the ARQ 80 could be checked before grabbing the sock-lock 52, and if it is non-empty, then blocking. However, in this variation, after grabbing the sock-lock 52, it is still necessary to check the condition of the ARQ 80 again in order to verify that another thread has not enqueued a buffer descriptor on the ARQ 80 in the meantime. In other embodiments, and depending on the way the application process thread calls the synchronous receive algorithm, instead of blocking, an error can be immediately returned to the application process.

It will be noted that it is an advantage of the described lock regime that the whole of the recv() routine 140 can be performed with just the sock-lock 52, and often, so can the async_recv routine 108 (apart from the relatively rare case in the branch of the algorithm containing steps 122,124,126).

The algorithms 100, 108, 140 are typically operating system routines which are in the examples described called by the process 68 and the application process 40. Although in terms of implementation at the code level, locks are picked up/taken and released within those routines, possession of the lock is said to reside with the higher-level calling processes 68, 40 on whose behalf the routines are working.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A computer system which is capable of running a plurality of concurrent processes, the system being operable to establish a first queue in which items related to data for sending over the network are enqueued, and to which access is governed by a lock; when a first of said processes is denied access to the first queue by the lock, to enqueue the items into a second queue to which access is not governed by the lock; and to arrange for the items in the second queue to be dequeued with the items in the first queue.

2. A system as in Claim 1 , operable to integrate items from the second queue into . the first queue.

3. A system as in Claim 2, wherein integration takes place by dequeueing items from the second queue and moving them into the first queue.

4. A system as in any preceding claim, wherein the second queue comprises a data structure facilitating access by concurrent processes.

5. A system as in Claim 4, wherein the second queue comprises a linked list structure having a head to which items can be added and from which items can be removed using an atomic instruction.

6. A system as in any preceding claim, further comprising means for registering the existence of the second queue, the system being operable after completing the second queue to register the existence of the second queue with the registering means if the lock is held by a said process other than the first process.

7. A system as in any preceding claim, wherein the lock is a single unit of memory which is manipulated atomically.

8. A system as in Claim 7 when dependent on Claim 6, wherein the registering means includes bits of the lock.

9. A system as in any of _.Claims 6 to 8, operable when a said process is about to release the lock to check for the existence of registered second queues with the registering means.

10. A system as in Claim 2, wherein integration takes place by linking the first queue to the second queue by arranging for a pointer from the first queue to point to the second queue, whereby the first and second queues can form a single linked structure.

11. A computer program for sending data to the network interface of a computer system which is capable of running a plurality of concurrent processes, the computer program being operable to establish a first queue in which items related to data for sending over the network interface are enqueued, and to which access is governed by a lock; and when a said process is denied access to the first queue by the lock, to enqueue the items for sending into a second queue to which access is not governed by said lock, wherein the computer program is further operable to arrange for the items in the second queue to be dequeued with the items in the first queue.

12. A data carrier bearing the computer program of Claim 11.

13. A computer system running a plurality of processes, comprising: a first queue in which items related to data for sending over a network are enqueued; a lock by which access to the first queue is governed; a second queue to which access is not governed by the lock; wherein the system is operating such that when a said process is denied access to the first queue by the lock, the data item for sending is enqueued in the second queue, and items in the second queue are dequeued with items in the first queue.

14. A computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to: establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue is governed by a second lock; establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the second lock; the system being operable when a said process, holding the first lock, processes incoming payload to determine from the second queue whether an application receive buffer is available for the payload; and if an application receive buffer is available, attempt to take possession of the second lock, and if the attempt fails, set a control flag as a signal to another said process.

15. A computer system as in Claim 14, further operable to enqueue payload on the first queue regardless of whether an application receive buffer descriptor is specified in the second queue.

16. A computer system as in Claim 14, further operable, only when there is no application receive buffer descriptor specified in the second queue for the payload, or the said process fails to obtain the second lock, to enqueue the payload on the first queue.

17. A computer system as in any of claims 14 to 16, further operable such that the another said process in response to the control flag being set and when holding the second lock dequeues payload from the first queue, and transfers it to an application receive buffer specified in the second queue.

18. A computer system as in any of claims 14 to 17, wherein the attempt to obtain the second lock, and upon failing, setting the control flag is performed by an atomic instruction.

19 A computer system as in any of claims 14 to 18, wherein bits implementing the second lock and the control flag reside in the same word of memory.

20. A computer system having a network interface and capable of running a plurality of concurrent processes in a plurality of address spaces, the system being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the system being operable when a said process processes incoming payload to: identify from the receive queue structure an application receive buffer for the payload; determine whether said application receive buffer is accessible in the current address space; and if it is not, arrange for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible.

21. A computer system as in Claim 20, wherein, the system is operable, upon determining that said application receive buffer is not accessible in the current address space, to make a system call in order to load the payload into the application receive buffer.

22. A computer system as in Claim 20, wherein the system is operable, upon determining that said application receive is not accessible in the current address space, to arrange for a thread in the appropriate address space to run in order that the payload may be loaded into the application receive buffer.

23. A computer system as in any of claims 20 to 22, wherein information about the address space in which the pointer(s) is valid is stored in the shared state of a socket with which the application received buffer is associated.

24. A computer system as in Claim 21 , wherein a reference to the page tables for the socket is stored in a kernel-private portion of the socket state.

25. A computer system as in any of Claims 20 to 24, wherein the system is further operable, if a said process fails to address an application receive buffer, to enqueue the payload in the receive queue structure.

26. A computer system as in any of Claims 20 to 25, wherein said receive queue structure comprises a first receive queue on which received payload can be enqueued, and a second receive queue on which descriptors for application receive buffers can be enqueued.

27. A computer system having a network interface and running a plurality of concurrent processes, the system: providing a first receive queue on which received payload is enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue is governed by a second lock; providing a second receive queue on which descriptors for application receive buffers are enqueued and from which the right to dequeue is governed by the second lock; wherein when a said process, holding the first lock, processes incoming payload, the system determines from the second queue whether an application receive buffer is available for the payload; and if an application receive buffer is available, attempts to take possession of the second lock, and if the attempt fails, sets a control flag as a signal to another said process.

28. A computer program for computer system having a network interface is capable of running a plurality of concurrent processes, the computer program being operable to establish a first receive queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock and from which the right to dequeue items is governed by a second lock; to establish a second receive queue on which descriptors for application receive buffers can be enqueued arid from which the right to dequeue is governed by the second lock; and when a said process, holding the first lock, processes incoming payload to determine from the second queue whether an application receive buffer is available for the payload; . ^{; ,} and if an application receive buffer is available, attempt to take possession of the second lock, and if the attempt fails, set a control flag as a signal to another said process.

29. A data carrier bearing the computer program of Claim 28.

30. A computer system having a network interface and running a plurality of concurrent processes in a plurality of .address spaces, the system establishing a receive queue structure comprising at least one pointer to an application receive buffer, wherein, when a said process processes incoming payload, the system identifying from the receive queue structure an application receive buffer for the payload; determining whether said application receive buffer is accessible in the current address space; and if it is not, arranging for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible.

31. A computer program for a -cόrnputer system having a network interface which is capable of running a plurality of concurrent processes in a plurality of address spaces, the program being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the computer program being operable when a said process processes incoming payload to: identify from the receive queue structure an application receive buffer for the payload; determine whether said application receive buffer is accessible in the current address space; and if it is not, arrange for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible.:

32. A data carrier bearing the computer program of Claim 31.

33. A computer system capable of operating in a network and being arranged to establish a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process.

34. The computer system as in claim 33, wherein the association of the I/O synchronisation object with the socket is recorded in its user-level state.

35. The computer system as in claim 33 or 34, wherein the system is configured to direct said system call made by the application to a user-level routine.

36. The computer system as in any of claims 33 to 35, wherein the. I/O synchronisation object comprises an I/O completion port, and the system call, CreateloCompletionPortO, serves to create an I/O completion port and/or associate it with a socket.

37. A computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to a user-level routine which is operable to update a user-level stack; make a system call to request the information previously requested by the application; and return the result of the system call to the application.

38. The computer system as in Claim 37, wherein the user-level routine is operable to update a user-level stack which has relevance to the request from the application.

39. The computer system as in Claim 37 or 38, further comprising an I/O synchronisation object associated with a said user-level stack.

40. The computer system as in any of Claims 37 to 39, wherein the I/O status information comprises event-based information.

41. The computer system as in Claim 39 or 40, wherein the I/O synchronisation object comprises an I/O completion port, and the system call, GetQueuedCompletionStatus(), returns a list of completed I/O operations.

42. The computer system as in any of Claims 37 to 41 , wherein the I/O status information comprises state-based information.

43. The computer system as in any of Claims 37 to 42 , wherein the user-level routine is operable, based on certain operating conditions, to make a determination as to whether it is currently opportune to update the user-level stack, and to update the user-level stack only when it is determined to be opportune.

44. The computer system as in Claim 43, wherein said determination includes a check as to whether there is any data awaiting processing by the user-level stack.

45. The computer system as in Claims 43 or 44 , further comprising a lock that governs the right to update the user-level stack, wherein said determination includes a check as to whether the lock is locked.

46. The computer system as in any of Claims 37 to 45 , wherein interrupts for the user-level stack are selectively enabled.

47. The computer system as in Claim 46 when dependent on any of Claims 43 to 45, wherein said determination includes a check as to whether said interrupts are enabled.

48. The computer system as in Claims 46 or 47, wherein said interrupts are not enabled, when during the stack updating, a process thread was awoken.

49. The computer system as in any of Claims 46 to 48, wherein said interrupts are not enabled, if the said system call made by the application was non-blocking.

50. The computer system as in any of Claims 46 to 49 when dependent on Claim 45, wherein said interrupts are not enabled, if the lock was locked.

51. The computer system as in any of Claims 36 to 50 , wherein the system is configured to direct the system call made by the application to the user-level routine by a dll interception mechanism.

52. A computer system operating in a network and providing a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process.

53. A computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the program, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process.

54. A data carrier bearing the computer program of Claim 53.

55. A computer system operating in a network and providing a user-level stack, the system directing a system call, made by an application, requesting I/O status information to a user-level routine which updates a user-level stack; makes a system call to request the information previously requested by the application; and returns the result of the system call to the application.

56. A computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to the program which runs at the user level and which is operable to update a user-level stack; make a system call to request the information previously requested by the application; and return the result of the system call to the application.

57. A data carrier bearing the computer program of Claim 56.

58. A method for use in a computer system capable of operating in a network and being arranged to establish a user-level stack, the method comprising, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user- level process.

59. A method for use in a computer system operating in a network and providing a user-level stack, the method comprising directing a system call, made by an application, requesting I/O status information to a user-level routine which updates a user-level stack; makes a system call to request the information previously requested by the application; and returns the result of the system call to the application.

60. A computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock; establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock; wherein the right to dequeue from the first queue is governed by a second lock when the second queue is empty, and by the first lock when the second queue is not empty.

61. The computer system as in Claim 60, wherein the system, when an application process, holding the second lock, has a new descriptor for enqueueing on the second queue, is operable, when the second queue is empty, to dequeue payload from the first queue and transfer it to the application receive buffer specified by the said new descriptor.

62. The computer system as in claim 60 or 61 , wherein the system, when an application process seeks to receive incoming payload and takes possession of. the second lock, is operable to^: transfer payload from._..the first queue to a new , application receive buffer specified by the application receive process.

63. The computer system as in any of claims 60 to 62, wherein the system,. . when a process, holding the first lock, processes incoming payload is operable, when the second queue is empty, to enqueue the incoming payload on the first, queue, and, when the second queue is not empty, to transfer payload from the. first queue to a buffer specified by a descriptor in the second queue ,

64. The computer system as in any of claims 60 to 63, wherein the right to, enqueue on the second queue is governed by the second lock.

65. The computer system as in Claim 60, wherein the system, when a process holding the first lock, processes incoming payload is operable to enqueue the incoming payload on the first queue, and when the second queue is not empty, to transfer payload from the first queue to at least one buffer specified by at least one descriptor in the second queue.

66. The computer system as in Claim 60, wherein the system, when a process holding the first lock, processes incoming payload is operable, when the second queue is not empty and the first queue is empty, to transfer the incoming payload to at least one buffer specified by at least one descriptor in the second queue, and when the second queue is empty or the first queue is not empty, to enqueue, the incoming payload on the first queue.

67. The computer system as in any of claims 60 to 66, further operable to perform a drain down operation in which, while the first or second queues are not empty, to transfer payload from the first queue to buffers specified in the second queue.

68. The computer system as in Claim 67, when dependent on Claim 61 , further operable to perform said drain down operation, after a new buffer . descriptor has been added to the second, queue and the first queue has become non-empty, and after obtaining the first lock.

69. The computer system as in Claim 67, when dependent on Claim 63, further operable to perform said drain down operation when the second queue is not empty.

70. A computer system comprising a network interface and capable of running a plurality of concurrent processes, the system providing a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock; providing a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock; wherein the right to dequeue from the first queue is governed by a second lock when the second queue is empty, and by the first lock when the second queue is not empty.

71. A computer program for a computer system having a network interface and capable of running a plurality of concurrent processes, the program being arranged to . establish a first queue on which received payload can be enqueued, and on which the right to enqueue is governed by a first lock; establish a second queue on which descriptors for application receive buffers can be enqueued and from which the right to dequeue is governed by the first lock; . wherein the right to dequeue from the first queue is governed by a second lock when the second queue is empty, and by the first lock when the second queue is not empty.

72. A data carrier bearing the computer program of Claim 71.