WO1999036851A1

WO1999036851A1 - Scalable single system image operating software architecture for a multi-processing computer system

Info

Publication number: WO1999036851A1
Application number: PCT/US1998/025586
Authority: WO
Inventors: Paul A. Leskar; Jonathan L. Bertoni
Original assignee: Src Computers, Inc.
Priority date: 1998-01-20
Filing date: 1998-12-03
Publication date: 1999-07-22
Also published as: EP1064597A1; CA2317132A1; JP2002509311A

Abstract

A scalable single system image ('S3I') operating system architecture for a multi-processing computer system having separate service (16) and computational (18) processors and wherein a unique distinction exists between the processors but both have shared, common access to all of the computer system memory. The computational processors have no input/output ('I/O') devices mapped directly onto them while the service processors have full I/O capability. A single operating system software image presents a single system application programming interface to application programs across all of the processors and a communication mechanism between the computational and service processors allows multiple requests to the service processors and fast, asynchronous interrupt responses to each request. A computational scheduler executes on each computational processor and provides the interface to the service processors where the operating system software executes.

Description

SCALABLE SINGLE SYSTEM IMAGE OPERATING SOFTWARE ARCHITECTURE FOR A MULTI-PROCESSING COMPUTER SYSTEM

BACKGROUND OF THE INVENTION

The present invention relates, in general, to the field of multiprocessing computer systems. More particularly, the present invention relates to a scalable single system image ("S3I") operating software architecture for a multi-processing computer system. The connection of more than one homogeneous processor to a single, monolithic central memory is denominated as multi-processing. Until recently, hardware and software limitations have minimized the total number of physical processors that could access a single memory efficiently. These limitations have reduced the ability to maintain computational efficiency as the number of processors in a computer system increased, thus reducing the overall scalability of the system. With the advent of faster and ever more inexpensive microprocessors, large processor count systems are becoming a hardware reality.

Hardware advances have allowed interconnected networks to multiplex hundreds of processors to a single memory with a minimal decrease in efficiency. Because of performance issues relating to software locking primitives in large configurations, operating software scalability that is required to efficiently accommodate large numbers of processors has still eluded system architects.

SUMMARY OF THE INVENTION In response to these computer system architecture efficiency issues, SRC Computers, Inc., Colorado Springs, Colorado, has developed an affordable, high performance computer that supplants traditional high performance supercomputers by providing a system with large shared memory utilizing fast commodity processors resulting in high bandwidth input/output ("I/O") functionality. This has been accomplished by creating a balance among processor speed, memory size, and I/O bandwidth to achieve a high degree of efficiency between the system hardware and software, resulting in a greater degree of parallelism.

Disclosed herein is a computer system utilizing a Scalable Single System Image ("S3I") operating software architecture. This architecture allows the efficient scalability of operating software from a few processors to hundreds of processors in a multi-processor environment that effectively presents a single system image. The S3I architecture of the present invention virtually obviates the need for Massively Parallel ("MPP") architectures simply because the need for distributed memories and message passing synchronization primitives no longer exists. A simple, easy to program, flat memory model replaces message passing. A common application programming interface/application binary interface ("API/ABI") is presented to all applications. Performance is much improved both computationally and from an input/output ("I/O") standpoint because of the elimination of the message passing paradigm.

Disclosed herein is a scalable single system image ("S3I") operating system architecture for a multi-processing computer system having separate service and computational processors and wherein a unique distinction exists between the processors but both have shared, common access to all of the computer system memory. The computational processors have no input/output ("I/O") devices mapped directly onto them while the service processors can control attached devices. A single operating system software image presents a single system application programming interface to application programs across all of the processors and a communication mechanism between the computational and service processors allows multiple requests to the service processors and fast, asynchronous interrupt responses to each request. A computational scheduler executes on each computational processor and provides the interface to the service processors where the operating system software executes. The S3I architecture also improves cache performance by reducing cache conflicts. These conflicts are reduced because the operating software no longer forces application data from the cache during the process of servicing application requests. This performance improvement becomes more important as processor speed relative to memory latency increases, as the latest generation of multi-processors demonstrates.

Particularly disclosed herein is a multi-processor computer system including operating software. The computer system comprises a first plurality of service processors functioning in conjunction with the operating software, the service processors handling all input/output functions for the computer system. A second plurality of computational processors functions in conjunction with a computational scheduler. The operating software and computational scheduler providing a communication medium between the service and computational processors.

Further disclosed herein is a multi-processor computer system including operating software. The computer system comprises a first plurality of service processors and a second plurality of computational processors. Each of the service function in conjunction with the operating software and handle all input/output functionality for the computer system. Each of the computational function in conjunction with a computational scheduler, with the operating software and the computational scheduler providing a communication medium between the service and computational processors. BRIEF DESCRIPTION OF THE DRAWINGS

The aforementioned and other features and objects of the present invention and the manner of attaining them will become more apparent and the invention itself will be best understood by reference to the following description of a preferred embodiment taken in conjunction with the accompanying drawings, wherein:

Figs. 1A and 1 B are a functional block system overview illustrating a computer system in accordance with an embodiment of the present invention comprising between 1 and 16 segments coupled together by a like number of trunk lines, each segment containing a number of computational and service processors in addition to memory and a crossbar switch assembly;

Fig. 2 is a simplified functional block diagram for the interconnect strategy for the computer system of Figs. 1 A and 1 B; Fig. 3 is a simplified functional block diagram of the computer system of Figs. 1 and 2 illustrating a 16 segment system comprising 256 computational processors and 64 service processors for interfacing to the computer program application software through a scalable single system image ("S3I") in accordance with the present invention;

Fig. 4 is a more detailed block diagram of a single segment computer system corresponding to a portion of the system of Fig. 3 illustrating the interface to the computer program application software through a common API/ABI to the computational processor through the computational scheduler and the service processors through the operating software which provides an interrupt response to the computational scheduler. DESCRIPTION OF A PREFERRED EMBODIMENT

With reference now to Figs. 1 A and 1 B, a multi-processing computer system 10 in accordance with the present invention is shown. The exemplary computer system 10 comprises, in pertinent part, any number of interconnected segments 12₀ through 12₁₅, although the principles of the present invention are likewise applicable to any scalable system large numbers of processors. The various segments 12₀ through 12_{l 5} are coupled through a number of trunk lines 14₀ through 12₁₅ as will be more fully described hereinafter. Each of the segments 12 comprises a number of functionally differentiated processing elements in the form of service processors 16₀ through 16₃ (service processor 16₀ functions additionally as a master boot device) and computational processors 18₀ through 18₁₅. The service processors 16 are coupled to a number of peripheral component interconnect ("PCI") interface modules 20, and in the embodiment shown, each service processor is coupled to two such modules 20 to enable the service processors 16 to carry out all of the I/O functionality of the segment 12.

The computer system 10 further includes a serial-to-PCI interface 22 for coupling a system console 24 to at least one of the segments 12 of the computer system 10. The system console 24 is operational for enabling a user of the computer system 10 to download boot information to the computer system 10, configure devices, monitor status, and perform diagnostic functions. Regardless of how many segments 12 are configured in the computer system 10, only one system console 24 is required.

The boot device 26 (for example, a JAZ® removable disk computer mass storage device available from Iomega Corporation, Roy UT) is also coupled to the master boot service processor 16₀ through one of the PCI modules 20. The PCI modules 20 coupled to service processors 16₁ through 16₃ are utilized to couple the segment 12 to all other peripheral devices such as, for example, disk arrays 28₀ through 28₅, any one or more of which may be replaced by, for example, an Ethernet connection. The computer system 10 comprises sophisticated hardware and building blocks which are commodity based, with some enhancements to accommodate the uniqueness of high-performance computing ("HPC"). On the hardware side, the base unit for the computer system 10 is a segment 12. Each segment 12 contains computation and service processor 18, 16 elements, memory, power supplies, and a crossbar switch assembly. The computer system 10 is "scalable" in that an end user can configure a system that consists of from 1 to 16 interconnected segments 12. Each segment 12 contains 20 total processors: sixteen computational processors 18 and four service processors 16. In a preferred embodiment, the computational processors 18 may reside on an individual assembly that contains four processors (e.g. the Deschutes™ microprocessor available from Intel Corporation, Santa Clara, CA) and eight interface chips (i.e. two per computational processor 18). Each computational processor 18 has an internal processor clock rate greater than 300 MHz and a system clock speed greater than 100 MHz, and the interface chips provide the connection between the computational processors 18 and the memory switches that connect to memory as will be described and shown in greater detail hereafter. The service processors 16 may be contained on a service processor assembly, which is responsible for all input and output for the computer system 10. Each of the service processor assemblies contain a processor (the same type as the computational processor 18), two interface chips, two 1 Mbyte I/O buffers, and two bi-directional PCI buses. Each PCI bus has a single connector. All I/O ports have DMA capability with equal priority to processors. The PCI modules 20 serve dual purposes, depending upon which service processor 16 with which they are used. The PCI connectors on the master boot service processor 16₀ are used to connect to the boot device 26 and the system console 24. The PCI modules 20 on the regular service processors 16-, through 16₃ are used for all other peripherals. Some of the supported PCI-based interconnects include small computer systems interface ("SCSI"), fiber distributed data interface ("FDDI"), high performance parallel interface ("HIPPI") and others. Each PCI bus has a corresponding commodity-based host adapter.

The separation of service functions from computing functions allows for concurrent execution of numeric processing and the servicing of operating system duties and external peripherals. With reference additionally now to Fig. 2, the interconnect strategy for the computer system 10 of Figs. 1 A and 1 B is shown in greater detail in an implementation employing sixteen segments 12₀ through 12₁₅ interconnected by means of sixteen trunk lines 14₀ through 14ι₅. As shown, a number of memory banks 50₀ through 50₁₅, each allocated to a respective one of the computational processors 18₀ through 18₁₅ (resulting in sixteen memory banks 50 per segment 12 and two hundred fifty six memory banks 50 in total for a sixteen segment 12 computer system 10) form a portion of the computer system 10 and are respectively coupled to the trunk lines 14₀ through 14₁₅ through a like number of memory switches 52₀ through 52ι₅. The memory utilized in the memory banks 50₀ through 50₁₅ may be synchronous static random access memory ("SSRAM") or other suitable high speed memory devices. Also as shown, each of the segments 12₀ through 12₁₅ includes, for example, twenty processors (four service processors 16₀ through 16₃ and sixteen computational processors 18₀ through 18₁₅) coupled to the trunk lines 14₀ through 14₁₅ through a corresponding one of a like number of processor switches 54₀ through

Each segment 12 interconnects to all other segments 12 through the crossbar switch. The computer system 10 crossbar switch technology enables segments 12 to have uniform memory access times across segment boundaries, as well as within the individual segment 12. It also enables the computer system 10 to employ a single memory access protocol for all the memory in the system. The crossbar switch may utilize high-speed Field Programmable Gate Arrays ("FPGAs")to provide interconnect paths between memory and the processors, regardless of where the processors and memory are physically located. This crossbar switch interconnects every segment 12 and enables the processors and memory located in different segments 12 to communicate with a uniform latency. In a preferred embodiment, each crossbar switch has a 1 clock latency per tier, which includes reconfiguration time. For a sixteen segment 12 computer system 10 utilizing three hundred and twenty processors 16, 18 only two crossbar tiers are required.

As mentioned previously, the computer system 10 may preferably utilize SSRAM for the memory banks 50 since it presents a component cycle time of 6 nanoseconds. Each memory bank 50 supports from 64 to 256 Mbytes of memory. Each computational processor 18 supports one memory bank 50 , with each memory bank 50 being 256 bits wide, plus 32 parity bits for a total width of 288 bits. In addition, the memory bank 50 size may be designed to match the cache line size, resulting in a single bank access for a full cache line. Read and write memory error correction may be provided by completing parity checks on address and data packets.

The parity check for address packets may be the same for both read and write functions wherein new and old parity bits are compared to determine whether or not the memory read or write should continue or abort. When a memory "write" occurs, a parity check may be done on each of the data packets arriving in memory. Each of these data packets has an 8-bit parity code appended to it. As the data packet arrives in memory, a new 8-bit parity code is generated for the data packet and the old and new parity codes are compared. The comparison results in one of two types of codes: single bit error ("SBE") or double-bit or multi-bit error ("DBE"). The single-bit error may be corrected on the data packet before it is entered in memory. In the case of a double-bit or multi-bit error, the data packet is not written to memory, but is reported back to the processor, which retries the data packet reference. When a memory "read" occurs, each of the data packets read from memory generates an 8-bit parity code. This parity code is forwarded with the data to the processor. The processor performs single error correction and double error detection ("SECDED") on each data packet.

With reference additionally now to Fig. 3, a simplified illustration of a sixteen segment 12 computer system 10 is shown comprising a total of sixty four service processors 16 and two hundred and fifty six computational processors 18 for a total of three hundred and twenty processors. The service processors 16 and computational processors 18 interface to the computer program application software by means of the scalable single system image ("S3I") layer 60 as will be more fully described hereinafter. The service processors 16 handle all I/O operation as previously described as well as the running of the computer system 10 operating system.

With reference additionally now to Fig. 4, a more detailed illustration of the S3I layer 60 is shown as it relates to the computer program code application 62 software and the service and computational processors 16, 18. The S3I layer 60 resides on top of the application 62 and comprises a common API/ABI layer 64 as well as a computational scheduler 66 and operating software 68 layers. The computational scheduler 66 interfaces to the computational processors 18₀ through 18₁₅ while the operating software 68 interfaces to the service processors 16₀ through 16₃. The operating software 68 provides an interrupt response signal 70 to the computational scheduler to control the operation of the computational processors 18 as shown. A number of memory communication buffers 72 receive data from the various computational processors 18₀ through 18₁₅ and, in turn, supply data to the service processors 16₀ through 16₃. As previously described and shown, the preferred implementation of the scalable single system interconnect architecture of the present invention is on a multiprocessor computer system 10 with uniform memory access across common, shared memory comprising a plurality of memory banks 50₀ through 50_N. As also previously described, processor subsystems may be partitioned into two groups: those which have I/O connectivity, i.e. the service processors 16₀ through 16_N and those which have no I/O connectivity, i.e. the computational processors 18₀ through 18_N.

The S3I utilizes a software environment consisting of service and computational processors 16, 18. A single copy of the operating system software 68 resides across all processors 16 within the service partition. Separate computational schedulers 66 exist in each computational processor 18. This software model guarantees a global resource sharing paradigm, in conjunction with a strong "single system image". Highly scalable threads of execution must be present in both the operating system software 68 and user application 62 software design model in conjunction with a high degree of software "multithreading" for efficient utilization of this architecture.

Because of a strong shared memory hardware architecture of the computer system 10, this choice of operating software 68 functionality eliminates the need for a message passing paradigm for communication between processors or hardware boundaries. However, some level of simple communication is required between computational and service processors 18, 16. No physical hardware boundaries are apparent to the application 62 program. All physical processors 16, 18 present the same Application Programming Interface [API] to the end user.

To complete the scalability model, user applications 62 are able to initiate and terminate multiple threads of execution in application user space, allowing further elimination of operating system software 68 overhead and increased scalability as the number of physical processors increases.

Under the S3I architecture, a user application 62 makes requests of the system in the normal, system software mechanism. The application 62 has no awareness of whether the application 62 is executing on a service processor 16 or a computational processor 18. If the request is executed on a service processor 16, the request follows the normal operating system path directly into the operating system software 68 for processing.

If the request is executed on a computational processor 18, the request is processed by the computational scheduler 66. The thread making the request is placed on the run queue of the service processor 16 and the computational processor 18 issues a request to the service processor 16 to examine the queues. The operating software 68 executing in the service processor 16 examines the request queue and processes the request as if it had originated on the service processor 16.

The requesting computational processor 18 is either suspended until interrupt acknowledge or placed into the general scheduling tables maintained by the operating software 68 for dispatching of additional work. The service processor 16 acknowledges the original request by queuing an application thread for execution on a computational processor 18 and restoring the original application context, which places the application 62 back into execution.

It should be noted that outside of the addition of a very small computational scheduler 66 and a small, additional component to the base operating software 68, no major operating software 68 modifications are required. Any physical processor 16 in the service partition will be able to execute within the operating system 68 simultaneously. Critical data regions may be locked utilizing standard locking primitives currently found in the underlying hardware. The base component of the software environment is the operating system software 68. The computer system 10 may use, in a preferred embodiment, an enhanced version of the SunSoft® Solaris® 2.6 operating system available from Sun Microsystems, Inc. Palo Alto, CA which is modified to achieve better performance across multiple computational and service processors 18, 16 by limiting the operating system to execute only in the service processors 16. As previously described, this technique is further accomplished by utilizing a computational scheduler 66 to communicate operating system requests and scheduling information between the service and computational processors 16, 18.

Stated another way, a single copy of the operating system software 68 executes in all service processors 16, while separate computational schedulers 66 reside in each computational processor 18. This software model provides for global resource sharing in conjunction with a strong scalable single system image. The computer system 10 of the present invention provides for highly scalable threads of execution in both the operating system software 68 and user application software 62, in conjunction with a high degree of software "multithreading". While there have been described above the principles of the present invention in conjunction with a specific computer architecture, any number of service and/or computational processors may be utilized and it is to be clearly understood that the foregoing description is made only by way of example and not as a limitation to the scope of the invention. Particularly, it is recognized that the teachings of the foregoing disclosure will suggest other modifications to those persons skilled in the relevant art. Such modifications may involve other features which are already known per se and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure herein also includes any novel feature or any novel combination of features disclosed either explicitly or implicitly or any generalization or modification thereof which would be apparent to persons skilled in the relevant art, whether or not such relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as confronted by the present invention. The applicants hereby reserve the right to formulate new claims to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

WHAT IS CLAIMED IS:

1. A multi-processor computer system including operating software, said computer system comprising: a first plurality of service processors functioning in conjunction with said operating software, said service processors handling all input/output functionality for said computer system; and a second plurality of computational processors, said computational processors functioning in conjunction with a computational scheduler, said operating software and computational scheduler providing a communication medium between said service and computational processors.

2. The multi-processor computer system of claim 1 wherein said communication medium is operational to enable multiple input/output requests to said service processors.

3. The multi-processor computer system of claim 2 wherein said communication medium is operational to enable asynchronous responses to said input/output requests.

4. The multi-processor computer system of claim 1 wherein said service and computational processors have shared access to a plurality of associated memory banks.

5. The multi-processor computer system of claim 1 further comprising: a common application programming interface operationally coupled to said computational scheduler and said operating software, said application programming interface, said computational scheduler and said operating software comprising a single system image application programming interface.

6. The multi-processor computer system of claim 5 wherein said single system image application programming interface is scalable across said first plurality of service processors and said second plurality of computational processors.

7. The multi-processor computer system of claim 1 further comprising: a system console coupled to at least one of said first plurality of service processors for enabling a user of said computer system to interact therewith.

8. The multi-processor computer system of claim 7 further comprising: a boot device coupled to said at least one of said first plurality of service processors for booting said computer system.

9. The multi-processor computer system of claim 1 further comprising: at least one computer mass storage device coupled to at least one of said first plurality of service processors.

10. The multi-processor computer system of claim 1 wherein said first plurality of service processors and said second plurality of computational processors comprise a first computer system segment and said computer system further comprises at least one additional computer system segment comprising: a third plurality of service processors functioning in conjunction with said operating software; and a fourth plurality of computational processors functioning in conjunction with said computational scheduler.

1 1. A multi-processor computer system including operating software, said computer system comprising: a first plurality of service processors and a second plurality of computational processors, each of said service processors functioning in conjunction with said operating software and handling all input/output functionality for said computer system and each of said computational processors functioning in conjunction with a computational scheduler, said operating software and computational scheduler providing a communication medium between said service and computational processors.

12. The multi-processor computer system of claim 1 1 wherein said communication medium is operational to enable multiple input/output requests to said service processors.

13. The multi-processor computer system of claim 12 wherein said communication medium is operational to enable asynchronous responses to said input/output requests.

14. The multi-processor computer system of claim 1 1 wherein said service and computational have shared access to a plurality of associated memory banks.

15. The multi-processor computer system of claim 1 1 further comprising: a common application programming interface operationally coupled to said computational scheduler and said operating software, said application programming interface, said computational scheduler and said operating software comprising a single system image application programming interface.

16. The multi-processor computer system of claim 15 wherein said single system image application programming interface is scalable across said n computing segments.

17. The multi-processor computer system of claim 1 1 further comprising: a system console coupled to at least one of said first plurality of service processors of one of said n computing segments for enabling a user of said computer system to interact therewith.

18. The multi-processor computer system of claim 17 further comprising: a boot device coupled to said at least one of said first plurality of service processors of one of said n computing segments for booting said computer system.

19. The multi-processor computer system of claim 1 1 further comprising: at least one computer mass storage device coupled to at least one of said first plurality of service processors of one of said n computing segments.