WO1999036851A1 - Scalable single system image operating software architecture for a multi-processing computer system - Google Patents
Scalable single system image operating software architecture for a multi-processing computer system Download PDFInfo
- Publication number
- WO1999036851A1 WO1999036851A1 PCT/US1998/025586 US9825586W WO9936851A1 WO 1999036851 A1 WO1999036851 A1 WO 1999036851A1 US 9825586 W US9825586 W US 9825586W WO 9936851 A1 WO9936851 A1 WO 9936851A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- computer system
- processors
- computational
- service
- processor
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
Definitions
- the present invention relates, in general, to the field of multiprocessing computer systems. More particularly, the present invention relates to a scalable single system image ("S3I") operating software architecture for a multi-processing computer system.
- S3I scalable single system image
- the connection of more than one homogeneous processor to a single, monolithic central memory is denominated as multi-processing.
- hardware and software limitations have minimized the total number of physical processors that could access a single memory efficiently. These limitations have reduced the ability to maintain computational efficiency as the number of processors in a computer system increased, thus reducing the overall scalability of the system. With the advent of faster and ever more inexpensive microprocessors, large processor count systems are becoming a hardware reality.
- Hardware advances have allowed interconnected networks to multiplex hundreds of processors to a single memory with a minimal decrease in efficiency. Because of performance issues relating to software locking primitives in large configurations, operating software scalability that is required to efficiently accommodate large numbers of processors has still eluded system architects.
- S3I Scalable Single System Image
- MPP Massively Parallel
- API/ABI application programming interface/application binary interface
- S3I scalable single system image
- the computational processors have no input/output ("I/O") devices mapped directly onto them while the service processors can control attached devices.
- a single operating system software image presents a single system application programming interface to application programs across all of the processors and a communication mechanism between the computational and service processors allows multiple requests to the service processors and fast, asynchronous interrupt responses to each request.
- a computational scheduler executes on each computational processor and provides the interface to the service processors where the operating system software executes.
- the S3I architecture also improves cache performance by reducing cache conflicts. These conflicts are reduced because the operating software no longer forces application data from the cache during the process of servicing application requests. This performance improvement becomes more important as processor speed relative to memory latency increases, as the latest generation of multi-processors demonstrates.
- the computer system comprises a first plurality of service processors functioning in conjunction with the operating software, the service processors handling all input/output functions for the computer system.
- a second plurality of computational processors functions in conjunction with a computational scheduler.
- the operating software and computational scheduler providing a communication medium between the service and computational processors.
- the computer system comprises a first plurality of service processors and a second plurality of computational processors.
- Each of the service function in conjunction with the operating software and handle all input/output functionality for the computer system.
- Each of the computational function in conjunction with a computational scheduler, with the operating software and the computational scheduler providing a communication medium between the service and computational processors.
- Figs. 1A and 1 B are a functional block system overview illustrating a computer system in accordance with an embodiment of the present invention comprising between 1 and 16 segments coupled together by a like number of trunk lines, each segment containing a number of computational and service processors in addition to memory and a crossbar switch assembly;
- Fig. 2 is a simplified functional block diagram for the interconnect strategy for the computer system of Figs. 1 A and 1 B
- Fig. 3 is a simplified functional block diagram of the computer system of Figs. 1 and 2 illustrating a 16 segment system comprising 256 computational processors and 64 service processors for interfacing to the computer program application software through a scalable single system image ("S3I") in accordance with the present invention
- Fig. 4 is a more detailed block diagram of a single segment computer system corresponding to a portion of the system of Fig. 3 illustrating the interface to the computer program application software through a common API/ABI to the computational processor through the computational scheduler and the service processors through the operating software which provides an interrupt response to the computational scheduler.
- the exemplary computer system 10 comprises, in pertinent part, any number of interconnected segments 12 0 through 12 15 , although the principles of the present invention are likewise applicable to any scalable system large numbers of processors.
- the various segments 12 0 through 12 l 5 are coupled through a number of trunk lines 14 0 through 12 15 as will be more fully described hereinafter.
- Each of the segments 12 comprises a number of functionally differentiated processing elements in the form of service processors 16 0 through 16 3 (service processor 16 0 functions additionally as a master boot device) and computational processors 18 0 through 18 15 .
- the service processors 16 are coupled to a number of peripheral component interconnect (“PCI") interface modules 20, and in the embodiment shown, each service processor is coupled to two such modules 20 to enable the service processors 16 to carry out all of the I/O functionality of the segment 12.
- PCI peripheral component interconnect
- the computer system 10 further includes a serial-to-PCI interface 22 for coupling a system console 24 to at least one of the segments 12 of the computer system 10.
- the system console 24 is operational for enabling a user of the computer system 10 to download boot information to the computer system 10, configure devices, monitor status, and perform diagnostic functions. Regardless of how many segments 12 are configured in the computer system 10, only one system console 24 is required.
- the boot device 26 (for example, a JAZ® removable disk computer mass storage device available from Iomega Corporation, Roy UT) is also coupled to the master boot service processor 16 0 through one of the PCI modules 20.
- the PCI modules 20 coupled to service processors 16 1 through 16 3 are utilized to couple the segment 12 to all other peripheral devices such as, for example, disk arrays 28 0 through 28 5 , any one or more of which may be replaced by, for example, an Ethernet connection.
- the computer system 10 comprises sophisticated hardware and building blocks which are commodity based, with some enhancements to accommodate the uniqueness of high-performance computing ("HPC").
- the base unit for the computer system 10 is a segment 12.
- Each segment 12 contains computation and service processor 18, 16 elements, memory, power supplies, and a crossbar switch assembly.
- the computer system 10 is "scalable" in that an end user can configure a system that consists of from 1 to 16 interconnected segments 12.
- Each segment 12 contains 20 total processors: sixteen computational processors 18 and four service processors 16.
- the computational processors 18 may reside on an individual assembly that contains four processors (e.g. the DeschutesTM microprocessor available from Intel Corporation, Santa Clara, CA) and eight interface chips (i.e. two per computational processor 18).
- Each computational processor 18 has an internal processor clock rate greater than 300 MHz and a system clock speed greater than 100 MHz, and the interface chips provide the connection between the computational processors 18 and the memory switches that connect to memory as will be described and shown in greater detail hereafter.
- the service processors 16 may be contained on a service processor assembly, which is responsible for all input and output for the computer system 10.
- Each of the service processor assemblies contain a processor (the same type as the computational processor 18), two interface chips, two 1 Mbyte I/O buffers, and two bi-directional PCI buses.
- Each PCI bus has a single connector. All I/O ports have DMA capability with equal priority to processors.
- the PCI modules 20 serve dual purposes, depending upon which service processor 16 with which they are used.
- the PCI connectors on the master boot service processor 16 0 are used to connect to the boot device 26 and the system console 24.
- the PCI modules 20 on the regular service processors 16-, through 16 3 are used for all other peripherals.
- Some of the supported PCI-based interconnects include small computer systems interface (“SCSI”), fiber distributed data interface (“FDDI”), high performance parallel interface (“HIPPI”) and others.
- SCSI small computer systems interface
- FDDI fiber distributed data interface
- HIPPI high performance parallel interface
- FIG. 2 the interconnect strategy for the computer system 10 of Figs. 1 A and 1 B is shown in greater detail in an implementation employing sixteen segments 12 0 through 12 15 interconnected by means of sixteen trunk lines 14 0 through 14 ⁇ 5 .
- a number of memory banks 50 0 through 50 15 each allocated to a respective one of the computational processors 18 0 through 18 15 (resulting in sixteen memory banks 50 per segment 12 and two hundred fifty six memory banks 50 in total for a sixteen segment 12 computer system 10) form a portion of the computer system 10 and are respectively coupled to the trunk lines 14 0 through 14 15 through a like number of memory switches 52 0 through 52 ⁇ 5 .
- the memory utilized in the memory banks 50 0 through 50 15 may be synchronous static random access memory (“SSRAM”) or other suitable high speed memory devices.
- SSRAM synchronous static random access memory
- each of the segments 12 0 through 12 15 includes, for example, twenty processors (four service processors 16 0 through 16 3 and sixteen computational processors 18 0 through 18 15 ) coupled to the trunk lines 14 0 through 14 15 through a corresponding one of a like number of processor switches 54 0 through
- Each segment 12 interconnects to all other segments 12 through the crossbar switch.
- the computer system 10 crossbar switch technology enables segments 12 to have uniform memory access times across segment boundaries, as well as within the individual segment 12. It also enables the computer system 10 to employ a single memory access protocol for all the memory in the system.
- the crossbar switch may utilize high-speed Field Programmable Gate Arrays ("FPGAs")to provide interconnect paths between memory and the processors, regardless of where the processors and memory are physically located. This crossbar switch interconnects every segment 12 and enables the processors and memory located in different segments 12 to communicate with a uniform latency.
- each crossbar switch has a 1 clock latency per tier, which includes reconfiguration time. For a sixteen segment 12 computer system 10 utilizing three hundred and twenty processors 16, 18 only two crossbar tiers are required.
- the computer system 10 may preferably utilize SSRAM for the memory banks 50 since it presents a component cycle time of 6 nanoseconds.
- Each memory bank 50 supports from 64 to 256 Mbytes of memory.
- Each computational processor 18 supports one memory bank 50 , with each memory bank 50 being 256 bits wide, plus 32 parity bits for a total width of 288 bits.
- the memory bank 50 size may be designed to match the cache line size, resulting in a single bank access for a full cache line. Read and write memory error correction may be provided by completing parity checks on address and data packets.
- the parity check for address packets may be the same for both read and write functions wherein new and old parity bits are compared to determine whether or not the memory read or write should continue or abort.
- a parity check may be done on each of the data packets arriving in memory.
- Each of these data packets has an 8-bit parity code appended to it.
- a new 8-bit parity code is generated for the data packet and the old and new parity codes are compared.
- the comparison results in one of two types of codes: single bit error (“SBE") or double-bit or multi-bit error (“DBE").
- SBE single bit error
- DBE multi-bit error
- the single-bit error may be corrected on the data packet before it is entered in memory.
- the data packet In the case of a double-bit or multi-bit error, the data packet is not written to memory, but is reported back to the processor, which retries the data packet reference.
- a memory "read” occurs, each of the data packets read from memory generates an 8-bit parity code. This parity code is forwarded with the data to the processor.
- the processor performs single error correction and double error detection (“SECDED") on each data packet.
- SECDED single error correction and double error detection
- FIG. 3 a simplified illustration of a sixteen segment 12 computer system 10 is shown comprising a total of sixty four service processors 16 and two hundred and fifty six computational processors 18 for a total of three hundred and twenty processors.
- the service processors 16 and computational processors 18 interface to the computer program application software by means of the scalable single system image ("S3I") layer 60 as will be more fully described hereinafter.
- the service processors 16 handle all I/O operation as previously described as well as the running of the computer system 10 operating system.
- the S3I layer 60 resides on top of the application 62 and comprises a common API/ABI layer 64 as well as a computational scheduler 66 and operating software 68 layers.
- the computational scheduler 66 interfaces to the computational processors 18 0 through 18 15 while the operating software 68 interfaces to the service processors 16 0 through 16 3 .
- the operating software 68 provides an interrupt response signal 70 to the computational scheduler to control the operation of the computational processors 18 as shown.
- a number of memory communication buffers 72 receive data from the various computational processors 18 0 through 18 15 and, in turn, supply data to the service processors 16 0 through 16 3 .
- the preferred implementation of the scalable single system interconnect architecture of the present invention is on a multiprocessor computer system 10 with uniform memory access across common, shared memory comprising a plurality of memory banks 50 0 through 50 N .
- processor subsystems may be partitioned into two groups: those which have I/O connectivity, i.e. the service processors 16 0 through 16 N and those which have no I/O connectivity, i.e. the computational processors 18 0 through 18 N .
- the S3I utilizes a software environment consisting of service and computational processors 16, 18.
- a single copy of the operating system software 68 resides across all processors 16 within the service partition.
- Separate computational schedulers 66 exist in each computational processor 18.
- This software model guarantees a global resource sharing paradigm, in conjunction with a strong "single system image”.
- Highly scalable threads of execution must be present in both the operating system software 68 and user application 62 software design model in conjunction with a high degree of software "multithreading" for efficient utilization of this architecture.
- user applications 62 are able to initiate and terminate multiple threads of execution in application user space, allowing further elimination of operating system software 68 overhead and increased scalability as the number of physical processors increases.
- a user application 62 makes requests of the system in the normal, system software mechanism.
- the application 62 has no awareness of whether the application 62 is executing on a service processor 16 or a computational processor 18. If the request is executed on a service processor 16, the request follows the normal operating system path directly into the operating system software 68 for processing.
- the request is executed on a computational processor 18, the request is processed by the computational scheduler 66.
- the thread making the request is placed on the run queue of the service processor 16 and the computational processor 18 issues a request to the service processor 16 to examine the queues.
- the operating software 68 executing in the service processor 16 examines the request queue and processes the request as if it had originated on the service processor 16.
- the requesting computational processor 18 is either suspended until interrupt acknowledge or placed into the general scheduling tables maintained by the operating software 68 for dispatching of additional work.
- the service processor 16 acknowledges the original request by queuing an application thread for execution on a computational processor 18 and restoring the original application context, which places the application 62 back into execution.
- any physical processor 16 in the service partition will be able to execute within the operating system 68 simultaneously.
- Critical data regions may be locked utilizing standard locking primitives currently found in the underlying hardware.
- the base component of the software environment is the operating system software 68.
- the computer system 10 may use, in a preferred embodiment, an enhanced version of the SunSoft® Solaris® 2.6 operating system available from Sun Microsystems, Inc. Palo Alto, CA which is modified to achieve better performance across multiple computational and service processors 18, 16 by limiting the operating system to execute only in the service processors 16.
- this technique is further accomplished by utilizing a computational scheduler 66 to communicate operating system requests and scheduling information between the service and computational processors 16, 18.
- a single copy of the operating system software 68 executes in all service processors 16, while separate computational schedulers 66 reside in each computational processor 18.
- This software model provides for global resource sharing in conjunction with a strong scalable single system image.
- the computer system 10 of the present invention provides for highly scalable threads of execution in both the operating system software 68 and user application software 62, in conjunction with a high degree of software "multithreading". While there have been described above the principles of the present invention in conjunction with a specific computer architecture, any number of service and/or computational processors may be utilized and it is to be clearly understood that the foregoing description is made only by way of example and not as a limitation to the scope of the invention.
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP98962876A EP1064597A1 (en) | 1998-01-20 | 1998-12-03 | Scalable single system image operating software architecture for a multi-processing computer system |
JP2000540495A JP2002509311A (en) | 1998-01-20 | 1998-12-03 | Scalable single-system image operating software for multi-processing computer systems |
CA002317132A CA2317132A1 (en) | 1998-01-20 | 1998-12-03 | Scalable single system image operating software architecture for a multi-processing computer system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US887198A | 1998-01-20 | 1998-01-20 | |
US09/008,871 | 1998-01-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1999036851A1 true WO1999036851A1 (en) | 1999-07-22 |
Family
ID=21734176
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1998/025586 WO1999036851A1 (en) | 1998-01-20 | 1998-12-03 | Scalable single system image operating software architecture for a multi-processing computer system |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1064597A1 (en) |
JP (1) | JP2002509311A (en) |
CA (1) | CA2317132A1 (en) |
WO (1) | WO1999036851A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1834237A1 (en) * | 2004-12-30 | 2007-09-19 | Koninklijke Philips Electronics N.V. | Data processing arrangement |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5109512A (en) * | 1990-05-31 | 1992-04-28 | International Business Machines Corporation | Process for dispatching tasks among multiple information processors |
US5325526A (en) * | 1992-05-12 | 1994-06-28 | Intel Corporation | Task scheduling in a multicomputer system |
US5675795A (en) * | 1993-04-26 | 1997-10-07 | International Business Machines Corporation | Boot architecture for microkernel-based systems |
-
1998
- 1998-12-03 WO PCT/US1998/025586 patent/WO1999036851A1/en not_active Application Discontinuation
- 1998-12-03 EP EP98962876A patent/EP1064597A1/en not_active Withdrawn
- 1998-12-03 JP JP2000540495A patent/JP2002509311A/en not_active Withdrawn
- 1998-12-03 CA CA002317132A patent/CA2317132A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5109512A (en) * | 1990-05-31 | 1992-04-28 | International Business Machines Corporation | Process for dispatching tasks among multiple information processors |
US5325526A (en) * | 1992-05-12 | 1994-06-28 | Intel Corporation | Task scheduling in a multicomputer system |
US5675795A (en) * | 1993-04-26 | 1997-10-07 | International Business Machines Corporation | Boot architecture for microkernel-based systems |
Also Published As
Publication number | Publication date |
---|---|
EP1064597A1 (en) | 2001-01-03 |
CA2317132A1 (en) | 1999-07-22 |
JP2002509311A (en) | 2002-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6249830B1 (en) | Method and apparatus for distributing interrupts in a scalable symmetric multiprocessor system without changing the bus width or bus protocol | |
US7743191B1 (en) | On-chip shared memory based device architecture | |
EP1058890B1 (en) | System and method for dynamic priority conflict resolution in a multi-processor computer system having shared memory resources | |
US6295573B1 (en) | Point-to-point interrupt messaging within a multiprocessing computer system | |
JP3381732B2 (en) | Interrupt steering system for multiprocessor computer. | |
Rettberg et al. | The Monarch parallel processor hardware design | |
US6282583B1 (en) | Method and apparatus for memory access in a matrix processor computer | |
US7421524B2 (en) | Switch/network adapter port for clustered computers employing a chain of multi-adaptive processors in a dual in-line memory module format | |
US5944809A (en) | Method and apparatus for distributing interrupts in a symmetric multiprocessor system | |
EP0737923A1 (en) | Interrupt system in microprocessor | |
US6044207A (en) | Enhanced dual port I/O bus bridge | |
EP0398696A2 (en) | Servicing interrupts in a data processing system | |
WO1991020044A1 (en) | Communication exchange system for a multiprocessor system | |
JP2501419B2 (en) | Multiprocessor memory system and memory reference conflict resolution method | |
JPH02236735A (en) | Data processing method and apparatus | |
US6996645B1 (en) | Method and apparatus for spawning multiple requests from a single entry of a queue | |
US5909574A (en) | Computing system with exception handler and method of handling exceptions in a computing system | |
Giloi | SUPRENUM: A trendsetter in modern supercomputer development | |
US5590338A (en) | Combined multiprocessor interrupt controller and interprocessor communication mechanism | |
US20030229721A1 (en) | Address virtualization of a multi-partitionable machine | |
US20060026214A1 (en) | Switching from synchronous to asynchronous processing | |
US6742072B1 (en) | Method and apparatus for supporting concurrent system area network inter-process communication and I/O | |
EP1064597A1 (en) | Scalable single system image operating software architecture for a multi-processing computer system | |
Tuazon et al. | Mark IIIfp hypercube concurrent processor architecture | |
Männer et al. | The Heidelberg POLYP—A flexible and fault-tolerant poly-processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CA JP MX |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase |
Ref document number: 2317132 Country of ref document: CA Ref country code: CA Ref document number: 2317132 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1998962876 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 2000 540495 Kind code of ref document: A Format of ref document f/p: F |
|
WWP | Wipo information: published in national office |
Ref document number: 1998962876 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1998962876 Country of ref document: EP |