« ZurückWeiter »
Total message size = M, number of colors C
Message size for color i - M(i)
Block size for color i = B(i)
Bytes received for color i = R(i) = 0
Bytes sent for color i = S(i) = 0
Unsent bytes for color i = U(i) = R(i) - S(i)
Agree on counter ids
For each counter with S(i) < M(i)
Read counter i, determine R(i), U(i) = R(i) - S(i)
Send next U(i) bytes
S(i) = S(i) + U(i)
U(i) - 0
If M(i) - S(i) < B(i), B(i) = M(i) - S(i)
OPTIMIZED COLLECTIVES USING A DMA
ON A PARALLEL COMPUTER
STATEMENT REGARDING FEDERALLY
SPONSORED RESEARCH OR DEVELOPMENT 5
This invention was made with Government support under Contract. No. B554331 awarded by Department of Energy. The Government has certain rights in this invention.
CROSS-REFERENCE TO RELATED
The present invention is related to the following com-
monly-owned, co-pending United States Patent Applications 15
filed on even date herewith, the entire contents and disclosure
of each of which is expressly incorporated by reference herein
as if fully set forth herein. U.S. patent application Ser. No.
11/768,777, for "A SHARED PERFORMANCE MONITOR
IN A MULTIPROCESSOR SYSTEM"; U.S. patent applica- 20
tion Ser. No. 11/768,781, for "DMA SHARED BYTE
COUNTERS IN A PARALLEL COMPUTER"; U.S. patent
application Ser. No. 11/768,784, for "MULTIPLE NODE
REMOTE MESSAGING"; U.S. patent application Ser. No.
11/768,697, for "A METHOD AND APPARATUS OF 25
PREFETCHING STREAMS OF VARYING PREFETCH
DEPTH"; U.S. patent application Ser. No. 11/768,532, for
"PROGRAMMABLE PARTITIONING FOR HIGH-PER-
FORMANCE COHERENCE DOMAINS IN A MULTI-
PROCESSOR SYSTEM"; U.S. patent application Ser. No. 30
11/768,857, for "METHOD AND APPARATUS FOR
SINGLE-STEPPING COHERENCE EVENTS IN A MUL-
TIPROCESSOR SYSTEM UNDER SOFTWARE CON-
TROL"; U.S. patent application Ser. No. 11/768,547, for
"INSERTION OF COHERENCE EVENTS INTO A MUL- 35
TIPROCESSOR COHERENCE PROTOCOL"; U.S. patent
application Ser. No. 11/768,791, for "METHOD AND
APPARATUS TO DEBUG AN INTEGRATED CIRCUIT
CHIP VIA SYNCHRONOUS CLOCK STOP AND SCAN";
U.S. patent application Ser. No. 11/768,795, for "DMA 40
ENGINE FOR REPEATING COMMUNICATION PAT-
TERNS"; U.S. patent application Ser. No. 11/768,799, for
"METHOD AND APPARATUS FOR A CHOOSE-TWO
MULTI-QUEUE ARBITER"; U.S. patent application Ser.
No. 11/768,800, for "METHOD AND APPARATUS FOR 45
EFFICIENTLY TRACKING QUEUE ENTRIES RELA-
TIVE TO A TIMESTAMP"; U.S. patent application Ser. No.
11/768,572, for "BAD DATA PACKET CAPTURE
DEVICE"; U.S. patent application Ser. No. 11/768,593, for
"EXTENDED WRITE COMBINING USING A WRITE 50
CONTINUATION HINT FLAG"; U.S. patent application
Ser. No. 11/768,805, for "A SYSTEM AND METHOD FOR
PROGRAMMABLE BANK SELECTION FOR BANKED
MEMORY SUBSYSTEMS"; U.S. patent application Ser.
No. 11/768,905, for "AN ULTRASCALABLE PETAFLOP 55
PARALLEL SUPERCOMPUTER"; U.S. patent application
Ser. No. 11/768,810, for "SDRAM DDR DATA EYE MONI-
TOR METHOD AND APPARATUS"; U.S. patent applica-
tion Ser. No. 11/768,812, for "A CONFIGURABLE
MEMORY SYSTEM AND METHOD FOR PROVIDING 60
ATOMIC COUNTING OPERATIONS IN A MEMORY
DEVICE"; U.S. patent application Ser. No. 11/768,559, for
"ERROR CORRECTING CODE WITH CHIP KILL CAPA-
BILITY AND POWER SAVING ENHANCEMENT"; U.S.
patent application Ser. No. 11/768,552, for "STATIC 65
POWER REDUCTION FOR MIDPOINT-TERMINATED
BUSSES"; U.S. patent application Ser. No. 11/768,527, for
"COMBINED GROUP ECC PROTECTION AND SUBGROUP PARITY PROTECTION": U.S. patent application Ser. No. 11/768,669, for "A MECHANISM TO SUPPORT GENERIC COLLECTIVE COMMUNICATION ACROSS A VARIETY OF PROGRAMMING MODELS"; U.S. patent application Ser. No. 11/768,813, for "MESSAGE PASSING WITH A LIMITED NUMBER OF DMA BYTE COUNTERS"; U.S. patent application Ser. No. 11/768,619, for "ASYNCRONOUS BROADCAST FOR ORDERED DELIVERY BETWEEN COMPUTE NODES IN A PARALLEL COMPUTING SYSTEM WHERE PACKET HEADER SPACE IS LIMITED"; U.S. patent application Ser. No. 11/768,682, for "HARDWARE PACKET PACING USING A DMA IN A PARALLEL COMPUTER"; and U.S. patent application Ser. No. 11/768,752, for "POWER THROTTLING OF COLLECTIONS OF COMPUTING ELEMENTS".
FIELD OF THE INVENTION
The present disclosure generally relates to supercomputer systems and architectures and particularly, to optimizing the performance of collective communication operations using a DMA on a parallel computer.
BACKGROUND OF THE INVENTION
Collective communication operations involve several processes at a time if not all. Collective communication operations such as MPI (Message Passing Interface) broadcast, which broadcasts data to all the processes in the communicator, and MPI allreduce, which performs reduction operations, are important communication patterns that can often limit the performance and scalability of applications. Thus it is desirable to get the best possible performance from such operations.
BlueGene/L systems, massively parallel computers, break up a long broadcast into several shorter broadcasts. The message is broken up into disjoint submessages, called colors, and the submessages are sent in such a way that different colors use different link on the 3D (dimension) torus. In this way, a single broadcast in 1 dimension of a torus could theoretically achieve 2 links worth of bandwidth (with 2 colors), a 2 dimensional broadcast could achieve 4 links worth of bandwidth, and a 3 dimensional broadcast could achieve 6 links worth of bandwidth. On those systems, however, there is no DMA engine and instead, processors are responsible for injecting and receiving each packet. Accordingly, what is desirable is a method and system that can utilize features of a DMA engine and network so as to achieve high throughput large message collectives. It is also desirable to have a method and system that utilizes those features to realize low latency small message collectives.
BRIEF SUMMARY OF THE INVENTION
Method and system for optimizing collective operations using direct memory access controller on a parallel computer are provide. A method for optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise establishing a byte counter associated with direct memory access controller for each submessage in a message. A byte counter includes at least a base address of memory and a byte count associated with a submessage. The method may also comprise monitoring the byte counter associated with a submessage to determine whether at least a block of data of the submessage has been received.
The block of data has a predetermined size. The method may further include processing the block when said block has been fully received and continuing the monitoring and processing step until all blocks in all submessages in the message have been processed. 5
A system for optimizing collective operations using direct memory access controller on a parallel computer, in one aspect, may comprise one or more processors in a node and memory in the node. The memory includes at least an injection fifo and a receive buffer. A direct memory access con- 10 trailer in the node includes at least a byte counter for each submessage of a message. A byte counter includes at least a base address in memory for storing associated submessage and a counter value. The direct memory access controller is operable to update the counter value as a result of receiving 15 one or more bytes of the associated submessage into the node. One or more processors are operable to monitor the counter value and when a predetermined number of bytes of the submessage is received, the one or more processors are further operable to process a block of data comprising the 20 received predetermined number of bytes of the submessage.
Further features as well as the structure and operation of various embodiments are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar 25 elements
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a parallel computer with multiple nodes, 30 a DMA on each node, and an interconnection network.
FIG. 2 illustrates a general structure of DMA byte counters in one embodiment of the present disclosure.
FIG. 3 illustrates a reception buffer for a broadcast split into multiple colors in one embodiment of the present disclo- 35 sure.
FIG. 4 is a flow diagram illustrating a method for a long broadcast in one embodiment of the present disclosure.
FIG. 5 shows a receive buffer for a short reduction with slots allocated for each processor in one embodiment of the 40 present disclosure.
A method and system are disclosed that provide high 45 throughput large message and low latency small message collective operations on a parallel machine with a DMA (Direct Memory Access) engine. In computer architecture designs of massively parallel computers or supercomputers such as BlueGene/P jointly developed by International Busi- 50 ness Machines Corporation and other institutions, there is a DMA engine that is integrated onto the same chip as the processors, cache memory, memory controller and network logic. Briefly, DMA engine allows accesses to system memory independently of a central processing unit. 55
DMA in one embodiment has multiple byte counters that count the number of bytes injected into the network and received from the network. Each counter includes a base address addressing a location in memory. Each counter also includes a byte count of a packet or message injected or 60 received. A packet contains a destination, a counter identifier (id) and an offset. The offset in packet specifies the memory position relative to the base address contained in the counter identified by the counter id, where the data of the packet is stored. A DMA unit may have, for example, 256 reception 65 and 256 injection counters. This number may vary and depend on a design choice, thus a DMA unit may have any
number of counters. A message descriptor is placed into an injection fifo (first-in-first-out). A processor in a node creates message descriptors, for example, based on the parameters or attributes of an application call such as an MPI call. This message descriptor specifies or contains information associated with the injection counter id, the number of bytes, the offset of the send buffer from the injection counter base address, a destination, a reception counter id, and an offset of the receive buffer from the reception counter base address. The destination specification or attribute may include a broadcast bit. In BlueGene/L and P systems, packets can be broadcast down a single dimension of a three dimensional torus. Several one-dimensional broadcasts may be performed to broadcast in multiple dimensions.
The present disclosure describes a method and system in one embodiment in shared memory mode, in which the processors on the same node comprise one or more threads in the same application process and have a common memory address space. Thus each processor can access all of the memory in this common address space. One of the processors is assigned to each color in an arbitrary manner. Thus a single processor could handle all colors, or different processors, each could handle a different color. For simplicity of explanation, the description herein assumes that a single processor is handling all colors, but one skilled in the art will appreciate variations in which different processors handle different colors. Thus, the method and system of the present disclosure is not limited to a single processor handling all colors.
In one embodiment of the present disclosure, a DMA engine and a plurality of counters, for instance, one reception counter per color and inj ection counter per color, may achieve theoretical peaks. In the method and system of the present disclosure in one embodiment, a core or a processor monitors the byte counters that track the number of received bytes. If a node is required to send the data corresponding to a particular color to another node or set of nodes, when a sufficient number of bytes is received for that color, a processor on the node injects a message descriptor into a DMA injection fifo thereby initiating transfer of the bytes out of that node. The message descriptor in one embodiment includes the injection counter identifier (id) that identifies the counter having the base address of a memory location, the offset from that base address where the data to send is stored, and the size of the sending data. Messages in different injection fifos can result in data flowing out of different links from the node. In this way, all colors can be both received and sent at the same time at every node in the network. For example, a node may receive and send on links at the same time.
Broadcast messages may be classified as long or short, for example, based on performance measurements associated with communicating those messages. For a long broadcast, software pre-arranges injection and reception counter ids for each color, and an offset from the reception counters. In one embodiment, counter ids and offsets for each color are common among the node, although they do not need to be in all cases. For instance, if there is no hardware broadcast, they need not be common. Counter ids are put into the packets as part of data that describes those packets, for instance, in a header or other portion of the packet. Assuming the byte counters decrement upon reception, all nodes program the byte counters to a suitably large number, for instance, one that is at least bigger than the message length. The source of the broadcast injects a message descriptor into an injection fifo. The DMA engine takes that descriptor, for example, from the injection fifo, puts the information from the descriptor into each packet and injects individual packets of the message into the network.