WO2000011547A1 - Processing element with special application for branch functions - Google Patents

Processing element with special application for branch functions Download PDF

Info

Publication number
WO2000011547A1
WO2000011547A1 PCT/US1999/019197 US9919197W WO0011547A1 WO 2000011547 A1 WO2000011547 A1 WO 2000011547A1 US 9919197 W US9919197 W US 9919197W WO 0011547 A1 WO0011547 A1 WO 0011547A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
instructions
instruction
branch
information
Prior art date
Application number
PCT/US1999/019197
Other languages
French (fr)
Other versions
WO2000011547A9 (en
Inventor
Rajit Manohar
Alain Martin
Original Assignee
California Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by California Institute Of Technology filed Critical California Institute Of Technology
Priority to AU56865/99A priority Critical patent/AU5686599A/en
Priority to EP99943848A priority patent/EP1105793A4/en
Publication of WO2000011547A1 publication Critical patent/WO2000011547A1/en
Publication of WO2000011547A9 publication Critical patent/WO2000011547A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • G06F9/381Loop buffering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units

Definitions

  • Branches can be found many places in programs. Examples include branches to subroutine calls, loops and if statements. Fixed length loops and subroutine calls facilitate prediction of how the branches behave when the program is compiled.
  • an asynchronous processor which carries out this function.
  • Another aspect teaches a synchronous design style.
  • the "dec" instruction examines the pair (baddr, N) stored on the top of the stack, and decrements N. If the result is zero (or negative) , the stack is popped; otherwise, the branch processor begins execution at address baddr. For example, the code corresponding to a loop that executes a sequence of 15 instructions 10 times would be:
  • the first instruction unconditionally changes the branch processor execution address to baddr.
  • the second instruction reads the address to branch to/from the synchronization channel.
  • Embodiment 1 - code that has a control flow that can be determined when the program is compiled.
  • the underlined instructions are deleted. In one case, since the branch is condition, it is replaced by the send! instruction shown.
  • the additional branch processor code would be: fetch E, 5; push LI, 100; LI: fetch L, 10; if? B; fetch P, 2; dec; push LI, 1;
  • Another stream of instructions synchronizes the branch processor to the data processor. Since two separate instructions are separate, misoperation between can cause deadlock, exceptions, or context switching.
  • Every send! instruction must be fetched before the corresponding receive is executed in the branch processor. Therefore, the first case can only be caused by an incorrect program. This possibility can be avoided in the compiler.
  • Deadlock can be detected by using a timing assumption or by running a deadlock detection program. Simple timing assumptions include assuming that the processor has deadlocked if instructions have not been decoded for a long interval-e. g. a microsecond. We could also execute a simple termination detection algorithm to detect deadlock. 3 In the latter case, only have to involve the two ends of the synchronization channel in the termination detection algorithm along with counters to detect that there are no data values in transit from the branch processor to the data processor.
  • Exceptions that occur in the branch processor itself include items such as address translation errors and stack underflow. These can be handled by sending them to the data processor with a special bit set indication of a branch processor exception. The instruction is executed as a nop in the data processor, and raises an exception in the usual way. Since the writeback unit in the data processor handles branch processor exceptions, the exceptions can be handled in program order.
  • PC the channel on which program counter values are sent to the data processor
  • SYNC the channel used to read data values from the data processor

Abstract

A processor system is formed from a branch processor (fig. 1, unit 110) and a main processor. The main data processor (fig. 1, unit 20) operates like conventional processors. The branch processor operates to determine the amount of branches and information to provide information that is usually speculatively figure out. Synchronizer is used occasionally to synchronize branch processor and data processor through feedback channel (fig. 1, unit 122).

Description

PROCESSING ELEMENT WITH SPECIAL APPLICATION FOR BRANCH FUNCTIONS
Cross Reference To Related Applications This application claims the benefit of U.S. Provisional Application No. 60/097,515, filed August 21, 1998.
The present application describes a processor architecture which uses independent instruction streams: one for the main processor that decodes and forms instructions forming the actual computation and another for the branch processor that determines sequences of program counter values to fetch instructions for the main processor.
Background It is desirable in many processor systems to execute the instructions as quickly as possible. One bottleneck in the execution of such instructions is the process of using the computation program counters to determine which instruction gets executed next. After determining which instruction gets executed in which order, the specified instructions are fetched from memory. In a traditional architecture, each instruction has information about the sequence of program counters that constitute the program. However, one cannot use this information to directly compute which program counter is to be generated next. This determination has been typically been done by examining the proceeding instruction, which might be a branch.
Traditional branch delays are introduced into the design to alleviate this problem. Hardware can use branch prediction techniques to "guess" which instruction will be fetched next. However, when this prediction fails, the hardware cancels the result of the prediction.
Branches can be found many places in programs. Examples include branches to subroutine calls, loops and if statements. Fixed length loops and subroutine calls facilitate prediction of how the branches behave when the program is compiled.
Summary The present application describes a processor architecture that provides additional information that has an application in determining branch information. This processor uses two separate instruction streams. A main or data processor instruction stream includes the instruction information. The branch processor instruction stream determines the program flow.
According to one aspect, an asynchronous processor is described which carries out this function. Another aspect teaches a synchronous design style.
Brief Description of the Drawings These and other aspects will be described in detail with reference to the accompanying drawings, wherein:
Figure 1 shows a block diagram of a basic branch processor- modified processing system.
Description of the Embodiments A processor that does not have a branch delay slot has a number of different instruction types. Different instructions are used to compute different functions. Each instruction has implicit control flow information based on its content. An instruction such as
pc: addu rl , r2 , r3
implicitly encodes that the next program counter value is the current program counter ("pc") +4. An instruction such as
pc: bne rl , r3 , L
encodes that the next program counter is either pc+4, or L, depending on whether or not registers rl and r3 are equal. A processor computes the sequence of program counter values. However, existing instruction sets can encode this information very inefficiently from the density point of view. Most of the time, an instruction needs to be examined simply to determine that the next instruction to be executed is at pc+4. Consider a simple FORTRAN 77 loop:
do 10 I = 2, 100 c (I) = a (I) +b (I) 10 con tinue ... (1 )
This loop would be compiled into a number of instructions, with a single branch at the end of the loop. Most of the instructions would increment the program counter by 4; i.e. "pc:=pc+4". It can be statically determined, by looking at the loop, that the sequence of instructions that implement the body of the loop will be executed 100 times. However, this information is not encoded in the instruction set. The present application defines a new separate and independent instruction sequence that encodes the sequence of program counter (PC) values. A program such as the one shown in (1) is compiled into two instruction streams. A first instruction stream determines the computations to be performed. The second instruction stream determines the flow of control.
The basic layout is shown in Figure 1. A "data processor" executes the first instruction stream that determines the computation to be performed. This a traditional processor operation.
The second instruction stream includes branch processor instructions. These are executed on a separate processor, the branch processor 110. Conceptually this second instruction stream is a sequence of instructions that computes, or includes information to compute, the sequence of program counter values. The program counter information constitutes memory interface 102 that is sent to memory. The sequence of instructions 104 is responsively received from memory. These instructions are passed as instruction stream 106 to the data processor 120, which is shown as a traiditional processors with an instruction decoder and instruction-executing registers.
The branch processor 110 also receives specified feedback from the data processor 120. The feedback indicates information such as synchronization information from a sync register 124. This information is used by certain types of instructions to enable information from the data processor 124 to control and provide information to, the branch processor 110. For example, some feedback from the data processor 124 is obtained when executing code that has conditional branches. The feedback channel 122 can also occasionally synchronize the two processors.
Control flow in a program normally follows a call/return pattern. A hardware stack in the branch processor is used for storing program counter values. However, there are times when the control flow information is only available at run time.
A special instruction in the main data processor called " send!" is defined to allow executing programs when the control flow is only available at run time. The send instruction sends a data value from the data processor 120 to the branch processor 110 via the synchronization channel 122. The branch processor produces an instruction that reads the data from this channel and reads values from this channel which have a "?" appended thereto. The information is described in the following. The instruction addr refers to the address of instructions to be executed on the data processor, and braddr refers to addresses for branch processor instructions. Block fetch instructions are introduced to compress control flow information within basic blocks. Instruction fetch addr, N means "fetch and execute N instructions that begin at address addr. " This enables the static determination of the number N instructions that will be executed sequentially. This control- flow information is compressed using this single instruction.
This instruction can be used to implement "straight-line" microcode. A sequential stream of instructions can implement a complex task without increasing code size significantly. The single fetch instruction can result in a smaller instruction cache footprint for a program in the case when common code can be shared among different parts of the program.
Looping constructs are implemented without significant overhead, using the following two instructions:
push baddr, N dec
The push instruction stores the pair (baddr, N) on the hardware stack. Branch processor execution continues with the next instruction.
The "dec" instruction examines the pair (baddr, N) stored on the top of the stack, and decrements N. If the result is zero (or negative) , the stack is popped; otherwise, the branch processor begins execution at address baddr. For example, the code corresponding to a loop that executes a sequence of 15 instructions 10 times would be:
push A, 10;
A: fetch addr, 15; dec
The number of iterations in a loop is not always known at compile time. The following instruction is used to permit the execution of loops with iteration counts determined at run time: pushN? baddr This instruction receives the next data value from the synchronization channel and uses it as the loop count N (as in the normal push instruction) ; other than that it behaves like a push instruction.
When breaking out of a loop, the hardware stack still has state information in it which needs to be destroyed. The pop instruction can explicitly pop the top of the hardware stack.
Function calls can be implemented with the call instruction. call baddr pushes (nextpc, 1 ) onto the branch processor stack, where nextpc is the program counter address immediately following the call, and then transfers control to baddr. Returning from a function is implemented by a "ret" instruction, that jumps to the address on the top of the stack and pops the stack.
A function call to an address determined at run time may occur when executing a function determined by looking at a function pointer stored in a table, or in the case of dynamic dispatch of methods in object-oriented languages. The call? instruction reads the address to branch to/from the synchronization channel. It otherwise behaves like a call.
The push and pop instructions can be used to implement control flow in loops. To handle arbitrary branches, goto instructions of two flavors are introduced: goto baddr goto?
The first instruction unconditionally changes the branch processor execution address to baddr. The second instruction reads the address to branch to/from the synchronization channel.
When control flow depends on computation in the data processor, the synchronization channel is used to determine the direction of the branch. The if? instruction is used for this purpose.
If? baddr The instruction reads a value from the synchronization channel. It continues execution at address baddr if the value received is non-negative. Otherwise, execution continues with the next branch processor instruction. Performance of execution is maximized if the matching send! is executed earlier in the data processor. Therefore, programs that have short sequences of instructions that are interspersed with conditional branches and depend on computation just performed would not be executed efficiently. In such cases, the predicted execution, that executes the instructions conditionally, could be used to preform to improve performance.
Predicated is a block of instructions using the instruction fetch? addr, N. If the value received from the data processor is non-negative, then the block of N instructions stored at address addr are executed otherwise, the instruction behaves like a no-op (nop) .
Figure imgf000012_0001
Figure imgf000013_0001
Table 1 Instruction-set summary
Table 1 shows a summary of the new instructions.
The following provides examples showing how code is generated for the branch processor. This code is generated, for example, in a compiler.
Embodiment 1 - code that has a control flow that can be determined when the program is compiled.
Consider the following FORTRAN program fragment:
do 10 i = 1 , 100
C (i) = a (i) +b (i)
10 continue Compiling this piece of coding using f2c and the GNU C compiler for an R3000 processor results in the following assembly code.
E: r2:=l; i:=r2; r8:=c-4; r7:=a-4; r6:=b-4;
L: r2:=i; r5:=r2+l; r2:=r2*4; r3:=r2+r7; r4=r2+r6; r3 : =mem [r3 ] ; r4 : =mem [ r 4 ] ; r2:=r2+r8;
I=r5; r5 : = (r5<l 01 ) : r3 :=r3+r4 ; mem [r2] :=r3 ; if r5 goto L
In a branch processor architecture, the underlined instructions shown above is deleted from the data processor code. The following special branch processor code is generated:
fetch E, 5; push LI, 100; LI: fetch L, 11; dec
In this example, the branch processor does not synchronize with the data processor because the control flow can be determined when the program is compiled.
Embodiment 2 - the same program with a modification that permits the program to exit the loop early.
do 10 i= 1, 100 c(i) = a(i)+b(i) if (c(i) .ge. 0) goto 11 /*This code allows the loop to exit early */
10 continue
11 ...
The compiled version of this program is shown below.
E: r2:=l; i:=r2; r8:=c-4; r7:=a-4; r6:=b-4; L : r5 : =i ; r3:=r5*4; r2:=r3+r7; r4:=r3+r6; r2 :=mem[r2] ; r4:=mem[r4] ; r3:=r3+r8 ; r2:=r2+r4; mem [r3] :-r2 ; if r2>=0 goto M: send!r2 r2:=r5+l;
I:=r2; r2: = (r2<101) ; if r2 goto L
M:
In a branch processor architecture, the underlined instructions are deleted. In one case, since the branch is condition, it is replaced by the send! instruction shown. The additional branch processor code would be: fetch E, 5; push LI, 100; LI: fetch L, 10; if? B; fetch P, 2; dec; push LI, 1;
B : pop An examination of the code generated reveals that the send! r2 can be reordered with two instructions to obtain:
r2 : = e [r2] ; r4 : =mem [r4] ; r2 : =r2+r4 ; send! r2; r3 : =r3+r8; mem [r3] : =r2
This transformation improves the performance because the send! action occurred earlier.
An important feature is caused by the two streams of instructions that are executed concurrently. Another stream of instructions synchronizes the branch processor to the data processor. Since two separate instructions are separate, misoperation between can cause deadlock, exceptions, or context switching.
DEADLOCK
The architecture includes instructions that synchronize the data processor 120 with the branch processor 110. Incorrect code could deadlock the hardware based on lack of synchronization. According to the present system, a deadlock detector is provided. The deadlock detector detects conditions which indicate deadlock, and responds thereto. At least two expected sources of deadlock include
• A receive on the synchronization channel is blocked because there is no matching send and the channel is empty; • A send on the synchronization channel is blocked because the channel is full and there is no matching receive.
Every send! instruction must be fetched before the corresponding receive is executed in the branch processor. Therefore, the first case can only be caused by an incorrect program. This possibility can be avoided in the compiler.
The second case could occur if multiple sends have been dispatched in advance, causing the synchronization channel to become full before any receives could be executed. This case can also be prevented by a compiler. The compiler must keep track of the number of outstanding send! operations at any point in the program, and ensure that the number of pending send operations does not exceed the hardware limit.
Although both cases of deadlock can be prevented using appropriate compilation techniques, errors can still occur. It might also be desirable to execute arbitrary programs on the hardware without causing the processor to deadlock. Deadlock can be detected by using a timing assumption or by running a deadlock detection program. Simple timing assumptions include assuming that the processor has deadlocked if instructions have not been decoded for a long interval-e. g. a microsecond. We could also execute a simple termination detection algorithm to detect deadlock.3 In the latter case, only have to involve the two ends of the synchronization channel in the termination detection algorithm along with counters to detect that there are no data values in transit from the branch processor to the data processor.
If a receive action is blocked forever, this implies that the code being executed on the branch processor is erroneous. In this case, execution of the exception handler should be started. If a send! action is blocked forever, this could imply that the code generated by the compiler is erroneous however, if the compiler can predetermine the sequence of send! actions, the processor might deadlock when the synchronization channel fills. In this case, graceful recovery is possible by permitting the program to continue execution while draining the values stored in the synchronization channel.
To permit program execution in the presence of blocked send! instructions, it is possible, to save (and restore) the values stored in the synchronization channel to memory. Therefore, a mechanism should be included that will memory-map the synchronization channel, and treat the hardware queue as an optimized implementation of this queue. With such an implementation, a blocked send! action will cause execution to fail only when a process exhausts the virtual memory on the system (or, alternatively, exceeds its resource limits) .
The processor architecture just proposed has state stored in both the data processor and the branch processor. The processor should also include the capability of storing the entire state to memory. The state of the data processor can be saved to and restored from memory in the same way as in traditional processors. The state of the branch processor is stored in the contents of the branch processor stack and the contents of the synchronization channel between the data processor and branch processor. The hardware stack as well as the synchronization channel is be memory-mapped. Therefore, we can save and restore the state of these parts of the branch processor can be saved and restored using load and store instructions from the data processor. Since a mechanism for saving and restoring the state of the processor is described, context switching can be used. Exceptions can be handled using a conventional handling mechanism. The send! instructions can be treated as instructions that modify the state of the processor. In addition, if an exception is encountered in the middle of a block fetch instruction, execution should be restored from the middle of the block. Therefore, the branch processor should keep track of pending block fetch instructions, allowing them to be restarted after an exception is handled.
Exceptions that occur in the branch processor itself include items such as address translation errors and stack underflow. These can be handled by sending them to the data processor with a special bit set indication of a branch processor exception. The instruction is executed as a nop in the data processor, and raises an exception in the usual way. Since the writeback unit in the data processor handles branch processor exceptions, the exceptions can be handled in program order.
The interaction between the branch processor and the data processor occurs via two channels: PC, the channel on which program counter values are sent to the data processor, and SYNC, the channel used to read data values from the data processor.
The program for the branch processor is shown below. Variable bpc is the program counter for the branch processor instructions, and S is a stack. A stack element has an addr field and an N field. Three stack operations are used. Top (S) is the top element of stack S; Push (S, addr, N) pushes the pair (addr,N) onto the stack and returns a new stack; Pop(S) deletes the top element of stack S and returns the new stack.
Overflow and underflow detection are deleted from the program for the sake of clarity. bpc,S:=init_bpc,G ;
* [ (i , addr,N) , bpc : =bmem [bpc] , bpc+1 ; [recv(I)→SYNC? x D ->recv (I) →skip] ; [fetch (I) → *[ N>0 →PC.'addr; addr,N:=addr+l,N-l] Dpush(I) → S:= Push (s , addr ,N) Odec (I) →[Top (S) . N>1 →bpc: =Top (S) . addr;
Top (S) . N: =Top (S) . N-l OTop (S) . N≤l →S : =Pop (S) ] DpushN? (I) →S:=Push (S,addr,x) Opop (I) →S : =Pop (S)
Oca 11(1) S : =Push (S , bpc ,1): bpc : =a ddr
Dcall (I) →S:=Push(S,bpc,l) : bpc:= x
Or t (I) →bpc : =Top (S) . addr; S : =Pop (S)
Ogoto (I) →bpc : =addr Ogoto (I) →bpc: = x
Oif?(I)→[x>0 →bpc:=addr O x<0 →skip]
Of etch? (I)→[x≥0 →
* [N>0 → PC! addr; addr,N:=addr+l,N-l] Ox<0 →skip ;
Oelse →skip ] Since program counter values are computed by the branch processor, the data processor reads the PC channel to determine which instruction should be executed next. The high-level CHP for the data processor is shown below.
* [ IF: PC?pc;
MEM : I : =imem [pc] ; DE : id: =decode (I) ; EXEC: "read operands ";
[send! (I) →SYNC! "da ta "
Oelse →"execute instruction "; "wri te resul ts "
]
.5.6.1 Program-Counter Computation
The branch processor can be compared to the instruction fetch in a standard processor. A simplified version of the instruction fetch for an asynchronous processor, the MiniMIPS, is shown below. The channel SYNC corresponds to the channel from the core of the processor that is used to communicate register values and immediate values to the instruction fetch. An additional COND channel is used on which condition codes for branches are sent to the instruction fetch. PC! ini t_pc : pc : =ini t_pc * [ I ?i ; pc : =pc+l
[ 1= "n extpc " →skip I="j ump"→SYNC?x;pc : =x πi= "bra n ch " →S YNC ?x; COND ?c;
Figure imgf000024_0001
] ;
PC.'pc
The branch processor computes program counter values earlier than this simple instruction fetch because the communication I?i , which synchronizes the instruction fetch with the rest of the data processor, is not done on every instruction. Further, the branch processor only synchronizes with the data processor when the synchronization becomes necessary.
In the example of a simple loop, all synchronization is eliminated. This permits the branch processor to fetch instructions without feedback from the data processor. The branch processor program is more complex than the simple instruction fetch because it has more instructions to decode. This decoding overhead is small when compared to the overhead in accessing branch processor memory by the " (i, addr, N) : =bmem [bpc] " statement. Accessing a large on-chip cache has a latency that is approximately equal to the cycle time τ of the processor. This is the additional overhead encountered when using a branch processor.
The slowest possible execution of the branch processor architecture corresponds to the case when the last PC\ communication in the branch processor fetches a send! instruction, and the next branch processor instruction is either pushN?, call?, goto?, if?, of fetch?. In this case, the branch processor waits for the send! instruction to be fetched, decoded, and executed. Assume the time taken to fetch, decode, and execute the send! instruction is τ0. The branch processor overhead can be analyzed for each potentially slow instruction.
• pushN?. Since the value of bpc is not data-dependent on the value received on SYNC, the branch processor can continue execution without actually having to wait for the data on SYNC to arrive.
• call? and goto?. The branch processor waits for the data on SYNC to arrive, before it can fetch the next branch processor instruction. The next data processor instruction has an additional data processor latency of τ0+τ seconds.
• if?. If the value received on SYNC is negative, the stall is τ0 seconds because the next branch processor instruction can be speculatively read from branch processor memory. If the value received is non-negative, the branch processor will have to read a new value from branch processor memory incurring an additional data processor latency of τ0+τ seconds.
• fetch?. If the value received on SYNC is negative, the stall is τ0 seconds. If the value received on SYNC is positive, the stall is τ0 seconds because the next program counter values are available immediately.
To summarize, it is expected that, in the worst case, the branch processor stalls for τ0+τ seconds for branches that are taken, goto? and call? instructions, and τ0 seconds for branches that are not taken and fetch? instructions. In a non-speculative traditional microprocessor, the latency of fetching the branch instruction and executing the instruction (~τ0) is typically avoided by the introduction of lτ0/τl branch delay slots. A standard instruction set is directly translated to branch processor code by replacing branches by send! instructions. Each send! instruction is followed by lτ0/τl instruction that corresponds to the branch delay slot. Therefore, the only additional stall that the branch processor encounters is τ . This would be completely hidden if the original architecture had an additional branch delay slot.
5.6.2. Memory Access
An additional memory read for branch processor instructions. This memory read is unsynchronized with the memory for data processor instructions of the data memory.
However, most modern processors have a single off-chip memory. Therefore, the instruction memory bandwidth requirements can be increase to the off-chip memory. Data processor instructions often do not contain information about which instruction has to be executed next. Therefore, common code can be shared without any code replication. The branch processor fetch instruction for the block of shared code needs to be replicated. Therefore, the branch processor could improve instruction cache performance by reducing cache misses in the instruction cache.
To maximize instruction sharing (and, incidentally, maximize branch processor code size) , each unique data processor opcode would be stored once. This implies that an upper bound on the number of instructions required to be stored in the instruction cache is given by the number of distinct instructions in the program.
The inventors collected instruction count statistics for 267 executables that were compiled using the GNU C compiler for an R3000-based DECstation. It was found that the number of distinct opcodes grows at a rate that is less than linear in the size of the executable.
Figure imgf000028_0001
Table 2. Percentage of programs with 100%cache hits.
Table 2 shows the percentage of programs that would completely fit in an instruction cache depending on whether total instructions or the number of unique instructions in the program are counted. In a branch processor architecture, most programs would fit in a typical instruction cache (8K words) . Therefore, the number of instruction cache misses in the data processor. At the same time, the cache misses for the branch processor are increased. The number of cache misses for the branch processor can be bounded by the number of cache misses for the original instruction set, since each ordinary instruction is translated into at most one branch processor instruction. Therefore, the additional memory bandwidth requirements for a branch processor can be reduced significantly by sharing instructions from the data processor-but at a performance cost. This conservative analysis shows that introducing a branch processor will not have a large impact on the instruction memory bandwidth required by the processor.
Branch prediction and prefetching techniques attempt to improve performance by predicting what the program will execute.
Incorporating branch prediction into this architecture corresponds to guessing the value being sent on the feedback channel for if? instructions. Since simple loops no longer contribute branch instructions, the effectiveness of branch prediction will be decreased because the cases which can be easily predicted (loops) are no longer present.
Prefetch instructions attempt to hide the latency of cache misses by dispatching reads to the caches before the data value is actually needed. These prefetch instructions can be inserted into the instruction stream of both the branch processor (for instruction cache prefetches) and the data processor (for data cache prefetches) .
Instructions that support software-controlled speculation can be introduced to improve the performance of the branch processor architecture. The instruction s fetch addr,N means "fetch and speculatively execute N instructions that begin at address addr. " These instructions are fetched from memory and dispatched to the data processor. The commit instruction informs the data processor if the last speculatively executed block should be permitted to modify the state of the processor. Therefore, the sequence "sfetch addr,N; commit true" is equivalent to "fetch addr,N" . The sequence "sfetch addr,N; commit false" is equivalent to a skip. Speculative execution is used to begin execution of a block of code before knowing whether it should be executed. The condition under which the code should be permitted to execute is computed in the data processor, and sent back to the branch processor via a send! instruction. Often, this information determines which of "commit true" or "commit false" should be executed. To optimize this case, the sfetch? Instruction is used. "Sfetch! addr,N" behaves like sfetch. In addition, it receives a value from the data processor and uses this value to determine which commit instruction should be executed.
This is equivalent to the following branch processor code: sfetch addr,N; if? A; commi t false; goto B; A: commi t true B :
The instructions for supporting software-controlled speculative execution are summarized in Table 3.
Figure imgf000031_0001
Table 3 . Instructions supporting speculative execution.
Existing compilation techniques can be used to generate code for the branch processor . In one method, a standard instruction set is translated directly into branch processor instructions by replacing conditional branches with send! and if? pairs, and using fetch instructions to dispatch instructions within a basic block. Both fixed length and variable length loops can be detected by modern compilation systems. Most programming languages have constructs for simple iterated loops, simplifying the problem of loop detection. Therefore, a compiler can generate push instructions for loops. In addition, subroutine call and returns are explicit in the language. Therefore, these instructions can be easily generated by standard compilation systems. Indeed, the branch processor instruction set is easier to map to because the call and return semantics are provided by the hardware directly. Peephole optimization can be used to move a send! instruction before any other instructions in the data processor that it depends on. Recall that early send! instruction will improve the performance of the branch processor architecture.
Loop unrolling and loop peeling are transformations used to improve the performance of programs. Both transformations replicate the body of the loop in order to statically determine the direction of some of the branches in the loop body. Observe that such program transformations replicate code just in the branch processor; streams of instructions in the data processor can be re-used because they no longer encode any control flow information. This implies that we will not worsen instruction cache performance by applying such transformations.
Fetch instructions provide a simple interface for implementing microcode. A sequence of instructions stored at fixed addresses in memory can be used to create complex "instructions" of the form of fetch addr,N. The effect of executing these instructions is to execute the sequence of instructions stored at the specified memory address, providing the same effect as an architecture that included programmable microcode. a standard instruction set is translated directly into branch processor instructions by replacing conditional branches with send! and if? pairs, and using fetch instructions to dispatch instructions within a basic block.
Other modifications are contemplated to exist within this disclosure. Such modifications are intended to be encompassed within the following claims.

Claims

What is claimed is:
1. A processor system, comprising: a first processor, which receives a stream of information including information on branching within a program being executed, and determines branching information and obtains instructions from memory based on said program; and a data processor, receiving said instructions from memory and carrying out an operation based thereon.
2. A system as in claim 1 wherein said stream of information includes a first instruction that is defined to allow executing programs when control flow is only available at run time.
3. A system as in claim 1 further comprising a synchronization mechanism between said data processor and said first processor, allowing communication between said data processor and said first processor.
4. A system as in claim 3 further comprising a special instruction which indicates that control flow will only be available at run time, and which commands said data processor to send a value to the first processor via said synchronization mechanism.
5. A system as in claim 4 wherein said line is also used for synchronization.
6. A system as in claim 1 wherein said instructions include a block fetch instruction.
7. A system as in claim 6 wherein said block fetch instruction includes an address of instructions and a length of instruction sets.
8. A system as in claim 1 wherein said instructions include a looping instruction which stores a branch address address, a number of instructions after said branch address address, and a number of times of execution.
9. A system as in claim 4 further comprising a loop instruction which specifies a branch address and uses a data value on the synchronization mechanism as a number of loop counter.
10. A system as in claim 4 wherein said special instruction includes a function call to an address determined at run time.
11. A system as in claim 4 wherein a branch-to address is not known prior to a specified time, and read via said synchronization mechanism.
12. A system as in claim 4 wherein control flow depends on said operation in the data processor and said control flow is communicated to said first processor via said synchronization mechanism.
13. A method of operating a processor, comprising: operating a data processor which carries out instructions which are applied thereto; operating a separate branch processor which determines information on branching of a program that uses said instructions; providing a synchronization channel between said data processor and said branch processor; operating said branch processor and said data processor separately when control flow can be determined at compilation time; and when control flow can only be determined at run time, providing an instruction which provides synchronization data from said data processor to said branch processor at said run time.
14. A method as in claim 13 further comprising detecting at least one condition which indicates a deadlock between said branch processor and said data processor, further comprising a deadlock removal mechanism.
15. A method as in claim 13 wherein said control flow includes a special instruction that commands said synchronization data at said run time.
16. A method as in claim 15 wherein said instructions are handled as instructions that modify states of the processor for exception purposes.
17. A method as in claim 13 further comprising predicting what the program will execute next, to improve performance of the program
18. A method as in claim 17 wherein said predicting comprises guessing a value on the synchronization channel.
19. A method as in claim 13 further comprising an instruction that supports software controlled speculation of values on said synchronization channel.
20. A method as claim 19 wherein said software controlled speculation instructs the data processor to carry out said action based on a condition.
21. A method as in claim 13 further comprising a speculative execution comprising beginning execution of a block of code prior to knowing whether the block will be executed.
22. A method of operating a processor, comprising: forming a first stream of information indicating a number of iteration counts, and a second stream of information indicating instructions to be carried out at said iteration counts; and determining instructions based on said first stream of information, and using said determined instructions to produce said second stream of information.
23. A method as in claim 22, wherein said first and second streams are instructions.
24. A method as in claim 23, wherein said first stream includes an instruction which indicates that a function of said instruction can only be determined at run time.
25. A method as in claim 22, further comprising executing said first stream of information in a branch processor which determines information on branching of the instructions and said second stream of information in a data processor; and providing a synchronization channel between said data processor and said branching processor.
26. A method as in claim 25, further comprising operating said branch processor and said data processor separately when control flow can be determined at compilation time; and when control flow can only be determined at run time, providing an instruction which provide synchronization data from said data processor to said branch processor at said compilation time.
27. A method as in claim 22, wherein said streams of information include a loop instruction is used to permit the execution of loops with iteration counts determined at run time.
28. A method as in claim 22, further comprising sending information about data at run time from said second stream to said first stream.
29. A method as in claim 28, further comprising a speculative execution comprising beginning execution of a block of code prior to knowing whether the block will be executed.
30. A method as in claim 22, further comprising synchronizing the first and second streams during run time, when control flow cannoy be determined when the program is compiled.
31. A method of compiling a program, comprising: obtaining instructions; determining branch parts of said instructions and producing a first stream of compiled information based thereon; and determining data processing parts of said instructions and producing a second stream of compiled information based thereon.
32. A method as in claim 31, further comprising: determining branch parts that cannot be determined until run time; and signalling said parts.
33. A method as in claim 32, wherein said signalling comprises replacing conditional branches with special instructions that indicate operation at run time.
PCT/US1999/019197 1998-08-21 1999-08-20 Processing element with special application for branch functions WO2000011547A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU56865/99A AU5686599A (en) 1998-08-21 1999-08-20 Processing element with special application for branch functions
EP99943848A EP1105793A4 (en) 1998-08-21 1999-08-20 Processing element with special application for branch functions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US9751598P 1998-08-21 1998-08-21
US60/097,515 1998-08-21

Publications (2)

Publication Number Publication Date
WO2000011547A1 true WO2000011547A1 (en) 2000-03-02
WO2000011547A9 WO2000011547A9 (en) 2000-08-10

Family

ID=22263771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/019197 WO2000011547A1 (en) 1998-08-21 1999-08-20 Processing element with special application for branch functions

Country Status (3)

Country Link
EP (1) EP1105793A4 (en)
AU (1) AU5686599A (en)
WO (1) WO2000011547A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4338661A (en) * 1979-05-21 1982-07-06 Motorola, Inc. Conditional branch unit for microprogrammed data processor
US5689720A (en) * 1991-07-08 1997-11-18 Seiko Epson Corporation High-performance superscalar-based computer system with out-of-order instruction execution
US5781752A (en) * 1996-12-26 1998-07-14 Wisconsin Alumni Research Foundation Table based data speculation circuit for parallel processing computer

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3137117B2 (en) * 1987-03-27 2001-02-19 将容 曽和 High-speed processing computer
AU3437293A (en) * 1993-01-06 1994-08-15 3Do Company, The Digital signal processor architecture
US5485629A (en) * 1993-01-22 1996-01-16 Intel Corporation Method and apparatus for executing control flow instructions in a control flow pipeline in parallel with arithmetic instructions being executed in arithmetic pipelines
DE69428504T2 (en) * 1993-11-30 2002-05-16 Texas Instruments Inc Three-input arithmetic logic unit with drum rotation circuit

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4338661A (en) * 1979-05-21 1982-07-06 Motorola, Inc. Conditional branch unit for microprogrammed data processor
US5689720A (en) * 1991-07-08 1997-11-18 Seiko Epson Corporation High-performance superscalar-based computer system with out-of-order instruction execution
US5781752A (en) * 1996-12-26 1998-07-14 Wisconsin Alumni Research Foundation Table based data speculation circuit for parallel processing computer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1105793A4 *

Also Published As

Publication number Publication date
AU5686599A (en) 2000-03-14
EP1105793A1 (en) 2001-06-13
WO2000011547A9 (en) 2000-08-10
EP1105793A4 (en) 2007-07-25

Similar Documents

Publication Publication Date Title
US6157988A (en) Method and apparatus for high performance branching in pipelined microsystems
McFarling et al. Reducing the cost of branches
Ditzel et al. Branch folding in the CRISP microprocessor: Reducing branch delay to zero
US6631514B1 (en) Emulation system that uses dynamic binary translation and permits the safe speculation of trapping operations
EP0459232B1 (en) Partially decoded instruction cache and method therefor
US6523110B1 (en) Decoupled fetch-execute engine with static branch prediction support
US6928645B2 (en) Software-based speculative pre-computation and multithreading
US7730263B2 (en) Future execution prefetching technique and architecture
US20020087849A1 (en) Full multiprocessor speculation mechanism in a symmetric multiprocessor (smp) System
Schlansker et al. EPIC: An architecture for instruction-level parallel processors
EP0605872A1 (en) Method and system for supporting speculative execution of instructions
GB2294341A (en) Providing support for speculative execution
US6687812B1 (en) Parallel processing apparatus
GB2293671A (en) Reducing delays due to branch instructions
US10338923B2 (en) Branch prediction path wrong guess instruction
Nakra et al. Value prediction in VLIW machines
US20020161987A1 (en) System and method including distributed instruction buffers holding a second instruction form
US5737562A (en) CPU pipeline having queuing stage to facilitate branch instructions
EP1105793A1 (en) Processing element with special application for branch functions
Steven et al. Using a resource-limited instruction scheduler to evaluate the iHARP processor
Song Demystifying epic and ia-64
Tyagi et al. Dynamic branch decoupled architecture
Thakkar et al. An instruction fetch unit for a graph reduction machine
González A survey of branch techniques in pipelined processors
Matui et al. Gmicro/500 microprocessor: Pipeline structure of superscalar architecture

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGE 1/1, DRAWINGS, REPLACED BY A NEW PAGE 1/1; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

WWE Wipo information: entry into national phase

Ref document number: 1999943848

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1999943848

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642