WO2000011547A1

WO2000011547A1 - Processing element with special application for branch functions

Info

Publication number: WO2000011547A1
Application number: PCT/US1999/019197
Authority: WO
Inventors: Rajit Manohar; Alain Martin
Original assignee: California Institute Of Technology
Priority date: 1998-08-21
Filing date: 1999-08-20
Publication date: 2000-03-02
Also published as: AU5686599A; EP1105793A1; WO2000011547A9; EP1105793A4

Abstract

A processor system is formed from a branch processor (fig. 1, unit 110) and a main processor. The main data processor (fig. 1, unit 20) operates like conventional processors. The branch processor operates to determine the amount of branches and information to provide information that is usually speculatively figure out. Synchronizer is used occasionally to synchronize branch processor and data processor through feedback channel (fig. 1, unit 122).

Description

PROCESSING ELEMENT WITH SPECIAL APPLICATION FOR BRANCH FUNCTIONS

Cross Reference To Related Applications This application claims the benefit of U.S. Provisional Application No. 60/097,515, filed August 21, 1998.

The present application describes a processor architecture which uses independent instruction streams: one for the main processor that decodes and forms instructions forming the actual computation and another for the branch processor that determines sequences of program counter values to fetch instructions for the main processor.

Background It is desirable in many processor systems to execute the instructions as quickly as possible. One bottleneck in the execution of such instructions is the process of using the computation program counters to determine which instruction gets executed next. After determining which instruction gets executed in which order, the specified instructions are fetched from memory. In a traditional architecture, each instruction has information about the sequence of program counters that constitute the program. However, one cannot use this information to directly compute which program counter is to be generated next. This determination has been typically been done by examining the proceeding instruction, which might be a branch.

Traditional branch delays are introduced into the design to alleviate this problem. Hardware can use branch prediction techniques to "guess" which instruction will be fetched next. However, when this prediction fails, the hardware cancels the result of the prediction.

Branches can be found many places in programs. Examples include branches to subroutine calls, loops and if statements. Fixed length loops and subroutine calls facilitate prediction of how the branches behave when the program is compiled.

Summary The present application describes a processor architecture that provides additional information that has an application in determining branch information. This processor uses two separate instruction streams. A main or data processor instruction stream includes the instruction information. The branch processor instruction stream determines the program flow.

According to one aspect, an asynchronous processor is described which carries out this function. Another aspect teaches a synchronous design style.

Brief Description of the Drawings These and other aspects will be described in detail with reference to the accompanying drawings, wherein:

Figure 1 shows a block diagram of a basic branch processor- modified processing system.

Description of the Embodiments A processor that does not have a branch delay slot has a number of different instruction types. Different instructions are used to compute different functions. Each instruction has implicit control flow information based on its content. An instruction such as

pc: addu rl , r2 , r3

implicitly encodes that the next program counter value is the current program counter ("pc") +4. An instruction such as

pc: bne rl , r3 , L

encodes that the next program counter is either pc+4, or L, depending on whether or not registers rl and r3 are equal. A processor computes the sequence of program counter values. However, existing instruction sets can encode this information very inefficiently from the density point of view. Most of the time, an instruction needs to be examined simply to determine that the next instruction to be executed is at pc+4. Consider a simple FORTRAN 77 loop:

do 10 I = 2, 100 c (I) = a (I) +b (I) 10 con tinue ... (1 )

This loop would be compiled into a number of instructions, with a single branch at the end of the loop. Most of the instructions would increment the program counter by 4; i.e. "pc:=pc+4". It can be statically determined, by looking at the loop, that the sequence of instructions that implement the body of the loop will be executed 100 times. However, this information is not encoded in the instruction set. The present application defines a new separate and independent instruction sequence that encodes the sequence of program counter (PC) values. A program such as the one shown in (1) is compiled into two instruction streams. A first instruction stream determines the computations to be performed. The second instruction stream determines the flow of control.

The basic layout is shown in Figure 1. A "data processor" executes the first instruction stream that determines the computation to be performed. This a traditional processor operation.

The second instruction stream includes branch processor instructions. These are executed on a separate processor, the branch processor 110. Conceptually this second instruction stream is a sequence of instructions that computes, or includes information to compute, the sequence of program counter values. The program counter information constitutes memory interface 102 that is sent to memory. The sequence of instructions 104 is responsively received from memory. These instructions are passed as instruction stream 106 to the data processor 120, which is shown as a traiditional processors with an instruction decoder and instruction-executing registers.

The branch processor 110 also receives specified feedback from the data processor 120. The feedback indicates information such as synchronization information from a sync register 124. This information is used by certain types of instructions to enable information from the data processor 124 to control and provide information to, the branch processor 110. For example, some feedback from the data processor 124 is obtained when executing code that has conditional branches. The feedback channel 122 can also occasionally synchronize the two processors.

Control flow in a program normally follows a call/return pattern. A hardware stack in the branch processor is used for storing program counter values. However, there are times when the control flow information is only available at run time.

A special instruction in the main data processor called " send!" is defined to allow executing programs when the control flow is only available at run time. The send instruction sends a data value from the data processor 120 to the branch processor 110 via the synchronization channel 122. The branch processor produces an instruction that reads the data from this channel and reads values from this channel which have a "?" appended thereto. The information is described in the following. The instruction addr refers to the address of instructions to be executed on the data processor, and braddr refers to addresses for branch processor instructions. Block fetch instructions are introduced to compress control flow information within basic blocks. Instruction fetch addr, N means "fetch and execute N instructions that begin at address addr. " This enables the static determination of the number N instructions that will be executed sequentially. This control- flow information is compressed using this single instruction.

This instruction can be used to implement "straight-line" microcode. A sequential stream of instructions can implement a complex task without increasing code size significantly. The single fetch instruction can result in a smaller instruction cache footprint for a program in the case when common code can be shared among different parts of the program.

Looping constructs are implemented without significant overhead, using the following two instructions:

push baddr, N dec

The push instruction stores the pair (baddr, N) on the hardware stack. Branch processor execution continues with the next instruction.

The "dec" instruction examines the pair (baddr, N) stored on the top of the stack, and decrements N. If the result is zero (or negative) , the stack is popped; otherwise, the branch processor begins execution at address baddr. For example, the code corresponding to a loop that executes a sequence of 15 instructions 10 times would be:

push A, 10;

A: fetch addr, 15; dec

The number of iterations in a loop is not always known at compile time. The following instruction is used to permit the execution of loops with iteration counts determined at run time: pushN? baddr This instruction receives the next data value from the synchronization channel and uses it as the loop count N (as in the normal push instruction) ; other than that it behaves like a push instruction.

When breaking out of a loop, the hardware stack still has state information in it which needs to be destroyed. The pop instruction can explicitly pop the top of the hardware stack.

Function calls can be implemented with the call instruction. call baddr pushes (nextpc, 1 ) onto the branch processor stack, where nextpc is the program counter address immediately following the call, and then transfers control to baddr. Returning from a function is implemented by a "ret" instruction, that jumps to the address on the top of the stack and pops the stack.

A function call to an address determined at run time may occur when executing a function determined by looking at a function pointer stored in a table, or in the case of dynamic dispatch of methods in object-oriented languages. The call? instruction reads the address to branch to/from the synchronization channel. It otherwise behaves like a call.

The push and pop instructions can be used to implement control flow in loops. To handle arbitrary branches, goto instructions of two flavors are introduced: goto baddr goto?

The first instruction unconditionally changes the branch processor execution address to baddr. The second instruction reads the address to branch to/from the synchronization channel.

When control flow depends on computation in the data processor, the synchronization channel is used to determine the direction of the branch. The if? instruction is used for this purpose.

If? baddr The instruction reads a value from the synchronization channel. It continues execution at address baddr if the value received is non-negative. Otherwise, execution continues with the next branch processor instruction. Performance of execution is maximized if the matching send! is executed earlier in the data processor. Therefore, programs that have short sequences of instructions that are interspersed with conditional branches and depend on computation just performed would not be executed efficiently. In such cases, the predicted execution, that executes the instructions conditionally, could be used to preform to improve performance.

Predicated is a block of instructions using the instruction fetch? addr, N. If the value received from the data processor is non-negative, then the block of N instructions stored at address addr are executed otherwise, the instruction behaves like a no-op (nop) .

Table 1 Instruction-set summary

Table 1 shows a summary of the new instructions.

The following provides examples showing how code is generated for the branch processor. This code is generated, for example, in a compiler.

Embodiment 1 - code that has a control flow that can be determined when the program is compiled.

Consider the following FORTRAN program fragment:

do 10 i = 1 , 100

C (i) = a (i) +b (i)

10 continue Compiling this piece of coding using f2c and the GNU C compiler for an R3000 processor results in the following assembly code.

E: r2:=l; i:=r2; r8:=c-4; r7:=a-4; r6:=b-4;

L: r2:=i; r5:=r2+l; r2:=r2*4; r3:=r2+r7; r4=r2+r6; r3 : =mem [r3 ] ; r4 : =mem [ r 4 ] ; r2:=r2+r8;

I=r5; r5 : = (r5<l 01 ) : r3 :=r3+r4 ; mem [r2] :=r3 ; if r5 goto L

In a branch processor architecture, the underlined instructions shown above is deleted from the data processor code. The following special branch processor code is generated:

fetch E, 5; push LI, 100; LI: fetch L, 11; dec

In this example, the branch processor does not synchronize with the data processor because the control flow can be determined when the program is compiled.

Embodiment 2 - the same program with a modification that permits the program to exit the loop early.

do 10 i= 1, 100 c(i) = a(i)+b(i) if (c(i) .ge. 0) goto 11 /*This code allows the loop to exit early */

10 continue

11 ...

The compiled version of this program is shown below.

E: r2:=l; i:=r2; r8:=c-4; r7:=a-4; r6:=b-4; L : r5 : =i ; r3:=r5*4; r2:=r3+r7; r4:=r3+r6; r2 :=mem[r2] ; r4:=mem[r4] ; r3:=r3+r8 ; r2:=r2+r4; mem [r3] :-r2 ; if r2>=0 goto M: send!r2 r2:=r5+l;

I:=r2; r2: = (r2<101) ; if r2 goto L

M:

In a branch processor architecture, the underlined instructions are deleted. In one case, since the branch is condition, it is replaced by the send! instruction shown. The additional branch processor code would be: fetch E, 5; push LI, 100; LI: fetch L, 10; if? B; fetch P, 2; dec; push LI, 1;

B : pop An examination of the code generated reveals that the send! r2 can be reordered with two instructions to obtain:

r2 : = e [r2] ; r4 : =mem [r4] ; r2 : =r2+r4 ; send! r2; r3 : =r3+r8; mem [r3] : =r2

This transformation improves the performance because the send! action occurred earlier.

An important feature is caused by the two streams of instructions that are executed concurrently. Another stream of instructions synchronizes the branch processor to the data processor. Since two separate instructions are separate, misoperation between can cause deadlock, exceptions, or context switching.

DEADLOCK

The architecture includes instructions that synchronize the data processor 120 with the branch processor 110. Incorrect code could deadlock the hardware based on lack of synchronization. According to the present system, a deadlock detector is provided. The deadlock detector detects conditions which indicate deadlock, and responds thereto. At least two expected sources of deadlock include

• A receive on the synchronization channel is blocked because there is no matching send and the channel is empty; • A send on the synchronization channel is blocked because the channel is full and there is no matching receive.

Every send! instruction must be fetched before the corresponding receive is executed in the branch processor. Therefore, the first case can only be caused by an incorrect program. This possibility can be avoided in the compiler.

The second case could occur if multiple sends have been dispatched in advance, causing the synchronization channel to become full before any receives could be executed. This case can also be prevented by a compiler. The compiler must keep track of the number of outstanding send! operations at any point in the program, and ensure that the number of pending send operations does not exceed the hardware limit.

Although both cases of deadlock can be prevented using appropriate compilation techniques, errors can still occur. It might also be desirable to execute arbitrary programs on the hardware without causing the processor to deadlock. Deadlock can be detected by using a timing assumption or by running a deadlock detection program. Simple timing assumptions include assuming that the processor has deadlocked if instructions have not been decoded for a long interval-e. g. a microsecond. We could also execute a simple termination detection algorithm to detect deadlock.³ In the latter case, only have to involve the two ends of the synchronization channel in the termination detection algorithm along with counters to detect that there are no data values in transit from the branch processor to the data processor.

If a receive action is blocked forever, this implies that the code being executed on the branch processor is erroneous. In this case, execution of the exception handler should be started. If a send! action is blocked forever, this could imply that the code generated by the compiler is erroneous however, if the compiler can predetermine the sequence of send! actions, the processor might deadlock when the synchronization channel fills. In this case, graceful recovery is possible by permitting the program to continue execution while draining the values stored in the synchronization channel.

To permit program execution in the presence of blocked send! instructions, it is possible, to save (and restore) the values stored in the synchronization channel to memory. Therefore, a mechanism should be included that will memory-map the synchronization channel, and treat the hardware queue as an optimized implementation of this queue. With such an implementation, a blocked send! action will cause execution to fail only when a process exhausts the virtual memory on the system (or, alternatively, exceeds its resource limits) .

The processor architecture just proposed has state stored in both the data processor and the branch processor. The processor should also include the capability of storing the entire state to memory. The state of the data processor can be saved to and restored from memory in the same way as in traditional processors. The state of the branch processor is stored in the contents of the branch processor stack and the contents of the synchronization channel between the data processor and branch processor. The hardware stack as well as the synchronization channel is be memory-mapped. Therefore, we can save and restore the state of these parts of the branch processor can be saved and restored using load and store instructions from the data processor. Since a mechanism for saving and restoring the state of the processor is described, context switching can be used. Exceptions can be handled using a conventional handling mechanism. The send! instructions can be treated as instructions that modify the state of the processor. In addition, if an exception is encountered in the middle of a block fetch instruction, execution should be restored from the middle of the block. Therefore, the branch processor should keep track of pending block fetch instructions, allowing them to be restarted after an exception is handled.

Exceptions that occur in the branch processor itself include items such as address translation errors and stack underflow. These can be handled by sending them to the data processor with a special bit set indication of a branch processor exception. The instruction is executed as a nop in the data processor, and raises an exception in the usual way. Since the writeback unit in the data processor handles branch processor exceptions, the exceptions can be handled in program order.

The interaction between the branch processor and the data processor occurs via two channels: PC, the channel on which program counter values are sent to the data processor, and SYNC, the channel used to read data values from the data processor.

The program for the branch processor is shown below. Variable bpc is the program counter for the branch processor instructions, and S is a stack. A stack element has an addr field and an N field. Three stack operations are used. Top (S) is the top element of stack S; Push (S, addr, N) pushes the pair (addr,N) onto the stack and returns a new stack; Pop(S) deletes the top element of stack S and returns the new stack.

Overflow and underflow detection are deleted from the program for the sake of clarity. bpc,S:=init_bpc,G ;

* [ (i , addr,N) , bpc : =bmem [bpc] , bpc+1 ; [recv(I)→SYNC? x D ->recv (I) →skip] ; [fetch (I) → *[ N>0 →PC.'addr; addr,N:=addr+l,N-l] Dpush(I) → S:= Push (s , addr ,N) Odec (I) →[Top (S) . N>1 →bpc: =Top (S) . addr;

Top (S) . N: =Top (S) . N-l OTop (S) . N≤l →S : =Pop (S) ] DpushN? (I) →S:=Push (S,addr,x) Opop (I) →S : =Pop (S)

Oca 11(1) ^→S : =Push (S , bpc ,1): bpc : =a ddr

Dcall (I) →S:=Push(S,bpc,l) : bpc:= x

Or t (I) →bpc : =Top (S) . addr; S : =Pop (S)

Ogoto (I) →bpc : =addr Ogoto (I) →bpc: = x

Oif?(I)→[x>0 →bpc:=addr O x<0 →skip]

Of etch? (I)→[x≥0 →

* [N>0 → PC! addr; addr,N:=addr+l,N-l] Ox<0 →skip ;

Oelse →skip ] Since program counter values are computed by the branch processor, the data processor reads the PC channel to determine which instruction should be executed next. The high-level CHP for the data processor is shown below.

* [ IF: PC?pc;

MEM : I : =imem [pc] ; DE : id: =decode (I) ; EXEC: "read operands ";

[send! (I) →SYNC! "da ta "

Oelse →"execute instruction "; "wri te resul ts "

]

.5.6.1 Program-Counter Computation

The branch processor can be compared to the instruction fetch in a standard processor. A simplified version of the instruction fetch for an asynchronous processor, the MiniMIPS, is shown below. The channel SYNC corresponds to the channel from the core of the processor that is used to communicate register values and immediate values to the instruction fetch. An additional COND channel is used on which condition codes for branches are sent to the instruction fetch. PC! ini t_pc : pc : =ini t_pc * [ I ?i ; pc : =pc+l

[ 1= "n extpc " →skip I="j ump"→SYNC?x;pc : =x πi= "bra n ch " →S YNC ?x; COND ?c;

] ;

PC.'pc

The branch processor computes program counter values earlier than this simple instruction fetch because the communication I?i , which synchronizes the instruction fetch with the rest of the data processor, is not done on every instruction. Further, the branch processor only synchronizes with the data processor when the synchronization becomes necessary.

In the example of a simple loop, all synchronization is eliminated. This permits the branch processor to fetch instructions without feedback from the data processor. The branch processor program is more complex than the simple instruction fetch because it has more instructions to decode. This decoding overhead is small when compared to the overhead in accessing branch processor memory by the " (i, addr, N) : =bmem [bpc] " statement. Accessing a large on-chip cache has a latency that is approximately equal to the cycle time τ of the processor. This is the additional overhead encountered when using a branch processor.

The slowest possible execution of the branch processor architecture corresponds to the case when the last PC\ communication in the branch processor fetches a send! instruction, and the next branch processor instruction is either pushN?, call?, goto?, if?, of fetch?. In this case, the branch processor waits for the send! instruction to be fetched, decoded, and executed. Assume the time taken to fetch, decode, and execute the send! instruction is τ₀. The branch processor overhead can be analyzed for each potentially slow instruction.

• pushN?. Since the value of bpc is not data-dependent on the value received on SYNC, the branch processor can continue execution without actually having to wait for the data on SYNC to arrive.

• call? and goto?. The branch processor waits for the data on SYNC to arrive, before it can fetch the next branch processor instruction. The next data processor instruction has an additional data processor latency of τ₀+τ seconds.

• if?. If the value received on SYNC is negative, the stall is τ₀ seconds because the next branch processor instruction can be speculatively read from branch processor memory. If the value received is non-negative, the branch processor will have to read a new value from branch processor memory incurring an additional data processor latency of τ₀+τ seconds.

• fetch?. If the value received on SYNC is negative, the stall is τ₀ seconds. If the value received on SYNC is positive, the stall is τ₀ seconds because the next program counter values are available immediately.

To summarize, it is expected that, in the worst case, the branch processor stalls for τ₀+τ seconds for branches that are taken, goto? and call? instructions, and τ₀ seconds for branches that are not taken and fetch? instructions. In a non-speculative traditional microprocessor, the latency of fetching the branch instruction and executing the instruction (~τ₀) is typically avoided by the introduction of lτ₀/τl branch delay slots. A standard instruction set is directly translated to branch processor code by replacing branches by send! instructions. Each send! instruction is followed by lτ₀/τl instruction that corresponds to the branch delay slot. Therefore, the only additional stall that the branch processor encounters is τ . This would be completely hidden if the original architecture had an additional branch delay slot.

5.6.2. Memory Access

An additional memory read for branch processor instructions. This memory read is unsynchronized with the memory for data processor instructions of the data memory.

However, most modern processors have a single off-chip memory. Therefore, the instruction memory bandwidth requirements can be increase to the off-chip memory. Data processor instructions often do not contain information about which instruction has to be executed next. Therefore, common code can be shared without any code replication. The branch processor fetch instruction for the block of shared code needs to be replicated. Therefore, the branch processor could improve instruction cache performance by reducing cache misses in the instruction cache.

To maximize instruction sharing (and, incidentally, maximize branch processor code size) , each unique data processor opcode would be stored once. This implies that an upper bound on the number of instructions required to be stored in the instruction cache is given by the number of distinct instructions in the program.

The inventors collected instruction count statistics for 267 executables that were compiled using the GNU C compiler for an R3000-based DECstation. It was found that the number of distinct opcodes grows at a rate that is less than linear in the size of the executable.

Table 2. Percentage of programs with 100%cache hits.

Table 2 shows the percentage of programs that would completely fit in an instruction cache depending on whether total instructions or the number of unique instructions in the program are counted. In a branch processor architecture, most programs would fit in a typical instruction cache (8K words) . Therefore, the number of instruction cache misses in the data processor. At the same time, the cache misses for the branch processor are increased. The number of cache misses for the branch processor can be bounded by the number of cache misses for the original instruction set, since each ordinary instruction is translated into at most one branch processor instruction. Therefore, the additional memory bandwidth requirements for a branch processor can be reduced significantly by sharing instructions from the data processor-but at a performance cost. This conservative analysis shows that introducing a branch processor will not have a large impact on the instruction memory bandwidth required by the processor.

Branch prediction and prefetching techniques attempt to improve performance by predicting what the program will execute.

Incorporating branch prediction into this architecture corresponds to guessing the value being sent on the feedback channel for if? instructions. Since simple loops no longer contribute branch instructions, the effectiveness of branch prediction will be decreased because the cases which can be easily predicted (loops) are no longer present.

Prefetch instructions attempt to hide the latency of cache misses by dispatching reads to the caches before the data value is actually needed. These prefetch instructions can be inserted into the instruction stream of both the branch processor (for instruction cache prefetches) and the data processor (for data cache prefetches) .

Instructions that support software-controlled speculation can be introduced to improve the performance of the branch processor architecture. The instruction s fetch addr,N means "fetch and speculatively execute N instructions that begin at address addr. " These instructions are fetched from memory and dispatched to the data processor. The commit instruction informs the data processor if the last speculatively executed block should be permitted to modify the state of the processor. Therefore, the sequence "sfetch addr,N; commit true" is equivalent to "fetch addr,N" . The sequence "sfetch addr,N; commit false" is equivalent to a skip. Speculative execution is used to begin execution of a block of code before knowing whether it should be executed. The condition under which the code should be permitted to execute is computed in the data processor, and sent back to the branch processor via a send! instruction. Often, this information determines which of "commit true" or "commit false" should be executed. To optimize this case, the sfetch? Instruction is used. "Sfetch! addr,N" behaves like sfetch. In addition, it receives a value from the data processor and uses this value to determine which commit instruction should be executed.

This is equivalent to the following branch processor code: sfetch addr,N; if? A; commi t false; goto B; A: commi t true B :

The instructions for supporting software-controlled speculative execution are summarized in Table 3.

Table 3 . Instructions supporting speculative execution.

Existing compilation techniques can be used to generate code for the branch processor . In one method, a standard instruction set is translated directly into branch processor instructions by replacing conditional branches with send! and if? pairs, and using fetch instructions to dispatch instructions within a basic block. Both fixed length and variable length loops can be detected by modern compilation systems. Most programming languages have constructs for simple iterated loops, simplifying the problem of loop detection. Therefore, a compiler can generate push instructions for loops. In addition, subroutine call and returns are explicit in the language. Therefore, these instructions can be easily generated by standard compilation systems. Indeed, the branch processor instruction set is easier to map to because the call and return semantics are provided by the hardware directly. Peephole optimization can be used to move a send! instruction before any other instructions in the data processor that it depends on. Recall that early send! instruction will improve the performance of the branch processor architecture.

Loop unrolling and loop peeling are transformations used to improve the performance of programs. Both transformations replicate the body of the loop in order to statically determine the direction of some of the branches in the loop body. Observe that such program transformations replicate code just in the branch processor; streams of instructions in the data processor can be re-used because they no longer encode any control flow information. This implies that we will not worsen instruction cache performance by applying such transformations.

Fetch instructions provide a simple interface for implementing microcode. A sequence of instructions stored at fixed addresses in memory can be used to create complex "instructions" of the form of fetch addr,N. The effect of executing these instructions is to execute the sequence of instructions stored at the specified memory address, providing the same effect as an architecture that included programmable microcode. a standard instruction set is translated directly into branch processor instructions by replacing conditional branches with send! and if? pairs, and using fetch instructions to dispatch instructions within a basic block.

Other modifications are contemplated to exist within this disclosure. Such modifications are intended to be encompassed within the following claims.

Claims

What is claimed is:

1. A processor system, comprising: a first processor, which receives a stream of information including information on branching within a program being executed, and determines branching information and obtains instructions from memory based on said program; and a data processor, receiving said instructions from memory and carrying out an operation based thereon.

2. A system as in claim 1 wherein said stream of information includes a first instruction that is defined to allow executing programs when control flow is only available at run time.

3. A system as in claim 1 further comprising a synchronization mechanism between said data processor and said first processor, allowing communication between said data processor and said first processor.

4. A system as in claim 3 further comprising a special instruction which indicates that control flow will only be available at run time, and which commands said data processor to send a value to the first processor via said synchronization mechanism.

5. A system as in claim 4 wherein said line is also used for synchronization.

6. A system as in claim 1 wherein said instructions include a block fetch instruction.

7. A system as in claim 6 wherein said block fetch instruction includes an address of instructions and a length of instruction sets.

8. A system as in claim 1 wherein said instructions include a looping instruction which stores a branch address address, a number of instructions after said branch address address, and a number of times of execution.

9. A system as in claim 4 further comprising a loop instruction which specifies a branch address and uses a data value on the synchronization mechanism as a number of loop counter.

10. A system as in claim 4 wherein said special instruction includes a function call to an address determined at run time.

11. A system as in claim 4 wherein a branch-to address is not known prior to a specified time, and read via said synchronization mechanism.

12. A system as in claim 4 wherein control flow depends on said operation in the data processor and said control flow is communicated to said first processor via said synchronization mechanism.

13. A method of operating a processor, comprising: operating a data processor which carries out instructions which are applied thereto; operating a separate branch processor which determines information on branching of a program that uses said instructions; providing a synchronization channel between said data processor and said branch processor; operating said branch processor and said data processor separately when control flow can be determined at compilation time; and when control flow can only be determined at run time, providing an instruction which provides synchronization data from said data processor to said branch processor at said run time.

14. A method as in claim 13 further comprising detecting at least one condition which indicates a deadlock between said branch processor and said data processor, further comprising a deadlock removal mechanism.

15. A method as in claim 13 wherein said control flow includes a special instruction that commands said synchronization data at said run time.

16. A method as in claim 15 wherein said instructions are handled as instructions that modify states of the processor for exception purposes.

17. A method as in claim 13 further comprising predicting what the program will execute next, to improve performance of the program

18. A method as in claim 17 wherein said predicting comprises guessing a value on the synchronization channel.

19. A method as in claim 13 further comprising an instruction that supports software controlled speculation of values on said synchronization channel.

20. A method as claim 19 wherein said software controlled speculation instructs the data processor to carry out said action based on a condition.

21. A method as in claim 13 further comprising a speculative execution comprising beginning execution of a block of code prior to knowing whether the block will be executed.

22. A method of operating a processor, comprising: forming a first stream of information indicating a number of iteration counts, and a second stream of information indicating instructions to be carried out at said iteration counts; and determining instructions based on said first stream of information, and using said determined instructions to produce said second stream of information.

23. A method as in claim 22, wherein said first and second streams are instructions.

24. A method as in claim 23, wherein said first stream includes an instruction which indicates that a function of said instruction can only be determined at run time.

25. A method as in claim 22, further comprising executing said first stream of information in a branch processor which determines information on branching of the instructions and said second stream of information in a data processor; and providing a synchronization channel between said data processor and said branching processor.

26. A method as in claim 25, further comprising operating said branch processor and said data processor separately when control flow can be determined at compilation time; and when control flow can only be determined at run time, providing an instruction which provide synchronization data from said data processor to said branch processor at said compilation time.

27. A method as in claim 22, wherein said streams of information include a loop instruction is used to permit the execution of loops with iteration counts determined at run time.

28. A method as in claim 22, further comprising sending information about data at run time from said second stream to said first stream.

29. A method as in claim 28, further comprising a speculative execution comprising beginning execution of a block of code prior to knowing whether the block will be executed.

30. A method as in claim 22, further comprising synchronizing the first and second streams during run time, when control flow cannoy be determined when the program is compiled.

31. A method of compiling a program, comprising: obtaining instructions; determining branch parts of said instructions and producing a first stream of compiled information based thereon; and determining data processing parts of said instructions and producing a second stream of compiled information based thereon.

32. A method as in claim 31, further comprising: determining branch parts that cannot be determined until run time; and signalling said parts.

33. A method as in claim 32, wherein said signalling comprises replacing conditional branches with special instructions that indicate operation at run time.