PROCESSOR ARCHITECTURE FOR SPECULATED VALUES
Technical field
The present invention pertains to the field of processors and in particular to that portion of this field relating to super-scalar processors that are designed to utilize speculation.
Background of -the invention
In modern processors, such as the Alpha 21264, Sun SPARC or Intel Pentium III, several different techniques are used to permit the execution of several instructions simultaneously. These techniques are often referred to under the umbrella term ILP (Instruction-Level Parallelism) . The objective of these techniques is to reduce instruction dependency and to identify instructions that can be executed in parallel in order to achieve higher performance from the processor.
One type of dependency investigated in this respect is data dependency. Data dependency arises when an instruction uses a value that has been generated by a previous instruction. This forces the processor to execute instructions in a specific order. In a modern processor, it can take several clock cycles to read in a value from an external memory. This means that instructions that are dependent upon the value to be read cannot be executed immediately; instead the processor idles until the value has been read in from the external memory. Various methods of reducing the effect of this type of data dependency have been developed. One method is to shorten access time by using cache memory, which provides shorter access time than external memory.
However, as processors have improved, the clock frequency of processors has increased faster than the access time of cache memory has decreased, which naturally implies that the access time as calculated in clock cycles has increased. This situation,
together with the increasingly deeper pipeline structure of the processor, has prompted the development of additional techniques to reduce the effects of data dependency. One known technique is to speculate (i.e., guess) the value to be read in from, for example, the external memory or the cache memory. The execution of the instructions can thus proceed without delay by using a speculated value, i.e., a guessed value. When the value is finally read in from the memory, it is compared with the speculated value. If the read-in value is equal to the speculated value, the operation has succeeded. If the read-in value is not equal to the speculated value, portions of the execution must be repeated using the current (correct) read-in value. It may seem that it would be difficult to successfully speculate values but in practice it has been shown that there are strategies for speculation that can lead to significant efficiency gains.
In speculation, the address of an instruction that follows the instruction that requested reading of the value from memory is saved; alternatively, the address of the instruction that requested reading is saved. When a speculation error has occurred (i.e., the value read differs from the speculated value), execution restarts at the address saved, now using the read-in value instead of the speculated value. This means that the result of a number of instructions must be flushed and the execution of these instructions has thereby been entirely in vain. Prior art document WO2000/42503 shows a processor pipeline comprising a fetch stage, a decode stage, a queue, an execution stage and a retirement stage. The processor pipeline includes a channel to send an indication that a given instruction encountered an execution problem. In one embodiment, the processor replays by copying a replay pointer into a tail pointer. In another embodiment partial pipeline replays may be performed in response to execution problems. According to the latter embodiment the rows of a mask indicate columns of micro-
ops that encountered execution problems or that potentially depend on micro-ops that encountered execution problems .
Summary of the invention The present invention primarily addresses the problem of making the execution of instructions in a super-scalar processor that utilizes value speculation more efficient.
Briefly, the problem stated above is solved with an improved processor that makes it possible to avoid unnecessarily re- executing instructions when restarting as a result of an erroneous value speculation.
A primary intention of the invention is thus to make execution of instructions in a processor more efficient, and the invention comprises a processor with which this intention is achieved. In more detail, the problem stated above is solved with a processor that comprises means for speculating values of variables that are used in conjunction with execution of instructions in the processor. The processor also comprises means for determining, for each of the speculated values, whether there is a first instruction that is dependent upon the speculated value. Moreover, the processor comprises means for determining if the speculation of a value has failed and means for restarting execution from a specified instruction in response to the detection of an incorrectly speculated value. In this case, the restart is made from the first specified instruction that is dependent upon the speculated value for which speculation has failed.
The speculated value is not normally used by the very first instruction that follows the instruction that gave rise to the speculation. In fact a large number of instructions that are independent of the speculated value may be executed ahead of the first instruction that is dependent upon the speculated value. The instructions that are executed independently of the
speculated value are thus correctly executed. If the speculated value proves to be incorrect, the execution will, according to the invention, restart from the first instruction that is dependent upon the speculated value. This means that the instructions that have been executed independently of the speculated value will not be re-executed. Consequently, the previous correctly executed instructions will not be re-executed unnecessarily upon a restart, thus achieving more efficient execution. A primary advantage of the present invention is thus that an execution of instructions in the processor is performed more efficiently because unnecessary execution of instructions is avoided for instructions that have already been correctly executed. The invention will now be described in more detail using preferred embodiments and with references to attached drawings .
Brief description of figures
Figure 1 shows a block diagram of a processor according to known technique, figure 2 shows a code example, figure 3 shows a block diagram of an instruction window that is complemented with a state machine and a table according to the invention, figure 4 shows a preferred embodiment of the table, figure 5 shows a flow diagram that describes a preferred method for processing of an incoming instruction by the state machine, figure 6 shows a block diagram of an embodiment of a processor according to the invention, figure 7 shows a detail of the preferred processor according to the invention,
figure 8 and 9 show exemplary states in the processor according to a first time, and figure 10 and 11 show exemplary states in the processor according to a second time.
Description of preferred embodiments of the invention
Figure 1 shows a block diagram of a super-scalar processor (1) according to known techniques. Processor (1) comprises at least one pipeline with a number of functional units, sometimes designated as steps. The steps in the pipeline can be said to work on the conveyor-belt principle. The steps in the pipeline load input data that is processed and placed in an output buffer, which normally in turn functions as an input buffer to a subsequent step. This means that the various steps can work in parallel, thus increasing the execution efficiency of processor
(1) because each item of input data does not need to pass through the entire pipeline before the next item of input data begins to be processed. In Figure 1, the pipeline comprises the following steps: an instruction load step (3) with a program counter (3a) ; a decoding step (5) ; an operand load step (7) ; an execution step (9) with a data load unit (13); and a write step (11). Instruction load step (3) is connected (for example, via a digital bus) to a program memory (17) in which instructions are stored. Instruction step (3) loads the instructions from the program memory (17) . Program counter (3a) points to an address in program memory (17) from which the instruction is to be loaded. When one of the instructions has been loaded from program memory (3), program counter (3a) is incremented to point to the next address from which the next instruction is to be loaded. Decoding step (5) decodes, i.e., it interprets the loaded instructions. For example, decoding step (5) determines whether the loaded instructions use operands, i.e., the values that are stored in a register unit (15) in processor (1) . Register unit (15) comprises one or more registers with register locations where operands can
be stored. The operand load step loads from register unit (15) the operands that are needed for executing the read instructions. Execution step (9) executes the loaded instructions, utilizing the loaded operands where needed. In a super-scalar processor, execution step (9) can be divided into a number of smaller units that are specialized for executing certain types of instructions. Data load unit (13) loads values from a data memory (19), which can be a cache memory or an external memory, for example. Write step (11) writes values to the register locations in register unit (15) ; these may be values that are loaded by data load unit (13) or values that are the result of the execution of arithmetic operations in execution step (9) . The write step also writes to data memory (19) .
Figure 2 shows an example of code. The code example begins with a load instruction having address IA. The load instruction means that a value for a variable x is to be loaded to a selected register location in register unit (15) from a specified address (FFFF0000) in data memory (19) . The load instruction is followed by n-1 instructions that are not dependent upon the value of variable x, i.e., they do not use the value of variable x as an operand. Thereafter comes an instruction having address Al+n which involves an arithmetic operation that uses the value of the variable x. In the code example in Figure 2, a new variable y is created by adding one (1) to the value of variable x. Thereafter, additional instructions follow having addresses AI+n+1, AI+n+2 and so on. It can, as mentioned previously, take a relatively long time to load the value of variable x from memory (19) , and this is why this value can be speculated to speed up the processor's (1) operations. Write step (11) in such a case is normally designed to write a speculative value (i.e., a guessed value that has been generated, for example, by data load unit [13] ) for variable x at the pertinent register location in register unit (15) . When the correct value of variable x is finally read in from data memory (19), it is compared with the speculated value. If the correct value of variable x is not the
same as the speculated value (speculation error) , a restart of the execution is made from the instruction with address IA+1. Consequently, the result of previously executed instructions is flushed, and this applies even to the instructions having addresses IA+1 to IA+n-1 that did not use the value of variable x and that therefore cannot have been influenced by the speculation error. There is thus an efficiency gain to be made if the already correctly executed instructions having addresses AI+1 to AI+n-1 do not need to be executed a second time, and the present invention is primarily intended to make such an efficiency gain possible.
In a super-scalar processor, there is always an instruction window that keeps order among the instructions that are in various stages of execution in the processor. One task of the instruction window is to keep track of the interdependencies of instructions. The instruction window thus has access to information on where the desired values from data memory (19) will be stored in register unit (15) . The instruction window also has access to a mechanism to keep track of which registers in register unit (15) the instructions are dependent upon.
Figure 3 shows, very schematically, a block diagram of an instruction window (20) for a super-scalar processor according to the present invention. Instruction window (20) receives incoming instructions and schedules the received instructions. Instruction window (20) is complemented with a state machine (21) and a table (23) . State machine (21) is connected to both register unit (15) and table (23) . The state machine builds up and utilizes table (23) in performing its tasks, as described in more detail further on. Figure 4 shows a preferred embodiment of table (23) . Table (23) contains identification of variables (first column) , in normal practice, pointers to the register locations in register unit (15) in which the variables' values are stored. The example shown in Figure 4 is associated with the code example in Figure 2 and
thus identifies the variables x and y. Table (23) also has tags (second column) that indicate whether the variables are subject to speculation or not. In the example shown in Figure 4 the variable x is subject to speculation but the variable y is not. For the variables (x) that are subject to speculation, table (23) includes, in this example, the associated speculated value (third column) . Alternatively, the speculated values can instead be stored in register unit (15) . In this example, the speculated value that is associated with the variable x is equal to zero (0) . Furthermore, table (23) includes, for each variable that is subject for speculation, an identification (fourth column) of an address in program memory (17) that contains the first instruction that is dependent upon the value of the current variable. In this example, the address Al+n is thus stored in table (23) and associated with the variable x. The addresses that are stored in table (23) are restart addresses that are used when restarting the execution as a result of speculation errors, which is described in more detail further on.
Figure 5 shows a flow diagram that describes a preferred embodiment of how state machine (21) processes an incoming instruction. The process is applied to each incoming instruction.
The process in Figure 5 is initiated, after a start (31) , with block (33) at which it is determined if the incoming instruction requires speculation of a variable's value. Typically, it is a variable with a value that is the result of the instruction. For example, the instruction can be a load instruction that will load a value of a variable from data memory (19) . But other types of instructions can also involve speculation. For example, the value of a variable that is the result of an instruction that performs an arithmetic operation can be subject to speculation, i.e., the processor does not wait for the result of the arithmetic operation but instead guesses the result to save time.
If at block (33) it is determined that the incoming instruction requires a speculation, the process continues with block (35) . At
block (35) , a speculated value is generated that is associated with the variable having a value determined by the incoming instruction. The speculated value is stored in table (23) . Moreover, a pointer to this speculated variable is stored in table (23) together with a tag that indicates that this variable is speculated.
Block (35) is followed by several blocks (37a-37e) . In a preferred embodiment blocks (37a-37e) are not executed by state machine (21) , but rather by some other part of the processor, write step (11), for example. Alternatively, blocks (37a-37e) can also be executed by state machine (21) .
Block (37a) indicates that a correct value of the speculated variable is being awaited. When the correct value has been obtained the process continues with block (37b) , in which it is determined whether the speculated value is in agreement with the correct value obtained. If the speculated value agrees with the obtained correct value, the speculation has succeeded and no additional measures need be taken. In this case block (37b) is followed by a stop (39) . If the speculated value does not agree with the correct value obtained, the speculation has failed
(speculation error) and block (37b) is followed by block (37c) . Block (37c) replaces the speculated value with the obtained correct value. Block (37c) is followed by block (37d) in which it is determined whether a restart address for the speculated variable is stored in table (23) . If a restart address is stored, the process continues with block (37e) , in which a restart of the execution is made from an instruction that is associated with the stored restart address. If at block (37d) it is instead determined that no restart address is stored, no operation is performed and block (37d) is followed by stop (39) .
However, if at block (33) it is determined that the incoming instruction does not involve a speculation, the process continues after block (33) with block (41) . The process also continues with block (41) after block (35) . Block (41) and subsequent blocks are
executed simultaneously with blocks (37a-37e) . At block (41) it is determined if the incoming instruction is dependent upon an earlier speculation, i.e., if the incoming instruction uses one or more of the variables that are tagged as speculated in table (23) . If at block (41) it is determined that the incoming instruction is not dependent upon a previous speculation, the process continues with block (43) in which the incoming instruction is scheduled for execution.
If at block (41) it is instead determined that the incoming instruction is dependent upon a previous speculation, the process continues with block (45) . At block (45) it is determined whether any previous instruction was dependent upon the same previous speculation as the incoming instruction. This is done by state machine (21) consulting table (23) to see if an address to the first instruction that is dependent upon the prior speculation is already stored in table (23) . If no such address is stored in table (23) , then the incoming instruction is the first instruction that is dependent upon this prior speculation and the process therefore continues with block (47), which means that the address of the incoming instruction is saved in table (23) as the address of the first instruction that is dependent upon the prior speculation. If the incoming instruction is dependent upon several prior speculations, block (45) is naturally executed and, when needed, block (47) is also executed for each such speculation. Block (47) is followed by block (49) , which means that the incoming instruction is executed in normal sequence utilizing the previously speculated value or previously speculated values, as required. If at block (45) it is determined that a previous instruction is dependent upon the same previous speculation, block (47) is naturally not executed for this speculation. The process ends after block (43) or, where appropriate, after block (49) as indicated by stop (39) .
In the process in Figure 5, the variable values that are associated with the results of the incoming instructions are
speculated. However, this process can also be used, making the necessary modifications, to speculate variable values that constitute operands for the incoming instructions .
Figure 6 shows a block diagram of an embodiment of a processor (1.1) according to the invention. Processor (1.1) is for the most part built up in the same way as processor (1) in Figure 1. Processor (1.1), however, is complemented with state machine (21) , which in this embodiment is connected to operand loading step (7) , which also includes instruction window (20) (not shown in Figure 6) . State machine (21) is also connected to register unit (15) and to table (23) , which is included in processor (1.1). Table (23) is also connected to write step (11). When write step (11) writes a (correct) value for a variable in register unit (15) , write step (11) is designed to consult table (23) to see if the value of the variable has been speculated or not. If the value is speculated, write step (11) then compares the (correct) value with the speculated value in table (23) . If this indicates that the speculation has failed, write step (11) retrieves from table (23) the stored address for the first instruction that is dependent upon the incorrectly speculated variable. The retrieved address is sent to instruction loading step (3) . Program counter (3a) is reset to the loaded address and execution now restarts beginning at the instruction associated with the loaded address. In contrast to known techniques, restart thus occurs from the first instruction that is dependent upon the variable having an incorrectly speculated value. This enables processor (1.1) to operate more efficiently because the instructions that precede the first instruction that is dependent upon the speculated variable, but that are not themselves dependent upon the speculated value, are not re-executed unnecessarily. Processor (1.1) functions both with central tables for processing instructions (so-called scoreboards) and with decentralized tables in the form of so-called reservation stations.
Fig. 7 shows an exemplary embodiment of the superscalar pipeline processor according to the invention. As previously shown the processor comprises an instruction load step 3 in which instructions are fetched from the list of codes shown in fig. 2, a decoding step 5, an operand load step 7, an execution step 9 and a write step 11. The operand load step 7 comprises a plurality of operand load units, in this example Cl - Cβ, in which instructions, some of which may be speculated values, are stored in a queue and awaiting to be processed. The execution step 9 comprises a corresponding plurality of execution units, designated El - Eβ, each executing the instructions stored in the corresponding operand load unit with the corresponding number. By way of example, execution steps El is a store operator, E2 is a sum operator, E3 is a subtract operator, E4 and E5 are multiplication or division operators and E6 / 13 is a read operator or data load unit. The processor according to the invention distributes instructions to the operand load units Cl to Cβ such that instructions are placed in an appropriate operand load unit, that is, placed in a queue for which a fitting type of execution unit is associated. Whenever more operand load units can be selected, instructions are preferably allocated to the operand load unit with the shortest queue.
Fig. 8 shows an example of instruction queues in operand load unit Cl - Cβ waiting to be processed in execution step 9 in the processor shown on fig. 7 at a given time. In fig. 8, speculated values are underlined and values that correspond to the first subsequent instruction in the queue, which depended on a speculated value, have been marked in bold letters. It should be noted that a dependent instruction might not necessarily be dependent on an instruction stored in the same operand load unit. In the present example instruction 22 is dependent on instruction 6.
Fig. 9 shows a buffer associated with write stage 11 at the same time as fig. 8. The executed values from execution step 9 are
communicated to the buffer and loaded into predetermined positions therein allocated (reserved) at earlier pipeline steps. The right hand side of the buffer (the dotted area) comprises instructions that are sequentially written into register locations in register unit 15. When instructions are written into register 15, they are committed and can not be altered any longer. The example shows instructions 3 and 4 having been written. The buffer moreover comprises executed (represented by a number) and non-executed values (represented by blank fields) . Some of the executed values are dependent on speculated values, as in the example instruction 21. The buffer is filled in the order the execution steps finish execution of respective instructions, but emptied in sequential order as mentioned above. When an instruction whose value is speculated is executed, the correct value is communicated to the buffer. In the present example according to figs. 8 and 9, the instructions 5, 6, 9, 14, 16 and 33 are about to be loaded into respective execution units.
At a later point in time when the speculated value to the instruction 6 is executed, the speculated value is for instance found to be erroneous and is therefore replaced with the true value β. The table shown in fig. 4 is updated.
Fig. 10 and 11 show in analogy to fig. 9 and 10 the exemplary situation after the value 5 and the speculated value 6 are executed and the value 5 and the true value 6 are communicated to the buffer and subsequently committed. According to the invention, the first instruction which is dependent on the mis- speculated value 6 and subsequent instructions are deleted or flushed from the operand load step 7 and the execution step 9 and the buffer associated with the write stage 11. Instructions are deleted to the extent that instructions are present in those steps. Consequently, instruction 22 and above are deleted, while instructions lower than 21 are retained. A new instruction 22' , based on the correct value for instruction 6 is gained from instruction load step 3 and distributed, by way of example, now
to operand load unit C5. Other new instructions 33', 34', 42' are also distributed into operand load units C4, C5 and C6.
By retaining those instructions that are lower than the first instruction dependent on a mis-speculated value, the risk that write step 11 waits in vain for reserved but non-executed values is minimised. Thereby, the write step 11 can commit instructions more speedily thus enhancing the average efficiency of the processor.