WO2003009134A1 - Value prediction in a processor for providing speculative execution - Google Patents
Value prediction in a processor for providing speculative execution Download PDFInfo
- Publication number
- WO2003009134A1 WO2003009134A1 PCT/SE2002/000298 SE0200298W WO03009134A1 WO 2003009134 A1 WO2003009134 A1 WO 2003009134A1 SE 0200298 W SE0200298 W SE 0200298W WO 03009134 A1 WO03009134 A1 WO 03009134A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- value
- prediction
- information
- decision
- cache
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 30
- 230000001419 dependent effect Effects 0.000 abstract description 11
- 239000013598 vector Substances 0.000 description 28
- 239000000872 buffer Substances 0.000 description 26
- 230000008901 benefit Effects 0.000 description 8
- 101100064323 Arabidopsis thaliana DTX47 gene Proteins 0.000 description 7
- 101150026676 SID1 gene Proteins 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000011084 recovery Methods 0.000 description 5
- 101100412394 Drosophila melanogaster Reg-2 gene Proteins 0.000 description 3
- 101100301524 Drosophila melanogaster Reg-5 gene Proteins 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 101000840469 Arabidopsis thaliana Isochorismate synthase 1, chloroplastic Proteins 0.000 description 1
- 101100256916 Caenorhabditis elegans sid-1 gene Proteins 0.000 description 1
- 240000004101 Iris pallida Species 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3861—Recovery, e.g. branch miss-prediction, exception handling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
- G06F9/3832—Value prediction for operands; operand history buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
Definitions
- the present invention relates in general to computer systems, and more specifically to the design and functioning of a processor.
- processor performance Today, one of the major limiting factors in processor performance is memory speed. In order to fully benefit from the processor's speed, the processor must receive data to operate on from data memory at a speed that keeps the processor busy continuously. This problem can be attacked by supplying a smaller and faster memory, called a cache, close to the processor.
- the cache reduces the delay associated with memory access by storing subsets of the memory data that can be quickly read and modified by the processor.
- U.S. Patent no. 5,781,752 discloses a processor provided with means for speculative execution.
- the processor has a data speculation circuit comprising a prediction threshold detector and a prediction table.
- the prediction table stores prediction counters that reflect the historical rate of mis-speculation for an instruction.
- the prediction threshold detector prevents data speculation for instructions having a prediction counter within a predetermined range.
- the load value prediction unit comprises a load classification table for deciding which predictions are likely to be correct.
- the load classification table includes counters for load instructions. The counters indicate the success rate of previous predictions and are incremented for correct predictions and decremented otherwise. Based on the value of the counter a load instruction is classified as unpredictable, predictable or constant. Speculative execution is prevented for load instructions that are classified as unpredictable.
- Calder B. et al. "Selective Value Prediction", Proceedings of the 26 th International Symposium on Computer Architecture, May 1999 describes techniques for selectively performing value prediction.
- One such technique is instruction filtering, which filters which instructions put values into the value prediction table. Filtering techniques that discussed include filtering based on instruction type and giving priority to instructions belonging to the data dependence path in the processor's active instruction window.
- Some of the methods for selective value speculation described above base the decision of whether or not to use a predicted value on instruction dependency predictions, which means that a prediction of what the possible gain may be is weighed into the decision. These methods make it possible to avoid speculative execution in some cases where it is unwise due to the possible gain being very low. However, all of the described methods of this type are rather complex and they all use predictions relating to dependency instead of true dependency information, which means that there is a degree of uncertainty involved.
- An object of the present invention is to provide a processing unit and a method, which include relatively simple means for deciding when to execute speculatively and wherein the decision to execute speculatively is based on criteria that allows for improved management of the risks involved, as compared with the prior art wherein the decision is based merely on an estimation of the likelihood of correct prediction.
- the object of the present invention is achieved by means of a processing unit as claimed in claim 1 and 12 and by means of a method as claimed in claim 21 and 29.
- the present invention allows for improved risk management by means of basing the decision whether or not to execute speculatively on information associated with the estimated time gain of execution based on a correct prediction. It is with the prior art methods possible that the processing unit is exposed to the negative impact of mis-prediction also when the gain for a correct prediction is very small.
- the present invention makes it possible to make wiser decisions by means of taking the gain for a correct decision into account. If the cost for mis-prediction is high and the gain for correct prediction is low it is probably wise not to speculate even if the likelihood of a correct prediction is high.
- the present invention makes it possible to avoid speculative execution in situations where it might seem unwise to speculate but where speculative execution undoubtedly would take place if the decision was based merely on the likelihood of a correct prediction, as in the prior art methods described above.
- the decision regarding whether or not to execute speculatively is based on information regarding whether a cache hit or a cache miss is detected in connection with a load instruction.
- a cache hit implies that the true value corresponding to the instruction will be available shortly since the value was found in a very fast memory. If, on the other hand, a cache miss is detected it is a sign that the value must be loaded from a slower memory and that it might take a while until the true value is available. Therefore, according to the present invention, a cache hit is a factor that weighs against speculative execution, since it implies a small performance gain for a correct prediction.
- a cache hit prediction based on the historic likelihood of detecting a cache hit or miss for a value of a certain load instruction, is used as a factor in the speculation decision, instead of the actual detected cache hit or miss.
- the decision regarding whether or not to execute speculatively is based on information regarding the true dependency depth of the load instruction, i.e. the number of instructions that are dependent on the load. If the number of dependent instructions are low it might, depending on the processor architecture, be possible to hide the latency of the load with other instructions that are independent of the load. If this is possible the gain of a correct prediction for the load will be small or none at all.
- the dependency depth of a certain load instruction is therefore, according to an embodiment of the present invention, used as a factor in the decision regarding whether to execute the load instruction speculatively or not.
- a predicted dependency depth is used as a factor in the decision instead of the true dependency depth.
- An advantage of the present invention is that it makes it possible to improve the performance of processing units since the invention makes it possible to avoid speculative execution when the gain for a correct value prediction is too small to motivate taking the risk of mis-prediction.
- the cost of recovery due to misprediction is fairly large and it would therefore be unwise to expose the processor to the risk of a recovery when the performance gain involved in the value speculation is small.
- the present invention makes it possible to restrict value speculation to when the potential gain is significant.
- a further advantage of the present invention is that since it makes it possible to avoid speculative execution when the performance gain is small, the cost of recovery becomes less critical. It is possible to allow more costly recovery since the invention enables restricting speculative execution to cases where the estimated performance gain of correct prediction is considerably larger than the recovery cost.
- Another advantage of an embodiment of the invention is that it reduces the need for storage of value prediction history statistics as will be explained in greater detail below.
- Yet another advantage of the embodiment of the invention is that it the information relating to the possible time gain of a correct decision which is used in the decision of whether or not to execute speculatively, is information that relates to the next execution of the instruction for which speculative execution is an option and not to historic information relating to earlier executions of the same instruction.
- the confidence in making a correct decision regarding speculative execution is improved according to this embodiment.
- Fig. 1 is a schematic block diagram of selected parts of a processor that is adapted according to the present invention.
- Figs. 2a-d are time diagrams that illustrate the possible gain and loss involved for speculative execution in cases where a cache hit occurs and in cases where a cache miss occurs.
- Fig. 3 is a block diagram of a value prediction unit according to an embodiment of the present invention, where arrows indicate the steps involved in an embodiment of a method according to the present invention.
- Fig. 4 is a block diagram that shows a more detailed illustration of a buffer unit of a value prediction unit according to an embodiment of the present invention.
- Fig. 5 is a schematic block diagram of a SID structure according to an embocliment of the present invention and a reorder buffer (ROB) .
- ROB reorder buffer
- Figs. 6a-d are schematic block diagrams that illustrate an example of how dependency depth is registered and utilized according to an embodiment of the present invention.
- FIG. 1 shows a schematic block diagram of selected parts of a processor 1 in which the present invention can be implemented.
- a dispatch unit 2 is shown, which determines what instructions are to be executed next and distributes the instructions between a number of reservation stations 3 for forwarding to execution units 4.
- data may have to be fetched from memory and supplied to the execution units 4.
- Fetching of data from memory is handled by address calculation unit (ACU) 5 and data fetch unit 6, which output the fetched data to the reservation stations 3.
- the data fetch unit is able to load data from memory units such as a cache or other types of slower memory units (not shown).
- the processor 1 is an out-of-order processor, i.e. a processor that allows instructions to be executed out of program order.
- Out-of-order execution is supported by a Reorder buffer (ROB) 7, which buffers results until they can be written to a register file in program order.
- the reservation stations 3, the execution units 4 and the ROB 7 constitute the execution engine 8 of the processor 1.
- the processor 1 further comprises a value prediction unit (VPU) 9, which enables predictions of values to be fetched from memory.
- the predictions can be used for speculative execution. If speculative execution is to be performed the value prediction unit 9 produces a prediction P that is presented to the appropriate reservation station 3 before the true value V is received from the data fetch unit 6. The execution is carried out based on the prediction P. Flags are set to indicate that results based on the prediction P are speculative. When the true value V is received from memory the data fetch unit 6 sends this value to the VPU 9, which uses this value to check if the prediction P was correct. If the prediction P was correct (i.e.
- the VPU 9 sends a signal s2 to the execution engine 8 that the prediction P was correct and that speculative flags can be cleared. If the prediction P was incorrect (i.e. P ⁇ V), the VPU 9 sends a flush order s3 to the execution engine 8, which causes all results based on the incorrect prediction to be flushed out from the ROB 7 and restarts execution based on the true value V.
- the VPU 9 When a load instruction (i.e. an instruction to fetch data from memory) is to be executed the VPU 9 receives an index signal si from the dispatch unit 2, which is used to index (via some hash function) the prediction P. As described above a decision of whether or not to use the prediction P for speculative execution is often made based on the historical success rate of earlier predictions for the same value.
- the VPU 9 may for this purpose store a counter, or some other type of history data, which is associated with a particular value and which reflects the success rate of earlier predictions for the particular value.
- the index signal si from the dispatch unit 2 is used to retrieve the appropriate counter or history data. As mentioned above there are several known methods for keeping track of past success rate of predictions and basing the decision whether to speculate or not on this success rate.
- the estimated gain of speculative execution based on a prediction, that later turn out to be correct is small it might be better not to speculate and instead wait for the true value V.
- the estimated gain of execution based on a correct prediction for a load instruction is used as a speculation decision criteria.
- a cache hit or a cache miss gives an indication of what the gain from speculative execution might be for a particular load instruction.
- a cache hit or miss is therefore taken into consideration when the decision whether to speculate or not is made.
- a cache miss is an indication of large load latency since it indicates that the value was not found in the cache but has to be loaded from a slower memory.
- a cache hit indicates that the true value will be available in this or the next few cycles since loads from the cache can be performed quickly. It is thus more advantageous to speculate when a cache miss is detected than when a cache hit is detected.
- Figures 2a-d give an illustration of the possible gain and loss involved for speculative execution in cases where a cache hit occurs and in cases where a cache miss occurs.
- Figure 2a illustrates a time line t for a case where a cache miss occurs and the prediction was incorrect.
- the speculative execution starts at time a0'.
- the true value is received which shows that the prediction was incorrect and causes a restart.
- the restart is finished at time aO where non-speculative execution begins based on the true value.
- Figure 2b illustrates a time line t for a case where a cache miss occurs and the prediction was correct.
- the speculative execution starts at time b0'.
- time bl the true value is received which shows that the prediction was correct.
- Execution can continue from time bl without having to restart and re-execute what was executed between time b0' and time bl.
- Figure 2c illustrates a time line t for a case where a cache hit occurs and the prediction was incorrect.
- the speculative execution starts at time cO ⁇
- the true value is received which shows that the prediction was incorrect and causes a restart.
- the restart is finished at time cO where non-speculative execution begins based on the true value.
- Figure 2d illustrates a time line t for a case where a cache hit occurs and the prediction was correct.
- the speculative execution starts at time d0 ⁇
- the true value is received which shows that the prediction was correct.
- Execution can continue from time dl without having to restart and re-execute what was executed between time d0' and time dl .
- the dotted areas in figs. 2a and 2c indicate the execution time loss due to restart.
- the dashed areas indicate the execution time gain due to correct value speculation.
- a cache miss indicates a possible gain that is large compared to the possible loss
- a cache hit indicates a possible gain that is small compared to the possible loss.
- the idea of the present invention is to expose the processor of the danger of imposing the restart penalty (al-aO or cl-cO) only when the potential performance gain is large as it is at a cache miss.
- a cache hit signal s4 is input to the VPU 9 as shown in Fig. 1.
- the cache hit signal s4 includes cache hit/ miss information 14 that indicate whether a cache hit or a cache miss was detected.
- a detected cache hit is used as a speculation inhibit signal, such that the VPU 9 is prevented from presenting speculative data to the reservation stations 3 when the cache hit signal s4 indicates a cache hit.
- the cache hit/miss information 14 and history data related to the success rate of previous predictions are weighted and combined to form a decision value that indicates whether or not speculative execution should take place.
- the cache hit/miss information 14 is an added criterion, which according to the invention is used as a factor in the speculation decision scheme. Many alternative decision schemes that take cache hit/miss information into consideration are possible as will be explained in greater detail below.
- Fig. 3 shows an implementation of the VPU 9.
- the VPU 9 comprises a value cache 10 and a buffer unit 11.
- values and information related to the values are stored.
- the values that are stored in the value cache 10 are not the true values V but predictions P of the true values V.
- Each prediction P corresponds to a particular load instruction.
- the predictions are associated with an identity code, which is used to identify the prediction that corresponds to a particular load instruction.
- History data such as counter data, related to earlier predictions is also stored in the value cache 10.
- the dispatch unit 2 sends the index signal si to the value cache 10.
- the index signal si contains an index, which for instance is a hashing of the load instruction's address or a hashing of the data address, and which helps the value cache 10 to identify the prediction P associated with the load instruction to be executed.
- the value cache 10 delivers the prediction P and the history data associated with the prediction to the buffer unit 11 (signals s51 and s52).
- the buffer unit 11 has decision logic 12 that based on predetermined rules for value prediction decides whether or not to speculate. According to an embodiment of the present invention the decision logic 12 receives both history data and the cache hit signal s4 as input.
- the decision logic 12 decides to speculate the prediction P is delivered (signal s ⁇ l) to the execution engine 8 of the processor together with a flag (signal s62) indicating the speculative nature of the value.
- the true value V is delivered from the memory system and received in the buffer unit 11.
- the true value is compared with the prediction P kept in the buffer unit. If the prediction P was correct the clear speculative flag order s2 is sent to the execution engine 8. If the prediction P was incorrect the flush order s3 is sent instead.
- the buffer unit 11 also updates the history data and predicted value if needed and sends these (signals s71 and s72) to the value cache 10 for storage so that the updated data can be used in subsequent predictions.
- the decision logic 12 is incorporated in the buffer unit 11.
- Fig. 4 shows a more detailed illustration of the buffer unit 11.
- the buffer unit has a table 13 in which identity codes ID, predictions P and history data H are stored for ongoing predictions.
- the identity code ID enables the processor to keep track of the different parts of an instruction, which during execution is scattered all over the processor 1.
- the identity code enables that the right set of true value V and prediction P is compared.
- a yes/no decision is made whether to predict or not based on the history data H and the cache hit signal s4.
- the part f logic may base its decision on an endless variation of rules.
- the rule may for instance be to predict if the cache hit signal s4 indicates a cache miss and a counter C of the history data H is above a predetermined threshold.
- Another example of a rule is to give weight factors to the cache hit signal s4 and the history data H and base the decision on the result of a combination of the weighted information.
- the cache hit signal s4 is received before the true value V is delivered from the memory system, but there is still some waiting time involved until the cache hit signal s4 is received.
- An alternative embodiment of the present invention is to use a cache hit prediction s4P instead of the actual cache hit signal s4 in the decision described above. This will speed up the prediction decision at the cost of some uncertainty regarding the expected cache hit. If the cache hit prediction s4P is used instead of the actual cache hit signal s4, history data H is not only stored in respect of value predictions P but also in respect of cache hit predictions S4P. In Fig. 4, the cache hit signal s4 and the cache hit prediction s4P are shown as dashed input signals to the part f of the decision logic 12. This is to indicate the above mentioned alternatives of either using the actual cache hit signal s4 or the cache hit prediction s4P, which may be included in the history data H.
- the true value V is compared to the value prediction P to decide if the prediction P was correct.
- the outcome of this decision will, as explained above, either be the flush order s3 or the clear speculative flag order s2.
- the part g of the logic is also responsible for updating the value predictions P and history data H if necessary.
- the cache hit signal s4 can, depending on the implementation, be input to the g part and used as a factor to calculate counter values or saved separately for later use in the part f of the logic. How history data H is stored and updated is the subject of many research reports, and since it is not the core of the invention it is not discussed in any greater detail herein.
- the embocliments of the present invention discussed above were described in the context of an out-of-order processor 1 equipped with a reorder buffer 7.
- the present invention is however not dependent on any particular processor architecture but adapts itself to other architectures.
- the present invention is also applicable with different types of memory configurations.
- the present invention may for instance be used in multi-level cache systems where many different prediction schemes are possible depending on in which level a cache hit is indicated.
- the rule may for instance be to inhibit prediction if a cache hit occurs in the first level of the cache or to inhibit prediction if there is a cache hit in any of the levels of the cache.
- a cache miss is directly indicated by the address. This is the case in for instance static caches, i.e.
- a virtual memory machine is a processor wherein address calculation involves translating a virtual address into a physical address.
- the address calculation is generally supported by a translation look-aside buffer (TLB) 24.
- TLB translation look-aside buffer
- the processor in Fig. 1 is assumed to be a virtual memory machine, the TLB 24 would typically reside in the Address Calculation Unit (ACU) 5 as an address translation cache.
- a miss in this cache would mean a load of page data 25 to get the physical address from memory before the request to fetch data can be sent to memory, i.e. the latency before the c the fetch to load the value can be sent to memory.
- a signal 26 of a cache hit or cache miss from the TLB may be used as a factor in the decision of whether or not to execute speculatively in the same way as a cache hit or miss signal in a value cache.
- the invention can be used together with confidence prediction data based on earlier predictions, as described above, or without.
- the different cache level hits could be combined with different confidence threshold values to produce an optimal decision based on current prediction confidence.
- the candidates for value prediction are chosen only among the values for which cache misses occur. There is then no need to store prediction statistics when cache hits are detected, thereby reducing the need for storage for value prediction history statistics.
- an embodiment of the present invention uses information regarding the dependency depth (i.e. information regarding the length of the dependency chain) to decide whether to predict a value or not.
- the decision is to only perform speculative execution based on the prediction P when the number of dependent instructions ' on the speculated load value is sufficiently large to motivate the risk of speculation.
- the length of the dependency chain that is suitable to qualify for value prediction depends on the size of the instruction window subjected to out-of-order execution.
- the dependency depth information may either be a prediction of the dependency depth based on dependency chain history or the "current" dependency depth that relate to the load instruction to be executed.
- the advantage of using a prediction of the dependency depth is that it is fairly simple since the current depth may be difficult, and in many processor architectures, impossible to derive.
- a disadvantage of using a prediction is that the dependency depth may be mis-predicted which means that a certain degree of uncertainty is involved.
- the structure used to retain out-of-order execution data such as the reorder buffer 7, stores a speculative identity field (SID) instead of simple speculative flag.
- SID speculative identity field
- the size of the SID must accommodate the number of simultaneously active speculations.
- a storage structure is indexed, which builds a value dependence chain during execution. When the speculative execution for a predicted value is finished, the depth of the value dependence chain indexed in the corresponding SID is stored to be used to influence future decisions whether to speculate or not. If other types of history data H are stored also, for use in the speculation decision, the dependency depth information may be stored together with such other history data.
- Fig. 5 shows a schematic illustration of a SID structure 15 with vectors SIDO, SID1, SID2, SIDn, alongside the reorder buffer (ROB) 7.
- the ROB 7 has an index called a sequence number SN.
- the sequence number is an identity code, which enables the processor to keep track of the different parts of the instruction during execution.
- the SID structure 15 is illustrated alongside the ROB 7 since the SID and the ROB are subject to the same stepping orders. However, when implemented the SID 15 does not have to be placed close to the ROB 7, but can instead e.g. be placed close to the reservation stations 3.
- the prediction P When the prediction P is output from the VPU 9, the prediction is assigned a SID number corresponding to a SID vector.
- the prediction P When the prediction P is stored in the reorder buffer 7, a bit is set in the assigned SID vector which correspond to the sequence number SN of the load instruction that was predicted. Thus the bit that is set is uniquely defined by the SID number and the sequence number SN.
- the prediction P When the prediction P is used to execute a subsequent instruction the result is stored in the ROB 7 and assigned another sequence number SN.
- the sequence number of the load instruction is then called a source sequence number and the sequence number of the result of the subsequent instruction is called a destination sequence number.
- the bit in the SID vector that correspond to the destination sequence number is also set to indicate a speculative data dependence.
- the prediction P is verified as correct or incorrect, the speculative state is to be cleared and the SID vector that correspond to the verified prediction is cleared so that it can be used for other predictions.
- the number of bits that are set in the vector are sent to the VPU 9. This number is the dependency depth D of the speculated load instruction and it is stored in the VPU 9 to form a basis for dependency depth predictions for future decisions whether to speculate or not with respect to the load instruction.
- the SID vectors may each be associated with a compute logic for computing the dependency depth.
- Each SID vector may have a depth register 20 and an adder 21, which increments the depth register 20 for each new SID vector bit assigned.
- FIG. 6a-e illustrate the dependency depth of load instructions can be registered and stored to be used in future speculation decisions as mentioned above.
- FIGs. 6a-e illustrate the SID vectors SIDO and SID 1 of the processor alongside a list of sequence numbers SN0- SN4 that correspond to the five entries in the reorder buffer.
- a pseudo assembly code 16 for the example processor is also illustrated in the figures 6a-e.
- An arrow 17 shows the point of execution in the code throughout the figures.
- Fig. 6a illustrates that a load instruction has been executed speculatively and a value regl has been predicted.
- the bit that is set in the SID vector SIDO indicates the speculative execution of the load instruction.
- Fig. 6b illustrates that a subsequent add instruction has been executed, where the predicted value regl was used.
- the value regl was retrieved from the reorder buffer together with its sequence number SNO.
- An indication of a speculation dependency was found in vector SIDO for sequence number SNO, which means that the speculation dependency exist also for the result of the add instruction reg2 associated with sequence number SNl.
- a bit is set in vector SIDO in the position that correspond to sequence number SNl.
- the next instruction that is executed is another load instruction for a value reg4 as shown in Fig. 6c.
- This load instruction is also subject to value prediction.
- the SID vector SID1 is used to keep track of the dependency of the speculative value reg4 as shown in the figure.
- the next multiplication instruction depends on the speculative value reg4. This is detected when the speculative value reg4 is delivered from the reorder buffer together with the sequence number SN2. A bit is set for sequence number SN2 in vector SID1 and hence a bit is set in SID1 for sequence number SN3 also as illustrated in Fig. 6d.
- the result reg5 of the multiplication instruction is stored in the reorder buffer and associated with sequence number SN3.
- the last instruction that is illustrated in this example is a subtract instruction.
- Fig. 6e shows the status of the SID vectors after this instruction has been executed. The subtract instructions depends on two values, reg5 and reg2, which depend on both of the earlier predictions.
- the speculative bit set for SN3 in the vector SID1 is detected and the SN4 entry in the vector SID1 is set.
- the speculative bit for SNl that is set in vector SIDO is detected in the reservation station and the bit in SIDO that correspond to sequence number SN4 is set.
- each vector SIDO and SID1 is associated with a depth register that is incremented during the speculative execution.
- the dependency register associated with the vector SIDO will when the true value for regl is delivered contain a number, which is considered as the dependency depth for this prediction.
- the dependency depth is delivered into the value cache 10 when the vector SIDO is cleared.
- the dependency depth is stored and used to produce the dependency depth prediction, which is used later as a factor in the future decisions whether to speculate or not.
- the result of the load instruction can be discarded from the reorder buffer and written to the register file.
- the instructions that are marked as speculation dependent will be cleared as speculative as the associated SID vector is released (i.e. cleared of bits). The released SID vector is free to be used for subsequent value speculations.
- a trace processor in which this scheme may be implemented is illustrated in "Processor Architectures", Ch. 5.2, page 228, J. Silc, B. Robic, T. Ungerer, Springer Verlag, ISBN 3-540-64798-8.
- This trace processor includes a fill unit, which is where the trace is constructed. Augmenting this unit with dependency depth extraction logic will enable each delivered load instruction to carry with it the number of instructions dependent on the load value (in the trace). Thereby, when a trace and its load instructions are delivered to the issue step, each load instruction's dependency depth is also delivered. The actual dependency depth of a load instruction is thus delivered together with the load instruction to the issue step if the trace is executed as constructed.
- the different embodiments of the invention focus on different indicators; number of instructions dependent on a load instruction, and cache hit or cache miss.
- the above-mentioned prior art methods for selective value speculation based on instruction dependency predictions catch dynamic behavior past and are rather complex.
- these prior art methods suffer from the interaction of cache fetch actions.
- a long dynamic dependency chain might not be long the next time around.
- the contents of a cache might be different from time to time.
- the embodiments of the present invention are rather simple, but may still be very reliable. Many additional advantages may be derived from combinations of the "basic" embodiments of the present invention. If for instance the cache hit scheme is combined with the dependency depth prediction scheme, it will only be decided to base execution on a value prediction when the load latency is long and the instruction window contains a large number of instructions dependent on the value to be loaded. The combination will add dynamics to the dependency depth prediction scheme and static use-information to the cache hit scheme. It will also use actual memory latency information, not just predicted.
- the "basic" embodiments of the present invention may be sorted into different classes.
- One way of classifying the embodiments is according to which part of the processor the information used in the speculation decision originates from.
- Information relating to cache hit or miss signals is information originating from the memory system and information regarding dependency depth is information from the execution engine.
- Another way of classifying the schemes is according to the point in time when the information to be used is collected.
- the embodiments above that use predictions based on historical data use information from the past while an unpredicted cache hit or miss signal or a current dependency depth is from the present.
- Each class of schemes has its strengths and weaknesses. By creating schemes that are combinations of different classes the strength associated with one class may be used to counter-act the weaknesses of other classes.
- the major advantage of the present invention is that value prediction in situations where it might seem unwise due to the risk involved can be avoided.
- the present invention makes is possible to make and informed speculation decision by means of basing the decision not only on the success rate of previous predictions, but also on the estimated gain from a correct prediction. Avoiding value prediction when the estimated gain of correct prediction is small compared to the risk involved improves the performance of the processor.
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02712589A EP1421477A1 (en) | 2001-07-19 | 2002-02-21 | Value prediction in a processor for providing speculative execution |
US10/484,195 US7243218B2 (en) | 2001-07-19 | 2002-02-21 | Method and processing unit for selective value prediction using data cache hit/miss information and/or dependency depth information |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SE0102564A SE0102564D0 (en) | 2001-07-19 | 2001-07-19 | Arrangement and method in computor system |
SE0102564-2 | 2001-07-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003009134A1 true WO2003009134A1 (en) | 2003-01-30 |
Family
ID=20284893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SE2002/000298 WO2003009134A1 (en) | 2001-07-19 | 2002-02-21 | Value prediction in a processor for providing speculative execution |
Country Status (4)
Country | Link |
---|---|
US (1) | US7243218B2 (en) |
EP (1) | EP1421477A1 (en) |
SE (1) | SE0102564D0 (en) |
WO (1) | WO2003009134A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007099421A2 (en) * | 2006-02-28 | 2007-09-07 | Nokia Corporation | Cache feature in electronic devices |
GB2461902A (en) * | 2008-07-16 | 2010-01-20 | Advanced Risc Mach Ltd | Method for improving the performance of a processor by tuning the thread speculate mechanisms of the processor. |
WO2021223879A1 (en) * | 2020-05-08 | 2021-11-11 | Huawei Technologies Co., Ltd. | Processing device for a parallel computing system and method for performing collective operations |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040154010A1 (en) * | 2003-01-31 | 2004-08-05 | Pedro Marcuello | Control-quasi-independent-points guided speculative multithreading |
US7219185B2 (en) * | 2004-04-22 | 2007-05-15 | International Business Machines Corporation | Apparatus and method for selecting instructions for execution based on bank prediction of a multi-bank cache |
US7747841B2 (en) * | 2005-09-26 | 2010-06-29 | Cornell Research Foundation, Inc. | Method and apparatus for early load retirement in a processor system |
US7856548B1 (en) * | 2006-12-26 | 2010-12-21 | Oracle America, Inc. | Prediction of data values read from memory by a microprocessor using a dynamic confidence threshold |
US7788473B1 (en) * | 2006-12-26 | 2010-08-31 | Oracle America, Inc. | Prediction of data values read from memory by a microprocessor using the storage destination of a load operation |
TWI334571B (en) * | 2007-02-16 | 2010-12-11 | Via Tech Inc | Program instruction rearrangement methods |
US8683129B2 (en) * | 2010-10-21 | 2014-03-25 | Oracle International Corporation | Using speculative cache requests to reduce cache miss delays |
US9690623B2 (en) * | 2015-11-06 | 2017-06-27 | International Business Machines Corporation | Regulating hardware speculative processing around a transaction |
US10324727B2 (en) * | 2016-08-17 | 2019-06-18 | Arm Limited | Memory dependence prediction |
US10296465B2 (en) * | 2016-11-29 | 2019-05-21 | Board Of Regents, The University Of Texas System | Processor using a level 3 translation lookaside buffer implemented in off-chip or die-stacked dynamic random-access memory |
US10282129B1 (en) * | 2017-10-24 | 2019-05-07 | Bottomline Technologies (De), Inc. | Tenant aware, variable length, deduplication of stored data |
US10620962B2 (en) * | 2018-07-02 | 2020-04-14 | Arm Limited | Appratus and method for using predicted result values |
US10990393B1 (en) | 2019-10-21 | 2021-04-27 | Advanced Micro Devices, Inc. | Address-based filtering for load/store speculation |
US11366668B1 (en) * | 2020-12-08 | 2022-06-21 | Arm Limited | Method and apparatus for comparing predicated load value with masked load value |
KR20240023850A (en) * | 2022-08-16 | 2024-02-23 | 삼성전자주식회사 | Memory device, memory system and method for operating memory system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5781752A (en) * | 1996-12-26 | 1998-07-14 | Wisconsin Alumni Research Foundation | Table based data speculation circuit for parallel processing computer |
US5996060A (en) * | 1997-09-25 | 1999-11-30 | Technion Research And Development Foundation Ltd. | System and method for concurrent processing |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07506921A (en) | 1992-03-06 | 1995-07-27 | ランバス・インコーポレーテッド | Cache prefetching to minimize main memory access time and cache memory size in computer systems |
US5933860A (en) * | 1995-02-10 | 1999-08-03 | Digital Equipment Corporation | Multiprobe instruction cache with instruction-based probe hint generation and training whereby the cache bank or way to be accessed next is predicted |
US5966544A (en) | 1996-11-13 | 1999-10-12 | Intel Corporation | Data speculatable processor having reply architecture |
JP3550092B2 (en) | 1998-12-10 | 2004-08-04 | 富士通株式会社 | Cache device and control method |
US6487639B1 (en) * | 1999-01-19 | 2002-11-26 | International Business Machines Corporation | Data cache miss lookaside buffer and method thereof |
-
2001
- 2001-07-19 SE SE0102564A patent/SE0102564D0/en unknown
-
2002
- 2002-02-21 US US10/484,195 patent/US7243218B2/en not_active Expired - Fee Related
- 2002-02-21 EP EP02712589A patent/EP1421477A1/en not_active Withdrawn
- 2002-02-21 WO PCT/SE2002/000298 patent/WO2003009134A1/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5781752A (en) * | 1996-12-26 | 1998-07-14 | Wisconsin Alumni Research Foundation | Table based data speculation circuit for parallel processing computer |
US5996060A (en) * | 1997-09-25 | 1999-11-30 | Technion Research And Development Foundation Ltd. | System and method for concurrent processing |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007099421A2 (en) * | 2006-02-28 | 2007-09-07 | Nokia Corporation | Cache feature in electronic devices |
WO2007099421A3 (en) * | 2006-02-28 | 2007-12-06 | Nokia Corp | Cache feature in electronic devices |
GB2461902A (en) * | 2008-07-16 | 2010-01-20 | Advanced Risc Mach Ltd | Method for improving the performance of a processor by tuning the thread speculate mechanisms of the processor. |
GB2461902B (en) * | 2008-07-16 | 2012-07-11 | Advanced Risc Mach Ltd | A Method and apparatus for tuning a processor to improve its performance |
US9870230B2 (en) | 2008-07-16 | 2018-01-16 | Arm Limited | Method and apparatus for tuning a processor to improve its performance |
WO2021223879A1 (en) * | 2020-05-08 | 2021-11-11 | Huawei Technologies Co., Ltd. | Processing device for a parallel computing system and method for performing collective operations |
Also Published As
Publication number | Publication date |
---|---|
SE0102564D0 (en) | 2001-07-19 |
EP1421477A1 (en) | 2004-05-26 |
US7243218B2 (en) | 2007-07-10 |
US20040199752A1 (en) | 2004-10-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10409605B2 (en) | System and method for using a branch mis-prediction buffer | |
US7243218B2 (en) | Method and processing unit for selective value prediction using data cache hit/miss information and/or dependency depth information | |
US7870369B1 (en) | Abort prioritization in a trace-based processor | |
JP5137948B2 (en) | Storage of local and global branch prediction information | |
US6697932B1 (en) | System and method for early resolution of low confidence branches and safe data cache accesses | |
US7032101B2 (en) | Method and apparatus for prioritized instruction issue queue in a processor | |
US7747841B2 (en) | Method and apparatus for early load retirement in a processor system | |
US5860017A (en) | Processor and method for speculatively executing instructions from multiple instruction streams indicated by a branch instruction | |
US11099850B2 (en) | Branch prediction circuitry comprising a return address prediction structure and a branch target buffer structure | |
US10817298B2 (en) | Shortcut path for a branch target buffer | |
US7844807B2 (en) | Branch target address cache storing direct predictions | |
US10620962B2 (en) | Appratus and method for using predicted result values | |
KR20210058812A (en) | Apparatus and method of prediction of source operand values, and optimization processing of instructions | |
CN110402434B (en) | Cache miss thread balancing | |
US20040225866A1 (en) | Branch prediction in a data processing system | |
US10922082B2 (en) | Branch predictor | |
US10430342B2 (en) | Optimizing thread selection at fetch, select, and commit stages of processor core pipeline | |
US6738897B1 (en) | Incorporating local branch history when predicting multiple conditional branch outcomes | |
US10732980B2 (en) | Apparatus and method for controlling use of a register cache | |
US9858075B2 (en) | Run-time code parallelization with independent speculative committing of instructions per segment | |
US7783863B1 (en) | Graceful degradation in a trace-based processor | |
US7343481B2 (en) | Branch prediction in a data processing system utilizing a cache of previous static predictions | |
JP2024055031A (en) | Processing device and processing method | |
US6948055B1 (en) | Accuracy of multiple branch prediction schemes | |
Prémillieu | Microarchitecture exploration of control flow reconvergence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG US Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 10484195 Country of ref document: US |
|
REEP | Request for entry into the european phase |
Ref document number: 2002712589 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002712589 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2002712589 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: JP |