« ZurückWeiter »
Mar. 10,2009 Sheet 4 of 5
Dispatch Op to
Dynamically Build ATV as Dependencies are Resolved
Suppress Request if ATV match with Bad Status
TRANSITIVE SUPPRESSION OF
1. Field of the Invention
This invention is related to processors and, more particularly, to instruction replay mechanisms in processors.
2. Description of the Related Art
Managing power consumption in processors is increasingly becoming a priority. In many systems, the power supply is at least sometimes a battery or other stored-charge supply. Maximizing battery life in such systems is often a key selling feature. Additionally, even in systems that have effectively limitless power (e.g. systems plugged into a wall outlet), the challenges of cooling the processors and other circuits in the system may be reduced if power consumption can be reduced in the processors.
Some processors implement replay, in which an instruction (or instruction operation) is issued for execution and, during execution, a condition is detected that causes the instruction to be reissued again at a later time. Instructions can also be replayed if a preceding instruction is replayed (particularly if the instructions depend on the previous instructions). If an instruction is replayed due to a condition that may take some time to clear, it is likely that the instruction will be issued and replayed repeatedly until the condition is cleared. The power consumed in issuing the instruction, only to be replayed, is wasted.
Furthermore, performance is impacted since the replayed instructions occupy issue slots that could otherwise be occupied by instructions that would not be replayed. This can lead to power/performance variability on a workload-specific basis, which is undesirable. Still further, extensive replay 35 scenarios complicate verification of the processor, increasing the likelihood that bugs will pass into the fabricated design.
In one embodiment, a processor comprises one or more execution resources configured to execute instruction operations and a scheduler coupled to the execution resources. The scheduler is configured to maintain an ancestor tracking vector (ATV) corresponding to each given instruction operation 45 in the scheduler, wherein the ATV identifies instruction operations which can cause the given instruction operation to replay. The scheduler is configured to set the ATV of the given instruction operation to a null value in response to the given instruction operation being dispatched to the scheduler, and is 50 configured to create the ATV of the given instruction operation dynamically as source operands of the given instruction operation are resolved.
In one implementation, the scheduler comprises a buffer comprising a plurality of entries, wherein each entry of the 55 plurality of entries is configured to store one or more source tags corresponding to source operands of a different instruction operation in the scheduler. The scheduler also comprises an ATV buffer comprising a second plurality of entries, wherein each entry of the second plurality of entries is con- 60 figured to store an ATV corresponding to a given instruction operation in the scheduler. The ATV identifies instruction operations which can cause the given instruction operation to replay. Coupled to each entry of the second plurality of entries, logic is configured to set the ATV of the given instruc- 65 tion operation to a null value in response to the given instruction operation being dispatched to the scheduler, and is con
figured to dynamically create the ATV of the given instruction operation as source operands of the given instruction operation are resolved.
In an embodiment, a method comprising dispatching an instruction operation to a scheduler; setting an ancestor tracking vector (ATV) corresponding to the instruction operation to a null value responsive to the dispatching; and dynamically updating the ATV with an ATV corresponding to an executed instruction operation if the executed instruction operation resolves a source operand from the instruction operation.
In another embodiment, a processor comprises one or more execution resources configured to execute instruction operations; a scheduler coupled to the execution resources; and an ATV assignment unit. The scheduler is configured to maintain an ATV corresponding to each given instruction operation in the scheduler, wherein the ATV identifies instruction operations which can cause the given instruction operation to replay within a replay window. The ATV assignment unit is configured to assign an ATV token to an executing instruction operation that can originate a replay chain. The ATV token uniquely identifies the instruction operation with regard to other instruction operations within the replay window that can originate a replay chain.
BRIEF DESCRIPTION OF THE DRAWINGS
The following detailed description makes reference to the accompanying drawings, which are now briefly described.
FIG. 1 is a block diagram of one embodiment of a processor.
FIG. 2 is a pipeline diagram illustrating a portion of one embodiment of a pipeline.
FIG. 3 is a table illustrating various events in one embodiment of a processor and one embodiment of a result from those events.
FIG. 4 is an example of several instructions and the generation of ancestor tracking vectors (ATVs) for the instructions.
FIG. 5 is a flowchart illustrating one embodiment of ATV generation and use.
FIG. 6 is a block diagram of one embodiment of a computer system.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
DETAILED DESCRIPTION OF EMBODIMENTS
Turning now to FIG. 1, a block diagram of one embodiment of a portion of a processor 10 is shown. In the illustrated embodiment, the processor 10 includes an instruction cache 12, a fetch/decode unit 14, a scheduler 16, a physical register file (PRF) 18, an execution unit (EXU) 20, an address generation unit (AGU) 22, a data cache 24, an ancestor tracking vector (ATV) assign unit 26, and an ATV register 28. The instruction cache 12 is coupled to the fetch/decode unit 14, which is coupled to the scheduler 16. The scheduler 16 is further coupled to the register file 18, the EXU 20, the AGU 22, and the data cache 24. The AGU 22 is coupled to the data cache 24 and the ATV register 28, which is further coupled to the ATV assign unit 26.
In the illustrated embodiment, the scheduler 16 comprises a source buffer 30, an ATV buffer 32, ATV qualifying logic 34, and pick logic 36. The source buffer 30 is coupled to the ATV buffer 32, the ATV qualifying logic 34, and the pick logic 36. The source buffer 30 comprises a plurality of entries 5 such as entry 38 and corresponding per entry logic 40 coupled thereto. The ATV buffer 32 is coupled to the ATV qualifying logic 34 and the pick logic 36, and the ATV buffer 32 comprises a plurality of entries such as entry 42 and corresponding per entry logic 44 coupled thereto. 10
The scheduler 16 may be configured to maintain an ATV for each instruction operation in the scheduler. The ATV for a given instruction operation identifies preceding instruction operations in the scheduler which can directly cause replay and on which the given instruction operation depends, either directly or indirectly, for a source operand. Instruction operations which can directly cause replay include instruction operations which can experience data misspeculation, for example. Load instruction operations (or more briefly, 2Q "loads") can experience data misspeculation. For example, loads may be speculated to hit in the data cache 24, and dependent instruction operations may be scheduled presuming that the load data will be available at a clock cycle consistent with a cache hit. Data may be forwarded from the data 25 cache 24 prior to detecting the hit, in some embodiments, which may allow data to propagate to subsequent instruction operations that are indirectly dependent on the loads through the intermediate instruction operations that use the load result and generate in inaccurate result themselves. Other condi- 3Q tions besides a cache miss may cause data misspeculation as well, described in more detail below. Instruction operations which can directly cause replay may also be referred to as instruction operations which can originate a replay chain. A replay chain may be a set of instruction operations that replay, 35 directly or indirectly, due to the same event (such as a data misspeculation for a load). For example, instruction operations that are directly or indirectly dependent on the load data may be part of the replay chain.
The ATV for each instruction operation may be set to a null 40 value, indicating no preceding instructions which can cause replay, when the instruction operation is dispatched into the scheduler to await scheduling and issuance. The ATV may be dynamically generated as instruction operations are scheduled and dependencies for source operands are resolved. The 45 ATV may thus be made small compared to the number of instructions that may be in the processor pipeline. That is, the ATV may be sized to cover those instruction operations that can directly cause a replay to occur (e.g. loads) and that can be in the pipeline between the point in the pipeline at which the 50 instruction operation indicates to the scheduler that dependent instruction operations can be scheduled (e.g. via a broadcast of a tag that identifies the destination register of the instruction operation) and the point in the pipeline that the replay event (e.g. data misspeculation) is signaled. Since the 55 ATV is relatively small, the hardware cost may be relatively small, and the hardware may be more power efficient than may be possible with a larger ATV.
Furthermore, the ATV may be transitive. That is, once a given load is resolved (either misspeculated or not), the ATVs 60 may be updated to remove that load's representation in the ATV. If the load is replayed, the ATV may be again updated to reflect the load (and in fact the ATV token assigned to the load may be different in the replay). Thus, complicated book keeping that often may be associated with tagging loads with a 65 fixed ATV token for their entire lifetime to retirement may be avoided, in some embodiments. While various embodiments
may have any instruction operation that can directly cause a replay, the remainder of the discussion will use loads as an example.
The ATV may be used to suppress requests for scheduling by instructions that are dependent on a load that has bad status (e.g. data misspeculation has occurred), thus preventing replay of those operations until the previous load executes correctly. Thus, power may be conserved and performance may be improved by scheduling instructions which have a higher probability of not replaying, in some embodiments.
Generally, the ATV may comprise one indication for each possible load that can be in flight between the tag broadcast stage and the status broadcast stage, at which replay events are identified by broadcasting status of the load. In one embodiment, each indication in the ATV of a given instruction operation may be a bit that indicates, when set, that the given instruction operation is directly or indirectly dependent on the load that is assigned to that bit in the ATV. When the bit is clear, the given instruction operation is not dependent on the load, the dependency has not yet been detected, or the dependency has been resolved via the status broadcast of the load. Thus, the ATV may be a bit vector in such an embodiment. The null value of the ATV may be the value which indicates no dependencies on instruction operations which can replay. Thus, for the bit vector example, a bit vector with the bits all set to zero may be the null value. This bit vector will be used as an example for the embodiments described herein, although other embodiments may use the opposite meanings for the set and clear states of the bit or other indications.
The ATV assign unit 26 may be configured to assign ATV tokens to instruction operations that can directly cause replay (e.g. loads). The ATV token may uniquely identify the corresponding load within the ATV. For a bit vector as mentioned above, the ATV token may be a vector of equal length to the ATV, and may be one-hot encoded. Each load may be assigned a different one-hot token. Since ATVs are maintained transitively, the association of a given load and a given ATV token ends when the status of the load is broadcast. Thus, tokens may automatically be recycled. The ATV assign unit 26 may detect that a load has been scheduled and issued to theAGU22, and may assign the ATV in the ATV register 28 to the load. The ATV assign unit 26 may cause the ATV register 28 to update to the next ATV token. For example, the ATV register 28 may be initialized to all binary zeros except a binary one in the least significant bit. Each time an ATV token is assigned, the ATV assign unit 26 may trigger the ATV register 28 to left shift by one bit, creating the next token. The most significant bitofthe ATV register 28 wraps around to the least significant bit to automatically reuse the first ATV token after the last ATV token is assigned.
The general flow of instructions/instruction operations in the processor 10 will next be described, to provide context for the details of one embodiment of the scheduler 16. The fetch/ decode unit 14 may fetch instructions from the instruction cache 12 and decode them into instruction operations for the scheduler 16. The fetch/decode unit 14 may implement branch prediction to speculatively fetch down a given path in the code being executed. In some embodiments, the processor 10 may implement register renaming to rename the architectural registers to the physical registers in the register file 18. If so, the fetch/decode unit 14 may perform the renaming also.
The scheduler 16 receives the instruction operations dispatched by the fetch/decode unit 14, and may monitor source operands of a given instruction operation to determine when it can be scheduled. The scheduler 16 may schedule the instruction operation, but may retain the instruction operation
in case a replay event is detected. Generally, replay may comprise any mechanism which, in response to a replay event that indicates that the instruction may not have produced a correct result in execution, permits that instruction operation to be re-executed without refetching the instruction (and sub- 5 sequent instructions in program order) from the instruction cache and/or memory. The scheduler 16 may be a centralized buffer which schedules all instructions, or may be distributed to execution resources (e.g. reservation stations). Scheduled instruction operations are transmitted to the EXU 20 or the 10 AGU 22, in this embodiment.
The EXU 20 may comprise circuitry to execution arithmetic, logic, shift, and other non-memory operations. Specifically, in one embodiment, the EXU 20 may be configured to execute integer operations. Floating point operations may 15 be executed in a floating point unit (not shown). The EXU 20 may receive source operands from the register file 18, the operation to execute from the scheduler 16, and the ATV of the operation from the scheduler 16 as well. As mentioned previously, operand forwarding may also be supported via an 20 operand forwarding network (not shown). The EXU may broadcast the tag of the instruction operation (which identifies the destination of the instruction operation in the register file 18 and thus can be compared to the source operands) to the scheduler 16 so that dependent operations may be sched- 25 uled and may receive the execution result. Additionally, the EXU 20 may broadcast the ATV of the operation to the scheduler 16 so that the ATVs of dependent operations may be updated. Similarly, the data cache 24 may broadcast tags and ATVs of memory operations being executed ("Broadcast 30 ATVs" in FIG. 1 from both the EXU 20 and the AGU 22). The AGU 22 may receive operands and the memory operation, and may generate the address of the memory location accessed by the load/store operation. The address is provided to the data cache 24 for access. 35
The data cache 24 is configured to determine if a load operation hits in the cache, and is configured to transmit status indicating whether the data speculation that was performed to forward the data for the operation was correct. The status may indicate bad (data speculation incorrect) or good (data specu- 40 lation correct). Additionally, the status ATV may be broadcast with the status ("Status, Status ATVs" in FIG. 1). The status ATV may be the ATV token assigned to the load (one-hot encoded). Data speculation may be incorrect if the load misses in the cache, or if translation is enabled and a transla- 45 tion lookaside buffer (TLB) miss is detected. Additionally, data speculation may be incorrect if the load hits a store in a store queue (shown in the data cache block 24 in FIG. 1, although the store queue may be physically separate from the data cache 24) and the store data cannot be forwarded to 50 satisfy the load. For example, the store data may not have been provided yet, or the store may not update all of the bytes accessed by the load (and thus some bytes from the store queue and some bytes from the cache or memory are needed to complete the load). 55
In the illustrated embodiment, the scheduler includes the source buffer 30 to store the source register addresses for the source operands of each instruction operation and the ATV buffer 32 to store the corresponding ATVs. That is, each instruction operation in the scheduler 16 may be assigned an 60 entry in the source buffer 30 and the corresponding entry in the ATV buffer 32. An additional buffer may store other information, such as the instruction operation itself, or that information may be also be stored in the source buffer 30.
An exemplary entry 38 is shown in the source buffer 30, 65 and may include one or more source register addresses (e.g. up to four source addresses for a given instruction operation,
labeled SRC1 to SRC4, although other embodiments may have more or fewer source operands per instruction operation). Additionally, a matched-previously (MP) bit may be maintained for each source operand, indicating that the source has previously matched a tag and thus is resolved. Once a given instruction operation's source operands have all been resolved, the instruction operand may request scheduling. The per entry logic 40 may detect that the instruction operation in entry 38 has resolved its sources and may generate a request to schedule (e.g. Raw_Req[i] in FIG. 1, for entry 38). More particularly, in one embodiment, the source register address fields in the entry may comprise content addressable memory (CAM), and a match may be detected using the CAM to compare between a tag broadcast from the execution resources and the stored register address. The per entry logic may detect that all source operands are resolved to make the request. The MP bit may also be set when the match is detected. If an instruction operation has been scheduled, the picked ("P") bit may be set to prevent subsequent requests for that instruction operation. Thus, a request may be made if all source operands have been resolved and the instruction operation has not be previously picked. The per entry logic 40 may be replicated for each entry in the source buffer 30.
The request from each entry of the source buffer 30 is shown as the Raw_Req[0 . . . n] signal, for an n+1 entry scheduler 16. That is, an n+1 entry scheduler 16 may include n+1 entries similar to entry 38 in the source buffer30, andn+1 entries similar to the entry 42 in the ATV buffer 32. The source buffer 30 may output a tag match signal for each entry (Tag_ Match[0 ... n]) indicating that a tag match has been detected. The ATV buffer 32 may receive the tag match signals to update ATVs in the ATV buffer 32 with the broadcast ATVs. The broadcast ATVs are provided by the execution resources at the same time the tag broadcast occurs. Each entry that is matched by the broadcast tag is updated to include the broadcast ATV (e.g. the broadcast ATV may be logically ORed with the broadcast ATV). In this fashion, the ATV of a given instruction operation may be dynamically generated as each source operand of that given instruction resolves. Generally, a source operand may be resolved if the source operand is know to be available or predicted to be available prior to the instruction operation that has the source operand reaching execution. For example, a source operand may be resolved if it is stored in the register file 18, will be stored in the register file 18 prior to a register file read, and/or available for forwarding in the pipeline (e.g. at the input to the EXU 20).
The request signals from the source buffer 30 are qualified by the request qualify logic 34. The request qualify logic 34 may be essentially a bitwise logical AND of the raw request signals and corresponding kill signals. In the illustrated embodiment, the kill signals (Kill[0 . . . n]) are asserted to suppress the corresponding request, and thus the inverse of the kill signal is ANDed. Other embodiments may generate the kill signal active low, and no inversion may be needed.
The ATV buffer 32 may include per entry logic 44 to generate the kill signals (and to update the ATVs). To generate the kill signals, the ATV buffer 32 may receive the status broadcast and status ATV (which may be the ATV token assigned to the load). The per entry logic 44 may compare the received status ATV to the ATV in the corresponding entry 42. If the status ATV is represented in the stored ATV and the status is bad (data misspeculation), the per entry logic 44 may assert the kill signal (Kill[i]) for that entry.
In addition to suppressing the request for an instruction operation if the ATV matches the status ATV, the scheduler 16 may use the kill signal to set the picked bit in the corresponding entry 38. The picked bit may prevent scheduling of the