US20010029573A1 - Set-associative cache-management method with parallel read and serial read pipelined with serial write - Google Patents

Set-associative cache-management method with parallel read and serial read pipelined with serial write Download PDF

Info

Publication number
US20010029573A1
US20010029573A1 US09/835,215 US83521501A US2001029573A1 US 20010029573 A1 US20010029573 A1 US 20010029573A1 US 83521501 A US83521501 A US 83521501A US 2001029573 A1 US2001029573 A1 US 2001029573A1
Authority
US
United States
Prior art keywords
cycle
cache
write
processor
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/835,215
Other versions
US6385700B2 (en
Inventor
Mark Johnson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US09/835,215 priority Critical patent/US6385700B2/en
Publication of US20010029573A1 publication Critical patent/US20010029573A1/en
Application granted granted Critical
Publication of US6385700B2 publication Critical patent/US6385700B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1028Power efficiency
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to computers and, more particularly, to a method for managing a set-associative cache.
  • a major objective of the present invention is to reduce average cache access times.
  • main memory is in the form of random-access memory (RAM) modules.
  • a processor access main memory by asserting an address associated with a memory location.
  • a 32-bit address can select any one of up to 2 32 address locations.
  • each location holds eight bits, i.e., one “byte” of data, arranged in “words” of four bytes each, arranged in “lines” of four words each. In other words, there are 2 30 word locations, and 2 28 line locations.
  • main memory tends to be much faster than accessing disk and tape-based memories; nonetheless, main memory accesses can leave a processor idling while it waits for a request to be fulfilled.
  • cache memories intercept processor requests to main memory and attempt to fulfill them faster than main memory can.
  • caches To fulfill processor requests to main memory, caches must contain copies of data stored in main memory. In part to optimize access times, a cache is typically much less capacious than main memory. Accordingly, it can represent only a small fraction of main memory contents at any given time. To optimize the performance gain achievable by a cache, this small fraction must be carefully selected.
  • a fully associative cache can store the fetched line in any cache storage location.
  • the fully associative cache stores not only the data in the line, but also stores the line-address (the most-significant 28 bits) of the address as a “tag” in association with the line of data.
  • the cache compares that address with all the tags stored in the cache. If a match is found, the requested data is provided to the processor from the cache.
  • each cache storage location is given an index which, for example, might correspond to the least-significant line-address bits. For example, in the 32-bit address example, a six-bit index might correspond to address bits 23 - 28 .
  • a restriction is imposed that a line fetched from main memory can only be stored at the cache location with an index that matches bits 23 - 28 of the requested address. Since those six bits are known, only the first 22 bits are needed as a tag. Thus, less cache capacity is devoted to tags.
  • the processor asserts an address only one cache location (the one with an index matching the corresponding bits of the address asserted by the processor) needs to be examined to determine whether or not the request can be fulfilled from the cache.
  • a direct-mapped cache does not provide the flexibility to choose which data is to be overwritten to make room for new data.
  • a set-associative cache the memory is divided into two or more direct-mapped sets. Each index is associated with one memory location in each set.
  • the number of locations that must be checked, e.g., one per set, to determine whether a requested location is represented in the cache is quite limited, and the number of bits that need to be compared is reduced by the length of the index.
  • set-associative caches provide an attractive compromise that combines some of the replacement strategy flexibility of a fully associative cache with much of the speed advantage of a direct-mapped cache.
  • a set-associative cache When a set-associative cache receives an address from a processor, it determines the relevant cache locations by selecting the cache locations with an index that matches the corresponding address bits. The tags stored at the cache locations corresponding to that index are checked for a match. If there is a match, the least-significant address bits are checked for the word location (or fraction thereof) within the line stored at the match location. The contents at that location are then accessed and transmitted to the processor.
  • the cache access can be hastened by starting the data access before a match is determined. While checking the relevant tags for a match, the appropriate data locations within each set having the appropriate index are accessed. By the time a match is determined, data from all four sets are ready for transmission. The match is used, e.g., as the control input to a multiplexer, to select the data actually transmitted. If there is no match, none of the data is transmitted.
  • the read operation is much faster since the data is accessed at the same time as the match operation is conducted rather than after.
  • a parallel “tag-and-data” read operation might consume only one memory cycle, while a serial “tag-then-data” read operation might require two cycles.
  • the serial read operation consumed only one cycle, the parallel read operation would permit a shorter cycle, allowing for more processor operations per unit of time.
  • the gains of the parallel tag-and-data reads are not without some cost.
  • the data accesses that do not provide the requested data consume additional power that can tax power sources and dissipate extra heat.
  • the heat can fatigue, impair, and damage the incorporating integrated circuit and proximal components. Accordingly, larger batteries or power supplies and more substantial heat removal provisions may be required.
  • the present invention provides a cache-management method using pipelined cache writes in conjunction with parallel tag-and-data reads.
  • the cache can accept a second write request while writing the data from the previous write request. While each write operation takes place over two cycles, a series of write operations consumes much less than two cycles per write on average.
  • the first type is a write operation, in which case the pipelining obtains the expected performance benefit.
  • the second type is a “no-operation”, neither a read nor a write. In this case, the pipelining allows the no-op to be executed during the second stage of a preceding write operation.
  • the third type is a read operation.
  • a parallel read follows a pipelined write
  • the read would not be begun until the write was completed.
  • a wait might be inserted in the first stage of the write pipeline during the second cycle of the write. In this case, the speed advantage of pipelining is not realized.
  • a pipelined write permits a power savings for an immediately following read operation.
  • a serial read can be started in the second cycle of the write operation.
  • the serial read can be completed in the first cycle after completion of the write operation, i.e., by the time a one-cycle parallel read operation would have been completed.
  • the power savings associated with a serial read are achieved with none of the time penalty normally associated with serial reads.
  • the invention provides both for faster average write access rates and for reduced power consumption.
  • FIG. 1 is a composite schematic of the method of the invention and a computer system in which the method is implemented.
  • method M 1 is illustrated as a series of six processor (read and write) request cycles. Cycles that are completed in one cycle are indicated by a horizontal arrow; cycles that are completed in the cycle following the request are indicated by a downward sloping arrow.
  • a computer system AP 1 comprises a data processor 10 , a memory 12 , and a cache 20 , as shown in FIG. 1.
  • Data processor 10 issues requests along a processor address bus ADP, which includes address lines, a read-write control line, and a memory request line. Data transfers between cache 20 and processor 10 take place along processor data bus DTP.
  • cache 20 can issue wait requests to processor 10 along processor wait signal line WTP.
  • cache 20 can issue requests to memory 12 via memory address bus ADM. Data transfers between cache 20 and memory 12 are along memory data bus DTM.
  • Memory 12 can issue wait requests to cache 20 via memory wait signal line WTM.
  • Cache 20 comprises a processor interface 21 , a memory interface 23 , a cache controller 25 , a read output multiplexer 27 , and cache memory 30 .
  • Cache memory 30 includes four sets S 1 , S 2 , S 3 , and S 4 .
  • Set S 1 includes 64 memory locations, each with an associated six-bit index. Each memory location stores a line of data and an associated 22-bit tag. Each line of data holds four 32-bit words of data.
  • Cache sets S 2 , S 3 , and S 4 are similar and use the same six-bit indexes.
  • Method M 1 of the invention is depicted as a series of processor request cycles in FIG. 1. While a specific series of cycles is described, it encompasses the situations most relevant to the invention.
  • processor request cycle is meant a version of a memory cycle phase-shifted to begin with a processor asserting a request (assuming that one is made).
  • processor 10 asserts a main-memory address and requests that data be read from the corresponding main-memory location.
  • Processor interface 21 of cache 20 intercepts this request and forwards it to cache control 25 , which selects an index based on bits 23 - 28 of the asserted address.
  • Controller 25 compares the 22 most significant bits with the tags stored at the four cache memory locations (one in each set S 1 , S 2 , S 3 , and S 4 ) sharing the determined index. Concurrently, controller 25 access the data at each of these locations so that they are respectively available at the four inputs of multiplexer 27 . If a tag match is found in one of the sets, controller 25 selects the multiplexer input corresponding to that input so that the data at the indexed memory location of that set is output from multiplexer 27 to processor 10 .
  • Cache controller 25 controls memory interface 23 so that the line containing the requested data from main memory 12 is fetched.
  • the line is stored at the indexed location of one of the four sets.
  • the fetched line might replace the line at the index that was least recently used and thus least likely to be used again in the near future.
  • the fetched data is stored at the location selected for overwrite, while the first 22 bits of the address used in the request are stored as the tag for that data.
  • processor 10 makes a write request.
  • cache 20 receives the address, checks the index bits, and compares tags at the four cache locations matching index bits. If there is no match, the data is written directly into memory. (In this case, a conventional “write-around” mode is employed; however, the invention is compatible with other modes of handling write-misses.) If there is a match, the word transmitted by the processor is written to the location determined by the index, the set with the matching tag, and the word positions indicated by bits 29 and 30 of the write address. This writing does not occur during processor request cycle 2 , but during succeeding processor cycle 3 , as indicated by the downward sloping arrow that points both toward processor data bus DTP and processor cycle 3 .
  • processor 10 can make a second write request during cycle 3 while data corresponding to the first write request is being written.
  • the tag portion of the address asserted in the second write request is compared with the tags stored at the four cache-set memory locations having an index equal to the index bits of the requested address.
  • the actual writing is delayed one cycle as indicated by a downward sloping arrow.
  • the invention provides for further successive pipelined write operations. However, eventually, the write series ends either with a no-operation or a read operation. The no-op just allows the previous write operation to complete. The more interesting case in which a read request immediately follows a write request is explored with respect to processor request cycle 4 .
  • Processor 10 makes a second read request at processor request cycle 4 while the write requested during cycle 3 is completed. Note that a parallel tag-and-data read is not possible within cycle 4 because of the ongoing write operation. To provide for such a parallel read operation, a wait would have to be inserted so that the parallel operations would be performed at cycle 5 instead of cycle 4 .
  • a read immediately following a pipelined write operation is performed serially. Specifically, the tag matching is performed in the same processor request cycle that the request is made. However, data is accessed and transmitted in the following cycle. Specifically, only data from a set having a tag match is provided to an input to mulitiplexer 27 . (If there is no match, no data is accessed from cache memory 30 ). In this case, a match is detected during processor request cycle 4 , and data is read at cycle 5 , as indicated by the downward sloping arrow in the row corresponding to cycle 4 .
  • cache 20 issues a wait request to processor 10 along processor wait signal line WTP.
  • processor 10 delays any pending request so that no request is made during processor cycle 5 . Since no request is made during cycle 5 , there is no arrow indicating when a cycle- 5 request is fulfilled. Instead, a circle indicates that there is no fulfillment of a wait.
  • Processor 10 makes a third read request in cycle 6 . Since there is nothing in the pipeline during cycle 6 , a parallel tag-and-data read is completed in cycle 6 , as indicated by the horizontal arrow in the row corresponding to cycle 6 . Subsequent reads would also be one-cycle parallel reads. The case of a write following a parallel read is addressed in the discussion concerning cycles 1 and 2 .
  • An alternative embodiment of the invention uses 2-cycle pipelined reads to save power whenever there is no time penalty involved in doing so.
  • any read immediately following a pipelined read or write operation is pipelined.
  • One-cycle parallel reads are used only after other parallel reads, no-ops, or cache misses.
  • no wait request is issued as in cycle 4 above.
  • a third read request can be made and then completed in cycle 6 .
  • the no-op can be executed in the first-stage of the write pipeline while the write requested in the third cycle is completed.
  • the less desirable alternative would be to issue a wait during the fourth cycle and execute the no-op during a fifth cycle.
  • the invention provides for withholding a “wait” that would conventionally be associated with the second cycle of a write operation until there is a resource available that can absorb it without accumulating an access latency.
  • the invention provides the greatest performance enhancement in cases where write operations occur frequently in series.
  • write operations occur frequently in series.
  • a conventional system with a single cache used for both instructions and data such circumstances can be infrequent due to the number of instructions fetches, which are all reads.
  • the invention can be used to great advantage on the data cache.

Abstract

A set-associative cache-management method combines one-cycle reads and two-cycle pipelined writes. The one-cycle reads involve accessing data from multiple sets in parallel before a tag match is determined. Once a tag match is determined, it is used to select the one of the accessed cache memory locations to be coupled to the processor for the read operation. The two-cycle write involves finding a match in a first cycle and performing the write in the second cycle. During the write, the first stage of the write pipeline is available to begin another write operation. Also, the first-stage of the pipeline can be used to begin a two-cycle read operation—which results in a power saving relative to the one-cycle read operation. Due to the pipeline, there is no time penalty involved in the two-cycle read performed after the pipelined write. Also, instead of a wait, a no-op can be executed in the first stage of the write pipeline while the second stage of the pipeline is fulfilling a write request.

Description

    BACKGROUND OF THE INVENTION
  • The present invention relates to computers and, more particularly, to a method for managing a set-associative cache. A major objective of the present invention is to reduce average cache access times. [0001]
  • Much of modern progress is associated with the increasing prevalence of computers. In a conventional computer architecture, a data processor manipulates data in accordance with program instructions. The data and instructions are read from, written to, and stored in the computer's “main” memory. Typically, main memory is in the form of random-access memory (RAM) modules. [0002]
  • A processor access main memory by asserting an address associated with a memory location. For example, a 32-bit address can select any one of up to 2[0003] 32 address locations. In this example, each location holds eight bits, i.e., one “byte” of data, arranged in “words” of four bytes each, arranged in “lines” of four words each. In other words, there are 230 word locations, and 228 line locations.
  • Accessing main memory tends to be much faster than accessing disk and tape-based memories; nonetheless, main memory accesses can leave a processor idling while it waits for a request to be fulfilled. To minimize such latencies, cache memories intercept processor requests to main memory and attempt to fulfill them faster than main memory can. [0004]
  • To fulfill processor requests to main memory, caches must contain copies of data stored in main memory. In part to optimize access times, a cache is typically much less capacious than main memory. Accordingly, it can represent only a small fraction of main memory contents at any given time. To optimize the performance gain achievable by a cache, this small fraction must be carefully selected. [0005]
  • In the event of a cache “miss”, when a request cannot be fulfilled by a cache, the cache fetches the entire line of main memory including the memory location requested by the processor. This entire line is stored in the cache since a processor is relatively likely to request data from locations that are near a location from which a recently made request was made. Where the line is stored depends on the type of cache. [0006]
  • A fully associative cache can store the fetched line in any cache storage location. The fully associative cache stores not only the data in the line, but also stores the line-address (the most-significant 28 bits) of the address as a “tag” in association with the line of data. The next time the processor asserts a main-memory address, the cache compares that address with all the tags stored in the cache. If a match is found, the requested data is provided to the processor from the cache. [0007]
  • There are two problems with a fully associative cache. The first is that the tags consume a relatively large percentage of cache capacity, which is limited to ensure high-speed accesses. The second problem is that every cache memory location must be checked to determine whether there is a tag that matches a requested address. Such an exhaustive match checking process can be time-consuming, making it hard to achieve the access speed gains desired of a cache. [0008]
  • In a direct-mapped cache, each cache storage location is given an index which, for example, might correspond to the least-significant line-address bits. For example, in the 32-bit address example, a six-bit index might correspond to address bits [0009] 23-28. A restriction is imposed that a line fetched from main memory can only be stored at the cache location with an index that matches bits 23-28 of the requested address. Since those six bits are known, only the first 22 bits are needed as a tag. Thus, less cache capacity is devoted to tags. Also, when the processor asserts an address, only one cache location (the one with an index matching the corresponding bits of the address asserted by the processor) needs to be examined to determine whether or not the request can be fulfilled from the cache.
  • The problem with a direct-mapped cache is that when a line is stored in the cache, it must overwrite any data stored at that location. If the data overwritten is data that would be likely to be called in the near term, this overwriting diminishes the effectiveness of the cache. A direct-mapped cache does not provide the flexibility to choose which data is to be overwritten to make room for new data. [0010]
  • In a set-associative cache, the memory is divided into two or more direct-mapped sets. Each index is associated with one memory location in each set. Thus, in a four-way set associative cache, there are four cache locations with the same index, and thus, four choices of locations to overwrite when a line is stored in the cache. This allows more optimal replacement strategies than are available for direct-mapped caches. Still, the number of locations that must be checked, e.g., one per set, to determine whether a requested location is represented in the cache is quite limited, and the number of bits that need to be compared is reduced by the length of the index. Thus, set-associative caches provide an attractive compromise that combines some of the replacement strategy flexibility of a fully associative cache with much of the speed advantage of a direct-mapped cache. [0011]
  • When a set-associative cache receives an address from a processor, it determines the relevant cache locations by selecting the cache locations with an index that matches the corresponding address bits. The tags stored at the cache locations corresponding to that index are checked for a match. If there is a match, the least-significant address bits are checked for the word location (or fraction thereof) within the line stored at the match location. The contents at that location are then accessed and transmitted to the processor. [0012]
  • In the case of a read operation, the cache access can be hastened by starting the data access before a match is determined. While checking the relevant tags for a match, the appropriate data locations within each set having the appropriate index are accessed. By the time a match is determined, data from all four sets are ready for transmission. The match is used, e.g., as the control input to a multiplexer, to select the data actually transmitted. If there is no match, none of the data is transmitted. [0013]
  • The read operation is much faster since the data is accessed at the same time as the match operation is conducted rather than after. For example, a parallel “tag-and-data” read operation might consume only one memory cycle, while a serial “tag-then-data” read operation might require two cycles. Alternatively, if the serial read operation consumed only one cycle, the parallel read operation would permit a shorter cycle, allowing for more processor operations per unit of time. [0014]
  • The gains of the parallel tag-and-data reads are not without some cost. The data accesses that do not provide the requested data consume additional power that can tax power sources and dissipate extra heat. The heat can fatigue, impair, and damage the incorporating integrated circuit and proximal components. Accordingly, larger batteries or power supplies and more substantial heat removal provisions may be required. [0015]
  • Nonetheless, such provisions are generally well worth the speed advantages of the parallel tag-and-data read accesses. A comparable approach to hastening write operations is desired. Unfortunately, the parallel tag-and-data approach is not applied to write operations since parallel data access would involve overwriting data that should be preserved. Accordingly, in the context of a system using parallel reads, the write operations have become a more salient limit to performance. What is needed is a cache management method in which write operation times more closely match those achieved using parallel tag-and-data reads. [0016]
  • SUMMARY OF THE INVENTION
  • The present invention provides a cache-management method using pipelined cache writes in conjunction with parallel tag-and-data reads. Thus, the cache can accept a second write request while writing the data from the previous write request. While each write operation takes place over two cycles, a series of write operations consumes much less than two cycles per write on average. [0017]
  • There are three possible types of processor events that can follow a pipelined write operation on the next cycle. The first type is a write operation, in which case the pipelining obtains the expected performance benefit. The second type is a “no-operation”, neither a read nor a write. In this case, the pipelining allows the no-op to be executed during the second stage of a preceding write operation. [0018]
  • The third type is a read operation. In the case a parallel read follows a pipelined write, the read would not be begun until the write was completed. Thus, a wait might be inserted in the first stage of the write pipeline during the second cycle of the write. In this case, the speed advantage of pipelining is not realized. [0019]
  • Surprisingly, a pipelined write permits a power savings for an immediately following read operation. Instead of waiting until the second cycle of the write operation is completed to begin a parallel read, a serial read can be started in the second cycle of the write operation. The serial read can be completed in the first cycle after completion of the write operation, i.e., by the time a one-cycle parallel read operation would have been completed. Hence, the power savings associated with a serial read are achieved with none of the time penalty normally associated with serial reads. Thus, the invention provides both for faster average write access rates and for reduced power consumption. These and other features and advantages of the invention are apparent from the description below with reference to the following drawing.[0020]
  • BRIEF DESCRIPTION OF THE FIGURE
  • FIG. 1 is a composite schematic of the method of the invention and a computer system in which the method is implemented. In FIG. 1, method M[0021] 1 is illustrated as a series of six processor (read and write) request cycles. Cycles that are completed in one cycle are indicated by a horizontal arrow; cycles that are completed in the cycle following the request are indicated by a downward sloping arrow.
  • DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • In accordance with the present invention, a computer system AP[0022] 1 comprises a data processor 10, a memory 12, and a cache 20, as shown in FIG. 1. Data processor 10 issues requests along a processor address bus ADP, which includes address lines, a read-write control line, and a memory request line. Data transfers between cache 20 and processor 10 take place along processor data bus DTP. In addition, cache 20 can issue wait requests to processor 10 along processor wait signal line WTP. Similarly, cache 20 can issue requests to memory 12 via memory address bus ADM. Data transfers between cache 20 and memory 12 are along memory data bus DTM. Memory 12 can issue wait requests to cache 20 via memory wait signal line WTM.
  • [0023] Cache 20 comprises a processor interface 21, a memory interface 23, a cache controller 25, a read output multiplexer 27, and cache memory 30. Cache memory 30 includes four sets S1, S2, S3, and S4. Set S1 includes 64 memory locations, each with an associated six-bit index. Each memory location stores a line of data and an associated 22-bit tag. Each line of data holds four 32-bit words of data. Cache sets S2, S3, and S4 are similar and use the same six-bit indexes.
  • Method M[0024] 1 of the invention is depicted as a series of processor request cycles in FIG. 1. While a specific series of cycles is described, it encompasses the situations most relevant to the invention. By “processor request cycle” is meant a version of a memory cycle phase-shifted to begin with a processor asserting a request (assuming that one is made).
  • In [0025] processor request cycle 1, processor 10 asserts a main-memory address and requests that data be read from the corresponding main-memory location. Processor interface 21 of cache 20 intercepts this request and forwards it to cache control 25, which selects an index based on bits 23-28 of the asserted address. Controller 25 compares the 22 most significant bits with the tags stored at the four cache memory locations (one in each set S1, S2, S3, and S4) sharing the determined index. Concurrently, controller 25 access the data at each of these locations so that they are respectively available at the four inputs of multiplexer 27. If a tag match is found in one of the sets, controller 25 selects the multiplexer input corresponding to that input so that the data at the indexed memory location of that set is output from multiplexer 27 to processor 10.
  • If there is no match, no input is selected. [0026] Cache controller 25 controls memory interface 23 so that the line containing the requested data from main memory 12 is fetched. The line is stored at the indexed location of one of the four sets. For example, the fetched line might replace the line at the index that was least recently used and thus least likely to be used again in the near future. Thus, the fetched data is stored at the location selected for overwrite, while the first 22 bits of the address used in the request are stored as the tag for that data.
  • If there is no match, data must be fetched from main memory. Thus, the read operation may consume several cycles. However, if there is a match, the parallel tag-and-data read is completed in one cycle. This is indicated by the horizontal arrow pointing to processor data bus DTP. [0027] Processor 10 is permitted to make another request in the next processor cycle.
  • In [0028] processor request cycle 2, processor 10 makes a write request. In cycle 2, cache 20 receives the address, checks the index bits, and compares tags at the four cache locations matching index bits. If there is no match, the data is written directly into memory. (In this case, a conventional “write-around” mode is employed; however, the invention is compatible with other modes of handling write-misses.) If there is a match, the word transmitted by the processor is written to the location determined by the index, the set with the matching tag, and the word positions indicated by bits 29 and 30 of the write address. This writing does not occur during processor request cycle 2, but during succeeding processor cycle 3, as indicated by the downward sloping arrow that points both toward processor data bus DTP and processor cycle 3.
  • Since writes are pipelined, [0029] processor 10 can make a second write request during cycle 3 while data corresponding to the first write request is being written. During this same cycle, the tag portion of the address asserted in the second write request is compared with the tags stored at the four cache-set memory locations having an index equal to the index bits of the requested address. As with the first write request, the actual writing is delayed one cycle as indicated by a downward sloping arrow.
  • The invention provides for further successive pipelined write operations. However, eventually, the write series ends either with a no-operation or a read operation. The no-op just allows the previous write operation to complete. The more interesting case in which a read request immediately follows a write request is explored with respect to [0030] processor request cycle 4.
  • [0031] Processor 10 makes a second read request at processor request cycle 4 while the write requested during cycle 3 is completed. Note that a parallel tag-and-data read is not possible within cycle 4 because of the ongoing write operation. To provide for such a parallel read operation, a wait would have to be inserted so that the parallel operations would be performed at cycle 5 instead of cycle 4.
  • In accordance with a refinement of the method of the invention, a read immediately following a pipelined write operation is performed serially. Specifically, the tag matching is performed in the same processor request cycle that the request is made. However, data is accessed and transmitted in the following cycle. Specifically, only data from a set having a tag match is provided to an input to [0032] mulitiplexer 27. (If there is no match, no data is accessed from cache memory 30). In this case, a match is detected during processor request cycle 4, and data is read at cycle 5, as indicated by the downward sloping arrow in the row corresponding to cycle 4.
  • In addition, during [0033] request cycle 4, cache 20 issues a wait request to processor 10 along processor wait signal line WTP. In response, processor 10 delays any pending request so that no request is made during processor cycle 5. Since no request is made during cycle 5, there is no arrow indicating when a cycle-5 request is fulfilled. Instead, a circle indicates that there is no fulfillment of a wait.
  • [0034] Processor 10 makes a third read request in cycle 6. Since there is nothing in the pipeline during cycle 6, a parallel tag-and-data read is completed in cycle 6, as indicated by the horizontal arrow in the row corresponding to cycle 6. Subsequent reads would also be one-cycle parallel reads. The case of a write following a parallel read is addressed in the discussion concerning cycles 1 and 2.
  • An alternative embodiment of the invention uses 2-cycle pipelined reads to save power whenever there is no time penalty involved in doing so. Thus, any read immediately following a pipelined read or write operation is pipelined. One-cycle parallel reads are used only after other parallel reads, no-ops, or cache misses. In this embodiment, no wait request is issued as in [0035] cycle 4 above. Thus, in cycle 5, a third read request can be made and then completed in cycle 6.
  • In method M[0036] 1 of FIG. 1, modified in that the fourth cycle involves execution of a no-op, the no-op can be executed in the first-stage of the write pipeline while the write requested in the third cycle is completed. The less desirable alternative would be to issue a wait during the fourth cycle and execute the no-op during a fifth cycle. In general, the invention provides for withholding a “wait” that would conventionally be associated with the second cycle of a write operation until there is a resource available that can absorb it without accumulating an access latency.
  • As indicated above, the invention provides the greatest performance enhancement in cases where write operations occur frequently in series. In a conventional system with a single cache used for both instructions and data, such circumstances can be infrequent due to the number of instructions fetches, which are all reads. However, in a Harvard architecture, with separate data and instructions paths, the invention can be used to great advantage on the data cache. These and other variations upon and modifications to the described embodiments are provided for by the present invention, the scope of which is defined by the following claims.[0037]

Claims (10)

What is claimed is:
1. A cache-management method for a system including a processor, main memory, and a set-associative cache having plural sets of cache locations containing copies of data stored in said main memory, said method comprising the steps of:
during a first request cycle in which said processor issues a first request to read first data from a first main-memory location, providing said first data to said processor from a first set of said plural sets;
during a second request cycle in which said processor issues a second request to write second data to a second main-memory location, and said cache determines a second set of said plural sets in which said second main-memory location is represented;
during a third request cycle during which said processor issues a third request to write third data to a third main-memory location, said cache determines a third set of said plural sets in which said third main-memory location is represented, and said cache writes said second data to said second set; and
during a fourth request cycle in which said processor issues a fourth request to read fourth data from a fourth main-memory location, during which said cache writes said third data to said third set and determining a fourth set of said plural sets in which said fourth main-memory locaiton is represented; and
during a fifth request cycle, providing said fourth data to said processor from said fourth set.
2. A method as recited in
claim 1
wherein said second set is the same as said first set.
3. A method as recited in
claim 1
wherein, during said fourth request cycle, said processor issues a fourth request to read fourth data from a fourth main-memory location, and said cache determines a fourth set of said plural sets in which said fourth main memory location is rep-resented.
4. A method as recited in
claim 3
wherein during said fourth cycle, said cache issues a “wait” request to said processor.
5. A method as recited in
claim 3
further comprising a fifth request cycle during which said processor does not issue a request to read or write from a main memory location, and said cache provides said fourth data to said processor from said fourth set.
6. A method as recited in
claim 5
wherein said processor issues a no-op during said fifth request cycle.
7. A method as recited in
claim 1
wherein, during said first processor cycle, said cache accesses data in all of said plural sets, said cache not transmitting accessed data to said processor other than from said first set.
8. A cache-management method comprising:
a parallel read;
a serial write: and
a serial read pipelined with said serial write.
9. A computer system comprising:
a processor;
main memory; and
a cache, including
means for executing a parallel read;
means for executing a serial write; and
means for executing a serial read pipelined with said serial write.
10. A method as recited in
claim 8
wherein said serial write is a second serial write of a pair of serial writes, said pair also including a first serial write preceding and pipelined with said second serial write.
US09/835,215 1999-06-21 2001-04-13 Set-associative cache-management method with parallel read and serial read pipelined with serial write Expired - Fee Related US6385700B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/835,215 US6385700B2 (en) 1999-06-21 2001-04-13 Set-associative cache-management method with parallel read and serial read pipelined with serial write

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US33690499A 1999-06-21 1999-06-21
US09/835,215 US6385700B2 (en) 1999-06-21 2001-04-13 Set-associative cache-management method with parallel read and serial read pipelined with serial write

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US33690499A Continuation 1999-06-21 1999-06-21

Publications (2)

Publication Number Publication Date
US20010029573A1 true US20010029573A1 (en) 2001-10-11
US6385700B2 US6385700B2 (en) 2002-05-07

Family

ID=23318201

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/835,215 Expired - Fee Related US6385700B2 (en) 1999-06-21 2001-04-13 Set-associative cache-management method with parallel read and serial read pipelined with serial write

Country Status (1)

Country Link
US (1) US6385700B2 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7251710B1 (en) 2004-01-12 2007-07-31 Advanced Micro Devices, Inc. Cache memory subsystem including a fixed latency R/W pipeline
US20080109606A1 (en) * 2006-11-03 2008-05-08 Lataille Norbert Bernard Eugen Cache logic, data processing apparatus including cache logic, and a method of operating cache logic
US20100306471A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation D-cache line use history based done bit based on successful prefetchable counter
US20100306472A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation I-cache line use history based done bit based on successful prefetchable counter
US20100306473A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation Cache line use history based done bit modification to d-cache replacement scheme
US20100306474A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation Cache line use history based done bit modification to i-cache replacement scheme
WO2012092717A1 (en) * 2011-01-07 2012-07-12 Mediatek Inc. Apparatuses and methods for hybrid automatic repeat request (harq) buffering optimization
WO2021180186A1 (en) 2020-03-13 2021-09-16 Shenzhen GOODIX Technology Co., Ltd. Low area cache memory
US20240004792A1 (en) * 2022-06-29 2024-01-04 Ampere Computing Llc Data l2 cache with split access

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7688824B2 (en) * 2001-07-11 2010-03-30 Broadcom Corporation Method, system, and computer program product for suppression index reuse and packet classification for payload header suppression
US7275112B1 (en) * 2001-08-08 2007-09-25 Pasternak Solutions Llc Efficient serialization of bursty out-of-order results

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4899275A (en) * 1985-02-22 1990-02-06 Intergraph Corporation Cache-MMU system
US6321307B1 (en) * 1997-12-31 2001-11-20 Compaq Computer Corporation Computer system and method employing speculative snooping for optimizing performance

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7251710B1 (en) 2004-01-12 2007-07-31 Advanced Micro Devices, Inc. Cache memory subsystem including a fixed latency R/W pipeline
US7856532B2 (en) * 2006-11-03 2010-12-21 Arm Limited Cache logic, data processing apparatus including cache logic, and a method of operating cache logic
US20080109606A1 (en) * 2006-11-03 2008-05-08 Lataille Norbert Bernard Eugen Cache logic, data processing apparatus including cache logic, and a method of operating cache logic
US8171224B2 (en) 2009-05-28 2012-05-01 International Business Machines Corporation D-cache line use history based done bit based on successful prefetchable counter
US8291169B2 (en) 2009-05-28 2012-10-16 International Business Machines Corporation Cache line use history based done bit modification to D-cache replacement scheme
US20100306474A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation Cache line use history based done bit modification to i-cache replacement scheme
US20100306472A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation I-cache line use history based done bit based on successful prefetchable counter
US8140760B2 (en) * 2009-05-28 2012-03-20 International Business Machines Corporation I-cache line use history based done bit based on successful prefetchable counter
US20100306471A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation D-cache line use history based done bit based on successful prefetchable counter
US8429350B2 (en) 2009-05-28 2013-04-23 International Business Machines Corporation Cache line use history based done bit modification to D-cache replacement scheme
US20100306473A1 (en) * 2009-05-28 2010-12-02 International Business Machines Corporation Cache line use history based done bit modification to d-cache replacement scheme
US8332587B2 (en) 2009-05-28 2012-12-11 International Business Machines Corporation Cache line use history based done bit modification to I-cache replacement scheme
WO2012092717A1 (en) * 2011-01-07 2012-07-12 Mediatek Inc. Apparatuses and methods for hybrid automatic repeat request (harq) buffering optimization
US20130272192A1 (en) * 2011-01-07 2013-10-17 Mediatek Inc. Apparatuses and Methods for Hybrid Automatic Repeat Request (HARQ) Buffering Optimization
WO2021180186A1 (en) 2020-03-13 2021-09-16 Shenzhen GOODIX Technology Co., Ltd. Low area cache memory
EP3977294A4 (en) * 2020-03-13 2022-07-20 Shenzhen Goodix Technology Co., Ltd. Low area cache memory
US11544199B2 (en) 2020-03-13 2023-01-03 Shenzhen GOODIX Technology Co., Ltd. Multi-way cache memory access
US20240004792A1 (en) * 2022-06-29 2024-01-04 Ampere Computing Llc Data l2 cache with split access

Also Published As

Publication number Publication date
US6385700B2 (en) 2002-05-07

Similar Documents

Publication Publication Date Title
US6321321B1 (en) Set-associative cache-management method with parallel and single-set sequential reads
US6425075B1 (en) Branch prediction device with two levels of branch prediction cache
US7447868B2 (en) Using vector processors to accelerate cache lookups
US6212602B1 (en) Cache tag caching
US5511175A (en) Method an apparatus for store-into-instruction-stream detection and maintaining branch prediction cache consistency
US7430642B2 (en) System and method for unified cache access using sequential instruction information
EP1550032B1 (en) Method and apparatus for thread-based memory access in a multithreaded processor
US6976126B2 (en) Accessing data values in a cache
US6782454B1 (en) System and method for pre-fetching for pointer linked data structures
US20010052060A1 (en) Buffering system bus for external-memory access
US7765360B2 (en) Performing useful computations while waiting for a line in a system with a software implemented cache
US20030200404A1 (en) N-way set-associative external cache with standard DDR memory devices
US20070250667A1 (en) Pseudo-lru virtual counter for a locking cache
US6385700B2 (en) Set-associative cache-management method with parallel read and serial read pipelined with serial write
US7260674B2 (en) Programmable parallel lookup memory
US7761665B2 (en) Handling of cache accesses in a data processing apparatus
US6629206B1 (en) Set-associative cache-management using parallel reads and serial reads initiated during a wait state
JP3498673B2 (en) Storage device
US6718439B1 (en) Cache memory and method of operation
US20090172296A1 (en) Cache Memory System and Cache Memory Control Method
US5619673A (en) Virtual access cache protection bits handling method and apparatus
US20090063773A1 (en) Technique to enable store forwarding during long latency instruction execution
JPH08263371A (en) Apparatus and method for generation of copy-backed address in cache
JP3295728B2 (en) Update circuit of pipeline cache memory
JP3221409B2 (en) Cache control system, readout method therefor, and recording medium recording control program therefor

Legal Events

Date Code Title Description
REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20060507