WO2001088719A2

WO2001088719A2 - Speed cache having separate arbitration for second-level tag and data cache rams

Info

Publication number: WO2001088719A2
Application number: PCT/US2001/013269
Authority: WO
Inventors: Rajasekhar Cherabuddi
Original assignee: Sun Microsystems, Inc.
Priority date: 2000-05-17
Filing date: 2001-04-24
Publication date: 2001-11-22
Also published as: AU2001257238A1; WO2001088719A3

Abstract

A cache system for use in computer systems has a tag memory (412), a data memory (416), and a cache control unit (401). The tag memory (412) and data memory (416) are provided with separate address lines. The cache control unit (401) has a first arbitration unit (405) for arbitrating access to the tag memory (412) and a second arbitration unit (406) for arbitrating access to the data memory (416). Providing separate arbitration units (405, 406) for the tag memory (412) and the data memory (416) allows access of the data memory (416) for a following cycle of a multicycle cache-line read while the tag memory (412) is accessed by a snoop controller (402).

Description

SPEED CACHE HAVING SEPARATE ARBITRATION FOR SECOND-LEVEL TAG AND DATA CACHE RAMS

The invention pertains to the field of cache memories for high-speed data processing units, including those having multiple levels of cache. In particular the invention relates to arbitration hardware for tag and data memories of cache memory. While the invention is of particular applicability to second-level off-chip cache, it may be implemented on- chip.

BACKGROUND OF THE INVENTION

Many processor CPU chips available today, including all high-performance processors, provide for at least one level of cache memory. Of these, many provide for multiple levels of cache memory, with at least one level of cache memory located on the CPU chip. CPU chips having cache memory, or provisions to support cache memory, are produced by companies including SUN Microsystems, Intel, Motorola, AMD, and others.

Memory systems come in a variety of sizes, speeds, and access limitations. Disk systems may have from dozens to thousands of gigabytes, but access is slow. Main memory systems built of DRAM chips typically range from tens to a few thousand megabytes, and can be accessed in times of forty to a few hundred nanoseconds .

Modern processors are capable of digesting instructions at rates that greatly exceed the bandwidth of the main memory systems to which they are coupled. One or more levels of cache memory are often inserted between a modern processor and a main memory system to provide higher memory bandwidth to the processor. Each level of cache is a memory subsystem of smaller size, and higher speed, than the next higher level of cache or main memory of a system; inserted between a processor or low level cache and the main memory or higher level cache; that takes advantage of locality of addressing and repetition of addressing to provide fast response.

As modern processor chips soar to the GHz performance level, they invite ever larger and faster cache memory systems to feed them.

Many cache systems are N-way, set associative, cache systems. These cache systems have tag memory subsystems and data memory subsystems . In these cache memory systems, when a memory location at a particular main-memory address is to be read, a cache-line address is derived from the main-memory address. The cache-line address is typically presented to the tag memory and to the data memory; and a read operation done on both memories. The tag memory contents is read and at least part of it is compared to at least part of the main-memory address to determine whether any part of data read from the cache data memory corresponds to data at the desired main-memory address. If the tag indicates that the desired data is in the cache data memory, that data is presented to the processor and next lower-level cache; if not, then the read operation is passed up to the next higher- level cache or main memory. N-way, set-associative, caches perform N such comparisons of tag memory contents to desired data address simultaneously. Oneway caches are common in off-chip cache memories. Typically, a tag memory contains status information as well as data information. This status information may include "dirty" flags that indicate whether information in the cache has been written to but not yet updated in higher-level memory, and "valid" flags indicating that information in the cache is valid.

A cache "hit" occurs whenever a memory access to the cache occurs and the cache system finds, through inspecting its tag memory, that the requested data is present and valid in the cache.

Memory systems in modern computing systems often have more than one device demanding access to them. There may be two or more processors each needing access to memory. Each processor may have one or more associated, often integrated, co-processor that has the ability to fetch or store information from or to memory. Peripheral devices, including disk controllers, network interfaces, and video controllers, often have DMA (Direct Memory Access) or bus-mastering capability, such that they can transfer blocks of information to or from memory without direct processor involvement with each byte. When there are multiple processors in a system, there is often a portion of memory allocated to coordinating and communicating between the processors. For at least this portion of memory, there is need to ensure that data written from a first processor is read correctly by a second processor. In particular, it is necessary to ensure that the second processor does not fetch data from its own cache instead of fetching the data written by the first processor. This is known as enforcing data coherency between the processors.

Many ways of enforcing data coherency are known, these include preventing caching of all references to this portion of memory, flushing or emptying an entire cache before reading from this portion of memory, coupled with operating in writethrough mode, and "snooping" caches. Of these, cache "snooping" is often preferred in high performance systems because it causes less interference with operation than does frequent cache flushing. Cache snoop operations are typically of two types, invalidating and non-invalidating snoops. Both types of snoop operations require access to the Tag portion of a cache, but many snoop operations do not need access to the data portion of the cache. An invalidating snoop operation may read the Tag memory of a cache to determine if a particular memory address is present in a cache, and if that address is present in the cache the snoop operation may then mark that cache line as not valid. The invalidating snoop operation thus forces any following read operation to return data from higher level cache or main memory in place of data that might have been previously present in the cache. Non-invalidating snoops will read the Tag memory, but will not access data memory unless a particular memory address was found in the tag memory.

Cache memory systems may be of the writeback or writethrough type. Whenever a processor writes data to a cache memory of the writethrough type, data is written to the cache and to the next higher level of memory before the processor is allowed to continue operation. Whenever a processor writes data to a cache memory of the writeback type, data is stored to the cache memory and marked "dirty", and the processor is allowed to continue execution. Later, the "dirty" data is written to the next higher level of memory, which may be the main memory. The writing of "dirty" data to higher level memory may be done by a cache write controller that requires access to the cache. Peripherals having DMA capability may be integrated onto a CPU chip along with coprocessors. Each such peripheral or coprocessor may require access to the cache memory, along with the main processor of the CPU chip.

When multiple devices, such as two or more processors, snoop units, cache write controllers, peripherals, or coprocessors, may simultaneously request access to a memory system - cache or otherwise - that can not perform all possible requested accesses simultaneously, cache arbitration hardware is provided to control access to the memory system. In a typical cache memory system, or level of a multilevel cache system, with cache arbitration hardware, a single arbitrator allocates access to the entire cache system or level. Further, in most typical set-associative cache memory systems there is a single cache address in use at any one time, common to both the tag memory and the data memory. With cache memory systems of this typical type, the cache can not perform a snoop access to tag memory at the same time as it performs a data access to cache data memory.

Cache memory systems typically handle data as a cache line, which is a unit of data that is often a multiple of a processor word length. Each cache line in the cache data memory is associated with at least one entry in the tag memory, therefore long cache lines permit use of smaller tag memories than required with short cache lines assuming constant cache data memory size. Cache lines are typically fetched from higher- level cache or main memory in an indivisible operation, which may involve multiple cycles.

Many low-level cache systems have a cache line size that is greater than the number of chip pins available for reading data from off -chip higher-level cache data memories. When a cache miss in the low- level cache occurs, but data is found in the higher- level cache, these low-level cache systems typically read several successive cache data memory words from the higher-level cache. The several successive cache data memory words of the higher-level cache typically comprise part or all of a cache line of the higher- level cache. This is a multicycle cache line read.

With modern superscalar processors in multiprocessor systems, overall instruction execution speed is often limited by the rate - or bandwidth - at which instructions and data can be fetched from memory and cache. It is therefore desirable to improve cache subsystem bandwidth. SUMMARY OF THE INVENTION

An N-way set-associative cache has a tag memory and a data memory; the preferred embodiment being 1- way set associative. The cache is accessible from several devices of a system, such as processor, a writeback controller, and a snoop unit, as is known in the art. The cache performs multicycle cache line reads to its data memory.

The cache has separate arbitration hardware for the tag memory and for the data memory, and a separate address bus to the tag and data memories. This permits the cache to simultaneously execute tag-only operations, such as snoop read operations to the tag memory, and data-only operations, such as later cycles of multicycle cache line reads. Because the cache can perform some operations in parallel, its overall bandwidth is greater than that of typical caches.

BRIEF DESCRIPTION OF THE DRAWINGS

The aforementioned and other features and objects of the present invention and the manner of attaining them will become more apparent and the invention itself will be best understood by reference to the following description of a preferred embodiment taken in conjunction with the accompanying drawings, wherein:

Figure 1 is a block diagram of a prior art multiprocessor computer system having multiple levels of writeback cache, the cache accessible from multiple functional elements and having a single arbiter; Figure 2, a block diagram of a level of prior-art cache memory for a computer system, the cache accessible from multiple functional elements and having a single arbiter;

Figure 3, a timing diagram of a prior-art cache memory performing a multicycle read, a snoop; Figure 4, a block diagram of a cache controller of present invention having separate tag and data arbiters and address hash blocks;

Figure 5, a timing diagram of a cache controller of present invention performing a multicycle read and a snoop; and

Figure 6, a block diagram of an alternative embodiment of a cache controller of the present invention having an address latch at the data memory and separate tag and data arbiters.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In a typical computer system, there is generally a system bus 100 for communicating between a processor assembly 101 and other devices of the system. Among the other devices are typically a network interface

102, a peripheral interface 103 for communicating with disk, tape, CD and DVD drives 104 which may be of the bus-mastering ultra-wide SCSI type. It is expected that a fibre channel peripheral interface could be utilized in future systems in place of ultra-wide SCSI as peripheral interface 103.

Such computer systems usually also incorporate a main memory 108, a video subsystem 109, keyboard, mouse, printer, and other I/O interfaces 110, and may incorporate one or more additional processor assemblies 111. Systems having four or more processor assemblies 101 and 111 are becoming increasingly common .

Each processor assembly 101 typically incorporates at least one processor 115 and may incorporate one or more coprocessors 116, which may be a combination of one or more floating point, matrix computation, or graphics coprocessors. Typically, the processor 115 and coprocessors 116 address their memory read and write operations to a first-level cache 117, which may in turn redirect these operations to a second-level cache 118, which may in turn redirect these operations to a bus controller 119, which in turn directs these operations to main memory 108 or to an I/O device such as devices 102, 103, 109, or 110.

The first-level cache 117 typically has data RAM 125 for holding data, and tag RAM 126 for holding status and associated memory address information for each cache line. There may be a snoop controller 127 and,, if writeback caching is implemented, a writeback controller 128. Since multiple blocks, including the processor block 115 and the snoop controller 127, may access the cache, an arbiter 130 determines which block has access to the cache for any given cycle, and cache controller 131 coordinates operation of the entire cache.

First level cache 117 is now usually integrated onto the same integrated circuit as the processor 115. Most first level caches are from two to four - way set associative caches.

The second-level cache 118 has architecture similar to that of the first level cache 117. It also has a tag ram 135, data ram 136, snoop controller 137, writeback controller 138, arbiter 139, and cache controller 140. Most second level caches are single- way set-associative caches ,^' having tag ram 135 and data ram 136 located on one or more separate integrated circuits from the processor 115; although fully on-chip implementations are known. On chip implementations may be two to four - way set associative caches.

There may be additional levels of cache, similar to second-level cache as herein described, interposed between the second level cache 118 and the memory buss controller 119. In general, each successive level of cache is larger than the lower, or closer to the processor, levels. In typical cache second-level cache systems, there is a single arbiter 200 (Figure 2), the arbiter forming a part of the second level cache controller 201. Arbiter 200 determines which of several blocks of the system, whether it be the second-level snoop controller 205, a second level writeback controller 206, or the first level cache controller, has access to the second level cache at any particular instant of time. The selected block provides a target memory address to a single address hash unit 207 of the cache controller 201, which provides a single cache address 210 to both the tag RAM 208 and the data RAM 209. While most address hash units 207 simply select several low-order address bits of the target memory address onto the cache address, more complex transformations are known.

Typically, the tag RAM 208 includes one or more fields (two fields in a two-way cache, four fields in a four-way cache) that correspond to several bits of the target memory address. These tag-memory bits are compared to some or all bits of the target address by tag-address comparator 215 in determining whether a particular operation scores a cache-hit. References not scoring a cache hit are passed on to a memory controller 216. The tag-ram 208 and data ram 209 are often, but need not be, located outside a chip boundary 219 from the cache controller 201.

In a typical cache of prior-art design, in a first cycle 300 (Figure 3), a snoop address for a tag- only snoop operation may appear on the cache address bus 210. In most cache systems, at least some pipelining occurs at the memory, hence in a second cycle 301, tag information associated with the snoop operation appears on input/output lines of the tag RAM 208, while address information for a following access may appear on the cache address lines 210. While a tag-only address may appear in the last cycle of a multicycle cache-line access, such as the last cycle 305 of a processor data request, no overlap is possible of cache tag information on input/output lines of the tag RAM 208 with data information on input/output lines of the data RAM 209. In a second cache memory of the present invention, second level cache controller 400 (Figure 4) incorporates a second level snoop controller 402 and writeback controller 403. Snoop controller 402, writeback controller 403, and any first-level cache controller (not shown) pass requests for combined data and tag operations to separate tag arbiter 405 and data arbiter 406. They pass requests for tag-only operations, such as a snoop-and-invalidate operation, to the tag arbiter 405, but not to data arbiter 406. A first address hash unit 410 processes a target memory address from a block selected by the tag arbiter 405 and provides a first cache address 411 to the tag RAM 412. A second address hash unit 415 processes a target address, from a block selected by the data arbiter 406, into a second cache address 417 for the data RAM 416. Data RAM 415 is designed such that reading an entire cache line

Tag arbiter 405 is expected to grant access to the tag ram 412 to such units as the second level snoop controller 402 for tag-only accesses during the second and following cycles of multicycle cache line accesses to the data RAM 416; including multicycle cache line accesses associated with memory references passed to the second-level cache from the first-level cache controller, or multicycle cache line accesses associated with insertion of data through next-level interface 420 from the memory controller 421

With the cache of the present invention, a multicycle cache line read address for writeback of data from a lower level cache to a higher level cache may appear in a first cycle 500 (Figure 5) ; with associated tag information and data in the following cycle 501. An address for a tag-only operation may appear on the cache tag address lines during cycle 501, such that a tag-only access of the tag memory may occur during the second 502 or a following cycle of the multicycle cache line access; this may be a snoop access as illustrated. Similarly, a processor access, or an access to fill a line of a lower-level cache, may provide tag and data address in a cycle 510, with tag and data returned in a following cycle 511. The remaining data of a multicycle cache-line read follows in subsequent cycles 512. During the remaining cycles 512 of the cache line read, the tag memory is available for any necessary tag-only access 514. There is therefore an overlap between data RAM and tag RAM accesses for tag-only and data-plus-tag operations not possible with a conventional cache; and overall cache bandwidth is improved.

In an alternative embodiment of the present invention, there is a tag RAM 600 (Figure 6) fed by a common tag/data cache address bus 601. The common cache address bus 601 feeds through a cache address register 602 to a cache data RAM 603. Cache data RAM 603 is connected to the cache controller 605 through a data in/out bus 606 and read/write control lines 607 as are known in the art. A pair of low-order address lines 610, used to address sequential cache words of a multicycle cache line address, enter the data RAM from a counter 611 of the cache controller 605.

In operation, a tag/data access arbiter 615 of the cache controller 605 allocates accesses among requests from a processor or lower-level cache port 616, a snoop controller 617, a writeback controller 618, and any other ports that may access cache at this level (not shown) . Preference in arbitration is given to accesses by a port, such as the processor or lower- level cache port 616, or the writeback controller 618, that request a multicycle data access.

If a multicycle data access is granted, the appropriate cache address is generated by an address hash unit 620 and placed on cache address lines 601, together with a first-address indication on the low- order address lines 610. This cache address is also latched into the address register 602, which may be built of latch cells or of D-flip-flop cells. Any tag access required with the multicycle data access may occur in parallel with the first cycle of the multicycle data access.

During the next cycle, the first data cycle of the multicycle cache line access, cache data is transferred on the data in/out bus 606.

During the first cycle of the multicycle cache line access, the arbiter 615 can not grant access for another multicycle data access, since the data RAM is busy, but it can grant access for a tag-only access such as may be required by snoop controller 617 or writeback controller 618. It is known that writeback controller 618 can be implemented in several ways, including a tag-scanning version that searches tags for "dirty" status, which would require such tag-only accesses as well as multicycle data accesses, and in a queued writeback version that maintains a list of addresses needing writeback in a separate memory.

Should arbiter 615 grant access for a tag-only access while data RAM 603 is busy with following cycles of a multicycle cache-line read, the appropriate cache address for the tag-only access is generated by the address hash unit 620 and placed on the cache address lines 601. This address is used by the tag RAM for reading or writing cache tag information on cache tag lines 621 in the next cycle; while the data RAM 603 is still using the cache address latched in the register 602. In this way, cache bandwidth is improved because tag-only accesses may execute in parallel with the later cycles of the multicycle cache line accesses.

It is expected that a cache controller according to Figure 6 may, but need not, have the address register 602 integrated with the data RAM 603 chips external to the processor. Further, the counter 611 may be integrated with the data RAM 603 chips, or may be integrated on the chip having the remainder of the cache controller 605. If this is done with a data memory having multiple integrated circuits, the counter and address register must be duplicated in each such data memory integrated circuit .

In yet another embodiment of the present invention, processor tag accesses are performed if the tag memory is available when they become pending, even if the data memory is not yet available. This embodiment allows those processor operations that miss in the cache to be identified and directed to higher levels of memory while late cycles of unrelated multicycle cache line read or write operations are in progress . Those processor operations that score hits are queued for multicycle cache-line data accesses when the data memory becomes available. This embodiment is of particular advantage with high- performance processors capable of out-of-order execution.

It is expected that the multiple-arbiter cache memory of the present invention may also be implemented as a third-level cache, or as a first- level cache. It is expected to be particularly beneficial for first-level cache implementations. Further, some benefit in cache memory bandwidth may be obtained with a cache having multiple arbitrators as described herein even if the data RAM is of the type that does not require a multi-cycle read to read a cache line.

While there have been described above the principles of the present invention in conjunction with specific embodiments thereof, it is to be clearly understood that the foregoing description is made only by way o.f example and not as a limitation to the scope of the invention. Particularly, it is recognized that the teachings of the foregoing disclosure will suggest other modifications to those persons skilled in the relevant art. Such modifications may involve other features which are already known per se and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure herein also includes any novel feature or any novel combination of features disclosed either explicitly or implicitly or any generalization or modification thereof which would be apparent to persons skilled in the relevant art, whether or not such relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as confronted by the present invention. The applicants hereby reserve the right to formulate new claims to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

CLAIMSWhat is claimed is:

1. A cache system for use in computer systems, the cache system comprising: a tag memory having an address input and a tag output ; a data memory having an address input and a data output ; a cache control unit, the cache control unit further comprising: a first arbitration unit for arbitrating access to the tag memory; a second arbitration unit for arbitrating access to the data memory; a tag comparator for comparing an output of the tag memory to at least some bits of a memory address .

2. The cache system of Claim 1, wherein the cache control unit has a first address output coupled to the address input of the tag memory, and a second address output coupled to the address input of the data memory .

3. The cache system of Claim 2, wherein the cache control unit is part of a first integrated circuit, and at least part of the data memory is part of a second integrated circuit.

4. The cache system of Claim 3, wherein the tag memory is part of the first integrated circuit.

5. The cache system of Claim 3, wherein at least part of the tag memory is part of a third integrated circuit .

6. The cache system of Claim 3 , wherein the cache control unit further comprises a cache writeback controller, wherein the first arbitration unit arbitrates accesses to the tag memory among a first group of elements comprising a processor port of the cache memory and the cache writeback controller, and wherein the second arbitration unit arbitrates accesses to the data memory between a second group of elements comprising the processor port of the cache memory and the cache writeback controller.

7. The cache system of Claim 6, wherein the cache control unit further comprises a snoop access controller, and wherein the first group of elements for which the first arbitration unit arbitrates access to the tag memory further comprises the snoop access controller .

8. The cache system of Claim 7, wherein the cache controller performs a multicycle read from the data memory when the tag comparator indicates that a cache hit has occurred.

9. A cache system for use in computer systems, the cache system comprising: a tag memory having an address input and a tag output ; a data memory having an address input and a data output, the address input having a plurality of lines; an address register having an address input and coupled to at least a plurality of lines of the address input of the data memory; a cache address bus having a plurality of lines, a plurality of lines of the cache address bus being coupled to the address input of the address register and a plurality of lines of the cache address bus being coupled to the address input of the tag memory; a cache control unit, the cache control unit further comprising: an arbitration unit for arbitrating access to the cache; and a tag comparator for comparing an output of the tag memory to at least some bits of a memory address; and wherein the arbitration unit of the cache control unit arbitrates between tag-and-data and tag-only access requests, and wherein the arbitration unit of the cache control unit is configured to allow access of the tag ram for a tag-only access while the data RAM is busy with a data-only cycle of a multicycle cache-line access.

10. The cache system of Claim 10 wherein the address register is integrated onto each integrated circuit comprising part of the data memory.

11. A cache controller comprising an arbiter, wherein the arbiter of the cache control unit arbitrates between tag-and-data access requests and tag-only access requests, and wherein the arbiter is configured to allow access of the tag ram for a tag-only access while the data RAM is completing a data-only cycle of a multicycle' cache-line access.

12. A computer system comprising: at least one processor; at least one lower-level cache coupled to the at least one processor; at least one upper-level cache coupled to process references from the at least one processor that miss in the lower-level cache, and to direct references that miss in the upper level cache to a higher level of memory; wherein the upper level cache further comprises a cache controller, at least one cache address bus, a data memory coupled to a cache address bus of the at least one cache address bus, and a tag memory coupled to a cache address bus of the at least one cache address bus; 5 wherein the data memory of the upper level cache performs multicycle cache-line read operations to fill cache lines of the lower level cache when a miss occurs in the lower level cache and a hit occurs in the upper level cache; '10 and wherein the tag memory of the upper level cache is capable of being accessed for a tag operation unrelated to, but simultaneously with, a multicycle cache line read operation of the data memory of the upper level cache .

15 13. The computer system of Claim 12, wherein the data memory of the higher level cache is coupled to a cache address bus of the at least one cache address bus through a register for holding a plurality of bits of a cache address during second and following cycles of

20 a multicycle cache line read operation.

14. The computer system of Claim 12, wherein the tag operation unrelated to, but performed simultaneously with, a multicycle cache line read operation is a snoop operation.

25 15. The computer system of Claim 12, wherein the tag operation unrelated to, but performed simultaneously with, a multicycle cache line read operation is a tag operation of a processor access.