US20060048120A1 - Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations - Google Patents

Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations Download PDF

Info

Publication number
US20060048120A1
US20060048120A1 US10/926,595 US92659504A US2006048120A1 US 20060048120 A1 US20060048120 A1 US 20060048120A1 US 92659504 A US92659504 A US 92659504A US 2006048120 A1 US2006048120 A1 US 2006048120A1
Authority
US
United States
Prior art keywords
stream
instructions
loop
streams
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US10/926,595
Other versions
US7669194B2 (en
Inventor
Rock Archambault
Robert Blainey
Yaoqing Gao
Allan Martin
James Mcinnes
Francis O'Connell
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US10/926,595 priority Critical patent/US7669194B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ARCHAMBAULT, ROCH GEORGES, MCINNES, JAMES LAWRENCE, GAO, YAOGING, MARTIN, ALLAN RUSSELL, O'CONNELL, FRANCIS PATRICK, BLAINEY, ROBERT JAMES
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION CORRECTION TO REEL AND FRAME 015148/ 0670 Assignors: ARCHAMBAULT, ROCH GEORGES, MCINNES, JAMES LAWRENCE, GAO, YAOQING, MARTIN, ALLAN RUSSELL, O'CONNELL, PATRICK FRANCIS, BLAINEY, ROBERT JAMES
Publication of US20060048120A1 publication Critical patent/US20060048120A1/en
Priority to US12/644,756 priority patent/US8413127B2/en
Application granted granted Critical
Publication of US7669194B2 publication Critical patent/US7669194B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • G06F8/4442Reducing the number of cache misses; Data prefetching

Definitions

  • local area network (LAN) adapter 210 small computer system interface SCSI host bus adapter 212 , and expansion bus interface 214 are connected to PCI local bus 206 by direct component connection.
  • audio adapter 216 graphics adapter 218 , and audio/video adapter 219 are connected to PCI local bus 206 by add-in boards inserted into expansion slots.
  • Expansion bus interface 214 provides a connection for a keyboard and mouse adapter 220 , modem 222 , and additional memory 224 .
  • SCSI host bus adapter 212 provides a connection for hard disk drive 226 , tape drive 228 , and CD-ROM drive 230 .
  • Typical PCI local bus implementations will support three or four PCI expansion slots or add-in connectors.

Abstract

A mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations is provided. The mechanism identifies and classifies streams, identifies data that is most likely to incur a cache miss, exploits effective hardware prefetching to determine the proper number of streams to be prefetched, exploits effective data prefetching on different types of streams in order to eliminate redundant prefetching and avoid cache pollution, and uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.

Description

    BACKGROUND OF THE INVENTION
  • 1. Technical Field
  • The present invention relates to a method of minimizing effective memory latency without unnecessary cost. In particular, the present invention relates to fine-grained software directed data prefetching using integrated high-level, and low-level code analysis and optimizations.
  • 2. Description of Related Art
  • In conventional computing systems, prefetching is a well known technique for effectively tolerating memory access latency which can adversely affect the performance of applications on modern processors. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. Much of the recent work in the area of prefetching has focused on three dimensions of prefetching effectiveness, which are timeliness, accuracy and overhead. Timeliness is the placement of the prefetches such that the latency to memory is effectively hidden. Accuracy is prefetching data which will actually be used by the program before it is used and reducing prefetches which will not be used and merely pollutes the caches. Overhead involves incurring the least amount of overhead resources incurred by the prefetch instructions themselves.
  • Data prefetching can be accomplished by software alone, hardware alone or a combination of the two. Software prefetching relies on compile-time analysis to insert and schedule prefetch, or touch, instructions within user programs. But prefetch instructions themselves involve some overhead. Hardware-based prefetching employs special hardware which monitors the storage reference patterns of the application in an attempt to infer prefetching opportunities. It has no instruction overhead, but it is often less accurate than software prefetching because it speculates on future memory accesses without the benefit of compile-time information. The combination of software and hardware prefetching is designed to take advantage of compile-time program information so as to direct the hardware prefetcher while incurring the least amount of software overhead as possible.
  • The IBM Power4 and Power5 systems have storage hierarchies consisting of three levels of cache and the memory subsystem: on-chip L1 and L2 cache and off-chip L3 cache. They employ hardware data prefetching to identify and automatically prefetch streams without any assistance from software. Still, there are shortcomings associated with hardware prefetching, such as, hardware prefetching does not begin immediately, as it takes several cache misses before a stream is identified. Additionally, hardware supports a limited number of streams to prefetch, if there are more concurrent streams than supported by hardware, a replacement algorithm is employed, and hardware may not prefetch the most profitable streams. Furthermore, hardware may prefetch more data than necessary since it does not know a priori where the end of the stream is.
  • SUMMARY OF THE INVENTION
  • The present invention provides a mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations. The mechanism identifies and classifies streams based on reuse analysis and dependence analysis. The mechanism makes use of the information from high-level loop transformations, data remapping, and work data-set analysis to identify which data is most likely to incur a cache miss. The mechanism exploits effective hardware prefetching through high-level loop transformations, including locality and reuse analysis, to determine the proper number of streams. The mechanism exploits effective data prefetching on different types of streams, based on compiler static analysis and dynamic profiling information, in order to eliminate redundant prefetching and avoid cache pollution. The mechanism uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
  • FIG. 1 is a pictorial representation of a data processing system in which the present invention may be implemented in accordance with a preferred embodiment of the present invention;
  • FIG. 2 is a block diagram of a data processing system in which the present invention may be implemented;
  • FIG. 3 is a diagram illustrating an exemplary implementation of components in accordance with the present invention;
  • FIG. 4 is a high-level flow diagram illustrating the operation of data prefetching in accordance with a preferred embodiment of the present invention;
  • FIG. 5 is a flow diagram illustrating the operation of the stream identification process in accordance with a preferred embodiment of the present invention;
  • FIG. 6 is a flow diagram illustrating the operation of the stream classification process in accordance with a preferred embodiment of the present invention;
  • FIG. 7 is a flow diagram illustrating the operation of the stream selection process in accordance with a preferred embodiment of the present invention; and
  • FIG. 8 is a flow diagram illustrating the operation of the prefetching and directive insertion in accordance with a preferred embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • With reference now to the figures and in particular with reference to FIG. 1, a pictorial representation of a data processing system in which the present invention may be implemented is depicted in accordance with a preferred embodiment of the present invention. A computer 100 is depicted which includes system unit 102, video display terminal 104, keyboard 106, storage devices 108, which may include floppy drives and other types of permanent and removable storage media, and mouse 110. Additional input devices may be included with personal computer 100, such as, for example, a joystick, touchpad, touch screen, trackball, microphone, and the like. Computer 100 can be implemented using any suitable computer, such as an IBM eServer computer or IntelliStation® computer, which are products of International Business Machines Corporation, located in Armonk, N.Y. Although the depicted representation shows a computer, other embodiments of the present invention may be implemented in other types of data processing systems, such as a network computer. Computer 100 also preferably includes a graphical user interface (GUI) that may be implemented by means of systems software residing in computer readable media in operation within computer 100.
  • With reference now to FIG. 2, a block diagram of a data processing system is shown in which the present invention may be implemented. Data processing system 200 is an example of a computer, such as computer 100 in FIG. 1, in which code or instructions implementing the processes of the present invention may be located. Data processing system 200 employs a peripheral component interconnect (PCI) local bus architecture. Although the depicted example employs a PCI bus, other bus architectures such as Accelerated Graphics Port (AGP) and Industry Standard Architecture (ISA) may be used. Processor 202 and main memory 204 are connected to PCI local bus 206 through PCI bridge 208. PCI bridge 208 also may include an integrated memory controller and cache memory for processor 202. Additional connections to PCI local bus 206 may be made through direct component interconnection or through add-in connectors.
  • In the depicted example, local area network (LAN) adapter 210, small computer system interface SCSI host bus adapter 212, and expansion bus interface 214 are connected to PCI local bus 206 by direct component connection. In contrast, audio adapter 216, graphics adapter 218, and audio/video adapter 219 are connected to PCI local bus 206 by add-in boards inserted into expansion slots. Expansion bus interface 214 provides a connection for a keyboard and mouse adapter 220, modem 222, and additional memory 224. SCSI host bus adapter 212 provides a connection for hard disk drive 226, tape drive 228, and CD-ROM drive 230. Typical PCI local bus implementations will support three or four PCI expansion slots or add-in connectors.
  • An operating system runs on processor 202 and is used to coordinate and provide control of various components within data processing system 200 in FIG. 2. The operating system may be a commercially available operating system such as Windows XP™, which is available from Microsoft Corporation. An object oriented programming system, such as the Java™ programming system, may run in conjunction with the operating system and provides calls to the operating system from Java™ programs or applications executing on data processing system 200. “JAVA” is a trademark of Sun Microsystems, Inc. Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 226, and may be loaded into main memory 204 for execution by processor 202.
  • Those of ordinary skill in the art will appreciate that the hardware in FIG. 2 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash read-only memory (ROM), equivalent nonvolatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in FIG. 2. Also, the processes of the present invention may be applied to a multiprocessor data processing system.
  • For example, data processing system 200, if optionally configured as a network computer, may not include SCSI host bus adapter 212, hard disk drive 226, tape drive 228, and CD-ROM 230. In that case, the computer, to be properly called a client computer, includes some type of network communication interface, such as LAN adapter 210, modem 222, or the like. As another example, data processing system 200 may be a stand-alone system configured to be bootable without relying on some type of network communication interface, whether or not data processing system 200 comprises some type of network communication interface. As a further example, data processing system 200 may be a personal digital assistant (PDA), which is configured with ROM and/or flash ROM to provide non-volatile memory for storing operating system files and/or user-generated data.
  • The depicted example in FIG. 2 and above-described examples are not meant to imply architectural limitations. For example, data processing system 200 also may be a notebook computer or hand held computer in addition to taking the form of a PDA. Data processing system 200 also may be a kiosk or a Web appliance.
  • The processes of the present invention are performed by processor 202 using computer implemented instructions, which may be located in a memory such as, for example, main memory 204, memory 224, or in one or more peripheral devices 226-230.
  • The present invention provides a mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations. The mechanism identifies and classifies streams based on reuse analysis and dependence analysis. The mechanism makes use of the information from high-level loop transformations, data remapping, and work data-set analysis to identify which data is most likely to incur a cache miss. The mechanism exploits effective hardware prefetching through high-level loop transformations, including locality and reuse analysis, to determine the proper number of streams. The mechanism exploits effective data prefetching on different types of streams, based on compiler static analysis and dynamic profiling information, in order to eliminate redundant prefetching and avoid cache pollution. The mechanism uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.
  • Turning now to FIG. 3, a diagram illustrating an exemplary implementation of components 202, 204 and 208 in FIG. 2 is depicted in accordance with the present invention. As shown in FIG. 3, in this illustrative example, processor 202 and main memory 204 in FIG. 2 may be implemented as processor 300 and main memory 310 in FIG. 3. However, PCI bridge 208 in FIG. 2 may include two or more levels of cache memory. In this example, level 1 cache 304, level 2 cache 306 and level 3 cache 308 are depicted. Level 1 cache 304 may be a fast memory chip that includes a small memory size, such as 64 kilobytes for instance. Generally, level 1 cache 304 is sometimes referred to as a “primary cache.” This cache is located between the processor, such as processor 300, and level 2 cache 306. Depending on the implementation, level 1 cache 304 may be integrated on the same integrated circuit as processor 300. Level 1 cache 304 also is more expensive compared to level 2 cache 306, because of its faster access speed.
  • Level 2 cache 306, a secondary cache, is sometimes larger and slower than level 1 cache 304. Level 2 cache 306 is generally located between the level 1 cache 304 and main memory 310. When cache misses occur in level 1 cache 306, processor 300 may attempt to retrieve data from level 2 cache 306 prior to searching for the data in main memory 310. Unlike level 1 cache 304, level 2 cache 306 is often located external to the integrated circuit of processor 300 although, depending on the implementation, level 2 cache 306 may be integrated on the same integrated circuit as processor 300. Level 2 cache 306 may also be cheaper to produce compared to level 1 cache 304, because of its slower access speed. In addition to level 1 and level 2 caches, other levels may also be added to PCI bridge 208 in FIG. 2, for example, level 3 cache 308, which may be even larger in size than level 2 cache 306 and may be slower access time.
  • Turning now to FIG. 4, a high-level flow diagram 400 illustrating the operation of data prefetching is depicted in accordance with a preferred embodiment of the present invention. First, high-level loop transformation and data remapping is performed for locality optimization (block 402). Locality optimization may either be spatial or temporal. Spatial locality means if a memory location is accessed, then most likely a location near this location will be accessed in the near future. Temporal locality means if a memory location is accessed, then most likely that it will be accessed again in the near future. Various types of high-level loop transformations may be utilized in performing the locality optimization, such as loop fusion, loop uni-modular transformation, loop distribution, outer and inner loop unrolling, loop tiling, and temporal vector optimization, though other types of high-level loop transformations may be utilized.
  • The information related to each loop is recorded in a loop table with an entry corresponding to each loop. Next, inter-loop analysis and work data-set analysis is performed to identify data access relationships between loops and estimate the data set size for each loop nest (block 404). Loop selection is performed to select profitable loops to produce a candidate loop list (block 406). In this step, a profitable loop is selected based on static and dynamic profile information. That is, the loops executed most frequently and the loops with large data set sizes, where it is most likely that cache misses may happen are selected. The candidate loop list is checked at then to see if there are candidate loops within the list (block 408), and if there is a loop in the candidate loop list, a loop is selected from the candidate loop list for processing.
  • All memory references in the loop are then gathered, and data dependency analysis and reuse analysis are used to identify unique streams within the loop (block 410). Stream classification is then performed classifying the stream types into load streams, store streams, indexed streams, and strided streams, though more or fewer types of streams may be used depending on implementation (block 412). As the streams are identified into stream types, they are loaded into a stream table. Based on static and dynamic profile information, the streams are classified as finite or infinite streams. A selection of the most profitable streams is performed and those most profitable streams are marked as protected until the number of protected streams reaches the number of streams supported by hardware. The most profitable streams are identified based on high-level loop transformation guided information such as temporal vector optimization and loop tiling, work data-set analysis to find the earliest point in a program that the stream may be prefetched, and off-line learning by gathering the runtime hardware performance counters (block 414).
  • A high-level loop cost estimate is performed to calculate loop body cost and to estimate how far data should be prefetched (block 416). Prefetch instruction insertion and annotation is then performed (block 418). In this step, proper prefetch control instructions are inserted at an optimal location based on stream types. Also directives are inserted by high-level optimizations which provide a guide to low-level optimizations for later adjustments. Finally, redundancy elimination is performed (block 420). In this step, based on high-level global analysis, redundant prefetch instructions can be eliminated if existing data is most likely already in cache.
  • From block 420, the process returns to block 408. The candidate loop list is checked to see if there are still candidate loops within the list (block 408). If so, the process starts again with block 410, otherwise the process proceeds to block 422. Low-level traditional optimizations are performed on the streams (block 422). Low-level optimizations that may be utilized are commoning, value numbering, and reassociation, though other types of low-level optimizations may be used depending on the implementation. Finally an instruction scheduler adjusts prefetch instruction based on high-level inserted directives and low-level precise loop cost calculation (block 424). This allows for prefetch instructions to be moved sufficiently far in advance of the use of their data through software pipelining and instruction scheduling and, when sufficient software pipelining is not possible, the prefetch address is adjusted to fetch cache lines sufficiently far in advance.
  • Turning now to FIG. 5, a flow diagram 500 illustrating the method of the stream identification process of block 410 in FIG. 4 is depicted in accordance with a preferred embodiment of the present invention. A stream is a sequence of addresses which depend on the inner loop induction variable with a constant stride that is less than L1 data cache line size. The loop identified at block 408 in FIG. 4 has all the memory references of the identified loop gathered into a loop reference list (block 502). For each memory reference in the loop reference list, a check is performed to see if the memory reference is a stream reference and may be represented in a canonical subscript form (block 504). A distance is then computed between the memory references and all of the unique streams are gathered into a streams list, based upon data dependency and reuse analysis (block 506). Reuse analysis attempts to discover those instances of array accesses that refer to the same memory line. Data dependency is the relation on the statements of the program.
  • Turning now to FIG. 6, a flow diagram 600 illustrating the method of the stream classification process of block 412 in FIG. 4 is depicted in accordance with a preferred embodiment of the present invention. Based on memory access patterns, each of the streams identified in the stream identification process and stored in the stream list are classified into stream types (block 602). Those stream type classifications include load streams, store streams, indexed streams, regular strided streams, and irregular strided streams, though more or fewer streams classifications may be used depending on implementation. A stream is a load stream if it includes at least one load (e.g., b and c in Example 1), otherwise it is a store stream (e.g., a in Example 1).
    double a[N], b[N], C[N];
    for (i=0; i<N; i++) {
      a[i] = b[i]*c[i];
    }
  • EXAMPLE 1 Load/Store Stream
  • A stream is called an indexed stream if it is accessed indirectly through another load stream (e.g., b in Example 2).
    int a[N];
    double b[N];
    for (i=01 i<N; i++) {
      ...= ... b[ a[i] + 8 ];
    }
  • EXAMPLE 2 Indexed Stream
  • A stream is called a strided stream if its stride is either unknown or a constant larger than L1 data cache line size. Based on static analysis and dynamic profile information, the stream length is estimated and streams are marked with limited or unlimited length (block 604). As the streams are identified into stream types, they are loaded into a stream table.
  • Turning now to FIG. 7, a flow diagram 700 illustrating the method of the stream selection process of block 414 in FIG. 4 is depicted in accordance with a preferred embodiment of the present invention. The most profitable streams are marked as protected until the number of protected streams reaches the number of hardware protected streams (block 702), based upon the high-level transformations, static analysis information and dynamic profile information gathered in blocks 402, 404, 406, 410, and 412 in FIG. 4.
  • Turning now to FIG. 8, a flow diagram 800 illustrating the method of the prefetching and directive insertion of block 418 in FIG. 4 is depicted in accordance with a preferred embodiment of the present invention. A stream is obtained from the stream list and checked for stream type (block 802). A determination is made as to whether the stream type is a load stream (block 804). If so, the process continues to block 814. At block 814, load stream prefetching is performed. When a number of streams in a loop is less than that supported by hardware, prefetch instructions are placed in the loop pre-head for all identified streams to reduce hardware startup time and all the streams are marked as protected to avoid performance degradation from unexpected address conflicts. Furthermore, to prevent cache pollution, steams are marked as limited or unlimited based on their lengths.
  • In Example 3, if the length of a stream is less than 1024 cache lines, the stream is marked as protected limited length stream.
    _protected_stream_set(FORWARD, a, 1);
    _protected_steam_count(N/16, 1);
    _protected_stream_set(FORWARD, b, 2);
    _protected_stream_count(N/16, 2);
    _eieio( );
    _protected_stream_go( );
    for (i=0; i<N; i++) {
      c[i] = c[i] + a[i] * b[i]
    }
  • EXAMPLE 3 Length of a Stream Less Than 1024 Cache Lines
  • In Example 4, if the length of a stream is equal to or larger than 1024 cache lines, the stream is marked as protected unlimited stream.
    _protected_unlimited_stream_set_go(FORWARD, a, 1);
    _protected_unlimited_stream_set_go(FORWARD, b, 2);
    for (i=0; i<N; i++) {
      c[i] = c[i] + a[i] * b[i]
    }
    _protected_stream_stop_all( )
  • EXAMPLE 4 Length of a Stream Equal to or Larger Than 1024 Cache Lines
  • In Example 5, two short streams can be promoted into a single stream and the leading stream is marked as protected unlimited length stream if the two continuously allocated streams are in the continuous separate loops.
    struct stream_t {
      double a[N];
      double b[N];
    }  p;
    _protected_unlimited_stream_set_go(FORWARD, a, 1);
    for (i=0; i<N; i++) {
       ...= ... p.a[i];
    }
    for (i=0; i<N; i++) {
       ...= ...p.b[i];
    }
    _protected_stream_stop_all( );
  • EXAMPLE 5 Two Short Streams can be Promoted into a Single Stream
  • In Example 6, in spec2000fp/mgrid, there is a loop in routine resid( ) depicted. The compiler analysis identifies ten load streams in the loop. Furthermore, some of the streams are continuous over next iterations and thus no stream stop instruction is inserted so that prefetch will continue over iterations.
     DO 600 I3=2,N−1
     DO 600 I2=2,N−1
     DO 600 I1=2,N−1
    600 R(I1,I2,I3)=V(I1,I2,I3)
     >  −A(0)*(U(I1,I2,I3))
     >  −A(1)*(U(I1−1,I2,I3)+U(I1+1,I2,I3)
     >    + U(I1,I2−1,I3)+U(I1,I2+1,I3)
     >    + U(I1,I2,I3−1)+U(I1,I2,I3+1))
     >  −A(2)*(U(I1−1,I2−1,I3)+U(I1+1,I2−1,I3)
     >    + U(I1−1,I2+1,I3)+U(I1+1,I2+1,I3)
     >    + U(I1,I2−1,I3−1)+U(I1,I2+1,I3−1)
     >    + U(I1,I2−1,I3+1)+U(I1,I2+1,I3+1)
     >    + U(I1−1,I2,I3−1)+U(I1−1,I2,I3+1)
     >    + U(I1+1,I2,I3−1)+U(I1+1,I2,I3+1))
     >  −A(3)*(U(I1−1,I2−1,I3−1)+U(I1+1,I2−1,I3−1)
     >    + U(I1−1,I2+1,I3−1)+U(I1+1,I2+1,I3−1)
     >    + U(I1−1,I2−1,I3+1)+U(I1+1,I2−1,I3+1)
     >    + U(I1−1,I2+1,I3+1)+U(I1+1,I2+1,I3+1))
     C
  • EXAMPLE 6 Multiple Load Strings
  • In most cases, loop distribution will try to split a loop with a number of streams greater than that supported by hardware, as shown in Example 7. But in the case of a loop with more than 8 streams supported by hardware, two ways to do effective data prefetching are exploited. One is to unroll or strip mine the inner loop so that one cache line is loaded for each stream and a cache line prefetch is inserted ahead in the loop body, which allows software pipelining and instruction scheduling to move the prefetch instruction sufficiently far in advance. The other is to unroll the inner loop by some factor, and initiate 8 protected streams prefetching in the loop pre-head. For the rest of streams, a cache line touch or do pseudo data prefetching is used by directing the software pipeliner and instruction scheduler to pre-load data from the next cache line into a register. This is illustrated in Example 7 as represented by the temporary variable:
    double b[N], temp;
    for (i=0; i<N/m; i++) {
     temp = b[i+ m−1] /* load from the next cache line
      */
       ... = b[i];
       ... = b[i+1];
        ...
       ... = b[i−m−2];
    }
  • EXAMPLE 7 More Streams than Supported by Hardware
  • When the load prefetching is complete for the stream, the process moves to block 812. In this step, a redundant prefetch elimination process is performed in order to eliminate redundant prefetches based on the information gathered during blocks 402, 404, 406, 410, 412, 414 and 416 in FIG. 4.
  • Returning to block 804, if the stream type is not a load stream, the process moves to block 806. A determination is made as to whether the stream type is a store stream (block 806). If so, the process continues to block 816. Store stream prefetching is performed (block 816). When the store prefetching is complete for the stream, the process moves to block 812. A redundant prefetch elimination process is performed in order to eliminate redundant prefetches based on the information gathered during blocks 402, 404, 406, 410, 412, 414 and 416 in FIG. 4.
  • Returning to block 806, if the stream type is not a store stream the process moves to block 808. A determination is made as to whether the stream type is an indexed stream (block 808). If so, the process continues to block 818. An indexed stream prefetching is performed (block 818). Indexed stream prefetching initiates after an indexed stream b[a[i]] is identified. At this point, a cache line touch is inserted in the loop body to execute a prefetch ahead of time, based on the total cycles in the loop body and L1 cache miss penalty. To be more precise, the high level optimizer actually inserts a pseudo cache line touch, and lets the instruction scheduler in the low level optimizer determine exactly how far in advance the indexed stream should be touched, as shown in Example 8.
    _protected_stream_set(FORWARD, a, 1);
    _protected_steam_count(N/16, 1);
    _eieio( );
    _protected_stream_go( );
    For (i=0; i<N; i++) {
      _dcbt( b[ a[i+ahead] ] );
      ... = ...b [ a[i] ];
    }
  • EXAMPLE 8 Prefetching Indexed Stream b
  • When the indexed prefetching is complete for the stream, the process moves to block 812. In this step, a redundant prefetch elimination process is performed in order to eliminate redundant prefetches based on the information gathered during blocks 402, 404, 406, 410, 412, 414 and 416 in FIG. 4.
  • Returning to block 808, if the stream type is not a indexed stream the process moves to block 810. A determination is made as to whether the stream type is a strided stream (block 810). If so, the process continues to block 820. Strided stream prefetching is performed (block 820). Strided stream prefetching is similar to indexed stream prefetching, in that a pseudo cache line touch is inserted in the loop body. As shown in Example 9, node_t is a big structure with its size being bigger than L1 cache line. Since root points to an array of node_t, the compiler can determine the constant stride and dcbt can be inserted ahead of time.
     typedef struct node
     {
      long number;
      char *ident;
      struct node *pred, *child, *sibling, *sibling_prev;
      long depth;
      long orientation;
      struct arc *basic_arc;
      struct arc *firstout, *firstin;
      cost_t potential;
      flow_t flow;
      size_t mark;
      long time;
     } node_t;
     for ( node = root, stop = net->stop_nodes; node <
    (node_t*)stop; node++ ) {
      _dcbt( node + sizeof(node_t)* ahead);
      node->mark = node->depth * node->umber;
    }
  • EXAMPLE 9 Pseudo Cache Line Touch Inserted into a Strided Stream
  • For irregular stride stream, extended dynamic profile information gathered from the runtime hardware performance counters can guide the compiler to place touch instructions ahead for irregular data accesses which incur data misses. The low level analysis may further determine that prefetches are redundant, either because they are not sufficiently ahead of the load or because the address is covered by a previous prefetch instruction. In Example 10, a pointer-chasing code usually has irregular behaviors. But in some cases, it shows regular stride pattern at run time. Based on the dynamic profile information, a touch instruction can be inserted to do prefetching.
    struct node {
     struct node * next;
     Element element1, element2, ....,elementN;
     }
     struct node * first_node, *current_node;
     ....
     while ( current_node != NULL) {
          /* touch insertion */
     _dcbt( current_node + stride );
      /* code to process current node */
     ...
      /* load next node */
     current_node = current_node->next;
     }
  • EXAMPLE 10 Irregular Stride Stream Touching
  • When the strided prefetching is complete for the stream, the process moves to block 812. In this step, a redundant prefetch elimination process is performed in order to eliminate redundant prefetches based on the information gathered during blocks 402, 404, 406, 410, 412, 414 and 416 in FIG. 4. Returning to block 810, if the stream type is not a stride stream the process returns to block 802 and the stream list is updated with an error indicating the stream type as undefined.
  • In summary, the present invention provides a mechanism for minimizing effective memory latency without unnecessary cost through fine-grained software-directed data prefetching using integrated high-level and low-level code analysis and optimizations. The mechanism identifies and classifies streams based on reuse analysis and dependence analysis. The mechanism makes use of the information from high-level loop transformations, data remapping, and work data-set analysis to identify which data is most likely to incur a cache miss. The mechanism exploits effective hardware prefetching through high-level loop transformations, including locality and reuse analysis, to determine the proper number of streams. The mechanism exploits effective data prefetching on different types of streams, based on compiler static analysis and dynamic profiling information, in order to eliminate redundant prefetching and avoid cache pollution. The mechanism uses high-level transformations with integrated lower level cost analysis in the instruction scheduler to schedule prefetch instructions effectively.
  • It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
  • The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (30)

1. A method in a data processing system for minimizing effective memory latency, the method comprising:
analyzing a portion of code that contains one or more loops
identifying at least one candidate loop within the one or more loops for prefetch efficiency optimization; and
inserting prefetch control instructions and directives to optimize the at least one candidate loop.
2. The method of claim 1, wherein analyzing the portion of code includes at least one of inter-loop analysis and work data set analysis.
3. The method of claim 1, further comprising the steps of:
performing at least one of data dependency analysis and reuse analysis on the at least one candidate loop.
4. The method of claim 1, wherein the at least one candidate loop generates one or more streams of instructions, the method comprising:
classifying the one or more streams of instructions for a given candidate loop within the at least one candidate loop.
5. The method of claim 1, wherein identifying at least one candidate loop is based on static and dynamic profile information, wherein the static and dynamic profile information is based upon frequency of candidate loop execution and data size of the candidate loop.
6. The method of claim 1, wherein classifying the one or mores streams of instructions classifies the streams into at least one of a load stream, a store stream, an indexed stream and a strided stream.
7. The method of claim 4, further comprising the steps of:
identifying at least one stream within the one or more stream of instructions to form a profitable stream, wherein the at least one stream is identified by performing at least one of guided information high-level loop optimizations, work data-set analysis and runtime hardware performance.
8. The method of claim 7, further comprising the steps of:
marking the profitable stream as protected; and
estimating a prefetching distance for the profitable stream.
9. The method of claim 1, further comprising the steps of:
eliminating redundant prefetch control instructions;
performing low-level optimizations; and
adjusting the prefetch control instructions.
10. The method of claim 9, wherein the adjustments of the prefetch control instructions are based on the inserted directives and the low-level optimizations.
11. A data processing system for minimizing effective memory latency, the apparatus comprising:
analyzing means for analyzing a portion of code that contains one or more loops
identifying means for identifying at least one candidate loop within the one or more loops for prefetch efficiency optimization; and
inserting means for inserting prefetch control instructions and directives to optimize the at least one candidate loop.
12. The apparatus of claim 11, wherein analyzing the portion of code includes at least one of inter-loop analysis and work data set analysis.
13. The apparatus of claim 11, further comprising:
performing means for performing at least one of data dependency analysis and reuse analysis on the at least one candidate loop.
14. The apparatus of claim 11, wherein the at least one candidate loop generates one or more streams of instructions, comprising:
classifying means for classifying the one or more streams of instructions for a given candidate loop within the at least one candidate loop.
15. The apparatus of claim 11, wherein identifying at least one candidate loop is based on static and dynamic profile information, wherein the static and dynamic profile information is based upon frequency of candidate loop execution and data size of the candidate loop.
16. The apparatus of claim 11, wherein classifying the one or mores streams of instructions classifies the streams into at least one of a load stream, a store stream, an indexed stream and a strided stream.
17. The apparatus of claim 14, further comprising:
identifying means for identifying at least one stream within the one or more stream of instructions, wherein the at least one stream is identified by performing at least one of guided information high-level loop optimizations, work data-set analysis and runtime hardware performance.
18. The apparatus of claim 17, further comprising:
marking means for marking the profitable stream as protected; and
estimating means for estimating a prefetching distance for the profitable stream.
19. The apparatus of claim 11, further comprising the steps of:
eliminating means for eliminating redundant prefetch control instructions;
performing means for performing low-level optimizations; and
adjusting means for adjusting the prefetch control instructions.
20. The apparatus of claim 19, wherein the adjustments of the prefetch control instructions are based on the inserted directives and the low-level optimizations.
21. A computer program product in a computer readable medium for minimizing effective memory latency, the computer program product comprising:
instructions for analyzing a portion of code that contains one or more loops
instructions for identifying at least one candidate loop within the one or more loops for prefetch efficiency optimization; and
instructions for inserting prefetch control instructions and directives to optimize the at least one candidate loop.
22. The computer program product of claim 21, wherein the instructions for analyzing the portion of code includes at least one of inter-loop analysis and work data set analysis.
23. The computer program product of claim 21, further comprising:
instructions for performing at least one of data dependency analysis and reuse analysis on the at least one candidate loop.
24. The computer program product of claim 21, wherein the at least one candidate loop generates one or more streams of instructions, the computer program product comprising:
instructions for classifying the one or more streams of instructions for a given candidate loop within the at least one candidate loop.
25. The computer program product of claim 21, wherein the instructions for identifying at least one candidate loop is based on static and dynamic profile information, wherein the static and dynamic profile information is based upon frequency of candidate loop execution and data size of the candidate loop.
26. The computer program product of claim 21, wherein the instructions for classifying the one or mores streams of instructions classifies the streams into at least one of a load stream, a store stream, an indexed stream and a strided stream.
27. The computer program product of claim 24, further comprising:
instructions for identifying at least one stream within the one or more stream of instructions to form a profitable stream, wherein the at least one stream is identified by performing at least one of guided information high-level loop optimizations, work data-set analysis and runtime hardware performance.
28. The computer program product of claim 27, further comprising:
instructions for marking the profitable stream as protected; and
instructions for estimating a prefetching distance for the profitable stream.
29. The computer program product of claim 21, further comprising:
instructions for eliminating redundant prefetch control instructions;
instructions for performing low-level optimizations; and
instructions for adjusting the prefetch control instructions.
30. The computer program product of claim 29, wherein the adjustments of the prefetch control instructions are based on the inserted directives and the low-level optimizations.
US10/926,595 2004-08-26 2004-08-26 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations Expired - Fee Related US7669194B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US10/926,595 US7669194B2 (en) 2004-08-26 2004-08-26 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations
US12/644,756 US8413127B2 (en) 2004-08-26 2009-12-22 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/926,595 US7669194B2 (en) 2004-08-26 2004-08-26 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/644,756 Continuation US8413127B2 (en) 2004-08-26 2009-12-22 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations

Publications (2)

Publication Number Publication Date
US20060048120A1 true US20060048120A1 (en) 2006-03-02
US7669194B2 US7669194B2 (en) 2010-02-23

Family

ID=35944977

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/926,595 Expired - Fee Related US7669194B2 (en) 2004-08-26 2004-08-26 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations
US12/644,756 Expired - Fee Related US8413127B2 (en) 2004-08-26 2009-12-22 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/644,756 Expired - Fee Related US8413127B2 (en) 2004-08-26 2009-12-22 Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations

Country Status (1)

Country Link
US (2) US7669194B2 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070022412A1 (en) * 2005-03-16 2007-01-25 Tirumalai Partha P Method and apparatus for software scouting regions of a program
US20070022422A1 (en) * 2005-03-16 2007-01-25 Tirumalai Partha P Facilitating communication and synchronization between main and scout threads
US20070061787A1 (en) * 2005-09-14 2007-03-15 Microsoft Corporation Code compilation management service
US20070079288A1 (en) * 2005-09-30 2007-04-05 Chad Willwerth System and method for capturing filtered execution history of executable program code
US20070294482A1 (en) * 2006-06-15 2007-12-20 P.A. Semi, Inc. Prefetch unit
US20080127131A1 (en) * 2006-09-13 2008-05-29 Yaoqing Gao Software solution for cooperative memory-side and processor-side data prefetching
US20080229028A1 (en) * 2007-03-15 2008-09-18 Gheorghe Calin Cascaval Uniform external and internal interfaces for delinquent memory operations to facilitate cache optimization
US20090104871A1 (en) * 2007-10-17 2009-04-23 Beom Seok Cho Broadcast reception mobile terminal
US20090249316A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US20090307674A1 (en) * 2008-06-04 2009-12-10 Ng John L Improving data locality and parallelism by code replication and array contraction
US20100217891A1 (en) * 2009-02-23 2010-08-26 International Business Machines Corporation Document Source Debugger
US20140344795A1 (en) * 2013-05-17 2014-11-20 Fujitsu Limited Computer-readable recording medium, compiling method, and information processing apparatus
US20150154101A1 (en) * 2013-12-04 2015-06-04 International Business Machines Corporation Tuning business software for a specific business environment
US20150212804A1 (en) * 2014-01-29 2015-07-30 Fujitsu Limited Loop distribution detection program and loop distribution detection method
JP2015219652A (en) * 2014-05-15 2015-12-07 富士通株式会社 Compile program, compile method, and compile device
US20170199822A1 (en) * 2013-08-19 2017-07-13 Intel Corporation Systems and methods for acquiring data for loads at different access times from hierarchical sources using a load queue as a temporary storage buffer and completing the load early
WO2022179553A1 (en) * 2021-02-25 2022-09-01 Huawei Technologies Co.,Ltd. Methods and systems for nested stream prefetching for general purpose central processing units

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8453135B2 (en) * 2010-03-11 2013-05-28 Freescale Semiconductor, Inc. Computation reuse for loops with irregular accesses
WO2012031165A2 (en) * 2010-09-02 2012-03-08 Zaretsky, Howard System and method of cost oriented software profiling
US9063749B2 (en) * 2011-05-27 2015-06-23 Qualcomm Incorporated Hardware support for hashtables in dynamic languages
WO2013101121A1 (en) * 2011-12-29 2013-07-04 Intel Corporation Managed instruction cache prefetching
US9043579B2 (en) 2012-01-10 2015-05-26 International Business Machines Corporation Prefetch optimizer measuring execution time of instruction sequence cycling through each selectable hardware prefetch depth and cycling through disabling each software prefetch instruction of an instruction sequence of interest
US9235511B2 (en) 2013-05-01 2016-01-12 Globalfoundries Inc. Software performance by identifying and pre-loading data pages
KR102070136B1 (en) 2013-05-03 2020-01-28 삼성전자주식회사 Cache-control apparatus for prefetch and method for prefetch using the cache-control apparatus
US9417882B2 (en) 2013-12-23 2016-08-16 International Business Machines Corporation Load synchronization with streaming thread cohorts
US9772824B2 (en) 2015-03-25 2017-09-26 International Business Machines Corporation Program structure-based blocking
US11169925B2 (en) 2015-08-25 2021-11-09 Samsung Electronics Co., Ltd. Capturing temporal store streams into CPU caches by dynamically varying store streaming thresholds
US9535696B1 (en) * 2016-01-04 2017-01-03 International Business Machines Corporation Instruction to cancel outstanding cache prefetches
US9898268B2 (en) 2016-07-20 2018-02-20 International Business Machines Corporation Enhanced local commoning
US10649777B2 (en) 2018-05-14 2020-05-12 International Business Machines Corporation Hardware-based data prefetching based on loop-unrolled instructions

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6249845B1 (en) * 1998-08-19 2001-06-19 International Business Machines Corporation Method for supporting cache control instructions within a coherency granule
US6301641B1 (en) * 1997-02-27 2001-10-09 U.S. Philips Corporation Method for reducing the frequency of cache misses in a computer
US6453389B1 (en) * 1999-06-25 2002-09-17 Hewlett-Packard Company Optimizing computer performance by using data compression principles to minimize a loss function
US20030005419A1 (en) * 1999-10-12 2003-01-02 John Samuel Pieper Insertion of prefetch instructions into computer program code
US20030079089A1 (en) * 2001-10-18 2003-04-24 International Business Machines Corporation Programmable data prefetch pacing
US6571318B1 (en) * 2001-03-02 2003-05-27 Advanced Micro Devices, Inc. Stride based prefetcher with confidence counter and dynamic prefetch-ahead mechanism
US20030225996A1 (en) * 2002-05-30 2003-12-04 Hewlett-Packard Company Prefetch insertion by correlation of cache misses and previously executed instructions
US20040154019A1 (en) * 2003-01-31 2004-08-05 Aamodt Tor M. Methods and apparatus for generating speculative helper thread spawn-target points
US6820250B2 (en) * 1999-06-07 2004-11-16 Intel Corporation Mechanism for software pipelining loop nests
US20050223175A1 (en) * 2004-04-06 2005-10-06 International Business Machines Corporation Memory prefetch method and system
US20060059311A1 (en) * 2002-11-22 2006-03-16 Van De Waerdt Jan-Willem Using a cache miss pattern to address a stride prediction table
US7168070B2 (en) * 2004-05-25 2007-01-23 International Business Machines Corporation Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6301641B1 (en) * 1997-02-27 2001-10-09 U.S. Philips Corporation Method for reducing the frequency of cache misses in a computer
US6249845B1 (en) * 1998-08-19 2001-06-19 International Business Machines Corporation Method for supporting cache control instructions within a coherency granule
US6820250B2 (en) * 1999-06-07 2004-11-16 Intel Corporation Mechanism for software pipelining loop nests
US6453389B1 (en) * 1999-06-25 2002-09-17 Hewlett-Packard Company Optimizing computer performance by using data compression principles to minimize a loss function
US20030005419A1 (en) * 1999-10-12 2003-01-02 John Samuel Pieper Insertion of prefetch instructions into computer program code
US6571318B1 (en) * 2001-03-02 2003-05-27 Advanced Micro Devices, Inc. Stride based prefetcher with confidence counter and dynamic prefetch-ahead mechanism
US20030079089A1 (en) * 2001-10-18 2003-04-24 International Business Machines Corporation Programmable data prefetch pacing
US20030225996A1 (en) * 2002-05-30 2003-12-04 Hewlett-Packard Company Prefetch insertion by correlation of cache misses and previously executed instructions
US20060059311A1 (en) * 2002-11-22 2006-03-16 Van De Waerdt Jan-Willem Using a cache miss pattern to address a stride prediction table
US20040154019A1 (en) * 2003-01-31 2004-08-05 Aamodt Tor M. Methods and apparatus for generating speculative helper thread spawn-target points
US20050223175A1 (en) * 2004-04-06 2005-10-06 International Business Machines Corporation Memory prefetch method and system
US7168070B2 (en) * 2004-05-25 2007-01-23 International Business Machines Corporation Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070022412A1 (en) * 2005-03-16 2007-01-25 Tirumalai Partha P Method and apparatus for software scouting regions of a program
US20070022422A1 (en) * 2005-03-16 2007-01-25 Tirumalai Partha P Facilitating communication and synchronization between main and scout threads
US7950012B2 (en) 2005-03-16 2011-05-24 Oracle America, Inc. Facilitating communication and synchronization between main and scout threads
US7849453B2 (en) * 2005-03-16 2010-12-07 Oracle America, Inc. Method and apparatus for software scouting regions of a program
US7730464B2 (en) * 2005-09-14 2010-06-01 Microsoft Corporation Code compilation management service
US20070061787A1 (en) * 2005-09-14 2007-03-15 Microsoft Corporation Code compilation management service
US20070079288A1 (en) * 2005-09-30 2007-04-05 Chad Willwerth System and method for capturing filtered execution history of executable program code
US7493451B2 (en) * 2006-06-15 2009-02-17 P.A. Semi, Inc. Prefetch unit
US20070294482A1 (en) * 2006-06-15 2007-12-20 P.A. Semi, Inc. Prefetch unit
US20090119488A1 (en) * 2006-06-15 2009-05-07 Sudarshan Kadambi Prefetch Unit
US7996624B2 (en) 2006-06-15 2011-08-09 Apple Inc. Prefetch unit
US7779208B2 (en) 2006-06-15 2010-08-17 Apple Inc. Prefetch unit
US8316188B2 (en) 2006-06-15 2012-11-20 Apple Inc. Data prefetch unit utilizing duplicate cache tags
US20080127131A1 (en) * 2006-09-13 2008-05-29 Yaoqing Gao Software solution for cooperative memory-side and processor-side data prefetching
US9798528B2 (en) * 2006-09-13 2017-10-24 International Business Machines Corporation Software solution for cooperative memory-side and processor-side data prefetching
US20080229028A1 (en) * 2007-03-15 2008-09-18 Gheorghe Calin Cascaval Uniform external and internal interfaces for delinquent memory operations to facilitate cache optimization
US8886887B2 (en) * 2007-03-15 2014-11-11 International Business Machines Corporation Uniform external and internal interfaces for delinquent memory operations to facilitate cache optimization
US20090104871A1 (en) * 2007-10-17 2009-04-23 Beom Seok Cho Broadcast reception mobile terminal
US20090249316A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US8136103B2 (en) * 2008-03-28 2012-03-13 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US8453134B2 (en) * 2008-06-04 2013-05-28 Intel Corporation Improving data locality and parallelism by code replication
US20090307674A1 (en) * 2008-06-04 2009-12-10 Ng John L Improving data locality and parallelism by code replication and array contraction
US8224997B2 (en) 2009-02-23 2012-07-17 International Business Machines Corporation Document source debugger
US20100217891A1 (en) * 2009-02-23 2010-08-26 International Business Machines Corporation Document Source Debugger
US9141357B2 (en) * 2013-05-17 2015-09-22 Fujitsu Limited Computer-readable recording medium, compiling method, and information processing apparatus
US20140344795A1 (en) * 2013-05-17 2014-11-20 Fujitsu Limited Computer-readable recording medium, compiling method, and information processing apparatus
US20170199822A1 (en) * 2013-08-19 2017-07-13 Intel Corporation Systems and methods for acquiring data for loads at different access times from hierarchical sources using a load queue as a temporary storage buffer and completing the load early
US10552334B2 (en) * 2013-08-19 2020-02-04 Intel Corporation Systems and methods for acquiring data for loads at different access times from hierarchical sources using a load queue as a temporary storage buffer and completing the load early
US20150154100A1 (en) * 2013-12-04 2015-06-04 International Business Machines Corporation Tuning business software for a specific business environment
US20150154101A1 (en) * 2013-12-04 2015-06-04 International Business Machines Corporation Tuning business software for a specific business environment
US20150212804A1 (en) * 2014-01-29 2015-07-30 Fujitsu Limited Loop distribution detection program and loop distribution detection method
US9182960B2 (en) * 2014-01-29 2015-11-10 Fujitsu Limited Loop distribution detection program and loop distribution detection method
JP2015219652A (en) * 2014-05-15 2015-12-07 富士通株式会社 Compile program, compile method, and compile device
WO2022179553A1 (en) * 2021-02-25 2022-09-01 Huawei Technologies Co.,Ltd. Methods and systems for nested stream prefetching for general purpose central processing units
US11740906B2 (en) 2021-02-25 2023-08-29 Huawei Technologies Co., Ltd. Methods and systems for nested stream prefetching for general purpose central processing units

Also Published As

Publication number Publication date
US20100095271A1 (en) 2010-04-15
US8413127B2 (en) 2013-04-02
US7669194B2 (en) 2010-02-23

Similar Documents

Publication Publication Date Title
US8413127B2 (en) Fine-grained software-directed data prefetching using integrated high-level and low-level code analysis optimizations
US8490065B2 (en) Method and apparatus for software-assisted data cache and prefetch control
US7467377B2 (en) Methods and apparatus for compiler managed first cache bypassing
US9798528B2 (en) Software solution for cooperative memory-side and processor-side data prefetching
US7681015B2 (en) Generating and comparing memory access ranges for speculative throughput computing
US20040093591A1 (en) Method and apparatus prefetching indexed array references
US7421540B2 (en) Method, apparatus, and program to efficiently calculate cache prefetching patterns for loops
US8886887B2 (en) Uniform external and internal interfaces for delinquent memory operations to facilitate cache optimization
US20060048121A1 (en) Method and apparatus for a generic language interface to apply loop optimization transformations
US7168070B2 (en) Aggregate bandwidth through management using insertion of reset instructions for cache-to-cache data transfer
US6968429B2 (en) Method and apparatus for controlling line eviction in a cache
US20030084433A1 (en) Profile-guided stride prefetching
US7577947B2 (en) Methods and apparatus to dynamically insert prefetch instructions based on garbage collector analysis and layout of objects
US7234136B2 (en) Method and apparatus for selecting references for prefetching in an optimizing compiler
US7389385B2 (en) Methods and apparatus to dynamically insert prefetch instructions based on compiler and garbage collector analysis
US7257810B2 (en) Method and apparatus for inserting prefetch instructions in an optimizing compiler
US8359435B2 (en) Optimization of software instruction cache by line re-ordering
US20070283105A1 (en) Method and system for identifying multi-block indirect memory access chains
WO2002029564A2 (en) System and method for insertion of prefetch instructions by a compiler
Reinman et al. Classifying load and store instructions for memory renaming
Reinman et al. Profile guided load marking for memory renaming
Zhang et al. Whole Execution Traces and their use in Debugging
EP4248321A1 (en) An apparatus and method for performing enhanced pointer chasing prefetcher
Smolens et al. Sarastro: a Hot Data Stream Detection Mechanism for a Java Virtual Machine

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ARCHAMBAULT, ROCH GEORGES;BLAINEY, ROBERT JAMES;GAO, YAOGING;AND OTHERS;REEL/FRAME:015148/0670;SIGNING DATES FROM 20040820 TO 20040824

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ARCHAMBAULT, ROCH GEORGES;BLAINEY, ROBERT JAMES;GAO, YAOGING;AND OTHERS;SIGNING DATES FROM 20040820 TO 20040824;REEL/FRAME:015148/0670

AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION,NEW YO

Free format text: CORRECTION TO REEL AND FRAME 015148/ 0670;ASSIGNORS:ARCHAMBAULT, ROCH GEORGES;BLAINEY, ROBERT JAMES;GAO, YAOQING;AND OTHERS;SIGNING DATES FROM 20040820 TO 20040824;REEL/FRAME:016121/0861

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: CORRECTION TO REEL AND FRAME 015148/ 0670;ASSIGNORS:ARCHAMBAULT, ROCH GEORGES;BLAINEY, ROBERT JAMES;GAO, YAOQING;AND OTHERS;REEL/FRAME:016121/0861;SIGNING DATES FROM 20040820 TO 20040824

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20140223