US20080209407A1 - Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags - Google Patents

Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags Download PDF

Info

Publication number
US20080209407A1
US20080209407A1 US12/109,707 US10970708A US2008209407A1 US 20080209407 A1 US20080209407 A1 US 20080209407A1 US 10970708 A US10970708 A US 10970708A US 2008209407 A1 US2008209407 A1 US 2008209407A1
Authority
US
United States
Prior art keywords
instruction
loop
instructions
conditional
intermediate codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/109,707
Inventor
Hazuki Okabayashi
Tetsuya Tanaka
Taketo Heishi
Hajime Ogawa
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US12/109,707 priority Critical patent/US20080209407A1/en
Publication of US20080209407A1 publication Critical patent/US20080209407A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30072Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter

Definitions

  • the present invention relates to a processor such as a DSP (Digital Signal Processor) and a CPU (Central Processing Unit), as well as to a compiler that generates instructions executed by such a processor. More particularly, the present invention relates to a processor and a compiler which are suitable for performing signal processing for sounds, images and others.
  • a processor such as a DSP (Digital Signal Processor) and a CPU (Central Processing Unit)
  • a compiler that generates instructions executed by such a processor. More particularly, the present invention relates to a processor and a compiler which are suitable for performing signal processing for sounds, images and others.
  • processors are increasingly required to be capable of high-speed media processing represented by sound and image signal processing.
  • Intel Corporation responding to such requirement, there exist Pentium®/Pentium® III/Pentium 4® MMX/SSE/SSE2 and others produced by the Intel Corporation of the United States supporting SIMD (Single Instruction Multiple Data) instructions.
  • SIMD Single Instruction Multiple Data
  • MMX Pentium for example, is capable of performing the same operations in one instruction on a maximum of eight integers stored in a 64-bit-long MMX register.
  • Such existing processors realize high-speed processing by utilizing software pipelining, as described in the following: Mitsuru Ikei, IA -64 Processor Basic Course ( IA -64 Processor Kihon Koza ), Tokyo: Ohmsha Ltd., 1999. FIG. 4.32 p. 129.
  • a large-scale circuit means that the amount of power consumed by the processor becomes large.
  • the present invention has been conceived in view of the above circumstances, and it is an object of the present invention to provide a processor whose circuitry scale is small and which is capable of performing loop processing at a high speed while consuming a low amount of power.
  • the processor is a processor for decoding an instruction and executing said decoded instruction.
  • the processor comprises: a flag register in which a plurality of conditional execution flags are stored, where the plurality of conditional execution flags are used as predicates for conditional execution instructions; a decoding unit operable to decode an instruction; and an execution unit operable to execute the instruction decoded by the decoding unit.
  • an iteration of a loop to be executed terminates in the execution unit, based on a value of one of the plurality of conditional execution flags for an epilog phase in the loop in a case where the loop is unrolled into the conditional execution instructions by means of software pipelining.
  • the flag register may further store a loop flag which is used to judge whether or not the iteration has terminated, and the execution unit may set, to the loop flag, the value of the one of the plurality of conditional execution flags for the epilog phase.
  • the execution unit sets, to the loop flag in one cycle later in the epilog phase, the value of the conditional execution flag for a conditional execution instruction to be executed in an (N ⁇ 2)th pipeline stage (where N is an integer greater than or equal to 3), in a case where the number of stages in the software pipelining is N and the stages are counted up each time processing in the epilog phase finishes.
  • the processor according to the above configuration may further comprise an instruction buffer for temporarily storing the instruction decoded by the decoding unit, and in such processor, the decoding unit may be configured not to read out one of the conditional execution instructions from the instruction buffer until the loop terminates, when judging that the conditional execution instruction should not be executed based on the value of the one of the plurality of conditional execution flags for the epilog phase.
  • conditional execution instruction stops being executed in the epilog phase, the conditional execution instruction will not be executed in the software pipelining until the loop processing ends. Accordingly, there is no need to read out the conditional execution instruction from the corresponding instruction buffer, which makes it possible for the processor to consume a small amount of power.
  • the compiler is a complier for translating a source program into a machine language program for a processor which is capable of executing instructions in parallel.
  • the complier comprises: a parser unit for parsing the source program; an intermediate code conversion unit for converting the parsed source program into intermediate codes; an optimization unit for optimizing the intermediate codes; and a code generation unit for converting the optimized intermediate codes into machine language instructions.
  • the processor stores a plurality of flags which are used as predicates for conditional execution instructions, and the optimization unit, when the intermediate codes include a loop, places an instruction in a prolog phase in loop in a case where said loop is unrolled by means of software pipelining so that the instruction is to be executed immediately before the loop.
  • an instruction to be executed immediately before a loop is placed in the prolog phase in the case where such loop is unrolled by means of software pipelining. Accordingly, it becomes possible to reduce the number of empty stages in the software pipelining, and therefore to execute a program at a high speed. Furthermore, it also becomes possible to reduce the amount of power consumption of a processor that executes a program compiled by this compiler.
  • the compiler is a complier for translating a source program into a machine language program for a processor which is capable of executing instructions in parallel.
  • the compiler comprises: a parser unit for parsing the source program; an intermediate code conversion unit for converting the parsed source program into intermediate codes; an optimization unit for optimizing the intermediate codes; and a code generation unit for converting the optimized intermediate codes into machine language instructions.
  • the processor stores a plurality of flags which are used as predicates for conditional execution instructions
  • the optimization unit when the intermediate codes include a conditional branch instruction, assigns the plurality of conditional execution flags so that a conditional execution flag which is used as a predicate for a conditional execution instruction in a case where a condition indicated by said conditional branch instruction is met, becomes different from a conditional execution flag used as a predicate for a conditional execution instruction in a case where the condition is not met.
  • FIG. 1 is a schematic block diagram showing a processor according to the present invention
  • FIG. 2 is a schematic diagram showing arithmetic and logic/comparison operation units of the processor
  • FIG. 4 is a block diagram showing a configuration of a converter of the processor
  • FIG. 5 is a block diagram showing a configuration of a divider of the processor
  • FIG. 6 is a block diagram showing a configuration of a multiplication/sum of products operation unit of the processor
  • FIG. 7 is a block diagram showing a configuration of an instruction control unit of the processor
  • FIG. 8 is a diagram showing a configuration of general-purpose registers (R 0 -R 31 ) of the processor
  • FIG. 9 is a diagram showing a configuration of a link register (LR) of the processor.
  • FIG. 10 is a diagram showing a configuration of a branch register (TAR) of the processor
  • FIG. 11 is a diagram showing a configuration of a program status register (PSR) of the processor
  • FIG. 12 is a diagram showing a configuration of a conditional flag register (CFR) of the processor
  • FIG. 13 is a diagram showing a configuration of accumulators (M 0 , M 1 ) of the processor
  • FIG. 14 is a diagram showing a configuration of a program counter (PC) of the processor
  • FIG. 15 is a diagram showing a configuration of a PC save register (IPC) of the processor
  • FIG. 16 is a diagram showing a configuration of a PSR save register (IPSR) of the processor
  • FIG. 17 is a timing diagram showing a pipeline behavior of the processor
  • FIG. 18 is a timing diagram showing each pipeline behavior when instructions are executed by the processor
  • FIG. 19 is a diagram showing a parallel behavior of the processor
  • FIG. 20A is a diagram showing a format of a 16-bit instruction executed by the processor
  • FIG. 20B is a diagram showing a format of a 32-bit instruction executed by the processor
  • FIGS. 21A and 21B are diagrams explaining instructions belonging to a category “ALUadd (addition) system”;
  • FIGS. 22A and 22B are diagrams explaining instructions belonging to a category “ALUsub (subtraction) system”;
  • FIGS. 23A and 23B are diagrams explaining instructions belonging to a category “ALUlogic (logical operation) system and others”;
  • FIGS. 24A and 24B are diagrams explaining instructions belonging to a category “CMP (comparison operation) system”;
  • FIGS. 25A and 25B are diagrams explaining instructions belonging to a category “mul (multiplication) system”
  • FIGS. 26A and 26B are diagrams explaining instructions belonging to a category “mac (sum of products operation) system”;
  • FIGS. 27A and 27B are diagrams explaining instructions belonging to a category “msu (difference of products) system”;
  • FIGS. 28A and 28B are diagrams explaining instructions belonging to a category “MEMId (load from memory) system”;
  • FIGS. 29A and 29B are diagrams explaining instructions belonging to a category “MEMstore (store in memory) system”;
  • FIG. 30 is a diagram explaining instructions belonging to a category “BRA (branch) system”
  • FIGS. 31A and 31B are diagrams explaining instructions belonging to a category “BSasl (arithmetic barrel shift) system and others”;
  • FIGS. 32A and 32B are diagrams explaining instructions belonging to a category “BSlsr (logical barrel shift) system and others”;
  • FIG. 33A is a diagram explaining instructions belonging to a category “CNVvaln (arithmetic conversion) system”;
  • FIGS. 34A and 34B are diagrams explaining instructions belonging to a category “CNV (general conversion) system”;
  • FIG. 35 is a diagram explaining instructions belonging to a category “SATvlpk (saturation processing) system”;
  • FIGS. 36A and 36B are diagrams explaining instructions belonging to a category “ETC (et cetera) system”;
  • FIG. 37 is a diagram explaining a detailed behavior of the processor when executing Instruction “jloop C 6 , Cm, TAR, Ra”;
  • FIG. 38 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar C 6 , Cm, D 9 ”;
  • FIG. 40 is a diagram showing a source program written in the C language
  • FIG. 41 is a diagram showing an example machine language program to be generated by using Instruction jloop and Instruction settar according to the present embodiment
  • FIG. 42 is a diagram explaining a detailed behavior of the processor when executing Instruction “jloop C 6 , C 2 : C 4 , TAR, Ra”;
  • FIG. 43 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar, C 6 , C 2 : C 4 , D 9 ”;
  • FIG. 48 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar C 6 , C 1 : C 4 , D 9 ”;
  • FIG. 49 is a diagram showing a source program written in the C language
  • FIG. 50 is a diagram showing an example machine language program to be generated by using Instruction jloop and Instruction settar according to the present embodiment
  • FIG. 53 is a diagram showing a behavior of 4-stage software pipelining in which instructions to be executed before and after the loop are incorporated respectively into a prolog phase and an epilog phase;
  • FIG. 56 is a diagram showing a behavior of an existing processor using 4-stage software pipelining.
  • FIG. 1 is a schematic block diagram showing the processor according to the present invention.
  • the processor 1 is comprised of an instruction control unit 10 , a decoding unit 20 , a register file 30 , an operation unit 40 , an I/F (interface) unit 50 , an instruction memory unit 60 , a data memory unit 70 , an extended register unit 80 , and an I/O (Input/Output) interface unit 90 .
  • the operation unit 40 includes arithmetic and logic/comparison operation units 41 - 43 and 48 , a multiplication/sum of products operation unit 44 , a barrel shifter 45 , a divider 46 , and a converter 47 for performing operations of SIMD instructions.
  • the multiplication/sum of products operation unit 44 is capable of performing accumulation which results in a maximum of a 65-bit operation result, without lowering bit precision.
  • the multiplication/sum of products operation unit 44 is also capable of executing SIMD instructions as in the case of the arithmetic and logic/comparison operation units 41 - 43 and 48 .
  • the processor 1 is capable of parallel execution of an arithmetic and logic/comparison operation instruction on a maximum of four data elements.
  • FIG. 2 is a schematic diagram showing the arithmetic and logic/comparison operation units 41 - 43 and 48 .
  • Each of the arithmetic and logic/comparison operation units 41 - 43 and 48 is made up of an ALU (Arithmetic and Logical Unit) 41 a , a saturation processing unit 41 b , and a flag unit 41 c .
  • the ALU 41 a includes an arithmetic operation unit (AU), a logical operation unit (LU), a comparator (CMP), and a TST.
  • the bit widths of operation data to be supported by the ALU 41 a are 8 bits (when using four operation units in parallel), 16 bits (when using two operation units in parallel) and 32 bits (when using one operation unit to process 32-bit data).
  • the flag unit 41 c and the like detects an overflow and generates a conditional flag.
  • the comparator and the TST For a result of each of the operation units, the comparator and the TST, an arithmetic shift right, saturation by the saturation processing unit 41 b , the detection of maximum/minimum values, and absolute value generation processing are performed.
  • FIG. 3 is a block diagram showing the configuration of the barrel shifter 45 .
  • the barrel shifter 45 is made up of selectors 45 a and 45 b , a higher bit barrel shifter 45 c , a lower bit barrel shifter 45 d , and a saturation processing unit 45 e .
  • This barrel shifter 45 executes an arithmetic shift of data (shift in the 2's complement number system) or a logical shift of data (unsigned shift). Usually, 32-bit or 64-bit data is inputted to and outputted from the barrel shifter 45 .
  • the amount of shifting data stored in the register 30 a or 30 b is specified by another register or according to its immediate value.
  • the barrel shifter 45 performs an arithmetic or logical shift of input data in the range of left 63 bits and right 63 bits, and outputs data of the same bit length as that of the input data.
  • the barrel shifter 45 is also capable of shifting 8-, 16-, 32-, or 64-bit data in response to a SIMD instruction.
  • the barrel shifter 45 can shift four pieces of 8-bit data in parallel.
  • An arithmetic shift which is a shift in the 2's complement number system, is performed for decimal point alignment at the time of addition and subtraction, for multiplication of powers of 2 (the 1 st power of 2, the 2 nd power of 2, the ⁇ 1 st power of 2, the ⁇ 2 nd power of 2) and other purposes.
  • FIG. 4 is a block diagram showing the configuration of the converter 47 .
  • the converter 47 includes a saturation block (SAT) 47 a , a BSEQ block 47 b , an MSKGEN block 47 c , a VSUMB block 47 d , a BCNT block 47 e , and an IL block 47 f.
  • SAT saturation block
  • the saturation block (SAT) 47 a performs saturation processing on input data. By having two blocks for performing saturation processing on 32-bit data, the saturation block (SAT) 47 a supports a SIMD instruction executed on two data elements in parallel.
  • the BSEQ block 47 b counts consecutive 0s or 1s from the MSB (Most Significant Bit).
  • the MSKGEN block 47 c outputs a specified bit segment as 1, while outputting the others as 0.
  • the IL block 47 f divides the input data into specified bit widths, and outputs a value that results from exchanging the positions of data blocks.
  • FIG. 5 is a block diagram showing the configuration of the divider 46 .
  • the divider 46 With a dividend being 64 bits and a divisor being 32 bits, the divider 46 outputs 32 bit data as a quotient and a modulo, respectively. 34 cycles are involved for obtaining a quotient and a modulo.
  • the divider 46 can handle both singed and unsigned data. Note, however, that whether or not to sign a dividend and a divisor is common between them.
  • the divider 46 is also capable of outputting an overflow flag, and a 0 division flag.
  • FIG. 6 is a block diagram showing the configuration of the multiplication/sum of products operation unit 44 .
  • the multiplication/sum of products operation unit 44 which is made up of two 32-bit multipliers (MUL) 44 a and 44 b , three 64-bit adders (Adder) 44 c - 44 e , a selector 44 f and a saturation processing unit (Saturation) 44 g , performs the following multiplications and sums of products:
  • FIG. 7 is a block diagram showing the configuration of the instruction control unit 10 .
  • the instruction control unit 10 which is made up of an instruction cache 10 a , an address management unit 10 b , instruction buffers 10 c - 10 e and 10 h , a jump buffer 10 f , and a rotation unit (rotation) 10 g , issues instructions at ordinary times and at branch points.
  • the instruction control unit 10 supports the maximum number of parallel instruction execution.
  • the instruction control unit 10 stores, in advance, a branch target instruction into the jump buffer 10 f and stores a branch target address into the below-described TAR register before performing a branch (settar instruction).
  • the instruction control unit 10 performs the branch by using the branch target address stored in the TAR register and the branch target instruction stored in the jump buffer 10 f.
  • This instruction description indicates that only an instruction “mov” shall be executed.
  • the instruction control unit 10 identifies an issue group and sends the identified issue group to the decoding unit 20 .
  • the decoding unit 20 decodes the instructions in the issue group, and controls resources required for executing such instructions.
  • Table 1 below lists a set of registers of the processor 1 .
  • Register name Bit width No. of registers Usage R0-R31 32 bits 32 General-purpose registers. Used as data memory pointer, data storage at the time of operation instruction, and the like. TAR 32 bits 1 Branch register. Used as branch address storage at branch point. LR 32 bits 1 Link register. SVR 16 bits 2 Save register. Used for saving conditional flag (CFR) and various modes. M0-M1 64 bits 2 Operation registers. Used as data storage (MH0:ML0- when operation instruction is executed. MH1 ⁇ ML1)
  • Table 2 below lists a set of flags (flags managed in a conditional flag register and the like described later) of the processor 1 .
  • FIG. 8 is a diagram showing the configuration of the general-purpose registers (R 0 -R 31 ) 30 a .
  • the general-purpose registers (R 0 -R 31 ) 30 a are a group of 32-bit registers that constitute an integral part of the context of a task to be executed and that store data or addresses. Note that the general-purpose registers R 30 and R 31 are used by hardware as a global pointer and a stack pointer, respectively.
  • FIG. 9 is a diagram showing the configuration of a link register (LR) 30 c .
  • the processor 1 also has a save register (SVR) which is not illustrated in FIG. 9 .
  • the link register (LR) 30 c is a 32-bit register in which a return address at the time of a function call is stored.
  • the save register (SVR) is a 16-bit register for saving a conditional flag (CFR.CF) of the conditional flag register at the time of a function call.
  • the link register (LR) 30 c is also used for the purpose of increasing the speed of loops, as in the case of a branch register (TAR) to be explained later.
  • 0 is always read out from the low 1 bit of the link register (LR) 30 c , and 0 must be written to the low 1 bit of the link register (LR) 30 c at the time of writing.
  • the processor 1 when executing “call (brl, jmpl)” instructions, the processor 1 saves a return address into the link register (LR) 30 c and saves a conditional flag (CFR.CF) into the save register (SVR).
  • the processor 1 fetches the return address (branch destination address) from the link register (LR) 30 c , and restores a program counter (PC).
  • the processor 1 fetches the branch destination address (return address) from the link register (LR) 30 c , and stores (restores) the branch destination address into the program counter (PC).
  • the processor 1 fetches the conditional flag from the save register (SVR) so as to store (restore) the conditional flag into a conditional flag area CFR.CF in the conditional flag register (CFR) 32 .
  • FIG. 10 is a diagram showing the configuration of the branch register (TAR) 30 d .
  • the branch register (TAR) 30 d is a 32-bit register in which a branch target address is stored, and which is used mainly for the purpose of increasing the speed of loops. 0 is always read out from the low 1 bit of the branch resister (TAR) 30 d , and 0 must be written to the low 1 bit of the branch resister (TAR) 30 d at the time of writing.
  • the processor 1 fetches a branch target address from the branch register (TAR) 30 d , and stores the branch target address in the program counter (PC).
  • a branch penalty will be 0.
  • An increased loop speed can be achieved by storing the top address of a loop in the branch register (TAR) 30 d.
  • FIG. 11 is a diagram showing the configuration of a program status register (PSR) 31 .
  • the program status register (PSR) 31 which constitutes an integral part of the context of a task to be executed, is a 32-bit register in which the following processor status information are stored:
  • Bit SWE indicates whether the switching of VMP (Virtual Multi-Processor) to LP (Logical Processor) is enabled or disabled. “0” indicates that switching to LP is disabled and “1” indicates that switching to LP is enabled.
  • Bit FXP indicates a fixed point mode. “0” indicates mode 0 and “1” indicates mode 1 .
  • Bit IH is an interrupt processing flag indicating whether or not maskable interrupt processing is ongoing. “1” indicates that there is an ongoing interrupt processing and “0” indicates that there is no ongoing interrupt processing. “1” is automatically set on the occurrence of an interrupt. This flag is used to make a distinction of which one of interrupt processing and program processing is taking place at a point in the program to which the processor returns in response to a “rti” instruction.
  • Bit LPIE 3 indicates whether LP-specific interrupt 3 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit LPIE 2 indicates whether LP-specific interrupt 2 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit LPIE 1 indicates whether LP-specific interrupt 1 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit AEE indicates whether a misalignment exception is enabled or disabled. “1” indicates that a misalignment exception is enabled and “0” indicates that a misalignment exception is disabled.
  • Bit IE indicates whether a level interrupt is enabled or disabled. “1” indicates that a level interrupt is enabled and “0” indicates a level interrupt is disabled.
  • IM[0] denotes a mask of level 0
  • IM[1] denotes a mask of level 1
  • IM[2] denotes a mask of level 2
  • IM[3] denotes a mask of level 3
  • IM[4] denotes a mask of level 4
  • IM[5] denotes a mask of level 5
  • IM[6] denotes a mask of level 6
  • IM[7] denotes a mask of level 7.
  • reserved indicates a reserved bit. 0 is always read out from “reserved”. 0 must be written to “reserved” at the time of writing.
  • FIG. 12 is a diagram showing the configuration of the conditional flag register (CFR) 32 .
  • the conditional flag register (CFR) 32 which constitutes an integral part of the context of a task to be executed, is a 32-bit register made up of conditional flags, operation flags, vector conditional flags, an operation instruction bit position specification field, and a SIMD data alignment information field.
  • Bit BPO [4:0] indicates a bit position. It is used in an instruction that requires a bit position specification.
  • Bit VC 0 -VC 3 are vector conditional flags. Starting from a byte on the LSB (Least Significant Bit) side or a half word through to the MSB side, each corresponds to a flag ranging from VC 0 through VC 3 .
  • Bit OVS is an overflow flag (summary). It is set on the detection of saturation and overflow. If not detected, a value before the execution of the instruction is retained. Clearing of this flag needs to be carried out by software.
  • Bit CAS is a carry flag (summary). It is set when a carry occurs under an “addc” instruction, or when a borrow occurs under a “subc” instruction. If there is no occurrence of a carry under an “addc” instruction or a borrow under a “subc” instruction, a value before the execution of the instruction is retained as the Bit CAS. Clearing of this flag needs to be carried out by software.
  • Bit C 0 -C 7 are conditional flags.
  • the value of the flag C 7 is always 1.
  • a reflection of a FALSE condition (writing of 0) made to the flag C 7 is ignored.
  • reserved indicates a reserved bit. 0 is always read out from “reserved”. 0 must be written to “reserved” at the time of writing.
  • FIGS. 13( a ) and ( b ) are diagrams showing the configuration of accumulators (M 0 , M 1 ) 30 b .
  • Such accumulators (M 0 , M 1 ) 30 b which constitute an integral part of the context of a task to be executed, are made up of a 32-bit register MHO-MH 1 (register for multiply and divide/sum of products (the higher 32 bits)) shown in (a) in FIG. 13 and a 32-bit register MLO-ML 1 (register for multiply and divide/sum of products (the lower 32 bits)) shown in (b) in FIG. 13 .
  • the register MHO-MH 1 is used for storing the higher 32 bits of an operation result at the time of a multiply instruction, whereas the register MH 0 -MH 1 is used as the higher 32 bits of the accumulators at the time of a sum of products instruction. Moreover, the register MHO-MH 1 can be used in combination with the general-purpose registers in the case where a bit stream is handled. Meanwhile, the register MLO-ML 1 is used for storing the lower 32 bits of an operation result at the time of a multiply instruction, whereas the register ML 0 -ML 1 is used as the lower 32 bits of the accumulators at the time of a sum of products instruction.
  • FIG. 14 is a diagram showing the configuration of a program counter (PC) 33 .
  • This program counter (PC) 33 which constitutes an integral part of the context of a task to be executed, is a 32-bit counter that holds the address of an instruction being executed. “0” is always stored in the low 1 bit of the program counter (PC) 33 .
  • FIG. 15 is a diagram showing the configuration of a PC save register (IPC) 34 .
  • This PC save register (IPC) 34 which constitutes an integral part of the context of a task to be executed, is a 32-bit register. “0” is always read out from the low 1 bit of the PC save register (IPC) 34 . “0” must be written to the low 1 bit of the PC save register (IPC) 34 at the time of writing.
  • FIG. 16 is a diagram showing the configuration of a PSR save register (IPSR) 35 .
  • This PSR save register (IPSR) 35 which constitutes an integral part of the context of a task to be executed, is a 32-bit register for saving the program status register (PSR) 31 . 0 must be always read out from a part in the PSR save register (IPSR) 35 corresponding to a reserved bit in the program status register (PSR) 31 , and 0 must be written to a part in the PSR save register (IPSR) 35 corresponding to a reserved bit in the program status register (PSR) 31 at the time of writing.
  • a linear memory space with a capacity of 4 GB is divided into 32 segments, and an instruction SRAM (Static RAM) and a data SRAM are allocated to 128-MB segments.
  • a target block to be accessed is set in a SAR (SRAM Area Register).
  • a direct access is made to the instruction SRAM/data SRAM when the accessed address is a segment set in the SAR, but an access request shall be issued to a bus controller (BUC) when such address is not a segment set in the SAR.
  • BUC bus controller
  • An on chip memory (OCM), an external memory, an external device, an I/O port and others are connected to the BUC.
  • the processor 1 is capable of reading/writing data from and to these devices.
  • FIG. 17 is a timing diagram showing the pipeline behavior of the processor 1 .
  • the pipeline of the processor 1 basically consists of the following five stages: instruction fetch; instruction assignment (dispatch); decode; execution; and writing.
  • FIG. 18 is a timing diagram showing each stage of the pipeline behavior of the processor 1 at the time of executing an instruction.
  • the instruction fetch stage an access is made to an instruction memory which is indicated by an address specified by the program counter (PC) 33 , and the instruction is transferred to the instruction buffers 10 c - 10 e and 10 h , and the like.
  • the instruction assignment stage the output of branch target address information in response to a branch instruction, the output of an input register control signal, and the assignment of a variable length instruction are carried out, which is followed by the transfer of the instruction to an instruction register (IR).
  • the instruction stored in the IR is inputted to the decoding unit 20 , from which an operation unit control signal and a memory access signal are outputted.
  • an operation is executed and the result of the operation is outputted either to the data memory or the general-purpose registers (R 0 -R 31 ) 30 a .
  • the writing stage a value obtained as a result of data transfer, and the operation results are stored in the general-purpose registers.
  • the VLIW architecture of the processor 1 allows parallel execution of the above processing on a maximum of four data elements. Therefore, the processor 1 performs parallel execution as shown in FIG. 18 at the timing shown in FIG. 19 .
  • Tables 3-5 list categorized instructions to be executed by the processor 1 .
  • “Operation units” in the above tables refer to operation units used in the respective instructions. More specifically, “A” denotes an ALU instruction, “B” denotes a branch instruction, “C” denotes a conversion instruction, “DIV” denotes a divide instruction, “DBGM” denotes a debug instruction, “M” denotes a memory access instruction, “S 1 ” and “S 2 ” denote a shift instruction, and “X 1 ” and “X 2 ” denote a multiply instruction.
  • FIG. 20A is a diagram showing the format of a 16-bit instruction executed by the processor 1
  • FIG. 20B is a diagram showing the format of a 32-bit instruction executed by the processor 1 .
  • E is an end bit (boundary of parallel execution); “F” is a format bit (00, 01, 10: 16-bit instruction format, 11: 32-bit instruction format); “P” is a predicate (execution condition: one of the eight conditional flags C 0 -C 7 is specified); “OP” is an operation code field; “R” is a register field; “I” is an immediate value field; and “D” is a displacement field.
  • predicates which are flags for controlling whether or not to execute an instruction based on values of the conditional flags C 0 -C 7 , serve as a technique that allows instructions to be selectively executed without using a branch instruction and therefore accelerates the speed of processing.
  • conditional flag C 0 indicating a predicate in an instruction when the conditional flag C 0 indicating a predicate in an instruction is 1, the instruction being assigned the conditional flag C shall be executed, whereas when the conditional flag C 0 is 0, such instruction shall not be executed.
  • FIGS. 21A-36B are diagrams explaining an outlined functionality of the instructions executed by the processor 1 . More specifically, FIGS. 21A and 21B explain instructions belonging to the category “ALUadd (addition) system)”; FIGS. 22A and 22B explain instructions belonging to the category “ALUsub (subtraction) system)”; FIGS. 23A and 23B explain instructions belonging to the category “ALUlogic (logical operation) system and others”; FIGS. 24A and 24B explain instructions belonging to the category “CMP (comparison operation) system”; FIGS. 25A and 25B explain instructions belonging to the category “mul (multiplication) system”; FIGS. 26A and 26B explain instructions belonging to the category “mac (sum of products operation) system”; FIGS.
  • FIGS. 27A and 27B explain instructions belonging to the category “msu (difference of products) system”;
  • FIGS. 28A and 28B explain instructions belonging to the category “MEMId (load from memory) system”;
  • FIGS. 29A and 29B explain instructions belonging to the category “MEMstore (store in memory) system”;
  • FIG. 30 explains instructions belonging to the category “BRA (branch) system”;
  • FIGS. 31A and 31B explain instructions belonging to the category “BSasl (arithmetic barrel shift) system and others”;
  • FIGS. 32A and 32B explain instructions belonging to the category “BSlsr (logical barrel shift) system and others”;
  • FIG. 33 explains instructions belonging to the category “CNVvaln (arithmetic conversion) system”;
  • FIGS. 34A and 34B explain instructions belonging to the category “CNV (general conversion) system”; FIG. 35 explains instructions belonging to the category “SATvlpk (saturation processing) system”; and FIGS. 36A and 36B explain instructions belonging to the category “ETC (et cetera) system”.
  • SIMD indicates the type of an instruction (distinction between SISD (SINGLE) and SIMD); “Size” indicates the size of an individual operand to be an operation target; “Instruction” indicates the operation code of an instruction; “Operand” indicates the operands of an instruction; “CFR” indicates a change in the conditional flag register; “PSR” indicates a change in the processor status register; “Typical behavior” indicates the overview of a behavior; “Operation unit” indicates an operation unit to be used; and “3116” indicates the size of an instruction.
  • Instruction jloop is an instruction for performing a branch and setting conditional flags (predicates, here) in a loop. For example, when
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets 1 to the conditional flag Cm; (ii) sets 0 to the conditional flag C 6 when the value held in the register Ra is smaller than 0; (iii) adds ⁇ 1 to the value held in the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d .
  • the jump buffer 10 f (branch instruction buffer) will be filled with a branch target instruction.
  • a detailed behavior is as shown in FIG. 37 .
  • Instruction settar is an instruction for storing a branch target address into the branch register (TAR) 30 d , and setting conditional flags (predicates, here). For example, when
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores an address that results from adding the value held in the program counter (PC) 33 and a displacement value (D 9 ) into the branch register (TAR) 30 d ; (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flag C 6 to 1 and the conditional flag Cm to 0. A detailed behavior is as shown in FIG. 38 .
  • prolog/epilog removal is intended to visually remove the prolog phase and epilog phase by using the prolog phase and the epilog phase as conditional execution instructions to be performed in accordance with predicates.
  • conditional flags C 6 and C 4 are illustrated as predicates for an epilog instruction (Stage 2) and a prolog instruction (Stage 1), respectively.
  • a compiler when the above-described jloop and settar instructions are used in a source program written in the C language shown in FIG. 40 , a compiler generates a machine language program shown in FIG. 41 by means of prolog/epilog removal software pipelining.
  • the processor 1 is capable of executing the following instructions which are applicable not only to 2-stage software pipelining, but also to 3-stage software pipelining: Instruction “jloop C 6 , C 2 : C 4 , TAR, Ra” and Instruction “settar C 6 , C 2 : C 4 , D 9 ”.
  • These instructions “jloop C 6 , C 2 : C 4 , TAR, Ra” and “settar C 6 , C 2 : C 4 , D 9 ” are equivalent to instructions in which the register Cm in the above-described 2-stage instructions “jloop C 6 , Cm, TAR, Ra” and “settar C 6 , Cm, D 9 ” is extended to the registers C 2 , C 3 and C 4 .
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets the conditional flag C 4 to 0 when the value held in the register Ra is smaller than 0; (ii) moves the value of the conditional flag C 3 to the conditional flag C 2 and moves the value of the conditional flag C 4 to the conditional flags C 3 and C 6 ; (iii) adds ⁇ 1 to the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d .
  • the jump buffer 10 f branch instruction buffer
  • a detailed behavior is as shown in FIG. 42 .
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores, into the branch register (TAR) 30 d , an address that results from adding the value held in the program counter (PC) 33 and a displacement value (D 9 ); (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flags C 4 and C 6 to 1 and the conditional flags C 2 and C 3 to 0. A detailed behavior is as shown in FIG. 43 .
  • FIGS. 44( a ) and ( b ) show the role of the conditional flags in the above 3-stage instructions “jloop C 6 , C 2 : C 4 , TAR, Ra” and “settar C 6 , C 2 : C 4 , D 9 ”.
  • the conditional flags C 2 , C 3 and C 4 serve as predicates for Stage 3, Stage 2 and Stage 1, respectively.
  • FIG. 44( b ) is a diagram showing how instruction execution is carried out when moving flags in such a case.
  • a compiler when the above-described jloop and settar instructions shown respectively in FIGS. 42 and 43 are used in a source program written in the C language shown in FIG. 45 , a compiler generates a machine language program shown in FIG. 46 by means of epilog removal software pipelining.
  • processor 1 is also capable of executing the following instructions which are applicable to 4-stage software pipelining: Instruction “jloop C 6 , C 1 : C 4 , TAR, Ra” and Instruction “settar C 6 , C 1 : C 4 , D 9 ”.
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets the conditional flag C 4 to 0 when the value held in the register Ra is smaller than 0; (ii) moves the value of the conditional flag C 2 to the conditional flag C 1 , moves the value of the conditional flag C 3 to the conditional flag C 2 , and moves the value of the conditional flag C 4 to the conditional flags C 3 and C 6 ; (iii) adds ⁇ 1 to the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d . When not filled with a branch target instruction, the jump buffer 10 f will be filled with a branch target instruction. A detailed behavior is as shown in FIG. 47 .
  • Instruction settar is an instruction for storing a branch target address into the branch register (TAR) 30 d as well as for setting conditional flags (predicates, here).
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores an address resulted from adding the value held in the program counter (PC) 33 and a displacement value (D 9 ) into the branch register (TAR) 30 d ; (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flags C 4 and C 6 to 1 and the conditional flags C 1 , C 2 and C 3 to 0. A detailed behavior is as shown in FIG. 48 .
  • a compiler when the above-described jloop and settar instructions shown respectively in FIGS. 47 and 48 are used in a source program written in the C language shown in FIG. 49 , a compiler generates a machine language program shown in FIG. 50 by means of epilog removal software pipelining.
  • FIG. 51 is a diagram showing the behavior to be performed in 4-stage software pipelining that uses jloop and settar instructions shown respectively in FIGS. 47 and 48 .
  • conditional flags C 1 -C 4 are used as predicates, each of which indicates whether or not to execute an instruction.
  • Instructions A, B, C, and D are instructions to be executed in the first, second, third, and fourth stages in the software pipelining, respectively.
  • the instructions A, B, C, and D are associated with the conditional flags C 4 , C 3 , C 2 , and C 1 , respectively.
  • Instruction jloop is associated with the conditional flag C 6 .
  • FIG. 52 is a diagram for explaining an example method of setting the conditional flag C 6 for the Instruction jloop shown in FIG. 47 .
  • This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater than or equal to 3) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the next cycle of a cycle in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 2) th pipeline stage in the epilog phase, becomes 0.
  • conditional flag C 6 is always set to 1
  • value of the conditional flag C 3 being a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 2)th stage in the software pipelining
  • the value of the conditional flag C 3 is set to the conditional flag C 6 which is in one cycle later.
  • conditional execution instruction corresponding to the conditional flag in question is not to be executed until the end of the loop.
  • the value of the conditional flag C 4 becomes 0 in the fifth cycle
  • the value of such conditional flag C 4 remains to be 0 until the seventh cycle in which the loop ends. Therefore, the instruction A that corresponds to the conditional flag C 4 is not to be executed from the fifth cycle to the seventh cycle.
  • a control may be performed so that no instruction will be read out, until the loop processing ends, from the instruction buffer 10 c ( 10 d , 10 e , and 10 h ) in which the instruction corresponding to such conditional flag is stored.
  • the decoding unit 20 may read out only the number of a conditional flag from the corresponding instruction buffer 10 c ( 10 d , 10 e , and 10 h ), and check the value of the conditional flag based on such read-out number, so that the decoding unit 20 will not read out instructions from the instruction buffer 10 c ( 10 d , 10 e , and 10 h ) when the value of the conditional flag is 0.
  • instructions to be executed before and after the loop may be placed respectively in the prolog and epilog phases for execution.
  • the conditional flag C 5 is assigned to an instruction X to be executed immediately before the loop and to an instruction Y to be executed immediately after the loop, so as to have such instructions X and Y executed in empty stages in the epilog and prolog phases. Accordingly, it becomes possible to reduce the number of empty stages in the epilog and prolog phases.
  • conditional flags shall be used for a conditional execution instruction to be executed when the condition is true and for a conditional execution instruction to be executed when the condition is false, so that the value of each conditional flag can be changed depending on a condition.
  • FIG. 54 is a diagram for explaining another example method of setting the conditional flag C 6 for the Instruction jloop shown in FIG. 47 .
  • This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater than or equal to 2) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the same cycle as the one in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 1) th pipeline stage in the epilog phase becomes 0.
  • conditional flag C 6 is always set to 1
  • value of the conditional flag C 2 being a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 1)th stage in the software pipelining
  • FIG. 55 is a diagram for explaining another example method of setting the conditional flag C 6 for the Instruction jloop shown in FIG. 47 .
  • This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater or equal to 4) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the cycle which is two cycles after the cycle in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 3) th pipeline stage in the epilog phase becomes 0.
  • conditional flag C 6 is always set to 1
  • value of the conditional flag C 4 being a conditional flag corresponding to the conditional execution instruction to be executed in the (N ⁇ 3)th stage in the software pipelining
  • a machine language instruction with the above-described characteristics is generated by a complier, where such machine language instruction is comprised of: a parser step of parsing a source program; an intermediate code conversion step of converting the parsed source program into intermediate codes; an optimization step of optimizing the intermediate codes; and a code generation step of converting the optimized intermediate codes into machine language instructions.
  • a conditional flag for a loop is set by the use of a conditional flag for the epilog phase of software pipelining. Accordingly, there is no need to use special hardware resources such as a counter in order to judge whether or not loop processing has terminated, and it becomes possible to prevent the circuitry scale from becoming large. This contributes to a reduction in the power consumption of the processor.
  • conditional execution instruction stops being executed in the epilog phase, such conditional execution instruction will not be executed in the software pipelining until the loop processing ends. Accordingly, there is no need to read out such a conditional execution instruction from the corresponding instruction buffer until the loop processing ends, which leads to a reduction in the power consumption of the processor.
  • the processor of the present invention it is possible to provide a processor whose circuitry scale is small and which is capable of high-speed loop execution while consuming a small amount of power.
  • the processor according to the present invention is capable of executing instructions while consuming only a small amount of power. It is therefore possible for the processor to be employed as a core processor to be commonly used in a mobile phone, mobile AV device, digital television, DVD and others. Thus, the processor according to the present invention is extremely useful in the present age in which the advent of high-performance and cost effective multimedia apparatuses is desired.

Abstract

The present invention provides a processor which has a small-scale circuit and is capable of executing loop processing at a high speed while consuming a small amount of power. When the processor decodes an instruction “jloop C6,C1:C4,TAR,Ra”, the processor (i) sets a conditional flag C4 to 0 when the value of a register Ra is smaller than 0, (ii) moves the value of a conditional flag C2 to a conditional flag C1, moves the value of a conditional flag C3 to the conditional flag C2, and moves the value of the conditional flag C4 to the conditional flags C3 and C6, (iii) adds −1 to the register Ra and stores the result into the register Ra, and (iv) branches to an address specified by a branch register (TAR). When not filled with a branch target instruction, the jump buffer will be filled with a branch target instruction.

Description

  • This is a Rule 1.53(b) Divisional of Ser. No. 10/805,381, filed Mar. 22, 2004
  • BACKGROUND OF THE INVENTION
  • (1) Field of the Invention
  • The present invention relates to a processor such as a DSP (Digital Signal Processor) and a CPU (Central Processing Unit), as well as to a compiler that generates instructions executed by such a processor. More particularly, the present invention relates to a processor and a compiler which are suitable for performing signal processing for sounds, images and others.
  • (2) Description of the Related Art
  • With the development in multimedia technologies, processors are increasingly required to be capable of high-speed media processing represented by sound and image signal processing. As existing processors responding to such requirement, there exist Pentium®/Pentium® III/Pentium 4® MMX/SSE/SSE2 and others produced by the Intel Corporation of the United States supporting SIMD (Single Instruction Multiple Data) instructions. Of these processors, MMX Pentium, for example, is capable of performing the same operations in one instruction on a maximum of eight integers stored in a 64-bit-long MMX register.
  • Such existing processors realize high-speed processing by utilizing software pipelining, as described in the following: Mitsuru Ikei, IA-64 Processor Basic Course (IA-64 Processor Kihon Koza), Tokyo: Ohmsha Ltd., 1999. FIG. 4.32 p. 129.
  • FIG. 56 is a diagram showing the operation of an existing processor using 4-stage software pipelining. In order to implement software pipelining, predicate flags used for predicates that indicate whether or not instructions should be executed are stored in a predicate register. In addition to this, the number of execution times until processing of the prolog phase in the software pipelining ends is stored in the loop counter, whereas the number of execution times until processing of the epilog phase in the software pipelining ends is stored in the epilog counter.
  • However, the above-described existing processor manages the loop counter, the epilog counter and the predicate register as individual hardware resources. Therefore, such processor is required to be equipped with many resources, which results in large-scale circuits.
  • Moreover, a large-scale circuit means that the amount of power consumed by the processor becomes large.
  • SUMMARY OF THE INVENTION
  • The present invention has been conceived in view of the above circumstances, and it is an object of the present invention to provide a processor whose circuitry scale is small and which is capable of performing loop processing at a high speed while consuming a low amount of power.
  • In order to achieve the above object, the processor according to the present invention is a processor for decoding an instruction and executing said decoded instruction. The processor comprises: a flag register in which a plurality of conditional execution flags are stored, where the plurality of conditional execution flags are used as predicates for conditional execution instructions; a decoding unit operable to decode an instruction; and an execution unit operable to execute the instruction decoded by the decoding unit. When the instruction decoded by the decoding unit is a loop instruction, an iteration of a loop to be executed terminates in the execution unit, based on a value of one of the plurality of conditional execution flags for an epilog phase in the loop in a case where the loop is unrolled into the conditional execution instructions by means of software pipelining.
  • As described above, a judgment is made as to whether or not the loop iteration has terminated, based on a conditional execution flag in the epilog phase in the case where such loop is unrolled into conditional execution instructions by means of software pipelining. Accordingly, there is no need to use special hardware resources such as a counter in order to judge whether or not the loop processing has terminated, and it becomes possible to prevent the circuitry scale from becoming large. This contributes to a reduction in the power consumption of the processor.
  • Moreover, the flag register may further store a loop flag which is used to judge whether or not the iteration has terminated, and the execution unit may set, to the loop flag, the value of the one of the plurality of conditional execution flags for the epilog phase. For example, the execution unit sets, to the loop flag in one cycle later in the epilog phase, the value of the conditional execution flag for a conditional execution instruction to be executed in an (N−2)th pipeline stage (where N is an integer greater than or equal to 3), in a case where the number of stages in the software pipelining is N and the stages are counted up each time processing in the epilog phase finishes.
  • As described above, a judgment is made as to whether or not the loop has terminated by use of the value of a conditional execution flag that is specified according to which stage the software pipelining such conditional execution flag is in. Accordingly, there is no need to use special hardware resources such as a counter in order to judge whether or not the loop processing has terminated, and it becomes possible to prevent the circuitry scale from becoming large, regardless of how many stages are contained in software pipelining. This contributes to a reduction in the power consumption of the processor.
  • Also, the processor according to the above configuration may further comprise an instruction buffer for temporarily storing the instruction decoded by the decoding unit, and in such processor, the decoding unit may be configured not to read out one of the conditional execution instructions from the instruction buffer until the loop terminates, when judging that the conditional execution instruction should not be executed based on the value of the one of the plurality of conditional execution flags for the epilog phase.
  • As described above, once a conditional execution instruction stops being executed in the epilog phase, the conditional execution instruction will not be executed in the software pipelining until the loop processing ends. Accordingly, there is no need to read out the conditional execution instruction from the corresponding instruction buffer, which makes it possible for the processor to consume a small amount of power.
  • Meanwhile, the compiler according to another aspect of the present invention is a complier for translating a source program into a machine language program for a processor which is capable of executing instructions in parallel. The complier comprises: a parser unit for parsing the source program; an intermediate code conversion unit for converting the parsed source program into intermediate codes; an optimization unit for optimizing the intermediate codes; and a code generation unit for converting the optimized intermediate codes into machine language instructions. The processor stores a plurality of flags which are used as predicates for conditional execution instructions, and the optimization unit, when the intermediate codes include a loop, places an instruction in a prolog phase in loop in a case where said loop is unrolled by means of software pipelining so that the instruction is to be executed immediately before the loop.
  • As described above, an instruction to be executed immediately before a loop is placed in the prolog phase in the case where such loop is unrolled by means of software pipelining. Accordingly, it becomes possible to reduce the number of empty stages in the software pipelining, and therefore to execute a program at a high speed. Furthermore, it also becomes possible to reduce the amount of power consumption of a processor that executes a program compiled by this compiler.
  • Also, the compiler according to another aspect of the present invention is a complier for translating a source program into a machine language program for a processor which is capable of executing instructions in parallel. The compiler comprises: a parser unit for parsing the source program; an intermediate code conversion unit for converting the parsed source program into intermediate codes; an optimization unit for optimizing the intermediate codes; and a code generation unit for converting the optimized intermediate codes into machine language instructions. The processor stores a plurality of flags which are used as predicates for conditional execution instructions, and the optimization unit, when the intermediate codes include a conditional branch instruction, assigns the plurality of conditional execution flags so that a conditional execution flag which is used as a predicate for a conditional execution instruction in a case where a condition indicated by said conditional branch instruction is met, becomes different from a conditional execution flag used as a predicate for a conditional execution instruction in a case where the condition is not met.
  • As described above, even when an instruction to be executed when a predetermined condition is met and an instruction to be executed when the condition is not met are different as in the case of an if-else statement in the C language, for example, different flags to be used as predicates shall be associated with the respective instructions. Accordingly, it becomes possible to implement processing which is equivalent to a conditional branch instruction, simply by changing flag values. Since it is possible to realize a conditional branch instruction through such simple processing, it becomes possible to reduce the amount of power consumed by a processor that executes a program compiled by this compiler.
  • Note that not only is it possible to embody the present invention as a processor that executes the above characteristic instructions and a compiler that generates such characteristic instructions, but also as an operation processing method to be applied on plural data elements, and as a program that includes the characteristic instructions. In addition, it should also be noted that such program can be distributed via a recording medium such as CD-ROM (Compact Disc-Read Only Memory) and a transmission medium such as the Internet.
  • As further information about the technical background to this application, Japanese Patent application No. 2003-081132, filed on Mar. 24, 2003, is incorporated herein by reference.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • These and other objects, advantages and features of the invention will become apparent from the following description thereof when taken in conjunction with the accompanying drawings that illustrate a specific embodiment of the invention. In the Drawings:
  • FIG. 1 is a schematic block diagram showing a processor according to the present invention;
  • FIG. 2 is a schematic diagram showing arithmetic and logic/comparison operation units of the processor;
  • FIG. 3 is a block diagram showing a configuration of a barrel shifter of the processor;
  • FIG. 4 is a block diagram showing a configuration of a converter of the processor;
  • FIG. 5 is a block diagram showing a configuration of a divider of the processor;
  • FIG. 6 is a block diagram showing a configuration of a multiplication/sum of products operation unit of the processor;
  • FIG. 7 is a block diagram showing a configuration of an instruction control unit of the processor;
  • FIG. 8 is a diagram showing a configuration of general-purpose registers (R0-R31) of the processor;
  • FIG. 9 is a diagram showing a configuration of a link register (LR) of the processor;
  • FIG. 10 is a diagram showing a configuration of a branch register (TAR) of the processor;
  • FIG. 11 is a diagram showing a configuration of a program status register (PSR) of the processor;
  • FIG. 12 is a diagram showing a configuration of a conditional flag register (CFR) of the processor;
  • FIG. 13 is a diagram showing a configuration of accumulators (M0, M1) of the processor;
  • FIG. 14 is a diagram showing a configuration of a program counter (PC) of the processor;
  • FIG. 15 is a diagram showing a configuration of a PC save register (IPC) of the processor;
  • FIG. 16 is a diagram showing a configuration of a PSR save register (IPSR) of the processor;
  • FIG. 17 is a timing diagram showing a pipeline behavior of the processor;
  • FIG. 18 is a timing diagram showing each pipeline behavior when instructions are executed by the processor;
  • FIG. 19 is a diagram showing a parallel behavior of the processor;
  • FIG. 20A is a diagram showing a format of a 16-bit instruction executed by the processor;
  • FIG. 20B is a diagram showing a format of a 32-bit instruction executed by the processor;
  • FIGS. 21A and 21B are diagrams explaining instructions belonging to a category “ALUadd (addition) system”;
  • FIGS. 22A and 22B are diagrams explaining instructions belonging to a category “ALUsub (subtraction) system”;
  • FIGS. 23A and 23B are diagrams explaining instructions belonging to a category “ALUlogic (logical operation) system and others”;
  • FIGS. 24A and 24B are diagrams explaining instructions belonging to a category “CMP (comparison operation) system”;
  • FIGS. 25A and 25B are diagrams explaining instructions belonging to a category “mul (multiplication) system”;
  • FIGS. 26A and 26B are diagrams explaining instructions belonging to a category “mac (sum of products operation) system”;
  • FIGS. 27A and 27B are diagrams explaining instructions belonging to a category “msu (difference of products) system”;
  • FIGS. 28A and 28B are diagrams explaining instructions belonging to a category “MEMId (load from memory) system”;
  • FIGS. 29A and 29B are diagrams explaining instructions belonging to a category “MEMstore (store in memory) system”;
  • FIG. 30 is a diagram explaining instructions belonging to a category “BRA (branch) system”;
  • FIGS. 31A and 31B are diagrams explaining instructions belonging to a category “BSasl (arithmetic barrel shift) system and others”;
  • FIGS. 32A and 32B are diagrams explaining instructions belonging to a category “BSlsr (logical barrel shift) system and others”;
  • FIG. 33A is a diagram explaining instructions belonging to a category “CNVvaln (arithmetic conversion) system”;
  • FIGS. 34A and 34B are diagrams explaining instructions belonging to a category “CNV (general conversion) system”;
  • FIG. 35 is a diagram explaining instructions belonging to a category “SATvlpk (saturation processing) system”;
  • FIGS. 36A and 36B are diagrams explaining instructions belonging to a category “ETC (et cetera) system”;
  • FIG. 37 is a diagram explaining a detailed behavior of the processor when executing Instruction “jloop C6, Cm, TAR, Ra”;
  • FIG. 38 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar C6, Cm, D9”;
  • FIG. 39 is a diagram showing prolog/epilog removal 2-stage software pipelining;
  • FIG. 40 is a diagram showing a source program written in the C language;
  • FIG. 41 is a diagram showing an example machine language program to be generated by using Instruction jloop and Instruction settar according to the present embodiment;
  • FIG. 42 is a diagram explaining a detailed behavior of the processor when executing Instruction “jloop C6, C2: C4, TAR, Ra”;
  • FIG. 43 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar, C6, C2: C4, D9”;
  • FIG. 44 is a diagram showing prolog/epilog removal 3-stage software pipelining;
  • FIG. 45 is a diagram showing a source program written in the C language;
  • FIG. 46 is a diagram showing an example machine language program to be generated by using Instruction jloop and Instruction settar according to the present embodiment;
  • FIG. 47 is a diagram explaining a detailed behavior of the processor when executing Instruction “jloop C6, C1: C4, TAR, Ra”;
  • FIG. 48 is a diagram explaining a detailed behavior of the processor when executing Instruction “settar C6, C1: C4, D9”;
  • FIG. 49 is a diagram showing a source program written in the C language;
  • FIG. 50 is a diagram showing an example machine language program to be generated by using Instruction jloop and Instruction settar according to the present embodiment;
  • FIG. 51 is a diagram showing a behavior to be performed in 4-stage software pipelining that uses the jloop and settar instructions shown respectively in FIGS. 47 and 48;
  • FIG. 52 is a diagram explaining an example method of setting a conditional flag C6 for Instruction jloop shown in FIG. 47;
  • FIG. 53 is a diagram showing a behavior of 4-stage software pipelining in which instructions to be executed before and after the loop are incorporated respectively into a prolog phase and an epilog phase;
  • FIG. 54 is a diagram explaining another example method of setting the conditional flag C6 for Instruction jloop shown in FIG. 47;
  • FIG. 55 is a diagram explaining further another example method of setting the conditional flag C6 for Instruction jloop shown in FIG. 47; and
  • FIG. 56 is a diagram showing a behavior of an existing processor using 4-stage software pipelining.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • An explanation is given for the architecture of the processor according to the present invention. The processor of the present invention is a general-purpose processor which has been developed targeting at the field of AV (Audio Visual) media signal processing technology, and instructions issued in this processor offer a higher degree of parallelism than in ordinary microcomputers. By being used as a core common to mobile phones, mobile AV devices, digital televisions, DVDs (Digital Versatile discs) and others, the processor can improve software reusability. Furthermore, this processor allows multiple high-performance media processes to be performed with high cost effectiveness, and provides a development environment for high-level languages intended for improving development efficiency.
  • FIG. 1 is a schematic block diagram showing the processor according to the present invention. The processor 1 is comprised of an instruction control unit 10, a decoding unit 20, a register file 30, an operation unit 40, an I/F (interface) unit 50, an instruction memory unit 60, a data memory unit 70, an extended register unit 80, and an I/O (Input/Output) interface unit 90.
  • The operation unit 40 includes arithmetic and logic/comparison operation units 41-43 and 48, a multiplication/sum of products operation unit 44, a barrel shifter 45, a divider 46, and a converter 47 for performing operations of SIMD instructions. The multiplication/sum of products operation unit 44 is capable of performing accumulation which results in a maximum of a 65-bit operation result, without lowering bit precision. The multiplication/sum of products operation unit 44 is also capable of executing SIMD instructions as in the case of the arithmetic and logic/comparison operation units 41-43 and 48. Furthermore, the processor 1 is capable of parallel execution of an arithmetic and logic/comparison operation instruction on a maximum of four data elements.
  • FIG. 2 is a schematic diagram showing the arithmetic and logic/comparison operation units 41-43 and 48. Each of the arithmetic and logic/comparison operation units 41-43 and 48 is made up of an ALU (Arithmetic and Logical Unit) 41 a, a saturation processing unit 41 b, and a flag unit 41 c. The ALU 41 a includes an arithmetic operation unit (AU), a logical operation unit (LU), a comparator (CMP), and a TST. The bit widths of operation data to be supported by the ALU 41 a are 8 bits (when using four operation units in parallel), 16 bits (when using two operation units in parallel) and 32 bits (when using one operation unit to process 32-bit data). For a result of an arithmetic operation, the flag unit 41 c and the like detects an overflow and generates a conditional flag. For a result of each of the operation units, the comparator and the TST, an arithmetic shift right, saturation by the saturation processing unit 41 b, the detection of maximum/minimum values, and absolute value generation processing are performed.
  • FIG. 3 is a block diagram showing the configuration of the barrel shifter 45. The barrel shifter 45 is made up of selectors 45 a and 45 b, a higher bit barrel shifter 45 c, a lower bit barrel shifter 45 d, and a saturation processing unit 45 e. This barrel shifter 45 executes an arithmetic shift of data (shift in the 2's complement number system) or a logical shift of data (unsigned shift). Usually, 32-bit or 64-bit data is inputted to and outputted from the barrel shifter 45. The amount of shifting data stored in the register 30 a or 30 b is specified by another register or according to its immediate value. The barrel shifter 45 performs an arithmetic or logical shift of input data in the range of left 63 bits and right 63 bits, and outputs data of the same bit length as that of the input data.
  • The barrel shifter 45 is also capable of shifting 8-, 16-, 32-, or 64-bit data in response to a SIMD instruction. For example, the barrel shifter 45 can shift four pieces of 8-bit data in parallel.
  • An arithmetic shift, which is a shift in the 2's complement number system, is performed for decimal point alignment at the time of addition and subtraction, for multiplication of powers of 2 (the 1st power of 2, the 2nd power of 2, the −1st power of 2, the −2nd power of 2) and other purposes.
  • FIG. 4 is a block diagram showing the configuration of the converter 47. The converter 47 includes a saturation block (SAT) 47 a, a BSEQ block 47 b, an MSKGEN block 47 c, a VSUMB block 47 d, a BCNT block 47 e, and an IL block 47 f.
  • The saturation block (SAT) 47 a performs saturation processing on input data. By having two blocks for performing saturation processing on 32-bit data, the saturation block (SAT) 47 a supports a SIMD instruction executed on two data elements in parallel.
  • The BSEQ block 47 b counts consecutive 0s or 1s from the MSB (Most Significant Bit).
  • The MSKGEN block 47 c outputs a specified bit segment as 1, while outputting the others as 0.
  • The VSUMB block 47 d divides the input data into specified bit widths, and outputs their total sum.
  • The BCNT block 47 e counts the number of bits in the input data specified as 1.
  • The IL block 47 f divides the input data into specified bit widths, and outputs a value that results from exchanging the positions of data blocks.
  • FIG. 5 is a block diagram showing the configuration of the divider 46. With a dividend being 64 bits and a divisor being 32 bits, the divider 46 outputs 32 bit data as a quotient and a modulo, respectively. 34 cycles are involved for obtaining a quotient and a modulo. The divider 46 can handle both singed and unsigned data. Note, however, that whether or not to sign a dividend and a divisor is common between them. The divider 46 is also capable of outputting an overflow flag, and a 0 division flag.
  • FIG. 6 is a block diagram showing the configuration of the multiplication/sum of products operation unit 44. The multiplication/sum of products operation unit 44, which is made up of two 32-bit multipliers (MUL) 44 a and 44 b, three 64-bit adders (Adder) 44 c-44 e, a selector 44 f and a saturation processing unit (Saturation) 44 g, performs the following multiplications and sums of products:
      • Multiplication, sum of products, and difference of products on signed 32×32-bit data;
      • Multiplication on signed 32×32-bit data;
      • Multiplication, sum of products, and difference of products on two signed 16×16-bit data in parallel; and
      • Multiplication, sum of products, and difference of products on two 32×16-bit signed data in parallel.
  • The above operations are performed on data in integer and fixed point format (h1, h2, w1, and w2). Also, the results of these operations are rounded and saturated.
  • FIG. 7 is a block diagram showing the configuration of the instruction control unit 10. The instruction control unit 10, which is made up of an instruction cache 10 a, an address management unit 10 b, instruction buffers 10 c-10 e and 10 h, a jump buffer 10 f, and a rotation unit (rotation) 10 g, issues instructions at ordinary times and at branch points. By having four 128-bit instruction buffers (the instruction buffers 10 c-10 e and 10 h), the instruction control unit 10 supports the maximum number of parallel instruction execution. Regarding branch processing, the instruction control unit 10 stores, in advance, a branch target instruction into the jump buffer 10 f and stores a branch target address into the below-described TAR register before performing a branch (settar instruction). Thus, the instruction control unit 10 performs the branch by using the branch target address stored in the TAR register and the branch target instruction stored in the jump buffer 10 f.
  • Note that the processor 1 is a processor with a VLIW architecture. The VLIW architecture is an architecture that allows a plurality of instructions (e.g. load, store, operation, and branch) to be stored in a single instruction word, and allows such instructions to be executed all at once. If a programmer describes a set of instructions which can be executed in parallel as a single issue group, it is possible for such issue group to be processed in parallel. In this specification, the delimiter of an issue group is indicated by “;;” Notational examples are described below.
  • Example 1
  • mov r1, 0x23;;
  • This instruction description indicates that only an instruction “mov” shall be executed.
  • Example 2
  • mov r1, 0x38
  • add r0, r1, r2
  • sub r3, r1, r2;;
  • These instruction descriptions indicate that three instructions of “mov”, “add” and “sub” shall be executed in parallel.
  • The instruction control unit 10 identifies an issue group and sends the identified issue group to the decoding unit 20. The decoding unit 20 decodes the instructions in the issue group, and controls resources required for executing such instructions.
  • Next, an explanation is given for registers included in the processor 1.
  • Table 1 below lists a set of registers of the processor 1.
  • TABLE 1
    Register name Bit width No. of registers Usage
    R0-R31 32 bits 32 General-purpose registers. Used as data
    memory pointer, data storage at the time of
    operation instruction, and the like.
    TAR 32 bits 1 Branch register. Used as branch address
    storage at branch point.
    LR 32 bits 1 Link register.
    SVR 16 bits 2 Save register. Used for saving conditional flag
    (CFR) and various modes.
    M0-M1 64 bits 2 Operation registers. Used as data storage
    (MH0:ML0- when operation instruction is executed.
    MH1~ML1)
  • Table 2 below lists a set of flags (flags managed in a conditional flag register and the like described later) of the processor 1.
  • TABLE 2
    Flag name Bit width No. of flags Usage
    C0-C7 1 8 Conditional flags. Indicate if condition is true or
    false.
    VC0-VC3 1 4 Conditional flags for media processing extension
    instruction. Indicate if condition is true or false.
    OVS 1 1 Overflow flag. Detects overflow at the time of
    operation.
    CAS 1 1 Carry flag. Detects carry at the time of operation.
    BPO 5 1 Specifies bit position. Specifies bit positions to be
    processed when mask processing instruction is
    executed.
    ALN 2 1 Specified byte alignment.
    FXP 1 1 Fixed point arithmetic mode.
    UDR 32 1 Undefined register.
  • FIG. 8 is a diagram showing the configuration of the general-purpose registers (R0-R31) 30 a. The general-purpose registers (R0-R31) 30 a are a group of 32-bit registers that constitute an integral part of the context of a task to be executed and that store data or addresses. Note that the general-purpose registers R30 and R31 are used by hardware as a global pointer and a stack pointer, respectively.
  • FIG. 9 is a diagram showing the configuration of a link register (LR) 30 c. In connection with this link register (LR) 30 c, the processor 1 also has a save register (SVR) which is not illustrated in FIG. 9. The link register (LR) 30 c is a 32-bit register in which a return address at the time of a function call is stored. Note that the save register (SVR) is a 16-bit register for saving a conditional flag (CFR.CF) of the conditional flag register at the time of a function call. The link register (LR) 30 c is also used for the purpose of increasing the speed of loops, as in the case of a branch register (TAR) to be explained later. 0 is always read out from the low 1 bit of the link register (LR) 30 c, and 0 must be written to the low 1 bit of the link register (LR) 30 c at the time of writing.
  • For example, when executing “call (brl, jmpl)” instructions, the processor 1 saves a return address into the link register (LR) 30 c and saves a conditional flag (CFR.CF) into the save register (SVR). When executing a “jmp” instruction, the processor 1 fetches the return address (branch destination address) from the link register (LR) 30 c, and restores a program counter (PC). Furthermore, when executing a “ret (jmpr)” instruction, the processor 1 fetches the branch destination address (return address) from the link register (LR) 30 c, and stores (restores) the branch destination address into the program counter (PC). Moreover, the processor 1 fetches the conditional flag from the save register (SVR) so as to store (restore) the conditional flag into a conditional flag area CFR.CF in the conditional flag register (CFR) 32.
  • FIG. 10 is a diagram showing the configuration of the branch register (TAR) 30 d. The branch register (TAR) 30 d is a 32-bit register in which a branch target address is stored, and which is used mainly for the purpose of increasing the speed of loops. 0 is always read out from the low 1 bit of the branch resister (TAR) 30 d, and 0 must be written to the low 1 bit of the branch resister (TAR) 30 d at the time of writing.
  • For example, when executing “jmp” and “jloop” instructions, the processor 1 fetches a branch target address from the branch register (TAR) 30 d, and stores the branch target address in the program counter (PC). When the instruction indicated by the address stored in the branch register (TAR) 30 d is stored in a branch instruction buffer, a branch penalty will be 0. An increased loop speed can be achieved by storing the top address of a loop in the branch register (TAR) 30 d.
  • FIG. 11 is a diagram showing the configuration of a program status register (PSR) 31. The program status register (PSR) 31, which constitutes an integral part of the context of a task to be executed, is a 32-bit register in which the following processor status information are stored:
  • Bit SWE: indicates whether the switching of VMP (Virtual Multi-Processor) to LP (Logical Processor) is enabled or disabled. “0” indicates that switching to LP is disabled and “1” indicates that switching to LP is enabled.
  • Bit FXP: indicates a fixed point mode. “0” indicates mode 0 and “1” indicates mode 1.
  • Bit IH: is an interrupt processing flag indicating whether or not maskable interrupt processing is ongoing. “1” indicates that there is an ongoing interrupt processing and “0” indicates that there is no ongoing interrupt processing. “1” is automatically set on the occurrence of an interrupt. This flag is used to make a distinction of which one of interrupt processing and program processing is taking place at a point in the program to which the processor returns in response to a “rti” instruction.
  • Bit EH: is a flag indicating whether or not an error or an NMI is being processed. “0” indicates that error processing or NMI interrupt processing is not ongoing and “1” indicates that error processing or NMI interrupt processing is ongoing. This flag is masked if an asynchronous error or an NMI occurs when EH=1. Meanwhile, when VMP is enabled, plate switching of VMP is masked.
  • Bit PL [1:0]: indicates a privilege level. “00” indicates the privilege level 0, i.e. the processor abstraction level, “01” indicates the privilege level 1 (non-settable), “10” indicates the privilege level 2, i.e. the system program level, and “11” indicates the privilege level 3, i.e. the user program level.
  • Bit LPIE3: indicates whether LP-specific interrupt 3 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit LPIE2: indicates whether LP-specific interrupt 2 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit LPIE1: indicates whether LP-specific interrupt 1 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit LPIE0: indicates whether LP-specific interrupt 0 is enabled or disabled. “1” indicates that an interrupt is enabled and “0” indicates that an interrupt is disabled.
  • Bit AEE: indicates whether a misalignment exception is enabled or disabled. “1” indicates that a misalignment exception is enabled and “0” indicates that a misalignment exception is disabled.
  • Bit IE: indicates whether a level interrupt is enabled or disabled. “1” indicates that a level interrupt is enabled and “0” indicates a level interrupt is disabled.
  • Bit IM [7:0]: indicates an interrupt mask, and ranges from levels 0-7, each being able to be masked at its own level. Level 0 is the highest level. Of the interrupt requests which are not masked by any IMs, only the interrupt request with the highest level is accepted by the processor 1. When the interrupt request is accepted, levels below the level of the accepted interrupt request are automatically masked by hardware. IM[0] denotes a mask of level 0, IM[1] denotes a mask of level 1, IM[2] denotes a mask of level 2, IM[3] denotes a mask of level 3, IM[4] denotes a mask of level 4, IM[5] denotes a mask of level 5, IM[6] denotes a mask of level 6, and IM[7] denotes a mask of level 7.
  • reserved: indicates a reserved bit. 0 is always read out from “reserved”. 0 must be written to “reserved” at the time of writing.
  • FIG. 12 is a diagram showing the configuration of the conditional flag register (CFR) 32. The conditional flag register (CFR) 32, which constitutes an integral part of the context of a task to be executed, is a 32-bit register made up of conditional flags, operation flags, vector conditional flags, an operation instruction bit position specification field, and a SIMD data alignment information field.
  • Bit ALN [1:0]: indicates an alignment mode. An alignment mode of “valnvc” instruction is set.
  • Bit BPO [4:0]: indicates a bit position. It is used in an instruction that requires a bit position specification.
  • Bit VC0-VC3: are vector conditional flags. Starting from a byte on the LSB (Least Significant Bit) side or a half word through to the MSB side, each corresponds to a flag ranging from VC0 through VC3.
  • Bit OVS: is an overflow flag (summary). It is set on the detection of saturation and overflow. If not detected, a value before the execution of the instruction is retained. Clearing of this flag needs to be carried out by software.
  • Bit CAS: is a carry flag (summary). It is set when a carry occurs under an “addc” instruction, or when a borrow occurs under a “subc” instruction. If there is no occurrence of a carry under an “addc” instruction or a borrow under a “subc” instruction, a value before the execution of the instruction is retained as the Bit CAS. Clearing of this flag needs to be carried out by software.
  • Bit C0-C7: are conditional flags. The value of the flag C7 is always 1. A reflection of a FALSE condition (writing of 0) made to the flag C7 is ignored.
  • reserved: indicates a reserved bit. 0 is always read out from “reserved”. 0 must be written to “reserved” at the time of writing.
  • FIGS. 13( a) and (b) are diagrams showing the configuration of accumulators (M0, M1) 30 b. Such accumulators (M0, M1) 30 b, which constitute an integral part of the context of a task to be executed, are made up of a 32-bit register MHO-MH1 (register for multiply and divide/sum of products (the higher 32 bits)) shown in (a) in FIG. 13 and a 32-bit register MLO-ML1 (register for multiply and divide/sum of products (the lower 32 bits)) shown in (b) in FIG. 13.
  • The register MHO-MH1 is used for storing the higher 32 bits of an operation result at the time of a multiply instruction, whereas the register MH0-MH1 is used as the higher 32 bits of the accumulators at the time of a sum of products instruction. Moreover, the register MHO-MH1 can be used in combination with the general-purpose registers in the case where a bit stream is handled. Meanwhile, the register MLO-ML1 is used for storing the lower 32 bits of an operation result at the time of a multiply instruction, whereas the register ML0-ML1 is used as the lower 32 bits of the accumulators at the time of a sum of products instruction.
  • FIG. 14 is a diagram showing the configuration of a program counter (PC) 33. This program counter (PC) 33, which constitutes an integral part of the context of a task to be executed, is a 32-bit counter that holds the address of an instruction being executed. “0” is always stored in the low 1 bit of the program counter (PC) 33.
  • FIG. 15 is a diagram showing the configuration of a PC save register (IPC) 34. This PC save register (IPC) 34, which constitutes an integral part of the context of a task to be executed, is a 32-bit register. “0” is always read out from the low 1 bit of the PC save register (IPC) 34. “0” must be written to the low 1 bit of the PC save register (IPC) 34 at the time of writing.
  • FIG. 16 is a diagram showing the configuration of a PSR save register (IPSR) 35. This PSR save register (IPSR) 35, which constitutes an integral part of the context of a task to be executed, is a 32-bit register for saving the program status register (PSR) 31. 0 must be always read out from a part in the PSR save register (IPSR) 35 corresponding to a reserved bit in the program status register (PSR) 31, and 0 must be written to a part in the PSR save register (IPSR) 35 corresponding to a reserved bit in the program status register (PSR) 31 at the time of writing.
  • Next, an explanation is given for the memory space of the processor 1. In the processor 1, a linear memory space with a capacity of 4 GB is divided into 32 segments, and an instruction SRAM (Static RAM) and a data SRAM are allocated to 128-MB segments. With a 128-MB segment serving as one block, a target block to be accessed is set in a SAR (SRAM Area Register). A direct access is made to the instruction SRAM/data SRAM when the accessed address is a segment set in the SAR, but an access request shall be issued to a bus controller (BUC) when such address is not a segment set in the SAR. An on chip memory (OCM), an external memory, an external device, an I/O port and others are connected to the BUC. The processor 1 is capable of reading/writing data from and to these devices.
  • FIG. 17 is a timing diagram showing the pipeline behavior of the processor 1. As illustrated in FIG. 17, the pipeline of the processor 1 basically consists of the following five stages: instruction fetch; instruction assignment (dispatch); decode; execution; and writing.
  • FIG. 18 is a timing diagram showing each stage of the pipeline behavior of the processor 1 at the time of executing an instruction. In the instruction fetch stage, an access is made to an instruction memory which is indicated by an address specified by the program counter (PC) 33, and the instruction is transferred to the instruction buffers 10 c-10 e and 10 h, and the like. In the instruction assignment stage, the output of branch target address information in response to a branch instruction, the output of an input register control signal, and the assignment of a variable length instruction are carried out, which is followed by the transfer of the instruction to an instruction register (IR). In the decode stage, the instruction stored in the IR is inputted to the decoding unit 20, from which an operation unit control signal and a memory access signal are outputted. In the execution stage, an operation is executed and the result of the operation is outputted either to the data memory or the general-purpose registers (R0-R31) 30 a. In the writing stage, a value obtained as a result of data transfer, and the operation results are stored in the general-purpose registers.
  • The VLIW architecture of the processor 1 allows parallel execution of the above processing on a maximum of four data elements. Therefore, the processor 1 performs parallel execution as shown in FIG. 18 at the timing shown in FIG. 19.
  • Next, an explanation is given for a set of instructions executed by the processor 1 with the above configuration.
  • Tables 3-5 list categorized instructions to be executed by the processor 1.
  • TABLE 3
    Operation
    Category unit Instruction operation code
    Memory move M ld,ldh,ldhu,ldb,ldbu,ldp,ldhp,ldbp,ldbh,
    instruction (load) ldbuh,ldbhp,ldbuhp
    Memory move M st,sth,stb,stp,sthp,stbp,stbh,stbhp
    instruction (store)
    Memory move M dpref,ldstb
    instruction (others)
    External register M rd,rde,wt,wte
    move instruction
    Branch instruction B br,brl,call,jmp,jmpl,jmpr,ret,jmpf,jloop,
    setbb,setlr,settar
    Software interrupt B rti,pi0,pi0l,pi1,pi1l,pi2,pi2l,pi3,pi3l,pi4,
    instruction pi4l,pi5,pi5l,pi6,pi6l,pi7,pi7l,sc0,sc1,sc2,
    sc3,sc4,sc5,sc6,sc7
    VMP/interrupt B intd,inte,vmpsleep,vmpsus,vmpswd,vmpswe,
    control instruction vmpwait
    Arithmetic operation A abs,absvh,absvw,add,addarvw,addc,addmsk,
    instruction adds,addsr,addu,addvh,addvw,neg,
    negvh,negvw,rsub,s1add,s2add,sub,
    subc,submsk,subs,subvh,subvw,max,
    min
    Logical operation A and,andn,or,sethi,xor,not
    instruction
    Compare instruction A cmpCC,cmpCCa,cmpCCn,cmpCCo,tstn,
    tstna,tstnn,tstno,tstz,tstza,tstzn,tstzo
    Move instruction A mov,movcf,mvclcas,mvclovs,setlo,vcchk
    NOP instruction A nop
    Shift instruction 1 S1 asl,aslvh,aslvw,asr,asrvh,asrvw,lsl,lsr,
    rol,ror
    Shift instruction 2 S2 aslp,aslpvw,asrp,asrpvw,lslp,lsrp
  • TABLE 4
    Operation
    Category unit Instruction operation code
    Extract instruction S2 ext,extb,extbu,exth,exthu,extr,extru,extu
    Mask instruction C msk,mskgen
    Saturation C sat12,sat9,satb,satbu,sath,satw
    instruction
    Conversion C valn,valn1,valn2,valn3,valnvc1,valnvc2,
    instruction valnvc3,valnvc4,vhpkb,vhpkh,vhunpkb,
    vhunpkh,vintlhb,vintlhh,vintllb,vintllh,
    vlpkb,vlpkbu,vlpkh,vlpkhu,vlunpkb,
    vlunpkbu,vlunpkh,vlunpkhu,vstovb,vstovh,
    vunpk1,vunpk2,vxchngh,vexth
    Bit count instruction C bcnt1,bseq,bseq0,bseq1
    Others C byterev,extw,mskbrvb,mskbrvh,rndvh,
    movp
    Multiply instruction 1 X1 fmulhh,fmulhhr,fmulhw,fmulhww,hmul,
    lmul
    Multiply instruction 2 X2 fmulww,mul,mulu
    Sum of products X1 fmachh,fmachhr,fmachw,fmachww,hmac,
    instruction 1 lmac
    Sum of products X2 fmacww,mac
    instruction
    2
    Difference of X1 fmsuhh,fmsuhhr,fmsuhw,fmsuww,hmsu,
    products instruction 1 lmsu
    Difference of X2 fmsuww,msu
    products instruction
    2
    Divide instruction DIV div,divu
    Debugger instruction DBGM dbgm0,dbgm1,dbgm2,dbgm3
  • TABLE 5
    Operation
    Category unit Instruction operation code
    SIMD arithmetic A vabshvh,vaddb,vaddh,vaddhvc,vaddhvh,
    operation instruction vaddrhvc,vaddsb,vaddsh,vaddsrb,vaddsrh,
    vasubb,vcchk,vhaddh,vhaddhvh,
    vhsubh,vhsubhvh,vladdh,vladdhvh,vlsubh,
    vlsubhvh,vnegb,vnegh,vneghvh,vsaddb,
    vsaddh,vsgnh,vsrsubb,vsrsubh,vssubb,
    vssubh,vsubb,vsubh,vsubhvh,vsubsh,
    vsumh,vsumh2,vsumrh2,vxaddh,
    vxaddhvh,vxsubh,vxsubhvh,
    vmaxb,vmaxh,vminb,vminh,vmovt,vsel
    SIMD compare A vcmpeqb,vcmpeqh,vcmpgeb,vcmpgeh,
    instruction vcmpgtb,vcmpgth,vcmpleb,vcmpleh,vcmpltb,
    vcmplth,vcmpneb,vcmpneh,
    vscmpeqb,vscmpeqh,vscmpgeb,vscmpgeh,
    vscmpgtb,vscmpgth,vscmpleb,vscmpleh,
    vscmpltb,vscmplth,vscmpneb,vscmpneh
    SIMD shift S1 vaslb,vaslh,vaslvh,vasrb,vasrh,vasrvh,
    instruction 1 vlslb,vlslh,vlsrb,vlsrh,vrolb,vrolh,vrorb,
    vrorh
    SIMD shift S2 vasl,vaslvw,vasr,vasrvw,vlsl,vlsr
    instruction 2
    SIMD saturation C vsath,vsath12,vsath8,vsath8u,vsath9
    instruction
    Other SIMD C vabssumb,vrndvh
    instruction
    SIMD multiply X2 vfmulh,vfmulhr,vfmulw,vhfmulh,vhfmulhr,
    instruction vhfmulw,vhmul,vlfmulh,vlfmulhr,vlfmulw,
    vlmul,vmul,vpfmulhww,vxfmulh,
    vxfmulhr,vxfmulw,vxmul
    SIMD sum of X2 vfmach,vfmachr,vfmacw,vhfmach,vhfmachr,
    products instruction vhfmacw,vhmac,vlfmach,vlfmachr,
    vlfmacw,vlmac,vmac,vpfmachww,vxfmach,
    vxfmachr,vxfmacw,vxmac
    SIMD difference of X2 vfmsuh,vfmsuw,vhfmsuh,vhfmsuw,vhmsu,
    products instruction vlfmsuh,vlfmsuw,vlmsu,vmsu,vxfmsuh,
    vxfmsuw,vxmsu
  • Note that “Operation units” in the above tables refer to operation units used in the respective instructions. More specifically, “A” denotes an ALU instruction, “B” denotes a branch instruction, “C” denotes a conversion instruction, “DIV” denotes a divide instruction, “DBGM” denotes a debug instruction, “M” denotes a memory access instruction, “S1” and “S2” denote a shift instruction, and “X1” and “X2” denote a multiply instruction.
  • FIG. 20A is a diagram showing the format of a 16-bit instruction executed by the processor 1, and FIG. 20B is a diagram showing the format of a 32-bit instruction executed by the processor 1.
  • The following describes the meaning of the acronyms used in the diagrams: “E”/is an end bit (boundary of parallel execution); “F” is a format bit (00, 01, 10: 16-bit instruction format, 11: 32-bit instruction format); “P” is a predicate (execution condition: one of the eight conditional flags C0-C7 is specified); “OP” is an operation code field; “R” is a register field; “I” is an immediate value field; and “D” is a displacement field. Note that an “E” field is unique to VLIW, and an instruction corresponding to E=0 is executed in parallel with the next instruction. In other words, the “E” field realizes VLIWs whose degree of parallelism is variable. Furthermore, predicates, which are flags for controlling whether or not to execute an instruction based on values of the conditional flags C0-C7, serve as a technique that allows instructions to be selectively executed without using a branch instruction and therefore accelerates the speed of processing.
  • For example, when the conditional flag C0 indicating a predicate in an instruction is 1, the instruction being assigned the conditional flag C shall be executed, whereas when the conditional flag C0 is 0, such instruction shall not be executed.
  • FIGS. 21A-36B are diagrams explaining an outlined functionality of the instructions executed by the processor 1. More specifically, FIGS. 21A and 21B explain instructions belonging to the category “ALUadd (addition) system)”; FIGS. 22A and 22B explain instructions belonging to the category “ALUsub (subtraction) system)”; FIGS. 23A and 23B explain instructions belonging to the category “ALUlogic (logical operation) system and others”; FIGS. 24A and 24B explain instructions belonging to the category “CMP (comparison operation) system”; FIGS. 25A and 25B explain instructions belonging to the category “mul (multiplication) system”; FIGS. 26A and 26B explain instructions belonging to the category “mac (sum of products operation) system”; FIGS. 27A and 27B explain instructions belonging to the category “msu (difference of products) system”; FIGS. 28A and 28B explain instructions belonging to the category “MEMId (load from memory) system”; FIGS. 29A and 29B explain instructions belonging to the category “MEMstore (store in memory) system”; FIG. 30 explains instructions belonging to the category “BRA (branch) system”; FIGS. 31A and 31B explain instructions belonging to the category “BSasl (arithmetic barrel shift) system and others”; FIGS. 32A and 32B explain instructions belonging to the category “BSlsr (logical barrel shift) system and others”; FIG. 33 explains instructions belonging to the category “CNVvaln (arithmetic conversion) system”; FIGS. 34A and 34B explain instructions belonging to the category “CNV (general conversion) system”; FIG. 35 explains instructions belonging to the category “SATvlpk (saturation processing) system”; and FIGS. 36A and 36B explain instructions belonging to the category “ETC (et cetera) system”.
  • The following describes the meaning of each item in these diagrams: “SIMD” indicates the type of an instruction (distinction between SISD (SINGLE) and SIMD); “Size” indicates the size of an individual operand to be an operation target; “Instruction” indicates the operation code of an instruction; “Operand” indicates the operands of an instruction; “CFR” indicates a change in the conditional flag register; “PSR” indicates a change in the processor status register; “Typical behavior” indicates the overview of a behavior; “Operation unit” indicates an operation unit to be used; and “3116” indicates the size of an instruction.
  • Next, the behavior of the processor 1 when executing some of the characteristic instructions is explained. Note that tables 6-10 describe the meaning of each symbol used to explain the instructions.
  • TABLE 6
    Symbol Meaning
    X[i] Bit number i of X
    X[i:j] Bit number j to bit number i of X
    X:Y Concatenated X and Y
    {n{X}} n repetitions of X
    sextM(X,N) Sign-extend X from N bit width to M bit width.
    Default of M is 32.
    Default of N is all possible bit widths of X.
    uextM(X,N) Zero-extend X from N bit width to M bit width.
    Default of M is 32.
    Default of N is all possible bit widths of X.
    smul(X,Y) Signed multiplication X * Y
    umul(X,Y) Unsigned multiplication X * Y
    sdiv(X,Y) Integer part in quotient of signed division X / Y
    smod(X,Y) Modulo with the same sign as dividend.
    udiv(X,Y) Quotient of unsigned division X / Y
    umod(X,Y) Modulo
    abs(X) Absolute value
    bseq(X,Y) for (i=0; i<32; i++) {
    if (X[31−i] != Y) break;
    }
    result = i;
    bcnt(X,Y) S = 0;
    for (i=0; i<32; i++) {
    if (X[i] == Y) S++;
    }
    result = S;
    max(X,Y) result = (X > Y)? X : Y
    min(X,Y) result = (X < Y)? X : Y;
    tstz(X,Y) X & Y == 0
    tstn(X,Y) X & Y != 0
  • TABLE 7
    Symbol Meaning
    Ra Ra[31:0] Register numbered a (0 <= a <= 31)
    Ra+1 R(a+1)[31:0] Register numbered a+1 (0 <= a <= 30)
    Rb Rb[31:0] Register numbered b (0 <= b <= 31)
    Rb+1 R(b+1)[31:0] Register numbered b+1 (0 <= b <= 30)
    Rc Rc[31:0] Register numbered c (0 <= c <= 31)
    Rc+1 R(c+1)[31:0] Register numbered c+1 (0 <= c <= 30)
    Ra2 Ra2[31:0] Register numbered a2 (0 <= a2 <= 15)
    Ra2+1 R(a2+1)[31:0] Register numbered a2+1 (0 <= a2 <= 14)
    Rb2 Rb2[31:0] Register numbered b2 (0 <= b2 <= 15)
    Rb2+1 R(b2+1)[31:0] Register numbered b2+1 (0 <= b2 <= 14)
    Rc2 Rc2[31:0] Register numbered c2 (0 <= c2 <= 15)
    Rc2+1 R(c2+1)[31:0] Register numbered c2+1 (0 <= c2 <= 14)
    Ra3 Ra3[31:0] Register numbered a3 (0 <= a3 <= 7)
    Ra3+1 R(a3+1)[31:0] Register numbered a3+1 (0 <= a3 <= 6)
    Rb3 Rb3[31:0] Register numbered b3 (0 <= b3 <= 7)
    Rb3+1 R(b3+1)[31:0] Register numbered b3+1 (0 <= b3 <= 6)
    Rc3 Rc3[31:0] Register numbered c3 (0 <= c3 <= 7)
    Rc3+1 R(c3+1)[31:0] Register numbered c3+1 (0 <= c3 <= 6)
    Rx Rx[31:0] Register numbered x (0 <= x <= 3)
  • TABLE 8
    Symbol Meaning
    + Addition
    Subtraction
    & Logical AND
    | Logical OR
    ! Logical NOT
    << Logical shift left (arithmetic shift left)
    >> Arithmetic shift right
    >>> Logical shift right
    {circumflex over ( )} Exclusive OR
    ~ Logical NOT
    == Equal
    != Not equal
    > Greater than
    Signed(regard left-and right-part MSBs as sign)
    >= Greater than or equal to
    Signed(regard left-and right-part MSBs as sign)
    >(u) Greater than
    Unsigned(Not regard left-and right-part MSBs as sign)
    >=(u) Greater than or equal to
    Unsigned(Not regard left-and right-part MSBs as sign)
    < Less than
    Signed(regard left-and right-part MSBs as sign)
    <= Less than or equal to
    Signed(regard left-and right-part MSBs as sign)
    <(u) Less than
    Unsigned(Not regard left-and right-part MSBs as sign)
    <=(u) Less than or equal to
    Unsigned(Not regard left-and right-part MSBs as sign)
  • TABLE 9
    Symbol Meaning
    D(addr) Double word data corresponding to address “addr” in Memory
    W(addr) Word data corresponding to address “addr” in Memory
    H(addr) Half data corresponding to address “addr” in Memory
    B(addr) Byte data corresponding to address “addr” in Memory
    B(addr,bus_lock) Access byte data corresponding to address “addr” in Memory,
    and lock used bus concurrently
    (unlockable bus shall not be locked)
    B(addr,bus_unlock) Access byte data corresponding to address “addr” in Memory,
    and unlock used bus concurrently
    (unlock shall be ignored for unlockable bus and
    bus which has not been locked)
    EREG(num) Extended register numbered “num”
    EREG_ERR To be 1 if error occurs when immediately previous access
    is made to extended register.
    To be 0, when there was no error.
    <− Write result
    => Synonym of instruction (translated by assembler)
    reg#(Ra) Register number of general-purpose register Ra(5-bit value)
    0x Prefix of hexadecimal numbers
    0b Prefix of binary numbers
    tmp Temporally variable
    UD Undefined value (value which is implementation-dependent
    value or which varies dynamically)
    Dn Displacement value
    (n is a natural value indicating the number of bits)
    In Immediate value
    (n is a natural value indicating the number of bits)
  • TABLE 10
    ◯Explanation for syntax
    if (condition) {
    Executed when condition is met;
    } else {
    Executed when condition is not met;
    }
    Executed when condition A is met, if (condition A); * Not executed
    when condition A is not met
    for (Expression1;Expression2;Expression3) *Same as C language
    (Expression1)? Expression2:Expression3 *Same as C language
    ◯Explanation for terms
    The following explains terms used for explanations:
    Integer multiplication Multiplication defined as “smul”
    Fixed point multiplication
    Arithmetic shift left is performed after integer operation. When
    PSR.FXP is 0, the amount of shift is 1 bit, and when PSR.FXP is 1,
    2 bits.
    SIMD operation straight / cross / high / low / pair
    Higher
    16 bits and lower 16 bits of half word vector data
    is RH and RL, respectively. In the case of operations performed
    between Ra register and Rb register, each operation is defined as
    follows:
    straight Operation is performed between RHa and RHb, and
    RLa and RLb
    cross Operation is performed between RHa and RLb, and
    RLa and RHb
    high Operation is performed between RHa and RHb, and
    RLa and RHb
    low Operation is performed between RHa and RLb, and
    RLa and RLb
    pair Operation is performed between RH and RHb, and
    RH and RLb (RH is 32-bit
    data)
  • [Instruction jloop, settar]
  • Instruction jloop is an instruction for performing a branch and setting conditional flags (predicates, here) in a loop. For example, when
  • jloop C6, Cm, TAR, Ra
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets 1 to the conditional flag Cm; (ii) sets 0 to the conditional flag C6 when the value held in the register Ra is smaller than 0; (iii) adds −1 to the value held in the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d. When not filled with a branch instruction, the jump buffer 10 f (branch instruction buffer) will be filled with a branch target instruction. A detailed behavior is as shown in FIG. 37.
  • Meanwhile, Instruction settar is an instruction for storing a branch target address into the branch register (TAR) 30 d, and setting conditional flags (predicates, here). For example, when
  • settar C6, Cm, D9
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores an address that results from adding the value held in the program counter (PC) 33 and a displacement value (D9) into the branch register (TAR) 30 d; (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flag C6 to 1 and the conditional flag Cm to 0. A detailed behavior is as shown in FIG. 38.
  • These instructions jloop and settar, which are usually used in pairs, are effective for increasing the speed of a loop in prolog/epilog removal software pipelining. Note that software pipelining, which is a technique used by a compiler to increase a loop speed, allows an efficient parallel execution of a plurality of instructions by converting a loop structure into a prolog phase, a kernel phase and an epilog phase, and by overlapping each iteration with the previous and following iterations in the kernel phase.
  • As shown in FIG. 39, “prolog/epilog removal” is intended to visually remove the prolog phase and epilog phase by using the prolog phase and the epilog phase as conditional execution instructions to be performed in accordance with predicates. In prolog/epilog removal 2-stage software pipelining shown in FIG. 39, the conditional flags C6 and C4 are illustrated as predicates for an epilog instruction (Stage 2) and a prolog instruction (Stage 1), respectively.
  • For example, when the above-described jloop and settar instructions are used in a source program written in the C language shown in FIG. 40, a compiler generates a machine language program shown in FIG. 41 by means of prolog/epilog removal software pipelining.
  • As indicated by the loop part in such a machine language program (Label L00023-Instruction jloop), setting and resetting of the conditional flag C4 is carried out in an Instruction jloop and Instruction settar, respectively. Accordingly, there is no need for special instructions for such processing, thereby enabling the loop execution to end in two cycles.
  • Note that the processor 1 is capable of executing the following instructions which are applicable not only to 2-stage software pipelining, but also to 3-stage software pipelining: Instruction “jloop C6, C2: C4, TAR, Ra” and Instruction “settar C6, C2: C4, D9”. These instructions “jloop C6, C2: C4, TAR, Ra” and “settar C6, C2: C4, D9” are equivalent to instructions in which the register Cm in the above-described 2-stage instructions “jloop C6, Cm, TAR, Ra” and “settar C6, Cm, D9” is extended to the registers C2, C3 and C4.
  • To put it another way, when
  • jloop C6, C2: C4, TAR, Ra
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets the conditional flag C4 to 0 when the value held in the register Ra is smaller than 0; (ii) moves the value of the conditional flag C3 to the conditional flag C2 and moves the value of the conditional flag C4 to the conditional flags C3 and C6; (iii) adds −1 to the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d. When not filled with a branch instruction, the jump buffer 10 f (branch instruction buffer) will be filled with a branch target instruction. A detailed behavior is as shown in FIG. 42.
  • Also, when
  • settar C6, C2: C4, D9
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores, into the branch register (TAR) 30 d, an address that results from adding the value held in the program counter (PC) 33 and a displacement value (D9); (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flags C4 and C6 to 1 and the conditional flags C2 and C3 to 0. A detailed behavior is as shown in FIG. 43.
  • FIGS. 44( a) and (b) show the role of the conditional flags in the above 3-stage instructions “jloop C6, C2: C4, TAR, Ra” and “settar C6, C2: C4, D9”. As shown in (a) in FIG. 44, in prolog/epilog removal 3-stage software pipelining, the conditional flags C2, C3 and C4 serve as predicates for Stage 3, Stage 2 and Stage 1, respectively. FIG. 44( b) is a diagram showing how instruction execution is carried out when moving flags in such a case.
  • For example, when the above-described jloop and settar instructions shown respectively in FIGS. 42 and 43 are used in a source program written in the C language shown in FIG. 45, a compiler generates a machine language program shown in FIG. 46 by means of epilog removal software pipelining.
  • Note that the processor 1 is also capable of executing the following instructions which are applicable to 4-stage software pipelining: Instruction “jloop C6, C1: C4, TAR, Ra” and Instruction “settar C6, C1: C4, D9”.
  • To put it another way, when
  • jloop C6, C1: C4, TAR, Ra
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) sets the conditional flag C4 to 0 when the value held in the register Ra is smaller than 0; (ii) moves the value of the conditional flag C2 to the conditional flag C1, moves the value of the conditional flag C3 to the conditional flag C2, and moves the value of the conditional flag C4 to the conditional flags C3 and C6; (iii) adds −1 to the register Ra and stores the result into the register Ra; and (iv) branches to an address specified by the branch register (TAR) 30 d. When not filled with a branch target instruction, the jump buffer 10 f will be filled with a branch target instruction. A detailed behavior is as shown in FIG. 47.
  • Meanwhile, Instruction settar is an instruction for storing a branch target address into the branch register (TAR) 30 d as well as for setting conditional flags (predicates, here).
  • For example, when
  • settar C6, C1: C4, D9
  • the processor 1 behaves as follows, by using the address management unit 10 b and others: (i) stores an address resulted from adding the value held in the program counter (PC) 33 and a displacement value (D9) into the branch register (TAR) 30 d; (ii) fetches the instruction corresponding to such address and stores the instruction into the jump buffer 10 f (branch instruction buffer); and (iii) sets the conditional flags C4 and C6 to 1 and the conditional flags C1, C2 and C3 to 0. A detailed behavior is as shown in FIG. 48.
  • For example, when the above-described jloop and settar instructions shown respectively in FIGS. 47 and 48 are used in a source program written in the C language shown in FIG. 49, a compiler generates a machine language program shown in FIG. 50 by means of epilog removal software pipelining.
  • FIG. 51 is a diagram showing the behavior to be performed in 4-stage software pipelining that uses jloop and settar instructions shown respectively in FIGS. 47 and 48.
  • In order to implement 4-stage software pipelining, the conditional flags C1-C4 are used as predicates, each of which indicates whether or not to execute an instruction. Instructions A, B, C, and D are instructions to be executed in the first, second, third, and fourth stages in the software pipelining, respectively. Furthermore, the instructions A, B, C, and D are associated with the conditional flags C4, C3, C2, and C1, respectively. Also, Instruction jloop is associated with the conditional flag C6.
  • FIG. 52 is a diagram for explaining an example method of setting the conditional flag C6 for the Instruction jloop shown in FIG. 47. This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater than or equal to 3) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the next cycle of a cycle in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N−2) th pipeline stage in the epilog phase, becomes 0.
  • Therefore, in the prolog phase and kernel phase in the loop processing, (i) the value of the conditional flag C6 is always set to 1, (ii) the value of the conditional flag C3 (being a conditional flag corresponding to the conditional execution instruction to be executed in the (N−2)th stage in the software pipelining) is monitored from when the epilog phase is entered, and (iii) the value of the conditional flag C3 is set to the conditional flag C6 which is in one cycle later. With the above configuration, the conditional flag C6 assigned to Instruction jloop is set to 0 at the end of the loop processing, making it possible for the processor 1 to exit from the loop. For example, in an example of the machine language program shown in FIG. 50, when the value of the conditional flag C6 becomes 0, not Instruction “jloop C6, C1: C4, TAR, R4” but Instruction “ret” being placed next to it is to be executed, which makes it possible for the processor 1 to exit from the loop.
  • Note that, as shown in FIG. 51, when the value of a certain conditional flag becomes 0 in the epilog phase, the value of such conditional flag remains to be 0 until the loop processing ends. This means that the conditional execution instruction corresponding to the conditional flag in question is not to be executed until the end of the loop. For example, when the value of the conditional flag C4 becomes 0 in the fifth cycle, the value of such conditional flag C4 remains to be 0 until the seventh cycle in which the loop ends. Therefore, the instruction A that corresponds to the conditional flag C4 is not to be executed from the fifth cycle to the seventh cycle.
  • Thus, when a conditional flag becomes 0 in the epilog phase, a control may be performed so that no instruction will be read out, until the loop processing ends, from the instruction buffer 10 c (10 d, 10 e, and 10 h) in which the instruction corresponding to such conditional flag is stored.
  • Meanwhile, a part of each instruction indicates the number of a conditional flag. Accordingly, the decoding unit 20 may read out only the number of a conditional flag from the corresponding instruction buffer 10 c (10 d, 10 e, and 10 h), and check the value of the conditional flag based on such read-out number, so that the decoding unit 20 will not read out instructions from the instruction buffer 10 c (10 d, 10 e, and 10 h) when the value of the conditional flag is 0.
  • Furthermore, as shown in FIG. 53, instructions to be executed before and after the loop may be placed respectively in the prolog and epilog phases for execution. For example, the conditional flag C5 is assigned to an instruction X to be executed immediately before the loop and to an instruction Y to be executed immediately after the loop, so as to have such instructions X and Y executed in empty stages in the epilog and prolog phases. Accordingly, it becomes possible to reduce the number of empty stages in the epilog and prolog phases.
  • Moreover, in the case where different instructions are executed depending on whether or not a predetermined condition is true, as in the case of an if-else statement in the C language, different conditional flags shall be used for a conditional execution instruction to be executed when the condition is true and for a conditional execution instruction to be executed when the condition is false, so that the value of each conditional flag can be changed depending on a condition. Through such simple processing, it becomes possible to realize a conditional branch instruction.
  • Also, the below-described method of setting the conditional flag C6 may be used instead of the method of setting the jloop instruction conditional flag C6 shown in FIG. 52. FIG. 54 is a diagram for explaining another example method of setting the conditional flag C6 for the Instruction jloop shown in FIG. 47. This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater than or equal to 2) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the same cycle as the one in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N−1) th pipeline stage in the epilog phase becomes 0.
  • Therefore, in the prolog phase and kernel phase in the loop processing, (i) the value of the conditional flag C6 is always set to 1, (ii) the value of the conditional flag C2 (being a conditional flag corresponding to the conditional execution instruction to be executed in the (N−1)th stage in the software pipelining) is monitored from when the epilog phase is entered, and (iii) the value of the conditional flag C2 is set to the conditional flag C6 within the same cycle. With the above configuration, the conditional flag C6 assigned to the Instruction jloop is set to 0 at the end of the loop processing, making it possible for the processor 1 to exit from the loop.
  • Furthermore, the below-described method of setting the conditional flag C6 may also be used. FIG. 55 is a diagram for explaining another example method of setting the conditional flag C6 for the Instruction jloop shown in FIG. 47. This method utilizes the following characteristic: in the case where the number of software pipelining stages is “N” stages (where “N” is an integer greater or equal to 4) when a loop to be executed is unrolled into conditional execution instructions by means of software pipelining, the loop ends in the cycle which is two cycles after the cycle in which a conditional flag corresponding to the conditional execution instruction to be executed in the (N−3) th pipeline stage in the epilog phase becomes 0.
  • Therefore, in the prolog phase and kernel phase in the loop processing, (i) the value of the conditional flag C6 is always set to 1, (ii) the value of the conditional flag C4 (being a conditional flag corresponding to the conditional execution instruction to be executed in the (N−3)th stage in the software pipelining) is monitored from when the epilog phase is entered, and (iii) the value of the conditional flag C4 is set to the conditional flag C6 which is in two cycles later. With the above configuration, the conditional flag C6 assigned to the Instruction jloop is set to 0 at the end of the loop processing, making it possible for the processor 1 to exit from the loop.
  • Note that software pipelining up to four stages has been explained in the present embodiment, but the present invention is also applicable to software pipelining containing five or more stages. It is possible to achieve such a configuration by increasing the number of conditional flags used as predicates.
  • A machine language instruction with the above-described characteristics is generated by a complier, where such machine language instruction is comprised of: a parser step of parsing a source program; an intermediate code conversion step of converting the parsed source program into intermediate codes; an optimization step of optimizing the intermediate codes; and a code generation step of converting the optimized intermediate codes into machine language instructions.
  • As described above, according to the present embodiment, a conditional flag for a loop is set by the use of a conditional flag for the epilog phase of software pipelining. Accordingly, there is no need to use special hardware resources such as a counter in order to judge whether or not loop processing has terminated, and it becomes possible to prevent the circuitry scale from becoming large. This contributes to a reduction in the power consumption of the processor.
  • Moreover, when a conditional execution instruction stops being executed in the epilog phase, such conditional execution instruction will not be executed in the software pipelining until the loop processing ends. Accordingly, there is no need to read out such a conditional execution instruction from the corresponding instruction buffer until the loop processing ends, which leads to a reduction in the power consumption of the processor.
  • Furthermore, by placing instructions to be executed before and after a loop in the prolog phase and the epilog phase, respectively, it becomes possible to reduce the number of empty stages in software pipelining, and therefore to execute a program at a high speed. This results in a reduction in the power consumption of the processor.
  • As is obvious from the above description, according to the processor of the present invention, it is possible to provide a processor whose circuitry scale is small and which is capable of high-speed loop execution while consuming a small amount of power.
  • Furthermore, according to the present invention, it is possible to provide a complier which is capable of generating machine language instructions that enable the processor to consume only a small amount of power.
  • As described above, the processor according to the present invention is capable of executing instructions while consuming only a small amount of power. It is therefore possible for the processor to be employed as a core processor to be commonly used in a mobile phone, mobile AV device, digital television, DVD and others. Thus, the processor according to the present invention is extremely useful in the present age in which the advent of high-performance and cost effective multimedia apparatuses is desired.

Claims (9)

1. A compiler apparatus that translates a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser unit operable to parse the source program;
an intermediate code conversion unit operable to convert the parsed source program into intermediate codes;
an optimization unit operable to optimize the intermediate codes; and
a code generation unit operable to convert the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
the optimization unit, when the intermediate codes include a loop, places an instruction in a prolog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately before the loop.
2. A compiler apparatus that translates a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser unit operable to parse the source program;
an intermediate code conversion unit operable to convert the parsed source program into intermediate codes;
an optimization unit operable to optimize the intermediate codes; and
a code generation unit operable to convert the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
the optimization unit, when the intermediate codes include a loop, places an instruction in an epilog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately after the loop.
3. A compiler apparatus that translates a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser unit operable to parse the source program;
an intermediate code conversion unit operable to convert the parsed source program into intermediate codes;
an optimization unit operable to optimize the intermediate codes; and
a code generation unit operable to convert the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
the optimization unit, when the intermediate codes include a conditional branch instruction, assigns the plurality of conditional execution flags so that a conditional execution flag used as a predicate for a conditional execution instruction in a case where a condition indicated by said conditional branch instruction is met, becomes different from a conditional execution flag used as a predicate for a conditional execution instruction in a case where said condition is not met.
4. A compilation method for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a loop, an instruction is placed in a prolog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately before the loop.
5. A compilation method for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a loop, an instruction is placed in an epilog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately after the loop.
6. A compilation method for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a conditional branch instruction, the plurality of conditional execution flags are assigned so that a conditional execution flag used as a predicate for a conditional execution instruction in a case where a condition indicated by said conditional branch instruction is met, becomes different from a conditional execution flag used as a predicate for a conditional execution instruction in a case where said condition is not met.
7. A complier for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a loop, an instruction is placed in a prolog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately before the loop.
8. A complier for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a loop, an instruction is placed in an epilog phase in the loop in a case where said loop is unrolled by means of software pipelining, the instruction being to be executed immediately after the loop.
9. A complier for translating a source program into a machine language program for a processor capable of executing instructions in parallel, comprising:
a parser step of parsing the source program;
an intermediate code conversion step of converting the parsed source program into intermediate codes;
an optimization step of optimizing the intermediate codes; and
a code generation step of converting the optimized intermediate codes into machine language instructions,
wherein the processor stores a plurality of flags used as predicates for conditional execution instructions, and
in the optimization step, when the intermediate codes include a conditional branch instruction, the plurality of conditional execution flags are assigned so that a conditional execution flag used as a predicate for a conditional execution instruction in a case where a condition indicated by said conditional branch instruction is met, becomes different from a conditional execution flag used as a predicate for a conditional execution instruction in a case where said condition is not met.
US12/109,707 2003-03-24 2008-04-25 Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags Abandoned US20080209407A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/109,707 US20080209407A1 (en) 2003-03-24 2008-04-25 Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2003-081132 2003-03-24
JP2003081132A JP3974063B2 (en) 2003-03-24 2003-03-24 Processor and compiler
US10/805,381 US7380112B2 (en) 2003-03-24 2004-03-22 Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags
US12/109,707 US20080209407A1 (en) 2003-03-24 2008-04-25 Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/805,381 Division US7380112B2 (en) 2003-03-24 2004-03-22 Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags

Publications (1)

Publication Number Publication Date
US20080209407A1 true US20080209407A1 (en) 2008-08-28

Family

ID=32821431

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/805,381 Expired - Fee Related US7380112B2 (en) 2003-03-24 2004-03-22 Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags
US12/109,707 Abandoned US20080209407A1 (en) 2003-03-24 2008-04-25 Processor and compiler for decoding an instruction and executing the instruction with conditional execution flags

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/805,381 Expired - Fee Related US7380112B2 (en) 2003-03-24 2004-03-22 Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags

Country Status (4)

Country Link
US (2) US7380112B2 (en)
EP (1) EP1462933A3 (en)
JP (1) JP3974063B2 (en)
CN (1) CN1302380C (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246295A1 (en) * 2010-04-05 2011-10-06 Yahoo! Inc. Fast networked based advertisement selection
US8413151B1 (en) 2007-12-19 2013-04-02 Nvidia Corporation Selective thread spawning within a multi-threaded processing system
US20130246856A1 (en) * 2012-03-16 2013-09-19 Samsung Electronics Co., Ltd. Verification supporting apparatus and verification supporting method of reconfigurable processor
US8615770B1 (en) 2008-08-29 2013-12-24 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US8959497B1 (en) * 2008-08-29 2015-02-17 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US9038042B2 (en) 2012-06-29 2015-05-19 Analog Devices, Inc. Staged loop instructions

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005149297A (en) * 2003-11-18 2005-06-09 Renesas Technology Corp Processor and assembler thereof
US20060101256A1 (en) * 2004-10-20 2006-05-11 Dwyer Michael K Looping instructions for a single instruction, multiple data execution engine
US7437537B2 (en) * 2005-02-17 2008-10-14 Qualcomm Incorporated Methods and apparatus for predicting unaligned memory access
US7669042B2 (en) * 2005-02-17 2010-02-23 Samsung Electronics Co., Ltd. Pipeline controller for context-based operation reconfigurable instruction set processor
US7991984B2 (en) * 2005-02-17 2011-08-02 Samsung Electronics Co., Ltd. System and method for executing loops in a processor
US20060190700A1 (en) * 2005-02-22 2006-08-24 International Business Machines Corporation Handling permanent and transient errors using a SIMD unit
US20080229074A1 (en) * 2006-06-19 2008-09-18 International Business Machines Corporation Design Structure for Localized Control Caching Resulting in Power Efficient Control Logic
US20070294519A1 (en) * 2006-06-19 2007-12-20 Miller Laura F Localized Control Caching Resulting In Power Efficient Control Logic
CN102944803B (en) * 2006-06-30 2015-06-24 英特尔公司 Leakage power estimation
JP4159586B2 (en) * 2006-08-03 2008-10-01 エヌイーシーコンピュータテクノ株式会社 Information processing apparatus and information processing speed-up method
US8380966B2 (en) * 2006-11-15 2013-02-19 Qualcomm Incorporated Method and system for instruction stuffing operations during non-intrusive digital signal processor debugging
US8341604B2 (en) * 2006-11-15 2012-12-25 Qualcomm Incorporated Embedded trace macrocell for enhanced digital signal processor debugging operations
US8370806B2 (en) 2006-11-15 2013-02-05 Qualcomm Incorporated Non-intrusive, thread-selective, debugging method and system for a multi-thread digital signal processor
US8533530B2 (en) * 2006-11-15 2013-09-10 Qualcomm Incorporated Method and system for trusted/untrusted digital signal processor debugging operations
US8484516B2 (en) * 2007-04-11 2013-07-09 Qualcomm Incorporated Inter-thread trace alignment method and system for a multi-threaded processor
JP5043560B2 (en) * 2007-08-24 2012-10-10 パナソニック株式会社 Program execution control device
JP5193624B2 (en) * 2008-02-19 2013-05-08 ルネサスエレクトロニクス株式会社 Data processor
US8131984B2 (en) * 2009-02-12 2012-03-06 Via Technologies, Inc. Pipelined microprocessor with fast conditional branch instructions based on static serializing instruction state
US20110055303A1 (en) * 2009-09-03 2011-03-03 Azuray Technologies, Inc. Function Generator
EP2872966A1 (en) * 2012-07-12 2015-05-20 Dual Aperture International Co. Ltd. Gesture-based user interface
US11048513B2 (en) * 2013-07-15 2021-06-29 Texas Instruments Incorporated Entering protected pipeline mode with clearing
CN103942035B (en) * 2014-04-11 2017-08-29 华为技术有限公司 Method, compiler and the instruction processing unit of process instruction
GB2551548B (en) * 2016-06-22 2019-05-08 Advanced Risc Mach Ltd Register restoring branch instruction
CN107229446A (en) * 2017-04-26 2017-10-03 深圳市创成微电子有限公司 A kind of audio data processor
CN111045729A (en) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 Operation method, device and related product
US20220100514A1 (en) * 2020-09-26 2022-03-31 Intel Corporation Loop support extensions
CN113946539B (en) * 2021-10-09 2024-02-13 深圳市创成微电子有限公司 DSP processor and processing method of circulation jump instruction thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6016399A (en) * 1996-03-28 2000-01-18 Intel Corporation Software pipelining a hyperblock loop
US6044222A (en) * 1997-06-23 2000-03-28 International Business Machines Corporation System, method, and program product for loop instruction scheduling hardware lookahead
US6192515B1 (en) * 1998-07-17 2001-02-20 Intel Corporation Method for software pipelining nested loops
US6289443B1 (en) * 1998-01-28 2001-09-11 Texas Instruments Incorporated Self-priming loop execution for loop prolog instruction
US20020091996A1 (en) * 2000-06-13 2002-07-11 Siroyan Limited Predicated execution of instructions in processors
US6449713B1 (en) * 1998-11-18 2002-09-10 Compaq Information Technologies Group, L.P. Implementation of a conditional move instruction in an out-of-order processor
US6567895B2 (en) * 2000-05-31 2003-05-20 Texas Instruments Incorporated Loop cache memory and cache controller for pipelined microprocessors
US20030120905A1 (en) * 2001-12-20 2003-06-26 Stotzer Eric J. Apparatus and method for executing a nested loop program with a software pipeline loop procedure in a digital signal processor
US6629238B1 (en) * 1999-12-29 2003-09-30 Intel Corporation Predicate controlled software pipelined loop processing with prediction of predicate writing and value prediction for use in subsequent iteration

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6408433B1 (en) * 1999-04-23 2002-06-18 Sun Microsystems, Inc. Method and apparatus for building calling convention prolog and epilog code using a register allocator

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6016399A (en) * 1996-03-28 2000-01-18 Intel Corporation Software pipelining a hyperblock loop
US6044222A (en) * 1997-06-23 2000-03-28 International Business Machines Corporation System, method, and program product for loop instruction scheduling hardware lookahead
US6289443B1 (en) * 1998-01-28 2001-09-11 Texas Instruments Incorporated Self-priming loop execution for loop prolog instruction
US6192515B1 (en) * 1998-07-17 2001-02-20 Intel Corporation Method for software pipelining nested loops
US6449713B1 (en) * 1998-11-18 2002-09-10 Compaq Information Technologies Group, L.P. Implementation of a conditional move instruction in an out-of-order processor
US6629238B1 (en) * 1999-12-29 2003-09-30 Intel Corporation Predicate controlled software pipelined loop processing with prediction of predicate writing and value prediction for use in subsequent iteration
US6567895B2 (en) * 2000-05-31 2003-05-20 Texas Instruments Incorporated Loop cache memory and cache controller for pipelined microprocessors
US20020091996A1 (en) * 2000-06-13 2002-07-11 Siroyan Limited Predicated execution of instructions in processors
US20030120905A1 (en) * 2001-12-20 2003-06-26 Stotzer Eric J. Apparatus and method for executing a nested loop program with a software pipeline loop procedure in a digital signal processor

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8413151B1 (en) 2007-12-19 2013-04-02 Nvidia Corporation Selective thread spawning within a multi-threaded processing system
US8615770B1 (en) 2008-08-29 2013-12-24 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US8959497B1 (en) * 2008-08-29 2015-02-17 Nvidia Corporation System and method for dynamically spawning thread blocks within multi-threaded processing systems
US20110246295A1 (en) * 2010-04-05 2011-10-06 Yahoo! Inc. Fast networked based advertisement selection
US8903736B2 (en) * 2010-04-05 2014-12-02 Yahoo! Inc. Fast networked based advertisement selection
US20130246856A1 (en) * 2012-03-16 2013-09-19 Samsung Electronics Co., Ltd. Verification supporting apparatus and verification supporting method of reconfigurable processor
US9087152B2 (en) * 2012-03-16 2015-07-21 Samsung Electronics Co., Ltd. Verification supporting apparatus and verification supporting method of reconfigurable processor
US9038042B2 (en) 2012-06-29 2015-05-19 Analog Devices, Inc. Staged loop instructions

Also Published As

Publication number Publication date
CN1302380C (en) 2007-02-28
EP1462933A3 (en) 2008-01-23
CN1532693A (en) 2004-09-29
JP3974063B2 (en) 2007-09-12
EP1462933A2 (en) 2004-09-29
US7380112B2 (en) 2008-05-27
US20040193859A1 (en) 2004-09-30
JP2004288016A (en) 2004-10-14

Similar Documents

Publication Publication Date Title
US7380112B2 (en) Processor and compiler for decoding an instruction and executing the decoded instruction with conditional execution flags
US7594099B2 (en) Processor executing SIMD instructions
US8151254B2 (en) Compiler, compiler apparatus and compilation method
US7185176B2 (en) Processor executing SIMD instructions
US7386844B2 (en) Compiler apparatus and method of optimizing a source program by reducing a hamming distance between two instructions
US7937559B1 (en) System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes
US7376812B1 (en) Vector co-processor for configurable and extensible processor architecture
US20070011441A1 (en) Method and system for data-driven runtime alignment operation
US7698696B2 (en) Compiler apparatus with flexible optimization
US7346881B2 (en) Method and apparatus for adding advanced instructions in an extensible processor architecture
US20110040822A1 (en) Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture
US8136105B2 (en) Method to exploit superword-level parallelism using semi-isomorphic packing
JP2006338684A (en) Processor
JP2007102821A (en) Processor and compiler
Jeroen van Straten ρ-VEX user manual

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION