US20060039555A1 - Method and system for performing permutations using permutation instructions based on butterfly networks - Google Patents

Method and system for performing permutations using permutation instructions based on butterfly networks Download PDF

Info

Publication number
US20060039555A1
US20060039555A1 US11/180,269 US18026905A US2006039555A1 US 20060039555 A1 US20060039555 A1 US 20060039555A1 US 18026905 A US18026905 A US 18026905A US 2006039555 A1 US2006039555 A1 US 2006039555A1
Authority
US
United States
Prior art keywords
bits
permutation
sequence
instruction
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/180,269
Inventor
Ruby Lee
Xiao Yang
Manish Vachharajani
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=26897495&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=US20060039555(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Individual filed Critical Individual
Priority to US11/180,269 priority Critical patent/US20060039555A1/en
Publication of US20060039555A1 publication Critical patent/US20060039555A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/34Bits, or blocks of bits, of the telegraphic message being interchanged in time

Definitions

  • the present invention relates to a method and system for performing arbitrary permutations of a sequence of bits in a programmable processor by determining a permutation instruction based on butterfly networks.
  • Secure information processing typically includes authentication of users and host machines, confidentiality of messages sent over public networks, and assurances that messages, programs and data have not been maliciously changed.
  • Conventional solutions have provided security functions by using different security protocols employing different cryptographic algorithms, such as public key, symmetric key and hash algorithms.
  • symmetric key cryptography algorithms For encrypting large amounts of data, symmetric key cryptography algorithms have been used, see Bruce Schneier, “Applied Cryptography”, 2nd Ed., John Wiley & Sons, Inc., 1996. These algorithms use the same secret key to encrypt and decrypt a given message, and encryption and decryption have the same computational complexity.
  • the cryptographic techniques of “confusion” and “diffusion” are synergistically employed. “Confusion” obscures the relationship between the plaintext (original message) and the ciphertext (encrypted message), for example, through substitution of arbitrary bits for bits in the plaintext.
  • “Diffusion” spreads the redundancy of the plaintext over the ciphertext, for example through permutation of the bits of the plaintext block. Such bit-level permutations have the drawback of being slow when implemented with conventional instructions available in microprocessors and other programmable processors.
  • Bit-level permutations are particularly difficult for processors, and have been avoided in the design of new cryptography algorithms, where it is desired to have fast software implementations, for example in the Advanced Encryption Standard, as described in NIST, “Announcing Request for Candidate Algorithm Nominations for the Advanced Encryption Standard (AES)”, http://csrc.nist.gov/encryption/aes/pre-round1/aes — 9709.htm, Since conventional microprocessors are word-oriented, performing bit-level permutations is difficult and tedious. Every bit has to be extracted from the source register, moved to its new location in the destination register, and combined with the bits that have already been moved.
  • AES Advanced Encryption Standard
  • Each entry has zeros in all positions, except the 8 bit positions to which the selected 8 bits in the source are permuted.
  • the results are combined with 7 OR instructions to get the final permutation.
  • 8 instructions are needed to extract the index for the LOAD instruction, for a total of 23 instructions.
  • Permutations are a requirement for fast processing of digital multimedia information, using subword-parallel instructions, more commonly known as multimedia instructions, as described in Ruby Lee, “Accelerating Multimedia with Enhanced Micro-processors”, IEEE Micro, Vol. 15, No. 2, pp. 22-32, April 1995, and Ruby Lee, “Subword Parallelism in MAX-2”, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996.
  • Microprocessor Instruction Set Architecture uses these subword parallel instructions for fast multimedia information processing. With subwords packed into 64-bit words, it is often necessary to rearrange the subwords within the word. However, such subword permutation instructions are not provided by many of the conventional multimedia ISA extensions.
  • MIX and PERMUTE instructions have been implemented in the MAX-2 extension to Precision Architecture RISC (PA-RISC) processor, see Ruby Lee, “Subword Parallelism in MAX-2”, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996.
  • PA-RISC Precision Architecture RISC
  • the MAX-2 general-purpose PERMUTE instruction can do any permutation, with and without repetitions, of the subwords packed in a register. However, it is only defined for 16-bit subwords.
  • MIX and MUX instructions have been implemented in the IA-64 architectures, which are extensions to the MIX and PERMUTE instructions of MAX-2, see Intel Corporation, “IA-64 Application Developers' Architecture Guide”, Intel Corporation, May 1999.
  • the IA-64 uses MUX instruction, which is a fully general permute instruction for 16-bit subwords, with five new permute byte variants.
  • a VPERM instruction has been used in an AltiVec extension to the Power PCTM available from IBM Corporation, Armonk, N.Y., see Motorola Corporation, “‘AltiVec Extensions to PowerPC’ Instruction Set Architecture Specification”, Motorola Corporation, May 1998.
  • the Altivec VPERM instruction extends the general permutation capabilities of MAX-2's PERMUTE instruction to 8-bit subwords selected from two 128-bit source registers, into a single 128-bit destination register.
  • VPERM has to use another 128-bit register to hold the permutation control bits, making it a very expensive instruction with three source registers and one destination register, all 128 bits wide.
  • the present invention provides permutation instructions which can be used in software executed in a programmable processor for solving permutation problems in both cryptography and multimedia. For fast cryptography, bit-level permutations are used, whereas for multimedia, permutations on subwords of typically 8 bits or 16 bits are used. Permutation instructions of the present invention can be used to provide any arbitrary permutation of sixty-four 1-bit subwords in a 64-bit processor, i.e., a processor with 64-bit words, registers and datapaths, for use in fast cryptography. The permutation instructions of the present invention can also be used for permuting subwords greater than 1 bit in size, for use in fast multimedia processing.
  • the permutation instructions and underlying functional unit can permute thirty-two 2-bit subwords, sixteen 4-bit subwords, eight 8-bit subwords, four 16-bit subwords, or two 32-bit subwords.
  • the permutation instructions of the present invention can be added as new instructions to the Instruction Set Architecture of a conventional microprocessor, or they can be used in the design of new processors or coprocessors to be efficient for both cryptography and multimedia software.
  • the method for performing permutations is by constructing a Benes interconnection network. This is done by executing a certain number of stages of the Benes network with permute instructions.
  • the permute instructions are performed by a circuit comprising Benes network stages.
  • Intermediate sequences of bits are defined that an initial sequence of bits from a source register are transformed into. Each intermediate sequence of bits is used as input to a subsequent permutation instruction.
  • Permutation instructions are determined for permuting the initial source sequence of bits into one or more intermediate sequence of bits until a desired sequence is obtained.
  • the intermediate sequences of bits are determined by configuration bits.
  • the permutation instructions form a permutation instruction sequence. At most 1gn permutation instructions are used in the permutation instruction sequence.
  • multibit subwords are permuted by eliminating pass-throughs in the Benes network.
  • the method and system are scaled for performing permutations of 2n bits in which subwords are packed into two or more registers. In this embodiment, at most 41gn+2 instructions are used to permute 2n bits using n-bit words.
  • FIG. 1 is a schematic diagram of a system for implementing permutation instructions in accordance with an embodiment of the present invention.
  • FIG. 2 is a flow diagram of a method for determining permutation instruction sequence to achieve a desired permutation in accordance with an embodiment of the present invention.
  • FIG. 3A is a schematic diagram of an 8-input Benes network.
  • FIG. 3B is a schematic diagram of an implementation of a CROSS instruction in accordance with an embodiment of the present invention.
  • FIG. 3C is a schematic diagram of a layout of a CROSS instruction in accordance with an embodiment of the present invention.
  • FIG. 4A is a flow diagram of a method for implementing a CROSS instruction sequence to do an arbitrary permutation.
  • FIG. 4B is a schematic diagram for obtaining configuration bits for an 8-input Benes network based on hierarchical partitioning into subnets.
  • FIG. 5 is a schematic diagram of a Benes network configured for a given permutation.
  • FIG. 6 is a flow diagram of a method for permutations of multi-bit subwords in accordance with an embodiment of the present invention.
  • FIG. 7A is a schematic diagram of a Benes network configured for a multi-bit permutation including pass through stages.
  • FIG. 7B is a schematic diagram of the Benes network of FIG. 7A after elimination of pass through stages.
  • FIG. 8A is a flow diagram of a method for 2n-bit permutations in accordance with an embodiment of the present invention.
  • FIG. 8B is a schematic diagram of an implementation of the method shown in FIG. 8A .
  • FIG. 9A is a schematic diagram of a circuit implementation of CROSS instructions for an individual node.
  • FIG. 9B is a schematic diagram of a circuit implementation of CROSS instructions for an 8-bit implementation.
  • FIG. 10A is a high-level schematic diagram of a circuit implementation for CROSS instructions in accordance with an embodiment of the present invention.
  • FIG. 10B is a high-level schematic diagram of a circuit implementation for CROSS instructions in accordance with an alternate embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a circuit implementation of an 8 ⁇ 8 crossbar for comparison with the circuit implementation of OMFLIP instructions.
  • FIG. 12A is a schematic diagram of a system for implementing permutation instructions in accordance with an alternate embodiment of the present invention.
  • FIG. 12B is a schematic diagram of a system for implementing permutation instructions in accordance with another alternate embodiment of the present invention.
  • FIG. 1 illustrates a schematic diagram of a system for implementing efficient permutation instructions 10 in accordance with the teachings of the present invention.
  • Register file 12 includes source register 11 a , source register 11 b and destination register 11 c .
  • System 10 can provide bit-level permutations of all n bits of any register in register file 12 .
  • Source register values to be permuted 13 from source register 11 a and configuration bits 15 from source register 11 b are applied over datapaths to permutation functional unit 14 .
  • Source register values to be permuted 13 can be a sequence of bits or a sequence of subwords.
  • Permutation functional unit 14 generates permutation result 16 .
  • Permutation result 16 can be an intermediate result if additional permutations are performed by permutation functional unit 14 .
  • arithmetic logic unit (ALU) 17 and shifter 18 receive source register values 13 from source register 11 a and source register values 15 from source register 11 b and generate a respective ALU result 20 and a shifter result 21 over a data path to destination register 11 c .
  • System 10 can be implemented in any programmable processor, for example, a conventional microprocessor, digital signal processor (DSP), cryptographic processor, multimedia processor and can be used in developing processors or coprocessors for providing cryptography and multimedia operations.
  • DSP digital signal processor
  • FIG. 2 is a flow diagram of a method of determining permutation instruction sequences for permutations 22 .
  • the determined permutation instruction sequences can be performed in permutation functional unit 14 .
  • intermediate states are defined that an initial sequence of bits from a source register are to be transformed into.
  • the final state is the desired permutation of the initial sequence of bits.
  • control configuration bits are defined for transforming the initial sequence into the first intermediate state and subsequent intermediate states until transformation into the final state.
  • a Benes network can be used to perform permutations of n bits with edge-disjoint paths using intermediate states.
  • the Benes network can be formed by connecting two butterfly networks of the same size back-to-back.
  • An example of an 8-input Benes network is shown in FIG. 3A .
  • An n-input Benes network can be broken into 21gn stages, 1gn of them are distinct.
  • the number of node in each stage is n.
  • a node is defined as a point in the network where the path selection for an input takes place.
  • In each stage of a butterfly network for every input, there is another input that shares the same two outputs with it.
  • Such pairs of inputs can be referred to as “conflict inputs” and their corresponding outputs can be referred to as “conflict outputs”.
  • the distances between conflict pairs in one stage of the Benes network are the same. The distances between conflict pairs are different in different stages.
  • basic operations are defined corresponding to one stage of the butterfly network.
  • One basic operation is that done by one stage of a butterfly network.
  • a basic operation is specified by a parameter m, where 2 m is the distance between conflict pairs for the corresponding stage.
  • a basic operation uses n/2 configuration bits to set up the connections in the corresponding stage and move the n input bits to the output. Accordingly, for permuting the contents in an n-bit register, the n configuration bits for two basic operations can be packed into one configuration register for allowing two basic operations to be packed into a single instruction. Since an n-input Benes network has 1gn distinct stages, there are 1gn different basic operations.
  • Bits from the source register are moved to the result register based on the configuration bits. In an embodiment of the present invention, if the configuration bit for a pair of conflict inputs is 0, the bits from the two conflict inputs go through non-crossing paths to the outputs. If the configuration bit for a pair of conflict inputs is 1, the bits from the two conflict inputs go through crossing paths to the outputs.
  • the instruction format for the permutation instruction can be defined as: CROSS,m 1 ,m 2 R 1 ,R 2 ,R 3
  • R 1 and m 2 are the parameters that specify the two basic operations to be used
  • R 1 is a reference to a source register which contains the subwords to be permuted
  • R 2 is a reference to a configuration register that holds the configuration bits for the two basic operations
  • R 3 is a reference to a destination register where the permuted subwords are placed.
  • R 1 , R 2 and R 3 refer to any registers R i , R j and R k where i,j and k can be all different or two or more of i,j and k can be the same. Alternately, R 3 can be omitted and the permuted subwords are placed in register R 1 .
  • a CROSS instruction performs two basic operations on the source according to the contents of the configuration register and the values of m 1 and m 2 .
  • the first basic operation can be determined by the value of m 1 .
  • the first basic operation moves the bits in source register R 1 based on the left half of the configuration bits held in the configuration in register R 2 to an intermediate result.
  • the second basic operation can be determined by the value of m 2 .
  • the second basic operation moves the bits in the intermediate result according to the right half of the configuration bits in the register R 2 to the destination register R 3 .
  • Pseudo code for the CROSS instruction is shown in Table 1.
  • the CROSS instruction can permute sixty-four 1-bit subwords in a 64-bit processor for use in, for example,encryption and decryption processing using software.
  • the CROSS instruction can also permute multi-bit subwords as described below, for example, thirty-two 2-bit subwords, sixteen 4-bit subwords, eight 8-bit subwords, four 16-bit subwords or two 32-bit subwords in a 64-bit processor for use for example in multimedia processing.
  • FIG. 3B illustrates an example of operation of a CROSS instruction.
  • the source sequence of bits consists of 8 bits: bit a, bit b, bit c, bit d, bit e, bit f and bit h.
  • the CROSS instruction is CROSS, 2 , 1 , R 1 , R 2 , R 3 wherein the source sequence of bits in register R 1 is referred to by abcdefgh, the control bits of R 2 are 10011010 and the destination sequence of bits received in register R 3 is cbehgfad.
  • Each of bit positions 30 a - 30 h in source register R 1 acts as an input node to this Benes network: node 30 a receives bit a, node 30 b receives bit b, node 30 c receives bit c, node 30 d receives bit d, node 30 e receives bit e, node 30 f receives bit f, node 30 g receives bit g and node 30 h receives bit h.
  • Each node 30 a - 30 h has two outputs 31 a and 31 b .
  • Outputs 31 a and 31 b for each of nodes 30 a - 30 h are each directed to one node in set of nodes 32 a - 32 h . For example, output 31 a of node 30 a is directed to node 32 a and output 31 b of node 30 a is directed to node 32 e .
  • Output 31 a of node 30 e is directed to node 32 a and output 31 b of node 30 e is directed to node 32 e .
  • node 30 a and node 30 e are conflict inputs and respective nodes 32 a and 32 e receive conflict outputs.
  • node 30 b and node 30 f are conflict inputs and respective nodes 32 b and 32 f receive conflict outputs.
  • Node 30 c and node 30 g are conflict inputs and respective nodes 32 c and 32 g receive conflict outputs.
  • Node 30 d and 30 h are conflict inputs and respective nodes 32 d and 32 h receive conflict outputs.
  • configuration bits R 2 are applied to each pair of conflict outputs and are represented in the first node of each pair of conflict outputs. Accordingly, configuration bit 34 a is applied to nodes 32 a and 32 e , configuration bit 34 b is applied to nodes 32 b and 32 f , configuration bit 34 c is applied to nodes 32 c and 32 g and configuration bit 34 d is applied to nodes 32 d and 32 h.
  • node 30 a and node 30 e have crossing paths to nodes 32 a and 32 e since the configuration bit 34 a is 1.
  • Node 30 b and node 30 f have non-crossing paths to nodes 32 b and 32 f since configuration bit 34 b is 0.
  • Node 30 c and node 30 g have non-crossing paths to nodes 32 c and 32 g since configuration bit 34 c is 0.
  • Node 30 d and node 30 h have crossing paths to nodes 32 d and 32 h since configuration bit 34 d is 1.
  • the intermediate sequence of bits is ebchafgd.
  • Each of nodes 32 a - 32 h has two outputs 35 a and 35 b .
  • Outputs 35 a and 35 b are each directed to one node in set of nodes 36 a - 36 h .
  • output 35 a of node 32 a is directed to node 36 a and output 35 b of node 32 a is directed to node 36 c .
  • output 35 a of node 32 c is directed to node 36 a and output 32 b of node 32 c is directed to node 36 c .
  • node 32 a and node 32 c receive conflict inputs and respective nodes 36 a and 36 c receive conflict outputs.
  • conflict outputs are also received at the respective pairs of node 36 b and 36 d , nodes 36 e and 36 g , nodes 36 f and 36 h .
  • Right half of configuration bits of R 2 are applied to each pair of conflict outputs.
  • configuration bit 34 e is applied to nodes 36 a and 36 c
  • configuration bit 34 f is applied to nodes 36 b and 36 d
  • configuration bit 34 g is applied to nodes 36 e and 36 g
  • configuration bit 34 h is applied to node 36 f and 36 h.
  • node 32 a and 32 c have crossing paths to nodes 36 a and 36 c since configuration bit 34 e is 1.
  • node 32 b and 32 d have non-crossing paths to nodes 36 b and 36 d since configuration bit 34 f is 0.
  • Node 32 e and node 32 g have crossing paths to nodes 36 e and 36 g since configuration bit 34 g is 1.
  • Node 32 f and node 32 h have crossing paths to node 36 f and node 36 h since configuration bit 34 h is 1.
  • the result sequence of bits is cbehgfad.
  • FIG. 3C shows a one embodiment of the encoding of the CROSS instruction 39 for use in a programmable processor.
  • the instruction may also contain other fields.
  • relative locations of the fields in an instruction is arbitrary and may be varied without violating the spirit of the invention.
  • FIG. 4A A method for implementing CROSS instructions to do arbitrary permutations is shown in FIG. 4A .
  • a Benes network configuration is determined for the desired permutation.
  • a Benes network can be configured as described in X. Yang, M. Vachharajani and R. B. Lee, “Fast Subword Permutation Instructions Based on Butterfly Networks”, Proceedings of SPIE, Media Processor 2000, pp. 80-86, January 2000, herein incorporated by reference.
  • FIG. 4B illustrates the following steps for configuring a Benes network:
  • “Inputs” and “outputs” refer to the inputs and outputs of current Benes network. Starting from the first input that is not configured, referred to as “current input”, set the “end input” to be the conflict input of the “current input”. If all “inputs” have already been configured, go to Step 4.
  • FIG. 4B illustrates the above steps for permutation (a- - - h) to (h- - - a- - ), where “-” means do-not-care.
  • the first input that is not configured is node 151 , which contains value a.
  • node 151 contains value a.
  • node 152 has the value a.
  • the output that has the value a is node 153 , we mark it as “output (current input)”.
  • We connect node 153 to subnet 156 which is the same subnet as node 151 is connected to.
  • the conflict output of node 153 is node 154 , which contains value h.
  • node 154 is connected to subnet 157 that is not 156 .
  • the input that contains value h is node 155 , we mark it as “input (current output)” and connect it to subnet 157 as well. Since node 155 is different from “end input”, or node 152 .
  • the configured Benes network is broken into pairs of stages.
  • a CROSS instruction is assigned for each pair of stages.
  • the first CROSS instruction takes the original input.
  • each CROSS instruction uses the output from the last CROSS instruction as input and produces input for the next CROSS instruction.
  • the last CROSS instruction generates the final permutation. Accordingly, since there are 21gn stages in an n-input Benes network, all possible permutations can be performed for subwords in an n-bit register using 1gn CROSS instructions.
  • a Benes network configured for the permutation (abcdefgh) ⁇ (fabcedhg) is shown in FIG. 5 .
  • Configuration bits are determined for each node. These configuration bits are the contents of the configuration registers R 2 , R 3 and R 4 . The configuration bits are read from left to right through nodes from left to right.
  • the Benes network is broken into stages 55 a - 55 c , by performing block 53 .
  • FIG. 6 A schematic diagram of a method for permuting multi-bit subwords 60 is shown in FIG. 6 in which each subword contains more than one bit.
  • Multi-bit subwords can be represented as k-bit subword permutation.
  • Block 61 is identical to block 51 in FIG. 4A .
  • block 62 a determination is made for eliminating pass through stages. For many permutations, some stages of the Benes network can be configured as pass-throughs. This is true even for some permutations that are not subword permutations. Because the bypassing connections only serve to copy the inputs to the outputs, these stages can be removed before the assignment of the CROSS instructions. For example if 2k stages are removed, there will be k fewer instructions.
  • FIGS An example of an implementation of method 60 is shown in FIGS.
  • FIG. 7A illustrates the configuration of an 8 input Benes network for a 2-bit permutation of (a 1 a 2 b 1 b 2 c 1 c 2 d 1 d 2 ) ⁇ (c 1 c 2 b 1 b 2 d 1 d 2 a 1 a 2 ) in which the middle 2 stages of the Benes network copy the input bits to their output without any change of order as determined from block 62 .
  • the middle stages are eliminated from the configured Benes network as shown in FIG. 7B .
  • the instructions are assigned to the remaining stages without affecting the result.
  • the CROSS instruction can be used to permute subwords packed into more than one register. If a register is n bits, two registers are 2n bits.
  • the CROSS instructions can be used for 2n-bit permutations by using an instruction such as the SHIFT PAIR instruction in PA-RISC, as described in Ruby Lee, “Precision Architecture”, IEEE Computer, Vol. 22, No. 1, pp. 78-91, January 1989, and Ruby Lee, Michael Mahon, Dale Morris, “Pathlength Reduction Features in the PA-RISC Architecture”, Proceedings of IEEE Compcon, pp. 129-135, Feb. 24-28, 1992, San Francisco, Calif., hereby incorporated by reference into this application.
  • the SHIFT PAIR instruction can process operands that cross word boundaries.
  • FIGS. 8A and 8B illustrate an example of performing 2n-bit permutations using SHIFT PAIR and CROSS instructions.
  • source registers R 1 and R 2 store the bits to be permuted and the results are put in destination register referred to by R 3 or R 4 .
  • the bits of the source registers R 1 and R 2 are divided into two groups using two CROSS instruction sequences.
  • One CROSS instruction sequence is for R 1 and one CROSS instruction sequence is for R 2 .
  • R 1 the bits going to register R 3 are put into a left group and the bits going to R 4 into the right group.
  • R 2 the bits going to register R 4 are put into a left group, and the bits going to register R 3 are put into a right group.
  • register R 1 is divided into left group 75 a and right group 75 b as shown in FIG. 8B .
  • Register R 2 is divided into left group 77 a and right group 77 b.
  • register R 3 includes the bits of left group 75 a and right group 77 b and register R 4 includes the bits of right group 75 b and left group 77 a .
  • R 3 and R 4 are considered as separate n-bit words, n-bit permutations are performed on register R 3 and register R 4 .
  • Each of R 3 and R 4 can use up to 1gn instructions. In total, excluding the instructions needed for loading control bits, 41gn+2 instructions are needed to do a 2n-bit permutation. Accordingly, with 64-bit registers, a 128-bit permutation can be performed with 26 instructions.
  • FIGS. 9A and 9B illustrate schematic diagrams of a circuit implementation for CROSS instruction corresponding to the high level diagram of 100 as shown in FIG. 10A , for an individual node 80 and 8-bit implementation 90 .
  • the CROSS instruction can be implemented by implementing at the circuit level a Benes network.
  • An n-input Benes network has 2n1gn switch points.
  • the control logic selects the two stages for the two basic operations based on the value of m 1 and m 2 . Because the Benes network has two of each butterfly stage, stages can always be selected for all possible m 1 and m 2 .
  • the left and right half of R 2 are used to configure the two stages selected and all the other stages are configured as pass-throughs.
  • the source R 1 is put through the configured network, and the result R 3 is obtained.
  • FIG. 10A illustrates one embodiment of a high-level schematic diagram of a circuit implementation 100 for CROSS instructions for an 8 bit system.
  • the circuit implementation implements the entire Benes network.
  • the control logic selects the proper two stages for the two basic operations based on the parameters m 1 and m 2 . Thereafter, the CROSS instruction configures the two selected stages according to the left half and right half of the configuration register R 2 .
  • the stages that are not used are configured as pass-throughs.
  • FIG. 10B illustrates another embodiment of a high-level schematic diagram of a circuit implementation 110 for CROSS instructions.
  • the circuit implementation implements two identical stages. Each stage comprises all the connections of all the stages of a butterfly network.
  • the control logic When executing a CROSS instruction, the control logic selects the proper two sets of connections for the two basic operations based on the parameters m 1 and m 2 . Thereafter, the CROSS instruction configures the two selected sets of connections according to the left half and right half of the configuration register R 2 .
  • two or more different butterfly stages are combined in one stage of the implementation.
  • FIG. 12A illustrates an alternate embodiment of the invention, in which a single permute instruction can perform more than two Benes stages.
  • register file 112 includes three read ports, 111 a , 111 b , 111 c .
  • Two registers 111 b and 111 c can be used to send configuration bits 115 and 122 to permutation unit 114 .
  • system 100 allows four Benes stages to be performed in one permute instruction. This allows any arbitrary permutation of n bits to be performed in an instruction sequence of (21gn)/4, or 1gn/2 instructions.
  • this can be extended to sending more configuration bits with each permute instruction, thus performing more Benes stages per instruction, and reducing the number of instructions in the instruction sequence needed for any arbitrary permutation of n bits.
  • the minimum number of instructions needed in one instruction is achieved by sending 1gn registers with configuration bits with the one register of n bits to be permuted in the permute instruction. Accordingly, this allows any arbitrary permutation of n bits to be performed in an instruction sequence of 21gn/m instructions where m is the number of network stages performed by one permutation instruction.
  • FIG. 12B illustrates an alternate embodiment of the invention, in which the permutation result can be temporarily stored in permutation functional unit 214 .
  • bits of intermediate permutation result 216 are stored in memory location 222 of permutation functional unit 214 after the generation of intermediate permutation result 216 .
  • the source bits can be used from memory location 222 instead of being fetched from register file 212 .
  • both of source registers 211 a and 211 b are used for configuration bits in a permutation instruction. Accordingly, the desired permutation can be performed in fewer instructions.
  • n1gn configuration bits are stored in the memory 222 , rather than read from the register 211 b (or from the registers 111 b and 111 d in FIG. 12A ).
  • the n-bit value 213 to be permuted is read from register 211 a and sent to the permutation functional unit 214 .
  • This embodiment is useful if the same n-bit permutation is repeated many times for different n-bit values. The sequence of permutation instructions needed to perform this n-bit permutation is reduced to one instruction.
  • the CROSS instruction in any of the above described embodiments, can be used by itself, rather than in a sequence of instructions.
  • the CROSS instruction generates a subset of all possible permutations.
  • a permutation performed by a single CROSS instruction can be reversed by reversing the order of the stages used in the CROSS instruction with the configuration bits for each stage being the same as for the original permutation.
  • Transistors 2n1gn ⁇
  • the 2n horizontal tracks come from the 2 input lines in each node.
  • the number of horizontal tracks is composed of two parts: n/2 configuration lines per stage for the 21gn stages, and the number of data tracks needed between adjacent stages, which is 2 ⁇ (2n ⁇ 2) in total.
  • the number of transistors are for the AND gate and pass transistor at each cross point.
  • An alternative implementation of crossbar is to provide a negated signal for each control signal so that no inverters before AND gates are needed. Then the horizontal track count becomes n+2n1gn and the transistor count becomes n 2 (1+21gn). This implementation may yield a larger size due to more vertical tracks used.
  • Table 3 shows a comparison of the number of instructions needed for permutations of a 64-bit word with different subword sizes for method 10 using CROSS instructions and a method using conventional instruction set architectures (ISAs).
  • ISAs instruction set architectures

Abstract

The present invention provides permutation instructions which can be used in software executed in a programmable processor for solving permutation problems in cryptography, multimedia and other applications. The permute instructions are based on a Benes network comprising two butterfly networks of the same size connected back-to-back. Intermediate sequences of bits are defined that an initial sequence of bits from a source register are transformed into. Each intermediate sequence of bits is used as input to a subsequent permutation instruction. Permutation instructions are determined for permitting the initial source sequence of bits into one or more intermediate sequence of bits until a desired sequence is obtained. The intermediate sequences of bits are determined by configuration bits. The permutation instructions form a permutation instruction sequence of at least one instruction. At most 21gr/m permutation instructions are used in the permutation instruction sequence, where r is the number of k-bit subwords to be permuted, and m is the number of network stages executed in one instruction. The permutation instructions can be used to permute k-bit subwords packed into an n-bit word, where k can be 1, 2, . . . , or n bits, and k*r=n.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates to a method and system for performing arbitrary permutations of a sequence of bits in a programmable processor by determining a permutation instruction based on butterfly networks.
  • 2. Description of the Related Art
  • The need for secure information processing has increased with the increasing use of the public internet and wireless communications in e-commerce, e-business and personal use. Typical use of the internet is not secure. Secure information processing typically includes authentication of users and host machines, confidentiality of messages sent over public networks, and assurances that messages, programs and data have not been maliciously changed. Conventional solutions have provided security functions by using different security protocols employing different cryptographic algorithms, such as public key, symmetric key and hash algorithms.
  • For encrypting large amounts of data, symmetric key cryptography algorithms have been used, see Bruce Schneier, “Applied Cryptography”, 2nd Ed., John Wiley & Sons, Inc., 1996. These algorithms use the same secret key to encrypt and decrypt a given message, and encryption and decryption have the same computational complexity. In symmetric key algorithms, the cryptographic techniques of “confusion” and “diffusion” are synergistically employed. “Confusion” obscures the relationship between the plaintext (original message) and the ciphertext (encrypted message), for example, through substitution of arbitrary bits for bits in the plaintext. “Diffusion” spreads the redundancy of the plaintext over the ciphertext, for example through permutation of the bits of the plaintext block. Such bit-level permutations have the drawback of being slow when implemented with conventional instructions available in microprocessors and other programmable processors.
  • Bit-level permutations are particularly difficult for processors, and have been avoided in the design of new cryptography algorithms, where it is desired to have fast software implementations, for example in the Advanced Encryption Standard, as described in NIST, “Announcing Request for Candidate Algorithm Nominations for the Advanced Encryption Standard (AES)”, http://csrc.nist.gov/encryption/aes/pre-round1/aes9709.htm, Since conventional microprocessors are word-oriented, performing bit-level permutations is difficult and tedious. Every bit has to be extracted from the source register, moved to its new location in the destination register, and combined with the bits that have already been moved. This requires 4 instructions per bit (mask generation, AND, SHIFT, OR), and 4n instructions to perform an arbitrary permutation of n bits. Conventional microprocessors, for example Precision Architecture (PA-RISC) have been described to provide more powerful bit-manipulation capabilities using EXTRACT and DEPOSIT instructions, which can essentially perform the four operations required for each bit in 2 instructions (EXTRACT, DEPOSIT), resulting in 2n instructions for any arbitrary permutation of n bits, see Ruby Lee, “Precision Architecture”, IEEE Computer, Vol. 22, No. 1, pp. 78-91, January 1989. Accordingly, an arbitrary 64-bit permutation could take 128 or 256 instructions on this type of conventional microprocessor. Pre-defined permutations with some regular patterns have been implemented in fewer instructions, for example, the permutations in DES, as described in Bruce Schneier, “Applied Cryptography”, 2nd Ed., John Wiley & Sons, Inc., 1996.
  • Conventional techniques have also used table lookup methods to implement fixed permutations. To achieve a fixed permutation of n input bits with one table lookup, a table with 2n entries is used with each entry being n bits. For a 64-bit permutation, this type of table lookup would use 267 bytes, which is clearly infeasible. Alternatively, the table can be broken up into smaller tables, and several table lookup operations could be used. For example, a 64-bit permutation could be implemented by permuting 8 consecutive bits at a time, then combining these 8 intermediate permutations into a final permutation. This method requires 8 tables, each with 256 entries, each entry being 64 bits. Each entry has zeros in all positions, except the 8 bit positions to which the selected 8 bits in the source are permuted. After the eight table lookups done by 8 LOAD instructions, the results are combined with 7 OR instructions to get the final permutation. In addition, 8 instructions are needed to extract the index for the LOAD instruction, for a total of 23 instructions. The memory requirement is 8*256*8=16 kilobytes for eight tables. Although 23 instructions is less than the 128 or 256 instructions used in the previous method, the actual execution time can be much longer due to cache miss penalties or memory access latencies. For example, if half of the 8 Load instructions miss in the cache, and each cache miss takes 50 cycles to fetch the missing cache line from main memory, the actual execution time is more than 4*50=200 cycles. Accordingly, this method can be longer than the previously described 128 cycles using EXTRACT and DEPOSIT. This method also has the drawback of a memory requirement of 16 kilobytes for the tables.
  • Permutations are a requirement for fast processing of digital multimedia information, using subword-parallel instructions, more commonly known as multimedia instructions, as described in Ruby Lee, “Accelerating Multimedia with Enhanced Micro-processors”, IEEE Micro, Vol. 15, No. 2, pp. 22-32, April 1995, and Ruby Lee, “Subword Parallelism in MAX-2”, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996. Microprocessor Instruction Set Architecture (ISA) uses these subword parallel instructions for fast multimedia information processing. With subwords packed into 64-bit words, it is often necessary to rearrange the subwords within the word. However, such subword permutation instructions are not provided by many of the conventional multimedia ISA extensions.
  • A few microprocessor architectures have subword rearrangement instructions. MIX and PERMUTE instructions have been implemented in the MAX-2 extension to Precision Architecture RISC (PA-RISC) processor, see Ruby Lee, “Subword Parallelism in MAX-2”, IEEE Micro, Vol. 16, No. 4, pp. 51-59, August 1996. The MAX-2 general-purpose PERMUTE instruction can do any permutation, with and without repetitions, of the subwords packed in a register. However, it is only defined for 16-bit subwords. MIX and MUX instructions have been implemented in the IA-64 architectures, which are extensions to the MIX and PERMUTE instructions of MAX-2, see Intel Corporation, “IA-64 Application Developers' Architecture Guide”, Intel Corporation, May 1999. The IA-64 uses MUX instruction, which is a fully general permute instruction for 16-bit subwords, with five new permute byte variants. A VPERM instruction has been used in an AltiVec extension to the Power PC™ available from IBM Corporation, Armonk, N.Y., see Motorola Corporation, “‘AltiVec Extensions to PowerPC’ Instruction Set Architecture Specification”, Motorola Corporation, May 1998. The Altivec VPERM instruction extends the general permutation capabilities of MAX-2's PERMUTE instruction to 8-bit subwords selected from two 128-bit source registers, into a single 128-bit destination register. Since there are 32 such subwords from which 16 are selected, this requires 16*1g32=80 bits for specifying the desired permutation. This means that VPERM has to use another 128-bit register to hold the permutation control bits, making it a very expensive instruction with three source registers and one destination register, all 128 bits wide.
  • It is desirable to provide significantly faster and more economical ways to perform arbitrary permutations of n bits, without any need for table storage, which can be used for encrypting large amounts of data for confidentiality or privacy.
  • SUMMARY OF THE INVENTION
  • The present invention provides permutation instructions which can be used in software executed in a programmable processor for solving permutation problems in both cryptography and multimedia. For fast cryptography, bit-level permutations are used, whereas for multimedia, permutations on subwords of typically 8 bits or 16 bits are used. Permutation instructions of the present invention can be used to provide any arbitrary permutation of sixty-four 1-bit subwords in a 64-bit processor, i.e., a processor with 64-bit words, registers and datapaths, for use in fast cryptography. The permutation instructions of the present invention can also be used for permuting subwords greater than 1 bit in size, for use in fast multimedia processing. For example, in addition to being able to permute sixty-four 1-bit subwords in a register, the permutation instructions and underlying functional unit can permute thirty-two 2-bit subwords, sixteen 4-bit subwords, eight 8-bit subwords, four 16-bit subwords, or two 32-bit subwords. The permutation instructions of the present invention can be added as new instructions to the Instruction Set Architecture of a conventional microprocessor, or they can be used in the design of new processors or coprocessors to be efficient for both cryptography and multimedia software.
  • The method for performing permutations is by constructing a Benes interconnection network. This is done by executing a certain number of stages of the Benes network with permute instructions. The permute instructions are performed by a circuit comprising Benes network stages. Intermediate sequences of bits are defined that an initial sequence of bits from a source register are transformed into. Each intermediate sequence of bits is used as input to a subsequent permutation instruction. Permutation instructions are determined for permuting the initial source sequence of bits into one or more intermediate sequence of bits until a desired sequence is obtained. The intermediate sequences of bits are determined by configuration bits. The permutation instructions form a permutation instruction sequence. At most 1gn permutation instructions are used in the permutation instruction sequence.
  • In an embodiment of the present invention, multibit subwords are permuted by eliminating pass-throughs in the Benes network. In a further embodiment of the invention, the method and system are scaled for performing permutations of 2n bits in which subwords are packed into two or more registers. In this embodiment, at most 41gn+2 instructions are used to permute 2n bits using n-bit words.
  • For a better understanding of the present invention, reference may be made to the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a schematic diagram of a system for implementing permutation instructions in accordance with an embodiment of the present invention.
  • FIG. 2 is a flow diagram of a method for determining permutation instruction sequence to achieve a desired permutation in accordance with an embodiment of the present invention.
  • FIG. 3A is a schematic diagram of an 8-input Benes network.
  • FIG. 3B is a schematic diagram of an implementation of a CROSS instruction in accordance with an embodiment of the present invention.
  • FIG. 3C is a schematic diagram of a layout of a CROSS instruction in accordance with an embodiment of the present invention.
  • FIG. 4A is a flow diagram of a method for implementing a CROSS instruction sequence to do an arbitrary permutation.
  • FIG. 4B is a schematic diagram for obtaining configuration bits for an 8-input Benes network based on hierarchical partitioning into subnets.
  • FIG. 5 is a schematic diagram of a Benes network configured for a given permutation.
  • FIG. 6 is a flow diagram of a method for permutations of multi-bit subwords in accordance with an embodiment of the present invention.
  • FIG. 7A is a schematic diagram of a Benes network configured for a multi-bit permutation including pass through stages.
  • FIG. 7B is a schematic diagram of the Benes network of FIG. 7A after elimination of pass through stages.
  • FIG. 8A is a flow diagram of a method for 2n-bit permutations in accordance with an embodiment of the present invention.
  • FIG. 8B is a schematic diagram of an implementation of the method shown in FIG. 8A.
  • FIG. 9A is a schematic diagram of a circuit implementation of CROSS instructions for an individual node.
  • FIG. 9B is a schematic diagram of a circuit implementation of CROSS instructions for an 8-bit implementation.
  • FIG. 10A is a high-level schematic diagram of a circuit implementation for CROSS instructions in accordance with an embodiment of the present invention.
  • FIG. 10B is a high-level schematic diagram of a circuit implementation for CROSS instructions in accordance with an alternate embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a circuit implementation of an 8×8 crossbar for comparison with the circuit implementation of OMFLIP instructions.
  • FIG. 12A is a schematic diagram of a system for implementing permutation instructions in accordance with an alternate embodiment of the present invention.
  • FIG. 12B is a schematic diagram of a system for implementing permutation instructions in accordance with another alternate embodiment of the present invention.
  • DETAILED DESCRIPTION
  • Reference will now be made in greater detail to a preferred embodiment of the invention, an example of which is illustrated in the accompanying drawings. Wherever possible, the same reference numerals will be used throughout the drawings and the description to refer to the same or like parts.
  • FIG. 1 illustrates a schematic diagram of a system for implementing efficient permutation instructions 10 in accordance with the teachings of the present invention. Register file 12 includes source register 11 a, source register 11 b and destination register 11 c. System 10 can provide bit-level permutations of all n bits of any register in register file 12. The same solution can be applied to different subword sizes of 2i bits, for i=0, 1, 2, . . . , m, where n=2m bits. For a fixed word size of n bits, and 1-bit subwords, there are n subwords to be permuted. Source register values to be permuted 13 from source register 11 a and configuration bits 15 from source register 11 b are applied over datapaths to permutation functional unit 14. Source register values to be permuted 13 can be a sequence of bits or a sequence of subwords. Permutation functional unit 14 generates permutation result 16. Permutation result 16 can be an intermediate result if additional permutations are performed by permutation functional unit 14. For other instructions, arithmetic logic unit (ALU) 17 and shifter 18 receive source register values 13 from source register 11 a and source register values 15 from source register 11 b and generate a respective ALU result 20 and a shifter result 21 over a data path to destination register 11 c. System 10 can be implemented in any programmable processor, for example, a conventional microprocessor, digital signal processor (DSP), cryptographic processor, multimedia processor and can be used in developing processors or coprocessors for providing cryptography and multimedia operations.
  • FIG. 2 is a flow diagram of a method of determining permutation instruction sequences for permutations 22. The determined permutation instruction sequences can be performed in permutation functional unit 14. In block 23, intermediate states are defined that an initial sequence of bits from a source register are to be transformed into. The final state is the desired permutation of the initial sequence of bits. In block 24, control configuration bits are defined for transforming the initial sequence into the first intermediate state and subsequent intermediate states until transformation into the final state.
  • A Benes network can be used to perform permutations of n bits with edge-disjoint paths using intermediate states. The Benes network can be formed by connecting two butterfly networks of the same size back-to-back. An example of an 8-input Benes network is shown in FIG. 3A.
  • An n-input Benes network can be broken into 21gn stages, 1gn of them are distinct. The number of node in each stage is n. A node is defined as a point in the network where the path selection for an input takes place. In each stage of a butterfly network, for every input, there is another input that shares the same two outputs with it. Such pairs of inputs can be referred to as “conflict inputs” and their corresponding outputs can be referred to as “conflict outputs”. The distances between conflict pairs in one stage of the Benes network are the same. The distances between conflict pairs are different in different stages.
  • In the implementation of method 22 in a Benes network, basic operations are defined corresponding to one stage of the butterfly network. One basic operation is that done by one stage of a butterfly network. A basic operation is specified by a parameter m, where 2m is the distance between conflict pairs for the corresponding stage. A basic operation uses n/2 configuration bits to set up the connections in the corresponding stage and move the n input bits to the output. Accordingly, for permuting the contents in an n-bit register, the n configuration bits for two basic operations can be packed into one configuration register for allowing two basic operations to be packed into a single instruction. Since an n-input Benes network has 1gn distinct stages, there are 1gn different basic operations. Bits from the source register are moved to the result register based on the configuration bits. In an embodiment of the present invention, if the configuration bit for a pair of conflict inputs is 0, the bits from the two conflict inputs go through non-crossing paths to the outputs. If the configuration bit for a pair of conflict inputs is 1, the bits from the two conflict inputs go through crossing paths to the outputs.
  • In a preferred embodiment of the invention, the instruction format for the permutation instruction can be defined as:
    CROSS,m1,m2 R1,R2,R3
  • wherein m1 and m2 are the parameters that specify the two basic operations to be used, R1 is a reference to a source register which contains the subwords to be permuted, R2 is a reference to a configuration register that holds the configuration bits for the two basic operations and R3 is a reference to a destination register where the permuted subwords are placed. R1, R2 and R3 refer to any registers Ri, Rj and Rk where i,j and k can be all different or two or more of i,j and k can be the same. Alternately, R3 can be omitted and the permuted subwords are placed in register R1. A CROSS instruction performs two basic operations on the source according to the contents of the configuration register and the values of m1 and m2. The first basic operation can be determined by the value of m1. The first basic operation moves the bits in source register R1 based on the left half of the configuration bits held in the configuration in register R2 to an intermediate result. The second basic operation can be determined by the value of m2. The second basic operation moves the bits in the intermediate result according to the right half of the configuration bits in the register R2 to the destination register R3. Pseudo code for the CROSS instruction is shown in Table 1.
    TABLE 1
    CROSS, m1, m2 R3 = R1;
    R1, R2, R3 j = 0;
    dist = 1 << ml;
    for (s = 0; s < n; s += (dist * 2))
    for (i = 0; i < dist; i++)
    if (R2[j++] == 1)
     swap(R3[s+j], R3[s+j+dist]);
    dist = 1 << m2;
    for (s = 0; s < n; s += (dist * 2))
    for (i = 0; i < dist; i++)
    if (R2[j++] == 1)
     swap(R3[s+j], R3[s+j+dist]);

    The CROSS instruction can be added to the Instruction Set Architecture of conventional microprocessors, digital signal processor (DSP), cryptographic processor, multimedia processor, media processors, programmable System-on-a-Chips (SOC), and can be used in developing processors or coprocessors for providing cryptography and multimedia operation. In particular, the CROSS instruction can permute sixty-four 1-bit subwords in a 64-bit processor for use in, for example,encryption and decryption processing using software. The CROSS instruction can also permute multi-bit subwords as described below, for example, thirty-two 2-bit subwords, sixteen 4-bit subwords, eight 8-bit subwords, four 16-bit subwords or two 32-bit subwords in a 64-bit processor for use for example in multimedia processing.
  • FIG. 3B illustrates an example of operation of a CROSS instruction. The source sequence of bits consists of 8 bits: bit a, bit b, bit c, bit d, bit e, bit f and bit h. The CROSS instruction is CROSS, 2, 1, R1, R2, R3 wherein the source sequence of bits in register R1 is referred to by abcdefgh, the control bits of R2 are 10011010 and the destination sequence of bits received in register R3 is cbehgfad. Each of bit positions 30 a-30 h in source register R1 acts as an input node to this Benes network: node 30 a receives bit a, node 30 b receives bit b, node 30 c receives bit c, node 30 d receives bit d, node 30 e receives bit e, node 30 f receives bit f, node 30 g receives bit g and node 30 h receives bit h.
  • Each node 30 a-30 h has two outputs 31 a and 31 b. Outputs 31 a and 31 b for each of nodes 30 a-30 h are configured such that the distance between conflict pairs is 4 as specified by m1=2. Outputs 31 a and 31 b for each of nodes 30 a-30 h are each directed to one node in set of nodes 32 a-32 h. For example, output 31 a of node 30 a is directed to node 32 a and output 31 b of node 30 a is directed to node 32 e. Output 31 a of node 30 e is directed to node 32 a and output 31 b of node 30 e is directed to node 32 e. Accordingly, node 30 a and node 30 e are conflict inputs and respective nodes 32 a and 32 e receive conflict outputs. Similarly, node 30 b and node 30 f are conflict inputs and respective nodes 32 b and 32 f receive conflict outputs. Node 30 c and node 30 g are conflict inputs and respective nodes 32 c and 32 g receive conflict outputs. Node 30 d and 30 h are conflict inputs and respective nodes 32 d and 32 h receive conflict outputs.
  • Left half of configuration bits R2 are applied to each pair of conflict outputs and are represented in the first node of each pair of conflict outputs. Accordingly, configuration bit 34 a is applied to nodes 32 a and 32 e, configuration bit 34 b is applied to nodes 32 b and 32 f, configuration bit 34 c is applied to nodes 32 c and 32 g and configuration bit 34 d is applied to nodes 32 d and 32 h.
  • During the first basic operation, node 30 a and node 30 e have crossing paths to nodes 32 a and 32 e since the configuration bit 34 a is 1. Node 30 b and node 30 f have non-crossing paths to nodes 32 b and 32 f since configuration bit 34 b is 0. Node 30 c and node 30 g have non-crossing paths to nodes 32 c and 32 g since configuration bit 34 c is 0. Node 30 d and node 30 h have crossing paths to nodes 32 d and 32 h since configuration bit 34 d is 1. After the first basic operation, the intermediate sequence of bits is ebchafgd.
  • Each of nodes 32 a-32 h has two outputs 35 a and 35 b. Outputs 35 a and 35 b for each of nodes 32 a-32 h are configured such that the difference between conflict pairs is 2 as specified by m2=1. Outputs 35 a and 35 b are each directed to one node in set of nodes 36 a-36 h. For example, output 35 a of node 32 a is directed to node 36 a and output 35 b of node 32 a is directed to node 36 c. Similarly, output 35 a of node 32 c is directed to node 36 a and output 32 b of node 32 c is directed to node 36 c. Accordingly, node 32 a and node 32 c receive conflict inputs and respective nodes 36 a and 36 c receive conflict outputs. Conflict outputs are also received at the respective pairs of node 36 b and 36 d, nodes 36 e and 36 g, nodes 36 f and 36 h. Right half of configuration bits of R2 are applied to each pair of conflict outputs. Accordingly, configuration bit 34 e is applied to nodes 36 a and 36 c, configuration bit 34 f is applied to nodes 36 b and 36 d, configuration bit 34 g is applied to nodes 36 e and 36 g and configuration bit 34 h is applied to node 36 f and 36 h.
  • During the second basic operation, node 32 a and 32 c have crossing paths to nodes 36 a and 36 c since configuration bit 34 e is 1. Node 32 b and 32 d have non-crossing paths to nodes 36 b and 36 d since configuration bit 34 f is 0. Node 32 e and node 32 g have crossing paths to nodes 36 e and 36 g since configuration bit 34 g is 1. Node 32 f and node 32 h have crossing paths to node 36 f and node 36 h since configuration bit 34 h is 1. After the second operation, the result sequence of bits is cbehgfad.
  • FIG. 3C shows a one embodiment of the encoding of the CROSS instruction 39 for use in a programmable processor. The instruction may also contain other fields. As will be understood by persons of ordinary skill in the art, relative locations of the fields in an instruction is arbitrary and may be varied without violating the spirit of the invention.
  • A method for implementing CROSS instructions to do arbitrary permutations is shown in FIG. 4A. In block 51, a Benes network configuration is determined for the desired permutation. A Benes network can be configured as described in X. Yang, M. Vachharajani and R. B. Lee, “Fast Subword Permutation Instructions Based on Butterfly Networks”, Proceedings of SPIE, Media Processor 2000, pp. 80-86, January 2000, herein incorporated by reference. FIG. 4B illustrates the following steps for configuring a Benes network:
  • 1. “Inputs” and “outputs” refer to the inputs and outputs of current Benes network. Starting from the first input that is not configured, referred to as “current input”, set the “end input” to be the conflict input of the “current input”. If all “inputs” have already been configured, go to Step 4.
  • 2a. Connect “current input” to the sub-network “sub1” that is on the same side as “current input”. Connect the output that has the same value as “current input”, to sub1 and call it “output (current input)”. Set “current output” to the conflict output of “output (current input)” and go to Step 3.
  • 2b. Connect “current input” to the sub-network “sub1” such that “sub1” is not “sub2”. Connect the output that has the same value as “current input”, to sub1 and call it “output (current input)”. Set “current output” to the conflict output of “output (current input)”.
  • 3. Connect “current output” to sub-network “sub2” such that “sub2” is not “sub1”. Also connect the input that has the same value as “current output”, call it “input (current output)”, to “sub2”. If “input (current output)” is the same as “end input”, go back to Step 1. Otherwise set “current input” to the conflict input of “input (current output)” and go to Step 2b.
  • 4. At this point, all the “inputs” and “outputs” have been connected to the two sub-networks. If the configuration of the two sub-networks is trivial, i.e. n=2, the configuration is done. Otherwise for each sub-network, treat it as a full Benes network and repeat the steps beginning at Step 1.
  • FIG. 4B illustrates the above steps for permutation (a- - - h) to (h- - - a- - ), where “-” means do-not-care. Starting from an unconfigured Benes network 150, the first input that is not configured is node 151, which contains value a. We mark node 151 as “current input” and its conflict input, node 152 as “end input”. We then connect node 151 to the subnet 156 that is on the same side as node 151. The output that has the value a is node 153, we mark it as “output (current input)”. We connect node 153 to subnet 156, which is the same subnet as node 151 is connected to. The conflict output of node 153 is node 154, which contains value h. We refer to node 154 as “current output”. Node 154 is connected to subnet 157 that is not 156. The input that contains value h is node 155, we mark it as “input (current output)” and connect it to subnet 157 as well. Since node 155 is different from “end input”, or node 152. We set “current input” to the conflict input of node 155, which is node 158, and repeat the above steps. This process terminates when all the inputs and outputs of Benes network 150 are configured. Thereafter, for each of subnets 156 and 157, we treat it as a full Benes network and apply the whole process on it until the whole Benes network 150 is configured.
  • In block 53 of FIG. 4A, the configured Benes network is broken into pairs of stages. In block 54, a CROSS instruction is assigned for each pair of stages. The first CROSS instruction takes the original input. Thereafter, each CROSS instruction uses the output from the last CROSS instruction as input and produces input for the next CROSS instruction. The last CROSS instruction generates the final permutation. Accordingly, since there are 21gn stages in an n-input Benes network, all possible permutations can be performed for subwords in an n-bit register using 1gn CROSS instructions.
  • For example, a Benes network configured for the permutation (abcdefgh)→(fabcedhg) is shown in FIG. 5. Configuration bits are determined for each node. These configuration bits are the contents of the configuration registers R2, R3 and R4. The configuration bits are read from left to right through nodes from left to right. The Benes network is broken into stages 55 a-55 c, by performing block 53. Performing block 54, the CROSS instruction CROSS 2, 1 R1, R2, R1 is assigned to stage 55 a with the configuration bits of R2=01010001. CROSS instruction CROSS 0, 0 R1, R3, R1 is assigned to stage 55 b with the configuration bits of R3=00001101. CROSS instruction CROSS 1, 2, R1, R4, R1 is assigned to stage 55 c with the configuration bits of R4=00000010.
  • A schematic diagram of a method for permuting multi-bit subwords 60 is shown in FIG. 6 in which each subword contains more than one bit. Multi-bit subwords can be represented as k-bit subword permutation. Block 61 is identical to block 51 in FIG. 4A. In block 62, a determination is made for eliminating pass through stages. For many permutations, some stages of the Benes network can be configured as pass-throughs. This is true even for some permutations that are not subword permutations. Because the bypassing connections only serve to copy the inputs to the outputs, these stages can be removed before the assignment of the CROSS instructions. For example if 2k stages are removed, there will be k fewer instructions. An example of an implementation of method 60 is shown in FIGS. 7A and 7B. FIG. 7A illustrates the configuration of an 8 input Benes network for a 2-bit permutation of (a1a2b1b2c1c2d1d2)→(c1c2b1b2d1d2a1a2) in which the middle 2 stages of the Benes network copy the input bits to their output without any change of order as determined from block 62. The middle stages are eliminated from the configured Benes network as shown in FIG. 7B. In block 63, the instructions are assigned to the remaining stages without affecting the result. In general, when permuting k-bit subword in an n-bit word, the middle 21gk stages of the Benes network are configured as pass-throughs. Some other stages may be configured as pass-throughs and thus can be removed as well. Accordingly, when permuting k-bit subwords in an n-bit word, the maximum number of instructions needed becomes 1gn−1gk=1g(n/k)=1gr, where r is the number of subwords in a word.
  • The CROSS instruction can be used to permute subwords packed into more than one register. If a register is n bits, two registers are 2n bits. The CROSS instructions can be used for 2n-bit permutations by using an instruction such as the SHIFT PAIR instruction in PA-RISC, as described in Ruby Lee, “Precision Architecture”, IEEE Computer, Vol. 22, No. 1, pp. 78-91, January 1989, and Ruby Lee, Michael Mahon, Dale Morris, “Pathlength Reduction Features in the PA-RISC Architecture”, Proceedings of IEEE Compcon, pp. 129-135, Feb. 24-28, 1992, San Francisco, Calif., hereby incorporated by reference into this application. The SHIFT PAIR instruction can process operands that cross word boundaries. This instruction concatenates two source registers to form a double-word value, then extracts any contiguous single-word value. FIGS. 8A and 8B illustrate an example of performing 2n-bit permutations using SHIFT PAIR and CROSS instructions. In this example, source registers R1 and R2 store the bits to be permuted and the results are put in destination register referred to by R3 or R4.
  • In block 70, the bits of the source registers R1 and R2 are divided into two groups using two CROSS instruction sequences. One CROSS instruction sequence is for R1 and one CROSS instruction sequence is for R2. For example, for R1, the bits going to register R3 are put into a left group and the bits going to R4 into the right group. In R2 the bits going to register R4 are put into a left group, and the bits going to register R3 are put into a right group. After performing block 70, register R1 is divided into left group 75 a and right group 75 b as shown in FIG. 8B. Register R2 is divided into left group 77 a and right group 77 b.
  • In block 71, using two SHIFT PAIR instructions, all bits going to register R3 are put into R3 and all bits going to register R4 are put into R4. After the implementation of block 71, register R3 includes the bits of left group 75 a and right group 77 b and register R4 includes the bits of right group 75 b and left group 77 a. In block 72, considering R3 and R4 as separate n-bit words, n-bit permutations are performed on register R3 and register R4. Each of R3 and R4 can use up to 1gn instructions. In total, excluding the instructions needed for loading control bits, 41gn+2 instructions are needed to do a 2n-bit permutation. Accordingly, with 64-bit registers, a 128-bit permutation can be performed with 26 instructions.
  • FIGS. 9A and 9B illustrate schematic diagrams of a circuit implementation for CROSS instruction corresponding to the high level diagram of 100 as shown in FIG. 10A, for an individual node 80 and 8-bit implementation 90. The CROSS instruction can be implemented by implementing at the circuit level a Benes network. An n-input Benes network has 2n1gn switch points. When executing a CROSS,m1,m2 R1, R2, R3 instruction, the control logic selects the two stages for the two basic operations based on the value of m1 and m2. Because the Benes network has two of each butterfly stage, stages can always be selected for all possible m1 and m2. The left and right half of R2 are used to configure the two stages selected and all the other stages are configured as pass-throughs. The source R1 is put through the configured network, and the result R3 is obtained. The method of the present invention can do arbitrary bit permutations of a 64-bit word with a maximum of 1g64=6 CROSS instructions. For 2-bit subwords, at most 1g(64/2)=5 instructions are needed and for 4-bit subwords, at most 1g(64/4)=4 instructions are needed.
  • FIG. 10A illustrates one embodiment of a high-level schematic diagram of a circuit implementation 100 for CROSS instructions for an 8 bit system. The circuit implementation implements the entire Benes network. When executing a CROSS instruction, the control logic selects the proper two stages for the two basic operations based on the parameters m1 and m2. Thereafter, the CROSS instruction configures the two selected stages according to the left half and right half of the configuration register R2. The stages that are not used are configured as pass-throughs. FIG. 10B illustrates another embodiment of a high-level schematic diagram of a circuit implementation 110 for CROSS instructions. The circuit implementation implements two identical stages. Each stage comprises all the connections of all the stages of a butterfly network. When executing a CROSS instruction, the control logic selects the proper two sets of connections for the two basic operations based on the parameters m1 and m2. Thereafter, the CROSS instruction configures the two selected sets of connections according to the left half and right half of the configuration register R2.
  • In another embodiment of the invention, two or more different butterfly stages are combined in one stage of the implementation.
  • FIG. 12A illustrates an alternate embodiment of the invention, in which a single permute instruction can perform more than two Benes stages. In system 100 register file 112 includes three read ports, 111 a, 111 b, 111 c. Two registers 111 b and 111 c can be used to send configuration bits 115 and 122 to permutation unit 114. Accordingly, system 100 allows four Benes stages to be performed in one permute instruction. This allows any arbitrary permutation of n bits to be performed in an instruction sequence of (21gn)/4, or 1gn/2 instructions. As is understood by one of ordinary skill in the art, this can be extended to sending more configuration bits with each permute instruction, thus performing more Benes stages per instruction, and reducing the number of instructions in the instruction sequence needed for any arbitrary permutation of n bits. The minimum number of instructions needed in one instruction is achieved by sending 1gn registers with configuration bits with the one register of n bits to be permuted in the permute instruction. Accordingly, this allows any arbitrary permutation of n bits to be performed in an instruction sequence of 21gn/m instructions where m is the number of network stages performed by one permutation instruction.
  • FIG. 12B illustrates an alternate embodiment of the invention, in which the permutation result can be temporarily stored in permutation functional unit 214. In system 200, bits of intermediate permutation result 216 are stored in memory location 222 of permutation functional unit 214 after the generation of intermediate permutation result 216. In a subsequent execution of a permutation instruction, the source bits can be used from memory location 222 instead of being fetched from register file 212. During the subsequent execution, both of source registers 211 a and 211 b are used for configuration bits in a permutation instruction. Accordingly, the desired permutation can be performed in fewer instructions.
  • In an alternate embodiment of using system 200 all of the n1gn configuration bits are stored in the memory 222, rather than read from the register 211 b (or from the registers 111 b and 111 d in FIG. 12A). The n-bit value 213 to be permuted is read from register 211 a and sent to the permutation functional unit 214. This embodiment is useful if the same n-bit permutation is repeated many times for different n-bit values. The sequence of permutation instructions needed to perform this n-bit permutation is reduced to one instruction.
  • In an alternate embodiment using system 200 of FIG. 12B, only (n−1)1gn configuration bits are stored in memory 222. This allows a small subset of n-bit permutations to be performed in one instruction, by reading n configuration bits 215 from register 211 b and sending this with the n-bit value 213 from register 211 a to permutation unit 214.
  • The CROSS instruction, in any of the above described embodiments, can be used by itself, rather than in a sequence of instructions. The CROSS instruction generates a subset of all possible permutations.
  • A permutation performed by a single CROSS instruction can be reversed by reversing the order of the stages used in the CROSS instruction with the configuration bits for each stage being the same as for the original permutation. For example, the permutation achieved by CROSS,2,1 R1, R2, R1, where R2=10000101 can be reversed by doing CROSS,1,2 R1, R3, R1, where R3=01011000.
  • Horizontal and vertical track counts and transistor counts have been calculated for a circuit implementation of CROSS instruction based on the Benes network of the present invention and are compared to a circuit implementation of a cross bar network for 8-bit and 64-bit permutations in Table 2. The numbers in Table 2 are computed as follows:
  • For the CROSS instruction implementation, the following relationships are used,
    Vertical Tracks=2n Horizontal Tracks = 21 gn × n 2 + 2 × ( 2 n - 2 ) = n1gn + 4 n - 4
  • Transistors=2n1gn×The 2n horizontal tracks come from the 2 input lines in each node. The number of horizontal tracks is composed of two parts: n/2 configuration lines per stage for the 21gn stages, and the number of data tracks needed between adjacent stages, which is 2×(2n−2) in total. The 8n1gn transistors are from 4=8n1gn 4 transistors in each cell for 2n1gn cells.
  • For implementation of an 8-input crossbar network as shown in FIG. 11,
    Vertical Tracks=n
    Horizontal Tracks=n×(1+1gn)=n+n1gn Transistors = n × ( n + i = 0 1 gn ( 1 gn i ) ( 21 gn + 2 i ) = O ( n 2 1 gn ) > 3 n 2 1 gn
    The vertical tracks consist of the n input data lines. The horizontal tracks consist of the n output data lines and the 1gn configuration lines for each output data line. The number of transistors are for the AND gate and pass transistor at each cross point. An alternative implementation of crossbar is to provide a negated signal for each control signal so that no inverters before AND gates are needed. Then the horizontal track count becomes n+2n1gn and the transistor count becomes n2(1+21gn). This implementation may yield a larger size due to more vertical tracks used.
  • From these equations, it is shown that when n is large, the CROSS instructions yield the smaller size. As shown in table 2, the CROSS circuit implementation yields much smaller transistor count and reasonable track counts for permutations of 64 bits. Accordingly, it yields more area-efficient implementation. Control logic circuits for generating the configuration signals, which are more complex for the crossbar than for CROSS, were not counted.
    TABLE 2
    Vertical Horizontal
    tracks tracks Transistors
    8-bit Benes 16   52 192
    permutations (cross)  16(data)  28(data)
     24(control)
    Crossbar 8  32 640
     8(data)  8(data)
     24(control)
    64-bit Benes 128 636 3072
    permutations (cross) 128(data) 252(data)
    384(control)
    Crossbar 64  448 >73728
     64(data)  64(data)
    384(control)
  • Table 3 shows a comparison of the number of instructions needed for permutations of a 64-bit word with different subword sizes for method 10 using CROSS instructions and a method using conventional instruction set architectures (ISAs).
    TABLE 3
    Subword Num of Maxa
    size in subwords in num of existing
    bits register CROSS ISAs
    1 64 6 30b
    2 32 5 30b
    4 16 4 30b
    8 8 3  1cd
    16 4 2  1a
    32 2 1  1a

    aThe maximum number here is 1 gn.

    bInstruction counts using table lookup methods, actual cycle counts will be larger due to cache misses.

    cUsing subword permutation instructions.

    dOnly VPERM in AltiVec is able to do this in one instruction.
  • It is to be understood that the above-described embodiments are illustrative of only a few of the many possible specific embodiments which can represent applications of the principles of the invention. Numerous and varied other arrangements can be readily devised in accordance with these principles by those skilled in the art without departing from the spirit and scope of the invention.

Claims (13)

1-65. (canceled)
66. A system of performing an arbitrary permutation of a source sequence of bits in a programmable processor comprising:
means for defining an intermediate sequence of bits that said source sequence of bits is transformed into using butterfly network stages and inverse butterfly network stages;
means for determining a permutation instruction for transforming said source sequence of bits into one or more intermediate sequence of bits until a desired sequence of bits is obtained,
wherein each intermediate sequence of bits is used as input to the subsequent permutation instruction and the determined permutation instructions form a permutation instruction sequence and configuration bits are used in said permutation instruction for determining movement of said source sequence of bits in said source register to said intermediate sequence of bits or movement of said intermediate sequence of bits into a destination register or a source register; and
means for storing said configuration bits and means for retrieving said stored configuration bits for use in said permutation instruction.
67. A system of performing an arbitrary permutation of a source sequence of bits in a programmable processor comprising:
means for defining an intermediate sequence of bits that said source sequence of bits is transformed into using Benes network stages and inverse butterfly network stages;
means for determining a permutation instruction for transforming said source sequence of bits into one or more intermediate sequence of bits until a desired sequence of bits is obtained,
wherein each intermediate sequence of bits is used as input to the subsequent permutation instruction and the determined permutation instructions form a permutation instruction sequence and configuration bits are used in said permutation instruction for determining movement of said source sequence of bits in said source register to said intermediate sequence of bits or movement of said intermediate sequence of bits into a destination register or a source register; and
means for storing said configuration bits and means for retrieving said stored configuration bits for use in said permutation instruction.
68. A method of performing an arbitrary permutation of a source sequence of bits in a programmable processor comprising the steps of:
a. defining an intermediate sequence of bits that said source sequence of bits is transformed into using one or more network stages selected from the group consisting of Benes network stages, butterfly network stages, and inverse network stages; and
b. determining one or more permutation instructions for transforming said source sequence of bits into said intermediate sequence of bits, wherein configuration bits are used in said one or more permutation instructions for determining movement of said source sequence of bits in a source register to said intermediate sequence of bits or movement of said intermediate sequence of bits into a destination register or a second intermediate sequence of bits.
69. The method of claim 68 further comprising the steps of:
repeating steps a. and b. using said determined intermediate sequence of bits from step b. as said source sequence of bits in step a. until a desired sequence of bits is obtained, the determined permutation instructions form a permutation instruction sequence.
70. The method of claim 69 wherein said one or more permutation instructions can perform more than two said Benes stages.
71. The method of claim 68 further comprising the steps of:
c. storing said configuration bits; and
d. retrieving said stored configuration bits.
72. The method of claim 71 further comprising the steps of:
determining a subsequent permutation instruction using said retrieved configuration of bits.
73. The method of claim 70 further comprising the steps of:
d. storing a portion of said configuration bits; and
e. retrieving said stored portions of said configuration bits.
74. The method of claim 73 further comprising the steps of:
determining a subsequent permutation instruction using said retrieved configuration portion of said configuration bits.
75. The method of claim 69 wherein said method performs 1g(n) of said network stages in one instruction.
76. The method of claim 69 wherein said method performs 21g(n) network stages in one instruction.
77. The method of claim 69 wherein said configuration bits are obtained from a register file.
US11/180,269 2000-05-05 2005-07-13 Method and system for performing permutations using permutation instructions based on butterfly networks Abandoned US20060039555A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/180,269 US20060039555A1 (en) 2000-05-05 2005-07-13 Method and system for performing permutations using permutation instructions based on butterfly networks

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US20224500P 2000-05-05 2000-05-05
US09/850,237 US6922472B2 (en) 2000-05-05 2001-05-07 Method and system for performing permutations using permutation instructions based on butterfly networks
US11/180,269 US20060039555A1 (en) 2000-05-05 2005-07-13 Method and system for performing permutations using permutation instructions based on butterfly networks

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/850,237 Continuation US6922472B2 (en) 2000-05-05 2001-05-07 Method and system for performing permutations using permutation instructions based on butterfly networks

Publications (1)

Publication Number Publication Date
US20060039555A1 true US20060039555A1 (en) 2006-02-23

Family

ID=26897495

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/850,237 Expired - Lifetime US6922472B2 (en) 2000-05-05 2001-05-07 Method and system for performing permutations using permutation instructions based on butterfly networks
US11/180,269 Abandoned US20060039555A1 (en) 2000-05-05 2005-07-13 Method and system for performing permutations using permutation instructions based on butterfly networks

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/850,237 Expired - Lifetime US6922472B2 (en) 2000-05-05 2001-05-07 Method and system for performing permutations using permutation instructions based on butterfly networks

Country Status (1)

Country Link
US (2) US6922472B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070022353A1 (en) * 2005-07-07 2007-01-25 Yan-Xiu Zheng Utilizing variable-length inputs in an inter-sequence permutation turbo code system
US20090168801A1 (en) * 2006-04-28 2009-07-02 National Chiao Tung University Butterfly network for permutation or de-permutation utilized by channel algorithm
US20090187746A1 (en) * 2008-01-22 2009-07-23 Arm Limited Apparatus and method for performing permutation operations on data
US20090217133A1 (en) * 2005-07-07 2009-08-27 Industrial Technology Research Institute (Itri) Inter-sequence permutation turbo code system and operation methods thereof
US20100198177A1 (en) * 2009-02-02 2010-08-05 Kimberly-Clark Worldwide, Inc. Absorbent articles containing a multifunctional gel
US8677123B1 (en) * 2005-05-26 2014-03-18 Trustwave Holdings, Inc. Method for accelerating security and management operations on data segments

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0983557B1 (en) * 1998-03-18 2019-10-02 Koninklijke Philips N.V. Data processing device for executing in parallel additions and subtractions on packed data
US6996234B2 (en) * 2001-02-02 2006-02-07 Asier Technology Corporation Data decryption methodology
US20040086114A1 (en) * 2002-11-06 2004-05-06 Sun Microsystems, Inc. System and method for implementing DES permutation functions
WO2004045135A1 (en) * 2002-11-06 2004-05-27 Sun Microsystems, Inc. System and method for implementing des encryption
US20040086116A1 (en) * 2002-11-06 2004-05-06 Sun Microsystems, Inc. System and method for implementing DES round functions
US7424597B2 (en) * 2003-03-31 2008-09-09 Hewlett-Packard Development Company, L.P. Variable reordering (Mux) instructions for parallel table lookups from registers
US7730292B2 (en) * 2003-03-31 2010-06-01 Hewlett-Packard Development Company, L.P. Parallel subword instructions for directing results to selected subword locations of data processor result register
US7925891B2 (en) * 2003-04-18 2011-04-12 Via Technologies, Inc. Apparatus and method for employing cryptographic functions to generate a message digest
US7320063B1 (en) * 2005-02-04 2008-01-15 Sun Microsystems, Inc. Synchronization primitives for flexible scheduling of functional unit operations
US7711955B1 (en) 2004-09-13 2010-05-04 Oracle America, Inc. Apparatus and method for cryptographic key expansion
US7620821B1 (en) * 2004-09-13 2009-11-17 Sun Microsystems, Inc. Processor including general-purpose and cryptographic functionality in which cryptographic operations are visible to user-specified software
US8285766B2 (en) * 2007-05-23 2012-10-09 The Trustees Of Princeton University Microprocessor shifter circuits utilizing butterfly and inverse butterfly routing circuits, and control circuits therefor
RU2446445C1 (en) * 2010-10-29 2012-03-27 Государственное образовательное учреждение высшего профессионального образования "Саратовский государственный университет им. Н.Г. Чернышевского" Apparatus for generating reverse transpositions of information stored on computer
US9432180B2 (en) * 2011-06-03 2016-08-30 Harris Corporation Method and system for a programmable parallel computation and data manipulation accelerator
WO2013006030A1 (en) 2011-07-06 2013-01-10 Mimos Berhad Apparatus and method for performing parallel bits distribution with bi-delta network
RU2488161C1 (en) * 2011-11-14 2013-07-20 Федеральное Государственное Бюджетное Образовательное Учреждение Высшего Профессионального Образования "Саратовский Государственный Университет Имени Н.Г. Чернышевского" Device for swapping and shifting of data bits in microprocessors
US9342479B2 (en) * 2012-08-23 2016-05-17 Qualcomm Incorporated Systems and methods of data extraction in a vector processor
US9606803B2 (en) * 2013-07-15 2017-03-28 Texas Instruments Incorporated Highly integrated scalable, flexible DSP megamodule architecture
EP3001306A1 (en) * 2014-09-25 2016-03-30 Intel Corporation Bit group interleave processors, methods, systems, and instructions
EP3001307B1 (en) * 2014-09-25 2019-11-13 Intel Corporation Bit shuffle processors, methods, systems, and instructions
US10459723B2 (en) * 2015-07-20 2019-10-29 Qualcomm Incorporated SIMD instructions for multi-stage cube networks
US11269636B2 (en) * 2019-05-27 2022-03-08 Texas Instmments Incorporated Look-up table write
US11334356B2 (en) * 2019-06-29 2022-05-17 Intel Corporation Apparatuses, methods, and systems for a user defined formatting instruction to configure multicast Benes network circuitry

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3796830A (en) * 1971-11-02 1974-03-12 Ibm Recirculating block cipher cryptographic system
US3962539A (en) * 1975-02-24 1976-06-08 International Business Machines Corporation Product block cipher system for data security
US4275265A (en) * 1978-10-02 1981-06-23 Wisconsin Alumni Research Foundation Complete substitution permutation enciphering and deciphering circuit
US4972481A (en) * 1986-06-09 1990-11-20 Datakonsult I Malmo Ab Ciphering and deciphering device
US5001753A (en) * 1987-03-06 1991-03-19 U.S. Philips Corporation Crytographic system and process and its application
US5483541A (en) * 1993-09-13 1996-01-09 Trw Inc. Permuted interleaver
US5495476A (en) * 1995-01-26 1996-02-27 International Business Machines Corporation Parallel algorithm to set up benes switch; trading bandwidth for set up time
US5546393A (en) * 1994-06-20 1996-08-13 M E T Asynchronous transfer mode data cell routing device for a reverse omega network
US5673321A (en) * 1995-06-29 1997-09-30 Hewlett-Packard Company Efficient selection and mixing of multiple sub-word items packed into two or more computer words
US5768493A (en) * 1994-11-08 1998-06-16 International Businees Machines Corporation Algorithm for fault tolerant routing in benes networks
US5940389A (en) * 1997-05-12 1999-08-17 Computer And Communication Research Laboratories Enhanced partially self-routing algorithm for controller Benes networks
US5956405A (en) * 1997-01-17 1999-09-21 Microsoft Corporation Implementation efficient encryption and message authentication
US6081896A (en) * 1997-09-02 2000-06-27 Motorola, Inc. Cryptographic processing system with programmable function units and method
US6119224A (en) * 1998-06-25 2000-09-12 International Business Machines Corporation Fast shift amount decode for VMX shift and vperm instructions
US6195026B1 (en) * 1998-09-14 2001-02-27 Intel Corporation MMX optimized data packing methodology for zero run length and variable length entropy encoding
US6275587B1 (en) * 1998-06-30 2001-08-14 Adobe Systems Incorporated Secure data encoder and decoder
US6381690B1 (en) * 1995-08-01 2002-04-30 Hewlett-Packard Company Processor for performing subword permutations and combinations
US6446198B1 (en) * 1999-09-30 2002-09-03 Apple Computer, Inc. Vectorized table lookup
US6629115B1 (en) * 1999-10-01 2003-09-30 Hitachi, Ltd. Method and apparatus for manipulating vectored data
US6970433B1 (en) * 1996-04-29 2005-11-29 Tellabs Operations, Inc. Multichannel ring and star networks with limited channel conversion

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3796830A (en) * 1971-11-02 1974-03-12 Ibm Recirculating block cipher cryptographic system
US3962539A (en) * 1975-02-24 1976-06-08 International Business Machines Corporation Product block cipher system for data security
US4275265A (en) * 1978-10-02 1981-06-23 Wisconsin Alumni Research Foundation Complete substitution permutation enciphering and deciphering circuit
US4972481A (en) * 1986-06-09 1990-11-20 Datakonsult I Malmo Ab Ciphering and deciphering device
US5001753A (en) * 1987-03-06 1991-03-19 U.S. Philips Corporation Crytographic system and process and its application
US5483541A (en) * 1993-09-13 1996-01-09 Trw Inc. Permuted interleaver
US5546393A (en) * 1994-06-20 1996-08-13 M E T Asynchronous transfer mode data cell routing device for a reverse omega network
US5768493A (en) * 1994-11-08 1998-06-16 International Businees Machines Corporation Algorithm for fault tolerant routing in benes networks
US5495476A (en) * 1995-01-26 1996-02-27 International Business Machines Corporation Parallel algorithm to set up benes switch; trading bandwidth for set up time
US5673321A (en) * 1995-06-29 1997-09-30 Hewlett-Packard Company Efficient selection and mixing of multiple sub-word items packed into two or more computer words
US6381690B1 (en) * 1995-08-01 2002-04-30 Hewlett-Packard Company Processor for performing subword permutations and combinations
US6970433B1 (en) * 1996-04-29 2005-11-29 Tellabs Operations, Inc. Multichannel ring and star networks with limited channel conversion
US5956405A (en) * 1997-01-17 1999-09-21 Microsoft Corporation Implementation efficient encryption and message authentication
US5940389A (en) * 1997-05-12 1999-08-17 Computer And Communication Research Laboratories Enhanced partially self-routing algorithm for controller Benes networks
US6081896A (en) * 1997-09-02 2000-06-27 Motorola, Inc. Cryptographic processing system with programmable function units and method
US6119224A (en) * 1998-06-25 2000-09-12 International Business Machines Corporation Fast shift amount decode for VMX shift and vperm instructions
US6275587B1 (en) * 1998-06-30 2001-08-14 Adobe Systems Incorporated Secure data encoder and decoder
US6195026B1 (en) * 1998-09-14 2001-02-27 Intel Corporation MMX optimized data packing methodology for zero run length and variable length entropy encoding
US6446198B1 (en) * 1999-09-30 2002-09-03 Apple Computer, Inc. Vectorized table lookup
US6629115B1 (en) * 1999-10-01 2003-09-30 Hitachi, Ltd. Method and apparatus for manipulating vectored data

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677123B1 (en) * 2005-05-26 2014-03-18 Trustwave Holdings, Inc. Method for accelerating security and management operations on data segments
US20070022353A1 (en) * 2005-07-07 2007-01-25 Yan-Xiu Zheng Utilizing variable-length inputs in an inter-sequence permutation turbo code system
US20090217133A1 (en) * 2005-07-07 2009-08-27 Industrial Technology Research Institute (Itri) Inter-sequence permutation turbo code system and operation methods thereof
US7797615B2 (en) 2005-07-07 2010-09-14 Acer Incorporated Utilizing variable-length inputs in an inter-sequence permutation turbo code system
US8769371B2 (en) 2005-07-07 2014-07-01 Industrial Technology Research Institute Inter-sequence permutation turbo code system and operation methods thereof
US20090168801A1 (en) * 2006-04-28 2009-07-02 National Chiao Tung University Butterfly network for permutation or de-permutation utilized by channel algorithm
US7856579B2 (en) 2006-04-28 2010-12-21 Industrial Technology Research Institute Network for permutation or de-permutation utilized by channel coding algorithm
US20090187746A1 (en) * 2008-01-22 2009-07-23 Arm Limited Apparatus and method for performing permutation operations on data
US8423752B2 (en) * 2008-01-22 2013-04-16 Arm Limited Apparatus and method for performing permutation operations in which the ordering of one of a first group and a second group of data elements is preserved and the ordering of the other group of data elements is changed
US20100198177A1 (en) * 2009-02-02 2010-08-05 Kimberly-Clark Worldwide, Inc. Absorbent articles containing a multifunctional gel

Also Published As

Publication number Publication date
US20020031220A1 (en) 2002-03-14
US6922472B2 (en) 2005-07-26

Similar Documents

Publication Publication Date Title
US6922472B2 (en) Method and system for performing permutations using permutation instructions based on butterfly networks
US6952478B2 (en) Method and system for performing permutations using permutation instructions based on modified omega and flip stages
US7519795B2 (en) Method and system for performing permutations with bit permutation instructions
Lee et al. Efficient permutation instructions for fast software cryptography
ES2805125T3 (en) Flexible architecture and instructions for Advanced Encryption Standard (AES)
US9134953B2 (en) Microprocessor Shifter Circuits Utilizing Butterfly and Inverse Butterfly Routing Circuits, and Control Circuits Therefor
US6862354B1 (en) Stream cipher encryption method and apparatus that can efficiently seek to arbitrary locations in a key stream
US8189792B2 (en) Method and apparatus for performing cryptographic operations
US8913740B2 (en) Method and apparatus for generating an Advanced Encryption Standard (AES) key schedule
Yang et al. Fast subword permutation instructions using omega and flip network stages
US20040120518A1 (en) Matrix multiplication for cryptographic processing
US8943297B2 (en) Parallel read functional unit for microprocessors
Shi Bit permutation instructions: Architecture, implementation, and cryptographic properties
Shi et al. Arbitrary bit permutations in one or two cycles
Jindal et al. Modified RC4 variants and their performance analysis
US20100329450A1 (en) Instructions for performing data encryption standard (des) computations using general-purpose registers
US11943332B2 (en) Low depth AES SBox architecture for area-constraint hardware
Hilewitz Advanced bit manipulation instructions: architecture, implementation and applications
Hilewitz et al. Advanced bit manipulation instruction set architecture
Lee et al. Cryptography efficient permutation instructions for fast software
Kolay et al. PERMS: A bit permutation instruction for accelerating software cryptography
Kivilinna Block ciphers: fast implementations on x86-64 architecture
Gaubatz et al. Leveraging the multiprocessing capabilities of modern network processors for cryptographic acceleration
Smyth et al. Reconfigurable cryptographic RISC microprocessor
Pirpilidis et al. A 4-bit Architecture of SEED Block Cipher for IoT Applications

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION