WO2003014948A1

WO2003014948A1 - System architecture of a high bit rate switch module between functional units in a system on a chip

Info

Publication number: WO2003014948A1
Application number: PCT/IL2002/000629
Authority: WO
Inventors: Yehiel Engel; David Ivancovsky
Original assignee: Broadlight Ltd.
Priority date: 2001-08-07
Filing date: 2002-07-31
Publication date: 2003-02-20
Also published as: IL144789A0

Abstract

The invention relates to a switch for a high bit rate transaction of data between master and slave devices in a system-on-a-chip (SOC) including a set of busses allowing maximum parallelism in said switch, a crossbar unit used to transfer data bursts from one of said devices to another of said devices, and connections to at least two said master devices which can originate read and write transactions of said data with any other master or slave and at least two said slave devices which can respond to read and write transactions of said data with any other master or slave. The switch transactions may be partial transfers or burst transfers. The switch crossbar may further comprise a byte enable mechanism for handling gaps in said write transactions of said data. The invention also describes a method for a high data rate transaction.

Description

SYSTEM ARCHITECTURE OF A HIGH BIT RATE SWITCH MODULE BETWEEN FUNCTIONAL UNITS IN A SYSTEM ON A CHIP

FIELD OF THE INVENTION

The present invention generally relates to systems on a chip. More specifically the present invention relates to an improved rate of transferring data using a switch module on an integrated system on a chip.

BACKGROUND OF THE INVENTION

Early telecommunication systems consisted^' of a general-purpose central processing unit (CPU) connected to multiple interface cards via a shared bus. All data arriving at the system through an interface card traversed the shared bus to the processor, which processed the data and made forwarding decisions. The data was then sent over the shared bus a second time to a destination interface card for transmission. As the demand for speed increased, the shared bus architecture eventually became untenable. New systems began to emerge with a switch architecture that replaced the shared bus. Switching architecture eliminated the interconnect bottleneck, because data could be moved at speeds many orders of magnitude faster than CPUs.

With the demand of speed still increasing a new technology was introduced. This technology is called System On Chip (SOC) that integrated a CPU and its peripherals on one chip. As with the former system, the first SOC architecture was based on CPU connected to the peripherals via a shared bus.

Thus, it would be desirable to provide more speed by implementing an integrated switch architecture to replace the bus.

SUMMARY OF THE INVENTION

Accordingly, it is a principal object of the present invention to overcome the limitations of existing systems on a chip (SOC), and to provide an integrated SOC that allows increased speed of reading by a master device from a slave device.

It is a further object of present invention to provide optimization of the arbitration phase for transactions. The present invention relates to a switch module that is used to interconnect between various functional units on a chip. For purposes of the present invention, there are 2 types of units:

A master is a unit that controls another unit, and which can originate read and write transactions. A master has the material to be written, so there is no wait for a write; and:

A slave is a unit that controls another unit, and which can respond to read and write transactions. When the master reads from a slave, the slave may be occupied, so there is a time dependency. The present invention breaks the dependency, because the master can proceed with another transaction while waiting for the slave from which reading is to take place to be free for the read.

Each Master can communicate with each slave in the system. Each transaction is independent and the switch enables simultaneous transactions.

The switch module is based on a crossbar unit used to transfer data bursts from one unit, i.e., master or slave, to another. The switch is implemented as a set of busses allowing maximum parallelism in the device. The crossbar implementation performs data transfers without any sampling of the data within a particular cycle. It supports both partial transfers and burst transfer. The crossbar mechanism is based on simple transactions with only a few control signals.

The present invention describes a switch for a high bit rate transaction of data between master and slave devices in a system-on-a-chip (SOC) including a set of busses allowing maximum parallelism in said switch, a crossbar unit used to transfer data bursts from one of said devices to another of said devices, and connections to at least two said master devices which can originate read and write transactions of said data with any other master or slave and at least two said slave devices which can respond to read and write transactions of said data with any other master or slave. The switch transactions may be partial transfers or burst transfers. The switch crossbar further comprises a byte enable mechanism for handling gaps in said write transactions of said data.

The present invention describes a method for a high data rate transaction, wherein each said transaction is selected from the group consisting of write, read request and read response, and said transaction being between at least two master devices and at least two slave devices connected to the crossbar of a switch in a system-on-a-chip (SOC) comprising: asserting a request by an initiator among one of said at least two master devices towards said crossbar for a target device among said at least two master devices and at least two slave devices in an arbitration phase of at least two cycles; a first transferring of an address and command from said initiator to said crossbar in an address phase; performing an arbitration by said crossbar by deciding which of said at least two masters next communicates with said target; and a second transferring of said data in burst mode from said initiator to said target with zero wait states, such that each transaction is independent, and when said master performs simultaneous transactions, each said transaction occurs in a single cycle.

The method further comprises defining the maximum latency for said initiator and said target. The method may be such that said first transferring comprises only the command in a read response transaction, and may further comprise initiating a read by one of said master devices from one of said slave devices, and if said slave is occupied, proceeding by said master to another transaction.

Other features and advantages of the invention will become apparent from the following drawings and description.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention with regard to the embodiments thereof, reference is made to the accompanying drawings, in which like numerals designate corresponding elements or sections throughout, and in which:

Fig. 1 is a schematic block diagram illustrating the switch module used to interconnect various modules, in accordance with an exemplary embodiment of the present invention;

Fig. 2 is a write timing diagram for a write of 12 words in a single burst, in accordance with an exemplary embodiment of the present invention;

Fig. 3 is a write timing diagram for 2 writes, a write of 3 words followed by a write of 1 word, in accordance with an exemplary embodiment of the present invention;

Fig. 4 is a timing diagram for a write of 3 words followed by a read request, in accordance with an exemplary embodiment of the present invention;

Fig. 5 is a timing diagram for a Read response transaction with 12 words of data, in accordance with an exemplary embodiment of the present invention; Fig. 6 is a schematic block diagram of the internal structure of the switch of fig. 1 , in accordance with an exemplary embodiment of the present invention;

Fig. 7 is a schematic illustration of a time window with 8 slices, according to an exemplary embodiment of the present invention; and

Fig. 8 is a block diagram illustrating the arbiter state machine, according to an exemplary embodiment of the present invention.

It will be appreciated that the embodiments described as follows are cited by way of example, and that the present invention is not limited to what is particularly shown and described. Rather, the scope of the present invention, as defined by appended claims, includes both combinations and sub-combinations of the various features described, as well as variations and modifications thereof, which would occur to persons skilled in the art upon reading the descriptions, and which are not disclosed in the prior art.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

Fig. 1 is a schematic block diagram illustrating the switch module 110 used to interconnect various modules in a system-on-a-chip, in accordance with an exemplary embodiment of the present invention. The switch module 110 is used to interconnect between the various functional units in (a communication system-on-a-chip 100. The crossbar unit of switch module 110 is used to transfer data bursts from one unit (master or slave) to another. Switch 110 is implemented as a set of busses allowing maximum parallelism in the device. Each Master can communicate with each slave in the system. .

The cross bar unit performs data transfers without any sampling of the data, i.e., in one cycle. It supports both partial transfers of 1-3 bytes, and burst transfers of up to 16 words of 32 bits per word for a maximum of 64 bytes. The crossbar includes a byte enable (BE) mechanism for gap handling in write mode (not used in read). The crossbar mechanism is based on simple, steady state, transactions and only a few control signals to simplify the implementation and reduce the overhead cycles.

Each transaction is composed of 2 or 3 phases: an arbitration phase, that takes a minimum of 2 cycles, but can take more since target devices may be busy (e.g. in a middle of a transaction from another master); in this phase the master asserts a request to the crossbar; valid addresses and bus command are also mandatory in this phase; the crossbar performs the arbitration, when possible, and decides which contending master will communicate with the target; an address phase, in which an address and a command, or only a command in a read response transactions, is transferred; this phase takes one cycle; and a data phase, in which data is transferred in burst mode from the transaction initiator to the target device with zero wait states; data phase is mandatory in write transactions and in a non-empty read response; in a read request and an empty read response no data phase occurs.

There are 3 transactions types: a write transaction, in which the master asserts a request plus a valid address and a command to the crossbar; after receiving the arbitration via an acknowledge (ACK) signal, an address phase occurs wherein request (REQ) and acknowledge (ACK) are both asserted towards the initiator and the target; then a data burst takes place with zero wait state between consecutive data phases; a read request transaction, which is the first part of a split read transaction; in this part, that includes only arbitration and address phases, the master initiates a request with a valid address and command; the target samples the address and command, and then ACK's the master, thereby completing the first part of a read transaction; the target then needs to initiate a read response transaction towards the initiator later on, when the data is valid in its internal buffers; all read transactions are split transactions; and a read response transaction, which is initiated when the target has the required data; in the address phase no address is transferred and instead of an address the command is sent on 16 least significant bits (LSB's); the target of the transaction is defined using the ID field of the command, and is sent by the initiator in the read request transaction and returned by the target; no wait state is allowed between consecutive data phases.

The protocol implementation is based on an arbiter for each device (master or slave).

The arbitration phase uses a unique algorithm that enables the definition of the maximum latency for each unit. It is based on two types of mechanisms. First type of arbitration is a time division multiple access (TDMA) approach that is used to define time priorities. It is based on a window with N slices. Each slice content is allocated for a specific master when the switch is configured. A counter is used to select the current slice in a cyclic manner. The counter is incremented for each arbitration phase that occurs for its owner. When a new request happens, the window content number is compared with the request and ID of all requesting agents. If a match is found then this agent is selected. If the agent that has priority for this window is not requesting, then the second mechanism, which is simple round robin arbitration, is used. The TDMA approach allows definition of the amount of bandwidth (BW) allocated for each requestor and the period between 2 consecutive requests (maximum latency).

Transaction duration: suppose N corresponds to the length of a burst of data being transmitted. Assuming only the minimum time of two cycles for the arbitration phase, we need 2+1 +N = N+3 cycles for a write transaction. For read transaction at least 3 cycles are needed for a read request transaction and 2+1 +N cycles for a read response transaction or N+6 for a total read transaction. For a 64-byte transaction (N = 16) the switch needs 19/22 cycles for the write/read respectively.

Since each target is accessed independently from each other, they may achieve maximum parallelism. Also the write and read bursts may operate for the same target simultaneously. This means that for 125MHz and 4 slaves active at the same time, ~ 125MHz *4 *(6.4) 3.2 Gbyte/second is achievable (in 20 cycles we transfer 128 bytes or 6.4 bytes/cycle with 32 bit data bus). In the communication chip 110, for example, there are several master units that can initiate crossbar read and write transactions, and several slave units that receives cross bar request transactions and initiate read response transactions. This rate can be multiple by using 64 bit data bus.

Pipeline: In order to overcome the overhead cycles due to arbitration and address phases, a pipeline mechanism is implemented. This mechanism allows starting a new arbitration, although the switch is still in data phase. After data phase is complete, the switch already has the new arbitration and can go directly to the address phase. Using this method reduces the number of cycles needed for write/read transaction to 17/20 respectively. The unit selection for master units: master 1 : 154; master 2: 142; master 3: 144; and master 4: 120.

The unit selection for slave units: slave 1: 164; slave 2: 162; slave 3: 180; and slave 4 170.

The definitions and assumptions pertain to switch module 110 in an exemplary embodiment of a system-on-a-chip: each slave supports up to 8 masters; each master supports up to 8 slaves; size of transaction is given in bytes and reflects the last byte to be written or read, but gaps are only controlled by the BE and are used only in write; for example a size of 4 bytes may appear with BE = 0110 (0 = active);.

BE has no meaning in read; data bus width is 32 bit; max transaction size is 64 byte; arbitration phase takes minimum of 2 cycles; address phase takes 1 cycle; the data phase for N words takes N cycles. an agent asserts its ACK (either for request transaction or for read response transactions) signal as default behavior; it can only de-assert, it cannot accept new transactions; the de-assertion should occur one cycle after the address phase of a transaction; the de-assertion should be limited in time (configurable); an interrupt should be set if an agent de-asserts its ACK more then the permitted time; and a master always asserts its rd_response_ack if it is not in a transaction; a master preferably should have sufficient space to receive the data.

Information about the master units signals are summarized in Table I:

TABLE

Signal Driver Definition

MST_REQ Master Master request

SW_ACK Cross-bar Acknowledge to the master. This is a one cycle strobe, that causes the de-assertion of the request (unless we have another request and current request was a read)

MST_AD Master Address/write data bus

MST_CBE Master Command byte enable bus.

SW_RDRS_REQ Cross-bar Read response request indication

MST_RDRS_ACK Master Acknowledge for the read response indication

SW RDATA Cross-bar Read data bus

Information about the slave unit's signals are summarized in Table II: TABLE

Signal Driver Definition

SW_REQ Cross-bar Request toward the slave unit from the switch

SLV_ACK Slave Acknowledge to the switch. Set if the slave is idled.

SW_AD Cross-bar Address/write data bus

SW_CBE Cross-bar Command byte enable bus

SLV_RDRS. _REQ Slave Response indication

SW_RDRS_ ACK Cross-bar Acknowledge for the read response indication

SLV RDATA Slave Read data bus

Command bus content

In order to allow the arbitration the initiator needs to send a command to the slave. This command includes information about the request type size and other parameters and status indications.

Table III includes the mandatory field of a command:

TABLE

Figs. 2 through 5 are some examples for crossbar transactions using timing diagrams. The topmost scale in each diagram represents a measure of time 210. Beneath measure of time 210 is a scale of the cycle count 220.

Refer now to Figure 2, a write timing diagram for a write of 12 words in a single burst 200, in accordance with an exemplary embodiment of the present invention. Timing explanation:

At cycle 1 , 2: Arbitration phase: the master asserts a Request 230. Target ACK is one - means the target is ready to accept new transactions. Arbitration takes place in the second cycle, and on the next clock rising edge a decision signal is sampled.

At cycle 3: Address phase: The switch asserts a Request signal (FF output) toward the target 250. Acknowledge is asserted by the switch (FF output) towards the master 240. This is the address phase where address 270 and command 280 are transferred to the target. These parameters are sampled on the clock rising edge. The slave asserts its ACK 260 as default behavior.

At cycle 4-15: Data phase: Data is transferred from master to target (one data byte per cycle). Note: at cycle 13 the CNT 295 is less the 4 (means that we have 3 more data phases); at the next cycle the state machine 290 goes to IDLE after the data has been transferred, allows new arbitration during the last 2 data phase cycles, and thereby improves performance.

Referring now to Fig. 3, a write timing diagram for 2 writes, a write of 3 words followed by a write of 1 word 300, in accordance with an exemplary embodiment of the present invention. Timing explanation:

At cycle 1, 2: Arbitration phase: Both M1 310 and M2 315 assert a request. Requestl wins the arbitration. M2 continues to assert his request, since no regret withdrawal is allowed. The level of MST_REQ2 is seen to continue.

At cycle 3: Address phase: The switch asserts Request (FF output) toward the target 330.

Acknowledge is asserted by the switch (FF output) towards the master 320. This is the address phase where address 350 and command 360 are transferred to the target. These parameters will be sampled on the clock rising edge. The slave asserts its ACK 340 as default behavior.

At cycle 4-6: First data phase: Data is being transferred from master 1 to target, as seen by the countdown of CNT 395.

At cycle 5 the state machine 390 goes to IDLE and starts the next arbitration.

At cycle 7: Address phase of new transaction where address 355 and command 365 are transferred to the target. The switch asserts a second Request (FF output) toward the target

330. Acknowledge is asserted by the switch (FF output) towards the master 325. At cycle 8: data phase of second transaction as seen in the second countdown of CNT 395

Refer now to Fig. 4, a timing diagram for a write of 3 words followed by a read request 400, in accordance with an exemplary embodiment of the present invention. Timing explanation:

Cycles 1 - 6 - same as previous figure as follows:

At cycle 1 , 2: Both M1 410 and M2 415 assert a request. Requestl wins the arbitration. M2 continues to assert his request, since no regret withdrawal is allowed. The level of MST_REQ2 is seen to continue.

At cycle 3: The switch asserts Request (FF output) toward the target 430. Acknowledge is asserted by the switch (FF output) towards the master 420. This is the address phase where address 450 and command 460 are transferred to the target. These parameters will be sampled on the clock rising edge. The slave asserts its ACK 440 as default behavior. At cycle 4-6: Data is being transferred from master 1 to target, as seen by the countdown of CNT 495.

At cycle 5 the state machine 490 goes to IDLE and starts the next arbitration. At cycle 5 and 6 the next arbitration takes place.

Cycle 7 - address phase of read request where address 455 and command 465 are transferred to the target. The switch asserts a second Request (FF output) toward the target 430. Acknowledge is asserted by the switch (FF output) towards the master 425.

Refer now to Fig. 5, a timing diagram for a read response transaction with 12 words of data 500, in accordance with an exemplary embodiment of the present invention. Timing explanation:

Cycle 1 , 2: target is asserted read response request 520 together with command 525 on data bus. Arbitration decision takes place and control signals will be sampled on next clock rising edge.

Cycle 3: Address phase: read response request 540 and acknowledge 530 are both asserted. Cycle 4-15 next consecutive data 595 are transferred (one per cycle). Master read response acknowledge 550 occurs at cycle 4, 5.

Note: state machine goes to IDLE at cycle 14 590 allowing hidden arbitration during the last 2 cycles of the data phase.

Ordering: Ordering may introduce many difficulties in bus and switch systems. Problem may arise from posted writes (i.e., the target put the data in a temporary buffer and only later transfer it to its destination) and from the split read (i.e., 2 requests that were sent one after the other can generate 2 read responses in opposite order). Concerning ordering, switch masters and slaves should fulfill the following rules: between two agents that initiate a transaction no order relationship is defined; both units operate completely asynchronously; two write transaction to the same device that are initiated by the same master are performed in order (FIFO). for two write transactions from the same master to different devices: the master issues both writes on the switch in order; they can be performed at their destination out-of-order; if order is important, the initiator should implement a mechanism that performs the write operation followed by the read operation, and only after having completed the read result, initiator issues the second write; two read transactions to the same device that are initiated by the same master are performed and sent back in order; and two read transactions initiated by the same master to different targets may be performed out of order, since the first target may be a slow device and the second a fast one; the initiator back-end unit may receive the data in order using control logic in the switch interface, whose responsibility it is to use a FIFO mechanism for transactions.

Size of read response and empty response: In read response the size may be one of the following: the requested size, if the target includes a buffer with that amount of data; a size that is less then the requested size in the case of end-of- packet events that cause the buffer to be less than full; and zero size: empty packet that is sent if the target does not have any data at all.

The switch hardware is based on 2 parts: first there is a data path that is responsible for data transfers from the initiator to the target; and second there is an embodiment of the arbitration schemes, wherein the switch supports

2 modes of arbitration:

TDMA based arbitration with 16 arbitration cycles window; and simple round robin mechanism used when the TDMA does not have a hit.

Refer now to Fig. 6, a schematic block diagram of the internal structure 600 of the switch of fig. 1 , in accordance with an exemplary embodiment of the present invention. The data path for a slave MUX's the AD and CBE busses from various masters to one AD and CBE bus towards a slave. For a master only, the RDATA buses are MUX'ed.

The operation of internal structure 600 is based on 5 blocks: qualifier module 630 is used to filter the requests so only relevant and "good" requests enter the arbiters; a request needs to be qualified as to whether destination ID is directed to this agent. TDMA arbiter 610; round robin arbiter 620; output stage 640 which merges both arbiters outputs; and control block 650 logic.

TDMA arbiter 610 is used to define time priorities, and it is based on a window with 16 slices. Each slice is allocated for a specific master by software (SW). A counter module 16 is used to select the current slice, and the counter is incremented during each arbitration phase that occurs for its owner. When a new request occurs, the window content number is compared against the request and the ID of all requesting agents. If a match is found then this agent is selected. If the agent that has priority for this window is not requesting, then the results of simple round robin arbitration 620 is selected.

The TDMA approach allows definition in 1/16 units of the total bandwidth (BW) of the amount of BW allocated per each requestor and the period between 2 consecutive requests.

In the following example, for simplicity, consider a window with only 8 slices instead of 16. Software is used to allocate the window scheme shown in fig. 7.

Refer now to Fig. 7, a schematic illustration of a time window with 8 slices, according to an exemplary embodiment of the present invention. A counter of modulo 8 is incremented each time that an arbitration phase occurs at that arbiter. If for cycle #2 720 all agents are asking for service, then agent #3 793 is selected in this period and so on: agent #1 791 in cycles 1 ,3,5,7 per reference blocks 710, 730, 750 and 770 respectively; agent #2 792 in cycle 6 760; and agent #4 794 in cycles 700 and 740. According to this scheme if all agents are asking for service all of the time agent #1 791 achieves 50% of the BW, agent #4 794 25% and agents #2 792 and #3 793 each achieve 12.5%. The maximum wait time of agent #1 791 is one arbitration period The maximum wait time of agent #4 794 is 3 arbitration periods. Thus, using this type of arbitration allows calculation of the BW and the maximum latency per agent.

If no hit is found in TDMA arbiter 700 the output of the round robin arbitration is selected. The round robin is based on having the information about the least recently selected agent, and defines the priority accordingly.

The following example illustrates the round robin: Suppose there are 4 agents: 0, 1, 2 and 3. The first priority is 0, then 1 , then 2 and then 3. Now suppose that all agents ask for service. Agent 0 wins, and then priority changes at the end of the address phase to 1 , then 2, then 3 and then to 0... and so on.

The output stage MUX's the TDMA selected ID and the Round robin selected ID and chooses the appropriate winner of the arbitration (using a fixed priority scheme - TDMA has the highest priority).

Refer now to Fig. 8, a schematic block diagram illustrating the arbiter state machine 800, according to an exemplary embodiment of the present invention. The state machine is based on 4 states: the three active phases described hereinabove with reference to fig. 1 : arbitration phase 810; address phase 820; and data phase 830; and an idle phase 840.

Machine 800 is implemented using FF as a Finite State Machine (FSM). An FSM is a "machine" called every cycle to control an object. An FSM generally looks at the current state, plus the input provided for it, and thereby achieves a new state. Fig. 8 shows the state machine for a slave unit. For a master unit the transactions are due to active read response request, active master read response acknowledge and empty bit signals instead of active request 850, slave acknowledge 860 and write/read bit 870. As an addition to state machine 800, the switch uses a count-down counter (CNT) 880 that is loaded with the transaction size (in words of 32 bytes) and is decremented each cycle when in data phase 830.

Claims

I claim:

1. A switch for a high bit rate transaction of data between master and slave devices in a system-on-a-chip (SOC) comprising: a set of busses allowing maximum parallelism in said switch; a crossbar unit used to transfer data bursts from one of said devices to another of said devices; and connections to: at least two said master devices which can originate read and write transactions of said data with any other master or slave; and at least two said slave devices which can respond to read and write transactions of said data with any other master or slave.

2. The switch according to claim 1 , wherein said transactions are partial transfers.

3. The switch according to claim 1 , wherein said transactions are burst transfers.

4. The switch according to claim 1 , wherein said crossbar further comprises a byte enable mechanism for handling gaps in said write transactions of said data.

5. A method for a high data rate transaction, wherein each said transaction is selected from the group consisting of write, read request and read response, and said transaction being between at least two master devices and at least two slave devices connected to the crossbar of a switch in a system-on-a-chip (SOC) comprising: asserting a request by an initiator among one of said at least two master devices towards said crossbar for a target device among said at least two master devices and at least two slave devices in an arbitration phase of at least two cycles; a first transferring of an address and command from said initiator to said crossbar in an address phase; performing an arbitration by said crossbar by deciding which of said at least two masters next communicates with said target; and a second transferring of said data in burst mode from said initiator to said target with zero wait states, such that each transaction is independent, and when said master performs simultaneous transactions, each said transaction occurs in a single cycle.

6. The method according to claim 5, wherein said asserting further comprises defining the maximum latency for said initiator and said target.

7. The method according to claim 5, wherein said first transferring comprises only the command in a read response transaction.

8. The method according to claim 5, and further comprising initiating a read by one of said master devices from one of said slave devices, and if said slave is occupied, proceeding by said master to another transaction.