« ZurückWeiter »
Threads Communicate to the Packet Dispatcher there Current State via a Thread Mailbox or Thread Message
RECEIVES A PACKET FROM THE NETWORK INTO THE PACKET RECEIVER BUFFER
DISPATCHER ASSIGNS PACKET PAYLOAD TO A
SPACE IN DRAM, ASSIGNS THE PACKET A
PACKET SEQUENCE NUMBER
PROCESSING A DATA PACKET
Networks enable computers and other devices to exchange data such as e-mail messages, web pages, audio, video, and so forth. To send data across a network, a sending device typically constructs a collection of packets. In networks, individual packets store some portion of the data being sent. A receiver can reassemble the data into its original form after receiving the packets.
A packet traveling across a network may make many "hops" to intermediate network devices before reaching its final destination. A packet includes data being sent and information used to deliver the packet. This information is often stored in the packet's "payload" and "header(s)", respectively. The header(s) may include information for a number of different communication protocols that define the information that should be stored in a packet. Different protocols may operate at different layers. For example, a low level layer generally known as the "link layer" coordinates transmission of data over physical connections. A higher level layer generally known as the "network layer" handles routing, switching, and other tasks that determine how to move a packet forward through a network.
Many different hardware and software schemes have been developed to handle packets. For example, some designs use software to program a general purpose CPU (Central Processing Unit) processor to process packets. Other designs use components such as ASICs (application-specific integrated circuits), feature dedicated, "hard-wired" approaches. Field programmable processors enable software programmers to quickly reprogram network processor operations.
DESCRIPTION OF DRAWINGS
FIG. 1 is a block diagram of a communication system employing a hardware-based multithreaded processor.
FIG. 2 is a block diagram of a microengine unit employed in the hardware-based multithreaded processor of FIG. 1.
FIG. 3 is a diagram of the processing of a packet.
FIG. 4 is a flow chart of the processing of a packet.
FIG. 5 is a flow chart of the initial handling and storing of packet information prior to processing by the threads.
Referring to FIG. 1, a communication system 10 includes a parallel, hardware-based multithreaded processor 12. The hardware-based multithreaded processor 12 is coupled to a bus such as a Peripheral Component Interconnect (PCI) bus 14, a memory system 16 and a second bus 18. The system 10 is especially useful for tasks that can be broken into parallel subtasks. Specifically hardware-based multithreaded processor 12 is useful for tasks that are bandwidth oriented rather than latency oriented. The hardware-based multithreaded processor 12 has multiple microengines 22 each with multiple hardware controlled program threads that can be simultaneously active and independently work on a task. A program thread is an independent program that runs a series of instruction. From the program's point-of-view, a program thread is the information needed to serve one individual user or a particular service request.
The hardware-based multithreaded processor 12 also includes a central controller 20 that assists in loading microcode control for other resources of the hardware-based multithreaded processor 12 and performs other general purpose
computer type tasks such as handling protocols, exceptions, extra support for packet processing where the microengines pass the packets off for more detailed processing such as in boundary conditions. In one embodiment, the processor 20 is
5 a Strong Arm® (Arm is a trademark of ARM Limited, United Kingdom) based architecture. The general purpose microprocessor 20 has an operating system. Through the operating system the processor 20 can call functions to operate on microengines 22a-22f. The processor 20 can use supported
l o operating system preferably a real time operating system. For the core processor implemented as a Strong Arm architecture, operating systems such as, Microsoft NT real-time, VXWorks and TCUS, a freeware operating system available over the Internet, can be used.
15 The hardware-based multithreaded processor 12 also includes a plurality of microengines 22a-22f. Microengines 22a-22/ each maintain a plurality of program counters in hardware and states associated with the program counters. Effectively, a corresponding plurality of sets of program
20 threads can be simultaneously active on each of the microengines 22a-22/while only one is actually operating at one time.
In one embodiment, there are six microengines 22a-22f, each having capabilities for processing four hardware pro
25 gram threads. The six microengines 22a-22f operate with shared resources including memory system 16 and bus interfaces 24 and 28. The memory system 16 includes a Synchronous Dynamic Random Access Memory (SDRAM) controller 26a and a Static Random Access Memory (SRAM)
30 controller 26*. SDRAM memory 16a and SDRAM controller 26a are typically used for processing large volumes of data, e.g., processing of network payloads from network packets. The SRAM controller 26* and SRAM memory 16* are used in a networking implementation for low latency, fast access
35 tasks, e.g., accessing look-up tables, memory for the core processor 20, and so forth.
Hardware context swapping enables other contexts with unique program counters to execute in the same microengine. Hardware context swapping also synchronizes completion of
40 tasks. For example, two program threads could request the same shared resource e.g., SRAM. Each one of these separate units, e.g., the FBUS interface 28, the SRAM controller 26a, and the SDRAM controller 26*, when they complete a requested task from one of the microengine program thread
45 contexts reports back a flag signaling completion of an operation. When the flag is received by the microengine, the microengine can determine which program thread to turn on.
As a network processor the hardware-based multithreaded processor 12 interfaces to network devices such as a media
50 access controller device e.g., a 10/100BaseT Octal MAC 13a or a Gigabit Ethernet device 13* coupled to communication ports or other physical layer devices. In general, as a network processor, the hardware-based multithreaded processor 12 can interface to different types of communication device or
55 interface that receives/sends large amounts of data. The network processor can include a router 10 in a networking application route network packets amongst devices 13a, 13* in a parallel manner. With the hardware-based multithreaded processor 12, each network packet can be independently pro
60 cessed. 26.
The processor 12 includes a bus interface 28 that couples the processor to the second bus 18. Bus interface 28 in one embodiment couples the processor 12 to the so-called FBUS 18 (FIFO bus). The FBUS interface 28 is responsible for 65 controlling and interfacing the processor 1*2 to the FBUS 18. The FBUS 18 is a 64-bit wide FIFO bus, used to interface to Media Access Controller (MAC) devices. The processor 12
includes a second interface e.g., a PCI bus interface 24 that couples other system components that reside on the PCI 14 bus to the processor 12. The units are coupled to one or more internal buses. The internal buses are dual, 32 bit buses (e.g., one bus for read and one for write). The hardware-based 5 multithreaded processor 12 also is constructed such that the sum of the bandwidths of the internal buses in the processor 12 exceed the bandwidth of external buses coupled to the processor 12. The processor 12 includes an internal core processor bus 32, e.g., an ASB bus (Advanced System Bus) 10 that couples the processor core 20 to the memory controllers 26a, 26b and to an ASB translator 30 described below. The ASB bus is a subset of the so-called AMBA bus that is used with the Strong Arm processor core. The processor 12 also includes a private bus 34 that couples the microengine units to 15 SRAM controller 26b, ASB translator 30 and FBUS interface 28. A memory bus 38 couples the memory controller 26a, 26b to the bus interfaces 24 and 28 and memory system 16 including flashrom 16c used for boot operations and so forth.
Each of the microengines 22a-22/"includes an arbiter that 20 examines flags to determine the available program threads to be operated upon. The program thread of the microengines 22a-22f can access the SDRAM controller 26a, SDRAM controller 26b or FBUS interface 28. The SDRAM controller 26a and SDRAM controller 26b each include a plurality of 25 queues to store outstanding memory reference requests. The queues either maintain order of memory references or arrange memory references to optimize memory bandwidth.
Although microengines 22 can use the register set to exchange data. A scratchpad or shared memory is also pro- 30 vided to permit microengines to write data out to the memory for other microengines to read. The scratchpad is coupled to bus 34.
Referring to FIG. 2, an exemplary one of the microengines 22a-22/J e.g., microengine 22/ is shown. The microengine 35 includes a control store 70 which, in one implementation, includes a RAM of here 1,024 words of 32 bits. The RAM stores a microprogram that is loadable by the core processor 20. The microengine 22/also includes controller logic 72. The controller logic includes an instruction decoder 73 and 40 program counter (PC) units 12a-12d. The four micro program counters 12a-12d are maintained in hardware. The microengine 22/also includes context event switching logic 74. Context event logic 74 receives messages (e.g., SEQ_#_EVENT_RESPONSE; FBI_EVENT_RESPONSE; 45 SRAM_EVENT_RESPONSE; SDRAM_EVENT_ RESPONSE; and ASB_EVENT_RESPONSE) from each one of the shared resources, e.g., SRAM 26a, SDRAM 26b, orprocessor core 20, control and status registers, and so forth. These messages provide information on whether a requested 50 task has completed. Based on whether or not a task requested by a program thread has completed and signaled completion, the program thread needs to wait for that completion signal, and if the program thread is enabled to operate, then the program thread is placed on an available program thread list 55 (not shown).
In addition to event signals that are local to an executing program thread, the microengines 22 employ signaling states that are global. With signaling states, an executing program thread can broadcast a signal state to the microengines 22. 60 The program thread in the microengines can branch on these signaling states. These signaling states can be used to determine availability of a resource or whether a resource is due for servicing.
The context event logic 74 has arbitration for the program 65 threads. In one embodiment, the arbitration is a round robin mechanism. Other techniques could be used including prior
ity queuing or weighted fair queuing. The microengine 22/ also includes an execution box (EBOX) data path 76 that includes an arithmetic logic unit 76a and general purpose register set 16b. The arithmetic logic unit 76a performs arithmetic and logic operation as well as shift operations. The registers set 16b has a relatively large number of general purpose registers. In this implementation there are 64 general purpose registers in a first bank, Bank A and 64 in a second bank, Bank B. The general purpose registers are windowed so that they are relatively and absolutely addressable.
The microengine 22/also includes a write transfer register stack 78 and a read transfer stack 80. These registers are also windowed so that they are relatively and absolutely addressable. Write transfer register stack 78 is where write data to a resource is located. Similarly, read register stack 80 is for return data from a shared resource. Subsequent to or concurrent with data arrival, an event signal from the respective shared resource e.g., the SRAM controller 26a, SDRAM controller 26b or core processor 20 will be provided to context event arbiter 74 which will then alert the program thread that the data is available or has been sent. Both transfer register banks 78 and 80 are connected to the execution box (EBOX) 76 through a data path. In one implementation, the read transfer register has 64 registers and the write transfer register has 64 registers.
Each microengine 22a-22/supports multi-threaded execution of multiple contexts. One reason for this is to allow one program thread to start executing just after another program thread issues a memory reference and must wait until that reference completes before doing more work. This behavior maintains efficient hardware execution of the microengines because memory latency is significant.
Special techniques such as inter-thread communications to communicate status and a thread_done register to provide a global program thread communication scheme is used for packet processing. The thread_done register can be implemented as a control and status register. Network operations are implemented in the network processor using a plurality of program threads e.g., contexts to process network packets. For example, scheduler program threads could be executed in one of the microprogram engines e.g., 22a whereas, processing program threads could execute in the remaining engines e.g., 22A-22/ The program threads (processing or scheduling program threads) use inter-thread communications to communicate status.
Program threads are assigned specific tasks such as receive and transmit scheduling, receive processing, and transmit processing, etc. Task assignment and task completion are communicated between program threads through the interthread signaling, registers with specialized read and write characteristics, e.g., the thread-done register, SRAM 16b and data stored in the internal scratchpad memory resulting from operations such as bit set, and bit clear.
Referring to FIG. 3, the packet dispatcher 302 resides on a processor inside the network processor and requests packets from the network interface. The packet dispatcher 302 is notified when a packet segment (e.g., 128 bytes) has been received by a packet receiver buffer 304. The packet dispatcher 302 moves the packet segment payload into DRAM 306. The packet dispatcher 302 stores packet reassembly state information to reassemble the packet. As successive segments are received for a packet, the dispatcher 302 uses the state information to direct and assemble the segments in space allocated in DRAM 306 by the packet dispatcher 302.
Each packet received is assigned a sequence number, in ascending order. The sequence number allows the packets to be dequeued in the order they were received. The sequence
number range corresponds to a slot in a ring in memory called an Asynchronous Insert Synchronous Remove (AISR) 308 ring. When a thread 310 in the pool of threads has taken its assigned packet and finished processing the packet, the thread 310 sends the processed packet to DRAM 306. The thread 5 also signals completion of the processed packet to the indexed location in the AISR 308, based on the packet's sequence number. This ensures that the results are stored in ascending addresses by order of packet arrival. The reorder dequeue 312 reads the AISR 308 in ascending order, checking to see if 10 packet information has been assigned to the slot. The reorder dequeue 312 will continue checking the slot in the AISR 308 until packet information is found in the slot. The system provides a First In First Out (FIFO) routine while efficiently processing packets out of order. 15
When a packet is received, the dispatcher 302 assigns the packet to a thread 310 in the pool of threads. Each thread in the pool makes itself available by signaling the dispatcher via either a thread mailbox 314 or a message CSR 316. Each thread 310 has a memory that allows the thread to work on a 20 presently assigned packet and store the next assigned packet in memory. The thread 310 communicates its memory and processing availability and location of the thread to the packet dispatcher 302. The dispatcher 302 communicates select packet state information back to the assigned threads. The 25 packet state information can include, for example, the packet payload's address in DRAM 306 and the sequence number.
There are multiple methods by which the thread 310 can communicate its availability and the packet dispatcher 302 can assign a packet to that thread 310. A thread 310 can 3Q communicate its availability through a Control and Status Register (CSR) 316. Each thread can write to a few bits of the CSR 316. The packet dispatcher 302 can read and clear the CSR 316, thus providing the status of many threads at one time. Alternatively, the dispatcher 302 and threads 310 can communicate via "mailboxes" 314. The thread 310 can signal 35 its availability by flagging or placing an identifier in the mailbox 314. The dispatcher polls each thread mailbox until it identifies an available thread. The dispatcher 302 can write the packet state information to the mailbox 314 for the available thread. 40
The threads 310 in the pool can finish their assignment at any time. Some will take a long time, probing deep into the packet header. Others will finish early. Once the thread 310 is finished processing the packet, the thread sends the packet information to the AISR ring 308 in the location of the 45 sequence number given to the packet during initial processing. The thread 310 is now available to process the next packet and signals its availability to the packet dispatcher 302. The reorder dequeue 312 cycles through the AISR ring 308 and dequeues the packets to the network based on the order the 50 packets were received.
A backlog (or bottleneck) can result when the microengine receives an above-average amount of packets that require in-depth processing. If the dispatcher 302 receives a new data packet from the network at a time when all the threads 310 are 55 processing assigned data packets, then the dispatcher 302 is forced to drop the new packet, leave the packet in the packet receiver buffer 304 or find temporary storage for it. The dispatcher 302 has a memory 318. Similar to the AISR ring 308 discussed earlier, the dispatcher memory 318 is a ring that allows the dispatcher 302 to assign packet state information to 60 a slot in the memory ring. The dispatcher 302 continues assigning newly enqueued packet state information sequentially in the slot of the memory ring 318. When threads 310 in the pool of threads become available the dispatcher 302 assigns packet information starting with the oldest saved slot 65 and sequentially assigns packets to newly available threads memory 310.
If the backlog continues to the extent that all the slots of the dispatcher memory ring 318 are filled, in one embodiment the dispatcher starts to assign slots to a backup memory ring 320. This process is similar to the process of assigning and retrieving slot information from the memory ring 318. The difference is that the backup ring can use memory that would normally be allocated to other resources when there is no need for the backup ring. In another embodiment, the primary dispatcher memory ring 318 is made larger in order to handle the largest bottleneck of packet processing.
In one embodiment, the dispatcher 302 can use the microengine scratch memory 322 to store packet information. If a packet-processing bottleneck causes all the slots in the dispatcher memory 318 to become filled, the dispatcher 302 can assign packet information to the microengine scratch memory 322. Once the bottleneck is relieved the dispatcher 302 assigns thepacket information in the scratchmemory 322 to the available thread memory 310. The dispatcher 302 can also assign packet information to the DRAM 306 if the dispatcher memory 318 and the scratch memory 322 are filled due to the bottleneck. The dispatcher 302 can also assign packet information to the DRAM 306 if the dispatcher memory 318 is filled and the scratch memory 322 is filled with other data assigned to scratch memory by the microengine processor. The process provides for efficient storage of packet information during bottlenecks while restraining the use of DRAM 306 bandwidth and other memory resources of the microengine.
Referring to FIG. 4, the flowchart shows the processing of data packets 400 by the microengine. The data packet is received from the network into the receiver buffer 402. The dispatcher gives the data packet a packet sequence number and assigns a location in memory for the thread information 404. The sequence number allows the packets to be processed by the threads in an order independent of the order the threads will be dequeued back to the network or general processor. The threads independently communicate to the packet dispatcher regarding their available state 406. A thread 408 in the pool can make itself available even when it is busy processing a packet. The thread 408 stores the packet it is processing and stores the next packet intended for processing by the thread. This allows each thread 408 to handle two packets at a time. Once the dispatcher determines an available location in a thread 408, the packet dispatcher assigns the packet information to the memory of the available thread 416. If the dispatcher determines that there are no available threads at that time 408, the packet dispatcher stores the packet information temporarily in memory 410. The packet dispatcher continues to receive packets, process the packets (e.g. assign a sequence number, a storage location, and determine reassembly information), and store the packet information in the next sequential memory slot 412.
Once the dispatcher determines a thread is available 414, the dispatcher sends the packet information into the available thread's local memory 416. The thread processes the packet and then sends the packet information to the AISR ring in memory based on the sequence number in the packet information 420. The reorder dequeue sequentially pulls the packet information from the ring and sends the packet to the packets future destination 422. In the case of router the packet would be sent onto the network to the next router on the packet path to the packets final destination.
Referring to FIG. 5, the dispatcher determines the most efficient location to store the packet information 500. By storing the packet information in a variety of the location the dispatcher can efficiently use the microengine's memory and handle overflow produced by bottleneck of thread processing. The packet is initially received into the receiver buffer 502. The dispatcher assigns the packet payload a location in memory and a sequence number 504. The dispatcher deter