CN101040489B - Network device architecture for consolidating input/output and reducing latency - Google Patents

Network device architecture for consolidating input/output and reducing latency Download PDF

Info

Publication number
CN101040489B
CN101040489B CN200580034646.0A CN200580034646A CN101040489B CN 101040489 B CN101040489 B CN 101040489B CN 200580034646 A CN200580034646 A CN 200580034646A CN 101040489 B CN101040489 B CN 101040489B
Authority
CN
China
Prior art keywords
frame
buffer
tunnel
received frame
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200580034646.0A
Other languages
Chinese (zh)
Other versions
CN101040489A (en
Inventor
西尔瓦诺·加伊
托马斯·埃兹尔
戴维·贝尔加马斯科
迪内希·达特
弗拉维欧·博诺米
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cisco Technology Inc
Original Assignee
Cisco Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/094,877 external-priority patent/US7830793B2/en
Application filed by Cisco Technology Inc filed Critical Cisco Technology Inc
Publication of CN101040489A publication Critical patent/CN101040489A/en
Application granted granted Critical
Publication of CN101040489B publication Critical patent/CN101040489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention provides methods and devices for implementing a Low Latency Ethernet (''LLE'') solution, also referred to herein as a Data Center Ethernet (''DCE'') solution, which simplifies the connectivity of data centers and provides a high bandwidth, low latency network for carrying Ethernet and storage traffic. Some aspects of the invention involve transforming FC frames into a format suitable for transport on an Ethernet. Some preferred implementations of the invention implement multiple virtual lanes (''VLs'') in a single physical connection of a data center or similar network. Some VLs are ''drop'' VLs, with Ethernet-like behavior, and others are ''no-drop'' lanes with FC-like behavior. Some preferred implementations of the invention provide guaranteed bandwidth based on credits and VL. Active buffer management allows for both high reliability and low latency while using small frame buffers. Preferably, the rules for active buffer management are different for drop and no drop VLs.

Description

Be used for unified I/O and reduce the network device architecture that postpones
Cross reference with related application
The application requires the U.S. Provisional Application No.60/621 that is entitled as " FC Over Ethernet " that submitted on October 22nd, 2004; The U.S. Patent application No.11/094 that is entitled as " Network Device Architecture For ConsolidatingInput/Output And Reducing Latency " that 396 (attorney docket No.CISCP404P) and on March 30th, 2005 submit; The priority of 877 (attorney docket No.CISCP417), by reference that the full content of these applications is incorporated here.
Background technology
Fig. 1 shows the simple version of the data center that the enterprise (for example financial institution) that requires the high availability and the network storage can employable general type.Data center 100 comprises having redundant the connection to obtain the Redundant Ethernet switch of high availability.Data center 100 is connected to client via fire compartment wall 115 via network 105.Network 105 for example can be corporate intranet, DMZ and/or internet.Ethernet extremely is suitable for the TCP/IP flow between client (for example Terminal Server Client 180 and 185) and the data center.
In data center 105, many network equipments are arranged.For example, many servers generally are disposed in the frame (rack) with standard profile parameter and go up (for example one " rack unit " maybe be wide by 19 " and thick about 1.25 ")." rack unit " or " U " is electronic industries alliance (perhaps more often being referred to as " EIA ") the standard metering unit that is used for rack mount type equipment.Recently, owing to appear at the surge that many frames of planting in commercial, industrial and the military market are installed product, this term becomes more popular.The height of " rack unit " equals 1.75 ".For the inside free space of computer rack enclosure, only needs multiply by 1.75 with the total amount of rack unit simply ".For example, the 44U holster shell will have 77 " inside free space (44 * 1.75).Each frame in the data center for example can have about 40 servers.Data center can have thousands of or even more service device.Recently, some manufacturers have issued " blade server (blade server) ", its permission even more highdensity server encapsulation (60 to 80 servers of about every frame).
But along with the number of the network equipment in the data center increases, connectedness becomes and becomes increasingly complex and costliness.On bottom line, the server of data center 105, switch or the like generally will connect via Ethernet.In order to obtain high availability, there are 2 Ethernets to connect to the major general, as shown in Figure 1.
In addition, do not hope that server comprises huge storage capacity.Owing to this reason and other reasons, enterprise network comprises that the situation with the connectedness of memory device (for example storage array 150) becomes more and more common.In history, the storage flow is realized through SCSI (small computer system interface) and/or FC (fiber channel).
In the mid-90 in 20th century, the SCSI flow can only be advanced than short distance.Most interested at that time theme is how to make SCSI go to " box is outer ".Just as all the time, hope that speed can be faster.At that time, Ethernet just developed into 100Mb/s from 10Mb/s.Some have imagined the following speed that reaches 1Gb/s, but many people think that this is near physics limit.For the 10Mb/s Ethernet, there is the problem of half-duplex and conflict.Ethernet is considered to less reliable, and this is in part because divide into groups may lose and possibly have conflict.(though the implication of term those skilled in the art normally used " grouping " and " frame " maybe be slightly different, and here these two terms can exchange use).
FC is considered to store the attractive and reliable selection of application, and this is because according to the FC agreement, divide into groups can deliberately not abandoned, and FC can move with 1Gb/s.But in the period of 2004, Ethernet and FC have reached the speed of 10Gb/s.In addition, Ethernet developed into full duplex and do not have the conflict stage.Therefore, FC no longer includes speed advantage compared with Ethernet.But the congested Ethernet that possibly cause in the switch divides into groups to be dropped, and this is undesirable characteristic for the storage flow.
At initial several years of 21 century, extensive work was put among the exploitation iSCSI, on the TCP/IP network, to realize SCSI.Though these effort have obtained some successes, it is all the fashion that iSCSI does not become: iSCSI occupies about 1%-2% in storage networking market, and by comparison, FC occupies about 98%-99%.
Reason is that the iSCSI stack compares some complicacy with the FC stack.With reference to figure 7A, can find out that iSCSI stack 700 needs 5 layers: ethernet layer 705, IP layer 710, TCP layer 715, iSCSI layer 720 and SCSI layer 725.TCP layer 715 is necessary parts of stack, because ethernet layer 705 possibility lost packets, and SCSI layer 725 can't stand packet loss.TCP layer 715 transmits for SCSI layer 725 provides reliable grouping.But, arriving under the speed of 10Gb/s 1, TCP layer 715 is agreements that are difficult to realize.On the contrary, because FC lost frames not, therefore need not utilize the layer of TCP layer and so on to come the frame of compensating missing.Therefore, shown in Fig. 7 B, FC stack 750 is simpler, only needs FC layer 755, FCP layer 760 and SCSI layer 765.
Therefore, the FC agreement is normally used for server and the communication between the memory device (for example storage array 150) on the network.Therefore, data center 105 comprises FC switch 140 and 145, is used for the communication between server 110 and the storage array 150, and FC switch 140 and 145 is by Cisco Systems in this example, and Inc. provides.
1RU and blade server are very popular, because their cheaper comparatively speaking, powerful, standardization, and can move any most popular operating system.The cost of having known recent year typical server has reduced and its performance level improves.Because cost that server is relatively low and the potential problems that operation possibly cause more than one type of software application on a server, each server generally is exclusively used in an application-specific.The extensive application that on the typical enterprise network, moves has continued to increase the number of server in the network.
But; Because keep and various types of connectednesses of each server (for example Ethernet and FC are connective) comparatively complicated (wherein in order to obtain preferably redundancy of every type of connectedness of high availability), the connective cost of server becomes and is higher than the cost of server itself.For example, the cost of the single FC interface of server just maybe be the same with server itself high.Server carried out via NIC (" NIC ") with being connected generally of Ethernet, and it carries out with utilizing host bus adaptor (" HBA ") being connected of FC network.
Some is different with regard to network traffics for the role of the equipment in FC network and the ethernet network, and this mainly is because congested in response in the TCP/IP network divide into groups can be dropped as a rule, and frame can deliberately do not abandoned in the FC network.Therefore, here FC will be called as " nothing abandons " (nodrop) example of network sometimes, and Ethernet will be called as " abandoning " (drop) a kind of form of expression of network.When on the TCP/IP network, dividing into groups to be dropped, system will recover rapidly, for example in the hundreds of microsecond, recovers.But the agreement of FC network generally is based on hypothesis that frame can not be dropped.Therefore, when frame was dropped on the FC network, system can not recover rapidly, and SCSI will spend some minutes to recover.
At present, the port of Ethernet switch can reach its buffering at about 100 milliseconds before abandoning grouping.Owing to realized the 10Gb/s Ethernet, the RAM that each port of Ethernet switch will about 100MB is with 100 milliseconds of buffering of packets.This will expensively must make us forbidding.
For some enterprises, hope " cluster (cluster) " more than a server, shown in the dotted line around server S among Fig. 12 and the S3.Cluster makes the even number server be regarded as individual server.In order to carry out cluster; Hope to carry out long-range direct memory visit (" RDMA "), the content of one of them virtual memory space (possibly be dispersed on many physical memory space) can be copied into another virtual memory space under the situation that does not have CPU to intervene.Should carry out RDMA with low-down delay (latency).In some enterprise networks, there be the 3rd type of network that is exclusively used in cluster server, shown in switch 175.This for example can be " Myrinet ", " Quadrix " or " Infiniband " network.
Therefore, the cluster of server is with add more complicated factor to data center network.But different with Quadrix and Myrinet, Infiniband allows the possibility of carrying out cluster and the reduced data central site network being provided.The Infiniband network equipment is cheaper relatively, and this mainly is because they use less buffer space, copper medium and simply transmit scheme.
But the Infiniband network has some defectives.For example, has only an Infiniband switch component source at present.In addition, do not prove that as yet Infiniband can correctly work in the such context of the data center of for example large enterprise.For example, there is not the interconnect implementation of Infiniband router of Infiniband subnet of known being used to.Though have gateway at Infiniband and fiber channel and Infiniband between the Ethernet,, the possibility that Ethernet is removed from data center is very little.This means that also main frame not only needs Infiniband to connect, and also needs Ethernet to connect.
Therefore, even large enterprise hopes to ignore aforesaid drawbacks and changes the system based on Infiniband into that this enterprise leaves over data center network (for example as shown in Figure 1) installation and work in the time of also need testing based on the Infiniband system in enterprise.Therefore, will can not be alternative cost based on the cost of Infiniband, but extra cost.
Be starved of to allow the mode reduced data central site network that evolution changes taking place with respect to the available data central site network.Ideal system will be provided for postponing and high-speed evolutionary system with providing to hang down with low-cost uniform server I/O.
Summary of the invention
The invention provides and be used to realize the low method and apparatus that postpones Ethernet (" LLE ") solution; Here the LLE solution also is called as data center's Ethernet (" DCE ") solution; It has simplified the connectedness of data center, and the high bandwidth that is used to transmit Ethernet and storage flow, low delay network are provided.Aspects more of the present invention comprise the form that the FC frame transform is become to be suitable on Ethernet, transmitting.
Preferred implementations more of the present invention have realized a plurality of tunnels (" VL ") (being also referred to as virtual link) in the single physical of data center or similar network connects.Some VL are " abandoning " VL, have the behavior of similar Ethernet, other be " nothing abandons " passage with similar FC behavior.Some implementations provide the middle behavior between " abandoning " and " nothing abandons ".Some such implementations are " postponement abandons " (delayed drop), wherein work as the buffer full time frame and can not be dropped immediately, before abandoning frame, exist the upper reaches of finite time (for example millisecond magnitude) " to push back " on the contrary.
VL partly can realize through marker frame.Because each VL can have its oneself credit, so each VL can be treated independently.Even can be according to replenishing speed according to assigning the credit of giving VL to confirm the performance of each VL.In order to allow more complicated topology and permission that the frame in the switch is better managed, TTL information and frame length field can be added to frame.Can also be relevant for congested coded message, message is got off to slow down so that the source can receive explicit (explicit).
Preferred implementations more of the present invention provide the assurance bandwidth based on credit and VL.The difference that different VL can be assigned changing in time guarantees bandwidth.Preferably, VL will keep identical characteristic (for example will keep be abandon or do not have a passage of abandoning), dynamically changes but the bandwidth of VL will depend on the time in one day, will accomplishing of task or the like.
Movable buffer management had both allowed high reliability, had allowed low delay again, also used less frame buffer simultaneously, even also was like this for the 10GB/s Ethernet.Preferably, for dissimilar VL, for example abandon and nothing abandons VL, the rule that is used for movable buffer management is different.Some embodiments of the present invention utilize copper medium rather than optical fiber to realize.Given all these attributes can realize that I/O is unified with competitive, cheap relatively mode.
Aspects more of the present invention provide a kind of being used for to handle the method more than one type of network traffics at single network equipment.This method may further comprise the steps: logically be divided into and receive first frame and second frame on the physical link of a plurality of tunnels; The buffer of the network equipment is divided into is used in first buffer space of first frame that receives on first tunnel and second buffer space of second frame that is used on second tunnel, receiving; First frame is stored in first buffer space; Second frame is stored in second buffer space; Use first group of rule to first frame, wherein first group of rule makes the frame of winning be dropped in response to postponing; And use second group of rule to second frame, wherein second group of rule can not make second frame be dropped in response to postponing.This method also can comprise to each tunnel assigns the step that guarantees the minimum buffer space.
First frame can be an ethernet frame, expansion ethernet frame for example described herein.In some implementations, first group of rule makes the frame of winning be dropped in response to postponing, and second group of rule can not make second frame be dropped in response to postponing.But second group of rule can make frame be dropped to avoid deadlock.First group of rule and/or second group of rule can make ecn (explicit congestion notification) sent from the network equipment in response to postponing.Ecn (explicit congestion notification) can be sent to one of source device or edge device, and can send via one of Frame or control frame.
The current control of " nothing abandons " and " postponement abandons " VL can realize to any convenient combination that buffer credit counts scheme and/or pause frame through utilizing buffer.For example, some implementations use buffer to count scheme to buffer credit in the network equipment, and on link, use pause frame to carry out current control.Therefore, second group of rule can be included as second frame and realize that buffer counts scheme to buffer credit.Buffer counts scheme to buffer credit and comprises according to frame sign and count credit, and can be implemented in the network equipment and/or on the network link.In the network equipment, buffer can be managed by moderator to buffer credit.Not only be used in the network equipment but also be used on the link if buffer counts scheme to buffer credit, the credit of then in the network equipment, managing is preferably inequality with the credit of on link, managing.
Partiting step can comprise distributing according to buffer occupation rate, the time, flow load in one day, congested, the task of guaranteeing minimum bandwidth allocation, the known big bandwidth of requirement and maximum bandwidth divides buffer.First frame and second frame can be stored in the VOQ (" VOQ ").Each VOQ can with a destination port/tunnel to being associated.This method can comprise the step in response to VOQ length, every tunnel buffer occupation rate, integrated buffer device occupancy and the age execution buffer management that divides into groups, the moment the when age of wherein dividing into groups is grouping entering buffer and the difference of current time.
Some embodiments of the present invention provide a kind of network equipment, and it comprises: be arranged to a plurality of ports of received frame on many physical links, and a plurality of Line cards.Each port is communicated by letter with one of a plurality of Line cards.Each Line cards is configured to carry out following steps: from one of a plurality of ports received frame; Be identified in first frame that receives on first tunnel and second frame that on second tunnel, receives; Buffer in the said Line cards is divided into first buffer space that is used for first frame and second buffer space that is used for second frame; First frame is stored in first buffer space; Second frame is stored in second buffer space; Use first group of rule to first frame, wherein first group of rule makes the frame of winning be dropped in response to postponing; And use second group of rule to second frame, wherein second group of rule can not make second frame be dropped in response to postponing.
Other implementations of the present invention provide a kind of method more than one type of flow that is used on single network equipment, carrying.This method may further comprise the steps: be identified in first frame that receives on first tunnel and second frame that on second tunnel, receives; Dynamically the buffer with the network equipment is divided into first buffer space that is used for first frame and second buffer space that is used for second frame; First frame is stored among the VOQ of first buffer space; Second frame is stored among the 2nd VOQ of second buffer space; Use first group of rule to first frame, wherein first group of rule makes the frame of winning be dropped in response to postponing; And use second group of rule to second frame, wherein second group of rule can not make second frame be dropped in response to postponing.Can come dynamically to divide buffer according to following factor: integrated buffer device occupancy, every tunnel buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute.
Other embodiment of the present invention provide a kind of network equipment, and it comprises a plurality of ports that are arranged to received frame on many physical links, and a plurality of Line cards.Each port is communicated by letter with one of a plurality of Line cards.Each Line cards is configured to carry out following steps: be identified in first frame that receives on first tunnel and second frame that on second tunnel, receives; Dynamically the buffer with the network equipment is divided into first buffer space that is used for first frame and second buffer space that is used for second frame; First frame is stored in first VOQ (" VOQ ") of first buffer space; Second frame is stored among the 2nd VOQ of second buffer space; Use first group of rule to first frame, wherein first group of rule makes the frame of winning be dropped in response to postponing; And use second group of rule to second frame, wherein second group of rule can not make second frame be dropped in response to postponing.Can come dynamically to divide buffer according to following factor: integrated buffer device occupancy, every tunnel buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute.Method described herein can realize and/or show that said multiple mode for example comprises hardware, software or the like by multiple mode.
Description of drawings
To understand the present invention with reference to following description in conjunction with the drawings, accompanying drawing illustrates concrete implementation of the present invention.
Fig. 1 is the simplification network diagram that data center is shown.
Fig. 2 illustrates the simplification network diagram of data center according to an embodiment of the invention.
Fig. 3 is the block diagram that is illustrated in a plurality of VL that realize on the single physical link.
Fig. 4 illustrates a kind of form that according to the present invention some implementations carry the ethernet frame of the extra field that is used to realize DCE.
Fig. 5 illustrates a kind of form of the link management frame of some implementations according to the present invention.
Fig. 6 A is the network diagram based on the method for credit (credit) that simplification of the present invention is shown.
Fig. 6 B illustrates the table that credit of the present invention counts (crediting) method.
Fig. 6 C summarizes the flow chart that is used for a kind of illustrative methods of initialization link according to the present invention.
Fig. 7 A shows the iSCSI stack.
Fig. 7 B shows the stack that is used on FC, realizing SCSI.
Fig. 8 shows the stack that according to the present invention some aspects are used on DCE, realizing SCSI.
Fig. 9 A and 9B show the method that according to the present invention some aspects are used on Ethernet, realizing FC.
Figure 10 is the simplification network diagram that some aspects are used on Ethernet, realizing FC according to the present invention.
Figure 11 is the simplification network diagram that some aspects are used to assemble the DCE switch according to the present invention.
Figure 12 shows the architecture of DCE switch according to some embodiments of the invention.
Figure 13 is the block diagram that the buffer management of some each VL of implementation according to the present invention is shown.
Figure 14 is the network diagram that illustrates according to the ecn (explicit congestion notification) of some type of the present invention.
Figure 15 is the block diagram that the buffer management of each VL of some implementations according to the present invention is shown.
Figure 16 is that the probabilistic that some aspects according to the present invention are shown abandons the graph of function line.
Figure 17 is the figure line that the exemplary occupancy of VL buffer in a period of time is shown.
Figure 18 is that the probabilistic that illustrates according to other aspects of the present invention abandons the graph of function line.
Figure 19 illustrates and can be configured the network equipment of carrying out certain methods of the present invention.
Embodiment
Now will be in detail with reference to specific embodiments more of the present invention, think comprising the inventor and to realize optimal mode of the present invention.The example of these specific embodiments is shown in the drawings.Though be that to combine these specific embodiments to describe of the present invention, should be appreciated that and do not hope to limit the invention to described embodiment.On the contrary, hope to cover the included replacement of the spirit and scope of the present invention that accompanying claims limited, revise and be equal to.In addition, many details have been set forth below to help complete understanding the present invention.There are not these details can realize the present invention yet.In other cases, do not describe known process operation in detail, a presumptuous guest usurps the role of the host to avoid.
The invention provides the connectedness and the high bandwidth that is provided for transmitting Ethernet and storage flow, the method and apparatus of low delay network that are used for the reduced data center.Preferred implementations more of the present invention have realized a plurality of VL in the single physical of data center or similar network connects.Preferably, safeguard that to each VL buffer arrives buffer credit.Some VL are " abandoning " VL, have the behavior of similar Ethernet; Other be " nothing abandons " passage with behavior of similar FC.
Some implementations provide the middle behavior between " abandoning " and " nothing abandons ".Some such implementations are " postponement abandon ", and wherein when buffer full, frame can not be dropped immediately, before abandoning frame, exist the upper reaches of finite time (for example millisecond magnitude) " to push back " on the contrary.Postponement abandon implementation for management of short duration congested be useful.
Preferably, congestion control scheme is the 2nd layer of realization.Preferred implementations more of the present invention provide the assurance bandwidth based on credit and VL.The alternative of using credit is to use standard IEEE 802.3 to suspend (PAUSE) frame to each VL, to realize " nothing abandons " or " postponement abandons " VL.Here by reference that IEEE 802.3 standards are incorporated, be used for all purposes.For example, specifically combined the appendix 31B of 802.3ae-2002 standard by reference, its title is " MACControl PAUSE Operation ".It is also understood that there be not endure of the present invention under the situation of VL, but in this case, the whole piece link will show " abandoning " or " postponement abandons " or " nothing abandons " behavior.
Preferred implementation is supported negotiation mechanism, and for example IEEE 802.1x appointment is the sort of, by reference that it is incorporated here.Negotiation mechanism for example can confirm whether main process equipment supports LLE, and, if support; Then allow main frame to receive VL and credit information, for example: support what VL, VL is to use credit still to suspend; If how much credit is credit have so, the behavior of each individual VL how.
Movable buffer management had both allowed high reliability, had realized low delay again, also used less frame buffer simultaneously.Preferably, for abandoning and nothing abandons VL, the rule that is used for movable buffer management is different.
The efficient rdma protocol that implementation supports more of the present invention are particularly useful for the cluster implementation.In implementations more of the present invention, NIC (" NIC ") has realized being used for the RDMA of cluster application, has also realized being used for the reliable transmission of RDMA.Aspects more of the present invention are to realize via the user API that directly visits programming library (" uDAPL ") from the user.UDAPL has defined the one group of user API that is used for all transmission with RDMA ability, and is by reference that it is incorporated here.
Fig. 2 is the simplification network diagram of an example that the LLE solution of the connectedness that is used for reduced data center 200 is shown.Data center 200 comprises LLE switch 240, and it has the router two 60 that is used for via fire compartment wall 215 and the TCP/IP network 205 and the connectedness of main process equipment 280 and 285.The architecture of exemplary LLE switch is here set forth in detail.Preferably, LLE switch of the present invention can move the 10Gb/s Ethernet, and has less relatively frame buffer.Some preferred LLE switches are only supported layer 2 feature.
Though LLE switch of the present invention can utilize optical fiber and optical transceiver to realize that some preferred LLE utilize the copper connectedness to realize, so that reduce cost.Some such implementations are to realize that according to the IEEE 802.3ak standard of proposing this standard is called as 10Base-CX4, and is by reference that it is incorporated, is used for all purposes here.The inventor expects that other implementations will use emerging standard IEEE P802.3an (10GBASE-T), and is also by reference that it is incorporated here, is used for all purposes.
Server 210 also is connected with LLE switch 245, and LLE switch 245 comprises and is used for the FC gateway 270 of communicating by letter with disk array 250.FC gateway 270 is realized FC (will be described in greater detail) here on Ethernet, thereby eliminated the independent FC and the needs of ethernet network are arranged in data center 200.Gateway 270 can be the such equipment of MDS 900 IP storage services module such as Cisco Systems, and this equipment has been configured and has been useful on the software of carrying out certain methods of the present invention.Ethernet traffic is carried in the data center 200 by unprocessed form.Why can be that it can also carry FC on the Ethernet (FC over Ethernet) and RDMA except original Ethernet because LLE is the expansion of Ethernet like this.
Fig. 3 shows two switches 305 and 310 that connected by physical link 315.In general switch 305 and 310 behavior receive IEEE 802.1 constraints, and in general the behavior of physical link 315 receives IEEE 802.3 constraints.Substantially, the invention provides two general behaviors of LLE switch, and behavior in the middle of multiple.First general behavior is " abandoning " behavior, the behavior and Ethernet similar.General behavior is " nothing abandons " behavior, and the behavior and FC's is similar.The present invention also provides the middle behavior between " abandoning " and " nothing abandons " behavior, includes but not limited to other local " postponement abandons " behaviors of describing among this paper.
In order on same physical link 315, to realize two kinds of behaviors the invention provides the method and apparatus that is used to realize VL.VL is divided into a plurality of logic entities so that the flow among VL does not receive the mode of the influence of the flow on other VL with a physical link.This is through accomplishing for the independent buffer (or independent part of a physical buffers) of each VL maintenance.For example, can use a VL to transmit control plane flow and some other high-priority traffic, and these flows can not get clogged owing to the low priority bulk flow on another VL.VLAN can be grouped into different VL, so that advancing of the flow among one group of VLAN can not receive other VLAN to go up the obstruction of flow.
In the example depicted in fig. 3, in fact switch 305 and 310 provides 4 VL on physical link 315.Here, VL 320 and 325 abandons VL, and VL 330 and 335 does not have to abandon VL.In order to realize " abandoning " behavior and " nothing abandons " behavior simultaneously, be necessary for every class behavior and assign at least one VL, 2 altogether.(in theory, can have only a VL, this VL is given every class behavior by interim the appointment, but this implementation is not preferred).In order to support legacy devices and/or other to lack the equipment of LLE function, preferred implementation support of the present invention does not have the link of VL, and all flows of this link are mapped to the single VL of a LLE port.From angle of network management, 2 to 16 VL are preferably arranged, though also can realize more a plurality of.
Preferably, link dynamically is divided into VL, this is because static division is not too flexible.In preferred implementations more of the present invention, dynamically division is for example realizing by (or by on frame basis) on the packet by packet basis through adding extended head.The a variety of forms of this head have been contained in the present invention.In implementations more of the present invention, on the DCE link, have two types of frames to send: these types are Frame and link management frame.
Be used to realize the ethernet data frame of aspects more of the present invention and the form of link management frame though Figure 4 and 5 show respectively, other implementations of the present invention provide the frame with more or less field, frame or other variants of different order.The field 405 and 410 of Fig. 4 is to be respectively applied for the destination-address of frame and the standard ethernet field of source address.Similarly, protocol type field 430, payload 435 and crc field 440 can be those fields of standard ethernet frame.
But the following field of protocol type field 420 indications is those fields of DCE head 425.If exist, preferably as much as possible near the place that begins of frame, this is because it makes in hardware, can be easy to resolve to the DCE head.The DCE head can be carried in the ethernet data frame, and is as shown in Figure 4, and is carried in the link management frame (seeing that Fig. 5 describes with corresponding).This head is preferably peelled off by MAC, and need not be stored in the frame buffer.In implementations more of the present invention, when not having data traffic or conventional frame can not be sent, generate the Continuous Flow of link management frame owing to lack credit.
Most of information of carrying in the DCE head are relevant with the ethernet frame that includes this DCE head.But some fields are used to the buffer credit field of the additional credit of flow on the rightabout.In this example, the buffer credit field is only carried by the frame with long DCE head.If solution is used pause frame rather than credit, then possibly not need credits field.
The 445 indication time-to-live of ttl field, this is a number that when frame 400 is forwarded, is just successively decreased.Usually, layer 2 network does not need ttl field.Ethernet uses spanning-tree topology, and this topology is very conservative.Generate tree active topology is imposed restriction, and only allow a paths for grouping from a switch to another switch.
In preferred implementation of the present invention, do not defer to this restriction to active topology.On the contrary, preferably, mulitpath is movable simultaneously, for example via link-state protocol, such as OSPF (SPF) or IS-IS (Intermediate System-to-Intermediate System).But, known that link-state protocol can cause instant loop during topology is reshuffled.Utilize TTL or similar characteristic to guarantee that instant loop can not become big problem.Therefore, in preferred implementation of the present invention, TTL is coded in the frame, thereby in fact realizes link-state protocol at the 2nd layer.Different with the use link-state protocol is that implementations more of the present invention are used with a plurality of generations trees of different LLE switches as root, and have obtained similar behavior.
The VL of field 450 identification frames 400.Sign according to 450 couples of VL of field has allowed equipment that frame is assigned to suitable VL, and is that different VL uses different rules.Like other local describe in detail among this paper, rule will be according to various standards and different, and said standard for example is that VL abandons or do not have the VL of abandoning, and whether VL has the assurance bandwidth, current on VL, whether have congested, and other factors.
ECN (ecn (explicit congestion notification)) field 455 is used to indicate the buffer part of the buffer of this VL (or distribute to) to be filled, thereby for indicated VL, its transfer rate should slow down in the source.In preferred implementation of the present invention, at least some main process equipments in the network can be understood ECN information, and will use reshaper, i.e. an a/k/a speed limiting device to indicated VL.Ecn (explicit congestion notification) can take place by at least two kinds of general fashions.In one approach, send a grouping from sending having a definite purpose of ECN.In another approach, this notice quilt " incidentally (piggy-back) " is in the grouping that has been transmitted.
Described like other places, ecn (explicit congestion notification) can be sent to the source or send to edge device.ECN can rise in the various device in the DCE network, comprises end-equipment and nucleus equipment.As discussed in detail in the following exchanger system structure part, congestion notification and be the congested pith of keeping less buffer sizes simultaneously of control to its response.
Implementations more of the present invention have allowed ECN by upstream being sent from inchoation equipment, and/or allowed ECN to be sent downstream, return the upper reaches then.For example, ECN field 455 can comprise that forward direction ECN part (" FECN ") and back are to ECN part (" BECN ").When switch ports themselves experiences when congested, it can carry out set in the FECN part, and transmitted frame normally.When receiving the frame that the FECN position is set, end stations is carried out set to becn bit, and frame is sent out Hui Yuan.The source received frame detects becn bit and is set, and reduces the flow that injects network, perhaps reduces the flow that injects network for indicated VL at least.
Frame credits field 465 is used to indicate the number of the credit that should be frame 400 distribution.Within the scope of the invention, the mode that has many possible this systems of realization.The simplest solution is to count credit for individual packets or frame.From buffering management angle, this possibly not be a best solution: divide into groups to go up and use a credit if reserve a buffer and each for single credit, the whole buffer that so just has been single packet reservation.Even the size of the normal size frame that the size of buffer only equals to expect, this credit count scheme and often also can cause the utilance of each buffer very lowly, this is because many frames will be less than largest amount.For example, if the normal size frame is 9KB, and all buffers all are 9KB, but average frame size is 1500 bytes, then have only 1/6 to be used in each buffer usually.
A kind of better solution is to count credit according to frame sign.Though can count a credit for for example single byte, in reality, preferably use bigger unit, for example 64B, 128B, 256B, 512B, 1024B, or the like.For example, if credit to this unit of 512B, the frame of then aforesaid average 1500 bytes will need 3 credits.If a this implementation according to the present invention transmits this frame, frame credits field 465 will indicate frame to need 3 credits.
Count according to the credit of frame sign and to have allowed to use more efficiently buffer space.The size of known grouping has not only been indicated will need for how many buffer space, has also indicated and when can remove from buffer dividing into groups.For example, if the inside transfer rate of switch is different from the speed of data arrives switch ports themselves, this just is even more important so.
This example provide the DCE head longer version and than short run this.It is long version or short run basis that long header fields 460 has been indicated the DCE head.In this implementation, all Frames all comprise brachycephaly portion at least, and this brachycephaly portion comprises TTL, VL, ECN and frame credit information respectively in field 445,450,455 and 465.Also need carry the credit information that is associated with each VL if Frame is present in the information in the brachycephaly portion except that needs carry, then Frame can comprise long head.In this example, exist 8 VL and being used to indicate 8 respective field of the buffer credit of each VL.The use of short DCE head and long DCE head has reduced the expense of in all frames, carrying credit information.
When the Frame that will not send, some embodiments of the present invention make link management frame (" LMF ") be sent out, with the declaration credit information.The buffer credit that LMF also can be used for carrying from the recipient is perhaps carried the frame credit of being sent from the sender.LMF should be sent out (frame credit=0) under bad repute situation, because it is preferably by port consumption, and be not forwarded.LMF can periodically be sent and/or sent in response to predetermined condition, for example after the payload of every 10MB is transmitted by Frame.
Fig. 5 shows the example according to the LMF form of implementations more of the present invention.LMF 500 starts from the 6B Ethernet field 510 and 520 of standard, and they are respectively applied for the destination-address and the source address of frame.After 530 indications of protocol type head is DCE head 540, and this DCE head is short DCE head (for example long header fields=0) in this example.The VL of DCE head 540, TTL, ECN and the frame credits field person of being sent out are set to zero and the person of being received ignores.Therefore, LMF can be identified by following characteristic: Protocol Type=DCE Header and Long Header=0 and Frame Credit=0.
Recipient's buffer credit of field 550 indicative of active VL.In this example, there are 8 movable VL, the buffer credit of therefore indicating each movable VL by field 551 to 558.Similarly, the buffer credit of field 560 indication transmitting apparatus, the frame credit of therefore indicating each movable VL by field 561 to 568.
LMF 500 does not comprise any payload.If necessary, just as in this example, LMF 500 is filled field 570 and is filled into 64 bytes, to create the ethernet frame of legal minimal size.LMF 500 ends at the Ethernet crc field 580 of standard.
In general, buffer of the present invention counts scheme to buffer credit and realizes according to following two rules: (1) sender is sending this frame during more than or equal to the required credit number of the frame that will send from recipient's credit number; And (2) recipient sends credit to the sender when it can accept extra frame.As stated, utilize any among Frame or the LMF can replenish credit.Only when existing number to equal the credit of frame length (length of getting rid of the DCE head) at least, just allowing port is specific VL transmit frame.
If it is use pause frame rather than credit, then regular like the application class.The sender sends this frame when the frame person of not being received suspends.The recipient sends pause frame to the sender in the time can't accepting extra frame.
It below is the simplification example that transfer of data and credit are replenished.Fig. 6 A shows the Frame 605 that sends to switch A from switch b, and it has short DCE head.After the 605 arrival switch As that divide into groups, it will be stored in the storage space 608 of buffer 610.Owing to have some to be consumed in the memory of buffer 610, so the available credit of switch b will have corresponding minimizing.Similarly, when Frame 615 (also having short DCE head) by when switch A sends to switch b, Frame 615 will consume the storage space 618 of buffer 620, thus the credit that switch A can be used will correspondingly reduce.
But after frame 605 and 615 had been forwarded, the corresponding memory space will be available in the buffer of transmit leg switch.Sometime, for example periodically or as required, available once more this fact of this buffer space should be transmitted to the equipment of the link other end.Frame and LMF with long DCE head are used to replenish credit.If do not replenish credit, then can use short DCE head.Though some implementations transmit all and all use long DCE head, the efficient of this implementation is so not high, and this is because do not consumed the bandwidth that exceeds the quata for not comprising the packets of information of replenishing about credit.
Fig. 6 B shows an example of credit Signalling method of the present invention.Traditional credit signaling schemes 650 announcement recipients hope the new credit returned.For example, at moment t4, the recipient hopes to return 5 credits, therefore is worth 5 and is carried in the frame.At moment t5, the recipient has no credit and will return, and therefore is worth 0 and is carried in the frame.If in moment t4 LOF, then five credits are lost.
The credit value of DCE scheme 660 announcement accumulations.In other words, the new credit that each announcement will be returned is added to total digital-to-analogue m (for 8, m is 256) of the credit of before having returned.For example, at moment t3, the total number of credits that begins to return from link initialization is 3; At moment t4,, therefore be added to 3, and in frame, send 85 owing to need return 5 credits.At moment t5, need not return credit, thereby send 8 once more.If in moment t4 LOF, have no credit so and lose, because comprise identical information at moment t5 frame.
According to a kind of exemplary implementation of the present invention, recipient DCE switch is safeguarded following information (wherein the VL indication information is safeguarded to each tunnel):
● the modulus counter of BufCrd [VL]-increase progressively by the credit number that can send;
● the byte number that BytesFromLastLongDCE-has sent since last long DCE head;
● the byte number that BytesFromLastLMF-has sent since last LMF;
● MaxIntBetLongDCE-is at the largest interval that sends between the long DCE head;
● MaxIntBetLMF-is at the largest interval that sends between the LMF; And
● the modulus counter that FrameRx-increases progressively by the FrameCredit field of received frame.
Send the DCE switch ports themselves and safeguard following information:
● the last estimated value of LastBufCrd [VL]-recipient's BufCrd [VL] variable; And
● the modulus counter of FrameCrd [VL]-increase progressively by the credit number that is used for transmit frame.
When link establishment, the network equipment of each end of link will be consulted the existence of DCE head.If head does not exist, then the network equipment for example will make link can carry out the standard ethernet operation simply.If head does not exist, then the network equipment will be launched the characteristic of the DCE link of some aspects according to the present invention.
Fig. 6 C illustrates according to implementations more of the present invention flow chart of initialization DCE link how.The step that those of skill in the art recognize that method 680 (the same with additive method described herein) need not carried out by indicated order, and does not carry out by indicated order in some cases.In addition, some implementations of these methods comprise than indicated more or less step.
In step 661, the physical link between two switch ports themselves is set up, and in step 663, first divides into groups is received.In step 665, confirm whether (by recipient's port) this grouping has the DCE head.If no, then make this link can transmit the standard ethernet flow.If this grouping has the DCE head, then the port execution in step is being the DCE link with this link configuration.In step 671, all array zero clearings that recipient and sender will be relevant with the flow on the link.In step 673, the value of MaxIntBetLongDCE is initialized to the value of configuration, and in step 675, MaxIntBetLMF is initialized to the value of configuration.
In step 677, two DCE side mouths preferably exchange the available credit information of each VL through transmission LMF.If certain VL is not used, then its available credit is declared to be 0.In step 679, make link can transmit DCE, and comprise that the conventional DCE flow of Frame can send according to method described herein on this link.
In order correctly to work existing under the situation of single LOF, the maximum number of the credit that the DCE of preferred implementation requires to announce in the frame from Restoration Mechanism less than maximum can the announcement value 1/2.In some implementations of short DCE head, each credits field is 8, promptly equals 256 value.Thereby, in single frame, can announce the most nearly 127 extra credits.The maximum of 127 credits is reasonably, because worst case is by a lot of minimal size frame on the direction and the single huge frame representative on the rightabout.In the huge image duration of transmitting 9KB, the maximum number of minimal size frame is about 9220B/84B=110 credit (supposing the IPG of maximum trasfer unit and 20 bytes of 9200 bytes and leading).
If a plurality of continuous LOFs, then the LMF restoration methods can " be repaired " link.A this LMF restoration methods is based on following viewpoint, promptly in some implementations, is 16 by the internal counter of the port maintenance of DCE switch, but in order to save bandwidth, has only lower 8 in long DCE head, to send.If do not have successive frame lose this mode work get fine, as previously mentioned.When link experiences a plurality of continuous mistake, long DCE head maybe be no longer can coincidence counter, but this has realized through whole 16 LMF that comprises all counters.8 extra positions have allowed to recover many 256 times mistake, i.e. 512 continuous mistakes altogether.Preferably, before running into this situation, link is declared as and can not works and be reset.
In order to realize the low Ethernet system that postpones, must consider the flow of at least 3 kinds of general types.These types are IP network flow, storage flow and cluster flow.Like top detailed description, LLE provides the characteristic of the similar FC that is suitable for for example storing flow for " nothing abandons " VL." nothing abandons " VL can lost packets/frame, and can provide according to simple stack for example shown in Figure 8.Have only the last FC of little " sheet " LLE (FC over LLE) 810 to be between LLE layer 805 and the FC the 2nd layer (815).Layer 815,820 is identical with those of FC stack 750 with 825.Therefore, the storage application that operated in the past on the FC may operate on the LLE now.
The mapping of FC frame FC (FC over Ethernet) frame to the Ethernet of going up an exemplary implementation of FC layer 810 according to LLE is described referring now to Fig. 9 A, 9B and 10.Fig. 9 A is the simple version of FC frame.FC frame 900 comprises SOF 905 and EOF 910; They are orderly assemble of symbol; Not only be used to limit the border of frame 900, also being used to pass on kind, frame such as frame is beginning or finishing of sequence (one group of FC frame), and frame is normally or improper and so on information.In these symbols at least some are illegal " code is (code violation) in violation of rules and regulations " symbols.FC frame 900 also comprises the destination FC id field 920 and the payload 925 of 24 915,24 of source FC id fields.
A target of the present invention is on Ethernet, to pass on the stored information that comprises in the FC frame (for example the FC frame 900).Figure 10 shows of the present invention a kind of implementation of the LLE that is used for passing on this storage flow.Network 1000 comprises LLE cloud 1005, and equipment 1010,1015 and 1020 is attached to this LLE cloud.LLE cloud 1005 comprises a plurality of LLE switches 1030, and its example architecture is other local discussion in this article.Equipment 1010,1015 and 1020 can be main process equipment, server, switch or the like.Storage gateway 1050 links to each other LLE cloud 1005 with memory device 1075.From the purpose of mobile storage flow, network 100 can be configured to serve as the FC network.Therefore, equipment 1010,1015 and 1020 port have its oneself FC ID respectively, and the port of memory device 1075 has FC ID.
For equipment 1010,1015 and 1020 and memory device 1075 between mobile storage flow (comprising frame 900) efficiently, preferred implementations more of the present invention will be from the information mapping of the field of FC frame 900 to divide into groups 950 respective field of LLE.LLE divides into groups 950 to comprise organization id field 965 and the device id field 970 of SOF 955, destination MAC field, organization id field 975 and device id field 980, protocol type field 985, field 990 and the payload 995 of source MAC field.
Preferably, field 965,970,975 and 980 all is 24 bit fields, meets conventional Ethernet protocol.Therefore, in implementations more of the present invention, the content of the destination FC id field of FC frame 900 is mapped in field 965 or 970, preferably is mapped to field 970.Similarly, the content of the source FC id field of FC frame 900 is mapped in field 975 or 980, preferably is mapped to field 980.Preferably, the content of destination FC id field of FC frame 900 915 and source FC id field 920 is mapped to divide into groups 950 field 970 and 980 of LLE respectively, because sanctified by usagely, IEEE is the single many device codes of code assignment of organizing.This mapping function for example can be carried out by storage gateway 1050.
Therefore, the FC frame partly can be realized through buying with the corresponding organization unique identifier of a group equipment code (" OUI ") code to IEEE to the mapping that LLE divides into groups.In a this example, current assignee Cisco Systems has paid the registration charges of OUI, and OUI is assigned to " FC on the Ethernet ".The storage gateway (for example storage gateway 1050) of configuration places field 965 and 975 with OUI according to this aspect of the invention; Copy 24 contents of destination FC id field 915 to 24 bit fields 970, and copy 24 contents of source FC id field 920 to 24 bit fields 980.Storage gateway inserts the code of FC on the indication Ethernet in protocol type field 985, and copies the content of payload 925 to payload field 995.
Because above-mentioned mapping need clearly not assigned MAC Address on storage networking.Yet because mapping, the version of deriving with algorithm of destination and source FC ID has been coded in the appropriate section of LLE frame, and these appropriate sections will be assigned to destination and source MAC in conventional Ethernet divides into groups.Through just look like these fields are contents that MAC Address field that kind is utilized these fields, can be on the LLE network route storage flow amount.
SOF field 905 comprises orderly assemble of symbol with EOF field 910, and some of them (for example being used to indicate those of beginning and end of FC frame) are the symbols that keeps, and these symbols are called as " illegally " or " code violation " symbol sometimes.If one of these symbols are copied into certain field (for example field 990) in the LLE grouping 950, then this symbol will cause mistake, for example should stop at this symbol place through indication LLE grouping 950.But, must be retained by these symbol information conveyed, because it has indicated the kind of FC frame, frame is beginning of sequence or finishes, and other important informations.
Therefore, preferred implementation of the present invention provides the another kind of mapping function that illegal symbol is converted to legal symbol.These legal symbols can be inserted in the interior section of LLE grouping 950 subsequently.In a this implementation, be placed in the field 990 through the symbol of changing.Field 990 does not need very big; In some implementations, its length is merely 1 or 2 byte.
In order to allow to connect the realization of (cut-through) exchange, field 990 can be divided into two independent fields.For example, a field can be positioned at frame and begin the place, and another can be positioned at the other end of frame.
Preceding method just is used for an example of the various technology in the ethernet frame of expansion that the FC frame is encapsulated in.Additive method comprises any mapping easily, for example comprises that { S_ID} derives tlv triple { VLAN, DST MAC Addr, Src MAC Addr} for VSAN, D_ID from tlv triple.
Above-mentioned mapping and symbol transition process have produced the LLE grouping, and for example LLE divides into groups 950, and it allows to go to or is forwarded to endpoint node equipment 1010,1015 and 1020 from the storage flow based on the memory device 1075 of FC via LLE cloud 1005.Mapping and symbol transition process for example can moved by on the frame basis by storage gateway 1050.
Therefore, the invention provides the illustrative methods that is used in the ingress edge place of FC-Ethernet cloud is encapsulated in the FC frame ethernet frame of expansion.Similar approach of the present invention provides the inverse process of carrying out at the outlet edge place of Ethernet-FC cloud.The FC frame can come out by deblocking from the expansion ethernet frame, on the FC network, transmits then.
Some such methods comprise these steps: receive ethernet frame (for example encapsulating by mode described herein); With the destination content map of the first of the destination MAC field of ethernet frame destination FC id field to the FC frame; The source contents of the second portion of the source MAC field of ethernet frame is mapped to the source FC id field of FC frame; The legal symbol transition of ethernet frame is become illegal symbol; Illegal symbol is inserted in the selected field of FC frame; With the payload content map of the payload field of ethernet frame to FC frame payload field; And on the FC network, transmit the FC frame.
Need not keep state information about frame.Therefore, processed frame promptly is for example with the rate processing frame of 40Gb/s.Endpoint node can be used based on SCSI operation storage, because the SCSI layer 825 that can see LLE stack 800 shown in Figure 8 is used in storage.Be different from via the switch that is exclusively used in the FC flow (FC switch 140 and 145 for example shown in Figure 1) and transmit the storage flow, this FC switch can be by 1030 replacements of LLE switch.
In addition, the function of LLE switch has allowed the powerful managerial flexibility in space.With reference to Figure 11, in a kind of Managed Solution, each in the LLE switch 1130 of LLE cloud 1105 can be regarded as independent FC switch.Perhaps, some in the LLE switch 1130 or all can be gathered together, and be regarded as the FC switch from administrative purposes.For example, from the network management purpose,, formed virtual FC switch 1140 through all the LLE switches in the LLE cloud 1105 are regarded as single FC switch.The all of the port of individual LLE switch 1130 for example can be regarded as the port of virtual FC switch 140.Perhaps, can assemble more a spot of LLE switch 1130.For example, 3 LLE switches are gathered together and are gathered together to form virtual FC switch 1165 to form 1160,4 LLE switches of virtual FC switch.Network manager can be through considering that individual LLE switch has how many ports or the like and decides how many switches of gathering.Through each LLE switch is regarded as a FC switch, perhaps, can realize the control plane function of FC, for example subregion (zoning), DNS, FSPF and other functions through a plurality of LLE switches being gathered into a virtual FC switch.
In addition, same LLE cloud 1105 can be supported many virtual networks.VLAN (" VLAN ") is as known in the art, is used to the network based on Ethernet that provides virtual.The United States Patent(USP) No. 5,742,604 that is entitled as " Interswitch Link Mechanism for Connecting High-Performance NetworkSwitches " has been described related system, and is by reference that it is incorporated here.This assignee's various patent applications; Comprise the U.S. Patent application No.10/034 that is entitled as " Methods And Apparatus For Encapsulating A Frame For Transmission In AStorage Area Network " that submits December 26 calendar year 2001; 160, provide to be used to the method and apparatus of realizing virtual storage area network (" VSAN ") based on the network of FC.Here by reference that this application is incorporated fully.Because the LLE network can support ethernet traffic can support the FC flow again, implementations more of the present invention have realized on same physics LLE cloud, forming virtual network for FC and ethernet traffic.
Figure 12 illustrates the sketch map of the simplification architecture of DCE switch 1200 according to an embodiment of the invention.DCE switch 1200 comprises N Line cards, and each Line cards is characterised in that entrance side (or input) 1205 and outlet side (or output) 1225.Line cards entrance side 1205 is connected to Line cards outlet side 1225 via switching fabric 1250, and this switching fabric comprises crossbar switch in this example.
In this implementation, all carry out buffering in the input and output side.Also possibly realize other architectures, for example have those of input buffer, output buffer and shared storage.Therefore; In the incoming line an outpost of the tax office 1205 each comprises at least one buffer 1210; And each in the output line an outpost of the tax office 1225 comprises at least one buffer 1230; Said buffer can be the buffer of any convenient type known in the art, and is for example outside based on the buffer of DRAM or the buffer based on SRAM on the sheet.Buffer 1210 for example is used for input buffering, so that temporary transient preservation grouping when wait output line an outpost of the tax office place has enough buffers to can be used for storing the grouping that will send via switching fabric 1250.Buffer 1230 for example is used for output buffering, so that temporarily when wait has enough credit to be used for sending to the grouping of another DCE switch preserve the one or more grouping that is received from incoming line an outpost of the tax office 1205.
Can between inside and outside credit, not necessarily there be mapping one to one in the inside and outside use of switch though it should be noted that credit.In addition, can be at the inner or outside pause frame that uses.For example, four possibly made up any in time-out-time-out (PUASE-PAUSE), time-out-credit (PAUSE-CREDITS), credit-time-out (CREDITs-PAUSE) and the credit-credit (CREDIT-CREDIT) and all can produce feasible solution.
DCE switch 1200 comprises that certain form is used to apply the credit mechanism of current control.This flow-control mechanism applies buffer brake in the time of can reaching its heap(ed) capacity in the output queue of one of buffer 1230 on buffer 1210.For example; Before transmit frame, one of incoming line an outpost of the tax office 1205 can ask credit to moderator 1240 (it for example can be to be positioned at the independent chip of middle position or to be distributed in the core assembly sheet in the output line an outpost of the tax office) from input rank 1215 before output queue 1235 transmit frames.Preferably, the size of frame is for example indicated in this request according to the frame credits field of DCE head.Moderator 1240 will confirm whether output queue 1235 can accept this frame (be output buffer 1230 have enough spaces hold this frame).If can, then credit request will be granted, and moderator 1240 will be authorized to input rank 1215 transmission credits.But if output queue 1235 is too full, then this request will be rejected, and can not send credit to input rank 1215.
Like other local discussion among this paper, DCE switch 1200 needs can virtual support passage desired " abandoning ", " nothing abandons " and middle behavior.Partly launch " nothing abandons " function through applying similar above-described certain type of credit mechanism to the DCE switch in inside.Externally, " nothing abandons " function can realize to buffer credit mechanism or pause frame according to the buffer of previous description.For example; If one of incoming line an outpost of the tax office 1205 is just experiencing through the buffer brake of inner credit mechanism from one or more output line an outpost of the tax office 1225, Line cards can externally be propagated buffer brake to the buffer credit system via the sort of buffer of similar FC on updrift side so.
Preferably, provide the same chip (for example same ASIC) of " nothing abandons " and intermediate function that the sort of " abandoning " function of similar classical Ethernet switch also will be provided.Though these tasks can be dispensed on the different chips, on same chip, provide to abandon, do not have to abandon and allowed to become to provide originally the DCE switch with much lower with central functionality.
Each DCE divides into groups in other local DCE heads of describing among this paper for example, to comprise the information of the tunnel of indication DCE under dividing into groups.DCE switch 1200 will abandon or not have the VL of abandoning and handle each DCE grouping according to the VL that DCE divides into groups to be assigned to.
Figure 13 shows the example of dividing buffer into VL.In this example, 4 VL have been assigned.VL 1305 and VL 1310 abandon VL.VL 1315 is not have to abandon VL with VL 1320.In this example; Input buffer 1300 has the specific region of assigning for each VL: VL 1305 is assigned to buffer space 1325; VL 1310 is assigned to buffer space 1330, and VL 1315 is assigned to buffer space 1335, and VL 1320 is assigned to buffer space 1340.The way to manage of the flow on VL 1305 and the VL 1310 is the spitting image of the ethernet traffic of routine, and part is the operation according to buffer space 1325 and 1330.Similarly, to abandon characteristic be to realize to the buffer credit flow control scheme according to being merely the buffer that buffer space 1335 and 1340 launches for VL 1315 and 1320 nothing.
In some implementations; Assign to give the amount of the buffer space of VL can be according to dynamically assigning such as following standard: buffer occupation rate, the time in one day, flow load/congested, guarantee that task, the maximum bandwidth of minimum bandwidth allocation, the known bigger bandwidth of requirement distribute, or the like.Preferably, the fairness doctrine will be used for preventing that the VL from obtaining the buffer space of volume.
In each buffer space, in the data structure of the logic query (VOQ or VOQ) that conduct is associated with the destination, there is the tissue of data.(" A Practical SchedulingAlgorithm to Achieve 100%Throughput in Input-Queued Switches "; AdisakMekkittikul and Nick McKeown; Computer Systems Laboratory; StanfordUniversity (InfoCom 1998) and the list of references of wherein quoting have been described the correlation technique that is used to realize VQO, and be by reference that they are incorporated here).Preferably destination port/tunnel is right in the destination.Utilize the VOQ scheme to avoid when output port blocks and/or the hol blocking (head of lineblocking) at the incoming line an outpost of the tax office place that another tunnel of destination output port causes when blocking.
In some implementations, between VL, do not share VOQ.In other implementations, abandoning between the VL or do not having to abandon between the VL and can share VOQ.But, abandon VL and abandon between the VL in nothing and should not share VOQ.In certain embodiments, VOQ is associated with single buffer.But in other embodiments, VOQ can be by realizing more than a buffer.
The buffer of DCE switch can be realized the management of various types of activity queues.Some preferred embodiments of DCE switch buffers device provide activity queue's management of at least 4 kinds of fundamental types: current control; From the Congestion Avoidance purpose, VL abandons or abandon VL to nothing carries out mark to abandoning; Abandon the deadlock that abandons among the VL to avoid not having; And for postpone control and abandon.
The current control of DCE network has at least two kinds of basic performances, a kind of being implemented in the DCE switch, and another kind is implemented on the link of network.A kind of current control performance is the fiduciary current control of buffer to buffer, and it is mainly used in realization " nothing abandons " or postponement abandons VL.As stated, pause frame and so on also can be used for realizing current control for " nothing abandons " or " postponement abandons " VL.Any convenient combination of credit and pause frame no matter be in the DCE switch or on link, all can be used for realizing current control.Notice that following this point is very important: in a preferred embodiment, the credit of in the DCE switch, managing is different with the credit of on link, managing.Some preferred embodiments use pause frame on link, and in the DCE switch, use credit.
The another kind of current control performance of some preferred implementations comprises the explicit upper reaches congestion notification of other equipment in network.This explicit upper reaches congestion notification for example can realize by ecn (explicit congestion notification) (" the ECN ") field of DCE head, like other local descriptions among this paper.
Figure 14 shows DCE network 1405, comprises edge DCE switch 1410,1415,1425 and 1430 and core DCE switch 1420.In this case, the buffer 1450 of core DCE switch 1420 is realized 3 types of current control.One type is buffer to buffer current control indication 1451, and it is passed on to the authorizing of buffer credit (whether) by the buffer between the buffer 1460 of buffer 1450 and edge DCE switch 1410.
Buffer 1450 also sends 2 ECN 1451 and 1452, and these two ECN are that the ECN field of the DCE head that divides into groups via DCE realizes.ECN 1451 can be regarded as core and notify to the edge, because it is to be sent and received by the buffer 1460 of edge DCE switch 1410 by nucleus equipment 1420.ECN 1452 will be regarded as core to terminal notice, because it is to be sent and received by the NIC card 1465 of endpoint node 1440 by nucleus equipment 1420.
In implementations more of the present invention, ECN is through sampling and generate storing grouping in the congested buffer of experience into.Be set to equal the source address by sampled packet through its destination-address, ECN is sent to the source of this grouping.Edge device will the source of learning be as endpoint node 1440, to support still as endpoint node 1435, not support DCE ECN.Under one situation of back, edge device 1410 will stop ECN and realize appropriate action.
Activity queue's management (AQM) will be performed in response to various standards, and these standards include but not limited to buffer occupation rate (for example to each VL), the queue length of each VOQ and the age of the grouping among the VOQ.For simplicity, when AQM is discussed, generally can supposes and between VL, not share VOQ.
Referring now to Figure 15 some examples according to AQM of the present invention are described.Figure 15 shows specific buffer operating position of advancing to carve.At this constantly, the part 1505 of physical buffers 1500 has been assigned to and has abandoned VL, and part 1510 has been assigned to not have and abandons VL.Of other places among this paper, buffer 1500 is assigned to and abandons the amount that VL or nothing abandon VL and can change in time.In distributing to the part 1505 that abandons VL, part 1520 is current just to be used, and part 1515 current not being used.
In part 1505 and 1510, there are many VOQ, comprise VOQ 1525,1530 and 1535.In this example, established threshold value VOQ length L.VOQ 1525 and 1535 length are greater than L, and the length of VOQ 1530 is less than L.Long VOQ indication downstream are congested.Activity queue's management has prevented that preferably any VOQ from becoming too big, and this is that the downstream of a VOQ are congested will to influence the flow of going to other destinations unfriendly because otherwise influence.
The age of the grouping among the VOQ is another standard that is used for AQM.In preferred implementation, be grouped in and be coupled with timestamp when getting into buffer and being enqueued onto among the suitable VOQ.Therefore, grouping 1540 time of reception when arriving buffer 1500 stabs 1545, and is placed VOQ according to its destination and VL sign.Of other places, the VL sign will be indicated application to abandon or do not had the behavior of abandoning.In this example, 1540 the head indication of dividing into groups divides into groups 1540 will to transmit abandoning on the VL, and has the 1525 corresponding destinations with VOQ, therefore divides into groups 1540 to be placed among the VOQ 1525.
Through the relatively moment and the current time of timestamp 1545, after age of 1540 of can confirming to divide into groups in the moment.In this context, " age " only refers to be grouped in the time that spends in the switch, rather than the time spent in certain other part in network.Yet, the situation of other parts through the age deducibility network that divides into groups.For example, if the age of dividing into groups becomes relatively large, the path that the destination of grouping is gone in then this situation indication is just experiencing congested.
In preferred implementation, the grouping that the age surpasses the predetermined age will be dropped.Surpass predetermined age threshold value if when confirming the age, find several groupings among the VOQ, then can carry out a plurality of abandoning.
In some preferred implementations, exist to be used for postponing control (T L) and be used to avoid deadlock (T D) the independent age limit.As packet arrives T LThe time action that will take depend on that preferably dividing into groups is to transmit abandoning VL or abandon on the VL in nothing.Abandon the flow on the passage for nothing, data integrity is more important than postponing.Therefore, in implementations more of the present invention, the age that abandons the grouping among the VL when nothing surpasses T LThe time, divide into groups to be not dropped, but take another action.For example, in some such implementations, grouping can be labeled and/or upper reaches congestion notification can be triggered.For the grouping that abandons among the VL, it is more important relatively to postpone control, therefore surpasses T when the age of dividing into groups LThe time take more radical action comparatively suitable.For example, can abandon function to this packet applications probabilistic.
The figure line 1600 of Figure 16 provides probabilistic to abandon some examples of function.According to abandoning function 1605,1610 and 1615, surpass T when the age of dividing into groups CO, when promptly postponing cutoff threshold, along with the age increase of dividing into groups reaches T L, the probability that it will deliberately be abandoned increases to 100% from 0%, and this depends on function.Abandoning function 1620 is step functions, and its probability that deliberately abandons is 0%, up to reaching T LTill.Reach T when the age of dividing into groups LThe time, abandon function 1605,1610,1615 and 1620 all reach 100% deliberately abandon probability.Though T CO, T LAnd T DCan be any time easily, but in implementations more of the present invention, T COAt tens of microsecond magnitudes, T LSeveral milliseconds to tens of milliseconds magnitudes, and T DHundreds of milliseconds of magnitudes, for example 500 milliseconds.
Surpass T if abandon the age that VL or nothing abandon the grouping among the VL D, then divide into groups and to be dropped.In preferred implementation, abandon the T of VL to nothing DThan being directed against the T that abandons VL DWant big.In some implementations, T LAnd/or T DAlso can depend in part on bandwidth that is grouped in the VL that transmits on it and the number that transmits the packet to the VOQ of this VL simultaneously.
Abandon VL for nothing, can be used to trigger upper reaches congestion notification, or the congested experience (CE) in the head that the TCP belong to the connection that can support TCP ECN divides into groups is set with similar probability function shown in Figure 16.
In some implementations, whether the CE position that whether grouping is dropped, whether upper reaches congestion notification is sent out and TCP divides into groups is labeled the age of not only depending on grouping, also depends on the length of dividing into groups to be placed in VOQ wherein.If this length is higher than threshold value L Max, then take the AQM action; Otherwise will be to surpassing threshold value L from length MaxVOQ go out first of team and divide into groups to carry out the AQM action.
The use of every VL buffer occupation rate
Shown in figure 15, buffer is divided into VL.For being allocated to the part (the for example part 1505 of buffer 1500) that abandons VL in the buffer, if in the occupancy of any given time VL greater than predetermined maximum value, then divide into groups and will be dropped.In some implementations, calculate and safeguard the average occupancy of VL.Based on this average occupancy, can take the AQM action.For example, for abandoning the part 1505 that VL is associated with nothing, DCE ECN will be triggered, rather than as with the situation that abandons the part 1510 that VL is associated under divide into groups to abandon.
Figure 17 shows the figure line 1700 of VL occupancy B (VL) (vertical axis) in a period of time (trunnion axis).Here, B TIt is the threshold value of B (VL).In implementations more of the present invention, will be dropped in definite B (VL) some groupings when reaching among the VL.The actual value of B in a period of time (VL) is illustrated by curve 1750, but B (VL) is only at moment t 1To t NConfirm.In this example, divide into groups and will be dropped at point 1705,1710 and 1715, these points are corresponding to moment t 2, t 3And t 6Will be according to the QoS of the virtual network of the age (for example at first the oldest) of dividing into groups, its size, grouping, randomly, according to abandoning function or otherwise abandoning grouping.
(perhaps as replacement) in addition is when the mean value of B (VL), weighted average or the like meet or exceed B TThe time, can take activity queue's management activities.Thisly on average can calculate, for example through the B that determines (VL) value being added up mutually and divided by confirming number according to the whole bag of tricks.Some implementations are used weighting function, for example through giving bigger weight for sample more recently.Can use the weighting function of any kind known in the art.
Activity queue's management activities of being taked for example can be to send ECN, and/or applied probability property abandons function, for example with the similar function of one of shown in Figure 180 those.In this example, the trunnion axis of figure line 1880 is the mean value of B (VL).When mean value was lower than first value 1805, the probability that deliberately abandons grouping was 0%.When mean value met or exceeded second value 1810, the probability that deliberately abandons grouping was 100%.Can be to any function easily of value application between two parties, no matter be and 1815,1820 or 1825 similar function or other functions.
With reference to Figure 15, clearly, VOQ 1525 and 1535 length have surpassed predetermined length L.In implementations more of the present invention, this situation has triggered activity queue's managing response, for example sends one or more ECN.Whether the grouping that preferably, comprises in the buffer 1500 source of will indicating can respond to ECN.If the sender who divides into groups can not respond to ECN, then this situation can trigger probabilistic and abandons function or just abandon simply.VOQ 1535 not only is longer than predetermined length L 1, also be longer than predetermined length L 2According to implementations more of the present invention, this situation triggers abandoning of dividing into groups.Implementations more of the present invention have been utilized the standard of average VOQ length as triggered activity queue management response, are not preferred therefore owing to it needs a large amount of calculating still.
Hope to have a plurality of standards that are used to trigger the AQM action.For example, be useful though the response to VOQ length is provided, this measure possibly be not enough for the DCE switch that every port has about 1 to 2MB buffer space.For given buffer, have thousands of movable VOQ.But, possibly have only enough to be used for 10 3The storage space of individual grouping magnitude perhaps maybe be still less.Therefore, may occur not having which individual VOQ that the grouping of any AQM response of enough triggerings is arranged, but certain VL has used up the situation in space.
There is not the queue management that abandons VL
In preferred implementation of the present invention, the main difference that abandons between the activity queue management that VL and nothing abandon VL will trigger and divides into groups to abandon (one or more) standard and abandon VL for nothing and will cause DCE ECN to be sent out or TCP CE position is labeled for abandoning VL.For example, will trigger the probabilistic ECN that situation that probabilistic divides into groups to abandon generally will cause going to upstream edge equipment or end (main frame) equipment for abandoning VL.Where fiduciary scheme is not based on divides into groups to go to, and divides into groups to come wherefrom and be based on.Therefore, the fairness that upper reaches congestion notification has helped to provide buffer to use, and avoided being based on the deadlock that possibly cause under the situation of current control of credit at the unique method that is used to not have the current control that abandons VL.
For example, for using every VL buffer occupation rate as standard, preferably can be only because every VL buffer occupation rate has not met or exceeded threshold value and just abandoned grouping.On the contrary, for example, grouping will be labeled, and perhaps ECN will be sent out.Similarly, still can calculate average every VL buffer occupation rate of certain type, and the applied probability function, but the elemental motion that will take is mark and/or sends ECN.Grouping will can not be dropped.
But even abandon VL for nothing, described by indicated obstruction or the deadlock situation of grouping age that surpasses threshold value like other places among this paper in response to for example, grouping will be dropped.Implementations more of the present invention have also allowed not have the grouping that abandons VL and have been dropped in response to the delay situation.This will depend on for this concrete do not have abandon VL, to the significance level that postpones to be provided with.The implementation applied probability property discard algorithm that some are such.For example, compare with the storage application, some cluster application can be provided with higher value on delay factor.Data integrity is still important for cluster application, but it may be favourable reducing delay through the data integrity of abandoning certain degree.In some implementations, compared with the analog value that is used to abandon passage, the TL of higher value (promptly postponing the control threshold value) can be used to not have and abandon passage.
Figure 19 shows the example of the network equipment that can be configured to realize certain methods of the present invention.The network equipment 1960 comprises main central processing unit (CPU) 1962, interface 1968 and bus 1967 (for example pci bus).In general, interface 1968 comprises the port one 969 that is applicable to suitable medium communication.In certain embodiments, one or more in the interface 1968 comprise at least one separate processor 1974, and comprise volatibility RAM in some cases.Independent processor 1974 for example can be ASIC or any other proper process device.According to some such embodiment, these independent processors 1974 are carried out at least some functions of logic described herein.In certain embodiments, such communications-intensive tasks is controlled and managed to one or more controls such as the medium in the interface 1968.Through provide independent processor, interface 1968 to allow master microprocessor 1962 to carry out other functions such as route calculating, network diagnosis, security functions efficiently for communications-intensive tasks.
Interface 1968 generally provides as interface card (being sometimes referred to as " Line cards ").In general, the transmission and the reception of packet on interface 1968 Control Network, and support other ancillary equipment sometimes with the network equipment 1960 uses.The interface that can provide comprises optical fiber (" FC ") interface, Ethernet interface, Frame Relay Interface, cable interface, DSL interface, token ring interface, or the like.In addition, various very high-speed interfaces can be provided, for example Ethernet interface, gigabit ethernet interface, atm interface, hssi interface, pos interface, fddi interface, ASI interface, DHEI interface, or the like.
When moving under the control at suitably software or firmware, in implementations more of the present invention, CPU 1962 can be responsible for realizing the specific function that is associated with the function of the required network equipment.According to some embodiment, CPU 1962 realizes all these functions under the control of software that comprises operating system (for example Linux, VxWorks or the like) and any suitable application software.
CPU 1962 can comprise one or more processors 1963, for example from the processor of Motorola microprocessor family or MIPS microprocessor family.In another embodiment, processor 863 is the specially designed hardware that is used for the operation of control network devices 1960.In a particular embodiment, memory 1961 (for example non-volatile ram and/or ROM) also forms the part of CPU 1962.But, exist many different memories to be coupled to the mode of system.Memory block 1961 can be used for multiple purpose, for example, and buffer memory and/or storage data, programming instruction or the like.
No matter the configuration of the network equipment how, it can adopt the one or more memories or the memory module (for example memory block 1965) of the program command that is configured to store data, be used for general general-purpose network operations and/or other information relevant with the function of technology described herein.Program command for example can the control operation system and/or the operation of one or more application.
Because this information and program command can be used for realizing system/method described herein, so the present invention relates to comprise the machine readable media of the program command that is used to carry out various operations described herein, state information etc.The example of machine readable media includes but not limited to: magnetizing mediums, for example hard disk, floppy disk and tape; Light medium, for example CD-ROM dish; Magnet-optical medium; And be configured to store the hardware with execution of program instructions, for example ROM device (ROM) and random access storage device (RAM) especially.Present invention may also be embodied in the carrier wave of on the suitable medium such as radio wave, optical link, electric line, propagating.The example of program command had both comprised machine code, and the machine code that is for example produced by compiler also comprises comprising the file that can be utilized the higher level code of interpreter execution by computer.
Though system shown in Figure 19 shows a particular network device of the present invention, it can realize unique a kind of network equipment of the present invention on it by no means.For example, often use the architecture of single processor with the communication handled and route calculating etc.In addition, the interface of other types and medium also can use with the network equipment.Communication path between the interfaces/line cards can be based on bus (shown in figure 19) or based on (the for example longitudinal mode switch) of switching fabric
Though at length illustrate and described the present invention with reference to specific embodiment, one skilled in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, can make the change on form and the details to the disclosed embodiments.For example, implementations more of the present invention allow VL to become to postpone and abandon or do not have a VL of abandoning from abandoning VL.Thereby example described herein does not want to limit the present invention.Therefore, hoping that appended claims is interpreted as comprises all changes that drop within true spirit of the present invention and the scope, is equal to, changes and revise.

Claims (22)

1. one kind is used for handling the method more than one type of network traffics at single network equipment, and this method comprises:
The buffer of the said network equipment is divided into first buffer space and second buffer space; Said first buffer space is used to be stored in the frame that receives on first tunnel of a physical link of the said network equipment, and said second buffer space is used to be stored in the frame that receives on second tunnel of this physical link of the said network equipment;
A plurality of frames are received in this physical link of the said network equipment, wherein the information of the tunnel that belongs to of this frame of indication of comprising in the head based on this frame respectively of each frame and indicate first tunnel or second tunnel; And
For each received frame; Based on this received frame is to specify said first tunnel or specify said second tunnel and use first group of rule or second group of rule respectively to this received frame; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored in said first buffer space based on said first buffer space respectively, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored in said second buffer space.
2. the method for claim 1 also comprises to each tunnel and assigns the step that guarantees the minimum buffer space.
3. the method for claim 1, wherein said received frame comprise ethernet frame of specifying said first tunnel and the fiber channel frame of specifying said second tunnel.
4. the method for claim 1, wherein said first group of rule make ecn (explicit congestion notification) sent from the said network equipment in response to the delay of said received frame.
5. the method for claim 1, wherein said second group of rule make ecn (explicit congestion notification) sent from the said network equipment in response to the delay of said received frame.
6. the method for claim 1, wherein said partiting step comprise according to one or more factors of from the factor group that is made up of following factor, selecting divides buffers: buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute.
7. the method for claim 1, wherein said second group of rule makes frame be dropped to avoid deadlock.
8. the method for claim 1, wherein said first group of rule abandons function in response to postponing applied probability property.
9. the method for claim 1, wherein said second group of rule are included as said second frame and realize that buffer counts scheme to buffer credit.
10. the method for claim 1 also comprises said first frame and said second frame are stored in the step in the VOQ, and wherein each VOQ and a destination port/tunnel are to being associated.
11. the method for claim 1 also comprises the step in response to every tunnel buffer occupation rate and at least one the execution buffer management in the age of dividing into groups, the moment the when age of wherein dividing into groups is grouping entering buffer and the difference of current time.
12. method as claimed in claim 4, wherein said ecn (explicit congestion notification) is sent to one of source device or edge device.
13. method as claimed in claim 4, wherein said ecn (explicit congestion notification) sends via one of Frame or control frame.
14. method as claimed in claim 9, wherein said buffer count scheme to buffer credit and comprise according to frame sign and count credit.
15. method as claimed in claim 9, wherein buffer is managed by moderator to buffer credit.
Realize in the network equipment that said buffer counts scheme to buffer credit 16. method as claimed in claim 9, wherein said second group of rule are included in, and on said second tunnel, use pause frame to carry out current control.
17. method as claimed in claim 10; Further comprising the steps of: in response to VOQ length, every tunnel buffer occupation rate, integrated buffer device occupancy and at least one the execution buffer management in the age of dividing into groups, the moment the when age of wherein dividing into groups is grouping entering buffer and the difference of current time.
18. a network equipment comprises:
Be used for the buffer of the said network equipment is divided into the device of first buffer space and second buffer space; Said first buffer space is used to be stored in the frame that receives on first tunnel of a physical link of the said network equipment, and said second buffer space is used to be stored in the frame that receives on second tunnel of this physical link of the said network equipment;
Be used for a plurality of frames are received the device of this physical link of the said network equipment, wherein the information of the tunnel that belongs to of this frame of indication of comprising in the head based on this frame respectively of each frame and indicate first tunnel or second tunnel; And
Be used for for each received frame; Based on this received frame is to specify said first tunnel or specify said second tunnel to come to use respectively to this received frame the device of first group of rule or second group of rule; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored in said first buffer space based on said first buffer space respectively, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored in said second buffer space.
19. a network equipment comprises:
Be arranged to a plurality of ports of received frame on many physical links;
A plurality of Line cards, each Line cards is communicated by letter with one of said a plurality of ports, and is configured to carry out following steps:
The first physical link received frame from the said physical link of said a plurality of ports, the wherein information of the tunnel that belongs to of this frame of indication of comprising in the head based on this frame respectively of each frame and indicate first tunnel or second tunnel;
Received frame is identified as second frame that on first frame that receives on first tunnel of this first physical link or second tunnel, receives at this first physical link;
Buffer in the said Line cards is divided into first buffer space that is used to store first frame of being discerned and second buffer space that is used to store second frame of being discerned; And
For each received frame; Based on this received frame is to specify said first tunnel or specify said second tunnel to come to use first group of rule or second group of rule respectively to this received frame; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored in said first buffer space based on said first buffer space respectively, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored in said second buffer space.
20. one kind is used on single network equipment, carrying the method more than one type of flow, this method comprises:
The information of the tunnel that the said received frame of indication that comprises in the head based on received frame respectively belongs to and said received frame is identified as at first frame that receives on first tunnel or second frame that on second tunnel, receives;
Dynamically the buffer with the said network equipment is divided into first buffer space and second buffer space; Said first buffer space has the first VOQ VOQ that is used to store said first frame; Said second buffer space has the second VOQ VOQ that is used to store said second frame, and wherein said buffer is dynamically to divide according to one or more factors of from the factor group that is made up of following factor, selecting: integrated buffer device occupancy, every tunnel buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute; And
For each received frame; Based on this received frame is to be identified as on said first tunnel or be identified as on said second tunnel to this received frame and use first group of rule or second group of rule respectively; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored among the said VOQ based on said first buffer space respectively, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored among said the 2nd VOQ.
21. a network equipment comprises:
Be used for respectively the information of the tunnel that the said received frame of indication that the head based on received frame comprises belongs to and said received frame is identified as the device at first frame that receives on first tunnel or second frame that on second tunnel, receives;
Be used for dynamically the buffer of the said network equipment being divided into the device of first buffer space and second buffer space; Said first buffer space has the first VOQ VOQ that is used to store said first frame; Said second buffer space has the second VOQ VOQ that is used to store said second frame, and wherein said buffer is dynamically to divide according to one or more factors of from the factor group that is made up of following factor, selecting: integrated buffer device occupancy, every tunnel buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute; And
Be used for for each received frame; Based on this received frame is to be identified as on said first tunnel or be identified as on said second tunnel device of using first group of rule or second group of rule to this received frame respectively; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored among the said VOQ respectively based on said first buffer space, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored among said the 2nd VOQ.
22. a network equipment comprises:
Be arranged to a plurality of ports of received frame on many physical links;
A plurality of Line cards, each Line cards is communicated by letter with one of said a plurality of ports, and is configured to carry out following steps:
The information of the tunnel that the said received frame of indication that comprises in the head based on received frame respectively belongs to and said received frame is identified as at first frame that receives on first tunnel or second frame that on second tunnel, receives;
Dynamically the buffer with the said network equipment is divided into first buffer space and second buffer space; Said first buffer space has the first VOQ VOQ that is used to store said first frame; Said second buffer space has the second VOQ VOQ that is used to store said second frame, and wherein said buffer is dynamically to divide according to one or more factors of from the factor group that is made up of following factor, selecting: integrated buffer device occupancy, every tunnel buffer occupation rate, the time in one day, flow load, congested, guarantee that the task and the maximum bandwidth of minimum bandwidth allocation, the known big bandwidth of requirement distribute; And
For each received frame; Based on this received frame is to be identified as on said first tunnel or be identified as on said second tunnel to this received frame and use first group of rule or second group of rule respectively; Whether wherein said first group of rule is filled scheduled volume and makes this received frame be dropped or be stored among the said VOQ based on said first buffer space respectively, and wherein said second group of rule forbidden abandoning in response to delay this received frame and made this received frame be stored among said the 2nd VOQ.
CN200580034646.0A 2004-10-22 2005-10-13 Network device architecture for consolidating input/output and reducing latency Active CN101040489B (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US62139604P 2004-10-22 2004-10-22
US60/621,396 2004-10-22
US11/094,877 US7830793B2 (en) 2004-10-22 2005-03-30 Network device architecture for consolidating input/output and reducing latency
US11/094,877 2005-03-30
PCT/US2005/037239 WO2006057730A2 (en) 2004-10-22 2005-10-13 Network device architecture for consolidating input/output and reducing latency

Publications (2)

Publication Number Publication Date
CN101040489A CN101040489A (en) 2007-09-19
CN101040489B true CN101040489B (en) 2012-12-05

Family

ID=38809008

Family Applications (4)

Application Number Title Priority Date Filing Date
CN200580034646.0A Active CN101040489B (en) 2004-10-22 2005-10-13 Network device architecture for consolidating input/output and reducing latency
CN200580034647.5A Active CN101040471B (en) 2004-10-22 2005-10-14 Ethernet extension for the data center
CN 200580035946 Active CN100555969C (en) 2004-10-22 2005-10-17 Fiber channel on the Ethernet
CN200580034955.8A Active CN101129027B (en) 2004-10-22 2005-10-18 Forwarding table reduction and multipath network forwarding

Family Applications After (3)

Application Number Title Priority Date Filing Date
CN200580034647.5A Active CN101040471B (en) 2004-10-22 2005-10-14 Ethernet extension for the data center
CN 200580035946 Active CN100555969C (en) 2004-10-22 2005-10-17 Fiber channel on the Ethernet
CN200580034955.8A Active CN101129027B (en) 2004-10-22 2005-10-18 Forwarding table reduction and multipath network forwarding

Country Status (1)

Country Link
CN (4) CN101040489B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7821939B2 (en) * 2007-09-26 2010-10-26 International Business Machines Corporation Method, system, and computer program product for adaptive congestion control on virtual lanes for data center ethernet architecture
CN101184098B (en) * 2007-12-11 2011-11-02 华为技术有限公司 Data transmission method and transmission apparatus
US8355345B2 (en) * 2009-08-04 2013-01-15 International Business Machines Corporation Apparatus, system, and method for establishing point to point connections in FCOE
CN101656721B (en) * 2009-08-27 2012-08-08 杭州华三通信技术有限公司 Method for controlling virtual link discovering and Ethernet bearing fiber channel protocol system
CN102045248B (en) 2009-10-19 2012-05-23 杭州华三通信技术有限公司 Virtual link discovery control method and Ethernet fiber channel protocol system
EP2489172B1 (en) 2010-05-28 2020-03-25 Huawei Technologies Co., Ltd. Virtual layer 2 and mechanism to make it scalable
CA2804141C (en) * 2010-06-29 2017-10-31 Huawei Technologies Co., Ltd. Asymmetric network address encapsulation
CN102377661A (en) * 2010-08-24 2012-03-14 鸿富锦精密工业(深圳)有限公司 Blade server and method for building shortest blade transmission path in blade server
US8917722B2 (en) * 2011-06-02 2014-12-23 International Business Machines Corporation Fibre channel forwarder fabric login sequence
CN102347955A (en) * 2011-11-01 2012-02-08 杭州依赛通信有限公司 Reliable data transmission protocol based on virtual channels
US20140153443A1 (en) * 2012-11-30 2014-06-05 International Business Machines Corporation Per-Address Spanning Tree Networks
US9160678B2 (en) * 2013-04-15 2015-10-13 International Business Machines Corporation Flow control credits for priority in lossless ethernet
US9479457B2 (en) 2014-03-31 2016-10-25 Juniper Networks, Inc. High-performance, scalable and drop-free data center switch fabric
US9703743B2 (en) * 2014-03-31 2017-07-11 Juniper Networks, Inc. PCIe-based host network accelerators (HNAS) for data center overlay network
CN104301229B (en) * 2014-09-26 2016-05-04 深圳市腾讯计算机系统有限公司 Data packet forwarding method, route table generating method and device
CN104767606B (en) * 2015-03-19 2018-10-19 华为技术有限公司 Data synchronization unit and method
US10243840B2 (en) 2017-03-01 2019-03-26 Juniper Networks, Inc. Network interface card switching for virtual networks
JP6743771B2 (en) * 2017-06-23 2020-08-19 株式会社デンソー Network switch
CN108965171B (en) * 2018-07-19 2020-11-20 重庆邮电大学 Industrial wireless WIA-PA network and time sensitive network conversion method and device
CN112737995B (en) * 2020-12-16 2022-11-22 北京东土科技股份有限公司 Method, device and equipment for processing Ethernet frame and storage medium
CN113872863B (en) * 2021-08-25 2023-04-18 优刻得科技股份有限公司 Path searching method and device
CN115580586A (en) * 2022-11-25 2023-01-06 成都成电光信科技股份有限公司 FC switch output queue construction method based on system on chip

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020141427A1 (en) * 2001-03-29 2002-10-03 Mcalpine Gary L. Method and apparatus for a traffic optimizing multi-stage switch fabric network
US20030037127A1 (en) * 2001-02-13 2003-02-20 Confluence Networks, Inc. Silicon-based storage virtualization
US20030061379A1 (en) * 2001-09-27 2003-03-27 International Business Machines Corporation End node partitioning using virtualization
US20030169690A1 (en) * 2002-03-05 2003-09-11 James A. Mott System and method for separating communication traffic
US20030195983A1 (en) * 1999-05-24 2003-10-16 Krause Michael R. Network congestion management using aggressive timers
US20040100980A1 (en) * 2002-11-26 2004-05-27 Jacobs Mick R. Apparatus and method for distributing buffer status information in a switching fabric
US20040120332A1 (en) * 2002-12-24 2004-06-24 Ariel Hendel System and method for sharing a resource among multiple queues

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5920566A (en) * 1997-06-30 1999-07-06 Sun Microsystems, Inc. Routing in a multi-layer distributed network element
US5974467A (en) * 1997-08-29 1999-10-26 Extreme Networks Protocol for communicating data between packet forwarding devices via an intermediate network interconnect device
US6556541B1 (en) * 1999-01-11 2003-04-29 Hewlett-Packard Development Company, L.P. MAC address learning and propagation in load balancing switch protocols
CN1104800C (en) * 1999-10-27 2003-04-02 华为技术有限公司 Dual-table controlled data frame forwarding method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030195983A1 (en) * 1999-05-24 2003-10-16 Krause Michael R. Network congestion management using aggressive timers
US20030037127A1 (en) * 2001-02-13 2003-02-20 Confluence Networks, Inc. Silicon-based storage virtualization
US20020141427A1 (en) * 2001-03-29 2002-10-03 Mcalpine Gary L. Method and apparatus for a traffic optimizing multi-stage switch fabric network
US20030061379A1 (en) * 2001-09-27 2003-03-27 International Business Machines Corporation End node partitioning using virtualization
US20030169690A1 (en) * 2002-03-05 2003-09-11 James A. Mott System and method for separating communication traffic
US20040100980A1 (en) * 2002-11-26 2004-05-27 Jacobs Mick R. Apparatus and method for distributing buffer status information in a switching fabric
US20040120332A1 (en) * 2002-12-24 2004-06-24 Ariel Hendel System and method for sharing a resource among multiple queues

Also Published As

Publication number Publication date
CN101040471A (en) 2007-09-19
CN101044717A (en) 2007-09-26
CN100555969C (en) 2009-10-28
CN101129027B (en) 2011-09-14
CN101040471B (en) 2012-01-11
CN101129027A (en) 2008-02-20
CN101040489A (en) 2007-09-19

Similar Documents

Publication Publication Date Title
CN101040489B (en) Network device architecture for consolidating input/output and reducing latency
US9246834B2 (en) Fibre channel over ethernet
EP1803257B1 (en) Network device architecture for consolidating input/output and reducing latency
EP1803240B1 (en) Ethernet extension for the data center
EP2002584B1 (en) Fibre channel over ethernet
CN114731337A (en) System and method for supporting target groups for congestion control in private architectures in high performance computing environments
US8531968B2 (en) Low cost implementation for a device utilizing look ahead congestion management
US8625427B1 (en) Multi-path switching with edge-to-edge flow control
CN111201757A (en) Network access node virtual structure dynamically configured on underlying network
CN103957156A (en) Method of data delivery across a network
US6621829B1 (en) Method and apparatus for the prioritization of control plane traffic in a router

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant