WO2012154751A1 - Flexible radix switching network - Google Patents

Flexible radix switching network Download PDF

Info

Publication number
WO2012154751A1
WO2012154751A1 PCT/US2012/036960 US2012036960W WO2012154751A1 WO 2012154751 A1 WO2012154751 A1 WO 2012154751A1 US 2012036960 W US2012036960 W US 2012036960W WO 2012154751 A1 WO2012154751 A1 WO 2012154751A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
switches
switch
ports
data
Prior art date
Application number
PCT/US2012/036960
Other languages
French (fr)
Inventor
Ratko V. Tomic
Christopher John Williams
Leigh Richard TURNER
Reed Graham LEWIS
Original Assignee
Infinetics Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infinetics Technologies, Inc. filed Critical Infinetics Technologies, Inc.
Priority to EP12723560.4A priority Critical patent/EP2708000B1/en
Priority to CA2872831A priority patent/CA2872831C/en
Publication of WO2012154751A1 publication Critical patent/WO2012154751A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4604LAN interconnection over a backbone network, e.g. Internet, Frame Relay
    • H04L12/462LAN interconnection over a bridge based backbone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0806Configuration setting for initial configuration or provisioning, e.g. plug-and-play
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • H04L45/06Deflection routing, e.g. hot-potato routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/10Packet switching elements characterised by the switching fabric construction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/10Packet switching elements characterised by the switching fabric construction
    • H04L49/113Arrangements for redundant switching, e.g. using parallel planes
    • H04L49/118Address processing within a device, e.g. using internal ID or tags for routing within a switch
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/15Interconnection of switching modules
    • H04L49/1553Interconnection of ATM switching modules, e.g. ATM switching fabrics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/35Switches specially adapted for specific applications
    • H04L49/356Switches specially adapted for specific applications for storage area networks

Definitions

  • the invention relates generally to the interconnection of nodes in a network. More specifically, the invention relates to interconnected nodes of a communication network that provide some combination of computation and/or data storage and provide an efficient interchange of data packets between the
  • the network can be defined by a base network structure that can be optimized by selectively defining and connecting long hops between sections of the network, for example, to reduce the network diameter.
  • the invention provides a system and method for creating a cost- effective way of connecting together a very large number of producers and consumers of data streams.
  • One real world analogy is the methods for constructing roadway networks that allow drivers to get from a starting location to a destination while satisfying real world constraints such as 1) a ceiling on the amount of tax people are willing to pay to fund roadway construction; 2) a desire to maximize speed of travel subject to safety constraints; and 3) a desire to avoid traffic jams at peak travel times of day.
  • the cars are similar to the data sent over a computer network and the starting locations and destinations represent the host computers l connected to the network.
  • the constraints translate directly into cost, speed and congestion constraints in computer networks.
  • both the number of connections between nodes and the number of different ways of making those connections grows faster than the number of connections.
  • a set of 6 nodes can have more than twice as many alternative ways to connect the nodes as a set of 3 nodes.
  • the possible number of connections between the nodes can vary from, on the low side, the number of nodes (N) minus 1 for destinations connected, for example, along a single line as shown in Fig. 1C, to N(N-l)/2 connections as shown in Fig. IF, where every single node has a direct connection to every other node.
  • Another measure of the performance of a network is the diameter of the network, which refers to how many connections need to be traveled in order to get from any one destination to another.
  • diameter of the network refers to how many connections need to be traveled in order to get from any one destination to another.
  • its economy in the number of connections (3) is offset by the consequence that the only path, from one end of the network to the other, requires travel across three connections, thus slowing the journey.
  • Fig. IF the large number of connections results in every destination only being one connection away from any other, permitting more rapid travel.
  • the two networks shown in Figs. 1C and IF can also have very different behavior at peak traffic times. Assuming that each connection can support the same rate of traffic flow, the two end point nodes of the network shown in Fig. 1C will be affected if there is a lot of traffic traveling between the two nodes in the middle of the line. Conversely, in network shown in Fig. IF, since there is an individual connection between every possible combination of nodes, traffic flowing between two nodes is not affected at all by traffic flowing between a different pair of nodes. [0012] Another difficulty arises in the construction of computer networks: It is difficult to have a large number of connections converging on a single point, such as shown in Fig. IF.
  • switches In a computer data center, the devices that allow multiple connections to converge are called switches. These switches that allow multiple connections to converge typically have physical limitations on the number of connections or ports, for example, around 50 ports for inexpensive switches, and can approach 500 ports for more modern, expensive switches. This means that for a fully- meshed network like that shown in Fig. IF where delays and congestion are minimized, no more than, for example, 499 destination hosts could be connected together.
  • the present invention allows for the design of networks that can include a very large number of connections and a high level of complexity of the switches that manage those connections, while providing very high immunity from the congestion that limits the ability of all nodes to communicate with each other at maximum speed, no matter how other nodes are using the network.
  • Some embodiments of the invention include a method for constructing networks that can be within 5-10% of the theoretical maximum for data throughput across networks with multiple simultaneously communicating hosts, a highly prevalent use case in modern data centers.
  • methods for constructing highly ordered networks of hosts and switches are disclosed that make maximum use of available switch hardware and interconnection wiring.
  • the basic approach can include the following: selecting a symmetrical network base design, such as, a hypercube, a star, or another member of the Cayley graph family;
  • the entire network can be operated and managed as a single switch.
  • the network can have 2 to 5 times greater bisection bandwidth than with conventional network architectures that use the same number of component switches and ports.
  • the invention also includes flexible methods for constructing physical embodiments of the networks using commercially available switches and method for efficiently, accurately and economically interconnecting (wiring) the switched together to form a high performance network having improved packet handing.
  • Figures 1 A - IF show sample network layouts.
  • Figures 2A - 2C show symmetrical network structures according to some embodiments of the invention.
  • Figures 3A and 3B show an example of topological routing.
  • Figure 4A shows an order 3 hypercube and Figure 4B shows an order 3 hypercube with shortcuts added.
  • Figure 5 illustrates a typical large data center layer 2 network architecture.
  • Figure 6 illustrates hypercube notation and construction.
  • Figure 7 illustrates partitioning between topology and external ports.
  • Figure 8 illustrates packet non-blocking with 4 switches and 8 paths.
  • Figure 9 illustrates a network bisection according to some
  • Figure 10 illustrates an 8 node network with long hops added.
  • Figures 11 - 15 are charts comparing long hop networks with alternative network configurations..
  • Figure 16 illustrates data center available bandwidth and cost for 4x external/topology port ratio.
  • Figure 17 illustrates data center available bandwidth and cost for lx external/topology port ratio.
  • Figure 18 illustrates the reduction in average and maximum hops.
  • Figure 19 illustrates optimized wiring pattern using port dimension mapping according to an embodiment of the invention.
  • Figure 20 illustrates the integrated super switch architecture across an entire data center according to an embodiment of the invention.
  • Figure 21 illustrates a network architecture showing a flexible radix switch fabric according to an embodiment of the invention.
  • Figure 22 illustrates the flow of a data packet from an ingress switch through a network according to an embodiment of the present invention.
  • Figure 23 illustrates various network logical topographies according to an embodiment of the present invention.
  • Figure 24 illustrates a network architecture according to one embodiment of the invention.
  • Figure 25 illustrates a system including a Data Factory according to some embodiments of the invention.
  • Figure 26 illustrates a system interconnecting a control plane executive
  • the present invention is directed methods and systems for designing large networks and the resulting large networks.
  • a way of connecting large numbers of nodes consisting of some combination of computation and data storage, and providing improved behaviors and features.
  • These behaviors and features can include: a) practically unlimited number of nodes, b) throughput which scales nearly linearly with the number of nodes, without bottlenecks or throughput restriction, c) simple incremental expansion where increasing the number of nodes requires only a proportional increase in the number of switching components, while maintaining the throughput per node, d) maximized parallel multipath use of available node interconnection paths to increase node-to-node bandwidth, e) Long hop topology enhancements which can simultaneously minimize latency (average and maximum path lengths) and maximize throughput at any given number of nodes, f) a unified and scalable control plane, g) a unified management plane, h) simple connectivity - nodes connected to an interconnection fabric do not need to have any knowledge of topology or
  • the nodes can represent servers or hosts and network switches in a networked data center, and the interconnections represent the physical network cables connecting the servers to network switches, and the network switches to each other.
  • the nodes can represent geographically separated clusters of processing or data storage centers and the network switches that connect them over a wide area network.
  • the interconnections in this case can be the long distance data transfer links between the geographically separated data centers.
  • component switches can be used as building blocks, wherein the component switches are not managed by data center administrators as individual switches. Instead, switches can be managed indirectly via the higher level parameters characterizing collective behavior of the network, such as latency (maximum and average shortest path lengths), bisection (bottleneck capacity), all-to-all capacity, aggregate
  • Internal management software can be used to translate selected values for these collective parameters into the internal configuration options for the individual switches and if necessary into rewiring instructions for data center technicians. This approach makes management and monitoring scalable.
  • a method of designing an improved network includes modifying a basic hypercube network structure in order to optimize latency and bandwidth across the entire network. Similar techniques can be used to optimize latency and bandwidth across other Cayley graph symmetrical networks such as star, pancake and truncated hypercube networks.
  • a symmetrical network is one that, from the perspective of a source or a destination looks the same no matter where you are in the network and which allows some powerful methods to be applied for developing both routing methods for moving traffic through the network and for adding short cuts to improve throughput and reduce congestion.
  • One commonly known symmetrical network structure is based on the structure of a hypercube.
  • the hypercube structured network can include a set of destinations organized as the corners of a cube, such as shown in Fig. 2A.
  • the structure shown in Fig. 2A is known as an order 3 hypercube, based on each destination having three connections to neighboring destinations. To generate a higher order hypercube, copy the original hypercube and connect all the destinations in the first hypercube with the corresponding destination in the copy as shown in Fig. 2B.
  • Hypercubes are just one form of symmetrical network.
  • Another form of symmetrical network is the star graph shown in Fig. 2C.
  • Cayley graphs There are many other types of symmetrical networks, known formally as Cayley graphs that can be used as a basis on which to apply the methods of the invention.
  • topological routing can be used route messages through the symmetrical network.
  • Topological routing can include a method for delivering messages from a source node to a destination node through a series of intermediate locations or nodes where the destination address on the message describes how to direct the message through the network.
  • a simple analogy is the choice of method for labeling streets and numbering houses in a city. In some planned areas such as Manhattan, addresses not only describe a destination location, "425 17 th Street", but also describe how to get there from a starting point. If it is known that house numbers are allocated 100 per block, and the starting location is 315 19 th Street, it can be determined that the route includes going across one block and down two streets to get to the destination.
  • traveling from N 200 W. 2 nd Street to N 100 E 1 st Street can include going east 3 blocks and south one block.
  • 3B has road that are not laid out in any regular pattern and the names for streets have no pattern either. This "plan” requires a “map” to determine how to get from one place to another.
  • Topological addressing is important in large networks because it means that a large map does not have to be both generated and then consulted at each step along the way of sending a message to a destination. Generating a map is time consuming and consumes a lot of computing resources, and storing a map at every step along the way between destinations consumes a lot of memory storage resources and requires considerable computation to look up the correct direction on the map each time a message needs to be sent on its way towards its destination.
  • the small maps required by topological addressing are not just a matter of theoretical concern. Present day data centers have to take drastic, performance impacting measures to keep their networks divided into small enough segments that the switches that control the forwarding of data packets do not get overwhelmed with building a map for the large number of destinations for which traffic flows through each switch.
  • the performance of these symmetrical networks can be greatly improved by the select placement of "short cuts" or long hops according to the invention.
  • the long hops can simultaneously reduce the distance between destinations and improve the available bandwidth for simultaneous communication.
  • Fig. 4A show a basic order 3 hypercube, where the maximum distance of three links between destination nodes occurs at the opposite corners.
  • adding shortcuts across all three corners as shown in Fig. 4B reduces the distance between the destinations that used to have the worst case distance of three to a distance of one link.
  • this method can be applied to hypercubes of higher order with many more destinations.
  • a method for identifying select long hops in higher order hypercube networks and symmetric networks can include determining a generator matrix using linear error correcting codes to identify potential long hops within the network.
  • Figure 5 shows a diagram of a typical commercial data center.
  • networks according to the invention can be expanded (increasing the number of host computer ports) practically, without limit or performance penalty.
  • the expansion can be flexible, using commodity switches having a variable radix.
  • switches which can be upgraded from an initial configuration with a smaller radix to a configuration with a higher radix the latter maximum radix is fixed in advance to at most a few hundred ports.
  • the 'radix multiplier' switching fabric for the maximum configuration is hardwired in the switch design.
  • a typical commercial switch such as the Arista 7500 can be expanded to 384 ports by adding up to 8 line cards, each providing 48 ports; but the switching fabric gluing the 8 separate 48 port switches into one 384 port switch is rigidly fixed by the design and it is even included in the basic unit.
  • the networks constructed according some embodiments of the invention have no upper limit on the maximum number of ports it can provide. And this holds for an initial network design as well as any subsequent expansion of the same network.
  • the upper limit for simple expansion without performance penalty is 2* "1 component switches. Since typical R is at least 48, even this conditional limit of 2 47 « 1.4 ⁇ 10 14 on the radix expansion is already far larger than the number of ports in the entire internet, let alone in any existing or contemplated data center.
  • data center layer 2 networks are typically operated and managed as networks of individual switches where each switch requires individual installation, configuration, monitoring and management.
  • the data center network can be operated and managed as a single switch. This allows the invention to optimize all aspects of performance and costs (of switching fabric, cabling, operation and management) to a far greater degree than existing solutions.
  • networks according to some embodiments of the invention can provide improved performance over any existing data center Layer 2 networks, on the order of 2 to 5 times greater bisection bandwidth than conventional network architectures that use the same number of component switches and ports.
  • the invention also describes novel and flexible methods for realizing physical embodiments of the network systems described, both in the area of wiring switches together efficiently, accurately and economically, as well as ways to use existing functionality in commercial switches to improve packet handing.
  • Hypercubes can be characterized by their number of dimensions, d.
  • a i/-cube can be a i -dimensional binary cube (or Hamming cube, hypercube graph) with network switches as its nodes, using d ports per switch for the d connections per node.
  • Each switch can have some number of ports dedicated to
  • a concise binary ⁇ -bit notation for nodes (and node labels) of a rf-cube can be used.
  • the d-cube coordinates of the switches can be used as their physical MAC addresses , and the optimal routing becomes very simple. Routing can be done entirely locally, within each switch using only 0(log(V)) resources (where O can be and N is the maximum number of switches).
  • the total number of switches N s in the network is not an exact power of 2, so in this case, the d- cubes can be truncated so that for any accessible M the relation M ⁇ N S holds, where bit string M is interpreted as an integer (instead of M ⁇ 2 d which is used for a complete . -cube).
  • the switches instead of the 0(N) size forwarding table and an 0(N) routing tree, the switches only need one number N s and their own MAC address to forward frames along the shortest paths.
  • one useful parameter of the hypercubic network topology is the port distribution or ratio of the internal topology ports (T-ports) used to interconnect switches and external ports(E- ports) that the network uses to connect the network to hosts (servers and routers).
  • each E-port can be simultaneously a source and a sink (destination) of data packets.
  • the evaluation of the NB property of a network can depend on the specific meaning of "sending data" as defined by the queuing model. Based on kinds of "sending data", there can be two forms of NB, Circuit NB (NB-C) and Packet NB (NB-P).
  • each source X can send a continuous stream at its full port capacity to its destination Y.
  • each source can send one frame to its destination Y.
  • each path connecting its XY pair.
  • the difference in these paths for the two forms of NB is that for NB-C each XY path has to have all its hops reserved exclusively for its XY pair at all times, while for NB-P, the XY path needs to reserve a hop only for the packet forwarding step in which the XY frame is using it.
  • NB-C is a stronger requirement, i.e. if a network is NB-C then it is also NB-P.
  • Fig. 8 shows the 8 paths with their properties discernable by splitting the diagram into (a) and (b) parts, but the two are actually running on the same switches and lines simultaneously.
  • the short arrows with numbers show direction of the frame hop and the switching step/phase at which it takes place. It is evident that at no stage of the switching, which lasts 3 hops, is any link required to carry 2 or more frames in the same direction (these are duplex lines, hence 2 frames can share a link in opposite direction) hence NB-P holds for this instance.
  • Not all paths are the shortest ones possible (e.g. X ⁇ Y ⁇ which took 3 hops, although the shortest path is 1 hop, the same one as the path X 2 ⁇ Y ).
  • each switch receives d frames from its d E-ports. If there were just one frame per switch instead of d, the regular hypercube routing could solve the problem, since there are no conflicts between multiple frames targeting the same port of the same switch. Since each switch also has exactly d T-ports, if each switch sends d frames, one frame to each port in any order, in the next stage each switch again has exactly d frames (received via its d T-ports), without collisions or frame drops so far.
  • each switch In order to assure a finite time delivery, each switch must pick out of the maximum d frames it can have in each stage, the frame closest to its destination (the one with the lowest Hamming weight of its jump vectorDsf ⁇ Current) and send it to the correct port. The remaining d-l frames (at most; there may be fewer) are sent on the remaining d- ⁇ ports applying the same rule (the closest one gets highest priority, etc). Hence after this step is done on each of the N switches, there are at least N frames (the N
  • load balancing can be performed locally at each switch.
  • the switch can select the next hop along a different c/-cube dimension than the last one sent, if one is available. Since for any two points with distance (shortest path) L there are LI alternative paths of equal length L, there are plenty of alternatives to avoid
  • symmetrical networks with long hop shortcuts are used to achieve high performance in the network, however additional forwarding management can be used to optimize the network and achieve higher levels of performance. As the size of the network (number of hosts) becomes large, it is useful to optimize the forwarding processes to improve network performance.
  • each switch can maintain a single size forwarding table (of size CHN)) and network connection matrix (of size C N-R), where R is the switch radix and N the number of switches).
  • the network can be divided into a hierarchy of clusters, which for performance reasons align with the actual network connectivity.
  • the 1 st level clusters contain R nodes (switches) each, while each higher level cluster contains R sub-clusters of the previous lower level.
  • each node belongs to exactly one 1 st level cluster, which belongs to exactly one 2 nd level cluster, etc.
  • each node is labeled via two decimal digits, e.g. a node 3.5 is a node with index 3 in a cluster with index 5.
  • a node 3.5 needs to forward to some node 2.8, all that 3.5 needs to know is how to forward to a single node in cluster 8, as long as each node within the cluster 8 knows how to forward within its own cluster.
  • nodes have more than single destination forwarding address.
  • the array T x [R] contains ports which F needs to use to forward to each of the R nodes in its own 1 st level cluster.
  • This forwarding is not assumed to be a single hop, so the control algorithm can seek to minimize the number of hops when constructing these tables.
  • a convenient topology, such as hypercube type makes this task trivial since each such forwarding step is a single hop to the right cluster.
  • the control algorithm can harmonize node and cluster indexing with port numbers so that no forwarding tables are needed at all.
  • the array ⁇ 2 contains ports F needed for forwarding to a single node in each of the R 2 nd level clusters belonging to the same third level cluster as node F;TT, contains ports F needed for forwarding to a single node in each of the R 3 rd level clusters belonging to the same 4 th level cluster as F,... and finally T m contains ports F needs to use to forward to a single node in each of the Rm* level cluster belonging to the same (## ⁇ +1) ⁇ cluster (which is a single cluster containing the whole network). [0084] In accordance with some embodiments of the invention, forwarding can be accomplished as follows.
  • the implementation of this technique can involve the creation of hierarchical addresses. Since the forwarding to clusters at levels > 1 involves approximation (a potential loss of information, and potentially sub-optimal forwarding), for the method to forward efficiently it can be beneficial to a) reduce the number of levels m to the minimum needed to fit the forwarding tables into the C AMs (content addressable memories) and b) reduce the forwarding approximation error for m >1 selecting the formal clustering used in the construction of the network hierarchy to match as closely as possible the actual topological clustering of the network.
  • Forwarding efficiency can be improved by reducing the number of levels m to the minimum needed to fit the forwarding tables into the CAMs.
  • the conventional CAM tables can be used. The difference from the conventional use is that instead of learning the MAC addresses, which introduce additional approximation and forwarding inaccuracy, the firmware can program the static forwarding tables directly with the hierarchical tables.
  • m is the lowest value satisfying inequality: m-N l/m ⁇ C.
  • the hypercubes of dimension d are intrinsically clustered into lower level hypercubes corresponding to partition of d into m parts.
  • the following clustering algorithm performs well in practice and can be used for general topologies:
  • a node which is the farthest node from the existent complete clusters is picked as the seed for the next cluster (the first pick, when there are no other clusters, is arbitrary).
  • the neighbor JC with max value of V(JC) score is then assigned to the cluster.
  • the cluster growth stops when there are no more nodes or when the cluster target size is reached (whichever comes first). When no more unassigned nodes are available the clustering layer is complete.
  • the next layer clusters are constructed by using the previous lower layer clusters as the input to this same algorithm.
  • For regular networks (graphs), those in which all nodes have the same number of topological links per node m (i.e. m is a node degree), it follows that Pi n m.
  • the network capacity or throughput is commonly characterized via the bisection (bandwidth) which is defined in the following manner: network is partitioned into two equal subsets (equipartition) Si + S 2 so that each subset contains nil nodes (within ⁇ 1 for odd n). The total number of links connecting Si and S 2 is called a cut for partition S ! +S 2 .
  • Bisection B is defined as the smallest cut (min-cut) for all possible equipartitions S 1 +S 2 of the network.
  • Bisection is thus an absolute measure of the network bottleneck throughput.
  • a related commonly used relative throughput measure is the network oversubscription ⁇ defined by considering the P/2 free ports in each min-cut half, S t and S 2 , with each port sending and receiving at its maximum capacity to/from the ports in the opposite half.
  • the maximum traffic that can be sent in each direction this way without overloading the network is B link (port) capacities since that's how many links the bisection has between the halves. Any additional demand that free ports are capable of generating is thus considered to be an "oversubscription" of the network.
  • the oversubscription ⁇ is defined as the ratio: ⁇ ⁇ ⁇ )
  • the latency, average or maximum (diameter), is another property that is often a target of optimization.
  • the improvements in latency are less sensitive to the distinction between the optimal and approximate solutions, with typical advantage factors of only 1.2-1.5. Accordingly, greater optimization can be achieved in LH networks by optimizing the bisection than by optimizing the network to improve latency.
  • the present invention is directed to Long Hop networks and methods of creating Long Hop networks.
  • the description provides illustrative examples of methods for constructing a Long Hop network in accordance with the invention.
  • one function of a Long Hop network is to create a network interconnecting a number of computer hosts to transfer data between computer hosts connected to the network.
  • the data can be transferred simultaneously and with specified constraints on the rate of data transmission and the components (e.g., switches and switch interconnect wiring) used to build the network.
  • a Long Hop network includes any symmetrical network whose topography can be represented by a Cayley graph, and the corresponding Cayley graphs have generators corresponding to the columns of Error Correcting Code (ECC) generator matrices G (or their isometric equivalents, also instead of G one can use equivalent components of the parity check matrix H).
  • ECC Error Correcting Code
  • the Long Hop networks in accordance with some embodiments of the invention can have performance (bisection in units of n/2) within 90% of the lower bounds of the related ECC, as described by the Gilbert- Varshamov bound theorem.
  • Long Hop networks will include networks having 128 or more switches (e.g., dimension 7 hypercube or greater) and/or direct networks.
  • Long Hop networks can include networks having the number of interconnections m not equal to d, d+l,..d+d-l and m not equal to n-1, n-2.
  • the wiring pattern for connecting the switches of the network can be determined from a generator matrix that is produced from the error correcting code that corresponds to the hypercube dimension and the number of required interconnections determined as function of the oversubscription ratio.
  • CPUs central processing units
  • data transfer channels within integrated circuits or within larger hardware systems such as backplanes and buses.
  • the Long Hop network can include a plurality of network switches and a number of network cables connecting ports on the network switches to ports on other network switches or to host computers.
  • Each cable connects either a host computer to a network switch or a network switch to another network switch.
  • the data flow through a cable can be bidirectional, allowing data to be sent simultaneously in both directions.
  • the rate of data transfer can be limited by the switch or host to which the cable is connected.
  • the data flow through the cable can be uni-directional.
  • the rate of data transfer can be limited only the physical capabilities of the physical cable media (e.g., the construction of the cable).
  • the cable can be any medium capable of transferring data, including metal wires, fiber optic cable, and wired and wireless electromagnetic radiation (e.g., radio frequency signals and light signals).
  • different types of cable can be used in the same Long Hop network.
  • each switch has a number of ports and each port can be connected via a cable to another switch or to a host.
  • at least some ports can be capable of sending and receiving data, and at least some ports can have a maximum data rate (bits per second) that it can send or receive.
  • Some switches can have ports that all have the same maximum data rate, and other switches can have groups of ports with different data rates or different maximum data transfer rates for sending or receiving.
  • all switches can have the same number of ports, and all ports can have the same send and receive maximum data transfer rate.
  • at least some of the switches in a Long Hop network can have different numbers of ports, and at least some of the ports can have different maximum data transfer rates.
  • Switches can receive data and send data on all their ports simultaneously.
  • a switch can be thought of as similar to a rail yard where incoming train cars on multiple tracks can be sent onward on different tracks by using a series of devices that control which track among several options a car continues onto.
  • the Long Hop network is constructed of switches and cables. Data transferred between a host computer or a switch and another switch over a cable. The data received from a sending host computer enters a switch, which can then forward the data either directly to a receiving host computer or to another switch which in turn decides whether to continue forwarding the data to another switch or directly to a host computer connected to the switch.
  • all switches in the network can be both connected to other switches and to hosts. In accordance with other embodiments of the invention, there can be interior switches that only send and receive to other switches and not to hosts as well.
  • the Long Hop network can include a plurality of host computers.
  • a host computer can be any device that sends and/or receives data to or from a Switch over a Cable.
  • host computers can be considered the source and/or destination of the data transferred through the network, but not considered to be a direct part of the Long Hop network being constructed.
  • host computers cannot send or receive data faster than the maximum data transfer rate of the Switch Port to which they are connected.
  • At least some of following factors can influence the construction of the network.
  • the factors can include 1) the number of Hosts that must be connected; 2) the number of switches available, 3) the number of ports on each switch; 4) the maximum data transfer rate for switch ports; and 5) the sum total rate of simultaneous data transmission by all hosts.
  • Other factors such as the desired level of fault tolerance and redundancy can also be factor in the construction of a Long Hop network.
  • the desired characteristics of the Long Hop network can limit combinations of the above factors used in the construction of a Long Hop network that can actually be built. For example, it is not possible to connect more hosts to a network than the total number of switches multiplied by the number of ports per switch minus the number of ports used to interconnect switches. As one ordinary skill would appreciate, a number of different approaches can be used to design a network depending on the desired outcome.
  • switches with a given maximum data transfer rate, and ports per switch how many switches are needed and how should they be connected in order to allow all hosts to send and receive simultaneously at 50% of their maximum data transfer rate
  • number of switches with a given number of ports and maximum data transfer rate how much data can be simultaneously transferred across the network and what switch connection pattern(s) supports that performance.
  • the Long Hop network includes 16 switches and uses up to 7 ports per switch for network interconnections (between switches). As one of ordinary skill will appreciate any number of switches can be selected and the number ports for network interconnection can be selected in accordance with the desired parameters and performance of the Long Hop network.
  • the method includes determining how to wire the switches (or change the wiring of an existing network of switches) and the relationship between the number of attached servers per switch and the oversubscription ratio.
  • the ports on each switch can be allocated to one of two purposes, external connections (e.g., for connecting the network to external devices including host computers, servers and external routers or switches that serve as sources and destinations within the network), and topological or internal connections.
  • An external network connection is a connection between a switch and a source or destination device that enables data to enter the network from a source or exit the network to a destination.
  • a topological or internal network connection is a connection between networks switches that form the network (e.g., that enables data to be transferred across network).
  • oversubscription ratio can be determined as the ratio between the total number of host connections (or more generally, external ports) and the bisection (given as number of links crossing the min-cut partition).
  • an oversubscription ratio of 1 indicates that in all cases, all hosts can simultaneously send at the maximum data transfer rate of the switch port.
  • an oversubscription ratio of 2 indicates that the network can only support a sum total of all host traffic equal to half of the maximum data transfer rate of all host switch ports.
  • an oversubscription ratio of 0.5 indicates that the network has twice the capacity required to support maximum host traffic, which provides a level of failure resilience such that if one or more switches or connections between switches fails, the network will still be able to support the full traffic volume generated by hosts.
  • the base network can be an n-dimensional hypercube.
  • the base network can be another symmetrical network such as a star, a pancake and other Cayley graphs based network structure.
  • an n-dimensional hypercube can be selected as a function of the desired number of switches and interconnect ports.
  • a generator matrix is produce for the linear error correcting code that matches the underlying hypercube dimension and the number of required interconnections between switches as determined by the network oversubscription ratio.
  • the generator matrix can be produced by retrieving it from one of the publicly available lists, such as the one maintained by the MinT project (http://mint.sbg.ac.at/index.php).
  • the generator matrix can be produced using a computer algebra system such as the Magma package (available
  • magma package a command entered into Magma claculator (http://magma.maths.usyd.edu.au/calc/):
  • a linear error correcting code generator matrix can be converted into a wiring pattern matrix by rotating the matrix counterclockwise 90 degrees, for example, as shown in Table 4.9.
  • each switch has 7 ports connected to other switches and 16 total switches corresponding to an LH augmented dimension 4 hypercube.
  • Generators i through h 7 correspond to the original columns from rotated [G 4j7 ] matrix that can be used to determine how the switches are connected to each other by cables.
  • the 16 switches can be labeled with binary addresses, 0000, 0001 , through 1111. The switches can be connected to each other using the 7 ports assigned for this purpose, labeled hi through h7, by performing the following procedure for each of the sixteen switches.
  • This wiring procedure describes how to place the connections to send from a source switch to a destination switch, so for each connection from a source switch to a destination switch, there is also a connection from a destination switch to a source switch.
  • a single bi-directional cable is used for each pair of connections.
  • the LH networks are direct networks constructed using general Cayley graphs Cay(G n , S ra ) for the topology of the switching network.
  • Node labels and group operation table [00124]
  • the nodes Vj are labeled using ./-tuples in alphabet of size q: v ⁇ i G
  • the 2-digit entries have digits which are from alphabet ⁇ 0,1,2 ⁇ .
  • the #i rows and n columns are labeled using 2-digit node labels.
  • each row r and column c contains all n group elements, but in a unique order.
  • Generator set S m contains m "hops" h ⁇ , h2,... h m (they are also elements of the group G n in Cay(G n , S m )), which can be viewed as the labels of the m nodes to which the "root" node, vo ⁇ 0 is connected.
  • eq. (4.1) defines T( ) for any element a (or vertex) of the group G n . Since the right hand side expression in eq. (4.1) is symmetric in and j it follows that T(a) is a symmetric matrix, hence it has real, complete eigenbasis:
  • Fig. 10 shows the resulting 8-node network (folded 3-cube, FQ 3 ).
  • Actions (bitwise XOR) of the 4 generators T(a)e ⁇ 001, 010, 100, 1 l l ⁇ bin on the node 000 are indicated by the arrows pointing to the target vertex. All other links are shown without arrows.
  • T(a)T(b) T(a A fc) (4.5)
  • T( ) matrices are a representation of the group G Rush and eq. (4.6) that they commute with each other. Since via eq. (4.2), [A] is the sum of T(fl) matrices, then [A] commutes with all T( ) matrices as well. Therefore, since they are all also symmetric matrices, the entire set ⁇ [A], T(a) Va ⁇ has a common eigenbasis (via result (M4) in section 2.F). The next sequence of equations shows that Walsh functions viewed as /i-dimensional vectors
  • the eigenvalues for [A] are obtained by applying eq.(4.9) to the expansion of [A] via ⁇ ( ⁇ ), eq. (4.2):
  • An equipartition X can be represented by an n-dimensional vector
  • the cut value C(X) for a given partition X (XQ, ⁇ ,... JC n-1 ) is obtained as the count of links which cross between nodes in Si and S 2 .
  • Such links can be easily identified via E and adjacency matrix [A], since [A]y is 1 iff nodes * and j are connected and 0 if they are not connected.
  • Bisection B is computed as the n inimum cut C(X) for all XeE, which via eq. (4.14) yields:
  • the Rayleigh-Ritz eqs. (2.45W2.46) do not directly apply to min ⁇ and max ⁇ expressions in eq. (4.15). Namely, the latter extrema are constrained to the set E of equipartitions, which is a proper subset of the full vector space V shadow to which the Rayleigh-Ritz applies.
  • the ME ⁇ max ⁇ in eq. (4.16) can be smaller than the My ⁇ max ⁇ ⁇ computed by eq. (2.46) since the result My can be a vector from V n which doesn't belong to E (the set containing only the
  • ME is analogous to the "tallest programmer in the world” while My is analogous to the "tallest person in the world.” Since the set of "all persons in the world” (analogous to V n ) includes as a proper subset the set of "all programmers in the world” (analogous to E) the tallest programmer may be shorter than the tallest person (e.g. the latter might be a non- programmer). Hence in general case the relation between the two extrema is ME ⁇ My. The equality holds only if at least one solution from My belongs also to ME, or in the analogy, if at least one person among the "tallest person in the world” is also a programmer. Otherwise, strict inequality holds ME ⁇ My.
  • Subspace V 0 is one dimensional space spanned by a single 'vector of all ones' (11 defined as:
  • V E is the (w-l) dimensional orthogonal complement of V 0 within V n , i.e. V E is spanned by some basis of w-1 vectors which are orthogonal to (1
  • V E is spanned by the remaining orthogonal set of n-1 Walsh functions
  • Uk), k ⁇ ..n-l .
  • set ⁇ For convenience the latter subset of Walsh functions is labeled as set ⁇ below:
  • [A] in eq. (4.22) is the set of Walsh functions
  • the inner loop in (4.31 ) executes m times and the outer loop (#i- 1 ) times, yielding total of ⁇ m n steps. Hence, for n-1 values of k, the total
  • [W n ] denotes bitwise complement of matrix [W n ].
  • [W n ] denotes bitwise complement of matrix [W n ].
  • the left and right sub-matrices [W n ] are the same, suggesting that after computing in eq. (4.29) the partial sums of W k (A s ) over h s ⁇ n and k ⁇ n (upper left quadrant of W 2n ) the remaining n partial sums for k > n (top right quadrant of W 2n ) can be copied from the computed left half.
  • the left and right quadrants of sub-matrices are a complement of each other, which replaces the above copying method with subtractions from some constant and copying (the constant is the number of hops h s > n, i.e. the A s in the lower half of W 2n matrix).
  • the constant is the number of hops h s > n, i.e. the A s in the lower half of W 2n matrix.
  • the B computation consists of finding the largest element in the set ⁇ k ⁇ of n-l elements.
  • U k the orthogonality and completeness of the n vectors
  • U f c n ⁇ 6 j k from eq. (23)
  • important property of the set ⁇ F ⁇ follows: n-l n-l n-l
  • eq. (4.42) also defines a quantity b which is the bisection in units n/2.
  • the worst case computational complexity the B optimization is thus 0((w/i-log( «)) m ), which is polynomial in n, hence, at least in principle, it is a computationally tractable problem as n increases.
  • the actual exponent m would be (iw - log(/i) - 1), not m, since the Cayley graphs are highly symmetrical and one would not have to search over the symmetrically equivalent subsets S m .
  • m is typically a hardware characteristics of the network components, such as switches, which usually don't get replaced often as network size n increases.
  • Eq. (4.43) also expresses W k (j ) in terms of parity function F(x) via eq.
  • V(A)> obtained via eq. (4.47) when k runs through all possible integers ⁇ .. ⁇ -I is a d- dimensional vector space, a linear span (subspace of m-tuples vector space ⁇ m ), which is denoted as ⁇ (d,m,q)
  • Hamming weight can be used in some embodiments of the invenition, any other weight, such as Lee weight, which would correspond to other Cayley graph groups G Too and generator sets Sm , can also be used.
  • [7,4,3] 2 on the left side was rotated 90° counter-clockwise and the resulting 7 rows of 4 digits are binary values for the 7 generators h s (also shown in hex) of the 16 node Cayley graph.
  • the methods of determining the bisection B can be implemented using a computer program or set of computer programs organized to perform the various steps described herein.
  • the computer can include one or more processors and associate member, including volatile and non- volatile memory to store the programs and data.
  • processors and associate member including volatile and non- volatile memory to store the programs and data.
  • a conventional IBM compatible computer running the Windows or Linix operating system or an Apple computer system can be used and the programs can be written, for example, the in C programming language.
  • Non-binary codes [00194]
  • the linear codes with q>2 generate hyper-torus/-mesh type of networks of extent q when the ⁇ metrics of the code is Lee distance.
  • the networks are of generalized hypercube/flattened butterfly type [3].
  • Walsh functions readily generalize to other groups, besides cyclic group ⁇ 2 used here (cf. [23]).
  • a simple generalization to base q>2 for groups Z%, for any integer q is based on defining function values via ⁇ -th primitive root of unity ⁇ :
  • the non-binary Walsh functions U q>k can also be used to define graph partition into / parts where /is any divisor of q (including q). For even q, this allows for efficient computation of bisection.
  • the generators T(a) and adjacency matrix [A] are computed via general eqs. (4.1),(4.2). where 0 operator is G ⁇ q) addition (mod q ).
  • the basic algorithm attempts replacement of typically 1 or 2 generators h s G S m , and for each new configuration it evaluates (incrementally) the target utility function, such as diameter, average distance or max-cut (or some hierarchy of these, used for tie-breaking rules).
  • the number of simultaneous replacements depends on n, m and available computing resources. Namely, there are ⁇ n r possible simultaneous deletions and insertions (assuming the "best deletion” is followed by "best” insertion).
  • the utility function also uses indirect measures (analogous to sub-goals) as a tie- breaking selection criterion e.g.
  • the bisection b can be maintained fixed for all replacements (e.g. if bisection is the highest valued objective), or one can allow b to drop by some value, if the secondary gains are sufficiently valuable.
  • This is a specialized domain of network parameters where the 2-layer Fat Tree (FT-2) networks are currently used since they achieve the yield of E R/3 external ports/s witch, which is the maximum mathematically possible for the worst case traffic patterns.
  • Table 4.10(a) shows the non-diagonalized hops after the step (i)
  • LH networks are useful for building modular switches, networks on a chip in multi-core or multi-processor systems, flash memory/storage network designs, or generally any of the applications requiring very high bisection from a small number of high radix components and where FT-2 (two level Fat Tree) is presently used. In all such cases, LH-HD will achieve the same bisections at a lower latency and lower cost for Gb/s of throughput.
  • C must have at least 2 ones. Namely, there are total of 2 L distinct bit patterns of length L. Among all 2 L possible L-bit patterns, 1 pattern has 0 ones (00..0) and L patterns have a single one. By removing these two types, with 0 or single one, there are 2 L -(L+1) remaining L-bit patterns with two or more ones, which is the left hand side of eq. (4.60). Any subset of d distinct patterns out of these 2 L -(L+1) remaining patterns can be chosen for the above augmentation.
  • the Table 4.12 shows values L (number of added hops to a rf-cube) satisfying eq . (4.60) for dimensions d of practical interest.
  • XOR-ing the hop list is the case in which the resulting hop Am+i happens to come out as 0 (which is an invalid hop value, a self-link of node 0 to itself). In such case, it is always possible to perform a single hop substitution in the original list S m which will produce the new list with the same b value but a non-zero value for the list XOR result h m+ i.
  • Each solution record contains, among others, the value m, bisection b and the hop list hi, ⁇ 2 ,... h m .
  • LH constructor scans record sets D d , for given P, R and ⁇ .
  • the requirement is "at least P ports” then the constraint P( ⁇ , n)-P>0 is imposed for the admissible comparisons.
  • the requirements can also prioritize ⁇ and ⁇ via weights for each (e.g. 0.7 ⁇ + 0.3 ⁇ for total error).
  • Given n such sets of links, L(0), L(l),..., L(w-1), the complete wiring for the network is specified.
  • the examples below illustrate the described construction procedure.
  • Table 4.15 shows complete connection map for the network for 32 switches, stacked in a 32-row rack one below the other, labeled in leftmost column "Sw" as 0, 1 , ... 1 F (in hex).
  • Switch 5 is outlined with connections shown for its ports # 1 ,#2, ... #9 to switches (in hex) 04, 07, 01, 0D, 15, 0B, OA, 11 and 1C. These 9 numbers are computed by XOR-ing 5 with the 9 generators (row 0): 01, 02, 04, 08, 10, 0E, OF, 14, 19. The free ports are #10, #11 and #12.
  • the outlined switch “5:” indicates on its port #2 a connection to switch 7 (the encircled number 07 in the row 5:).
  • switch 7 the encircled number 07 in the row 5:.
  • switch 7: there is an encircled number 05 at its port #2 (column #2), which refers back to this same connection between the switch 5 and the switch 7 via port #2 on each switch.
  • the same pattern can be observed between any pair of connected switches and ports.
  • the first 8 links are regular 8-cube links (power of 2), while the remaining 10 are LH augmentation links.
  • the table also shows that each switch has 6 ports #19, #20,... #24 free.
  • the LH solutions database was used to compare LH networks against several leading alternatives from industry and research across broader spectrum of parameters.
  • the resulting spreadsheet charts are shown in Figures 11 - 15.
  • the metrics used for evaluation were Ports/Switch yield (ratio P/n, higher is better) and the cables consumption as Cables/Port (ratio: # of topological cables/P, lower is better).
  • the alternative networks were set up to generate some number of ports P using switches of radix R, which are optimal parameters values for a given alternative network (each network type has its own "natural" parameter values at which it produces the most efficient networks). Only then the LH network was constructed to match the given number of external ports P using switches of radix R (as a rule, these are not the optimal or "natural" parameters for LH networks).
  • the Ports/Switch chart for each alternative network shows Ports/Switch yields for the LH network and the alternative network , along with the ratio LH/alternative with numbers on the right axis (e.g. a ratio 3 means that LH yields 3 times more Ports/Switch than the alternative).
  • the Ports/Switch for LH network yielding the same total number of ports P is shown, along with the ratio LH/HC, which shows (on the right axis scale) that LH produces 2.6 to 5.8 times greater Ports/Switch yield than hypercube, hence it uses 2.6-5.8 times fewer switches than HC to produce the same number of ports P as HC at the same throughput.
  • FIG. 11 shows similarly the Cables/Port consumption for HC and LH, and the ratio HC/LH of the two (right axis scale), showing that LH consumes 3.5 to 7 times fewer cables to produce the same number of ports P as HC at the same throughput.
  • the remaining charts in Figs. 12 - 14 show the same type of comparisons for the other four alternatives. Performance Measurement
  • C is a single IPA switch port capacity (2 ⁇ Port Bit Rate> for duplex ports).
  • Bisection B is the smallest total capacity of links connecting two halves of the network (i.e. it's the minimum for all possible network cuts into halves).
  • Eq. (7) doesn't yield a closed form expression for N, it does allow computation of the number of IP A switches N needed to get some target number of total network ports P at IB-oversubscription ⁇ , knowing the radix R of the switches being used.
  • the log(log(P)) error margin the N above grows as N ⁇ P log(P), which is an unavoidable mathematical limit on performance of larger switches combined from N smaller switches at fixed ⁇ .
  • the slight log(N) nonlinearity when using fixed ⁇ can be seen in the price per port - while N increased by a factor 128K, the price per 10G port increased only 3.4 times (i.e. the cost per 10G port grew over 38,000 times slower than the network size and capacity, which is why the slight non-linearity can be ignored in practice).
  • the switch forwarding port can be computed on the fly via simple hardware performing a few bitwise logical operations on the destination address field, without any expensive and slow forwarding Content Addressable Memory (CAM) tables being required.
  • CAM Content Addressable Memory
  • trunking (or link aggregation in the IEEE 802.1 AX standard, or Cisco's commercial EtherChannel product), which amounts to cloning the link between two switches, resulting in multiple parallel links between the two switches using additional pairs of ports.
  • the invention shows a better version of trunking for increasing the bisection with a fixed number of switches.
  • the procedure is basically the opposite of the approach used for traditional trunking.
  • B is picked such that it is the farthest switch from A. Since the invention's topologies maintain uniform bisection across the network, any target switch will be equally good from the bisection perspective, which is not true for conventional trees or fat trees.
  • picking the farthest switch B also maximally reduces the longest and the average hop counts across the network. For example, with a hypercube topology, the farthest switch from any switch A is the switch B which is on the long diagonal from A. Adding that one link to A cuts its longest path by half, and reduce the average path by at least 1 hop.
  • Figure 18 shows the reductions in the maximum and average hops due to adding from 1 to 20 long hops.
  • LH shows hex bitmasks of the long hop, i.e. the index of the farthest switch chosen.
  • the cables and corresponding port connectors in the same column are color coded using matching colors (properties (b) and (c) makes such coding possible), and the cables are of the minimum length necessary in each vertical column, this port-dimension mapping makes the wiring of a rack of switches easy to learn, easy to connect and virtually error proof (any errors can be spotted at a glance).
  • the total length of cables is also the minimum possible (requiring no slack) and it has the fewest number of distinct cable lengths allowed by the topology.
  • the shortening and uniformity of cables reduces the power needed to drive the signals between the ports, a factor identified as having commercial relevance in industry research.
  • the 64 switches are line cards mounted in a rack one below the other and they are depicted as 64 separate rows 0, 1, 2,...63.
  • switch pairs connected on their port #2 with each other are 4 rows apart, e.g. switch (row) 0 connects on its port #2 to switch 4 on its port #2 and they use orange:2 wire (the color of port #2). This connection is shown as the top orange: 2 arc connecting numbers 4 and 0.
  • 32x6 192 wires for H(64), two prewired containers and 32 wires now connect between them in a simple 1 ,2,3... order. The job is made even easier with a bundled, thick cable with these 32 lines and a larger connector on each box, requiring thus only one cable to be connected.
  • VMs virtual machines
  • switching cost ⁇ $16 VM For large setups a single frame may be used where any newly added container can just be snapped into the frame (without any cables), that has built in frame-based connectors (with all the inter-container thick cabling prewired inside the frame base).
  • connection order in order to reduce the number of circuit layers the connection order must be reversed, changing all wire intersects into nestings, allowing for single layer wiring.
  • the resulting hypercube is just another one among the alternate labelings.
  • the above manual wiring scheme can also be used to build a network that has a number of switches N which is not a power of 2 (thus it cannot form a conventional hypercube).
  • N which is not a power of 2
  • rows #32 and #33 This starts the 6 th dimension (port #5, long cyan wires), but only having two of the 32 cyan lines connected on port #6 (the two are connecting port #6 in rows 0 ⁇ 32 and 1 ⁇ 33 for the 2 new switches #32 and #33).
  • the first 5 ports #0-#4 of the two new switches have no switches to go to, since these haven't been filled in (these will come later in the rows 34-63).
  • #32:3* ⁇ #40:3 and #40:5 ⁇ #8:5 are short-circuited via #32:3 ⁇ #8:5 etc, resulting in full (with natural forwarding) 6-D connectivity for the new switches and their neighbors.
  • the general technique is to first construct correct links for the target topology (e.g. hypercube), which include the non-existent nodes. Then one extends all shortest paths containing the non-existent nodes until they reach existent nodes on both ends. The existent nodes terminating such "virtual" shortest paths (made of nonexistent nodes on the inner links) are connected directly, using the available ports (reserved on existent nodes for connections with as yet non-existent ones).
  • target topology e.g. hypercube
  • C-Switches building large, software controlled super- connectors
  • a C-Switch forwards packets statically, where the settings for the network of crossbar connections within the C- Switch can be provided by an external program at initialization time. Without any need for high speed dynamic forwarding and buffering of data packets, the amount of hardware or power used by a C-Switch is several orders of magnitude smaller than a standard switch with the same number of ports.
  • the individual connectors or per-switch bundles of for example 48 individual circuit cables brought in via trunked thick cables, plugged into a large single connector
  • plug into the C-Switch's panel which can cover 3-5 sides of the C- Switch container
  • the C-Switch's panel which can cover 3-5 sides of the C- Switch container
  • Any desired topology can be selected via an operator using software to select from a library of topologies or topology modules or topology elements.
  • C-Switches can be modular, meaning that a single C-Switch module can combine several hundred to several thousand connectors, and the modules can be connected via single or few cables (or fiber links), depending on the internal switching mechanism used by the C- Switch.
  • the inter-module cabling can be done via the cabling built into the frame where the connections can be established indirectly, by snapping a new module into the frame.
  • C-Switch functionality of a C-Switch, ranging from telephony style crossbar switches, to arrays of stripped down, primitive hub or bridge elements, to nanotech optical switches and ASIC/FPGA techniques. Since the internal distances within a C-Switch are several orders of magnitude smaller than standard Ethernet connections, it is useful (for the heat& power reduction) that the incoming signal power be downscaled by a similar factor before entering the crossbar logic (the signals can be amplified back to the required levels on the output from the crossbar logic).
  • power reduction may not be necessary where optical signals are switched via piezo-electrically controlled nano-rnirrors or other purely optical/photonic techniques such as DLP normally used for projection screens, where such down/up-scaling is implicit in the transceivers.
  • the internal topology of the C-Switch can be multi-staged since the complexity of a single, flat crossbar grows as 0(X ) for X external ports.
  • each small crossbar of radix 3p has a circuit complexity (number of cross points) of 0(9p ).
  • the traffic patterns in a data center are generally not uniform all-to-all traffic. Instead, smaller clusters of servers and storage elements often work together on a common task (e.g. servers and storage belonging to the same client in a server farm).
  • the integrated control plane of the current invention allows traffic to be monitored, and to identify these types of traffic clusters and reprogram the C-Switch so that the nodes within a cluster become topologically closer within the enhance hypercube of Ethernet switches.
  • the C-Switch is used in this new division of labor between the dynamic switching network of the Layer 2 switches and the crossbar network within the C-Switch, which offloads and increases the capacity of the more expensive network (switches) by the less expensive network (crossbars).
  • This is a similar kind of streamlining of the switching network by C-Switch that layer 2 switching networks perform relative to the more expensive router/layer 3 networks. In both cases, a lower level, more primitive and less expensive form of switching takes over some of the work of the more expensive form of switching.
  • the switches are numerically labeled in a hierarchical manner tailored to the packaging and placement system used, allowing technicians to quickly locate the physical switch.
  • a wiring program displays the wiring instructions in terms of the visible numbers on the switches (containers, racks, boxes, rooms) and ports. The program seeks to optimize localization/clustering of the wiring steps, so that all that is needed in one location is grouped together and need not be revisited.
  • Front panels of the C-Box provide rows of connectors for each switch (with -10-20 connectors per switch) with numbered rows and columns for simple, by the numbers, wiring for the entire rows of rack switches and hosts.
  • C-Box is as easy to hook up and functions exactly as the C-Switch
  • Diagnostic software connected to the network can test the topology and connections, then indicates which cables are not connected properly and what corrective actions need to be taken.
  • Figure 20 shows an embodiment of the invention applied to a complete data center.
  • the particular details of this diagram are illustrative only, and those skilled in the art will be able to see that many other combinations of data center components with various attributes such as number or ports and port speed may also be used, and connected in various topologies.
  • the cables (vertical arrows) are coded by capacity and named according to their roles: S(erver)-Lines from server to TORs or transceivers, U(plink)-Lines from edge to network ports, T(opology)-Lines:
  • the internal switching fabric of the network consists of the fabric from variable number of common off-the-shelf (COTS) switches with firmware extensions, connected via the Topology Panel (ITP).
  • COTS common off-the-shelf
  • ITP Topology Panel
  • the ITP block may merely symbolize a prescribed pattern of direct connections between ports (by the number wiring), or it can be realized as a prewired connector panel or as programmable crossbar switch.
  • the network spanned by the T-Lines is the network backbone.
  • the encircled "A" above the top-of-rack (TOR) switches represents fabric aggregation for parts of the TOR fabric which reduces the TOR inefficiencies.
  • CPX Control Plane Executive
  • IDF Data Factory
  • the MMC and CPX can cooperate to observe and analyze the traffic patterns between virtual machine instances. Upon discovering a high volume of data communication between two virtual machine instances separated by a large number of physical network hops, the MMC and/or CPX can issue instructions to the virtual machine supervisor that results in one or more virtual machine instances being moved to physical servers separated by a smaller number of network hops or network hops that are less used by competing network communication. This function both optimizes the latency between the virtual machines and releases usage of some network links for use by other communicating entities.
  • the most commonly used layer 3 (or higher) reliable communication protocols such as TCP and HTTP, which have large communication overheads and non-optimal behaviors in data center environments, can be substantially optimized in managed data center networks with a unified control plane such as in the current invention.
  • the optimization consists of replacing the conventional multi-step sequence of protocol operations (such as three way handshake and later AC s in TCP, or large repetitive request/reply headers in http) which have source and destination addresses within the data center, with streamlined, reliable Layer 2 virtual circuits managed by the central control plane where such circuits fit naturally into the flow-level traffic control.
  • protocol operations such as three way handshake and later AC s in TCP, or large repetitive request/reply headers in http
  • Layer 2 virtual circuits managed by the central control plane where such circuits fit naturally into the flow-level traffic control.
  • this approach also allows for better, direct implementation of the QoS attributes of the connections (e.g. via reservation of the appropriate network capacity for the circuit).
  • the network- wide circuit allocation provides additional mechanism for global anticipatory traffic management and load balancing that operates temporally ahead of the traffic in contrast to reactive load balancing.
  • This approach of tightly integrating with the underlying network traffic management is a considerable advance over current methods of improving layer 3+ protocol performance by locally "spoofing" remote responses without visibility into the network behavior between the spoofing appliances at the network end points.
  • the virtualized connections cooperate with the Layer 2 flow control, allowing for congestion/fault triggered buffering to occur at the source of the data (the server memory), where the data is already buffered for transmission, instead of consuming additional and far more expensive and more limited fast frame buffers in the switches.
  • This offloading of the switch frame buffers further improves the effective network capacity, allowing switches to handle much greater fluctuations of the remaining traffic without having to drop frames.
  • FRS-CP The FRS Control Plane (FRS-CP) makes use of the advanced routing and traffic management capabilities of the Infinetics Super Switch (ISS) architecture. It can also be used to control conventional switches, although some of the capabilities for Quality of Service control congestion control may be limited.
  • ISS Infinetics Super Switch
  • FRS-CP provides:
  • FRS-CP can include a central control system that connects directly to all the switches in the network, which may be replicated for redundancy and failover. Each switch can run an identical set of services that discover network topology and forward data packets.
  • Switches can be divided into three types based upon their role in the network, as shown in Figure 24:
  • ARP and broadcast squelching When a specific machine attempts to locate another machine on the network in a classic network, it sends out a broadcast ARP (sort of a where are you type message), which will be transmitted across the entire network. This message needs to be sent to every machine across the network on every segment which significantly lowers the throughput capacity of the network. We keep a master list(distributed to every switch) of every host on the network, so that any host can find any other host immediately. Also any other broadcast type packets which would have been sent completely across the network are also blocked. (** See CPX Controller / Data Factory)
  • Fig. 25 shows a system according to one embodiment of the invention.
  • the Data Factory component can be used to establish the behavior of the IPA
  • the Control Plane Executive uses the data stored in the data factory to configure the network and to set up services such as security and quality guarantees. Management consoles access this component to modify system behavior and retrieve real time network status.
  • the Data Factory communicates with the Control Plane Executive
  • CPX through a service interface using a communication mechanism such as Thrift or JSON as shown in Fig. 26.
  • a communication mechanism such as Thrift or JSON as shown in Fig. 26.
  • Any form of encryption can be supported.
  • a public key encryption system can be used.
  • the UBM can provide some or all of the following functions:
  • a UBM entry can describe a name for an organization or a specific service.
  • a UBM entry could be a company name like ReedCO which would contain all the machines that the company ReedCO would use in the data center.
  • a UBM entry can also be used to describe a service available in that data center.
  • a UBM entry has the following attributes:
  • Port(s) - these are the port(s) that are allowed to the specified machines. If there are no ports, then this is a container Node which means it is used to store a list of allowed machines.
  • a flag can be provided in or associated with the
  • a machine table contains at least the following information:
  • the Universal Boundary Manager service can provide membership services, security services and QoS. There can be two or more types of UBM groups:
  • a transparent group can be used as an entry point into the IPA Eco-
  • UBM Interfaces can be determined by port number - e.g. Port 80. This type of group can be used to handle legacy IP applications such as Mail and associated Web Services. Since a Web Service can be tied to an IP port, limited security (at the Port Level) and QoS attributes (such as Load Balancing) can be attributes of the UBM structure.
  • An opaque group can have all the attributes of the Transparent group's attributes, but allows for the extension of pure IPA security, signaling (switch layer) and the ability to provided guaranteed QoS.
  • the major extensions to the Opaque group can include the security attributes along with the guaranteed QoS attributes.
  • Multiple opaque or visible groups can be defined from this core set of attributes.
  • the firewall can be a network- wide mechanism to pre-authorize data flows from host to host. Since every host on the network must be previously configured by the network administrator before it can be used, no host can
  • the ingress switch where a data packet from a host first arrives in the network can use the following rules to determine whether the data packet will be admitted to the network as shown in Figure 22: Forward Path Rules
  • the CPX computer is the Control Plane Executive which controls all switches, and receives and sends data to the switches. This data is what is necessary to route data, firewall info, etc. It also controls the ICP (Integrated Control Plane) module which determines topology, and controls the IFX (Firmware extensions) which are installed on every switch and hypervisor.
  • ICP Integrated Control Plane
  • IFX Firmware extensions
  • CPX connects to the Data Factory to read all of the configuration data necessary to make the entire network work. It also writes both log data and current configuration data to the Data Factory for presentation to users.
  • ICP Integrated Control Plane
  • This module controls each instance of IFX on each switch, and takes that neighbor data from each IFX instance and generates cluster data which is then sent back to each IFX instance on each switch.
  • Triplets (which contain the Host IP Address, Switch ID, and MAC address of the host) are generated by the Host detector that runs on each switch. The detected triplets are sent through the Host Controller to the CPX controller. First the triplet's data is validated to make sure that this host MAC address (and IP address if defined), is a valid one. Once validate, the triplet is enabled in the network.
  • the host can be forced to validate themselves using various standard methods such as 802. lx.
  • the triplets can be sent to the Data Factory for permanent storage, and are also sent to other switches that have previously requested that triplet.
  • the sends will be timed out, so that if a switch has not requested a specific triplet for a specific time, the CPX will not automatically send it if it changes again unless the ICP requests it.
  • the host controller sends a request for the triplet associated with the specific IP address.
  • the CPX looks up that triplet and sends it to the IFX which in turn sends it to the KLM module so that the KLM can route data.
  • Firewall rules and Quality of Service (QOS) data travel along the same route as triplets.
  • a switch always receives all the firewall rules involving hosts that are connected to that switch so that quick decisions can be made by the KLM module. If a firewall rule changes, then it is sent to the IFX which sends it to the KLM module. In cases where there are firewall rules with schedules or other "trigger points", the firewall rules are sent to the IFX and IFX sends them to the KLM module at the appropriate time.
  • KLM (or some other module) to IFX, and then to CPX which sends it to the Data Factory.
  • CPX controls ICP which then controls each instance of IFX on each switch through ICP, telling it to send "discover" packets, and return back neighbor topology data to ICP. All this data is stored in the Data Factory for permanent storage, and for presentation to users. This topology data is used by IFX to generate routes. When link states change, the IFX module notifies ICP, and a new routing table will be generated by IFX. Initially IFX will reroute the data around the affected path.
  • CPX reads the following data from the Data Factory:
  • Topology information - links between switches including metadata about each link
  • Triplets from switches for hosts These will be written whenever a new host comes online, or a host goes away. They can happen anywhere from once every few seconds, to much more often as hosts come online. There needs to be some sort of acknowledgement that the specific host being added already exists in the UBM so that we can route to that host. If the host does not exist we need to flag that host's information so that the user can see that a undefined host has been activated on the network, and allow the user to add it to the UBM.
  • Multi-server data This is all the servers of an equivalent type.
  • the following services can run on all switches in the network.
  • This module runs on each switch and is responsible for determining the topology of the neighbors. It sends data back to the ICP module about its local physical connectivity, and also receives topology data from ICP. It supports multiple simultaneous network logical topologies, including n-cube, butterfly, torus, etc as shown in Figure 23. It uses a raw Ethernet frame to probe the devices attached to this switch only. It also takes the topology data from ICP, and the cluster data from ICP and calculates forwarding tables.
  • This module runs on each hypervisor and interact s with the
  • Hypervisor/KLM module to control the KLM. Flow data related to how many bytes of data flowing from this hypervisor to various destinations is accepted by this module and used to calculate forwarding tables.
  • This can include a Linux kernel loadable module (KLM) that implements the Data plane. It can be controlled by the Switch Controller.
  • KLM Linux kernel loadable module
  • the input to this module are:
  • the KLM can route packets from hosts to either other hosts, or to outside the network if needed (and allowed by rules). All packets sent across the "backbone" can be encrypted, if privacy is required.
  • the KLM switch module can have access to caches of the following data: triplets (they map IPv4 addresses into (Egress Switch ID, host Ethernet Address pairs); routes (they define the outbound interfaces, and next hop Ethernet Address to use to reach a given Egress Switch); and firewall rules (they define which IPv4 flows are legal, and how much bandwidth they may utilize).
  • the KLM can eavesdrop on all IP traffic that flows from VM instances
  • the KLM switch can intercepts (STEALs) it and determines if firewall rules classify the corresponding flow to be legal. If it's illegal, the packet is dropped. If the flow is legal and it's destination is local to the hypervisor, it's made to obey QoS rules, and delivered. If the flow is legal and exogenous, the local triplet cache is consulted with the destination IP address as an index. If a triplet exists, it determines the Egress Switch ID (which is just a six-byte Ethernet address). If a route also exists to the Egress switch, then the packet will be forwarded with the destination switch Topological MAC address put into the Ethernet frame.
  • STEALs firewall rules classify the corresponding flow to be legal. If it's illegal, the packet is dropped. If the flow is legal and it's destination is local to the hypervisor, it's made to obey QoS rules, and delivered. If the flow is legal and exogenous, the local triplet cache is consulted with the destination IP address as an index. If a triplet exists,
  • the KLM can use a dedicated Ethernet frame type to make it impossible for any backbone switch or rogue host to send a received frame up its protocol stack.
  • a frame arrives at a hypervisor, it can be intercepted by its kernel's protocol handler (functionality inside the KLM) for Ethernet frame type defined.
  • the protocol handler can examine the IP datagram, extract the destination IP address, and then index it into it's triplet cache to extract the Ethernet address of the local VM. If no triplet exists, the frame can dropped.
  • the socket buffer's protocol type can switched from 0xbee5 to 0x0800, and the packet can be made to obey QoS rules before it is queued for transmission to the local host.
  • the KLM can use IFXS, for example, as its method to talk with CPX to access the data factory.
  • Figure 24 shows a typical use case where switching systems according to various embodiments of the invention can be used within a data center.
  • Figure 15 shows one embodiment of the invention where the FRS is used alone to provide an ultra-high bisection bandwidth connection between multiple CPU cores and a large array of flash memory modules.
  • the prior art approach for having CPU cores transfer data to and from flash memory treats the flash memory modules as an emulated disk drive where data is transferred serially from a single "location".
  • the invention allows large numbers of CPUs or other consumers or generators of data communicate in parallel to multiple different flash memory storage modules.
  • the ISS network can be designed using the physical constraints of the various methods that semiconductor devices are packaged and interconnected. This embodiment results in a network that has a different connection pattern than would be used in a data center, but still provides extremely high bisection bandwidth for the available physical connections within and between semiconductor devices and modules.
  • Hop networks is provided in attached Appendix A, which is hereby incorporated by reference.
  • nodes may implement any combination of storage, processing or message forwarding functions, and the nodes within a network may be of different types with different behaviors and types of information exchanged with other nodes in the network or devices connected to the network.
  • Hypercubic (HC) networks a class of "direct networks” using Cartesian product construction recipe. This class includes plain hypercube variants (BCube, MDCube), Folded Hypercube (FC), Flattened Butterfly (FB), HyperX (HX), hyper-mesh, hyper- torus, Dragonfly (DF),... etc.
  • the HC networks are overall the more economical of the two types, providing the same capacity for random traffic as FT with fewer switches and fewer cables, the FT is more economical on the worst case traffic, specifically on the task of routing the worst case 1-1 pairs permutation.
  • the Long Hop (LH) networks stand above this dichotomy by being simultaneously the most optimal for the common random traffic and for the worst case traffic.
  • the LH optimality is result of the new approach to network construction which is fundamentally different from the techniques used to construct all the leading alternatives. Namely, while the alternative techniques build the network via simple mechanical, repetitive design patterns which are not directly related to the network performance metrics such as throughput, the LH networks are constructed via an exact combinatorial optimization of the target metrics.
  • the LH construction method optimizes the highly symmetrical and, from practical perspective, the most desirable subset of general networks, Cayley graphs [11].
  • the LH networks are optimal regarding throughput and latency within that domain, practical to compute and discover, simple and economical to wire and troubleshoot and highly efficient in routing and forwarding resources (“self-routing" networks).
  • V a iterator or a set defined by the statement "for all a"
  • [a, b) half-open interval contains all x satisfying a ⁇ x ⁇ b
  • [a, b] closed interval contains all x satisfying a ⁇ x ⁇ b
  • V V 1 V 2
  • Vector space ⁇ is direct sum of vector spaces ⁇ i and ⁇ 2 '
  • Hamming distance ⁇ ( ⁇ , ⁇ ) between i-tuples X and Y is the number of positions / where JCJ ⁇ y t -.
  • ⁇ ( ⁇ , ⁇ ) ( ⁇ ⁇ ⁇ ) i.e. the Hamming weight of ⁇ ⁇ Y.
  • Cyclic group 3 ⁇ 4 set of integers ⁇ 0, 1 , ... n- 1 ⁇ with integer addition modulo n as the group operation. Note that 3 ⁇ 4 group operation is equivalent to a single bit XOR operation
  • the same symbol 3 ⁇ 4 is also used for commutative ring with integer additions and multiplication performed mod it.
  • Zq also denotes a commutative ring in which the Z q operations (integer +,* mod q) are done component-wise.
  • Table entry in row Y and column X is the result of bitwise X A Y operation.
  • Dirac notation (also called “bra-ket” notation, [13]) is a mnemonic notation which encapsulates common matrix operations and properties in a streamlined, visually intuitive form.
  • Matrix [A r>c ] (also: [A] or just A) is a rectangular table with r rows and c columns of "matrix elements". An element on i-th row and y'-th column of a matrix [A] is denoted as [A]y.
  • ...> are always row or column vectors. Due to associativity of matrix products, these "object type rules" are valid however many other matrix or vector factors may be inside and outside of the selected sub-product of a given type. Also, the "resolution of identity” sums ⁇
  • Walsh function 13 ⁇ 4 for is defined as the A-th row of H n .
  • the A-th column of H n is also equal to 13 ⁇ 4 ).
  • the row and column forms of U k (jc) can also be used as the /i-dimensional bra/ket or row/column vectors (13 ⁇ 4 and
  • the exponent ⁇ ⁇ ⁇ ⁇ m e 3 ⁇ 4- (2-5) uses binary digits ⁇ ⁇ and ⁇ of ⁇ f-bit integers k and J .
  • U k (jc) is 1
  • the sum is odd number 13 ⁇ 4 ⁇ ) is -1.
  • the second equality in eq. (2.5) expresses the same results via parity function ⁇ (k&x), where k&x is a bitwise AND of integers k and *.
  • Ui4(15) (-1) from the table Fig. 1.
  • the LH network computations use mostly binary (also called boolean) form of U k and H n denoted respectively as W k and [W n ]-
  • binary form is obtained from the algebraic form via mappings l-> 0 and -1 - 1. Denoting algebraic values as a and binary values as b, the translations between the two are:
  • bit string form one can perform bitwise Boolean operations on Wk as length n bit strings.
  • Their XOR property will be useful for the LH computations:
  • Table 2.3 shows the binary form of Hadamard (also called Walsh) matrix [W 32 ] obtained via mapping eq. (2.8) from H 32 in Table 2.2 (binary 0's are shown as '-').

Abstract

A system and method for interconnecting nodes and routing data packets in high radix networks includes constructing or redefining a network structure to provide improved performance. Computation and data storage nodes are connected to a network of switching nodes that provide near optimum bandwidth and latency for networks of any size. Specialized interconnection patterns and addressing methods ensure reliable data delivery in very large networks with high data traffic volume.

Description

FLEXIBLE RADIX SWITCHING NETWORK
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims any and all benefits as provided by law, including benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application Nos. 61/483,686 and 61/483,687, both filed on May 8, 2011, both of which are hereby incorporated by reference in their entirety.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
[0002] Not Applicable
REFERENCE TO MICROFICHE APPENDIX
[0003] Not Applicable
FIELD OF THE INVENTION
[0004] The invention relates generally to the interconnection of nodes in a network. More specifically, the invention relates to interconnected nodes of a communication network that provide some combination of computation and/or data storage and provide an efficient interchange of data packets between the
interconnected nodes. The network can be defined by a base network structure that can be optimized by selectively defining and connecting long hops between sections of the network, for example, to reduce the network diameter.
[0005] The invention provides a system and method for creating a cost- effective way of connecting together a very large number of producers and consumers of data streams.
BACKGROUND OF THE INVENTION
[0006] One real world analogy is the methods for constructing roadway networks that allow drivers to get from a starting location to a destination while satisfying real world constraints such as 1) a ceiling on the amount of tax people are willing to pay to fund roadway construction; 2) a desire to maximize speed of travel subject to safety constraints; and 3) a desire to avoid traffic jams at peak travel times of day.
[0007] In this analogy, the cars are similar to the data sent over a computer network and the starting locations and destinations represent the host computers l connected to the network. The constraints translate directly into cost, speed and congestion constraints in computer networks.
[0008] The basic quality of solving these types of problems is that they get much harder to solve efficiently as the number of starting locations and destinations (e.g., host computers) increases. As a starting point, consider how three destinations can be connected together. Figs. 1 A and IB show that there are really only two alternatives. The number of ways to connect the destinations together grows along with the number of destinations, for example, with four destinations, some of the possible methods of connection are shown in Figs. 1C, ID, IE and IF.
[0009] As can be seen from the figures, both the number of connections between nodes and the number of different ways of making those connections grows faster than the number of connections. For example, a set of 6 nodes can have more than twice as many alternative ways to connect the nodes as a set of 3 nodes. Also, the possible number of connections between the nodes can vary from, on the low side, the number of nodes (N) minus 1 for destinations connected, for example, along a single line as shown in Fig. 1C, to N(N-l)/2 connections as shown in Fig. IF, where every single node has a direct connection to every other node.
[0010] Another measure of the performance of a network is the diameter of the network, which refers to how many connections need to be traveled in order to get from any one destination to another. In the network shown in Fig. 1C, its economy in the number of connections (3) is offset by the consequence that the only path, from one end of the network to the other, requires travel across three connections, thus slowing the journey. On the other hand as shown in Fig. IF, the large number of connections results in every destination only being one connection away from any other, permitting more rapid travel.
[0011] The two networks shown in Figs. 1C and IF can also have very different behavior at peak traffic times. Assuming that each connection can support the same rate of traffic flow, the two end point nodes of the network shown in Fig. 1C will be affected if there is a lot of traffic traveling between the two nodes in the middle of the line. Conversely, in network shown in Fig. IF, since there is an individual connection between every possible combination of nodes, traffic flowing between two nodes is not affected at all by traffic flowing between a different pair of nodes. [0012] Another difficulty arises in the construction of computer networks: It is difficult to have a large number of connections converging on a single point, such as shown in Fig. IF. In a computer data center, the devices that allow multiple connections to converge are called switches. These switches that allow multiple connections to converge typically have physical limitations on the number of connections or ports, for example, around 50 ports for inexpensive switches, and can approach 500 ports for more modern, expensive switches. This means that for a fully- meshed network like that shown in Fig. IF where delays and congestion are minimized, no more than, for example, 499 destination hosts could be connected together.
SUMMARY OF THE INVENTION
[0013] The sample network layouts shown in Figs. 1 A - IF, 2 A - 2C, and in fact all other network layouts conceived to date, suffer from a fundamental tradeoff between the cost and difficulty of building the network, and the ability of the network to support high traffic rates. The present invention allows for the design of networks that can include a very large number of connections and a high level of complexity of the switches that manage those connections, while providing very high immunity from the congestion that limits the ability of all nodes to communicate with each other at maximum speed, no matter how other nodes are using the network.
[0014] The emergence of "cloud computing", supported by huge data centers where hundreds of thousands of computers all connected to one network provide economies of scale and thereby reduced costs, has stressed the ability of current network designs to provide a reliable and cost effective way of allowing data to be exchanged between the computers.
[0015] A number of approaches have been tried by both academia and industry, but to date, all the approaches fall short of theoretical limits by a factor of 2 to 5 times. Some embodiments of the invention include a method for constructing networks that can be within 5-10% of the theoretical maximum for data throughput across networks with multiple simultaneously communicating hosts, a highly prevalent use case in modern data centers.
[0016] In accordance with some embodiments of the invention, methods for constructing highly ordered networks of hosts and switches are disclosed that make maximum use of available switch hardware and interconnection wiring. The basic approach can include the following: selecting a symmetrical network base design, such as, a hypercube, a star, or another member of the Cayley graph family;
developing an appropriate topological routing method that simplifies data packet forwarding; and adding short cuts or long hops to the base symmetrical network to reduce the network diameter.
[0017] The regularity of symmetrical networks makes them well suit for topological addressing schemes.
[0018] It is one of the objects of the present invention to provide an improved network design that can be expanded greatly without performance penalty.
[0019] It is another object of the present invention to provide an improved network design allows the network to be more easily operated and managed. In some embodiments, the entire network can be operated and managed as a single switch.
[0020] It is another object of the present invention to provide an improved network design that provided improved network performance. In some embodiments, the network can have 2 to 5 times greater bisection bandwidth than with conventional network architectures that use the same number of component switches and ports.
[0021] The invention also includes flexible methods for constructing physical embodiments of the networks using commercially available switches and method for efficiently, accurately and economically interconnecting (wiring) the switched together to form a high performance network having improved packet handing.
Description of Drawings
[0022] Figures 1 A - IF show sample network layouts.
[0023] Figures 2A - 2C show symmetrical network structures according to some embodiments of the invention.
[0024] Figures 3A and 3B show an example of topological routing.
[0025] Figure 4A shows an order 3 hypercube and Figure 4B shows an order 3 hypercube with shortcuts added.
[0026] Figure 5 illustrates a typical large data center layer 2 network architecture.
[0027] Figure 6 illustrates hypercube notation and construction.
[0028] Figure 7 illustrates partitioning between topology and external ports.
[0029] Figure 8 illustrates packet non-blocking with 4 switches and 8 paths.
[0030] Figure 9 illustrates a network bisection according to some
embodiments of the invention.
[0031] Figure 10 illustrates an 8 node network with long hops added.
[0032] Figures 11 - 15 are charts comparing long hop networks with alternative network configurations..
[0033] Figure 16 illustrates data center available bandwidth and cost for 4x external/topology port ratio.
[0034] Figure 17 illustrates data center available bandwidth and cost for lx external/topology port ratio.
[0035] Figure 18 illustrates the reduction in average and maximum hops.
[0036] Figure 19 illustrates optimized wiring pattern using port dimension mapping according to an embodiment of the invention.
[0037] Figure 20 illustrates the integrated super switch architecture across an entire data center according to an embodiment of the invention.
[0038] Figure 21 illustrates a network architecture showing a flexible radix switch fabric according to an embodiment of the invention.
[0039] Figure 22 illustrates the flow of a data packet from an ingress switch through a network according to an embodiment of the present invention.
[0040] Figure 23 illustrates various network logical topographies according to an embodiment of the present invention.
[0041] Figure 24 illustrates a network architecture according to one embodiment of the invention. [0042] Figure 25 illustrates a system including a Data Factory according to some embodiments of the invention.
[0043] Figure 26 illustrates a system interconnecting a control plane executive
(CPX) according to some embodiments of the invention.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
[0044] The present invention is directed methods and systems for designing large networks and the resulting large networks. In accordance with some embodiments of the invention, a way of connecting large numbers of nodes, consisting of some combination of computation and data storage, and providing improved behaviors and features. These behaviors and features can include: a) practically unlimited number of nodes, b) throughput which scales nearly linearly with the number of nodes, without bottlenecks or throughput restriction, c) simple incremental expansion where increasing the number of nodes requires only a proportional increase in the number of switching components, while maintaining the throughput per node, d) maximized parallel multipath use of available node interconnection paths to increase node-to-node bandwidth, e) Long hop topology enhancements which can simultaneously minimize latency (average and maximum path lengths) and maximize throughput at any given number of nodes, f) a unified and scalable control plane, g) a unified management plane, h) simple connectivity - nodes connected to an interconnection fabric do not need to have any knowledge of topology or connection patterns, i) streamlined interconnection path design - dense interconnections can be between physically near nodes, combined with a reduced number of interconnections between physically distant nodes, resulting in simple interconnection or wiring.
[0045] In one embodiment of the invention, the nodes can represent servers or hosts and network switches in a networked data center, and the interconnections represent the physical network cables connecting the servers to network switches, and the network switches to each other.
[0046] In another embodiment of the invention, the nodes can represent geographically separated clusters of processing or data storage centers and the network switches that connect them over a wide area network. The interconnections in this case can be the long distance data transfer links between the geographically separated data centers.
[0047] Those skilled in the art will realize that the described invention can be applied to many other systems where computation or data storage nodes require high bandwidth interconnection, such as central processing units in a massively parallel supercomputer or other multiple CPU or multi-core CPU processing arrays.
[0048] In accordance with some embodiments of the invention, component switches can be used as building blocks, wherein the component switches are not managed by data center administrators as individual switches. Instead, switches can be managed indirectly via the higher level parameters characterizing collective behavior of the network, such as latency (maximum and average shortest path lengths), bisection (bottleneck capacity), all-to-all capacity, aggregate
oversubscription, ratio of external and topological ports, reliable transport behavior, etc. Internal management software can be used to translate selected values for these collective parameters into the internal configuration options for the individual switches and if necessary into rewiring instructions for data center technicians. This approach makes management and monitoring scalable.
[0049] Hypercubes and their variants have attracted great deal of attention within parallel and supercomputer fields, and recently for data center architectures as well due to their highly efficient communications, high fault tolerance and reliable diagnostics, lack of bottlenecks, simple routing & processing logistics, and simple, regular construction. In accordance with some embodiments of the invention, a method of designing an improved network includes modifying a basic hypercube network structure in order to optimize latency and bandwidth across the entire network. Similar techniques can be used to optimize latency and bandwidth across other Cayley graph symmetrical networks such as star, pancake and truncated hypercube networks.
[0050] A symmetrical network is one that, from the perspective of a source or a destination looks the same no matter where you are in the network and which allows some powerful methods to be applied for developing both routing methods for moving traffic through the network and for adding short cuts to improve throughput and reduce congestion. One commonly known symmetrical network structure is based on the structure of a hypercube. The hypercube structured network can include a set of destinations organized as the corners of a cube, such as shown in Fig. 2A. The structure shown in Fig. 2A is known as an order 3 hypercube, based on each destination having three connections to neighboring destinations. To generate a higher order hypercube, copy the original hypercube and connect all the destinations in the first hypercube with the corresponding destination in the copy as shown in Fig. 2B.
[0051] Hypercubes are just one form of symmetrical network. Another form of symmetrical network is the star graph shown in Fig. 2C. There are many other types of symmetrical networks, known formally as Cayley graphs that can be used as a basis on which to apply the methods of the invention.
[0052] In accordance with some embodiments of the present invention, topological routing can be used route messages through the symmetrical network. Topological routing can include a method for delivering messages from a source node to a destination node through a series of intermediate locations or nodes where the destination address on the message describes how to direct the message through the network. A simple analogy is the choice of method for labeling streets and numbering houses in a city. In some planned areas such as Manhattan, addresses not only describe a destination location, "425 17th Street", but also describe how to get there from a starting point. If it is known that house numbers are allocated 100 per block, and the starting location is 315 19th Street, it can be determined that the route includes going across one block and down two streets to get to the destination.
Similarly, for the organization shown in Fig. 3 A, traveling from N 200 W. 2nd Street to N 100 E 1st Street can include going east 3 blocks and south one block.
[0053] In contrast, a typical unplanned town like Concord, MA, shown in Fig.
3B, has road that are not laid out in any regular pattern and the names for streets have no pattern either. This "plan" requires a "map" to determine how to get from one place to another.
[0054] Topological addressing is important in large networks because it means that a large map does not have to be both generated and then consulted at each step along the way of sending a message to a destination. Generating a map is time consuming and consumes a lot of computing resources, and storing a map at every step along the way between destinations consumes a lot of memory storage resources and requires considerable computation to look up the correct direction on the map each time a message needs to be sent on its way towards its destination. The small maps required by topological addressing are not just a matter of theoretical concern. Present day data centers have to take drastic, performance impacting measures to keep their networks divided into small enough segments that the switches that control the forwarding of data packets do not get overwhelmed with building a map for the large number of destinations for which traffic flows through each switch.
[0055] The regularity of symmetrical networks makes them excellent candidates for having topological addressing schemes applied to them, just as a regular, basically symmetrical, arrangement of streets allows addresses to provide implied directions for getting to them.
[0056] In accordance with some embodiments of the invention, the performance of these symmetrical networks can be greatly improved by the select placement of "short cuts" or long hops according to the invention. The long hops can simultaneously reduce the distance between destinations and improve the available bandwidth for simultaneous communication. For example, Fig. 4A show a basic order 3 hypercube, where the maximum distance of three links between destination nodes occurs at the opposite corners. In accordance with some embodiments of the invention, adding shortcuts across all three corners as shown in Fig. 4B reduces the distance between the destinations that used to have the worst case distance of three to a distance of one link.
[0057] In accordance with some embodiments of the invention, this method can be applied to hypercubes of higher order with many more destinations. In accordance with some of the embodiments of the invention, a method for identifying select long hops in higher order hypercube networks and symmetric networks can include determining a generator matrix using linear error correcting codes to identify potential long hops within the network.
[0058] Figure 5 shows a diagram of a typical commercial data center. Figure
5 also shows the typical port oversubscription ratios, and hence bottlenecks, at each level (core, aggregation, and edge) of the network, that result from the traditional approaches to building data centers. In addition, none of these approaches work well as the number of devices connected to the network increase exponentially, as has happened as a result of adoption of highly centralized data centers with large numbers of host computers or servers at a single location.
[0059] All real world network implementations are limited by the physical constraints of constructing switches and wiring them together. With the limitations of conventional wiring techniques, one of the parameters that can be adjusted to improve network performance is to increase the number of ports per network switch, which allows that group of ports to exchange data with very high throughput within the single physical device. Problems then arise mamtaining that high throughput when groups of switches have to be assembled in order to connect a large number of servers together. Switch manufacturers have been able to increase the number of ports per switch into the several hundreds (e.g., 500), and some new architectures claim the ability to create switch arrays that have several thousand ports. However, that is two to three orders of magnitude less than the number of servers in large data centers. The number of switch ports is referred to as the "radix" of the switch.
[0060] In accordance with some embodiments of the invention, one difference between networks according to the invention and the prior art, networks according to the invention can be expanded (increasing the number of host computer ports) practically, without limit or performance penalty. The expansion can be flexible, using commodity switches having a variable radix. Although there are presently switches which can be upgraded from an initial configuration with a smaller radix to a configuration with a higher radix, the latter maximum radix is fixed in advance to at most a few hundred ports. Further, the 'radix multiplier' switching fabric for the maximum configuration is hardwired in the switch design. For example, a typical commercial switch such as the Arista 7500 can be expanded to 384 ports by adding up to 8 line cards, each providing 48 ports; but the switching fabric gluing the 8 separate 48 port switches into one 384 port switch is rigidly fixed by the design and it is even included in the basic unit. In contrast, the networks constructed according some embodiments of the invention have no upper limit on the maximum number of ports it can provide. And this holds for an initial network design as well as any subsequent expansion of the same network. In accordance with some embodiments of the invention, for any given type of switch having radix R, the upper limit for simple expansion without performance penalty is 2*"1 component switches. Since typical R is at least 48, even this conditional limit of 247« 1.4· 1014 on the radix expansion is already far larger than the number of ports in the entire internet, let alone in any existing or contemplated data center.
[0061] Another difference between networks according to some embodiments of the invention and prior art data centers is that data center layer 2 networks are typically operated and managed as networks of individual switches where each switch requires individual installation, configuration, monitoring and management. In accordance with some embodiments of the invention, the data center network can be operated and managed as a single switch. This allows the invention to optimize all aspects of performance and costs (of switching fabric, cabling, operation and management) to a far greater degree than existing solutions.
[0062] In addition, networks according to some embodiments of the invention can provide improved performance over any existing data center Layer 2 networks, on the order of 2 to 5 times greater bisection bandwidth than conventional network architectures that use the same number of component switches and ports.
[0063] The invention also describes novel and flexible methods for realizing physical embodiments of the network systems described, both in the area of wiring switches together efficiently, accurately and economically, as well as ways to use existing functionality in commercial switches to improve packet handing.
[0064] Hypercubes can be characterized by their number of dimensions, d.
To construct a (< +l)-cube, take two rf-cubes and connect all 2d corresponding nodes between them, as shown in Figure 6 for transitions </: 0 - 1— » 2— > 3 (red lines indicate added links joining two <f-cubes).
[0065] For purpose of illustrating one embodiment of the invention, a i/-cube can be a i -dimensional binary cube (or Hamming cube, hypercube graph) with network switches as its nodes, using d ports per switch for the d connections per node. By convention, coordinate values for nodes can be 0 or 1, e.g. a 2-cube has nodes at (x, y) = (0,0), (0,1), (1,0), (1,1), or written concisely as binary 2-bit strings: 00, 01, 10 and 11.
[0066] Each switch can have some number of ports dedicated to
interconnecting switches, and hosts can be connected to some or all of the remaining ports not used to interconnect switches. Since the maximum number of switches N in a rf-cube is N=2d, the dimensions d of interest for typical commercial scalable data center applications can include, for example, d = 10..16, i.e. rf-cubes with 1K-64K switches, which corresponds to a range of 20K-1280K physical host (computers or servers), assuming a typical subscription of 20 hosts per switch.
[0067] In accordance with some embodiments of the invention, a concise binary < -bit notation for nodes (and node labels) of a rf-cube can be used. The hops, defined as the difference vectors between directly connected nodes, can be < -bit strings with a single bit=l and (d-l) bits=0. The jumps (difference vectors) between any two nodes Si and S2 can be: /12 = Sj Λ S2 (A is a bitwise XOR) and the minimum number of hops (distance or the shortest path) L between them is the Hamming weight (count of l 's) of the jump J12 i.e. L≡ L(Jn) = \ j\2 \ · There are exactly LI distinct shortest paths of equal length L between any two nodes Si and S2 at distance L. The diameter D (maximum shortest path over all node pairs) of a </-cube is D = log(N) =d hops, which is also realized for each node. For any node S, its bitwise complement (~S) is at the maximum distance D from S. The average number of hops between two nodes is d/2 and bisection (minimum number of links to cut in order to split a </-cube into 2 equal halves) is N/2.
[0068] In accordance with some embodiments of the invention, the d-cube coordinates of the switches (d-bit strings with d ~ 10..16) can be used as their physical MAC addresses , and the optimal routing becomes very simple. Routing can be done entirely locally, within each switch using only 0(log(V)) resources (where O can be and N is the maximum number of switches). When a frame with dst arrives at a switch M, the switch M computes J = Μ Μ&χ and if J=0, then the switch M is the destination. Otherwise it selects the next hop h corresponding to any bit = 1 in J which will bring the frame one hop closer to the M&n since the next node after the hop at = A h will have one less bit = 1 , hence one less hop, in its jump vector to Most which is
Figure imgf000013_0001
nxt A dst-
[0069] In accordance with some embodiments of the invention, the total number of switches Ns in the network is not an exact power of 2, so in this case, the d- cubes can be truncated so that for any accessible M the relation M<NS holds, where bit string M is interpreted as an integer (instead of M< 2d which is used for a complete . -cube). Hence, instead of the 0(N) size forwarding table and an 0(N) routing tree, the switches only need one number Ns and their own MAC address to forward frames along the shortest paths.
[0070] In accordance with some embodiments of the invention, one useful parameter of the hypercubic network topology is the port distribution or ratio of the internal topology ports (T-ports) used to interconnect switches and external ports(E- ports) that the network uses to connect the network to hosts (servers and routers). Networks built according to some embodiments of the invention can use a fixed ratio: λ≡ E/T (E=#E-ports, T=#T-ports) for all IPA switches. In accordance with one embodiment, the ratio is λ=1 (ports are split evenly between E and T), is shown in Figure 7 for rf=2.
[0071] For a hypercube of dimension d there are m≡d -2d total T-ports and E- ports, d ports per switch for either type. Since the ports are duplex, each E-port can be simultaneously a source and a sink (destination) of data packets. Hence, there are m sources X\, ¾,· · Xm and m destinations Y\, F2,... Y„. The non-blocking (NB) property of a network or a switch can usually be defined via performance on the 'permutation task'- each source Xi (r=l ..m) is sending data to a distinct destination Yj (where j =nm[i] and nm a permutation of m elements), and if these m transmissions can occur without collisions/blocking, for all ml permutations of Ys, the network is NB. The evaluation of the NB property of a network can depend on the specific meaning of "sending data" as defined by the queuing model. Based on kinds of "sending data", there can be two forms of NB, Circuit NB (NB-C) and Packet NB (NB-P). For NB-C, each source X can send a continuous stream at its full port capacity to its destination Y. For NB-P, each source can send one frame to its destination Y. In both cases, for NB to hold, for any π(Υ) there exists a set of m paths (sequences of hops), each path connecting its XY pair. The difference in these paths for the two forms of NB is that for NB-C each XY path has to have all its hops reserved exclusively for its XY pair at all times, while for NB-P, the XY path needs to reserve a hop only for the packet forwarding step in which the XY frame is using it. Hence NB-C is a stronger requirement, i.e. if a network is NB-C then it is also NB-P.
[0072] In accordance with some embodiments of the invention, a hypercube network with a λ=1 has Packet Non-Blocking Property. This is self-evident for d=l, where there are only 2 switches, two ports per switch, one T-port and one E-port. In this case m=2, hence there are only 2!=2 sets of XY pairing instances to consider: = [Χι→Υι , X2→Y2] and I2 = [Xi→Y2 , X2→Yil The set of m=2 paths for i! are: { (ATjO Yi), (X2 1 Y2)}, each taking 0 hops to reach its destination (i.e. there were no hops between switches, since the entire switching function in each path was done internally within the switch). The paths are shown as (X St S2 ... S*Y), where 5/ sequence specifies switches visited by the frame in each hop from X. This path requires k- lhops between the switches (X and Y are not switches but ports on Sj and Sk respectively). For the pairing I2, the two paths are { (Λ ιΟ 1 F2), (X2 1 0 Yi)}, each 1 hop long. Since there were no collisions in either instance Ii or I2, the d=l network is NB-P. For the next size hypercube, d=2, m=8 and there are 8! (40320) XY pairings, so we will look at just one instance (selected to maximize the demands over the same links) and show the selection of the m=8 collision free paths, before proving the general case.
[0073] Fig. 8 shows the 8 paths with their properties discernable by splitting the diagram into (a) and (b) parts, but the two are actually running on the same switches and lines simultaneously. The short arrows with numbers show direction of the frame hop and the switching step/phase at which it takes place. It is evident that at no stage of the switching, which lasts 3 hops, is any link required to carry 2 or more frames in the same direction (these are duplex lines, hence 2 frames can share a link in opposite direction) hence NB-P holds for this instance. Not all paths are the shortest ones possible (e.g. XY^ which took 3 hops, although the shortest path is 1 hop, the same one as the path X2 Y ).
[0074] To prove that in the general case all m= d-N frames sent by X\,
..Xm can be delivered to proper destinations, in a finite time and without collisions or dropped frames, the following routing algorithm can be used. In the initial state when m frames are injected by the sources into the network, each switch receives d frames from its d E-ports. If there were just one frame per switch instead of d, the regular hypercube routing could solve the problem, since there are no conflicts between multiple frames targeting the same port of the same switch. Since each switch also has exactly d T-ports, if each switch sends d frames, one frame to each port in any order, in the next stage each switch again has exactly d frames (received via its d T-ports), without collisions or frame drops so far. While such 'routing' can go on forever without collisions/frame drops, it does not guarantee delivery. In order to assure a finite time delivery, each switch must pick out of the maximum d frames it can have in each stage, the frame closest to its destination (the one with the lowest Hamming weight of its jump vectorDsf^Current) and send it to the correct port. The remaining d-l frames (at most; there may be fewer) are sent on the remaining d-\ ports applying the same rule (the closest one gets highest priority, etc). Hence after this step is done on each of the N switches, there are at least N frames (the N
"winners" on N switches) which are now closer by 1 hop to their destinations i.e. which are now at most d-l hops away from their destination (since the maximum hop distance on a hypercube is d). After k such steps, there will be at least N frames which are - at most d-k hops away from their destinations. Since the maximum distance on a hypercube is d hops, in at most d steps from start at least N frames are delivered to their destinations and there are no collisions/drops. Since the total number of frames to deliver is d-N, the above sequence of steps need not be repeated more than d times, therefore all frames are delivered in at most d2 steps after the start. QED.
[0075] In accordance with some embodiments of the invention, load balancing can be performed locally at each switch. For each arriving frame, the switch can select the next hop along a different c/-cube dimension than the last one sent, if one is available. Since for any two points with distance (shortest path) L there are LI alternative paths of equal length L, there are plenty of alternatives to avoid
congestion, especially if aided by a central control and management system with a global picture of traffic flows.
[0076] Much of this look ahead at the packet traffic flow and density at adjacent nodes required to decide which among the equally good alternatives to pick can be done completely locally between switches with a suitable lightweight one-hop (or few hops) self-terminating (time to live set to 1 or 2) broadcast through all ports, notifying neighbors about its load. The information packet broadcast in such manner by a switch M can also combine its knowledge about other neighbors (with their weight/significance scaled down geometrically, e.g. by a factor lid for each neighbor). The division of labor between this local behavior of switches and a central control and management system can be that switching for short distance and near time regions can be controlled by switches and that switching for long distance and long time behavior can be controlled by the central control and management system.
[0077] In accordance with some embodiments of the invention, symmetrical networks with long hop shortcuts are used to achieve high performance in the network, however additional forwarding management can be used to optimize the network and achieve higher levels of performance. As the size of the network (number of hosts) becomes large, it is useful to optimize the forwarding processes to improve network performance.
[0078] One reason for current data center scaling problems is the non-scalable nature forwarding tables used in current switches. These tables grow as (^(VV2) where N is the number of edge devices (hosts) connected to the network. For large networks, this quickly leads to forwarding tables that can not be economically supported with current hardware, leading to various measures to control forwarding table size by segmenting networks, which leads to further consequences and sub- optimal network behavior.
[0079] In accordance with some embodiments of to the invention, each switch can maintain a single size forwarding table (of size CHN)) and network connection matrix (of size C N-R), where R is the switch radix and N the number of switches). The scalable layer 2 topology and forwarding tables maintained by the switches can be based on hierarchical labeling and corresponding hierarchical forwarding behavior of the switches, which require only m-N table entries for the in-level hierarchy (where m is a small integer parameter, typically m = 2 or 3).
[0080] In accordance with one embodiment of the invention, the network can be divided into a hierarchy of clusters, which for performance reasons align with the actual network connectivity. The 1st level clusters contain R nodes (switches) each, while each higher level cluster contains R sub-clusters of the previous lower level. Hence, each node belongs to exactly one 1st level cluster, which belongs to exactly one 2nd level cluster, etc. The number of levels m needed for a network with N nodes and a given R, is then determined from the relations Rm~l<N≤Rm, i.e. m =
riog(jV)/log(R)l. The forwarding identifier (FID or Forwarding ID) of a node consists of m separate fields (digits of the node ordinal O.JV-1 expressed in radix R), FID = F\.Fz...Fm where F\ specifies the node index (number 0.. R-l) within its Is level cluster, 2 the index of the node's 1st level cluster within its second level cluster, etc.
[0081] For example, in anN=100 node network and selecting R=10, each node is labeled via two decimal digits, e.g. a node 3.5 is a node with index 3 in a cluster with index 5. In this embodiment, if node 3.5 needs to forward to some node 2.8, all that 3.5 needs to know is how to forward to a single node in cluster 8, as long as each node within the cluster 8 knows how to forward within its own cluster. For multi-path topologies, nodes have more than single destination forwarding address. Or, for a general node - each node needs to know how to forward to 9 nodes in its own cluster and to a single node in each of other 9 clusters, hence it needs tables with only 2*9=18 elements (instead of 99 elements that conventional forwarding uses).
[0082] In accordance with some embodiments of the invention, the forwarding tables for each node can consist of m arrays Γ/, i=\..m, each of size R elements (the elements are forwarding ports). For example, for R=\6 and a network with
N=64* 1024 switches (corresponding to a network with 20*N=1280* 1024 hosts), the forwarding tables in each switch consist of 4 arrays, each of size 16 elements, totaling 64 elements.
[0083] For any node F with FID( ) = Fi.F2...Fm the array Tx [R] contains ports which F needs to use to forward to each of the R nodes in its own 1st level cluster. This forwarding is not assumed to be a single hop, so the control algorithm can seek to minimize the number of hops when constructing these tables. A convenient topology, such as hypercube type, makes this task trivial since each such forwarding step is a single hop to the right cluster. In accordance with some embodiments of the invention, in the hypercube network, the control algorithm can harmonize node and cluster indexing with port numbers so that no forwarding tables are needed at all. The array Γ2 contains ports F needed for forwarding to a single node in each of the R 2nd level clusters belonging to the same third level cluster as node F;TT, contains ports F needed for forwarding to a single node in each of the R 3rd level clusters belonging to the same 4th level cluster as F,... and finally Tm contains ports F needs to use to forward to a single node in each of the Rm* level cluster belonging to the same (##ι+1)Λ cluster (which is a single cluster containing the whole network). [0084] In accordance with some embodiments of the invention, forwarding can be accomplished as follows. A node F with FID( ) = F\.F2...Fm receiving a frame with final destination FID(Z) = Z\ ^. m determines the index = 1..m of the highest 'digit' Z/ that differs from its own corresponding 'digit' /and forward the frame to the port 7/[Zj]. The receiving node G then has (from the construction of tables 7¾ for its ι-th digit the value Gf= Z/. Hence, repeating the procedure, node G determines the index j<i of the highest digit Zy differing from corresponding Gj and forwards to port
Figure imgf000019_0001
at which point the node is performing the final forwarding within its own cluster.
[0085] In accordance with some embodiments of the invention, the implementation of this technique can involve the creation of hierarchical addresses. Since the forwarding to clusters at levels > 1 involves approximation (a potential loss of information, and potentially sub-optimal forwarding), for the method to forward efficiently it can be beneficial to a) reduce the number of levels m to the minimum needed to fit the forwarding tables into the C AMs (content addressable memories) and b) reduce the forwarding approximation error for m >1 selecting the formal clustering used in the construction of the network hierarchy to match as closely as possible the actual topological clustering of the network.
[0086] Forwarding efficiency can be improved by reducing the number of levels m to the minimum needed to fit the forwarding tables into the CAMs. In situations where one can modify only the switch firmware but not the forwarding hardware to implement hierarchical forwarding logic, the conventional CAM tables can be used. The difference from the conventional use is that instead of learning the MAC addresses, which introduce additional approximation and forwarding inaccuracy, the firmware can program the static forwarding tables directly with the hierarchical tables.
[0087] Since m levels reduce the size of the tables from N to m-N1'10 entries
(e.g. m=2 reduces the tables from N entries to 2- N entries), a 2-3 level hierarchy may be sufficient to fit the resulting tables in the 016K entries CAM memory (e.g. m=2, C 6K allows 2-8K entries, orN=64-206 nodes). Generally, m is the lowest value satisfying inequality: m-Nl/m < C. [0088] In order to reduce the forwarding approximation error for m >1 , the formal clustering used in the construction of the hierarchical should match as closely as possible the actual topological clustering of the network. For enhanced hypercube topologies used by the invention, optimum clustering is possible since hypercubes are a clustered topology with m=log( V). In practice, where minimum m is preferred, the hypercubes of dimension d are intrinsically clustered into lower level hypercubes corresponding to partition of d into m parts. E.g. partition d = a+b corresponds to 2a clusters (hypercube of dim=a) of size 2*each (hypercubes of dim=6). The following clustering algorithm performs well in practice and can be used for general topologies:
[0089] A node which is the farthest node from the existent complete clusters is picked as the seed for the next cluster (the first pick, when there are no other clusters, is arbitrary). The new cluster is grown by adding to it one of the unassigned nearest neighbors J based on the scoring function: V( ) = #i - #e, where #i is the number of intra-cluster links and #e is the number of extra-cluster links in the cluster resulting from adding node x to it. The neighbor JC with max value of V(JC) score is then assigned to the cluster. The cluster growth stops when there are no more nodes or when the cluster target size is reached (whichever comes first). When no more unassigned nodes are available the clustering layer is complete. The next layer clusters are constructed by using the previous lower layer clusters as the input to this same algorithm.
[0090] In accordance with some embodiments of the invention, networks can be considered to include n "switches" (or nodes) of radix (number of ports per switch) Ri for the i-th switch, where i =\..n. The network thus has the total of PT =∑f R¾ ports. Some number of ports Pi is used for internal connections between switches ("topological ports") leaving P = Ρχ - Pi ports free ("external ports"), available for use by servers, routers, storage,... etc. The number of cables O used by the internal connections is Q = Pi 12. For regular networks (graphs), those in which all nodes have the same number of topological links per node m (i.e. m is a node degree), it follows that Pi = n m.
[0091] The network capacity or throughput is commonly characterized via the bisection (bandwidth) which is defined in the following manner: network is partitioned into two equal subsets (equipartition) Si + S2 so that each subset contains nil nodes (within ±1 for odd n). The total number of links connecting Si and S2 is called a cut for partition S!+S2. Bisection B is defined as the smallest cut (min-cut) for all possible equipartitions S1+S2 of the network. Fig. 9 illustrates this definition on an 8 node network with B=2.
[0092] Bisection is thus an absolute measure of the network bottleneck throughput. A related commonly used relative throughput measure is the network oversubscription φ defined by considering the P/2 free ports in each min-cut half, St and S2, with each port sending and receiving at its maximum capacity to/from the ports in the opposite half. The maximum traffic that can be sent in each direction this way without overloading the network is B link (port) capacities since that's how many links the bisection has between the halves. Any additional demand that free ports are capable of generating is thus considered to be an "oversubscription" of the network. Hence, the oversubscription φ is defined as the ratio: ø·)
[0093] The performance comparisons between network topologies, such as
[l]-[5], [9]-[10], typically use non-oversubscribed networks (φ=1) and compare the costs in terms of number of switches n of common radix R and number of internal cables used in order to obtain a given target number of free ports P. Via eq. (3.1), that is equivalent to comparing the costs n and needed to obtain a common target bisection B.
[0094] Therefore, the fundamental underlying problem is how to maximize B given the number of switches n each using some number of topological ports per switch m (node degree). This in turn breaks down into two sub-problems:
(i) Compute bisection B for given network
(ii) Modify/select links which maximize B computed via (i)
[0095] For general networks (graphs), both sub-problems are computationally intractable, i.e. NP-complete problems. For example, the 'easier' of the two tasks is (i), since (ii) requires multiple evaluations of (i) as the algorithm (ii) iterates/searches for the optimum B. Task (i) involves finding the graph equipartition Ho+Hi which has the minimum number of links between the two halves, in general case would have to examine every possible equipartition Ho+Ht and in each case count the links between the two, then pick the one with the lowest count. Since there are C(n, n/2 ^ 2η/ /πη/2 ways to split the set of #i nodes into two equal halves, the exact brute force solution has exponential complexity. The problem with approximate bisection algorithms is the poor solution quality as network size increases - the polynomial complexity algorithms bisection applied to general graphs cannot guarantee to find an approximate cut even to within merely a constant factor from the actual minimum cut as n increases. And without an accurate enough measure of network throughput, the subtask (ii) cannot even begin to optimize the links.
[0096] An additional problem with (ii) becomes apparent, that even for small networks such as those with few dozen nodes, for which one can compute exact B via brute force and also compute the optimum solution by examining all combinations of the links. Namely, a greedy approach for solving (ii), successively computes B for all possible addition of the next link, then picks the link which produces the largest increment of B among all possible additions. That procedure continues until the target number of links per node is reached. The numerical experiments on small networks show that in order to get the optimum network in step m→ m+1 links per node, one often needs to replace one or more existent links as well, the links which were required for optimum at previous smaller values of m.
[0097] In addition to bandwidth optimization for a given number of switches and cables, the latency, average or maximum (diameter), is another property that is often a target of optimization. Unlike the B optimization, where an optimum solution dramatically reduces network costs, yielding ~2-5 fewer switches and cables compared to conventional and approximate solutions, the improvements in latency are less sensitive to the distinction between the optimal and approximate solutions, with typical advantage factors of only 1.2-1.5. Accordingly, greater optimization can be achieved in LH networks by optimizing the bisection than by optimizing the network to improve latency.
[0098] The present invention is directed to Long Hop networks and methods of creating Long Hop networks. The description provides illustrative examples of methods for constructing a Long Hop network in accordance with the invention. In accordance with one embodiment, one function of a Long Hop network is to create a network interconnecting a number of computer hosts to transfer data between computer hosts connected to the network. In accordance some embodiments, the data can be transferred simultaneously and with specified constraints on the rate of data transmission and the components (e.g., switches and switch interconnect wiring) used to build the network. [0099] In accordance with the invention, a Long Hop network includes any symmetrical network whose topography can be represented by a Cayley graph, and the corresponding Cayley graphs have generators corresponding to the columns of Error Correcting Code (ECC) generator matrices G (or their isometric equivalents, also instead of G one can use equivalent components of the parity check matrix H). In addition, the Long Hop networks in accordance with some embodiments of the invention can have performance (bisection in units of n/2) within 90% of the lower bounds of the related ECC, as described by the Gilbert- Varshamov bound theorem. In accordance with some embodiments of the invention, Long Hop networks will include networks having 128 or more switches (e.g., dimension 7 hypercube or greater) and/or direct networks. In accordance with some embodiments of the invention, Long Hop networks can include networks having the number of interconnections m not equal to d, d+l,..d+d-l and m not equal to n-1, n-2. In accordance with some embodiments of the invention, the wiring pattern for connecting the switches of the network can be determined from a generator matrix that is produced from the error correcting code that corresponds to the hypercube dimension and the number of required interconnections determined as function of the oversubscription ratio.
[00100] In other embodiments of the invention, similar methods can be used to create networks for interconnecting central processing units (CPUs) as is typically used in supercomputers, as well as to interconnect data transfer channels within integrated circuits or within larger hardware systems such as backplanes and buses.
[00101] In accordance with some embodiments of the invention, the Long Hop network can include a plurality of network switches and a number of network cables connecting ports on the network switches to ports on other network switches or to host computers.
[00102] Each cable connects either a host computer to a network switch or a network switch to another network switch. In accordance with some embodiments of the invention, the data flow through a cable can be bidirectional, allowing data to be sent simultaneously in both directions. In accordance with some embodiments of the invention, the rate of data transfer can be limited by the switch or host to which the cable is connected. In accordance with other embodiments of the invention, the data flow through the cable can be uni-directional. In accordance with other embodiments of the invention, the rate of data transfer can be limited only the physical capabilities of the physical cable media (e.g., the construction of the cable). In accordance with some embodiments, the cable can be any medium capable of transferring data, including metal wires, fiber optic cable, and wired and wireless electromagnetic radiation (e.g., radio frequency signals and light signals). In accordance with some embodiments, different types of cable can be used in the same Long Hop network.
[00103] In accordance with some embodiments of the invention, each switch has a number of ports and each port can be connected via a cable to another switch or to a host. In accordance with some embodiments of the invention, at least some ports can be capable of sending and receiving data, and at least some ports can have a maximum data rate (bits per second) that it can send or receive. Some switches can have ports that all have the same maximum data rate, and other switches can have groups of ports with different data rates or different maximum data transfer rates for sending or receiving. In accordance with some embodiments, all switches can have the same number of ports, and all ports can have the same send and receive maximum data transfer rate. In accordance with other embodiments of the invention, at least some of the switches in a Long Hop network can have different numbers of ports, and at least some of the ports can have different maximum data transfer rates.
[00104] The purpose of a switch is to receive data on one of its ports and to send that data on out another port based on the content of the packet header fields. Switches can receive data and send data on all their ports simultaneously. A switch can be thought of as similar to a rail yard where incoming train cars on multiple tracks can be sent onward on different tracks by using a series of devices that control which track among several options a car continues onto.
[00105] In accordance with some embodiments of the invention, the Long Hop network is constructed of switches and cables. Data transferred between a host computer or a switch and another switch over a cable. The data received from a sending host computer enters a switch, which can then forward the data either directly to a receiving host computer or to another switch which in turn decides whether to continue forwarding the data to another switch or directly to a host computer connected to the switch. In accordance with some embodiments of the invention, all switches in the network can be both connected to other switches and to hosts. In accordance with other embodiments of the invention, there can be interior switches that only send and receive to other switches and not to hosts as well. [00106] In accordance with some embodiments, the Long Hop network can include a plurality of host computers. A host computer can be any device that sends and/or receives data to or from a Switch over a Cable. In accordance with some embodiments of the invention, host computers can be considered the source and/or destination of the data transferred through the network, but not considered to be a direct part of the Long Hop network being constructed. In accordance with some embodiments of the invention, host computers cannot send or receive data faster than the maximum data transfer rate of the Switch Port to which they are connected.
[00107] In accordance with some embodiments of the invention, at least some of following factors can influence the construction of the network. The factors can include 1) the number of Hosts that must be connected; 2) the number of switches available, 3) the number of ports on each switch; 4) the maximum data transfer rate for switch ports; and 5) the sum total rate of simultaneous data transmission by all hosts. Other factors, such as the desired level of fault tolerance and redundancy can also be factor in the construction of a Long Hop network.
[00108] In accordance with some embodiments of the invention, the desired characteristics of the Long Hop network can limit combinations of the above factors used in the construction of a Long Hop network that can actually be built. For example, it is not possible to connect more hosts to a network than the total number of switches multiplied by the number of ports per switch minus the number of ports used to interconnect switches. As one ordinary skill would appreciate, a number of different approaches can be used to design a network depending on the desired outcome. For example, for a specified number of hosts, switches with a given maximum data transfer rate, and ports per switch, how many switches are needed and how should they be connected in order to allow all hosts to send and receive simultaneously at 50% of their maximum data transfer rate, Alternatively, for a specified number of hosts, number of switches with a given number of ports and maximum data transfer rate, how much data can be simultaneously transferred across the network and what switch connection pattern(s) supports that performance.
[00109] For purposes of illustration, the following description explains how to construct a Long Hop network according to some embodiments of the invention. In this embodiment, the Long Hop network includes 16 switches and uses up to 7 ports per switch for network interconnections (between switches). As one of ordinary skill will appreciate any number of switches can be selected and the number ports for network interconnection can be selected in accordance with the desired parameters and performance of the Long Hop network.
[00110] In accordance with some embodiments of the invention, the method includes determining how to wire the switches (or change the wiring of an existing network of switches) and the relationship between the number of attached servers per switch and the oversubscription ratio.
[00111] In accordance with some embodiments of the invention, the ports on each switch can be allocated to one of two purposes, external connections (e.g., for connecting the network to external devices including host computers, servers and external routers or switches that serve as sources and destinations within the network), and topological or internal connections. An external network connection is a connection between a switch and a source or destination device that enables data to enter the network from a source or exit the network to a destination. A topological or internal network connection is a connection between networks switches that form the network (e.g., that enables data to be transferred across network).
[00 12] In accordance with some embodiments of the invention, the
oversubscription ratio can be determined as the ratio between the total number of host connections (or more generally, external ports) and the bisection (given as number of links crossing the min-cut partition). In accordance with some embodiments of the invention, an oversubscription ratio of 1 indicates that in all cases, all hosts can simultaneously send at the maximum data transfer rate of the switch port. In accordance with some embodiments of the invention, an oversubscription ratio of 2 indicates that the network can only support a sum total of all host traffic equal to half of the maximum data transfer rate of all host switch ports. In accordance with some embodiments of the invention, an oversubscription ratio of 0.5 indicates that the network has twice the capacity required to support maximum host traffic, which provides a level of failure resilience such that if one or more switches or connections between switches fails, the network will still be able to support the full traffic volume generated by hosts.
[00113] In accordance with some embodiments of the invention, the base network can be an n-dimensional hypercube. In accordance with other embodiments of the invention, the base network can be another symmetrical network such as a star, a pancake and other Cayley graphs based network structure. In accordance with some embodiments of the invention, an n-dimensional hypercube can be selected as a function of the desired number of switches and interconnect ports.
[00114] In accordance with some embodiments of the invention, a generator matrix is produce for the linear error correcting code that matches the underlying hypercube dimension and the number of required interconnections between switches as determined by the network oversubscription ratio. In accordance with some embodiments of the invention, the generator matrix can be produced by retrieving it from one of the publicly available lists, such as the one maintained by the MinT project (http://mint.sbg.ac.at/index.php). In accordance with other embodiments of the invention, the generator matrix can be produced using a computer algebra system such as the Magma package (available
fromhttp://magma.maths.usyd.edu.au/magma/). For example, in Magma package a command entered into Magma claculator (http://magma.maths.usyd.edu.au/calc/):
C:=BKLC(GF(2),7,4); C; produces as output the generator matrix for the binary linear code [7,4,3]:
[7, 4, 3] Linear Code over GF(2)
Generator matrix:
[1 0 0 0 0 1 1]
[0 1 0 0 1 0 1]
[0 0 1 0 1 1 0]
[0 0 0 1 1 1 1]
[00115] In accordance with some embodiments of the invention, a linear error correcting code generator matrix can be converted into a wiring pattern matrix by rotating the matrix counterclockwise 90 degrees, for example, as shown in Table 4.9.
[00116] In the illustrative example shown in Table 4.9, each switch has 7 ports connected to other switches and 16 total switches corresponding to an LH augmented dimension 4 hypercube. Generators i through h7 correspond to the original columns from rotated [G4j7] matrix that can be used to determine how the switches are connected to each other by cables. [00117] In accordance with some embodiments of the invention, the 16 switches can be labeled with binary addresses, 0000, 0001 , through 1111. The switches can be connected to each other using the 7 ports assigned for this purpose, labeled hi through h7, by performing the following procedure for each of the sixteen switches. For example, connect a cable between each source switch network port (1 - 7) and the same port number on the destination switch whose number is determined by performing an exclusive or logical operation between the source switch number and the value of the Cayley graph generator hi to h (column 2 in the table below) for the network port number.
[00118] For example, to determine how to connect the 7 wires going from switch number 3 (binary 0011), take the graph generator (number in 2nd column) and exclusive or (XOR) it with 0011 (the source switch number), which results in
"Destination switch number" in the following connection map (the XOR of columns 2 and 3 yields column 4):
Figure imgf000028_0001
[00119] This wiring procedure describes how to place the connections to send from a source switch to a destination switch, so for each connection from a source switch to a destination switch, there is also a connection from a destination switch to a source switch. As a practical matter, in this embodiment, a single bi-directional cable is used for each pair of connections.
Construction of Long Hop Networks [00120] The LH networks are direct networks constructed using general Cayley graphs Cay(Gn, Sra) for the topology of the switching network. The preferred embodiment for LH networks belongs to the most general hypercubic-like networks, with uniform number of external (E) and topological (m) ports per switch (where E+iw=R=' switch radix'), which retain the vertex and edge symmetries of the regular </-cube Qd- The resulting LH network with n=2d switches in that case is a Cayley graph of type CayiL^. > Sm) with #i-l > m > d+\ (these restiction on m exclude well known networks such as rf-cube Qd which has #« = </, folded (/-cube FQd with m = d+l, as well as fully meshed network m=n and m=n-l). It will become evident that the construction method shown on Z example applies directly to the general group Zq with q > 2. For q > 2, the resulting Cay(2q , Sm) is the most general LH type construction of a ./-dimensional hyper-torus-like or flattened butterfly-like networks of extent q (which is equivalent to a hyper-mesh-like network with cyclic boundary conditions). The preferred embodiment will use q = 2, since Z!? is the most optimal choice from practical perspective due to the shortest latency (average and max), highest symmetry, simplest forwarding and routing, simplest job partitioning (e.g. for multi-processor clusters), easiest and most economical wiring in the Z class.
[00121] Following the overall task breakdown in section 3, the LH construction proceeds in two main phases:
(i) Constructing a method for efficient computation of the exact bisection B
(ii) Computing the optimal set of m links (hops) Sm per node maximizing this B
[00122] For the sake of clarity, the main phases are split further into smaller subtasks, each described in the sections that follow.
Generators and Adjacency Matrix
[00123] Network built on Cay(Zg , Sm) graph has n = q d vertices (syn. nodes), and for q = 2 which is the preferred embodiment, n - 2d nodes. These n nodes make the n element vertex set K={v0,V2, . . . vn-i}. We are using 0-based subscripts since we need to do modular arithmetic with them.
Node labels and group operation table [00124] The nodes Vj are labeled using ./-tuples in alphabet of size q: v≡ i G
{0,1 ,... n-1 } expressed as ./-digit integers in base q. The group operation, denoted as Φ, is not the same as integer addition mod n but rather it is the component-wise addition modulo q done on d components separately. For q = 2, this is equivalent to a bitwise XOR operation between the < -tuples, as illustrated in Table 2.1 (Appendix A) which shows the full Z group operation table for d = 4.
[00125] Table 4.1 illustrates analogous Z group operation table for d=2 and q=3, hence there are «=3 =9 group elements and the operations table has «xn = 9*9 = 81 entries. The 2-digit entries have digits which are from alphabet {0,1,2}. The #i rows and n columns are labeled using 2-digit node labels. Table entry at row r and column c contains result of r®c (component-wise addition mod q=3). For example, the 3rd row labeled 02, and the 6-th column labeled 12, yield table entry 02012 = (0+l)%3, (2+2)%3 =1 1 = 11.
Figure imgf000030_0001
Table 4.1
[00126] It can be noted in Table 4.1 for Zf and in Table 2.1 (Appendix A) for
Z2 that each row r and column c contains all n group elements, but in a unique order. The 0-th row or 0-th column contain the unmodified r and c values since the 'identity element' is I0=0. Both tables are symmetrical since the operation r®c = c®r is symmetrical (which is a characteristic of the abelian group Z, used in the example).
Construction of adjacency matrix [A]
[00127] Generator set Sm contains m "hops" h\, h2,... hm (they are also elements of the group Gn in Cay(Gn, Sm)), which can be viewed as the labels of the m nodes to which the "root" node, vo≡0 is connected. Hence, the row r=0 of the adjacency matrix [A] has m ones, at columns Α(Ο,Λ) for m hops h E SRA and 0 elsewhere. Similarly, the column c=0 has m ones at rows Α(Λ,Ο) for m hops h 6 Sm and 0 elsewhere. In a general case, some row r=y has m ones at columns A( ,y@h) for h E Sm and 0 elsewhere. Similarly a column c=x has m ones at rows Α(Λ:0Α,Λ:) for h E Sm and 0 elsewhere. Denoting contributions of a single generator h E Sm to the adjacency matrix [A] as a matrix T(A), these conclusions can be written more compactly via Iverson brackets and bitwise OR operator '|' as:
T(a)ij≡ [i0a = j] I [j@a = i] a E Gn
(4.1)
[A] =∑nesm Τ(Λ) =∑s m =1 T( is) (4.2)
[00128] Note that eq. (4.1) defines T( ) for any element a (or vertex) of the group Gn. Since the right hand side expression in eq. (4.1) is symmetric in and j it follows that T(a) is a symmetric matrix, hence it has real, complete eigenbasis:
T(a)u = Τ(α),· i (4.3)
[00129] For the group G„= Zi? , the group operator 0 becomes regular XOR 'Λ' operation, simplifying eq. (4.1) to:
Figure imgf000031_0001
[00130] Table 4.2 illustrates the T( ) matrices for q=2,d=3, n=8 and all group elements a = 0..7. For given a=0..7, value 1 is placed on row r and column c iff r^c = a, and 0 otherwise (0s are shown as '-').
Table 4.2
[00131] Table 4.3 (a) shows the 8x8 adjacency matrix [A] obtained for the generator set S4≡ { 1 , 2, 4, 7 } hex { 001 , 010, 100, 1 1 1 } bin by adding the 4 generators from Table 4.2: [A] = T(l)+T(2)+T(4)+T(7), via eq. (42). For pattern clarity, values 0 are shown as Table 4.3 (b) shows the indices of the 4 generators (1 , 2, 3, 4) which contributed 1 to a given element of [A] in Table 4.3 (a).
- 1 1 - 1 - - 1 - 1 2 - 3 - - 4
1 - - 1 - 1 1 - 1 - - 2 - 3 4 -
1 - - 1 - 1 1 - 2 - - 1 - 4 3 -
- 1 1 - 1 - - 1 - 2 1 - 4 - - 3
1 - - 1 - 1 1 - 3 - - 4 - 1 2 -
- 1 1 - 1 - - 1 - 3 4 - 1 - - 2
- 1 1 - 1 - - 1 - 4 3 - 2 - - 1
1 - - 1 - 1 1 - 4 - - 3 - 2 1 -
(a) (b)
Table 43 [00132] Fig. 10 shows the resulting 8-node network (folded 3-cube, FQ3).
Actions (bitwise XOR) of the 4 generators T(a)e{001, 010, 100, 1 l l}bin on the node 000 are indicated by the arrows pointing to the target vertex. All other links are shown without arrows. The total number of links is C=«-m/2=8-4/2=16, which can be observed directly in the figure.
Eigenvectors of T( ) and [A]
[00133] To solve the eigen-problem of [A], couple additional properties of T(a) are derived from eq. (4.4) (using xAx=0 and xAy=yAx):
n-l n-l
(T(a)T(b)) . . = T(a)i>fcT(b)feJ = T [fc = iAa] [k = j*b] = k=Q k=Q
= [iAa = j"b] = = aAb] = Τ(αΛ¾· =»
T(a)T(b) = T(aAfc) (4.5) T(a)T(b) = T(aAb) = T(bAa) = T(b)T(a)
(4.6)
[00134] Eq. (4.5) shows that T( ) matrices are a representation of the group G„ and eq. (4.6) that they commute with each other. Since via eq. (4.2), [A] is the sum of T(fl) matrices, then [A] commutes with all T( ) matrices as well. Therefore, since they are all also symmetric matrices, the entire set { [A], T(a) Va} has a common eigenbasis (via result (M4) in section 2.F). The next sequence of equations shows that Walsh functions viewed as /i-dimensional vectors |Uic) are the eigenvectors for T( ) matrices. Using eq. (4.4) for the matrix elements of the T(a), the action of T( ) on Walsh ket vector |Uk) yields for the i-th component of the resulting vector:
n-l n-l
(Τ(α)|ϋ¾»έ = Τ(α)ί)7· Ufc(/) =∑U = ίΛ ] Ufc(/) = Ufc(iAa) (4.7)
j=o 7=0
[00135] The result 1¼/Λ ) is transformed via eq. (2.5) for the general function values of Uk(jc):
Ufc(l Aa) =
Figure imgf000033_0001
μ *u -
= ^Σ^ α, . (_ 1}∑^ ^ i„ = Ufc(a)Ufc(0 = υ α) |υ*»ί (4.8) [00136] Collecting all n components of the left side of eq. (4.7) and right side of eq. (4.8) yields in vector form:
T(a)|Uk> = Uk(a)|Uk> (4.9)
[00137] Hence, the orthogonal basis set { |Uk), k=0..n-\ } is the common eigenbasis for all T( ) matrices and for the adjacency matrix [A]. The n eigenvalues for T( ) are Walsh function values 1¾α), A=0..#t-1. The eigenvalues for [A] are obtained by applying eq.(4.9) to the expansion of [A] via Τ(Λ), eq. (4.2):
[A] |Uk> = Ak|Uk> (4.10)
Figure imgf000034_0001
m
where: Xk≡^ Uk{hs) (4.11)
S=l
[00138] Since
Figure imgf000034_0002
the eigenvalue λ of [A] for the eigenvector |U0) is:
A0 = m≥Ak (4.12)
[00139] From eq. (4.11) it also follows that λ ≥ for k=l ..n- 1 since the sum in eq. (4.11) may contain one or more negative addends Uk(ns)=-1 for k>0, while for the k=0 case all addends are equal to +1.
Computing Bisection
Cuts from adjacency matrix and partition vector
[00140] The bisection B is computed by finding the minimum cut C(X) in the set E={X} of all possible equipartitions X=S!+S2 of the set of n vertices. An equipartition X can be represented by an n-dimensional vector |X) G Vn containing nil values +1 selecting nodes of group Si, and n/2 values -1 selecting the nodes of group S2. Since the cut value of a given equipartition X does not depend on particular +1/-1 labeling convention (e.g. changing sign of all elements Xi defines the same graph partition), all vectors |X) will have by convention the 1st component set to 1 and only the remaining n-l components need to be varied (permuted) to obtain all possible distinct equipartitions from E. Hence, the equipartitions set E consists of all vectors X = (jc0, xi,... n-i), where
Figure imgf000034_0003
Xt = 0.
[00141] The cut value C(X) for a given partition X = (XQ, χι,... JCn-1) is obtained as the count of links which cross between nodes in Si and S2. Such links can be easily identified via E and adjacency matrix [A], since [A]y is 1 iff nodes * and j are connected and 0 if they are not connected. The group membership of some node * is stored in the component x\ of the partition X. Therefore, the links (i,J) that are counted have [A] =l, i.e. nodes and j must be connected, and they must be in opposite partitions i.e. J ;≠ Xj. Recalling that x,- and ¾ have values +1 or -1 , the "A:,≠ x" is equivalent to " xj-
Figure imgf000035_0001
To express that condition as a contribution +1 when jtj≠ Jtj and a contribution 0 when Xi = Xj, expression ( 1 - J j- Xj)/2 is constructed which yields precisely the desired contributions +1 and 0 for any x{, Xj = ±1. Hence, the values added to the link count can be written as Cy≡(1- x{ jtj)-[A]y/2 since Cy=l iff nodes and j are connected ([A]y=l) and they are in different groups
Figure imgf000035_0002
Otherwise Cy is 0, thus adding no contribution to the C(X).
[00142] A counting detail that needs a bit of care arises when adding Q j terms for all iJ=Q..n-\. Namely, if the contribution of e.g. C3>5 for nodes 3 and 5 is 1, because [A]3,5=l (3,5 linked), J 3=-1 and s=+l, then the contribution of the same link will contribute also via C5 term since
Figure imgf000035_0003
. Hence the sum of Cy for all ij=0..n-\ counts the contribution for each link twice. Therefore, to compute the cut value C(X) for some partition X, the sum of Cy terms must be divided by 2. Noting also that for any vector XeE
Figure imgf000035_0004
= n and∑"7 o[A]ij = ∑" β m = n · m, yields for the cut C(X):
C(X) = [A]U
Figure imgf000035_0005
n ( (X|A|X>^
= 4 (m— (xixTj (4·14)
[00143] To illustrate operation of the formula (4.14), the Table 4.5 shows adjacency matrix [A] for ay(Z ,S5), which reproduces FQ (folded 4-cube), with d=A, H=2d=24=16 nodes, m=5 links per node, produced by the generator set S5={ 1 , 2,
4, 8, F}hex={0001, 0010, 0100, 1000, 1111 }bin. The row and column headers show the sign pattern of the example partition X=( 1,1, 1,1, -1,-1,-1,-1, 1,1,1,1, -1,-1,-1,-1) and the shaded areas indicate the blocks of [A] in which eq. (4.14) counts ones - elements of [A] where row r and column c have opposite signs of the X components xT and xc.
The cut is computed as C(X)= ½ (sum of ones in shaded blocks) = l/2*(4*8) = 16 which is the correct B for FQ4
Figure imgf000035_0006
Note that the zeros (they don't contribute to C(X)) in the matrix [A] are shown as '-' symbol.
Figure imgf000036_0001
Table 4.5
Finding the minimum cut (bisection)
[00144] Bisection B is computed as the n inimum cut C(X) for all XeE, which via eq. (4.14) yields:
(4.16)
Figure imgf000036_0002
[00145] Despite the apparent similarity between the max{ } term ME in eq.
(4.16) to the max{} term Mv in eq. (2.46), the Rayleigh-Ritz eqs. (2.45W2.46) do not directly apply to min{} and max{} expressions in eq. (4.15). Namely, the latter extrema are constrained to the set E of equipartitions, which is a proper subset of the full vector space V„ to which the Rayleigh-Ritz applies. The ME≡ max{} in eq. (4.16) can be smaller than the My max{ } computed by eq. (2.46) since the result My can be a vector from Vn which doesn't belong to E (the set containing only the
equipartition vectors X) i.e. if My is solved only by some vectors Y which do not consist of exactly n/2 elements +1 and n/2 elements -1.
[00146] As an illustration of the problem, ME is analogous to the "tallest programmer in the world" while My is analogous to the "tallest person in the world." Since the set of "all persons in the world" (analogous to Vn) includes as a proper subset the set of "all programmers in the world" (analogous to E) the tallest programmer may be shorter than the tallest person (e.g. the latter might be a non- programmer). Hence in general case the relation between the two extrema is ME < My. The equality holds only if at least one solution from My belongs also to ME, or in the analogy, if at least one person among the "tallest person in the world" is also a programmer. Otherwise, strict inequality holds ME < My.
[00147] In order to evaluate ME≡ max{} in eq. (4.16), the n-dimensional vector space Vn (the space to which vectors |X) belong) is decomposed into a direct sum of two mutually orthogonal subspaces:
Figure imgf000037_0001
[00148] Subspace V0 is one dimensional space spanned by a single 'vector of all ones' (11 defined as:
<1|≡ (1,1,1....,D (4.18) while VE is the (w-l) dimensional orthogonal complement of V0 within Vn, i.e. VE is spanned by some basis of w-1 vectors which are orthogonal to (1|. Using the eq. (2.6) for Walsh function Uo(x), it follows:
<1|≡ (1.1.1 1) = <U0 | (4.19)
[00149] Hence, VE is spanned by the remaining orthogonal set of n-1 Walsh functions |Uk), k=\ ..n-l . For convenience the latter subset of Walsh functions is labeled as set Φ below:
4 = {|Ufc>: k=\ ..n-\) (4.20)
[00150] Since all vectors XGE contain n/2 components equal +1 and n/2 components equal -1 , then via (4.18):
(l\X) =∑?- l ' Xi = 0, VX 6 E (4.21) i.e. < 11 is orthogonal to all equipartion vectors X from E, hence the entire set E is a proper subset of VE (which is the set of all vectors 6 V„ orthogonal to (1|). Using ME in eq. (4.16) and eq. (2.46) results in:
= max (4.22)
Figure imgf000037_0002
The My in eq. (4.22) is solved by an eigenvector | Y) of [A] for which |Y) since:
Figure imgf000038_0001
[00152] Recalling, via eq. (4.10). that the eigenbasis of the adjacency matrix
[A] in eq. (4.22) is the set of Walsh functions |Uk), and that VE in which the
Mv=max{}is searched for, is spanned by the n-l Walsh functions |Uk) G Φ, it follows that the eigenvector |Y) of [A] in eq. (4.23) can be selected to be one of these n-l Walsh functions from Φ (since they form a complete eigenbasis of [A] in VE) i.e.:
|Y> e O = {|Ufc): *=l ..n-l} (4.24)
[00153] The equality in (4.22) holds iff at least one solution |Y) G VE is also a vector from the set E. In terms of the earlier analogy, this can be stated as: in the statement "the tallest student" < "the tallest person", the equality holds iff at least one among the "tallest person" happens to be a "programmer."
[00154] Since |Y) is one of the Walsh functions from Φ and since all |Uk> G Φ have, via eqs. (2.5) and (2.7), exactly n/2 components equal +1 and n/2 components equal -1, |Y) belongs to the set E. Hence the exact solution for ME in eq. (4.22) is the Walsh functions |Uk) G Φ with the largest eigenvalue ¾. Returning to the original bisection eq. (4.15), where ME is the second term, it follows that B is solved exactly by this same solution |Y)=|Uk) G Φ. Combining thus eq. (4.15) with equality case for ME in eq. (4.22) yields:
nm n n n ,
B = ir ~ 4 ME = 4 (m ~ *™ = A [M - ¾¾*} ) (4·25
[00155] Therefore, the computation of B is reduced to evaluating n-l eigenvalues fa of [A] for k=\ ..n-l and finding a t≡ (k with the largest fa) i.e. a / such that fa > ¾c for A=l..w-1. The corresponding Walsh function Ut provides the equipartition which achieves this bisection B (the exact rninimum cut). The evaluation of Ik in eq. (4.25) can be written in terms of the m generators hs G SM via eq. (4.11) as:
Figure imgf000038_0002
[00156] Although the function values Uk(x) above can be computed via eq.
(2.5) as Ufe(x) = (— i)p(fe&*), due to parallelism of binary operation on a regular CPU, it is computationally more efficient to use binary form of Walsh functions, Wk( ). The binary <→ algebraic translations in eqs. (2.8) can be rewritten in vector form for Uk and Wk, with aid of definition of 11 ) from eq. (4.18), as:
|Wfc>≡i(|l>-|Ufc» (4.27) |Ufc> = |l> - 2-|Wfc) (4.28)
[00157] Hence, the B formula (4.26) can be written in terms of Wk
(4.28) and Wk formula eq. (2.10 as: Wfc( i )
Figure imgf000039_0001
min F(k&hs) (4.29)
Figure imgf000039_0002
[00158] The final expression in (4.29) is particularly convenient since for each
A=l..ii-1 it merely adds parities of the bitwise AND terms: (k&hs) for all m Cayley graph generators hs G Sm. The parity function F(x) in eq. (4.29) can be computed efficiently via a short C function ([14] p. 42) as follows:
//-- Parity for 32-bit integers inline int Parity(unsigned int
(4.30)
{
χΛ=χ»16, xA=x»8, x =x»4,
return (χ (χ>>1))&1;
} [00159] Using a P(x) implementation Parity(x), the entire computation of B via eq. (4.29) can be done by a small C function Bisection(n,hops[],m) as shown in code (4.31).
int Bisection(int n,int *ha,int m) (4.31) {
int cut,b,i,k; // n=2d is # of nodes, m=# of hops for(b=n,k=l; k<nj ++k) // Loop through all n-1 Wk() functions
{ // Set initial min cut b=n (out of range since m<n)
for(cut=i=9; i<m; ++i) // Loop through all m hops ha[i], add cut+=Parity(ha[i]&k); // +1 if hop[i] coincides with Wk(hop[i]) if (cut<b) b=cut; // Update min cut if count cut<old_min_cut
}
return b; II Return bisection (min cut) in units n/2
}
[00160] The inner loop in (4.31 ) executes m times and the outer loop (#i- 1 ) times, yielding total of ~ m n steps. Hence, for n-1 values of k, the total
computational complexity of B is ~ C jn n ).
"Symmetry Optimization" of B computation
[00161] A significant further speedup can be obtained by taking full advantage of the symmetries of Walsh functions Wk particularly evident in the recursive definition of Hadamard matrix H„ in eq. (2.1). The corresponding recursion for the binary Walsh matrix [Wn] can be written as:
Figure imgf000040_0001
(4.32) where [Wn] denotes bitwise complement of matrix [Wn]. For example, in the upper half of [W2n] the left and right sub-matrices [Wn] are the same, suggesting that after computing in eq. (4.29) the partial sums of Wk(As) over hs<n and k < n (upper left quadrant of W2n) the remaining n partial sums for k > n (top right quadrant of W2n) can be copied from the computed left half. Similarly, in the lower half of [W2n] the left and right quadrants of sub-matrices are a complement of each other, which replaces the above copying method with subtractions from some constant and copying (the constant is the number of hops hs > n, i.e. the As in the lower half of W2n matrix). The net result of these two computational short-circuits is a reduction of original computation in half. Since computation inside the halves W„ are of the same type as the as those just described for W2n, applying the same symmetry method recursively log(«) times to the halved matrices being generated in each stage, the net complexity of the computation of B is reduced from the earlier 0(m n2) to 0(m-n-\og(n)) i.e. the gain is a speedup factor of #i/log(«) over the original method of eq. (4.29).
"Fast Walsh Transform Optimization" of B computation
[00162] Analogue of the above 'halving' optimization of B computation can be formulated for the algebraic form of Walsh functions !_ by defining a function f(x) for:c=0,l,... #i-l as:
Figure imgf000041_0001
where and 0 <x < n and Sm={Ai, A2,... Am} is the set of m graph generators. Hence, (Ac) is 1 when Λ: is equal to one of the generators hs 6 Sm and 0 elsewhere. This function can be viewed as a vector |/>,
Figure imgf000041_0002
Recalling the computation of adjacency matrix [A] via eq. (4.2), vector \f) can also be recognized as the 0-th column of [A] i.e. = [A]o,i. With this notation, the eq. (4.26) for B becomes:
Figure imgf000041_0003
3 (m - ¾«u' "») τ (m - a ( ·34)
where: Fk≡(Vk\f) (4.35)
[00163] Therefore, the B computation consists of finding the largest element in the set { k} of n-l elements. Using the orthogonality and completeness of the n vectors |Uk), (U |Ufc) = n · 6j k from eq. (23), important property of the set {F^} follows: n-l n-l n-l
∑^ fc | Ufc) =∑^ lU*><U*l /> = ( ∑Ι ϋ*><ϋ*ψ/> = ln \f) = l > (4-36) k=0 fc=0 V fc=0
[00164] The eqs. (4.35),(4.36) can be recognized as the Walsh transform ([14] chap. 23) of function f(x), with n coefficients F n as the transform coefficients. Hence, evaluation of all n coefficients F^, which in direct (4.35) computation requires 0(n ) steps, can be done via Fast Walsh Transform (FWT) in 0(n-log(«)). Note that FWT will produce « coefficients Ek , including 0, even though we don't need F0 i.e. according to eq. (4.34), we still need to look for the max { } in the set {F\ , 2, ... n-1 } . Since each step involves adding of m points, the net complexity of the B computation via (4.34) using FWT is 0(m n \og(n)), which is the same as the "symmetry optimization" result in the previous section.
[00165] Although both methods above achieve a speedup by a factor /i/log(it) over the direct use of eqs. (4.26) and (4.29), the far greater saving has already occurred in the original eq. (4.26). Namely, the eq. (4.26) computes B by computing only the n-l cuts for equipartitions Uk€ Ω, instead of computing the cuts for all equipartitions in the set E of all possible equipartitions. The size of the full set E of "all possible equipartitions" is (factor ½ is due to convention that all partitions in E have +1 as the 1st component):
Figure imgf000042_0001
[00166] To appreciate the savings by eq. (4.26) alone, consider a very small network of merely «=32 nodes. To obtain the exact B for this network the LH method needs to compute «-1= 31 cuts, while the exact enumeration would need to compute |E| = 0.5 C(32,16) = 300,540,195 cuts i.e. 9,694,845 times greater number of cuts. Further, this ratio via eq. (2.37) grows exponentially in the size of the network n, nearly doubling for each new node added.
Optimizing Bisection
[00167] With the couple 0(»i « log(«)) complexity methods for computation of bisection B for a given set of generators Sm described in the previous sections, the next task identified is the optimization of the generator set Sm ={«i, A2, ...«m} i.e. the finding of the Sm with the largest B. The individual hops hs are constrained to n-l values: 1,2,... n-l (0 is eliminated since no node is connected to itself), i.e. Sm is an m element subset of the integer sequence 1../i-l . For convenience, this set of all m- subsets of integer sequence L.n-1 is labeled as follows:
Ω(η, m) = Sln≡ {Sm: (Sm = {hlt h2, ... , ,} and (0 < hs < n)} (4.40)
|Ω|≡ \a n,ni)\ = ~ ) = 0{nm) (4.41)
[00168] With this notation and using the binary formula for B, eq. (4.29). the B optimization task is:
Wfc(/is) (4.42)
Figure imgf000043_0001
[00169] For convenience, eq. (4.42) also defines a quantity b which is the bisection in units n/2. The worst case computational complexity the B optimization is thus 0((w/i-log(«))m), which is polynomial in n, hence, at least in principle, it is a computationally tractable problem as n increases. (The actual exponent m would be (iw - log(/i) - 1), not m, since the Cayley graphs are highly symmetrical and one would not have to search over the symmetrically equivalent subsets Sm. Note that m is typically a hardware characteristics of the network components, such as switches, which usually don't get replaced often as network size n increases.
[00170] Since for large enough n, even a low power polynomial can render 'an in principle tractable' problem practically intractable, approximate methods for the max{} part of the computation (4.42) would be used in practice. Particularly attractive for this purpose would be genetic algorithms and simulated annealing techniques used in [12] (albeit for the task of computing B, which the methods of this invention solve efficiently and exactly). Some of the earlier implementations of this inventions have used fast greedy algorithms, which work fairly well. The 'preferred embodiment' for the invention which is described next does not perform any such direct optimization of eq. (4.42), but uses a more effective method instead.
Bisection B optimization via EC Codes
[00171] In order to describe this method, the inner-most term within the nested max{min{ } } expression in the eq. (4.42) is identified and examined in more detail. For convenience, this term, which has a meaning of a cut for a partition defined via the pattern of ones in the Walsh function W^ac), is labeled as:
m m
Ck≡ Wft(hs) = F(k&hs (4.43) s=l s=l
[00172] Eq. (4.43) also expresses Wk(j ) in terms of parity function F(x) via eq.
(2.10). The function for some rf-bit integer x =(Λ¾-Ι · · · i o)binary is defined as:
Ψ(χ)≡ mod 2 = (x0 A*iA - A*d-i) (4.44)
Figure imgf000044_0001
[00173] The last expression in eq. (4.44) shows that P(J )≡ Ρ(χ¾-ι · · · i o) is a
"linear combination" in terms of the selected field GF(2)d, of the field elements provided in the argument. The eq. (4.43) contains a modified argument of type
F{k&h), for A E Sm, which can be reinterpreted as: the Ones' from the integer k are selecting a subset of bits from the //-bit integer h, then F(x) performs the linear combination of the selected subset of bits of h. For example, if k=l ldec=101 lbin than the action of Wioii(A)≡lP(101 l&A) is to compute linear combination of the bits bit- 0,1 and 3 of A (bit numbering is zero based, from low/right to high/left significance). Since eq. (4.43) performs the above "linear combination via ones in k" action of Wk on a series of <i-bit integers As, s=\ ..m, the "action" on such series of integers is interpreted as the parallel linear combination on the bit-columns of the list of As as shown in the Table 4.6, for Λ=1011 and Wion acting on a set of generators S5={ 0001, 0010, 0100, 1101 }. The 3 bit-columns V3, V( and V0 selected by ones in A are combined via XOR into the resultin bit-column V: |V3>0|V1)0|Vo>=|V>.
Figure imgf000044_0002
Table 4.6 [00174] Therefore, the action of a Wk on the generator set Sm={A1, A2, · · -Am} can be seen as a "linear combination" of the length-m columns of digits (columns selected by ones in k from Wk) formed by the m generators hs. If instead of the used in the example of Table 4.6, there was a more general Cayley graph group, such as Zg , instead of the bit-columns there would have been length-m columns made of digits in alphabet of size q (i.e. integers Ο..^-l) and the XOR would have been replaced with the appropriate G¥(q) field arithmetic e.g. addition modulo q on m- tuples for Z as illustrated in an earlier example in Table 4.1. The construction of column vectors |νμ) of Table 4.6 can be expressed more precisely via an m*d matrix [Rni,d] defined as:
[«*,] -∑|.,MW =
s=l
Figure imgf000045_0001
≡(|Vd_x), |Vd_2), ... |V0» (4.45) where: (|V„))s≡ hStfl = (</ι5|)μ οτμ = 0..d -1, s=\..m
(4.46)
[00175] Hence the m rows of matrix [Rm,d] are m generators <AS|€ Sm and its d columns are d column vectors |νμ). The above 'linear combination of columns via ones in Λ' becomes in this notation:
d-i d l
|V(fc)>≡ ^ ^ |VM) where k≡ fcM 2" (4.47)
μ=0 μ=° where the linear combination of Λμμ> is performed in G¥(q) i.e. mod q on each component of m-tuples Αμμ>. The sum computing the cut Ck in eq. (4.43) is then simply adding (without mod q) all components of the vector |V(Jfc)> from eq.
(4.47). Recalling the definition of Hamming weight as the number of non-zero digits, this cut Ck is recognizable as the Hamming weight of the vector |V(A)>:
Cfe = <V(/c)> (4.48)
[00176] The next step is to propagate the new "linear combination"
interpretation of Wk action back one more level, to the original optimization problem in eq. (4.42), in which the cut Ck was only the innermost term. The min{ } block of eq. (4.42), seeks a minimum value of Ck for all k=\ ..n-l . The set of n vectors |V(A)> obtained via eq. (4.47) when k runs through all possible integers Ο..Λ-I is a d- dimensional vector space, a linear span (subspace of m-tuples vector space ¥m), which is denoted as §(d,m,q)
S(d, m, q)≡ {|V(fc)>: k = 0. . n - 1} (4.49)
[00177] Therefore, the min{ } level optimization in eq. (4.42) computing bisection b, seeks a non-zero vector \Y(k)) from the linear span S(d,m,q) with the smallest Hamming weight (V(A)):
b = min{<V(fc)>: (V(fc) e §(d, m, qr)) and (V(fc)≠ 0)}
(4.50)
[00178] While Hamming weight can be used in some embodiments of the invenition, any other weight, such as Lee weight, which would correspond to other Cayley graph groups G„ and generator sets Sm, can also be used.
[00179] But b in eq. (4.50) is precisely the definition eq. (2.25) of the minimum weight wmin in the codeword space (linear span) S _k,_n ,q) of non-zero codewords Y. Note: In order to avoid the mix up in the notation between the two fields, the overlapping symbols [n, k] which have a different meaning in ECC, will in this section have an underscore prefix, i.e. the linear code [n, k] is relabeled as _n, _k .
[00180] The mapping between the ECC quantities and LH quantities is then:
Wmin = b, A <=> d, _n <=> m, _k vectors (g spanning linear space §(_k,_n,q) of _/i-tuples and constructing code generator matrix [G] (eq. (2.20)) <= d columns | νμ> for μ=0..</-1 spanning linear space §(d,m,q) of m-tuples (digit-columns in the generator list). Since, via eq. (2.26) the minimum weight of the code Wmjn is same as the minimum distance Δ between the codewords Y, it follows that the bisection b is also the same quantity as the ECC Δ (even numerically). Table 4.7 lists some of the elements of this mapping.
Figure imgf000046_0001
Table 4.7
[00181] The optimization of linear code \_n, _k, Δ] that maximizes Δ is thus the same optimization as the outermost level of the LH optimization, max{} level in eq. (4.42) that seeks the Cayley graph generator set Sm with the largest bisection b - other than difference in labeling conventions, both optimizations seek the d- dimensional subspace §(d,m,q) of some vectors space Vm which maximizes the minimum non-zero weight
Figure imgf000047_0001
of the subspace §. The two problems are mathematically one and the same.
[00182] Therefore, the vast numbers of good/optimal linear ECC codes computed over the last six decades (such as EC code tables [17] and [22]) are immediately available as good/optimal solutions for the b optimization problem of the LH networks, such as eq. (4.42) for Cayley graph group G„=Z . Similarly any techniques, algorithms and computer programs (e.g. MAGMA ECC module ht^://magma.mams.usyd.edu.au/magma/handbook/lmear_codes_over_fmi used for constructing and combining of good/optimum linear EC codes, such as quadratic residue codes, Goppa, Justesen, BCH, cyclic codes, Reed-Muller codes,...
[15], [16], via translation Table 4.7, automatically become techniques and algorithms for constructing good/optimum LH networks.
[00183] As an illustration of the above translation procedure, a simple parity check EC code [4,3, 1]2 with generator matrix [G3>4] is shown in Table 4.8. The codeword has 1 parity bit followed by 3 message bits and is capable of detecting all single bit errors. The translation to the optimum network shown on the right, is obtained by rotating 90° counter-clockwise (_5 the 3*4 generator matrix [G3>4]. The obtained block of 4 rows with 3 bits per row is interpreted as 4 generators hs, each 3 bits wide, for the Coy(Z|,C4) graph. The resulting network thus has d=3, Λ=23=8 nodes and m=4 links/node. The actual network is a folded 3-cube shown within an earlier example .
Figure imgf000047_0002
Table 4.8 [00184] A slightly larger and denser network using EC code [7,4,3]2 from
Table 2.4 (Appendix A), is converted into an optimum solution, a graph C y(Z2,C ), with d=4, «=16 nodes and m=7 link/node as shown in Table 4.9.
h, = 0001 1
Figure imgf000048_0001
/l7 = 1011 = B
Table 4.9
[00185J The 4 row, 7 column generator matrix [G4>7] of the linear EC code
[7,4,3]2 on the left side was rotated 90° counter-clockwise and the resulting 7 rows of 4 digits are binary values for the 7 generators hs (also shown in hex) of the 16 node Cayley graph. The resulting «=16 node network has relative bisection (in «/2 units) b=A=3 and absolute bisection (in # of links) of: B = b*«/2 = 3-16/2 = 24 links. Since the network is a non-planar 4-dimensional cube with total «·»ι/2=16·7/2=56 links it is not drawn.
[00186] The above examples are captured by the following simple, direct translation recipe:
EC code Δ1ς → LH Coy(Z ,Sm) (4.45)
(i) Take EC code generator matrix [G k,„] and rotate it 90° (in either direction - direction of rotation merely selects order of generators in the list, which is an arbitrary convention.)
(ii) The result is m =_« row by d =_k column matrix [Rm,d] of GF(^)-digits 0..^-l
(iii) Read m rows of if-tuples in base q from [Rm,d] as m generators As e Sm c Z
(iv) Compute Cayley graph LH=Cay(Z% ,Sm) from the obtained generators
Figure imgf000048_0002
(v) LH: n=qi nodes, m links/node, bisection: relative b=A, absolute Β=Δ·«/2
links.
[00187] The methods of determining the bisection B can be implemented using a computer program or set of computer programs organized to perform the various steps described herein. The computer can include one or more processors and associate member, including volatile and non- volatile memory to store the programs and data. For example, a conventional IBM compatible computer running the Windows or Linix operating system or an Apple computer system can be used and the programs can be written, for example, the in C programming language.
Implementation Notes
N-l. Equivalent LH networks
[00188] Order of elements in a generator set Sm = { hi, h ,... hm) is clearly a matter of convention and network performance characteristics don't depend on particular ordering. Similarly, the subspace §(djn,q) of the column vectors can be generated using any linearly independent set of d vectors from S(d,m,q) instead of the original subset { νμ} . All these transformation of a given network yield equivalent networks, differing only in labeling convention but all with the same distribution of cuts (including min-cut and max-cut) and the same network paths distribution (e.g. same average and max paths). This equivalence is used to compute specific generators optimized for some other objective, beyond the cuts and paths. Some of these other objectives are listed in the notes below.
N-2. Minimum change network expansion
[00189J During expansion of the network, it is useful that the next larger network is produced with the minimum change from the previous configuration e.g. requiring the fewest cables to be reconnected to other switches or ports. The equivalence transforms of N-l are used to "morph" the two configuration, initial and final toward each other, using the number of different links in Sm as the cost function being minimized. Techniques and algorithms of "Compressive Sensing" [CS] (see [20]) are particularly useful as the source for the efficient "morphing" algorithms.
N-3. Diagonalization
[00190] It is often useful, especially in physical wiring, discovery and routing, to have a based network in which (usually first) d hops from Sm are powers of q. This property of generator set Sm corresponds to systematic generator matrix [G t,„] for linear codes and can be recognized by the presence of identity matrix Id within [G ΐ „] (possibly with permuted columns). The two previous examples, Tables 4Ji and 4^9 were of this type (the digits of I<j sub-matrix were in bold).
[00191] A simple, efficient method for computing a "systematic generator" from non-systematic one is to select for each column c = 0..d-\ a row r(c)=\ ..m that contains a digit 1 in column c. If row r(c) doesn't contain any other ones, then we have one column with desired property (the h^C) is a power of 2). If there are any other columns, such as c' which contain ones in row r(c), the column Vc is XOR-ed into these columns VC', clearing the excessive ones in He). Finally, when there is a single 1 in row r(c) and column c, the hop h^C) is swapped with hop so that the resulting matrix contains generator
Figure imgf000050_0001
. The process is repeated for the remaining columns c < d.
[00192] The number of XOR operations between columns needed to reduce some row r{c) to a single 1 in column c, is h^-l. Therefore, to reduce number of required XOR-s (columns are m bits long which can be much larger than the machine word), for each new c to diagonalize, algorithm picks the row which has the smallest weight, ιηϊη Λ ))}.
N-4. Digital or (t,m,s) nets (or designs, orthogonal arrays)
[00193] This research field closely related to design of optimal linear codes
[_n,_k,A]q (cf. [21], [221). The basic problem in the field of 'digital nets' is to find distribution of points on s-dimensional hypercubic (fish-) net with "binary intervals" layout of 'net eyes' (or generally analogous 6-ary intervals via powers of any base b, not only for 6=2) which places the same number of points into each net eye. There is a mapping between (t,_m^s)b digital nets and \_n^_k]q codes via identities: _n=s, _lt=s- _m, q=b. A large database of optimal (t,jn^) nets, which includes linear code translations is available via a web site [22]. Therefore, the solutions, algorithms and computer programs for constructing good/optimal (t,_m ) nets are immediately portable to construction of good/optimal LH networks via this mapping followed by the Ln5_A]q LH mapping in Table 4.7.
N-5. Non-binary codes [00194] The linear codes with q>2 generate hyper-torus/-mesh type of networks of extent q when the Δ metrics of the code is Lee distance. When Hamming distance is used for q>2 codes, the networks are of generalized hypercube/flattened butterfly type [3]. For q=2, which is the binary code, the two types of distance metrics are one and the same.
N-6. Non-binary Walsh functions
[00195] Walsh functions readily generalize to other groups, besides cyclic group∑2 used here (cf. [23]). A simple generalization to base q>2 for groups Z%, for any integer q is based on defining function values via ^-th primitive root of unity ω:
Vq.k 00 = ω^»11"1" for x, k < n≡qd (4.50) where: ω≡ e27li/q (4.51) [00196] For q=2, eq. (4.51) yields ω=(-1), which reduces ϋς^( ) from eq.
(4.50) to the regular Walsh functions Uk(x), eq. (2.5). The q discrete values of Uq>k(x) can be also mapped into integers in [0,q) interval to obtain integer-valued Walsh functions Wq k(jc) (analogue of binary form Wk(x)), which is useful for efficient computer implementation, via analogous mapping to the binary case e.g. via mapping a = ω* for integer A=0../i-l, where ftrinteger, a:algebraic value, as in eq. (2.8) where this same mapping (expressed differently) was used for q=2.
[00197] The non-binary Walsh functions Uq>k can also be used to define graph partition into / parts where /is any divisor of q (including q). For even q, this allows for efficient computation of bisection. The method is a direct generalization of the binary case: the q distinct function values of Uq>k(jc) define partitions arrays Xk[x] Uq;k( ) containing n=qd elements indexed by x=0..n-l. Each of q values of Uqjc( ) indicates a node x belongs to one of the q parts. The partitions Xk for k=\ ...#t-l are examined and cuts computed using the adjacency matrix [A] for Cay(Z% ,Sm) graph, as in eq. (4.14) for q=2. The generators T(a) and adjacency matrix [A] are computed via general eqs. (4.1),(4.2). where 0 operator is G¥{q) addition (mod q ).
[00198] The algorithmic speed optimizations via "symmetry optimization" and
"Fast Walsh Transform optimization" apply here as well (see [14] pp. 465-468 on fast transforms for multi-valued Walsh functions).
N-7. Secondary Optimizations [00199] Once the optimum solution for (4.42) is obtained (via ECC, Digital nets, or via direct optimization), secondary optimizations, such as seeking the minimum diameter (max distance) or minimum average distance or largest max-cut, can be performed on the solution via local, greedy algorithms. Such algorithms were used in construction of our data solutions data base, where each set of parameters (d, m, q) has alternate solutions optimized for some other criteria (usually diameter, then average distance).
[00200] The basic algorithm attempts replacement of typically 1 or 2 generators hs G Sm, and for each new configuration it evaluates (incrementally) the target utility function, such as diameter, average distance or max-cut (or some hierarchy of these, used for tie-breaking rules). The number of simultaneous replacements depends on n, m and available computing resources. Namely, there are ~ nr possible simultaneous deletions and insertions (assuming the "best deletion" is followed by "best" insertion). The utility function also uses indirect measures (analogous to sub-goals) as a tie- breaking selection criterion e.g. when minimizing diameter, it was found that an effective indirect measure is the number of nodes #F in the farthest (from node 0) group of nodes. The indirect objective in this case would be to minimize the #F of such nodes, whenever the exarnined change (swap of 1 or two generators) leaves the diameter unchanged.
[00201] In addition to incremental updates to the networks after each evaluated generators replacement, these algorithms rely on vertex symmetry of Cayley graphs to further reduce computations. E.g. all distance tables are only maintained and updated for #i- 1 distances from node 0 ("root"), since the table is the same for all nodes (with mere permutation of indices, obtainable via T(a) representation of G„ if needed).
[00202] Depending on network application, the bisection b can be maintained fixed for all replacements (e.g. if bisection is the highest valued objective), or one can allow b to drop by some value, if the secondary gains are sufficiently valuable.
[00203] After generating and evaluating all replacements to a given depth (e.g. replacement of 1 or 2 generators), the "best" one is picked (according to the utility/cost function) and replacement is performed. Then the outer iteration loop would continue, examining another set of replacements seeking the best one, etc. until no more improvements to the utility/cost function can be obtained in the last iteration. Specialized solution [00204] This section describes several optimum LH solutions with particularly useful parameters or simple construction patterns.
S-l. High Density LH Networks for modular switches (LH-HD)
[00205] This is a special case of LH networks with high topological link density, suitable for combining smaller number of high radix switches into a single large radix modular switch. This is a specialized domain of network parameters where the 2-layer Fat Tree (FT-2) networks are currently used since they achieve the yield of E=R/3 external ports/s witch, which is the maximum mathematically possible for the worst case traffic patterns. The 'high density' LH networks (LH-HD) match the FT-2 in this optimum E=R/3 external ports/switch yield for the worst case traffic patterns, while achieving substantially lower average latency and the cost in Gb/s of throughput on random or 'benign' (non-worst case) traffic.
[00206] In our preferred embodiment using Cay(Z ,Sm) graph, the network size is w=2d switches and the number of links per node m is one of the numbers: w/2, n/2+n/4, I/2+/I/4+ I/8,... , Λ/2+Λ/4+Λ/8+...+1, then the optimum m generators for LH-HD are constructed as follows:
(i) h\=n-\, h- =n-2, A3=n-3,... hm=n-m
(ii) Optionally diagonalize and sort Sm via procedure (N-3) (Of course, there are a large number of equivalent configurations obtained via equivalence transforms N-l .)
[00207] The resulting bisection is:
Figure imgf000053_0001
average hops is 2-m/n. The largest LH-HD m = n/2+n/4+/i/8+...+1 = n-l has b=/i/2 and corresponds to a fully meshed network.
[00208] Table 4.10 shows an example of LH-HD generators for /i=26=64 nodes and hops/node, with the hops shown in hex and binary (binary 0s are shown as '-' character). Table 4.10(a) shows the non-diagonalized hops after the step (i), and Table 4.10(b) shows the equivalent network with w=32 hops after
diagonalization in step (ii) and sorting. Other possible LH-HD m values for the same n=64 node network are m=32+16=48, m=48+8=56, m=56+4=60, =60+2=62 and m=61+1=63 hops.
[00209] Additional modified LH-HD networks are obtained from any of the above LH-HD networks via removal of any one or two generators, which yields networks LH-HD1 with nt\ = m-\ and LH-HD2 with /n2=»i-2 generators. Their respective bisections are b^b-1 and b2=b-2. These two modified networks may be useful when an additional one or two server ports are needed on each switch compared to the unmodified LH-HD network.
[00210] These three types of high density LH networks are useful for building modular switches, networks on a chip in multi-core or multi-processor systems, flash memory/storage network designs, or generally any of the applications requiring very high bisection from a small number of high radix components and where FT-2 (two level Fat Tree) is presently used. In all such cases, LH-HD will achieve the same bisections at a lower latency and lower cost for Gb/s of throughput.
1. 3F 111111 1. 1 1
2. 3E 11111. 2. 2 1.
3. 3D 1111.1 3. 4 . . .1. .
4. 3C 1111. . 4. 8 . .1. . .
5. 3B 111.11 5. 10 .1
6. 3A 111.1. 6. 20
7. 39 111. .1 7. 7 . . .111
8. 38 111. . . 8. B . .1.11
9. 37 11.111 9. D . .11.1
10. 36 11.11. 10. E . .111.
11. 35 11.1.1 11. 13 .1. .11
12. 34 11.1. . 12. 15 .1.1.1
13. 33 11. .11 13. 16 .1.11.
14. 32 11. .1. 14. 19 .11. .1
15. 31 11. . .1 15. 1A .11.1.
16. 30 11 16. 1C .111. .
17. 2F 1.1111 17. IF .11111
18. 2E 1.111. 18. 23 1. . .11
19. 2D 1.11.1 19. 25 1. .1.1
20. 2C 1.11. . 20. 26 1. .11.
21. 2B 1.1.11 21. 29 1.1. .1
22. 2A 1.1.1. 22. 2A 1.1.1.
23. 29 1.1. .1 23. 2C 1.11. .
24. 28 1.1. . . 24. 2F 1.1111
25. 27 1. .111 25. 31 11. . .1
26. 26 1. .11. 26. 32 11. .1.
27. 25 1. .1.1 27. 34 11.1. .
28. 24 1. .1. . 28. 37 11.111
29. 23 1. . .11 29. 38 111. . .
30. 22 1. . .1. 30. 3B 111.11
31. 21 1 1 31. 3D 1111.1
32. 20 1 32. 3E 11111.
(a) (b)
Table 4.10 S-2. Low Density LH networks with b=3
[00211] This subset of LH networks is characterized by comparatively low link density and low bisection b=3 i.e. B=3#i/2 links. They are constructed as a direct augmentation of regular hypercubic networks which have bisection b=l . The method is illustrated in Table 4.11 using au mentation of the 4-cube.
Figure imgf000055_0001
G Ct c
Table 4.11
[00212] The d=4 hops h\, Λ2, A3 and A4 for the regular 4-cube are enclosed in a
4x4 box on the top. The augmentation consists of 3 additional hops A5, A6 and A7 added in the form of 4 columns Ci, C2, C3 and C4, where each column Ομ (μ=\..ά) has length of L=3 bits. The resulting network has it =16 nodes with 7 links per node and it is identical to an earlier example in Table 4.9 with b=3 obtained there via translation from a [7,4,3]2 EC code into the LH network. General direct construction of the b=3 LH network from a </-cube is done by appending d columns Ομ (μ=\ ..d) of length L bits, such that each bit column has at least 2 ones and L is the smallest integer satisfying inequality:
2L - L - 1 > d (4.60)
[00213] The condition in eq. (4.60) expresses the requirement that d columns
C must have at least 2 ones. Namely, there are total of 2L distinct bit patterns of length L. Among all 2L possible L-bit patterns, 1 pattern has 0 ones (00..0) and L patterns have a single one. By removing these two types, with 0 or single one, there are 2L-(L+1) remaining L-bit patterns with two or more ones, which is the left hand side of eq. (4.60). Any subset of d distinct patterns out of these 2L-(L+1) remaining patterns can be chosen for the above augmentation. The Table 4.12 shows values L (number of added hops to a rf-cube) satisfying eq . (4.60) for dimensions d of practical interest.
Figure imgf000056_0002
Table 4.12
S-3. Augmentation of LH networks with b=odd integer
[00214] This is a very simple, yet optimal, augmentation of an LH network which has m links per node and bisection b=odd integer into LH network with bisection bi=b+l and /n1=m+l links per node. The method is illustrated in Table 4.14 using the augmented 4-cube (d=4, #i=16 nodes) with m=l links per node and bisection b=3, which was used in earlier exam les in Tables 4.9 and 4.12.
Figure imgf000056_0001
XOR Φ Φ Φ Φ
4· I I I
ht→ 1 1 0 1
Table 4.14
[00215] A single augmenting link h% = hiAh2 A...ΑΛ7 (bitwise XOR of the list) is added to the network which increases bisection from b=3 to b=4 i.e. it increases the absolute bisection B by n/2= 16/2=8 links. The general method for Cay(Z ,Sm) with b='odd integer' consists of adding the link Am+i=A1 A^2A...AAm (the bitwise XOR of the previous m hops) to the generator set Sm. The resulting LH network Cay(J.2 ,Sm+i) has bisection bi=b+l. [00216] The only case which requires additional computation, beyond merely
XOR-ing the hop list, is the case in which the resulting hop Am+i happens to come out as 0 (which is an invalid hop value, a self-link of node 0 to itself). In such case, it is always possible to perform a single hop substitution in the original list Sm which will produce the new list with the same b value but a non-zero value for the list XOR result hm+i.
LH construction for a target network
[00217] In practice one would often need to construct a network satisfying requirements expressed in terms of some target number of external ports P having oversubscription φ, obtained using switches of radix R. The resulting construction would compute the number n of radix-R switches needed, as well as the list for detailed wiring between switches. For concreteness, each radix-R switch will be assumed to have R ports labeled as port #1 , #2,... #R. Each switch will be connected to m other switches using ports #1, #2,... #m (these are topological ports or links) and leave E≡ R-m ports: #m+l, #m+2,... #R as "external ports" per switch available to the network users for servers, routers, storage,... etc. Hence, the requirement of having total of P external ports is expressed in terms of E and number of switches n as:
E = P/n (4.70)
[00218] The oversubscription eq. (3.1) is then expressed via definition of bisection b in eq. (4.42) as:
Figure imgf000057_0001
[00219] The illustrative construction below will use non-oversubscribed networks, φ=1, simplifying eq. (4.71):
E = b = R - m (4.72) i.e. for non-oversubscribed networks, the number of external ports/switch E must be equal to the relative bisection b (this the bisection in units n/2), or
equivalently, the number of links/switch: m - R - b.
[00220] In order to find appropriate n=2d and m parameters, LH solutions database, obtained by translating optimum EC code tables [17] and [22] via recipe (4.45), groups solutions by network dimension d into record sets D<j, where rf=3,4,... 24. These dimensions cover the range of network sizes n=2d that are of practical interest, from n = 23 = 8 to n = 224 = 16 million switches. Each record set Da contains solution records for m = d, d+l,... mmax links/switch, where the present database has JWmax-256 links/switch. Each solution record contains, among others, the value m, bisection b and the hop list hi, Λ2,... hm.
[00221] For given P, R and φ, LH constructor scans record sets Dd, for
d=3,4,.■■ and in each set, inspects the records for m=d, d+l, ... computing for each (d,m) record values E(d,m)=R-m ports/switch, total ports F(d,m) = n- E(d,m) = 2d-(R- m) and oversubscription fy(d,m)=E(d,m)ib (value b is in each (d,m) record). The relative errors δΡ = |P(i ,m)-P|/P and δφ = \§(d,m)- φ|/φ are computed and the best match (record (djn) with the lowest combined error) is selected as the solution to use. If the requirement is "at least P ports" then the constraint P(< , n)-P>0 is imposed for the admissible comparisons. The requirements can also prioritize δΡ and δφ via weights for each (e.g. 0.7·δΡ + 0.3 ·δφ for total error). After finding the best matching (d,m) record, the hop list hi, A2,... hm is retrieved from the record and the set of links L(v) is computed for each node v, where v = 0, 1, ... n-l, as: L(v) = { v hs for s=l..m}. Given n such sets of links, L(0), L(l),..., L(w-1), the complete wiring for the network is specified. The examples below illustrate the described construction procedure.
Example 1. Small network with P=96 ports at φ=1, using switches with radix R=12
[00222] The LH database search finds the exact match (δΡ=0, δφ=0) for the record d=5, m=9, hence requiring #i=2d=25=32 switches of radix R=12. The bisection b=3 and the hop list (in hex base) for the record is: S9= { 1, 2, 4, 8, 10, E, F, 14, 19} hex- The number of external ports per switch is E=b=3, combined with m=9 topological ports/switch, results in radix R=3+9=12 total ports/switch as specified. The total number of external ports is P = E n = 3-32 = 96 as required. Diameter (max hops) for the network is D=3 hops, and the average hops (latency) is Avg= 1.6875 hops. Table 4.15 shows complete connection map for the network for 32 switches, stacked in a 32-row rack one below the other, labeled in leftmost column "Sw" as 0, 1 , ... 1 F (in hex). Switch 5 is outlined with connections shown for its ports # 1 ,#2, ... #9 to switches (in hex) 04, 07, 01, 0D, 15, 0B, OA, 11 and 1C. These 9 numbers are computed by XOR-ing 5 with the 9 generators (row 0): 01, 02, 04, 08, 10, 0E, OF, 14, 19. The free ports are #10, #11 and #12.
Sw/Pt: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
0: 01 02 04 08 10 0E 0F 14 19 ** ** **
1: 00 03 05 09 11 0F 0E 15 18 ** ** **
2: 03 00 06 0A 12 0C 0D 16 IB ** ** **
3: 02 01 07 ΘΒ 13 0D 0C 17 1A ** ** **
4: 05 06 ΘΘ 0C 14 0A 0B 10 ID ** ** **
5: 04 07 Θ& ¾D 15 0B 0A 11 1C ** ** **
6: 07 04 Θ2. ΘΕ 16 08 09 12 IF ** ** **
7: 06 05 £ 17 09 08 13 IE ** ** **
8: 09 0A 0C 00 18 06 07 1C 11 ** ** **
9: 08 0B 0D 01 19 07 06 ID 10 ** ** **
A: 0B 08 0E 02 1A 04 05 IE 13 ** ** **
B: ΘΑ 09 0F 03 IB 05 04 IF 12 ** ** **
C: 0D 0E 08 04 1C 02 03 18 15 ** ** **
D: 0C 0F 09 05 ID 03 02 19 14 ** ** **
E: 0F 0C 0A 06 IE 00 01 1A 17 ** ** **
F: 0E 0D ΘΒ 07 IF 01 00 IB 16 ** ** **
10: 11 12 14 18 00 IE IF 04 09 ** ** **
11: 10 13 15 19 01 IF IE 05 08 ** ** **
12: 13 10 16 1A 02 1C ID 06 ΘΒ ** ** **
13: 12 11 17 IB 03 ID 1C 07 0A ** ** **
14: 15 16 10 1C 04 1A IB 00 0D ** ** **
15: 14 17 11 ID 05 IB 1A 01 0C ** ** **
16: 17 14 12 IE 06 18 19 02 0F ** ** **
17: 16 15 13 IF 07 19 18 03 0E ** ** **
18: 19 1A 1C 10 08 16 17 0C 01 ** ** **
19: 18 IB ID 11 09 17 16 0D 00 ** ** **
1A: IB 18 IE 12 0A 14 15 0E 03 ** ** **
IB: 1A 19 IF 13 ΘΒ 15 14 0F 02 ** ** **
1C: ID IE 18 14 0C 12 13 08 05 ** ** **
ID: 1C IF 19 15 0D 13 12 09 04 ** ** **
IE: IF 1C 1A 16 0E 10 11 0A 07 ** ** **
IF: IE ID IB 17 0F 11 10 0B 06 ** ** **
Table 4.15
[00223] To illustrate the interpretation of the links via numbers, the outlined switch "5:" indicates on its port #2 a connection to switch 7 (the encircled number 07 in the row 5:). In the row 7:, labeled as switch "7:", there is an encircled number 05 at its port #2 (column #2), which refers back to this same connection between the switch 5 and the switch 7 via port #2 on each switch. The same pattern can be observed between any pair of connected switches and ports.
Example 2. Small network with P=1536 (1.5K) ports at φ=1, using switches with radix R=24. [00224] The LH solutions database search finds an exact match for d = 8, n =
256 switches of radix R=24 and m=18 topological ports/switch. Diameter (max hops) of the network is D=3 hops, and average latency is Avg=2.2851562 hops. The bisection is b=6, providing thus E=6 free ports per switch at φ=1. The total number of ports provided is Ε·Λ=6·256=1536 as required. The set of 18 generators is: S18 = {01, 02, 04, 08, 10, 20, 40, 80, 1A, 2D, 47, 78, 7E, 8E, 9D, B2, Dl, FB}hex. Note that the first 8 links are regular 8-cube links (power of 2), while the remaining 10 are LH augmentation links. These generators specify the target switches (as index 00..FFhex) connected to switch 00 via ports #1, #2,... #18 (switches on both ends of a link use the same port number for mutual connections). To compute the 18 links (to 18 target switches) for some other switch x≠ 00, one would simply XOR number with the 18 generators. Table 4.16 shows the connection table only for the first 16 switches of the resulting network, illustrating this computation of the links. For example, switch 1 (row '1 :') has on its port #4 target switch 09, which is computed as 1A8=9, where 8 was the generator in row '0:' for port #4. Checking then switch 9 (in row '9:'), on its port #4 is switch 01 (since 9A8=1), i.e. switches 1 and 9 are connected via port #4 on each. The table also shows that each switch has 6 ports #19, #20,... #24 free.
Sw/Pt #1 #2 #3 #4 #5 #6 #7 «8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #2Θ #21 #22 #23 #24
θ: 01 02 04 08 10 20 40 80 1A 2D 47 78 7E 8E 9D B2 Dl FB ** ** ** ** **
··
1: 00 03 05 09 11 21 41 81 IB 2C 46 79 7F 8F 9C B3 D0 FA ** ** ** ** **
*·
2: 03 00 06 0A 12 22 42 82 18 2F 45 7A 7C 8C 9F B0 D3 F9 ** ** ** ** **
*·
3: 02 01 07 ΘΒ 13 23 43 83 19 2E 44 7B 7D 8D 9E Bl D2 F8 ** ** ** ** **
4: 05 06 00 0C 14 24 44 84 IE 29 43 7C 7A 8A 99 B6 D5 FF ** ** ** ** **
*·
5: 04 07 01 0D 15 25 45 85 IF 28 42 7D 7B 8B 98 B7 D4 FE ** ** *» ** *»
6: 07 04 02 ΘΕ 16 26 46 86 1C 2B 41 7E 78 88 9B B4 D7 FD ** ** ** ** ** 7: 06 05 03 0F 17 27 47 87 ID 2A 40 7F 79 89 9A B5 D6 FC ** ** ** ** **
··
8: 09 0A 0C 00 18 28 48 88 12 25 4F 70 76 86 95 BA D9 F3 ·· *· ·« ·· ··
9: 08 ΘΒ 0D 01 19 29 49 89 13 24 4E 71 77 87 94 BB D8 F2 ** ** ** *» *»
*·
A: ΘΒ 08 0E 02 1A 2A 4A 8A 10 27 40 72 74 84 97 B8 DB Fl ** ** ** ** **
*·
B: 0A 09 0F 03 IB 2B 4B 8B 11 26 4C 73 75 85 96 B9 DA F0 ** ** ** ** ** C: 0D 0E 08 04 1C 2C 4C 8C 16 21 4B 74 72 82 91 BE DD F7 ** ** ** *» **
·*
D: 0C 0F 09 05 ID 2D 4D 8D 17 20 4A 75 73 83 90 BF DC F6 ** ** ** ** **
··
E: 0F 0C 0A 06 IE 2E 4E 8E 14 23 49 76 70 80 93 BC DF F5 ·· *· *· ·· **
F: 0E 00 0B 07 IF 2F 4F 8F 15 22 48 77 71 81 92 BD DE F4 ** ** ** ** **
10: _
Table 4.16
Example 3. Large network with P=655,360 (640K) ports at φ=1, using switches with radix R=48.
[00225] The database lookup finds the exact match using rf=16, H=216= 65,536
= 64K switches of radix R=48. Each switch uses m=38 ports for connections with other switches leaving E=48-38=10 ports/switch free, yielding total of P =E n=
10·64Κ=640Κ available ports as required. Bisection is b=10 resulting in <t>=E/b=l. The list of m=38 generators S38 = {hi, A2,.. . A38} is shown in Table 4.17 in hex and binary base. The 38 links for some switch x (where x: 0..FFFF) are computed as S38(JC)≡ {x^h xAh2,... Λ^Λ38}. Diameter (max hops) of the network is D=5 hops, and the average latency is Avg=4.061691 hops.
1. 1
2. 2
3. 4
4. 8
5. 10
6. 2Θ
7. 4Θ
8. 8β
9. 1ΘΘ
ie. 20Θ
11. 4Θ0
12. 8ΘΘ 1
13. 10ΘΘ ...1
14. 2ΘΘΘ
15. 4ΘΘΘ
16. 8Θ0Θ
17. 6F2 11.1111..1.
18. 1BD6 ...11.1111.1
19. 1F3D ...11111..1111.1
2Θ. 3D72 ..1111.1.111..1.
21. 6B64 .11.1.11.11..1..
22. 775C .111.111.1.111..
23. 893A 1...1..1..111.1.
24. 8B81
25. 9914 1..11..1...1.1..
26. A4C2 1.1..1..11 1.
27. Α75β 1.1..111.1.1
28. Β7ΘΕ 1.11.111 111.
29. BFF1 1.1111111111...1
3Θ. C57D 11...1.1.11111.1
31. D6A6 11.1 1.1..11.
32. D1CA 11.1...111..1.1.
33. E6B5 111..11.1.11.1.1
34. EAB9 111.1.1.1.111..1
35. F2E8 1111..1.111.1...
36. F313 1111..11...1..11
37. F9BF 11111..11.111111
38. FC31 111111 11...1
Table 4.17
LH performance comparisons
[00226] The LH solutions database was used to compare LH networks against several leading alternatives from industry and research across broader spectrum of parameters. The resulting spreadsheet charts are shown in Figures 11 - 15. The metrics used for evaluation were Ports/Switch yield (ratio P/n, higher is better) and the cables consumption as Cables/Port (ratio: # of topological cables/P, lower is better). In order to maximize the fairness of the comparisons, the alternative networks were set up to generate some number of ports P using switches of radix R, which are optimal parameters values for a given alternative network (each network type has its own "natural" parameter values at which it produces the most efficient networks). Only then the LH network was constructed to match the given number of external ports P using switches of radix R (as a rule, these are not the optimal or "natural" parameters for LH networks).
[00227] In Figs. 11 - 15, the Ports/Switch chart for each alternative network shows Ports/Switch yields for the LH network and the alternative network , along with the ratio LH/alternative with numbers on the right axis (e.g. a ratio 3 means that LH yields 3 times more Ports/Switch than the alternative). The second chart for each alternative network shows the Cables/Port consumption for the LH and the alternative, along with the ratio: alternative/LH on the right axis (e.g. a ratio 3 means that LH consumes 3 times fewer cables per port produced than the alternative). All networks are non-oversubscribed i.e. φ=1.
[00228] For example, the Ports/Switch chart in Fig. 11 shows yield for hypercube (HC), for network sizes from w=28 to 224 switches of radix R=64. The Ports/Switch for LH network yielding the same total number of ports P is shown, along with the ratio LH/HC, which shows (on the right axis scale) that LH produces 2.6 to 5.8 times greater Ports/Switch yield than hypercube, hence it uses 2.6-5.8 times fewer switches than HC to produce the same number of ports P as HC at the same throughput. The second chart in Fig. 11 shows similarly the Cables/Port consumption for HC and LH, and the ratio HC/LH of the two (right axis scale), showing that LH consumes 3.5 to 7 times fewer cables to produce the same number of ports P as HC at the same throughput. The remaining charts in Figs. 12 - 14show the same type of comparisons for the other four alternatives. Performance Measurement
[00229] It is desirable to maximize λ since λ quantifies the external port yield of each switch. Namely if each switch's port count (radix) is R, then R=E+T (where E is the number of external ports and T number of topological ports) and the E-port yield per IP A port is: Yield≡ Ε/Ι =λ/(λ+1), i.e. increasing Xincreases the Yield. But increasing λ for a given N also lowers the bisection for that N, hence in practical applications, data center administrators need to select a balance of Yield vs. bisection and N suitable for the usage patterns in the data center. The centralized control and management software provides modeling tools for such evaluations.
[00230] Denoting the number of external ports and topology ports per switch as
E and T, the radix (number of ports) R of a switch is R=E+T. The topology ports in turn consist of the d ports needed to connect a < -dimensional hypercube HCd and of h long hop ports used for trunking, so T=d+h. If the number of switches is N, the N=2d or d=\og(N), where log x)is the logarithm base 2 of x i.e. log(x) = ln(x)/ln(2)
«1.443·1η(χ). In order to relate formally the invention's long hops to terminology used with conventional trunking (where each of the d HCd cables is replaced with q cables, a trunk quantum), define q≡Tld, i.e. T=q d. Hence q and h are related as: q= \+h/d and A= d-(q-\). Using the ratio: ≡EIT, E and T is expressed as Γ=/?/(1+λ) and E=X-R/(\+X). Restating the bisection formula:
B≡B(N) = (N/2)-q-C= NI2-{\+hld)-C (5)
[00231] Where C is a single IPA switch port capacity (2 <Port Bit Rate> for duplex ports). Bisection B is the smallest total capacity of links connecting two halves of the network (i.e. it's the minimum for all possible network cuts into halves). Consider two network halves with N/2 switches each and E external ports per switch, there are E-N/2 external ports in each half. If these two sets of eternal ports were to transmit to each other at full port capacity C, the total capacity needed to support it is E-(N/2)-C. Since bisection limits the worst case capacity between halves to B, the oversubscription φ is defined as the ratio between the capacity needed E-(N/2)-C and the capacity for the job available via B:
q>≡ E-(N/2)-C / B = E/q = λ-d = -log(N) (6) [00232] Eq. (6) shows in what ratio λ=Ε/Τ ports must be divided in order to obtain oversubscription φ using N switches: λ=φ/^(Ν). The quantity most often of interest is the total number of external ports provided by the network, P = N-E, which in terms of other quantities typically given as constraints (φ, N and radix R), and recalling that
Figure imgf000066_0001
is then:
„ _ φ-R N
Φ (7) +logiN)
[00233] Although Eq. (7) doesn't yield a closed form expression for N, it does allow computation of the number of IP A switches N needed to get some target number of total network ports P at IB-oversubscription φ, knowing the radix R of the switches being used. Qualitatively, the number of total network ports P increases slightly slower than linearly in N (when φ is kept fixed) due to the denominator D≡^+log(N)) which also increases with N. Its effects diminish as N increases (or if φ is large or grows with N), since doubling of N increments D by +1 (which is only by ~5% for N=64K and φ=4). Within the log(log(P)) error margin, the N above grows as N ~ P log(P), which is an unavoidable mathematical limit on performance of larger switches combined from N smaller switches at fixed φ.
[00234] Figure 16 (computed for the commercially available Pronto 3780 switch) shows the resulting network capacity based on a simple un-optimized configuration (for the lowest commonly used fixed IB-oversubscription φ=4, other values of interest φ=1, 2, 5, 10, 15 and 20 are shown later). The slight log(N) nonlinearity when using fixed φ can be seen in the price per port - while N increased by a factor 128K, the price per 10G port increased only 3.4 times (i.e. the cost per 10G port grew over 38,000 times slower than the network size and capacity, which is why the slight non-linearity can be ignored in practice). If instead of using fixed φ the fixed λ(Ε/Ρ ratio) is used, then via Eq. (3): φ≡ φ(Ν,λ) = -log(N), the port Eq. (6) becomes linear in N:
λ - R - N
P = 7T i.e. we get a fixed cost and power per port as N grows. In this case the tradeoff is that it is φ which now grows as -log(N) as N grows. Recalling that typical aggregate oversubscriptions on core switches and routers are -200+ in the current data centers, log(iV) is quite moderate in comparison. The network bandwidth properties for λ=1 are shown in Figure 17 where the cost per 10G port remains fixed at $500 (or $104 per 1G port) and power at 14.6W. Results for some values of λ≠ 1 are shown later.
Elimination of CAM Tables
[00235J By using mathematically convenient topologies such as an enhanced hypercube connection pattern or its hierarchical variants, the switch forwarding port can be computed on the fly via simple hardware performing a few bitwise logical operations on the destination address field, without any expensive and slow forwarding Content Addressable Memory (CAM) tables being required. Hence, for customized switches, price and power use advantages can be gained by removing CAM hardware entirely.
Exception and Fault Handling Using CAM
[00236] Although the most favorable embodiments of the invention can eliminate CAMs completely, a much smaller (by at least 3 orders of magnitude smaller) CAM hardware can still be useful to maintain forwarding exceptions arising from faults or congestion. Since the enhanced hypercubic topology allows for forwarding via simple, small logic circuits (in the ideal, exception free case), the only complication arises when some port P is faulty due to a fault at the port or
failure/congestion at the nearest neighbor switch connected to it. Since number of such exceptions is limited by the radix R of the switch, the necessary exception table needs a space for at most R small entries (typical R=24..128, entry size 5-7 bits). A match of a computed output port with an entry in the reduced CAM overrides the routine forwarding decision based on the Jump Vector computed by the logic circuit. Such a tiny table can be implemented in the substantially reduced residual CAMs, or even within the address decoding logic used in forwarding port computation. This exception table can also be used to override the routine forwarding decisions for local and global traffic management and load balancing. Improved Trunking and Link Aggregation
[00237] In order to increase the pipe capacity along overloaded paths, while under the tree topology constraints, the conventional data center solution is trunking(or link aggregation in the IEEE 802.1 AX standard, or Cisco's commercial EtherChannel product), which amounts to cloning the link between two switches, resulting in multiple parallel links between the two switches using additional pairs of ports. The invention shows a better version of trunking for increasing the bisection with a fixed number of switches.
[00238] With the invention, this problem arises when number of switches in a network is fixed for some reason so bisection cannot be increased by increasing N. Generally, this restriction arises when the building block switches are a smaller number of high radix switches (such as the Arista 7500) rather than the larger number of low radix switches that allow the desirable high bisection bandwidth as provided by the invention. Data centers making use of the invention can use conventional trunking by building hypercubes using multiple parallel cables per hypercube dimension. While that will increase the bisection as it does for regular tree based data center networks, there are better approaches that can be used.
The procedure is basically the opposite of the approach used for traditional trunking. By adding a link from some switch A, instead of picking the target switch B from those closest to A, B is picked such that it is the farthest switch from A. Since the invention's topologies maintain uniform bisection across the network, any target switch will be equally good from the bisection perspective, which is not true for conventional trees or fat trees. By taking advantage of this uniformity, picking the farthest switch B also maximally reduces the longest and the average hop counts across the network. For example, with a hypercube topology, the farthest switch from any switch A is the switch B which is on the long diagonal from A. Adding that one link to A cuts its longest path by half, and reduce the average path by at least 1 hop. When the long hops are added uniformly to all switches (hence N/2 wires are added per new long hop), the resulting topology is called enhanced hypercube. Figure 18 shows the reductions in the maximum and average hops due to adding from 1 to 20 long hops. In Figure 18, LH shows hex bitmasks of the long hop, i.e. the index of the farthest switch chosen. [00239] The table was obtained by a simple 'brute force' counting and updates of distance tables as the new long hops were added. At each stage, the farthest node from the origin is used as a new link (a variety of tiebreaking rules were explored to provide a pick when multiple 'farthest' nodes are equally far, which is the common occurrence). After each link is added the distance table is updated. For Dim=4, N=16, adding long hops beyond 11 doesn't have an effect since the small network becomes fully meshed (when total number of links is N-l), hence all distances become 1 hop.
Optimizing Wiring Using Port Dimension Mapping
[00240] In some embodiments of the invention, systems being implemented via a set of switches in a data center, (e.g. available as line cards in a rack), wiring such dense networks can easily become very complex, error prone and inefficient. With (d\)N topologically equally correct mappings between ports and dimensions for a d- dimensional hypercube using N=2d switches, d ports per switch, there are lots of ways to create an unmanageable, error prone, wasteful tangle. The invention optimizes the mapping between the ports and HC/FB dimensions using the following rules:
(i) The same dimensions are mapped to the same ports on all switches
(ii) Consecutive dimensions (0,1,... d-X) are mapped onto consecutive ports (a, +Ι,... a+d-l)
The resulting wiring pattern shown in Figure 19 has the following advantages over a general topologically correct mapping:
a) All cables belonging to the same dimension have the same length b) All cables have the same port number on both ends (cables strictly vertical)
c) All cables in the same vertical column (dimension) have the same lengths
[00241] Provided the cables and corresponding port connectors in the same column are color coded using matching colors (properties (b) and (c) makes such coding possible), and the cables are of the minimum length necessary in each vertical column, this port-dimension mapping makes the wiring of a rack of switches easy to learn, easy to connect and virtually error proof (any errors can be spotted at a glance). The total length of cables is also the minimum possible (requiring no slack) and it has the fewest number of distinct cable lengths allowed by the topology. In addition to economizing the quantity and complexity of the wiring, the shortening and uniformity of cables reduces the power needed to drive the signals between the ports, a factor identified as having commercial relevance in industry research.
Details of Connecting 64=2? switches -* 6-D hypercube
[00242] In Figure 19, the column headers show 6 color coded port numbers
0=red, l=blue, 2=orange, 3=purple, 4=green and 5=cyan. The 64 switches are line cards mounted in a rack one below the other and they are depicted as 64 separate rows 0, 1, 2,...63. The 6 ports/switch used for wiring these switches into a 6-D hypercube, line up into 6 columns (the wire colors match the port colors in each column).
[00243] The 6 numbers inside some row #k show the 6 switches connected to the 6 ports of the switch #k. E.g. row #7 shows that switch #7 is connected to switches #6, 5, 3, 15, 23, 39 on its ports 0, 1, 2,... 5. Picking now say, port (column) #4 for switch (row) #7, it connects on port 4 to switch #23. Looking down to switch (row) #23, its port (column) #4 it connects back to switch #7 i.e. switch 7 and switch 23 are connected to each other's port #4. This simple rule - two switches always connect on the same port # with each other- holds generally for hypercubes. This leads to the proposed port and cable color coding scheme. E.g. green: 4 cables connect green ports #4 on some pair of switches, red: 0 cables connect red ports #0 on some other pair of switches, blue:l cables connect blue ports #1, etc.
[00244] The wiring pattern is as simple. All wires of the same color have the same length L=2port #, e.g. orange: 2 wire (connecting always ports #2, orange:2 ports) has length 22=4, green:4 24=T6, red: 0 2°=1, etc. Hence switch pairs connected on their port #2 with each other are 4 rows apart, e.g. switch (row) 0 connects on its port #2 to switch 4 on its port #2 and they use orange:2 wire (the color of port #2). This connection is shown as the top orange: 2 arc connecting numbers 4 and 0. The next orange:2 (port #2) wire start at the next unconnected row, which is row #1 (switch #1), and connects to row 1+4=5 (switch #5), and so on until the first row already connected on port #2 is reached, which is row #4 (Step 1-4). At that point 8 top rows on port #2 are connected. Then proceed down to the next row with free port #2, which is row 8. That port #2 is now connected with the port #2 down 4 rows, i.e. with row 8+4=12, which is shown with orange: 2 wire linking numbers 12 and 8. Now the next two rows (orange: 2 arc connecting numbers 13 and 9), etc, until column (port) #2 is connected on all switches. Then follows purple: 3 port #3, using purple: 3 wires 2 =8 slots long, and repeat the same procedure, except with longer wires... etc.
Containers for Prewired Internal Topology
[00245] While the above wiring of a 64-switch hypercube ¾≡H(64) is not difficult since errors are unlikely because starting at the top row and going down, any new wire can go into just one port of the matching color, the pattern above suggests a simple way to design easily connectable internally prewired containers, which eliminate much of the tedium and expense of this kind of dense manual wiring.
[00246] Consider the above H(64) as being composed of two prewired H(32) boxes A and B (separated by the dotted horizontal line at 32/32). The first 5 dimensions, ports 0,1,...4, of each H(32) are already fully wired and the only missing connections are the 32 wires connecting ports #5 on the 32 switches from one to the other container, in perfectly orderly manner (row 0 of container A to row 0 of container B, row 1 from A to row 1 from B,... etc). Hence, instead of wiring
32x6=192 wires for H(64), two prewired containers and 32 wires now connect between them in a simple 1 ,2,3... order. The job is made even easier with a bundled, thick cable with these 32 lines and a larger connector on each box, requiring thus only one cable to be connected.
[00247] Looking further at the wiring relation between port #4 and port #5, it is obvious that these thick cables (each carrying e.g. 64 or 128 Cat 5 cables) follow the exact pattern as ports #1 and #2, except with cable bundles and big connectors instead of single Cat 5 cables and individual ports. Hence, if one had a row of internally prewired (e.g. via ASIC) 128-switch containers (e.g. one rack 64RU tall, 2 line cards per slot), each container having 8 color coded big connectors lined up vertically on its back panel, matching color thick cables may be used that repeat the above exact wiring pattern between these 2 =256 containers (except it goes horizontally) to create a network with 27+8 = 32K IPA switches (for only $393 million), providing 786,432 x lOGports (1 port per 10G virtual server with 32 virtual machines (VMs), totaling 25,165,824 VMs; i.e. switching cost < $16 VM). For large setups a single frame may be used where any newly added container can just be snapped into the frame (without any cables), that has built in frame-based connectors (with all the inter-container thick cabling prewired inside the frame base). [00248] The ultimate streamlining of the wiring (and of a lot more) is achieved by using "merchant silicon", where all such dense wiring, along with the connectors and their supporting hardware on the switch is replaced with ASICs tying together the bare switching fabric chips. This approach not only eliminates the wiring problem, but also massively reduces the hardware costs and power consumption.
[00249] For ASIC wiring of the Figure 19 pattern, in order to reduce the number of circuit layers the connection order must be reversed, changing all wire intersects into nestings, allowing for single layer wiring. The resulting hypercube is just another one among the alternate labelings.
Non-Power-of-2 Networks
[00250] The above manual wiring scheme can also be used to build a network that has a number of switches N which is not a power of 2 (thus it cannot form a conventional hypercube). Consider the case of a network where that has 32 switches (rf=5, using ports #0..#4, rows 0..31) and now wish to add two more switches, (rows) #32 and #33. This starts the 6th dimension (port #5, long cyan wires), but only having two of the 32 cyan lines connected on port #6 (the two are connecting port #6 in rows 0<→32 and 1<→33 for the 2 new switches #32 and #33). The first 5 ports #0-#4 of the two new switches have no switches to go to, since these haven't been filled in (these will come later in the rows 34-63).
[00251] The problem with such partial wiring is that it severely restricts forwarding to and from the new switches (just 1 link instead of 6 links), along with reduced bandwidth and fragility (due to single points of failure. This problem can be eliminated by using port (column) #4 of the new first new switch (row) #32. The port #32:4 normally connects (via green wire going down to row #48) to switch: port #48:4, but the switch #48 isn't there yet. Switch #48 also connects on port #5 (via dotted cyan wire) back to the existent switch #16:5. Thus, there are two broken links #32:4<→#48:4 and #48:5<→#16:5, with missing switch #48 in the middle. Therefore, the two ends of existing switches can be connected directly to each other, i.e.
#32:4<→#16:5 as shown by the top dotted green wire (which happens to be just the right length, too). Later, when switch #48 is finally added, the shortcut (green dotted wire going up) moves down to #48:4 while the #16:5, which becomes free as well (after moving the green wire down), now connects to #48:5 (dotted cyan wire). The same maneuver applies to switch #33 as shown with the 2 green dotted wire. The analogous shortcuts follows for lower ports of #32 and #33 e.g. the broken pairs
#32:3*→#40:3 and #40:5<→#8:5 are short-circuited via #32:3÷→#8:5 etc, resulting in full (with natural forwarding) 6-D connectivity for the new switches and their neighbors. The general technique is to first construct correct links for the target topology (e.g. hypercube), which include the non-existent nodes. Then one extends all shortest paths containing the non-existent nodes until they reach existent nodes on both ends. The existent nodes terminating such "virtual" shortest paths (made of nonexistent nodes on the inner links) are connected directly, using the available ports (reserved on existent nodes for connections with as yet non-existent ones).
Programmable Connector Panel
[00252] Another approach according to embodiments of the invention for interconnecting switches can include building large, software controlled super- connectors ("C-Switches"), where making any desired connections between the physical connectors can be controlled by software.
[00253] Unlike a standard switch, which forwards packets dynamically based on the destination address in the packet frame header, a C-Switch forwards packets statically, where the settings for the network of crossbar connections within the C- Switch can be provided by an external program at initialization time. Without any need for high speed dynamic forwarding and buffering of data packets, the amount of hardware or power used by a C-Switch is several orders of magnitude smaller than a standard switch with the same number of ports.
[00254] The individual connectors (or per-switch bundles of for example 48 individual circuit cables brought in via trunked thick cables, plugged into a large single connector), plug into the C-Switch's panel (which can cover 3-5 sides of the C- Switch container), which can include a matrix containing hundreds or thousands of receptacles. Beyond the simple external physical connection, everything else can be done via software controls. Any desired topology can be selected via an operator using software to select from a library of topologies or topology modules or topology elements.
[00255] To facilitate physical placement and heat management, C-Switches can be modular, meaning that a single C-Switch module can combine several hundred to several thousand connectors, and the modules can be connected via single or few cables (or fiber links), depending on the internal switching mechanism used by the C- Switch. In such a modular implementation, the inter-module cabling can be done via the cabling built into the frame where the connections can be established indirectly, by snapping a new module into the frame.
[00256] There is a great variety of possible ways to implement core
functionality of a C-Switch, ranging from telephony style crossbar switches, to arrays of stripped down, primitive hub or bridge elements, to nanotech optical switches and ASIC/FPGA techniques. Since the internal distances within a C-Switch are several orders of magnitude smaller than standard Ethernet connections, it is useful (for the heat& power reduction) that the incoming signal power be downscaled by a similar factor before entering the crossbar logic (the signals can be amplified back to the required levels on the output from the crossbar logic). In other embodiments, for example using MEMS based devices, power reduction may not be necessary where optical signals are switched via piezo-electrically controlled nano-rnirrors or other purely optical/photonic techniques such as DLP normally used for projection screens, where such down/up-scaling is implicit in the transceivers.
[00257] The internal topology of the C-Switch can be multi-staged since the complexity of a single, flat crossbar grows as 0(X ) for X external ports. For example, a arrangable non-blocking hypercubic topology requires a hypercube dimension of d, connecting N=2d smaller crossbars, which is twice the number of external ports p per smaller crossbar, i.e. d=2p. Hence each small crossbar of radix 3p has a circuit complexity (number of cross points) of 0(9p ). The number of external ports X=Np= l^p determines value p needed for a given X in implicit form where approximately p» ½ \ g{X) + 0(log(log(A))). Hence, the number of small crossbars is N=2d»X-\og(X). With the small crossbar radix p= 72, the C-Switch hardware scales to Ar=224«16 million ports.
[00258] This kind of software controlled multi-connector has a much wider applicability than data centers, or even than Ethernet LANs, since cabling and connectors are a major problem in many other settings and at much smaller scales of connectivity. Use of CSwitches for Layer 2 Network Optimization
[00259] The traffic patterns in a data center are generally not uniform all-to-all traffic. Instead, smaller clusters of servers and storage elements often work together on a common task (e.g. servers and storage belonging to the same client in a server farm). The integrated control plane of the current invention allows traffic to be monitored, and to identify these types of traffic clusters and reprogram the C-Switch so that the nodes within a cluster become topologically closer within the enhance hypercube of Ethernet switches. By reducing the path lengths of the more frequent traffic patterns or flows by using a C-Switch, the load on the switching network is reduced since fewer switching operations are needed on average from ingress to egress, hence increasing capacity. The C-Switch is used in this new division of labor between the dynamic switching network of the Layer 2 switches and the crossbar network within the C-Switch, which offloads and increases the capacity of the more expensive network (switches) by the less expensive network (crossbars). This is a similar kind of streamlining of the switching network by C-Switch that layer 2 switching networks perform relative to the more expensive router/layer 3 networks. In both cases, a lower level, more primitive and less expensive form of switching takes over some of the work of the more expensive form of switching.
Wiring Improvements
[00260] Although the rf-cube wiring is highly regular and can be performed mechanically (a la weaving), the 'long hops' do complicate the simple pattern enough to make it error prone for brute force manual wiring. Since this problem is shared by many other desirable topologies, a general solution is desirable to make networks built according to the invention practical in the commercial world.
Computer assisted manual wiring
[00261] In this method, the switches are numerically labeled in a hierarchical manner tailored to the packaging and placement system used, allowing technicians to quickly locate the physical switch. A wiring program displays the wiring instructions in terms of the visible numbers on the switches (containers, racks, boxes, rooms) and ports. The program seeks to optimize localization/clustering of the wiring steps, so that all that is needed in one location is grouped together and need not be revisited. C-Box - Prewired crossbar for fixed topologies
[00262] This is a more attainable lower tech variation of the C-Switch in the form of a connector box with pre- wired topologies, such as enhanced hypercubes, within certain range of sizes. Front panels of the C-Box provide rows of connectors for each switch (with -10-20 connectors per switch) with numbered rows and columns for simple, by the numbers, wiring for the entire rows of rack switches and hosts.
[00263] C-Box is as easy to hook up and functions exactly as the C-Switch
(e.g. with a built in processor and a unified control plane per box), except that the topology is fixed. As with the C-Switch, multiple C-Boxes can be connected via thick cables to form a larger network.
Automated wiring verification
[00264] This facility is useful for the manual wiring methods described above.
Diagnostic software connected to the network can test the topology and connections, then indicates which cables are not connected properly and what corrective actions need to be taken.
Data Center Application
[00265] Figure 20 shows an embodiment of the invention applied to a complete data center. The particular details of this diagram are illustrative only, and those skilled in the art will be able to see that many other combinations of data center components with various attributes such as number or ports and port speed may also be used, and connected in various topologies. The cables (vertical arrows) are coded by capacity and named according to their roles: S(erver)-Lines from server to TORs or transceivers, U(plink)-Lines from edge to network ports, T(opology)-Lines:
internal to the network (aggregate switching fabric via scalable topology &
forwarding) and W(AN)-Lines to routers/L3. The only long lines, thus posing cabling bulk problems, are U-Lines, but these already exist in a standard data center. The internal switching fabric of the network consists of the fabric from variable number of common off-the-shelf (COTS) switches with firmware extensions, connected via the Topology Panel (ITP). Depending on size and complexity of topology (which depends on the type of data center), the ITP block may merely symbolize a prescribed pattern of direct connections between ports (by the number wiring), or it can be realized as a prewired connector panel or as programmable crossbar switch.
[00266] The network spanned by the T-Lines is the network backbone. The encircled "A" above the top-of-rack (TOR) switches represents fabric aggregation for parts of the TOR fabric which reduces the TOR inefficiencies.
[00267] The control and management software, MMC (Management,
Monitoring and Control module), CPX (Control Plane Executive) and IDF (Data Factory), can run on one or more servers connected to the network switching fabric.
Virtual Machine Motion
[00268] In a data center using virtual machine instances, the MMC and CPX can cooperate to observe and analyze the traffic patterns between virtual machine instances. Upon discovering a high volume of data communication between two virtual machine instances separated by a large number of physical network hops, the MMC and/or CPX can issue instructions to the virtual machine supervisor that results in one or more virtual machine instances being moved to physical servers separated by a smaller number of network hops or network hops that are less used by competing network communication. This function both optimizes the latency between the virtual machines and releases usage of some network links for use by other communicating entities.
Layer 3+ Protocol Performance Improvement
[00269] The most commonly used layer 3 (or higher) reliable communication protocols, such as TCP and HTTP, which have large communication overheads and non-optimal behaviors in data center environments, can be substantially optimized in managed data center networks with a unified control plane such as in the current invention.
[00270] The optimization consists of replacing the conventional multi-step sequence of protocol operations (such as three way handshake and later AC s in TCP, or large repetitive request/reply headers in http) which have source and destination addresses within the data center, with streamlined, reliable Layer 2 virtual circuits managed by the central control plane where such circuits fit naturally into the flow-level traffic control. In addition to reducing communication overhead (number of frames sent, or frame sizes via removal of repetitive, large headers) and short- circuiting the slow error detection and recovery (the problem known as "TCP incast performance collapse"), this approach also allows for better, direct implementation of the QoS attributes of the connections (e.g. via reservation of the appropriate network capacity for the circuit). The network- wide circuit allocation provides additional mechanism for global anticipatory traffic management and load balancing that operates temporally ahead of the traffic in contrast to reactive load balancing. This approach of tightly integrating with the underlying network traffic management is a considerable advance over current methods of improving layer 3+ protocol performance by locally "spoofing" remote responses without visibility into the network behavior between the spoofing appliances at the network end points.
[00271] Further, by operating in the network stacks/hypervisor, the virtualized connections cooperate with the Layer 2 flow control, allowing for congestion/fault triggered buffering to occur at the source of the data (the server memory), where the data is already buffered for transmission, instead of consuming additional and far more expensive and more limited fast frame buffers in the switches. This offloading of the switch frame buffers further improves the effective network capacity, allowing switches to handle much greater fluctuations of the remaining traffic without having to drop frames.
Flexible Radix Switch Control Plane
Control Plane Capabilities
[00272] The FRS Control Plane (FRS-CP) makes use of the advanced routing and traffic management capabilities of the Infinetics Super Switch (ISS) architecture. It can also be used to control conventional switches, although some of the capabilities for Quality of Service control congestion control may be limited.
FRS-CP provides:
Performance
• Controls the flat fully meshed layer 2 substrate/fabric to maximize effective throughput to near physical limits
• Self-configuring, self-balancing, self-healing dynamic networks
• Device and service level bandwidth optimization and QoS guarantees Management
• Unified logical management framework for all networked devices • Hierarchical group-based management to reduce large network complexity
• Autonomic, self-healing traffic flow management
Security
• Single point of authentication for all points of attachment and services at origin
• Group-based networked device isolation throughout physical and virtualized networks
Cost Savings
• Far less network infrastructure required; substantial savings on capital expenditures, power, and payroll
• Subsumes the functionality of other monolithic appliances such as load balancers, NATs, firewalls
Control Plane Architecture
[00273J FRS-CP can include a central control system that connects directly to all the switches in the network, which may be replicated for redundancy and failover. Each switch can run an identical set of services that discover network topology and forward data packets.
[00274] Switches can be divided into three types based upon their role in the network, as shown in Figure 24:
• Ingress switches
• Fabric switches
• Egress switches
[00275] ARP and broadcast squelching. When a specific machine attempts to locate another machine on the network in a classic network, it sends out a broadcast ARP (sort of a where are you type message), which will be transmitted across the entire network. This message needs to be sent to every machine across the network on every segment which significantly lowers the throughput capacity of the network. We keep a master list(distributed to every switch) of every host on the network, so that any host can find any other host immediately. Also any other broadcast type packets which would have been sent completely across the network are also blocked. (** See CPX Controller / Data Factory)
Overview
Data Factory (IDF)
[00276] Fig. 25 shows a system according to one embodiment of the invention.
The Data Factory component can be used to establish the behavior of the IPA
Network. The Control Plane Executive (CPX) uses the data stored in the data factory to configure the network and to set up services such as security and quality guarantees. Management consoles access this component to modify system behavior and retrieve real time network status.
Control Plane Executive (CPX)
[00277] The Data Factory communicates with the Control Plane Executive
(CPX) through a service interface using a communication mechanism such as Thrift or JSON as shown in Fig. 26. Any form of encryption can be supported. In accordance with some embodiments of the invention, a public key encryption system can be used.
Universal Boundary Manager (UBM)
[00278] In accordance with some embodiments of the invention, the UBM can provide some or all of the following functions:
• Abstracts the physical network to a unified and hierarchical logical group with rights-based inheritance for security and QoS parameters
• Controls visibility of hosts and services
• Provides a single "Firewall" around perimeter of entire layer 2 network managing routing decisions for quality assurance and security enforcement for network access
• Scriptable policy management based upon time-of-day, congestion and application type
• Data stored in the Data Factory, and read by CPX for distribution to the switches. [00279] A UBM entry can describe a name for an organization or a specific service. A UBM entry could be a company name like ReedCO which would contain all the machines that the company ReedCO would use in the data center. A UBM entry can also be used to describe a service available in that data center. A UBM entry has the following attributes:
• Name of node
• DNS Name of this node (for DNS lookup)
• Port(s) - these are the port(s) that are allowed to the specified machines. If there are no ports, then this is a container Node which means it is used to store a list of allowed machines.
• QOS information
• Parent Node. Each parent can have multiple child Nodes, but each child can only have one parent Node.
• Allow Public Access
[00280] To allow external access, a flag can be provided in or associated with the
Node definition that indicates that this Node can be accessible from anybody without restrictions. So a typical company with a Database server, Backup Database server, WWW server, and Backup server could look like the following:
• COMPCO (Lists all four computers, but no public access)
• DB (lists just the Database server)
• BACKUPDB (lists just the backup database server)
• BACKUP (Lists just the backup server)
• WWW (Lists just the WWW server, but allow public connections)
A machine table contains at least the following information:
• MAC Address
• IP Address (If the machine is defined as static)
• Description of machine [00281] The firewall rules that are necessary to allow dataflow across the network can be created from this table. Only flows that are allowed will be sent to the KLM.
UBM Service
[00282] The Universal Boundary Manager service can provide membership services, security services and QoS. There can be two or more types of UBM groups:
Transparent UBM Group
[00283] A transparent group can be used as an entry point into the IPA Eco-
System. It can be visible and allow standard IP traffic to flow over its interface - UBM Interfaces can be determined by port number - e.g. Port 80. This type of group can be used to handle legacy IP applications such as Mail and associated Web Services. Since a Web Service can be tied to an IP port, limited security (at the Port Level) and QoS attributes (such as Load Balancing) can be attributes of the UBM structure.
Figure imgf000082_0001
Qos Lite
Figure imgf000082_0002
Explicit Congestion Control Notification
Opaque UBM Group [00284] An opaque group can have all the attributes of the Transparent group's attributes, but allows for the extension of pure IPA security, signaling (switch layer) and the ability to provided guaranteed QoS.
Hidden - group Members only know about group Members
Membership Driven
Secure (Utilizing Public Key Security or Lattice based cryptography)
Polymorphic Membership Model (The rise of Inheritance)
Pure lPA
Guaranteed QoS based upon proprietary meshed network
Figure imgf000084_0001
Signaling
[00285] The major extensions to the Opaque group can include the security attributes along with the guaranteed QoS attributes. Multiple opaque or visible groups can be defined from this core set of attributes.
Firewall
[00286] The firewall can be a network- wide mechanism to pre-authorize data flows from host to host. Since every host on the network must be previously configured by the network administrator before it can be used, no host can
successfully transmit or receive data unless it has been authorized in the network. Furthermore because of the built in security model applied to all devices connected to the network, hosts can only communicate with other authorized hosts. There is no way a rogue host can successfully communicate with any unauthorized host. The data defined in the UBM can control all access to hosts. The KLM loaded into each Hypervisor can provide this functionality. Alternatively, this functionality can be provided on each switch for each attached physical host.
[00287] The ingress switch where a data packet from a host first arrives in the network can use the following rules to determine whether the data packet will be admitted to the network as shown in Figure 22: Forward Path Rules
Ingress Switch
I. Is H2 using the correct Ethernet Address? (Drop point 1)
I. Use source IP address to fetch triplet, compare addresses
II. Can H2 send to HI on the given destination port? (Drop point 2)
I. (Use UBM group rules.)
IILSend packet to SI
IV. Create "reverse" rule for H1->H2 for given source Port
I. Time stamp and age out rule.
Egress Switch
I. Can H2 send to HI on the given destination port? (Drop point 3)
H. Create "reverse" rule for H1->H2 for given source Port
I. Time stamp and age out rule.
IILSend packet to HI
Reverse Path Rules
Ingress Switch
I. Is HI using the correct Ethernet Address? (drop point 4)
I. Use source IP # to fetch triplet, compare MAC #s
Il.Can HI send to H2 on the given destination port? (drop point 5) I. UseUBM group information
IILSend encapsulated packet to S2
Egress Switch
I. Can H2 send to HI on the given destination port? (drop point 6) I. Use reverse rule.
II. Send packet to HI
[00288] This is the opposite way to which traditional firewalls work, where data is allowed to enter the network from any source, the data then traverses the network and is prevented from reaching a destination host once the data packet has nearly reached its intended destination. This significantly lowers "backbone" traffic on the network.
Central Services
Data Factory
[00289] This is the starting point for full control of the network. All static and dynamic data is stored here, and a user interface is used to view and modify this data.
CPX Controller
[00290] The CPX computer is the Control Plane Executive which controls all switches, and receives and sends data to the switches. This data is what is necessary to route data, firewall info, etc. It also controls the ICP (Integrated Control Plane) module which determines topology, and controls the IFX (Firmware extensions) which are installed on every switch and hypervisor.
CPX connects to the Data Factory to read all of the configuration data necessary to make the entire network work. It also writes both log data and current configuration data to the Data Factory for presentation to users.
ICP (Integrated Control Plane)
[00291] This module controls each instance of IFX on each switch, and takes that neighbor data from each IFX instance and generates cluster data which is then sent back to each IFX instance on each switch.
CPX Interaction with ICP
The types of data that will flow through CPX for the data plane are:
• Triplets
• Firewall Rules/QoS Data
• Topology Information • Logging Data
[002921 Triplets (which contain the Host IP Address, Switch ID, and MAC address of the host) are generated by the Host detector that runs on each switch. The detected triplets are sent through the Host Controller to the CPX controller. First the triplet's data is validated to make sure that this host MAC address (and IP address if defined), is a valid one. Once validate, the triplet is enabled in the network.
Optionally, before a host's triplet is added to the database, the host can be forced to validate themselves using various standard methods such as 802. lx.
[00293] The triplets can be sent to the Data Factory for permanent storage, and are also sent to other switches that have previously requested that triplet. The sends will be timed out, so that if a switch has not requested a specific triplet for a specific time, the CPX will not automatically send it if it changes again unless the ICP requests it.
[00294] When a switch needs to route data to a host that it does not have a triplet for, the host controller sends a request for the triplet associated with the specific IP address. The CPX looks up that triplet and sends it to the IFX which in turn sends it to the KLM module so that the KLM can route data.
[00295] Firewall rules and Quality of Service (QOS) data travel along the same route as triplets. A switch always receives all the firewall rules involving hosts that are connected to that switch so that quick decisions can be made by the KLM module. If a firewall rule changes, then it is sent to the IFX which sends it to the KLM module. In cases where there are firewall rules with schedules or other "trigger points", the firewall rules are sent to the IFX and IFX sends them to the KLM module at the appropriate time.
[00296] Logging Data such as data sent/received, errors, etc is sent from the
KLM (or some other module) to IFX, and then to CPX which sends it to the Data Factory.
ICP Interaction with IFX on Switches
[00297] CPX controls ICP which then controls each instance of IFX on each switch through ICP, telling it to send "discover" packets, and return back neighbor topology data to ICP. All this data is stored in the Data Factory for permanent storage, and for presentation to users. This topology data is used by IFX to generate routes. When link states change, the IFX module notifies ICP, and a new routing table will be generated by IFX. Initially IFX will reroute the data around the affected path.
CPX Interaction with Data Factory
CPX reads the following data from the Data Factory:
• Host information to validate the host being allowed, including authorization keys, etc.
• Firewall Rules and QoS for inter-host interaction
• Triplets that have been previously deposited into the Factory
• Routing and topology data
CPX writes the following data into the Data Factory:
• Triplet information determined by host detectors
• Topology and routing data determined by CPX and IFX
• Log information about changes in network infrastructure, including routing, host, and other data
ICP Data Factory
The following information is needed by ICP.
[00298] This can happen at a very high rate upon startup, and can reoccur on a regular basis very slowly
• Switch Information
Key Value will either be MAC or IP address
Data Returned will be information necessary for to calculate topology, and identify switches.
• Topology information previously written by CPX before. This will be used as
"hints" to restart routing in case of a failed switch for example
• Routing information necessary to route data between switches. This will need to be updated on all affected switches whenever the ICPupdates the Datastore Factory.
The following information will be written by ICP.
[00299] This can happen on a very regular basis (e.g., at least 1 per second and can occur more often), but the writes can be buffered and delayed for writing if need be. The data will not be read on a regular basis, except for startup, but will need to be updated on all other switches. Of course the data will be read by the User for network status monitoring.
• Switch Status - Current Status of each switch, including port status
• Topology information - links between switches including metadata about each link
• Routing information. Calculated "best" routes between switches
ICP Data needed for Switches
The following information will be written by the switches
Triplets from switches for hosts. These will be written whenever a new host comes online, or a host goes away. They can happen anywhere from once every few seconds, to much more often as hosts come online. There needs to be some sort of acknowledgement that the specific host being added already exists in the UBM so that we can route to that host. If the host does not exist we need to flag that host's information so that the user can see that a undefined host has been activated on the network, and allow the user to add it to the UBM.
The following information will be read by the switches.
[00300] All of these reads can occur as fast as possible. Any slowness in these reads may slow down the data path.
• Triplets for hosts. This can happen quite often, and needs to be as fast as possible.
• UBM data that allows all the data necessary to create the firewall/QOS rules, multi- server data, and everything else necessary to route to that host.
• The data that will be delivered to the switches from the UBM is:
Firewall Rules with QOS information
Multi-server data. This is all the servers of an equivalent type.
Switch Services
The following services can run on all switches in the network.
IFX (Firmware extensions)
[00301] This module runs on each switch and is responsible for determining the topology of the neighbors. It sends data back to the ICP module about its local physical connectivity, and also receives topology data from ICP. It supports multiple simultaneous network logical topologies, including n-cube, butterfly, torus, etc as shown in Figure 23. It uses a raw Ethernet frame to probe the devices attached to this switch only. It also takes the topology data from ICP, and the cluster data from ICP and calculates forwarding tables.
IFXS (Firmware extensions for Servers)
[00302] This module runs on each hypervisor and interact s with the
Hypervisor/KLM module to control the KLM. Flow data related to how many bytes of data flowing from this hypervisor to various destinations is accepted by this module and used to calculate forwarding tables.
Hypervisor Controller
This can include a Linux kernel loadable module (KLM) that implements the Data plane. It can be controlled by the Switch Controller.
The input to this module are:
• Triplets from this and other switches
• Firewall Rules and QoS Associated data
• Routes from IFX
[00303] The KLM can route packets from hosts to either other hosts, or to outside the network if needed (and allowed by rules). All packets sent across the "backbone" can be encrypted, if privacy is required.
[00304] The KLM switch module can have access to caches of the following data: triplets (they map IPv4 addresses into (Egress Switch ID, host Ethernet Address pairs); routes (they define the outbound interfaces, and next hop Ethernet Address to use to reach a given Egress Switch); and firewall rules (they define which IPv4 flows are legal, and how much bandwidth they may utilize).
[00305] The KLM can eavesdrop on all IP traffic that flows from VM instances
(that are supported by the local hypervisor). It can, for example, use functionality (defined in the Linux netfilter library) to STEAL, DROP, or ACCEPT individual IP datagrams that are transmitted by any VM. [00306] When a datagram is transmitted by a VM, the KLM switch can intercepts (STEALs) it and determines if firewall rules classify the corresponding flow to be legal. If it's illegal, the packet is dropped. If the flow is legal and it's destination is local to the hypervisor, it's made to obey QoS rules, and delivered. If the flow is legal and exogenous, the local triplet cache is consulted with the destination IP address as an index. If a triplet exists, it determines the Egress Switch ID (which is just a six-byte Ethernet address). If a route also exists to the Egress switch, then the packet will be forwarded with the destination switch Topological MAC address put into the Ethernet frame.
[00307] The KLM can use a dedicated Ethernet frame type to make it impossible for any backbone switch or rogue host to send a received frame up its protocol stack.
[00308] When a frame arrives at a hypervisor, it can be intercepted by its kernel's protocol handler (functionality inside the KLM) for Ethernet frame type defined. The protocol handler can examine the IP datagram, extract the destination IP address, and then index it into it's triplet cache to extract the Ethernet address of the local VM. If no triplet exists, the frame can dropped. The socket buffer's protocol type can switched from 0xbee5 to 0x0800, and the packet can be made to obey QoS rules before it is queued for transmission to the local host.
[00309] The KLM can use IFXS, for example, as its method to talk with CPX to access the data factory.
Examples
[00310] Figure 24 shows a typical use case where switching systems according to various embodiments of the invention can be used within a data center.
[00311] Figure 15 shows one embodiment of the invention where the FRS is used alone to provide an ultra-high bisection bandwidth connection between multiple CPU cores and a large array of flash memory modules. The prior art approach for having CPU cores transfer data to and from flash memory treats the flash memory modules as an emulated disk drive where data is transferred serially from a single "location". The invention allows large numbers of CPUs or other consumers or generators of data communicate in parallel to multiple different flash memory storage modules. In this embodiment of the invention, the ISS network can be designed using the physical constraints of the various methods that semiconductor devices are packaged and interconnected. This embodiment results in a network that has a different connection pattern than would be used in a data center, but still provides extremely high bisection bandwidth for the available physical connections within and between semiconductor devices and modules.
[00312] Additional supporting information relating to the construction of Long
Hop networks is provided in attached Appendix A, which is hereby incorporated by reference.
[00313] Those skilled in the art will realize that the methods of the invention may be used to develop networks than interconnect devices or nodes with arbitrary functionality and with arbitrary types of information being exchanged between the nodes. For example, nodes may implement any combination of storage, processing or message forwarding functions, and the nodes within a network may be of different types with different behaviors and types of information exchanged with other nodes in the network or devices connected to the network.
1. Introduction
Rapid proliferation of large Data Center and storage networks in recent years has spurred great deal of interest from industry and academia in optimization of network topologies [1 ]- £12]. The urgency of these efforts is further motivated by the inefficiencies and costs of the presently deployed large Data Center networks which are largely based on non-scalable tree topology.
There are two main types of network topologies proposed as scalable alternatives to the non- scalable tree topology of the conventional Data Center:
• Fat Tree (FT) (syn. folded Clos) based networks, a class of "indirect networks"
• Hypercubic (HC) networks, a class of "direct networks" using Cartesian product construction recipe. This class includes plain hypercube variants (BCube, MDCube), Folded Hypercube (FC), Flattened Butterfly (FB), HyperX (HX), hyper-mesh, hyper- torus, Dragonfly (DF),... etc.
While the HC networks are overall the more economical of the two types, providing the same capacity for random traffic as FT with fewer switches and fewer cables, the FT is more economical on the worst case traffic, specifically on the task of routing the worst case 1-1 pairs permutation.
The Long Hop (LH) networks stand above this dichotomy by being simultaneously the most optimal for the common random traffic and for the worst case traffic. The LH optimality is result of the new approach to network construction which is fundamentally different from the techniques used to construct all the leading alternatives. Namely, while the alternative techniques build the network via simple mechanical, repetitive design patterns which are not directly related to the network performance metrics such as throughput, the LH networks are constructed via an exact combinatorial optimization of the target metrics.
Although there have been some previous attempts to optimize the network throughput directly, such as the "entangled networks" described in [2] and [12], these techniques sought to optimize general random networks. Since such optimization is computationally intractable for general graphs (it is an NP-complete problem), the computations of both, the network performance and the search for its improvements, are by necessity very approximate
(simulated annealing) and still, they become prohibitively expensive as the network size n
Appendix A - increases beyond few thousand nodes. For example, the largest computed size in [12] had M=2000 nodes. Further, since the resulting approximate solutions have variable node degree and random connectivity, appearing to a network technician as massive, incoherent tangles of wires without any pattern or logic, the "entangled networks" are in practice virtually impossible to wire and troubleshoot. Finally, the node degree irregularity and the complete lack of symmetry of such networks compound their impracticality due to complicated, resource hungry routing algorithms and forwarding tables.
In contrast, the LH construction method optimizes the highly symmetrical and, from practical perspective, the most desirable subset of general networks, Cayley graphs [11]. As result of its more focused and more careful identification of the target domain, the LH networks are optimal regarding throughput and latency within that domain, practical to compute and discover, simple and economical to wire and troubleshoot and highly efficient in routing and forwarding resources ("self-routing" networks).
Appendix A ematical Tools and Notation
A≡ B equality defining expression A via expression B (tautology)
A <=> B expression or statement "A is equivalent to B"
A = B "A implies B"
V a iterator or a set defined by the statement "for all a"
iff "if and only if "
|S| sets: size of set S (number of elements in S), numbers: absolute value of S
La J floor(a): the largest integer < a
¥N Λ-dimensional vector space (over some implicit field Fq)
§(k,n,q) A-dimensional subspace of ¥n (linear span) over field Fq
(x\y) scalar (dot) product of real vectors x and >>: (x\y)≡∑=1 x{ yt llxll norm (length) of vector x: \\x\\≡ ^ix x
a .b integer sequence a, a+l, ..., b for some integers a < b
[a, b) half-open interval: contains all x satisfying a < x < b
[a, b] closed interval: contains all x satisfying a <x < b
{ai , o2, aj } set of elements a\ , a2 and a3
{JC: E(J )} set of elements x for which Boolean expression E(x) is true minB{set} minimum element of a {set} under condition E; analogously for max {set}
a % b "a mod b" or "a modulo b" (remainder in integer division a l b) bitwise operation on bit strings done separately in each bit position
~a or a NOT a (bitwise complement, toggles each bit of a)
a & b bitwise AND (bitwise a b)
a I b bitwise OR (bitwise a + b - a b)
α Λ b XOR, exclusive OR (bitwise: (a + b) mod 2, also a + b - 2 a b) a 0 b modular addition in ring (Zq)d: component-wise (a + b) mod q a Q b synonym for a 0 (-b); for q=2: aQb a®b s=> a b (bitwise XOR)
V=V1 V2 Vector space ¥ is direct sum of vector spaces ¥i and ¥2 '
Objects (matrices, group elements, etc.) commute for operation Ό'
Appendix A • [E] Iverson bracket (E is a Boolean expression): E true (false) ==> [E]≡l (0)
• 5y Kronecker delta: Sjj≡ [i=j] i.e. 0y is 1 if =y* and 0 if i≠ j
• δί Dirac integer delta: δ;≡ δί>0 i.e. δ; is 1 if = 0 and 0 if i≠0
• B = A 3 A matrix B is a transpose of matrix A i.e. elements B[ j =
• A <¾ B Kronecker product of matrices A and B
• A®n Kronecker #i-th power of matrix A: A®n≡ A® Α<¾· · ® A (« times)
• A*B Cartesian product of sets or groups A and B
• A*n Cartesian Λ-th power of a set or group A
• C(n,k) Binomial coefficient C(n,k)≡«!/[*!(«-*)!)] =(J)
• 0(N) Big O notation characterizes the growth rate and complexity.
Binary expansion of a d-bit integer X
Figure imgf000096_0001
where χμ is the "μ-th bit of X" (bits τμ have values 0 or 1). Bit-string form of the binary expansion of integer X is denoted as: X = JCd-i . . . xi x -
Parity of a rf-bit integer X = Xd-i .. . xi x is: P(X)≡ (JCO+OCI+. . .+J <I-I) mod 2 = Λ:0 Λ JCI Λ. . .Λ Λ¾. ι·
Hamming weight (X) or Δ(Χ) of i-tuple X≡ X\ JC2... xn, where J j e [0,^), is the number of non-zero symbols in X. Hamming distance Δ(Χ,Υ) between i-tuples X and Y is the number of positions / where JCJ≠yt-. For vectors X and Y this is equivalent to Δ(Χ,Υ) = (X— Y) = Δ(Χ - Y) i.e. to Hamming weight of (X-Y). For binary strings this yields Δ(Χ,Υ)= (ΧΛΥ) i.e. the Hamming weight of ΧΛ Y.
Lee distance is Λ(Χ, Υ)≡∑i= 1
Figure imgf000096_0002
Lee weight is: Λ(Χ) = Λ(Χ, Ο).
Binary intervals (or binary tiles) are intervals of size 2k (for k = 1 ,2,...) such that each "tile" of size 2k starts on an integer multiple of 2k e.g. [m-2 , (m+1) -2k) for any integer m are "binary intervals" of size 2k.
Cyclic group ¾,: set of integers {0, 1 , ... n- 1 } with integer addition modulo n as the group operation. Note that ¾ group operation is equivalent to a single bit XOR operation
Appendix A (1Λ0=0Λ1=1, 0A0=1A1=0). The same symbol ¾ is also used for commutative ring with integer additions and multiplication performed mod it.
Product group Zq≡ Zq x Zq x ··· x Zq (d times): extension of Zq into a t-tuple. As with Z„, Zq also denotes a commutative ring in which the Zq operations (integer +,* mod q) are done component-wise.
Finite Dyadic group Dd of order n=2d is abelian group consisting of all <-bit integers Ο../ι-l using bitwise XOR (A) as the group operation. Notes: (i) for n=2d and d >2 = Z„≠ Dd; (ii) Dd is an instance of Z?.
ΥΛΧ 0 1 2 3 4 5 6 7 8 9 A B C D E F
ø: 0 1 2 3 4 5 6 7 8 9 A B C D E F :
l: 1 0 3 2 5 4 7 6 9 8 B A D C F E :1
2: 2 3 0 1 6 7 4 5 A B 8 9 E F C D :2
3: 3 2 1 0 7 6 5 4 B A 9 8 F E D C :3
4: 4 5 6 7 0 1 2 3 C D E F 8 9 A B :4
5: 5 4 7 6 1 0 3 2 D C F E 9 8 B A :5
6: 6 7 4 5 2 3 0 1 E F C D A B 8 9 :fr
7: 7 6 5 4 3 2 1 0 F E D C B A 9 8 :7
8: 8 9 A B C D E F 0 1 2 3 4 5 6 7 :8
9: 9 8 B A D C F E 1 0 3 2 5 4 7 6 :9
A: A B 8 9 E F C D 2 3 0 1 6 7 4 5 :A
B: B A 9 8 F E D C 3 2 1 0 7 6 5 4 :B
Ci C D E F 8 9 A B 4 5 6 7 0 1 2 3 :C
D: D C F E 9 8 B A 5 4 7 6 1 0 3 2 :f>
E: E F C D A B 8 9 6 7 4 5 2 3 0 1 :E
Fi F E D C B A 9 8 7 6 5 4 3 2 1 0 ;F
0 1 2 3 4 5 6 7 8 9 A B C 0 E F
Table 2.1
Table 2.1 illustrates the group operation table for group D4 with n = 24=\6 elements 0, 1, 2, ... F (all numbers are in base 16). Table entry in row Y and column X is the result of bitwise XAY operation.
Appendix A A. Matrices and Vectors in Dirac Notation
Dirac notation (also called "bra-ket" notation, [13]) is a mnemonic notation which encapsulates common matrix operations and properties in a streamlined, visually intuitive form.
Matrix [Ar>c] (also: [A] or just A) is a rectangular table with r rows and c columns of "matrix elements". An element on i-th row and y'-th column of a matrix [A] is denoted as [A]y.
Identity matrix «χ« is denoted as I„ or I. Matrices with r = 1 or c = 1, row or column vectors, are denoted as follows:
Row vector (
Inner (scalar) "
Outer product:
Figure imgf000098_0001
= "matrix"
yrJ yrxi yr*2 - yrXcJ
Translation bra *→ ket * real matrix A: \u) — \v) «= (u\ = (p|AT i-th "canonical basis" bra vector: (e ≡ (θ!θ2 ··· lj 0i+1 ··· 0n)
General "orthonormal basis" [B]≡ {|bi): i = 1.. n}: (bt |ty) = i
Orthogonal matrix U:
Figure imgf000098_0002
Projector (matrix) onto the i-th canonical axis: Pj≡ (ej|
Projector (matrix) onto any normalized ((u\u) = 1) vector the |u): Pj = \u) (u\
Component (vector) of {X\ along axis (β(|: <et- 1 Xi
"Resolution of identity" in any basis {
Figure imgf000098_0003
The above examples illustrate a rationale for Dirac notation: product expressions of the form with two "pointy" ends such as <...> are always scalars (numbers), while products of the form with two flat ends |...>...<...| are always matrices. Mixed ends products (those with one
Appendix A pointy and one flat end) such as <...| or |...> are always row or column vectors. Due to associativity of matrix products, these "object type rules" are valid however many other matrix or vector factors may be inside and outside of the selected sub-product of a given type. Also, the "resolution of identity" sums∑|bj)(bf | can be freely inserted between any two adjacent bars ('flat ends') within a large product, further aiding in the breakup of longer chains of matrices into scalars. Such rules of thumb often suggest, purely visually, quick, mistake-proof simplifications e.g. any scalars spotted as ...<...>... pattern can be
immediately factored out.
B. Hadamard Matrices and Walsh Functions
Hadamard matrix H„ (or H) is a square nxn matrix defined by equation HnHj = nln. Of interest here are the Sylvester type of H„ matrices characterized by the size constraint n≡ 2d. Under this constraint the H„ matrices can be constructed recursively (equivalent to Kronecker products of H2) as follows [14]:
Figure imgf000099_0001
The pattern of H32 (d=5) is shown in Table 2.2 with '- elements shown as '-' and coordinates in base 16.
Appendix A 00 02 04 06 08 0A 0 0E 10 12 14 16 18 1A 1C IE
0 00 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 00
1 01 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 01
2 02 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 02
3 03 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 03
4 04 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 04
5 05 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 05
6 06 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 06
7 07 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 07
8 08 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 08
9 09 1 - 1 - 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - 1 - 1 - - 1 - 1 - 1 - 1 09
10 ΘΑ 1 1 - - 1 1 - - - - 1 1 - - 1 1 1 1 - - 1 1 - - - - 1 1 - - 1 1 ΘΑ
11 0B 1 - - 1 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 1 - - 1 - 1 1 - - 1 1 - 0B
12 0C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0C
13 0D 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - 0D
14 0E 1 1 - - - - 1 1 - - 1 1 1 1 - - 1 1 - - - - 1 1 - - 1 1 1 1 - - 0E
15 0F 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 0F
16 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10
17 11 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 11
18 12 1 1 - - 1 1 - - 1 1 - - 1 1 - - - - 1 1 - - 1 1 - - 1 1 - - 1 1 12
19 13 1 - - 1 1 - - 1 1 - - 1 1 - - 1 - 1 1 - - 1 1 - - 1 1 - - 1 1 - 13
20 14 1 1 1 1 - - - - 1 1 1 1 1 1 1 1 - - - - 1 1 1 1 14
21 15 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - 15
22 16 1 1 - - - - 1 1 1 1 - - - - 1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - 16
23 17 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 17
24 18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 18
25 19 1 - 1 - 1 - 1 - - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 1 - 1 - 1 - 1 - 19
26 1A 1 1 - - 1 1 - - - - 1 1 - - 1 1 - - 1 1 - - 1 1 1 1 - - 1 1 - - 1A
27 IB 1 - - 1 1 - - 1 - 1 1 - - 1 1 - - 1 1 - - 1 1 - 1 - - 1 1 - - 1 IB
28 1C 1 1 1 1 1 1 1 1 - - - - 1 1 1 1 1 1 1 1 - - - - 1C
29 ID 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - 1 - 1 - - 1 - 1 ID
30 IE 1 1 - - - - 1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - 1 1 - - - - 1 1 IE
31 IF 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 1 - - 1 - 1 1 - IF
00 02 04 06 08 0A 0C 0E 10 12 14 16 18 1A 1C IE
Table 2.2
Appendix A From the construction eq. (2.1) of H„ (where n≡ 2d) it follows that Hn is a symmetric matrix:
Symmetry: Hi - = Hj i (2.2)
Walsh function 1¾ ) for
Figure imgf000101_0001
is defined as the A-th row of Hn. By virtue of Hn symmetry, eq. (2.2), the A-th column of Hn is also equal to 1¾ ). The row and column forms of Uk(jc) can also be used as the /i-dimensional bra/ket or row/column vectors (1¾ and |Uk>. Some properties of Uk(x) are:
Orthogonality: (U |Ufc) = n · ¾ k = g J™ J. = £ (2.3)
Symmetry: Ufc(x) = U*(fc) (2.4) Function values: (2.5)
U0 (X) =
Figure imgf000101_0002
lf V (2.6) n-1
^ Ufc x) = 0 for k = l.. n - l (2.7)
The exponent∑μ=ο χ μ m e¾- (2-5) uses binary digits Αμ and μ of <f-bit integers k and J . When this sum is even number Uk(jc) is 1 and when the sum is odd number 1¾χ) is -1. The second equality in eq. (2.5) expresses the same results via parity function ¥(k&x), where k&x is a bitwise AND of integers k and *. For example Ui4(15)=(-1) from the table Fig. 1. Binary forms for k and JC are: A= 14=01110 and ;c=15=011 11. The sum in the exponent is∑μ=ο ^μ Χμ = 0·0+1 ·1+Μ+1·1+0·1 = 3 => U14(15) = (-1)3 = (-1)'= -1. The parity approach uses k & x = 01110 & 01111 = 011 10 yielding exponent P(01 110) = 0Λ1Λ1Λ1Λ0 = 1 and Un^M-l)' = - 1 i.e. the same result as the one obtained via the sum formula.
For efficiency, the LH network computations use mostly binary (also called boolean) form of Uk and Hn denoted respectively as Wk and [Wn]- When both forms are used in the same context, the Uk and Hn forms are referred to as algebraic forms. Binary form is obtained from the algebraic form via mappings l-> 0 and -1 - 1. Denoting algebraic values as a and binary values as b, the translations between the two are:
Figure imgf000101_0003
Appendix A Binary b 1 0 and a = 1 - 2b (2.8)
The symmetry eq. (2.4) and function values eq. (2.5) become for the binary form Wt(j ):
Symmetry: Wfc(x) = Wx(k) (2.9)
Function values: Wk(x) =
Figure imgf000102_0001
kM xM) = P(k&x) (2.10)
Binary Walsh functions W x) are often treated as length #i bit strings, which for k=\ ..n-\ have exactly nil zeros and nil ones. In the bit string form one can perform bitwise Boolean operations on Wk as length n bit strings. Their XOR property will be useful for the LH computations:
W Wk = W,Afc (2.1 1) i.e. the set {Wk}≡ {Wk: k=0..n-\ } is closed with respect to bitwise XOR (denoted as A) operation and it forms a group of n-bit strings isomorphic to the dyadic group Dd of their indices k d-biX strings).
Table 2.3 below shows the binary form of Hadamard (also called Walsh) matrix [W32] obtained via mapping eq. (2.8) from H32 in Table 2.2 (binary 0's are shown as '-').
Appendix A 00 02 04 06 08 0Α 0C ΘΕ 10 12 14 16 18 1Α 1C IE
0:00 - - - - - - - - - - - - - - - - - - - - ee
1:01 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 01
2:02 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 - - 1 1 02
3:03 - 11 - - 1 1 - - 11 - - 1 1 - - 1 1 - - 11 - ■ 1 1 - - 1 1 - 03
4:04 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 04
5:05 - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - 05
6:06 - - 1 1 1 1 11 1 1 1 1 11 11 1 1 · · 06
7:07 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - - 1 07
8:08 - - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 08
9:09 - 1 - 1 - 1 - 1 1 - 1 · 1 · 1 - - 1 ■ 1 - 1 - 1 1 ■ 1 · 1 - 1 - 09
10:0A - - 1 1 - - 1 1 1 1 - - 1 1 1 1 - - 1 1 1 1 - - 1 1 - - 0A ll:0B - 11 ■ ■ 11 · 1 · - 1 1 - - 1 - 1 1 - - 11 ■ 1 - · 11 - - 1 0B
12:0C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0C
13:0D · 1 - 1 1 · 1 - 1 - 1 - - 1 - 1 - 1 - 11 - 1 - 1 - 1 - ■ 1 - 1 0D
14:0E - - 1 1 1 1 - - 1 1 1 1 - - 1 1 1 1 - - 1 1 1 1 0E
15:0F - 1 1 - 1 - - 1 1 - - 1 - 1 1 - - 1 1 - 1 - - 1 1 - - 1 - 1 1 - 0F
16:10 - - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10
17:11 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 11
18:12 - 11 · - 11 - 1 1 - - 11 11 - · 11 - - 1 1 - - 1 1 ■ - 12
19:13 - 11 - - 11 - - 11 - - 1 1 - 1 - ■ 1 1 - - 1 1 - - 1 1 - - 1 13
20:14 - - - - 1 1 1 1 - 1 1 1 1 1 1 1 1 1 1 1 1 14
21:15 - 1 - 1 1 - 1 - - 1 - 1 1 - 1 - 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 15
22:16 - - 1 1 1 1 1 1 1 1 - - 11 1 1 1 1 1 1 16
23:17 - 11 ■ 1 - - 1 - 1 1 · 1 - - 11 · - 1 ■ 1 1 - 1 · - 1 - 1 1 - 17
24:18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 18
25:19 - 1 - 1 - 1 · 11 - 1 - 1 - 1 - 1 · 1 - 1 ■ 1 - - 1 - 1 - 1 ■ 1 19
26:1A - - 1 1 - - 1 1 1 1 - - 1 1 - - 1 1 - - 1 1 1 1 - - 1 1 1A
27: IB - 1 1 - - 1 1 - 1 - - 1 1 - - 1 1 - - 1 1 - - 1 - 1 1 - - 1 1 - IB
28:1C - - - - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1C
29: ID - 1 - 1 1 - 1 - 1 - 1 - - 1 - 1 1 - 1 - - 1 - 1 - 1 - 1 1 - 1 - ID
30:1E - - 1 1 1 1 - - 1 1 11 1 1 1 1 - - 11 1 1 - - IE
31:1F · 1 1 - 1 - - 1 1 - - 1 - 11 - 1 · · 1 · 1 1 - - 1 1 - 1 - ■ 1 IF
00 02 04 06 08 ΘΑ 0C 0E 10 12 14 16 18 1A 1C IE
Table 2.3
Appendix A C. Error Correcting Codes
Error correcting coding (ECC) is a large variety of techniques for adding redundancy to messages in order to detect or correct errors in the decoding phase. Of interest for the LH network construction are the linear EC codes, which are the most developed and in practice the most important type of ECC [15], [16],
Message X is a sequence of k symbols xi, . ., Λ¾ from alphabet A of size q > 2 i.e. x can be taken to be integers with values in interval [0,q). EC code for X is a codeword Y which is a sequence yi,y2,...,yn of n > k symbols from A*. The encoding procedure translates all messages from some set {X} of all possible messages into codewords from some set {Y}. For block codes the sizes of the sets {X} and {Y} are qk i.e. messages are arbitrary A-symbol sequences. The excess symbols n-k > 0 in Y represent coding redundancy or "check bits" that support detection or correction of errors during decoding of Y into X.
For ECC algorithmic purposes, the set A is augmented with additional mathematical structure, beyond merely that of a bare set of q elements A. The common augmentation is to consider symbols x\ and y to be elements of a Galois field GF(^) where qpm for some prime p and some integer m>l (this condition on q is a necessary condition in order to augment a bare set A into a finite field Fq). Codewords Y are then a subset of all n-tuples F over the field GF(^). The GF(^) field arithmetic (i.e. the + and scalar ·) for the n-tuples is done component-wise i.e. F is /i-dimensional vector space Vn≡ F over GF( ).
Linear EC codes are a special case of the above «-tuple F^ structure of codewords, in which the set {Y} of all codewords is a A-dimensional vector subspace (or span) {k,n,q) of V„. Hence, if two w-tuples Yi and Y2 are codewords, then the n-tup is also a codeword. The number of distinct codewords Y in §(k,n,q) is |
Figure imgf000104_0001
This linear code is denoted in ECC convention as [n,k]q code, or just [n,k] code when q is understood from the context or otherwise unimportant in a context.
A particular [n, k] code can be defined by specifying k linearly independent «-dimensional row vectors = —9i,ri) for i=\..k, which are used to define the Αχ/ι "generator matrix" [G] of the [n,k] code as follows ([16] p. 84):
* More generally message X and codeword Y can use different alphabets, but this generality merely complicates exposition without adding anything useful for the LH construction.
Appendix A
Figure imgf000105_0001
Encoding of a message X≡ (X|≡ {x\, xi, k) into the codeword Y≡ (Y|≡ (y\,yi, · · ·, yn) is:
Figure imgf000105_0002
Individual component (symbol) ys (where 5=1..n) of the codeword Y is then via eqs. (2.20)- (2.21):
ft: ft:
ys≡ <Y|es) = ^ x gi \es) = ^ xi3ifi (2.22)
The Λχ/ι matrix [Gi^,] is called systematic generator iff the original message X = J 1 ? . . , jtk occurs as a substring of the output codeword Y. The systematic generators [G] combine a k*k identity matrix as a sub-matrix of [G] i.e. [G] typically has a form [Ik | Ak,n-k] or [Ak^-k I Ik ], yielding unmodified substring X as a prefix or a suffix of Y, which simplifies encoding and decoding operations. The remaining n-k symbols of Y are then called parity check symbols.
The choice of vectors (gi \ used to construct [G] depends on type of errors that the [n,k] code is supposed to detect or correct. For the most common assumption in ECC theory, the independent random errors for symbols of codeword Y, the best choice of {g^ \ are those that maximize the minimum Hamming distance Δ(Υ l5Y2) among all pairs (Yi,Y2) of codewords. Defining minimum codeword distance via:
Δ = miniAiYi, Y2) | V Yx, Y2 6 §(fc, n, q) and Y1≠ Y2} (2.24) the [n,k]q code is often denoted as [«,A,A]q or [n,k,A] code. The optimum choice for vectors (gi I maximizes Δ for given n, k and q. The tables of optimum and near optimum [n,k,A]q codes have been computed over decades for wide ranges of free parameters «, k and q (e.g. see web repository [17]).
Table 2.4 ([16] p. 34) illustrates optimum [7,4,3]2 code i.e. a systematic binary code with n - 7 bit codewords each containing 3 parity check bits, A=4 message bits (appearing as suffix in
Appendix A the codeword Y), with minimum distance Δ=3, thus capable of correcting all 1-bit errors and detecting all 2-bit errors.
Figure imgf000106_0001
Table 2.4
Quantity closely related to Δ, and of importance for LH construction, is the minimum nonzero codeword weight wm,„ defined via Hamming weight (Y) (the number of non-zero symbols in Y) as follows: wmin = min{<Y): (Y e S(fc, n, qj) and (Y≠ 0)} (2.25)
The property of wmi/I (cf. Theorem 3.1, p. 83 in [16]) of interest is that for any linear code [«,A,A]q: wmin = Δ (2.26)
Hence, the construction of optimal [/ι,Α,Δ]ς codes (maximizing Δ) is a problem of finding k- dimensional subspace §(k,n,q) of an n-dimensional space F which maximizes wmjj,. Note also that since any set of k linearly independent vectors | (a basis) from {k,n,q) generates (spans) the same space §{k,n,q) of qk vectors Y, wm,„ and Δ are independent of the choice of the basis {(#j|: i = 1.. k). Namely by virtue of uniqueness of expansion of all qk vectors Y€ §(£,«,ø) in any basis and pigeonhole principle, the change of basis merely permutes the mapping X→Y, retaining exactly the same set of q^ vectors of §(k,n,q).
Appendix A D. Graphs: Terms and Notation
• T(VJZ) Graph Γ with vertices V={v\,V2,... v„} and edges E={&\, ε 2,... ε c}
• degree of v Number of edges (links) connected to node v
• ΓιθΓ2 Cartesian product of graphs Γι and Γ2 (syn. "product graph")
• rD n (Cartesian) n-th power of graph Γ
• ε k = ( i,Vj) Edge ε k connects vertices v; and Vj
• Vi ~ j Vertices Vj and Vj are connected
• vj * j Vertices v; and Vj are not connected
• [A] Adjacency matrix of a graph: [A] y≡ A(i,f)≡ [ j ~ Vj] : 1 if v, ~ Vj, 0 if v;
Number of ones on a row r (or column c) is the degree of node r (or c)
• A(iJ)=A(j,i) Symmetry property of [A] (for undirected graphs)
• Cn Cycle graph: A ring with n vertices (syn. /i-ring)
• P„ Path graph: /i-ring with one link broken i.e. a line with n vertices (syn.
• Qd rf-dimensional hypercube (syn. rf-cube): (P2) ° d = F2n P2a ...□ P2 (d times)
• FQd Folded </-cube: rf-cube with extra link on each long diagonal (see Table 4.4)
Cayley Graph Coy(Gn, Sm), where: Gn is a group with n elements { giIo, #2,· · · gn } and Sm called generator set, is a subset of G„ with m elements: Sm = { hy, A2,... Am} such that (cf.
[18] chap. 5):
(i) for any h E Sm => h'x E Sm (i.e. Sm contains inverse of any of its elements)
(ii) Sm does not contain identity element (denoted as I0) gi of Gn *
Construction: Vertex set Vof Cay(Gw Sm) is V≡{ g\, g2,...gn } and the edge set is E≡{ (gi, gi hs), V I, s}. In words, each vertex gj is connected to m vertices gi hs for s=\..m.
Generating elements hs are called here "hops" since for identity element g\≡ I0 ("root node") their group action is precisely the single hop transition from the root node g\ to its 1-hop neighbors h\, h2,... hm E V(Gn).
* The requirement for inverse A"' to be in Sm applies to undirected Cayley graphs, not to directed graphs. The exclusion of identity Sm applies to graphs that have no self-loops of a node to itself (i.e. a vertex v ~ v). These restrictions are not essential but mere conveniences of the 'preferred embodiment'.
Appendix A The construction of (¾=Ca '(D3,S3) is illustrated in Fig. 10. Group is the 8 element Dyadic group D and the 3 generators A^OOl , A2= 10 and A3=100 are shown with arrows indicating the group action (XORs node labels with generators; all labels are in binary) on vertex
V!=000. The resulting graph is a 3-cube.
lix A E. Properties of Matrices
This section lists several results about matrices (cf. [19]) needed in LH construction. All matrices below will be assumed to be real (rather than complex valued matrices).
Mi) Square «x» real matrix A is called normal matrix ([19] p. 100) iff it satisfies relation:
AAT = ATA (2.40)
This implies that any symmetrical (real) matrix S is normal matrix (since S=ST, hence SST=S2=STS).
M2) Any real, symmetrical «χ ι matrix [S] has n real eigenvalues λ ( =l..«) and the n corresponding orthonormal eigenvectors: \vt) for i=l..n (cf. [19] p.101):
[S]|vi> = Af i> for i = l.. n (2.41)
Figure imgf000109_0001
M3) Since set {
Figure imgf000109_0002
is a complete orthonormal set of vectors (a basis in Vn), any [S] from (M2) can be diagonalized via an orthogonal n*/i matrix" [U] (orthogonal matrix is defined via condition [U][UT]=In) which can be constructed as follows (applying eqs. (2.41)-(2.42)):
Figure imgf000109_0003
The final sum in (2.44) is a diagonalized form of [S], with j's along main diagonal and 0' elsewhere.
M_t) A set of m symmetric, pairwise commuting matrices Tm≡ { Sr: Sr St = St Sr for /, r =1. is called commuting family (cf. [19] p. 51). For each commuting family Tm there is an
Appendix A orthonormal set of n vectors (eigenbasis in V„) { | t?f )} which are simultaneously eigenvectors ofallSre^m(cf. [19] p.52).
Ms) Labeling the n eigenvalues of the symmetric matrix S from (M as: λ mjn≡ λ\ < λ-ι <··· < λαmax, then the following equalities hold (Rayleigh-Ritz theorem. [19] p.176):
^^,/or| )eVn and \X)≠ 0 (2.45)
\X) E ¥n and \X)≠ 0[ (2.46)
Figure imgf000110_0001
Appendix A References
1. Taming the Flying Cable Monster: A Topology Design and Optimization Framework for Data-Center Networks
J. Mudigonda, P. Yalagandula, J.C. Mogul (HP), (slides)
USENTX ATC-11, June 14. 2011, pp. 101-114
2. Network Topology Analysis
D. S. Lee, J. L. Kalb (Sandia National Laboratories)
Sandia Report SAND2008-0069, Jan 2008
3. Flattened butterfly: a cost-efficient topology for high-radix networks
J. Kim. W. J. Dally. D. Abts (Stanford-Google),
Proc. ISCA'07, May 2007, pp. 126-137
High-Radix Interconnection Networks
J. Kim, PhD thesis, Stanford University, 2008.
4. High Performance Datacenter Networks: Architectures. Algorithms, and
Opportunities
D. Abts . J. Kim (Stanford-Google)
Synthesis Lectures on Computer Architecture #14, M & C Pub., 2011.
5. Energy Proportional Datacenter Networks
D. Abts , M. Marty. P. Wells. P. Klausler, H. Liu (Google),
Proc. ISCA'10, June 2010, pp. 338-347
6. BCube: A High Performance. Server-centric Network Architecture for Modular Data Centers
C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu (Microsoft) Proc. SIGCOMM, 2009, pp. 63-74
7. MDCube: A High Performance Network Structure for Modular Data Center
Interconnection
H. Wu, G. Lu, D. Li, C. Guo, Y. Zhang (Microsoft)
Proc. SIGCOMM, 2009, pp. 25-36
8. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers
C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu (Microsoft)
Proc. SIGCOMM, 2008, pp. 75-86
9. HyperX: Topology. Routing, and Packaging of Efficient Large-Scale Networks
J. H. Ahn, N. Binkert, M. M. L. Al Davis, R. S. Schreiber (HP)
SC09 Nov 2009, pp. 14-20
10. Technology-Driven. Highly-Scalable Dragonfly Topology
J. Kim, W. J. Dally, S. Scott, D. Abts (Stanford-Google)
Proc. ISCA'08, 2008, pp. 77-88
Appendix A
no 11. A Group-Theoretic Model for Symmetric Interconnection Networks S. B. Akers, B. Krishnamurthy
IEEE Transactions on Computers, pp. 555-566, April, 1989
12. Optimal network topologies: Expanders, Cages, Ramanuian graphs. Entangled networks and all that
L. Donetti, F. Neri, M. A. Munoz
May 2006, arXiv:cond-mat/0605565v2 [cond-mat. other]
url: http ://arxiv.org/abs/ cond-mat/0605565
13. Bra-Ket notation
Wikipedia article (includes a link to the full text of Berkeley lecture notes), March 2012 url: http://en.wikipedia.org/wiki/Bra-ket_notation
url: http://bohr.phvsics.berkeley.edU/classes/221/l 112/notes/hilbert.pdf
14. Matters Computational (Algorithms for Programmers)
Jorg Arndt
(c) 2011, Springer, ISBN 978-3-642-14763-0
Dec 2010 online edition, url: http://www.iii.de/fxt/fxtbook.pdf
15. The Theory of Error-Correcting Codes
F. J. Mac Williams, N. J. A. Sloane
(c) 1977 by North-Holland Publishing Co., ISBN: 0-444-85009-0
16. Error Correction Coding, Mathematical Methods and Algorithms
T. K. Moon
(c) 2005 by John Wiley & Sons, Inc., ISBN 0-471-64800-0
17. Code Tables
A. E. Brouwer, M. Grassl
Web repository 2012, url : http://www.codetables.de/
18. Representation Theory of Finite Groups
B. Steinberg
(c) 2011, Springer, ISBN 978-1-4614-0775-1
19. Matrix Analysis
R. A. Horn, C. R. Johnson
(c) 1985 Cambridge Univ. Press, 1990 edition, ISBN 0-521-30586-1
20. Compressive Sensing Resources
C. S. web repository by Rice university
http://dsp.rice.edu/cs
21. Ordered Orthogonal Arrays and Where to Find Them
R. Schurer
University of Salzburg PhD thesis 2006
http://mint.sbg.ac.at rudi/projects/corrected diss.pdf
Appendix A
ill 22. MinT Database (Digital Nets. Orthogonal Arrays and Linear Codes) W. C. Schmid, R. Schiirer
url: http://mint.sbg.ac.at/index.php
23. Walsh Transforms. Balanced Sum Theorems and Partition Coefficients over Multarv Alphabets
M. T. Iglesias, A. Verschoren, B. Naudts, C. Vidal
Proc. GECCO '05 (Genetic and evolutionary computation), 2005
Appendix A

Claims

Claims
1. A method of constructing a network for the transfer of data from a source device to a destination device the method comprising:
selecting a base symmetric network structure, wherein the topology of the base symmetric network structure substantially corresponds to a Cayley graph;
defining at least one of:
a number of source and destination devices to be connected to the network,
a number switches to be used in the network,
a number of ports per switch, and
an oversubscription characteristic of the network;
determining a generator matrix as a function of at least one of:
the number of source and destination devices to be connected to the network,
the number switches to be used in the network,
the number of ports per switch, and
the oversubscription characteristic of the network;
determining a wiring pattern for interconnecting each of the switches as a function of the generator matrix; and
interconnecting the switches of the network with interconnecting wires according to the wiring pattern.
2. The method according to claim 1 wherein the base network structure substantially corresponds to a hypercube having a dimension d.
3. The method according to claim 2 wherein the generator matrix is determined as a function of the number of interconnections between switches of the network and the dimension, d, of the hypercube.
4. The method according to claim 1 wherein the generator matrix is an error correcting code (ECC) generating matrix and the wiring pattern is determined by rotating the error correcting code generating matrix.
5. The method according to claim 4, wherein the error correcting code generating matrix is rotated counterclockwise.
6. The method according to claim 1 wherein the oversubscription characteristic of the network is determined as a function of a number of ports defined for connection to source computers and destination computers and a bisection of the network.
7. The method according to claim 6 wherein the bisection is determined as a function of a Walsh function.
8. The method according to claim 7 wherein the bisection is determined by constructing primary equipartitions defined by patterns of 1 's and 0s in a Walsh function.
9. The method according to claim 7 wherein the bisection is determined by constructing primary equipartitions defined by the sign pattern in an algebraic Walsh function.
10. The method according to claim 1 wherein the generator matrix is an error correcting code (ECC) generating matrix derived from digital (t,m,s) nets parameters and the wiring pattern is determined by rotating the error correcting code generating matrix.
11. The method according to claim 4 wherein ECC distance metrics are constructed using a Lee distance.
12. The method according to claim 4 wherein ECC distance metrics are constructed using a Hamming distance.
13. A network constructed according to the method of claim 1.
14. A network constructed by connecting a plurality of switched, the network comprising a defined number of switches, each switch being connected to at least one other switch by an internal switch connection and the network including a defined number of internal switch connections;
the switched being arranged in a symmetric network structure, wherein the topology of the base symmetric network structure substantially corresponds to a Cayley graph;
the switches being interconnected according to a wiring pattern, the wiring pattern being determined as a function of a generator matrix, wherein the generator matrix is determined as a function of the number of internal switch connections.
15. A network according to claim 14 wherein the base network structure substantially corresponds to a hypercube having a dimension d.
16. A network according to claim 14 wherein the generator matrix is determined as a function of the number of internal switch connections and the dimension, d, of the hypercube.
17. A network according to claim 14 wherein the generator matrix is an error correcting code generating matrix and the wiring pattern is determined by rotating the error correcting code generating matrix.
18. A network according to claim 17, wherein the error correcting code generating matrix is rotated counterclockwise.
19. A network according to claim 14, wherein the generator matrix is determined as a function of at least one of:
a number of source and destination devices to be connected to the network, the number switches used in the network,
a number of ports per switch, and
an oversubscription characteristic of the network.
20. A network according to claim 19 wherein the oversubscription characteristic of the network is determined as a function of the number of ports defined for connection to source and destination devices and a bisection of the network.
PCT/US2012/036960 2011-05-08 2012-05-08 Flexible radix switching network WO2012154751A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP12723560.4A EP2708000B1 (en) 2011-05-08 2012-05-08 Flexible radix switching network
CA2872831A CA2872831C (en) 2011-05-08 2012-05-08 Flexible radix switch

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201161483687P 2011-05-08 2011-05-08
US201161483686P 2011-05-08 2011-05-08
US61/483,686 2011-05-08
US61/483,687 2011-05-08

Publications (1)

Publication Number Publication Date
WO2012154751A1 true WO2012154751A1 (en) 2012-11-15

Family

ID=46149727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2012/036960 WO2012154751A1 (en) 2011-05-08 2012-05-08 Flexible radix switching network

Country Status (4)

Country Link
US (1) US8830873B2 (en)
EP (1) EP2708000B1 (en)
CA (1) CA2872831C (en)
WO (1) WO2012154751A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156282A (en) * 2014-08-15 2014-11-19 上海斐讯数据通信技术有限公司 System image file backup system and method
WO2015180559A1 (en) * 2014-05-26 2015-12-03 华为技术有限公司 Fault detection method and apparatus for service chain
US9363204B2 (en) 2013-04-22 2016-06-07 Nant Holdings Ip, Llc Harmonized control planes, systems and methods
US9678800B2 (en) 2014-01-30 2017-06-13 International Business Machines Corporation Optimum design method for configuration of servers in a data center environment
CN110719170A (en) * 2019-08-30 2020-01-21 南京航空航天大学 Bit-level image encryption method based on compressed sensing and optimized coupling mapping grid
US11233712B2 (en) * 2016-07-22 2022-01-25 Intel Corporation Technologies for data center multi-zone cabling

Families Citing this family (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8300798B1 (en) 2006-04-03 2012-10-30 Wai Wu Intelligent communication routing system and method
US9008510B1 (en) * 2011-05-12 2015-04-14 Google Inc. Implementation of a large-scale multi-stage non-blocking optical circuit switch
JP5790312B2 (en) * 2011-08-25 2015-10-07 富士通株式会社 COMMUNICATION METHOD, COMMUNICATION DEVICE, AND COMMUNICATION PROGRAM
EP2759100B1 (en) * 2011-10-26 2015-03-04 International Business Machines Corporation Optimising data transmission in a hypercube network
US9515920B2 (en) * 2012-04-20 2016-12-06 Futurewei Technologies, Inc. Name-based neighbor discovery and multi-hop service discovery in information-centric networks
US9473422B1 (en) * 2012-05-09 2016-10-18 Google Inc. Multi-stage switching topology
US8983816B2 (en) * 2012-06-18 2015-03-17 International Business Machines Corporation Efficient evaluation of network robustness with a graph
US9184999B1 (en) 2013-03-15 2015-11-10 Google Inc. Logical topology in a dynamic data center network
EP3480956B1 (en) * 2013-03-15 2021-01-06 The Regents of the University of California Network architectures for boundary-less hierarchical interconnects
US9246760B1 (en) 2013-05-29 2016-01-26 Google Inc. System and method for reducing throughput loss responsive to network expansion
CN103516613A (en) * 2013-09-25 2014-01-15 汉柏科技有限公司 Quick message forwarding method
US9548960B2 (en) 2013-10-06 2017-01-17 Mellanox Technologies Ltd. Simplified packet routing
US9166692B1 (en) * 2014-01-28 2015-10-20 Google Inc. Network fabric reconfiguration
US10218538B1 (en) * 2014-01-30 2019-02-26 Google Llc Hybrid Clos-multidimensional topology for data center networks
NO2776466T3 (en) 2014-02-13 2018-01-20
US9729473B2 (en) 2014-06-23 2017-08-08 Mellanox Technologies, Ltd. Network high availability using temporary re-routing
US9806994B2 (en) 2014-06-24 2017-10-31 Mellanox Technologies, Ltd. Routing via multiple paths with efficient traffic distribution
CN105337866B (en) * 2014-06-30 2019-09-20 华为技术有限公司 A kind of flow switching method and device
US9699067B2 (en) 2014-07-22 2017-07-04 Mellanox Technologies, Ltd. Dragonfly plus: communication over bipartite node groups connected by a mesh network
US9690734B2 (en) * 2014-09-10 2017-06-27 Arjun Kapoor Quasi-optimized interconnection network for, and method of, interconnecting nodes in large-scale, parallel systems
CN105704180B (en) * 2014-11-27 2019-02-26 英业达科技有限公司 The configuration method and its system of data center network
GB2529736B (en) 2014-12-24 2017-11-22 Airties Kablosuz Iletism Sanayi Ve Disticaret As Mesh islands
EP3975429A1 (en) 2015-02-22 2022-03-30 Flex Logix Technologies, Inc. Mixed-radix and/or mixed-mode switch matrix architecture and integrated circuit
US9894005B2 (en) 2015-03-31 2018-02-13 Mellanox Technologies, Ltd. Adaptive routing controlled by source node
US9973435B2 (en) 2015-12-16 2018-05-15 Mellanox Technologies Tlv Ltd. Loopback-free adaptive routing
US9699078B1 (en) * 2015-12-29 2017-07-04 International Business Machines Corporation Multi-planed unified switching topologies
KR102389028B1 (en) * 2016-01-04 2022-04-22 한국전자통신연구원 Apparatus and method for high speed data transfer between virtual desktop
US9893950B2 (en) * 2016-01-27 2018-02-13 International Business Machines Corporation Switch-connected HyperX network
US10819621B2 (en) 2016-02-23 2020-10-27 Mellanox Technologies Tlv Ltd. Unicast forwarding of adaptive-routing notifications
US10225153B2 (en) * 2016-04-18 2019-03-05 International Business Machines Corporation Node discovery mechanisms in a switchless network
US10225185B2 (en) 2016-04-18 2019-03-05 International Business Machines Corporation Configuration mechanisms in a switchless network
US10218601B2 (en) 2016-04-18 2019-02-26 International Business Machines Corporation Method, system, and computer program product for configuring an attribute for propagating management datagrams in a switchless network
US10178029B2 (en) 2016-05-11 2019-01-08 Mellanox Technologies Tlv Ltd. Forwarding of adaptive routing notifications
JP6623939B2 (en) * 2016-06-06 2019-12-25 富士通株式会社 Information processing apparatus, communication procedure determination method, and communication program
US9780948B1 (en) * 2016-06-15 2017-10-03 ISARA Corporation Generating integers for cryptographic protocols
CN106126315A (en) * 2016-06-17 2016-11-16 广东工业大学 A kind of virtual machine distribution method in the data center of minimization communication delay
US10681131B2 (en) 2016-08-29 2020-06-09 Vmware, Inc. Source network address translation detection and dynamic tunnel creation
US10225103B2 (en) 2016-08-29 2019-03-05 Vmware, Inc. Method and system for selecting tunnels to send network traffic through
CN109952744B (en) 2016-09-26 2021-12-14 河谷控股Ip有限责任公司 Method and equipment for providing virtual circuit in cloud network
CN106533777B (en) * 2016-11-29 2018-08-10 广东工业大学 Method and system are determined based on the intelligent transformer substation information flow path of matrix ranks
US10263883B2 (en) * 2016-12-14 2019-04-16 International Business Machines Corporation Data flow configuration in hybrid system of silicon and micro-electro-mechanical-switch (MEMS) elements
US10200294B2 (en) 2016-12-22 2019-02-05 Mellanox Technologies Tlv Ltd. Adaptive routing based on flow-control credits
US10614055B2 (en) * 2016-12-29 2020-04-07 Emc Ip Holding Cimpany Llc Method and system for tree management of trees under multi-version concurrency control
JP6834771B2 (en) * 2017-05-19 2021-02-24 富士通株式会社 Communication device and communication method
US10862755B2 (en) * 2017-06-30 2020-12-08 Oracle International Corporation High-performance data repartitioning for cloud-scale clusters
CN109327409B (en) * 2017-07-31 2020-09-18 华为技术有限公司 Data center network DCN, method for transmitting flow in DCN and switch
US10931637B2 (en) * 2017-09-15 2021-02-23 Palo Alto Networks, Inc. Outbound/inbound lateral traffic punting based on process risk
US10855656B2 (en) * 2017-09-15 2020-12-01 Palo Alto Networks, Inc. Fine-grained firewall policy enforcement using session app ID and endpoint process ID correlation
FR3076142A1 (en) * 2017-12-21 2019-06-28 Bull Sas METHOD AND SERVER OF TOPOLOGICAL ADDRESS ALLOCATION TO NETWORK SWITCHES, COMPUTER PROGRAM AND CLUSTER OF CORRESPONDING SERVERS
US10809926B2 (en) 2018-02-05 2020-10-20 Microsoft Technology Licensing, Llc Server system
CN110139325B (en) * 2018-02-09 2021-08-13 华为技术有限公司 Network parameter tuning method and device
US10644995B2 (en) 2018-02-14 2020-05-05 Mellanox Technologies Tlv Ltd. Adaptive routing in a box
US11005724B1 (en) 2019-01-06 2021-05-11 Mellanox Technologies, Ltd. Network topology having minimal number of long connections among groups of network elements
US11184245B2 (en) 2020-03-06 2021-11-23 International Business Machines Corporation Configuring computing nodes in a three-dimensional mesh topology
US10812264B1 (en) * 2020-04-30 2020-10-20 ISARA Corporation Traversing a zigzag path tree topology in a supersingular isogeny-based cryptosystem
US11948077B2 (en) * 2020-07-02 2024-04-02 Dell Products L.P. Network fabric analysis
US11575594B2 (en) 2020-09-10 2023-02-07 Mellanox Technologies, Ltd. Deadlock-free rerouting for resolving local link failures using detour paths
US11411911B2 (en) 2020-10-26 2022-08-09 Mellanox Technologies, Ltd. Routing across multiple subnetworks using address mapping
US11870682B2 (en) 2021-06-22 2024-01-09 Mellanox Technologies, Ltd. Deadlock-free local rerouting for handling multiple local link failures in hierarchical network topologies
US11765103B2 (en) 2021-12-01 2023-09-19 Mellanox Technologies, Ltd. Large-scale network with high port utilization

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5844887A (en) * 1995-11-30 1998-12-01 Scorpio Communications Ltd. ATM switching fabric

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03500104A (en) * 1988-06-20 1991-01-10 アメリカ合衆国 interconnect network
US5684959A (en) * 1995-04-19 1997-11-04 Hewlett-Packard Company Method for determining topology of a network
US6046988A (en) * 1995-11-16 2000-04-04 Loran Network Systems Llc Method of determining the topology of a network of objects
EP0861546B1 (en) * 1995-11-16 2004-04-07 Loran Network Systems, L.L.C. Method of determining the topology of a network of objects
US5793975A (en) * 1996-03-01 1998-08-11 Bay Networks Group, Inc. Ethernet topology change notification and nearest neighbor determination
US6697338B1 (en) * 1999-10-28 2004-02-24 Lucent Technologies Inc. Determination of physical topology of a communication network
JP4163023B2 (en) * 2003-02-28 2008-10-08 三菱電機株式会社 Parity check matrix generation method and parity check matrix generation apparatus
US7369513B1 (en) * 2003-05-16 2008-05-06 Cisco Technology, Inc. Method and apparatus for determining a network topology based on Spanning-tree-Algorithm-designated ports
CN1771684B (en) * 2003-05-28 2011-01-26 三菱电机株式会社 Re-transmission control method and communication device
US9109904B2 (en) * 2007-06-28 2015-08-18 Apple Inc. Integration of map services and user applications in a mobile device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5844887A (en) * 1995-11-30 1998-12-01 Scorpio Communications Ltd. ATM switching fabric

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JOHN D. DIXON: "Groups with a Cayley graph isomorphic to a hypercube", BULLETIN OF THE AUSTRALIAN MATHEMATICAL SOCIETY, vol. 55, no. 03, 1 June 1997 (1997-06-01), pages 385, XP055033996, ISSN: 0004-9727, DOI: 10.1017/S0004972700034055 *
LAKSHMIVARAHAN S ET AL: "Ring, torus and hypercube architectures/algorithms for parallel computing", PARALLEL COMPUTING, ELSEVIER PUBLISHERS, AMSTERDAM, NL, vol. 25, no. 13-14, 1 December 1999 (1999-12-01), pages 1877 - 1906, XP004363665, ISSN: 0167-8191, DOI: 10.1016/S0167-8191(99)00069-1 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9363204B2 (en) 2013-04-22 2016-06-07 Nant Holdings Ip, Llc Harmonized control planes, systems and methods
US10110509B2 (en) 2013-04-22 2018-10-23 Nant Holdings Ip, Llc Harmonized control planes, systems and methods
US10924427B2 (en) 2013-04-22 2021-02-16 Nant Holdings Ip, Llc Harmonized control planes, systems and methods
US9678800B2 (en) 2014-01-30 2017-06-13 International Business Machines Corporation Optimum design method for configuration of servers in a data center environment
WO2015180559A1 (en) * 2014-05-26 2015-12-03 华为技术有限公司 Fault detection method and apparatus for service chain
US10181989B2 (en) 2014-05-26 2019-01-15 Huawei Technologies Co., Ltd. Service chain fault detection method and apparatus
US11032174B2 (en) 2014-05-26 2021-06-08 Huawei Technologies Co., Ltd. Service chain fault detection method and apparatus
US11831526B2 (en) 2014-05-26 2023-11-28 Huawei Technologies Co., Ltd. Service chain fault detection method and apparatus
CN104156282A (en) * 2014-08-15 2014-11-19 上海斐讯数据通信技术有限公司 System image file backup system and method
US11233712B2 (en) * 2016-07-22 2022-01-25 Intel Corporation Technologies for data center multi-zone cabling
CN110719170A (en) * 2019-08-30 2020-01-21 南京航空航天大学 Bit-level image encryption method based on compressed sensing and optimized coupling mapping grid

Also Published As

Publication number Publication date
US20130083701A1 (en) 2013-04-04
CA2872831C (en) 2019-10-29
US8830873B2 (en) 2014-09-09
EP2708000B1 (en) 2020-03-25
CA2872831A1 (en) 2012-11-15
EP2708000A1 (en) 2014-03-19

Similar Documents

Publication Publication Date Title
EP2708000B1 (en) Flexible radix switching network
CA2831607C (en) Network transpose box and switch operation based on backplane ethernet
AU2011305638B2 (en) Transpose box based network scaling
Lebiednik et al. A survey and evaluation of data center network topologies
Wang et al. NovaCube: A low latency Torus-based network architecture for data centers
JP2015512584A (en) Packet flow interconnect fabric
WO2012040237A1 (en) Transpose boxes for network interconnection
Li et al. GBC3: A versatile cube-based server-centric network for data centers
US20110202682A1 (en) Network structure for data center unit interconnection
Dominicini et al. Polka: Polynomial key-based architecture for source routing in network fabrics
Camarero et al. Random folded Clos topologies for datacenter networks
Peñaranda et al. The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks
Zhang et al. Space‐memory‐memory Clos‐network switches with in‐sequence service
Sharma et al. A comprehensive survey on data center network architectures
Castillo A comprehensive DCell network topology model for a data center
Tomic Optimal networks from error correcting codes
Li et al. Permutation generation for routing in BCube connected crossbars
CA2982147A1 (en) Direct interconnect gateway
US10218538B1 (en) Hybrid Clos-multidimensional topology for data center networks
Ashok Kumar et al. Simple, efficient location‐based routing for data center network using IP address hierarchy
Moraveji et al. Multispanning tree zone-ordered label-based routing algorithms for irregular networks
Hosomi et al. Dual-plane isomorphic hypercube network
Li et al. ABCCC: An advanced cube based network for data centers
Huang et al. SCautz: a high performance and fault-tolerant datacenter network for modular datacenters
Tomic Network Throughput Optimization via Error Correcting Codes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12723560

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2012723560

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2872831

Country of ref document: CA