CN1327669C - Grouped parallel inputting/outputting service method for telecommunication - Google Patents

Grouped parallel inputting/outputting service method for telecommunication Download PDF

Info

Publication number
CN1327669C
CN1327669C CNB2004100232541A CN200410023254A CN1327669C CN 1327669 C CN1327669 C CN 1327669C CN B2004100232541 A CNB2004100232541 A CN B2004100232541A CN 200410023254 A CN200410023254 A CN 200410023254A CN 1327669 C CN1327669 C CN 1327669C
Authority
CN
China
Prior art keywords
data
communication
ion
node
disk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100232541A
Other languages
Chinese (zh)
Other versions
CN1585380A (en
Inventor
卢凯
迟万庆
冯华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CNB2004100232541A priority Critical patent/CN1327669C/en
Publication of CN1585380A publication Critical patent/CN1585380A/en
Application granted granted Critical
Publication of CN1327669C publication Critical patent/CN1327669C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention discloses a grouping parallel input / output service method facing communication. Aiming at the existing I/O service method, the present invention solves the technical problems of low access performance of separate data, no support for an MPMD programming mood, larger size of generated messages, etc., and provides a grouping parallel input / output service method facing communication so as to provide higher I/O service performance. The present invention has the technical scheme that a structure of a communication group and a structure of a major node are added in a computer system structure of high performance; based on the structure of a communication group and the structure of a major node, larger disk access bandwidth is obtained through the synchronous effect of a CIO operation interface so as to respectively process read operation and write operation; a message transmission mode of grouping baling is used in both the read operation and the write operation. The present invention can reduce the size of messages of communication, mask retransmission overhead, and realize the parallelization between the communication and disk operation; thus, the present invention effectively solves the problem of communication bottle-neck, reduces the service time of I/O access, improves the integrated service performance of a parallel file system.

Description

Towards the parallel I/O method of servicing of the grouping of communication
Technical field
The present invention relates to the I/O technology in the computer realm, i.e. parallel I in I/O technology, the especially massive parallel processing/O method of servicing.
Background technology
Because the imbalance that develops between microprocessor technology and magnetic disc i/o technology, I/O has become one of main bottleneck of present restriction massively parallel computer system overall performance performance, and adopting parallel I/O technology is a kind of important technical of alleviating this bottleneck.Massively parallel computer system is deposited in file distribution on the disk of a plurality of I/O nodes by parallel file system, and provides high performance parallel I/O service by the magnetic disc i/o service of many I/O passage and parallelization.
Parallel I/O the subsystem structure of main flow can be divided into centralized parallel I/O structure and distributed parallel I/O structure at present.Node in the high-performance computer system can be divided into according to its major function and handle node (PN) and I/O node (ION).PN mainly is responsible for calculation task, and run user is used on it.ION is with disk, the main I/O request of being responsible for handling PN.In centralized parallel I/O structure, PN is different with ION node structure: PN is not with disk, only responsible calculation task; ION is with disk, only is responsible for the I/O task.ION communicates by letter by high speed internet with PN.In distributed I/O structure, ION is identical with the PN junction dot structure, the indeterminate division of the function of ION and PN, and each processor both can run user be used, and also can hang disk and handle the I/O visit.
Though parallel I/O subsystem provides huge original I/O access bandwidth, in actual applications because user I/O access module varies, under the different application environment user the original bandwidth that provides of obtainable actual I/O bandwidth and system differ greatly.Therefore, be the key that improves parallel I/O system service performance at the different suitable I/O method of servicing of applied environment research.
Present parallel I/O method of servicing mainly contains three kinds: traditional cache method, two stage method of servicing and towards the parallel I/O method of disk.
1, the most traditional parallel I/O method of servicing is a cache method commonly used in the file system.In the cache method, the internal memory cache that calculates node and I/O node is at first visited in the I/O request, then returns the user immediately as the data that buffering among the cache has the user to visit, and does not exist then to continue the visit disk.I/O request size is generally bigger because science is calculated, and the internal memory cache hit rate of parallel file system is often lower, and the I/O service performance is lower.
2, J.del Rosario has proposed two stage method of servicing according to the regular stronger characteristics of parallel science computing application I/O visit in " Improving Parallel I/O Performanceusing Two-Phase Access Strategy " (adopting two stage access strategies to improve parallel I/O service performance) paper of delivering in November, 1993 on Computer Architechure News.This method is divided into two stages with the I/O service process: the sequential I/O stage of I/O node and the data of calculating between node weigh distribution phase.Two stage method of servicing utilize the fireballing characteristic of disk sequential access to obtain bigger disk bandwidth, but have wasted valuable calculating node internal memory, and block in easily generation information of the heavy distribution phase of data.This method only is suitable for cluster type I/O (Collecttive I/O, CIO) operation of big data quantity.
3, parallel I/O method of servicing towards disk has been proposed in the technical report " Disk-directed I/O for MIMDMultiprocessors " (towards the I/O method of servicing towards disk of MIMD multiprocessor) that David Kotz delivered in July, 1994.Towards the parallel I/O method of servicing of disk strong, the big characteristics of I/O request size of I/O access synchronized,, obtained higher disk bandwidth according to the physical distribution dispatch request of data on disk based on the computing application of large-scale parallel science.Employing has the PANDA of exploitation such as Y.Chen and the parallel file systems such as Galley of Nils Nieuwejaar exploitation towards the parallel file system of the parallel I/O method of servicing of disk.But the parallel I/O method of servicing towards disk still has many limitations, is mainly reflected in:
A) only visit provides big bandwidth service to large scale towards the parallel I/O method of servicing of disk, but improves little to the lag characteristic of small size visit.It is basic goal that the time of implementation is used in minimizing, and the method can not be taken into account and satisfy the low delay of I/O service and big these two targets of bandwidth.
B) do not provide optimization towards the parallel I/O method of servicing of disk and operation-interface thereof, and only support single program stream multiple data stream (SPMD) programming mode, can't support multiple program multiple data (MPMD) programming mode different files, different access pattern.
C) size of message that produces towards the parallel I of disk/O method of servicing is bigger.At present, disk performance has had large increase, and disk is close to user's actual performance of the access speed of continuous data and high speed internet.When user capture small size separate data, because will make Network Transmission expense much larger than disk access expense towards a large amount of internet messages that parallel I/O method of servicing is produced of disk this moment, network service becomes new bottleneck.
Summary of the invention
Technical problem to be solved by this invention is the limitation at above-mentioned existing parallel I/O method of servicing, performance based on existing disk and communication system, a kind of grouping parallel I/O method of servicing (Communication Directed Group Input Output towards communication is proposed, CDGIO), effective access performance that improves small size, separate data; Optimization to different files, different access pattern is provided simultaneously, had both supported the SPMD programming mode, support the MPMD programming mode again; Utilize multi-thread mechanism, adopt load mode, overcome the communication performance bottleneck of parallel I/O method of servicing to a certain extent, higher I/O integrity service performance is provided towards disk with a large amount of small size message grouping packings.
Technical scheme of the present invention is towards communication, take into account the relation of disk performance and communication bandwidth, disperse the simultaneous characteristics of visit at large scale central access and small size in the parallel science calculating, design of communications group structure and major node structure in the high-performance computer system structure, reduce size of message by grouping packing transmission mechanism towards communication, reduce communication overhead, make disk access cover traffic operation preferably, thereby reduce whole I/O service time, obtain more excellent I/O service performance.
On the basis of high-performance computer system structure PN junction point and ION node, communication set structure and major node structure have been increased.Communication set is the set of some PN, during design in the group communication mode of PN adopt and to have the higher transmission bandwidth and than the close coupling communication mode of low transmission delay, the communication mode between communication set then adopts looser communication mode.Communication bandwidth in the group and delay are better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set.The communication set number is handled the square root of nodal point number for all.Each communication set has a communication set group leader node, and group leader's node is generally chosen the PN junction point of PN minimum in this communication set, also can select any PN junction point flexibly.Operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data.Major node of the present invention can be chosen any PN junction point and serve as, operation cluster type I/O (CIO, CollecttiveI/O) interface operable processing threads on it.This thread is responsible for all I/O requests of participating in the CIO operation synchronously.When each PN sent request by the CIO operation-interface, all requests sent to major node earlier.After the request of all PN that participate in the CIO operation arrived, major node accumulated a big packet to the request of each PN and is broadcast to all ION.The ION that finishes all I/O request services also will send the signal that finishes to major node.After the signal that finishes of all ION arrived, major node was to each PN sign-off signal, and each PN junction Shu Benci CIO operates.
The present invention is based on communication set structure and major node structure, synchronous effect by the CIO operation-interface obtains bigger disk access bandwidth, read operation and write operation are handled respectively, a large amount of small size message when in read operation and write operation, all adopting the transmission of messages mode of grouping packing to reduce in existing parallel I/O method of servicing to small size, separate data visit, thus the network service bottleneck avoided.The read operation detailed process is:
1. each PN junction read request of naming a person for a particular job mails to major node.
2. whether all major node detects the PN that belongs to same CIO operational group and asks arrival, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION.
3. each ION node is inquired about the meta data server of this node, this node file metamessage and data each read request of physical distribution message scheduling on disk according to meta data server provides obtain optimum access performance thereby visit data in magnetic disk with the sequence that the data in magnetic disk order distributes.
4. each ION node is that the communication set group leader is mail in the unit packing based on multithreading employing stream treatment mode with the communication set with the data that read, each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends, the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Sending thread determines whether to adopt packet mode to send to PN junction point place according to the PN number that data in the data buffer zone cover, when surpassing threshold value τ, number adopt packet mode to send when covering, be no more than threshold value τ, ION directly will be sent to this PN place after then only will belonging to the separate data packing of same PN, threshold value τ by communication overhead with the disk expense than definite: t DtskAnd t Net_bendwidthBe respectively disk and read the time of a block buffer and the transmission bandwidth of high-speed interconnect network, s BlkBe file data blocks size, t Net-setupIt is the communication settling time of a packet.
τ=(t disk-s blk/t net_bandwidth)/t net_setup
Because the transmission bandwidth of high speed internet is bigger, can ignore so go up the Network Transmission expense of formula, the τ value can be determined by following formula is approximate.
τ≈t disk/t net_setup
When the data fine granularity distributed, a block buffer covered the PN number and often surpasses threshold value τ, and message sends expense will be greater than the disk expense, and disk operating can't be covered communication.When the PN of data covers number and surpasses threshold value τ, adopt the grouping send mode, the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, does not carry out block functions, and promptly each PN is one group, and the packing data that each ION only will belong to same PN sends.When a plurality of ION simultaneously by identical order when a plurality of PN send data, the recipient produces probably and receives conflict.In order to eliminate conflict, the present invention is adopting circulating flowing water message mode when each communication set long hair send data.Because the synchronism characteristics of CIO operation, each ION has almost read data at synchronization, and send message to each communication set long hair.This moment a plurality of ION simultaneously by identical order when a plurality of firm transmission data, the recipient produces probably and receives conflict.In this send mode, each ION respectively with oneself sequence number as initial firm transmission number, for example ION1 begins to send to PN2, PN3 successively from PN1, ION2 then begins successively the transmission that circulates to PN3, PN4 from PN2.Because initial transmission node difference, just avoided when a plurality of ION all send object with PN1 for first, because the PN1 synchronization can only receive a message (ION1's), so ION2 must postpone to wait for that PN1 receives the wait situation that could begin to send after the data of ION1:
4.1 in the grouping send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose rigid joint is by receiving, unpack data with communication set appearance method together.
4.2 in non-grouping send mode, ION directly sends data to each PN junction point, each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment.
The transmission of messages mode that the present invention adopts grouping to pack for the processing of write operation requests equally, detailed process is:
1) each PN junction of participating in the CIO operation is named a person for a particular job, and to send to the major node place synchronous for write request.
2) after all participate in the write request arrival of CIO operation, be broadcast to all ION after the write request merging of major node with each PN.
3) the ION node determines that according to the file distribution information that local meta data server provides the data of which PN will write this ION.
4) ION gives all PN junction points with above-mentioned data element information broadcast.
5) when the PN number of participating in this CIO operation surpasses threshold value τ, adopt grouping packing transmission means; When PN number during, take the way that directly sends less than threshold value τ:
I. in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group.During transmission, transmit the separate data packing back that each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in the user buffering district.The communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head.
Ii. in the direct mode, each PN will belong to the unified purpose ION that is sent to after the packing data of same ION.
6) no matter divide into groups or direct mode, the ION node all utilizes double buffering mechanism to receive concomitantly, unpack and data dispatching, and last property writes disk.
Compare with existing other parallel I/O method, adopt the present invention can reach following technique effect:
1, adopt packing manner to reduce the size of message of communication.In nonpacket mode, adopt packaging technique that the discontinuous data segment of user's space is sent with a packet, reduced message count; In packet mode, be that packing data that unit will belong to many PN sends and reduced the message count that a blocks of data is wanted serial communication with the communication set, significantly reduced communication settling time.
2, adopt multithreading flowing water technology,, effectively covered the forwarding expense, realized the parallelization of communication and disk operating by concurrent magnetic disc i/o operation of double buffering and Network Transmission.
3, melting grouping and nonpacket mode neatly is one, dispatches the I/O request based on the distributed intelligence of data in magnetic disk, efficiently solves the communication performance bottleneck problem, has reduced the service time of I/O visit, has improved the integrity service performance of parallel file system.
4, the present invention will belong to the separate data merging back packing transmission of same PN, reduce the communication information amount, solve the defective that the discrete date section that belongs to same PN in the parallel I/O method of servicing towards disk can produce a plurality of internet messages.
5, the present invention is based on the principle that the space exchanges the time for, realize the transfer service as double buffering and a little P N node as the communication set group leader, obtained higher IO service performance by the some K byte of memorys that take PN junction point.
Description of drawings
Fig. 1 is traditional high-performance computer system parallel I/O system construction drawing;
Fig. 2 is the system construction drawing of parallel I of the present invention/O method of servicing;
Fig. 3 is read operation service procedure figure of the present invention;
Fig. 4 is write operation service procedure figure of the present invention;
Fig. 5 is that pipeline system message circulation of the present invention sends schematic diagram;
Fig. 6 is the present invention with the performance test comparison diagram towards the parallel I/O method of servicing of disk.
Embodiment
Fig. 1 is the high-performance computer system parallel I/O system construction drawing of current main-stream.Node in the high-performance computer system is divided into according to its main function and handles node (PN) and I/O node (ION).PN mainly is responsible for calculation task, and run user is used on it.PN sends the I/O request to ION.ION is with disk, the main I/O request of being responsible for handling PN.The parallel I of high-performance computer system/O structure is divided into centralized parallel I/O structure (shown in Fig. 1 (a)) and distributed parallel I/O structure (shown in Fig. 1 (b)) by the relation of ION and PN.In centralized parallel I/O structure, PN is not with disk, only responsible calculation task; ION is with disk, is responsible for the I/O task specially, does not carry out Any user and uses.ION communicates by letter by high speed internet with PN, the I/O request that service PN sends.In distributed I/O structure, the indeterminate division of the function of ION and PN, each processor both can run user be used, and also can hang disk and handle the I/O visit.
Fig. 2 is the system construction drawing of parallel I of the present invention/O method of servicing.On the basis of PN junction point and ION node, the present invention has increased communication set structure and major node structure newly.Communication set is the set of some PN, and communication bandwidth during design in the communication set and delay will be better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set.All PN junction points among the figure in frame of broken lines are exactly a communication set, and the communication mode of PN adopts and has the higher transmission bandwidth and than the close coupling communication mode of low transmission delay, the communication mode between communication set then adopts looser communication mode in the group.The communication set number is the square root of all PN sums.Each communication set has a communication set group leader, and group leader is served as by minimum PN in this communication set number PN junction point, and the PN junction point of black promptly is the communication set group leader among the figure.Operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data.PN junction point white among the figure is a major node, and operation CIO operation synchronizing thread is responsible for the synchronous of CIO request on the major node.When each PN sent request by the CIO operation-interface, all requests will send to major node earlier, were responsible for synchronously by major node.After the request of all PN that participate in the CIO operation arrived, major node accumulated a big packet to the request of each PN and is broadcast to all ION.The ION that finishes all I/O request services also sends the signal that finishes to major node.After the information that finishes of all ION arrived, major node was to each PN sign-off signal, and each PN junction Shu Benci CIO operates.
Fig. 3 has illustrated the service main-process stream of read operation of the present invention.The read operation process is as follows:
1) each PN junction of participating in CIO operation read request of naming a person for a particular job mails to major node;
2) whether all major node detects the PN that belongs to same CIO operation and asks arrival, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION;
3) each ION node is inquired about the meta data server of this node, according to this node file metamessage and data each read request of physical distribution message scheduling on disk that meta data server provides, the sequence visit data that distributes with the data in magnetic disk order obtains optimum magnetic disc i/o performance;
4) to adopt the stream treatment mode based on multithreading be that the communication set group leader is mail in the unit packing with the communication set with the data that read to each ION node, and each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends: the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Whether according to PN number that in data buffer zone data cover above threshold value τ determine whether adopt packet mode send: when the PN of data covering number surpasses threshold value τ if sending thread, the present invention adopts the grouping send mode, and the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, and the present invention does not carry out block functions, and promptly each PN is one group.
5) in the present invention divides into groups send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked/sends thread and unpacks data; Unpack/send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose PN junction is pressed with communication set appearance method together and is received, unpacks data.
6) in the non-grouping send mode of the present invention, ION directly sends data to each PN junction point, and each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment.
Fig. 4 has illustrated the service main-process stream of write operation of the present invention.Write operation is the inverse process of read operation.Write operation has adopted the grouping packaging technique equally, and concrete grammar is as follows:
1) each PN junction of participating in the CIO operation is named a person for a particular job, and to send to the major node place synchronous for write request.
2) after all participate in the write request arrival of CIO operation, major node is broadcast to all ION after each write request is merged.
3) the ION node determines that according to the file distribution information that local meta data server provides the data of which PN will write this ION.
4) ION gives all PN junction points with above-mentioned data element information broadcast.
5) when the PN number of participating in this CIO operation surpasses threshold value τ, the present invention adopts grouping packing transmission means; When PN number during less than threshold value τ, the present invention takes the way that directly sends:
I. in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group.During transmission, transmit the separate data packing back that each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in the user buffering district.The communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked/sends thread and unpacks; Unpack/send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head.
Ii. in the direct mode, each PN will belong to the unified purpose ION that is sent to after the packing data of same ION.
6) no matter divide into groups or direct mode, the ION node all utilizes double buffering mechanism to receive concomitantly, unpack and data dispatching, and last property writes disk.
Fig. 5 has illustrated the circulating flowing water message of the present invention transmit mechanism.When the PN number that data cover surpassed threshold value, the present invention adopted the grouping send mode to reduce the traffic volume of message, is adopting message repeating query send mode when each group leader sends data.The left side of figure is for sending the ION numbering of message.Each square frame is represented a message, the purpose PN that is numbered message in the square frame number.Because the synchronism characteristics of CIO operation, each ION has almost read data at synchronization, and send message to each communication set long hair.This moment, a plurality of ION pressed identical order simultaneously when a plurality of PN send data, and the recipient produces probably and receives conflict.Shown in Fig. 5 (a),,,, ION2 must postpone to wait for that ION1 sends data to PN1 and finishes and could begin to send so sending data to PN1 because PN1 one constantly can only receive a message (ION1's) when a plurality of ION all are that with PN1 first sends object.And adopt Fig. 2 (b) is circulating flowing water message mode of the present invention, and with different PN number beginning, from PN1, ION2 is from PN2 as ION1 respectively in the transmission of each ION message.The message of this moment sends can eliminate conflict phenomenon basically.So ION2 can carry out to PN1 transmission data with ION1 simultaneously to PN2 transmission data among Fig. 2 (b).When the size of message that sends was big, the circulation send mode can be saved the more time.
Fig. 6 shown CDGIO method of the present invention and towards the parallel I/O method of servicing of disk in the bandwidth contrast when an ION reads the 3M data.Distribute if data are fine granularity between each PN, distributed dimension is 8 bytes, and MPI communication is set up and postponed to be 20us, when the disk peak bandwidth is 12MB/s.Because the data block of a 8K will produce 1024 little message, thus excessive towards the parallel I/O method of servicing of disk owing to communication overhead, system I/O poor-performing.As shown in Figure 6, abscissa represents that PN junction counts, and ordinate is represented the I/O bandwidth, and the PN number that is covered when a blocks of data is during greater than some τ=32 node, the I/O of disk operation can't be covered communication overhead, descend rapidly towards the performance of the parallel I/O method of servicing of disk.The present invention the message count of a blocks of data adopted during less than threshold value with towards the identical direct transmission mechanism of the parallel I/O method of servicing of disk, so the performance of the two is suitable.When message count during greater than threshold value τ, the present invention adopts the transmitted in packets mode, has reduced the transport overhead that a blocks of data piece is produced.The disk operating of ION can be covered traffic operation effectively, prevents to produce communication performance bottleneck.Can find that from Fig. 6 because CDGIO adopts group technology to reduce size of message, even the PN number increases, disk performance can be given full play in the CDGIO algorithm, reaches about nearly 10M, systematic function remains unchanged substantially.And towards the parallel I/O method of servicing of disk because the communication performance bottleneck problem, the disk performance of system can't be given full play to, whole I/O performance descends fast.
The present invention is towards communication, take into account the relation of disk performance and communication bandwidth, large scale central access and small size are disperseed the simultaneous characteristics of visit in calculating at parallel science, propose communication set, major node structure, adopt twin-stage transmitted in packets strategy, by the packet data transmission towards communication, refinement bottleneck parts reduce size of message, reduce communication overhead, make disk access cover traffic operation preferably, thereby reduce the whole I/O time of implementation, obtain more excellent I/O service performance.Through evaluation and test, the present invention is a kind of parallel I/O method of servicing in more efficient under front disk and high speed internet performance condition.
The present invention has been implemented on the high-performance server and Digital UNIX operating system that University of Science and Technology for National Defence develops voluntarily.The present invention adopts threading mechanism, based on the thread library design of Digital UNIX.But the present invention is not limited to any concrete hardware platform and operating system, and method of servicing can be transplanted in other environment easily, in Linux, Free BSD and milky way kylin operating systems such as (KYLIN), has versatility widely.

Claims (3)

1. parallel I/O method of servicing of the grouping towards communication, it is characterized in that on the basis of high-performance computer system structure PN junction point and ION node, increase communication set structure and major node structure, based on communication set structure and major node structure, synchronous effect by the CIO operation-interface obtains bigger disk access bandwidth, read operation and write operation are handled respectively, when read operation and write operation, are all adopted the transmission of messages mode of grouping packing:
1.1 the read operation detailed process is:
Read request mails to major node 1.1.1 each PN junction is named a person for a particular job;
1.1.2 whether major node detects the PN request that belongs to same CIO operational group and all arrives, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION;
1.1.3 each ION node is inquired about the meta data server of this node, according to this node file metamessage and data each read request of physical distribution message scheduling on disk that meta data server provides, the sequence that distributes with the data in magnetic disk order visits data in magnetic disk;
1.1.4 it is that the communication set group leader is mail in the unit packing with the communication set with the data that read that each ION node adopts the stream treatment mode based on multithreading, each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends, the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Sending thread determines whether to adopt packet mode to send to PN junction point place according to the PN number that data in the data buffer zone cover: adopt packet mode to send when covering when number surpasses threshold value τ, be no more than threshold value τ, ION directly will be sent to this PN place after then only will belonging to the packing data of same PN, threshold value τ by communication overhead with the disk expense than definite: t DiskAnd t Net_bandwidthBe respectively disk and read the time of a block buffer and the transmission bandwidth of high-speed interconnect network, S BlkBe file data blocks size, t Net_setupBe the communication settling time of a packet,
τ=(t disk-s blk/t net_bandwidth)/t net_setup
Because the transmission bandwidth of high speed internet, above-mentioned formula can further be expressed as
τ≈t disk/t net_setup
When the data fine granularity distributes, when the PN of data covers number and surpasses threshold value τ, adopt the grouping send mode, the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, does not carry out block functions, and promptly each PN is one group, and the packing data that each ION only will belong to same PN sends;
1.1.4.1 in the grouping send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose PN junction is pressed with communication set appearance method together and is received, unpacks data;
1.1.4.2 in non-grouping send mode, ION directly sends data to each PN junction point, each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment;
1.2 write operation adopts the transmission of messages mode of grouping packing equally, detailed process is:
It is synchronous that write request sends to the major node place 1.2.1 each PN junction of participation CIO operation is named a person for a particular job;
1.2.2 after all participate in the write request arrival of CIO operation, be broadcast to all ION after the write request merging of major node with each PN;
1.2.3ION node, determines that the data of which PN will write this ION according to the file distribution information that local meta data server provides;
1.2.4ION give all PN junction points with above-mentioned data element information broadcast;
1.2.5 when the PN number of participating in this CIO operation surpasses threshold value τ, adopt grouping packing transmission means; When PN number during, take the way that directly sends less than threshold value τ:
1.2.5.1 in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group, during transmission, each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in separate data packing back transmission in the user buffering district, and the communication group leader adopts multithreading stream treatment technology to receive with transmitting to belong to the data that this organizes other members: the group leader that communicate by letter opens up two blocks of data buffering areas and is respectively applied for and receives data and unpack, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head;
1.2.5.2 in the direct mode, each PN will belong to the unified purpose ION that is sent to after the packing data of same ION;
1.2.6 no matter divide into groups or direct mode, the ION node all utilizes double buffering mechanism to receive concomitantly, unpack and data dispatching, and last property writes disk.
2. the parallel I/O method of servicing of the grouping towards communication as claimed in claim 1, it is characterized in that described communication set is the set of some PN, the communication mode of PN employing close coupling communication mode in group during design, communication mode between communication set then adopts the communication mode of loose coupling, communication bandwidth in the group and delay are better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set, the communication set number is handled the square root of nodal point number for all, each communication set has a communication set group leader node, group leader's node is optional PN junction point, operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data; Described major node can be chosen any PN junction point and serve as, operation cluster type I/O is a CIO interface operable processing threads on it, this thread is responsible for all I/O requests of participating in the CIO operation synchronously, when each PN sends request by the CIO operation-interface, all requests send to major node earlier, after the request of all PN that participate in the CIO operation arrives, major node accumulates a big packet to the request of each PN and is broadcast to all ION, the ION that finishes all I/O request services also will send the signal that finishes to major node, after the signal that finishes of all ION arrives, major node is to each PN sign-off signal, and each PN junction Shu Benci CIO operates.
3. the parallel I/O method of servicing of the grouping towards communication as claimed in claim 1, it is characterized in that adopting circulating flowing water message mode in the described read operation when each communication set long hair send data, promptly each ION sends number as initial PN with the sequence number of oneself respectively.
CNB2004100232541A 2004-05-28 2004-05-28 Grouped parallel inputting/outputting service method for telecommunication Expired - Fee Related CN1327669C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100232541A CN1327669C (en) 2004-05-28 2004-05-28 Grouped parallel inputting/outputting service method for telecommunication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100232541A CN1327669C (en) 2004-05-28 2004-05-28 Grouped parallel inputting/outputting service method for telecommunication

Publications (2)

Publication Number Publication Date
CN1585380A CN1585380A (en) 2005-02-23
CN1327669C true CN1327669C (en) 2007-07-18

Family

ID=34600753

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100232541A Expired - Fee Related CN1327669C (en) 2004-05-28 2004-05-28 Grouped parallel inputting/outputting service method for telecommunication

Country Status (1)

Country Link
CN (1) CN1327669C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101662414B (en) * 2008-08-30 2011-09-14 成都市华为赛门铁克科技有限公司 Method, system and device for processing data access
CN102981912B (en) * 2012-11-06 2015-05-20 无锡江南计算技术研究所 Method and system for resource distribution
CN105786447A (en) * 2014-12-26 2016-07-20 乐视网信息技术(北京)股份有限公司 Method and apparatus for processing data by server and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6016315A (en) * 1997-04-30 2000-01-18 Vlsi Technology, Inc. Virtual contiguous FIFO for combining multiple data packets into a single contiguous stream
US6065087A (en) * 1998-05-21 2000-05-16 Hewlett-Packard Company Architecture for a high-performance network/bus multiplexer interconnecting a network and a bus that transport data using multiple protocols
CN1431808A (en) * 2003-01-27 2003-07-23 西安电子科技大学 Large capacity and expandable packet switching network structure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6016315A (en) * 1997-04-30 2000-01-18 Vlsi Technology, Inc. Virtual contiguous FIFO for combining multiple data packets into a single contiguous stream
US6065087A (en) * 1998-05-21 2000-05-16 Hewlett-Packard Company Architecture for a high-performance network/bus multiplexer interconnecting a network and a bus that transport data using multiple protocols
CN1431808A (en) * 2003-01-27 2003-07-23 西安电子科技大学 Large capacity and expandable packet switching network structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
并行文件系统中的最优化服务关系研究 商临锋 卢凯 罗宇,计算机科学与工程,第23卷第2期 2001 *

Also Published As

Publication number Publication date
CN1585380A (en) 2005-02-23

Similar Documents

Publication Publication Date Title
CN101873253A (en) Buffered crossbar switch system
CN107046510B (en) Node suitable for distributed computing system and system composed of nodes
CN103365726B (en) A kind of method for managing resource towards GPU cluster and system
CN102298539A (en) Method and system for scheduling shared resources subjected to distributed parallel treatment
May et al. Transputer and Routers: components for concurrent machines
CN102299843A (en) Network data processing method based on graphic processing unit (GPU) and buffer area, and system thereof
CN110995598A (en) Variable-length message data processing method and scheduling device
CN105426260A (en) Distributed system supported transparent interprocess communication system and method
CN1327669C (en) Grouped parallel inputting/outputting service method for telecommunication
CN1501640A (en) Method and system for transmitting Ethernet data using multiple E1 lines
CN105049372A (en) Method of expanding message middleware throughput and system thereof
JP2007532052A (en) Scalable network for management of computing and data storage
CN106909624B (en) Real-time sequencing optimization method for mass data
CN103176850A (en) Electric system network cluster task allocation method based on load balancing
Vaidya et al. LAPSES: A recipe for high performance adaptive router design
CN115391053B (en) Online service method and device based on CPU and GPU hybrid calculation
CN101188556A (en) Efficient multicast forward method based on sliding window in share memory switching structure
Wang et al. A BSP-based parallel iterative processing system with multiple partition strategies for big graphs
Woodside et al. Alternative software architectures for parallel protocol execution with synchronous IPC
Qin et al. Fault tolerant storage and data access optimization in data center networks
Ravindran et al. On topology and bisection bandwidth of hierarchical-ring networks for shared-memory multiprocessors
Zhu et al. A full distributed web crawler based on structured network
Wu et al. A practical packet reordering mechanism with flow granularity for parallelism exploiting in network processors
Qi et al. MapReduce-based data stream processing over large history data
CN1506860A (en) Resource locating system under network environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070718

Termination date: 20130528