CN1327669C

CN1327669C - Grouped parallel inputting/outputting service method for telecommunication

Info

Publication number: CN1327669C
Application number: CNB2004100232541A
Authority: CN
Inventors: 卢凯; 迟万庆; 冯华
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2004-05-28
Filing date: 2004-05-28
Publication date: 2007-07-18
Anticipated expiration: 2024-05-28
Also published as: CN1585380A

Abstract

The present invention discloses a grouping parallel input / output service method facing communication. Aiming at the existing I/O service method, the present invention solves the technical problems of low access performance of separate data, no support for an MPMD programming mood, larger size of generated messages, etc., and provides a grouping parallel input / output service method facing communication so as to provide higher I/O service performance. The present invention has the technical scheme that a structure of a communication group and a structure of a major node are added in a computer system structure of high performance; based on the structure of a communication group and the structure of a major node, larger disk access bandwidth is obtained through the synchronous effect of a CIO operation interface so as to respectively process read operation and write operation; a message transmission mode of grouping baling is used in both the read operation and the write operation. The present invention can reduce the size of messages of communication, mask retransmission overhead, and realize the parallelization between the communication and disk operation; thus, the present invention effectively solves the problem of communication bottle-neck, reduces the service time of I/O access, improves the integrated service performance of a parallel file system.

Description

Towards the parallel I/O method of servicing of the grouping of communication

Technical field

The present invention relates to the I/O technology in the computer realm, i.e. parallel I in I/O technology, the especially massive parallel processing/O method of servicing.

Background technology

Because the imbalance that develops between microprocessor technology and magnetic disc i/o technology, I/O has become one of main bottleneck of present restriction massively parallel computer system overall performance performance, and adopting parallel I/O technology is a kind of important technical of alleviating this bottleneck.Massively parallel computer system is deposited in file distribution on the disk of a plurality of I/O nodes by parallel file system, and provides high performance parallel I/O service by the magnetic disc i/o service of many I/O passage and parallelization.

Parallel I/O the subsystem structure of main flow can be divided into centralized parallel I/O structure and distributed parallel I/O structure at present.Node in the high-performance computer system can be divided into according to its major function and handle node (PN) and I/O node (ION).PN mainly is responsible for calculation task, and run user is used on it.ION is with disk, the main I/O request of being responsible for handling PN.In centralized parallel I/O structure, PN is different with ION node structure: PN is not with disk, only responsible calculation task; ION is with disk, only is responsible for the I/O task.ION communicates by letter by high speed internet with PN.In distributed I/O structure, ION is identical with the PN junction dot structure, the indeterminate division of the function of ION and PN, and each processor both can run user be used, and also can hang disk and handle the I/O visit.

Though parallel I/O subsystem provides huge original I/O access bandwidth, in actual applications because user I/O access module varies, under the different application environment user the original bandwidth that provides of obtainable actual I/O bandwidth and system differ greatly.Therefore, be the key that improves parallel I/O system service performance at the different suitable I/O method of servicing of applied environment research.

Present parallel I/O method of servicing mainly contains three kinds: traditional cache method, two stage method of servicing and towards the parallel I/O method of disk.

1, the most traditional parallel I/O method of servicing is a cache method commonly used in the file system.In the cache method, the internal memory cache that calculates node and I/O node is at first visited in the I/O request, then returns the user immediately as the data that buffering among the cache has the user to visit, and does not exist then to continue the visit disk.I/O request size is generally bigger because science is calculated, and the internal memory cache hit rate of parallel file system is often lower, and the I/O service performance is lower.

2, J.del Rosario has proposed two stage method of servicing according to the regular stronger characteristics of parallel science computing application I/O visit in " Improving Parallel I/O Performanceusing Two-Phase Access Strategy " (adopting two stage access strategies to improve parallel I/O service performance) paper of delivering in November, 1993 on Computer Architechure News.This method is divided into two stages with the I/O service process: the sequential I/O stage of I/O node and the data of calculating between node weigh distribution phase.Two stage method of servicing utilize the fireballing characteristic of disk sequential access to obtain bigger disk bandwidth, but have wasted valuable calculating node internal memory, and block in easily generation information of the heavy distribution phase of data.This method only is suitable for cluster type I/O (Collecttive I/O, CIO) operation of big data quantity.

3, parallel I/O method of servicing towards disk has been proposed in the technical report " Disk-directed I/O for MIMDMultiprocessors " (towards the I/O method of servicing towards disk of MIMD multiprocessor) that David Kotz delivered in July, 1994.Towards the parallel I/O method of servicing of disk strong, the big characteristics of I/O request size of I/O access synchronized,, obtained higher disk bandwidth according to the physical distribution dispatch request of data on disk based on the computing application of large-scale parallel science.Employing has the PANDA of exploitation such as Y.Chen and the parallel file systems such as Galley of Nils Nieuwejaar exploitation towards the parallel file system of the parallel I/O method of servicing of disk.But the parallel I/O method of servicing towards disk still has many limitations, is mainly reflected in:

A) only visit provides big bandwidth service to large scale towards the parallel I/O method of servicing of disk, but improves little to the lag characteristic of small size visit.It is basic goal that the time of implementation is used in minimizing, and the method can not be taken into account and satisfy the low delay of I/O service and big these two targets of bandwidth.

B) do not provide optimization towards the parallel I/O method of servicing of disk and operation-interface thereof, and only support single program stream multiple data stream (SPMD) programming mode, can't support multiple program multiple data (MPMD) programming mode different files, different access pattern.

C) size of message that produces towards the parallel I of disk/O method of servicing is bigger.At present, disk performance has had large increase, and disk is close to user's actual performance of the access speed of continuous data and high speed internet.When user capture small size separate data, because will make Network Transmission expense much larger than disk access expense towards a large amount of internet messages that parallel I/O method of servicing is produced of disk this moment, network service becomes new bottleneck.

Summary of the invention

Technical problem to be solved by this invention is the limitation at above-mentioned existing parallel I/O method of servicing, performance based on existing disk and communication system, a kind of grouping parallel I/O method of servicing (Communication Directed Group Input Output towards communication is proposed, CDGIO), effective access performance that improves small size, separate data; Optimization to different files, different access pattern is provided simultaneously, had both supported the SPMD programming mode, support the MPMD programming mode again; Utilize multi-thread mechanism, adopt load mode, overcome the communication performance bottleneck of parallel I/O method of servicing to a certain extent, higher I/O integrity service performance is provided towards disk with a large amount of small size message grouping packings.

Technical scheme of the present invention is towards communication, take into account the relation of disk performance and communication bandwidth, disperse the simultaneous characteristics of visit at large scale central access and small size in the parallel science calculating, design of communications group structure and major node structure in the high-performance computer system structure, reduce size of message by grouping packing transmission mechanism towards communication, reduce communication overhead, make disk access cover traffic operation preferably, thereby reduce whole I/O service time, obtain more excellent I/O service performance.

On the basis of high-performance computer system structure PN junction point and ION node, communication set structure and major node structure have been increased.Communication set is the set of some PN, during design in the group communication mode of PN adopt and to have the higher transmission bandwidth and than the close coupling communication mode of low transmission delay, the communication mode between communication set then adopts looser communication mode.Communication bandwidth in the group and delay are better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set.The communication set number is handled the square root of nodal point number for all.Each communication set has a communication set group leader node, and group leader's node is generally chosen the PN junction point of PN minimum in this communication set, also can select any PN junction point flexibly.Operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data.Major node of the present invention can be chosen any PN junction point and serve as, operation cluster type I/O (CIO, CollecttiveI/O) interface operable processing threads on it.This thread is responsible for all I/O requests of participating in the CIO operation synchronously.When each PN sent request by the CIO operation-interface, all requests sent to major node earlier.After the request of all PN that participate in the CIO operation arrived, major node accumulated a big packet to the request of each PN and is broadcast to all ION.The ION that finishes all I/O request services also will send the signal that finishes to major node.After the signal that finishes of all ION arrived, major node was to each PN sign-off signal, and each PN junction Shu Benci CIO operates.

The present invention is based on communication set structure and major node structure, synchronous effect by the CIO operation-interface obtains bigger disk access bandwidth, read operation and write operation are handled respectively, a large amount of small size message when in read operation and write operation, all adopting the transmission of messages mode of grouping packing to reduce in existing parallel I/O method of servicing to small size, separate data visit, thus the network service bottleneck avoided.The read operation detailed process is:

1. each PN junction read request of naming a person for a particular job mails to major node.

2. whether all major node detects the PN that belongs to same CIO operational group and asks arrival, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION.

3. each ION node is inquired about the meta data server of this node, this node file metamessage and data each read request of physical distribution message scheduling on disk according to meta data server provides obtain optimum access performance thereby visit data in magnetic disk with the sequence that the data in magnetic disk order distributes.

4. each ION node is that the communication set group leader is mail in the unit packing based on multithreading employing stream treatment mode with the communication set with the data that read, each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends, the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Sending thread determines whether to adopt packet mode to send to PN junction point place according to the PN number that data in the data buffer zone cover, when surpassing threshold value τ, number adopt packet mode to send when covering, be no more than threshold value τ, ION directly will be sent to this PN place after then only will belonging to the separate data packing of same PN, threshold value τ by communication overhead with the disk expense than definite: t _DtskAnd t _{Net_bendwidth}Be respectively disk and read the time of a block buffer and the transmission bandwidth of high-speed interconnect network, s _BlkBe file data blocks size, t _Net-setupIt is the communication settling time of a packet.

τ＝(t _disk-s _blk/t _{net_bandwidth})/t _{net_setup}

Because the transmission bandwidth of high speed internet is bigger, can ignore so go up the Network Transmission expense of formula, the τ value can be determined by following formula is approximate.

τ≈t _disk/t _{net_setup}

When the data fine granularity distributed, a block buffer covered the PN number and often surpasses threshold value τ, and message sends expense will be greater than the disk expense, and disk operating can't be covered communication.When the PN of data covers number and surpasses threshold value τ, adopt the grouping send mode, the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, does not carry out block functions, and promptly each PN is one group, and the packing data that each ION only will belong to same PN sends.When a plurality of ION simultaneously by identical order when a plurality of PN send data, the recipient produces probably and receives conflict.In order to eliminate conflict, the present invention is adopting circulating flowing water message mode when each communication set long hair send data.Because the synchronism characteristics of CIO operation, each ION has almost read data at synchronization, and send message to each communication set long hair.This moment a plurality of ION simultaneously by identical order when a plurality of firm transmission data, the recipient produces probably and receives conflict.In this send mode, each ION respectively with oneself sequence number as initial firm transmission number, for example ION1 begins to send to PN2, PN3 successively from PN1, ION2 then begins successively the transmission that circulates to PN3, PN4 from PN2.Because initial transmission node difference, just avoided when a plurality of ION all send object with PN1 for first, because the PN1 synchronization can only receive a message (ION1's), so ION2 must postpone to wait for that PN1 receives the wait situation that could begin to send after the data of ION1:

4.1 in the grouping send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose rigid joint is by receiving, unpack data with communication set appearance method together.

4.2 in non-grouping send mode, ION directly sends data to each PN junction point, each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment.

The transmission of messages mode that the present invention adopts grouping to pack for the processing of write operation requests equally, detailed process is:

1) each PN junction of participating in the CIO operation is named a person for a particular job, and to send to the major node place synchronous for write request.

2) after all participate in the write request arrival of CIO operation, be broadcast to all ION after the write request merging of major node with each PN.

3) the ION node determines that according to the file distribution information that local meta data server provides the data of which PN will write this ION.

4) ION gives all PN junction points with above-mentioned data element information broadcast.

5) when the PN number of participating in this CIO operation surpasses threshold value τ, adopt grouping packing transmission means; When PN number during, take the way that directly sends less than threshold value τ:

I. in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group.During transmission, transmit the separate data packing back that each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in the user buffering district.The communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head.

Ii. in the direct mode, each PN will belong to the unified purpose ION that is sent to after the packing data of same ION.

6) no matter divide into groups or direct mode, the ION node all utilizes double buffering mechanism to receive concomitantly, unpack and data dispatching, and last property writes disk.

Compare with existing other parallel I/O method, adopt the present invention can reach following technique effect:

1, adopt packing manner to reduce the size of message of communication.In nonpacket mode, adopt packaging technique that the discontinuous data segment of user's space is sent with a packet, reduced message count; In packet mode, be that packing data that unit will belong to many PN sends and reduced the message count that a blocks of data is wanted serial communication with the communication set, significantly reduced communication settling time.

2, adopt multithreading flowing water technology,, effectively covered the forwarding expense, realized the parallelization of communication and disk operating by concurrent magnetic disc i/o operation of double buffering and Network Transmission.

3, melting grouping and nonpacket mode neatly is one, dispatches the I/O request based on the distributed intelligence of data in magnetic disk, efficiently solves the communication performance bottleneck problem, has reduced the service time of I/O visit, has improved the integrity service performance of parallel file system.

4, the present invention will belong to the separate data merging back packing transmission of same PN, reduce the communication information amount, solve the defective that the discrete date section that belongs to same PN in the parallel I/O method of servicing towards disk can produce a plurality of internet messages.

5, the present invention is based on the principle that the space exchanges the time for, realize the transfer service as double buffering and a little P N node as the communication set group leader, obtained higher IO service performance by the some K byte of memorys that take PN junction point.

Description of drawings

Fig. 1 is traditional high-performance computer system parallel I/O system construction drawing;

Fig. 2 is the system construction drawing of parallel I of the present invention/O method of servicing;

Fig. 3 is read operation service procedure figure of the present invention;

Fig. 4 is write operation service procedure figure of the present invention;

Fig. 5 is that pipeline system message circulation of the present invention sends schematic diagram;

Fig. 6 is the present invention with the performance test comparison diagram towards the parallel I/O method of servicing of disk.

Embodiment

Fig. 1 is the high-performance computer system parallel I/O system construction drawing of current main-stream.Node in the high-performance computer system is divided into according to its main function and handles node (PN) and I/O node (ION).PN mainly is responsible for calculation task, and run user is used on it.PN sends the I/O request to ION.ION is with disk, the main I/O request of being responsible for handling PN.The parallel I of high-performance computer system/O structure is divided into centralized parallel I/O structure (shown in Fig. 1 (a)) and distributed parallel I/O structure (shown in Fig. 1 (b)) by the relation of ION and PN.In centralized parallel I/O structure, PN is not with disk, only responsible calculation task; ION is with disk, is responsible for the I/O task specially, does not carry out Any user and uses.ION communicates by letter by high speed internet with PN, the I/O request that service PN sends.In distributed I/O structure, the indeterminate division of the function of ION and PN, each processor both can run user be used, and also can hang disk and handle the I/O visit.

Fig. 2 is the system construction drawing of parallel I of the present invention/O method of servicing.On the basis of PN junction point and ION node, the present invention has increased communication set structure and major node structure newly.Communication set is the set of some PN, and communication bandwidth during design in the communication set and delay will be better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set.All PN junction points among the figure in frame of broken lines are exactly a communication set, and the communication mode of PN adopts and has the higher transmission bandwidth and than the close coupling communication mode of low transmission delay, the communication mode between communication set then adopts looser communication mode in the group.The communication set number is the square root of all PN sums.Each communication set has a communication set group leader, and group leader is served as by minimum PN in this communication set number PN junction point, and the PN junction point of black promptly is the communication set group leader among the figure.Operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data.PN junction point white among the figure is a major node, and operation CIO operation synchronizing thread is responsible for the synchronous of CIO request on the major node.When each PN sent request by the CIO operation-interface, all requests will send to major node earlier, were responsible for synchronously by major node.After the request of all PN that participate in the CIO operation arrived, major node accumulated a big packet to the request of each PN and is broadcast to all ION.The ION that finishes all I/O request services also sends the signal that finishes to major node.After the information that finishes of all ION arrived, major node was to each PN sign-off signal, and each PN junction Shu Benci CIO operates.

Fig. 3 has illustrated the service main-process stream of read operation of the present invention.The read operation process is as follows:

1) each PN junction of participating in CIO operation read request of naming a person for a particular job mails to major node;

2) whether all major node detects the PN that belongs to same CIO operation and asks arrival, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION;

3) each ION node is inquired about the meta data server of this node, according to this node file metamessage and data each read request of physical distribution message scheduling on disk that meta data server provides, the sequence visit data that distributes with the data in magnetic disk order obtains optimum magnetic disc i/o performance;

4) to adopt the stream treatment mode based on multithreading be that the communication set group leader is mail in the unit packing with the communication set with the data that read to each ION node, and each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends: the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Whether according to PN number that in data buffer zone data cover above threshold value τ determine whether adopt packet mode send: when the PN of data covering number surpasses threshold value τ if sending thread, the present invention adopts the grouping send mode, and the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, and the present invention does not carry out block functions, and promptly each PN is one group.

5) in the present invention divides into groups send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked/sends thread and unpacks data; Unpack/send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose PN junction is pressed with communication set appearance method together and is received, unpacks data.

6) in the non-grouping send mode of the present invention, ION directly sends data to each PN junction point, and each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment.

Fig. 4 has illustrated the service main-process stream of write operation of the present invention.Write operation is the inverse process of read operation.Write operation has adopted the grouping packaging technique equally, and concrete grammar is as follows:

2) after all participate in the write request arrival of CIO operation, major node is broadcast to all ION after each write request is merged.

5) when the PN number of participating in this CIO operation surpasses threshold value τ, the present invention adopts grouping packing transmission means; When PN number during less than threshold value τ, the present invention takes the way that directly sends:

I. in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group.During transmission, transmit the separate data packing back that each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in the user buffering district.The communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked/sends thread and unpacks; Unpack/send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head.

Fig. 5 has illustrated the circulating flowing water message of the present invention transmit mechanism.When the PN number that data cover surpassed threshold value, the present invention adopted the grouping send mode to reduce the traffic volume of message, is adopting message repeating query send mode when each group leader sends data.The left side of figure is for sending the ION numbering of message.Each square frame is represented a message, the purpose PN that is numbered message in the square frame number.Because the synchronism characteristics of CIO operation, each ION has almost read data at synchronization, and send message to each communication set long hair.This moment, a plurality of ION pressed identical order simultaneously when a plurality of PN send data, and the recipient produces probably and receives conflict.Shown in Fig. 5 (a),,,, ION2 must postpone to wait for that ION1 sends data to PN1 and finishes and could begin to send so sending data to PN1 because PN1 one constantly can only receive a message (ION1's) when a plurality of ION all are that with PN1 first sends object.And adopt Fig. 2 (b) is circulating flowing water message mode of the present invention, and with different PN number beginning, from PN1, ION2 is from PN2 as ION1 respectively in the transmission of each ION message.The message of this moment sends can eliminate conflict phenomenon basically.So ION2 can carry out to PN1 transmission data with ION1 simultaneously to PN2 transmission data among Fig. 2 (b).When the size of message that sends was big, the circulation send mode can be saved the more time.

Fig. 6 shown CDGIO method of the present invention and towards the parallel I/O method of servicing of disk in the bandwidth contrast when an ION reads the 3M data.Distribute if data are fine granularity between each PN, distributed dimension is 8 bytes, and MPI communication is set up and postponed to be 20us, when the disk peak bandwidth is 12MB/s.Because the data block of a 8K will produce 1024 little message, thus excessive towards the parallel I/O method of servicing of disk owing to communication overhead, system I/O poor-performing.As shown in Figure 6, abscissa represents that PN junction counts, and ordinate is represented the I/O bandwidth, and the PN number that is covered when a blocks of data is during greater than some τ=32 node, the I/O of disk operation can't be covered communication overhead, descend rapidly towards the performance of the parallel I/O method of servicing of disk.The present invention the message count of a blocks of data adopted during less than threshold value with towards the identical direct transmission mechanism of the parallel I/O method of servicing of disk, so the performance of the two is suitable.When message count during greater than threshold value τ, the present invention adopts the transmitted in packets mode, has reduced the transport overhead that a blocks of data piece is produced.The disk operating of ION can be covered traffic operation effectively, prevents to produce communication performance bottleneck.Can find that from Fig. 6 because CDGIO adopts group technology to reduce size of message, even the PN number increases, disk performance can be given full play in the CDGIO algorithm, reaches about nearly 10M, systematic function remains unchanged substantially.And towards the parallel I/O method of servicing of disk because the communication performance bottleneck problem, the disk performance of system can't be given full play to, whole I/O performance descends fast.

The present invention is towards communication, take into account the relation of disk performance and communication bandwidth, large scale central access and small size are disperseed the simultaneous characteristics of visit in calculating at parallel science, propose communication set, major node structure, adopt twin-stage transmitted in packets strategy, by the packet data transmission towards communication, refinement bottleneck parts reduce size of message, reduce communication overhead, make disk access cover traffic operation preferably, thereby reduce the whole I/O time of implementation, obtain more excellent I/O service performance.Through evaluation and test, the present invention is a kind of parallel I/O method of servicing in more efficient under front disk and high speed internet performance condition.

The present invention has been implemented on the high-performance server and Digital UNIX operating system that University of Science and Technology for National Defence develops voluntarily.The present invention adopts threading mechanism, based on the thread library design of Digital UNIX.But the present invention is not limited to any concrete hardware platform and operating system, and method of servicing can be transplanted in other environment easily, in Linux, Free BSD and milky way kylin operating systems such as (KYLIN), has versatility widely.

Claims

1. parallel I/O method of servicing of the grouping towards communication, it is characterized in that on the basis of high-performance computer system structure PN junction point and ION node, increase communication set structure and major node structure, based on communication set structure and major node structure, synchronous effect by the CIO operation-interface obtains bigger disk access bandwidth, read operation and write operation are handled respectively, when read operation and write operation, are all adopted the transmission of messages mode of grouping packing:

1.1 the read operation detailed process is:

Read request mails to major node 1.1.1 each PN junction is named a person for a particular job;

1.1.2 whether major node detects the PN request that belongs to same CIO operational group and all arrives, after the read request of all PN that participate in the CIO operation arrives major node, accumulate a big packet after the read request of synchronous each PN of the CIO interface synchronizing thread of major node and be broadcast to all ION;

1.1.3 each ION node is inquired about the meta data server of this node, according to this node file metamessage and data each read request of physical distribution message scheduling on disk that meta data server provides, the sequence that distributes with the data in magnetic disk order visits data in magnetic disk;

1.1.4 it is that the communication set group leader is mail in the unit packing with the communication set with the data that read that each ION node adopts the stream treatment mode based on multithreading, each ION opens up the data that two blocks of data buffering areas are used for pipeline mode and sends, the request sequence of magnetic disc i/o thread after according to scheduling from the disk reading of data to the idle data buffering area; Sending thread determines whether to adopt packet mode to send to PN junction point place according to the PN number that data in the data buffer zone cover: adopt packet mode to send when covering when number surpasses threshold value τ, be no more than threshold value τ, ION directly will be sent to this PN place after then only will belonging to the packing data of same PN, threshold value τ by communication overhead with the disk expense than definite: t _DiskAnd t _{Net_bandwidth}Be respectively disk and read the time of a block buffer and the transmission bandwidth of high-speed interconnect network, S _BlkBe file data blocks size, t _{Net_setup}Be the communication settling time of a packet,

τ＝(t _disk-s _blk/t _{net_bandwidth})/t _{net_setup}

Because the transmission bandwidth of high speed internet, above-mentioned formula can further be expressed as

τ≈t _disk/t _{net_setup}

When the data fine granularity distributes, when the PN of data covers number and surpasses threshold value τ, adopt the grouping send mode, the data that each ION will belong to same communication set break into a message bag and send to the communication group leader; The PN number that covers when data in the data buffer zone is during less than this threshold value τ, and disk operating can be covered traffic operation, does not carry out block functions, and promptly each PN is one group, and the packing data that each ION only will belong to same PN sends;

1.1.4.1 in the grouping send mode, the communication group leader adopts the reception of multithreading stream treatment technology to belong to the data that this organizes other members with transmitting: the group leader that communicates by letter opens up two blocks of data buffering areas and is respectively applied for the reception data and unpacks, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and determine that according to the destination information of data head this segment data is local data or the data of other PN, the local data user buffering district of then writing direct, non-local data then is sent to the purpose node; The purpose PN junction is pressed with communication set appearance method together and is received, unpacks data;

1.1.4.2 in non-grouping send mode, ION directly sends data to each PN junction point, each PN junction point adopts double buffering structure, by multithreading flowing water parallel receive with unpack data, does not have repeating process this moment;

1.2 write operation adopts the transmission of messages mode of grouping packing equally, detailed process is:

It is synchronous that write request sends to the major node place 1.2.1 each PN junction of participation CIO operation is named a person for a particular job;

1.2.2 after all participate in the write request arrival of CIO operation, be broadcast to all ION after the write request merging of major node with each PN;

1.2.3ION node, determines that the data of which PN will write this ION according to the file distribution information that local meta data server provides;

1.2.4ION give all PN junction points with above-mentioned data element information broadcast;

1.2.5 when the PN number of participating in this CIO operation surpasses threshold value τ, adopt grouping packing transmission means; When PN number during, take the way that directly sends less than threshold value τ:

1.2.5.1 in the packet mode, the data that each PN will write are made the as a whole communication set strong point that is sent to this group, during transmission, each PN adopts the parallel transmission technology of two-wire journey and double buffering will be distributed in separate data packing back transmission in the user buffering district, and the communication group leader adopts multithreading stream treatment technology to receive with transmitting to belong to the data that this organizes other members: the group leader that communicate by letter opens up two blocks of data buffering areas and is respectively applied for and receives data and unpack, send data; After receiving thread received a blocks of data, notice unpacked the transmission thread and unpacks; Unpack and send thread and data are unpacked, and be forwarded to purpose ION place after repacking according to its purpose ION according to the destination information of data head;

1.2.5.2 in the direct mode, each PN will belong to the unified purpose ION that is sent to after the packing data of same ION;

1.2.6 no matter divide into groups or direct mode, the ION node all utilizes double buffering mechanism to receive concomitantly, unpack and data dispatching, and last property writes disk.

2. the parallel I/O method of servicing of the grouping towards communication as claimed in claim 1, it is characterized in that described communication set is the set of some PN, the communication mode of PN employing close coupling communication mode in group during design, communication mode between communication set then adopts the communication mode of loose coupling, communication bandwidth in the group and delay are better than bandwidth and the delay between communication set, and the message transmission of other communication set inside is not disturbed in the communication in the same communication set, the communication set number is handled the square root of nodal point number for all, each communication set has a communication set group leader node, group leader's node is optional PN junction point, operation receiving thread and unpack/transmits thread on communication set group leader's node, be responsible for organizing entirely with ION between communicate by letter and the heavily distribution work of data; Described major node can be chosen any PN junction point and serve as, operation cluster type I/O is a CIO interface operable processing threads on it, this thread is responsible for all I/O requests of participating in the CIO operation synchronously, when each PN sends request by the CIO operation-interface, all requests send to major node earlier, after the request of all PN that participate in the CIO operation arrives, major node accumulates a big packet to the request of each PN and is broadcast to all ION, the ION that finishes all I/O request services also will send the signal that finishes to major node, after the signal that finishes of all ION arrives, major node is to each PN sign-off signal, and each PN junction Shu Benci CIO operates.

3. the parallel I/O method of servicing of the grouping towards communication as claimed in claim 1, it is characterized in that adopting circulating flowing water message mode in the described read operation when each communication set long hair send data, promptly each ION sends number as initial PN with the sequence number of oneself respectively.