CN103189867A - Duplicated data search method and equipment - Google Patents

Duplicated data search method and equipment Download PDF

Info

Publication number
CN103189867A
CN103189867A CN2012800019897A CN201280001989A CN103189867A CN 103189867 A CN103189867 A CN 103189867A CN 2012800019897 A CN2012800019897 A CN 2012800019897A CN 201280001989 A CN201280001989 A CN 201280001989A CN 103189867 A CN103189867 A CN 103189867A
Authority
CN
China
Prior art keywords
cryptographic hash
packet
deblocking
hash
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012800019897A
Other languages
Chinese (zh)
Other versions
CN103189867B (en
Inventor
覃强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN103189867A publication Critical patent/CN103189867A/en
Application granted granted Critical
Publication of CN103189867B publication Critical patent/CN103189867B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values

Abstract

One embodiment of the invention provides a duplicated data search method and equipment. The method comprises performing partitioning treatment on received data to obtained at least two data partitions, grouping the at least two data partitions to obtain at least one data packet, performing similarity Hash calculation on each data packet to obtain a Hash value of the data packet, and obtaining a first Hash value having a similarity with the Hash value of the data packet larger or equal to a first similarity threshold from a Hash value storage list. If the similarity between the Hash value of the data packet and the first Hash value is larger or equal to a second similarity threshold, the data partitions in the data packet are subjected to duplicated block research. The technical scheme of the invention improves the duplicated block query efficiency and improves the whole performance of data de-duplicated technology.

Description

Repeating data search method and equipment
Technical field
The present invention relates to memory technology, relate in particular to a kind of repeating data search method and equipment.
Background technology
Data de-duplication (English is De-duplication) is that a kind of data are reduced technology, is intended to reduce the memory capacity of using in the storage system or reduces the transmission quantity of data in network, and it is widely used in the scene of data backup or wide area network data transmission.The process of data de-duplication is: the input data are carried out piecemeal, calculate Hash (Hash) value of each piecemeal, in the single-instance storehouse, search to judge with the hash value that calculates whether this piecemeal is repeatable block, if repeatable block, then this piecemeal and hash value thereof are not stored in the single-instance storehouse, thereby reach the purpose of reducing data.
The single-instance storehouse is bigger usually; can't all put into internal memory; usually can be placed in the disk; when whether the inquiry piecemeal is repeatable block, just need visit disk continually like this; because disk access speed is lower; make that the efficient of repeatable block inquiry is lower, influenced the overall performance of data de-duplication technology.
Summary of the invention
The embodiment of the invention provides a kind of repeating data search method and equipment, in order to improve the repeatable block search efficiency, improves the overall performance of data de-duplication technology.
First aspect provides a kind of repeating data search method, comprising:
The data that receive are carried out piecemeal handle, obtain at least two data piecemeals;
Described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking;
At first packet in described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
If the similarity of the cryptographic hash of described first packet and described first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in described first packet.
In first kind of first aspect possible implementation, described repeating data block retrieval method also comprises: if the similarity of the cryptographic hash of described first packet and described first cryptographic hash is less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
In conjunction with first kind of first aspect or first aspect possible implementation, in second kind of first aspect possible implementation, described at least two data piecemeals are divided into groups, obtain at least one packet and comprise: constituted by the cryptographic hash of each deblocking in described at least two data piecemeals and treat piecemeal Hash data; Length with the cryptographic hash of any described deblocking is sliding step, adopts block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, and obtains at least one cryptographic hash piecemeal; To belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
Second kind of possible implementation in conjunction with first kind of first aspect or first aspect possible implementation or first aspect, in the third possible implementation of first aspect, deblocking in described first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining described first packet comprises: each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet; In the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
In conjunction with second kind of possible implementation of first kind of first aspect or first aspect possible implementation or first aspect or the third possible implementation of first aspect, in the 4th kind of possible implementation of first aspect, described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Deblocking in described first packet is carried out the repeatable block retrieval to be comprised: obtain the numbering n of the storage area of the described first cryptographic hash correspondence from described cryptographic hash storage list, the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
The 4th kind of possible implementation in conjunction with first aspect, in the 5th kind of possible implementation of first aspect, described method also comprises: when the deblocking in will numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, comprise with the repeatable block retrieval of finishing the deblocking in described first packet: deblocking identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
With the first aspect, or the first aspect of a first possible implementation, the first aspect or a second possible implementation, or the first aspect of the third possible implementation, or the first aspect of Four possible implementation, or the first aspect of the possible implementation of the fifth, sixth aspect of the first possible implementation, the storage table of the hash value obtained in said first data similarity grouping hash value is greater than or equal to the preset first similarity threshold a first hash value comprises: obtaining the hash value stored in the table with the hash value of the first packet in a position corresponding to the repeat the number of bits greater than or equal to a preset number of the first hash value as a hash value.
The 6th kind of possible implementation in conjunction with first aspect, in the 7th kind of possible implementation of first aspect, described obtain in the cryptographic hash storage list and the cryptographic hash correspondence position of described first packet on the number of repeats bits comprise as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number: obtain the Hamming distance between each cryptographic hash in the cryptographic hash of described first packet and the described cryptographic hash storage list, Hamming distance be less than or equal to cryptographic hash in the described cryptographic hash storage list of default Hamming distance threshold value as described first cryptographic hash.
Second aspect provides a kind of repeating data retrieval facility, comprising:
The piecemeal acquisition module is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals;
The grouping acquisition module, described at least two the data piecemeals that are used for described piecemeal acquisition module is got access to divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking;
The Hash calculation module, be used for first packet at described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
The repeated retrieval module is used for, the deblocking in described first packet being carried out repeatable block retrieving during more than or equal to default second similarity threshold in the similarity of the cryptographic hash of described first packet and described first cryptographic hash.
In first kind of second aspect possible implementation, described repeating data retrieval facility also comprises: memory module, be used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
In conjunction with first kind of second aspect or second aspect possible implementation, in second kind of second aspect possible implementation, described grouping acquisition module specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
Second kind of possible implementation in conjunction with first kind of second aspect or second aspect possible implementation or second aspect, in the third possible implementation of second aspect, described Hash calculation module is used for the deblocking in described first packet is carried out the similarity Hash operation, and the cryptographic hash of obtaining described first packet comprises:
Described Hash calculation module specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
In conjunction with second kind of possible implementation of first kind of second aspect or second aspect possible implementation or second aspect or the third possible implementation of second aspect, in the 4th kind of possible implementation of second aspect, described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described repeated retrieval module specifically is used for obtaining from described cryptographic hash storage list the numbering n of the storage area of the described first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
The 4th kind of possible implementation in conjunction with second aspect, in the 5th kind of possible implementation of second aspect, described repeated retrieval module also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described repeated retrieval module specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in described first packet: described repeated retrieval module specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
With a second aspect, the first or second aspect of a possible implementation, or the second aspect of the possible implementation of a second, or second aspect of the third possible implementation, the first or second aspect Four possible implementation, the fifth aspect or the second possible implementation of the second aspect of the possible implementation of the sixth aspect, the hash calculation module is used to obtain the hash value stored in the table and the hash value for the first data packet is greater than or equal to a preset similarity a first similarity threshold comprises a first hash value: said hash calculation module is specifically adapted to obtain the hash value stored in the table and the hash value of the first packet in a position corresponding to the number of repeated bits is greater than or equal to the preset number of the first hash value as a hash value.
The 6th kind of possible implementation in conjunction with second aspect, in the 7th kind of possible implementation of second aspect, the number that described Hash calculation module specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number: described cryptographic hash computing module specifically is used for obtaining the cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
The third aspect provides a kind of repeating data retrieval facility, comprising: processor, communication interface, storer and bus: described processor, described communication interface, described storer are finished mutual communicating by letter by described bus;
Described communication interface is used for receiving data;
Described processor is used for executive routine;
Described storer is used for depositing described program;
Wherein, described program is used for that the described data that described communication interface receives are carried out piecemeal to be handled, and obtains at least two data piecemeals; Described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking; At first packet in described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet; If the similarity of the cryptographic hash of described packet and described first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in described first packet.
In first kind of the third aspect possible implementation, described program also is used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
In conjunction with first kind of the third aspect or the third aspect possible implementation, in second kind of the third aspect possible implementation, described program is used for described at least two data piecemeals are divided into groups, obtaining at least one packet comprises: described program specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
Second kind of possible implementation in conjunction with first kind of the third aspect or the third aspect possible implementation or the third aspect, in the third possible implementation of the third aspect, described program is used for the deblocking in described first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining described first packet comprises: described program specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
In conjunction with second kind of possible implementation of first kind of the third aspect or the third aspect possible implementation or the third aspect or the third possible implementation of the third aspect, in the 4th kind of possible implementation of the third aspect, described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described program is carried out the repeatable block retrieval to the deblocking in described first packet and comprised: described program specifically is used for obtaining from described cryptographic hash storage list the numbering n of the storage area of the described first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
The 4th kind of possible implementation in conjunction with the third aspect, in the 5th kind of possible implementation of the third aspect, described program also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described program specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in described first packet: described program specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
With a third aspect, the third aspect or the possible implementation of a first or third aspect of the possible implementation of a second, or third aspect of the third possible implementation, the first or third aspect Four possible implementation, or third aspect of possible implementation of the fifth, sixth aspect of the third possible implementation, the program used to obtain the hash value stored in the table, with the the first hash value for the similarity data packet is greater than or equal to a preset first threshold value similarity first hash value comprises: the specific procedures used to obtain the hash value stored in said first data table grouping the hash value corresponding to the position of the number of repeated bits is greater than or equal to the preset number of the first hash value as a hash value.
The 6th kind of possible implementation in conjunction with the third aspect, in the 7th kind of possible implementation of the third aspect, the number that described program specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number: described program specifically is used for concrete be used for the obtaining cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
Fourth aspect provides a kind of computer program, comprises computer-readable recording medium, is used for the storage program, and described program comprises:
The piecemeal acquiring unit is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals;
The grouping acquiring unit, described at least two the data piecemeals that are used for described piecemeal acquiring unit is got access to divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking;
The Hash calculation unit, be used for first packet at described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
The repeated retrieval unit is used for, the deblocking in described first packet being carried out repeatable block retrieving during more than or equal to default second similarity threshold in the similarity of the cryptographic hash of described first packet and described first cryptographic hash.
In first kind of fourth aspect possible implementation, described program also comprises: storage unit, be used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
In conjunction with first kind of fourth aspect or fourth aspect possible implementation, in second kind of fourth aspect possible implementation, described grouping acquiring unit specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
Second kind of possible implementation in conjunction with first kind of fourth aspect or fourth aspect possible implementation or fourth aspect, in the third possible implementation of fourth aspect, described Hash calculation unit is used for the deblocking in described first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining described first packet comprises: described Hash calculation unit specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
In conjunction with second kind of possible implementation of first kind of fourth aspect or fourth aspect possible implementation or fourth aspect or the third possible implementation of fourth aspect, in the 4th kind of possible implementation of fourth aspect, described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described repeated retrieval unit specifically is used for obtaining from described cryptographic hash storage list the numbering n of the storage area of the described first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
The 4th kind of possible implementation in conjunction with fourth aspect, in the 5th kind of possible implementation of fourth aspect, described repeated retrieval unit also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described repeated retrieval unit specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in described first packet: described repeated retrieval unit specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
With a fourth aspect, or the fourth aspect of the possible implementation of a first or fourth aspect of the possible implementation of the second or fourth aspect of the possible implementation of a third, or fourth aspect of Four possible implementation, the fifth or fourth aspect possible implementation of the fourth aspect of the possible implementation of the sixth embodiment, the hash value calculation unit for acquiring the hash table in memory and the hash value for the first data packet is greater than or equal to a preset similarity a first similarity threshold comprises a first hash value: the hash calculation unit is configured to obtain the hash value stored in the table and the hash value of the first packet in a position corresponding to the number of repeated bits is greater than or equal to the preset number of the first hash value as a hash value.
The 6th kind of possible implementation in conjunction with fourth aspect, in the 7th kind of possible implementation of fourth aspect, the number that described Hash calculation unit specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number: described cryptographic hash computing unit specifically is used for obtaining the cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
Repeating data search method and equipment that the embodiment of the invention provides, to the data elder generation piecemeal that receives, grouping again, deblocking in the data grouping is carried out the similarity Hash operation, obtain the cryptographic hash of packet, obtain the cryptographic hash similarity that stores each packet in the data space into of storing in the cryptographic hash of packet and the cryptographic hash storage list then more than or equal to first cryptographic hash of default first similarity threshold, whether the cryptographic hash of judgment data grouping and the similarity of first cryptographic hash be more than or equal to the second default similarity threshold, if greater than, illustrate that the deblocking in this packet is repeatable block to a great extent, then it is carried out the repeatable block retrieval, because what store in the inquiry cryptographic hash storage list is to have stored the cryptographic hash of the packet in the data space and the corresponding relation of packet into, and the quantity of packet is less relatively, so the efficient of inquiry cryptographic hash storage list is higher, and carry out the number of times that the repeatable block retrieval has reduced the repeatable block retrieval based on packet, namely reduced the number of times mutual with disk, be conducive to improve the repeatable block search efficiency, thereby improved the overall performance of data de-duplication technology.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
The process flow diagram of the repeating data search method that Fig. 1 provides for one embodiment of the invention;
The similarity Hash operation process synoptic diagram that Fig. 2 provides for one embodiment of the invention;
The structural representation of the repeating data retrieval facility that Fig. 3 provides for one embodiment of the invention;
The structural representation of the repeating data retrieval facility that Fig. 4 provides for another embodiment of the present invention;
The structural representation of the repeating data retrieval facility that Fig. 5 provides for further embodiment of this invention;
The structural representation of the computer program that Fig. 6 provides for one embodiment of the invention.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the invention clearer, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
The process flow diagram of the repeating data search method that Fig. 1 provides for one embodiment of the invention.As shown in Figure 1, the method for present embodiment comprises:
Step 101, the data that receive are carried out piecemeal handle, obtain at least two data piecemeals.
The executive agent of present embodiment can be the repeating data retrieval facility, this equipment is realizing that on the form can be various equipment with computing power, for example can be server, computing machine in the data backup environment etc., can also be terminal, gateway, base station in the wide area network data transmitting scene etc.
After the repeating data retrieval facility receives data to be stored, at first data are carried out piecemeal, obtain at least two data piecemeals.Optionally, the repeating data retrieval facility carries out the piecemeal processing to data can adopt block algorithm, for example can be but be not limited to fixedly piecemeal (Fixed-Sized Partition, abbreviation FSP) algorithm, variable partitioned blocks (Content-Defined Chunking abbreviates CDC as) algorithm, sliding shoe (English is sliding block) algorithm.The block algorithm that the neglecting greatly of deblocking adopted and practical application request and decide, the embodiment of the invention is not done restriction to its occurrence.Belong to prior art about the process of using various block algorithms that data are carried out the piecemeal processing, be not described in detail in this, can be referring to prior art.
Step 102, described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking.
The repeating data retrieval facility carries out piecemeal to data to be handled and to obtain after the deblocking, again the deblocking that obtains is carried out packet transaction, obtains packet, and the number of packet can be the number less than deblocking.This packet transaction is divided into the deblocking that obtains in the different packets in fact exactly, and concrete packet mode can have multiple.For example, the repeating data retrieval facility can comprise the principle of the deblocking of same number according to each packet, successively a plurality of deblockings is divided, and forms at least one data grouping.
Again for example, the repeating data retrieval facility can also adopt block algorithm to obtain at least one packet again to the deblocking that marks off.This embodiment comprises: be made of the cryptographic hash of each deblocking in above-mentioned at least two data piecemeals that mark off and treat piecemeal Hash data; Length (identical length of the cryptographic hash of each deblocking together) with the cryptographic hash of any deblocking in described at least two data piecemeals is sliding step, adopt block algorithm that this is treated that piecemeal Hash data carry out piecemeal and handle, obtain at least one cryptographic hash piecemeal.Sliding step refers to the sliding distance of minimum when treating that piecemeal Hash data are slided, and the cryptographic hash piecemeal that uses block algorithm to obtain can slide by one or many and wait until.Owing to divide to calculate sliding step that algorithm uses and be with the length of cryptographic hash as unit, so the cryptographic hash piecemeal all is to be made of one or more complete cryptographic hash.If obtain the sliding distance of a cryptographic hash piecemeal in the block algorithm and be a plurality of sliding steps (namely through repeatedly sliding), then this cryptographic hash piecemeal just is made of a plurality of cryptographic hash; If obtain the sliding distance of a cryptographic hash piecemeal in the block algorithm and be a sliding step (namely through once sliding), then this cryptographic hash piecemeal just is made of a cryptographic hash.After obtaining the cryptographic hash piecemeal, the deblocking that will belong to the cryptographic hash correspondence of same cryptographic hash piecemeal divides into groups as data, so just obtained at least one packet, and, adopt such packet mode, make that the end position of each packet is exactly the end position of a piecemeal, the division of grouping is more accurate.Wherein, adopting block algorithm to treat piecemeal Hash data, to carry out the process of process that piecemeal handles and existing block algorithm similar, repeats no more.Constituting the process for the treatment of piecemeal Hash data by the cryptographic hash of each deblocking in above-mentioned at least two data piecemeals comprises: calculate the cryptographic hash of each deblocking in described at least two data piecemeals, these cryptographic hash are linked together to constitute treat piecemeal Hash data.
Wherein, the deblocking in each packet is continuous, i.e. each packet is made of continuous deblocking.
Wherein, the deblocking number that each packet comprises can be identical, also can be inequality.And the deblocking number that packet comprises can be decided according to practical application, and the embodiment of the invention is not done restriction to its occurrence yet.
Through after the above-mentioned packet transaction, can carry out the repeatable block retrieval based on packet and be conducive to reduce the number of times that carries out the repeatable block retrieval, minimizing is mutual with disk, is conducive to improve the repeatable block recall precision.
Step 103, at first packet in above-mentioned at least one packet, deblocking in first packet is carried out similarity Hash operation (similarly hash, or simhash), obtain the cryptographic hash of first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of first packet first cryptographic hash more than or equal to the first default similarity threshold, if the similarity of the cryptographic hash of first packet and first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in first packet.
Wherein, because the processing to each packet is identical, so present embodiment is that example describes with any one packet wherein, for ease of distinguishing it is designated as first packet, that is to say that first packet can be any one packet at least one packet of above-mentioned acquisition.Store the current cryptographic hash of second packet in the data space and the corresponding relation of second packet of being stored in the cryptographic hash storage list.For ease of distinguishing and describing, the current packet that has been stored in the data space is designated as second packet.Wherein, the computing method of the cryptographic hash of first packet are identical in the computing method of the cryptographic hash of second packet of storing in the cryptographic hash storage list and the present embodiment, namely the cryptographic hash of second packet also is the deblocking in second packet to be carried out the similarity Hash operation obtain, there is not repetition each other in the data block of these cryptographic hash correspondences in addition, and namely the deblocking in second packet is judged as and is not repeatable block.Data space refers to the storage space for the storage deblocking, can be hard disk, disk etc.
Optionally, because what store in the cryptographic hash storage list is the cryptographic hash that has been stored in the packet in the data space, again because the packet of present embodiment whether deblocking constitute, the quantity of packet is not more than the quantity of deblocking, like this under the situation of quantity less than the quantity of deblocking of packet, compare with the Hash table of each deblocking of storage, the cryptographic hash storage list of present embodiment will be little a lot, so can be stored in the internal memory, be conducive to improve the efficient of inquiry cryptographic hash storage list like this, be conducive to further improve the repeatable block effectiveness of retrieval.Wherein, the cryptographic hash storage list is not limited to be stored in the internal memory, can also be stored on disk or other memory devices, but preferably be stored in the internal memory.The cryptographic hash storage list of present embodiment can adopt sparse Hash table on implementation, but is not limited thereto.
The repeating data retrieval facility carries out identical processing to each packet meeting after getting access to packet, present embodiment is example with first packet, and then the repeating data retrieval facility carries out following processing to first packet:
At first, the deblocking in first packet is carried out the similarity Hash operation, obtain the cryptographic hash of first packet.The principle of similarity Hash is that the similarity of two data piecemeals is more high, also can be more big to the similarity of the cryptographic hash of its calculating, and vice versa.The similarity Hash operation is the more high operational method of similarity that can make the cryptographic hash of the more high deblocking of similarity.For example, a kind of method that the similarity Hash operation is carried out in first packet comprises: each deblocking in first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in first packet; The cryptographic hash of each deblocking in first packet is represented with binary mode, each position in the described cryptographic hash of representing with binary mode is changed, can be that 0 binary digit replaces with-1 with value during specific implementation, value is that 1 binary digit remains unchanged, cryptographic hash after will changing then adds up, can be with the corresponding position addition of the cryptographic hash after described each conversion during specific implementation, with sum greater than 0 the position be mapped as 1, with sum be less than or equal to 0 the position be mapped as 0, thus obtained binary numeral is as the cryptographic hash of this first packet.Preferred implementation process in conjunction with the similarity Hash operation of Fig. 2 describes.As shown in Figure 2, first packet comprises n data piecemeal, be respectively first deblocking-n deblocking, each deblocking is carried out the cryptographic hash that Hash operation obtains binary mode, Fig. 2 shows first deblocking, the cryptographic hash of the binary mode of second deblocking and n deblocking is respectively 100110,110000 and 001001, in the cryptographic hash of the binary mode of each deblocking 0 replaced with-1, above-mentioned first deblocking, the cryptographic hash of the binary mode after the replacement of second deblocking and n deblocking is respectively 1-1-111-1,11-1-1-1-1 and-1-11-1-11, successively with the corresponding positions addition in the cryptographic hash after the replacement of n data piecemeal, finally obtain 13,18,-22,-5 ,-2,5 these results, numerical value greater than 0 among this result is mapped as 1, be less than or equal to 0 numerical value and be mapped as 0, obtain binary one 10001,110001 of this binary mode is the cryptographic hash of above-mentioned first packet.
Except a kind of similarity Hash operation that the deblocking in first packet is carried out that said method finishes that present embodiment relates to, can also adopt another kind of similarity Hash operation, perception hash algorithm (Perceptual hash algorithm) for example, the similarity Hash operation that the deblocking in first packet is carried out of finishing that present embodiment relates to.The principle of perception Hash operation is that every pictures is generated one " fingerprint " (English be fingerprint) character string, the fingerprint of more different pictures then, and the similarity of comparative result is more high, illustrates that the similarity of picture is more high; And apply it in the repeating data search method that present embodiment provides, its principle is that a cryptographic hash is calculated in each packet, the cryptographic hash that compares the different pieces of information grouping then, if the similarity of two cryptographic hash is more high, just illustrate that the data block that may repeat in two data groupings is with regard to more many (namely the similarity of two data groupings is just more big).
Present embodiment is by introducing the similarity Hash operation, it is more high to take full advantage of the cryptographic hash similarity, the similarity of corresponding data grouping is more high this characteristic just, cryptographic hash by the packet that will calculate and the cryptographic hash of the packet that has existed compare and just can embody each deblocking in this packet to a certain extent and be stored in deblocking in the data space possibility of repetition takes place, if the cryptographic hash of the packet that calculates is more high with the similarity of the cryptographic hash of the packet that has existed, illustrate that the possibility that the deblocking generation repeats in this packet is just more big, this moment is if determine that based on the cryptographic hash of packet this packet need carry out repeatable block retrieval, illustrate that the deblocking in this packet is repeatable block to a great extent, at this time carry out the performance that the repeatable block retrieval has just improved the repeatable block retrieval.Mode below by a kind of comparison illustrates that the method for present embodiment can improve the performance of repeatable block retrieval.
Then, calculate the cryptographic hash of first packet when the repeating data retrieval facility after, the cryptographic hash of this first packet and each cryptographic hash in the cryptographic hash storage list are compared, obtain and the cryptographic hash similarity of this first packet cryptographic hash more than or equal to default first similarity threshold, be designated as first cryptographic hash.Optionally, when specific implementation, if having a plurality ofly more than or equal to the cryptographic hash of default first similarity threshold, then can obtain described a plurality of cryptographic hash, wherein each Hash belongs to first cryptographic hash; If the cryptographic hash more than or equal to default first similarity threshold has one, then with this cryptographic hash as first cryptographic hash, first cryptographic hash of namely obtaining is one.Preferably, can obtain in the cryptographic hash storage list cryptographic hash with the cryptographic hash similarity maximum of this first packet as first cryptographic hash, but be not limited thereto.Here obtain with the cryptographic hash similarity of this first packet and more than or equal to the embodiment of the cryptographic hash of default first similarity threshold can be: the number that the repeating data retrieval facility obtains in the cryptographic hash storage list repeats bits on the cryptographic hash correspondence position with first packet more than or equal to the cryptographic hash of predetermined number as first cryptographic hash.In this embodiment, the similarity that how much has characterized two cryptographic hash of repeats bits on two cryptographic hash correspondence positions; If repeats bits is more many on two cryptographic hash correspondence positions, illustrate that the similarity of these two cryptographic hash is more high; Vice versa.The predetermined number here is equivalent to above-mentioned default first similarity threshold.Further, the repeating data retrieval facility obtain in the cryptographic hash storage list and the cryptographic hash correspondence position of first packet on repeats bits comprise more than or equal to the cryptographic hash of the predetermined number a kind of embodiment as first cryptographic hash: the repeating data retrieval facility obtains the Hamming distance between each cryptographic hash in the cryptographic hash of first packet and the cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the cryptographic hash storage list of default Hamming distance threshold value as first cryptographic hash.Wherein, Hamming distance between each cryptographic hash has embodied repetition degree between the second corresponding packet of each cryptographic hash in first packet and the cryptographic hash storage list to a certain extent in the cryptographic hash of first packet and the cryptographic hash storage list.Hamming distance more little (it is more many namely to repeat figure place) shows that first packet is more high with repetition degree between corresponding second packet.In addition, except using Hamming distance, can also use other parameters of the similarity that to represent two cryptographic hash.The default Hamming distance threshold value here is equivalent to above-mentioned predetermined number.
Then, the repeating data retrieval facility compares the cryptographic hash of above-mentioned first packet and the similarity of first cryptographic hash with default second similarity threshold, be used for judging whether first packet needs to carry out the repeatable block retrieval.If the similarity of the cryptographic hash of first packet and first cryptographic hash is more than or equal to this second similarity threshold, illustrate that the multiplicity between first packet, second packet corresponding with first cryptographic hash is very high, can judge and have more repeatable block between the two, therefore, need carry out the repeatable block retrieval to first packet.Optionally, if use the similarity that how much characterizes two cryptographic hash of repeats bits on two cryptographic hash correspondence positions, then second similarity threshold here can be repetition figure place threshold value.Accordingly, the repeating data retrieval facility compares the similarity of the cryptographic hash of above-mentioned first packet and first cryptographic hash and can be with default second similarity threshold: whether repeat figure place more than or equal to default repetition figure place threshold value on the cryptographic hash that the repeating data retrieval facility is judged first packet and the first cryptographic hash correspondence position.
In this explanation, second similarity threshold is more than or equal to first similarity threshold.
Optionally, data space comprises a plurality of storage areas, and each storage area has a numbering, uses each storage area successively according to the ascending order of numbering.Correspondingly, except the cryptographic hash that stores second packet and the corresponding relation of second packet, from corresponding relation, can recognize the corresponding relation of the numbering of the second packet place storage area corresponding with the cryptographic hash of second packet in the cryptographic hash storage list.Based on this, the above-mentioned process that repeatable block retrieval is carried out in first packet can be: the repeating data retrieval facility obtains the numbering n of the storage area of the first cryptographic hash correspondence from the cryptographic hash storage list, deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory, and the n here is the integer more than or equal to 0; Then deblocking identical with cryptographic hash in the numbering n corresponding stored zone in first packet is compared, to finish the repeatable block retrieval to the deblocking in first packet.
Optionally, when the deblocking in will numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and also be loaded in the internal memory.Based on this, above-mentioned deblocking identical with cryptographic hash in the numbering n corresponding stored zone in first packet is compared, to finish the process that the repeatable block of the deblocking in first packet is retrieved to be: deblocking identical with cryptographic hash in numbering n and numbering (n+1) the corresponding stored zone in first packet is compared, to finish the repeatable block retrieval to the deblocking in first packet.
Here deblocking identical with cryptographic hash in numbering n and numbering (n+1) the corresponding stored zone in first packet is compared, with the process of finishing the retrieval of the repeatable block of the deblocking in first packet can be: earlier with the cryptographic hash of each deblocking in first packet respectively the cryptographic hash in the storage area corresponding with numbering n and numbering (n+1) compare, obtain identical cryptographic hash in the storage area corresponding with numbering n and numbering (n+1) in first packet, for ease of describing, here the identical cryptographic hash that obtains is second cryptographic hash, then second cryptographic hash corresponding deblocking and second cryptographic hash corresponding deblocking in numbering n and numbering (n+1) corresponding stored zone in this packet compared, to finish the repeatable block retrieval to the deblocking in first packet.
Wherein, because each storage area is to use successively according to the ascending order of numbering, so numbering (n+1) corresponding stored zone is the next storage area in numbering n corresponding stored zone, that is to say, after numbering n corresponding stored zone is fully written, continue again in numbering (n+1) corresponding stored zone, to write data.Just in the next storage area (namely being numbered the storage area of (n+1)) of the storage area of the first cryptographic hash correspondence, repeating data is arranged because the data of next receiving are very possible, so the content of the next storage area of the storage area of disposable storage area with the first cryptographic hash correspondence (namely being numbered the storage area of n) and the first cryptographic hash correspondence all is added in the internal memory, be conducive to improve the efficient of follow-up repeatable block retrieving, and then be conducive to improve on the whole the repeatable block effectiveness of retrieval.
In this explanation, present embodiment adopts different storage zone to carry out the storage of the cryptographic hash of deblocking and deblocking, but is not limited thereto.Comparatively preferred subregion storage mode is: according to the order that receives deblocking, in centralized stores to a storage area, after this storage area is full, the deblocking that receives is stored in the next storage area.Wherein, each storage area is one section storage space, and each storage area has certain size, for example can be but is not limited to 64MB.
Store the cryptographic hash of deblocking and deblocking in each storage area simultaneously, concrete storage mode does not limit.A kind of preferred storage mode of storage area is: be divided into two parts in the storage area, a part is the data segment zone, this data segment area stores be deblocking; Another part is metadata area, the storage of this metadata area be with described data segment zone in the deblocking metadata corresponding, the metadata here comprises information such as the length of length, data segment of cryptographic hash, the deblocking of deblocking and some check codes, in the process that repeating data of the present invention is searched, it mainly is the cryptographic hash of utilizing the deblocking in the metadata.
Optionally, if the cryptographic hash by carrying out first packet that the similarity Hash operation obtains in the above-mentioned steps 103 and the similarity of first cryptographic hash are less than second similarity threshold of presetting, illustrate that the multiplicity between first packet, second packet corresponding with first cryptographic hash is not very high, can judge that not have repeatable block or repeatable block quantity between the two considerably less, only exist the deblocking in second packet corresponding with this first cryptographic hash of one or two deblocking to have repetition in for example may first packet, in order to improve overall performance, deblocking in first packet can be handled as new data, namely do not carried out repeatable block retrieval but directly store in the data space.Further, if data space comprises a plurality of storage areas, then the repeating data retrieval facility can be directly stores the cryptographic hash of the deblocking in first packet and deblocking in the storage area of current use into.
Therefore, the repeating data search method that present embodiment provides, to the data elder generation piecemeal that receives, grouping again, deblocking in the data grouping is carried out the similarity Hash operation, obtain the cryptographic hash of packet, obtain in the cryptographic hash that stores each packet in the data space into of storing in the cryptographic hash of packet and the cryptographic hash storage list similarity then more than or equal to first cryptographic hash of default first similarity threshold, whether the cryptographic hash of judgment data grouping and the similarity of first cryptographic hash be more than or equal to the second default similarity threshold, if greater than, illustrate that the deblocking in this packet is repeatable block to a great extent, then it is carried out the repeatable block retrieval, because what store in the inquiry cryptographic hash storage list is to have stored the cryptographic hash of the packet in the data space and the corresponding relation of packet into, and the quantity of packet is less relatively, so the efficient of inquiry cryptographic hash storage list is higher, and carry out the number of times that the repeatable block retrieval has reduced the repeatable block retrieval based on packet, namely reduced the number of times mutual with disk, be conducive to improve the repeatable block search efficiency, thereby improved the overall performance of data de-duplication technology.
The structural representation of the repeating data retrieval facility that Fig. 3 provides for one embodiment of the invention.The repeating data retrieval facility of present embodiment can be various equipment with computing power and storage capacity on the specific implementation form, for example can be server, computing machine in the data backup environment etc., can also be terminal, gateway, base station in the wide area network data transmitting scene etc., the specific embodiment of the invention not be done restriction to the specific implementation of repeating data retrieval facility.As shown in Figure 3, the equipment of present embodiment comprises: piecemeal acquisition module 31, grouping acquisition module 32, Hash calculation module 33 and repeated retrieval module 34.
Wherein, piecemeal acquisition module 31 is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals.
Grouping acquisition module 32 is connected with piecemeal acquisition module 31, and at least two data piecemeals that are used for piecemeal acquisition module 31 is obtained divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking.
Hash calculation module 33, be connected with grouping acquisition module 32, first packet at least one packet of obtaining at grouping acquisition module 32, deblocking in this first packet is carried out the similarity Hash operation, obtain the cryptographic hash of this first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of this first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of second packet in this cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in second packet to obtain; Described first packet is any one packet at least one packet.
Repeated retrieval module 34, be connected with Hash calculation module 33, the similarity that is used for first cryptographic hash obtained in cryptographic hash and the Hash calculation module 33 of first packet is carried out repeatable block to the deblocking in first packet and is retrieved during more than or equal to default second similarity threshold.
In an optional embodiment, as shown in Figure 4, the repeating data retrieval facility of present embodiment also comprises: memory module 35.Memory module 35, be connected with Hash calculation module 33, the similarity that is used for first cryptographic hash obtained in cryptographic hash and the Hash calculation module 33 of first packet is during less than second similarity threshold, the cryptographic hash of the deblocking in the deblocking in first packet and first packet is stored in the data space, and the cryptographic hash of first packet and the corresponding relation of first packet are stored in the cryptographic hash storage list.
In this explanation, above-mentioned Hash calculation module 33, repeated retrieval module 34 and memory module 35 are all carried out identical action to each packet.
In an optional embodiment, the cryptographic hash of each deblocking constitutes and treats piecemeal Hash data in grouping acquisition module 32 at least two data piecemeals specifically can be used for being got access to by piecemeal acquisition module 31, length with the cryptographic hash of each deblocking at least two data piecemeals is sliding step, adopt block algorithm that the above-mentioned piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, the deblocking that will belong to the cryptographic hash correspondence of same cryptographic hash piecemeal divides into groups as data, thereby obtains at least one packet.
In an optional embodiment, Hash calculation module 33 is used for the deblocking in above-mentioned first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining above-mentioned first packet comprises: Hash calculation module 33 is concrete for each deblocking in above-mentioned first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in first packet, in the cryptographic hash of each deblocking in first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of first packet.
In an optional embodiment, above-mentioned data space comprises a plurality of storage areas; Correspondingly, the cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of second packet and the second packet place storage area.Based on this, repeated retrieval module 34 specifically can be used for obtaining the numbering n in the first cryptographic hash corresponding stored zone from the cryptographic hash storage list, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the numbering n corresponding stored zone in first packet is compared, to finish the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, repeated retrieval module 34 also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory.Based on this, repeated retrieval module 34 specifically is used for the deblocking that first packet is identical with cryptographic hash in the numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in first packet: repeated retrieval module 34 specifically is used for the deblocking that first packet is identical with cryptographic hash in numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, the cryptographic hash similarity that Hash calculation module 33 is used for obtaining cryptographic hash storage list and first packet comprises more than or equal to first cryptographic hash of the first default similarity threshold: the number that Hash calculation module 33 specifically can be used for obtaining in the cryptographic hash storage list repeats bits on the cryptographic hash correspondence position with first packet more than or equal to the cryptographic hash of predetermined number as first cryptographic hash.
The number that Hash calculation module 33 is concrete to be used for obtaining the repeats bits on the cryptographic hash correspondence position of cryptographic hash storage list and first packet comprises as first cryptographic hash more than or equal to the cryptographic hash of predetermined number: the Hamming distance in the cryptographic hash that Hash calculation module 33 specifically can be used for obtaining above-mentioned packet and the cryptographic hash storage list between each cryptographic hash is less than or equal to Hamming distance cryptographic hash in the cryptographic hash storage list of presetting the Hamming distance threshold value as first cryptographic hash.
Each functional module of the repeating data retrieval facility that the embodiment of the invention provides can be used for carrying out the flow process of repeating data search method shown in Figure 1, and its concrete principle of work repeats no more, and sees the description of method embodiment for details.
The repeating data retrieval facility that present embodiment provides, to the data elder generation piecemeal that receives, grouping again, deblocking in the data grouping is carried out the similarity Hash operation, obtain the cryptographic hash of packet, obtain in the cryptographic hash that stores each packet in the data space into of storing in the cryptographic hash of packet and the cryptographic hash storage list similarity then more than or equal to first cryptographic hash of default first similarity threshold, whether the cryptographic hash of judgment data grouping and the similarity of first cryptographic hash be more than or equal to the second default similarity threshold, if greater than, illustrate that the deblocking in this packet is repeatable block to a great extent, then it is carried out the repeatable block retrieval, because what store in the inquiry cryptographic hash storage list is to have stored the cryptographic hash of the packet in the data space and the corresponding relation of packet into, and the quantity of packet is less relatively, so the efficient of inquiry cryptographic hash storage list is higher, and carry out the number of times that the repeatable block retrieval has reduced the repeatable block retrieval based on packet, namely reduced the number of times mutual with disk, be conducive to improve the repeatable block search efficiency, thereby improved the overall performance of data de-duplication technology.
The structural representation of the repeating data retrieval facility that Fig. 5 provides for further embodiment of this invention.The repeating data retrieval facility of present embodiment can be various equipment with computing power and storage capacity on the specific implementation form, for example can be server, computing machine in the data backup environment etc., can also be terminal, gateway, base station in the wide area network data transmitting scene etc., the specific embodiment of the invention not be done restriction to the specific implementation of repeating data retrieval facility.As shown in Figure 5, the repeating data retrieval facility of present embodiment comprises:
Processor 51, communication interface (Communications Interface) 53, storer 52 and bus; Processor 51, storer 52 and communication interface 53 are connected by bus and finish mutual communication.Described bus can be industry standard architecture (Industry Standard Architecture, abbreviate ISA as) bus, peripheral component interconnect (Peripheral Component, abbreviate PCI as) bus or extended industry-standard architecture (Extended Industry Standard Architecture abbreviates EISA as) bus etc.Described bus can be divided into address bus, data bus, control bus etc.For ease of expression, only represent with a thick line among Fig. 5, but do not represent only to have the bus of a bus or a type.Wherein:
Communication interface 53 is used for receiving data.
Processor 51 is used for executive routine.Particularly, this program can comprise program code, and described program code comprises computer-managed instruction.
Processor 51 may be a central processing unit (CPU), the person is specific integrated circuit (ApplicationSpecific Integrated Circuit, be designated hereinafter simply as ASIC), or be configured to implement one or more integrated circuit of the embodiment of the invention.
Storer 52 is used for the storage program.Storer 52 may comprise the high-speed RAM storer, also may also comprise nonvolatile memory (non-volatile memory), for example at least one magnetic disk memory.
Said procedure specifically can be used for: the data that communication interface 53 receives are carried out the piecemeal processing, obtain at least two data piecemeals; Described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking; At first packet at least one packet, deblocking in first packet is carried out the similarity Hash operation, obtain the cryptographic hash of first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in second packet to obtain; Described first packet is any one packet in described at least one packet; If the similarity of the cryptographic hash of first packet and first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, storer 52 program stored also are used in the similarity of the cryptographic hash of first packet and first cryptographic hash when stating second similarity threshold, the cryptographic hash of the deblocking in the deblocking in first packet and first packet is stored in the data space, and the cryptographic hash of first packet and the corresponding relation of first packet are stored in the cryptographic hash storage list.
In an optional embodiment, storer 52 program stored are used for described at least two data piecemeals are divided into groups, obtaining at least one packet comprises: this program specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
In an optional embodiment, storer 52 program stored are used for the deblocking in first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining first packet comprises: this program specifically is used for each deblocking in first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in first packet, in the cryptographic hash of each deblocking in first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of first packet.
In an optional embodiment, data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of second packet and the second packet place storage area.Based on this, storer 52 program stored are used for that the deblocking in first packet is carried out the repeatable block retrieval and comprise: this program specifically is used for obtaining from the cryptographic hash storage list numbering n of the storage area of the first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the numbering n corresponding stored zone in first packet is compared, to finish the repeatable block retrieval to the deblocking in first packet.
Optionally, storer 52 program stored also are used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory.Based on this, this program specifically is used for the deblocking that first packet is identical with cryptographic hash in the numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in first packet: this program specifically is used for the deblocking that first packet is identical with cryptographic hash in numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, the cryptographic hash similarity that storer 52 program stored are used for obtaining cryptographic hash storage list and first packet comprises more than or equal to first cryptographic hash of the first default similarity threshold: the number that this program specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of cryptographic hash storage list and first packet more than or equal to the cryptographic hash of predetermined number as described first cryptographic hash.
In an optional embodiment, the number that storer 52 program stored specifically are used for obtaining the repeats bits on the cryptographic hash correspondence position of cryptographic hash storage list and first packet comprises as first cryptographic hash more than or equal to the cryptographic hash of predetermined number: this program specifically is used for obtaining the cryptographic hash of first packet and the Hamming distance between each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the cryptographic hash storage list of presetting the Hamming distance threshold value as first cryptographic hash.
The repeating data retrieval facility that the embodiment of the invention provides can be used for carrying out the flow process of repeating data search method shown in Figure 1, and its concrete principle of work repeats no more, and sees the description of method embodiment for details.
The repeating data retrieval facility that present embodiment provides, to the data elder generation piecemeal that receives, grouping again, deblocking in the data grouping is carried out the similarity Hash operation, obtain the cryptographic hash of packet, obtain in the cryptographic hash that stores each packet in the data space into of storing in the cryptographic hash of packet and the cryptographic hash storage list similarity then more than or equal to first cryptographic hash of default first similarity threshold, whether the cryptographic hash of judgment data grouping and the similarity of first cryptographic hash be more than or equal to the second default similarity threshold, if greater than, illustrate that the deblocking in this packet is repeatable block to a great extent, then it is carried out the repeatable block retrieval, because what store in the inquiry cryptographic hash storage list is to have stored the cryptographic hash of the packet in the data space and the corresponding relation of packet into, and the quantity of packet is less relatively, so the efficient of inquiry cryptographic hash storage list is higher, and carry out the number of times that the repeatable block retrieval has reduced the repeatable block retrieval based on packet, namely reduced the number of times mutual with disk, be conducive to improve the repeatable block search efficiency, thereby improved the overall performance of data de-duplication technology.
One embodiment of the invention provides a kind of computer program, and this computer program comprises computer-readable recording medium, is used for the storage program.As shown in Figure 6, this program comprises:
Piecemeal acquiring unit 81 is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals.
Grouping acquiring unit 82 is connected with piecemeal acquiring unit 81, and at least two data piecemeals that are used for piecemeal acquiring unit 81 is obtained divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking.
Hash calculation unit 83, be connected with grouping acquiring unit 82, first packet at least one packet of obtaining at grouping acquiring unit 82, deblocking in first packet is carried out the similarity Hash operation, obtain the cryptographic hash of first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in second packet to obtain; Described first packet is any one packet in described at least one packet.
Repeated retrieval unit 84 is connected with Hash calculation unit 83, is used for, the deblocking in first packet being carried out repeatable block retrieving during more than or equal to default second similarity threshold in the similarity of the cryptographic hash of first packet and first cryptographic hash.
In an optional embodiment, as shown in Figure 6, the repeating data retrieval facility of present embodiment also comprises: storage unit 85.Storage unit 85, be connected with Hash calculation unit 83, the similarity that is used for first cryptographic hash obtain in cryptographic hash and the Hash calculation unit 83 of first packet is during less than second similarity threshold, the cryptographic hash of the deblocking in the deblocking in first packet and first packet is stored in the data space, and the cryptographic hash of first packet and the corresponding relation of first packet are stored in the cryptographic hash storage list.
In this explanation, above-mentioned Hash calculation unit 83, repeated retrieval unit 84 and storage unit 85 are all carried out identical action to each packet.
In an optional embodiment, the cryptographic hash of each deblocking constitutes and treats piecemeal Hash data in grouping acquiring unit 82 at least two data piecemeals specifically can be used for being got access to by piecemeal acquiring unit 81, length with the cryptographic hash of each deblocking at least two data piecemeals is sliding step, adopt block algorithm that the above-mentioned piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, the deblocking that will belong to the cryptographic hash correspondence of same cryptographic hash piecemeal divides into groups as data, thereby obtains at least one packet.
In an optional embodiment, Hash calculation unit 83 is used for the deblocking in above-mentioned first packet is carried out the similarity Hash operation, the cryptographic hash of obtaining above-mentioned first packet comprises: Hash calculation unit 83 is concrete for each deblocking in above-mentioned first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in first packet, in the cryptographic hash of each deblocking in first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of first packet.
In an optional embodiment, above-mentioned data space comprises a plurality of storage areas; Correspondingly, the cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of second packet and the second packet place storage area.Based on this, repeated retrieval unit 84 specifically can be used for obtaining the numbering n in the first cryptographic hash corresponding stored zone from the cryptographic hash storage list, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the numbering n corresponding stored zone in first packet is compared, to finish the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, repeated retrieval unit 84 also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory.Based on this, repeated retrieval unit 84 specifically is used for the deblocking that first packet is identical with cryptographic hash in the numbering n corresponding stored zone and compares, comprise with the repeatable block retrieval of finishing the deblocking in first packet: repeated retrieval unit 84 specifically is used for the deblocking that first packet is identical with cryptographic hash in numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in first packet.
In an optional embodiment, the cryptographic hash similarity that Hash calculation unit 83 is used for obtaining cryptographic hash storage list and first packet comprises more than or equal to first cryptographic hash of the first default similarity threshold: the number that Hash calculation unit 83 specifically can be used for obtaining in the cryptographic hash storage list repeats bits on the cryptographic hash correspondence position with first packet more than or equal to the cryptographic hash of predetermined number as first cryptographic hash.
The number that Hash calculation unit 83 is concrete to be used for obtaining the repeats bits on the cryptographic hash correspondence position of cryptographic hash storage list and first packet comprises as first cryptographic hash more than or equal to the cryptographic hash of predetermined number: the Hamming distance in the cryptographic hash that Hash calculation unit 83 specifically can be used for obtaining above-mentioned packet and the cryptographic hash storage list between each cryptographic hash is less than or equal to Hamming distance cryptographic hash in the cryptographic hash storage list of presetting the Hamming distance threshold value as first cryptographic hash.
The repeating data retrieval facility that the embodiment of the invention provides can be used for carrying out the flow process of repeating data search method shown in Figure 1, and its concrete principle of work repeats no more, and sees the description of method embodiment for details.
The repeating data retrieval facility that present embodiment provides, to the data elder generation piecemeal that receives, grouping again, deblocking in the data grouping is carried out the similarity Hash operation, obtain the cryptographic hash of packet, obtain in the cryptographic hash that stores each packet in the data space into of storing in the cryptographic hash of packet and the cryptographic hash storage list similarity then more than or equal to first cryptographic hash of default first similarity threshold, whether the cryptographic hash of judgment data grouping and the similarity of first cryptographic hash be more than or equal to the second default similarity threshold, if greater than, illustrate that the deblocking in this packet is repeatable block to a great extent, then it is carried out the repeatable block retrieval, because what store in the inquiry cryptographic hash storage list is to have stored the cryptographic hash of the packet in the data space and the corresponding relation of packet into, and the quantity of packet is less relatively, so the efficient of inquiry cryptographic hash storage list is higher, and carry out the number of times that the repeatable block retrieval has reduced the repeatable block retrieval based on packet, namely reduced the number of times mutual with disk, be conducive to improve the repeatable block search efficiency, thereby improved the overall performance of data de-duplication technology.
One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each method embodiment can be finished by the relevant hardware of programmed instruction.Aforesaid program can be stored in the computer read/write memory medium.This program is carried out the step that comprises above-mentioned each method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
It should be noted that at last: above each embodiment is not intended to limit only in order to technical scheme of the present invention to be described; Although the present invention has been described in detail with reference to aforementioned each embodiment, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment puts down in writing, and perhaps some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims (32)

1. a repeating data search method is characterized in that, comprising:
The data that receive are carried out piecemeal handle, obtain at least two data piecemeals;
Described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking;
At first packet in described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
If the similarity of the cryptographic hash of described first packet and described first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in described first packet.
2. repeating data search method according to claim 1 is characterized in that, also comprises:
If the similarity of the cryptographic hash of described first packet and described first cryptographic hash is less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
3. repeating data search method according to claim 1 and 2 is characterized in that, described at least two data piecemeals are divided into groups, and obtains at least one packet and comprises:
Constituted by the cryptographic hash of each deblocking in described at least two data piecemeals and to treat piecemeal Hash data; Length with the cryptographic hash of any described deblocking is sliding step, adopts block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, and obtains at least one cryptographic hash piecemeal;
To belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
4. according to each described repeating data search method of claim 1-3, it is characterized in that the deblocking in described first packet is carried out the similarity Hash operation, and the cryptographic hash of obtaining described first packet comprises:
Each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet;
In the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
5. according to each described repeating data search method of claim 1-4, it is characterized in that described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Deblocking in described first packet is carried out the repeatable block retrieval to be comprised:
From described cryptographic hash storage list, obtain the numbering n of the storage area of the described first cryptographic hash correspondence, the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0;
Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
6. repeating data search method according to claim 5 is characterized in that, described method also comprises:
When the deblocking in will numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, comprises with the repeatable block retrieval of finishing the deblocking in described first packet:
Deblocking identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
7. according to each described repeating data search method of claim 1-6, it is characterized in that described obtaining in the cryptographic hash storage list with the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold comprises:
The number of obtaining in the described cryptographic hash storage list repeats bits on the cryptographic hash correspondence position with described first packet more than or equal to the cryptographic hash of predetermined number as described first cryptographic hash.
8. repeating data search method according to claim 7, it is characterized in that the number of the repeats bits on the described cryptographic hash correspondence position that obtains in the cryptographic hash storage list with described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number:
Obtain the Hamming distance between each cryptographic hash in the cryptographic hash of described first packet and the described cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of default Hamming distance threshold value as described first cryptographic hash.
9. a repeating data retrieval facility is characterized in that, comprising:
The piecemeal acquisition module is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals;
The grouping acquisition module, described at least two the data piecemeals that are used for described piecemeal acquisition module is got access to divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking;
The Hash calculation module, be used for first packet at described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
The repeated retrieval module is used for, the deblocking in described first packet being carried out repeatable block retrieving during more than or equal to default second similarity threshold in the similarity of the cryptographic hash of described first packet and described first cryptographic hash.
10. repeating data retrieval facility according to claim 9 is characterized in that, also comprises:
Memory module, be used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
11. according to claim 9 or 10 described repeating data retrieval facilities, it is characterized in that, described grouping acquisition module specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
12. according to each described repeating data retrieval facility of claim 9-11, it is characterized in that, described Hash calculation module is used for the deblocking in described first packet is carried out the similarity Hash operation, and the cryptographic hash of obtaining described first packet comprises:
Described Hash calculation module specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
13. according to each described repeating data retrieval facility of claim 9-12, it is characterized in that described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described repeated retrieval module specifically is used for obtaining the numbering n in the described first cryptographic hash corresponding stored zone from described cryptographic hash storage list, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
14. according to each described repeating data retrieval facility of claim 9-13, it is characterized in that, described repeated retrieval module also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described repeated retrieval module specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, and comprises with the repeatable block retrieval of finishing the deblocking in described first packet:
Described repeated retrieval module specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
15. according to each described repeating data retrieval facility of claim 9-14, it is characterized in that described Hash calculation module is used for obtaining the cryptographic hash storage list and comprises with first cryptographic hash of the described first packet cryptographic hash similarity more than or equal to the first default similarity threshold:
The number that described Hash calculation module specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet more than or equal to the cryptographic hash of predetermined number as described first cryptographic hash.
16. repeating data retrieval facility according to claim 15, it is characterized in that the number that described Hash calculation module specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number:
Described cryptographic hash computing module specifically is used for obtaining the cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
17. a repeating data retrieval facility is characterized in that, comprising: processor, communication interface, storer and bus, described processor, described communication interface, described storer are finished mutual communication by described bus;
Described communication interface is used for receiving data;
Described processor is used for executive routine;
Described storer is used for depositing described program;
Wherein, described program is used for that the described data that described communication interface receives are carried out piecemeal to be handled, and obtains at least two data piecemeals; Described at least two data piecemeals are divided into groups, obtain at least one packet, each packet comprises at least one deblocking; At first packet in described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet; If the similarity of the cryptographic hash of described first packet and described first cryptographic hash more than or equal to the second default similarity threshold, is carried out the repeatable block retrieval to the deblocking in described first packet.
18. repeating data retrieval facility according to claim 17, it is characterized in that, described program also is used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
19., it is characterized in that described program is used for described at least two data piecemeals are divided into groups, and obtains at least one packet and comprises according to claim 17 or 18 described repeating data retrieval facilities:
Described program specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
20., it is characterized in that described program is used for the deblocking in described first packet is carried out the similarity Hash operation according to each described repeating data retrieval facility of claim 17-19, the cryptographic hash of obtaining described first packet comprises:
Described program specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
21. according to each described repeating data retrieval facility of claim 17-20, it is characterized in that described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described program is used for that the deblocking in described first packet is carried out the repeatable block retrieval and comprises:
Described program specifically is used for obtaining from described cryptographic hash storage list the numbering n of the storage area of the described first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
22. repeating data retrieval facility according to claim 21, it is characterized in that, described program also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described program specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, and comprises with the repeatable block retrieval of finishing the deblocking in described first packet:
Described program specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
23. according to each described repeating data retrieval facility of claim 17-22, it is characterized in that the cryptographic hash similarity that described program is used for obtaining cryptographic hash storage list and described first packet comprises more than or equal to first cryptographic hash of the first default similarity threshold:
The number that described program specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet more than or equal to the cryptographic hash of predetermined number as described first cryptographic hash.
24. repeating data retrieval facility according to claim 23, it is characterized in that the number that described program specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number:
Described program specifically is used for obtaining the cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
25. a computer program is characterized in that, comprises computer-readable recording medium, is used for the storage program, described program comprises:
The piecemeal acquiring unit is used for that the data that receive are carried out piecemeal and handles, and obtains at least two data piecemeals;
The grouping acquiring unit, described at least two the data piecemeals that are used for described piecemeal acquiring unit is got access to divide into groups, and obtain at least one packet, and each packet comprises at least one deblocking;
The Hash calculation unit, be used for first packet at described at least one packet, deblocking in described first packet is carried out the similarity Hash operation, obtain the cryptographic hash of described first packet, obtain in the cryptographic hash storage list and the cryptographic hash similarity of described first packet first cryptographic hash more than or equal to the first default similarity threshold, store the cryptographic hash that is stored in second packet in the data space and the corresponding relation of described second packet in the described cryptographic hash storage list, the cryptographic hash of described second packet is to carry out the similarity Hash operation according to the deblocking in described second packet to obtain; Described first packet is any one packet in described at least one packet;
The repeated retrieval unit is used for, the deblocking in described first packet being carried out repeatable block retrieving during more than or equal to default second similarity threshold in the similarity of the cryptographic hash of described first packet and described first cryptographic hash.
26. computer program according to claim 25 is characterized in that, described program also comprises:
Storage unit, be used in the similarity of the cryptographic hash of described first packet and described first cryptographic hash during less than described second similarity threshold, the cryptographic hash of the deblocking in the deblocking in described first packet and described first packet is stored in the described data space, and the cryptographic hash of described first packet and the corresponding relation of described first packet are stored in the described cryptographic hash storage list.
27. according to claim 25 or 26 described computer programs, it is characterized in that, described grouping acquiring unit specifically constitutes for the cryptographic hash by described two each deblockings of data piecemeal at least treats piecemeal Hash data, length with the cryptographic hash of any described deblocking is sliding step, adopt block algorithm that the described piecemeal Hash data for the treatment of are carried out piecemeal and handled, obtain at least one cryptographic hash piecemeal, will belong to the deblocking of cryptographic hash correspondence of same cryptographic hash piecemeal as a described packet.
28. according to each described computer program of claim 25-27, it is characterized in that, described Hash calculation unit is used for the deblocking in described first packet is carried out the similarity Hash operation, and the cryptographic hash of obtaining described first packet comprises:
Described Hash calculation unit specifically is used for each deblocking in described first packet is carried out Hash operation, obtain the cryptographic hash of each deblocking in described first packet, in the cryptographic hash of each deblocking in described first packet 0 replaced with-1, corresponding position addition with the cryptographic hash of all deblockings in described first packet, with addition greater than 0 the position be mapped as 1, with addition be less than or equal to 0 the position be mapped as 0, the binary numeral of acquisition is as the cryptographic hash of described first packet.
29. according to each described computer program of claim 25-28, it is characterized in that described data space comprises a plurality of storage areas; Described cryptographic hash storage list also stores the corresponding relation of the numbering of the cryptographic hash of described second packet and the described second packet place storage area;
Described repeated retrieval unit specifically is used for obtaining from described cryptographic hash storage list the numbering n of the storage area of the described first cryptographic hash correspondence, and the deblocking in the numbering n corresponding stored zone and the cryptographic hash of deblocking are loaded in the internal memory; Wherein, n is the integer more than or equal to 0; Deblocking identical with cryptographic hash in the described numbering n corresponding stored zone in described first packet is compared, to finish the repeatable block retrieval to the deblocking in described first packet.
30. according to each described computer program of claim 25-29, it is characterized in that, described repeated retrieval unit also is used for when the cryptographic hash of the deblocking that will number n corresponding stored zone and deblocking is loaded in the internal memory, will number deblocking in (n+1) corresponding stored zone and the cryptographic hash of deblocking and be loaded in the internal memory;
Described repeated retrieval unit specifically is used for the deblocking that described first packet is identical with cryptographic hash in the described numbering n corresponding stored zone and compares, and comprises with the repeatable block retrieval of finishing the deblocking in described first packet:
Described repeated retrieval unit specifically is used for the deblocking that described first packet is identical with cryptographic hash in described numbering n and numbering (n+1) the corresponding stored zone and compares, to finish the repeatable block retrieval to the deblocking in described first packet.
31. according to each described computer program of claim 25-30, it is characterized in that the cryptographic hash similarity that described Hash calculation unit is used for obtaining cryptographic hash storage list and described first packet comprises more than or equal to first cryptographic hash of the first default similarity threshold:
The number that described Hash calculation unit specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet more than or equal to the cryptographic hash of predetermined number as described first cryptographic hash.
32. computer program according to claim 32, it is characterized in that the number that described Hash calculation unit specifically is used for obtaining the repeats bits on the cryptographic hash correspondence position of described cryptographic hash storage list and described first packet comprises as described first cryptographic hash more than or equal to the cryptographic hash of predetermined number:
Described cryptographic hash computing unit specifically is used for obtaining the cryptographic hash of described first packet and the Hamming distance between described each cryptographic hash of cryptographic hash storage list, Hamming distance is less than or equal to cryptographic hash in the described cryptographic hash storage list of presetting the Hamming distance threshold value as described first cryptographic hash.
CN201280001989.7A 2012-10-30 2012-10-30 Repeating data search method and equipment Expired - Fee Related CN103189867B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/083740 WO2014067063A1 (en) 2012-10-30 2012-10-30 Duplicate data retrieval method and device

Publications (2)

Publication Number Publication Date
CN103189867A true CN103189867A (en) 2013-07-03
CN103189867B CN103189867B (en) 2016-05-25

Family

ID=48679810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201280001989.7A Expired - Fee Related CN103189867B (en) 2012-10-30 2012-10-30 Repeating data search method and equipment

Country Status (2)

Country Link
CN (1) CN103189867B (en)
WO (1) WO2014067063A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014067063A1 (en) * 2012-10-30 2014-05-08 华为技术有限公司 Duplicate data retrieval method and device
WO2015042909A1 (en) * 2013-09-29 2015-04-02 华为技术有限公司 Data processing method, system and client
WO2015089728A1 (en) * 2013-12-17 2015-06-25 华为技术有限公司 Repeated data processing method, device, storage controller and storage node
CN105843859A (en) * 2016-03-17 2016-08-10 华为技术有限公司 Data processing method, device and equipment
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN107644081A (en) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 Data duplicate removal method and device
CN108763270A (en) * 2018-04-07 2018-11-06 长沙开雅电子科技有限公司 A kind of data de-duplication Hash table Realization of Storing
CN108875062A (en) * 2018-06-26 2018-11-23 北京奇艺世纪科技有限公司 A kind of determination method and device repeating video
CN109670153A (en) * 2018-12-21 2019-04-23 北京城市网邻信息技术有限公司 A kind of determination method, apparatus, storage medium and the terminal of similar model
CN110134544A (en) * 2018-02-08 2019-08-16 广东亿迅科技有限公司 The method and its system of datamation backup
CN110909019A (en) * 2019-11-14 2020-03-24 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN113472609A (en) * 2020-05-25 2021-10-01 汪永强 Data repeated transmission marking system for wireless communication
CN114064621A (en) * 2021-10-28 2022-02-18 江苏未至科技股份有限公司 Method for judging repeated data
CN114817230A (en) * 2022-06-29 2022-07-29 深圳市乐易网络股份有限公司 Data stream filtering method and system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202212A (en) * 2016-06-28 2016-12-07 微梦创科网络科技(中国)有限公司 A kind of method and system realizing data fractionation based on data server cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060116989A1 (en) * 2004-11-30 2006-06-01 Srikanth Bellamkonda Efficient data aggregation operations using hash tables
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
US20110029491A1 (en) * 2009-07-29 2011-02-03 International Business Machines Corporation Dynamically detecting near-duplicate documents
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN102467572B (en) * 2010-11-17 2013-10-02 英业达股份有限公司 Data block inquiring method for supporting data de-duplication program
US9110936B2 (en) * 2010-12-28 2015-08-18 Microsoft Technology Licensing, Llc Using index partitioning and reconciliation for data deduplication
GB2477607B (en) * 2011-01-17 2011-12-28 Quantum Corp Sampling based data de-duplication
WO2014067063A1 (en) * 2012-10-30 2014-05-08 华为技术有限公司 Duplicate data retrieval method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060116989A1 (en) * 2004-11-30 2006-06-01 Srikanth Bellamkonda Efficient data aggregation operations using hash tables
US20110029491A1 (en) * 2009-07-29 2011-02-03 International Business Machines Corporation Dynamically detecting near-duplicate documents
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段飞: "《相似网页识别算法的研究与实现》", 《中国优秀硕士学位论文全文数据库 信息科技辑 》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014067063A1 (en) * 2012-10-30 2014-05-08 华为技术有限公司 Duplicate data retrieval method and device
US10210186B2 (en) 2013-09-29 2019-02-19 Huawei Technologies Co., Ltd. Data processing method and system and client
WO2015042909A1 (en) * 2013-09-29 2015-04-02 华为技术有限公司 Data processing method, system and client
CN104823184A (en) * 2013-09-29 2015-08-05 华为技术有限公司 Data processing method, system and client
CN104823184B (en) * 2013-09-29 2016-11-09 华为技术有限公司 A kind of data processing method, system and client
US11163734B2 (en) 2013-09-29 2021-11-02 Huawei Technologies Co., Ltd. Data processing method and system and client
WO2015089728A1 (en) * 2013-12-17 2015-06-25 华为技术有限公司 Repeated data processing method, device, storage controller and storage node
CN105843859A (en) * 2016-03-17 2016-08-10 华为技术有限公司 Data processing method, device and equipment
WO2017157038A1 (en) * 2016-03-17 2017-09-21 华为技术有限公司 Data processing method, apparatus and equipment
CN105843859B (en) * 2016-03-17 2019-05-24 华为技术有限公司 The method, apparatus and equipment of data processing
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN107644081A (en) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 Data duplicate removal method and device
CN110134544A (en) * 2018-02-08 2019-08-16 广东亿迅科技有限公司 The method and its system of datamation backup
CN108763270A (en) * 2018-04-07 2018-11-06 长沙开雅电子科技有限公司 A kind of data de-duplication Hash table Realization of Storing
CN108875062A (en) * 2018-06-26 2018-11-23 北京奇艺世纪科技有限公司 A kind of determination method and device repeating video
CN109670153A (en) * 2018-12-21 2019-04-23 北京城市网邻信息技术有限公司 A kind of determination method, apparatus, storage medium and the terminal of similar model
CN109670153B (en) * 2018-12-21 2023-11-17 北京城市网邻信息技术有限公司 Method and device for determining similar posts, storage medium and terminal
CN110909019A (en) * 2019-11-14 2020-03-24 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN110909019B (en) * 2019-11-14 2022-04-08 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN113472609A (en) * 2020-05-25 2021-10-01 汪永强 Data repeated transmission marking system for wireless communication
CN113472609B (en) * 2020-05-25 2024-03-19 汪永强 Data repeated sending marking system for wireless communication
CN114064621A (en) * 2021-10-28 2022-02-18 江苏未至科技股份有限公司 Method for judging repeated data
CN114064621B (en) * 2021-10-28 2022-07-15 江苏未至科技股份有限公司 Method for judging repeated data
CN114817230A (en) * 2022-06-29 2022-07-29 深圳市乐易网络股份有限公司 Data stream filtering method and system

Also Published As

Publication number Publication date
WO2014067063A1 (en) 2014-05-08
CN103189867B (en) 2016-05-25

Similar Documents

Publication Publication Date Title
CN103189867A (en) Duplicated data search method and equipment
CN102968498B (en) Data processing method and device
KR101657561B1 (en) Data processing method and apparatus in cluster system
CN101855620B (en) Data processing apparatus and method of processing data
CN103858125B (en) Repeating data disposal route, device and memory controller and memory node
CN105468642A (en) Data storage method and apparatus
KR102509913B1 (en) Method and apparatus for maximized dedupable memory
CN103067525A (en) Cloud storage data backup method based on characteristic codes
CN104081378B (en) Make full use of parallel processor for method and system that data process
CN104112011B (en) The method and device that a kind of mass data is extracted
CN106326475A (en) High-efficiency static hash table implement method and system
CN101751475B (en) Method for compressing section records and device therefor
CN103514210A (en) Method and device for processing small files
CN104823184A (en) Data processing method, system and client
CN105138281A (en) Physical disk sharing method and apparatus
CN110058969A (en) A kind of data reconstruction method and device
CN107506310A (en) A kind of address search, key word storing method and equipment
CN106648991A (en) Duplicated data deletion method in data recovery system
CN103930890B (en) Data processing method, device and heavily delete processor
CN103092886B (en) A kind of implementation method of data query operation, Apparatus and system
CN105446982A (en) Data storage system management method and device
CN108563649B (en) Offline duplicate removal method based on GlusterFS distributed file system
CN110221778A (en) Processing method, system, storage medium and the electronic equipment of hotel's data
CN105117403A (en) Log data fragmentation and query method and apparatus
CN116383290B (en) Data generalization and analysis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160525

Termination date: 20191030