CN100576207C - Remove the method for repeating objects based on metadata - Google Patents

Remove the method for repeating objects based on metadata Download PDF

Info

Publication number
CN100576207C
CN100576207C CN 200710106024 CN200710106024A CN100576207C CN 100576207 C CN100576207 C CN 100576207C CN 200710106024 CN200710106024 CN 200710106024 CN 200710106024 A CN200710106024 A CN 200710106024A CN 100576207 C CN100576207 C CN 100576207C
Authority
CN
China
Prior art keywords
metadata
typing
treatment
current
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200710106024
Other languages
Chinese (zh)
Other versions
CN101286156A (en
Inventor
高飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leade Technology Development Co., Ltd.
Beijing Founder Apabi Technology Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN 200710106024 priority Critical patent/CN100576207C/en
Publication of CN101286156A publication Critical patent/CN101286156A/en
Application granted granted Critical
Publication of CN100576207C publication Critical patent/CN100576207C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a kind of method based on metadata removal repeating objects, relate to metadata cleaning field, solved the big problem of existing removal repeating data workload, the present invention treats the metadata of typing earlier and carries out standardization processing.Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In the data acquisition record, choose and the identical record of publishing house's field for the treatment of the typing metadata; In selected record, choose isbn, title, author, publishing house, publication time, price field, as a comparison scope.Utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; Compound similarity value and predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data.

Description

Remove the method for repeating objects based on metadata
Technical field
The present invention relates to a kind of method of data scrubbing, relate in particular to a kind of method of in data acquisition, removing repeating objects.
Background technology
In information society, information can be divided into two big classes.One category information enough data of energy or unified structure are represented that we are referred to as structural data, as numeral, symbol; And another kind of information can't be represented with numeral or unified structure, and as text, image, sound, webpage etc., we are referred to as unstructured data.Structural data belongs to unstructured data, is the special case of unstructured data.
Structured data type is a kind of user-defined data type, and it comprises the element of some non-atoms, and or rather, these data types can be cut apart, and they both can use separately, again can be as an independently unit use under suitable situation.
In library and information circle, metadata is defined as: a kind of structurized data about information resources or data are provided, and are the structurized descriptions to information resources.It act as: the feature and the attribute of descriptor resource or data itself, the tissue of regulation digital information has location, discovery, proof, assessment, functions such as selection.
Do not have good data environment, just do not have desirable excavation result. but the data of real world generally all are dirty, incomplete and inconsistent.Use the data pre-service may create this environment.
At present, along with networks development, the sharp increase of various metadata quantity.Because the metadata quality that increases is uneven, there are a large amount of repeating datas in various sources, have brought no small trouble for follow-up base business thereon.In case because the metadata re-treatment is bad, base service logic thereon just can have problems, so that brings loss.Such as selling the website at books, the repeating data of books bibliography is more, just may cause the user not know how to place an order.In the past, this class problem is normally by artificial judgment, but along with the continuous growth and the accumulation of data volume, the manpower consumption who brings thus also sharply increases.So how a large amount of metadata is declared and heavily has been processed into a primary problem.
All the time, it is more to declare heavy Study on Problems for the non-structured data of network in the industry, and various algorithm achievements also emerge in an endless stream, and in the present all kinds of search engines utilization are arranged all.But metadata is as the semantic structural data of band, and its requirement of declaring heavy standard and accuracy is all more accurate.So existing for unstructured data declare the double recipe case, can not satisfy metadata fully and declare heavy requirement.In addition, be applied to accurately declaring the double recipe case and can not being adapted at more that metadata is this may to be existed in the environment of partial data mistake of database itself usually.
Summary of the invention
The invention provides and a kind ofly can accurately differentiate repeating data, and remove the method for repeating objects based on metadata what repeating data was removed.
The present invention is by the following technical solutions: the present invention is based on the method that metadata is removed repeating objects, comprise the steps:
1) the current metadata of typing for the treatment of is carried out standardization processing, judge whether it is that quality is treated the typing metadata preferably;
2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;
3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged.
The described current metadata of typing for the treatment of comprises following field at least: International Standard Book Number, title, author, publishing house, publication time, price field.
Described International Standard Book Number is made up of 10 bit digital, and this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, and publisher number is the code name of publishing house.
Described " the current metadata of typing for the treatment of is carried out standardization processing " comprises the steps:
1) judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;
2) do you judge that the current International Standard Book Number of the metadata of typing for the treatment of is made up of 10 bit digital? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;
3) whether the International Standard Book Number of the current metadata for the treatment of typing of checking is correct;
4), verify again whether the publishing house of the current metadata for the treatment of typing is correct if the International Standard Book Number of the current metadata for the treatment of typing is correct;
If the publishing house of the current metadata for the treatment of typing is correct, the then current metadata of typing for the treatment of is described " quality is treated the typing metadata preferably ".
The method of described " whether the International Standard Book Number of verifying the current metadata for the treatment of typing is correct " is: the 1st to the 9th bit digital of International Standard Book Number multiply by 10 to 2 these 9 numerals in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct;
The method of described " whether the publishing house that verifies the current metadata for the treatment of typing is correct " is:
Whether the publishing house that selects the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number is correct;
If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;
If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.
Described " the current metadata of typing for the treatment of is carried out standardization processing " comprising: publication time, price specifications are turned to real number.
When data acquisition when being empty, described step 2), 3) be specially:
2) do not have in the data acquisition and the record for the treatment of that the typing metadata repeats;
3) quality is treated preferably typing metadata store inverse is according in the set.
When data acquisition is not sky, described step 2) comprising:
21) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;
22) in step 21) in the restricted portion, utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition;
23) each field similarity is on duty with weighted value, addition obtains compound similarity value;
24) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
Described step 21) be specially:
211) in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;
212) in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
In step 22) described in the band weighted value the similarity comparison function comprise: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.
The present invention treats the metadata (dirty data) of typing and carries out standardization processing, makes it not have pro forma apparent error, and the metadata quality of this moment is reasonable.Quality is treated that preferably each bar record compares in typing metadata and the data acquisition, whether have and the record for the treatment of that the typing metadata repeats in the judgment data set; Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In thousands of records of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison; In selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.Utilize record and the similarity value for the treatment of the typing metadata in the set of similarity comparison function computational data, utilize weighted value training function calculation field weighted value; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; A compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
Description of drawings
Fig. 1 the present invention is based on the process flow diagram that metadata is removed the method for repeating objects;
Fig. 2 is for treating preferably that with quality each bar writes down the process flow diagram that compares judgement in typing metadata and the data acquisition among the present invention.
Embodiment
At existing metadata cleaning field, remove the big problem of dirty data workload, the invention provides the method for removing repeating objects based on metadata, with reference to accompanying drawing 1, it comprises the steps:
1) the current metadata of typing for the treatment of is carried out standardization processing, judge whether it is that quality is treated the typing metadata preferably;
2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;
3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged.
The information of an online book has comprised a large amount of metadata, and these metadata mostly are some dirty datas, i.e. poor quality's data.For instance: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN7-305-01568-7; Publisher number: 305; Publishing house: all sorts of flowers publishing house; Publication time: on June 9th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,90 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In the above metadata, the part before the colon is a field, and the part behind the colon is a property value.Above information has been formed a record in data acquisition.In this record, property value all is correct, is called the measured data of matter.Property value in the reality in the metadata record often is wrong, also with the example that is recorded as of the described The Romance of the Three Kingdoms: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN8-305-01548-7; Publisher number: 306; Publishing house: spend hundred publishing houses; Publication time: on February 30th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,908 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In this record, mistake has all appearred in the property value of field International Standard Book Number, publisher number, publishing house, publication time, responsible editor, current price etc.The data that are called dirty data or poor quality.
Should the measured metadata of typing matter in the data acquisition, the metadata that clear quality is bad.When the typing metadata, always, judge the quality quality for the treatment of the typing metadata at present by artificial.Inefficiency and standard disunity like this.
One, for poor quality's metadata, before typing, at first to carry out standardization processing:
1) International Standard Book Number is carried out standardization processing:
The general book colophon of all regular publication all has ISBN number, and ISBN is the abbreviation of the several English alphabets of international standard of book number, i.e. International Standard Book Number.It is made up of 10 bit digital, this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, as: ISBN7-305-01568-7, group number is to represent the numbering of country languages, and China is numbered 7.Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.Punctuation marks used to enclose the title are the numberings that given every kind of publication by publisher.Verification number is last bit value of ISBN number, it can verification go out ISBN number whether correct.The ISBN1-9 bit digital be multiply by these 9 numerals of 10-2 in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this ISBN number is correct.
Below two steps 1,2 verified the pro forma correctness of International Standard Book Number.Each International Standard Book Number all must meet these pro forma requirements, could verify the correctness of International Standard Book Number own again:
1, judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;
2, do you judge whether 10 bit digital are formed for the International Standard Book Number of the current metadata for the treatment of typing? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;
3, the 1st of International Standard Book Number the to the 9th bit digital multiply by 10 to 2 these 9 numerals in proper order, and these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct.Also with the example that is recorded as of the described The Romance of the Three Kingdoms.International Standard Book Number is: ISBN7-305-01568-7, formula are 7*10+3*9+0*8+5*7+0*6+1*5+5*4+6*3+8*2+7=198, and 198/11=18 can be divided exactly by 11.Then this International Standard Book Number is correct.International Standard Book Number is: ISBN8-305-01548-7; Formula is that 8*10+3*9+0*8+5*7+0*6+1*5+5*4+4*3+8*2+7=204 204/11=18 surpluss 6, can not be divided exactly by 11.Then this International Standard Book Number is incorrect.
2) publishing house is carried out standardization processing
1, judges and currently treat whether the International Standard Book Number of the metadata of typing is the character string pattern; If other pattern characters are arranged, after its deletion, keep this current metadata for the treatment of typing;
Whether 2, select the publishing house of the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number correct;
If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;
If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.
Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.For example International Standard Book Number is: ISBN7-305-01568-7 therefrom extracts publisher number: 305; Finding corresponding publishing house then is all sorts of flowers publishing houses.If treating the typing metadata is all sorts of flowers publishing houses; Then currently treat that the publishing house of the metadata of typing is correct.
3) title, author's standard are turned to character string, if occur the character of numeral or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata author: during sieve 9 passes through or the author: the Roseau, in, during standardization with 9 and, deletion, keep the author: during sieve passes through or the author: carry out later processing in the Roseau.
4) publication time, price specifications are turned to real number.If occur the character of Chinese character or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata publication time: 1988-6f-9 or 198 water 8-6-9 after during standardization f and water being removed, keep publication time: 1988-6-9 and carry out later processing.
5) with responsible editor, current price, release, brief introduction, classification, descriptor ... wait and carry out standardization processing.
Dirty data through after the standardization has not had pro forma apparent error, and the metadata quality of this moment is reasonable.
Two, quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set.
Discuss according to the two kinds of situations that how much are divided into that write down in the data acquisition: when 1) data acquisition is for sky; With 2) when data acquisition is not sky;
When 1) data acquisition is empty, directly be entered into the metadata for the treatment of typing in the data acquisition;
2) when data acquisition be not empty, illustrating has some records in the data acquisition; With reference to accompanying drawing 2, be divided into following steps and carry out typing:
A) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;
Through the ages the recording of information of various books has thousands of in the data acquisition, treats that as one the typing metadata will be entered in the data acquisition, need search the record that whether has with its repetition in thousands of records of data acquisition; In order to reduce workload, increase work efficiency.Need dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares; Concrete measure:
A1, in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;
Have in thousands of records of data acquisition much all is that same publishing house publishes.Relatively the time, the record identical with publishing house's field for the treatment of the typing metadata extracted scope as a comparison.
For example the metadata of the The Romance of the Three Kingdoms is gone into to record in the data acquisition, its publishing house is all sorts of flowers publishing houses.The property value that extracts field in data acquisition is the record of all sorts of flowers publishing house, scope as a comparison.
A2, in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
In order to reduce workload, increase work efficiency.Further drawdown ratio scope in selected scope with identical publishing house.Choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
B) in the step a) restricted portion, utilize the similarity comparison function of band weighted value: f (r 1, r 2)=f ' (r ' 1, r ' 2)-α (1-f ' (r " 1, r " 2)), f ' ∈ [0,1] calculates the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition, and wherein f ' is a similarity comparison function of the prior art, r 1, r 2For treating the property value of corresponding field in typing metadata and the data acquisition (International Standard Book Number, title, author, publishing house, publication time, price field), r ' 1, r ' 2For property value is removed the part ignore behind the speech, r " 1, r " 2For only keeping weight speech part in the property value, α is a weighted value, for by training algorithm training gained, and under the situation that does not have the weight speech, f (r 1, rx)=f ' (r ' 1, r ' 2).For instance: for publishing house's field: in publishing house of property value Tsing-Hua University and BJ University Press, these speech of university press can be regarded as and ignore speech the relatively too big meaning not of this field of publishing house.In the time of relatively, only compare Tsing-Hua University and Beijing, be r ' 1, r ' 2For property value in the title field: the The Romance of the Three Kingdoms (up and down) is exactly up and down the weight speech, is r " 1, r " 2
Described similarity comparison function comprises: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.
The comparison function of isbn field if isbn equates then to be 1, otherwise is 0;
Title field comparison function is the similarity of character string value of cutting gained speech
The author field comparison function is the similarity of character string value of cutting gained speech;
Publication time comparison function, adopt the relative mistake function to obtain the similarity value;
The price comparison function adopts the relative mistake function to obtain the similarity value;
C) utilize compound similarity function F ( R 1 , R 2 ) = α 0 + Σ i = 1 n α i f i ( R 1 , R 2 ) , α wherein 0Be threshold value, α iBe weight, R 1, R 2Be metadata, f i(R 1, R 2) be R 1And R 2The similarity comparison function of the band weighted value of field i calculates the compound similarity value for the treatment of the typing metadata;
D) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
The present invention treats the metadata (dirty data) of typing and carries out standardization processing, makes it not have pro forma apparent error, and the metadata quality of this moment is reasonable.Quality is treated that preferably each bar record compares in typing metadata and the data acquisition, whether have and the record for the treatment of that the typing metadata repeats in the judgment data set; Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In thousands of records of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison; In selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.Utilize record and the similarity value for the treatment of the typing metadata in the set of similarity comparison function computational data, utilize weighted value training function calculation field weighted value; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; A compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.

Claims (8)

1, a kind of method based on metadata removal repeating objects is characterized in that comprising the steps:
1) the current metadata of typing for the treatment of is carried out standardization processing, judges whether it is that quality is treated the typing metadata preferably, described quality treat preferably the typing metadata be do not have a form error treat the typing metadata;
When data acquisition is not sky,
2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;
3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged;
This step 2) further comprises the following steps:
21) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;
22) in step 21) in the restricted portion, utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition;
23) each field similarity is on duty with weighted value, addition obtains compound similarity value;
24) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
2, the method based on metadata removal repeating objects according to claim 1 is characterized in that the described current metadata of typing for the treatment of comprises following field at least: International Standard Book Number, title, author, publishing house, publication time, price field.
3, the method for removing repeating objects based on metadata according to claim 2, it is characterized in that, described International Standard Book Number is made up of 10 bit digital, this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, use "-" to link to each other therebetween, publisher number is the code name of publishing house.
4, the method based on metadata removal repeating objects according to claim 1 is characterized in that described " the current metadata of typing for the treatment of is carried out standardization processing " comprises the steps:
1) judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;
2) judge whether the current International Standard Book Number of the metadata of typing for the treatment of is made up of 10 bit digital, if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;
3) whether the International Standard Book Number of the current metadata for the treatment of typing of checking is correct;
4), verify again whether the publishing house of the current metadata for the treatment of typing is correct if the International Standard Book Number of the current metadata for the treatment of typing is correct;
If the publishing house of the current metadata for the treatment of typing is correct, the then current metadata of typing for the treatment of is described " quality is treated the typing metadata preferably ".
5, the method for removing repeating objects based on metadata according to claim 4, it is characterized in that, the method of described " whether the International Standard Book Number of verifying the current metadata for the treatment of typing is correct " is: the 1st to the 9th bit digital of International Standard Book Number multiply by 10 to 2 these 9 numerals in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct;
The method of described " whether the publishing house that verifies the current metadata for the treatment of typing is correct " is:
Whether the publishing house that selects the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number is correct;
If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;
If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.
6, the method based on metadata removal repeating objects according to claim 4 is characterized in that described " the current metadata of typing for the treatment of is carried out standardization processing " comprising: publication time, price specifications are turned to real number.
7, the method based on metadata removal repeating objects according to claim 1 is characterized in that described step 21) be specially:
211) in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;
212) in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
8, the method for removing repeating objects based on metadata according to claim 1, it is characterized in that, in step 22) described in the band weighted value the similarity comparison function comprise: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.
CN 200710106024 2007-05-29 2007-05-29 Remove the method for repeating objects based on metadata Active CN100576207C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710106024 CN100576207C (en) 2007-05-29 2007-05-29 Remove the method for repeating objects based on metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710106024 CN100576207C (en) 2007-05-29 2007-05-29 Remove the method for repeating objects based on metadata

Publications (2)

Publication Number Publication Date
CN101286156A CN101286156A (en) 2008-10-15
CN100576207C true CN100576207C (en) 2009-12-30

Family

ID=40058367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710106024 Active CN100576207C (en) 2007-05-29 2007-05-29 Remove the method for repeating objects based on metadata

Country Status (1)

Country Link
CN (1) CN100576207C (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236635A (en) * 2010-04-22 2011-11-09 上海百果信息科技有限公司 Method for realizing multi-system information association by capturing and comparing key elements
CN102609418B (en) * 2011-01-21 2015-02-04 北京世纪读秀技术有限公司 Data quality grade judging method
CN102609419B (en) * 2011-01-21 2015-02-18 北京世纪读秀技术有限公司 Similar data de-duplication method
US9223511B2 (en) 2011-04-08 2015-12-29 Micron Technology, Inc. Data deduplication
CN102325347A (en) * 2011-09-14 2012-01-18 中兴通讯股份有限公司 Transport stream template coupling method in LTE system and apparatus thereof
US9489133B2 (en) 2011-11-30 2016-11-08 International Business Machines Corporation Optimizing migration/copy of de-duplicated data
CN103166917B (en) * 2011-12-12 2016-02-10 阿里巴巴集团控股有限公司 Network equipment personal identification method and system
CN103257961B (en) * 2012-02-15 2016-08-10 北大方正集团有限公司 Bibliography disappear weight method, Apparatus and system
CN103425711B (en) * 2012-05-25 2017-08-25 株式会社理光 Object value alignment schemes based on many object instances
CN103729369B (en) * 2012-10-15 2017-06-13 金蝶软件(中国)有限公司 The method and device of automatically processing coexisting orders
US20150032609A1 (en) * 2013-07-29 2015-01-29 International Business Machines Corporation Correlation of data sets using determined data types
CN103473654A (en) * 2013-09-23 2013-12-25 国家电网公司 Asset data cleaning auxiliary method and system for electric ERP system
CN104899408A (en) * 2014-03-05 2015-09-09 孙宝文 Interesting item set acquisition method and device
CN105205107A (en) * 2015-08-27 2015-12-30 湖南人文科技学院 Internet of Things data similarity processing method
CN106528705A (en) * 2016-10-26 2017-03-22 桂林电子科技大学 Repeated record detection method and system based on RBF neural network
CN108153793A (en) * 2016-12-02 2018-06-12 航天星图科技(北京)有限公司 A kind of original data processing method
CN106649650B (en) * 2016-12-10 2020-08-18 宁波财经学院 Bidirectional matching method for demand information
CN107203686B (en) * 2017-03-31 2021-04-20 苏州艾隆信息技术有限公司 Medicine information difference processing method and system
CN107870991A (en) * 2017-10-27 2018-04-03 湖南纬度信息科技有限公司 A kind of similarity calculating method and computer-readable recording medium of paper metadata
CN109034199B (en) * 2018-06-25 2022-02-01 泰康保险集团股份有限公司 Data processing method and device, storage medium and electronic equipment
CN109446190B (en) * 2018-11-07 2022-11-01 湖北省标准化与质量研究院 Data processing method of standard metadata
CN110941598A (en) * 2019-12-02 2020-03-31 北京锐安科技有限公司 Data deduplication method, device, terminal and storage medium
CN111158666B (en) * 2019-12-27 2023-07-04 北京百度网讯科技有限公司 Entity normalization processing method, device, equipment and storage medium
CN112069510B (en) * 2020-07-24 2024-01-30 北京思特奇信息技术股份有限公司 Data encryption and duplication elimination method
CN115829143A (en) * 2022-12-15 2023-03-21 广东慧航天唯科技有限公司 Water environment treatment prediction system and method based on time-space data cleaning technology

Also Published As

Publication number Publication date
CN101286156A (en) 2008-10-15

Similar Documents

Publication Publication Date Title
CN100576207C (en) Remove the method for repeating objects based on metadata
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN101770446B (en) Method and system for identifying form in layout file
CN102681994B (en) Webpage information extracting method and system
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN1936892A (en) Image content semanteme marking method
WO2003012685A2 (en) A data quality system
CN104756100A (en) Intent estimation device and intent estimation method
CN104008106A (en) Method and apparatus for obtaining hot topic
CN111127068B (en) Automatic pricing method and device for engineering quantity list
CN104484380A (en) Personalized search method and personalized search device
CN112364172A (en) Method for constructing knowledge graph in government official document field
CN104765729A (en) Cross-platform micro-blogging community account matching method
CN101894129B (en) Video topic finding method based on online video-sharing website structure and video description text information
CN109190099B (en) Sentence pattern extraction method and device
CN105550253A (en) Method and device for obtaining type relation
CN109213998A (en) Chinese wrongly written character detection method and system
CN101887415A (en) Automatic extraction method for text document theme word meaning
CN1320481C (en) Method for conducting title and text logic connection for newspaper pages
CN105243053A (en) Method and apparatus for extracting key sentence of document
CN100562872C (en) Automatic moulding plate information locating method at the structuring webpage
CN111898351B (en) Automatic Excel data importing method and device based on Aviator, terminal equipment and storage medium
Berntsen et al. Sustainability in software engineering-a systematic mapping
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
CN111143457A (en) Student homonymy disambiguation method based on multiple source data sets

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: LIDE TECHNOLOGY DEVELOPMENT CO., LTD.

Free format text: FORMER OWNER: PEKING UNIVERSITY FOUNDER GROUP CORP.

Effective date: 20120823

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100871 HAIDIAN, BEIJING TO: 409000 QIANJIANG, CHONGQING

TR01 Transfer of patent right

Effective date of registration: 20120823

Address after: 409000 Zhengyang Industrial Park, Chongqing

Patentee after: Leade Technology Development Co., Ltd.

Patentee after: Beijing Founder Apabi Technology Co., Ltd.

Address before: 100871 Beijing, Haidian District into the house road, founder of the building on the 5 floor, No. 298

Patentee before: Peking Founder Group Co., Ltd.

Patentee before: Beijing Founder Apabi Technology Co., Ltd.