CN100576207C

CN100576207C - Remove the method for repeating objects based on metadata

Info

Publication number: CN100576207C
Application number: CN 200710106024
Authority: CN
Inventors: 高飞
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: Leade Technology Development Co., Ltd.; Beijing Founder Apabi Technology Co Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2009-12-30
Anticipated expiration: 2027-05-29
Also published as: CN101286156A

Abstract

The invention discloses a kind of method based on metadata removal repeating objects, relate to metadata cleaning field, solved the big problem of existing removal repeating data workload, the present invention treats the metadata of typing earlier and carries out standardization processing.Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In the data acquisition record, choose and the identical record of publishing house's field for the treatment of the typing metadata; In selected record, choose isbn, title, author, publishing house, publication time, price field, as a comparison scope.Utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; Compound similarity value and predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data.

Description

Remove the method for repeating objects based on metadata

Technical field

The present invention relates to a kind of method of data scrubbing, relate in particular to a kind of method of in data acquisition, removing repeating objects.

Background technology

In information society, information can be divided into two big classes.One category information enough data of energy or unified structure are represented that we are referred to as structural data, as numeral, symbol; And another kind of information can't be represented with numeral or unified structure, and as text, image, sound, webpage etc., we are referred to as unstructured data.Structural data belongs to unstructured data, is the special case of unstructured data.

Structured data type is a kind of user-defined data type, and it comprises the element of some non-atoms, and or rather, these data types can be cut apart, and they both can use separately, again can be as an independently unit use under suitable situation.

In library and information circle, metadata is defined as: a kind of structurized data about information resources or data are provided, and are the structurized descriptions to information resources.It act as: the feature and the attribute of descriptor resource or data itself, the tissue of regulation digital information has location, discovery, proof, assessment, functions such as selection.

Do not have good data environment, just do not have desirable excavation result. but the data of real world generally all are dirty, incomplete and inconsistent.Use the data pre-service may create this environment.

At present, along with networks development, the sharp increase of various metadata quantity.Because the metadata quality that increases is uneven, there are a large amount of repeating datas in various sources, have brought no small trouble for follow-up base business thereon.In case because the metadata re-treatment is bad, base service logic thereon just can have problems, so that brings loss.Such as selling the website at books, the repeating data of books bibliography is more, just may cause the user not know how to place an order.In the past, this class problem is normally by artificial judgment, but along with the continuous growth and the accumulation of data volume, the manpower consumption who brings thus also sharply increases.So how a large amount of metadata is declared and heavily has been processed into a primary problem.

All the time, it is more to declare heavy Study on Problems for the non-structured data of network in the industry, and various algorithm achievements also emerge in an endless stream, and in the present all kinds of search engines utilization are arranged all.But metadata is as the semantic structural data of band, and its requirement of declaring heavy standard and accuracy is all more accurate.So existing for unstructured data declare the double recipe case, can not satisfy metadata fully and declare heavy requirement.In addition, be applied to accurately declaring the double recipe case and can not being adapted at more that metadata is this may to be existed in the environment of partial data mistake of database itself usually.

Summary of the invention

The invention provides and a kind ofly can accurately differentiate repeating data, and remove the method for repeating objects based on metadata what repeating data was removed.

The present invention is by the following technical solutions: the present invention is based on the method that metadata is removed repeating objects, comprise the steps:

1) the current metadata of typing for the treatment of is carried out standardization processing, judge whether it is that quality is treated the typing metadata preferably;

2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;

3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged.

The described current metadata of typing for the treatment of comprises following field at least: International Standard Book Number, title, author, publishing house, publication time, price field.

Described International Standard Book Number is made up of 10 bit digital, and this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, and publisher number is the code name of publishing house.

Described " the current metadata of typing for the treatment of is carried out standardization processing " comprises the steps:

1) judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;

2) do you judge that the current International Standard Book Number of the metadata of typing for the treatment of is made up of 10 bit digital? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;

3) whether the International Standard Book Number of the current metadata for the treatment of typing of checking is correct;

4), verify again whether the publishing house of the current metadata for the treatment of typing is correct if the International Standard Book Number of the current metadata for the treatment of typing is correct;

If the publishing house of the current metadata for the treatment of typing is correct, the then current metadata of typing for the treatment of is described " quality is treated the typing metadata preferably ".

The method of described " whether the International Standard Book Number of verifying the current metadata for the treatment of typing is correct " is: the 1st to the 9th bit digital of International Standard Book Number multiply by 10 to 2 these 9 numerals in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct;

The method of described " whether the publishing house that verifies the current metadata for the treatment of typing is correct " is:

Whether the publishing house that selects the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number is correct;

If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;

If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.

Described " the current metadata of typing for the treatment of is carried out standardization processing " comprising: publication time, price specifications are turned to real number.

When data acquisition when being empty, described step 2), 3) be specially:

2) do not have in the data acquisition and the record for the treatment of that the typing metadata repeats;

3) quality is treated preferably typing metadata store inverse is according in the set.

When data acquisition is not sky, described step 2) comprising:

21) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;

22) in step 21) in the restricted portion, utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition;

23) each field similarity is on duty with weighted value, addition obtains compound similarity value;

24) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.

Described step 21) be specially:

211) in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;

212) in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.

In step 22) described in the band weighted value the similarity comparison function comprise: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.

The present invention treats the metadata (dirty data) of typing and carries out standardization processing, makes it not have pro forma apparent error, and the metadata quality of this moment is reasonable.Quality is treated that preferably each bar record compares in typing metadata and the data acquisition, whether have and the record for the treatment of that the typing metadata repeats in the judgment data set; Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In thousands of records of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison; In selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.Utilize record and the similarity value for the treatment of the typing metadata in the set of similarity comparison function computational data, utilize weighted value training function calculation field weighted value; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; A compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.

Description of drawings

Fig. 1 the present invention is based on the process flow diagram that metadata is removed the method for repeating objects;

Fig. 2 is for treating preferably that with quality each bar writes down the process flow diagram that compares judgement in typing metadata and the data acquisition among the present invention.

Embodiment

At existing metadata cleaning field, remove the big problem of dirty data workload, the invention provides the method for removing repeating objects based on metadata, with reference to accompanying drawing 1, it comprises the steps:

The information of an online book has comprised a large amount of metadata, and these metadata mostly are some dirty datas, i.e. poor quality's data.For instance: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN7-305-01568-7; Publisher number: 305; Publishing house: all sorts of flowers publishing house; Publication time: on June 9th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,90 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In the above metadata, the part before the colon is a field, and the part behind the colon is a property value.Above information has been formed a record in data acquisition.In this record, property value all is correct, is called the measured data of matter.Property value in the reality in the metadata record often is wrong, also with the example that is recorded as of the described The Romance of the Three Kingdoms: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN8-305-01548-7; Publisher number: 306; Publishing house: spend hundred publishing houses; Publication time: on February 30th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,908 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In this record, mistake has all appearred in the property value of field International Standard Book Number, publisher number, publishing house, publication time, responsible editor, current price etc.The data that are called dirty data or poor quality.

Should the measured metadata of typing matter in the data acquisition, the metadata that clear quality is bad.When the typing metadata, always, judge the quality quality for the treatment of the typing metadata at present by artificial.Inefficiency and standard disunity like this.

One, for poor quality's metadata, before typing, at first to carry out standardization processing:

1) International Standard Book Number is carried out standardization processing:

The general book colophon of all regular publication all has ISBN number, and ISBN is the abbreviation of the several English alphabets of international standard of book number, i.e. International Standard Book Number.It is made up of 10 bit digital, this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, as: ISBN7-305-01568-7, group number is to represent the numbering of country languages, and China is numbered 7.Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.Punctuation marks used to enclose the title are the numberings that given every kind of publication by publisher.Verification number is last bit value of ISBN number, it can verification go out ISBN number whether correct.The ISBN1-9 bit digital be multiply by these 9 numerals of 10-2 in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this ISBN number is correct.

Below two steps 1,2 verified the pro forma correctness of International Standard Book Number.Each International Standard Book Number all must meet these pro forma requirements, could verify the correctness of International Standard Book Number own again:

1, judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;

2, do you judge whether 10 bit digital are formed for the International Standard Book Number of the current metadata for the treatment of typing? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;

3, the 1st of International Standard Book Number the to the 9th bit digital multiply by 10 to 2 these 9 numerals in proper order, and these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct.Also with the example that is recorded as of the described The Romance of the Three Kingdoms.International Standard Book Number is: ISBN7-305-01568-7, formula are 7*10+3*9+0*8+5*7+0*6+1*5+5*4+6*3+8*2+7=198, and 198/11=18 can be divided exactly by 11.Then this International Standard Book Number is correct.International Standard Book Number is: ISBN8-305-01548-7; Formula is that 8*10+3*9+0*8+5*7+0*6+1*5+5*4+4*3+8*2+7=204 204/11=18 surpluss 6, can not be divided exactly by 11.Then this International Standard Book Number is incorrect.

2) publishing house is carried out standardization processing

1, judges and currently treat whether the International Standard Book Number of the metadata of typing is the character string pattern; If other pattern characters are arranged, after its deletion, keep this current metadata for the treatment of typing;

Whether 2, select the publishing house of the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number correct;

Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.For example International Standard Book Number is: ISBN7-305-01568-7 therefrom extracts publisher number: 305; Finding corresponding publishing house then is all sorts of flowers publishing houses.If treating the typing metadata is all sorts of flowers publishing houses; Then currently treat that the publishing house of the metadata of typing is correct.

3) title, author's standard are turned to character string, if occur the character of numeral or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata author: during sieve 9 passes through or the author: the Roseau, in, during standardization with 9 and, deletion, keep the author: during sieve passes through or the author: carry out later processing in the Roseau.

4) publication time, price specifications are turned to real number.If occur the character of Chinese character or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata publication time: 1988-6f-9 or 198 water 8-6-9 after during standardization f and water being removed, keep publication time: 1988-6-9 and carry out later processing.

5) with responsible editor, current price, release, brief introduction, classification, descriptor ... wait and carry out standardization processing.

Dirty data through after the standardization has not had pro forma apparent error, and the metadata quality of this moment is reasonable.

Two, quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set.

Discuss according to the two kinds of situations that how much are divided into that write down in the data acquisition: when 1) data acquisition is for sky; With 2) when data acquisition is not sky;

When 1) data acquisition is empty, directly be entered into the metadata for the treatment of typing in the data acquisition;

2) when data acquisition be not empty, illustrating has some records in the data acquisition; With reference to accompanying drawing 2, be divided into following steps and carry out typing:

A) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;

Through the ages the recording of information of various books has thousands of in the data acquisition, treats that as one the typing metadata will be entered in the data acquisition, need search the record that whether has with its repetition in thousands of records of data acquisition; In order to reduce workload, increase work efficiency.Need dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares; Concrete measure:

A1, in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;

Have in thousands of records of data acquisition much all is that same publishing house publishes.Relatively the time, the record identical with publishing house's field for the treatment of the typing metadata extracted scope as a comparison.

For example the metadata of the The Romance of the Three Kingdoms is gone into to record in the data acquisition, its publishing house is all sorts of flowers publishing houses.The property value that extracts field in data acquisition is the record of all sorts of flowers publishing house, scope as a comparison.

A2, in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.

In order to reduce workload, increase work efficiency.Further drawdown ratio scope in selected scope with identical publishing house.Choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.

B) in the step a) restricted portion, utilize the similarity comparison function of band weighted value: f (r ₁, r ₂)=f ' (r ' ₁, r ' ₂)-α (1-f ' (r " ₁, r " ₂)), f ' ∈ [0,1] calculates the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition, and wherein f ' is a similarity comparison function of the prior art, r ₁, r ₂For treating the property value of corresponding field in typing metadata and the data acquisition (International Standard Book Number, title, author, publishing house, publication time, price field), r ' ₁, r ' ₂For property value is removed the part ignore behind the speech, r " ₁, r " ₂For only keeping weight speech part in the property value, α is a weighted value, for by training algorithm training gained, and under the situation that does not have the weight speech, f (r ₁, rx)=f ' (r ' ₁, r ' ₂).For instance: for publishing house's field: in publishing house of property value Tsing-Hua University and BJ University Press, these speech of university press can be regarded as and ignore speech the relatively too big meaning not of this field of publishing house.In the time of relatively, only compare Tsing-Hua University and Beijing, be r ' ₁, r ' ₂For property value in the title field: the The Romance of the Three Kingdoms (up and down) is exactly up and down the weight speech, is r " ₁, r " ₂

Described similarity comparison function comprises: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.

The comparison function of isbn field if isbn equates then to be 1, otherwise is 0;

Title field comparison function is the similarity of character string value of cutting gained speech

The author field comparison function is the similarity of character string value of cutting gained speech;

Publication time comparison function, adopt the relative mistake function to obtain the similarity value;

The price comparison function adopts the relative mistake function to obtain the similarity value;

C) utilize compound similarity function

F (R_{1}, R_{2}) = α_{0} + Σ_{i = 1}^{n} α_{i} f_{i} (R_{1}, R_{2}),

α wherein ₀Be threshold value, α _iBe weight, R ₁, R ₂Be metadata, f _i(R ₁, R ₂) be R ₁And R ₂The similarity comparison function of the band weighted value of field i calculates the compound similarity value for the treatment of the typing metadata;

D) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.

Claims

1, a kind of method based on metadata removal repeating objects is characterized in that comprising the steps:

1) the current metadata of typing for the treatment of is carried out standardization processing, judges whether it is that quality is treated the typing metadata preferably, described quality treat preferably the typing metadata be do not have a form error treat the typing metadata;

When data acquisition is not sky,

3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged;

This step 2) further comprises the following steps:

2, the method based on metadata removal repeating objects according to claim 1 is characterized in that the described current metadata of typing for the treatment of comprises following field at least: International Standard Book Number, title, author, publishing house, publication time, price field.

3, the method for removing repeating objects based on metadata according to claim 2, it is characterized in that, described International Standard Book Number is made up of 10 bit digital, this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, use "-" to link to each other therebetween, publisher number is the code name of publishing house.

4, the method based on metadata removal repeating objects according to claim 1 is characterized in that described " the current metadata of typing for the treatment of is carried out standardization processing " comprises the steps:

2) judge whether the current International Standard Book Number of the metadata of typing for the treatment of is made up of 10 bit digital, if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;

5, the method for removing repeating objects based on metadata according to claim 4, it is characterized in that, the method of described " whether the International Standard Book Number of verifying the current metadata for the treatment of typing is correct " is: the 1st to the 9th bit digital of International Standard Book Number multiply by 10 to 2 these 9 numerals in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct;

6, the method based on metadata removal repeating objects according to claim 4 is characterized in that described " the current metadata of typing for the treatment of is carried out standardization processing " comprising: publication time, price specifications are turned to real number.

7, the method based on metadata removal repeating objects according to claim 1 is characterized in that described step 21) be specially:

8, the method for removing repeating objects based on metadata according to claim 1, it is characterized in that, in step 22) described in the band weighted value the similarity comparison function comprise: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.