CN102298632B - Character string similarity computing method and device and material classification method and device - Google Patents

Character string similarity computing method and device and material classification method and device Download PDF

Info

Publication number
CN102298632B
CN102298632B CN201110262493.2A CN201110262493A CN102298632B CN 102298632 B CN102298632 B CN 102298632B CN 201110262493 A CN201110262493 A CN 201110262493A CN 102298632 B CN102298632 B CN 102298632B
Authority
CN
China
Prior art keywords
goods
maxcommon
materials
prefix
suffix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110262493.2A
Other languages
Chinese (zh)
Other versions
CN102298632A (en
Inventor
韩建国
巩军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenhua Group Corp Ltd
Original Assignee
Shenhua Group Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenhua Group Corp Ltd filed Critical Shenhua Group Corp Ltd
Priority to CN201110262493.2A priority Critical patent/CN102298632B/en
Publication of CN102298632A publication Critical patent/CN102298632A/en
Application granted granted Critical
Publication of CN102298632B publication Critical patent/CN102298632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a character string similarity computing method and device and a material classification method and device. The similarity computing method comprises the following steps of: computing initial similarity between a character string X and a character string di; acquiring the longest common prefix and the longest common postfix between the character string X and the character string di; determining weight of the longest common prefix and weight of the longest common postfix; and computing similarity between the character string X and the character string di. Through the technical scheme, the invention provides a material classification-oriented Chinese character string similarity computing method (namely a dynamic weight method) aiming at the characteristics of names of Chinese materials. By adopting the method, the weight of the prefix and the weight of the postfix of the character string of the name of each material can be dynamically estimated, so that the names of the materials of same type have high similarity, and the accuracy for automatic classification of the materials is improved.

Description

Similarity of character string computing method and device and materials and equipment classification method and device
Technical field
The present invention relates to similarity and calculate and materials and equipment classification field, particularly, relate to a kind of based on forward and backward similarity of character string computing method and device and materials and equipment classification method and the device of sewing changeable weight of character string.
Background technology
The Technologies of Automated Text Classification of comparative maturity has neural network (Neural Net at present, NNet), support vector machine (Support Vector Machine, SVM), simple Bayes (Naive Bayes, NB), k neighbour (k nearest neighbor, k-NN) and linear least square matching (Linear Least Squares Fit, LLSF) etc.These methods are applied to materials and equipment classification, all need to solve the problem that between goods and materials, similarity is calculated.Different from the applied environment of traditional text automatic classification, in enterprise, the Name and Description of goods and materials is often more brief, Text similarity computing method based on word frequency can not meet the needs of materials and equipment classification, so need to calculate the similarity between goods and materials by means of other method, as the similarity of character string.
About the calculating of similarity of character string, in english-speaking environment, set up ripe theory and model, and be widely used.Wherein, from the scholar of statistics, database, artificial intelligence field, all from the research field of self, different similarity calculating methods has been proposed.These methods are in to the match test of all kinds of titles, and Jaro-Winkler and Monge-Elkan behave oneself best, and are more suitable for the coupling of name, place name, organization names.Scholar afterwards finds that approximate string match and language have very large correlativity, so according to the feature of different language, proposed improved algorithm: as Piskorski has proposed improvement algorithm for the Polish family of languages; Arehart etc. are studied for the similarity of roman character string.The research of Chinese character string similarity has also been had to many achievements, and obtained many practical applications.As Li Honglian etc. has proposed a kind of similarity algorithm of applicable speech recognition; Zhou Faguo etc. have proposed the similarity calculating method of sentence for online question answering system; Zhang Chengzhi has proposed the similarity calculating method of the multilayer features such as a kind of integrated literal, semanteme and statistical correlation.
Although the research of text classification and similarity of character string has had a lot of achievements, also there is no the research of classifying for enterprise material under Chinese environment specially.In enterprise, the Name and Description of goods and materials has the feature of himself, so the similarity that needs new technology accurately to measure goods and materials Name and Description improves the accuracy rate of classification.
Summary of the invention
The object of this invention is to provide a kind of based on forward and backward similarity of character string computing method and device and materials and equipment classification method and the device of sewing changeable weight of character string, the method and device can make other goods and materials title of same class have higher similarity, have improved the accuracy rate of goods and materials automatic classifications.
To achieve these goals, the invention provides a kind of similarity of character string computing method, the method comprises: calculating character string X and character string d ibetween initial similarity Sim, character string d ifor belonging to a set { C 1, C 2... C nclassification C ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Obtain character string X and character string d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon; The longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j); And calculating character string X and character string d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dynamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
The present invention separately provides a kind of materials and equipment classification method, and the method comprises: utilize above-mentioned similarity calculating method, calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category ibetween similarity Sim dynamicWeight(X, d i); Get K goods and materials title of similarity maximum, form set KNN; According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Correspondingly, the present invention also provides a kind of similarity of character string calculation element, and this device comprises: initial similarity calculation element (10), and for calculating character string X and character string d ibetween initial similarity Sim, character string d ifor belonging to a set { C 1, C 2... C nclassification C ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Acquisition device (20) is sewed in public front and back, for obtaining character string X and character string d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon; Weight determining device (30), for the longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j) and similarity calculation element (40), for calculating character string X and character string d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dynamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
Correspondingly, the present invention also provides a kind of materials and equipment classification device, and this device comprises: above-mentioned similarity of character string calculation element (100), and for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category ibetween similarity Sim dynamicWeight(X, d i); Similarity maximum set determining device (200), for getting K goods and materials title of similarity maximum, forms set KNN; Scoring apparatus (300), for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And classification determining device (400), for according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Pass through technique scheme, feature for Chinese goods and materials title, (the present invention has provided a kind of Chinese character string similarity calculating method towards materials and equipment classification, changeable weight method (DynamicWeight)), it dynamically estimates the forward and backward weight of sewing of goods and materials name character string, make other goods and materials title of same class there is higher similarity, improved the accuracy rate of goods and materials automatic classifications.
Other features and advantages of the present invention partly in detail are described the embodiment subsequently.
Accompanying drawing explanation
Accompanying drawing is to be used to provide a further understanding of the present invention, and forms a part for instructions, is used from explanation the present invention, but is not construed as limiting the invention with embodiment one below.In the accompanying drawings:
Fig. 1 is provided by the invention based on the forward and backward process flow diagram of sewing the similarity of character string computing method of changeable weight of character string;
Fig. 2 is the forward and backward process flow diagram that weight is estimated of sewing;
Fig. 3 is the process flow diagram of materials and equipment classification provided by the invention;
Fig. 4 adopts respectively the accuracy rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 9 large classes to compare schematic diagram;
Fig. 5 is for adopting respectively Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention to 60 large classes, 660 middle classes and 3940 accuracy rate comparison schematic diagram that group is classified;
Fig. 6 adopts respectively the recall rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 3940 groups to compare schematic diagram;
Fig. 7 is provided by the invention based on the forward and backward block diagram of sewing the similarity of character string calculation element of changeable weight of character string; And
Fig. 8 is the block diagram of materials and equipment classification device.
Description of reference numerals
Acquisition device is sewed in the 10 public front and back of initial similarity calculation element 20
30 weight determining device 40 similarity calculation elements
100 similarity of character string calculation element 200 similarity maximum set determining devices
300 scoring apparatus 400 classification determining devices
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is elaborated.Should be understood that, embodiment described herein only, for description and interpretation the present invention, is not limited to the present invention.
Fig. 1 is provided by the invention based on the forward and backward process flow diagram of sewing the similarity of character string computing method of changeable weight of character string.As shown in Figure 1, the invention provides a kind of similarity of character string computing method, the method comprises: calculating character string X and character string d ibetween initial similarity Sim, character string d ifor belonging to a set { C 1, C 2... C nclassification C ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Obtain character string X and character string d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon; The longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j); Calculating character string X and character string d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dynamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient, and this merge coefficient can be set the forward and backward impact of weight on similarity of sewing, and generally can be set as 0.5, and the forward and backward weight of sewing is identical on the impact of similarity.
The computing formula of described initial similarity can be:
Sim=1/3(m/|length(X)|+m/|length(d i)|+(m-t)/m),
Wherein, m is character string X and character string d ithe character number matching, length (X) and length (d i) represent respectively character string X and character string d icharacter, t represents character string X and character string d iin the process matching, the number of times that character position changes, half of number that is the coupling character of different order is the number t of transposition, for instance, MARTHA mates with the character of MARHTA, but in the character of these couplings, T and H will replace and MARTHA could be become to MARHTA, T and H are exactly the coupling character of different order so, t=2/2=1.The similarity that this formula calculates is Jaro similarity, and certainly, the present invention is not limited to this, and other formula that can realize similar similarity calculating also can be applicable to this.
Described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula can be:
PW ( Prefix MaxCommon , C j ) = Freq ( Cateogy = C j | Prefix = Prefix MaxCommon )
= N ( Cateogy = C j , Prefix = Prefix MaxCommon ) N ( Prefix = Prefix MaxCommon )
PW ( Suffix MaxCommon , C j ) = Freq ( Cateogy = C j | Suffix = Suffix MaxCommon )
= N ( Cateogy = C j , Suffix = Suffix MaxCommon ) N ( Suffix = Suffix MaxCommon )
Wherein, N (Cateogy=C j, Prefix=Prefix maxCommon) represent described set { C 1, C 2... C nin prefix be Prefix maxCommonand belong to classification C jthe number of character string, N (Cateogy=C j, Suffix=Suffix maxCommon) represent described set { C 1, C 2... C nin suffix be Suffix maxCommonand belong to classification C jthe number of character string, N (Prefix=Prefix maxCommon) represent described set { C 1, C 2... C nin prefix be Prefix maxCommonthe number of character string, N (Suffix=Suffix maxCommon) represent described set { C 1, C 2... C nin suffix be Suffix maxCommonthe number of character string.
Preferably, can be to above-mentioned weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) carrying out smoothing processing, concrete formula is as follows:
PW(Prefix MaxCommon,C j)=α*Freq(Cateogy=C j|Prefix=Prefix MaxCommon)+(1-α)/n
PW(Suffix MaxCommon,C j)=β*Freq(Cateogy=C j|Suffix=Suffix MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient, this merge coefficient can be set respectively the forward and backward probability that occurs in the classification impact on weight of sewing, and generally can be set as 0.9, sets that forward and backward to sew the probability occurring in classification very large on the impact of weight.
Fig. 2 is the forward and backward process flow diagram that weight is estimated of sewing.As shown in Figure 2, can be to described set { C 1, C 2... C nin each character string (this character string can comprise Chinese character and western language character) of each classification, get successively from front to back substring, western language character is made Chinese character and is processed, the reference position of substring is the first character of described character string, end position starts to increase progressively successively from first position, until the afterbody of described character string, this substring can be designated as Prefix i, be prefix.To described set { C 1, C 2... C nin each character string (this character string can comprise Chinese character and western language character) of each classification, get successively from back to front substring, western language character is made Chinese character and is processed, the end position of substring is the afterbody of described character string, reference position starts to increase progressively successively from the afterbody of described character string, until the head of described character string, this substring can be designated as Suffix i, be suffix.By this, can utilize above-mentioned weight calculation formula (can comprise smoothing processing), pair set { C 1, C 2... C nin each classification each character string each possible forward and backward sewing carry out weight estimation (, calculate each possible forward and backward probability occurring of sewing in each classification, can carry out smoothing processing afterwards), obtain each possible forward and backward weight of sewing, thereby build the forward and backward weight tables of data of sewing.Thereby, at calculating character string X and character string d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommonweight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) time, can be directly in this forward and backward sewing, in weight tables of data, obtain weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j).
Fig. 3 is the process flow diagram of materials and equipment classification provided by the invention.As shown in Figure 3, the present invention separately provides a kind of materials and equipment classification method that adopts above-mentioned similarity of character string computing method, the method comprises: utilize above-mentioned similarity calculating method, calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category ibetween similarity Sim dynamicWeight(X, d i); Get K goods and materials title of similarity maximum, form set KNN (K-NearestNeighbour); According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.In this materials and equipment classification method, in above-mentioned similarity calculating method, mentioned " character string " is " the goods and materials title " of goods and materials.
Wherein, described y (d i, C j) be category attribute function, if d ibelong to classification C j, the functional value of this function is 1, otherwise is 0.Certainly, the present invention is not limited to this, also can adopt other values, as long as can realize differentiation by level scoring, such as can be in d ibelong to classification C j, the functional value of this function is 0.9, otherwise is 0.
By above-mentioned evaluate formula, can show that goods and materials to be sorted are arranged in affiliated each classification of classification (that is, the candidate's class C of K goods and materials title of similarity maximum j) in possibility, can, according to this possibility, determine the classification of goods and materials to be sorted afterwards.For some goods and materials to be sorted, it can be divided into a plurality of candidate's classes (that is, having plurality of classes), thereby a threshold value need be set, described appraisal result is sorted, obtain minute the highest previous or a plurality of classifications (such other quantity is determined by described threshold value).For general situation, the situation that threshold value is 1, gets p (X, C j) maximum C jas the classification under described goods and materials to be sorted.
The beneficial effect of materials and equipment classification method of the present invention is described by specific embodiment below.According to the three phases of technology contents, implementation step is also divided into three phases.
one, weight estimation stages:
Step 1: in weight estimation stages, first will define the classification of goods and materials.In this example, provide seven classifications conventional in materials and equipment classification: common iron, non-ferrous metal and rapidoprint, architectural hardware, nonmetallic materials, textiles and other light industrial goods, timber and goods and daily-use electrical appliance.
Step 2: then for some goods and materials titles are added in each classification, as the goods and materials title of adding in " architectural hardware " classification has: aluminum alloy casement window, aluminium alloy screen, aluminium alloy window, safety lockset kit, flexible wirerope Circuit lock, six hole locksets, portable lock box, wirerope lockset, butterfly valve lockset, safe off-stream unit lock tube, switch lockset, lockset link plate etc. altogether.The classification of these goods and materials and correspondence thereof is as training dataset (that is, above-mentioned set { C 1, C 2... C n).
Step 3: each the goods and materials title to training dataset, list its prefix and suffix.If the prefix sets of goods and materials " aluminium alloy screen " is { aluminium, aluminium closes, aluminium alloy, aluminium alloy yarn, aluminium alloy screen }, its suffix set is { window, screen window, golden screen window, alloy screen window, aluminium alloy screen }.
Step 4: add up the probability that each prefix and suffix occur in each classification.
Step 5: the probability counting is carried out to smoothing processing, and take suffix " window " and " plate " is example, probability that it occurs in each classification and level and smooth after the weight that obtains as shown in table 1.
Probability and weight that table 1 suffix " window " and " plate " occur in each classification
two, similarity calculation stages
This stage is usingd " aluminum alloy slide window " as the goods and materials title of goods and materials to be sorted, and " aluminum alloy casement window " is as the goods and materials title of certain goods and materials in training set.
Step 1: the Jaro similarity of calculating " aluminum alloy slide window " and " aluminum alloy casement window ", the length of these two character strings is all 6, the character number of coupling is 4, and the number of times that position changes is 0, and the similarity obtaining according to the computing formula of Jaro is 0.778.
Step 2: comparison " aluminum alloy slide window " and " aluminum alloy casement window ", the longest common prefix is " aluminium alloy ", the longest public suffix is " window ".
Step 3: because " aluminum alloy casement window " belongs to architectural hardware classification, training to forward and backwardly find corresponding weight estimated value: PW (aluminium alloy, architectural hardware)=0.587, SW (window, architectural hardware)=0.529 in sewing weight sets.
Step 4: the Sim that calculates " aluminum alloy slide window " and " aluminum alloy casement window " dynamicWeightsimilarity, wherein set merge coefficient θ=0.5, the similarity obtaining is 0.902.
three, the automatic classification stage
Step 1: the goods and materials title " aluminum alloy slide window " of inputting goods and materials to be sorted.
Step 2: the similarity of calculating each goods and materials title in " aluminum alloy slide window " and training set according to the method in the stage 2.
Step 3: find front k the goods and materials title with " aluminum alloy slide window " similarity maximum.Herein, k is set as 5, and the most similar 5 goods and materials titles, similarity and the classification found are as table 2.
Goods and materials title Similarity Affiliated classification
1 Aluminum alloy slide window material 0.973 Non-ferrous metal and rapidoprint
2 Aluminium alloy screen 0.921 Architectural hardware
3 Aluminum alloy casement window 0.902 Architectural hardware
4 Aluminium alloy window 0.902 Architectural hardware
5 Aluminium alloy wire 0.823 Non-ferrous metal and rapidoprint
Table 2 training set neutralization 5 the most similar goods and materials titles of " aluminum alloy slide window "
Step 4: according to the situation of classifying under front 5 goods and materials the most similar, calculate p (aluminum alloy slide window, architectural hardware)=2.725, p (aluminum alloy slide window, non-ferrous metal and rapidoprint)=1.875.
Step 5: according to the result calculating, p (aluminum alloy slide window, architectural hardware) maximum, when Output rusults number is 1, " aluminum alloy slide window " assigned to " architectural hardware " this classification automatically.
Classifying quality of the present invention as shown in Figure 4, Figure 5 and Figure 6, for the similarity of character string based on changeable weight that proposes in the illustration the present invention validity for materials and equipment classification, also with contrast with Jaro-Winkler and two kinds of similarity calculating methods of Monge-Elkan simultaneously.The evaluation of materials and equipment classification is from the viewpoint of accuracy and spend two comprehensively, and leading indicator is accuracy rate and recall rate.The computing formula of accuracy rate is: the number of classification in the correct number/Output rusults of classifying in accuracy rate (Precision)=Output rusults; The computing formula of recall rate is: the number of the classification of the correct number of classifying in recall rate (Recall)=Output rusults/should be divided into.Accuracy rate mainly pays close attention to how much in Output rusults, have be correct; Recall rate is mainly paid close attention to correct classification results to be had and how much has been covered to.
Fig. 4 adopts respectively the accuracy rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 9 large classes to compare schematic diagram.As can be seen from Figure 4, three kinds of similarity of character string methods are effective centering historical relic money title automatic classification all, and the classification accuracy that the Dynamic-Weight method based on changeable weight obtains is all better than Jaro-Winkler and Monge-Elkan method in each classification.
Fig. 5 is for adopting respectively Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention to 60 large classes, 660 middle classes and 3940 accuracy rate comparison schematic diagram that group is classified.From Fig. 5 can, along with the granularity of classification is more and more thinner, the classification accuracy of the whole bag of tricks all obviously declines.But the Dynamic-Weight method based on changeable weight is all the highest in the accuracy rate of three category level
Fig. 6 adopts respectively the recall rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 3940 groups to compare schematic diagram.As can be seen from Figure 6, along with Output rusults number (, described threshold value) increase, the recall rate of automatic classification (Recall) is significantly improved, and in three kinds of methods, Dynamic-Weight method based on changeable weight above three curves, has shown that this method is stable all the time.
The invention has the advantages that the actual features according to Chinese materials and equipment classification, designed the similarity of character string computing method towards Chinese goods and materials title.This method estimates the forward and backward weight of sewing in each materials and equipment classification of goods and materials title by training, when comparing goods and materials title, has increased the similarity of one species goods and materials, thereby has improved the accuracy rate of goods and materials automatic classifications.
Fig. 7 is provided by the invention based on the forward and backward block diagram of sewing the similarity of character string calculation element of changeable weight of character string.Correspondingly, as shown in Figure 7, the present invention also provides a kind of similarity of character string calculation element, and this device comprises: initial similarity calculation element 10, and for calculating character string X and character string d ibetween initial similarity Sim, character string d ifor belonging to a set { C 1, C 2... C nclassification C ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Acquisition device 20 is sewed in public front and back, for obtaining character string X and character string d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon; Weight determining device 30, for the longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j) and similarity calculation element 40, for calculating character string X and character string d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dynamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
Wherein, the computing formula of described initial similarity can be:
Sim=1/3 (m/|length (X) |+m/|length (d i) |+(m-t)/m, wherein m is character string X and character string d ithe character number matching, length (X) and length (d i) represent respectively character string X and character string d icharacter, t represents character string X and character string d iin the process matching, the number of times that character position changes.
Wherein, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula can be:
PW(Prefix MaxCommon,C j)=Freq(Cateogy=C j|Prefix=Prefix MaxCommon)
PW(Suffix MaxCommon,C j)=Freq(Cateogy=C j|Suffix=Suffix MaxCommon)
Wherein, Freq (Cateogy=C j| Prefix=Prefix maxCommon) expression prefix is Prefix maxCommoncharacter string in classification C jthe probability of interior appearance, Freq (Cateogy=C j| Suffix=Suffix maxCommon) expression suffix is Suffix maxCommoncharacter string in classification C jthe probability of interior appearance.
Wherein, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula can be:
PW(Prefix MaxCommon,C j)=α*Freq(Cateogy=C j|Prefix=Prefix MaxCommon)+(1-α)/n
PW(Suffix MaxCommon,C j)=β*Freq(Cateogy=C j|Suffix=Suffix MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
About detail that should be based on the forward and backward similarity of character string calculation element of sewing changeable weight of character string and beneficial effect are with above-mentioned identical for the description based on the forward and backward similarity of character string computing method of sewing changeable weight of character string, in this, repeat no more.
Fig. 8 is the block diagram of materials and equipment classification device.Correspondingly, as described in Figure 8, the present invention also provides a kind of materials and equipment classification device, and this device comprises: above-mentioned similarity of character string calculation element 100, and for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category ibetween similarity Sim dynamicWeight(X, d i); Similarity maximum set determining device 200, for getting K goods and materials title of similarity maximum, forms set KNN; Scoring apparatus 300, for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And classification determining device 400, for according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Wherein, the desirable p of described classification determining device 400 (X, C j) maximum C jas the classification under described goods and materials to be sorted.
About detail and the beneficial effect of this materials and equipment classification device are identical with the above-mentioned description for materials and equipment classification method, in this, repeat no more.
Below describe by reference to the accompanying drawings the preferred embodiment of the present invention in detail; but; the present invention is not limited to the detail in above-mentioned embodiment; within the scope of technical conceive of the present invention; can carry out multiple simple variant to technical scheme of the present invention, these simple variant all belong to protection scope of the present invention.
It should be noted that in addition each the concrete technical characterictic described in above-mentioned embodiment, in reconcilable situation, can combine by any suitable mode.For fear of unnecessary repetition, the present invention is to the explanation no longer separately of various possible array modes.
In addition, between various embodiment of the present invention, also can carry out combination in any, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.

Claims (10)

1. a materials and equipment classification method, the method comprises:
Calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category ibetween similarity Sim dynamicWeight(X, d i);
Get K goods and materials title of similarity maximum, form set KNN;
According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And
According to appraisal result, determine the classification that described goods and materials to be sorted are affiliated,
Wherein calculating described similarity comprises:
Calculate the goods and materials title d of each goods and materials in goods and materials title X and a plurality of material category ibetween initial similarity Sim, goods and materials title d ifor belonging to a set { C 1, C 2... C nclassification C ja goods and materials title, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of goods and materials titles;
Obtain goods and materials title X and goods and materials title d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon;
The longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j); And
Calculate goods and materials title X and goods and materials title d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dnamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
2. method according to claim 1, is characterized in that, the computing formula of described initial similarity Sim is as follows:
Sim=1/3 (m/|length (X) |+m/|length (d i) |+(m-t)/m), wherein m is goods and materials title X and goods and materials title d ithe character number matching, length (X) and length (d i) represent respectively goods and materials title X and goods and materials title d icharacter, t represents goods and materials title X and goods and materials title d iin the process matching, the number of times that character position changes.
3. method according to claim 1, is characterized in that, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula as follows:
PW(Prefix MaxCommon,C j)=Freq(Cateogy=C j|Prefix=Prefix MaxCommon)
PW(Suffix MaxCommon,C j)=Freq(Cateogy=C j|Suffix=Suffix MaxCommon)
Wherein, Freq (Cateogy=C j| Prefix=Prefix maxCommon) expression prefix is Prefix maxCommongoods and materials title in classification C jthe probability of interior appearance, Freq (Cateogy=C j| Suffix=Suffix maxCommon) expression suffix is Suffix maxCommongoods and materials title in classification C jthe probability of interior appearance.
4. method according to claim 1, is characterized in that, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula as follows:
PW(Prefix MaxCommon,C j)=α*Freq(Cateogy=C j|Prefix=Prefix MaxCommon)+(1-α)/n
PW(Suffix MaxCommon,C j)=β*Freq(Cateogy=C j|Suffix=Suffix MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
5. method according to claim 1, is characterized in that, the classification under described definite described goods and materials to be sorted comprises: get p (X, C j) maximum C jas the classification under described goods and materials to be sorted.
6. a materials and equipment classification device, this device comprises:
Similarity of character string calculation element (100), for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category ibetween similarity Sim dynamicWeight(X, d i);
Similarity maximum set determining device (200), for getting K goods and materials title of similarity maximum, forms set KNN;
Scoring apparatus (300), for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted jmark, evaluate formula is as follows: p ( X , C j ) = Σ d i ∈ KNN Sim DynamicWeight ( X , d i ) * y ( d i , C j ) , Y (d wherein i, C j) be category attribute function; And
Classification determining device (400), for according to appraisal result, determines the classification that described goods and materials to be sorted are affiliated,
Wherein said similarity of character string calculation element (100) comprising:
Initial similarity calculation element (10), for calculating goods and materials title X and goods and materials title d ibetween initial similarity Sim, goods and materials title d ifor belonging to a set { C 1, C 2... C nclassification C ja goods and materials title, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of goods and materials titles;
Acquisition device (20) is sewed in public front and back, for obtaining goods and materials title X and goods and materials title d ibetween the longest common prefix Prefix maxCommonwith the longest public suffix Suffix maxCommon;
Weight determining device (30), for the longest common prefix Prefix described in determining maxCommonweight PW (Prefix maxCommon, C j) and the longest described public suffix Suffix maxCommonweight SW (Suffix maxCommon, C j); And
Similarity calculation element (40), for calculating goods and materials title X and goods and materials title d ibetween similarity Sim dynamicWeight(X, d i), computing formula is as follows: Sim dynamicWeight(X, d i)=Sim+ θ * PW maxCommon* (1-Sim)+(1-θ) * SW maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
7. materials and equipment classification device according to claim 6, is characterized in that, the computing formula of described initial similarity Sim is as follows:
Sim=1/3 (m/|length (X) |+m/|length (d i) |+(m-t)/m), wherein m is goods and materials title X and goods and materials title d ithe character number matching, length (X) and length (d i) represent respectively goods and materials title X and goods and materials title d icharacter, t represents goods and materials title X and goods and materials title d iin the process matching, the number of times that character position changes.
8. materials and equipment classification device according to claim 6, is characterized in that, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula as follows:
PW(Prefix MaxCommon,C j)=Freq(Cateogy=C j|Prefix=Prefix MaxCommon)
PW(Suffix MaxCommon,C j)=Freq(Cateogy=C j|Suffix=Suffix MaxCommon)
Wherein, Freq (Cateogy=C j| Prefix=Prefix maxCommon) expression prefix is Prefix maxCommongoods and materials title in classification C jthe probability of interior appearance, Freq (Cateogy=C j| Suffix=Suffix maxCommon) expression suffix is Suffix maxCommongoods and materials title in classification C jthe probability of interior appearance.
9. materials and equipment classification device according to claim 6, is characterized in that, described weight PW (Prefix maxCommon, C j) and SW (Suffix maxCommon, C j) computing formula as follows:
PW(Prefix MaxCommon,C j)=α*Freq(Cateogy=C j|Prefix=Prefix MaxCommon)+(1-α)/n
PW(Suffix MaxCommon,C j)=β*Freq(Cateogy=C j|Suffix=Suffix MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
10. materials and equipment classification device according to claim 6, is characterized in that, described classification determining device (400) is got p (X, C j) maximum C jas the classification under described goods and materials to be sorted.
CN201110262493.2A 2011-09-06 2011-09-06 Character string similarity computing method and device and material classification method and device Active CN102298632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110262493.2A CN102298632B (en) 2011-09-06 2011-09-06 Character string similarity computing method and device and material classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110262493.2A CN102298632B (en) 2011-09-06 2011-09-06 Character string similarity computing method and device and material classification method and device

Publications (2)

Publication Number Publication Date
CN102298632A CN102298632A (en) 2011-12-28
CN102298632B true CN102298632B (en) 2014-10-29

Family

ID=45359046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110262493.2A Active CN102298632B (en) 2011-09-06 2011-09-06 Character string similarity computing method and device and material classification method and device

Country Status (1)

Country Link
CN (1) CN102298632B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346394A (en) * 2013-08-02 2015-02-11 中国人民大学 Similarity measurement method based on massive text data

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815197B (en) * 2015-11-27 2020-07-31 北京国双科技有限公司 Text similarity determination method and device
CN106919663A (en) * 2017-02-14 2017-07-04 华北电力大学 Character string matching method in the multi-source heterogeneous data fusion of power regulation system
CN107357779B (en) * 2017-06-27 2018-10-02 北京神州泰岳软件股份有限公司 A kind of method and device obtaining organization names
CN109284422B (en) * 2018-08-31 2019-12-27 成都信息工程大学 Construction method of universal character string similarity measurement framework
CN109299112B (en) * 2018-11-15 2020-01-17 北京百度网讯科技有限公司 Method and apparatus for processing data
CN110827931A (en) * 2020-01-13 2020-02-21 四川大学华西医院 Method and device for managing clinical terms and readable storage medium
CN112100381B (en) * 2020-09-22 2022-05-17 福建天晴在线互动科技有限公司 Method and system for quantizing text similarity
CN114548883A (en) * 2022-04-25 2022-05-27 创思(广州)电子科技有限公司 Vegetable wholesale quantity checking system for performing autonomous checking according to wholesale orders

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
CN101976270A (en) * 2010-11-29 2011-02-16 南京师范大学 Uncertain reasoning-based text hierarchy classification method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
CN101976270A (en) * 2010-11-29 2011-02-16 南京师范大学 Uncertain reasoning-based text hierarchy classification method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346394A (en) * 2013-08-02 2015-02-11 中国人民大学 Similarity measurement method based on massive text data
CN104346394B (en) * 2013-08-02 2018-12-21 中国人民大学 A kind of method for measuring similarity based on mass text data

Also Published As

Publication number Publication date
CN102298632A (en) 2011-12-28

Similar Documents

Publication Publication Date Title
CN102298632B (en) Character string similarity computing method and device and material classification method and device
CN105243152B (en) A kind of automaticabstracting based on graph model
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106202042A (en) A kind of keyword abstraction method based on figure
CN103514255B (en) A kind of collaborative filtering recommending method based on project stratigraphic classification
CN107122352A (en) A kind of method of the extracting keywords based on K MEANS, WORD2VEC
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN106124175A (en) A kind of compressor valve method for diagnosing faults based on Bayesian network
CN103778227A (en) Method for screening useful images from retrieved images
CN108595655A (en) A kind of abnormal user detection method of dialogue-based characteristic similarity fuzzy clustering
CN104750844A (en) Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN105975596A (en) Query expansion method and system of search engine
CN108647736A (en) A kind of image classification method based on perception loss and matching attention mechanism
CN104020845B (en) Acceleration transducer placement-unrelated movement recognition method based on shapelet characteristic
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN105868178A (en) Multi-document automatic abstract generation method based on phrase subject modeling
KR101091834B1 (en) Method and apparatus for test question selection and achievement assessment
CN103455609A (en) New kernel function Luke kernel-based patent document similarity detection method
CN106203534A (en) A kind of cost-sensitive Software Defects Predict Methods based on Boosting
CN110188192A (en) A kind of multitask network struction and multiple dimensioned charge law article unified prediction
CN110008323A (en) A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing
CN105955975A (en) Knowledge recommendation method for academic literature
CN108052625A (en) A kind of entity sophisticated category method
CN103823880A (en) Attribute weight-based method for calculating similarity between detection mechanisms
CN110232185A (en) Towards financial industry software test knowledge based map semantic similarity calculation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant