CN102298632B - Character string similarity computing method and device and material classification method and device - Google Patents
Character string similarity computing method and device and material classification method and device Download PDFInfo
- Publication number
- CN102298632B CN102298632B CN201110262493.2A CN201110262493A CN102298632B CN 102298632 B CN102298632 B CN 102298632B CN 201110262493 A CN201110262493 A CN 201110262493A CN 102298632 B CN102298632 B CN 102298632B
- Authority
- CN
- China
- Prior art keywords
- goods
- maxcommon
- materials
- prefix
- suffix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a character string similarity computing method and device and a material classification method and device. The similarity computing method comprises the following steps of: computing initial similarity between a character string X and a character string di; acquiring the longest common prefix and the longest common postfix between the character string X and the character string di; determining weight of the longest common prefix and weight of the longest common postfix; and computing similarity between the character string X and the character string di. Through the technical scheme, the invention provides a material classification-oriented Chinese character string similarity computing method (namely a dynamic weight method) aiming at the characteristics of names of Chinese materials. By adopting the method, the weight of the prefix and the weight of the postfix of the character string of the name of each material can be dynamically estimated, so that the names of the materials of same type have high similarity, and the accuracy for automatic classification of the materials is improved.
Description
Technical field
The present invention relates to similarity and calculate and materials and equipment classification field, particularly, relate to a kind of based on forward and backward similarity of character string computing method and device and materials and equipment classification method and the device of sewing changeable weight of character string.
Background technology
The Technologies of Automated Text Classification of comparative maturity has neural network (Neural Net at present, NNet), support vector machine (Support Vector Machine, SVM), simple Bayes (Naive Bayes, NB), k neighbour (k nearest neighbor, k-NN) and linear least square matching (Linear Least Squares Fit, LLSF) etc.These methods are applied to materials and equipment classification, all need to solve the problem that between goods and materials, similarity is calculated.Different from the applied environment of traditional text automatic classification, in enterprise, the Name and Description of goods and materials is often more brief, Text similarity computing method based on word frequency can not meet the needs of materials and equipment classification, so need to calculate the similarity between goods and materials by means of other method, as the similarity of character string.
About the calculating of similarity of character string, in english-speaking environment, set up ripe theory and model, and be widely used.Wherein, from the scholar of statistics, database, artificial intelligence field, all from the research field of self, different similarity calculating methods has been proposed.These methods are in to the match test of all kinds of titles, and Jaro-Winkler and Monge-Elkan behave oneself best, and are more suitable for the coupling of name, place name, organization names.Scholar afterwards finds that approximate string match and language have very large correlativity, so according to the feature of different language, proposed improved algorithm: as Piskorski has proposed improvement algorithm for the Polish family of languages; Arehart etc. are studied for the similarity of roman character string.The research of Chinese character string similarity has also been had to many achievements, and obtained many practical applications.As Li Honglian etc. has proposed a kind of similarity algorithm of applicable speech recognition; Zhou Faguo etc. have proposed the similarity calculating method of sentence for online question answering system; Zhang Chengzhi has proposed the similarity calculating method of the multilayer features such as a kind of integrated literal, semanteme and statistical correlation.
Although the research of text classification and similarity of character string has had a lot of achievements, also there is no the research of classifying for enterprise material under Chinese environment specially.In enterprise, the Name and Description of goods and materials has the feature of himself, so the similarity that needs new technology accurately to measure goods and materials Name and Description improves the accuracy rate of classification.
Summary of the invention
The object of this invention is to provide a kind of based on forward and backward similarity of character string computing method and device and materials and equipment classification method and the device of sewing changeable weight of character string, the method and device can make other goods and materials title of same class have higher similarity, have improved the accuracy rate of goods and materials automatic classifications.
To achieve these goals, the invention provides a kind of similarity of character string computing method, the method comprises: calculating character string X and character string d
ibetween initial similarity Sim, character string d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Obtain character string X and character string d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon; The longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j); And calculating character string X and character string d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dynamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
The present invention separately provides a kind of materials and equipment classification method, and the method comprises: utilize above-mentioned similarity calculating method, calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i); Get K goods and materials title of similarity maximum, form set KNN; According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Correspondingly, the present invention also provides a kind of similarity of character string calculation element, and this device comprises: initial similarity calculation element (10), and for calculating character string X and character string d
ibetween initial similarity Sim, character string d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Acquisition device (20) is sewed in public front and back, for obtaining character string X and character string d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon; Weight determining device (30), for the longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j) and similarity calculation element (40), for calculating character string X and character string d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dynamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
Correspondingly, the present invention also provides a kind of materials and equipment classification device, and this device comprises: above-mentioned similarity of character string calculation element (100), and for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i); Similarity maximum set determining device (200), for getting K goods and materials title of similarity maximum, forms set KNN; Scoring apparatus (300), for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And classification determining device (400), for according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Pass through technique scheme, feature for Chinese goods and materials title, (the present invention has provided a kind of Chinese character string similarity calculating method towards materials and equipment classification, changeable weight method (DynamicWeight)), it dynamically estimates the forward and backward weight of sewing of goods and materials name character string, make other goods and materials title of same class there is higher similarity, improved the accuracy rate of goods and materials automatic classifications.
Other features and advantages of the present invention partly in detail are described the embodiment subsequently.
Accompanying drawing explanation
Accompanying drawing is to be used to provide a further understanding of the present invention, and forms a part for instructions, is used from explanation the present invention, but is not construed as limiting the invention with embodiment one below.In the accompanying drawings:
Fig. 1 is provided by the invention based on the forward and backward process flow diagram of sewing the similarity of character string computing method of changeable weight of character string;
Fig. 2 is the forward and backward process flow diagram that weight is estimated of sewing;
Fig. 3 is the process flow diagram of materials and equipment classification provided by the invention;
Fig. 4 adopts respectively the accuracy rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 9 large classes to compare schematic diagram;
Fig. 5 is for adopting respectively Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention to 60 large classes, 660 middle classes and 3940 accuracy rate comparison schematic diagram that group is classified;
Fig. 6 adopts respectively the recall rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 3940 groups to compare schematic diagram;
Fig. 7 is provided by the invention based on the forward and backward block diagram of sewing the similarity of character string calculation element of changeable weight of character string; And
Fig. 8 is the block diagram of materials and equipment classification device.
Description of reference numerals
Acquisition device is sewed in the 10 public front and back of initial similarity calculation element 20
30 weight determining device 40 similarity calculation elements
100 similarity of character string calculation element 200 similarity maximum set determining devices
300 scoring apparatus 400 classification determining devices
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is elaborated.Should be understood that, embodiment described herein only, for description and interpretation the present invention, is not limited to the present invention.
Fig. 1 is provided by the invention based on the forward and backward process flow diagram of sewing the similarity of character string computing method of changeable weight of character string.As shown in Figure 1, the invention provides a kind of similarity of character string computing method, the method comprises: calculating character string X and character string d
ibetween initial similarity Sim, character string d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Obtain character string X and character string d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon; The longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j); Calculating character string X and character string d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dynamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient, and this merge coefficient can be set the forward and backward impact of weight on similarity of sewing, and generally can be set as 0.5, and the forward and backward weight of sewing is identical on the impact of similarity.
The computing formula of described initial similarity can be:
Sim=1/3(m/|length(X)|+m/|length(d
i)|+(m-t)/m),
Wherein, m is character string X and character string d
ithe character number matching, length (X) and length (d
i) represent respectively character string X and character string d
icharacter, t represents character string X and character string d
iin the process matching, the number of times that character position changes, half of number that is the coupling character of different order is the number t of transposition, for instance, MARTHA mates with the character of MARHTA, but in the character of these couplings, T and H will replace and MARTHA could be become to MARHTA, T and H are exactly the coupling character of different order so, t=2/2=1.The similarity that this formula calculates is Jaro similarity, and certainly, the present invention is not limited to this, and other formula that can realize similar similarity calculating also can be applicable to this.
Described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula can be:
Wherein, N (Cateogy=C
j, Prefix=Prefix
maxCommon) represent described set { C
1, C
2... C
nin prefix be Prefix
maxCommonand belong to classification C
jthe number of character string, N (Cateogy=C
j, Suffix=Suffix
maxCommon) represent described set { C
1, C
2... C
nin suffix be Suffix
maxCommonand belong to classification C
jthe number of character string, N (Prefix=Prefix
maxCommon) represent described set { C
1, C
2... C
nin prefix be Prefix
maxCommonthe number of character string, N (Suffix=Suffix
maxCommon) represent described set { C
1, C
2... C
nin suffix be Suffix
maxCommonthe number of character string.
Preferably, can be to above-mentioned weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) carrying out smoothing processing, concrete formula is as follows:
PW(Prefix
MaxCommon,C
j)=α*Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)+(1-α)/n
PW(Suffix
MaxCommon,C
j)=β*Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient, this merge coefficient can be set respectively the forward and backward probability that occurs in the classification impact on weight of sewing, and generally can be set as 0.9, sets that forward and backward to sew the probability occurring in classification very large on the impact of weight.
Fig. 2 is the forward and backward process flow diagram that weight is estimated of sewing.As shown in Figure 2, can be to described set { C
1, C
2... C
nin each character string (this character string can comprise Chinese character and western language character) of each classification, get successively from front to back substring, western language character is made Chinese character and is processed, the reference position of substring is the first character of described character string, end position starts to increase progressively successively from first position, until the afterbody of described character string, this substring can be designated as Prefix
i, be prefix.To described set { C
1, C
2... C
nin each character string (this character string can comprise Chinese character and western language character) of each classification, get successively from back to front substring, western language character is made Chinese character and is processed, the end position of substring is the afterbody of described character string, reference position starts to increase progressively successively from the afterbody of described character string, until the head of described character string, this substring can be designated as Suffix
i, be suffix.By this, can utilize above-mentioned weight calculation formula (can comprise smoothing processing), pair set { C
1, C
2... C
nin each classification each character string each possible forward and backward sewing carry out weight estimation (, calculate each possible forward and backward probability occurring of sewing in each classification, can carry out smoothing processing afterwards), obtain each possible forward and backward weight of sewing, thereby build the forward and backward weight tables of data of sewing.Thereby, at calculating character string X and character string d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommonweight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) time, can be directly in this forward and backward sewing, in weight tables of data, obtain weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j).
Fig. 3 is the process flow diagram of materials and equipment classification provided by the invention.As shown in Figure 3, the present invention separately provides a kind of materials and equipment classification method that adopts above-mentioned similarity of character string computing method, the method comprises: utilize above-mentioned similarity calculating method, calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i); Get K goods and materials title of similarity maximum, form set KNN (K-NearestNeighbour); According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.In this materials and equipment classification method, in above-mentioned similarity calculating method, mentioned " character string " is " the goods and materials title " of goods and materials.
Wherein, described y (d
i, C
j) be category attribute function, if d
ibelong to classification C
j, the functional value of this function is 1, otherwise is 0.Certainly, the present invention is not limited to this, also can adopt other values, as long as can realize differentiation by level scoring, such as can be in d
ibelong to classification C
j, the functional value of this function is 0.9, otherwise is 0.
By above-mentioned evaluate formula, can show that goods and materials to be sorted are arranged in affiliated each classification of classification (that is, the candidate's class C of K goods and materials title of similarity maximum
j) in possibility, can, according to this possibility, determine the classification of goods and materials to be sorted afterwards.For some goods and materials to be sorted, it can be divided into a plurality of candidate's classes (that is, having plurality of classes), thereby a threshold value need be set, described appraisal result is sorted, obtain minute the highest previous or a plurality of classifications (such other quantity is determined by described threshold value).For general situation, the situation that threshold value is 1, gets p (X, C
j) maximum C
jas the classification under described goods and materials to be sorted.
The beneficial effect of materials and equipment classification method of the present invention is described by specific embodiment below.According to the three phases of technology contents, implementation step is also divided into three phases.
one, weight estimation stages:
Step 1: in weight estimation stages, first will define the classification of goods and materials.In this example, provide seven classifications conventional in materials and equipment classification: common iron, non-ferrous metal and rapidoprint, architectural hardware, nonmetallic materials, textiles and other light industrial goods, timber and goods and daily-use electrical appliance.
Step 2: then for some goods and materials titles are added in each classification, as the goods and materials title of adding in " architectural hardware " classification has: aluminum alloy casement window, aluminium alloy screen, aluminium alloy window, safety lockset kit, flexible wirerope Circuit lock, six hole locksets, portable lock box, wirerope lockset, butterfly valve lockset, safe off-stream unit lock tube, switch lockset, lockset link plate etc. altogether.The classification of these goods and materials and correspondence thereof is as training dataset (that is, above-mentioned set { C
1, C
2... C
n).
Step 3: each the goods and materials title to training dataset, list its prefix and suffix.If the prefix sets of goods and materials " aluminium alloy screen " is { aluminium, aluminium closes, aluminium alloy, aluminium alloy yarn, aluminium alloy screen }, its suffix set is { window, screen window, golden screen window, alloy screen window, aluminium alloy screen }.
Step 4: add up the probability that each prefix and suffix occur in each classification.
Step 5: the probability counting is carried out to smoothing processing, and take suffix " window " and " plate " is example, probability that it occurs in each classification and level and smooth after the weight that obtains as shown in table 1.
Probability and weight that table 1 suffix " window " and " plate " occur in each classification
two, similarity calculation stages
This stage is usingd " aluminum alloy slide window " as the goods and materials title of goods and materials to be sorted, and " aluminum alloy casement window " is as the goods and materials title of certain goods and materials in training set.
Step 1: the Jaro similarity of calculating " aluminum alloy slide window " and " aluminum alloy casement window ", the length of these two character strings is all 6, the character number of coupling is 4, and the number of times that position changes is 0, and the similarity obtaining according to the computing formula of Jaro is 0.778.
Step 2: comparison " aluminum alloy slide window " and " aluminum alloy casement window ", the longest common prefix is " aluminium alloy ", the longest public suffix is " window ".
Step 3: because " aluminum alloy casement window " belongs to architectural hardware classification, training to forward and backwardly find corresponding weight estimated value: PW (aluminium alloy, architectural hardware)=0.587, SW (window, architectural hardware)=0.529 in sewing weight sets.
Step 4: the Sim that calculates " aluminum alloy slide window " and " aluminum alloy casement window "
dynamicWeightsimilarity, wherein set merge coefficient θ=0.5, the similarity obtaining is 0.902.
three, the automatic classification stage
Step 1: the goods and materials title " aluminum alloy slide window " of inputting goods and materials to be sorted.
Step 2: the similarity of calculating each goods and materials title in " aluminum alloy slide window " and training set according to the method in the stage 2.
Step 3: find front k the goods and materials title with " aluminum alloy slide window " similarity maximum.Herein, k is set as 5, and the most similar 5 goods and materials titles, similarity and the classification found are as table 2.
Goods and materials title | Similarity | Affiliated classification | |
1 | Aluminum alloy slide window material | 0.973 | Non-ferrous metal and rapidoprint |
2 | Aluminium alloy screen | 0.921 | Architectural hardware |
3 | Aluminum alloy casement window | 0.902 | Architectural hardware |
4 | Aluminium alloy window | 0.902 | Architectural hardware |
5 | Aluminium alloy wire | 0.823 | Non-ferrous metal and rapidoprint |
Table 2 training set neutralization 5 the most similar goods and materials titles of " aluminum alloy slide window "
Step 4: according to the situation of classifying under front 5 goods and materials the most similar, calculate p (aluminum alloy slide window, architectural hardware)=2.725, p (aluminum alloy slide window, non-ferrous metal and rapidoprint)=1.875.
Step 5: according to the result calculating, p (aluminum alloy slide window, architectural hardware) maximum, when Output rusults number is 1, " aluminum alloy slide window " assigned to " architectural hardware " this classification automatically.
Classifying quality of the present invention as shown in Figure 4, Figure 5 and Figure 6, for the similarity of character string based on changeable weight that proposes in the illustration the present invention validity for materials and equipment classification, also with contrast with Jaro-Winkler and two kinds of similarity calculating methods of Monge-Elkan simultaneously.The evaluation of materials and equipment classification is from the viewpoint of accuracy and spend two comprehensively, and leading indicator is accuracy rate and recall rate.The computing formula of accuracy rate is: the number of classification in the correct number/Output rusults of classifying in accuracy rate (Precision)=Output rusults; The computing formula of recall rate is: the number of the classification of the correct number of classifying in recall rate (Recall)=Output rusults/should be divided into.Accuracy rate mainly pays close attention to how much in Output rusults, have be correct; Recall rate is mainly paid close attention to correct classification results to be had and how much has been covered to.
Fig. 4 adopts respectively the accuracy rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 9 large classes to compare schematic diagram.As can be seen from Figure 4, three kinds of similarity of character string methods are effective centering historical relic money title automatic classification all, and the classification accuracy that the Dynamic-Weight method based on changeable weight obtains is all better than Jaro-Winkler and Monge-Elkan method in each classification.
Fig. 5 is for adopting respectively Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention to 60 large classes, 660 middle classes and 3940 accuracy rate comparison schematic diagram that group is classified.From Fig. 5 can, along with the granularity of classification is more and more thinner, the classification accuracy of the whole bag of tricks all obviously declines.But the Dynamic-Weight method based on changeable weight is all the highest in the accuracy rate of three category level
Fig. 6 adopts respectively the recall rate that Jaro-Winkler, Monge-Elkan and DynamicWeight of the present invention classify to 3940 groups to compare schematic diagram.As can be seen from Figure 6, along with Output rusults number (, described threshold value) increase, the recall rate of automatic classification (Recall) is significantly improved, and in three kinds of methods, Dynamic-Weight method based on changeable weight above three curves, has shown that this method is stable all the time.
The invention has the advantages that the actual features according to Chinese materials and equipment classification, designed the similarity of character string computing method towards Chinese goods and materials title.This method estimates the forward and backward weight of sewing in each materials and equipment classification of goods and materials title by training, when comparing goods and materials title, has increased the similarity of one species goods and materials, thereby has improved the accuracy rate of goods and materials automatic classifications.
Fig. 7 is provided by the invention based on the forward and backward block diagram of sewing the similarity of character string calculation element of changeable weight of character string.Correspondingly, as shown in Figure 7, the present invention also provides a kind of similarity of character string calculation element, and this device comprises: initial similarity calculation element 10, and for calculating character string X and character string d
ibetween initial similarity Sim, character string d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja character string, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of character strings; Acquisition device 20 is sewed in public front and back, for obtaining character string X and character string d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon; Weight determining device 30, for the longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j) and similarity calculation element 40, for calculating character string X and character string d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dynamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
Wherein, the computing formula of described initial similarity can be:
Sim=1/3 (m/|length (X) |+m/|length (d
i) |+(m-t)/m, wherein m is character string X and character string d
ithe character number matching, length (X) and length (d
i) represent respectively character string X and character string d
icharacter, t represents character string X and character string d
iin the process matching, the number of times that character position changes.
Wherein, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula can be:
PW(Prefix
MaxCommon,C
j)=Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)
PW(Suffix
MaxCommon,C
j)=Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)
Wherein, Freq (Cateogy=C
j| Prefix=Prefix
maxCommon) expression prefix is Prefix
maxCommoncharacter string in classification C
jthe probability of interior appearance, Freq (Cateogy=C
j| Suffix=Suffix
maxCommon) expression suffix is Suffix
maxCommoncharacter string in classification C
jthe probability of interior appearance.
Wherein, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula can be:
PW(Prefix
MaxCommon,C
j)=α*Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)+(1-α)/n
PW(Suffix
MaxCommon,C
j)=β*Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
About detail that should be based on the forward and backward similarity of character string calculation element of sewing changeable weight of character string and beneficial effect are with above-mentioned identical for the description based on the forward and backward similarity of character string computing method of sewing changeable weight of character string, in this, repeat no more.
Fig. 8 is the block diagram of materials and equipment classification device.Correspondingly, as described in Figure 8, the present invention also provides a kind of materials and equipment classification device, and this device comprises: above-mentioned similarity of character string calculation element 100, and for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i); Similarity maximum set determining device 200, for getting K goods and materials title of similarity maximum, forms set KNN; Scoring apparatus 300, for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And classification determining device 400, for according to appraisal result, determine the classification that described goods and materials to be sorted are affiliated.
Wherein, the desirable p of described classification determining device 400 (X, C
j) maximum C
jas the classification under described goods and materials to be sorted.
About detail and the beneficial effect of this materials and equipment classification device are identical with the above-mentioned description for materials and equipment classification method, in this, repeat no more.
Below describe by reference to the accompanying drawings the preferred embodiment of the present invention in detail; but; the present invention is not limited to the detail in above-mentioned embodiment; within the scope of technical conceive of the present invention; can carry out multiple simple variant to technical scheme of the present invention, these simple variant all belong to protection scope of the present invention.
It should be noted that in addition each the concrete technical characterictic described in above-mentioned embodiment, in reconcilable situation, can combine by any suitable mode.For fear of unnecessary repetition, the present invention is to the explanation no longer separately of various possible array modes.
In addition, between various embodiment of the present invention, also can carry out combination in any, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.
Claims (10)
1. a materials and equipment classification method, the method comprises:
Calculate the goods and materials title X of goods and materials to be sorted and the goods and materials title d of interior each goods and materials of a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i);
Get K goods and materials title of similarity maximum, form set KNN;
According to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And
According to appraisal result, determine the classification that described goods and materials to be sorted are affiliated,
Wherein calculating described similarity comprises:
Calculate the goods and materials title d of each goods and materials in goods and materials title X and a plurality of material category
ibetween initial similarity Sim, goods and materials title d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja goods and materials title, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of goods and materials titles;
Obtain goods and materials title X and goods and materials title d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon;
The longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j); And
Calculate goods and materials title X and goods and materials title d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dnamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
2. method according to claim 1, is characterized in that, the computing formula of described initial similarity Sim is as follows:
Sim=1/3 (m/|length (X) |+m/|length (d
i) |+(m-t)/m), wherein m is goods and materials title X and goods and materials title d
ithe character number matching, length (X) and length (d
i) represent respectively goods and materials title X and goods and materials title d
icharacter, t represents goods and materials title X and goods and materials title d
iin the process matching, the number of times that character position changes.
3. method according to claim 1, is characterized in that, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula as follows:
PW(Prefix
MaxCommon,C
j)=Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)
PW(Suffix
MaxCommon,C
j)=Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)
Wherein, Freq (Cateogy=C
j| Prefix=Prefix
maxCommon) expression prefix is Prefix
maxCommongoods and materials title in classification C
jthe probability of interior appearance, Freq (Cateogy=C
j| Suffix=Suffix
maxCommon) expression suffix is Suffix
maxCommongoods and materials title in classification C
jthe probability of interior appearance.
4. method according to claim 1, is characterized in that, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula as follows:
PW(Prefix
MaxCommon,C
j)=α*Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)+(1-α)/n
PW(Suffix
MaxCommon,C
j)=β*Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
5. method according to claim 1, is characterized in that, the classification under described definite described goods and materials to be sorted comprises: get p (X, C
j) maximum C
jas the classification under described goods and materials to be sorted.
6. a materials and equipment classification device, this device comprises:
Similarity of character string calculation element (100), for calculating the goods and materials title d of each goods and materials in the goods and materials title X of goods and materials to be sorted and a plurality of material category
ibetween similarity Sim
dynamicWeight(X, d
i);
Similarity maximum set determining device (200), for getting K goods and materials title of similarity maximum, forms set KNN;
Scoring apparatus (300), for according to the classification under the K of similarity maximum goods and materials title, the candidate's class C to goods and materials to be sorted
jmark, evaluate formula is as follows:
Y (d wherein
i, C
j) be category attribute function; And
Classification determining device (400), for according to appraisal result, determines the classification that described goods and materials to be sorted are affiliated,
Wherein said similarity of character string calculation element (100) comprising:
Initial similarity calculation element (10), for calculating goods and materials title X and goods and materials title d
ibetween initial similarity Sim, goods and materials title d
ifor belonging to a set { C
1, C
2... C
nclassification C
ja goods and materials title, a plurality of classification C of this set-inclusion, n is the number of classification C, each classification comprises a plurality of goods and materials titles;
Acquisition device (20) is sewed in public front and back, for obtaining goods and materials title X and goods and materials title d
ibetween the longest common prefix Prefix
maxCommonwith the longest public suffix Suffix
maxCommon;
Weight determining device (30), for the longest common prefix Prefix described in determining
maxCommonweight PW (Prefix
maxCommon, C
j) and the longest described public suffix Suffix
maxCommonweight SW (Suffix
maxCommon, C
j); And
Similarity calculation element (40), for calculating goods and materials title X and goods and materials title d
ibetween similarity Sim
dynamicWeight(X, d
i), computing formula is as follows: Sim
dynamicWeight(X, d
i)=Sim+ θ * PW
maxCommon* (1-Sim)+(1-θ) * SW
maxCommon* (1-Sim), wherein θ is greater than 0 and be less than 1 merge coefficient.
7. materials and equipment classification device according to claim 6, is characterized in that, the computing formula of described initial similarity Sim is as follows:
Sim=1/3 (m/|length (X) |+m/|length (d
i) |+(m-t)/m), wherein m is goods and materials title X and goods and materials title d
ithe character number matching, length (X) and length (d
i) represent respectively goods and materials title X and goods and materials title d
icharacter, t represents goods and materials title X and goods and materials title d
iin the process matching, the number of times that character position changes.
8. materials and equipment classification device according to claim 6, is characterized in that, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula as follows:
PW(Prefix
MaxCommon,C
j)=Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)
PW(Suffix
MaxCommon,C
j)=Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)
Wherein, Freq (Cateogy=C
j| Prefix=Prefix
maxCommon) expression prefix is Prefix
maxCommongoods and materials title in classification C
jthe probability of interior appearance, Freq (Cateogy=C
j| Suffix=Suffix
maxCommon) expression suffix is Suffix
maxCommongoods and materials title in classification C
jthe probability of interior appearance.
9. materials and equipment classification device according to claim 6, is characterized in that, described weight PW (Prefix
maxCommon, C
j) and SW (Suffix
maxCommon, C
j) computing formula as follows:
PW(Prefix
MaxCommon,C
j)=α*Freq(Cateogy=C
j|Prefix=Prefix
MaxCommon)+(1-α)/n
PW(Suffix
MaxCommon,C
j)=β*Freq(Cateogy=C
j|Suffix=Suffix
MaxCommon)+(1-β)/n
Wherein, α and β are greater than 0 and be less than 1 merge coefficient.
10. materials and equipment classification device according to claim 6, is characterized in that, described classification determining device (400) is got p (X, C
j) maximum C
jas the classification under described goods and materials to be sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110262493.2A CN102298632B (en) | 2011-09-06 | 2011-09-06 | Character string similarity computing method and device and material classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110262493.2A CN102298632B (en) | 2011-09-06 | 2011-09-06 | Character string similarity computing method and device and material classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102298632A CN102298632A (en) | 2011-12-28 |
CN102298632B true CN102298632B (en) | 2014-10-29 |
Family
ID=45359046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110262493.2A Active CN102298632B (en) | 2011-09-06 | 2011-09-06 | Character string similarity computing method and device and material classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102298632B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346394A (en) * | 2013-08-02 | 2015-02-11 | 中国人民大学 | Similarity measurement method based on massive text data |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815197B (en) * | 2015-11-27 | 2020-07-31 | 北京国双科技有限公司 | Text similarity determination method and device |
CN106919663A (en) * | 2017-02-14 | 2017-07-04 | 华北电力大学 | Character string matching method in the multi-source heterogeneous data fusion of power regulation system |
CN107357779B (en) * | 2017-06-27 | 2018-10-02 | 北京神州泰岳软件股份有限公司 | A kind of method and device obtaining organization names |
CN109284422B (en) * | 2018-08-31 | 2019-12-27 | 成都信息工程大学 | Construction method of universal character string similarity measurement framework |
CN109299112B (en) * | 2018-11-15 | 2020-01-17 | 北京百度网讯科技有限公司 | Method and apparatus for processing data |
CN110827931A (en) * | 2020-01-13 | 2020-02-21 | 四川大学华西医院 | Method and device for managing clinical terms and readable storage medium |
CN112100381B (en) * | 2020-09-22 | 2022-05-17 | 福建天晴在线互动科技有限公司 | Method and system for quantizing text similarity |
CN114548883A (en) * | 2022-04-25 | 2022-05-27 | 创思(广州)电子科技有限公司 | Vegetable wholesale quantity checking system for performing autonomous checking according to wholesale orders |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6137911A (en) * | 1997-06-16 | 2000-10-24 | The Dialog Corporation Plc | Test classification system and method |
CN101976270A (en) * | 2010-11-29 | 2011-02-16 | 南京师范大学 | Uncertain reasoning-based text hierarchy classification method and device |
-
2011
- 2011-09-06 CN CN201110262493.2A patent/CN102298632B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6137911A (en) * | 1997-06-16 | 2000-10-24 | The Dialog Corporation Plc | Test classification system and method |
CN101976270A (en) * | 2010-11-29 | 2011-02-16 | 南京师范大学 | Uncertain reasoning-based text hierarchy classification method and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346394A (en) * | 2013-08-02 | 2015-02-11 | 中国人民大学 | Similarity measurement method based on massive text data |
CN104346394B (en) * | 2013-08-02 | 2018-12-21 | 中国人民大学 | A kind of method for measuring similarity based on mass text data |
Also Published As
Publication number | Publication date |
---|---|
CN102298632A (en) | 2011-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102298632B (en) | Character string similarity computing method and device and material classification method and device | |
CN105243152B (en) | A kind of automaticabstracting based on graph model | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN106202042A (en) | A kind of keyword abstraction method based on figure | |
CN103514255B (en) | A kind of collaborative filtering recommending method based on project stratigraphic classification | |
CN107122352A (en) | A kind of method of the extracting keywords based on K MEANS, WORD2VEC | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN106124175A (en) | A kind of compressor valve method for diagnosing faults based on Bayesian network | |
CN103778227A (en) | Method for screening useful images from retrieved images | |
CN108595655A (en) | A kind of abnormal user detection method of dialogue-based characteristic similarity fuzzy clustering | |
CN104750844A (en) | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts | |
CN105975596A (en) | Query expansion method and system of search engine | |
CN108647736A (en) | A kind of image classification method based on perception loss and matching attention mechanism | |
CN104020845B (en) | Acceleration transducer placement-unrelated movement recognition method based on shapelet characteristic | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN105868178A (en) | Multi-document automatic abstract generation method based on phrase subject modeling | |
KR101091834B1 (en) | Method and apparatus for test question selection and achievement assessment | |
CN103455609A (en) | New kernel function Luke kernel-based patent document similarity detection method | |
CN106203534A (en) | A kind of cost-sensitive Software Defects Predict Methods based on Boosting | |
CN110188192A (en) | A kind of multitask network struction and multiple dimensioned charge law article unified prediction | |
CN110008323A (en) | A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing | |
CN105955975A (en) | Knowledge recommendation method for academic literature | |
CN108052625A (en) | A kind of entity sophisticated category method | |
CN103823880A (en) | Attribute weight-based method for calculating similarity between detection mechanisms | |
CN110232185A (en) | Towards financial industry software test knowledge based map semantic similarity calculation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |