CN103034627A - Method and device for calculating sentence similarity and method and device for machine translation - Google Patents

Method and device for calculating sentence similarity and method and device for machine translation Download PDF

Info

Publication number
CN103034627A
CN103034627A CN2011103035225A CN201110303522A CN103034627A CN 103034627 A CN103034627 A CN 103034627A CN 2011103035225 A CN2011103035225 A CN 2011103035225A CN 201110303522 A CN201110303522 A CN 201110303522A CN 103034627 A CN103034627 A CN 103034627A
Authority
CN
China
Prior art keywords
sentence
word
difference
difference word
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103035225A
Other languages
Chinese (zh)
Other versions
CN103034627B (en
Inventor
刘占一
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110303522.5A priority Critical patent/CN103034627B/en
Publication of CN103034627A publication Critical patent/CN103034627A/en
Application granted granted Critical
Publication of CN103034627B publication Critical patent/CN103034627B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a device for calculating sentence similarity and a method and a device for machine translation, wherein the method for calculating the sentence similarity comprises the following steps that a first sentence and a second sentence are compared, so as to determine different word pairs; different words are marked by utilizing the matching probability of the different words in the different word pairs with other words in the first sentence or the second sentence in which the different words are contained, wherein the matching probability of two words is obtained by inquiring a matching probability model, and the matching probability of the two words in the matching probability model is obtained by counting the co-occurrence frequency of the two words in a preset corpus; the marking results of the different words in the different word pairs are utilized to mark the different word pairs; and the marking results of the different word pairs are utilized to determine the similarity of the first sentence and the second sentence. According to the method and the device, the matching degree of the two sentences can be more accurately reflected, thereby increasing the application quality of the method and the device for the machine translation and the like.

Description

Calculate the method and apparatus of sentence similarity and the method and apparatus of mechanical translation
[technical field]
The present invention relates to field of computer technology, particularly a kind of method and apparatus of sentence similarity and method and apparatus of mechanical translation of calculating.
[background technology]
Sentence similarity calculates has very important using value in fields such as problem retrieval, bilingual illustrative sentence retrieval, mechanical translation, document abstracts, and the similar situation that wherein adopts what kind of sentence similarity computing method can embody exactly between two sentences is the key of the above-mentioned application quality of impact.
Lift the application in machine translation mothod, in machine translation mothod, usually use pretreated bilingual example sentence as main translated resources, generate final translation by editor to the similar example sentence of sentence coupling to be translated.Particularly, may further comprise the steps:
1) in the translation instance storehouse, searches for the similar example sentence that mates to sentence to be translated.
For example: sentence to be translated is: This is a pencil.
Similar example sentence is: That is a pen.
2) the difference word between identification sentence to be translated and the similar example sentence
This and That are the difference words, and pencil and pen are the difference words.
3) translation that the difference word in the sentence to be translated is corresponding is as candidate's translation fragment.
Namely " this " and " pencil " is as candidate's translation fragment.
4) in the translation of similar example sentence, utilize candidate's translation fragment to replace the translation of difference word in the similar example sentence, obtain the translation of sentence to be translated.
The translation of similar example sentence is: " that is a pen ", replace " that " with " this ", with " pencil " replacement " pencil ", the translation that obtains sentence to be translated is " this is a pencil ".
Can be found out that by above mechanical translation process the similar example sentence of How to choose is the key factor that affects the translation quality height.
Existing sentence similarity calculates the mode of calculating editing distance between the sentence that usually adopts, editing distance is determined by be transformed into the needed minimal action number of another sentence from a sentence, described operation can comprise: insertion, deletion or replacement etc., if the editing distance between two sentences is less, determine that then the similarity between two sentences is higher, but can there be certain defect in this mode.
For example, if sentence to be translated is: Can I take a picture of the painting?
Is the similar example sentence of selecting by calculating editing distance mode: Can I take a picture of the car?
Can the translation that utilizes this similar example sentence to form is: I clap a photo for this oil painting?
Can if with the similar example sentence of sentence Can we take a photo of the painting as sentence to be translated, the translation that then forms is: I clap a photo for this width of cloth oil painting?
Can find out, although the editing distance of sentence Can we take a photo of the painting and sentence to be translated is greater than the editing distance of sentence Can I take a picture of the car and sentence to be translated, but the similarity of itself and sentence to be translated will be higher than sentence Can I take a picture of the car, thereby the translation quality that forms is also higher.
Above-mentioned problem is exactly because when calculating between the sentence similarity, do not consider the relation between the two sentence difference words.Although someone proposes in the calculating of similarity to consider similarity degree between the difference word based on synonymicon, but under a lot of the application, in above-mentioned mechanical translation application, the collocation relation is compared semantic between difference word and the context, in calculating, similarity has more importantly meaning, more can embody exactly the matching degree between two sentences, larger to the quality influence of above-mentioned application.
[summary of the invention]
The invention provides and a kind ofly calculate the method and apparatus of sentence similarity and the method and apparatus of mechanical translation, so that embody more exactly the matching degree between two sentences, be used for the quality used such as mechanical translation etc. thereby improve it.
Concrete technical scheme is as follows:
A kind of method of calculating sentence similarity, the method comprises:
A, the first sentence and the second sentence are compared, determine difference word pair;
The collocation probability of other words in B, utilization variance word centering difference word and its place the first sentence or the second sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;
The marking result of C, each difference word of utilization variance word centering determines the marking that the difference word is right;
D, utilize the right marking result of each difference word, determine the similarity of described the first sentence and described the second sentence.
Particularly, in described step B, be each difference word marking according to following formula:
R (w wherein i, E) be difference word w iThe marking result, E is difference word w iFirst sentence at place or the second sentence, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, m is the word number that E comprises.
In described step C, be the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; Perhaps, S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) ;
Wherein,
Figure BDA0000097117310000034
For by difference word w and
Figure BDA0000097117310000035
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure BDA0000097117310000036
It is the difference word among the second sentence E2 The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.
Further, the method also comprises: determine the proper vector of difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;
When determining the right marking of difference word among the described step C, the further similarity distance of utilization variance word centering two difference words.
Wherein, definite mode of the proper vector of difference word is specially:
Inquire about described collocation probability model, the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.
Particularly, calculate the similarity distance of described two difference words according to following formula:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , Wherein,
Figure BDA0000097117310000042
For difference word w and
Figure BDA0000097117310000043
Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure BDA0000097117310000044
Be the difference word
Figure BDA0000097117310000045
Proper vector, Co sin e ( F ( w ) , F ( w ~ ) ) For F (w) and
Figure BDA0000097117310000047
Included angle cosine.
In described step C, be the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; Perhaps,
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) ;
Wherein,
Figure BDA00000971173100000410
For by difference word w and The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure BDA00000971173100000412
It is the difference word among the second sentence E2
Figure BDA00000971173100000413
The marking result,
Figure BDA00000971173100000414
For difference word w and
Figure BDA00000971173100000415
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.
A kind of method of mechanical translation, the method for this mechanical translation comprises:
S1, adopt the method for above-mentioned calculating sentence similarity to calculate the similarity of sentence in sentence to be translated and the default example sentence storehouse;
S2, selection similarity come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;
S3, utilize the translation of described similar example sentence to obtain the translation of described sentence to be translated.
Wherein, described step S1 specifically comprises:
S11, determine in the described example sentence storehouse and the editing distance between the described sentence to be translated satisfies the sentence of preset requirement;
S12, adopt the method for above-mentioned calculating sentence similarity to calculate similarity between the sentence that sentence to be translated and described step S11 determine.
Described step S3 specifically comprises:
Difference word between S31, the described sentence to be translated of identification and the described similar example sentence;
S32, translation that the difference word in the described sentence to be translated is corresponding are as candidate's translation fragment;
S33, in the translation of described similar example sentence, utilize candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtain the translation of described sentence to be translated.
Preferably, the method for this mechanical translation also comprises: when showing the translation of described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.
A kind of device that calculates sentence similarity, this device comprises:
The sentence comparison unit is used for the first sentence and the second sentence are compared, and determines difference word pair;
Difference word marking unit, the collocation probability that is used for utilization variance word centering difference word and its place the first sentence or second other words of sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;
Difference word air exercise subdivision, the marking result for each difference word of utilization variance word centering determines the marking that the difference word is right;
The similarity determining unit is used for utilizing the right marking result of each difference word, determines the similarity of described the first sentence and described the second sentence.
Particularly, described difference word marking unit be that each difference word is given a mark according to following formula:
r ( w i , E ) = Σ w i ∈ E , w j ∈ E , w i ≠ w j r ( w i , w j ) m , R (w wherein i, E) be difference word w iThe marking result, E is difference word w iFirst sentence at place or the second sentence, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, m is the word number that E comprises.
At this moment, described difference word air exercise subdivision is the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; Perhaps, S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) ;
Wherein,
Figure BDA0000097117310000054
For by difference word w and
Figure BDA0000097117310000055
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure BDA0000097117310000056
It is the difference word among the second sentence E2
Figure BDA0000097117310000057
The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.
Also have a kind of embodiment, this device also comprises: the similarity distance determining unit, for the proper vector of determining difference word centering two difference words, utilize the proper vector of described two difference words, and calculate the similarity distance of described two difference words;
Described difference word air exercise subdivision when determining the right marking of difference word, the further similarity distance of utilization variance word centering two difference words.
Wherein, described similarity distance determining unit is inquired about described collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.
Described similarity distance determining unit is calculated the similarity distance of described two difference words according to following formula:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , Wherein,
Figure BDA0000097117310000062
For difference word w and
Figure BDA0000097117310000063
Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure BDA0000097117310000064
Be the difference word
Figure BDA0000097117310000065
Proper vector, Co sin e ( F ( w ) , F ( w ~ ) ) For F (w) and
Figure BDA0000097117310000067
Included angle cosine.
At this moment, described difference word air exercise subdivision is the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; Perhaps,
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) ;
Wherein,
Figure BDA00000971173100000610
For by difference word w and
Figure BDA00000971173100000611
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure BDA00000971173100000612
It is the difference word among the second sentence E2
Figure BDA00000971173100000613
The marking result,
Figure BDA00000971173100000614
For difference word w and
Figure BDA00000971173100000615
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.
A kind of device of mechanical translation, the device of this mechanical translation comprises:
The device of above-mentioned calculating sentence similarity is for the similarity of the example sentence storehouse sentence that calculates sentence to be translated and preset;
Similar example sentence selected cell is used for selecting similarity to come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;
Translation forms the unit, obtains the translation of described sentence to be translated for the translation that utilizes described similar example sentence.
Further, the device of this mechanical translation also comprises: the initial option unit is used for determining that the editing distance between described example sentence storehouse and the described sentence to be translated satisfies the sentence of preset requirement;
The device of described calculating sentence similarity calculates the similarity between the sentence of sentence to be translated and described initial option unit determining.
Wherein, described translation forms the unit and specifically comprises:
Difference word recognin unit is used for identifying the difference word between described sentence to be translated and the described similar example sentence;
Fragment constructor unit is used for the translation that the difference word of described sentence to be translated is corresponding as candidate's translation fragment;
Translation forms subelement, is used for the translation at described similar example sentence, utilizes candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtains the translation of described sentence to be translated.
Preferably, the device of this mechanical translation also comprises: display unit, be used in the translation that shows described sentence to be translated, and the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.
As can be seen from the above technical solutions, method and apparatus provided by the invention incorporates the collocation probability of word and word the calculating of sentence similarity, namely the collocation probability based on other words in difference word and its place sentence is that the difference word is to marking, and then the diversity factor between the calculating sentence, the prior art of comparing, embody more exactly the matching degree between the sentence, thereby improve it for the quality such as application such as mechanical translation.
[description of drawings]
The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the invention one;
The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the invention two;
The method flow diagram of the mechanical translation that Fig. 3 provides for the embodiment of the invention three;
Fig. 4 shows instance graph for the translation that the embodiment of the invention three provides;
The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the invention four;
The structural drawing of the machine translation apparatus that Fig. 6 provides for the embodiment of the invention five.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Be described below by embodiment one and two pairs of similarity calculating methods provided by the present invention of embodiment.Embodiment one and embodiment two are used for calculating the similarity between sentence E1 and the sentence E2, and sentence E1 and sentence E2 can choose according to concrete application.For example: if be applied to the problem retrieval, then sentence E1 can be the query of user's input, and sentence E2 can be existing problem in the issue database; If be applied to mechanical translation, then sentence E1 can be sentence to be translated, and sentence E2 can be for translating the sentence in the employed example sentence storehouse, etc.
Embodiment one,
The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, the method can may further comprise the steps:
Step 101: sentence E1 and sentence E2 are compared, determine difference word pair.
The embodiment of the invention is processed based on the base text to sentence, and processing such as participle, alignment because this partial content is prior art, does not repeat them here.
Word among sentence E1 and the sentence E2 is compared, determine different word and consist of difference word pair, for example:
Is sentence E1: Can I take a picture of the painting?
Is sentence E2: Can we take a photo of the painting?
Then determine the difference word to being: the difference word that I and we consist of pair, the difference word that picture and photo consist of pair.
Step 102: the collocation probability of other words among utilization variance word centering difference word and its place sentence E1 or the sentence E2, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the collocation probability model between two words is obtained by the co-occurrence number of times statistics of two words in default corpus.
By to the statistics of co-occurrence number of times between word and the word in the default corpus, can obtain the collocation probability of word and word, thereby consist of the probability model of arranging in pairs or groups in advance.For example, when being used for mechanical translation, should can be the employed corpus of mechanical translation by default corpus, the collocation probability that the co-occurrence number of times of statistics " take " and " picture " can obtain " take " and " picture " deposits the collocation probability model in, the collocation probability that the co-occurrence number of times of statistics " take " and " photo " can obtain " take " and " photo " deposits the collocation probability model in, and is like that.Collocation probability between the word is larger, illustrates that the dependence between the word is stronger.
Because word is not the individuality that isolates in the sentence, each word more or less with sentence in other words have certain collocation relation, this collocation relation can embody this word in sentence with contextual degree of dependence with edit risk.Being each difference word when giving a mark, can obtain respectively the collocation probability of other words in difference word and its place sentence, the collocation probability that obtains is integrated to obtain the marking result of difference word, for example, for difference word w iCan adopt the following formula as a result r (w that obtains giving a mark i, E):
r ( w i , E ) = Σ w i ∈ E , w j ∈ E , w i ≠ w j r ( w i , w j ) m , - - - ( 1 )
E is difference word w iThe sentence at place can be above-mentioned sentence E1 or sentence E2, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, obtain by inquiry collocation probability model, m is the word number that E comprises.
Take sentence E1 as example, can obtain difference word " picture " respectively with the collocation probability of " can ", " I ", " take ", " a ", " of ", " the " and " painting ", the m value is 8, and then substitution formula (1) calculates the marking result that just can obtain difference word " picture ".
Step 103: the marking result of each difference word of utilization variance word centering, determine the marking that the difference word is right.
In determining sentence E1 and sentence E2 behind the marking result of each difference word, can be for the difference word to giving a mark, the marking mode can obtain by the marking result who integrates difference word centering two difference words, for example can calculate according to following formula (2) or formula (3):
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; - - - ( 2 )
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) ; - - - ( 3 )
Wherein,
Figure BDA0000097117310000094
For by difference word w and
Figure BDA0000097117310000095
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1, Be the difference word among the sentence E2
Figure BDA0000097117310000097
The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.α 1, α 2, β 1 and β 2 can be set usually be the number between-1 and 1, with being chosen for positive number or negatives, β 1 and β 2 are usually with being chosen for positive number or negatives usually for α 1 and α 2.
For example, in the right marking of the difference word that calculates " picture " and " photo " formation as a result the time, at first utilize formula (1) to calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtain the right marking result of difference word that " picture " and " photo " consists of.
Step 104: utilize the right marking result of each difference word, determine the similarity of sentence E1 and sentence E2.
In this step, the right marking result of all differences word among sentence E1 and the sentence E2 is integrated, for example the marking result that all differences word is right sues for peace, thereby determines the similarity of sentence E1 and sentence E2.The marking mode that obtains by method described in the embodiment one, finally to integrate rear value higher for the marking result of each difference word, and it is higher to illustrate that two example sentences close the similarity of fastening in collocation, and matching degree is also higher.
Embodiment two,
The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, the method can may further comprise the steps:
Step 201 is with step 101 among the embodiment one.
Step 202 is with step 102 among the embodiment one.
Step 203: determine the proper vector of difference word centering two difference words, utilize the proper vector of two difference words to calculate the similarity distance of two difference words.
In embodiment two, can further consider the similarity degree of difference word in specific corpus, this similarity degree embodies by the distance of the proper vector of difference word centering two difference words.
The proper vector of difference word can be by existing the word of higher collocation probability to consist of with this difference word, particularly, can be by inquiry collocation probability model, the word that will reach with the collocation probability of this difference word default collocation probability threshold value consists of the proper vector of this difference word.
Take difference word " picture " as example, by inquiry collocation probability model, the collocation probability of determining " take ", " draw ", " of ", " gallery " etc. and " picture " reaches default collocation probability threshold value, proper vector that then can the words such as " take ", " draw ", " of ", " gallery " formation " picture ".Same method also can be determined the proper vector of difference word " photo ".
When the similarity distance of calculated difference word centering two difference words, can utilize the included angle cosine of the proper vector of two difference words.For example can adopt following formula to calculate:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , - - - ( 4 )
Wherein,
Figure BDA0000097117310000112
For difference word w and Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure BDA0000097117310000114
Be the difference word
Figure BDA0000097117310000115
Proper vector,
Figure BDA0000097117310000116
For F (w) and Included angle cosine.
Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, can obtain following formula take wherein a kind of as example:
Figure BDA0000097117310000118
Because Collocation and collocation probability count in specific corpus training in the collocation probability model, therefore, can effectively describe the similarity degree of two difference words on specific corpus by the mode of this step.
Step 204: the marking result of each difference word of utilization variance word centering and the similarity distance of two difference words, determine the marking that the difference word is right.
What this embodiment two and embodiment one were different is, at the similarity distance of the difference word having been further considered the difference word when giving a mark, has namely considered simultaneously similarity distance and editor's risk of difference word centering two difference words.For example, can adopt following formula (6) or (7) to the difference word to marking:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; - - - ( 6 )
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) ; - - - ( 7 )
Wherein,
Figure BDA00000971173100001111
For by difference word w and
Figure BDA00000971173100001112
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1,
Figure BDA00000971173100001113
Be the difference word among the sentence E2
Figure BDA00000971173100001114
The marking result,
Figure BDA00000971173100001115
For difference word w and
Figure BDA00000971173100001116
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.α 1, α 2, α 3, β 1, β 2 and β 3 can be set usually be the number between-1 and 1, with being chosen for positive number or negatives, β 1 and β 2 are usually with being chosen for positive number or negatives usually for α 1 and α 2.
For example, in the right marking of the difference word that calculates " picture " and " photo " formation as a result the time, at first utilize formula (1) to calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtain the right marking result of difference word that " picture " and " photo " consists of.Utilize the similarity distance between formula (5) calculating " picture " and " photo ", then utilize formula (6) or (7) to obtain the right marking result of difference word that " picture " and " photo " consists of.
Step 205 is with step 104 among the embodiment one.
Above-mentioned two embodiment are described as an example of english sentence example, but are not limited to english sentence, can be applied to equally calculate such as the sentence similarity of other language such as Chinese sentence.
The sentence similarity that calculates by above-mentioned two embodiment can be used for fields such as problem retrieval, bilingual illustrative sentence retrieval, mechanical translation, document abstracts.Situation when being used for mechanical translation below by three couples of embodiment is described.
Embodiment three,
The method flow diagram of the mechanical translation that Fig. 3 provides for the embodiment of the invention three, as shown in Figure 3, the method can may further comprise the steps:
Step 301: the similarity of calculating sentence in sentence to be translated and the default example sentence storehouse.
Can adopt in this step the method described in embodiment one or the embodiment two to calculate the similarity of sentence in sentence to be translated and the example sentence storehouse, thereby for further selecting similar example sentence to prepare.
Because example sentence quantity is very huge in the example sentence storehouse, calculate the similarity of each sentence and sentence to be translated in the example sentence storehouse if adopt one by one mode shown in embodiment one or the embodiment two, then efficient can be lower, in order to raise the efficiency, can at first calculate the editing distance of each example sentence and sentence to be translated in the example sentence storehouse, determine to satisfy with the editing distance of sentence to be translated in the example sentence storehouse sentence of preset requirement, then each sentence of calculative determination and the similarity between the sentence to be translated.For example, can select editing distance less than the sentence of predetermined threshold value, perhaps, select editing distance to come front M sentence, M is default positive integer.
Editing distance determines that by be transformed into the needed minimal action number of another sentence from a sentence described operation can comprise: insertion, deletion or replacement etc. because the account form of editing distance is prior art, do not repeat them here.
Step 302: select similarity to come the sentence of top n as the similar example sentence of sentence to be translated, N is default positive integer.
Similar example sentence by embodiment one or embodiment two described similarity account forms selections, considered the collocation relation of other words in difference word and the sentence, namely considered the compiling risk of difference word, even in embodiment two, further considered similarity distance between the difference word, select the higher sentence of matching degree and be used for generating version as similar example sentence, thereby improve translation quality.
Preferred embodiment a kind of, can select a highest sentence of similarity as similar example sentence.
For sentence to be translated: for the Can I take a picture of the painting, the sentence Can we take a photo ofthe painting Can I take a picture ofthe car that compares, the difference word to " I " and " we " and difference word to the similarity distance of difference word in " picture " and " photo " and with sentence in the collocation probability of other words all larger, and the difference word to the similarity distance of " painting " and " car " and with sentence in the collocation probability of other words less, therefore, compare Can I take a picture of the car and sentence to be translated of sentence Can we take a photo of the painting has higher similarity, can choose Can we take a photo of the painting as similar example sentence.
Step 303: utilize the translation of similar example sentence to obtain the translation of sentence to be translated.
After determining similar example sentence, the translation that generates sentence to be translated can be realized in accordance with the following steps:
Identify the difference word between sentence to be translated and the similar example sentence; The translation that difference word in the sentence to be translated is corresponding is as candidate's translation fragment; In the translation of similar example sentence, utilize candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtain the translation of sentence to be translated.This partial content is same as the prior art, repeats no more.
For example, the difference word that identifies similar example sentence Can we take a photo of the painting and sentence Can I take a picture of the painting to be translated is " we " and " I ", " photo " and " picture ".The translation " photograph " of the translation " I " of " I " and " picture " is as candidate's translation fragment.The translation of similar example sentence is " we can clap a photo for this width of cloth oil painting ", and utilizing candidate's translation fragment that the translation of difference word in the similar example sentence is replaced the translation that obtains sentence to be translated is " I can be that this width of cloth oil painting is clapped sheet photo ".
When the translation for the treatment of translation of the sentence shows, similar example sentence can be shown, and further can the marking result that each difference word of similar example sentence and sentence to be translated is right show.When showing that the right marking of difference word as a result, can be according to the marking result who sets in advance and the corresponding relation of confidence level, for example confidence level is divided into height, the low Three Estate of neutralization according to the marking result, then determine right confidence level corresponding to marking result of difference word, thereby show this confidence level.
As shown in Figure 4, show the translation of sentence to be translated, similar example sentence, similar example sentence and the translation of sentence to be translated, wherein the difference word of similar example sentence and sentence to be translated can highlight, and candidate's translation fragment also highlights.Show simultaneously the right confidence level of difference word on the right side.The mode that highlights is not limited to the mode shown in Fig. 4.
More than be the detailed description that method provided by the invention is carried out, be described below by the device of four pairs of calculating sentence similarities provided by the invention of embodiment.
Embodiment four,
The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the invention four, as shown in Figure 5, this device can comprise: sentence comparison unit 501, difference word marking unit 502, difference word air exercise subdivision 503 and similarity determining unit 504.
The 501 couples of sentence E1 in sentence comparison unit and sentence E2 compare, and determine difference word pair.
In fact exactly the word among sentence E1 and the sentence E2 is compared, determine different word and consist of difference word pair.
The collocation probability of other words among difference word marking unit 502 utilization variance word centering difference words and its place sentence E1 or the sentence E2, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the collocation probability model between two words is obtained by the co-occurrence number of times statistics of two words in default corpus.
The formation of collocation probability model is in advance by to the statistics of co-occurrence number of times between word and the word in the default corpus, thereby the collocation probability that obtains word and word forms the probability model of arranging in pairs or groups.Collocation probability between the word is larger, can embody this word in sentence with contextual degree of dependence and editor risk.When giving a mark for each difference word, can obtain respectively the collocation probability of other words in difference word and its place sentence, the collocation probability that obtains is integrated to obtain the marking result of difference word.
For example, difference word marking unit 502 can be each difference word marking according to following formula:
Figure BDA0000097117310000141
R (w wherein i, E) be difference word w iThe marking result, E is difference word w iThe sentence at place can be sentence E1 or sentence E2, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, m is the word number that E comprises.
The marking result of difference word air exercise subdivision 503 each difference word of utilization variance word centering determines the marking that the difference word is right.
Similarity determining unit 504 is utilized the right marking result of each difference word, determines the similarity of sentence E1 and sentence E2.Namely the right marking result of all differences word among sentence E1 and the sentence E2 is integrated, for example the marking result that institute's all differences word is right sues for peace, thereby determines the similarity of sentence E1 and sentence E2.
Wherein, difference word air exercise subdivision 503 can adopt dual mode be the difference word to marking, respectively corresponding method embodiment one and embodiment two, specific as follows:
First kind of way: difference word air exercise subdivision 503 can be the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; Perhaps, S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) .
Wherein,
Figure BDA0000097117310000153
For by difference word w and
Figure BDA0000097117310000154
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1,
Figure BDA0000097117310000155
Be the difference word among the sentence E2
Figure BDA0000097117310000156
The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.
The second way: as shown in Figure 5, this device also comprises: similarity distance determining unit 505, for the proper vector of determining difference word centering two difference words, utilize the proper vector of two difference words, and calculate the similarity distance of two difference words.
At this moment, difference word air exercise subdivision 503 when determining the right marking of difference word, the further similarity distance of utilization variance word centering two difference words.
Particularly, in the second way, similarity distance determining unit 505 can be inquired about the collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.
When calculating the similarity distance of two difference words, similarity distance determining unit 505 can be according to following formula:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , Wherein,
Figure BDA0000097117310000158
For difference word w and
Figure BDA0000097117310000159
Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure BDA00000971173100001510
Be the difference word
Figure BDA00000971173100001511
Proper vector, Co sin e ( F ( w ) , F ( w ~ ) ) For F (w) and Included angle cosine.
Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, can obtain formula (5) among the embodiment two take wherein a kind of as example.
In the second way, difference word air exercise subdivision 503 can be the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; Perhaps,
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) .
Wherein,
Figure BDA0000097117310000163
For by difference word w and
Figure BDA0000097117310000164
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1,
Figure BDA0000097117310000165
Be the difference word among the sentence E2
Figure BDA0000097117310000166
The marking result,
Figure BDA0000097117310000167
For difference word w and
Figure BDA0000097117310000168
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.α 1, α 2, α 3, β 1, β 2 and β 3 can be set usually be the number between-1 and 1, with being chosen for positive number or negatives, β 1 and β 2 are usually with being chosen for positive number or negatives usually for α 1 and α 2.
Embodiment five,
The structural drawing of the machine translation apparatus that Fig. 6 provides for the embodiment of the invention five, as shown in Figure 6, this device can comprise: calculate the device 600 of sentence similarity, similar example sentence selected cell 610 and translation and form unit 620.
The similarity of sentence in the example sentence storehouse of calculating the device 600 calculating sentences to be translated of sentence similarity and presetting, structure can be as shown in Figure 5.
Similar example sentence selected cell 610 selects similarity to come the sentence of top n as the similar example sentence of sentence to be translated, and N is default positive integer.Usually get 1 as a kind of preferred embodiment N value.
Translation forms unit 620 and utilizes the translation of similar example sentence to obtain the translation of sentence to be translated.
Because example sentence quantity is very huge in the example sentence storehouse, if the device 600 that calculates sentence similarity calculates similarity with sentence to be translated one by one for sentences all in the example sentence storehouse, then efficient can be lower, in order to raise the efficiency, the device of this mechanical translation can also comprise: initial option unit 630 is used for determining that the editing distance between example sentence storehouse and the sentence to be translated satisfies the sentence of preset requirement.For example, can select editing distance less than the sentence of predetermined threshold value, perhaps, select editing distance to come front M sentence, M is default positive integer.
Editing distance determines that by be transformed into the needed minimal action number of another sentence from a sentence described operation can comprise: insertion, deletion or replacement etc. because the account form of editing distance is prior art, do not repeat them here.
Correspondingly, 600 in the device that calculates sentence similarity needs to calculate the similarity between the sentence of sentences to be translated and initial option unit 630 determining.
Translation forms unit 620 and can specifically comprise: difference word recognin unit 621, fragment constructor unit 622 and translation form subelement 623.
Difference word between difference word recognin unit 621 identification sentences to be translated and the similar example sentence.
Fragment constructor unit 622 translation that the difference word in the sentence to be translated is corresponding is as candidate's translation fragment.
Translation forms subelement 623, is used for the translation at similar example sentence, utilizes candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtains the translation of sentence to be translated.
The device of this mechanical translation can further include: display unit 640, be used in the translation that shows sentence to be translated, and the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of sentence to be translated are shown.
When the marking result is shown, can be according to the marking result who sets in advance and the corresponding relation of confidence level, for example confidence level is divided into height, the low Three Estate of neutralization according to the marking result, then determine right confidence level corresponding to marking result of difference word, thereby show this confidence level.
As a kind of preferred displaying scheme, can show the translation of sentence to be translated, similar example sentence, similar example sentence and the translation of sentence to be translated, wherein the difference word of similar example sentence and sentence to be translated can highlight, candidate's translation fragment also highlights, as shown in Figure 4, show simultaneously the right confidence level of difference word on the right side.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (22)

1. method of calculating sentence similarity is characterized in that the method comprises:
A, the first sentence and the second sentence are compared, determine difference word pair;
The collocation probability of other words in B, utilization variance word centering difference word and its place the first sentence or the second sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;
The marking result of C, each difference word of utilization variance word centering determines the marking that the difference word is right;
D, utilize the right marking result of each difference word, determine the similarity of described the first sentence and described the second sentence.
2. method according to claim 1 is characterized in that, in described step B, is each difference word marking according to following formula:
Figure FDA0000097117300000011
R (w wherein i, E) be difference word w iThe marking result, E is difference word w iFirst sentence at place or the second sentence, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, m is the word number that E comprises.
3. method according to claim 1 and 2 is characterized in that, in described step C, is the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; Perhaps, S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) ;
Wherein,
Figure FDA0000097117300000014
For by difference word w and
Figure FDA0000097117300000015
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure FDA0000097117300000016
It is the difference word among the second sentence E2
Figure FDA0000097117300000017
The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.
4. method according to claim 1 is characterized in that, the method also comprises: determine the proper vector of difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;
When determining the right marking of difference word among the described step C, the further similarity distance of utilization variance word centering two difference words.
5. method according to claim 4 is characterized in that, definite mode of the proper vector of difference word is specially:
Inquire about described collocation probability model, the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.
6. method according to claim 4 is characterized in that, calculates the similarity distance of described two difference words according to following formula:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , Wherein,
Figure FDA0000097117300000022
For difference word w and
Figure FDA0000097117300000023
Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure FDA0000097117300000024
Be the difference word
Figure FDA0000097117300000025
Proper vector, Co sin e ( F ( w ) , F ( w ~ ) ) For F (w) and
Figure FDA0000097117300000027
Included angle cosine.
7. according to claim 4,5 or 6 described methods, it is characterized in that, in described step C, be the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; Perhaps,
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) ;
Wherein,
Figure FDA00000971173000000210
For by difference word w and The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure FDA00000971173000000212
It is the difference word among the second sentence E2
Figure FDA00000971173000000213
The marking result,
Figure FDA00000971173000000214
For difference word w and
Figure FDA00000971173000000215
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.
8. the method for a mechanical translation is characterized in that, the method for this mechanical translation comprises:
S1, adopt the method for claim 1 to calculate the similarity of sentence in sentence to be translated and the default example sentence storehouse;
S2, selection similarity come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;
S3, utilize the translation of described similar example sentence to obtain the translation of described sentence to be translated.
9. the method for mechanical translation according to claim 8 is characterized in that, described step S1 specifically comprises:
S11, determine in the described example sentence storehouse and the editing distance between the described sentence to be translated satisfies the sentence of preset requirement;
S12, adopt the method for claim 1 to calculate similarity between the sentence that sentence to be translated and described step S11 determine.
10. the method for mechanical translation according to claim 8 is characterized in that, described step S3 specifically comprises:
Difference word between S31, the described sentence to be translated of identification and the described similar example sentence;
S32, translation that the difference word in the described sentence to be translated is corresponding are as candidate's translation fragment;
S33, in the translation of described similar example sentence, utilize candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtain the translation of described sentence to be translated.
11. the method for mechanical translation according to claim 8, it is characterized in that, the method of this mechanical translation also comprises: when showing the translation of described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.
12. a device that calculates sentence similarity is characterized in that, this device comprises:
The sentence comparison unit is used for the first sentence and the second sentence are compared, and determines difference word pair;
Difference word marking unit, the collocation probability that is used for utilization variance word centering difference word and its place the first sentence or second other words of sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;
Difference word air exercise subdivision, the marking result for each difference word of utilization variance word centering determines the marking that the difference word is right;
The similarity determining unit is used for utilizing the right marking result of each difference word, determines the similarity of described the first sentence and described the second sentence.
13. device according to claim 12 is characterized in that, described difference word marking unit is each difference word marking according to following formula:
Figure FDA0000097117300000031
R (w wherein i, E) be difference word w iThe marking result, E is difference word w iFirst sentence at place or the second sentence, w jFor removing w among the E iOutside other words, r (w i, w j) be w iAnd w jThe collocation probability, m is the word number that E comprises.
14. according to claim 12 or 13 described devices, it is characterized in that described difference word air exercise subdivision is the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 ; Perhaps, S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) ;
Wherein, For by difference word w and
Figure FDA0000097117300000044
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure FDA0000097117300000045
It is the difference word among the second sentence E2
Figure FDA0000097117300000046
The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.
15. device according to claim 12, it is characterized in that this device also comprises: the similarity distance determining unit, for the proper vector of determining difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;
Described difference word air exercise subdivision when determining the right marking of difference word, the further similarity distance of utilization variance word centering two difference words.
16. device according to claim 15 is characterized in that, described similarity distance determining unit is inquired about described collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.
17. device according to claim 15 is characterized in that, described similarity distance determining unit is calculated the similarity distance of described two difference words according to following formula:
dist ( w , w ~ ) = A - Co sin e ( F ( w ) , F ( w ~ ) ) , Wherein, For difference word w and
Figure FDA0000097117300000049
Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,
Figure FDA00000971173000000410
Be the difference word
Figure FDA00000971173000000411
Proper vector, Co sin e ( F ( w ) , F ( w ~ ) ) For F (w) and
Figure FDA00000971173000000413
Included angle cosine.
18. according to claim 15,16 or 17 described devices, it is characterized in that described difference word air exercise subdivision is the difference word to marking according to following formula:
S ( w , w ~ ) = r ( w , E 1 ) α 1 * r ( w ~ , E 2 ) α 2 * dist ( w , w ~ ) α 3 ; Perhaps,
S ( w , w ~ ) = β 1 * r ( w , E 1 ) + β 2 * r ( w ~ , E 2 ) + β 3 * dist ( w , w ~ ) ;
Wherein,
Figure FDA00000971173000000416
For by difference word w and
Figure FDA00000971173000000417
The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,
Figure FDA0000097117300000051
It is the difference word among the second sentence E2
Figure FDA0000097117300000052
The marking result,
Figure FDA0000097117300000053
For difference word w and
Figure FDA0000097117300000054
Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.
19. the device of a mechanical translation is characterized in that, the device of this mechanical translation comprises:
The device of calculating sentence similarity as claimed in claim 12 is for the similarity of the example sentence storehouse sentence that calculates sentence to be translated and preset;
Similar example sentence selected cell is used for selecting similarity to come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;
Translation forms the unit, obtains the translation of described sentence to be translated for the translation that utilizes described similar example sentence.
20. the device of mechanical translation according to claim 19 is characterized in that, the device of this mechanical translation also comprises: the initial option unit is used for determining that the editing distance between described example sentence storehouse and the described sentence to be translated satisfies the sentence of preset requirement;
The device of described calculating sentence similarity calculates the similarity between the sentence of sentence to be translated and described initial option unit determining.
21. the device of mechanical translation according to claim 19 is characterized in that, described translation forms the unit and specifically comprises:
Difference word recognin unit is used for identifying the difference word between described sentence to be translated and the described similar example sentence;
Fragment constructor unit is used for the translation that the difference word of described sentence to be translated is corresponding as candidate's translation fragment;
Translation forms subelement, is used for the translation at described similar example sentence, utilizes candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtains the translation of described sentence to be translated.
22. the device of mechanical translation according to claim 19, it is characterized in that, the device of this mechanical translation also comprises: display unit, be used in the translation that shows described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.
CN201110303522.5A 2011-10-09 2011-10-09 Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation Active CN103034627B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110303522.5A CN103034627B (en) 2011-10-09 2011-10-09 Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110303522.5A CN103034627B (en) 2011-10-09 2011-10-09 Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation

Publications (2)

Publication Number Publication Date
CN103034627A true CN103034627A (en) 2013-04-10
CN103034627B CN103034627B (en) 2016-05-25

Family

ID=48021531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110303522.5A Active CN103034627B (en) 2011-10-09 2011-10-09 Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation

Country Status (1)

Country Link
CN (1) CN103034627B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572789A (en) * 2013-10-29 2015-04-29 北大方正集团有限公司 Text sequencing method and equipment
CN105095188A (en) * 2015-08-14 2015-11-25 北京京东尚科信息技术有限公司 Sentence similarity computing method and device
CN106991181A (en) * 2017-04-07 2017-07-28 广州视源电子科技股份有限公司 The method and device that colloquial style sentence is extracted
CN109145289A (en) * 2018-07-19 2019-01-04 昆明理工大学 Based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model
CN110348010A (en) * 2019-06-21 2019-10-18 北京小米智能科技有限公司 Synonymous phrase acquisition methods and device
CN110750977A (en) * 2019-10-23 2020-02-04 支付宝(杭州)信息技术有限公司 Text similarity calculation method and system
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN113408304A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium
US11557284B2 (en) 2020-01-03 2023-01-17 International Business Machines Corporation Cognitive analysis for speech recognition using multi-language vector representations

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5408410A (en) * 1992-04-17 1995-04-18 Hitachi, Ltd. Method of and an apparatus for automatically evaluating machine translation system through comparison of their translation results with human translated sentences
JPH0950434A (en) * 1995-08-10 1997-02-18 Brother Ind Ltd Japanese analysis method
JP2002351872A (en) * 2001-05-22 2002-12-06 Nippon Telegr & Teleph Corp <Ntt> Natural-language translation candidate selection method, device, program, and recording medium storing therein the program
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
CN101271452A (en) * 2007-03-21 2008-09-24 株式会社东芝 Method and device for generating version and machine translation
CN101271451A (en) * 2007-03-20 2008-09-24 株式会社东芝 Computer aided translation method and device
CN101667176A (en) * 2008-09-01 2010-03-10 株式会社东芝 Method and system for counting machine translation based on phrases
JP4528818B2 (en) * 2007-09-27 2010-08-25 株式会社東芝 Machine translation apparatus and machine translation program
CN102135957A (en) * 2010-01-22 2011-07-27 阿里巴巴集团控股有限公司 Clause translating method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5408410A (en) * 1992-04-17 1995-04-18 Hitachi, Ltd. Method of and an apparatus for automatically evaluating machine translation system through comparison of their translation results with human translated sentences
JPH0950434A (en) * 1995-08-10 1997-02-18 Brother Ind Ltd Japanese analysis method
JP2002351872A (en) * 2001-05-22 2002-12-06 Nippon Telegr & Teleph Corp <Ntt> Natural-language translation candidate selection method, device, program, and recording medium storing therein the program
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
CN101271451A (en) * 2007-03-20 2008-09-24 株式会社东芝 Computer aided translation method and device
CN101271452A (en) * 2007-03-21 2008-09-24 株式会社东芝 Method and device for generating version and machine translation
JP4528818B2 (en) * 2007-09-27 2010-08-25 株式会社東芝 Machine translation apparatus and machine translation program
CN101667176A (en) * 2008-09-01 2010-03-10 株式会社东芝 Method and system for counting machine translation based on phrases
CN102135957A (en) * 2010-01-22 2011-07-27 阿里巴巴集团控股有限公司 Clause translating method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MYOUNG-CHEOL KIM 等: "《A comparison of collocation-based similarity measures in query expansion》", 《INFORMATION PROCESSING AND MANAGEMENT》 *
WANYIN LI 等: "Similarity Based Chinese Synonym Collocation Extraction", 《COMPUTATIONAL LINGUISTICS AND CHINESE LANGUAGE PROCESSING》 *
刘占一 等: "An Improved Hierarchical Phrase Based Machine Translation Model", 《2010 INTERNATIONAL CONFERENCE ON CIRCUIT AND SIGNAL PROCESSING》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572789A (en) * 2013-10-29 2015-04-29 北大方正集团有限公司 Text sequencing method and equipment
CN105095188A (en) * 2015-08-14 2015-11-25 北京京东尚科信息技术有限公司 Sentence similarity computing method and device
CN105095188B (en) * 2015-08-14 2018-02-16 北京京东尚科信息技术有限公司 Sentence similarity computational methods and device
CN106991181A (en) * 2017-04-07 2017-07-28 广州视源电子科技股份有限公司 The method and device that colloquial style sentence is extracted
CN106991181B (en) * 2017-04-07 2020-04-21 广州视源电子科技股份有限公司 Method and device for extracting spoken sentences
CN109145289A (en) * 2018-07-19 2019-01-04 昆明理工大学 Based on the old-Chinese bilingual sentence similarity calculating method for improving relation vector model
CN110348010B (en) * 2019-06-21 2023-06-02 北京小米智能科技有限公司 Synonymous phrase acquisition method and apparatus
CN110348010A (en) * 2019-06-21 2019-10-18 北京小米智能科技有限公司 Synonymous phrase acquisition methods and device
CN110750977A (en) * 2019-10-23 2020-02-04 支付宝(杭州)信息技术有限公司 Text similarity calculation method and system
CN110750977B (en) * 2019-10-23 2023-06-02 支付宝(杭州)信息技术有限公司 Text similarity calculation method and system
US11557284B2 (en) 2020-01-03 2023-01-17 International Business Machines Corporation Cognitive analysis for speech recognition using multi-language vector representations
CN111597826B (en) * 2020-05-15 2021-10-01 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN113408304A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103034627B (en) 2016-05-25

Similar Documents

Publication Publication Date Title
CN103034627A (en) Method and device for calculating sentence similarity and method and device for machine translation
Cohen et al. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods
CN103678564B (en) Internet product research system based on data mining
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN108763321B (en) Related entity recommendation method based on large-scale related entity network
CN107832229A (en) A kind of system testing case automatic generating method based on NLP
CN108959258B (en) Specific field integrated entity linking method based on representation learning
US20040141354A1 (en) Query string matching method and apparatus
CN102779135B (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN106407180A (en) Entity disambiguation method and apparatus
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN105095222B (en) Uniterm replacement method, searching method and device
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN106407113A (en) Bug positioning method based on Stack Overflow and commit libraries
CN109213998B (en) Chinese character error detection method and system
CN105718585A (en) Document and label word semantic association method and device thereof
WO2018227930A1 (en) Method and device for intelligently prompting answers
CN103678287A (en) Method for unifying keyword translation
KR20200080822A (en) A method for mapping a natural language sentence to an SQL query
CN108363688A (en) A kind of name entity link method of fusion prior information
CN104572634A (en) Method for interactively extracting comparable corpus and bilingual dictionary and device thereof
CN106127265A (en) A kind of text in picture identification error correction method based on activating force model
CN105786971B (en) A kind of grammer point recognition methods towards international Chinese teaching
CN106547732A (en) Near synonym recognition methodss and near synonym identifying system
CN103617245A (en) Bilingual sentiment classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant