CN103034627A

CN103034627A - Method and device for calculating sentence similarity and method and device for machine translation

Info

Publication number: CN103034627A
Application number: CN2011103035225A
Authority: CN
Inventors: 刘占一; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-10-09
Filing date: 2011-10-09
Publication date: 2013-04-10
Anticipated expiration: 2031-10-09
Also published as: CN103034627B

Abstract

The invention provides a method and a device for calculating sentence similarity and a method and a device for machine translation, wherein the method for calculating the sentence similarity comprises the following steps that a first sentence and a second sentence are compared, so as to determine different word pairs; different words are marked by utilizing the matching probability of the different words in the different word pairs with other words in the first sentence or the second sentence in which the different words are contained, wherein the matching probability of two words is obtained by inquiring a matching probability model, and the matching probability of the two words in the matching probability model is obtained by counting the co-occurrence frequency of the two words in a preset corpus; the marking results of the different words in the different word pairs are utilized to mark the different word pairs; and the marking results of the different word pairs are utilized to determine the similarity of the first sentence and the second sentence. According to the method and the device, the matching degree of the two sentences can be more accurately reflected, thereby increasing the application quality of the method and the device for the machine translation and the like.

Description

Calculate the method and apparatus of sentence similarity and the method and apparatus of mechanical translation

[technical field]

The present invention relates to field of computer technology, particularly a kind of method and apparatus of sentence similarity and method and apparatus of mechanical translation of calculating.

[background technology]

Sentence similarity calculates has very important using value in fields such as problem retrieval, bilingual illustrative sentence retrieval, mechanical translation, document abstracts, and the similar situation that wherein adopts what kind of sentence similarity computing method can embody exactly between two sentences is the key of the above-mentioned application quality of impact.

Lift the application in machine translation mothod, in machine translation mothod, usually use pretreated bilingual example sentence as main translated resources, generate final translation by editor to the similar example sentence of sentence coupling to be translated.Particularly, may further comprise the steps:

1) in the translation instance storehouse, searches for the similar example sentence that mates to sentence to be translated.

For example: sentence to be translated is: This is a pencil.

Similar example sentence is: That is a pen.

2) the difference word between identification sentence to be translated and the similar example sentence

This and That are the difference words, and pencil and pen are the difference words.

3) translation that the difference word in the sentence to be translated is corresponding is as candidate's translation fragment.

Namely " this " and " pencil " is as candidate's translation fragment.

4) in the translation of similar example sentence, utilize candidate's translation fragment to replace the translation of difference word in the similar example sentence, obtain the translation of sentence to be translated.

The translation of similar example sentence is: " that is a pen ", replace " that " with " this ", with " pencil " replacement " pencil ", the translation that obtains sentence to be translated is " this is a pencil ".

Can be found out that by above mechanical translation process the similar example sentence of How to choose is the key factor that affects the translation quality height.

Existing sentence similarity calculates the mode of calculating editing distance between the sentence that usually adopts, editing distance is determined by be transformed into the needed minimal action number of another sentence from a sentence, described operation can comprise: insertion, deletion or replacement etc., if the editing distance between two sentences is less, determine that then the similarity between two sentences is higher, but can there be certain defect in this mode.

For example, if sentence to be translated is: Can I take a picture of the painting?

Is the similar example sentence of selecting by calculating editing distance mode: Can I take a picture of the car?

Can the translation that utilizes this similar example sentence to form is: I clap a photo for this oil painting?

Can if with the similar example sentence of sentence Can we take a photo of the painting as sentence to be translated, the translation that then forms is: I clap a photo for this width of cloth oil painting?

Can find out, although the editing distance of sentence Can we take a photo of the painting and sentence to be translated is greater than the editing distance of sentence Can I take a picture of the car and sentence to be translated, but the similarity of itself and sentence to be translated will be higher than sentence Can I take a picture of the car, thereby the translation quality that forms is also higher.

Above-mentioned problem is exactly because when calculating between the sentence similarity, do not consider the relation between the two sentence difference words.Although someone proposes in the calculating of similarity to consider similarity degree between the difference word based on synonymicon, but under a lot of the application, in above-mentioned mechanical translation application, the collocation relation is compared semantic between difference word and the context, in calculating, similarity has more importantly meaning, more can embody exactly the matching degree between two sentences, larger to the quality influence of above-mentioned application.

[summary of the invention]

The invention provides and a kind ofly calculate the method and apparatus of sentence similarity and the method and apparatus of mechanical translation, so that embody more exactly the matching degree between two sentences, be used for the quality used such as mechanical translation etc. thereby improve it.

Concrete technical scheme is as follows:

A kind of method of calculating sentence similarity, the method comprises:

A, the first sentence and the second sentence are compared, determine difference word pair;

The collocation probability of other words in B, utilization variance word centering difference word and its place the first sentence or the second sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;

The marking result of C, each difference word of utilization variance word centering determines the marking that the difference word is right;

D, utilize the right marking result of each difference word, determine the similarity of described the first sentence and described the second sentence.

Particularly, in described step B, be each difference word marking according to following formula:

R (w wherein _i, E) be difference word w _iThe marking result, E is difference word w _iFirst sentence at place or the second sentence, w _jFor removing w among the E _iOutside other words, r (w _i, w _j) be w _iAnd w _jThe collocation probability, m is the word number that E comprises.

In described step C, be the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein,

For by difference word w and

The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,

It is the difference word among the second sentence E2 The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.

Further, the method also comprises: determine the proper vector of difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;

When determining the right marking of difference word among the described step C, the further similarity distance of utilization variance word centering two difference words.

Wherein, definite mode of the proper vector of difference word is specially:

Inquire about described collocation probability model, the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.

Particularly, calculate the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein,

For difference word w and

Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,

Be the difference word

Proper vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) and

Included angle cosine.

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

Wherein,

For by difference word w and The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the first sentence E1,

It is the difference word among the second sentence E2

The marking result,

For difference word w and

Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.

A kind of method of mechanical translation, the method for this mechanical translation comprises:

S1, adopt the method for above-mentioned calculating sentence similarity to calculate the similarity of sentence in sentence to be translated and the default example sentence storehouse;

S2, selection similarity come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;

S3, utilize the translation of described similar example sentence to obtain the translation of described sentence to be translated.

Wherein, described step S1 specifically comprises:

S11, determine in the described example sentence storehouse and the editing distance between the described sentence to be translated satisfies the sentence of preset requirement;

S12, adopt the method for above-mentioned calculating sentence similarity to calculate similarity between the sentence that sentence to be translated and described step S11 determine.

Described step S3 specifically comprises:

Difference word between S31, the described sentence to be translated of identification and the described similar example sentence;

S32, translation that the difference word in the described sentence to be translated is corresponding are as candidate's translation fragment;

S33, in the translation of described similar example sentence, utilize candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtain the translation of described sentence to be translated.

Preferably, the method for this mechanical translation also comprises: when showing the translation of described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.

A kind of device that calculates sentence similarity, this device comprises:

The sentence comparison unit is used for the first sentence and the second sentence are compared, and determines difference word pair;

Difference word marking unit, the collocation probability that is used for utilization variance word centering difference word and its place the first sentence or second other words of sentence, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the described collocation probability model between two words is obtained by the co-occurrence number of times statistics of described two words in default corpus;

Difference word air exercise subdivision, the marking result for each difference word of utilization variance word centering determines the marking that the difference word is right;

The similarity determining unit is used for utilizing the right marking result of each difference word, determines the similarity of described the first sentence and described the second sentence.

Particularly, described difference word marking unit be that each difference word is given a mark according to following formula:

r (w_{i}, E) = \frac{\underset{w_{i} &Element; E, w_{j} &Element; E, w_{i} &NotEqual; w_{j}}{Σ} r (w_{i}, w_{j})}{m},

At this moment, described difference word air exercise subdivision is the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein,

For by difference word w and

It is the difference word among the second sentence E2

The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.

Also have a kind of embodiment, this device also comprises: the similarity distance determining unit, for the proper vector of determining difference word centering two difference words, utilize the proper vector of described two difference words, and calculate the similarity distance of described two difference words;

Described difference word air exercise subdivision when determining the right marking of difference word, the further similarity distance of utilization variance word centering two difference words.

Wherein, described similarity distance determining unit is inquired about described collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.

Described similarity distance determining unit is calculated the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein,

For difference word w and

Be the difference word

Proper vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) and

Included angle cosine.

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

Wherein,

For by difference word w and

It is the difference word among the second sentence E2

The marking result,

For difference word w and

A kind of device of mechanical translation, the device of this mechanical translation comprises:

The device of above-mentioned calculating sentence similarity is for the similarity of the example sentence storehouse sentence that calculates sentence to be translated and preset;

Similar example sentence selected cell is used for selecting similarity to come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;

Translation forms the unit, obtains the translation of described sentence to be translated for the translation that utilizes described similar example sentence.

Further, the device of this mechanical translation also comprises: the initial option unit is used for determining that the editing distance between described example sentence storehouse and the described sentence to be translated satisfies the sentence of preset requirement;

The device of described calculating sentence similarity calculates the similarity between the sentence of sentence to be translated and described initial option unit determining.

Wherein, described translation forms the unit and specifically comprises:

Difference word recognin unit is used for identifying the difference word between described sentence to be translated and the described similar example sentence;

Fragment constructor unit is used for the translation that the difference word of described sentence to be translated is corresponding as candidate's translation fragment;

Translation forms subelement, is used for the translation at described similar example sentence, utilizes candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtains the translation of described sentence to be translated.

Preferably, the device of this mechanical translation also comprises: display unit, be used in the translation that shows described sentence to be translated, and the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.

As can be seen from the above technical solutions, method and apparatus provided by the invention incorporates the collocation probability of word and word the calculating of sentence similarity, namely the collocation probability based on other words in difference word and its place sentence is that the difference word is to marking, and then the diversity factor between the calculating sentence, the prior art of comparing, embody more exactly the matching degree between the sentence, thereby improve it for the quality such as application such as mechanical translation.

[description of drawings]

The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the invention one;

The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the invention two;

The method flow diagram of the mechanical translation that Fig. 3 provides for the embodiment of the invention three;

Fig. 4 shows instance graph for the translation that the embodiment of the invention three provides;

The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the invention four;

The structural drawing of the machine translation apparatus that Fig. 6 provides for the embodiment of the invention five.

[embodiment]

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

Be described below by embodiment one and two pairs of similarity calculating methods provided by the present invention of embodiment.Embodiment one and embodiment two are used for calculating the similarity between sentence E1 and the sentence E2, and sentence E1 and sentence E2 can choose according to concrete application.For example: if be applied to the problem retrieval, then sentence E1 can be the query of user's input, and sentence E2 can be existing problem in the issue database; If be applied to mechanical translation, then sentence E1 can be sentence to be translated, and sentence E2 can be for translating the sentence in the employed example sentence storehouse, etc.

Embodiment one,

The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, the method can may further comprise the steps:

Step 101: sentence E1 and sentence E2 are compared, determine difference word pair.

The embodiment of the invention is processed based on the base text to sentence, and processing such as participle, alignment because this partial content is prior art, does not repeat them here.

Word among sentence E1 and the sentence E2 is compared, determine different word and consist of difference word pair, for example:

Is sentence E1: Can I take a picture of the painting?

Is sentence E2: Can we take a photo of the painting?

Then determine the difference word to being: the difference word that I and we consist of pair, the difference word that picture and photo consist of pair.

Step 102: the collocation probability of other words among utilization variance word centering difference word and its place sentence E1 or the sentence E2, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the collocation probability model between two words is obtained by the co-occurrence number of times statistics of two words in default corpus.

By to the statistics of co-occurrence number of times between word and the word in the default corpus, can obtain the collocation probability of word and word, thereby consist of the probability model of arranging in pairs or groups in advance.For example, when being used for mechanical translation, should can be the employed corpus of mechanical translation by default corpus, the collocation probability that the co-occurrence number of times of statistics " take " and " picture " can obtain " take " and " picture " deposits the collocation probability model in, the collocation probability that the co-occurrence number of times of statistics " take " and " photo " can obtain " take " and " photo " deposits the collocation probability model in, and is like that.Collocation probability between the word is larger, illustrates that the dependence between the word is stronger.

Because word is not the individuality that isolates in the sentence, each word more or less with sentence in other words have certain collocation relation, this collocation relation can embody this word in sentence with contextual degree of dependence with edit risk.Being each difference word when giving a mark, can obtain respectively the collocation probability of other words in difference word and its place sentence, the collocation probability that obtains is integrated to obtain the marking result of difference word, for example, for difference word w _iCan adopt the following formula as a result r (w that obtains giving a mark _i, E):

r (w_{i}, E) = \frac{\underset{w_{i} &Element; E, w_{j} &Element; E, w_{i} &NotEqual; w_{j}}{Σ} r (w_{i}, w_{j})}{m}, - - - (1)

E is difference word w _iThe sentence at place can be above-mentioned sentence E1 or sentence E2, w _jFor removing w among the E _iOutside other words, r (w _i, w _j) be w _iAnd w _jThe collocation probability, obtain by inquiry collocation probability model, m is the word number that E comprises.

Take sentence E1 as example, can obtain difference word " picture " respectively with the collocation probability of " can ", " I ", " take ", " a ", " of ", " the " and " painting ", the m value is 8, and then substitution formula (1) calculates the marking result that just can obtain difference word " picture ".

Step 103: the marking result of each difference word of utilization variance word centering, determine the marking that the difference word is right.

In determining sentence E1 and sentence E2 behind the marking result of each difference word, can be for the difference word to giving a mark, the marking mode can obtain by the marking result who integrates difference word centering two difference words, for example can calculate according to following formula (2) or formula (3):

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2}; - - - (2)

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2); - - - (3)

Wherein,

For by difference word w and

The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1, Be the difference word among the sentence E2

The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.α 1, α 2, β 1 and β 2 can be set usually be the number between-1 and 1, with being chosen for positive number or negatives, β 1 and β 2 are usually with being chosen for positive number or negatives usually for α 1 and α 2.

For example, in the right marking of the difference word that calculates " picture " and " photo " formation as a result the time, at first utilize formula (1) to calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtain the right marking result of difference word that " picture " and " photo " consists of.

Step 104: utilize the right marking result of each difference word, determine the similarity of sentence E1 and sentence E2.

In this step, the right marking result of all differences word among sentence E1 and the sentence E2 is integrated, for example the marking result that all differences word is right sues for peace, thereby determines the similarity of sentence E1 and sentence E2.The marking mode that obtains by method described in the embodiment one, finally to integrate rear value higher for the marking result of each difference word, and it is higher to illustrate that two example sentences close the similarity of fastening in collocation, and matching degree is also higher.

Embodiment two,

The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, the method can may further comprise the steps:

Step 201 is with step 101 among the embodiment one.

Step 202 is with step 102 among the embodiment one.

Step 203: determine the proper vector of difference word centering two difference words, utilize the proper vector of two difference words to calculate the similarity distance of two difference words.

In embodiment two, can further consider the similarity degree of difference word in specific corpus, this similarity degree embodies by the distance of the proper vector of difference word centering two difference words.

The proper vector of difference word can be by existing the word of higher collocation probability to consist of with this difference word, particularly, can be by inquiry collocation probability model, the word that will reach with the collocation probability of this difference word default collocation probability threshold value consists of the proper vector of this difference word.

Take difference word " picture " as example, by inquiry collocation probability model, the collocation probability of determining " take ", " draw ", " of ", " gallery " etc. and " picture " reaches default collocation probability threshold value, proper vector that then can the words such as " take ", " draw ", " of ", " gallery " formation " picture ".Same method also can be determined the proper vector of difference word " photo ".

When the similarity distance of calculated difference word centering two difference words, can utilize the included angle cosine of the proper vector of two difference words.For example can adopt following formula to calculate:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})), - - - (4)

Wherein,

For difference word w and Similarity distance, A is default positive number, F (w) is the proper vector of difference word w,

Be the difference word

Proper vector,

For F (w) and Included angle cosine.

Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, can obtain following formula take wherein a kind of as example:

Because Collocation and collocation probability count in specific corpus training in the collocation probability model, therefore, can effectively describe the similarity degree of two difference words on specific corpus by the mode of this step.

Step 204: the marking result of each difference word of utilization variance word centering and the similarity distance of two difference words, determine the marking that the difference word is right.

What this embodiment two and embodiment one were different is, at the similarity distance of the difference word having been further considered the difference word when giving a mark, has namely considered simultaneously similarity distance and editor's risk of difference word centering two difference words.For example, can adopt following formula (6) or (7) to the difference word to marking:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3}; - - - (6)

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w}); - - - (7)

Wherein,

For by difference word w and

The right marking result of difference word who consists of, r (w, E1) is the marking result of the difference word w among the sentence E1,

Be the difference word among the sentence E2

The marking result,

For difference word w and

Similarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.α 1, α 2, α 3, β 1, β 2 and β 3 can be set usually be the number between-1 and 1, with being chosen for positive number or negatives, β 1 and β 2 are usually with being chosen for positive number or negatives usually for α 1 and α 2.

For example, in the right marking of the difference word that calculates " picture " and " photo " formation as a result the time, at first utilize formula (1) to calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtain the right marking result of difference word that " picture " and " photo " consists of.Utilize the similarity distance between formula (5) calculating " picture " and " photo ", then utilize formula (6) or (7) to obtain the right marking result of difference word that " picture " and " photo " consists of.

Step 205 is with step 104 among the embodiment one.

Above-mentioned two embodiment are described as an example of english sentence example, but are not limited to english sentence, can be applied to equally calculate such as the sentence similarity of other language such as Chinese sentence.

The sentence similarity that calculates by above-mentioned two embodiment can be used for fields such as problem retrieval, bilingual illustrative sentence retrieval, mechanical translation, document abstracts.Situation when being used for mechanical translation below by three couples of embodiment is described.

Embodiment three,

The method flow diagram of the mechanical translation that Fig. 3 provides for the embodiment of the invention three, as shown in Figure 3, the method can may further comprise the steps:

Step 301: the similarity of calculating sentence in sentence to be translated and the default example sentence storehouse.

Can adopt in this step the method described in embodiment one or the embodiment two to calculate the similarity of sentence in sentence to be translated and the example sentence storehouse, thereby for further selecting similar example sentence to prepare.

Because example sentence quantity is very huge in the example sentence storehouse, calculate the similarity of each sentence and sentence to be translated in the example sentence storehouse if adopt one by one mode shown in embodiment one or the embodiment two, then efficient can be lower, in order to raise the efficiency, can at first calculate the editing distance of each example sentence and sentence to be translated in the example sentence storehouse, determine to satisfy with the editing distance of sentence to be translated in the example sentence storehouse sentence of preset requirement, then each sentence of calculative determination and the similarity between the sentence to be translated.For example, can select editing distance less than the sentence of predetermined threshold value, perhaps, select editing distance to come front M sentence, M is default positive integer.

Editing distance determines that by be transformed into the needed minimal action number of another sentence from a sentence described operation can comprise: insertion, deletion or replacement etc. because the account form of editing distance is prior art, do not repeat them here.

Step 302: select similarity to come the sentence of top n as the similar example sentence of sentence to be translated, N is default positive integer.

Similar example sentence by embodiment one or embodiment two described similarity account forms selections, considered the collocation relation of other words in difference word and the sentence, namely considered the compiling risk of difference word, even in embodiment two, further considered similarity distance between the difference word, select the higher sentence of matching degree and be used for generating version as similar example sentence, thereby improve translation quality.

Preferred embodiment a kind of, can select a highest sentence of similarity as similar example sentence.

For sentence to be translated: for the Can I take a picture of the painting, the sentence Can we take a photo ofthe painting Can I take a picture ofthe car that compares, the difference word to " I " and " we " and difference word to the similarity distance of difference word in " picture " and " photo " and with sentence in the collocation probability of other words all larger, and the difference word to the similarity distance of " painting " and " car " and with sentence in the collocation probability of other words less, therefore, compare Can I take a picture of the car and sentence to be translated of sentence Can we take a photo of the painting has higher similarity, can choose Can we take a photo of the painting as similar example sentence.

Step 303: utilize the translation of similar example sentence to obtain the translation of sentence to be translated.

After determining similar example sentence, the translation that generates sentence to be translated can be realized in accordance with the following steps:

Identify the difference word between sentence to be translated and the similar example sentence; The translation that difference word in the sentence to be translated is corresponding is as candidate's translation fragment; In the translation of similar example sentence, utilize candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtain the translation of sentence to be translated.This partial content is same as the prior art, repeats no more.

For example, the difference word that identifies similar example sentence Can we take a photo of the painting and sentence Can I take a picture of the painting to be translated is " we " and " I ", " photo " and " picture ".The translation " photograph " of the translation " I " of " I " and " picture " is as candidate's translation fragment.The translation of similar example sentence is " we can clap a photo for this width of cloth oil painting ", and utilizing candidate's translation fragment that the translation of difference word in the similar example sentence is replaced the translation that obtains sentence to be translated is " I can be that this width of cloth oil painting is clapped sheet photo ".

When the translation for the treatment of translation of the sentence shows, similar example sentence can be shown, and further can the marking result that each difference word of similar example sentence and sentence to be translated is right show.When showing that the right marking of difference word as a result, can be according to the marking result who sets in advance and the corresponding relation of confidence level, for example confidence level is divided into height, the low Three Estate of neutralization according to the marking result, then determine right confidence level corresponding to marking result of difference word, thereby show this confidence level.

As shown in Figure 4, show the translation of sentence to be translated, similar example sentence, similar example sentence and the translation of sentence to be translated, wherein the difference word of similar example sentence and sentence to be translated can highlight, and candidate's translation fragment also highlights.Show simultaneously the right confidence level of difference word on the right side.The mode that highlights is not limited to the mode shown in Fig. 4.

More than be the detailed description that method provided by the invention is carried out, be described below by the device of four pairs of calculating sentence similarities provided by the invention of embodiment.

Embodiment four,

The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the invention four, as shown in Figure 5, this device can comprise: sentence comparison unit 501, difference word marking unit 502, difference word air exercise subdivision 503 and similarity determining unit 504.

The 501 couples of sentence E1 in sentence comparison unit and sentence E2 compare, and determine difference word pair.

In fact exactly the word among sentence E1 and the sentence E2 is compared, determine different word and consist of difference word pair.

The collocation probability of other words among difference word marking unit 502 utilization variance word centering difference words and its place sentence E1 or the sentence E2, be each difference word marking, wherein the collocation probability between two words obtains by inquiry collocation probability model, and the collocation probability in the collocation probability model between two words is obtained by the co-occurrence number of times statistics of two words in default corpus.

The formation of collocation probability model is in advance by to the statistics of co-occurrence number of times between word and the word in the default corpus, thereby the collocation probability that obtains word and word forms the probability model of arranging in pairs or groups.Collocation probability between the word is larger, can embody this word in sentence with contextual degree of dependence and editor risk.When giving a mark for each difference word, can obtain respectively the collocation probability of other words in difference word and its place sentence, the collocation probability that obtains is integrated to obtain the marking result of difference word.

For example, difference word marking unit 502 can be each difference word marking according to following formula:

R (w wherein _i, E) be difference word w _iThe marking result, E is difference word w _iThe sentence at place can be sentence E1 or sentence E2, w _jFor removing w among the E _iOutside other words, r (w _i, w _j) be w _iAnd w _jThe collocation probability, m is the word number that E comprises.

The marking result of difference word air exercise subdivision 503 each difference word of utilization variance word centering determines the marking that the difference word is right.

Similarity determining unit 504 is utilized the right marking result of each difference word, determines the similarity of sentence E1 and sentence E2.Namely the right marking result of all differences word among sentence E1 and the sentence E2 is integrated, for example the marking result that institute's all differences word is right sues for peace, thereby determines the similarity of sentence E1 and sentence E2.

Wherein, difference word air exercise subdivision 503 can adopt dual mode be the difference word to marking, respectively corresponding method embodiment one and embodiment two, specific as follows:

First kind of way: difference word air exercise subdivision 503 can be the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) .

Wherein,

For by difference word w and

Be the difference word among the sentence E2

The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.

The second way: as shown in Figure 5, this device also comprises: similarity distance determining unit 505, for the proper vector of determining difference word centering two difference words, utilize the proper vector of two difference words, and calculate the similarity distance of two difference words.

At this moment, difference word air exercise subdivision 503 when determining the right marking of difference word, the further similarity distance of utilization variance word centering two difference words.

Particularly, in the second way, similarity distance determining unit 505 can be inquired about the collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.

When calculating the similarity distance of two difference words, similarity distance determining unit 505 can be according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein,

For difference word w and

Be the difference word

Proper vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) and Included angle cosine.

Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, can obtain formula (5) among the embodiment two take wherein a kind of as example.

In the second way, difference word air exercise subdivision 503 can be the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w}) .

Wherein,

For by difference word w and

Be the difference word among the sentence E2

The marking result,

For difference word w and

Embodiment five,

The structural drawing of the machine translation apparatus that Fig. 6 provides for the embodiment of the invention five, as shown in Figure 6, this device can comprise: calculate the device 600 of sentence similarity, similar example sentence selected cell 610 and translation and form unit 620.

The similarity of sentence in the example sentence storehouse of calculating the device 600 calculating sentences to be translated of sentence similarity and presetting, structure can be as shown in Figure 5.

Similar example sentence selected cell 610 selects similarity to come the sentence of top n as the similar example sentence of sentence to be translated, and N is default positive integer.Usually get 1 as a kind of preferred embodiment N value.

Translation forms unit 620 and utilizes the translation of similar example sentence to obtain the translation of sentence to be translated.

Because example sentence quantity is very huge in the example sentence storehouse, if the device 600 that calculates sentence similarity calculates similarity with sentence to be translated one by one for sentences all in the example sentence storehouse, then efficient can be lower, in order to raise the efficiency, the device of this mechanical translation can also comprise: initial option unit 630 is used for determining that the editing distance between example sentence storehouse and the sentence to be translated satisfies the sentence of preset requirement.For example, can select editing distance less than the sentence of predetermined threshold value, perhaps, select editing distance to come front M sentence, M is default positive integer.

Correspondingly, 600 in the device that calculates sentence similarity needs to calculate the similarity between the sentence of sentences to be translated and initial option unit 630 determining.

Translation forms unit 620 and can specifically comprise: difference word recognin unit 621, fragment constructor unit 622 and translation form subelement 623.

Difference word between difference word recognin unit 621 identification sentences to be translated and the similar example sentence.

Fragment constructor unit 622 translation that the difference word in the sentence to be translated is corresponding is as candidate's translation fragment.

Translation forms subelement 623, is used for the translation at similar example sentence, utilizes candidate's translation fragment to replace the translation of corresponding difference word in the similar example sentence, obtains the translation of sentence to be translated.

The device of this mechanical translation can further include: display unit 640, be used in the translation that shows sentence to be translated, and the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of sentence to be translated are shown.

When the marking result is shown, can be according to the marking result who sets in advance and the corresponding relation of confidence level, for example confidence level is divided into height, the low Three Estate of neutralization according to the marking result, then determine right confidence level corresponding to marking result of difference word, thereby show this confidence level.

As a kind of preferred displaying scheme, can show the translation of sentence to be translated, similar example sentence, similar example sentence and the translation of sentence to be translated, wherein the difference word of similar example sentence and sentence to be translated can highlight, candidate's translation fragment also highlights, as shown in Figure 4, show simultaneously the right confidence level of difference word on the right side.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. method of calculating sentence similarity is characterized in that the method comprises:

2. method according to claim 1 is characterized in that, in described step B, is each difference word marking according to following formula:

3. method according to claim 1 and 2 is characterized in that, in described step C, is the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein,

For by difference word w and

It is the difference word among the second sentence E2

The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.

4. method according to claim 1 is characterized in that, the method also comprises: determine the proper vector of difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;

5. method according to claim 4 is characterized in that, definite mode of the proper vector of difference word is specially:

6. method according to claim 4 is characterized in that, calculates the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein,

For difference word w and

Be the difference word

Proper vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) and

Included angle cosine.

7. according to claim 4,5 or 6 described methods, it is characterized in that, in described step C, be the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

Wherein,

It is the difference word among the second sentence E2

The marking result,

For difference word w and

8. the method for a mechanical translation is characterized in that, the method for this mechanical translation comprises:

S1, adopt the method for claim 1 to calculate the similarity of sentence in sentence to be translated and the default example sentence storehouse;

9. the method for mechanical translation according to claim 8 is characterized in that, described step S1 specifically comprises:

S12, adopt the method for claim 1 to calculate similarity between the sentence that sentence to be translated and described step S11 determine.

10. the method for mechanical translation according to claim 8 is characterized in that, described step S3 specifically comprises:

11. the method for mechanical translation according to claim 8, it is characterized in that, the method of this mechanical translation also comprises: when showing the translation of described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.

12. a device that calculates sentence similarity is characterized in that, this device comprises:

13. device according to claim 12 is characterized in that, described difference word marking unit is each difference word marking according to following formula:

14. according to claim 12 or 13 described devices, it is characterized in that described difference word air exercise subdivision is the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein, For by difference word w and

It is the difference word among the second sentence E2

The marking result, α 1, α 2, β 1 and β 2 are default weighting parameter.

15. device according to claim 12, it is characterized in that this device also comprises: the similarity distance determining unit, for the proper vector of determining difference word centering two difference words, utilize the proper vector of described two difference words, calculate the similarity distance of described two difference words;

16. device according to claim 15 is characterized in that, described similarity distance determining unit is inquired about described collocation probability model, and the word that will reach with the collocation probability of difference word default collocation probability threshold value consists of the proper vector of this difference word.

17. device according to claim 15 is characterized in that, described similarity distance determining unit is calculated the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein, For difference word w and

Be the difference word

Proper vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) and

Included angle cosine.

18. according to claim 15,16 or 17 described devices, it is characterized in that described difference word air exercise subdivision is the difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Perhaps,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

Wherein,

For by difference word w and

It is the difference word among the second sentence E2

The marking result,

For difference word w and

19. the device of a mechanical translation is characterized in that, the device of this mechanical translation comprises:

The device of calculating sentence similarity as claimed in claim 12 is for the similarity of the example sentence storehouse sentence that calculates sentence to be translated and preset;

20. the device of mechanical translation according to claim 19 is characterized in that, the device of this mechanical translation also comprises: the initial option unit is used for determining that the editing distance between described example sentence storehouse and the described sentence to be translated satisfies the sentence of preset requirement;

21. the device of mechanical translation according to claim 19 is characterized in that, described translation forms the unit and specifically comprises:

22. the device of mechanical translation according to claim 19, it is characterized in that, the device of this mechanical translation also comprises: display unit, be used in the translation that shows described sentence to be translated, the similar example sentence that adopts and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated are shown.