CN101286161B

CN101286161B - Intelligent Chinese request-answering system based on concept

Info

Publication number: CN101286161B
Application number: CN2008100478554A
Authority: CN
Inventors: 张茂元; 邹春燕; 杨付全; 卢正鼎; 赵冰心; 余毅; 刘明
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2008-05-28
Filing date: 2008-05-28
Publication date: 2010-10-06
Anticipated expiration: 2028-05-28
Also published as: CN101286161A

Abstract

The invention discloses a Chinese question answering system based on concept, which mainly comprises a data server, a question pre-treatment module, a candidate question set extracting module and a question sentence similarity calculation module. The invention aims at providing a question answering system which is based on concept, can carry out synonym expansion of keywords which are processed by question sentences which are input by the user, understand question sentences better, carry out searching and improve the recall ratio of the question answering system. Furthermore, the system has a Chinese sentence similarity calculation method based on concept from three aspects: word form, word order and word length, and improves searching precision ratio. Meanwhile, the system adopts a high-efficiency retrieval technology to realize rapid extraction of candidate question set, calculates question sentence similarity, sorts question set quickly and returns the sorted questions and answers to the user. The question answering system of the invention gives more precise understanding in concept to the question sentences input by the user and searches the accurate answers. Experiments show that the question answering system of the invention achieves high recall ratio and precision ratio.

Description

A kind of intelligent Chinese question answering system based on notion

Technical field

. the invention belongs to information retrieval technique, be specially a kind of dialogue retrieve system based on notion.This question answering system is the improvement to information retrieval system, is a kind of advanced form of information retrieval.It can answer the problem that the user proposes with natural language with accurate, succinct language.

Background technology

21 century, people have formally stepped into the information age, and the demand of network information amount is grown with each passing day.But high capacity, isomerism, distributivity and dynamic that network is intrinsic, and a large amount of inorganized invalid datas among the Web have reduced people to the abundant information efficiency of resource, " information overload " phenomenon occurs.Recent years, along with the fast development of network and infotech, simultaneously people's hope of thinking to obtain quickly information has promoted the development of automatic question answering technology.There are increasing company and scientific research institutions to participate in the automatic question answering Study on Technology.More famous in Microsoft, IBM, MIT, University of Zurich etc.The famous text retrieval meeting TREC of the U.S. set up QA Track in 1999, and the platform of evaluation and test is provided for question answering system.At present, some ripe relatively question answering systems have been developed abroad.Domestic also have some colleges and universities and research institution that automatically request-answering system is studied, the Computer Department of the Chinese Academy of Science, Harbin Institute of Technology, Fudan University, Beijing Institute of Technology, Hong Kong University of Science and Thchnology etc.But generally, the scientific research institution that participates in Chinese automatic question answering technical research is fewer, and does not have the Chinese natural language question answering system of moulding substantially.

Question answering system (Question Answering System) is meant the computer program that can make answer to the question sentence of the use natural language description of computer user input.The natural language processing of question answering system collection, information retrieval, the representation of knowledge are one, become the focus of research in the world just day by day.It can either allow the user put question to natural language, again can for the user return one succinctly, answer accurately, rather than some relevant webpages.Therefore, the search engine of question answering system and traditional dependence keyword matching is compared, and can satisfy user's Search Requirement better, finds out the needed answer of user more accurately, has characteristics such as convenient, fast, efficient.

The man-machine interface of natural language question answering system, accuracy and real-time are three big research and development targets of Chinese natural language question answering system.Wherein, accuracy is the primary goal of natural language question answering system.In order to reach this target, aspect the processing of user's question sentence, need carry out correct participle and part-of-speech tagging, synonym expansion, name entity mark, syntactic analysis, answer type mark or the like to the question sentence of user's input handles, for question answering system based on the frequently asked question storehouse, the similarity calculating that the user imports between question sentence and the problem base question sentence is the core place of system, and the accuracy of its computing method and high efficiency are related to the accuracy and the efficient of total system.

Summary of the invention

The object of the present invention is to provide a kind of intelligent Chinese question answering system based on notion, this system has higher recall ratio and precision ratio.

Intelligent Chinese question answering system based on notion provided by the invention, its structure is characterized in that for comprising data server, load module, display module: it also comprises problem pretreatment module, candidate question set extraction module, question sentence similarity calculation module;

Data server is used to store corpus, index database, XML document and problem base;

Load module is used to receive the problem of user's input, checks the standardization of input question sentence, and the question sentence of correct format is submitted to the problem pretreatment module;

The problem pretreatment module is used to receive the question sentence that load module transmits, and the knowledge base and the rule base that call in the data server carry out pre-service to it, and the result after will handling passes to candidate question set module and question sentence similarity calculation module respectively;

The pre-service that provides from problem pretreatment module rapid extraction candidate question set as a result is provided the candidate question set extraction module, for the question sentence similarity calculation module provides calculating object;

The question sentence similarity calculation module is used for finding the solution the similarity of retrieval question sentence and candidate question set question sentence, the Chinese sentence similarity calculates by the keyword string to the retrieval question sentence and carries out the synonym expansion, utilize spreading result, call the morphology similarity calculating method, call the long similarity calculating method of word order similarity calculating method and speech again, calculate morphology similarity, word order similarity, the long similarity of speech respectively; Then, with three weightings, calculate the final similarity of question sentence;

The morphology similarity calculating method is meant according to formula (I) and calculates morphology similarity Simword:

Simword(S1，S2)＝

(I)

2*((λ ₁*SameWord(S1，S2)+λ ₂*SimWord(S1，S2))/(Len(S1)+Len(S2))

In the formula, S1, S2 are two sentences, and (S1 S2) is the number of contained same words among S1, the S2 to SameWord, (S1 is a contained synon number among S1, the S2 S2) to SimWord, and λ 1, λ 2 represent respectively SameWord (S1, S2) and SimWord (S1, significance level S2);

The word order similarity calculating method is meant according to formula (II) and calculates word order similarity Simord:

Simord (s_{1}, s_{2}) = \{\begin{matrix} 1 - (RevOrd (s_{1}, s_{2}) / (| λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | - 1)) \\ \begin{matrix} 1 & | λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | = 1 \\ 0 & | λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | = 0 \end{matrix} \end{matrix}

Formula (II)

In the formula, S1, S2 are two sentences, OnceSameWord (S1, S2) be contained only once same words among S1, the S2, OnceSimWord (S1, S2) be contained only once synon set among S1, the S2, (S1 S2) is OnceSameWord (S1 to Pfirst, S2) and OnceSimWord (S1, S2) vector that the position number of the speech in S1 constitutes, (S1 S2) is Pfirst (S1 to Psecond, S2) component in is pressed the vector that the order ordering of equivalent in S2 generates, (S1 S2) is Psecond (S1, S2) the backward number of each adjacent component to RevOrd;

The long similarity calculating method of speech is meant according to formula (III) computing statement length similarity SimLen:

Simlen(S1，S2)＝1-abs(Len(S1)-Len(S2))/Len(S1)+Len(S2)

Formula (III)

Len (S1), Len (S2) represent the length of statement S1 and statement S2 respectively, and abs represents to take absolute value;

Display module will return to the user who submits the retrieval question sentence to corresponding to problem answers in the problem base and relevant information according to the result of question sentence similarity calculation module.

System of the present invention can understand the Chinese question sentence of user's input from concept hierarchy, and the keyword in the question sentence is carried out the synonym expansion, supports the retrieval of the question sentence of natural language description, has improved the recall ratio of question answering system.And system synthesis is considered the morphology of question sentence, and word order, and long three aspects of speech have improved the precision ratio of question sentence retrieval.Secondly, system adopts efficient retrieval technology rapid extraction from problem base to go out candidate question set, similarity between the question sentence that computational problem collection and user import, and based on similarity to the problem set quicksort, sorted problem and answer thereof are returned to the user.By above innovative approach, guaranteed to return apace one succinctly, answer accurately.System of the present invention is a leading indicator with aspects such as precision ratio, recall precision, recall ratios respectively at the requirement of accuracy and real-time, develops, and realizes.Experimental result shows, produces a desired effect.Concrete analysis, the present invention has following advantage:

(1) precision ratio height: this system is according to natural language processing technique, from concept hierarchy the keyword the retrieval question sentence is handled, utilized synonym in sentence, to express the character of identical concept, keyword string to the retrieval question sentence carries out the synonym expansion, calculate the morphology similarity, again in conjunction with word order, the long similarity of speech, COMPREHENSIVE CALCULATING question sentence similarity is calculated, and has realized the pin-point accuracy coupling to former retrieval question sentence and preliminary election problem base problem.Finally, retrieve desirable accurate result apace, reach user's retrieval requirement.

(2) recall precision height: native system has adopted the high-efficiency information retrieval technique.Realize the rapid extraction candidate question set.Has higher execution efficient.Native system utilizes retrieval technique fast, with the retrieval question sentence the keyword string as index terms, the index database that the capacity of setting up is less; The structure of index adopts the inverted list structure, and recall precision is provided greatly.Therefore, retrieval module can extract the preliminary election problem set apace.Improved the efficient of system.

(3) recall ratio height: system can understand the Chinese question sentence of user's input from concept hierarchy, and the keyword in the question sentence is carried out the synonym expansion, has enlarged the semantic information of the retrieval question sentence of user's submission.Support the retrieval of the question sentence of natural language description, make candidate question set more accurate.Improved the recall ratio that selects problem set.And then improved the recall ratio of question answering system.Guarantee that the user obtains correct result.

Description of drawings

Fig. 1 is the system assumption diagram that the present invention is based on the intelligent Chinese question answering system of notion.

Fig. 2 is the modular structure synoptic diagram that the present invention is based on the Chinese question answering system of notion.

Fig. 3 is the process flow diagram of problem pretreatment module.

Fig. 4 is the process flow diagram of retrieval module.

Fig. 5 is the process flow diagram of candidate question set module.

Fig. 6 is the process flow diagram that sentence similarity calculates.

Fig. 7 is the process flow diagram of display module.

Embodiment

The present invention is further detailed explanation below in conjunction with accompanying drawing and example.

As shown in Figure 1, the intelligent Chinese question answering system based on notion provided by the invention comprises data server 100, load module 200, problem pretreatment module 300, candidate question set extraction module 400, question sentence similarity calculation module 500 and display module 600.

Data server 100 is used to store corpus, index database, and XML document and problem base are supported for problem pretreatment module 300 provides knowledge and rule, for candidate question set extraction module 400 provides index and searching object.

Load module 200 is used to receive the problem of user's input, checks the standardization of input question sentence, guarantees the question sentence of correct format is submitted to problem pretreatment module 300.

Question sentence similarity calculation module 500 is utilized the Chinese sentence similarity computational algorithm based on notion of design, find the solution the similarity of question sentence in retrieval question sentence and the candidate question set, the Chinese sentence similarity calculates by the keyword string to the retrieval question sentence and carries out the synonym expansion, utilize spreading result, call the morphology similarity calculating method, call the long similarity calculating method of word order similarity calculating method and speech again, calculate morphology similarity, word order similarity, the long similarity of speech respectively.Then, with three weightings, calculate the final similarity of question sentence.

Problem pretreatment module 300 is used to receive the question sentence that load module 200 transmits, the knowledge base and the rule base that call in the data server 100 carry out pre-service to it, comprise Chinese word segmentation, part-of-speech tagging, operations such as keyword abstraction, and the result after will handling passes to candidate question set module 400 and question sentence similarity calculation module 500 respectively.

The candidate question set extraction module comprises index module, retrieval module and candidate question set module.Be used for rapid extraction candidate question set (with the relevant question sentence collection of retrieval question sentence), for the question sentence similarity calculation module provides calculating object.

Display module 600 according to the result of question sentence similarity calculation module 500, will return to the user who submits the retrieval question sentence to corresponding to problem answers in the problem base and relevant information.

For example data server 100, problem pretreatment module 300, preliminary election problem set module 400 and sentence similarity computing module 500 are described in further detail respectively below.

Shown in Fig. 2 (based on the modular structure synoptic diagram of the Chinese question answering system of notion):

Data server 100 is used to store corpus and comprises knowledge base 110 and rule base 120, and index database 130, XML document 140 and problem base 150.For providing knowledge and rule, problem pretreatment module 300 supports, simultaneously, and for index module 410 provides the index source, for candidate question set module 430 provides searching object.

What deposit in the corpus is to be the basic resource of carrier carrying linguistry with the robot calculator.The linguistic data that truly occurred in the actual use of language obtains through processing (analyze and handle).

Wherein, knowledge base is a notion synonym expansion knowledge base, dictionary, dictionary knowledge base.Rule base has the part-of-speech rule storehouse, the sentence element rule base.

Problem pretreatment module 300 is used to receive the question sentence that load module 200 transmits, call knowledge base 110, rule base 120 carries out pre-service to it, the Chinese word segmentation that comprises question sentence, part-of-speech tagging, operations such as keyword abstraction, and the result after will handling passes to candidate question set module 400 and question sentence similarity calculation module 500 respectively.

As shown in Figure 3, problem pretreatment module 300 is carried out lexical analysis to user's search problem earlier, comprises the Chinese word segmentation module 310 and the part-of-speech tagging module 320 of question sentence.According to the significance level rule of part of speech in sentence (pronoun, adjective is most important to sentence for noun usually, verb) and utilize the vocabulary of stopping using to filter stop words and carry out keyword abstraction module 330.The keyword that extracts is expanded by conceptual expansion knowledge base 110 (generating according to shareware " synonym speech woods ") again.Utilize pretreatment module 300, obtain one group of satisfactory intermediate treatment result;

Problem pretreatment module 300 treatment schemees are: (1), input question sentence; (2), question sentence is carried out format check:, return (1) if incorrect for form; (3), question sentence is handled Chinese word segmentation, part-of-speech tagging; (4), call inactive vocabulary, utilize sentence element significance level rule, carry out the keyword abstraction analyzing and processing; (5) question sentence keyword abstraction; (6), output keyword string.

Chinese word segmentation module 310, the participle of this module adopt maximum reverse matching process.Support as language material by the dictionary knowledge base.Suppose that the contained Chinese character number of long word bar in the dictionary is i, then get preceding i word in the processed text current character string sequence as matching field, search dictionary, if in the dictionary such i words is arranged, then the match is successful, and matching field is cut out as a speech; If can not find a such i words in the dictionary, then it fails to match, and matching field removes the last character, and remaining word mates as new matching field again, so goes on, till the match is successful.

If speech the longest in the dictionary is made up of MaxNum word, sentence length is the number of individual character in the sentence, is made as Len.Array S[N-1] storage length is the sentence of N, i, j, k, position are variable; Wik represents S[i] to S[wik+i] word segmentation unit of composition; Dik is the attribute of the represented word segmentation unit of wik, as its position in dictionary, part of speech etc.; Function m atch (S[i], S[i+j]) judges word string S[i]～S[i+j] whether be the speech in the dictionary.

The flow process of Chinese word segmentation module 310 is as follows: 1) input sentence, call the dictionary knowledge base, the subordinate clause tail coupling that begins to consult the dictionary finishes if mate, and then turns to 3).2) judge word string S[i], S[i+j] whether exceed the sentence tail, whether be the speech in the dictionary, if matching field is cut out as a speech; If can not find a such i words in the dictionary, then it fails to match, and matching field removes the last character, and remaining word mates as new matching field again, returns 1); 3) output word segmentation result.

Part-of-speech tagging module 320 in conjunction with Chinese word segmentation module 310 results, is called the part-of-speech rule storehouse, and the speech of telling is carried out part-of-speech tagging.Determine a most suitable part of speech mark according to the contextual information in the sentence to each speech in the sentence.

Flow process is as follows: 1) get speech string Span from word segmentation result: to each speech in the speech string, look into the part-of-speech rule storehouse, if find, all part of speech marks of this speech are taken out, be registered in array Tags[i] in [j], i represents the sequence number of speech, and j represents the part of speech marking serial numbers, and the occurrence number of this this mark of speech is registered in Freqs[i] in [j] array; If do not find, the open-class items mark is composed to this speech, be registered in Tags[i] in [j], with Freqs[i] value of [j] is changed to 1.2) to each possible part of speech mark of each speech in the speech string, (1) calculates the aggregate-value of this mark; (2) write down best forerunner's mark of this mark.After the part of speech mark of last speech in the speech string is decided, take out best forerunner's mark of each speech in turn, promptly obtain the part-of-speech tagging result.Speech string manipulation class data are reinitialized, prepare the mark of next speech string.Turn back to 1).

Keyword abstraction module 330 is according to the significance level rule of part of speech in sentence (pronoun, adjective is most important to sentence for noun usually, verb) and utilize the vocabulary of stopping using to filter stop words and carry out the extraction of keyword.Make that S is a sentence, w is arbitrary speech among the S, and S ' is a keyword sequence among the S.Flow process is as follows: 1) get a speech w from S, the inactive vocabulary of inquiry turns to 2 if find speech w then), if getting, speech finishes, turn to 4); 2) call the sentence element rule base, judge whether w is noun, pronoun, verb or adjective, if, extract w, read in next speech, turn to 3); 4) form keyword sequence S ' by all keywords that extract among the S, return S '.

The candidate question set extraction module comprises: index module 410, retrieval module 420, candidate question set module 430.Can the rapid extraction candidate question set, be that the question sentence similarity calculation module improves calculating object.

The purpose of candidate's question sentence retrieval is that complicated process such as follow-up similarity calculating is all carried out in this relative small range of candidate question set.Require efficient retrieval.Candidate question set is exactly to concentrate a quick fuzzy correlation that takes out but less relatively subclass from extensive question sentence, and therefore, the function of this part can be achieved by information retrieval technique.Like this, can select to use retrieval technique efficiently on the one hand, make the recall precision height; On the other hand, the function of this module is improved, upgrading is easy, and transplantability is good.

Adopt efficient retrieval, similar problem in the quick positioning question storehouse, for sentence similarity computing module 500 provides the problem base problem set, candidate question set extraction module 400 has very consequence.

The problem base content (XML storage) that index module 410 is used for data server 100 is provided is built index database 130, and the keyword string item among the XML as index terms, is set up index database 130 by index terms and document related information.Along with the renewal of problem base 150, increment is built index, upgrades index database 130.

Retrieval module 420 by problem base 150 derived datas, is stored in the XML document 140, utilizes 130 pairs of XML document 140 of index database to retrieve apace.

As shown in Figure 4, retrieval module 420 treatment schemees are: (1), the input search problem the keyword string, and with it as term; (2), call index database, retrieve; (3), judge whether the keyword string is empty, if sky returns (1), is not empty, enters (4); (4), retrieval, return the problem relevant ID number with the keyword string; (5), ID number of the output problem.

The intermediate treatment result that candidate question set module 430 provides according to problem pretreatment module 300 submits to retrieval module 420 as the term string.Call retrieval module 420, XML document 140 is retrieved, and analyzing XML file 140, the ID that obtains corresponding problem base 150 problems numbers.

As shown in Figure 5, the treatment scheme of this module: (1), input ID number of search problem; (2), corresponding problem in the inquiry problem base; (3), the question sentence of the ID correspondence that judges whether to have problems, if there is no, return (2); (4) the keyword string of output problem concentration problem correspondence.

Sentence similarity computing module 500 calculates the similarity of retrieving question sentence in question sentence and the candidate question set, has directly influenced the result of retrieval.It is a nucleus module of this question answering system.

As shown in Figure 6, this module is mainly utilized the Chinese sentence similarity computing method based on notion of design, find the solution the similarity of question sentence in retrieval question sentence and the candidate question set, the Chinese sentence similarity calculates the keyword string by keyword string synonym expansion module 510 expansion retrieval question sentences, utilize spreading result, call morphology similarity calculation module 530, call word order similarity calculation module 520, the long similarity calculation module 540 of speech again, obtain morphology similarity, word order similarity, the long similarity of speech respectively.Then, call sentence similarity calculating sub module 550, three weightings are tried to achieve the similarity of question sentence.

Treatment scheme is: the keyword string of the problem (being obtained by candidate question set module 400) in (1), input search problem and the preliminary election problem set; (2), call the conceptual expansion knowledge base, retrieval question sentence keyword string is carried out the synonym conceptual expansion, calculate the morphology similarity; (3), calculate the number of same words in the two keyword strings, calculate the long similarity of speech; (4), calculate keyword pairing word order of same keyword in the candidate question set problem of retrieving question sentence, calculating word order similarity; (5), with the similarity result of calculation of (2), (3), (4), carry out the similarity weighting, calculate the question sentence similarity, and output.

Below each module of inside of sentence similarity calculation module 500 is done detailed explanation.

As shown in Figure 2, sentence similarity computing module 500 comprises the long similarity calculation module 540 of synonym expansion module 510, morphology similarity calculation module 530, word order similarity calculation module 520, speech and the sentence similarity calculating sub module 550 of keyword string.

Before specifically introducing the step of function, realization of each module, it is as follows to introduce relevant knowledge earlier:

Related notion is introduced:

(1), the definition 1: the morphology similarity, reflect two modal similarity degrees of sentence, weigh with contained same words or synon number in two sentences.If S1, S2 are two sentences, then the morphology similarity of S1, S2 is:

Simword(S1，S2)＝

(1.1)

2*((λ ₁*SameWord(S1，S2)+λ ₂*SimWord(S1，S2))/(Len(S1)+Len(S2))

SameWord in the formula (S1 S2) is the number of contained same words among S1, the S2, SimWord (S1 is a contained synon number among S1, the S2 S2), and λ 1, λ 2 represent respectively SameWord (S1, S2) and SimWord (S1, significance level S2).The number of times that occurs in S1, S2 when a word is not simultaneously with the few counting of occurrence number; Len (S) is the number of contained speech among the sentence S.Meaning: speech or synon number that two statements are identical are many more, and two statements are similar more;

(2) definition 2: the word order similarity, reflect contained same words or the similarity degree of synonym on the relation of position in two sentences, weigh with contained same words in two sentences or the reverse number of synon adjacent sequential.If S1, S2 is two sentences, OnceSameWord (S1, S2) be S1, contained only once same words among the S2, OnceSimWord (S1, S2) be S1, contained only once synon set among the S2, Pfirst (S1, S2) be OnceSameWord (S1, S2) and OnceSimWord (S1, S2) vector that the position number of the speech in S1 constitutes, Psecond (S1, S2) be Pfirst (S1, S2) component in is pressed the vector that the order ordering of equivalent in S2 generates, RevOrd (S1, S2) be Psecond (S1, S2) the backward number of each adjacent component (with the summation of standard row phase inverse ordinal number), then S1, the word order similarity of S2 is:

Simord (s_{1}, s_{2}) = \{\begin{matrix} 1 - (RevOrd (s_{1}, s_{2}) / (| λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | - 1)) \\ \begin{matrix} 1 & | λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | = 1 \\ 0 & | λ_{1} * OnceSameWord (s_{1}, s_{2}) + λ_{2} * OnceSimWord (s_{1}, s_{2}) | = 0 \end{matrix} \end{matrix} - - - (1.2)

The advantage of definition word order similarity is like this: when a subordinate sentence or word are whole long distance takes place moves after, still very similar to original statement.Realize fast, algorithm complex is O (m), wherein m=|OnceWord (S1, S2) |;

(3) definition 3: statement length similarity, Len (S1), Len (S2) represents the length of statement S1 and statement S2 respectively, i.e. the number of the speech in two statements.Statement length similarity SimLen (S1 S2) is determined by formula (1.3):

Simlen(S1，S2)＝1-abs(Len(S1)-Len(S2))/Len(S1)+Len(S2)

(1.3)

Draw easily: (S1, S2) ∈ [0,1] meaning: the length of two statements is approaching more, and two statements are similar more for SimLen.The example: middle Len (S1)=11, Len (S2)=8, then SimLen (S1, S2) ≈ 0.84;

(4) definition 4: sentence similarity, reflect the similarity degree between two sentences.Be generally the numerical value between one 0～1,0 expression is dissimilar, and 1 expression is similar fully, and two of the big more expressions of numerical value are similar more.Statement X, the final similarity Sim of Y (S1 S2) is determined by formula (1.4):

Sim(S1，S2)＝λ ₁*Simword(S1，S2)+λ ₂*Simorder(S1，S2)

(1.4)

+λ ₃*Simlen(S1，S2)

Wherein, λ 1, and λ 2, λ 3 constants, and satisfy λ 1+ λ 2+ λ 3=1, obvious Sim (S1, S2) ∈ [0,1].We should be understood that the morphology similarity plays main effect in statement similarity, and statement length similarity and word order similarity play a part less important, so λ 1, and λ 2, should have during λ 3 values λ 1＞＞λ 2, λ 3.(S1 S2) is S1 to WordSim in the formula, the morphology similarity of S2; (S1 S2) is S1 to OrderSim, S2 word order similarity; (S1 S2) is S1 to OrderSim, the long similarity of sentence of S2.By experiment, get λ 1=0.9, λ 2=0.05, λ 3=0.05.

The function of the long similarity calculation module 540 of the synonym expansion module 510 of keyword string, morphology similarity calculation module 530, word order similarity calculation module 520 and speech, the step of realization:

The synonym expansion module 510 of keyword string mainly is that the keyword string of importing is carried out the synonym expansion.The specific implementation step is as follows: 1) the keyword string keywords1 of input retrieval question sentence; The keyword string keywords2 of input candidate question set question sentence; 2) call the conceptual expansion knowledge base, keywords1 is carried out the synonym conceptual expansion, the result of keywords1 expansion deposits among the character string extendkeywords, finishes the synonym expansion.

Morphology similarity calculation module 530 mainly is a morphology similarity of calculating two sentences, reflects two modal similarity degrees of sentence, weighs with contained same words or synon number in two sentences.The specific implementation step is as follows: 1) passed over the keyword string keywords1 of retrieval question sentence by the synonym expansion module 510 of keyword string, the character string extendkeywords of the keyword string keywords2 of candidate question set question sentence and keywords1 expansion; 2) the keyword number wordsNum1 among the calculating keywords1; Calculate the keyword number wordsNum2 among the keywords2; 3) the number samenum of same keyword among calculating extendkeywords and the keywords2; 4) bring formula into: 2.0*samenum/ (wordsNum1+wordsNum2) calculates morphology similarity simword;

Word order similarity calculation module 520, it mainly is the word order similarity of calculating two sentences, reflect contained same words or the similarity degree of synonym on the relation of position in two sentences, weigh with contained same words in two sentences or the reverse number of synon adjacent sequential.The specific implementation step is as follows: 1) passed over the keyword string keywords1 of retrieval question sentence by the synonym expansion module 510 of keyword string, the keyword string keywords2 of candidate question set question sentence; 2) calculate contained unduplicated same keyword among keywords1 and the keywords2, deposit array oncesimwords in; 3) calculate Pfirst (keywords1, keywords2), vector for the position number formation of the speech among the oncesimwords in keywords1,4) calculate Psecond (keywords1, keywords2), for Pfirst (keywords1, keywords2) component in is pressed the vector that the order ordering of equivalent in keywords2 generates; 5) calculate revord, be Psecond (keywords1, keywords2) the backward number of each adjacent component (with the summation of standard row phase inverse ordinal number); 6) bring formula into: 1-1.0*revord/ (samenum-1) calculates word order similarity simorder;

The long similarity calculation module 540 of speech mainly is the long similarity of speech of calculating two sentences, reflects the similarity degree of the number of contained speech in two sentences.Number with contained speech in two sentences is relatively weighed.The specific implementation step is as follows: 1) transmitted the keyword string keywords1 of retrieval question sentence by the synonym expansion module 510 of keyword string, the keyword string keywords2 of candidate question set question sentence; 2) the keyword number among the calculating keywords1 is made as integer variable wordsNum1; Calculate the keyword number among the keywords2, be made as integer variable wordsNum2; 3) calculate keywords1, the difference distince of keyword number among the keywords2; 4) bring formula into: 1.0-1.0*simorder/ (wordsNum1+wordsNum2) calculates the long similarity simlen of speech;

Sentence similarity calculating sub module 550, according to the significance level of the long similarity of morphology similarity, word order similarity and speech to sentence similarity, the morphology similarity is the most relevant with the semanteme of sentence, and significance level is the highest.Test obtains significance level coefficient preferably by experiment.Try to achieve the similarity of question sentence by the weighting of significance level coefficient to obtaining the long similarity of morphology similarity, word order similarity, speech.The specific implementation step is as follows: 1) transmit morphology similarity, word order similarity, the long similarity of speech by the long similarity calculation module 540 of morphology similarity calculation module 530, word order similarity calculation module 520 and speech respectively; 2) bring formula into: λ ₁* simword+ λ ₂* simorder+ λ ₃* simlen calculates sentence similarity similary; 3) output sentence similarity similary.

Claims

1. intelligent Chinese question answering system based on notion, comprise data server (100), load module (200), display module (600), it is characterized in that: it also comprises problem pretreatment module (300), candidate question set extraction module (400), question sentence similarity calculation module (500);

Data server (100) is used to store corpus, index database, XML document and problem base;

Load module (200) is used to receive the problem of user's input, checks the standardization of input question sentence, and the question sentence of correct format is submitted to problem pretreatment module (300);

Problem pretreatment module (300) is used to receive the question sentence that load module (200) transmits, the knowledge base and the rule base that call in the data server (100) carry out pre-service to it, and the result after will handling passes to candidate question set module (400) and question sentence similarity calculation module (500) respectively;

The pre-service that provides from problem pretreatment module (300) rapid extraction candidate question set as a result is provided candidate question set extraction module (400), for question sentence similarity calculation module (500) provides calculating object;

Question sentence similarity calculation module (500) is used for finding the solution the similarity of retrieval question sentence and candidate question set question sentence, the Chinese sentence similarity calculates by the keyword string to the retrieval question sentence and carries out the synonym expansion, utilize spreading result, call the morphology similarity calculating method, call the long similarity calculating method of word order similarity calculating method and speech again, calculate morphology similarity, word order similarity, the long similarity of speech respectively; Then, with three weightings, calculate the final similarity of question sentence;

Simword(S1，S2)＝

2*((λ ₁*SameWord(S1，S2)+λ ₂*SimWord(S1，S2))/(Len(S1)+Len(S2)) (I)

Simord (S_{1}, S_{2}) = \{\begin{matrix} 1 - (RevOrd (S_{1}, S_{2}) / (| λ_{1} * OnceSameWord (S_{1}, S_{2}) + λ_{2} * OnceSimWord (S_{1}, S_{2}) | - 1)) \\ \begin{matrix} 1 & | λ_{1} * OnceSameWord (S_{1}, S_{2}) + λ_{2} * OnceSimWord (S_{1}, S_{2}) | = 1 \end{matrix} \\ \begin{matrix} 0 & | λ_{1} * OnceSameWord (S_{1}, S_{2}) + λ_{2} * OnceSimWord (S_{1}, S_{2}) | = 0 \end{matrix} \end{matrix}

Formula (II)

Simlen (S1, S2)=1-abs (Len (S1)-Len (S2))/Len (S1)+Len (S2) formula (III)

Display module (600) will return to the user who submits the retrieval question sentence to corresponding to problem answers in the problem base and relevant information according to the result of question sentence similarity calculation module (500).

2. the intelligent Chinese question answering system based on notion according to claim 1 is characterized in that: problem pretreatment module (300) comprises Chinese word segmentation module (310), part-of-speech tagging module (320) and keyword abstraction module (330);

Chinese word segmentation module (310) adopts maximum reverse matching process, supports as language material with the dictionary knowledge base, and the entry in processed text and the dictionary is mated, and obtains Chinese word segmentation;

Part-of-speech tagging module (320) is called the part-of-speech rule storehouse according in conjunction with Chinese word segmentation module (310) result, and the speech of telling is carried out part-of-speech tagging; Determine a most suitable part of speech mark according to the contextual information in the sentence to each speech in the sentence;

Keyword abstraction module (330) is carried out the extraction of keyword according to the significance level rule and the inactive vocabulary filtration of the utilization stop words of part of speech in sentence, obtains the keyword string.

3. the intelligent Chinese question answering system based on notion according to claim 1 is characterized in that: candidate question set extraction module (400) comprises index module (410), retrieval module (420), candidate question set module (430);

Index module (410) is used for the problem base content that data server (100) provides is built index database and renewal;

Retrieval module (420) utilizes index database (130) that XML document is retrieved apace;

The intermediate treatment result that candidate question set module (430) provides according to problem pretreatment module (300) submits to retrieval module (420) as the term string; Call retrieval module (420), XML document is retrieved, and analyzing XML file, the ID that obtains corresponding problem base problem numbers.

4. according to claim 1,2 or 3 described intelligent Chinese question answering systems based on notion, it is characterized in that: sentence similarity computing module (500) comprises the long similarity calculation module (540) of synonym expansion module (510), word order similarity calculation module (520), morphology similarity calculation module (530), speech and the sentence similarity calculating sub module (550) of keyword string;

The synonym expansion module (510) of keyword string is used for the keyword string of input is carried out the synonym expansion, and sends morphology similarity calculation module (530) to;

The keyword string of morphology similarity calculation module (530) after to the expansion that receives carries out the morphology similarity and calculates, and according to contained same words or synon number in two sentences, obtain two modal similarity degrees of sentence, and send sentence similarity calculating sub module (550) to;

Word order similarity calculation module (520) receives the keyword string that problem pretreatment module (300) provides, according to the similarity degree on the relation of position of contained same words or synonym in two sentences, and contained same words or the reverse number of synon adjacent sequential in two sentences, calculate the word order similarity of two sentences, and send sentence similarity calculating sub module (550) to;

The long similarity calculation module of speech (540) receives the keyword string that problem pretreatment module (300) provides, similarity degree according to the number of contained speech in two sentences, and the number of contained speech in two sentences, calculate the long similarity of speech of two sentences, and send sentence similarity calculating sub module (550) to;

Sentence similarity calculating sub module (550) is weighted calculating according to the morphology similarity that obtains, word order similarity, the long similarity of speech, obtains the similarity of question sentence.