CN100476800C - Method and system for cutting index participle - Google Patents

Method and system for cutting index participle Download PDF

Info

Publication number
CN100476800C
CN100476800C CNB2007101230513A CN200710123051A CN100476800C CN 100476800 C CN100476800 C CN 100476800C CN B2007101230513 A CNB2007101230513 A CN B2007101230513A CN 200710123051 A CN200710123051 A CN 200710123051A CN 100476800 C CN100476800 C CN 100476800C
Authority
CN
China
Prior art keywords
character
participle
unit
stream
english
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2007101230513A
Other languages
Chinese (zh)
Other versions
CN101071420A (en
Inventor
王启明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2007101230513A priority Critical patent/CN100476800C/en
Publication of CN101071420A publication Critical patent/CN101071420A/en
Application granted granted Critical
Publication of CN100476800C publication Critical patent/CN100476800C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a segmentation index segmentation method. Including the following steps: read the character stream; identification described the character stream to identify Chinese characters and English characters, as well as an identification number or character; already identified Chinese and English characters or Digital and pre-built 1.1 tree comparison, the sub-set match words; English characters or figures generic fuzzy matching ASCII codes to determine English string or string of digital-term matching the above mentioned English words and string or digital string of words and non-recognition of characters referred to the character stream by order of ranking; The words and figures mentioned in the English string or strings of the sort described in the order of the character stream. The invention also openly segmentation Index segmentation system. The invention provides a cut-word indexing method and system can simultaneously address the precise words, a certain amount of redundant words and word-term problems, enhance the user experience.

Description

A kind of method and system of cutting index participle
Technical field
The present invention relates to the information index field, particularly a kind of method and system of cutting index participle.
Background technology
The existing information searching system is universal day by day, arrives network search engines greatly, and is little of the application-specific message searching system.When needs carried out the processing of Chinese character information, how information retrieval system will run into the problem of participle.
Present branch word algorithm has a variety of, and wherein the n-gram participle is a kind of mechanical segmentation method that does not need dictionary, realizes easily.But this segmenting method redundance is big, can not solve individual character participle problem.
The binary segmenting method is all to branch away by any two next-door neighbours' that occur word in the sentence, sets up inverted index.For example: sentence " from above-mentioned performing step " can tell " from last, above-mentioned, state real, realize, existing step, step, come suddenly, " etc. several speech.From the above-mentioned participle that branches away as can be seen, do not have practical significance as participles such as " stating reality ", " existing steps ".And this method can not solve the problem of individual character participle, can not divide english.
Maximum match segmentation is a kind of method of the principle coupling participle according to priority of long word.For example: sentence " from above-mentioned performing step " may be divided into " from, above-mentioned, performing step, " etc. several speech.The speech that this method is told is fewer, but not necessarily the shortest, nor accurately certain.Because this segmenting method does not have a certain amount of redundant speech, may cause recall ratio to descend, experience bad in some application scenario.
Based on the segmenting method of statistics or semantic analysis, need to solve the ambiguity resolution problem.The result that this segmenting method obtains is not necessarily comprehensive, but relatively accurately.But because that this segmenting method implements is cumbersome, and complicated analytic process will inevitably be to a certain degree influencing participle efficient, and this segmenting method is not suitable for small-sized application-specific message searching system.
Summary of the invention
The method that the purpose of this invention is to provide a kind of cutting index participle, this method can solve the problem of accurate, a certain amount of redundant speech of participle and individual character participle simultaneously, strengthen user experience.
Purpose of the present invention also provides a kind of system of cutting index participle, and this system can solve the problem of accurate, a certain amount of redundant speech of participle and individual character participle simultaneously, strengthens user experience.
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of method of cutting index participle, may further comprise the steps:
Read character stream;
Discern described character stream, determine Chinese character, English character or numeral and unrecognizable character;
Chinese character, English character or numeral of having determined and the lexicographic tree of setting up are in advance compared, determine the participle of coupling;
Described English character or numeral are carried out ASCII character Wild match (general fuzzy matching), determine the participle of English character string or numeric string;
With the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort in proper order by described character stream;
Length by the branch word order after the described ordering and described each participle and above-mentioned unrecognizable character is divided described character stream.
Preferably, described lexicographic tree is the trie character tree data structure of setting up in advance.
Preferably, described lexicographic tree is the binary stream dictionary configuration of setting up in advance.
Preferably, behind the described character stream of described identification, described character stream is stored in inner character buffer.
Preferably, before described character stream is stored in inner character buffer, described character stream is unified the processing of character.
Preferably, behind described definite Chinese character, English character or numeral and the unrecognizable character, remove the punctuation mark in the described character stream.
Preferably, described lexicographic tree is removed insignificant individual character when setting up in advance.
Preferably, further comprise after dividing described character stream by the length of the branch word order after the described ordering and described each participle and above-mentioned unrecognizable character:
Regularly add up the frequency of the keyword that receives;
The keyword that frequency is higher than predetermined value adds in the described lexicographic tree.
The embodiment of the invention provides a kind of system of cutting index participle, and this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, described comparing unit and definite participle and the definite unrecognizable character of described character stream recognition unit of described general fuzzy matching unit are sorted in proper order by the character stream that described reading unit reads, and write down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream that described reading unit reads, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character.
The embodiment of the invention also provides a kind of system of cutting index participle, and this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
Inner character buffer location is used to store the character stream of described character stream recognition unit identification;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, the unrecognizable character that participle that described comparing unit and described general fuzzy matching unit are determined and described character stream recognition unit are determined sorts in proper order by the described character stream of described inner character buffer location storage, and writes down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream of described inner character buffer location storage, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character;
The dictionary adaptive unit, by the frequency of occurrences of the statistical module counts keyword of setting up in advance, the keyword that the described frequency of occurrences is higher than predetermined value adds described lexicographic tree unit to.
The method of cutting index participle of the present invention can solve the problem of accurate, a certain amount of redundant speech of participle and individual character participle simultaneously, has strengthened user's experience.
The method of the described cutting index participle of the embodiment of the invention comprises and reads character stream; Discern character in the described character stream, determine Chinese character, English character or numeral and unrecognizable character; Chinese character, English character or the numeral determined are compared with the lexicographic tree of setting up in advance, determine the participle of coupling; English character or numeral are carried out ASCII character Wild match, determine the participle of English character string or numeric string; With the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort by order in the described character stream; Order by described participle and described English character string or numeric string ordering is divided described character stream.
Because before dividing participle, all Chinese characters, English character or numeral are all compared with the lexicographic tree of setting up in advance, have avoided the appearance of invalid phrase or phrase, and have guaranteed suitable redundant speech.When individual character can be used as a speech or has practical significance, can handle according to normal phrase in the lexicographic tree, so can realize the division of individual character as a participle.And the present invention has increased the process of ASCII character Wild match, effectively determines the participle of English character string and numeric string.
Therefore, the method for cutting index participle of the present invention can solve the problem of accurate, a certain amount of redundant speech of participle and individual character participle simultaneously, has strengthened user's experience.
Description of drawings
Fig. 1 is a kind of embodiment process flow diagram of the method for the invention;
Fig. 2 is a trie character tree structural representation of the present invention;
The lexicographic tree structural representation of Fig. 3 binary stream of the present invention;
Fig. 4 is first kind of cutting index participle system construction drawing of the present invention;
Fig. 5 is second kind of cutting index participle system construction drawing of the present invention.
Embodiment
The invention provides a kind of method of cutting index participle, this method can solve the problem of accurate, a certain amount of redundant speech of participle and individual character division simultaneously, strengthens user experience.
In order to make those skilled in the art understand the present invention program better, the present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Referring to Fig. 1, this figure is a kind of embodiment process flow diagram of the method for the invention.
S10, read character stream.
In the character stream that reads, may comprise Chinese character and also may comprise English character, numeral and unrecognizable character.
S20, discern described character stream, determine Chinese character, English character or numeral and unrecognizable character.
The character stream that step S10 is read carries out character recognition, determines that character is specially Chinese character or English character or numeral or unrecognizable character in the described character stream.
Because the character in the described character stream is discerned, can be easy to realize cutting to multiple character set.
S30, the Chinese character that will determine, English character or numeral and the lexicographic tree of setting up are in advance compared, and determine the participle of coupling.
Lexicographic tree can be the Chinese word used always or the set of phrase, and the blendword of English character and Chinese character also can be arranged.
Chinese character, English character or numeral that step S20 determines are compared with the lexicographic tree of setting up in advance, determine the participle of coupling, and this participle corresponding characters string is set up inverted index.Inverted index is a prior art, is not described in detail in this.
When carrying out participle for a character stream, following character string can be established inverted index.
1, all are present in the substring in the described lexicographic tree in this character stream.
2, inverted index set up in single Chinese character, realized one-gram word.
3, the English word or the English character string of Chinese character composition all over Britain.
4, the digit strings of digital character composition.
S40, described English character or numeral are carried out the general fuzzy matching of ASCII character, determine the participle of English character string or numeric string.
The general fuzzy matching of ASCII character is used to guarantee the integrality of participle, when English character string or numeric string have a speech of one's own, can be marked as a participle.
If need the character stream of cutting to be: there is " USB flash disk " this speech in " useU dish " in the described lexicographic tree.When Wild match scanning runs into first English character " u ", scan last English character in this character string backward always, search and obtain English character string " useU " complete in the character stream, be defined as the participle of an English character string.
S50, with the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort in proper order by described character stream.
In whole character stream participle sequencer procedure, the Chinese character participle all is to sort in order according to the position of Chinese character in character stream.But when running into English character string or numeric string, phenomenon unordered even that repeat might appear in the participle of generation, and this moment, the order by character in the described character stream sorted.The final like this participle that obtains all is to arrange in order according to the order of each participle in former character stream, does not need to call specially order module and handles.
S60, divide described character stream by the length of described minute word order and described each participle and above-mentioned unrecognizable character.
Step S50 carries out orderly arrangement with above-mentioned each participle, only need this moment by described each minute word order and the length of described each participle and above-mentioned unrecognizable character divide described character stream and get final product.
Because before dividing participle, all Chinese characters, English character string or numeral are all compared with the lexicographic tree of setting up in advance, have avoided the appearance of invalid phrase or phrase, and have guaranteed suitable redundant speech.When individual character can be used as a speech or has practical significance, can handle according to normal phrase in the described lexicographic tree, therefore can realize the division of individual character as a participle.And the present invention has increased the process of the general fuzzy matching of ASCII character, effectively determines the participle of English character string and numeric string.Therefore, the method for cutting index participle of the present invention can solve the problem of accurate, a certain amount of redundant speech of participle and individual character division simultaneously, has strengthened user's experience.
Though participle methods such as existing binary participle, ternary participle do not need lexicographic tree, the speech that generates is many, and a lot of speech are nonsensical, and can't be to the division of Chinese and English blendword.The described method of the embodiment of the invention has been used lexicographic tree, because the division of participle is to be foundation with the lexicographic tree, has effectively avoided the division of meaningless speech.
The described method of the embodiment of the invention is according to real data, and the final index that produces is Duoed one times than the binary participle, but is actually because single Chinese character is carried out participle, can realize one-gram word.Preferred version of the present invention removes insignificant individual character or the individual character that do not need to search for when setting up lexicographic tree.
On efficient, the described method of the embodiment of the invention is slower slightly than n-gram participle, because increased the process that lexicographic tree is searched coupling.But the described method of the embodiment of the invention is in search, and the speech in the lexicographic tree will carry out participle with the form of shortest path, thereby has significantly reduced the number of times of search index, and search performance is greatly improved.
Described method of the embodiment of the invention and maximum match are full cuttings owing to what adopt relatively, and recall ratio just can reach 100%, than maximum match more comprehensively.This can meet the demands in the small information retrieval.
The participle efficient of the described method of the embodiment of the invention is than faster.Through walk through test, under linux, language material is general webpage, and lexicographic tree is magnanimity dictionary for word segmentation tree, and nearly more than 20 ten thousand words are about committed memory 80M.Carry out participle by the described method of the embodiment of the invention, can handle 3 to 4M data a second.
Setting up index has a lot of processes, and participle is the first step, and with respect to other processes, it is very fast to handle 3 to 4M data in one second.Because other processes for example merge index etc. and need hard disk I/O, speed is relatively slow.Therefore, segmenting method of the present invention can not cause bottleneck in whole index process.
We are example with " useU dish " this character stream, specify the participle process.
Because each character all can be searched, above-mentioned first English character is marked as the beginning of this English character string participle, and mark " u " is the beginning of a participle of this English character string.
The next position character of " u " is " s ", can not find out " us " in the dictionary, is not the end of participle, does not export any participle.
The rest may be inferred, when running into second " U ", finds that there is speech " USB flash disk " the dictionary the inside, illustrates that this position also is the beginning of a participle.
Recall simultaneously, participle starting position first " u " and current location that the front mark is crossed are formed a participle " use " output.And the mark current location is the beginning of participle, next scans successively.
The participle of final output is: " use " " useU " " USB flash disk ".
The general fuzzy matching process of numeric string and the process of above-mentioned English character string likeness in form, the general fuzzy matching process of concrete numeric string is not described in detail in this.
The preferred implementation of the method for the invention may further comprise the steps:
S10, read character stream.
S20, discern described character stream, determine Chinese character, English character or numeral and unrecognizable character.
S21, the character stream after will discerning are stored in inner character buffer.
Described character stream is stored in before the inner character buffer, described character stream can also be unified the processing of character.
S30, the Chinese character that will determine, English character or numeral and the lexicographic tree of setting up are in advance compared, and determine the participle of coupling.
S40, English character or numeral are carried out the general fuzzy matching of ASCII character, determine the participle of English character string or numeric string.
S50, with the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort in proper order by described character stream.
S60, the character stream of described inner character buffer stores is divided described character stream by the length of described minute word order and described each participle and above-mentioned unrecognizable character.
Can also comprise behind the described step S60:
The frequency of the keyword that S61, regular statistics receive.
Regularly adding up the frequency of the keyword that receives can be added up by the statistical module of setting up in advance.Statistical module is used to add up the keyword frequency, and this statistical module can adopt existing user to add up the general statistical module of keyword.The principle of work of concrete statistical module is that common practise is not described in detail in this.
S62, the keyword that frequency is higher than predetermined value add in the described lexicographic tree.
The user can regularly add up user's input key word according to self industry or real needs, and the keyword that frequency is higher than predetermined value adds in the described lexicographic tree.Predetermined value can be set according to actual needs.
The user can join the high frequency keyword after the statistics in the described lexicographic tree, and then rebulids index.Described like this lexicographic tree just can have adaptivity, will be a very big lifting to retrieval rate.
The described lexicographic tree of the embodiment of the invention can be the map mapping table of foundation in advance or the data structure of hash_map (Hash table).When described lexicographic tree is adopted map mapping table or hash_map data structure, need search two secondary data, and can't learn participle at which character stops, need the whole sentence of traversal just can find participle.
Lexicographic tree preferable case of the present invention adopts the trie character tree data structure of setting up in advance.Adopt the trie character tree reasonablely to meet the demands.Described lexicographic tree adopts the trie character tree to find a plurality of speech in a search procedure, and finds the final position of participle easily.About the trie character tree model of a lot of maturations has been arranged at present, this paper repeats no more.
Referring to Fig. 2, this figure is a trie character tree structural representation of the present invention.
Each square frame is represented a node among Fig. 2, Chinese character of corresponding expression or character.The square frame of band " * " mark, Chinese Character Set that the path of expression from root node to this node formed or character string are corresponding to speech the lexicographic tree or phrase.Numeral in the square frame is the numbering to this node.
For example Shi Bie character stream is " middle Chinese is great ".From " " begin traversal, when the "Yes" mismatch, just obtain two speech " China " and " middle Chinese ".Because with " in " speech of word beginning, there is not concrete participle correspondence after arriving " people " word, therefore just do not need to scan backward again.Therefore adopting the lexicographic tree search procedure of trie character tree is more efficiently.
Character string " middle Chinese is great " by scanning behind the forward direction, is searched with first position and is begun all speech in lexicographic tree, and process is as follows:
A, first word be " in ", in the trie character tree, match node 1.
B, continue to check that second word is " China ", in node 1, look for " in " child node, match node 2.
C, continue to check that the 3rd word is " people ", in node 2, look for the child node of " China ", match node 3.
D, continue to check that the 4th word is "Yes", the child node of " people " in node 3 is not represented "Yes", so withdraw from.
Like this from " " beginning, find " China ", " middle Chinese " such two participles.Because node 2 (China) and node 3 (people) have been labeled suffix " * " in dictionary.From two paths of root node 1 to 2 and 3, just can represent " China " and " middle Chinese " these two concrete participles with 1-2 and 1-2-3 respectively.
By above-mentioned process of searching as can be seen, the number of times of maximum match equals the length of long word in the lexicographic tree under situation worst.But in fact the mean value of this length is very little, and very long speech can not be arranged.Therefore this algorithm can be regarded the algorithm of constant level as, with complexity function O (1) expression.
Complexity function O (1) is from algorithmic procedure, mainly is a sequential scanning identification character and the process of searching in lexicographic tree.Search a character string in front in the lexicographic tree of Jie Shaoing, the number of times of searching equals the length of the long word of this position.Because the speech length in the actual lexicographic tree all is limited, therefore on average searching number of times can be a constant c.Generic constant c<5.When running into English character or numerical character, need recall and look for participle, but in fact the participle of English character string beginning all is some very special speech that occurrence probability is very little.The influence of this part can be ignored to the complexity effect of whole participle.
The process of identification character also is the constant complexity, and for the length character stream that is n, the whole algorithm complexity is O (k * n).Wherein k is a constant factor, is linear complexity.The k average is about 10.
Optimal way of the present invention is preserved two character buffers, be respectively applied for storage input character stream and handle after output character stream and position, the length information of the participle told.In the practical application, seldom there is long sentence of no delimiter, so above-mentioned character buffer capacity can be set at definite value.When above-mentioned character buffer can not load, return the branch set of words in this character buffer earlier, recharge the capacity of this participle buffer zone, carry out participle at last again.We are the funtcional relationship of sentence length with the internal memory that the space complexity representation program need use with the input data scale, and this function is a definite value, can not increase with sentence length.Therefore space complexity is a constant, and is irrelevant with character stream length.
Character buffer capacity optimal way adopts 1024bit, certainly sets other numerical value according to concrete needs.
Lexicographic tree of the present invention can also be adopted the lexicographic tree of binary stream, can guarantee that like this child node quantity of each node is less.The lexicographic tree of binary stream is directly to locate with array index.And the English character string also can be put into the lexicographic tree of binary stream, thereby has realized the division of Chinese and English blendword.
Referring to Fig. 3, this figure is the lexicographic tree structural representation of binary stream of the present invention.
Each node of the lexicographic tree structure of binary stream of the present invention is not a Chinese character, but the several bit that split into according to binary data.
Suppose it is 4 bit, a Chinese character comprises two bytes, so just needs 4 nodes.An ASCII character character needs 2 nodes.
For example " in " binary form be shown 0xd0d6, so according to the bit position, splitting from the low level to a high position becomes four data blocks of 6d 0d, then " in " tree of word can be expressed as form shown in Figure 3.
" 6 " corresponding first node 31; First " d " corresponding Section Point 32; " 0 " corresponding the 3rd node 33; Second " d " corresponding the 4th node 34.
When the method for the invention adopted the lexicographic tree of binary stream, the character string that reads was seen as binary stream.
For example search " middle Chinese ", their binary format is 0xd0d60xaabb 0xcbc8, and changing into the digital stream that 4bit represents is 6d 0d b b a a c b c 8.When searching, mate successively.6 coupling first nodes 31,33, the second d of first d coupling Section Point 32,0 coupling the 3rd nodes mate the 4th node 34.The 4th node 34 is marks of suffix, and then 6d 0d is a speech altogether in this time search procedure, restore into binary format 0xd0d6 promptly obtain speech " in ".The child node of each node has 16 at most like this, because 4bit can only represent 16 kinds of different numerals, thereby has saved the space greatly.The lexicographic tree of binary stream is a kind of improvement implementation to the tries tree.
When carrying out the index participle, scan this character string from front to back, can filter punctuation mark, in lexicographic tree, search all speech and the output that are contained in this character string.The string that Chinese character all over Britain is formed or the string of digital composition are considered to a word, also as a participle output.In the index participle, has a certain amount of redundant speech.
The present invention also provides a kind of system of cutting index participle, and this system can solve the participle problem of accurate, a certain amount of redundant speech of participle and individual character simultaneously, strengthens user experience.
Referring to Fig. 4, this figure is first kind of cutting index participle system construction drawing of the present invention.
First kind of cutting index participle system of the present invention, this system comprises:
Reading unit 41 is used to read character stream.
Character stream recognition unit 42 is used for the character stream that described reading unit 41 reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character.
Because the character stream that reading unit 41 reads may comprise Chinese character, English character or numeral and unrecognizable character, for the ease of carrying out the division of participle, need discern it.
Lexicographic tree unit 43 is stored the data structure cell of the lexicographic tree of phrase and phrase in advance.
Lexicographic tree unit 43 is very important unit, can effectively reduce unnecessary redundant speech.
Comparing unit 44, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit 43 is set up in advance that are used for described character stream recognition unit 42 is determined are compared, and determine the participle of coupling.
General fuzzy matching unit 45 is used for English character or numeral after described comparing unit 44 comparisons are carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string.
Participle administrative unit 46, described comparing unit 44 and described general fuzzy matching unit 45 definite participle and described character stream recognition unit 42 definite unrecognizable characters are sorted in proper order by the character stream that described reading unit 41 reads, and write down the length of each above-mentioned participle and above-mentioned unrecognizable character.
Participle division unit 47 with the character stream that described reading unit 41 reads, is divided according to the branch word order of described participle administrative unit 46 records and the length of described each participle and above-mentioned unrecognizable character.
Because the Chinese character that the comparing unit 44 of the described system of the embodiment of the invention is determined described character stream recognition unit 42, English character or lexicographic tree digital and that described lexicographic tree unit 43 is set up are in advance compared, and determine the participle of coupling.Can effectively avoid the appearance of invalid phrase or phrase like this, guarantee a certain amount of redundant speech.When individual character can be used as a speech or has practical significance, can handle according to normal phrase in the lexicographic tree unit 43, so can realize the participle of individual character as a participle.And the described system of the embodiment of the invention has increased general fuzzy matching unit 45, and English or numeral after described comparing unit 44 is compared are carried out the general fuzzy matching of ASCII character, effectively determine the participle of English character string and numeric string.Therefore, first kind of cutting index participle system of the present invention can solve the participle problem of accurate, a certain amount of redundant speech of participle and individual character simultaneously, strengthened user's experience.
Referring to Fig. 5, this figure is second kind of cutting index participle system construction drawing of the present invention.
Second kind of cutting index participle system of the present invention increased inner character buffer location 48 and dictionary adaptive unit 49 with respect to first kind of cutting index participle system.
Reading unit 41 is used to read character stream.
Character stream recognition unit 42 is used for the character stream that described reading unit 41 reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character.
Described inner character buffer location 48 is used to store the character stream of described character stream recognition unit 42 identifications, for participle division unit 47 provides the participle text.
Certainly character stream recognition unit 42 can also carry out character after reunification with the character stream of identification, just sends in the described inner character buffer location 48 and preserves.
Lexicographic tree unit 43 is stored the data structure cell of the lexicographic tree of phrase and phrase in advance.
Comparing unit 44, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit 43 is set up in advance that are used for described character stream recognition unit 42 is determined are compared, and determine the participle of coupling.
General fuzzy matching unit 45 is used for English character or numeral after described comparing unit 44 comparisons are carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string.
Participle administrative unit 46, the unrecognizable characters that participle that described comparing unit 44 and described general fuzzy matching unit 45 are determined and described character stream recognition unit 42 are determined sort in proper order by the described character stream of described inner character buffer location 48 storages, and write down the length of each above-mentioned participle and above-mentioned unrecognizable character.
Described participle division unit 47 with the character stream of described inner character buffer location 48 storages, is divided according to the branch word order of described participle administrative unit 46 records and the length of described each participle and above-mentioned unrecognizable character.
Dictionary adaptive unit 49, by the frequency of occurrences of the statistical module 50 statistics keywords of setting up in advance, the keyword that the described frequency of occurrences is higher than predetermined value adds described lexicographic tree unit 43 to.
The statistical module 50 of Jian Liing is used to add up the keyword frequency in advance.Statistical module 50 can adopt existing user to add up the general statistical module of keyword.The principle of work of concrete statistical module is that common practise is not described in detail in this.
Dictionary adaptive unit 49 carries out the self-adaptation operation according to the keyword frequency of described statistical module 50 statistics.When the keyword frequency of described statistical module 50 statistics during greater than predetermined value, described dictionary adaptive unit 49 adds this keyword in the described lexicographic tree unit 43 to.
Because the described second kind of cutting index participle system of the embodiment of the invention, the Chinese character that comparing unit 44 is determined described character stream recognition unit 42, English character or lexicographic tree digital and that described lexicographic tree unit 43 is set up are in advance compared, and determine the participle of coupling.Effective like this appearance of having avoided invalid phrase or phrase, and guaranteed a certain amount of redundant speech.When individual character can be used as a speech or has practical significance, lexicographic tree unit 43 can be handled according to normal phrase, so can realize the division of individual character as a participle.And the described second kind of cutting index participle system of the embodiment of the invention has general fuzzy matching unit 45, English character or numeral after described comparing unit 44 comparisons are carried out the general fuzzy matching of ASCII character, effectively determine the participle of English character string and numeric string.Therefore, second kind of cutting index participle system of the present invention can solve the problem of accurate, a certain amount of redundant speech of participle and individual character participle simultaneously, strengthened user's experience.
The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1, a kind of method of cutting index participle is characterized in that, may further comprise the steps:
Read character stream;
Discern described character stream, determine Chinese character, English character or numeral and unrecognizable character;
Chinese character, English character or numeral of having determined and the lexicographic tree of setting up are in advance compared, determine the participle of coupling;
Described English character or numeral are carried out the general fuzzy matching of ASCII character, determine the participle of English character string or numeric string;
With the participle of above-mentioned coupling and the participle and the unrecognizable character of described English character string or numeric string, sort in proper order by described character stream;
Divide described character stream by the order of the participle after the described ordering and the length of described each participle and above-mentioned unrecognizable character.
2, the method for cutting index participle according to claim 1 is characterized in that, described lexicographic tree is the trie character tree data structure of setting up in advance.
3, the method for cutting index participle according to claim 1 is characterized in that, described lexicographic tree is the binary stream dictionary configuration of setting up in advance.
4, according to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that, behind the described character stream of described identification, described character stream is stored in inner character buffer.
5, the method for cutting index participle according to claim 4 is characterized in that, described character stream is stored in before the inner character buffer, described character stream is unified the processing of character.
6, the method for cutting index participle according to claim 5 is characterized in that, behind described definite Chinese character, English character or numeral and the unrecognizable character, removes the punctuation mark in the described character stream.
According to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that 7, described lexicographic tree is removed insignificant individual character when setting up in advance.
8, according to the method for the arbitrary described cutting index participle of claim 1 to 3, it is characterized in that, further comprise after dividing described character stream by the length of the order of the participle after the described ordering and described each participle and above-mentioned unrecognizable character:
Regularly add up the frequency of the keyword that receives;
The keyword that frequency is higher than predetermined value adds in the described lexicographic tree.
9, a kind of system of cutting index participle is characterized in that, this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, with described comparing unit and definite participle and the definite unrecognizable character of described character stream recognition unit of described general fuzzy matching unit, the character stream that reads by described reading unit sorts in proper order, and writes down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream that described reading unit reads, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character.
10, a kind of system of cutting index participle is characterized in that, this system comprises:
Reading unit is used to read character stream;
The character stream recognition unit is used for the character stream that described reading unit reads is discerned, and determines Chinese character, English character or numeral and unrecognizable character;
Inner character buffer location is used to store the character stream of described character stream recognition unit identification;
The lexicographic tree unit is stored the data structure cell of the lexicographic tree of phrase and phrase in advance;
Comparing unit, the Chinese character, English character or the lexicographic tree digital and that described lexicographic tree unit is set up in advance that are used for described character stream recognition unit is determined are compared, and determine the participle of coupling;
General fuzzy matching unit is used for English character or the numeral of described comparing unit after relatively carried out the general fuzzy matching of ASCII character, determines the participle of English character string or numeric string;
The participle administrative unit, with described comparing unit and definite participle and the definite unrecognizable character of described character stream recognition unit of described general fuzzy matching unit, described character stream by described inner character buffer location storage sorts in proper order, and writes down the length of each above-mentioned participle and above-mentioned unrecognizable character;
The participle division unit with the character stream of described inner character buffer location storage, is divided according to the branch word order of described participle management unit records and the length of described each participle and above-mentioned unrecognizable character;
The dictionary adaptive unit, by the frequency of occurrences of the statistical module counts keyword of setting up in advance, the keyword that the described frequency of occurrences is higher than predetermined value adds described lexicographic tree unit to.
CNB2007101230513A 2007-06-22 2007-06-22 Method and system for cutting index participle Active CN100476800C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101230513A CN100476800C (en) 2007-06-22 2007-06-22 Method and system for cutting index participle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101230513A CN100476800C (en) 2007-06-22 2007-06-22 Method and system for cutting index participle

Publications (2)

Publication Number Publication Date
CN101071420A CN101071420A (en) 2007-11-14
CN100476800C true CN100476800C (en) 2009-04-08

Family

ID=38898647

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101230513A Active CN100476800C (en) 2007-06-22 2007-06-22 Method and system for cutting index participle

Country Status (1)

Country Link
CN (1) CN100476800C (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770478B (en) * 2008-12-26 2013-04-24 高德信息技术有限公司 Data retrieval method, data retrieval engine and embedded terminal
CN102455845B (en) * 2010-10-14 2015-02-18 北京搜狗科技发展有限公司 Character entry method and device
CN103201737B (en) * 2010-11-10 2016-06-29 乐天株式会社 Conjunctive word calling mechanism, information processor, conjunctive word register method, conjunctive word calling mechanism program and record medium
CN102012897B (en) * 2010-12-02 2014-09-17 无敌科技(西安)有限公司 Word-by-word comparison method for realizing high hit rate
CN102331999A (en) * 2011-07-22 2012-01-25 大连亿创天地科技发展有限公司 Search box searching method and system for medical industry
CN102779163A (en) * 2012-06-18 2012-11-14 青岛禧泰房产数据技术有限公司 Quantization searching method and quantization searching system
CN103198146B (en) * 2013-04-19 2015-05-27 中国科学院计算技术研究所 Real-time event filtering method and real-time event filtering system oriented to network stream data
CN104268137A (en) * 2013-07-31 2015-01-07 深圳市华傲数据技术有限公司 Method and device for matching pharmaceutical name data
CN104462105B (en) * 2013-09-16 2019-01-22 腾讯科技(深圳)有限公司 Chinese word cutting method, device and server
CN103870537B (en) * 2013-12-03 2017-02-01 山东金质信息技术有限公司 Intelligent word segmentation method for standard retrieval
CN103778200B (en) * 2014-01-09 2017-08-08 中国科学院计算技术研究所 A kind of message information source abstracting method and its system
CN106294371B (en) 2015-05-15 2019-08-16 阿里巴巴集团控股有限公司 Character string codomain cutting method and device
CN104881503A (en) * 2015-06-24 2015-09-02 郑州悉知信息技术有限公司 Data processing method and device
CN105184053B (en) * 2015-08-13 2018-09-07 易保互联医疗信息科技(北京)有限公司 A kind of automatic coding and system of Chinese medical service item information
CN105095665B (en) * 2015-08-13 2018-07-06 易保互联医疗信息科技(北京)有限公司 A kind of natural language processing method and system of Chinese medical diagnosis on disease information
CN106202464B (en) * 2016-07-18 2019-12-17 上海轻维软件有限公司 data identification method based on mutation backtracking algorithm
CN106227661B (en) * 2016-07-22 2019-01-08 腾讯科技(深圳)有限公司 Data processing method and device
CN108009153A (en) * 2017-12-08 2018-05-08 北京明朝万达科技股份有限公司 A kind of searching method and system based on search statement cutting word result
CN108197313B (en) * 2018-02-01 2021-06-25 中国计量大学 Dictionary indexing method for realizing space optimization through 16-bit Trie tree
CN108388635B (en) * 2018-02-24 2021-08-03 杭州朗和科技有限公司 Data searching method, device, medium and computing equipment
CN110362650A (en) * 2018-04-09 2019-10-22 深圳企业云科技股份有限公司 Precisely participle realizes the search method of file full-text search
CN108664468A (en) * 2018-05-02 2018-10-16 武汉烽火普天信息技术有限公司 A kind of name recognition methods and device based on dictionary and semantic disambiguation
CN109582972B (en) * 2018-12-27 2023-05-16 信雅达科技股份有限公司 Optical character recognition error correction method based on natural language recognition
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN113627722B (en) * 2021-07-02 2024-04-02 湖北美和易思教育科技有限公司 Simple answer scoring method based on keyword segmentation, terminal and readable storage medium
CN113836917B (en) * 2021-09-28 2023-07-18 广州华多网络科技有限公司 Text word segmentation processing method and device, equipment and medium thereof
CN114021564B (en) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text
CN115391495B (en) * 2022-10-28 2023-01-24 强企宝典(山东)信息科技有限公司 Method, device and equipment for searching keywords in Chinese context
CN116226362B (en) * 2023-05-06 2023-07-18 湖南德雅曼达科技有限公司 Word segmentation method for improving accuracy of searching hospital names

Also Published As

Publication number Publication date
CN101071420A (en) 2007-11-14

Similar Documents

Publication Publication Date Title
CN100476800C (en) Method and system for cutting index participle
JP3143079B2 (en) Dictionary index creation device and document search device
CN102142038B (en) Multi-stage query processing system and method for use with tokenspace repository
US7031910B2 (en) Method and system for encoding and accessing linguistic frequency data
JP3889762B2 (en) Data compression method, program, and apparatus
US6754650B2 (en) System and method for regular expression matching using index
Baeza-Yates Introduction to Data Structures and Algorithms Related to Information Retrieval.
CN107153647B (en) Method, apparatus, system and computer program product for data compression
CN103365992B (en) Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN102103416B (en) Chinese character input method and device
US9720976B2 (en) Extracting method, computer product, extracting system, information generating method, and information contents
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN107038225A (en) The search method of information intelligent retrieval system
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
Navarro Document listing on repetitive collections with guaranteed performance
Sirén Burrows-Wheeler transform for terabases
CN109885641B (en) Method and system for searching Chinese full text in database
CN105404677A (en) Tree structure based retrieval method
US8682900B2 (en) System, method and computer program product for documents retrieval
CN105426490A (en) Tree structure based indexing method
CN102521418A (en) Pinyin storage structure and pinyin input method
CN101576877A (en) Fast word segmentation realization method
JP4208326B2 (en) Information indexing device
KR101174184B1 (en) Method and System on Deriving Thesaurus Database from Statistics
CN113111655B (en) Construction method of separation dictionary, word segmentation method and device based on separation dictionary

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151222

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.