CN103927330A - Method and device for determining characters with similar forms in search engine - Google Patents

Method and device for determining characters with similar forms in search engine Download PDF

Info

Publication number
CN103927330A
CN103927330A CN201410104483.XA CN201410104483A CN103927330A CN 103927330 A CN103927330 A CN 103927330A CN 201410104483 A CN201410104483 A CN 201410104483A CN 103927330 A CN103927330 A CN 103927330A
Authority
CN
China
Prior art keywords
word
nearly
coded string
search engine
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410104483.XA
Other languages
Chinese (zh)
Inventor
项碧波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410104483.XA priority Critical patent/CN103927330A/en
Publication of CN103927330A publication Critical patent/CN103927330A/en
Priority to PCT/CN2014/094933 priority patent/WO2015139497A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

An embodiment of the invention discloses a method and a device for determining characters with similar forms in a search engine. The method includes determining a first character and a second character to be verified in the search engine; acquiring a first coded string of the first character and a second coded string of the second character according to the preset rule; calculating coding distance between the first coded string and the second coded string; judging the first character and the second character as characters with similar forms when the coding distance is smaller than the threshold value of the preset distance; establishing mapping relation of the characters with similar forms between the first character and the second character in the search engine. By the method and the device for determining similar forms in the search engine in the embodiment, judging whether the first character and the second character are characters with similar forms or not is realized, webpage recognition efficiency of the search engine is improved, and functions of the search engine are added.

Description

A kind of method and apparatus of determining nearly word form in search engine
Technical field
The present invention relates to the technical field of spoken and written languages information, be specifically related to a kind of method of determining nearly word form in search engine, a kind of method that the error correction of search Chinese key is provided, a kind ofly in search engine, determine the device of nearly word form, a kind of device that the error correction of search Chinese key is provided.
Background technology
Along with the high speed development of internet, the diversification of network application trend, online quantity of information sharply increases.
Under various occasions, user often needs input language word to carry out the mutual of information.For example, in search engine, input keyword search info web, in immediate communication tool, input words and phrases and exchange with other users, etc.
There is nearly word form in spoken and written languages, i.e. the spoken and written languages of the structural similarity of spoken and written languages.Spoken and written languages are defined as various coded systems and input, for example 5-stroke coding, Pinyin coding etc., user is when adopting this coded system input language word, reason due to nearly word form, be easy to occur maloperation, input other spoken and written languages, cause user often need to re-enter spoken and written languages, not only troublesome poeration, and waste system resource.
Take five as example, five input characters are accurate inaccurate depends on the whether careful or cognition to Chinese character itself of user, but the situation of inputing Chinese character by mistake that the maloperation causing due to carelessness or user cognition itself are exactly wrongly written or mispronounced characters to be caused etc. are much, for example certain headline of certain newspaper " random press horn is not penalized to call for redressing a grievance " has been write as " disorderly press loud-speaker do not penalized call for redressing a grievance ".
Moreover, if user thinks inputted search word " Xiang Yu " in search engine, the related web page information of search history people items plumage, but be " top " by " item " erroneous input, because " item " and " top " is also very close, user has probably inputted " top plumage " and has not discovered, and directly asks the search engine search info web relevant to " top plumage ".
On the one hand, the Search Results of maloperation has very big difference with expection originally, and it is very poor that user experiences, and wasted the resource of client and the resource of search engine.On the other hand, user need to obtain own interested info web, can again in search engine, input keyword search, search engine will again carry out search, contrast, screening of magnanimity information etc. and obtain the information relevant to searched key word, not only user's operation is more loaded down with trivial details, expend user's time, and will greatly increase the burden of search engine, expend the more resource of multi-client and search engine.
Summary of the invention
In view of the above problems, having proposed the present invention overcomes the problems referred to above or a kind of method of determining nearly word form in search engine addressing the above problem at least in part, a kind of method that the error correction of search Chinese key is provided and corresponding, a kind ofly in search engine, determines the device of nearly word form, a kind of device that the error correction of search Chinese key is provided to provide a kind of.
According to one aspect of the present invention, a kind of method of determining nearly word form in search engine is provided, comprising:
Determine the first word to be verified and the second word in inputted search engine;
According to preset rules, obtain the first coded string of described the first word and the second coded string of described the second word;
Calculate the coding distance between described the first coded string and described the second coded string;
When described coding distance is less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other;
In search engine, set up the nearly word form mapping relations between the first word and the second word.
Alternatively, described preset rules comprises default coding rule, described in obtain the first coded string of described the first word and the second coded string of described the second word step comprise:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
Alternatively, also comprise:
The first word of described nearly word form each other and the second word and described nearly word form mapping relations are exported in the font database of appointment.
According to a further aspect in the invention, provide a kind of method that keyword error correction in search is provided, having comprised:
Receive searching request; Described searching request comprises searched key word;
When described searched key word being carried out to correction process discovery mistake, adopt the nearly word form of mating with described searched key word to rewrite described searched key word;
With revised searched key word, search for, obtain the search result data matching with described revised searched key word.
Alternatively, described nearly word form obtains in the following manner:
Determine whether the verification in search engine to be entered is the first word and second word of nearly word form;
According to preset rules, obtain the first coded string of described the first word and the second coded string of described the second word;
Calculate the coding distance between described the first coded string and described the second coded string;
When described coding distance is less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other;
In search engine, set up the nearly word form mapping relations between the first word and the second word.
Alternatively, described preset rules comprises default coding rule, and the described step of obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules comprises:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
Alternatively, the nearly word form that described in described font database, the first word is corresponding also obtains in the following manner:
Search respectively the first input key that described the first coded string is corresponding;
Search respectively the second input key that described the second coded string is corresponding;
Calculate respectively the button distance between described the first input key and described the second input key;
According to described button distance, be weight corresponding to described coding distance configuration;
Described when described coding distance is less than predeterminable range threshold value, judge described the first word and described the second word each other nearly word form step as:
When disposing the coding distance of described weight and be less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other.
Alternatively, described button distance is inversely proportional to described weight.
Alternatively, also comprise:
According to described search result data, generate search results pages.
Alternatively, also comprise:
The information that prompting is carried out error correction to described searched key word in described search results pages.
According to a further aspect in the invention, provide a kind of device of determining nearly word form in search engine, having comprised:
Word determination module, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
Alternatively, described preset rules comprises default coding rule, and described coding acquisition module is also suitable for:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
Alternatively, also comprise:
Output module, is suitable for the first word of described nearly word form each other and the second word and described nearly word form mapping relations to export in the font database of appointment.
According to a further aspect in the invention, provide a kind of device that keyword error correction in search is provided, having comprised:
Receiving element, is suitable for receiving searching request; Described searching request comprises searched key word;
Rewrite unit, be suitable for, when described searched key word being carried out to correction process discovery mistake, adopting the nearly word form of mating with described searched key word to rewrite described searched key word;
Search unit, is suitable for searching for revised searched key word, obtains the search result data matching with described revised searched key word.
Alternatively, described nearly word form obtains by calling with lower module:
Word determination module, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
Alternatively, described preset rules comprises default coding rule, and described coding acquisition module is also suitable for:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
Alternatively, described nearly word form also obtains by calling with lower module:
First searches module, is suitable for searching respectively the first input key that described the first coded string is corresponding;
Second searches module, is suitable for searching respectively the second input key that described the second coded string is corresponding;
Button distance calculation module, is suitable for calculating respectively the button distance between described the first input key and described the second input key;
Weight configuration module, being suitable for according to described button distance is weight corresponding to described coding distance configuration;
Described nearly word form determination module is also suitable for:
When disposing the coding distance of described weight and be less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other.
Alternatively, described button distance is inversely proportional to described weight.
Alternatively, also comprise:
Generation unit, is suitable for generating search results pages according to described search result data.
Alternatively, also comprise:
Tip element, is suitable for the information that prompting is carried out error correction to described searched key word in described search results pages.
The embodiment of the present invention by calculating the coding distance between the first coded string of the first word and the second coded string of the second word in search engine, realized the whether each other judgement of nearly word form of the first word and the second word, improve the webpage recognition efficiency of search engine, increased the function of search engine.
The embodiment of the present invention is carried out correction process to searched key word, adopts the nearly word form of mating with searched key word to rewrite searched key word, to obtain the search result data matching with described revised searched key word.On the one hand, revised searched key word makes Search Results more approach expection originally, promotes user and experiences, and has reduced the resource of client and the wasting of resources of search engine, has improved search efficiency.On the other hand, avoid user need to obtain own interested info web, again in search engine, input keyword search, reduce search that search engine will carry out magnanimity information again, contrast, screening etc. and obtained the information relevant to searched key word, make user more convenient to operate, reduce user's time consumption, further reduced the resource cost of client and search engine.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention flow chart of steps of determining the embodiment of the method for nearly word form in search engine;
Fig. 2 shows a kind of flow chart of steps that the embodiment of the method for keyword error correction in search is provided according to an embodiment of the invention;
Fig. 3 shows a kind of according to an embodiment of the invention structured flowchart of determining the device embodiment of nearly word form in search engine; And
Fig. 4 shows a kind of according to an embodiment of the invention structured flowchart that the device embodiment of keyword error correction in search is provided.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
With reference to Fig. 1, show a kind of flow chart of steps of determining the embodiment of the method for nearly word form in search engine of one embodiment of the present of invention, can comprise the steps:
Step 101, determines the first word to be verified and the second word in inputted search engine;
The treatment scheme of search engine generally can be divided into two parts, and first is front end subscriber request, and second portion is that rear end makes data.
One, front end subscriber request processing procedure can comprise:
1. user entered keyword;
2. query word analysis, search engine is to key word participle;
3. retrieval, according to word segmentation result, from the index of prior making, finds out relevant collections of web pages;
4. sequence, the collections of web pages to candidate, sorts according to content relevance, the dimension such as ageing;
5. represent: the webpage after sequence is represented.
Two, making data procedures in rear end can comprise:
1. webpage captures, and reptile, by the linking relationship between webpage, captures the webpage of internet and preserves;
2. compilation of index, analyzes capturing the webpage of preservation, to web page title and page text participle, according to word segmentation result, makes inverted index, for front end, retrieves.
The webpage of crawler capturing can be kept in web database, and in store numerous Word message in webpage, this web database can be called corpus again.
In specific implementation, can from corpus, extract the first word and the second word, carry out the whether each other verification of nearly word form.
In an optional example of the embodiment of the present invention, the first word and the second word can be Chinese character.
Step 102, obtains the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Word can have specific text structure characteristic, according to this word architectural characteristic, encodes, and sets up input mode, can realize and carry out input characters in electronic equipment.For example, the first word and the second word can carry out pinyin input mode, five input modes, stroke input mode etc.
Accordingly, the first word and the second word can corresponding different the first coded strings and the second coded strings for different coding rules.For example, " side " is " ce " for coded string corresponding to pinyin input mode, for five coded strings corresponding to input mode, is " WMJh ".
In a preferred embodiment of the present invention, described preset rules can comprise default coding rule, and step 102 can comprise following sub-step:
Sub-step S11, calculates according to default coding rule the first coded string that described the first word is corresponding;
Sub-step S12, calculates according to described coding rule the second coded string that described the second word is corresponding;
Wherein, described default coding rule can comprise 5-stroke coding rule.
Chinese character is comprised of stroke or radical, in order to input these Chinese characters, Chinese character can be splitted into some the most frequently used base unit, i.e. radicals.Radical can be the radical of Chinese character, can be also a part for radicals by which characters are arranged in traditional Chinese dictionaries, or even stroke.
Radical, when forming Chinese character, can be divided into according to the position relationship between radical four class formations: single, loose, connect, hand over.Wherein, list can refer to that radical itself becomes separately a Chinese character, comprises key name radical and characterized radical, such as mouth, wood etc.; Between the loose radical that can refer to form Chinese character, can keep certain distance, such as the Chinese, Hunan etc.; Connect and can refer to that a radical connects a single stroke, for example " Pie " company " order " becomes " certainly "; Friendship can refer to form Chinese character after several radicals intersection intussusceptions, and for example " Shen " is by " day " friendship " Shu ".
Five is the abbreviation of five-stroke input method, is a kind of code input method.Radical is the elementary cell of five-stroke input method, according to stroke and font style characteristic, Chinese character is encoded, and radical is classified according to certain rules, then these radicals are distributed on keyboard, as the base unit of input Chinese character.
Particularly, five are divided into Wu Ge district by Chinese character stroke: horizontal (with carrying), perpendicular, skim, right-falling stroke (same point), Zhe Wu district.Radical or code element are distributed according to certain rules on 25 letter keys and (are the qwerty keyboard of standard, do not comprise Z key).
When adopting five-stroke input method input Chinese character, can be according to the key corresponding with radical in the sequential write of Chinese character and structure successively keypad, form a coded string, the coded string that system forms according to input radical retrieves desired word in the character library of five-stroke input method.
It should be noted that, in five-stroke input method, although the application of identification code makes repeated code (coded string) rate of single word lower, the repetition rate of coding of phrase is higher.Therefore, five-stroke input method is not generally used large dictionary, and to prevent too much repeated code, otherwise five-stroke input method is particularly useful for single word input, to obtain higher input efficiency.
Step 103, calculates the coding distance between described the first coded string and described the second coded string;
By calculating the coding distance between the first coded string and the second coded string, can identify the similarity between the first coded string and the second coded string.
In a kind of preferred exemplary of the embodiment of the present invention, described coding distance can comprise editing distance.Editing distance (Edit Distance), claims again Levenshtein distance, can refer to for example,, between two character strings (the first coded string and the second coded string), by one, be converted to another required minimum editing operation number of times.
In practice, many editing operations comprise a character string are replaced to another character string, insert a character string, delete a character string.
For example, character string " kitten " is converted to three number of operations of the minimum needs of character string " sitting ":
1, sitten(k → s), be about to character " k " and replace with character " s ";
2, sittin(e → i), be about to character " e " and replace with character " i ";
3, sitting(→ g), in character string " sittin ", finally insert character " g ".
Step 104, when described coding distance is less than predeterminable range threshold value, judges described the first word and described the second word nearly word form each other.
Nearly word form can be the similar word of character form structure, easily produces and obscures in use.For example " own ", " ", " the sixth of the twelve Earthly Branches " nearly word form each other.
In five-stroke input method, radical or code element are generally into the existence of piece, same or close with the stroke or the radical prime minister of portion that form word, all concentrate in some or adjacent button.For example, in the five-stroke input method of certain version radical corresponding to H key comprise " order, upper, foretell, only, tiger, head, tool ".
Because the character form structure of nearly word form is similar, accordingly, the radical that forms nearly word form is also similar.
When adopting five-stroke input method to input single word, except the key name radical and characterized radical of minority, most applications all need to adopt fractionation rule to carry out radical fractionation to word according to the feature of Chinese character, if while split surpassing four radicals, get first, second and third, end (finally) individual radical gets final product input characters.
For example, splitting rule can comprise: sequential write, get large preferential, take into account directly perceived, can connect do not hand over, can fall apart does not connect.
The stroke or the portion's radical capital that form word are to have certain service regeulations, can comprise location rule, rules for writing etc.For example single side " Ren ", double side " Chi " are generally the leftmost sides at word, and override is write, as " you ", " hundred million ", " very ", " past " etc.
The service regeulations of stroke or radical make Chinese character can be divided into single character (as the words that consist of stroke such as upper and lower, day, months, or saying the word being comprised of single radical) and combinde rqdical character (word being comprised of radical as hang, stop, get, bright etc.).
Particularly, Hanzi structure can be divided into:
(1) up-down structure: think, askew, emit, anticipate, pacify, entirely;
(2) Up-Center-Down Structure: grass, sudden and violent, meaning, unexpectedly, competing;
(3) left and right structure: good, canopy and, honeybee, beach, past, bright;
(4) left, center, right structure: thank, set, fall, remove, slash, whip, debate;
(5) entirely surround structure: enclose, prisoner, tired, field, because of, state, consolidate;
(6) semi-surrounding structure: bag, district, sudden strain of a muscle, this, sentence, letter, wind;
(7) intert structure: shocking, million, non-;
(8) delta structure: product, gloomy, Nie, crystalline substance, of heap of stone, prosperous, spark.
Therefore, in five-stroke input method, due to the similarity of stroke or radical and the five-stroke etymon of Chinese character, the structure of Chinese character and rules for writing thereof and five split regular similaritys, therefore respectively nearly word form is carried out to radical fractionation, can obtain similar or close coded string.For example, " survey " and " side " be nearly word form each other, and " survey " comprises three radicals, is also radical simultaneously, be respectively " Rui ", " shellfish ", " Dao ", its coded string is " imjh ", and " side " comprises three radicals, is also radical simultaneously, respectively " Ren ", " shellfish ", " Dao ", its coded string is " wmjh ", and obviously, " imjh " is very similar with " wmjh ".
Accordingly, to the encode calculating of distance of the first word and the first coded string corresponding to the second word and the second coded string, when it is less than predeterminable range threshold value, show that its similarity is higher, can think nearly word form.On the contrary, when coding distance is more than or equal to predeterminable range threshold value, show that its similarity is lower, can think it is non-nearly word form.
For example, in five-stroke input method, because Chinese character mostly is 4 coded strings most, can predeterminable range threshold value be 2.For word " time " and " marquis ", application 5-stroke coding rule, the coded string of " time " is " whnd ", the coded string of " marquis " is " wntd ", coding distance between " whnd " and " wntd " is 1, being less than can distance threshold 2, can judge " time " and " marquis " nearly word form each other.
Step 105 is set up the nearly word form mapping relations between the first word and the second word in search engine.
In specific implementation, can in search engine, set up respectively nearly word form and corresponding nearly word form mapping relations that font database is collected current word.
It should be noted that, nearly word form mapping relations can be mutual.For example the first word with the nearly word form mapping relations of the second word can be the first word------the second word; The nearly word form mapping relations of the second word and the first word can be the second word------the first word.
The embodiment of the present invention by calculating the coding distance between the first coded string of the first word and the second coded string of the second word in search engine, realized the whether each other judgement of nearly word form of the first word and the second word, improve the webpage recognition efficiency of search engine, increased the function of search engine.
In a preferred embodiment of the present invention, can also comprise the steps:
Step 106, exports the first word of described nearly word form each other and the second word and described nearly word form mapping relations in the font database of appointment to.
The application embodiment of the present invention can travel through all words in corpus, finds the nearly word form of current word, the nearly word form searching out and nearly word form mapping relations is generated to the font database of current word.
For example in the font database of the first word, preserve one or more nearly word forms and nearly word form mapping relations, as the first word------the second word, the 3rd word, the 4th word; In the font database of the second word, preserve one or more nearly word forms and nearly word form mapping relations, as the second word------the first word, the 5th word, the 6th word.
With reference to Fig. 2, show a kind of flow chart of steps that the embodiment of the method for keyword error correction in search is provided of one embodiment of the present of invention, can comprise the steps:
Step 201, receives searching request; Described searching request comprises searched key word;
Searching request can refer to the indication that certain searched key word of employing that user sends is searched for.For example, user can send searching request by search-engine web page, or sends searching request etc. at search plug-in unit.Inputted search keyword when clicking or pressing enter key in the search box of user at search engine, is just equivalent to receive searching request; Equally, inputted search keyword when clicking or pressing enter key in the input frame at search plug-in unit, is just equivalent to receive searching request.
Step 202, when described searched key word being carried out to correction process discovery mistake, adopts the nearly word form of mating with described searched key word to rewrite described searched key word;
In specific implementation, can use natural language processing technique (Natural LanguageProcessing, NLP) to carry out correction process to searched key word.
Correction process generally can split into two subtasks:
1, misspelling detects (Spelling Error Detection): different according to type of error, can be divided into Non-word Errors and Real-word Errors.Wherein, Non-word Errors can refer to that the word itself after misspelling is just illegal, as wrong write as " graffe " by " giraffe "; Real-word Errors can refer to that the word after those misspellings remains legal situation, as being " three " (shape is near) by " there " misspellings, by " peace " misspellings, being " piece " (unisonance), is " too " (unisonance) by " two " misspellings.In specific implementation, can spell error correction based on noisy channel model (Noisy Channel Model) etc.;
2, spelling error correction (Spelling Error Correction): searched key word is carried out to error correction, can carry out words debugging, such as the mistake between adjacent words and word, adjacent words and word, adjacent word and word etc., check, and then by the nearly word form of mating most with the word at mistake place in nearly word form mapping relationship searching font database, searched key word is rewritten.
In a preferred embodiment of the present invention, described nearly word form can obtain in the following manner:
Sub-step S21, determines whether to be verified in inputted search engine is the first word and second word of nearly word form;
Sub-step S22, obtains the first coded string of described the first word and the second coded string of described the second word according to preset rules;
In a kind of preferred exemplary of the embodiment of the present invention, described preset rules can comprise default coding rule, and sub-step S22 further can comprise following sub-step:
Sub-step S221, calculates according to default coding rule the first coded string that described the first word is corresponding;
Sub-step S222, calculates according to described coding rule the second coded string that described the second word is corresponding;
Wherein, described default coding rule can comprise 5-stroke coding rule.
Sub-step S23, calculates the coding distance between described the first coded string and described the second coded string;
Sub-step S24, when described coding distance is less than predeterminable range threshold value, judges described the first word and described the second word nearly word form each other;
Sub-step S25 sets up the nearly word form mapping relations between the first word and the second word in search engine.
It should be noted that, in embodiments of the present invention, because sub-step S21 is substantially similar to the application of embodiment of the method 1 to sub-step S25, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method 1, and the embodiment of the present invention is not described in detail at this.
In a preferred embodiment of the present invention, described nearly word form can obtain in the following manner:
Sub-step S31, determines whether to be verified in inputted search engine is the first word and second word of nearly word form;
Sub-step S32, obtains the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Sub-step S33, calculates the coding distance between described the first coded string and described the second coded string;
Sub-step S34, searches respectively the first input key that described the first coded string is corresponding;
Sub-step S35, searches respectively the second input key that described the second coded string is corresponding;
Sub-step S36, calculates respectively the button distance between described the first input key and described the second input key;
Sub-step S37 is weight corresponding to described coding distance configuration according to described button distance;
Sub-step S38, when disposing the coding distance of described weight and be less than predeterminable range threshold value, judges described the first word and described the second word nearly word form each other;
Sub-step S39 sets up the nearly word form mapping relations between the first word and the second word in search engine.
In embodiments of the present invention, the distance of the button between the first input key and the second input key can be the physical distance of input key on keyboard.
In the fingering of qwerty keyboard, left index finger is controlled button R, T, F, G, V, B, left hand middle finger is controlled button E, D, C, nameless button W, S, the X of controlling of left hand, left hand little finger of toe is controlled button Q, A, Z, and right hand forefinger is controlled button Y, U, H, J, N, M, and right hand middle finger is controlled button I, K, right ring finger is controlled button O, L, and right hand little finger of toe is controlled button P.Wherein, button F, J generally have projection, as positioning key.
And due to the existence of positioning key, current finger is clicked while not belonging to the button of its control, and left index finger click keys E for example, finger span is larger, makes user generally have significant discomfort, and then makes this kind of overdue probability hitting very little.Otherwise the overdue probability hitting is relatively large in the button of controlling at current finger, for example left index finger click keys R, the easily overdue T that hits.
Therefore, described button distance can be inversely proportional to described weight.And alternatively, the button distance between the input key that same finger is controlled can be to weight configure weights coefficient, reduce weight, make the coding distance of the first word and the second word less, similarity is higher, to embody the relatively large feature of the overdue probability hitting.
Step 203, searches for revised searched key word, obtains the search result data matching with described revised searched key word.
After the rewriting of searched key word finishes, just can adopt the modes such as full-text index, directory index to carry out the retrieval coupling of the network information.
In a preferred embodiment of the present invention, can also comprise the steps:
Step 204, generates search results pages according to described search result data.
Search engine is searched in database, if find the network information that requires content to conform to user, general according to the position of the matching degree of keyword in the network information, appearance, the frequency, link quality etc., calculate the degree of correlation and the rank grade of each webpage, then according to degree of association height, in order these network information links are returned to user.
In a preferred embodiment of the present invention, can also comprise the steps:
Step 205, the information that prompting is carried out error correction to described searched key word in described search results pages.
In specific implementation, the embodiment of the present invention can adopt arbitrary form to point out, for example can under the input frame of search engine, point out the information that described searched key word is carried out to error correction, for strengthening prompt facility, also can adopt different colors to mark to the word before error correction and the word after error correction, etc., the embodiment of the present invention is not limited this.
The embodiment of the present invention is carried out correction process to searched key word, adopts the nearly word form of mating with searched key word to rewrite searched key word, to obtain the search result data matching with described revised searched key word.On the one hand, revised searched key word makes Search Results more approach expection originally, promotes user and experiences, and has reduced the resource of client and the wasting of resources of search engine, has improved search efficiency.On the other hand, avoid user need to obtain own interested info web, again in search engine, input keyword search, reduce search that search engine will carry out magnanimity information again, contrast, screening etc. and obtained the information relevant to searched key word, make user more convenient to operate, reduce user's time consumption, further reduced the resource cost of client and search engine.
For embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the present invention is not subject to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to Fig. 3, show a kind of structured flowchart of determining the device embodiment of nearly word form in search engine of one embodiment of the invention, can comprise as lower module:
Word determination module 301, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module 302, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module 303, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module 304, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module 305, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
In a preferred embodiment of the present invention, described preset rules can comprise default coding rule, and described coding acquisition module can also be suitable for:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
In a preferred embodiment of the present invention, can also comprise as lower module:
Output module, is suitable for the first word of described nearly word form each other and the second word and described nearly word form mapping relations to export in the font database of appointment.
With reference to Fig. 4, show a kind of structured flowchart that the device embodiment of keyword error correction in search is provided of one embodiment of the invention, can comprise as lower unit:
Receiving element 401, is suitable for receiving searching request; Described searching request comprises searched key word;
Rewrite unit 402, be suitable for, when described searched key word being carried out to correction process discovery mistake, adopting the nearly word form of mating with described searched key word to rewrite described searched key word;
Search unit 403, is suitable for searching for revised searched key word, obtains the search result data matching with described revised searched key word.
In a preferred embodiment of the present invention, described nearly word form can obtain by calling with lower module:
Word determination module, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
In a preferred embodiment of the present invention, described preset rules can comprise default coding rule, and described coding acquisition module is also suitable for:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
In a preferred embodiment of the present invention, described nearly word form can also obtain by calling with lower module:
First searches module, is suitable for searching respectively the first input key that described the first coded string is corresponding;
Second searches module, is suitable for searching respectively the second input key that described the second coded string is corresponding;
Button distance calculation module, is suitable for calculating respectively the button distance between described the first input key and described the second input key;
Weight configuration module, being suitable for according to described button distance is weight corresponding to described coding distance configuration;
Described nearly word form determination module can also be suitable for:
When disposing the coding distance of described weight and be less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other.
In a preferred embodiment of the present invention, described button distance can be inversely proportional to described weight.
In a preferred embodiment of the present invention, can also comprise as lower module:
Generation unit, is suitable for generating search results pages according to described search result data.
In a preferred embodiment of the present invention, can also comprise as lower module:
Tip element, is suitable for the information that prompting is carried out error correction to described searched key word in described search results pages.
For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts in the equipment that can use in practice microprocessor or digital signal processor (DSP) to realize to determine nearly word form according to the embodiment of the present invention a kind of in search engine, a kind of equipment that keyword error correction in search is provided.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. in search engine, determine a method for nearly word form, comprising:
Determine the first word to be verified and the second word in inputted search engine;
According to preset rules, obtain the first coded string of described the first word and the second coded string of described the second word;
Calculate the coding distance between described the first coded string and described the second coded string;
When described coding distance is less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other;
In search engine, set up the nearly word form mapping relations between the first word and the second word.
2. the method for claim 1, is characterized in that, described preset rules comprises default coding rule, described in obtain the first coded string of described the first word and the second coded string of described the second word step comprise:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
3. method as claimed in claim 1 or 2, is characterized in that, also comprises:
The first word of described nearly word form each other and the second word and described nearly word form mapping relations are exported in the font database of appointment.
4. the method that keyword error correction in search is provided, comprising:
Receive searching request; Described searching request comprises searched key word;
When described searched key word being carried out to correction process discovery mistake, adopt the nearly word form of mating with described searched key word to rewrite described searched key word;
With revised searched key word, search for, obtain the search result data matching with described revised searched key word.
5. method as claimed in claim 4, is characterized in that, described nearly word form obtains in the following manner:
Determine whether the verification in search engine to be entered is the first word and second word of nearly word form;
According to preset rules, obtain the first coded string of described the first word and the second coded string of described the second word;
Calculate the coding distance between described the first coded string and described the second coded string;
When described coding distance is less than predeterminable range threshold value, judge described the first word and described the second word nearly word form each other;
In search engine, set up the nearly word form mapping relations between the first word and the second word.
6. in search engine, determine a device for nearly word form, comprising:
Word determination module, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
7. device as claimed in claim 6, is characterized in that, described preset rules comprises default coding rule, and described coding acquisition module is also suitable for:
According to default coding rule, calculate the first coded string that described the first word is corresponding;
According to described coding rule, calculate the second coded string that described the second word is corresponding;
Wherein, described default coding rule comprises 5-stroke coding rule.
8. the device as described in claim 6 or 7, is characterized in that, also comprises:
Output module, is suitable for the first word of described nearly word form each other and the second word and described nearly word form mapping relations to export in the font database of appointment.
9. the device that keyword error correction in search is provided, comprising:
Receiving element, is suitable for receiving searching request; Described searching request comprises searched key word;
Rewrite unit, be suitable for, when described searched key word being carried out to correction process discovery mistake, adopting the nearly word form of mating with described searched key word to rewrite described searched key word;
Search unit, is suitable for searching for revised searched key word, obtains the search result data matching with described revised searched key word.
10. device as claimed in claim 9, is characterized in that, described nearly word form obtains by calling with lower module:
Word determination module, is suitable for determining the first word to be verified and the second word in inputted search engine;
Coding acquisition module, is suitable for obtaining the first coded string of described the first word and the second coded string of described the second word according to preset rules;
Coding distance calculation module, is suitable for calculating the coding distance between described the first coded string and described the second coded string;
Nearly word form determination module, is suitable for, when described coding distance is less than predeterminable range threshold value, judging described the first word and described the second word nearly word form each other;
Mapping relations determination module, is suitable for setting up the nearly word form mapping relations between the first word and the second word in search engine.
CN201410104483.XA 2014-03-19 2014-03-19 Method and device for determining characters with similar forms in search engine Pending CN103927330A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410104483.XA CN103927330A (en) 2014-03-19 2014-03-19 Method and device for determining characters with similar forms in search engine
PCT/CN2014/094933 WO2015139497A1 (en) 2014-03-19 2014-12-25 Method and apparatus for determining similar characters in search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410104483.XA CN103927330A (en) 2014-03-19 2014-03-19 Method and device for determining characters with similar forms in search engine

Publications (1)

Publication Number Publication Date
CN103927330A true CN103927330A (en) 2014-07-16

Family

ID=51145551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410104483.XA Pending CN103927330A (en) 2014-03-19 2014-03-19 Method and device for determining characters with similar forms in search engine

Country Status (1)

Country Link
CN (1) CN103927330A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156454A (en) * 2014-08-18 2014-11-19 腾讯科技(深圳)有限公司 Search term correcting method and device
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN106919614A (en) * 2015-12-28 2017-07-04 中国移动通信集团公司 A kind of information processing method and device
CN106980620A (en) * 2016-01-18 2017-07-25 阿里巴巴集团控股有限公司 A kind of method and device matched to Chinese character string
CN109344387A (en) * 2018-08-01 2019-02-15 北京奇艺世纪科技有限公司 The generation method of nearly word form dictionary, device and nearly word form error correction method, device
CN110019760A (en) * 2017-11-02 2019-07-16 中移(杭州)信息技术有限公司 A kind of processing method and processing device of text information
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium
CN110866188A (en) * 2019-11-14 2020-03-06 拉扎斯网络科技(上海)有限公司 Information processing method, information processing device, electronic equipment and computer readable storage medium
CN111259262A (en) * 2020-01-13 2020-06-09 上海极链网络科技有限公司 Information retrieval method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206673A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Intelligent error correcting system and method in network searching process
CN101288046A (en) * 2005-08-11 2008-10-15 亚马逊技术有限公司 Identifying alternative spellings of search strings by analyzing self-corrective searching behaviors of users
US20100306229A1 (en) * 2009-06-01 2010-12-02 Aol Inc. Systems and Methods for Improved Web Searching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101288046A (en) * 2005-08-11 2008-10-15 亚马逊技术有限公司 Identifying alternative spellings of search strings by analyzing self-corrective searching behaviors of users
CN101206673A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Intelligent error correcting system and method in network searching process
US20100306229A1 (en) * 2009-06-01 2010-12-02 Aol Inc. Systems and Methods for Improved Web Searching

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015139497A1 (en) * 2014-03-19 2015-09-24 北京奇虎科技有限公司 Method and apparatus for determining similar characters in search engine
CN104156454A (en) * 2014-08-18 2014-11-19 腾讯科技(深圳)有限公司 Search term correcting method and device
CN104156454B (en) * 2014-08-18 2018-09-18 腾讯科技(深圳)有限公司 The error correction method and device of search term
CN106919614A (en) * 2015-12-28 2017-07-04 中国移动通信集团公司 A kind of information processing method and device
CN106980620A (en) * 2016-01-18 2017-07-25 阿里巴巴集团控股有限公司 A kind of method and device matched to Chinese character string
CN106980620B (en) * 2016-01-18 2020-07-31 阿里巴巴集团控股有限公司 Method and device for matching Chinese character strings
CN110019760A (en) * 2017-11-02 2019-07-16 中移(杭州)信息技术有限公司 A kind of processing method and processing device of text information
CN110019760B (en) * 2017-11-02 2022-05-06 中移(杭州)信息技术有限公司 Text information processing method and system
CN109344387A (en) * 2018-08-01 2019-02-15 北京奇艺世纪科技有限公司 The generation method of nearly word form dictionary, device and nearly word form error correction method, device
CN109344387B (en) * 2018-08-01 2023-12-19 北京奇艺世纪科技有限公司 Method and device for generating shape near word dictionary and method and device for correcting shape near word error
CN110032920A (en) * 2018-11-27 2019-07-19 阿里巴巴集团控股有限公司 Text region matching process, equipment and device
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium
CN110866188A (en) * 2019-11-14 2020-03-06 拉扎斯网络科技(上海)有限公司 Information processing method, information processing device, electronic equipment and computer readable storage medium
CN111259262A (en) * 2020-01-13 2020-06-09 上海极链网络科技有限公司 Information retrieval method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN103927329B (en) A kind of instant search method and system
CN103927330A (en) Method and device for determining characters with similar forms in search engine
US11416679B2 (en) System and method for inputting text into electronic devices
US20210132792A1 (en) System and method for inputting text into electronic devices
CN109960726B (en) Text classification model construction method, device, terminal and storage medium
US10402493B2 (en) System and method for inputting text into electronic devices
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
CN105094368B (en) A kind of control method and control device that frequency modulation sequence is carried out to candidates of input method
CN102449579B (en) All-in-one chinese character input method
CN100517301C (en) Systems and methods for improved spell checking
Kothari et al. SMS based interface for FAQ retrieval
CN104077275A (en) Method and device for performing word segmentation based on context
CN106708929B (en) Video program searching method and device
CN112527999A (en) Extraction type intelligent question and answer method and system introducing agricultural field knowledge
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
WO2015139497A1 (en) Method and apparatus for determining similar characters in search engine
CN111194457A (en) Patent evaluation determination method, patent evaluation determination device, and patent evaluation determination program
CN101308512A (en) Mutual translation pair extraction method and device based on web page
CN106570196B (en) Video program searching method and device
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
Khan et al. A clustering framework for lexical normalization of Roman Urdu
Tapsai et al. TLS-ART: Thai language segmentation by automatic ranking trie
van Cranenburgh Rich statistical parsing and literary language
Zhang et al. A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time
JP7047825B2 (en) Search device, search method, search program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140716