CN100501745C - Convenient method and system for electronic text-processing and searching - Google Patents

Convenient method and system for electronic text-processing and searching Download PDF

Info

Publication number
CN100501745C
CN100501745C CNB200710164298XA CN200710164298A CN100501745C CN 100501745 C CN100501745 C CN 100501745C CN B200710164298X A CNB200710164298X A CN B200710164298XA CN 200710164298 A CN200710164298 A CN 200710164298A CN 100501745 C CN100501745 C CN 100501745C
Authority
CN
China
Prior art keywords
abutting connection
text
speech section
keyword
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB200710164298XA
Other languages
Chinese (zh)
Other versions
CN101201841A (en
Inventor
刘二中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNB200710164298XA priority Critical patent/CN100501745C/en
Publication of CN101201841A publication Critical patent/CN101201841A/en
Application granted granted Critical
Publication of CN100501745C publication Critical patent/CN100501745C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method and a system for treating a plurality of electronic texts containing same keywords and is implemented by a computer. The invention includes the following steps: obtaining a plurality of electronic texts containing same keywords, stipulating the word quantity contained in an adjacent word segment and the intercepting method of the adjacent word segment, according to whether the adjacent word segment of the keyword in each text content of part of or whole text is same with other texts or not, classifying the text and other texts into the same or different subclasses or classes which are subject to same or different processing. The massive searching results obtained by keyword search can form multilevel subset systems, catalogs or example sequences, core contents of which are not repeated or lost; therefore, the invention can help users to narrow searching range rapidly and strictly, and to obtain the expected searching results completely and accurately.

Description

E-text is handled convenient method and the system with retrieval
(1) technical field
The present invention relates to computing machine and search engine handles and the technology of retrieving about e-text.
(2) background technology
Recent decades, the Computer Database retrieval technique has had the progress of network technologies such as very big development, particularly WWW, makes the scale of the database that people can share reach astronomical figure.The user finds information needed or file for convenience, classification or catalogue retrieval system have occurred.This technology is more suitable in the maturation classification field that people are very familiar to, but in magnanimity information field widely, is difficult to set up also be difficult to grasp and use.
The search engine technique that with the keyword search is core is that the user has brought facility.The search system that with the search engine is core generally is positioned on one or more servers or other computer installations, the text analyzing of text library is obtained the index constructor of text index by text (page) storehouse, text index storehouse, basis, and accept the requestor etc. that inquiry generates Search Results and partly form, toward the subsidiary promising text library of contact from the internet or other information sources collect and increase the data acquisition server of text.This system can obtain inquiry's keyword query request by the interactive interface on the client computer and communication network or communication line, in text index storehouse or text library, inquire about, and carry out the correlation analysis of keyword request and text, obtain correlated results and ordering, be provided to interactive interface via communication network or circuit again.This search system uses very convenient rapid, but the index sum that the return result comprises is still very huge, is difficult to consult one by one.
People have also been developed the technology of keyword and anchor content (anchor text) description of pointing to relevant text being compared to determine correlativity, still can not make retrieval person very satisfied.For the potential Query Result to inquiry's most worthy can be come the front to make things convenient for the inquiry as far as possible, the 6th, 285, No. 999 United States Patent (USP)s have proposed to carry out based on webpage hyperlink structure analysis (Page link) technology of Search Results ordering, other ordering techniques have been surpassed, adopted by Google company, obtain unprecedented success.
Yet this technology and other various ordering techniques only are the efficient that has improved keyword search on statistical significance, can not guarantee that Query Result that everyone wishes can both come the front of huge concordance list.For example, we utilize " Google " Chinese website search " Bu Lin " speech, can obtain nearly 300,000 index.We still can not guarantee and can none find the content of expectation on forward position with omitting, accomplish not only tightly but also more convenient.Simultaneously, we but helplessly read the irrelevant information that all main contents repeat again and again before reading the information of expectation.
In order to address this problem, people attempt to develop various new search engine techniques always over past ten years, for example, the technology of " according to the priority ranking tabulation of importance " that No. 6421675 United States Patent (USP) relates to, the technology of " history according to user's data query forms the dynamic object table " that No. 6256633 United States Patent (USP) relates to, " sharing Query Information " technology of CN1151457 Chinese patent with other inquiries, the technology of No. 6990628 United States Patent (USP) relevant " measuring the e-text similarity ".These technology have some advantage, but effect is very limited.
The technology of No. 7089236 United States Patent (USP) can be carried out semantic analysis to the keyword that the inquiry proposes, and interactive interface be presented in different possible semantemes, helps the inquiry to dwindle the hunting zone.The technology that No. the 200510081867.5th, close with it Chinese patent application is by using the keyword search results of webpage classification information dispersion search engine.The problem of these two kinds of technology is, yet at first must set up very complicated huge impossible accurate classification database, judge that by machine a certain page or text belong to which bar of certain keyword or the semanteme or the classification of which bar is very difficult, its reliability is not high.Overlappingly probably between the different semantemes of a keyword or the classification more may there be blank.If increase the level of classification, overlapping will causing takies exploding of storage space.Simultaneously, the inquiry of keyword search also is difficult to accurate assurance in the face of unfamiliar field to all multi-semantic meanings or classification.These have all had a strong impact on the raising of search efficiency.
Therefore, people press for a kind of not only tightly but also the technology of keyword search automotive engine system efficiently, can help the inquiry to dwindle effectively even repeatedly dwindle the scope of consulting.Require between the different range boundary clear and definite, judge there is not the overlapping blank that also do not have easily, accelerating the speed that the inquiry obtains expected result greatly, and guarantee the tightness of search.This also becomes unsolved for many years global problem.
(3) summary of the invention
Purpose of the present invention just provides the e-text of a kind of computing machine or search engine and handles and the technology of retrieving or searching for, carry out keyword retrieval and during the user in the face of the Search Results of magnanimity, can repeatedly dwindle the hunting zone rapidly and closely, or reject all kinds of irrelevant informations or duplicate message, obtain desired result exactly and seldom omission.
The method that a plurality of e-texts that contain same keyword are handled and retrieved that one aspect of the present invention has provided that a kind of computing machine carries out comprises:
Step (1) obtains a plurality of e-texts that contain same keyword;
Step (2) regulation is in abutting connection with the contained words quantity of speech section or in abutting connection with speech section interception way;
Step (3) is identical still different in abutting connection with the speech section according to keyword in the e-text content in the part or all of e-text, and described e-text is carried out respective handling;
Step (4) shows result at interactive interface;
Described respective handling comprises the arbitrary of following processing mode:
First handles, and has identical distributing position or storage mode in abutting connection with the identical e-text of speech section, has different distributing positions or storage mode in abutting connection with the different e-text of speech section;
Second handles, and will be subdivided into identical subclass or obtain identical subclass mark in abutting connection with the identical e-text of speech section, will be subdivided into different subclass or obtain different subclass marks in abutting connection with the different e-text of speech section;
The 3rd handles, and makes in abutting connection with the index of the identical e-text of speech section to have identical mark or index entry, makes in abutting connection with the index of the different e-text of speech section to have different marks or index entry;
The reason everywhere makes in abutting connection with the identical e-text of speech section to have identical arranged mode, makes in abutting connection with the different e-text of speech section to have different arranged modes;
The 5th handles, and makes in abutting connection with the identical e-text of speech section to have identical display mode or position at interactive interface, makes in abutting connection with the different e-text of speech section to have different display modes or position at interactive interface;
The 6th handles, and will be subdivided into identical subclass in abutting connection with the identical e-text of speech section, will be subdivided into different subclass in abutting connection with the different e-text of speech section, and the part subclass respectively has and one or morely strides subclass combination or ordering in abutting connection with speech section or e-text at least;
The 7th handles, one or more levels catalogue or sequence of layout, described catalogue or sequence reflect described e-text same keyword difference in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship;
Described e-text is e-file or their summary or index or questions record or exercise question, also can be webpage, also can comprise the information content of the various electronizations of database or works or dictionary or handbook or patent documentation.
Above-mentioned in abutting connection with the speech section or can be the keyword front in abutting connection with the speech section indirectly, also can be the keyword back; Generally be the speech section that one or more speech in the e-text content or word even root are formed, also comprise some character when needing, as the letter of abridging, punctuate etc.
The benefit to retrieval of method of the present invention is fairly obvious.The inquiry when interesting, be easy to obtain comprising this kind all e-texts in abutting connection with the classification of speech section, otherwise he then is easy to skip these texts in abutting connection with the speech section to keyword a certain.
Crucial part of the present invention is, keyword most possibly determine concrete intension or sensing or limited range or the direction of this keyword in this e-text in abutting connection with content, this should be that the searchers is most interested in.Simultaneously, if the mode that this method adopts is appropriate, can avoid other " content different classes of or subclass is overlapping and blank " phenomenons of utilizing classification retrieving method to be difficult to avoid fully, this phenomenon can cause final unworkable consequence in the multiclass classification subset system.This has determined the search effect of method of the present invention or system will have lifting highlightedly.
The method of described processing and retrieval can also further comprise in described second handles:
One or more levels catalogue or sequence of layout, described catalogue or sequence reflect described e-text same keyword difference in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship;
And, separately identical of the one or more different subclass that comprise described e-text in abutting connection with speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with speech section or identical indirect statement or summary example in abutting connection with the speech section, perhaps comprise separately identical of the next stage of this or these subclass or multistage down a plurality of subclass in abutting connection with speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with speech section or identical indirect statement or summary example, according to arranged side by side or precedence relationship layout or distribution or storage or demonstration in abutting connection with the speech section;
Wherein identical in abutting connection with the speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with the speech section or identical indirect in abutting connection with the speech section statement the summary example is striden subclass or in subclass side by side.
The method of described processing and retrieval, may further include following steps: for the different e-texts that belong to certain or some same first order subclass or higher subclass, the same keyword that contains according to e-text and in abutting connection with identical still different in abutting connection with the speech section of other of speech section, part or all of e-text is handled by the arbitrary of following processing mode:
The 8th handles, and other have identical distributing position or storage mode in abutting connection with the identical e-text of speech section, and other have different distributing positions or storage mode in abutting connection with the different e-text of speech section;
The 9th handles, and other are subdivided into identical next stage subclass or obtain identical subclass mark in abutting connection with the identical e-text of speech section, and other are subdivided into different next stage subclass or obtain different subclass marks in abutting connection with the different e-text of speech section;
The tenth handles, and makes other index in abutting connection with the identical e-text of speech section have identical mark or index entry, makes other index in abutting connection with the different e-text of speech section have different marks or index entry;
The 11 handles, and makes other have identical arranged mode in abutting connection with the identical e-text of speech section, makes other have different arranged modes in abutting connection with the different e-text of speech section;
The 12 handles, and makes other have identical display mode or position in abutting connection with the identical e-text of speech section at interactive interface, makes other have different display modes or position in abutting connection with the different e-text of speech section at interactive interface;
The 13 handles, other are subdivided into identical next stage subclass in abutting connection with the identical e-text of speech section, other are subdivided into different next stage subclass in abutting connection with the different e-text of speech section, and the part subclass respectively has and one or morely strides subclass combination or ordering in abutting connection with speech section or e-text at least;
The tenth manages everywhere, one or more levels catalogue or sequence of layout, this catalogue or sequence reflect the same keyword of described e-text and in abutting connection with other differences of speech section in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship.
The method of described processing and retrieval can be to merging or separate operation in abutting connection with the speech section successively, to reduce or to increase the subclass level.
Under some necessary situation, this method judge keyword in abutting connection with speech section or keyword in abutting connection with other of speech section during in abutting connection with speech section identical or different, can omit in abutting connection with the prefix of speech or the difference in suffix or punctuate or space, perhaps omit having or not or difference of auxiliary word or number or measure word or adjective or adverbial word, perhaps omit article or conjunction and have or not or difference.
In the method for described processing and retrieval, when keyword during for separately a plurality of words, described in abutting connection with the speech section can be meant wherein a certain words (as forward words) or a plurality of words in abutting connection with the speech section.
The words or the quantity of character or can be that be scheduled to or inquiry's acquiescence or selecting that are comprised in abutting connection with the speech section in the method for described processing and retrieval in abutting connection with the speech section by mode or particular content.
In the method, judge in abutting connection with the speech section or during indirectly in abutting connection with the quantity of the words of speech section or character, can omit the prefix of a words or a plurality of words or having or not or difference of suffix or auxiliary word or number or measure word or punctuate or space or adjective or adverbial word.
When needing, in the method, if the keyword in described catalogue or the sequence in abutting connection with the speech section or indirectly in abutting connection with the next stage of speech section or multistage down have only in abutting connection with the speech section a kind of, can with in abutting connection with the speech section or indirectly in abutting connection with the speech section together with its next stage or multistagely down distribute in abutting connection with the original position of speech Duan Zaiqi or store or show.
The step of the method for described processing and retrieval may further include in (4):
In e-text or catalogue or statement or summary example or at keyword that they comprised or in abutting connection with the speech section or indirectly near the speech section, can have its corresponding number of subsets side by side or subordinate's number of subsets or related term or in abutting connection with the speech section or indirectly in abutting connection with the number of subsets arranged side by side of speech section place subclass or indirectly in abutting connection with speech section contained subordinate's number of subsets or the prompting of textual data purpose.
Method of the present invention may further include step (5), and promptly the inquiry indicates the literal in catalogue or the sequence or figure or symbol on interactive interface, determines or launches or the link related content.
The 7th of the method for described processing and retrieval can also comprise in handling:
Layout contains the sequence of a plurality of e-texts or the e-text partial content of same keyword, this sequence contain by a plurality of speech form different in abutting connection with the speech section, perhaps the keyword of a plurality of speech of containing of each e-text in this sequence or e-text partial content is different in abutting connection with the speech section.Perhaps we can say, contain identical described e-text or e-text partial content and have only one or more as representative in abutting connection with the speech section.
Can reduce the information representative sequence of the same keyword of the only about half of order of magnitude like this with the bar number, replace original magnanimity information.
The method of processing of the present invention and retrieval can further include step (6):
The difference of the same keyword of e-text is carried out ratio of similitude in abutting connection with the speech section, a plurality of differences that meet certain similar requirement mutually are subdivided into same similar subclass in abutting connection with the speech section;
The a plurality of differences that perhaps will not meet certain similar requirement mutually are subdivided into different similar subclass in abutting connection with the speech section;
The a plurality of differences that perhaps will not meet certain similar requirement are mutually weaved into dissimilar each other sequence or catalogue in abutting connection with the speech section in abutting connection with the speech section, with the title or the mark of common this similar subclass of content act or omission of each element of same similar subclass, or list it in similar subset name sequence or catalogue.
Similar manner of comparison described here or similar requirement can have a variety of, can stipulate when needing: described certain similar requirement can comprise same word or the quantity of speech or phrase or character or the requirement of proportion that difference is contained in abutting connection with the speech section at least.
When needing, can put this similar subclass under with containing in abutting connection with the speech section or indirectly divide this subordinate's subclass into, perhaps will containing in abutting connection with e-text or its partial content of speech section in abutting connection with the similar subclass at speech section place in abutting connection with the subclass of speech section in abutting connection with speech section place.
Method of the present invention can further include step (7), that is:
Arranged side by side subclass or side by side in abutting connection with the speech section or indirectly in abutting connection with speech section or e-text or statement arranged side by side or the some concrete sorting positions in summary example or the representative series information side by side, partially or completely depend on following wherein one or more factors:
First factor, e-text or in abutting connection with the speech section or indirectly in abutting connection with size or the height of clicking rate or the height of keyword occurrence rate of the Page link value of speech section or statement or summary example or information place e-text;
Second factor, the size of the mean values of the height of what or this subclass clicking rate of subordinate's number of subsets of this subclass or subordinate's e-text number or the e-text Page link value of this subclass;
The 3rd factor is in abutting connection with the speech section or indirectly in abutting connection with the size of the mean values of the e-text Page link value of the height of what or place subclass clicking rate of subordinate's number of subsets of speech section or e-text or statement or example sentence or summary example place subclass or subordinate's e-text number or place subclass;
The 4th factor, the size of the Page link value of e-text that the Page link value of subclass is the highest or other e-text example;
The 5th factor, the clicking rate of the clicking rate of subclass e-text the highest or that the keyword occurrence rate is the highest or other e-text example or the height of keyword occurrence rate;
The ordering of associated electrical text in other search websites or searching system Search Results in the 6th factor, e-text or associated subset;
The 7th factor, e-text or in abutting connection with the speech section or indirectly in abutting connection with investor's relevant payment of speech section speech section or the height of bidding;
The 8th factor is in abutting connection with the spelling of the speech word of speech section or the lexicographic order or the stroke of phonetic;
The 9th factor, the source web of e-text or unit or people's scoring;
The time order and function that the tenth factor, e-text are included or new and old;
Whether the 11 factor, e-text belong to the same subclass of certain one-level;
The 12 factor decides by a kind of target function value, and target function value depends on one or more variablees, described wherein some or a plurality of factors above the variable of described objective function is partly or entirely represented respectively.
The method of processing of the present invention and retrieval can further include step (8), promptly
Increase or minimizing should possess the other keyword that maybe can not possess, and perhaps the restriction of increase or minimizing time or region or languages or scope obtains the result of further refining or more wide in range result.
Another aspect of the present invention is a kind of computer data system that comprises storing apparatus that can use the method for described processing and retrieval, it is characterized in that the data of described memory storage or the part or all of keyword index that data division contained wherein or e-text summary or e-text distribute in the following manner:
E-text summary or e-text contain same keyword and this keyword in abutting connection with the identical or different index of speech section or the data of text snippet or text, be positioned at the distributed areas of the same or different subclass of same keyword set.
When needing, allow to be positioned at same subclass, and its e-text summary or e-text contain same etendue critical statement and this statement in abutting connection with the identical or different index data of speech section, be positioned at same or different low one or more levels subclass distributed areas of same subclass.
The present invention can also comprise have reflection have in the e-text of same keyword or e-text summary or the questions record this keyword different stage in abutting connection with precedence relationship between the speech section in abutting connection with the tree-shaped catalogue of speech section.
Another aspect of the present invention is the computer data system that another kind comprises storing apparatus, it is characterized in that, the data structure of described memory storage or the part or all of keyword index that data division contained is wherein formed and comprised at least:
The keyword section;
One or more in abutting connection with the speech section, by in the respective electronic content of text or at different levels the mapping by former order of the predetermined number of the adjacency successively of the keyword in the e-text summary in abutting connection with the speech section form, be followed successively by: in abutting connection with speech section 1, in abutting connection with speech section 2 ... in abutting connection with speech section N;
Respective electronic text ID section, or the ID section of its relevant information;
Wherein the ID section is meant address field;
The summary section or the header segment that can comprise in case of necessity, the described keyword that the respective electronic text contains;
This system can allow described keyword section that this computer data system comprises according to search regulation and each one or more combinations or increase and decrease of portmanteau word hop count purpose in the speech section, searches for or searches for corresponding index or content with mapping mode.
Obviously, can stipulate that above-mentioned N minimum is 1, this moment, relative index had only one in abutting connection with the speech section.
Another aspect of the present invention is to have provided a kind of search engine system that can use the method for described e-text processing and retrieval, and this system comprises:
Server 5, this server is via client computer 3 couplings at communication network 4 or circuit and described interactive interface 2 places;
This server 5 comprises search engine 8,
This search engine comprises database 9 and requestor 11;
Wherein, described database storing keyword index, the keyword that described requestor proposes according to the inquiry requires to inquire about and the related data the results list that inquires is offered interactive interface 2 at described database;
It is characterized in that:
Described database has also been stored e-text summary or the e-text that comprises keyword, and this e-text summary comprises or is not included in the keyword index;
Described requestor or search engine comprise keyword expansion parts 10, described keyword expansion parts can align keyword inquiry or to be checked and carry out the one or many extended operation, this extended operation comprises: will be in containing the e-text content of described keyword or the e-text summary in the keyword that occurs and in abutting connection with the speech section, as each different etendue critical statement, and with its tabulation, or the difference of described keyword selected for use via interactive interface for the inquiry in abutting connection with the tabulation of speech section, perhaps will contain the different index of identical or different etendue critical statement or e-text summary or e-text is retrieved or layout or arrangement, select for use via interactive interface for the inquiry.
Described search engine system, can store reflection have in the e-text of same keyword or e-text summary or the questions record the keyword different stage in abutting connection with precedence relationship between the speech section in abutting connection with the tree-shaped catalogue of speech section, perhaps comprise the tree-shaped catalogue (Fig. 8) of precedence relationship between the key sentence of the described keyword different stage expansion of reflection.
Described tree-shaped catalogue in fact also can reflect the coordination in abutting connection with speech section or key sentence at the same level simultaneously.
Described search engine system can also comprise the graphical user interactive interface, and described graphical interaction interface comprises dialog box or choice box, to receive the selection of inquiry to mode of operation or pattern; Described graphical interaction interface also comprises the key sentence that is used to click or in abutting connection with literal or the symbol or the figure of speech section or statement or paragraph or operational order or selection, so that the inquiry adds additional queries information.
Above-described search engine system can be the search system for internet customer service that is positioned at the internet, also can be computerized information library searching system independently.Described server 5 is Computer Storage and treating apparatus, can be single, also can be in groups a plurality of or decentralized configuration.Described client computer 3 can be PC or workstation or other computer installations, when needing, can dispose suitable browser.
Described search engine system can also allow: described search engine comprises index structural member 13, be used for to the subsidiary data acquisition server 12 of the e-text in the described database or search engine from the internet 4 or the e-text that obtains of other information sources analyze, produce that described e-text comprises the keyword section accordingly at least and in abutting connection with the index of speech section and text ID section, and storage.
When needing, can stipulate each speech number in abutting connection with the speech section herein simply, for example the speech number is one.
Described corresponding to speech section place in abutting connection with the tree-shaped catalogue of speech section perhaps reflected the corresponding key sentence place of the tree-shaped catalogue of precedence relationship between the key sentence of this keyword different stage expansion, also can show the subclass quantity or the contained quantity of documents of its back.
Another aspect of the present invention is a kind of searching method that can be used in described search engine system, may further comprise the steps:
Steps A receives inquiry's keyword query requirement via interactive interface;
Step B requires the described database of inquiry according to described keyword query;
Step C, will be in containing the e-text content of keyword or the e-text summary in the keyword that occurs together with it in abutting connection with the speech section, as key sentence;
Wherein, the words that is comprised in abutting connection with the speech section or the quantity of character or should in abutting connection with the speech section by mode, be predetermined or inquiry's acquiescence or selected by described search engine system, perhaps determine according to the symbol of the end of adjacency speech section or word or speech or font or color or space, perhaps by the inquiry in the selectionbar that interactive interface presents or the position and the mode that comprise the cursor indication of carrying out on the page of the e-text summary of concrete index or e-text determine;
Step D, according to described in the step C in abutting connection with speech section or key sentence induction-arrangement go out to have nothing in common with each other in abutting connection with speech section or the key sentence that has nothing in common with each other;
Step e, generate Search Results according to what obtain in abutting connection with speech section or key sentence, that is: will contain described identical or different in abutting connection with speech section or key sentence different index or e-text summary or e-text or questions record is retrieved or layout, select for use via interactive interface for the inquiry.
Described steps A-the E of described searching method can be carried out in advance or when inquiring about by described search engine system.
Described searching method can be used for the internet search engine system, also can be used for local or computerized information library searching system independently, for example digital library system, documents and materials storehouse numeral search system.
Described searching method can further include:
Step F, will be in containing the e-text content of key sentence or the e-text summary in the key sentence that occurs together with it in abutting connection with the speech section, perhaps with this key sentence together with it in abutting connection with the speech section, as the key sentence of expansion;
Wherein, the words that is comprised in abutting connection with the speech section or the quantity of character or should in abutting connection with the speech section by mode or particular content, be predetermined or inquiry's acquiescence or selected by described search engine system;
Step G is according to the key sentence in abutting connection with speech section or the expansion that has nothing in common with each other that goes out to have nothing in common with each other in abutting connection with speech section or key sentence induction-arrangement described in the step F;
Step H, the key sentence in abutting connection with speech section or expansion that obtains according to step G generates Search Results, that is: and will contain different index or the e-text summary or the e-text of described identical or different key sentence in abutting connection with speech section or expansion or questions record is retrieved or layout or storage respectively, select for use via interactive interface for the inquiry;
Wherein, steps A-H can be carried out in advance or when inquiring about by described search engine system.
Like this, originally the subclass system that will be segmented again and again of the huge PRELIMINARY RESULTS of keyword search is convenient to the user and is selected.
When needing, described searching method can also comprise the marshalling step:
Be about to contain the various key sentence of same keyword or in abutting connection with speech section or index or e-text summary or e-text, perhaps will contain same former key sentence various expansion key sentence or in abutting connection with speech section or index or e-text summary or e-text, organize into groups separately with catalogue or sequence form and arrange or show, wherein each is only taken in abutting connection with the key sentence at speech section place or index or e-text summary or e-text that each is one or more.
Marshalling has certain unicity or representativeness like this, can help the user to read a small amount of interactive interface image information and just can make one's options.
When the length of the key sentence of selecting when us acquired a certain degree, the core content of index that obtains or summary or questions record marshalling sequence will not repeat do not have omission substantially yet.
Utilize the described searching method can also be with the data of part or all of keyword index or e-text summary or e-text, similar and different according to its keyword that contains or key sentence or etendue critical statement is distributed in the subset area storage of similar and different subset area or similar and different even lower level.
When keyword query, can directly extract or provide the data of corresponding key sentence or keyword index or text snippet or text.
Described searching method can also comprise:
The e-text that text in the described database or summary or the subsidiary data acquisition server of search engine obtain from the internet is analyzed, produce and the corresponding index of store electrons text, this index comprises the keyword section, in abutting connection with speech section and e-text ID section.
Described searching method can also comprise the layout step:
Be layout reflection have the e-text of same keyword or this keyword different stage in the e-text summary in abutting connection with between the speech section successively or the tree-shaped catalogue (Fig. 8) of coordination, perhaps reflect between the key sentence of this keyword different stage expansion successively or the tree-shaped catalogue of coordination, use during for inquiry.
In described searching method, can also comprise selected step:
Be described search engine system according to the inquiry on the e-text of the page of interactive interface or e-text summary or key sentence or the indication of the cursor on speech section catalogue or in selectionbar or frame, determine corresponding key sentence, and carrying out catalogue in abutting connection with speech section or index or e-text summary or e-text and show to the key sentence of the various expansion of this key sentence correspondence or expansion, perhaps carrying out the ordering of respective index or e-text summary or e-text shows, perhaps remove step according to the corresponding key sentence of determining, described clauses and subclauses that remove the key sentence that step contains this page or other a plurality of pages or index or e-text summary or e-text are rejected or the shift position.
Described searching method can also comprise the ignorance step when needed:
Promptly according to the inquiry browse when comprising index former keyword or that comprise former key sentence or e-text summary or questions record or e-text sequence on the interactive interface to the page or the operation on the page, judge that the inquiry browses the present position of this index or e-text sequence; If determine to be arranged in the index that comprises key sentence in this present position front certain limit or e-text summary or e-text or key sentence itself always or continuously certain number of times be not opened or link, also not clicked or prompting keeps, then remove step according to this key sentence, described clauses and subclauses that remove the key sentence that step contains this page or other a plurality of pages or index or e-text summary or e-text are rejected or the shift position.
This mode is moved or is rejected after the inquiry can not being paid close attention to for a long time in the similar fileinfo sequence from behind of file in reading process, reduces the too much puzzlement of garbage.
Of the present invention is processing and the retrieval and the search technique of core with keyword and in abutting connection with speech, aspect dividing and constantly dwindling same keyword search results scope, have the tightness of dictionary formula and obviously surmount the convenience of prior art, can also be with up to a million same keyword network information usually, concentrate the slivering number and reduce 2, the information representative sequence of the same keyword of 3 orders of magnitude, and the core content of every information (near the content that several contiguous speech constitute the keyword) neither repeats also not omit, and will satisfy the active demand for a long time of vast information retrieval and search subscriber better.
(4) description of drawings
Fig. 1 for regulation in abutting connection with the contained words quantity of speech section or in abutting connection with the example schematic of speech section interception way.
Fig. 2 is the tree-shaped catalogue example schematic of the difference of same keyword in abutting connection with speech section or respective subset.
Fig. 3 is the different similar subclass of same keyword and the subordinate's difference example schematic in abutting connection with speech cross-talk collection.
Figure 4 shows that structured flowchart according to an embodiment of search system of the present invention.
Figure 5 shows that the synoptic diagram of the key sentence generation of one embodiment of the present of invention.
Figure 6 shows that the synoptic diagram of the another kind of key sentence generating mode of embodiments of the invention.
Figure 7 shows that the example operational flow figure of the user of one embodiment of the present of invention at interactive interface.
Figure 8 shows that a reflection keyword different stage that one embodiment of the present of invention show in abutting connection with precedence relationship between the speech in abutting connection with the tree-shaped catalogue synoptic diagram of speech section.
Figure 9 shows that the workflow diagram of the search engine of one embodiment of the present of invention.
Figure 10 shows that in the search procedure of one embodiment of the present of invention that cursor clicks (selected operation) and generate local screen's picture view of display result.
Figure 11 shows that the exemplary process diagram of a processing of the present invention and an embodiment of search method.
(5) embodiment
Below in conjunction with accompanying drawing, further specify on the basis of " summary of the invention " in front.
A kind of computing machine provided by the invention carry out to a plurality of e-texts that contain same keyword handle and retrieve method, specifically comprised for 4 steps for instance:
At first from computing machine or database or a plurality of e-texts that contain same keyword of internet acquisition; Described text can be e-file or document or webpage or their summary or index or questions record or exercise question, also can be the various digitized information contents of database, works, dictionary, handbook, patent documentation.
The 2nd step, in the regulation text this keyword in abutting connection with the contained words quantity of speech section or in abutting connection with speech section interception way:
Described specifically generally is directly in abutting connection with the speech section in abutting connection with the speech section, also can be indirectly in abutting connection with the speech section where necessary; Directly be meant that in abutting connection with speech this does not have literal at interval in abutting connection with the speech section in the primary electron content of text and between the above-mentioned keyword, and referring in abutting connection with the speech section in abutting connection with the speech section that indirectly this has a small amount of literal at interval in abutting connection with the speech section in the primary electron content of text and between the above-mentioned keyword, senior general obviously influences the result of use of this method at interval.
Described can be the keyword front in abutting connection with the speech section, also can be the keyword back; Generally be the speech section that one or more speech in the content of text or word even root are formed, also comprise some character when needing, as the letter of abridging, punctuate etc.
The quantity of described words that is comprised in abutting connection with the speech section or character or should in abutting connection with the speech section by mode or particular content can be computer system predetermined or the inquiry agrees or acquiescence or selected, perhaps by the inquiry in the selectionbar that interactive interface presents or the position and the mode that comprise the cursor indication of the carrying out on the page of the text snippet of certain concrete index or text or related content determine.
Fig. 1 and Fig. 5, Fig. 6 have provided regulation in abutting connection with speech section contained words quantity or in abutting connection with several examples of speech section interception way.In the example of Fig. 1, keyword is " search engine ".The mode of wherein 101 expressions " preceding 2 notional words of intercepting keyword "; The mode of 102 expressions " 2 notional words behind the preceding 2+ of intercepting keyword "; The mode of 103 expressions " 2 notional words behind the intercepting keyword "; The mode of 104 expressions " notional word before intercepting keyword first comma of back or the fullstop "; The mode of 105 expressions " intercepting keyword back distance is not less than first comma of 2 speech or the speech of fullstop front ".
Under some necessary situation, the length of grammatical term for the character section, also can omit or do not consider the prefix of some speech or having or not or difference or difference (referring to the embodiment A of back) of suffix or some function word or auxiliary word or number or measure word or non-notional word or punctuate or space, even also can omit or do not consider wherein adjective or the having or not or difference or difference of adverbial word.
When the keyword in when retrieval during for a plurality of words that can separate, for example above-mentioned can be meant wherein a certain words (as forward words) or a plurality of words in abutting connection with the speech section each in abutting connection with the speech section.In the latter case, may need to judge to the comparing respectively of the different piece of keyword whether the keyword of different texts is identical in abutting connection with the speech section in abutting connection with the speech section.
When repeatedly same keyword occurring in the text, can only consider arbitrary appearance keyword in abutting connection with content, the text suitably can also be separated, be used as a plurality of texts and handle.This is more suitable than the retrieval of long text for length.
The 3rd step, identical with other e-texts or different in abutting connection with the speech section according to keyword described in each the e-text content in the part or all of e-text is divided into this e-text and other e-texts identical or different subclass or classification or carries out corresponding identical or different processing.(referring to the embodiment A of back)
In general, so-called " identical " means that two speech sections are just the same; But under some necessary situation, judge the identical or different of two speech sections, also can omit or do not consider the prefix of some speech or having or not or difference or difference of suffix or some function word or auxiliary word or number or measure word or non-notional word or punctuate or space, even also can omit or do not consider wherein adjective or the having or not or difference or difference of adverbial word.
For example, need,, can think if according to loose standard: " strength of science is very powerful " and " science power is very powerful " be two identical in abutting connection with the speech section.
Press the identical with other e-texts or different of keyword described in each e-text content in abutting connection with the speech section, after this e-text and other texts be divided into identical or different classification, the inquiry can obtain or skip to comprise this kind all texts in abutting connection with the classification of speech section according to type directly according to a certain interest in abutting connection with the speech section to keyword.
Described corresponding identical or different processing can comprise: the respective electronic text has identical or different distributing position or storage mode, perhaps obtain identical or different subclass mark, perhaps make its index have identical or different mark or index entry, perhaps has identical or different arranged mode, perhaps have identical or different display mode or position at interactive interface, perhaps allow at least the part subclass respectively to have and one or morely stride subclass combination or ordering in abutting connection with speech section or text, perhaps the difference of a kind of same keyword that reflects described e-text of layout in abutting connection with the speech section or indirectly in abutting connection with the speech section or comprise the statement of these speech sections or the summary example side by side precedence relationship one or more levels catalogue or sequence (referring to Fig. 2, Fig. 8).
The 4th step need show result or show according to inquiry at interactive interface, and its mode can be referring to embodiment A and Figure 10, shown in Figure 11 and related description.
For the different e-texts that belong to certain or some same first order subclass or higher subclass, its content contains same keyword and in abutting connection with the different e-texts of speech section in other words, can part or all of described e-text be subdivided into the same or different next stage or the multistage subclass of above-mentioned subclass or carry out corresponding identical or different processing according to its described same keyword that contains and in abutting connection with identical still different in abutting connection with the speech section of other of speech section.
Corresponding identical or different processing such as preamble described herein can comprise equally: the respective electronic text has identical or different distributing position or storage mode, perhaps obtain identical or different subclass mark, perhaps make its index have identical or different mark or index entry, perhaps has identical or different arranged mode, perhaps have identical or different display mode or position at interactive interface, perhaps allow at least the part subclass respectively to have and one or morely stride subclass combination or ordering in abutting connection with speech section or e-text, perhaps the difference of a kind of same keyword that reflects described e-text of layout in abutting connection with the speech section or indirectly in abutting connection with the speech section or comprise the statement of these speech sections or the summary example side by side or one or more levels catalogue or sequence of precedence relationship, or show at interactive interface.(referring to the embodiment A of back and the content of Figure 10)
In fact this is exactly to pass through original keyword in abutting connection with the expansion of speech section and the comparison whether identical to enlarged, samely further be subdivided into some next stage subclass with original in abutting connection with speech cross-talk collection, if desired, can also continue, up to obtaining the satisfied result of inquiry.This also is another advantage of this method.
For example, we are that a plurality of e-texts of " search engine " are according to adjacency speech section dividing subset to keyword, first intercepting by the mode of " preceding 1 speech of keyword+back 1 speech " wherein in abutting connection with the speech section, a plurality of subclass have been obtained like this, wherein contain in abutting connection with the speech section and be " professional K company " subclass (K represents keyword " search engine ") herein, this subclass comprises 185 texts; If we with these 185 texts by 3 speech of " K company " back constitute second in abutting connection with speech section identical division the whether, getting back second is 13 secondary subclass such as " passing through professional technique ", " trying hard to exploit market " in abutting connection with the speech section; If the text that we comprise the secondary subclass that contains " passing through professional technique " speech section continues to divide in abutting connection with speech section (for 2 notional word speech sections of its back) by the 3rd, can also obtain some three grades of subclass.(can with reference to figure 2, Fig. 8).
Utilize the method for processing of the present invention and retrieval for example can also allow successively in abutting connection with the merging of speech section or separate, to reduce or to increase the subclass level.For example be a plurality of e-texts of " search engine " for above-mentioned keyword, if we intercept first in abutting connection with the speech section by the mode of " preceding 1 speech of keyword+back 4 speech " at the very start, the quantity of the one-level subclass that obtains should equal each secondary subclass quantity sum that the front mode is divided, the result is similar, but the subclass level has reduced.
In fact, in the face of same a large amount of e-texts, if the length in abutting connection with the speech section of keyword is longlyer fixed, the number of the subclass that obtains can be more, but the textual data of each subclass can be less; Conversely, if the length in abutting connection with the speech section of keyword is shortlyer fixed, the number of the subclass that obtains can lack, but the textual data of each subclass can be more.
In the face of numerous subclass of said method or their subordinate's subclass, consult for convenience, we can layout a described e-text of reflection same keyword difference in abutting connection with the speech section or indirectly in abutting connection with the speech section or comprise the statement (or example sentence) of these speech sections or the summary example side by side or one or more levels catalogue or tree-shaped catalogue or sequence of precedence relationship.
We the difference of the same keyword of the described e-text of the described reflection of layout in abutting connection with the speech section or indirectly in abutting connection with the speech section or comprise the statement (or example sentence) of these speech sections or the summary example side by side precedence relationship one or more levels catalogue or when tree-shaped catalogue or sequence, can make they one or more different subclass that comprise described e-text separately described identical in abutting connection with the speech section or identical indirect in abutting connection with the speech section or comprise this speech section statement (or example sentence) or the summary example, perhaps comprise separately identical of the next stage of this or these subclass or a plurality of subclass of stage further, according to side by side or be subordinate to precedence relationship layout or distribution or storage or displaying in abutting connection with the speech section or indirectly in abutting connection with the speech section or comprise statement or the example sentence or the summary example of this speech section; It is arranged side by side that wherein said speech section or statement or example sentence or summary example can be striden subclass.
Fig. 2 and catalogue shown in Figure 8 are exactly two examples of this catalogue.Fig. 2 reflected above-mentioned keyword for the difference of a plurality of e-texts of " search engine " in abutting connection with speech section or these speech sections in abutting connection with the speech section side by side or one or more levels catalogue or tree-shaped catalogue of precedence relationship.(example shown in Figure 8 will be explained below)
In the tree-shaped catalogue example of Fig. 2, keyword is " search engine ", with symbol " K " representative, and first intercepting by the mode of " preceding 1 speech of keyword+back 1 speech " wherein in abutting connection with the speech section, second the 3 speech speech sections in abutting connection with speech Duan Weiqi back, the 3rd 2 notional word speech sections in abutting connection with speech Duan Weiqi back.
If we feel to have any problem to understanding its core content when reading relevant text difference in abutting connection with the catalogue of speech section, will wish to see to comprise each more contents in abutting connection with the speech section.Thereby, we may need the described e-text of one of layout reflection same keyword difference in abutting connection with speech section or these speech sections in abutting connection with the speech section side by side or precedence relationship one or more levels catalogue or the derivation sequence of tree-shaped catalogue, wherein, arbitrary in former catalogue or the tree-shaped catalogue can add or replace with in abutting connection with the speech section and contain this more contents in abutting connection with the speech section.
For example these contents can be to contain this statement in abutting connection with the speech section (or example sentence) or summary example or questions record or representative text.Keyword in wherein said statement (or example sentence) or summary example or the representative e-text and can have font or color or the characteristics that are different from other guide in abutting connection with the speech section; It is arranged side by side that wherein said speech section or statement (or example sentence) or summary example or representative e-text can be striden subclass.
In fact, we can allow each subclass or subordinate's subclass to represent this subclass by corresponding in abutting connection with speech section or the statement (or example sentence) or the summary example that comprise this speech section, like this on limited interactive interface, just can arrange the representative content of more subclass, form catalogue or sequence, the inquiry can select the subclass and the contained text of being interested in by clicking representative content.For example, we click " Google " speech section of " respectively by Yahoo " back of " different K technology 672 " back in the catalogue shown in Figure 2, just can be contained all text category or the related content of the subclass of " the different K technology is respectively by Google of Yahoo " speech section.If we click in the related content of correlated series or catalogue in abutting connection with the speech section, also can obtain identical result.
In fact, present technique allows the inquiry on interactive interface the literal in catalogue or sequence or the other guide or figure or symbol to be indicated, and for example clicks cursor, determines or launches or the link related content.
We can also make this method more convenient, for example can arrange: for the keyword in described catalogue or tree-shaped catalogue or the sequence in abutting connection with the speech section or indirectly in abutting connection with the speech section, if it is a kind of that its next stage or stage further have only in abutting connection with the speech section, this speech section can distribute in abutting connection with the original position of speech Duan Zaiqi or stores or show together with its next stage or stage further.
Inquiry for convenience, for example we can also arrange present technique, permission can have the number of subsets arranged side by side of its corresponding number of subsets side by side or subordinate's number of subsets or related term or speech section place subclass or contained subordinate's number of subsets or the prompting of textual data purpose in above-mentioned e-text or catalogue or statement (or example sentence) or summary example or at keyword that they comprised or in abutting connection with the speech section or indirectly near the speech section.(as Fig. 2)
The method of processing of the present invention and retrieval can also comprise:
Can also further utilize disposal route of the present invention, for example layout contains the sequence of a plurality of e-texts or the e-text partial content of same keyword, they contain by a plurality of speech form different or different basically in abutting connection with the speech section.Can think, wherein contain identical described e-text or e-text partial content and have only one or more as the representative in the sequence in abutting connection with the speech section.
Described e-text partial content can be meant summary or index or questions record or example sentence or the phrase etc. that contain same keyword.
Keyword that also we can say a plurality of speech (2 or 2 above speech) that each e-text of this representative series or e-text partial content contain is different or different basically in abutting connection with the speech section.A plurality of speech generally can reflect the connotation of keyword adjacent core content better.
Like this, this kind method can with easily up to a million with the keyword network information, concentrate the information representative sequence that the slivering number reduces the same keyword of 2,3 orders of magnitude, and the core content of every information (near the content that several contiguous speech constitute the keyword) repeats neither also not omit.This also is the very effective method of refining web page core content, and the method that can only reject the mirror image webpage compared with prior art has had marked improvement.
If we still feel that in abutting connection with the catalogue and the so-called information representative sequence of speech section content is too many to the difference of the same keyword that obtained, we for example can also carry out ratio of similitude in abutting connection with the speech section with the difference of the same keyword of described a plurality of e-texts, the a plurality of differences that meet certain similar requirement mutually are subdivided into same similar subclass in abutting connection with the speech section, the a plurality of differences that perhaps will not meet certain similar requirement mutually are subdivided into different similar subclass in abutting connection with the speech section, perhaps will not meet a plurality of differences of certain similar requirement mutually in abutting connection with the speech section, weave into dissimilar each other sequence or catalogue in abutting connection with the speech section, can perhaps it be listed the common content of each element of same similar subclass title or mark in similar subset name sequence or catalogue as this similar subclass.
Described similar manner of comparison or similar requirement can have a variety of, can stipulate when needing: described certain similar requirement comprises same word or the quantity of speech or phrase or character or the requirement of proportion that difference is contained in abutting connection with the speech section at least.
For instance, if what relate to is that length is sequence or the catalogue of the difference of 4 speech (omitting or do not omit function word) in abutting connection with the speech section, can require different in abutting connection with 4 or 3 speech mutually the same (but word order is not necessarily identical) are arranged between the speech section at least, as similar requirement.Similar requirement can also can be selected by the inquiry by systemic presupposition.In this embodiment, the speech of these 4 or 3 common appearance can separate back title or catalogue content as corresponding similar subclass with punctuate.
Fig. 3 has constituted another similarity subclass example, be in the keyword difference on the basis of speech section (by the intercepting of 3 speech in back) sequence, form through ratio of similitude.Its keyword difference is " must have same 3 speech, and the order of front and back each other not being limit " in abutting connection with the similar requirement of speech section by systemic presupposition.The keyword of each e-text is " search engine " in this example, and with symbol " K " representative, wherein the keyword difference is shown with its each total vocabulary in abutting connection with the title of the similar subclass of speech section formation, represents with capitalization respectively herein.The difference of the similar subclass of shown in Figure 3 first has all comprised X, Y, these 3 speech of Z in abutting connection with the speech section.Same similar subclass in abutting connection with the speech section because the order difference of each total speech, can constitute different first in abutting connection with the speech section, constituted the next stage subclass of this similar subclass.
Also can also can put this above-mentioned similar subclass under when needing with containing certain in abutting connection with the speech section or indirectly divide this subordinate's subclass in abutting connection with the above-mentioned similar subclass at speech section place in abutting connection with the subclass of speech section with containing certain e-text or its partial content in abutting connection with the speech section in abutting connection with speech section place.
Obviously, similar subclass can be regarded as original in abutting connection with the speech section catalogue or the basis of sequence on weave into, so the quantity of similar subclass or the length of its catalogue obviously reduce than former catalogue or former sequence, the inquiry can find out relevant principal ingredient in abutting connection with the speech section (several independently speech arranged side by side) by the title of similar subclass in the catalogue more easily, in case of interest, then can open this similar subclass, obtain the relevant information of its each affiliated subordinate's subclass.
Present technique can adopt a kind of mode more efficiently: the speech number in abutting connection with the speech section of regulation keyword is a kind of among 2 to 10, for example 6, like this through processing will obtain different from speech section relevant sequence or catalogue, (for example content is too much if desired, above hundreds of) can further carry out ratio of similitude to it, obtain the catalogue or the sequence (for example reducing to tens) of different similar subclass.This is very convenient to inquiring about.
Ordering for the content of the above-mentioned various catalogues of utilizing this method to obtain or sequence, for example can be stochastic distribution sometimes, also can utilize known existing ordering techniques, perhaps order when needed wherein subclass arranged side by side or side by side in abutting connection with the speech section or indirectly in abutting connection with speech section or text or statement arranged side by side or the some concrete sorting positions in example sentence or summary example or the representative series information side by side, partially or completely depend on following wherein some or a plurality of factor:
Size or the height of clicking rate or the height of keyword occurrence rate of the Page link value of this e-text or this speech section or statement or example sentence or summary example or information place e-text,
The perhaps size of the mean values of the e-text Page link value of the height of what or this subclass clicking rate of subordinate's number of subsets of this subclass or subordinate's e-text number or this subclass,
The perhaps size of the mean values of the e-text Page link value of the height of what or place subclass clicking rate of subordinate's number of subsets of this speech section or e-text or statement or example sentence or summary example or information place subclass or subordinate's e-text number or place subclass
The perhaps size of the Page link value of e-text that the Page link value of this subclass is the highest or other e-text example,
The perhaps clicking rate of the clicking rate of this subclass text the highest or that the keyword occurrence rate is the highest or other e-text example or the height of keyword occurrence rate,
The perhaps ordering of associated electrical text in other search websites or searching system Search Results in related text or the associated subset,
Perhaps relevant e-text or investor's relevant payment of relevant speech section or the height of bidding,
The spelling of perhaps relevant speech word in abutting connection with the speech section or the lexicographic order or the stroke of phonetic,
The perhaps source web of text or unit or people's scoring,
The perhaps related text time order and function of including or new and old,
The same subclass that perhaps whether belongs to certain one-level.When needing, concrete sorting position can decide by a kind of target function value, and target function value depends on one or more variablees, and the part or all of variable of this objective function can be represented above-mentioned listed wherein some or a plurality of factors respectively.
For example a target function value can be expressed as F (x 1, x 2X n),
For example can make F (x 1, x 2X n)=F 1(x 1)+F 2(x 2)+... + F n(x n);
Wherein, x 1, x 2... x nBe respectively some or a plurality of factors (variable) or other factors of the concrete sorting position of being mentioned in the preamble summary of the invention part of decision.Because (as the US6285999 patent) has many concrete disposal routes in the prior art, no longer describe in detail herein.
The method of processing of the present invention and retrieval can also allow on the existing method of handling or result when needing, increase or reduce and to possess the other keyword that maybe can not possess, perhaps the restriction of increase or minimizing time or region or languages or other types or scope or requirement obtains the result of further refining or more wide in range result.
For example the present invention allows the content to the subclass that obtains in abutting connection with the comparison of speech section of (difference of for example ignoring function word in the speech section) under loose requirement, carry out being strict with (difference of for example not ignoring function word) in abutting connection with the speech section relatively, and divide the next stage subclass or obtain in more detail in abutting connection with speech section catalogue or corresponding information; Or carry out reverse operating.
Increase and decrease a keyword (as " China "), perhaps change the restriction of time (as changing in half a year in 1 year or in two years) or region (Hebei or Baoding or North China) or languages (as English or Western languages) or other types (as article or toy) or scope (as boy or children or people), can dwindle and enlarge the hunting zone easily.
Another aspect of the present invention is the computer data system that another kind comprises storing apparatus, and the data structure of the keyword index of described memory storage or the partly or entirely relevant e-text that data division contained is wherein formed and comprised at least:
The keyword section;
One or more in abutting connection with the speech section, by in the respective electronic content of text or at different levels the mapping by former order of the predetermined number of the adjacency successively of the keyword in the e-text summary in abutting connection with the speech section form, be followed successively by: in abutting connection with speech section 1, in abutting connection with speech section 2 ... in abutting connection with speech section N;
Corresponding text ID section, or the ID section of its relevant information, (wherein the ID section is meant address field);
The summary section or the header segment that can comprise in case of necessity, the described keyword that the respective electronic text contains.
In general, keyword index is carried out keyword retrieval and is set up for convenience of search or searching system, and the needs of same e-text multiple keyword retrieval for convenience usually will have the index of a plurality of different keywords.As example of the present invention, a text is as follows at the index data structure of keyword " the Changjiang river ":
Figure C200710164298D00281
For such data structure, no matter search engine is search " the Changjiang river ", still searches for the search word " Yangtze river basin " of lengthening, and still longer " Yangtze river basin waterpower " can have access to this index very easily, and then find the text by the address, help specific implementation of the present invention.Just this system can allow described keyword section that this computer data system comprises according to search regulation and each one or more combinations or increase and decrease of portmanteau word hop count purpose in the speech section, searches for or searches for corresponding index or content with mapping mode.
For instance, if each in the index all is the length of a speech in abutting connection with the speech section, determine the keyword of inquiry and each behind speech, the index that computing machine is easy to obtain the keyword section and all meets search request in abutting connection with speech section content.
Above-mentioned address can be the database text address, or web page address or other addresses.
Described computer data system also can be a search engine system.(referring to the Embodiment B of back)
The present invention can also be a kind of computer data system that comprises storing apparatus, can arrange described memory storage or the part or all of keyword index that data division contained wherein or the data of text snippet or text to distribute in the following manner:
Its e-text summary or e-text contain same keyword and this keyword in abutting connection with the identical or different index of speech section or the data of e-text summary or e-text, be positioned at the distributed areas of the same or different subclass of same keyword set.
When needing, allow to be positioned at same subclass, and its e-text summary or e-text contain same etendue critical statement (etendue critical statement be keyword together with one or more levels in abutting connection with the speech section) and this statement in abutting connection with the identical or different index data of speech section, be positioned at same or different low one or more levels subclass distributed areas of same subclass.
For example, can will have with each e-text of a kind of keyword or each index of e-text partial content (for example summary or questions record or statement or paragraph etc.), various catalogue listings (or subclass catalogue listing) or the multistage index that can comprise this keyword when needing in abutting connection with the example sentence sequence table tree-shaped catalogue listing of speech section (or multistage subclass catalogue listing) or corresponding or summary exemplary sequences table in abutting connection with the speech section, all or part of concentrated distribution or be arranged in the centralized stores zone corresponding continuously with this keyword.(for example embodiment A of back)
Each indexed data structure described herein is formed the address field that comprises indexed storage object (as text, catalogue listing, sequence table etc.) at least.
Inquiry is during this keyword like this, can make things convenient for or visits relative index continuously, obtains the address or the numbering of the address field (ID section) of index, visit or extract or represent associative directory or text or other guide.
Similarly, also can further will have with each e-text of a kind of keyword or each index of e-text partial content, can comprise various more next stage or multistage catalogue listing or index tree-shaped catalogue listing or corresponding example sentence sequence table or summary exemplary sequences table when needing, all or part of respectively concentrated distribution or be arranged in different with this keyword continuously in abutting connection with the corresponding respectively centralized stores zone of speech section in abutting connection with the speech section.
Described computer data system can be a search engine system, can inquire about more easily or handle like this or provide with the keyword of inquiring about and in abutting connection with the data of the relevant same subclass of speech section and low one or more levels subclass to the user.
The method that a plurality of e-texts that contain same keyword are handled and retrieved of the present invention is implemented concrete example flow by computer system and can be illustrated (comprising embodiment A, B, C etc.) by several examples of Figure 11 and Fig. 7, Fig. 9, Figure 10.The following stated " text " all refers to e-text.In the example of Figure 11, the correlation computer treatment facility starts working 61, receive the keyword query 62 that the inquiry submits to, the text that is contained this keyword in a large number, specify according to pre-if inquiry, determine keyword in abutting connection with speech section words quantity or scope (for example 5 notional words) 63, to comparing classification 64 in abutting connection with the speech section, and mark off it in abutting connection with each respectively identical subclass 65 of speech section from this scope of different texts each.On this basis, can divide 66 again to the subclass that obtains, for example according to the whether identical next stage subclass of dividing in abutting connection with the speech section of next stage, that perhaps carries out being strict with compares in abutting connection with the speech section, and divides the next stage subclass; Also can arrange representative series or in abutting connection with speech section sequence inequality or layout respective directories 67, (comprise the text number that marks respective subset and suitably sort 71), for inquiry's selection operation, launch associated subset or the interior related text 72 that perhaps shows at showing interface 70.If the clauses and subclauses of these sequences or catalogue are too much, can also carry out ratio of similitude in abutting connection with the speech section to the keyword query item of these clauses and subclauses, divide similar subclass 68 therein or arrange the sequence or the catalogue 69 of different similar contents, this will be more convenient for browsing, during content that the inquiry finds to be interested in, carry out clicking operation 70 again, launch relevant subclass or detailed content 72 more.(example of Fig. 7, Fig. 9, Figure 10 will be explained below)
A embodiment illustrated in fig. 4 is one can be carried out computing machine e-text of the present invention and handle and the example of the computer data system of the search method-internet search engine system of the key sentence search of expansion can be provided.(present embodiment and following examples described " text " all refer to e-text).It comprises: be located at the search engine 8 on the server 5 that has storer 6 and processor 7, this search engine 8 is connected with the client computer 3 that has interactive interface 2 by the communication network 4 of internet; This search engine 8 has database 9, requestor 11 and keyword expansion parts 10 or module, and is connected with index constructor 13 with data acquisition unit 12;
Data acquisition unit 12 for the text library of database 9 from the internet or other information sources collect and increase text, the text analyzing of 13 pairs of text libraries of index constructor obtains text index and offers the keyword index storehouse of database 9;
Each index that this index constructor 13 obtains according to the analysis to text all comprises keyword section, 6 lists ID section, text header section, the text snippet section in abutting connection with speech section, corresponding text, like this, search engine when needed can be according to desired keyword section, perhaps find required text index in abutting connection with the speech section with desired one or more lists, obtain the ID section of header segment or the text snippet section or the corresponding text of the text, can be linked to former text easily when needing.The index in keyword index storehouse distributes by multistage subclass according to the similarities and differences in abutting connection with the speech section at different levels, so that retrieval or extraction.Accordingly in abutting connection with speech section catalogue, in abutting connection with tree-shaped catalogue of speech section (Fig. 8) and key sentence catalogue, also in advance the storage.
Client applications browser on the client computer 3 of embodiment A (InternetExplorer of Microsoft) allows user 1 to retrieve html documents (comprising the Web list) by communication network 4 from server 5.It is mutual with the Web list that retrieves that interactive interface 2 on the client computer 3 allows users 1 to utilize monitor, keyboard or mouse, and the submission searching request makes one's options and receives Search Results.
A major issue of searching method of the present invention is that the selection mode (or keyword and in abutting connection with the combination of speech section) in abutting connection with the speech section is the generating mode of key sentence.The exemplary key sentence of embodiment A shown in Figure 5 increases in abutting connection with speech section (this example is word) expansion backward one by one along keyword 21 in text snippet.Wherein, 22 is 1 grade of key sentence, and 23 is 2 grades of key sentence, and 24 is 3 grades of key sentence, and 25 is 4 grades of key sentence.
Figure 6 shows that the key sentence generating mode of another kind of Embodiment B.Its 1st grade is positioned at the front of keyword 21 in abutting connection with the speech section, and the 2nd grade is positioned at the back of keyword in abutting connection with speech section and other in abutting connection with the speech section.Wherein, 22 is 1 grade of key sentence, and 23 is 2 grades of key sentence, and 24 is 3 grades of key sentence, and 25 is 4 grades of key sentence.As if this kind front and back taken into account generating mode and be more suitable for searching for the western language file.The length in abutting connection with the speech section of key sentence at different levels (speech number) also can be predesignated or the arrangement of or default system selected in when search by the inquiry.
In other extreme embodiment, also can allow from keyword again and again to the front in abutting connection with speech section expansion, form key sentence at different levels.
For the extended mode of the keyword search of a plurality of speech that allow to separate, should select one as the core keyword, by in conjunction with it form key sentence at different levels in abutting connection with the speech section, these key sentence all have separable all the other keywords.Also can near each speech or speech section of the keyword of a plurality of speech, add one by one in abutting connection with the speech section, form key sentence at different levels according to desired sequence.
The system of embodiment A can select function word, measure word, punctuate, space etc. not to be counted, in the notional word with their merger adjacency when calculating whether identical in abutting connection with the speech number of speech section and comparison in abutting connection with the speech section.This example can have for western language corresponding concrete regulation is also arranged.In other embodiments, when needing even when calculating whether identical, can select adjective or adverbial word etc. not to be counted in abutting connection with the speech number of speech section and comparison in abutting connection with the speech section.
In embodiment A,, and inquire about and with the related data the results list that inquires, in order to offering interactive interface at described database 9 according to the keyword request that proposes by the query requests of requestor 11 authenticated 1; Keyword expansion parts 10 replenish as requestor 11, will keep in or handle the corresponding key sentence at different levels of this keyword, corresponding example sentence when needing, in abutting connection with speech section tree structure catalogue (referring to Fig. 8) etc., with the needs that satisfy search then or show; If these contents are not arranged in database 9 or keyword expansion parts 10 as yet, keyword expansion parts 10 will be with its foundation on the keyword query data basis of requestor 11.
In fact, it is very easy to achieve the above object, and can utilize the whole bag of tricks.For example, no matter afterwards still in advance, for a possible keyword or the actual keyword that proposes, no matter be keyword expansion parts 10 or computing machine or other search systems of embodiment A, can appoint from the index that contains this keyword or file sequence looks for (for example article one) index or file to check that the speech of keyword and adjacency or phrase promptly in abutting connection with speech section (according to predetermined length), store them as article one key sentence; Look for second index or file to check the identical of the speech of its keyword adjacency or phrase and article one again? if different, then storage is successively identically then given up; Check again the 3rd index or file and with preceding two comparisons ... the rest may be inferred, will obtain one group of key sentence that has nothing in common with each other each other; In above-mentioned comparison procedure, arrange respectively in groups if will comprise the index or the file of same key sentence in passing, then each subclass forms, otherwise, with each key sentence is that standard is retrieved described index or file sequence respectively by requestor 11, can obtain corresponding each subclass.If in the sequence of each index or subset of the file, according to said method search for various the 2nd grade in abutting connection with the speech section, will obtain various the 2nd grade of key sentence and corresponding low one-level subclass ... and the rest may be inferred.If in each subclass that obtains, respectively select one (for example article one) or several index or summary, then obtain required catalogue and example sentence sequence, and then finish the marshalling operation as example sentence.
This shows that directly no matter afterwards still in advance this method no matter according to possible keyword or the actual keyword that proposes, all can be handled related text, with convenient inquiry equally.
This in fact be exactly by to original keyword in abutting connection with the expansion of speech section and the comparison whether identical to enlarged, samely further be subdivided into some next stage subclass with original in abutting connection with speech cross-talk collection.
For putting in order of catalogue and example sentence sequence, for example can be according to the size arrangement of a target function value.This target function value is the value of the text of objective function maximum among the respective subset of respective entries, equals the Page link value of the text and clicking rate sum in the recent period.Described example sentence can be by quoting in the text of respective subset target function value maximum.
In other embodiment, the ordering of the sequence of information such as catalogue or example sentence or questions record or summary can be according to a target function value F (x 1, x 2X n) size decide.
For the text of adding advertisements content, target function value can equal corresponding bid.
Owing to many concrete disposal routes about the text ordering are arranged in the prior art, no longer describe in detail herein.
The keyword index of embodiment A can adopt the system that distributes by each subclass, takies bigger memory space unlike existing other keyword index storehouses, and this is one of its outstanding advantage.
In another Embodiment B, its keyword index storehouse does not adopt subclass to distribute, because its index data structure is comprising key word item and several in abutting connection with speech section item, its requestor 11 is according to keyword section and one or more key sentence in abutting connection with speech section combination, can directly should belong to the indexed search of respective subset respectively and displays.In Embodiment B, only need the tree-shaped catalogue of arrangement, even can not change original traditional keyword index database in abutting connection with speech section or key sentence.
Certainly, also can more generally obtain subordinate's subclass of existing subclass:
Can utilize when needing and similarly shown in Figure 8 reflect that in abutting connection with the tree-shaped catalogue of speech section the keyword different stage is in abutting connection with precedence relationship between the speech section, and be illustrated on the picture and will help the overall status that the user understands each subclass or subclass at different levels, to take better search strategy.This figure has omitted each subclass corresponding text number and has indicated.Wherein keyword is " Bu Lin ", and in abutting connection with speech section 1, in abutting connection with speech section 2, in abutting connection with speech section 3, all be made of single notional word in abutting connection with speech section 4, they have also represented common in abutting connection with the speech section that subclass at different levels contain respectively respectively.
Embodiment A can be carried out selected operation, promptly allow described system according to the inquiry on the text of the page of interactive interface or summary or on the catalogue or the indication of the cursor of selectionbar, determine corresponding key sentence, and to the key sentence of the expansion of this key sentence correspondence or or expansion organize into groups operation in abutting connection with speech section or index or text snippet, perhaps carrying out the ordering of respective index or text snippet or text shows, perhaps remove operation, the described page or other a plurality of pages are contained the clauses and subclauses of this key sentence or index or text snippet or text reject or the shift position.
Figure 10 shows that in the search procedure of one embodiment of the present of invention that cursor clicks and generate local screen's picture view of display result (the selected operation of promptly organizing into groups).
Wherein search box 51 is for input keyword (being " Bu Lin " in this example), and 52 is two kinds of options of clicking operation: ' click and launch ' or ' click and reject ', selected ' click and launch ' in this example.Herein for click to as if picture on the summary 55 showed of description column 53.It is interested when the inquiry reads in the related content of " Boll index ", cursor 54 is aimed at " mark " word to be clicked, like this, " Boll index " from " Bu Lin " to " mark " is just as new key sentence, and by marshalling operation, list several further expand in abutting connection with speech section or example sentence separately 56.
The searching method of embodiment A also comprises ignores operation, write down or analyzed in operation on the page of interactive interface 2 (as skipping) or the data of " pay close attention to and click " of on respective entries, content on the page, being done or " ignore and click " in the time of promptly can browsing the index that comprises former key sentence or text snippet or text sequence to the inquiry, key sentence and relative index and summary in the back unheeded always or that do not paid close attention in certain reading time or space are removed operation.
In the system of embodiment A, after user 1 proposed keywords by interactive interface 2 and requires, requestor 11 can be as requested inquired about and the related data the results list that inquires is offered interactive interface 2 at described database; If user 1 wishes expanded keyword, keyword expansion parts 10 will generate corresponding key sentence, and extract or provide desired data by requestor 11 search.
The workflow of this search engine 8 (comprising requestor 11 and keyword expansion parts 10) can illustrate by Fig. 9:
This system starts working according to module 41, and inquiry has or not keyword search requirement (42), does not have and then returns (48); Does having then have or not the keyword expansion operation to require according to module 43 inquiries? if do not have, then carry out 44 common Search Results sequence shows is provided, if any, then carry out 45 demands of coming inquiring user 1 by the prompting frame on the screen of interactive interface 2; Operate accordingly then, provide corresponding information, continue the selection and the demand of inquiring user 1 according to module 46 ... repeat the back several times and provide corresponding search information, return or 49 finish according to user 1 wish execution module 48 according to module 47.
8 corresponding users 1 can represent by Fig. 7 in the operating process of interactive interface with search engine:
Start working after (31) opening interactive interface 2, selected keyword (32) can carry out routine and browse (34), also can select expanded search (33); As selecting (33), promptly utilize etendue critical statement search technique, then need to click the suitable mode of operation of selection: for example choose the length (quantity of the speech that comprised) of keyword first in abutting connection with the speech section by cursor.Its length is short, and the kind (number of subsets) of corresponding key sentence is less, but each subclass the contents are multifarious and disorderly; Its length is long, and the kind (number of subsets) of corresponding key sentence is more, and the core content of each subclass is then more single or concentrated.
Obviously, when the length of the key sentence of selecting when us reaches 5 to 6 speech, foregoing unicity index that obtains or summary marshalling sequence will be how many core contents do not repeat not have substantially yet and omit
" refining sequence ", the file total amount but may reduce several magnitude.
When selecting long key sentence, the bar number of the described unicity index of the first order can be many.Native system allow to utilize clicking operation to change and suitably reduce key sentence in abutting connection with speech or in abutting connection with the quantity of speech section, can significantly reduce the bar number of the first order or this grade unicity index or key sentence or summary or example sentence.
If abandon keyword first in abutting connection with the choosing and the option of other types of speech segment length, system will be automatically according to original be that word or two speech length are operated in groups in abutting connection with the speech section for example with every grade, and the result is presented (35).This moment, user 1 can select 37 directly to open link text in the result, also can according to 36 in being presented in the result of picture selected suitable etendue critical statement (can referring to Figure 10), and obtain the further Search Results that module 38 shows contents such as () next stage subclass catalogues.
So far, user 1 still can select 40 directly to open link text, also can select the key sentence of selected certain expansion of 39 continuation ... the rest may be inferred, until returning (301).
This statement of etendue critical step by step promptly dwindles the mode of hunting zone step by step, will effectively lock ferret out rapidly.
In embodiment A, certainly in other embodiment of method of the present invention, can write down or accumulative total certain or some or all inquiries in certain time period to the number of clicks of related content of comprising of various keywords of various various key sentence in abutting connection with the speech section, or corresponding statistical module is set when needed.
In Embodiment C, above-mentioned key sentence search technique will combine with existing keyword search technology, when the index order of its subclass inside, perhaps when each bar example sentence is selected in marshalling operation, pay respect or keep ordering or the position of associated documents in the Search Results of the search system of prior art.In other words, technology of the present invention is included on above-mentioned basic skills and the basic structure basis utilization to prior art searching order principle or method.Embodiment B and Embodiment C beyond the specified part aspect, basic identical with embodiment A.
The technical characterictic that above embodiment provides all is suggestive, and the various technical characterictics of an embodiment can independently use, and does not allow to be used for limiting the scope that the present invention includes.

Claims (31)

1, the method that a plurality of e-texts that contain same keyword are handled and retrieved carried out of a kind of computing machine comprises:
Step (1) obtains a plurality of e-texts that contain same keyword;
Step (2) regulation is in abutting connection with the contained words quantity of speech section or in abutting connection with speech section interception way;
Step (3) is identical still different in abutting connection with the speech section according to keyword in the e-text content in the part or all of e-text, and described e-text is carried out respective handling;
Step (4) shows result at interactive interface;
Described respective handling comprises the arbitrary of following processing mode:
First handles, and has identical distributing position or storage mode in abutting connection with the identical e-text of speech section, has different distributing positions or storage mode in abutting connection with the different e-text of speech section;
Second handles, and will be subdivided into identical subclass or obtain identical subclass mark in abutting connection with the identical e-text of speech section, will be subdivided into different subclass or obtain different subclass marks in abutting connection with the different e-text of speech section;
The 3rd handles, and makes in abutting connection with the index of the identical e-text of speech section to have identical mark or index entry, makes in abutting connection with the index of the different e-text of speech section to have different marks or index entry;
The reason everywhere makes in abutting connection with the identical e-text of speech section to have identical arranged mode, makes in abutting connection with the different e-text of speech section to have different arranged modes;
The 5th handles, and makes in abutting connection with the identical e-text of speech section to have identical display mode or position at interactive interface, makes in abutting connection with the different e-text of speech section to have different display modes or position at interactive interface;
The 6th handles, and will be subdivided into identical subclass in abutting connection with the identical e-text of speech section, will be subdivided into different subclass in abutting connection with the different e-text of speech section, and the part subclass respectively has and one or morely strides subclass combination or ordering in abutting connection with speech section or e-text at least;
The 7th handles, one or more levels catalogue or sequence of layout, described catalogue or sequence reflect described e-text same keyword difference in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship;
Described e-text is e-file or its summary or index or questions record or exercise question.
2, according to the method for described processing of claim 1 and retrieval, wherein said e-file is a webpage.
3,, further comprise in described second processing according to the method for described processing of claim 1 and retrieval:
One or more levels catalogue or sequence of layout, described catalogue or sequence reflect described e-text same keyword difference in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship;
And, separately identical of the one or more different subclass that comprise described e-text in abutting connection with speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with speech section or identical indirect statement or summary example in abutting connection with the speech section, perhaps comprise separately identical of the next stage of this or these subclass or multistage down a plurality of subclass in abutting connection with speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with speech section or identical indirect statement or summary example, according to arranged side by side or precedence relationship layout or distribution or storage or demonstration in abutting connection with the speech section;
Wherein identical in abutting connection with the speech section or identical indirect in abutting connection with the speech section or comprise identical in abutting connection with the speech section or identical indirect in abutting connection with the speech section statement the summary example is striden subclass or in subclass side by side.
4, according to the method for described processing of claim 1 and retrieval, wherein further may further comprise the steps:
For the different e-texts that belong to certain or some same first order subclass or higher subclass, the same keyword that contains according to e-text and in abutting connection with identical still different in abutting connection with the speech section of other of speech section, part or all of e-text is handled by the arbitrary of following processing mode:
The 8th handles, and other have identical distributing position or storage mode in abutting connection with the identical e-text of speech section, and other have different distributing positions or storage mode in abutting connection with the different e-text of speech section;
The 9th handles, and other are subdivided into identical next stage subclass or obtain identical subclass mark in abutting connection with the identical e-text of speech section, and other are subdivided into different next stage subclass or obtain different subclass marks in abutting connection with the different e-text of speech section;
The tenth handles, and makes other index in abutting connection with the identical e-text of speech section have identical mark or index entry, makes other index in abutting connection with the different e-text of speech section have different marks or index entry;
The 11 handles, and makes other have identical arranged mode in abutting connection with the identical e-text of speech section, makes other have different arranged modes in abutting connection with the different e-text of speech section;
The 12 handles, and makes other have identical display mode or position in abutting connection with the identical e-text of speech section at interactive interface, makes other have different display modes or position in abutting connection with the different e-text of speech section at interactive interface;
The 13 handles, other are subdivided into identical next stage subclass in abutting connection with the identical e-text of speech section, other are subdivided into different next stage subclass in abutting connection with the different e-text of speech section, and the part subclass respectively has and one or morely strides subclass combination or ordering in abutting connection with speech section or e-text at least;
The tenth manages everywhere, one or more levels catalogue or sequence of layout, this catalogue or sequence reflect the same keyword of described e-text and in abutting connection with other differences of speech section in abutting connection with the speech section or indirectly in abutting connection with the speech section side by side or precedence relationship, perhaps reflection comprise these differences in abutting connection with the speech section or indirectly in abutting connection with the statement of speech section or summary example side by side or precedence relationship.
5, according to the method for claim 1 or 3 or 4 described processing and retrieval, wherein to merging or separate operation successively, to reduce or to increase the subclass level in abutting connection with the speech section.
6, according to the method for claim 1 or 3 described processing and retrieval, wherein, judge keyword in abutting connection with speech section or keyword in abutting connection with other of speech section during in abutting connection with speech section identical or different, omit in abutting connection with the prefix of speech or the difference in suffix or punctuate or space, perhaps omit having or not or difference of auxiliary word or number or measure word or adjective or adverbial word, perhaps omit article or conjunction and have or not or difference.
7, according to the method for claim 1 or 3 or 4 described processing and retrieval, wherein when keyword during for separately a plurality of words, described in abutting connection with the speech section be meant wherein a certain words or a plurality of words in abutting connection with the speech section.
8, according to the method for claim 1 or 3 or 4 described processing and retrieval, the quantity of words that is comprised in abutting connection with the speech section or character or be that be scheduled to or inquiry's acquiescence or selected by mode or particular content wherein in abutting connection with the speech section.
9, according to the method for described processing of claim 8 and retrieval, comprising: judge in abutting connection with the speech section or during indirectly in abutting connection with the quantity of the words of speech section or character, omit the prefix of a words or a plurality of words or having or not or difference of suffix or auxiliary word or number or measure word or punctuate or space or adjective or adverbial word.
10, according to the method for claim 1 or 3 described processing and retrieval, wherein, if the keyword in described catalogue or the sequence in abutting connection with the speech section or indirectly in abutting connection with the next stage of speech section or multistage down have only in abutting connection with the speech section a kind of, then will in abutting connection with the speech section or indirectly in abutting connection with the speech section together with its next stage or multistagely down distribute in abutting connection with the original position of speech Duan Zaiqi or store or show.
11, according to the method for claim 1 or 3 or 4 described processing and retrieval, wherein, step further comprises in (4):
In e-text or catalogue or statement or summary example or at keyword that they comprised or in abutting connection with the speech section or indirectly near the speech section, have its corresponding number of subsets side by side or subordinate's number of subsets or related term or in abutting connection with the speech section or indirectly in abutting connection with the number of subsets arranged side by side of speech section place subclass or indirectly in abutting connection with speech section contained subordinate's number of subsets or the prompting of textual data purpose.
12, according to the method for claim 1 or 3 or 4 described processing and retrieval, further comprise step (5), promptly the inquiry indicates the literal in catalogue or the sequence or figure or symbol on interactive interface, determines or launches or the link related content.
13, according to the method for described processing of claim 1 and retrieval, also comprise in the 7th processing:
Layout contains the sequence of a plurality of e-texts or the e-text partial content of same keyword, this sequence contain by a plurality of speech form different in abutting connection with the speech section, perhaps the keyword of a plurality of speech of containing of each e-text in this sequence or e-text partial content is different in abutting connection with the speech section.
14,, further comprise step (6) according to the method for described processing of claim 1 and retrieval:
The difference of the same keyword of e-text is carried out ratio of similitude in abutting connection with the speech section, a plurality of differences that meet certain similar requirement mutually are subdivided into same similar subclass in abutting connection with the speech section;
The a plurality of differences that perhaps will not meet certain similar requirement mutually are subdivided into different similar subclass in abutting connection with the speech section;
The a plurality of differences that perhaps will not meet certain similar requirement are mutually weaved into dissimilar each other sequence or catalogue in abutting connection with the speech section in abutting connection with the speech section, with the title or the mark of common this similar subclass of content act or omission of each element of same similar subclass, or list it in similar subset name sequence or catalogue.
15, according to the method for described processing of claim 14 and retrieval, wherein, described certain similar requirement comprises same word or the quantity of speech or phrase or character or the requirement of proportion that difference is contained in abutting connection with the speech section at least.
16, according to the method for described processing of claim 14 and retrieval, wherein, to contain in abutting connection with the speech section or indirectly divide this subordinate's subclass into, perhaps will contain in abutting connection with e-text or its partial content of speech section and put this similar subclass under in abutting connection with speech section place in abutting connection with the similar subclass at speech section place in abutting connection with the subclass of speech section.
17, according to the method for claim 1 or 3 or 4 described processing and retrieval, further comprise step (7), that is:
Arranged side by side subclass or side by side in abutting connection with the speech section or indirectly in abutting connection with speech section or e-text or statement arranged side by side or the some concrete sorting positions in summary example or the representative series information side by side, partially or completely depend on following wherein one or more factors:
First factor, e-text or in abutting connection with the speech section or indirectly in abutting connection with size or the height of clicking rate or the height of keyword occurrence rate of the Page link value of speech section or statement or summary example or information place e-text;
Second factor, the size of the mean values of the height of what or this subclass clicking rate of subordinate's number of subsets of this subclass or subordinate's e-text number or the e-text Page link value of this subclass;
The 3rd factor is in abutting connection with the speech section or indirectly in abutting connection with the size of the mean values of the e-text Page link value of the height of what or place subclass clicking rate of subordinate's number of subsets of speech section or e-text or statement or example sentence or summary example place subclass or subordinate's e-text number or place subclass;
The 4th factor, the size of the Page link value of e-text that the Page link value of subclass is the highest or other e-text example;
The 5th factor, the clicking rate of the clicking rate of subclass e-text the highest or that the keyword occurrence rate is the highest or other e-text example or the height of keyword occurrence rate;
The ordering of associated electrical text in other search websites or searching system Search Results in the 6th factor, e-text or associated subset;
The 7th factor, e-text or in abutting connection with the speech section or indirectly in abutting connection with investor's relevant payment of speech section speech section or the height of bidding;
The 8th factor is in abutting connection with the spelling of the speech word of speech section or the lexicographic order or the stroke of phonetic;
The 9th factor, the source web of e-text or unit or people's scoring;
The time order and function that the tenth factor, e-text are included or new and old;
Whether the 11 factor, e-text belong to the same subclass of certain one-level;
The 12 factor decides by a kind of target function value, and target function value depends on one or more variablees, described wherein some or a plurality of factors above the variable of described objective function is partly or entirely represented respectively.
18, according to the method for claim 1 or 3 or 4 described processing and retrieval, wherein further comprise step (8), promptly
Increase or minimizing should possess the other keyword that maybe can not possess, and perhaps the restriction of increase or minimizing time or region or languages or scope obtains the result of further refining or more wide in range result.
19, a kind of computer data system that comprises memory storage of using the method for described processing of claim 1 and retrieval, it is characterized in that the data of the part or all of keyword index that described memory storage contained or e-text summary or e-text distribute in the following manner:
E-text summary or e-text contain same keyword and this same keyword in abutting connection with the identical or different index of speech section or e-text is made a summary or the data of e-text, be positioned at the distributed areas of the same or different subclass of same keyword set.
20, a kind of search engine system that uses the method for described processing of claim 1 and retrieval, this search engine system comprises:
Server is via the client computer coupling at communication network or circuit and interactive interface place;
This server comprises search engine;
This search engine comprises database and requestor;
Wherein, described database storing keyword index, the keyword that described requestor proposes according to the inquiry requires to inquire about and the related data the results list that inquires is offered interactive interface at described database;
It is characterized in that:
Described database has also been stored e-text summary or the e-text that comprises keyword, and this e-text summary comprises or is not included in the keyword index;
Described requestor or search engine comprise the keyword expansion parts, described keyword expansion parts can align keyword inquiry or to be checked and carry out the one or many extended operation, this extended operation comprises: will be in containing the e-text content of described keyword or the e-text summary in the keyword that occurs and in abutting connection with the speech section, as each different etendue critical statement, and with its tabulation, or the difference of described keyword selected for use via interactive interface for the inquiry in abutting connection with the tabulation of speech section, perhaps will contain the different index of identical or different etendue critical statement or e-text summary or e-text is retrieved or layout or arrangement, select for use via interactive interface for the inquiry.
21, according to the described search engine system of claim 20, wherein
Store reflection have in the e-text of same keyword or e-text summary or the questions record the keyword different stage in abutting connection with precedence relationship between the speech section in abutting connection with the tree-shaped catalogue of speech section, perhaps comprise the tree-shaped catalogue of precedence relationship between the key sentence of the described keyword different stage expansion of reflection.
22, according to the described search engine system of claim 20, wherein:
Also comprise the graphical user interactive interface, described graphical interaction interface comprises dialog box or choice box, to receive the selection of inquiry to mode of operation or pattern; Described graphical interaction interface also comprises the key sentence that is used to click or in abutting connection with literal or the symbol or the figure of speech section or statement or paragraph or operational order or selection, so that the inquiry adds additional queries information.
23, a kind of searching method that is used for the described search engine system of claim 20 may further comprise the steps:
Steps A receives inquiry's keyword query requirement via interactive interface;
Step B requires the described database of inquiry according to described keyword query;
Step C, will be in containing the e-text content of keyword or the e-text summary in the keyword that occurs together with it in abutting connection with the speech section, as key sentence;
Wherein, the words that is comprised in abutting connection with the speech section or the quantity of character or should in abutting connection with the speech section by mode, be predetermined or inquiry's acquiescence or selected by described search engine system, perhaps determine according to the symbol of the end of adjacency speech section or word or speech or font or color or space, perhaps by the inquiry in the selectionbar that interactive interface presents or the position and the mode that comprise the cursor indication of carrying out on the page of the e-text summary of concrete index or e-text determine;
Step D, according to described in the step C in abutting connection with speech section or key sentence induction-arrangement go out to have nothing in common with each other in abutting connection with speech section or the key sentence that has nothing in common with each other;
Step e, generate Search Results according to what obtain in abutting connection with speech section or key sentence, that is: will contain described identical or different in abutting connection with speech section or key sentence different index or e-text summary or e-text or questions record is retrieved or layout, select for use via interactive interface for the inquiry.
24, according to the described searching method of claim 23, described steps A-E is carried out in advance or when inquiring about by described search engine system.
25, according to the described searching method of claim 23, further comprise:
Step F, will be in containing the e-text content of key sentence or the e-text summary in the key sentence that occurs together with it in abutting connection with the speech section, perhaps with this key sentence together with it in abutting connection with the speech section, as the key sentence of expansion;
Wherein, the words that is comprised in abutting connection with the speech section or the quantity of character or should in abutting connection with the speech section by mode or particular content, be predetermined or inquiry's acquiescence or selected by described search engine system;
Step G is according to the key sentence in abutting connection with speech section or the expansion that has nothing in common with each other that goes out to have nothing in common with each other in abutting connection with speech section or key sentence induction-arrangement described in the step F;
Step H, the key sentence in abutting connection with speech section or expansion that obtains according to step G generates Search Results, that is: and will contain different index or the e-text summary or the e-text of described identical or different key sentence in abutting connection with speech section or expansion or questions record is retrieved or layout or storage respectively, select for use via interactive interface for the inquiry;
Steps A-H is carried out in advance or when inquiring about by described search engine system.
26,, wherein also comprise the marshalling step according to the described searching method of claim 23:
Be about to contain the various key sentence of same keyword or in abutting connection with speech section or index or e-text summary or e-text, perhaps will contain same former key sentence various expansion key sentence or in abutting connection with speech section or index or e-text summary or e-text, organize into groups separately with catalogue or sequence form and arrange or show, wherein each is only taken in abutting connection with the key sentence at speech section place or index or e-text summary or e-text that each is one or more.
27, according to the described searching method of claim 23, wherein,
Data with part or all of keyword index or e-text summary or e-text, similar and different according to its keyword that contains or key sentence or etendue critical statement is distributed in the subset area storage of similar and different subset area or similar and different even lower level.
28, according to the described searching method of claim 23, wherein,
The e-text that text in the described database or summary or the subsidiary data acquisition server of search engine obtain from the internet is analyzed, produce and the corresponding index of store electrons text, this index comprises the keyword section, in abutting connection with speech section and e-text ID section.
29,, wherein also comprise the layout step according to the described searching method of claim 23:
Be layout reflection have the e-text of same keyword or this keyword different stage in the e-text summary in abutting connection with between the speech section successively or the tree-shaped catalogue of coordination, perhaps reflect between the key sentence of this keyword different stage expansion successively or the tree-shaped catalogue of coordination, use during for inquiry.
30, according to claim 23 or 24 or 25 described searching methods, wherein also comprise selected step:
Be described search engine system according to the inquiry on the e-text of the page of interactive interface or e-text summary or key sentence or the indication of the cursor on speech section catalogue or in selectionbar or frame, determine corresponding key sentence, and carrying out catalogue in abutting connection with speech section or index or e-text summary or e-text and show to the key sentence of the various expansion of this key sentence correspondence or expansion, perhaps carrying out the ordering of respective index or e-text summary or e-text shows, perhaps remove step according to the corresponding key sentence of determining, described clauses and subclauses that remove the key sentence that step contains this page or other a plurality of pages or index or e-text summary or e-text are rejected or the shift position.
31,, wherein also comprise the ignorance step according to the described searching method of claim 23:
Promptly according to the inquiry browse when comprising index former keyword or that comprise former key sentence or e-text summary or questions record or e-text sequence on the interactive interface to the page or the operation on the page, judge that the inquiry browses the present position of this index or e-text sequence; If determine to be arranged in the index that comprises key sentence in this present position front certain limit or e-text summary or e-text or key sentence itself always or continuously certain number of times be not opened or link, also not clicked or prompting keeps, then remove step according to this key sentence, described clauses and subclauses that remove the key sentence that step contains this page or other a plurality of pages or index or e-text summary or e-text are rejected or the shift position.
CNB200710164298XA 2007-02-15 2007-10-24 Convenient method and system for electronic text-processing and searching Expired - Fee Related CN100501745C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200710164298XA CN100501745C (en) 2007-02-15 2007-10-24 Convenient method and system for electronic text-processing and searching

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
CN200710079309.4 2007-02-15
CN200710079309 2007-02-15
CN200710087104.0 2007-03-21
CN200710147578.X 2007-08-28
CNB200710164298XA CN100501745C (en) 2007-02-15 2007-10-24 Convenient method and system for electronic text-processing and searching

Publications (2)

Publication Number Publication Date
CN101201841A CN101201841A (en) 2008-06-18
CN100501745C true CN100501745C (en) 2009-06-17

Family

ID=39517010

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200710164298XA Expired - Fee Related CN100501745C (en) 2007-02-15 2007-10-24 Convenient method and system for electronic text-processing and searching

Country Status (1)

Country Link
CN (1) CN100501745C (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9870392B2 (en) 2010-12-31 2018-01-16 Yan Xiao Retrieval method and system
US9116984B2 (en) * 2011-06-28 2015-08-25 Microsoft Technology Licensing, Llc Summarization of conversation threads
CN103136274A (en) * 2011-12-02 2013-06-05 北大方正集团有限公司 Date retrieval method and device used for content resource data base
CN103185596A (en) * 2011-12-30 2013-07-03 上海博泰悦臻电子设备制造有限公司 Interest point searching method and interest point searching device
CN102819601B (en) * 2012-08-15 2015-07-01 中国联合网络通信集团有限公司 Information retrieval method and information retrieval equipment
CN104216934B (en) * 2013-09-29 2018-02-13 北大方正集团有限公司 A kind of Knowledge Extraction Method and system
CN104050158B (en) * 2014-06-27 2017-05-17 吴涛军 Automatic quotation extraction method and device with semantic integrity kept
EP3324305A4 (en) * 2015-07-13 2018-12-05 Teijin Limited Information processing apparatus, information processing method, and computer program
CN108268438B (en) * 2016-12-30 2021-10-22 腾讯科技(深圳)有限公司 Page content extraction method and device and client
CN107168991B (en) * 2017-03-28 2020-12-04 北京三快在线科技有限公司 Search result display method and device
CN107544962A (en) * 2017-09-07 2018-01-05 电子科技大学 Social media text query extended method based on Similar Text feedback
CN109145016A (en) * 2018-09-10 2019-01-04 合肥科讯金服科技有限公司 A kind of finance internet big data searching system
CN111444413B (en) * 2020-04-08 2023-05-12 作业不凡(北京)教育科技有限公司 Data query method and device and computing equipment

Also Published As

Publication number Publication date
CN101201841A (en) 2008-06-18

Similar Documents

Publication Publication Date Title
CN100501745C (en) Convenient method and system for electronic text-processing and searching
US9323827B2 (en) Identifying key terms related to similar passages
US8122032B2 (en) Identifying and linking similar passages in a digital text corpus
CN100375090C (en) Retrieving matching documents by queries in any national language
CN101501630B (en) Method for ranking computerized search result list and its database search engine
US8145632B2 (en) Systems and methods of identifying chunks within multiple documents
CN101520786B (en) Method for realizing input method dictionary and input method system
US20130013616A1 (en) Systems and Methods for Natural Language Searching of Structured Data
US20010049674A1 (en) Methods and systems for enabling efficient employment recruiting
US20070219986A1 (en) Method and apparatus for extracting terms based on a displayed text
CN101063975A (en) Method and system for electronic text-processing and searching
US7024405B2 (en) Method and apparatus for improved internet searching
CN101246484A (en) Electric text similarity processing method and system convenient for query
CN102945237A (en) Suggesting and refining user input based on original user input
CN1503163A (en) International information search and deivery system providing search results personalized to a particular natural language
CN1871605A (en) System and method for question-reply type document search
CN101727447A (en) Generation method and device of regular expression based on URL
CN1487452A (en) System for carrying out universal search management in one or more networks
WO2010014082A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN102200975A (en) Vertical search engine system and method using semantic analysis
US20090119283A1 (en) System and Method of Improving and Enhancing Electronic File Searching
CN102831131A (en) Method and device for establishing labeling webpage linguistic corpus
US8924421B2 (en) Systems and methods of refining chunks identified within multiple documents
CN103136356A (en) Processing method for search engine end-user to input prompt messages of reference documents
WO2008098467A1 (en) Convenient method and system of electric text processing and retrieve

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090617

Termination date: 20131024