WO2012143839A1 - A computerized system and a method for processing and building search strings - Google Patents

A computerized system and a method for processing and building search strings Download PDF

Info

Publication number
WO2012143839A1
WO2012143839A1 PCT/IB2012/051870 IB2012051870W WO2012143839A1 WO 2012143839 A1 WO2012143839 A1 WO 2012143839A1 IB 2012051870 W IB2012051870 W IB 2012051870W WO 2012143839 A1 WO2012143839 A1 WO 2012143839A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
signature
database
text
Prior art date
Application number
PCT/IB2012/051870
Other languages
French (fr)
Inventor
Abraham Carel GREYLING
Original Assignee
Greyling Abraham Carel
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Greyling Abraham Carel filed Critical Greyling Abraham Carel
Publication of WO2012143839A1 publication Critical patent/WO2012143839A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This invention relates to query processing, and more specifically relates to the semantic analysis of search query strings to generate multiple alternative strings to facilitate improved computerized search.
  • search engines use text-based input search queries. To return accurate search results, the search engine must be able to apply some form of language interpretation to the search string entered by a user. The search engine must also apply language interpretation when it indexes web pages or other documents, so that the search string can be matched to web pages by a ranking algorithm that only delivers the most relevant results to a user.
  • the "Semantic Web” refers to a structure for the Internet in which machine- readable data (or meta-data) is available that tells a computer unambiguously what a web page, a document or a topic is about. This meta-data enables computers to understand the meaning of information directly, without the interpretation problems that plague current search engines.
  • machine- readable data or meta-data
  • This meta-data enables computers to understand the meaning of information directly, without the interpretation problems that plague current search engines.
  • certain defined domains - for example, airline booking systems - operate in this way.
  • JFK in an airline booking system refers only to John F Kennedy International airport in New York, not to the former US president or other terms that may have these three letters as their acronym.
  • Some search engines such as BingTM, identify categories based on the search terms, and a user is able to filter out irrelevant results by only selecting certain categories.
  • a search for "chicken” might identify categories of “animals” and “recipes” and allow the user to filter so as to only search within one of the two categories.
  • the goal of the Internet itself being semantic has not yet been realized, despite ongoing efforts to index and associate concepts on the Internet.
  • the main problem is the complexity of the task involved in performing such identification and association on the open Internet, which requires a huge structured database to be built.
  • each unique word identified in the text is stored in the database and is associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
  • the word relationship database processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text;
  • Still further features of the invention provide for the method to include an additional step of, immediately after extracting text and before forming a word relationship database, parsing the text into sentence portions which start and end with sentence delimiters.
  • Still further features of the invention provide for inputting the multiple alternate search strings into a search engine simultaneously or in rapid succession and comparing the results of each separate search so as to rank the overall results and present those results which were obtained in the greatest number of separate searches as the most relevant search results.
  • the language of the text is preferably identified so that separate word relationship databases and signature databases can be built for each separate language.
  • the invention extends to a system for processing an input search string and building multiple alternative search strings, comprising:
  • a processor in the form of a server which is able to access a multitude of web pages or other documents through the Internet and extract text, the text including words;
  • each unique word in the word relationship database being associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
  • a signature database coupled to the server, the signature database being formed by the server processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text, and combining the forward and reverse signatures of each word to form an ambidextrous signature for each word that is stored in the signature database;
  • the server to be configured to input the multiple alternative search strings into a search engine simultaneously or in rapid succession, and to compare the results of each separate search so as to rank the overall results and present those results which are obtained in the greatest number of separate searches as the most relevant search results.
  • Figure 1 is a flowchart that illustrates the overall steps performed in obtaining improved search results where multiple alternative search strings are generated according to the method of the invention
  • Figure 2 is a schematic diagram showing the system for generating multiple alternative search strings according to the invention.
  • Figure 3 is a flowchart that illustrates the steps performed in creating and updating a word relationship database
  • Figure 4 is a flowchart that illustrates the steps performed in creating a signature database based on the word relationship database; and Figure 5 is a flowchart that illustrates the steps performed in using the signatures in the signature database to create multiple alternative search strings according to the method of the invention.
  • Figure 1 is a flowchart that illustrates the overall steps performed and results obtained by the method and system of the invention.
  • a search string is input into the system of the invention.
  • multiple alternative search strings are generated, with each search string being semantically similar to the original search string and containing correct grammar.
  • the original search string and each alternative search string are then input into a search engine at the next stage (24).
  • the search engine may be any computerized search engine, including a web based internet search engine that facilitates keyword searching.
  • the results of each internet search obtained using the search strings are obtained at the next stage (26). These results are then combined at the next stage (28) so as to identify the most relevant search results and output those results at stage (30).
  • Figure 2 illustrates a system (100) which enables the multiple alternative search strings to be generated, which was stage (22) in Figure 1.
  • the system includes a processor in the form of a server (102) which is able to access a multitude of web pages (104) or other documents through the Internet (106) by means of web crawling programs (not shown).
  • the server is also coupled to a word relationship database (108) and a signature database (1 10).
  • the word relationship database is built from content obtained from the Internet by the web crawling programs, and the word relationship database is used to create the signature database as will be explained below.
  • the first stage in the method of the invention is to create a word relationship database where every word in a particular language is associated with words adjacent to it.
  • a flow chart illustrating the steps to create and update a word relationship database is shown in Figure 3.
  • the Semantic Web is a technology waiting to be actualized. Application areas are experiencing intensified interest due to the rapid growth in the use of the Web. Information content technologies (such as search engines) are constantly being improved, with the hope of the actualization of powerful search technologies.”
  • sentence delimiters The following list of ASCII characters are generally regarded as sentence delimiters:
  • the delimiters in the text are "(", ")" and The text can therefore be parsed into the following sentence portions: a) The Semantic Web is a technology waiting to be actualized b) Application areas are experiencing intensified interest due to the rapid growth in the use of the Web
  • a word relationship database can then be formed which shows how adjacent words are related to each other in the body of text analyzed by the web spiders.
  • the word relationship database is formed as a two- dimensional matrix in which each row represents a particular word (the "row word"), and has a number of row fields that represent specific words that were found to occur after the row word in the body of text that was analyzed.
  • Each row field also includes an indication of the frequency, or number of times, that the word was found to occur after the row word in the body of content. This can schematically be illustrated as follows:
  • ⁇ RowWord1 > ⁇ WordAfter1 >, ⁇ Freq1 >
  • ⁇ RowWord2> ⁇ WordAfter1 >, ⁇ Freq1 >
  • ⁇ RowWord3> ⁇ WordAfter1 >, ⁇ Freq1 >
  • Each word is assigned a unique reference number, and within each row the row fields are ranked according to frequency. For example, consider a very small portion of the two-dimensional matrix, only the words that follow alphabetically between "actuality” and “actualizes”. From the sentence portions (e) above, only the word "of” was found to follow the word “actualization”. In the sentence portion (a) no word was found after "actualized”. If the only text input into the two-dimensional matrix were the sentence portions (a) - (e) above, the matrix portion might look as follows:
  • Each row represents a unique word ("a row word”) and the rows are alphabetically sorted with incrementing reference numbers, words 501 -504 ("actuality” - "actualizes”).
  • Each row word has a number of row fields after it.
  • word 501 "actuality"
  • word 503 and 504 there are only 2 row fields following each of these words.
  • Each row field includes two items of information, the reference number of a word that was found to come after it in the body of searched text, and a frequency number which shows the number of times that referenced word was found to come after the row word in the body of searched text.
  • This extract from the word relationship database is, of course, greatly simplified for illustrative purposes.
  • the word relationship database increases in size with the frequency numbers growing rapidly and the number of row fields also growing, although not as quickly.
  • a number of techniques can be employed, such as techniques that gradually reduce the frequency fields so that only those field words that are frequently incremented will develop large frequencies.
  • Algorithms for periodically discarding the row fields that have very low frequencies can also be used so as to keep the number of row fields in check, in addition to algorithms that compress the matrix density (the number of row fields multiplied by their frequencies).
  • the word relationship database Once populated with content from a large number of web pages and other documents, the word relationship database provides an accurate view of the relationship that each word has to the words that come after it in a particular language (such as English), provided of course that the bulk of the content accessed by the web spiders is not garbled or meaningless, which it should not be if ordinary content on the Internet is being accessed.
  • a particular language such as English
  • the bulk of the content accessed by the web spiders is not garbled or meaningless, which it should not be if ordinary content on the Internet is being accessed.
  • a signature database is created that is based on the word relationship database.
  • Figure 4 illustrates the steps to create a signature database based on the word relationship database.
  • the words in the row field of the word relationship database only indicate words that come after the row word.
  • Each row can therefore be thought of as a signature for the words that follow the row word, where the signature tells you the relationship of the row word to other words following it, ranked according to popularity.
  • the word relationship database is queried to obtain the "reverse signature" of every word, i.e. an indication of the popularity of words that precede the word of interest. This can be done by searching the entire word relationship database for every instance where the word of interest appears in a row field, and identifying the row word associated with that row field as the preceding word.
  • the forward and reverse signatures of each word are combined into an "ambidextrous signature".
  • the information about whether the field word came before or after the word of interest is discarded, and the number of times each field word came before or after the word is also discarded, while nevertheless maintaining a ranking based on the number of times each field word came before or after the word.
  • the "forward signature" of "work” given at (4) and the “reverse” signature of "work” given at (6) are combined into the following "ambidextrous signature":
  • the ambidextrous signature (7) is therefore a word relationship signature which shows which words are contextually close to the word "work”, in that those words often appear adjacent to the word "work” (either before or after) in the English language.
  • word relationship signatures only reflect the relationship of specific words to those words that come immediately before or after them, not to more distant word relationships.
  • word relationship database and signature database are two-dimensional matrices, rather than 3-, 4- or higher-order matrices. This simplicity is important because it keeps the size of the word relationship database and signature database manageable and makes it very scalable.
  • the word relationship signatures in the signature database are used to create multiple alternative search strings that are semantically similar to an input search string and grammatically correct. This is the step that was indicated broadly by stage (22) in Figure 1 and which will now be described in detail. The various stages involved in generating the multiple alternative search strings are illustrated in Figure 5.
  • popular words are removed from the input search string.
  • Popular words are identified as those words with a total frequency in the entire word relationship database that is higher than a predetermined threshold - in other words, those words that appear very commonly in the total body of text accessed by the web crawling programs.
  • a predetermined threshold - in other words, those words that appear very commonly in the total body of text accessed by the web crawling programs.
  • the search string "Where can I get cool spring water?".
  • the words “where”, “can”, “I” and “get” will likely be identified as popular words, with the remaining words “cool spring water” being non-popular words.
  • the non-popular words are linked in two-word groups from left to right with the last word of any preceding two-word group forming the first word of the next two-word group. In this case, there are two two-word groups, namely "cool spring” and "spring water”.
  • each two word group is analyzed as follows: the reverse signature of the first word and the forward signature of the second word are obtained. Then, at stage (66), the forward and reverse group signatures are combined into a single ambidextrous "word-group" signature.
  • the ambidextrous word relationship signature in (10) gives the forward and reverse relationship of the two words "cool spring” in combination, as if they were a single word.
  • the signature database is searched to look for close signature matches for the ambidextrous "word group” signature (10).
  • This comparison can be done in various ways. One way is to calculate a matching score between the signature (10) and each of the signatures in the signature database by an algorithm that looks for matches between the fields of the signature (10) and the fields of each of the signatures in the signature database.
  • Decreasing weighting factors can be allocated to each of the fields with the signature so that matches between fields that are further to the right count less than matches between fields that are further left.
  • the algorithm can also allocate a higher weighting factor if the word in the signature database that includes matching fields is not a common word, as these words give more information than common words such as prepositions and conjunctions.
  • the word or words that have the highest weighting factor are then identified as the words that are semantically similar to the two-word group.
  • stages (64) to (70) are then repeated for each of the other two-word groups in the search string, which in this example is the second two-word group, "spring water".
  • the search string which in this example is the second two-word group, "spring water”.
  • one or more other words are identified that are semantically similar to "spring water”.
  • Combining the results of both iterations yields a number of two word strings that are each semantically similar to "cool spring water”. For example, if one of the words identified as semantically similar to "cool spring” was “refreshing” and one of the words identified as semantically similar to "spring water” was “liquid”, then "refreshing liquid” would be identified as semantically similar to "cool spring water”.
  • the invention includes additional steps by means of which grammatically incorrect alternative strings can be excluded. To do this, the substituted words are first substituted back into the original search string at stage (76). Then, at stage (78), each substituted word is analyzed within the original string to see whether the words preceding it and following it are words that are associated with the substituted word by a predefined degree.
  • the most relevant documents or web pages are then presented to the user first. It will be appreciated that from the perspective of the user of a search engine the invention described above is completely hidden and is carried out in the background. The user interacts with the search engine in exactly the same way as before - by typing in a search string - and the search engine generates the alternative search strings and identifies the most relevant documents to present to the user.
  • the applicant has found that the invention leads to a marked improvement in the quality of the results that are presented to a user. Irrelevant search results are excluded far more often than with existing search engines and complex sentence structures can be handled with more precision. Because multiple alternative search strings are generated based on the search string, the applicant has found that it is no longer necessary to substitute different words or attempt to re-write search strings with different sentence structures in an attempt to locate relevant results. This leads to increased user satisfaction and quicker location of relevant search results.
  • the system of the invention requires no human input to categorize and index content, not does it have to be programmed with complex morphological or grammatical rules or built-in dictionaries.
  • the invention provides a completely autonomous and extremely scalable system that is able to build a contextual language model of any contextual language so that search strings can be interpreted more accurately by search engines, so as to deliver more relevant and targeted search results without the need to categorize or index existing content.
  • the invention may be applied in web based search engines, it can also be applied in the enterprise search market where companies search their own internal documents and information.
  • any of the software components or functions described in this specification may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C++ or Perl using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions, or commands on a computer readable medium, such as a random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM.
  • RAM random access memory
  • ROM read only memory
  • magnetic medium such as a hard-drive
  • an optical medium such as a CD-ROM.
  • Any such computer readable medium may reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network.

Abstract

A method is provided for processing an input search string and building multiple alternative search strings to improve computerized search. The method includes extracting text from web pages, forming a word-relationship database in which every unique word is associated with fields which represent other words found to occur adjacent to that word, processing the word relationship database so as to determine a forward and reverse signature for each word, and combining the forward and reverse signatures to form a signature database. Two-word groups in the input search string are linked and the forward and reverse signatures for each two-word group obtained. These signatures are compared with the signature database to find single words that have signatures that closely match the signature of the two-word group, and those words identified as alternative words that are semantically similar to the two-word group, so as to generate alternate search strings.

Description

A COMPUTERIZED SYSTEM AND A METHOD FOR PROCESSING AND BUILDING SEARCH STRINGS
FIELD OF THE INVENTION
This invention relates to query processing, and more specifically relates to the semantic analysis of search query strings to generate multiple alternative strings to facilitate improved computerized search. BACKGROUND TO THE INVENTION
Enabling computers to understand language remains one of the hardest problems in artificial intelligence. Language is highly contextual. Often the same words have different meanings in different contexts and small differences in sentence structure can lead to totally different meanings. At the same time, a great number of different sentence structures can have the same meaning.
Most internet search engines use text-based input search queries. To return accurate search results, the search engine must be able to apply some form of language interpretation to the search string entered by a user. The search engine must also apply language interpretation when it indexes web pages or other documents, so that the search string can be matched to web pages by a ranking algorithm that only delivers the most relevant results to a user.
One simple form of language interpretation is to analyze each word in the search string and to also search for the synonyms for certain of the words. Thus the word "picture" may have the synonym "photo" so that if a user searches for "picture of Grand Canyon", the search engine must also return results for web pages in which the words, "photo of Grand Canyon" appear. However, merely applying synonyms can also lead to wrong results. For example, if a user searches for "history of motion pictures" then the word "pictures" must not be substituted with "photos" because the string "history of motion photos" is meaningless. As another example, if a user were to search for "HP wide screen monitor" and the search engine were also to substitute the synonym "detector" for "monitor", and "shutter" for "screen", completely irrelevant results would be delivered. From the above it is clear that a search engine also needs to be able to perform contextual analysis so that it knows, for example, that the string "HP wide screen monitor" has nothing to do with shutters or detectors and that the term "motion photo" is not the same as "motion picture".
As a further illustration of this problem, even words which are normally interchangeable can lead to totally different meanings when used in different contexts. A search for "arm reduction" probably has to do with cosmetic surgery whereas "arms reduction" relates to reducing stockpiles of weaponry. When longer sentences are involved, the permutations become exponentially more complex. Even the most advanced search engines available today, such as the Google™ search engine, are generally not able to accurately interpret longer search queries so as to deliver meaningful results. Therefore, a search on Google™ for "Software companies founded before 1990 with a current turnover of more than $100 million" yields a list of largely irrelevant references, even though the search query is perfectly clear to a human and the information is doubtless available on the Internet. Because existing search engines rely primarily on keywords rather than the context of words, most people have learned through experience to write search queries in what has been dubbed "caveman speak"; where, for example, a user wanting to know about popular seafood restaurants in Seattle might search for "seafood Seattle" rather than "Where I can I find a good seafood restaurant in Seattle?". Existing search engines are not able to properly analyze complex contextual meanings created by combining words into a sentence structure.
The "Semantic Web" refers to a structure for the Internet in which machine- readable data (or meta-data) is available that tells a computer unambiguously what a web page, a document or a topic is about. This meta-data enables computers to understand the meaning of information directly, without the interpretation problems that plague current search engines. Currently, certain defined domains - for example, airline booking systems - operate in this way. Thus the term "JFK" in an airline booking system refers only to John F Kennedy International airport in New York, not to the former US president or other terms that may have these three letters as their acronym.
Some search engines, such as Bing™, identify categories based on the search terms, and a user is able to filter out irrelevant results by only selecting certain categories. Thus a search for "chicken" might identify categories of "animals" and "recipes" and allow the user to filter so as to only search within one of the two categories. However, the goal of the Internet itself being semantic has not yet been realized, despite ongoing efforts to index and associate concepts on the Internet. The main problem is the enormity of the task involved in performing such identification and association on the open Internet, which requires a huge structured database to be built.
It would be advantageous to have a completely autonomous system that is able to build a contextual language model so that search strings can be interpreted more accurately by search engines, so as to deliver more relevant and targeted search results without the need to categorize or index existing content.
SUMMARY OF THE INVENTION In accordance with the invention there is provided a method for processing an input search string and building multiple alternative search strings, the method comprising:
extracting text from a multitude of electronically accessible documents or web pages, the text including words;
forming a word relationship database in which each unique word identified in the text is stored in the database and is associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text;
combining the forward and reverse signatures of each word to form an ambidextrous signature for each word, and storing the ambidextrous signatures in a signature database;
optionally removing popular words from the input search string, being those words with a total frequency higher than a predetermined threshold; linking the remaining words of the input search string into two-word groups from left to right with the second word of any preceding two-word group forming the first word of the next two-word group;
for each two-word group, carrying out the following steps:
querying the word relationship database to determine the reverse signature of the first word and the forward signature of the second word;
combining the forward and the reverse signatures obtained in the previous step into an ambidextrous signature representative of that two-word group; comparing the ambidextrous signature of the previous step with the signature database to find the closest matches;
identifying the words in the signature database with the closest signature matches as alternative words that are semantically similar to the two-word group;
substituting one or more of the two-word groups in the input search string with the identified alternative words;
analyzing whether each substituted word fits grammatically into the input search string by querying the word relationship database to see whether the word preceding and the word following the substituted word in the input search string are words that are associated with the substituted word to a predefined extent; and
if the substituted word or words do fit grammatically, identifying the string with the substituted words as an alternative search string that is both semantically similar to the original search string and grammatically correct.
Further features of the invention provide for the text to be extracted from the web pages or documents by autonomous web crawling programs; for the word relationship database to be continually updated as more and more text is extracted; and for the signature database to be periodically rebuilt.
Still further features of the invention provide for the method to include an additional step of, immediately after extracting text and before forming a word relationship database, parsing the text into sentence portions which start and end with sentence delimiters.
Yet further features of the invention provide for techniques to be employed that keep the growth of the word relationship database in check, for example by gradually reducing the frequency sub-fields so that only those words that are frequently incremented will develop large frequencies, or by periodically discarding words with low frequency sub-fields. Further features of the invention provide for the ambidextrous signature of the two-word group to be matched with ambidextrous signatures in the signature database by calculating a matching score that looks for matches between the fields of the signatures and applies decreasing weighting factors to fields that are associated with lower frequency sub-fields.
Still further features of the invention provide for inputting the multiple alternate search strings into a search engine simultaneously or in rapid succession and comparing the results of each separate search so as to rank the overall results and present those results which were obtained in the greatest number of separate searches as the most relevant search results.
The language of the text is preferably identified so that separate word relationship databases and signature databases can be built for each separate language.
The invention extends to a system for processing an input search string and building multiple alternative search strings, comprising:
a processor in the form of a server which is able to access a multitude of web pages or other documents through the Internet and extract text, the text including words;
a word relationship database coupled to the processor and having each unique word identified in the text stored thereon, each unique word in the word relationship database being associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
a signature database coupled to the server, the signature database being formed by the server processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text, and combining the forward and reverse signatures of each word to form an ambidextrous signature for each word that is stored in the signature database;
and computer software stored on the processor and configured to enable the processor to carry out the following:
optionally removing popular words from the input search string, being those words with a total frequency higher than a predetermined threshold;
linking the remaining words of the input search string into two- word groups from left to right with the second word of any preceding two-word group forming the first word of the next two-word group; for each two-word group, carrying out the following:
querying the word relationship database to determine the reverse signature of the first word and the forward signature of the second word;
combining the forward and the reverse signatures obtained in the previous step into an ambidextrous signature representative of that two-word group;
comparing the ambidextrous signature of the previous step with the signature database to find the closest matches; identifying the words in the signature database with the closest signature matches as alternative words that are semantically similar to the two-word group;
substituting one or more of the two-word groups in the input search string with the identified alternative words;
analyzing whether each substituted word fits grammatically into the input search string by querying the word relationship database to see whether the word preceding and the word following the substituted word in the input search string are words that are associated with the substituted word to a predefined extent; and if the substituted word or words do fit grammatically, identifying the string with the substituted words as an alternative search string that is both semantically similar to the original search string and grammatically correct. Further features of the invention provide for the server to be configured to input the multiple alternative search strings into a search engine simultaneously or in rapid succession, and to compare the results of each separate search so as to rank the overall results and present those results which are obtained in the greatest number of separate searches as the most relevant search results.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will now be described, by way of example only with reference to the accompanying representations in which:
Figure 1 is a flowchart that illustrates the overall steps performed in obtaining improved search results where multiple alternative search strings are generated according to the method of the invention;
Figure 2 is a schematic diagram showing the system for generating multiple alternative search strings according to the invention;
Figure 3 is a flowchart that illustrates the steps performed in creating and updating a word relationship database;
Figure 4 is a flowchart that illustrates the steps performed in creating a signature database based on the word relationship database; and Figure 5 is a flowchart that illustrates the steps performed in using the signatures in the signature database to create multiple alternative search strings according to the method of the invention.
DETAILED DESCRIPTION WITH REFERENCE TO THE DRAWINGS
Figure 1 is a flowchart that illustrates the overall steps performed and results obtained by the method and system of the invention. As provided by the invention, at a first stage (20), a search string is input into the system of the invention. At a next stage (22), multiple alternative search strings are generated, with each search string being semantically similar to the original search string and containing correct grammar. The original search string and each alternative search string are then input into a search engine at the next stage (24). The search engine may be any computerized search engine, including a web based internet search engine that facilitates keyword searching. The results of each internet search obtained using the search strings are obtained at the next stage (26). These results are then combined at the next stage (28) so as to identify the most relevant search results and output those results at stage (30).
Figure 2 illustrates a system (100) which enables the multiple alternative search strings to be generated, which was stage (22) in Figure 1. The system includes a processor in the form of a server (102) which is able to access a multitude of web pages (104) or other documents through the Internet (106) by means of web crawling programs (not shown). The server is also coupled to a word relationship database (108) and a signature database (1 10). The word relationship database is built from content obtained from the Internet by the web crawling programs, and the word relationship database is used to create the signature database as will be explained below. These databases can then be used to obtain multiple alternative search strings for a given input search string according to the method of the invention which will be explained in detail below.
A. Creating and Updating a Word Relationship Database
The first stage in the method of the invention is to create a word relationship database where every word in a particular language is associated with words adjacent to it. A flow chart illustrating the steps to create and update a word relationship database is shown in Figure 3.
At a first stage (40), software programs commonly referred to as "spiders" or "robots", automatically access content on thousands or millions of webpages on the Internet and, from each webpage, extract text by accessing the underlying HTML code. This text need not be in English, although it would be advantageous to identify the language of specific sites or domains so that separate word relationship databases can be built for different languages. For the purposes of this description, it will be assumed that English text is being analyzed and input into the word relationship database. Exactly the same methods can be applied to any other language in which context plays an important part in meaning - i.e. where the meaning conveyed by words depends on the words surrounding them.
An example of a small portion of text extracted by a software spider from a given website might be:
"The Semantic Web is a technology waiting to be actualized. Application areas are experiencing intensified interest due to the rapid growth in the use of the Web. Information content technologies (such as search engines) are constantly being improved, with the hope of the actualization of powerful search technologies." At a next stage (42), text is then parsed into sentence portions which start and end with sentence delimiters. The following list of ASCII characters are generally regarded as sentence delimiters:
Figure imgf000013_0001
Figure imgf000013_0002
Table 1 : Common Sentence Delimiters
Using the example text above, the delimiters in the text are "(", ")" and The text can therefore be parsed into the following sentence portions: a) The Semantic Web is a technology waiting to be actualized b) Application areas are experiencing intensified interest due to the rapid growth in the use of the Web
c) Information content technologies
d) such as search engines
e) are constantly being improved
e) with the hope of the actualization of powerful search technologies Using the parsed sentence portions, a word relationship database can then be formed which shows how adjacent words are related to each other in the body of text analyzed by the web spiders. At stage (44), the word relationship database is formed as a two- dimensional matrix in which each row represents a particular word (the "row word"), and has a number of row fields that represent specific words that were found to occur after the row word in the body of text that was analyzed. Each row field also includes an indication of the frequency, or number of times, that the word was found to occur after the row word in the body of content. This can schematically be illustrated as follows:
<RowWord1 >: <WordAfter1 >, <Freq1 > | <WordAfter2>, <Freq2>...
<RowWord2>: <WordAfter1 >, <Freq1 > | <WordAfter2>, <Freq2>...
<RowWord3>: <WordAfter1 >, <Freq1 > | <WordAfter2>, <Freq2>...
Each word is assigned a unique reference number, and within each row the row fields are ranked according to frequency. For example, consider a very small portion of the two-dimensional matrix, only the words that follow alphabetically between "actuality" and "actualizes". From the sentence portions (e) above, only the word "of" was found to follow the word "actualization". In the sentence portion (a) no word was found after "actualized". If the only text input into the two-dimensional matrix were the sentence portions (a) - (e) above, the matrix portion might look as follows:
502("actualization"): | 3488("of"), 1
503("actualized"): (1 )
The word "of", which has reference number 3488, was found to come after the word "actualization" (reference number 502) only once. Although the word "actualized" (reference number 503) was found, no words after it were found. Stages (40), (42) and (44) are then cycled through repeatedly as more and more text is extracted from the Internet, parsed into sentence portions, and added to the word relationship database.
Consider how the word relationship database might look after several thousand or hundreds of thousands of words have been parsed into sentence portions and input into the matrix. The words that follow alphabetically between "actuality" and "actualizes" might then have the following entries:
501("actuality"): 6638("in"), 13 | 662("at"), 8 | 465("about"), 6 | 3488("of"), 6, 1392("for"), 5
502("actualization"): 2227("towards"), 2 | 3488("of"), 1 | 465("about"), 1 503("actualized"): 49731 ("work"), 7 | 1392("for"), 4
504("actualizes"): 663 ("anything"), 4 | 381 1 ("something"), 3 (2)
Each row represents a unique word ("a row word") and the rows are alphabetically sorted with incrementing reference numbers, words 501 -504 ("actuality" - "actualizes"). Each row word has a number of row fields after it. In the case of word 501 ("actuality"), there is an array of 5 row fields following it, whereas for words 503 and 504 ("actualizes" and "actualizes") there are only 2 row fields following each of these words. Each row field includes two items of information, the reference number of a word that was found to come after it in the body of searched text, and a frequency number which shows the number of times that referenced word was found to come after the row word in the body of searched text. In this case, the word "in" (6638) was found to come after the row word "actuality" (501 ) the most (13 times), whereas the word "for" (1392) came after "actuality" (501 ) only 5 times. The phrase "actuality in" was therefore more popular than the phrase "actuality for" in the body of searched text. In the case of row word 503 ("actualized"), the only words found to follow it were "work" (49731 ) and "for" (1392). The row fields are sorted according to the frequency with which the words are found, from the highest frequency on the left to the lowest frequency on the right.
It will be appreciated that the row words and field words have been written between inverted commas to aid understanding. In the actual machine- readable word relationship database, only the reference numbers will be used. The portion of the matrix in (2) above therefore looks as follows:
501 : 6638, 13 | 662, 8 | 465, 6 | 3488, 6 | 1392, 5
502: 2227, 2 | 3488, 1 | 465, 1
503: 49731 , 7 | 1392, 4 (3) 504: 663, 4 | 381 1 , 3
This extract from the word relationship database is, of course, greatly simplified for illustrative purposes. As more text is searched and indexed by the web crawling programs, the word relationship database increases in size with the frequency numbers growing rapidly and the number of row fields also growing, although not as quickly. To keep this growth in check, a number of techniques can be employed, such as techniques that gradually reduce the frequency fields so that only those field words that are frequently incremented will develop large frequencies. Algorithms for periodically discarding the row fields that have very low frequencies can also be used so as to keep the number of row fields in check, in addition to algorithms that compress the matrix density (the number of row fields multiplied by their frequencies).
Once populated with content from a large number of web pages and other documents, the word relationship database provides an accurate view of the relationship that each word has to the words that come after it in a particular language (such as English), provided of course that the bulk of the content accessed by the web spiders is not garbled or meaningless, which it should not be if ordinary content on the Internet is being accessed. One would therefore expect the relationship between certain two-word groups to be strong, such as "founding fathers", while there would be a very weak (or non- existent) relationship between other, even very similar, two-word groups such as "found fathers".
B. Creating a Signature Database based on the Word Relationship Database
Next, according to the method of the invention, a signature database is created that is based on the word relationship database. Figure 4 illustrates the steps to create a signature database based on the word relationship database.
As explained in the previous section, the words in the row field of the word relationship database only indicate words that come after the row word. Each row can therefore be thought of as a signature for the words that follow the row word, where the signature tells you the relationship of the row word to other words following it, ranked according to popularity.
At a first stage (50) in Figure 4, the word relationship database is queried to obtain the "reverse signature" of every word, i.e. an indication of the popularity of words that precede the word of interest. This can be done by searching the entire word relationship database for every instance where the word of interest appears in a row field, and identifying the row word associated with that row field as the preceding word.
For example, assume that the word "work" has the following row in the word relationship database:
54731 ("Work"): | 2284("on"), 15 | 4433("hard"), 12 | 5333("towards"), 5 (4) To find out what the "reverse signature" of "work" is, the entire word relationship database is searched to find every instance where "work" appears in any one of the row fields. In this example, assume that only the following rows are found in which the word "work" appears in the row fields:
503("actualized"): 49731 ("work"), 7 | 1392("for"), 4
4332("hard"): 2290("life"), 12 | 49731 ("work"), 10 | 902("to") (5)
27332("start"): 9221 ("living"), 14 | 49731 ("work"), 13 | 3323 ("button"), 9
Three rows were found with "work" in the row fields. It will be appreciated that the row words of these three rows ("actualized", "hard" and "start") represent the words that were found to come before the word "work", and that the number of times they were found to come before work is given by the frequency field in each row field in which "work" appears.
Therefore, the word "actualized" came before "work" 7 times, "hard" came before work 10 times and "start" came before work 13 times. The reverse signature of "work" can therefore be constructed as:
54731 ("Work"): 27332("start"), 13 | 4332("hard"), 10 | 503("actualized"), 7
(6)
In the same way, the "reverse signature" of each word in the word relationship database can be obtained.
At the next stage (52), the forward and reverse signatures of each word are combined into an "ambidextrous signature". To do this, the information about whether the field word came before or after the word of interest is discarded, and the number of times each field word came before or after the word is also discarded, while nevertheless maintaining a ranking based on the number of times each field word came before or after the word. As an example, the "forward signature" of "work" given at (4) and the "reverse" signature of "work" given at (6) are combined into the following "ambidextrous signature":
54731 ("Work"): 4433("hard") | 2284("on") | 27332("start") | 503("actualized") 1 5333("towards") (7)
It will be noted that the frequency fields no longer appear, but that the row fields are still ranked according to the original frequencies in the forward and reverse signatures. The reason that "hard" appears first is because it appeared in both the forward and reverse signatures, and the frequency of both fields (12 and 10 respectively) were added together, yielding a frequency of 22 which is greater than the next highest frequency of 15 for the word "on". The ambidextrous signature (7) is therefore a word relationship signature which shows which words are contextually close to the word "work", in that those words often appear adjacent to the word "work" (either before or after) in the English language. By processing the word relationship database, word relationship signatures like the one in (7) are then created for every word in the word relationship database. These word relationship signatures are saved in a second database, called the signature database, at stage (54). The signature database was illustrated at (1 10) in Figure 2. As the word relationship database is continually updated with more and more content obtained by the web crawling programs, and adapted by the various algorithms to control its growth, the signature database is similarly automatically updated by being periodically rebuilt (stage (56)) and having stages (50) to (54) repeating. Over time, therefore, words which previously may have had no relationship to each other, such as "Lady" and "Gaga", may become strongly associated with each other if lots of content on web pages starts appearing with those words adjacent to each other. In this way, the word relationship signatures in the signature database change to reflect language as it is commonly written and used.
It will be appreciated that the word relationship signatures only reflect the relationship of specific words to those words that come immediately before or after them, not to more distant word relationships. Using only a two-word relationship means the word relationship database and signature database are two-dimensional matrices, rather than 3-, 4- or higher-order matrices. This simplicity is important because it keeps the size of the word relationship database and signature database manageable and makes it very scalable.
C. Using the Signatures in the Signature Database to Create Multiple Alternative Search Strings Finally, according to the method of the invention, the word relationship signatures in the signature database are used to create multiple alternative search strings that are semantically similar to an input search string and grammatically correct. This is the step that was indicated broadly by stage (22) in Figure 1 and which will now be described in detail. The various stages involved in generating the multiple alternative search strings are illustrated in Figure 5.
At a first stage (60), popular words are removed from the input search string. Popular words are identified as those words with a total frequency in the entire word relationship database that is higher than a predetermined threshold - in other words, those words that appear very commonly in the total body of text accessed by the web crawling programs. As an example, consider the search string "Where can I get cool spring water?". The words "where", "can", "I" and "get" will likely be identified as popular words, with the remaining words "cool spring water" being non-popular words. Next, at stage (62), the non-popular words are linked in two-word groups from left to right with the last word of any preceding two-word group forming the first word of the next two-word group. In this case, there are two two-word groups, namely "cool spring" and "spring water".
Next, at stage (64), each two word group is analyzed as follows: the reverse signature of the first word and the forward signature of the second word are obtained. Then, at stage (66), the forward and reverse group signatures are combined into a single ambidextrous "word-group" signature.
For example, if the forward signature of "spring" in the word relationship database is the following:
42551 ("spring"): 221 1 ("day"), 21 | 53342 ("was"), 15 | 3321 ("morning"), 4
(8) and the reverse signature of "cool" is the following:
1221 ("cool"): 49923 ("very"), 19 | 3221 ("stay"), 13 | 9219 ("really"), 8 (9) then the ambidextrous signature of "cool spring" could be the following:
("cool spring"): 221 1 ("day") | 49923 ("very") | 53342 ("was") | 3221 ("stay") | 9219 ("really") | 3321 ("morning") (10)
Importantly, the ambidextrous word relationship signature in (10) gives the forward and reverse relationship of the two words "cool spring" in combination, as if they were a single word. Next, at stage (68), the signature database is searched to look for close signature matches for the ambidextrous "word group" signature (10). By comparing the signature in (10) to the word signature database and looking for close matches, single words can be found that are semantically similar to the two word group, "cool spring". This comparison can be done in various ways. One way is to calculate a matching score between the signature (10) and each of the signatures in the signature database by an algorithm that looks for matches between the fields of the signature (10) and the fields of each of the signatures in the signature database. Decreasing weighting factors can be allocated to each of the fields with the signature so that matches between fields that are further to the right count less than matches between fields that are further left. The algorithm can also allocate a higher weighting factor if the word in the signature database that includes matching fields is not a common word, as these words give more information than common words such as prepositions and conjunctions. At stage (70), the word or words that have the highest weighting factor are then identified as the words that are semantically similar to the two-word group.
As shown at stage (72), stages (64) to (70) are then repeated for each of the other two-word groups in the search string, which in this example is the second two-word group, "spring water". In this way, one or more other words are identified that are semantically similar to "spring water". Combining the results of both iterations yields a number of two word strings that are each semantically similar to "cool spring water". For example, if one of the words identified as semantically similar to "cool spring" was "refreshing" and one of the words identified as semantically similar to "spring water" was "liquid", then "refreshing liquid" would be identified as semantically similar to "cool spring water".
Using the substitute word or words for "cool spring" and "spring water", and repeating the procedure with the substitute two words (e.g. "refreshing liquid"), it is possible to repeat stages (64)-(70) to find individual words that are semantically similar to the three words, "cool spring water". In this example, the single word "juice" could, for example, be identified as semantically similar to "refreshing liquid". It will be appreciated that, by repeating the substitution procedure in stages (64) - (70) a number of times, it is possible to obtain multiple alternate words for the extracted non-popular words, as shown at stage (74). The alternate words can be a string that has any number of words fewer than the extracted non-popular words. For example, if 5 non-popular words were extracted, then alternate word string of 4, 3, 2, or 1 word(s) can be generated. In the case of the three word string, "cool spring water", the following alternatives could perhaps have been generated:
"refreshing water"
- "cool spring liquid"
"refreshing liquid"
- "juice" (1 1 )
While the method described above enables the extracted non-popular portion of the search string to be substituted with semantically similar words, it does not necessarily follow that the semantically similar words will be grammatically correct when substituted back into the original search string. For example, in the search string, "Where can I get cool spring water?", if the word "season" is identified as semantically similar to the two words "cool spring", substituting "season" into the original string yields the phrase, "Where can I get season water?" which clearly is not grammatically correct. In this case, the meaning is also not as originally intended because of the multiple meanings of the word "spring". In most cases, the applicant has found that where the substituted words yield a sentence that is grammatically incorrect, the meaning of the alternative string is different from the intended meaning of the original string, but where the substituted words yield a sentence that is grammatically correct, the meaning is generally consistent with the original meaning. To overcome the problem of grammatically incorrect alternative search strings being generated, the invention includes additional steps by means of which grammatically incorrect alternative strings can be excluded. To do this, the substituted words are first substituted back into the original search string at stage (76). Then, at stage (78), each substituted word is analyzed within the original string to see whether the words preceding it and following it are words that are associated with the substituted word by a predefined degree. This is done by looking up the word in the word relationship database and checking whether the word following it appears within the list of row fields with more than a predetermined frequency. Using the reverse signature of that word, a check is also made to see whether the word preceding it appears within the list of row fields with more than a predetermined frequency. Only if both the preceding and following words appear within the row fields with more than a predetermined frequency is the word regarded as fitting grammatically within the string, otherwise they are rejected at stage (80).
For example, in the case of the alternative string, "Where can I get season water", it is very unlikely that "get" will appear within the list of words that commonly precede "season" or that "water" will appear within the list of words that commonly follow "season". This alternative string will therefore be rejected as grammatically incorrect.
If the word "fresh" is identified as semantically similar to "cool spring", the string, "Where can I get fresh water?" would be checked for grammatical correctness by seeing whether the word "get" commonly precedes "fresh" and whether "water" commonly follows "fresh". In both cases, the answer will be in the affirmative and, at stage (82), the string "Where can I get fresh water?" will be identified as an alternative string for "Where can I get cool spring water?". Once multiple alternative strings have been generated, they can simultaneously or in very rapid succession be input into a search engine as shown in Figure 1 and the results compared. The documents or web pages that are found to be relevant in the results of numerous alternative strings can then be identified as more relevant than those documents or web pages which are only found to be relevant in the results of one or two alternative search strings. The most relevant documents or web pages are then presented to the user first. It will be appreciated that from the perspective of the user of a search engine the invention described above is completely hidden and is carried out in the background. The user interacts with the search engine in exactly the same way as before - by typing in a search string - and the search engine generates the alternative search strings and identifies the most relevant documents to present to the user.
The applicant has found that the invention leads to a marked improvement in the quality of the results that are presented to a user. Irrelevant search results are excluded far more often than with existing search engines and complex sentence structures can be handled with more precision. Because multiple alternative search strings are generated based on the search string, the applicant has found that it is no longer necessary to substitute different words or attempt to re-write search strings with different sentence structures in an attempt to locate relevant results. This leads to increased user satisfaction and quicker location of relevant search results.
The system of the invention requires no human input to categorize and index content, not does it have to be programmed with complex morphological or grammatical rules or built-in dictionaries. The invention provides a completely autonomous and extremely scalable system that is able to build a contextual language model of any contextual language so that search strings can be interpreted more accurately by search engines, so as to deliver more relevant and targeted search results without the need to categorize or index existing content.
While it is envisaged that the invention may be applied in web based search engines, it can also be applied in the enterprise search market where companies search their own internal documents and information.
The above description is illustrative and is not restrictive. Many variations of the invention will become apparent to those skilled in the art upon review of the disclosure. The scope of the invention should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the claims along with their full scope or equivalents.
Any of the software components or functions described in this specification may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C++ or Perl using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions, or commands on a computer readable medium, such as a random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM. Any such computer readable medium may reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network.

Claims

A method for processing an input search string and building multiple alternative search strings, the method comprising:
extracting text from a multitude of electronically accessible documents or web pages, the text including words;
forming a word relationship database in which each unique word identified in the text is stored in the database and is associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text;
combining the forward and reverse signatures of each word to form an ambidextrous signature for each word, and storing the ambidextrous signatures in a signature database;
optionally removing popular words from the input search string, being those words with a total frequency higher than a predetermined threshold;
linking the remaining words of the input search string into two- word groups from left to right with the second word of any preceding two-word group forming the first word of the next two-word group; for each two-word group, carrying out the following steps:
querying the word relationship database to determine the reverse signature of the first word and the forward signature of the second word; combining the forward and the reverse signatures obtained in the previous step into an ambidextrous signature representative of that two-word group;
comparing the ambidextrous signature of the previous step with the signature database to find the closest matches; identifying the words in the signature database with the closest signature matches as alternative words that are semantically similar to the two-word group;
substituting one or more of the two-word groups in the input search string with the identified alternative words;
analyzing whether each substituted word fits grammatically into the input search string by querying the word relationship database to see whether the word preceding and the word following the substituted word in the input search string are words that are associated with the substituted word to a predefined extent; and
if the substituted word or words do fit grammatically, identifying the string with the substituted words as an alternative search string that is both semantically similar to the original search string and grammatically correct.
The method as claimed in claim 1 , wherein the text is extracted from the web pages or documents by autonomous web crawling programs.
The method as claimed in claim 1 or 2, wherein the word relationship database is continually updated as more and more text is extracted, and the signature database is periodically rebuilt.
The method as claimed in any of the preceding claims, wherein the method includes an additional step of, immediately after extracting text and before forming a word relationship database, parsing the text into sentence portions which start and end with sentence delimiters. The method as claimed in any of the preceding claims, wherein techniques are employed that keep the growth of the word relationship database in check.
The method as claimed in claim 5, wherein a technique employed is that the value of the frequency sub-fields are gradually reduced so that only those words that are frequently incremented will develop large frequencies.
The method as claimed in claim 5, wherein the technique employed is that words with low frequency fields are periodically discarded.
The method as claimed in any of the preceding claims, wherein the ambidextrous signature of the two-word group is matched with ambidextrous signatures in the signature database by calculating a matching score that looks for matches between the fields of the signatures and applies decreasing weighting factors to fields that are associated with lower frequency sub-fields.
The method as claimed in any of the preceding claims, wherein the method includes the steps of inputting the multiple alternate search strings into a search engine simultaneously or in rapid succession and comparing the results of each separate search so as to rank the overall results and present those results which were obtained in the greatest number of separate searches as the most relevant search results.
The method as claimed in any of the preceding claims, wherein the language of the text is identified so that separate word relationship databases and signature databases can be built for each separate language. A system for processing an input search string and building multiple alternative search strings, comprising:
a processor in the form of a server which is able to access a multitude of web pages or other documents through the Internet and extract text, the text including words;
a word relationship database coupled to the processor and having each unique word identified in the text stored thereon, each unique word in the word relationship database being associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
a signature database coupled to the server, the signature database being formed by the server processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text, and combining the forward and reverse signatures of each word to form an ambidextrous signature for each word that is stored in the signature database;
and computer software stored on the processor and configured to enable the processor to carry out the following:
optionally removing popular words from the input search string, being those words with a total frequency higher than a predetermined threshold;
linking the remaining words of the input search string into two-word groups from left to right with the second word of any preceding two-word group forming the first word of the next two-word group;
for each two-word group, carrying out the following: querying the word relationship database to determine the reverse signature of the first word and the forward signature of the second word;
combining the forward and the reverse signatures obtained in the previous step into an ambidextrous signature representative of that two-word group;
comparing the ambidextrous signature of the previous step with the signature database to find the closest matches;
identifying the words in the signature database with the closest signature matches as alternative words that are semantically similar to the two-word group;
substituting one or more of the two-word groups in the input search string with the identified alternative words;
analyzing whether each substituted word fits grammatically into the input search string by querying the word relationship database to see whether the word preceding and the word following the substituted word in the input search string are words that are associated with the substituted word to a predefined extent; and
if the substituted word or words do fit grammatically, identifying the string with the substituted words as an alternative search string that is both semantically similar to the original search string and grammatically correct.
The system as claimed in claim 1 1 , wherein the server is configured to input the multiple alternative search strings into a search engine simultaneously or in rapid succession, and to compare the results of each separate search so as to rank the overall results and present those results which are obtained in the greatest number of separate searches as the most relevant search results.
PCT/IB2012/051870 2011-04-19 2012-04-16 A computerized system and a method for processing and building search strings WO2012143839A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201161476917P 2011-04-19 2011-04-19
US61/476,917 2011-04-19

Publications (1)

Publication Number Publication Date
WO2012143839A1 true WO2012143839A1 (en) 2012-10-26

Family

ID=47041106

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2012/051870 WO2012143839A1 (en) 2011-04-19 2012-04-16 A computerized system and a method for processing and building search strings

Country Status (1)

Country Link
WO (1) WO2012143839A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853742A (en) * 2012-11-29 2014-06-11 北大方正集团有限公司 Retrieval device, terminal and retrieval method
CN110678860A (en) * 2017-03-13 2020-01-10 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 System and method for word-by-word text mining
CN114897576A (en) * 2022-05-05 2022-08-12 深圳市极客智能科技有限公司 Commodity pushing method based on data analysis
US11475015B2 (en) 2020-11-20 2022-10-18 Coupang Corp. Systems and method for generating search terms

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US20030171910A1 (en) * 2001-03-16 2003-09-11 Eli Abir Word association method and apparatus
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US20060253427A1 (en) * 2005-05-04 2006-11-09 Jun Wu Suggesting and refining user input based on original user input

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US20030171910A1 (en) * 2001-03-16 2003-09-11 Eli Abir Word association method and apparatus
US20060253427A1 (en) * 2005-05-04 2006-11-09 Jun Wu Suggesting and refining user input based on original user input

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853742A (en) * 2012-11-29 2014-06-11 北大方正集团有限公司 Retrieval device, terminal and retrieval method
CN110678860A (en) * 2017-03-13 2020-01-10 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 System and method for word-by-word text mining
CN110678860B (en) * 2017-03-13 2023-06-09 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 System and method for word-by-word text mining
US11475015B2 (en) 2020-11-20 2022-10-18 Coupang Corp. Systems and method for generating search terms
CN114897576A (en) * 2022-05-05 2022-08-12 深圳市极客智能科技有限公司 Commodity pushing method based on data analysis

Similar Documents

Publication Publication Date Title
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
US8463593B2 (en) Natural language hypernym weighting for word sense disambiguation
KR100546743B1 (en) Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system
US8229730B2 (en) Indexing role hierarchies for words in a search index
US8429184B2 (en) Generation of refinement terms for search queries
US7756855B2 (en) Search phrase refinement by search term replacement
US7113943B2 (en) Method for document comparison and selection
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
US20160041986A1 (en) Smart Search Engine
US20110258212A1 (en) Automatic query suggestion generation using sub-queries
US20070185831A1 (en) Information retrieval
KR20040063822A (en) Retrieval of structured documents
KR20010108845A (en) Term-based cluster management system and method for query processing in information retrieval
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
JP5250009B2 (en) Suggestion query extraction apparatus and method, and program
Boschetti et al. Computational analysis of historical documents: An application to italian war bulletins in world war i and ii
WO2012143839A1 (en) A computerized system and a method for processing and building search strings
Babekr et al. Personalized semantic retrieval and summarization of web based documents
Bhoir et al. Question answering system: A heuristic approach
CN111737413A (en) Feedback model information retrieval method, system and medium based on concept net semantics
Lin et al. Biological question answering with syntactic and semantic feature matching and an improved mean reciprocal ranking measurement
CN111930880A (en) Text code retrieval method, device and medium
Boukhatem et al. Empirical comparison of semantic similarity measures for technical question answering
Siemiński Fast algorithm for assessing semantic similarity of texts
Kashyapi et al. TREMA-UNH at TREC 2018: Complex Answer Retrieval and News Track.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12774098

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12774098

Country of ref document: EP

Kind code of ref document: A1