WO2001090930A1 - Indexing and searching ideographic characters on a networked system of computers - Google Patents

Indexing and searching ideographic characters on a networked system of computers Download PDF

Info

Publication number
WO2001090930A1
WO2001090930A1 PCT/AU2001/000612 AU0100612W WO0190930A1 WO 2001090930 A1 WO2001090930 A1 WO 2001090930A1 AU 0100612 W AU0100612 W AU 0100612W WO 0190930 A1 WO0190930 A1 WO 0190930A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
characters
stored information
string
occurrence
Prior art date
Application number
PCT/AU2001/000612
Other languages
French (fr)
Inventor
Phillip Andre Bertolus
James Michael Jelbart
Grant Timothy Lewis
Original Assignee
Web Wombat Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Web Wombat Pty Ltd filed Critical Web Wombat Pty Ltd
Priority to AU2001259949A priority Critical patent/AU2001259949B2/en
Priority to CA002409199A priority patent/CA2409199A1/en
Priority to AU5994901A priority patent/AU5994901A/en
Publication of WO2001090930A1 publication Critical patent/WO2001090930A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion

Abstract

The system and method allows the retrieval, indexing and searching of information stored on computers connected by a communications network, where that information comprises ideographic, logographic or pictographic characters, which are encoded using two bytes per character. The binary value which encodes a particular character contained in a given document is converted into hexadecimal text format, which is then prefixed with a predetermined marker character to indicate that it is the hexadecimal value of a double-byte character. That value is then added to a sequential string of such values for each of such characters in that document. The marker characters are then removed from this string, leaving a series of alphanumeric characters separated at set intervals by blank spaces. Each set of characters demarcated by a blank space is then indexed as if it were a standard word such as an English word, albeit a meaningless one. A unique index entry is created for each such word and phase (up to a predetermined combination of such words) which the search engine encounters, and incorporates positional data which points to the location on a networked system of computers of each occurrence of that particular word or phase which the search engine has encountered. Search queries are then met by retrieving the positional data associated with each character or sequence of characters contained in the search query to determine whether any occurrence of those characters which has been encountered by the search engine meets the criteria of the user.

Description

INDEXING AND SEARCHING IDEOGRAPHIC CHARACTERS ON A NETWORKED SYSTEM OF COMPUTERS
A. Field of the Invention The present invention relates generally to computer systems and methods for retrieving and indexing information on a network and, more particularly, to systems and methods used for retrieving and indexing information represented by ideographic characters on a networked system of computers such as the Internet.
B. Description of the Related Art Since its inception over 30 years ago in the United States, the Internet has remained predominantly Western in its content, with more than 80% of top-level Internet hosts and roughly 80% of Internet traffic using the English language. A survey of several hundred million web pages which was conducted in 1999 found that around 72% were in English, followed by Japanese with 7%, German with 5%, then French, Chinese and Spanish, each with between 1 and 2% (Geoffrey Nunberg, Will the Internet always Speak English,' The American Prospect vol. 11 no. 10, March 27-April 10, 2000). However, it is estimated that by 2003, non-English-speaking Internet users will exceed English-speaking users. In line with this projected growth, the amount of information on the Internet which is expressed using major Asian languages — such as Chinese, Japanese and Korean — is expanding rapidly.
There are, however, inherent obstacles against the use of Asian languages on the Internet and computers in general — particularly those Asian languages such as Chinese, Japanese and
Korean, which use characters to represent information as opposed to the Romanised script used by Western languages such as English.
The characters used in the Chinese, Japanese and Korean languages are described variously as ideographic, logographic, and pictographic, with each term having a slightiy different linguistic connotation. For ease of reference, 'ideographic' will be used throughout this application as an umbrella term which encompasses ideographic, logographic and pictographic characters.
The problem with using ideographic characters on computers lies in the way computer systems interpret, manipulate and display language which is comprehensible to users. Each discrete character used by a particular language is assigned a unique numerical character code, and it is that character code which the computer stores in binary form. To display the character in a form comprehensible to users, the computer then consults a table and finds die graphical representation (called a glyph) which corresponds with that particular character code, and it is that glyph which is displayed to the user, on a computer monitor for example.
This process of assigning a unique value to each character is easy with English, for example, as there are only 52 upper-case and lower-case letters comprising the English alphabet. In the case of Chinese and Japanese, however, there are over 40,000 possible characters. The character set for Chinese is therefore several orders of magnitude greater than the English character set, and accordingly a larger range of values is required to provide a unique numerical representation for each character. At present, this representation is made more difficult by the existence of a plurality of different systems for encoding those characters into numerical values.
Most English characters are encoded using a system called ASCII (American Standard Code for Information Interchange), which uses 7 binary digits ('bits') to represent each character.
Each bit can take the value 1 or 0, so a 7-bit number can have 128 (27) possible values. As there are only 52 upper and lower-case letters in the English language this is more than sufficient for encoding English text. For languages such as Chinese, however, 16 bits are required to provide an adequate code space to represent each character with a unique value.
The use of 16 bits allows 65,536 (216) possible values, and characters encoded in this way are . often referred to as double-byte characters, because 8 bits equal one byte. Commonly used double-byte encoding systems for ideographic characters include GB 2312-80 (Chinese), Big 5
(Chinese), EUC (Japanese) and Shift-JIS (Japanese).
An additional problem with the languages that use ideographic characters is the difficulty of segmenting text comprised of ideographic characters into meaningful units such as words and phrases. In conducting a typical search, a user wants to find documents which contain particular words or phrases. In most languages, discrete words are made identifiable through the use of separator characters such as a comma, full stop or space between groupings of characters. In languages such as Japanese and Chinese, however, such separator characters are generally not used.
The grammatical structure of the Chinese language, in particular, relies heavily on context for determining the meaning of individual characters. Native speakers use their knowledge of word meaning and context to figure out where the word boundaries are. Any given Chinese character is a meaningful unit in itself, but when used in a particular context or in combination with other characters it can assume a totally different meaning. In a string of Chinese characters, it is therefore often difficult to tell whether a character is being used in conjunction with adjacent characters to form a longer 'word' or whether it is being used as a word or grammatical particle in itself. This acquired skill is very difficult for a computer to perform, so rather than attempt a semantic analysis of a given string of text, 'workaround' techniques are needed which approximate the same results but can be performed easily by a computer.
The traditional technique for indexing and searching information represented using ideographic characters is to create separate index entries for each possible meaningful unit. Given the string of three ideographic characters 'abc,' for example, 'a' in itself could be a word, as could be V, 'c', 'ab', 'be' and 'abc' Traditional indexing methods, such as that described in U.S. Patent No. 6,021,409 entitled 'Method for parsing, indexing and searching world-wide-web pages' to Digital Equipment Corporation, would create separate index entries for each one of these possibilities, which it describes as 'indexable words.' If a user were then to search for the word 'abc', the search engine could then go directiy to the index entry for 'abc' to determine where that term had occurred.
When confronted with double-byte character values, traditional search engines either index those characters in their double-byte form (often with special 'escape sequences' of characters to denote that the following indexed value is that of a double-byte character) or translate the character using a dictionary look-up, and index the English translation. These methods are cumbersome and either require that a separate index be created for double-byte characters, or place undue demands on storage space and computational resources, which are magnified as the index database grows larger. These demands then serve as an obstacle against the creation of extensive and up-to-date databases.
Based on the foregoing, there is a need for a system that efficiency collects and indexes stored information represented by ideographic characters on networks such as the Internet, and which is capable of integrating that information into existing indexes where it can be efficiently searched to produce meaningful and relevant results in a timely manner. SUMMARY OF THE INVENTION
According to the invention in a first aspect, there is provided a method for indexing stored information that partially or wholly consists of encoded ideographic, logographic or pictographic characters, comprising the steps of: creating an index entry for each individual character contained in the stored information using a search engine; adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; creating an index entry for each sequential string of characters, up to a predetermined length, contained in the stored information using a search engine; and adding to the index entry for each individual sequence of characters a pointer which indicates the location of each occurrence of that sequence which the search engine has encountered.
The stored information may be a plurality of pages on a networked system of computers, and the characters may be characters used to represent the Chinese language.
Preferably, the characters making up the stored information are encoded using at least two bytes per character (preferably exactly two bytes per character). In a preferred form, the index entry comprises a unique word and positional data indicating the location of each occurrence of that word. Said word may comprise a string of alphanumeric characters, which string may be encoded using the ASCII (American Standard Code for Information Interchange) encoding system, and in a preferred form the string of alphanumeric characters represents the original multiple-byte binary value of an ideographic, logographic or pictographic character expressed in hexadecimal text format.
Preferably, the positional data indicates each respective location which the search engine has encountered where the unique word that is the subject of that index entry occurs on a networked system of computers.
According to the invention in a second aspect, there is provided a method for indexing stored information comprising the steps of: retrieving stored information comprising character information which is encoded using two bytes to represent each character; converting the numerical values used to encode each character into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the stored information; replacing each instance of the marker character with a blank space; and adding each set of characters demarcated by a blank space to an index along with a pointer to the location at which that set of characters occurred.
According to the invention in a third aspect, there is provided a mediod for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising a number of ideographic, logographic, or pictographic characters encoded using two bytes to represent each character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the search query; replacing each instance of the marker character with a blank space, and searching the index for each occurrence of the single string of characters.
According to the invention in a fourth aspect, there is provided a method for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising a number of ideographic, logographic, or pictographic characters encoded using two bytes to represent each character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to die beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in die same sequential order in which they occurred in the search query; replacing each instance of the marker character with a blank space, searching the index for each occurrence of each character contained in the search query; examining the positional data describing the location of each occurrence of each individual character contained in the search query; and determining whether the positional data indicates that any of die character occurrences contained in the index match the character comprising the search query.
According to the invention in fifth aspect, there is provided a method for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising more than one ideographic, logographic, or pictographic characters encoded using two bytes per character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the search query; replacing each instance of the marker character with a blank space; searching the index for each occurrence of the sequence of characters which comprise the sequence of characters contained in the search query. Preferably, the method includes the step of: searching the index for each occurrence of each individual character contained in the search query; examining the positional data describing the location of each indexed occurrence of each individual character contained in the search query; and determining whether the positional data indicates that any of the character occurrences contained in the index match the character string comprising the search query.
According to the invention in a sixth aspect, there is provided a computer-readable medium containing instructions for performing a method for indexing stored information that partially or wholly consists of encoded ideographic, logographic or pictographic characters, the method comprising the steps of: creating an index entry for each individual character contained in the stored information using a search engine; adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; creating an index entry for each individual sequence of characters, up to a predetermined length, contained in the stored information using a search engine; and adding to the index entry for each individual sequence of characters a pointer which indicates die location of each occurrence of that sequence which the search engine has encountered.
According to the invention in a seventh aspect, there is provided a system for indexing stored information d at partially or wholly consists of encoded ideographic, logographic or pictographic characters, comprising: means for, including a search engine, creating an index entry for each individual character contained in the stored information; means for adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; means for, including a search engine, creating an index entry for each individual sequence of characters, up to a predetermined length, contained in the stored information using a search engine; and means for adding to the index entry for each individual sequence of characters a pointer which indicates the location of each occurrence of that sequence which the search engine has encountered. According to the invention in an eighth aspect, there is provided a system for indexing stored information, comprising: a spider for retrieving a document containing a string of characters and for converting the numerical values used to encode each character contained in the document into hexadecimal text format to produce a hex value, and for adding a predetermined marker character to the beginning of each hex value to produce a marked character value, wherein the spider is also used for merging the marked character values in to a single string; a storage device for storing the single string; an indexer for replacing each instance of the marker character with a blank space and for adding each word separated by the blank space to an index database, and for adding positional data specifying the location of each word in the document.
In broad terms, then, the present invention provides a metiiod and system for retrieving, indexing and searching information, which is represented by ideographic characters, and which is stored on a network of computers such as the Internet. Regardless of the form of encoding used to encode ideographic characters, the ideographic characters are converted into a format which allows double-byte characters to be indexed in the same fashion as are words in languages such as English, allowing double-byte characters to be incorporated into existing search engine databases. Preferably, the double-byte characters are converted into hexadecimal values consisting of multiple ASCII characters prefaced by a marker character, such as a tilde. Subsequentiy, the marker character is replaced by a space producing a string of 'words' separated by spaces. These strings of words (in the so-called 'GBY' format) , although unintelligible in themselves, can then be readily indexed in the same fashion as are English or other western words and phrases. The marker character is used during the results display following a search query, the marked sequences being converted back into their original ideographic double-byte characters and hence appearing in their correct relative positions on the user's computer screen.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention. In the drawings:
Figure 1 is an illustration of a computer network for practicing methods and systems consistent with the present invention; Figure 2 is a diagram illustrating die conversion of the double-byte value of each ideographic character into the form in which it is indexed consistent with the present invention;
Figure 3 is a diagram depicting a process for retrieving character information from the Internet, storing it in an index, and then searching in response to user queries consistent with the present invention;
Figure 4 is a diagram illustrating a method for generating word lists for languages that use ideographic characters according to the present invention;
Figure 5 is a diagram illustrating a further example of a method for generating word lists and indexing ideographic character information consistent with the present invention; and Figure 6 illustrates one embodiment of a process for handling a search query consistent with the present invention.
DETAILED DESCRIPTION
The following detailed description of the invention refers to the accompanying drawings. Although the description includes exemplary implementations, other implementations are possible, and changes may be made to die implementations described without departing from the spirit and scope of the invention. It is therefore to be understood that the following detailed description does not limit the invention in any way. Wherever possible, the same reference numbers will be used throughout die drawings and the following description to refer to the same or like parts. As described in more detail below, the present invention employs a method of conversion of the numerical values which are used to encode ideographic characters into a format which allows the value for each character to be indexed as if it were a normal word comprising simple ASCII (American Standard Code for Information Interchange) characters.
As further detailed below, in one form of the present invention, the system only indexes each individual character, rather than indexing each combination of the individual ideographic characters which could form meaningful units in a given string of characters. In response to a search query, the present invention then uses the positional data stored for each individual character comprising that search query to determine whether the required combination of those characters occurs in the documents that have been indexed. Because significandy fewer index entries are required, the storage requirements of the search engine are therefore reduced. Additionally, the demands on the computational resources of the search engine are lowered, because there are fewer index entries for it to search. Despite having fewer index entries, however, the present invention still allows die search engine to cover the full range of combinations of characters through use of the positional data for each individual character.
One embodiment of the present invention is a method for processing information which is retrieved from computers connected to a communications network. The basic arrangement of a computer network for practicing methods and systems consistent with the present invention is shown in Figure 1.
The computer network includes a central computer 10, a remote computer 20, and a plurality of pages of information 50 and 60 distributively stored on one or more computer systems 30 and 40 which are to be searched. All of the computers in Figure 1 are connected, either direcdy or indirecdy, via a communications network 70. One skilled in the art will appreciate that even though Figure 1 for sake of convenience depicts only two computer systems 30 and 40 with stored information as part of the computer network, millions of computers may be part of the computer network. In one embodiment, the communications network 70 is the Internet, a Transmission Control Protocol / Internet Protocol ('TCP/IP') based network, and the computers are connected to communication network 70 using technology in common use. In other embodiments of the present invention, communications network 70 is any device or a combination of devices that allows computers 10, 30, 40 to communicate with each other. For example, communications network 70 can be a local area network, an Intranet, dedicated point-to-point communication lines, or a wireless transmission network. Furthermore, communications network 70 might take a different form for different pairs of computers. For example, central computer 10 might communicate to a computer system 30 via the Internet, and computer system 30 might communicate to remote computer 20 via a local area network. Figure 2 illustrates the conversion of a numerical value used to encode a character, into a format which allows it to be indexed as if it were a normal 'word' comprising simple ASCII characters.
Document 200 contains Chinese characters 210 and 220. In this example characters 210 and 220 are represented using traditional Chinese script and are encoded in the encoding standard known as 'Big 5', which encodes each Chinese character as a double-byte binary representation. The binary double-byte representations of Chinese characters 210 and 220 are 230 and 240 respectively. The double-byte values 230 and 240 are each then converted into hexadecimal format consisting of four ASCII characters. The hexadecimal values for characters 210 and 220 are each then prefaced by a marker character, in this case a tilde (~), to produce marked values 250 and 260. The marker character preceding each converted value indicates to the indexing program that the following four ASCII characters represent one ideographic character expressed in hexadecimal format. One skilled in the art will appreciate that even though a tilde is used as the marker character, other symbols or means may be used as the marker character.
The marked values 250 and 260 are then combined to form a single string 270 of ASCII characters. The marker characters are then removed from string 270 to produce string 280, in which the groups of ASCII characters 'b971' and cb8a3' are separated by blank spaces. String 280 can now be treated as if it were a string consisting of two normal English 'words,' albeit meaningless ones, with each word demarcated by a blank space. At this stage, string 280 is designated as being in so-called 'Gobbledegook' (GBY) format. GBY format allows values that represent ideographic characters to be indexed and searched as if they were conventional words, such as English words, segmented by spaces.
The GBY string 280 is then indexed in a conventional manner, with each discrete 'word' and each sequence or 'phrase' of words, up to a predetermined length, having its own entry in index 290. In the example of Figure 2, each time the search engine encounters another occurrence of character 210 it will add a pointer to the location of that occurrence to the index entry for the GBY format 'word' which corresponds to character 210. Similarly, each time the search engine encounters another occurrence of character combination 210; 220 it will add a pointer to the location of ti at occurrence to the index entry for the GBY format 'word' combination which corresponds to characters 210; 220.
Figure 3 represents diagrammatically a process by which information represented using ideographic characters is retrieved from the Internet, stored in an index, then searched in response to user queries.
Spider 320 retrieves a page of information 300 from a location on Internet 310 specified by a particular URL (Universal Resource Locator). In this example page 300 contains Chinese text encoded using a double-byte encoding system, such as GB 2312-80 or Big 5. Data 315 retrieved by spider 320 is converted by the spider 320 into hexadecimal format and the hexadecimal values for each character are prefaced by a tilde and then merged into a single string of ASCII characters 325. String 325 is then stored in storage 330 before being sent to indexer 340. Indexer 340 removes all the tildes from string 325, to produce string 345. Each GBY format 'word' is then added to the index database 350, along with the positional data specifying where that 'word' (with its underlying character) occurred. Each sequential string of GBY format 'words', up to a predetermined string length, is also added to the index database 350, along with the positional data specifying where that string or 'phrase' occurred.
The index database 350 is then accessed by search engine 370 in response to search queries 365 from the user 360. Search engine 370 retrieves and then ranks each occurrence 380 of the term(s) comprising the search query 365, and then sends the search results 385 to user 360.
In Figure 4, the method by which index entries are generated from strings of ideographic characters is illustrated in further detail. Ideograohic characters are indexed by creating a separate index entry not only for each individual character, but also for each possible combination of characters in a given string, up to a predetermined maximum string length. For example, document 400 contains a string of four characters 'a', 'b', 'c' and 'd.' From that string of four characters, this indexing method would produce index entries 450 comprising word list 460 which contains the ten discrete combinations: 'a', 'b', 'c', 'd', 'ab', 'be', 'cd', 'abc', 'bed', and 'abed'; and positional data 470 which points to each occurrence of each of the entries in word list 460. Because of the semantic structure of languages which use ideographic characters, it is possible that each of the ten combinations of characters identified in this example represent discrete meaningful units, so each combination is indexed as a separate entry. For example, if the string 'abed' was a sentence of four Chinese characters, it is possible that 'a' could be the subject, 'b' could be the verb, and 'cd' could be d e object. Alternatively, 'a' could be the verb, Tac' could be the name of a town, and 'd' could be a grammatical particle signifying that the action indicated by the verb has been completed, or converting the phrase into interrogative form. It is therefore difficult to segment strings of ideographic characters to determine which components of that string constitute meaningful units in any given context. This indexing method addresses this difficulty by indexing all possible combinations of characters which could constitute meaningful units in a given string, creating an exhaustive word list as part of the index.
If a user then searches for the word 'bed,' for example, the search engine would simply go directiy to the entry for lacd' in its word list, look at the positional data which points to instances where the search engine has come across Tjcd', rank those results, and return the search result to the user. In one form of the system, a dictionary of known meaningful terms is used to filter out meaningless terms and phrases from this exhaustive word list, thus helping to reduce its size and therefore also reducing search times. This filtering may be done by way of a statistical process, deleting those terms and phrases which do not appear to correlate with combinations of characters otherwise encountered by the indexer.
This method allows for relatively fast searches, yet for any given string of characters the number of possible combinations of those characters is relatively large, and indexing each combination in addition to indexing the individual characters can place heavy demands on storage space and computational resources. The aggregation of the (ostensibly meaningless) GBY words into (ostensibly meaningless) phrases, as described above, can also be applied to English and other western language words and terms, along with the optional filtering steps if desired, as the phrase aggregation is not dependent on the underlying double-byte engine.
Figure 5 illustrates one form of the manner in which the method of the present invention may generate index entries from information represented using ideographic characters.
Instead of indexing each possible combination of the characters contained in a given string, the system may simply index the individual characters, along with the positional data for each character.
For example, document 500 contains a string of four characters 'a', 'b', 'c' and 'd.' From that string of four characters, the present invention would produce index entries 550 comprising a word list 560 of the four characters 'a', V, 'c', 'd', and positional data 570 which points to each occurrence of each of the entries in the word list 560.
If a user were then to search for the character 'c', it would be a trivial matter of scanning the index for character 'c' and, in this case, returning a hit for document 1. The closer to the beginning of the page that 'c' appears, the higher the ranking that page will receive.
If a user were to search for a word comprised of more than one character, such as 'bed,' then the search engine would scan the index for each of the characters 'b', 'c' and 'd,' and then examine the positional data associated with each character to determine whether they occurred in proximity to one another. This process is set out in more detail in Figure 6, which illustrates the manner in which a search query is processed in this particular form of the present invention. A user enters search query 600 comprising one or more ideographic characters. In the example shown in Figure 6, search query 600 consists of the word or phrase 'bed,' where 'b', 'c' and 'd' are ideographic characters. The invention searches the index 610 for each of the individual characters which comprise search query 600 and retrieves the positional data 620 which points to each occurrence of those characters which the search engine has indexed. A an example of this approach, the positional data 620 may take the form: (document ID, word position). For example, (1,3) would indicate the third word on document number 1. Many alternative methods for recording positional data may of course be used.
Having retrieved d e positional data 620 for each character in search query 600, the system then looks to see if there are any instances where the three characters comprising search query 600 are adjacent to each other on the same document. In the example, documents 1, 4 and 7 each contain all of the characters 'b', 'c' and 'd' which comprise search query 600. In document 4, however, the characters are adjacent to one another (in positions 13, 14 and 15) and are in the same order as specified in search query 600 — this constitutes a perfect match for search query 600 and in this case will be returned to the user as the highest ranking search result 630.
If there is more than one result where the characters comprising search query 600 are adjacent to one another, the highest ranking will be assigned to those results in which 'b', 'c' and 'd' are closest to the beginning of the page. For example, if characters V, 'c' and 'd' also occurred in positions (10,1) (10,2) and (10,3) respectively, then that result would rank higher than (4,13) (4,14) and (4,15), as it occurs closer to the beginning of the page.
If there are several search results 630 where the characters comprising search query 600 are in the same position on the page, for example (10,1) (10,2) (10,3) and (11,1) (11,2) (11,3), then the number of additional occurrences of those characters on the page is also taken into account to differentiate between the results for ranking purposes. For example, if 'bed' subsequentiy occurred six times on page 10, but only once on page 11, then page 10 would be ranked higher than page 11, notwithstanding that 'bed' is the first term to occur on both pages. The greater the number of occurrences of the characters comprising search query 600 on a page, the more likely it is that that page will be of relevance to the user. One skilled in the art will appreciate that other strategies, such as statistical techniques, may be used to determine relevance to the user, and therefore the ranking of returned results.
The foregoing description of an implementation of the invention has been presented for purposes of illustration and description only- It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practicing of the invention. For example, one embodiment described includes a method for indexing and searching Chinese characters. However, other embodiments include other languages which use ideographic, pictographic or logographic characters to represent information, such as Japanese, Korean and Vietnamese. The examples given above refer to the indexing and searching of ideographic characters encoded using the Big 5 double-byte encoding system for traditional Chinese characters. Ideographic characters may of course be encoded by means of alternative encoding systems, such as 'GB 2312-80', 'Unicode' and 'HZ'.

Claims

1. A method for indexing stored information that partially or wholly consists of encoded ideographic, logographic or pictographic characters, comprising the steps of: creating an index entry for each individual character contained in the stored information using a search engine; adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; creating an index entry for each sequential string of characters, up to a predetermined length, contained in the stored information using a search engine; and adding to the index entry for each individual sequence of characters a pointer which indicates the location of each occurrence of that sequence which the search engine has encountered.
2. The method of claim 1, wherein the stored information is a plurality of pages on a networked system of computers.
3. The method of claim 1 or claim 2, wherein the characters are characters used to represent the Chinese language.
4. The method of any preceding claim, wherein the characters are encoded using at least two bytes per character.
5. The method of any preceding claim, wherein the index entry comprises a unique word and positional data indicating the location of each occurrence of that word.
6. The method of claim 5, wherein the word comprises a string of alphanumeric characters.
7. The method of claim 6, wherein the string of alphanumeric characters is encoded using the ASCII (American Standard Code for Information Interchange) encoding system.
8. The method of claim 6 or claim 7, wherein the positional data indicates each respective location which the search engine has encountered where the unique word that is the subject of that index entry occurs on a networked system of computers.
9. The method of claim 6 or claim 7, wherein the string of alphanumeric characters represents the original multiple-byte binary value of an ideographic, logographic or pictographic character expressed in hexadecimal text format.
10. A method for indexing stored information comprising the steps of: retrieving stored information comprising character information which is encoded using two bytes to represent each character; converting the numerical values used to encode each character into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the stored information; replacing each instance of the marker character with a blank space; and adding each set of characters demarcated by a blank space to an index along with a pointer to the location at which that set of characters occurred.
11. The method of claim 10, wherein the stored information is stored as a plurality of pages on a networked system of computers.
12. The method of claim 10 or claim 11, wherein the character information partially or wholly consists of encoded ideographic, logographic or pictograpliic characters such as Chinese characters.
13. A method for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising a number of ideographic, logographic, or pictographic characters encoded using two bytes to represent each character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the search query; replacing each instance of the marker character with a blank space, and searching the index for each occurrence of the single string of characters.
14. A method for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising a number of ideographic, logographic, or pictographic characters encoded using two bytes to represent each character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in which they occurred in the search query; replacing each instance of the marker character with a blank space, searching die index for each occurrence of each character contained in die search query; examining the positional data describing the location of each occurrence of each individual character contained in die search query; and determining whether the positional data indicates that any of the character occurrences contained in the index match the character comprising the search query.
15. The method of claim 13 or claim 14, wherein the stored information is a plurality of pages on a networked system of computers.
16. A method for searching and retrieving stored information, comprising the steps of: receiving a search query from a user comprising more than one ideographic, logographic, or pictographic characters encoded using two bytes per character; converting the numerical values used to encode each character in the search query into hexadecimal text format to produce a hex value; adding a predetermined marker character to the beginning of each hex value to produce a marked character value; merging the marked character values into a single string of characters in the same sequential order in -which they occurred in the search query; replacing each instance of the marker character with a blank space; searching the index for each occurrence of the sequence of characters which comprise the sequence of characters contained in the search query.
17. The method of claim 16, including the step of: searching the index for each occurrence of each individual character contained in the search query; examining d e positional data describing the location of each indexed occurrence of each individual character contained in the search query; and determining whether the positional data indicates that any of die character occurrences contained in die index match the character string comprising the search query.
18. The method of claim 16 or claim 17, wherein the stored information is a plurality of pages on a networked system of computers.
19. A computer-readable medium containing instructions for performing a method for indexing stored information that partially or wholly consists of encoded ideographic, logographic or pictographic characters, the method comprising d e steps of: creating an index entry for each individual character contained in the stored information using a search engine; adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; creating an index entry for each individual sequence of characters, up to a predetermined length, contained in the stored information using a search engine; and adding to the index entry for each individual sequence of characters a pointer which indicates the location of each occurrence of that sequence which the search engine has encountered.
20. The computer-readable medium of claim 19, wherein the stored information is stored as a plurality of pages on a networked system of computers.
21. The computer-readable medium of claim 19 or claim 20, wherein the characters are characters used to represent the Chinese language.
22. The computer-readable medium of any one of claims 19 to 21, wherein the characters are encoded using two bytes per character.
23. The computer-readable medium of any one of claims 19 to 22, wherein the stored information is stored in at least one storage device.
24. The computer-readable medium of any one of claims 19 to 23, wherein the index entry comprises a unique word or phrase and positional data indicating the location of each occurrence of that word or phrase.
25. The computer-readable medium of claim 24, wherein the word or phrase comprises a string of alphanumeric characters.
26. The computer-readable medium of claim 25, wherein the string of alphanumeric characters is encoded using the ASCII (American Standard Code for Information Interchange) encoding system.
27. The computer-readable medium of claim 24, wherein the positional data indicates each respective location which the search engine has encountered where d e unique word or phrase that is the subject of that index entry occurs on a networked system of computers.
28. The computer-readable medium of claim 25, wherein die string of alphanumeric characters represents the original multiple-byte binary value of an ideographic, logographic or pictographic character, or sequence of characters, expressed in hexadecimal text format.
29. A system for indexing stored information that partially or wholly consists of encoded ideographic, logographic or pictographic characters, comprising: means for, including a search engine, creating an index entry for each individual character contained in the stored information; means for adding to the index entry for each individual character a pointer which indicates the location of each occurrence of that character which the search engine has encountered; means for, including a search engine, creating an index entry for each individual sequence of characters, up to a predetermined lengti , contained in the stored information using a search engine; and means for adding to the index entry for each individual sequence of characters a pointer which indicates the location of each occurrence of that sequence which the search engine has encountered.
30. A system for indexing stored information, comprising: a spider for retrieving a document containing a string of characters and for converting the numerical values used to encode each character contained in the document into hexadecimal text format to produce a hex value, and for adding a predetermined marker character to the beginning of each hex value to produce a marked character value, wherein the spider is also used for merging the marked character values in to a single string; a storage device for storing the single string; an indexer for replacing each instance of die marker character with a blank space and for adding each word separated by the blank space to an index database, and for adding positional data specifying the location of each word in the document.
PCT/AU2001/000612 2000-05-24 2001-05-24 Indexing and searching ideographic characters on a networked system of computers WO2001090930A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU2001259949A AU2001259949B2 (en) 2000-05-24 2001-05-24 Indexing and searching ideographic characters on a networked system of computers
CA002409199A CA2409199A1 (en) 2000-05-24 2001-05-24 Indexing and searching ideographic characters on a networked system of computers
AU5994901A AU5994901A (en) 2000-05-24 2001-05-24 Indexing and searching ideographic characters on a networked system of computers

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AUPQ7730 2000-05-24
AUPQ7730A AUPQ773000A0 (en) 2000-05-24 2000-05-24 Indexing and searching ideographic characters on the internet

Publications (1)

Publication Number Publication Date
WO2001090930A1 true WO2001090930A1 (en) 2001-11-29

Family

ID=3821807

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2001/000612 WO2001090930A1 (en) 2000-05-24 2001-05-24 Indexing and searching ideographic characters on a networked system of computers

Country Status (3)

Country Link
AU (1) AUPQ773000A0 (en)
CA (1) CA2409199A1 (en)
WO (1) WO2001090930A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742450A (en) * 2021-08-30 2021-12-03 中信百信银行股份有限公司 User data grade label falling method and device, electronic equipment and storage medium
CN113821539A (en) * 2021-09-07 2021-12-21 丰图科技(深圳)有限公司 Region query method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5375235A (en) * 1991-11-05 1994-12-20 Northern Telecom Limited Method of indexing keywords for searching in a database recorded on an information recording medium
US5544352A (en) * 1993-06-14 1996-08-06 Libertech, Inc. Method and apparatus for indexing, searching and displaying data
JPH10207882A (en) * 1997-01-22 1998-08-07 Toshiba Corp Document preparation device and japanese syllabary-to-chinese character converting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5375235A (en) * 1991-11-05 1994-12-20 Northern Telecom Limited Method of indexing keywords for searching in a database recorded on an information recording medium
US5544352A (en) * 1993-06-14 1996-08-06 Libertech, Inc. Method and apparatus for indexing, searching and displaying data
JPH10207882A (en) * 1997-01-22 1998-08-07 Toshiba Corp Document preparation device and japanese syllabary-to-chinese character converting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DATABASE WPI Derwent World Patents Index; Class T01, AN 1998-485650/42 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742450A (en) * 2021-08-30 2021-12-03 中信百信银行股份有限公司 User data grade label falling method and device, electronic equipment and storage medium
CN113742450B (en) * 2021-08-30 2023-05-30 中信百信银行股份有限公司 Method, device, electronic equipment and storage medium for user data grade falling label
CN113821539A (en) * 2021-09-07 2021-12-21 丰图科技(深圳)有限公司 Region query method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CA2409199A1 (en) 2001-11-29
AUPQ773000A0 (en) 2000-06-15

Similar Documents

Publication Publication Date Title
US20020022953A1 (en) Indexing and searching ideographic characters on the internet
KR100235223B1 (en) Mapping method and device
EP1485830B1 (en) Retrieving matching documents by queries in any national language
JP5740029B2 (en) System and method for improving interactive search queries
US6952691B2 (en) Method and system for searching a multi-lingual database
KR101157693B1 (en) Multi-stage query processing system and method for use with tokenspace repository
KR100408637B1 (en) Method and system for similar word extraction and document retrieval
WO2006010163A2 (en) User interface and database structure for chinese phrasal stroke and phonetic text input
US20020152258A1 (en) Method and system of intelligent information processing in a network
JP2006164292A (en) Method and system for processing intelligent information in network
US20110137912A1 (en) System, method and computer program product for documents retrieval
US20050187964A1 (en) Method and apparatus for retrieving natural language text
WO2001090930A1 (en) Indexing and searching ideographic characters on a networked system of computers
AU2001259949B2 (en) Indexing and searching ideographic characters on a networked system of computers
AU2001259949A1 (en) Indexing and searching ideographic characters on a networked system of computers
JP7022789B2 (en) Document search device, document search method and computer program
KR100434718B1 (en) Method and system for indexing document
KR20010035390A (en) Internet natural language addressing input connecting system and the method thereof
Mujoo et al. A search engine for Indian languages
KR19990084950A (en) Data partial retrieval device using inverse file and its method
JPH0232467A (en) Machine translation system
JPH05165889A (en) Document retrieval device
Wouda Similarity between Index Expressions
Chen et al. Design of a Chinese Search Engine Based on Homophonic Chinese Character Encoding
JP2006344011A (en) Multilingual document retrieving device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2409199

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2001259949

Country of ref document: AU

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP