CA1282496C - Database system for parallel processor - Google Patents

Database system for parallel processor

Info

Publication number
CA1282496C
CA1282496C CA000539813A CA539813A CA1282496C CA 1282496 C CA1282496 C CA 1282496C CA 000539813 A CA000539813 A CA 000539813A CA 539813 A CA539813 A CA 539813A CA 1282496 C CA1282496 C CA 1282496C
Authority
CA
Canada
Prior art keywords
processor
word
point value
total point
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA000539813A
Other languages
French (fr)
Inventor
Brewster Kahle
Craig W. Stanfill
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thinking Machines Corp
Original Assignee
Thinking Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thinking Machines Corp filed Critical Thinking Machines Corp
Application granted granted Critical
Publication of CA1282496C publication Critical patent/CA1282496C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90339Query processing by using parallel associative memories or content-addressable memories
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Abstract

ABSTRACT

A method is disclosed for using a single instruction multiple data (SIMD) computer to search large databases in parallel. The words of each document are stored by surrogate coding in tables in one or more of the processors of the SIMD computer. To determine which documents of the database contain a word that is the subject of a query, a query is broadcast from a central computer to all the processors and the query operations are simultaneously performed on the documents stored in each processor. The results of the query are then returned to the central computer. After all the search words have been broadcast to the processors and point values accumulated as appropriate, the point values associated with each document are reported to the central computer. The documents with the largest point values are then ascertained and their identification is provided to the user.

Description

~23~49~

DATABASE SYSTEM FOR PARALLEL PROCESSORS

5 Cross Reference to Related Applications Related applications are Canadian Patent Application No. 455,008 filed May 24, 1984 ~or PAR~LLEL
PROCESSOR, now Canadian Patent 1,212,480, issued October 7, 10 1986; Canadian Patent Application No. 510,359 filed May 29, 1986 for METHOD AND APPARA~US FOR INTERCONNECTING PROCESSORS
IN A HYPER-DIMENSIONAL ARRAY, now Canadian Patent 1,256,581, issued June 27, 19~9; Canadian Patent ~pplication No. 528,855 filed Feb. 23, 1987 for METHOD OF SIMULATING ADDITIOMAL
15 PROCESSING IN A SIMD PARALLEL PROCESSOR ~RRAY; U.S. Patent No. 4,598,400 issued July 1, 1986 for METHOD AND APPARATUS
FOR ROUTIN& MESSAGE PAC~ETS.

Background o~ the Inventlon This relates to the searching of large databases and in particular to the searching of large databases in which the search strategies are executed in parallel.
Today it has become increasingly popular to store 2~ information such as articles from newswires and newspapers, abstracts and articles from journals and other print media, encyclopedias and bibliographies, on large databases for computerized search and retrieval. For convenience of reference, each group of related information will be re~erred 30 to as a document regardless of its format or original physical embodiment. The methods used in searching large databases have been limited by the sequential computers available to perform the search. Ideally a search method should have a high recall and precision. Recall is the 35 proportion of relevant documents in the entire database which are retrieved. Precision is the proportion of retrieved documents which are relevant. Exhaustive search ~r,~

methods provide hicJh recall and precision. The basie problem is that an exhaustive search may take a very long time.
ThereEore, noJl-exllaustive methods are used.
Tlle usual metllod of organizing a database is a technique called "inverting tlle database". See G. James, Doeument Databases (Van Nostrand Reinhold Company 19~5); C.
. . ~ . ~ ~ , =, .
J. Rijsbergen, In~ormation Retrieval, p. 72 (Butterwortl-s, 2d ed. 1979). Eacll document is assigned a unique doeument number. The words in the documents (exeluding trivial words such as "a" and "tlle") are taqged with the doeument number and placed in an alphabetieal index. To locate all doeuments containinq a ~iven word, the index is searelled for that word, and a set of document numbers is returned. ~lternatively, the worcls of eacll ~oeument may be stored by surrogate codincJ
in which each word is represented by a hash eode in a table of hasll codes and a word search is perEormed by searehing for the presenee of the hasll eode assoeiated with the word in interest.
~'o seareh Eor doeuments eontaining more than one word, a boolean search strategy is typieally used on the inverted index. ~ boolean search is a seareh whiell aehieves its results by logical comparisons oE the query with the documents. Commercial application of this technique requires rooms full oE disk drives and large mainframe computers. The response time is oEten quite slow depending on the complexity of the query because tlle seareh through the index and the logieal comparisolls are exeeuted sequentially. Sucll systems are limited in the cluality oE the search they provide and are Eound to be clumsy to use. There is a tradeofF between recall and precision whicll limits the quality of boolean searches on large clatabases. Searching a database for documents containing a single word may lead to low reeall, beeause there is no guarall~ee that all relevant doeuments will use that word. In addition, it is likely that a large Z4L~6 number o~ irrelevant documents will be retrieved, leading to low precision. Searchin~ for several words aggravates tl1ese problems. I~ the searcher looks for any of several words (a disjunctive query), recall improves but precision goes down.
If the searcller looks ~or documents containing all of several words (a conjunctive query), precision improves but recall suffers. For a large database, this means that the searcl~er may l1ave to choose between missing important information or searchil1cJ throucJI1 thousarlds of irrelevant documents. There are additional problems with the viability of using boolean queries Eor full text search. First, the user is playing a guessing game, trying to guess which words the authors of the documents he i~ interested in migllt have used. Second, even if he guesses the words, he has to figure out wl1icl1 connectives to use to avoid getting too much or too little - data. This oEten involves several iterations as the user debugs l-is query. ~inally, the syntax of boolean queries is complex, makiny the system difficult to learn.
A second search strategy employs a variant on boolean queries referred to as nsimple queries". See C. J.
van ~ijsbercJen, Inrormation Retrieval, p. 160 (Butterworths, 2d ed. 1979). In this search strategy a query consists oE a set of words, each of whicl1 is assigned a point value. Every document in t~le database is scored by adding up the point values for tl1e words it contains. Tl-e result oE this query is a set of docun1ents, ordered by their total point values.
Simple queries are comparable to boolean queries in the quality of tl1e search tl1ey support. For example, if the user looks only at the documents which have a positive score, he is essentially lookinq at the results of a disjunctive query, and can expect higl1 recall but low precision. ~n advantage of simple queries is that, between these two extremes, there are regions of intermediate recall and precision. In addition, they are easier to use than boolean queries. The 2~ 9~

user does not need to decide which connectives to use as there are none. Tl~e user does not need to learn a complex query language, as the query consists oE a list oE words.
Ilowever, searchincJ with simple queries, like searching ~/itl boolean queries, re-nains a guessing game. ~n additional problem is determilling wl~ere to set the threshold in the point value oE responses Erom the query in order to limit the number oE retrieved documents to a manageable amount.
~notller search strategy is relevance feedback. In this strategy simple queries are constructed from the texts of documents judged to be relevant. See G. Salton, The SM~Rl' ~etrieval System-Experiment in Rutomatic Document Processin~, p. 313 (Prentice-llall 1971); c. J. van Rijsbergen, Information ~etrieval, p. 105 (Butterworths, 2d ed. 1979).
__ _ First, a searcll method is used to locate a small set of possibly relevallt documents. The user then scans these documents, and marks any which l-e considers obviously relevant as good and any wllicll lle con5iders obviously irrelevant as bad. The text oE the marked documents is tllen scanned for appropriate search words, and a query is constructed from these words. The more good documents a word occurs in, the greater its importance in the new query and tllere~ore the llicJIler the score assigne~ to tha~ word. ~'he new query may contaill hundreds oE terms. This query is the applied to the database in the same fashion as a simple query. ~elevance Eeedback leads to both high precision and higll recall due to the large number of words employed in the search process. One word taken by itself conveys little inrormatioll; but several hundred words together convey a 3~ great deal. Only higllly relevant documents will use a higl proportion oE this set oE several hulldred items. Ilowever, the only way to implement such a query is by an exhaustive searcl~ wllicll is impracticable on the serial mainframe systems currently in use Eor database retrieval systems.

Summary of the Invention The present invention relates to the use of a massively parallel processor for document earch and 5 retrieval. The system as presented is sufficiently fast to permit the application of exhaustive search methods not previously feasible for large databases.
In the preferred embodiment of the invention the document data is stored and the searches are implemented on a t0 single instruction multiple data (SIMD) computer which makes it possible to perform thousands of operations in parallel.
One such SIMD computer on which the invention has been performed is the Connection Machine (Reg. TM) Computer made by the present assignee, Thinking Machines, Inc. of 15 Cambridge, Massachusetts. This computer is described more fully in Canadian Patent 1,212,480. In the embodiment of the Connection Machine Computer on which the invention has been practiced, the computer comprises 65,536 relatively small identical processors which are interconnected in a sixteen-20 dimensional hypercube network.
The words of each document are stored by surrogatecoding in tables in one or more of the processors of the SIMD
computer. To determine which documents of the database contain a word that is the subject of a query, a query is 25 broadcast from a central computer to all the processors and the ~uery operations are simultaneously performed on the documents stored in each processor. The results of the query are then returned to the central computer.
Because of the enormous parallel processing capability of an SIMD computer such as the Connection Machine Computer, simple query and relevance feedback search strategies using large numbers of search words and exhaustive or near exhaustive search strategies are now practical.

G -ScorincJ oE tl-c results oE such searches is done in parallel at each processor. For example, each processor which stores a hasll value associated with a word that is the subject of ~uery can be dirccted to accumulate a point value for that word. ~Eter all the search words have been broadcast to tlle processors and point values accumulated as appropriate, the point values associated with each document are reported to the central computer. 'l'he documents with the lar~est pOillt values are then ascertained and their identification is providecl to the user.

Description of the Drawings These and other objects, features and elements o~
5 the invention will be more readily apparent from the following Description of the Preferred Embodiment of the Invention in which:
Figs. 1 and 2 depict in schematic form details of a SIMD processor preferably used in the practice of the 10 invention; Fig. 3 is a block diagram of the search and retrieval process; and Fig. 4 is a block diagram of the query forming process.

Detailed Description of the Preferred Embodiments 1~
In the system of the invention, a sin~le instruction multiple data (SIMD) computer such as the Connection Machine Computer is preferably used. This computer is described in detail in Canadian Patent 20 1,212,480.
As shown in Fig. lA of that patent which is reproduced in Fig. 1, the computer system comprises a central computer 10, a microcontroller 20, an array 30 of parallel processing integrated circuits 35, a data source 25 40, a first buffer and multiplexer/demultiplexer 50, first, second, third and fourth bidirectional bus control circuits 60, 65, 70, 75, a second buffer and multiplexer/demultiplexer 80, and a data sink 90. Central computer 10 may be a suitably programmed commercially 30 available computer such as a Symbolics 3600-series LISP
Machine. The database to be searched is stored as described below in the memories of individual processor/memories 36 in integrated circuits 35.

2~

Microcontroller 20 is an instruction sequencer of conven-tional design for generating a sequence of instructions that are applied to array 30 by means of a thirty-two bit parallel bus 22. Microcontroller 20 receives from array 30 a signal on line 26. This signal is a general purpose or GLOB~L signal that can be used for data output and status in~ormation. ~us 22 and line 26 are connected in parallel to eacll IC 35. ~s a result, signals from microcontroller 20 are applied simultaneously to each IC 35 10 in array 30 and the sigl1al applied to microcontroller 20 on line 26 is formed by combining the signal outputs from all of ICs 35 o~ tlle array.
In the embocliment oE the Connection Machine Computer used in the practice of the present invention, 15 array 30 contains ~09G (=2 ) identical ICs 35; and each IC
35 contains 16 (=2 ) identical processor/memories 36. Tl~us, the entire array 30 contains 65,536 (=2l6) identical processor/memories 36.
Processor/memories 36 are organized and interconnected in two geom~tries: a conventional two-dimensional grid pattern and a lG dimension hypercube network. In the CJL-id pattern, the processor/memories are organized in a rectal1gular array and connected to their ~our nearest neicJI1bors in the array. The sides of this array and the four neigl~bors are identi~ied as NO~T1l, E~ST, SOUTII and WEST. ~'he hypercube network allows processors to communicate by excl1a11c3ing packets oE information. This configuratiol1 is realized by organizing and interconnecting the IC's 35 in the forn1 of a ~oolean n-cube o~ sixteen dimensions. Each IC 35 is provided with logic circuitry to control tlle routil1cJ of messages through such an interconl1ectiol1 network; and within each IC, bus connections 49~

are provided to every processor/memory so that each processor/memory can communicate with any other by sending a signal through at most sixteen communication lines.
An illustrative processor/memory 36 is disclosed B in greater detail in Fig. 2 which is the same as Fig. 7A of Canadian Patent 1,212,480. As shown in Fig. 2, the processor/memory comprises random access memory (RAM) 250, arithmetic logic unit (ALU) 280 and flag controller 290.
The ALU operates on data from thre0 sources, two registers 1~ in the RAM and one flag input, and produces two outputs, a sum output that is written into one of the RAM registers and a carry output that is made available to certain registers in the flag controller as well as to certain other processor/memories.
The inputs to RAM 250 are busses 152, 154, 156, 158, a sum output line 285 from ALU 280, the message packet input line 122 from communication interface unit (CIU) 180 of Fig. 6B of Canadian Patent 1,212,480 and a WRITE ENABLE
line 298 from flag controller 290. The outputs from RAM 250 20 are lines 256, 257. The signals on lines 256, 257 are obtained from the same column of two different registers in RAM 250, one of which is designated Register A and the other Register B. Busses 152, 154, 156, 158 address these registers and the columns therein in accordance with the 25 instruction words from microcontroller 20. Il~ustratively, RAM 250 has a memory capacity of 4096 bits.
Flag controller 290 is an array of eight one-bit D-type flip-flops 292, a two-out-of-sixteen selector 294 and some logic gates. The inputs to flip-flops 292 are a carry 30 output signal from ALU 280, a WRITE ENABLE signal on line 298 from selector 294, and the eight lines of bus 172 from programmable logic array (PLA) lS0 of Fig. 6B of Canadian Patent 1,212,480. Lines 172 are address lines each of which '` - 10 -` ~82~6 is connected to a different one of flip-flops 292 to select the one flip-flop into which a flag bit is to be written.
The outputs of flip-flops 292 are applied to selector 294.
The inputs to selector 294 are up to sixteen flag 5 signal lines 295, eight of which are from flip-flops 292, and the sixteen lines each of busses 174, 176. Again, lines 174 and 176 are address lines which select one of the flag signal lines for output or further processing. Selector 294 provides outputs on lines 2g6 and 297 that are whichever 10 flags have been selected by address lines 174 and 176, respectively.
ALU 280 comprises a one-out-of-eight decoder 282, a sum output selector 284 and a carry output selector 286.
As detailed in Canadian Patent 1,212,480, this enables it to 15 produce sum and carry outputs for many functions including ADD~ logical OR and logical ANDo ALU 280 operates on three bits at a time, two on lines 256, 257 from Registers A and B
in RAM 250 and one on line 296 from flag controller 290.
The ALU has two outputs: a sum on line 285 that is written 20 into Register A of RAM 250 and a carry on line 287 that may be written into a flag register 292 and applied to the inputs of the other processor/memories 3 6 to which this processor/memory is connected.
The words of the document are stored in a table 25 format originally developed for spelling correction dictionaries called surrogate coding. See Dodds, "Reducing Dictionary Size by Using a Hashing Technique", Communications of the ACM, Vol. 25, No. 6, pp. 368-370, (June 19~2); Nix, "Experience With a Space Efficient Way to 30 Store a Dictionary", Communications of the ACM, Vol. 24, No.
5, pp. 297 298, (May, 1981); Peterson, "Computer Programs for Detecting and Correcting Spelling Errors", Communications o~ the ACM, Vol. 23, No. 12, pp. 676-687.
~December, 1980). Although the tables may be of any size, it is preferred that the table be 512 or 1~24 bits long.
The 4096 bits of RAM allows for six tables of 512 bits or three tables of 1024 bits with the remaining memory used as scratch memory.
To store the words of a document in the table, the table is first initialized to zero. A fixed number of independent hash codes -~ ten, in the presently preferred embodiment -- are generated for each significant word in the document. Each code corresponds to a position in the table. For example, for a table of 512 bits, each code would be between zero and 511. Yor each of the hash codes generated for a word, the corresponding binary bit at that address in the table is set to one. To minimize data storage requirements, trivial words such as "a" and "the"
15are not included in the table. In addition, a text indexer preferably is used which picks out noun phrases in a document to input into the table. This allows for a three to one compression of the document.
Each document of the database is stored in this Ofashion in one or more tables in one or more processor/memories of the SI~D computer. If a document contains more than the maximum number of words that should be stored in a table, additional tables are used. For example if a ninety word document is stored in the 25database, and thirty words are contained in each table, a total of three tables are used. The set of tahles which contain a single document is called a chain. Preferably each of the tables in a chain is located in a physically different processor, and each table is located in the same 30portion of its respective memory. Alternatively, all the tables could be located in the same physical processor.
To probe for the presence of a word in the documents stored in the tables of the processor/memories, the corresponding bits of the ten hash codes for that word ~ 324~6 are checked in each table of the processor/memories. If any of the ten bits in a table are zero, the word is definitely absent from that table, yielding a negative response. If all ten bits are one, then the word is probably present 5 yielding a positive response. Although this algorithm does not generate false negatives, there is a possibility of false positives. The probability in thi~ algorithm of a false positive is dependent on the number of word~ contained in each table as well as the size of the table and the 10 number of bits which are set for each word. In a table of 512 bits with ten bits set for each word, the probability of a false positive can be shown to be about one in a million for a table containing fifteen words, one in a hundred thousand for a table of twenty words and thirty in a hundred 15 thousand for a table of thirty words. For optimal system performance, it seems preferable to limit the table to about fifteen to thirty words.
Figs. 3 and 4 illustrates the exhaustive search strategy used in the system of the invention. To permit 20 user access to the documents from a computer terminal, the documents are stored in full text in the central computer and a display terminal is provided for accessing this database. The significant words of each document are then hash coded and the hash codes are then stored in one or more tables in the memories of the processor/memories of the SIMD
computer. As indicated above, trivial words are ignored and storage typically is limited to noun phrases. To begin a search, the user selects at least one word and preferably several that delineate the subject of interest. When first executing a query as indicated by the left hand side of Fig.
4, the user enters these words into the central computer and may also assign point values to each word reflecting his estimate of the significance of the word in the search. The central computer then examines the full text of the documents stored in the central computer for the presence of one or more of ~ - 13 ~
~.~13;?,~9~

those words, computes point value scores for the documents, if applicable, and identifies to the user the documents selected by this process. As indicated by the righthand side of Fig. 4, the user then examines the documents and informs the central computer which documents are relevant or "good" and which are irrelevant or "bad". The computer then examines these documents to locate appropriate search words and formulates a query from these words. The computer also assigns to these words a point value basing its valuation, 10 for example, on the number of good documants in which the word appears. Other parameters, such as the frequency of occurrence of a particular word tfor example, words which do not appear frequently in a document would have a higher point value assigned to the word), whether a word occurs in 1~ the title or headlines, and whether the user has made an explicit note of the word, may also be used in constructing the simpl~ query. The resulting query may contain hundreds of words.
This query is then used to search the hash tables 2~ stored in the memories of the SIMD computer. For each word in the query, the central computer determines from a look-up table the values or addresses in the hash table where the ten bits of the corresponding hash code are stored. It then instructs each processor/memory to read the bit at each of 25 those addresses. Each bit that is read is used to set a flag and this flag is then ANDed together with the next bit that is read to determine the next value of the flag. If any of the hash code bits has a value of zero, the flag becomes zero, and the test fails, indicating that the word in question is not stored in that table. If the test succeeds, the flag is a one, the word is assumed to be in the document represented by that table and the point value associated with that word is awarded to that document and accumulated with any other point values associated with other words in the document.

~ 2~ -14-. ~

Advantageously, communications from the individual processor/memories to the central computer can be minimized by storing the point values for each document at the individual processor/memories until completion of the entire ~uery search. Point values are accumulated after each word is tested by broadcasting to all the processor/memories an instruction to accumulate the point value if the flag bit is one.
If a document is divided among multiple tables, the values in each table are passed to the first table in the chain where they are accumulated.
Documents having the largest point values may be identified by sorting the point values stored in the processor/memories in order of magnitude. Computer programs for such a sort are well known in the art and their adaptation to a SIMD computer will be apparent from the foregoing description. Alternatively, the point values can be tested to identify the maximum point values and the identification of the documents associated with a series of such maxima can be extracted from the processor/memories.
To perform such a test, the central computer simultaneously tests th~ most significant bit of the point values stored in each of the processor/memories. This is readily done if the point values are all stored at the same addresses in all the processor/memories. The test is in the form of an instruction to set a first flag and ignore further parts of the test if the most significant bit is zero and to produce an output on the GLOBAL signal line 26 of Fig. 1 if the most significant bit is a one. If no output is received on line 26 from any processor/memory, the central computer 30resets all the flags and their processor/memories and begins the test anew with the next most significant bit.
If an output is received, the central processor enters the next cycle of the test and is~ues an instruction to those 8~ 15-processor/memories still being tested to set a second flag and ignore further parts of the test if the test of the next most significant bit is ~ero and to produce an output on the GLOBAL signal line if that bit is a one. If no output is received on the GLOBAL line, the second flags of the processor/memories that were shut down during that cycle of the test are reset, those processor/memories are reactivated and the third most significant bit is tested.
If, however, an output is received, the first flags are set in those processor/memories whose second flags were set;
and the central processor enters the next cycle of the test. It then tests the third most significant bit and so on. As a result of this process, the processor/memory having the maximum point value is isolated in the array and the document associated with that point value is identified. The point value associated with that document is then set to zero and the process is repeated to find the document having the next highest pOillt value; and so on.
Since a large number of documents may be retrieved using the simple query search strategy, it is preferred that some means be used to limit the number of retrieved documents to a managable amount. Preferably, this is done by retrieving the best documents (i.e., those with the highest point values) first and stopping when enough documents have been retrieved. Alternatively, a threshold point value can be established so that if the total point value of the document is below the threshold point value, the document is not retrieved.
It has been found that by utilizing the system of the invention the user can obtain both high recall and 30precision. Exhaustive searches which were not practicable can now be used. Since the searches are executed in 8;~:~96 parallel the search time is extremely fast. For example, a simple query search of 200 terms on a 112 Megabyte database is executed in 60 milliseconds.
As will he apparent to those skilled in the art, 5 numerous modi~ications may be made within the scope of the above described invention. While the invention has been described in terms of parallel processing implementation o~
a combination of simple queries and relevant feedback search strategies other search strategies such as boolean search strategies and other exhaustive search strategies may be used.

3~

Claims (15)

1. A process for searching for relevant documents in a database comprising the steps of:
a) forming a database by storing for each of a plurality of documents at least one table of hash codes representing words in the document, the table(s) that represent the words in each different document being stored in a different digital data processor, each hash code comprising information at a plurality of bit locations;
b) forming a query having at least one word and a point value of relevance assigned to each word;
c) testing if the word in the query is in the database by:
1) determining the bit locations in the table at which the hash code corresponding to the queried word is stored; and
2) simultaneously testing in each of the processors the bit locations corresponding to the queried word;
d) adding at each digital data processor the point value associated with the queried word to a total point value for the document if the hash code is found at all the bit locations corresponding to the queried word that are tested in that processor; and e) providing identification of those documents in the database with high total point values.

2. The process of claim 1 wherein the step of forming a query comprises:
a) identifying a group of relevant documents;
b) determining a frequency at which the words in the documents occur among all the relevant documents; and c) generating a query based on the frequency of occurrence of words in the relevant documents.
3. The process of claim 1 wherein the step of adding the point value associated with a queried word to the total point value of a document comprises:
a) setting a flag in the processor to indicate the presence of the queried word in the table;
b) communicating simultaneously from the central computer to each of the processors a command to add the point value specified to the total point value of the table if the flag in the processor is set.
4. The process according to claim 1 wherein a document is represented by a chain of tables, each of which stores a representation of a portion of the document, each chain of tables comprising a first table and one or more successive tables, the step of adding the point value associated with the queried word to the total point value of the document comprising the steps of;
a) communicating the total point value of the portion of the document represented by each successive table in the chain to the preceding table in the chain; and b) accumulating the total point value of all the tables in the first table of the chain.
5. The process according to claim 1 wherein the storing of document data in the processors comprises the steps of:
a) initializing to zero a table of bits in a memory of each processor;
b) generating a plurality of independent hash codes for each word with values in an address range of the table; and c) for each hash code, setting a bit at an address in the table corresponding to each hash code value.
6. The process of claim 1 wherein the step of providing the identification of those documents with high total point values comprises identifying the documents in numeric order of total point value, said process comprising:
a) testing the most significant bit of the point value stored in memory in each processor;
b) setting a flag in the processor if the most significant bit is zero;
c) repeating steps a) and b), if necessary, until a non-zero bit is identified; and d) successively testing bits of the point values of lesser significance in those processors where a non-zero bit is identified until a document is identified with the highest total point value.
7. The process of claim 1 wherein the step of providing the identification of those documents with high total point values comprises identifying the documents in numeric order of total point value, said process comprising:
a) testing the most significant bit of the point values stored in memory in each processor;
b) setting a flag in the processor if the most significant bit is zero;
c) testing the next most significant bit of the point values stored in memory at each processor where a flag is not set; and d) setting a second flag in the processor if the next most significant hit is zero.
8. The process of claim 1 wherein the step of providing the identification of documents with high total point values comprises:

a) successively testing at each digital data processor bits of the total point values associated with each document beginning with the most significant bit and continuing to test bits of lesser significance until the highest total point value is identified; and b) repeating step (a) of this claim on the total point values remaining after the highest total point value is identified in the previous execution of step (a).
9. The method of claim 1 wherein different point values are assigned to at least some different words.
10. A process for searching in a database comprising the steps of:
a) forming a database by storing in each of a plurality of digital data processors at least one table of hash codes, the hash codes in each table representing a group of related words;
b) forming a query having at least one word and a point value of relevance assigned to each word;
c) testing if the word in the query is in the database by:
1) determining the bit locations in the table at which the hash code corresponding to the queried word is stored; and 2) simultaneously testing in each of the processors the bit locations corresponding to the queried word:
d) at each digital data processor, adding the point value associated with the queried word to a total point value for that group of related words if the hash code is found at all the bit locations tested in the table; and e) providing identification of those groups of related words with high total point values.
11. The process of claim 10 wherein the step of adding the point value of a queried word to the total point value for a group of related words comprises:
a) setting a flag in the processor to indicate the presence of the queried word in the table;
b) communicating simultaneously from a central computer to each of the processors a command to add the point value of the queried word to the total point value for the group of related words if the flag in the processor is set.
12. The process of claim 10 wherein the step o f providing the identification of those groups of related words with high total point values comprises:
a) successively testing the bits of the total point values associated with each group of related words beginning with the most significant bit and continuing to test bits of lesser significance until the highest total point value is identified; and b) repeating step (a) on the total point values remaining after the highest total point value is identified in the previous execution of step (a).
13. The process of claim 10 wherein the step of providing the identification of those groups of related words with high total point values comprises identifying the groups in numeric order of total point value, said process comprising:
a) testing the most significant bit of the point values stored in memory in each processor;
b) setting a flag in the processor if the most significant bit is not set;
c) testing the next most significant bit of the point values stored in memory at each processor where a flag is not set;

d) setting a second flag in the processor if the next most significant bit is not set.
14. The method of claim 10 wherein different point values are assigned to at least some different words.
15. A process for searching in a database comprising the steps of:
a) forming a database by storing in each of a plurality of digital data processors at least one table of hash codes, the hash codes in each table representing a group of related words;
b) forming a query having at least one word;
c) testing for the presence of the queried word in the database by:
1) determining the bit locations in the table at which the hash code corresponding to the queried word is stored; and 2) simultaneously testing in each of the processors the bit locations corresponding to the queried word; and d) scoring each group of related words, a score for a group of related words being increased if the hash code is found at all the bit locations tested in the table.
CA000539813A 1986-06-25 1987-06-16 Database system for parallel processor Expired - Fee Related CA1282496C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US06/878,532 US4870568A (en) 1986-06-25 1986-06-25 Method for searching a database system including parallel processors
US878,532 1986-06-25

Publications (1)

Publication Number Publication Date
CA1282496C true CA1282496C (en) 1991-04-02

Family

ID=25372216

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000539813A Expired - Fee Related CA1282496C (en) 1986-06-25 1987-06-16 Database system for parallel processor

Country Status (5)

Country Link
US (1) US4870568A (en)
EP (1) EP0251594B1 (en)
JP (1) JPH083817B2 (en)
CA (1) CA1282496C (en)
DE (1) DE3750492T2 (en)

Families Citing this family (88)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL8603193A (en) * 1986-12-16 1988-07-18 Hollandse Signaalapparaten Bv DATABASE SYSTEM.
US5170482A (en) * 1987-08-14 1992-12-08 Regents Of The University Of Minnesota Improved hypercube topology for multiprocessor computer systems
JP2509947B2 (en) * 1987-08-19 1996-06-26 富士通株式会社 Network control system
US5257395A (en) * 1988-05-13 1993-10-26 International Business Machines Corporation Methods and circuit for implementing and arbitrary graph on a polymorphic mesh
US5088048A (en) * 1988-06-10 1992-02-11 Xerox Corporation Massively parallel propositional reasoning
US5408655A (en) * 1989-02-27 1995-04-18 Apple Computer, Inc. User interface system and method for traversing a database
US5469354A (en) * 1989-06-14 1995-11-21 Hitachi, Ltd. Document data processing method and apparatus for document retrieval
JPH03129472A (en) * 1989-07-31 1991-06-03 Ricoh Co Ltd Processing method for document retrieving device
US5170370A (en) * 1989-11-17 1992-12-08 Cray Research, Inc. Vector bit-matrix multiply functional unit
US5210870A (en) * 1990-03-27 1993-05-11 International Business Machines Database sort and merge apparatus with multiple memory arrays having alternating access
US5367677A (en) * 1990-05-11 1994-11-22 Thinking Machines Corporation System for iterated generation from an array of records of a posting file with row segments based on column entry value ranges
US5761655A (en) * 1990-06-06 1998-06-02 Alphatronix, Inc. Image file storage and retrieval system
US5757983A (en) * 1990-08-09 1998-05-26 Hitachi, Ltd. Document retrieval method and system
US5321833A (en) * 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5325500A (en) * 1990-12-14 1994-06-28 Xerox Corporation Parallel processing units on a substrate, each including a column of memory
US5488725A (en) * 1991-10-08 1996-01-30 West Publishing Company System of document representation retrieval by successive iterated probability sampling
US5388259A (en) * 1992-05-15 1995-02-07 Bell Communications Research, Inc. System for accessing a database with an iterated fuzzy query notified by retrieval response
US5511186A (en) * 1992-11-18 1996-04-23 Mdl Information Systems, Inc. System and methods for performing multi-source searches over heterogeneous databases
US5819259A (en) * 1992-12-17 1998-10-06 Hartford Fire Insurance Company Searching media and text information and categorizing the same employing expert system apparatus and methods
US5548770A (en) * 1993-02-25 1996-08-20 Data Parallel Systems, Inc. Method and apparatus for improving retrieval of data from a database
US5809212A (en) * 1993-07-12 1998-09-15 New York University Conditional transition networks and computational processes for use interactive computer-based systems
US5742806A (en) 1994-01-31 1998-04-21 Sun Microsystems, Inc. Apparatus and method for decomposing database queries for database management system including multiprocessor digital data processing system
US5499360A (en) * 1994-02-28 1996-03-12 Panasonic Technolgies, Inc. Method for proximity searching with range testing and range adjustment
US6081804A (en) * 1994-03-09 2000-06-27 Novell, Inc. Method and apparatus for performing rapid and multi-dimensional word searches
US5553139A (en) * 1994-04-04 1996-09-03 Novell, Inc. Method and apparatus for electronic license distribution
US5522077A (en) * 1994-05-19 1996-05-28 Ontos, Inc. Object oriented network system for allocating ranges of globally unique object identifiers from a server process to client processes which release unused identifiers
US5715443A (en) * 1994-07-25 1998-02-03 Apple Computer, Inc. Method and apparatus for searching for information in a data processing system and for providing scheduled search reports in a summary format
US5623652A (en) * 1994-07-25 1997-04-22 Apple Computer, Inc. Method and apparatus for searching for information in a network and for controlling the display of searchable information on display devices in the network
US5530939A (en) * 1994-09-29 1996-06-25 Bell Communications Research, Inc. Method and system for broadcasting and querying a database using a multi-function module
US5946678A (en) * 1995-01-11 1999-08-31 Philips Electronics North America Corporation User interface for document retrieval
US5758149A (en) * 1995-03-17 1998-05-26 Unisys Corporation System for optimally processing a transaction and a query to the same database concurrently
US5742807A (en) * 1995-05-31 1998-04-21 Xerox Corporation Indexing system using one-way hash for document service
US5799298A (en) * 1995-08-07 1998-08-25 International Business Machines Corporation Method of indirect specification of user preferences
JP3842319B2 (en) * 1995-11-08 2006-11-08 富士通株式会社 Information retrieval system
US5758069A (en) * 1996-03-15 1998-05-26 Novell, Inc. Electronic licensing system
US5926811A (en) * 1996-03-15 1999-07-20 Lexis-Nexis Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching
US5905860A (en) * 1996-03-15 1999-05-18 Novell, Inc. Fault tolerant electronic licensing system
US5909681A (en) * 1996-03-25 1999-06-01 Torrent Systems, Inc. Computer system and computerized method for partitioning data for parallel processing
US7308485B2 (en) * 1997-04-15 2007-12-11 Gracenote, Inc. Method and system for accessing web pages based on playback of recordings
US5987525A (en) * 1997-04-15 1999-11-16 Cddb, Inc. Network delivery of interactive entertainment synchronized to playback of audio recordings
US7167857B2 (en) 1997-04-15 2007-01-23 Gracenote, Inc. Method and system for finding approximate matches in database
US7779020B2 (en) 2002-03-01 2010-08-17 International Business Machines Corporation Small-footprint applicative query interpreter method, system and program product
US6807632B1 (en) * 1999-01-21 2004-10-19 Emc Corporation Content addressable information encapsulation, representation, and transfer
US6195739B1 (en) 1998-06-29 2001-02-27 Cisco Technology, Inc. Method and apparatus for passing data among processor complex stages of a pipelined processing engine
US6513108B1 (en) 1998-06-29 2003-01-28 Cisco Technology, Inc. Programmable processing engine for efficiently processing transient data
US6356548B1 (en) 1998-06-29 2002-03-12 Cisco Technology, Inc. Pooled receive and transmit queues to access a shared bus in a multi-port switch asic
US6836838B1 (en) 1998-06-29 2004-12-28 Cisco Technology, Inc. Architecture for a processor complex of an arrayed pipelined processing engine
US6119215A (en) * 1998-06-29 2000-09-12 Cisco Technology, Inc. Synchronization and control system for an arrayed processing engine
US6101599A (en) * 1998-06-29 2000-08-08 Cisco Technology, Inc. System for context switching between processing elements in a pipeline of processing elements
US6728839B1 (en) 1998-10-28 2004-04-27 Cisco Technology, Inc. Attribute based memory pre-fetching technique
US6385747B1 (en) 1998-12-14 2002-05-07 Cisco Technology, Inc. Testing of replicated components of electronic device
US6173386B1 (en) 1998-12-14 2001-01-09 Cisco Technology, Inc. Parallel processor with debug capability
US6920562B1 (en) 1998-12-18 2005-07-19 Cisco Technology, Inc. Tightly coupled software protocol decode with hardware data encryption
US6681341B1 (en) 1999-11-03 2004-01-20 Cisco Technology, Inc. Processor isolation method for integrated multi-processor systems
US6529983B1 (en) 1999-11-03 2003-03-04 Cisco Technology, Inc. Group and virtual locking mechanism for inter processor synchronization
US6892237B1 (en) 2000-03-28 2005-05-10 Cisco Technology, Inc. Method and apparatus for high-speed parsing of network messages
US7058658B2 (en) * 2000-03-28 2006-06-06 Dana-Farber Cancer Institute, Inc. Molecular database for antibody characterization
US6505269B1 (en) 2000-05-16 2003-01-07 Cisco Technology, Inc. Dynamic addressing mapping to eliminate memory resource contention in a symmetric multiprocessor system
WO2002029601A2 (en) * 2000-10-04 2002-04-11 Pyxsys Corporation Simd system and method
US20050010604A1 (en) * 2001-12-05 2005-01-13 Digital Networks North America, Inc. Automatic identification of DVD title using internet technologies and fuzzy matching techniques
US7447872B2 (en) * 2002-05-30 2008-11-04 Cisco Technology, Inc. Inter-chip processor control plane communication
US6990483B2 (en) * 2002-07-08 2006-01-24 International Business Machines Corporation Method, system and program product for automatically retrieving documents
US8676843B2 (en) * 2002-11-14 2014-03-18 LexiNexis Risk Data Management Inc. Failure recovery in a parallel-processing database system
US7293024B2 (en) 2002-11-14 2007-11-06 Seisint, Inc. Method for sorting and distributing data among a plurality of nodes
US7185003B2 (en) * 2002-11-14 2007-02-27 Seisint, Inc. Query scheduling in a parallel-processing database system
US7945581B2 (en) * 2002-11-14 2011-05-17 Lexisnexis Risk Data Management, Inc. Global-results processing matrix for processing queries
US6968335B2 (en) 2002-11-14 2005-11-22 Sesint, Inc. Method and system for parallel processing of database queries
US7240059B2 (en) * 2002-11-14 2007-07-03 Seisint, Inc. System and method for configuring a parallel-processing database system
US7403942B1 (en) 2003-02-04 2008-07-22 Seisint, Inc. Method and system for processing data records
US7720846B1 (en) 2003-02-04 2010-05-18 Lexisnexis Risk Data Management, Inc. System and method of using ghost identifiers in a database
US7912842B1 (en) 2003-02-04 2011-03-22 Lexisnexis Risk Data Management Inc. Method and system for processing and linking data records
US7657540B1 (en) 2003-02-04 2010-02-02 Seisint, Inc. Method and system for linking and delinking data records
EP1457889A1 (en) * 2003-03-13 2004-09-15 Koninklijke Philips Electronics N.V. Improved fingerprint matching method and system
US8924654B1 (en) * 2003-08-18 2014-12-30 Cray Inc. Multistreamed processor vector packing method and apparatus
JP4772506B2 (en) * 2003-12-25 2011-09-14 株式会社ターボデータラボラトリー Information processing method, information processing system, and program
US20050246194A1 (en) * 2004-04-06 2005-11-03 Lundberg Steven W System and method for information disclosure statement management
US7574424B2 (en) * 2004-10-13 2009-08-11 Sybase, Inc. Database system with methodology for parallel schedule generation in a query optimizer
US8126870B2 (en) * 2005-03-28 2012-02-28 Sybase, Inc. System and methodology for parallel query optimization using semantic-based partitioning
JP2009529753A (en) * 2006-03-09 2009-08-20 グレースノート インコーポレイテッド Media navigation method and system
US8266168B2 (en) 2008-04-24 2012-09-11 Lexisnexis Risk & Information Analytics Group Inc. Database systems and methods for linking records and entity representations with sufficiently high confidence
US8190616B2 (en) 2008-07-02 2012-05-29 Lexisnexis Risk & Information Analytics Group Inc. Statistical measure and calibration of reflexive, symmetric and transitive fuzzy search criteria where one or both of the search criteria and database is incomplete
US8458441B2 (en) * 2009-05-14 2013-06-04 Microsoft Corporation Vector extensions to an interpreted general expression evaluator in a database system
US9411859B2 (en) 2009-12-14 2016-08-09 Lexisnexis Risk Solutions Fl Inc External linking based on hierarchical level weightings
US9189505B2 (en) 2010-08-09 2015-11-17 Lexisnexis Risk Data Management, Inc. System of and method for entity representation splitting without the need for human interaction
US9442930B2 (en) 2011-09-07 2016-09-13 Venio Inc. System, method and computer program product for automatic topic identification using a hypertext corpus
US9442928B2 (en) 2011-09-07 2016-09-13 Venio Inc. System, method and computer program product for automatic topic identification using a hypertext corpus
US9378246B2 (en) * 2012-05-03 2016-06-28 Hiromichi Watari Systems and methods of accessing distributed data
US10303687B2 (en) 2016-09-01 2019-05-28 Parallel Universe, Inc. Concurrent processing of data sources

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4152762A (en) * 1976-03-03 1979-05-01 Operating Systems, Inc. Associative crosspoint processor system
US4118788A (en) * 1977-03-07 1978-10-03 Bell Telephone Laboratories, Incorporated Associative information retrieval
US4255796A (en) * 1978-02-14 1981-03-10 Bell Telephone Laboratories, Incorporated Associative information retrieval continuously guided by search status feedback
US4358824A (en) * 1979-12-28 1982-11-09 International Business Machines Corporation Office correspondence storage and retrieval system
US4445171A (en) * 1981-04-01 1984-04-24 Teradata Corporation Data processing systems and methods
US4468728A (en) * 1981-06-25 1984-08-28 At&T Bell Laboratories Data structure and search method for a data base management system
US4464650A (en) * 1981-08-10 1984-08-07 Sperry Corporation Apparatus and method for compressing data signals and restoring the compressed data signals
US4495566A (en) * 1981-09-30 1985-01-22 System Development Corporation Method and means using digital data processing means for locating representations in a stored textual data base
US4451901A (en) * 1982-01-21 1984-05-29 General Electric Company High speed search system
US4554631A (en) * 1983-07-13 1985-11-19 At&T Bell Laboratories Keyword search automatic limiting method

Also Published As

Publication number Publication date
US4870568A (en) 1989-09-26
JPS6359626A (en) 1988-03-15
EP0251594B1 (en) 1994-09-07
DE3750492D1 (en) 1994-10-13
JPH083817B2 (en) 1996-01-17
EP0251594A2 (en) 1988-01-07
DE3750492T2 (en) 1995-03-16
EP0251594A3 (en) 1991-10-09

Similar Documents

Publication Publication Date Title
CA1282496C (en) Database system for parallel processor
US6131092A (en) System and method for identifying matches of query patterns to document text in a document textbase
JP2668438B2 (en) Data retrieval device
US5995962A (en) Sort system for merging database entries
US5895463A (en) Compression of grouped data
US6760821B2 (en) Memory engine for the inspection and manipulation of data
US4170039A (en) Virtual address translation speed up technique
JP2011511366A (en) Data retrieval and indexing method and system for implementing the same
US6470334B1 (en) Document retrieval apparatus
Burkowski A hardware hashing scheme in the design of a multiterm string comparator
WO2005114426A1 (en) Associative memory device
CN106776590A (en) A kind of method and system for obtaining entry translation
EP0232376B1 (en) Circulating context addressable memory
Lipovski et al. The dynamic associative access memory chip and its application to simd processing and full-text database retrieval
Lee et al. HYTREM-a hybrid text-retrieval machine for large databases
Brown et al. The GURU system in TREC-6
Menon A study of sort algorithms for multiprocessor database machines
EP0649106B1 (en) Compactly stored word groups
Stone Associative processing for general purpose computers through the use of modified memories
Layer et al. Efficient Hardware Search Engine for Associative Content Retrieval of Long Queries in Huge Multimedia Databases
Layer et al. High Performance Associative Coprocessor Architecture for Advanced Database Searching.
Shahmohammadi et al. A framework for detecting Holy Quran inside Arabic and Persian texts
JPS6143338A (en) Searching of thin data base using association technology
Lipovski et al. Application of Processor-in-memory Chips to Full-text Database Retrieval
JP2773657B2 (en) String search device

Legal Events

Date Code Title Description
MKLA Lapsed