WO2003001403A1 - Cross-idea association database creation - Google Patents

Cross-idea association database creation Download PDF

Info

Publication number
WO2003001403A1
WO2003001403A1 PCT/US2002/019587 US0219587W WO03001403A1 WO 2003001403 A1 WO2003001403 A1 WO 2003001403A1 US 0219587 W US0219587 W US 0219587W WO 03001403 A1 WO03001403 A1 WO 03001403A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
document
words
ofthe
state
Prior art date
Application number
PCT/US2002/019587
Other languages
French (fr)
Inventor
Eli Abir
Original Assignee
Eli Abir
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eli Abir filed Critical Eli Abir
Priority to KR10-2003-7016595A priority Critical patent/KR20040007741A/en
Priority to CA002447229A priority patent/CA2447229A1/en
Priority to EP02744486A priority patent/EP1397754A4/en
Priority to JP2003507722A priority patent/JP2004531832A/en
Priority to IL15874902A priority patent/IL158749A0/en
Priority to EA200400059A priority patent/EA006182B1/en
Publication of WO2003001403A1 publication Critical patent/WO2003001403A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • This invention relates to a method and apparatus for creating a cross-idea association database for converting, manipulating, and/or translating information from one state to a second state.
  • the two states represent word languages (e.g., English, Hebrew, Chinese, etc.) such that the present invention creates a cross-language database correlating words and phrases in one language to their translation counterparts in a second language.
  • the present invention creates a database by examining documents in the two languages and creating a database of translations for each word or phrase in both languages.
  • the present mvention need not be limited to language translation.
  • the present invention allows a user to create a database of ideas, and associate those ideas to other, differing ideas in a hierarchical manner. Thus, ideas are associated with other ideas and rated according to the frequency ofthe occurrence. The specific weight given to the occurrence frequency, and the use applied to the database thus created, can vary depending upon the user's requirements.
  • the present invention will operate to create foreign language translations of words and strings of words in the English language.
  • the present invention will return a ranking of associations to those words (or strings of words); e.g., the word occurring the most often will be the foreign language equivalent ofthe word (in English), given an large enough sample size.
  • the present invention will also return other foreign language associations with the English word, and the user may manipulate those associations as desired.
  • the word "mountain,” when operated on according to the present invention may return a list of foreign language words in the language being examined.
  • the present invention is an automated association database creator.
  • the strongest associations represent “translations” in one sense, but other frequent (but weaker) associations represent ideas that are closely related to the idea being examined.
  • the purpose ofthe present invention is to develop a database of associations of words and phrases (strings of words) between one language and a second language. In general, the method involves examining and operating on two documents, each containing text which represents the same concept or content, but in two different languages.
  • the method and apparatus ofthe present invention is utilized such that a database is created with associations across the two languages - translations, or more specifically, possible associations for words and phrases.
  • the translation and other relevant associations for words and phrases between the two languages becomes stronger, i.e. more frequent, as more documents are examined and operated on by the present invention, such that by operation on a large enough "sample" of documents the most common (and, in one sense, the correct) associations becomes apparent and the method and apparatus can be utilized for translation purposes.
  • the preferred embodiment ofthe present invention utilizes a computing device such as a personal computer system ofthe type readily available in the prior art. However, the method and apparatus ofthe present invention does not need to use such a computing device and can readily be accomplished by other means, including manual creation ofthe cross-associations.
  • the method by which successive documents are examined to enlarge the "sample" of documents and create the cross-association database is varied - the documents can be set up for analysis and manipulation manually, by automatic feeding (such as automatic paper loaders as known in the prior art), or by using search techniques on the Internet to automatically seek out the related documents.
  • the present invention may be utilized on a common computer system having at least a display means, an input method, and output method, and a processor.
  • the display means can be any of those readily available in the prior art, such as cathode ray terminals, liquid crystal displays, flat panel displays, and the like.
  • the processor means also can be any of those readily available and used in a computing environment such that the means is supplied to allow the computer to operate to perform the present invention.
  • an input method is utilized to allow the input ofthe documents for the purposes of building the cross-association database; as described above the specific input method can vary depending on the needs ofthe user.
  • the documents are examined for the purpose of building the database.
  • the creation process begins using the methods and/or apparatus described herein.
  • Document A is in language A
  • Document B is in language B.
  • the documents have the following text:
  • the first step in the present invention is calculate a word range to determine the approximate associations for any given word or phrase. Since a word-for-word translation is not appropriate (i.e., word 1 in document A most likely will not exist as the literal translation of word 1 in document B), the database creation technique ofthe present invention tests each word in the first language against a range of words in the second language. This range thus is developed by examining the two documents, and is used to compare the words, phrases, or other word strings in the SECOND document against the words, phrases, or other word strings in the FIRST document. That is, a range of words (or phrases, or word strings) in the second document is applied as a possible match against any one word (or phrase, or word string) in the first document. By testing against a range, the database creation technique establishes a number of second language words that may equate and translate to the first language words.
  • the value ofthe range is, ultimately, user defined.
  • Various techniques can be used to determine the value ofthe range, including common statistical techniques such as the derivation of a bell curve based on the number of words in a document. With a statistical technique such as a bell curve, the range at the beginning and end ofthe document will be smaller than the range in the middle ofthe document.
  • a bell-shaped frequency for the range words allows reasonable extrapolation of possible word translations, whether it is derived according to the number of words in a document or according to the percentage of coverage of number of words desired.
  • the value ofthe range may depend on the number of words in the two documents. If the word count ofthe two documents are equal, any value may be given. Applying statistical techniques, a bell curve may be created such that the range is a lower number of words at the beginning ofthe document, the highest number of words in the middle of the document, and a lower number of words at the end ofthe document " .
  • a ratio may be used to correctly position the range. For example, if document A has 75 words and document B has 100 words, the ratio between the two documents is 3:4.
  • the mid-point of document A is word position 37 (or 38); however, using this mid-point (word position 37 or 38) as the placement for the largest value ofthe range (if determined according to a bell curve technique) in document B is not effective, since this position (word position 37 or 38) is not the midpoint of document B.
  • the point of maximum application of the range value in document B may be determined by the ratio or words between the two documents, by manual placement in the mid-point of document B, or by other techniques.
  • association frequencies for each possible translation.
  • the database creation technique ofthe present invention returns a possible set of words in the second-language document that translate to the word in the first document.
  • the possible set of words will be narrowed and an association frequency will be developed that will assist in the determination ofthe potential translation.
  • the present invention will create association frequencies for a word (or phrase, or word string) in one language to that same word (or phrase, or word string) in a second language.
  • the cross-language association database creation technique will return higher and higher association frequencies for any one word, phrase or word string.
  • the highest association frequency after a large enough sample is reviewed results in a translation; of course, the ultimate point where the association frequency is deemed to be an accurate translation is user defined and subject to other interpretive translation techniques (such as those described in Provisional Application No. 60/276,107, entitled “Method and Apparatus for Content Manipulation” filed on March 16, 2001 and incorporated herein by reference.
  • association frequencies could result for the Spanish equivalent to the English word “friend”: “gato” - 25%; “burro” - 15 %; and “amigo” - 60%.
  • the present invention operation will increase the association frequency for "amigo” and decrease the association frequencies for "gato” and "burro.”
  • the association frequency will reach a level such that the a translation is deemed to have occurred such that the word "friend” in English translates to "amigo" in Spanish.
  • the invention tests not only words but phrases, or strings of words (multiple words).
  • the database creation technique ofthe present invention analyzes a two-word word string, then three- word word string, and so on in an incremental manner. This technique makes possible the translation of phrases or word strings in one language into one word in another, as often occurs.
  • the analysis stops when all positions for the word or word string have been analyzed, if the number of words (or word strings) is greater than one. If a word only occurs once in a document, the process immediately proceeds to increment a word and return a word string. When a word string only occurs once, the process cycles back to the second word in the document, where the analysis cycle occurs again as described above.
  • the incrementing, testing and return process occurs in a similar manner for word strings.
  • the number of occurrences for any phrase is examined, phrases are returned based on the range, and a database is created of possible translations for that phrase.
  • the present invention can operate in such a manner so as to analyze word strings that depend on the correct positioning or words (in that word string), and can operate in such a manner so as to account for grammatical idiosyncrasies such as phrasing, style, or abbreviations.
  • the present invention can accommodate different variations that occur in documents where subsets of words occur within larger word strings. For example, proper names are sometimes presented complete (as in "John Doe"), abbreviated by first or surname ("John” or “Doe”), or abbreviated by another manner ("Mr. Doe”).
  • the present invention accounts for these patterns by recognizing, through the analysis, the existence of these patterns in the association database, and manipulating the frequency return. Since the present invention will most likely return more individual word returns than word string returns (i.e., more returns for the first or surnames rather than the full name word string "John Doe"), because the words that make up a word string will necessarily be counted individually as well as part ofthe phrase, a change in ranking maybe utilized.
  • Step 1 a range is determined.
  • the range may be user defined or may be approximated by a variety of methods.
  • the word count ofthe two documents is approximately equal (ten words in document A, eight words in document B); a range value of three (thirty percent ofthe words in document A) may provide the best results.
  • a range value of three may provide the best results.
  • the range will be one at the beginning and end ofthe document, and two in the middle.
  • the range (or the method used to determine the range) may be entirely user defined.
  • the range is will vary from one word, to two words, to one word as the database creation technique ofthe present invention is utilized.
  • Step 2 the first word in document A is examined and tested against document A to determine the number of occurrences of that word in the document.
  • the first word in document A is X: X occurs three times in document A, at positions 1, 4, and 9.
  • the position numbers of a word, phrase, or other word string are simply a notation ofthe number of times that word, phrase, or word string is present in the document, and the location of that word, phrase, or word string in the document relative to other words.
  • the position numbers correspond to the number of words in a document, ignoring punctuation - for example, if a document has ten words in it, and the word "king" appears twice, the position numbers ofthe word "king” are merely the places (out often words) where the word appears. Because word X occurs more than once in the document, the process proceeds to the next step. If word X only occurred once, then that word would be skipped and the process expanded to the next word string (or phrase) and the creation process continued.
  • Step 3 Possible second language translations for first language word X at position 1 are returned: applying that range to document B yields words at positions 1 and 2 (1 +/- 1) in document B: AA and BB (located at positions 1 and 2 in document B). All possible combinations of this word are returned as a potential translation for X: AA, BB, and AA BB (as a word string combination). The word string combination is returned as a possible match to accommodate the fact that a word in one language may equate to a phrase in the second language.
  • XI the first occurrence of word X
  • Step 4 The next position of word X is analyzed. This word (X2) occurs at position 4. Since position 4 is near the center ofthe document, the range (as determined above) will be two words. Possible translations are returned by looking at word 4 in document B and applying the range (2) - hence, two words before word 4 and two words after word 4 are returned. Thus, words at positions 4 +/- 2 are returned, or at positions 2, 3, 4, 5, and 6. These positions correspond to words BB, CC, AA, EE, and FF in document B.
  • X2 returns BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF as associations.
  • Step 5 The returns ofthe first occurrence of X (position 1) is compared to the returns ofthe first occurrence of X (position 4) and matches are determined, h this case the associations for XI and X2 are compared, and the matches in the two documents provided. Note that identical returns (or word occurrences or word strings) in the overlap between the two ranges can be reduced to a single occurrence.
  • the word at position 2 is BB; this is returned both for the first occurrence of X (when operated on by the range) and the second occurrence of X (when operated on by the range). Because this same word position is returned for both XI and X2, the word is counted as one occurrence. If, however, the same word is returned but not in an overlap area (i.e., the same word position is not returned for both XI and X2, but the results happen to return the identical word), then the word is counted twice. In this case the returns for word X is AA, since that word (AA) occurs in both association returns for XI and X2. Note that the other word that occurs in both associations returns is BB; however, as described above, since that word is the same position (and hence the same word) reached by the operation ofthe range on the first and second occurrences of X, the word can be disregarded.
  • Step 6 The next position of word X (position 9) (X3) is analyzed.
  • Step 8 Because no more occurrences of word X occur, the process is incremented by a word and a word string (or phrase) is tested. In this case the word string examined is "X Y", the first two words in document A. The same technique described in steps 2-7 are applied to this phrase.
  • Step 9 By looking at document A, we see that there is only one occurrence ofthe word string X Y. At this point the incrementing process stops and no database creation occurs. Because an end-point has been reached, the next word is examined (this process occurs whenever no matches occur for a word string); in this case the word in position 2 of document A is "Y".
  • Step 10 Applying the process of steps 2-7 for the word "Y" yields the following:
  • Step 11 End of range incrementation: Because the only possible match for word Y (word CC) occurs at the end ofthe range for the first occurrence of Y (CC occurred at position 3 in document B), the range is incremented by 1 at the first occurrence to return positions 1, 2, 3, and 4: AA, BB, CC, and AA; or the following forward permutations: AA, BB, CC, AA BB, AA BB CC, AA BB CC, BB CC, BB CC AA, and CC AA. Applying this result still yields CC as a possible translation for Y. Note that the range was incremented because the returned match was at the end ofthe range for the first occurrence (the base occurrence for word "Y"); whenever this pattern occurs an end of range incrementation will occur as a sub-step (or alternative step) to ensure completeness.
  • Step 12 Since no more occurrences of "Y" exist in document A, the analysis increments one word in document A and the word string "Y Z" is examined (the next word after word Y). Incrementing to the next string (Y Z) and repeating the process yields the following:
  • Word string Y Z occurs twice in document A: position 2 and 7;
  • Possibilities for Y Z at the first occurrence are AA, BB, CC, AA BB, AA BB CC, BB CC;
  • Possibilities for Y Z at the second occurrence are EE, FF, GG, CC, EE FF, EE FF GG, EE FF GG CC, FF GG, FF GG CC, and GG CC;
  • Extending the range yields the following for Y Z: AA, BB, CC, AA BB, AA BB CC, AA BB CC AA, BB CC, BB CC AA, and CC AA.
  • Step 13 Since no more occurrences of "Y Z" exist in document A, the analysis increments one word in document A and the word string "Y Z X" is examined (the next word after word Z at position 3 in document A). Incrementing to the next phrase (Y Z X) and repeating the process (Y Z X occurs twice in document A) yields the following:
  • Permutations are EE, FF, GG, CC, EE FF, EE FF GG, EE FF GG CC, FF GG, FF GG CC, and GG CC.
  • Step 14 Incrementing to the next word string (Y Z X A) finds only one occurrence; therefore the word string database creation is completed and the next word is examined: Z (position 3 in document A).
  • Step 15 Applying the steps described above for Z, which occurs 3 times in document A, yields the following:
  • Zl are: AA, BB, CC, AA, EE, AA BB, AA BB CC, AA BB CC AA, AA BB CC AA EE, BB CC, BB CC AA, BB CC AA EE, CC AA, CC AA EE, and AA EE;
  • Step 16 Incrementing to the next word string yields the word string Z X, which occurs twice in document A. Applying the steps described above for Z X yields the following: • Returns for Z XI are: BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF.
  • Step 17 Incrementing, the next phrase is Z X A, which only occurs, so the next word (X) in document A is examined.
  • Step 18 Word X has already been examined in the first position. However, the second position of word X, relative to the other document, has not been examined for possible returns for word X. Thus word X (in the second position) is now operated on as in the first occurrence of word X, going forward in the document:
  • Step 19 Incrementing to the next word string (since no more occurrences of
  • Step 20 Applying the process described above for the second occurrence of word Z yields the following:
  • Step 21 Incrementing by one word yields the word string Z X; this word string does not occur in any more (forward) positions in document A, so the process begins anew at the next word in document A - "X". Word X does not occur in any more (forward) positions of document A, so the process begins anew. However, the end of document A has been reached and the analysis stops.
  • Step 22 The final association frequency is tabulated combining all the results from above. There is insufficient data to return results for other words and phrases in document A. Note that many possible associations occur for word CC in document B, as either an individual word or a word string in document A. As more document pairs are examined containing word CC in language B, the association frequencies will become statistically more reliable such that a word (or, possibly a word string) will exist as the translation for word CC. hi another embodiment, the database creation technique ofthe present invention may be utilized in a variety of ways to create the cross-language associations.
  • the database may be created by simply matching every word and word string (or phrase) occurring in document A with a range of words in document B (using the range techniques described above), without comparison to multiple occurrences ofthe word, and without range incrementing techniques.
  • This method utilizes the principle of cross-language association to create the database in a different manner than that described above.
  • the word count in each document is established to create an appropriate ratio.
  • the ratio is used for comparative range positioning, as described below, hi this example, document A has twenty words, while document B has fifteen words, for a ratio of 4:3. Thus, every four words of document A equate to three words of document B.
  • a segment of words is established for the word strings, or phrases, to be examined.
  • This segment can be determined according to common language rules; e.g., a segment can be a sentence or paragraph.
  • the length ofthe segment is user defined and can be any fragment of word strings desired.
  • the segments will correspond to the sentences in each respective document, although larger segments are usually more effective than single sentences to create the associations ofthe present invention because there exists a larger base of potential associations to fill the database.
  • Positions of words are determined by their respective word count location in any document. Using the example, the positions ofthe word "the” are one, five, nine, and fifteen (the first, fifth, ninth, and fifteenth words in the document).
  • the target words are determined by using the word ratio to determine the relative point in document B, and applying the range to that word position in document B (the range is user defined as described in the first embodiment).
  • the relative position of a word in document B is determined by applying the ratio calculated above.
  • the word "the” occurs in the first, fifth, ninth, and fifteenth positions of document A. These positions correspond to relative positions 1, 4, 7, and 11 of document B.
  • Frequency range applied to preceding and following words in document B equals word positions 1-3 in Document B. 1 This determination occurs by taking the position +/- the frequency range, or 1 +/- 2, or -1 through 3. Ignoring the negative and null positions returns a word position result of 1-3 in Document B.
  • Matches are AAA (twice), BB and CCC.
  • the present invention increments the number of words examined by one. h the first example the word examined was "the” (the first word in Document A). Incrementing, the next word string to be analyzed are the words "the sky”.
  • the process is then incremented a word and the process repeated for "the sky is.” This process yields as a potential match only the first occurrence AAA BB CCC, as there are no other occurrences.
  • the end ofthe first segment has been reached, defined by the user as that indicated by punctuation in Document A.
  • the next step is to take the SECOND word in the first segment and continue the iterative process described above - in the example the analysis would include “sky,” “sky is,” and “sky is blue” yielding the following as matches: “sky” occurs in positions 2 and 10 in Document A; which yields 2 and 7 as relative positions in Document B; which yields AAA BB CCC AAA as the first match and EEE DDDD AAA BB and FFF as the second match; which yields AAA and BB as possible associations to be stored in the database.
  • segment one The next incremented word in segment one is “blue” which yields AAA BB CCC AAA and EEE as possible associations to be stored in the database.
  • the analysis is now up to the end of segment one.
  • the next segment is the sentence "the grass is green.” Since "the” has already been analyzed the next word portion to be analyzed is “the grass,” followed by "the grass is”, “the grass is green”, “grass”, “grass is”, “grass is green”, and "green.”
  • the sky includes clouds and stars
  • segment sentences (“Went to school today. She walked to the school on the street.") can be analyzed by extending the segment to incorporate the person ("she") into the first sentence when the present invention acts to translate languages.
  • these two embodiments are representative ofthe technique used to create associations.
  • the techniques ofthe present invention need not be limited to language translation; in a broad sense, the techniques will apply to any two embodiments ofthe same idea that may be associated, for at its essence foreign language translation merely exists as a paired association with one idea (the word or phrase).
  • the present invention may be applied to associating data, sound, music, video, or any wide ranging concept that exists as an idea, including ideas that can represent any sensory (sound, sight, smell, etc.) experiences. All that is required is that the present invention analyze two embodiments (in language translation, the embodiments are documents; for music, the embodiments might be digital representations of a music score and sound frequencies denoting the same composition, and the like).
  • an embodiment ofthe present invention loads, by either mechanical, electrical, or other means, certain associations in to the database.
  • certain associations for example, it is possible to load the database with foreign language equivalents ofthe English words it, his, her, an, a, of- or any common words - to create the association database more accurately, more efficiently, and with a faster resolution.
  • the present invention would automatically return the foreign language equivalents of certain words loaded into the database.
  • This embodiment allows the association database creation technique ofthe present invention to accommodate common words that may skew the analysis
  • an embodiment can utilize common associations to create and recognize word patterns. For example, it is possible to load associations into the database (e.g., "President" for "Clinton") such that the association database accommodates situations where the text means President Clinton, but only the word "president" is utilized as an abbreviation.
  • the cross-language association exists in its broad sense as a cross-idea association technique for creating a database of possible associations, the results may be manipulated when an association is established.
  • each "idea” is assigned an association to an electromagnetic wave (tone), it will be possible to create an "electromagnetic association” ofthe idea.
  • data in the form of an idea
  • data can be manipulated into electromagnetic waves and transferred at once over conventional telecommunications infrastructure.
  • that machine will synthesize the waves into separate components and, given the associations, present the individual ideas that were represented by the electromagnetic associations.

Abstract

A method and apparatus for creating a cross-idea association database. The cross- idea database correlates words and phrases in one language, corresponding to information in one state, to words and phrases in a second language, corresponding to information in a second state. The method includes receiving content expressed in a first state and receiving content expressed in a second state, analyzing the content expressed in said first state with said content expressed in said second state. Analyzing involves utilizing segments of content expressed in a first state and segments of content expressed in said second state. The method also includes creating an association database of the content in said first state as related to said content in said second state.

Description

CROSS-IDEA ASSOCIATION DATABASE CREATION
Statement of the Invention
This invention relates to a method and apparatus for creating a cross-idea association database for converting, manipulating, and/or translating information from one state to a second state. In one embodiment ofthe present invention, the two states represent word languages (e.g., English, Hebrew, Chinese, etc.) such that the present invention creates a cross-language database correlating words and phrases in one language to their translation counterparts in a second language. In this example, the present invention creates a database by examining documents in the two languages and creating a database of translations for each word or phrase in both languages. However, the present mvention need not be limited to language translation. The present invention allows a user to create a database of ideas, and associate those ideas to other, differing ideas in a hierarchical manner. Thus, ideas are associated with other ideas and rated according to the frequency ofthe occurrence. The specific weight given to the occurrence frequency, and the use applied to the database thus created, can vary depending upon the user's requirements.
For example, in the context of converting text from one language to another, the present invention will operate to create foreign language translations of words and strings of words in the English language. The present invention will return a ranking of associations to those words (or strings of words); e.g., the word occurring the most often will be the foreign language equivalent ofthe word (in English), given an large enough sample size. However, the present invention will also return other foreign language associations with the English word, and the user may manipulate those associations as desired. For example, the word "mountain," when operated on according to the present invention, may return a list of foreign language words in the language being examined. The foreign language equivalent ofthe word "mountain" will most likely be ranked the highest; however, the present invention will return other foreign language words associated with "mountain," such as "snow" or "ski." These words, which may or may not be ranked lower than the translation of "mountain," can be manipulated as desired by the user. Thus, the present invention is an automated association database creator. The strongest associations represent "translations" in one sense, but other frequent (but weaker) associations represent ideas that are closely related to the idea being examined. The purpose ofthe present invention is to develop a database of associations of words and phrases (strings of words) between one language and a second language. In general, the method involves examining and operating on two documents, each containing text which represents the same concept or content, but in two different languages. The method and apparatus ofthe present invention is utilized such that a database is created with associations across the two languages - translations, or more specifically, possible associations for words and phrases. The translation and other relevant associations for words and phrases between the two languages becomes stronger, i.e. more frequent, as more documents are examined and operated on by the present invention, such that by operation on a large enough "sample" of documents the most common (and, in one sense, the correct) associations becomes apparent and the method and apparatus can be utilized for translation purposes. The preferred embodiment ofthe present invention utilizes a computing device such as a personal computer system ofthe type readily available in the prior art. However, the method and apparatus ofthe present invention does not need to use such a computing device and can readily be accomplished by other means, including manual creation ofthe cross-associations. The method by which successive documents are examined to enlarge the "sample" of documents and create the cross-association database is varied - the documents can be set up for analysis and manipulation manually, by automatic feeding (such as automatic paper loaders as known in the prior art), or by using search techniques on the Internet to automatically seek out the related documents.
Note that in the following discussion the term "documents" is used interchangeably to refer generally to the pair of items (books, articles, letters, and the like) which represent the same concept or content, but for the fact that one is in one language and the other is in a second language, hi addition, whenever the present invention is deemed to operate on a word, it is clear that the same technique will work on phrases or other word strings, and is not limited to just one word.
The provisional application incorporates by reference Provisional Application No. 60/276,107 entitled "Method and Apparatus for Content Manipulation," filed on March 16, 2001. (Attached hereto).
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE
INVENTION
A preferred embodiment ofthe present invention will now be described. The present invention may be utilized on a common computer system having at least a display means, an input method, and output method, and a processor. The display means can be any of those readily available in the prior art, such as cathode ray terminals, liquid crystal displays, flat panel displays, and the like. The processor means also can be any of those readily available and used in a computing environment such that the means is supplied to allow the computer to operate to perform the present invention. Finally, an input method is utilized to allow the input ofthe documents for the purposes of building the cross-association database; as described above the specific input method can vary depending on the needs ofthe user.
According to the present invention, the documents are examined for the purpose of building the database. After document input (again, ofthe pair of documents representing the same text in two different languages), the creation process begins using the methods and/or apparatus described herein.
For illustrative purposes, assume that the documents contain the same words (or, in a general sense, idea) in two different languages. Document A is in language A, Document B is in language B. The documents have the following text:
Figure imgf000005_0001
The first step in the present invention is calculate a word range to determine the approximate associations for any given word or phrase. Since a word-for-word translation is not appropriate (i.e., word 1 in document A most likely will not exist as the literal translation of word 1 in document B), the database creation technique ofthe present invention tests each word in the first language against a range of words in the second language. This range thus is developed by examining the two documents, and is used to compare the words, phrases, or other word strings in the SECOND document against the words, phrases, or other word strings in the FIRST document. That is, a range of words (or phrases, or word strings) in the second document is applied as a possible match against any one word (or phrase, or word string) in the first document. By testing against a range, the database creation technique establishes a number of second language words that may equate and translate to the first language words.
The value ofthe range is, ultimately, user defined. Various techniques can be used to determine the value ofthe range, including common statistical techniques such as the derivation of a bell curve based on the number of words in a document. With a statistical technique such as a bell curve, the range at the beginning and end ofthe document will be smaller than the range in the middle ofthe document. A bell-shaped frequency for the range words allows reasonable extrapolation of possible word translations, whether it is derived according to the number of words in a document or according to the percentage of coverage of number of words desired. Other methods to calculate the range exists, such as a "stair" technique of having the range exist at one level for a certain percentage of words, a second higher level for another percentage of words, and a third level equal to the first level for the last percentage of words. Again, the range is user defined or established according to other possible parameters.
The value ofthe range may depend on the number of words in the two documents. If the word count ofthe two documents are equal, any value may be given. Applying statistical techniques, a bell curve may be created such that the range is a lower number of words at the beginning ofthe document, the highest number of words in the middle of the document, and a lower number of words at the end ofthe document".
If the number ofthe words in the two documents are not equal, then a ratio may be used to correctly position the range. For example, if document A has 75 words and document B has 100 words, the ratio between the two documents is 3:4. The mid-point of document A is word position 37 (or 38); however, using this mid-point (word position 37 or 38) as the placement for the largest value ofthe range (if determined according to a bell curve technique) in document B is not effective, since this position (word position 37 or 38) is not the midpoint of document B. Instead, the point of maximum application of the range value in document B may be determined by the ratio or words between the two documents, by manual placement in the mid-point of document B, or by other techniques.
At the heart ofthe present invention is the creation of association frequencies for each possible translation. By looking at the position of a word in the document, and applying the range as described above, the database creation technique ofthe present invention returns a possible set of words in the second-language document that translate to the word in the first document. As the database creation technique ofthe present invention is utilized, the possible set of words will be narrowed and an association frequency will be developed that will assist in the determination ofthe potential translation. Thus, after examining a pair of documents, the present invention will create association frequencies for a word (or phrase, or word string) in one language to that same word (or phrase, or word string) in a second language. After a number of document pairs are examined according to the present invention (and thus a large sample created), the cross-language association database creation technique will return higher and higher association frequencies for any one word, phrase or word string. The highest association frequency after a large enough sample is reviewed results in a translation; of course, the ultimate point where the association frequency is deemed to be an accurate translation is user defined and subject to other interpretive translation techniques (such as those described in Provisional Application No. 60/276,107, entitled "Method and Apparatus for Content Manipulation" filed on March 16, 2001 and incorporated herein by reference. For example, after examining a number of documents the following association frequencies could result for the Spanish equivalent to the English word "friend": "gato" - 25%; "burro" - 15 %; and "amigo" - 60%. As more document pairs are examined, the present invention operation will increase the association frequency for "amigo" and decrease the association frequencies for "gato" and "burro." At a user-defined point, the association frequency will reach a level such that the a translation is deemed to have occurred such that the word "friend" in English translates to "amigo" in Spanish.
As indicated above, the invention tests not only words but phrases, or strings of words (multiple words). After a single word is analyzed, the database creation technique ofthe present invention analyzes a two-word word string, then three- word word string, and so on in an incremental manner. This technique makes possible the translation of phrases or word strings in one language into one word in another, as often occurs. The analysis stops when all positions for the word or word string have been analyzed, if the number of words (or word strings) is greater than one. If a word only occurs once in a document, the process immediately proceeds to increment a word and return a word string. When a word string only occurs once, the process cycles back to the second word in the document, where the analysis cycle occurs again as described above. Note that variations on this process may occur to accommodate situations where a word only occurs once in the two documents to be examined. For example, if the word only occurs once in a document, a variation ofthe present invention would allow the analysis to operate on other documents to search for relevant words or word strings. In a sense, any number of documents can be aggregated and treated as one single document for the operation ofthe present invention. In addition, as another embodiment it is possible to work on the entire document to accommodate situations where a word only occurs once.
The incrementing, testing and return process occurs in a similar manner for word strings. Thus, the number of occurrences for any phrase is examined, phrases are returned based on the range, and a database is created of possible translations for that phrase.
In addition, the present invention can operate in such a manner so as to analyze word strings that depend on the correct positioning or words (in that word string), and can operate in such a manner so as to account for grammatical idiosyncrasies such as phrasing, style, or abbreviations.
The present invention can accommodate different variations that occur in documents where subsets of words occur within larger word strings. For example, proper names are sometimes presented complete (as in "John Doe"), abbreviated by first or surname ("John" or "Doe"), or abbreviated by another manner ("Mr. Doe"). The present invention accounts for these patterns by recognizing, through the analysis, the existence of these patterns in the association database, and manipulating the frequency return. Since the present invention will most likely return more individual word returns than word string returns (i.e., more returns for the first or surnames rather than the full name word string "John Doe"), because the words that make up a word string will necessarily be counted individually as well as part ofthe phrase, a change in ranking maybe utilized. For example, in any document the name "John Doe" might occur one hundred times, while "John" by itself might occur one hundred-twenty times, and "Doe" by itself might occur one hundred-ten times. The normal translation return (according to the present invention) will rank "John" higher than "Doe," and both of those words higher than the word string "John Doe" - all when attempting to analyze the word string "John Doe." By operation through subtraction ofthe number ofthe larger word string occurrences from the returns for the subset (or individual returns) the proper ordering may be accomplished (although, of course, other methods may be utilized to obtain a similar result). Thus, subtracting one hundred (the number of occurrences for "John Doe"), from one-hundred twenty (the number of occurrences for the word "John"), the corrected return for "John" is twenty. Applying this analysis yields one-hundred as the number of occurrences for the word string "John Doe" (when analyzing and attempting to translate this word string), twenty for the word "John," and ten for the word string "Doe," thus creating the proper associations.
An embodiment ofthe present invention will now be described using the two documents described above as an example - the table is re-created as follows:
Figure imgf000010_0001
Using the two documents listed above (A, the first language and B, the second language), the following steps occur for the database creation technique.
Step 1. First, a range is determined. As indicated, the range may be user defined or may be approximated by a variety of methods. The word count ofthe two documents is approximately equal (ten words in document A, eight words in document B); a range value of three (thirty percent ofthe words in document A) may provide the best results. In this example, to approximate a bell curve the range will be one at the beginning and end ofthe document, and two in the middle. However, as indicated, the range (or the method used to determine the range) may be entirely user defined.
Thus, for this example the range is will vary from one word, to two words, to one word as the database creation technique ofthe present invention is utilized.
Step 2. Next, the first word in document A is examined and tested against document A to determine the number of occurrences of that word in the document. In this example the first word in document A is X: X occurs three times in document A, at positions 1, 4, and 9. The position numbers of a word, phrase, or other word string are simply a notation ofthe number of times that word, phrase, or word string is present in the document, and the location of that word, phrase, or word string in the document relative to other words. Thus, the position numbers correspond to the number of words in a document, ignoring punctuation - for example, if a document has ten words in it, and the word "king" appears twice, the position numbers ofthe word "king" are merely the places (out often words) where the word appears. Because word X occurs more than once in the document, the process proceeds to the next step. If word X only occurred once, then that word would be skipped and the process expanded to the next word string (or phrase) and the creation process continued.
Step 3. Possible second language translations for first language word X at position 1 are returned: applying that range to document B yields words at positions 1 and 2 (1 +/- 1) in document B: AA and BB (located at positions 1 and 2 in document B). All possible combinations of this word are returned as a potential translation for X: AA, BB, and AA BB (as a word string combination). The word string combination is returned as a possible match to accommodate the fact that a word in one language may equate to a phrase in the second language. Thus, XI (the first occurrence of word X) returns AA, BB, and AA BB as associations.
Step 4. The next position of word X is analyzed. This word (X2) occurs at position 4. Since position 4 is near the center ofthe document, the range (as determined above) will be two words. Possible translations are returned by looking at word 4 in document B and applying the range (2) - hence, two words before word 4 and two words after word 4 are returned. Thus, words at positions 4 +/- 2 are returned, or at positions 2, 3, 4, 5, and 6. These positions correspond to words BB, CC, AA, EE, and FF in document B. All forward permutations of these words (and their combined word strings) are considered: BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF. Thus, X2 returns BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF as associations. Step 5. The returns ofthe first occurrence of X (position 1) is compared to the returns ofthe first occurrence of X (position 4) and matches are determined, h this case the associations for XI and X2 are compared, and the matches in the two documents provided. Note that identical returns (or word occurrences or word strings) in the overlap between the two ranges can be reduced to a single occurrence. For example, in this example the word at position 2 is BB; this is returned both for the first occurrence of X (when operated on by the range) and the second occurrence of X (when operated on by the range). Because this same word position is returned for both XI and X2, the word is counted as one occurrence. If, however, the same word is returned but not in an overlap area (i.e., the same word position is not returned for both XI and X2, but the results happen to return the identical word), then the word is counted twice. In this case the returns for word X is AA, since that word (AA) occurs in both association returns for XI and X2. Note that the other word that occurs in both associations returns is BB; however, as described above, since that word is the same position (and hence the same word) reached by the operation ofthe range on the first and second occurrences of X, the word can be disregarded.
Step 6. The next position of word X (position 9) (X3) is analyzed.
Applying a range of 1 (near the end ofthe document) returns values at the following positions of document B: 8, 9, and 10. Since document B has only 8 positions, the results are truncated and only word position 8 is returned as possible values for X: CC.
Comparing to the first return for X (XI) returns no matches. Thus, since no match occurs, the value returned for X3 - here CC - is disregarded and an associative match is not provided. Step 7. The next position of word X is analyzed; however, there are no more occurrences of word X in document A. At this point as association frequency is established for word X and the following database has been created as possible translations for X: AA. Thus, at this point there is an association of X to AA.
Step 8. Because no more occurrences of word X occur, the process is incremented by a word and a word string (or phrase) is tested. In this case the word string examined is "X Y", the first two words in document A. The same technique described in steps 2-7 are applied to this phrase.
Step 9. By looking at document A, we see that there is only one occurrence ofthe word string X Y. At this point the incrementing process stops and no database creation occurs. Because an end-point has been reached, the next word is examined (this process occurs whenever no matches occur for a word string); in this case the word in position 2 of document A is "Y".
Step 10. Applying the process of steps 2-7 for the word "Y" yields the following:
• Two occurrences of word Y (positions 2 and 7) exist, so the database creation process continues (again, if Y only occurred once in document A, then Y would not be examined);
• The range at position 2 is 1 word;
• Application of range to document B (position 2, the location ofthe first occurrence of word Y) returns results at position 1, 2, and 3 in document B;
• The corresponding foreign language words in those returned positions are: AA, BB, and CC; • Applying forward-permutations yields the following possibilities for Yl : AA, BB, CC, AA BB, AA BB CC, and BB CC;
• The next position of Y is analyzed (position 7);
• The range at position 7 is 2 words;
• Application of that range to document B (position 7) returns results at positions 5, 6, 7, and 8: EE FF GG and CC;
• All permutations yields following possibilities for Y2: EE, FF, GG, CC, EE FF, EE FF GG, EE FF GG CC, FF GG, FF GG CC, and GG CC;
• Matching results from Yl returns CC as the only match;
• Combining matches for Yl and Y2 yields CC as an association frequency for Y, at one value.
Step 11. End of range incrementation: Because the only possible match for word Y (word CC) occurs at the end ofthe range for the first occurrence of Y (CC occurred at position 3 in document B), the range is incremented by 1 at the first occurrence to return positions 1, 2, 3, and 4: AA, BB, CC, and AA; or the following forward permutations: AA, BB, CC, AA BB, AA BB CC, AA BB CC AA, BB CC, BB CC AA, and CC AA. Applying this result still yields CC as a possible translation for Y. Note that the range was incremented because the returned match was at the end ofthe range for the first occurrence (the base occurrence for word "Y"); whenever this pattern occurs an end of range incrementation will occur as a sub-step (or alternative step) to ensure completeness.
Step 12. Since no more occurrences of "Y" exist in document A, the analysis increments one word in document A and the word string "Y Z" is examined (the next word after word Y). Incrementing to the next string (Y Z) and repeating the process yields the following:
• Word string Y Z occurs twice in document A: position 2 and 7;
• Possibilities for Y Z at the first occurrence (Y Zl) are AA, BB, CC, AA BB, AA BB CC, BB CC;
• Possibilities for Y Z at the second occurrence (Y Z2) are EE, FF, GG, CC, EE FF, EE FF GG, EE FF GG CC, FF GG, FF GG CC, and GG CC;
• Matches and combination yields CC as a possible translation for word string Y Z;
• Extending the range (the end of range incrementation) yields the following for Y Z: AA, BB, CC, AA BB, AA BB CC, AA BB CC AA, BB CC, BB CC AA, and CC AA.
• Applying the results still yields CC as an association frequency for word string Y Z.
Step 13. Since no more occurrences of "Y Z" exist in document A, the analysis increments one word in document A and the word string "Y Z X" is examined (the next word after word Z at position 3 in document A). Incrementing to the next phrase (Y Z X) and repeating the process (Y Z X occurs twice in document A) yields the following:
• Here the range is 2, since the mid-point ofthe phrase occurs closer to the midpoint ofthe document;
• Returns for first occurrence of Y Z X are at positions 2, 3, 4, and 5;
• Permutations are BB, CC, AA, EE, BB CC, BB CC AA, BB CC AA EE, CC AA CC AA EE, and AA EE; • Returns for second occurrence of Y Z X are at positions 5, 6, 7, and 8;
• Permutations are EE, FF, GG, CC, EE FF, EE FF GG, EE FF GG CC, FF GG, FF GG CC, and GG CC.
• Comparing the two yields CC as an association frequency for word string Y Z X; again, note that the return of EE as a possible association is disregarded because it occurs in both instances as the same word (i.e., at the same position).
Step 14. Incrementing to the next word string (Y Z X A) finds only one occurrence; therefore the word string database creation is completed and the next word is examined: Z (position 3 in document A).
Step 15. Applying the steps described above for Z, which occurs 3 times in document A, yields the following:
• Returns for Zl are: AA, BB, CC, AA, EE, AA BB, AA BB CC, AA BB CC AA, AA BB CC AA EE, BB CC, BB CC AA, BB CC AA EE, CC AA, CC AA EE, and AA EE;
• Returns for Z2 are: FF, GG, CC, FF GG, FF GG CC, and GG CC;
• Comparing Zl and Z2 yields CC as a possible match;
• Returns for Z3 and comparing with Zl yields CC as an association frequency for word Z.
Step 16. Incrementing to the next word string yields the word string Z X, which occurs twice in document A. Applying the steps described above for Z X yields the following: • Returns for Z XI are: BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF.
• Returns for Z X2 are: FF, GG, CC, FF GG, FF GG CC, and GG CC;
• Comparing yields the association of word string Z X to CC .
• Returns for Z X and comparison yields CC as an association frequency for word string Z X.
Step 17. Incrementing, the next phrase is Z X A, which only occurs, so the next word (X) in document A is examined.
Step 18. Word X has already been examined in the first position. However, the second position of word X, relative to the other document, has not been examined for possible returns for word X. Thus word X (in the second position) is now operated on as in the first occurrence of word X, going forward in the document:
• Returns for X at position 4 yield: BB, CC, AA, EE, FF, BB CC, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA EE, CC AA EE, CC AA EE FF, AA EE, AA EE FF, and EE FF.
• Returns for X at position 9 yield: GG, CC, and GG CC.
• Comparison ofthe results of position 9 compared to results for position 4 yields CC as a possible match for word X.
• Returns for X and comparison yield CC as an association frequency for word X. Step 19. Incrementing to the next word string (since no more occurrences of
X occur for comparison to the second occurrence of X) yields the word string X A; however, this word string does not occur more than once in document A so the process turns to examine the next word (A). Word "A" only occurs once in document A, so incrementation occurs - not to the next word string, since word "A" only occurred once, but to the next word in document A - "B". Word "B" only occurs once in document A, so the next word (Y) is examined. Word "Y" does not occur in any other positions higher than position 7 in document A, so next word (Z) is examined. Word "Z" occurs at two more positions in document A ~ position 8 and 10.
Step 20. Applying the process described above for the second occurrence of word Z yields the following:
• Returns for Z at position 8 yields: FF, GG, CC, FF GG, FF GG CC, and GG CC;
• Returns for Z at position 10 yields: CC;
• Comparing results of position 10 to position 8 yields no matches for word Z. Again, word CC is returned as a possible match; however, since CC represents the same word position reached by analyzing Z at position 8 and Z at position 10, the match is disregarded.
Step 21. Incrementing by one word yields the word string Z X; this word string does not occur in any more (forward) positions in document A, so the process begins anew at the next word in document A - "X". Word X does not occur in any more (forward) positions of document A, so the process begins anew. However, the end of document A has been reached and the analysis stops.
Step 22. The final association frequency is tabulated combining all the results from above. There is insufficient data to return results for other words and phrases in document A. Note that many possible associations occur for word CC in document B, as either an individual word or a word string in document A. As more document pairs are examined containing word CC in language B, the association frequencies will become statistically more reliable such that a word (or, possibly a word string) will exist as the translation for word CC. hi another embodiment, the database creation technique ofthe present invention may be utilized in a variety of ways to create the cross-language associations. For example, the database may be created by simply matching every word and word string (or phrase) occurring in document A with a range of words in document B (using the range techniques described above), without comparison to multiple occurrences ofthe word, and without range incrementing techniques. This method utilizes the principle of cross-language association to create the database in a different manner than that described above.
As an example of this embodiment, consider the following example of two documents that represent the same concept or content, but in different languages:
Figure imgf000020_0001
As a first step of this embodiment, the word count in each document is established to create an appropriate ratio. The ratio is used for comparative range positioning, as described below, hi this example, document A has twenty words, while document B has fifteen words, for a ratio of 4:3. Thus, every four words of document A equate to three words of document B.
As a next step, a segment of words is established for the word strings, or phrases, to be examined. This segment can be determined according to common language rules; e.g., a segment can be a sentence or paragraph. However, the length ofthe segment is user defined and can be any fragment of word strings desired. For this example, the segments will correspond to the sentences in each respective document, although larger segments are usually more effective than single sentences to create the associations ofthe present invention because there exists a larger base of potential associations to fill the database.
As a next step, examine the first word in the first segment — here the first word of the first segment ("the sky is blue") is "the."
As a next step, determine the positions of all occurrences of this first word in document A. Positions of words are determined by their respective word count location in any document. Using the example, the positions ofthe word "the" are one, five, nine, and fifteen (the first, fifth, ninth, and fifteenth words in the document).
As a next step, determine the target words that relate to the first word examined. The target words are determined by using the word ratio to determine the relative point in document B, and applying the range to that word position in document B (the range is user defined as described in the first embodiment). The relative position of a word in document B is determined by applying the ratio calculated above. In the example, the word "the" occurs in the first, fifth, ninth, and fifteenth positions of document A. These positions correspond to relative positions 1, 4, 7, and 11 of document B. This calculation occurs by taking the position in document A, establishing a ratio (by simple math multiplying by fraction of words in the document B to document A, or by %) and applying that ratio: 1 (document A) x % = 1 (rounded up); 5 (document A) x % = 3 3Λ, which is 4 (rounded up)(document B); 9 (document A) x % = 7 (document B, rounded up); 15 (document A) x % = 11 (document B, rounded down).
Applying the above, for the first word to be examined ("the") yields the following:
• Position in Document A = 1 ;
• Relative position in Document B = 1
• Frequency range applied to preceding and following words in document B equals word positions 1-3 in Document B.1 This determination occurs by taking the position +/- the frequency range, or 1 +/- 2, or -1 through 3. Ignoring the negative and null positions returns a word position result of 1-3 in Document B.
• Applying that frequency range to Document B yields words at positions 1, 2, and 3 in Document B, or the following: AAA, BB, and CCC.
Thus, the first occurrence ofthe word "the" in Document A yields the words AAA, BB, and CCC in Document B.
As a next step, advance to the next occurrence ofthe word "the" in document A and apply the previous procedures:
• Position in Document A = 4.
• Relative position in Document B = 3. • Frequency range (+1-2) with relative position 3 yields words in positions 1, 2, 3, 4, and 5 of Document B: AAA BB CCC AAA EEE.
Then, determine if the target words for the second position match the target words for the first position:
• Results from first search are: AAA BB CCC.
• Results from second search are: AAA BB CCC AAA EEE.
• Matches are AAA (twice), BB and CCC.
These matches are stored in a memory device for possible associations with the word "the".
The process continues and repeats for next occurrence ofthe word "the": results from third occurrence of word "the" yields CCC AAA EEE DDDD AAA; matches are AAA (twice) and CCC; the matches are stored in a memory device for possible associations.
The process repeats for all other occurrences ofthe word "the." The results from this analysis returns AAA BB FFF GGGG HHH, and AAA and BB as possible associations.
As a next step, the present invention increments the number of words examined by one. h the first example the word examined was "the" (the first word in Document A). Incrementing, the next word string to be analyzed are the words "the sky".
Repeating the steps above for the word string: "The sky" appears in positions 1 and 9 (utilize the first word in the phrase as the marker for the positions). The relative positions in Document A are 1 and 6. Applying the frequency range to the number of relative positions yields: AAA, BB, and CCC for the first position; and AAA EEE DDDD AAA BB for the second position. Comparing the two results for two word phrases yields AAA BB as possible associations to be stored in the database.
The process is then incremented a word and the process repeated for "the sky is." This process yields as a potential match only the first occurrence AAA BB CCC, as there are no other occurrences.
Repeating the process for the phrase "the sky is blue" there is only one occurrence having AAA BB CCC as possible associations to be stored in the database.
As a next step, the end ofthe first segment has been reached, defined by the user as that indicated by punctuation in Document A. The next step is to take the SECOND word in the first segment and continue the iterative process described above - in the example the analysis would include "sky," "sky is," and "sky is blue" yielding the following as matches: "sky" occurs in positions 2 and 10 in Document A; which yields 2 and 7 as relative positions in Document B; which yields AAA BB CCC AAA as the first match and EEE DDDD AAA BB and FFF as the second match; which yields AAA and BB as possible associations to be stored in the database.
Repeating the process for "sky is" yields only one result: AAA BB CCC AAA; repeating the process for "sky is blue" yields AAA BBB CCC AAA.
The next incremented word in segment one returns "is" and "is blue": repeating the process for "is" and "is blue" yields as matches AAA BB CCC and AAA and CCC AAA EEE DDDD and AAA; with AAA and CCC as possible associations to be stored in the database.
The next incremented word in segment one is "blue" which yields AAA BB CCC AAA and EEE as possible associations to be stored in the database. The analysis is now up to the end of segment one. The next segment is the sentence "the grass is green." Since "the" has already been analyzed the next word portion to be analyzed is "the grass," followed by "the grass is", "the grass is green", "grass", "grass is", "grass is green", and "green."
The process continues with the next segment ("the sky includes clouds and stars") with the first analysis acting on "the sky includes", "the sky includes clouds," "the sky includes clouds and," "the sky includes clouds and stars", "sky includes", "sky includes clouds", "sky includes clouds and", "sky includes clouds and stars", "includes", "includes clouds", "includes clouds and" "includes clouds and stars", "clouds", "clouds and", "clouds and stars", "and", "and stars", "stars".
Finally, The process continues with the next segment ("the grass dies in the winter") with the analysis on "the grass dies", "the grass dies in", "the grass dies in the", "the grass dies in the winter", "grass dies", "grass dies in", "grass dies in the", "grass dies in the winter", "dies" "dies in" "dies in the" "dies in the winter" "in" "in the" "in the winter" "the winter" and "winter".
Note that it is possible to have the segments extended for analysis, as described above wherein the segments need not be limited to sentences or paragraphs. Fragment sentences ("Went to school today. She walked to the school on the street.") can be analyzed by extending the segment to incorporate the person ("she") into the first sentence when the present invention acts to translate languages.
As demonstrated, these two embodiments are representative ofthe technique used to create associations. The techniques ofthe present invention need not be limited to language translation; in a broad sense, the techniques will apply to any two embodiments ofthe same idea that may be associated, for at its essence foreign language translation merely exists as a paired association with one idea (the word or phrase). Thus, the present invention may be applied to associating data, sound, music, video, or any wide ranging concept that exists as an idea, including ideas that can represent any sensory (sound, sight, smell, etc.) experiences. All that is required is that the present invention analyze two embodiments (in language translation, the embodiments are documents; for music, the embodiments might be digital representations of a music score and sound frequencies denoting the same composition, and the like).
In addition, note that it is also possible to have an embodiment ofthe present invention that loads, by either mechanical, electrical, or other means, certain associations in to the database. For example, it is possible to load the database with foreign language equivalents ofthe English words it, his, her, an, a, of- or any common words - to create the association database more accurately, more efficiently, and with a faster resolution. Thus, with this embodiment the present invention would automatically return the foreign language equivalents of certain words loaded into the database. This embodiment allows the association database creation technique ofthe present invention to accommodate common words that may skew the analysis
In addition, an embodiment can utilize common associations to create and recognize word patterns. For example, it is possible to load associations into the database (e.g., "President" for "Clinton") such that the association database accommodates situations where the text means President Clinton, but only the word "president" is utilized as an abbreviation. Given that the cross-language association exists in its broad sense as a cross-idea association technique for creating a database of possible associations, the results may be manipulated when an association is established. Thus, for example, if each "idea" is assigned an association to an electromagnetic wave (tone), it will be possible to create an "electromagnetic association" ofthe idea. Once a given number of ideas have been encoded with corresponding electromagnetic associations, data (in the form of an idea) can be manipulated into electromagnetic waves and transferred at once over conventional telecommunications infrastructure. When the electromagnetic waves reach the destination machine, that machine will synthesize the waves into separate components and, given the associations, present the individual ideas that were represented by the electromagnetic associations.
As will be understood by those skilled in the art, many changes in the apparatus and methods described above may be made by the skilled practitioner without departing from the spirit and scope ofthe invention.

Claims

I claim:
1. A method for associating content comprising the steps of: receiving content expressed in a first state; receiving content expressed in a second state; analyzing said content expressed in said first state with said content expressed in said second state, wherein said analyzing utilizes segments of content expressed in a first state and segments of content expressed in said second state; and creating an association database of said content in said first state as related to said content in said second state.
2. A computer system for associating content, comprising: a computing device that receives content expressed in a first state, and that receives content expressed in a second state;
' wherein said computing device analyzes said content expressed in said first state with said content expressed in said second state utilizing segments of content expressed in a first state and segments of content expressed in said second state; and wherein said computing device creates an association database of said content in said first state as related to said content in said second state.
PCT/US2002/019587 2001-06-21 2002-06-21 Cross-idea association database creation WO2003001403A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
KR10-2003-7016595A KR20040007741A (en) 2001-06-21 2002-06-21 Cross-idea association database creation
CA002447229A CA2447229A1 (en) 2001-06-21 2002-06-21 Cross-idea association database creation
EP02744486A EP1397754A4 (en) 2001-06-21 2002-06-21 Cross-idea association database creation
JP2003507722A JP2004531832A (en) 2001-06-21 2002-06-21 Generating an association database between concepts
IL15874902A IL158749A0 (en) 2001-06-21 2002-06-21 Cross-idea association database creation
EA200400059A EA006182B1 (en) 2001-06-21 2002-06-21 Cross-idea association database creation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29947201P 2001-06-21 2001-06-21
US60/299,472 2001-06-21

Publications (1)

Publication Number Publication Date
WO2003001403A1 true WO2003001403A1 (en) 2003-01-03

Family

ID=23154946

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/019587 WO2003001403A1 (en) 2001-06-21 2002-06-21 Cross-idea association database creation

Country Status (9)

Country Link
EP (1) EP1397754A4 (en)
JP (1) JP2004531832A (en)
KR (1) KR20040007741A (en)
CN (1) CN1520558A (en)
CA (1) CA2447229A1 (en)
EA (1) EA006182B1 (en)
IL (1) IL158749A0 (en)
WO (1) WO2003001403A1 (en)
ZA (1) ZA200309843B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786954B (en) * 2005-12-20 2010-05-05 无敌科技(西安)有限公司 Method and system for integrated inquiry of multi language and multi text

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0728819A (en) * 1993-07-07 1995-01-31 Kokusai Denshin Denwa Co Ltd <Kdd> Automatic bilingual dictionary preparing system
US5907821A (en) * 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3408291B2 (en) * 1993-09-20 2003-05-19 株式会社東芝 Dictionary creation support device
DE69837979T2 (en) * 1997-06-27 2008-03-06 International Business Machines Corp. System for extracting multilingual terminology

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0728819A (en) * 1993-07-07 1995-01-31 Kokusai Denshin Denwa Co Ltd <Kdd> Automatic bilingual dictionary preparing system
US5907821A (en) * 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1397754A4 *

Also Published As

Publication number Publication date
EP1397754A4 (en) 2006-05-10
ZA200309843B (en) 2005-01-19
CN1520558A (en) 2004-08-11
EA200400059A1 (en) 2004-04-29
KR20040007741A (en) 2004-01-24
JP2004531832A (en) 2004-10-14
IL158749A0 (en) 2004-05-12
EA006182B1 (en) 2005-10-27
EP1397754A1 (en) 2004-03-17
CA2447229A1 (en) 2003-01-03

Similar Documents

Publication Publication Date Title
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
US20090094017A1 (en) Multilingual Translation Database System and An Establishing Method Therefor
Huang et al. Rethinking chinese word segmentation: tokenization, character classification, or wordbreak identification
Wróbel et al. Transformer-based part-of-speech tagging and lemmatization for Latin
CN107229613B (en) English-Chinese corpus extraction method based on vector space model
Weerasinghe A statistical machine translation approach to sinhala-tamil language translation
Du et al. Using babelnet to improve OOV coverage in SMT
CN113743090A (en) Keyword extraction method and device
Huang et al. Chinese-Korean word alignment based on linguistic comparison
Norbu et al. Dzongkha word segmentation
Onyenwe et al. Toward an effective igbo part-of-speech tagger
CN106569997B (en) Science and technology compound phrase identification method based on hidden Markov model
EP1397754A1 (en) Cross-idea association database creation
AU2002345728A1 (en) Cross-idea association database creation
Kasthuri et al. An improved rule based iterative affix stripping stemmer for Tamil language using K-mean clustering
Tambouratzis Conditional Random Fields versus template-matching in MT phrasing tasks involving sparse training data
Fan et al. Automatic extraction of bilingual terms from a chinese-japanese parallel corpus
JP2009230561A (en) Example-set-based translation device, method and program, and phrase translation device including the translation device
JPS63228326A (en) Automatic key word extracting system
Weerasinghe Bootstrapping the lexicon building process for machine translation between ‘new’languages
Enemouh et al. Morph-inflected word detection in igbo via bitext
Jin et al. Automatic Extraction of English-Chinese Transliteration Pairs using Dynamic Window and Tokenizer
Giri et al. English Kashmiri Machine Translation System related to Tourism Domain
Gordillo et al. Neural Machine Translation tool from Spanish to English in the medical domain
Lehal et al. A transliteration based word segmentation system for Shahmukhi script

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2002345728

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 158749

Country of ref document: IL

WWE Wipo information: entry into national phase

Ref document number: 2002744486

Country of ref document: EP

Ref document number: 2447229

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2003/09843

Country of ref document: ZA

Ref document number: 200309843

Country of ref document: ZA

Ref document number: 1020037016595

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 028125363

Country of ref document: CN

Ref document number: 2003507722

Country of ref document: JP

Ref document number: 2003/02228

Country of ref document: TR

WWE Wipo information: entry into national phase

Ref document number: 200400059

Country of ref document: EA

ENP Entry into the national phase

Ref document number: 2004106598

Country of ref document: RU

Kind code of ref document: A

Ref document number: 2004106597

Country of ref document: RU

Kind code of ref document: A

WWP Wipo information: published in national office

Ref document number: 2002744486

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 2002744486

Country of ref document: EP