WO2008032037A1 - Method and system for filtering and searching data using word frequencies - Google Patents

Method and system for filtering and searching data using word frequencies Download PDF

Info

Publication number
WO2008032037A1
WO2008032037A1 PCT/GB2007/003418 GB2007003418W WO2008032037A1 WO 2008032037 A1 WO2008032037 A1 WO 2008032037A1 GB 2007003418 W GB2007003418 W GB 2007003418W WO 2008032037 A1 WO2008032037 A1 WO 2008032037A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
web
profile
words
Prior art date
Application number
PCT/GB2007/003418
Other languages
French (fr)
Inventor
Richard J. Stevens
Original Assignee
Stevens Richard J
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stevens Richard J filed Critical Stevens Richard J
Publication of WO2008032037A1 publication Critical patent/WO2008032037A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

A method of indexing documents and characterizing a computer user by detecting the frequency of usage of words in a language, comparing the relevance of documents to a search query by means of comparing the frequency of usage of words in the document to words in the search query.

Description

χ
METHOD AND SYSTEM FOR FILTERING AND SEARCHING DATA USING WORD
FREQUENCIES.
BACKGROUND AND SUMMARY OF THE INVENTION
The invention can make the user interaction with the Internet, also known as the Web, and other media more effective. A user controlled profile is generated automatically (though the user can add to it) . The user controls the use and display of the profile, and can reveal parts of it for specific uses. When searching or using the internet, Irrelevant information is suppressed, while the user's interests can be emphasized and displayed and used efficiently. The system, called KnowingMe, acts as the intermediary between the User's interests and the mass of information and systems of the World Wide Web, filtering out information irrelevant to the user, finding out useful information and maximizing the return for the user. The system then actively matches the user's needs with the Web information and programs, acting in the user's interests, revealing only what the user wants to reveal about himself/herself.
The prior art shows that attempts have been made to use automated methods to analyze documents in order to determine meaning or to take search queries and produce additional search terms based on the words in the query. For example, the following U.S. Patents and Patent Applications (all of which are hereby incorporated by reference for all that they teach) show a variety of approaches, but not the use of word frequency usage: 5642502; 5694592; 5893092; 5926811; 5983221; 6028605; 6189002; 6453315; 6598047; 6654740; 6687696; 6741981; 6816857; 6823333; 7076493; 7080068; 7092870; 20020107853; 20040034652A; 20040220944; 20060106767; 20060106767; A20020107853; A20040034652
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1: Show relative use of words, or word frequency as a graph.
Figure 2: Shows a schematic of the word cone data structure. Figure 3 : Shows creation of a word frequency curve as a filter. ^
Figure 4: Shows the application applied to an Internet search.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
1. User profile analysis
1.1. Automatic Derivation of user profile from 'vocabulary'
The user interests and context (gender, age, nationality, location) are determined by examining the words that the user writes and views relative to average usage.
The size of a user's vocabulary is estimated by examining the sample of words that the user has written or actively viewed while using the Web. This is demonstrated by Hunter Diack in the article "Test your own Wordpower," (1975) Paladin 586 08233 6, which is incorporated herein by reference. This can cover both viewing e.g. looking at Web pages or writing e.g. Word documents or searches in Search Engines. The larger the vocabulary in the pages read, the larger is the size of the user's 'vocabulary' is assumed to be. The word 'vocabulary' used in this discussion is related to, but different from, from traditional definitions of vocabulary. The user vocabulary is derived from the information written, the level of interest in Web sites, the location of the user cursor on the Web page, and the time spent on sites by the user. If the user writes text on the computer (e.g. Word), these are analyzed and weighted differently. A person's vocabulary is a good measure of key marketing indicators such as age and education, and is a better predictor of income than IQ.
The user's words are sorted by the average frequency of word occurrence as a whole across the whole Web. A person with a larger vocabulary will use more unusual words. The average frequency of Word usage for all users across the Web is determined by scanning a wide sample of Web pages and gathering data from many users. For example the word "the" typically is the most common word, perhaps 6-8% of the total number of words. Other words may occur at a rate of 1 in 1,000 or 1 in 100,000. This provides a list of words ordered by frequency of occurrence across the Web.
The size of the user vocabulary can then be estimated by comparison with stereotypical vocabulary of similar size. For example a user with a vocabulary of 20,000 - -
words will know almost all the most common 15,000 words in the language, but relative few after the most common 30,000 words. The distribution profile of the user vocabulary versus frequency of occurrence gives a rapid indication of vocabulary size. An example of distribution profile is shown in Fig 1. This approach allows the vocabulary size to be estimated from a relatively small sample from a user. Typical vocabulary size is measured in "stem" words — for example, 'exercise', 'exercises', 'exercising, and so on are counted as only one word — the stem of all of these words is 'exercise'. Usually nouns carry the most significance.
The user's interests are then determined by determining which words in the user's lexicon are used preferentially, when compared to the average word frequency for a user of similar size vocabulary - both for overuse and underuse of words. The result is an automatically generated set of words which indicate the user's interests and "anti-interest" compared to stereotypical vocabularies. For example, a user may use the words 'requirements', 'history', 'football', 50 times more than an average user who also has the same size vocabulary. The same person may use the word 'rap', 'McDonalds', 4 times less than the typical user. The net result for an individual users may be two groups of words, perhaps in total 50-3000, which the person uses in significantly different way to the average user.
Preferential use of phrases can be found by examining combinations of the words that are used preferentially. Phrases are even more specific to user interests than words. Phrases may be represented by a string of keywords to avoid having minor differences count as two different phrases.
Additional keywords can be generated later by seeing clusters of keywords, once they are positioned on the thesaurus. For example, gaps can be filled in, or headings derived that summarize a cluster of words.
However, we can also estimate individual characteristics, such as age, gender, and location through the user's vocabulary. Such individual user characteristics can be differentiated through keywords selected to optimally differentiate different groups. The keywords are derived by comparing the vocabulary of a stereotypical group (e.g. young) with - -
another (old), and detecting those words that show the largest differential use between the groups. For example, the words "baseball", "elevator", "Congress", and "Seattle" may (statistically) be used many times more by an US citizen than a UK citizen. If we check the usage of those keywords, we can estimate whether the user is American or British. Of course, an individual person may have a vocabulary reflecting interests in both cultures. Hence the result is not binary, and a user may be 75% American and 25% British in his/her interests.
The word "pension", and "retirement" may be used many times more by an older person than a young one. A person may be heavily interested in sport, but specifically not interested in ice-hockey or baseball. A relatively small number of keywords can derive a parameter value for the particular characteristic. For example, the usage of a group of 10-30 keywords compared to average use may be used to estimate the level of interest in golf, or the age of a user.
Actions made by the user can add to the profile. For example, the knowledge that the user buys golf equipment can be used for the user profile.
Processing can take place as a background task while the user is working, while the compilation tasks are performed at the server side.
A user profile is constructed from the user's interests, characteristics, and context. The system performs two steps:
1) Deriving the user's word preference profile automatically by comparison of the user's vocabulary with average and stereotypical users with similar size vocabularies. The profile is generated with the user knowledge and the user controls the use and the display of the profile (although the user may not be able to alter much of the information). Hence the profile is 'automatically- generated' and 'user-controlled,' although the user may also add some elements manually'. '
2) Deriving individual characteristics of the user by comparing usage of keywords that are known to characterize groups with similar sized- vocabularies. 1.2 Clustering of user interests
The words used in web pages, documents or any other data object on the Web are organized into a structure so that closely related words are generally speaking, statistically close together. An initial data structure called a WordCone can be created for the whole Web, which represents a hierarchy. The WordCone is a layered hierarchy, like a layered tree structure but wrapped around itself so that it does not have a beginning or end. However, it can be sliced through from top to bottom to produce a standard layered hierarchy.
This can be done because, on average, related words will tend to appear more closely in Web text than unrelated words. The relative separation of words is deduced by how closely the pairs appear together statistically across text on the Web. Distance is represented by, among other things, the number of words apart in the text, or relative frequency of both words occurring in the same paragraph. Other metrics for distance may be used. Because of the association of words that are close together, this process produces a set of words ordered by similarity (like a thesaurus or a Dewey Decimal Classification for books). The words in the WordCone are then arranged into a complete set with minimum or close to minimum separation of the total set of words. This task is performed on the server side. Statistically, words that appear close to one another have associated meaning over a large sample. At the leaf level there may be about 100,000 words, ordered by similarity. This index also has metrics for the actual usage and relative usage of words.
The general techniques of structuring words into such linear thesauri are prior art. For example, Roget's Thesaurus of English Words and Phrases by Peter Mark Roget, incorporated herein by reference, presents a thesaurus in a linear fashion as a book. Practitioners of ordinary skill will recognize that access to the WordCone can be controlled by means of the encryption. The code that can decrypt and use it to create search queries or to filter search results or send to search engines specific profile information can also include an identifier for the search engine itself to recognize that the source of the query was WordCone or the source of the WordCone application. In this manner, a credit for the search referral can be accrued to the WordCone account associated with the identifier. - -
The hierarchy for the WordCone word index is generated by a recursive technique of looking for words which summarize the group beneath them in the hierarchy. For example, "sport" would summarize "baseball", "football", "rugby" etc at the level above. The structure is actually a hierarchical cone (hence WordCone), rather than a tree structure. The WordCone structure is not a thesaurus or a hierarchy, but can easily be converted into these when needed. The cone can be cut vertically at any point to convert it to a tree structure at a convenient boundary. The leaf level of the WordCone is a word index of closely related subject areas. That is, the proximate words at the leaf nodes are related by subject matter. The WordCone data structure can be constructed by assembling data elements containing a word and it's associated pointer to other elements. One pointer points to the parent node, one or more pointers point to the child nodes and two pointers point to neighboring nodes. At the leaves, the child node pointers are null. In one embodiment, the initial structure is passed to each user's computer, to allow it to populate their personal WordCone(s). This gives each user a compatible structure and hence allows them to be added easily.
Alternatively a structure for a word index can be created manually, with only the leaf level or the lower levels generated automatically. This allows existing heading such as "sports", "business" or "entertainment" to be used.
The same word can appear more than once and in different places in the WordCone, because the same word can have multiple meanings. The WordCone can use a link or pointer in the data structure to show the relationships between the same word with different meanings. Each instance of a word will be close to other words that are used in the same context. The WordCone can be stretched and/or compressed, for example to make a thesaurus in which each unit along the leaf represent uniform usage of a word on the Web (for the Web's WordCone) or for a user. Each linear section of the thesaurus then represents a uniform amount of interest, instead of the sheer number of words. Fig. 3 show the process of adjusting the overall WordCone, firstly to the overall frequency of word usage, and then to the interests of the user.
There is a unitary WordCone for which constitutes the thesaurus for the World Wide Web: it is the set of usage frequencies (or probabilities) for the unitary WordCone. The unitary WordCone is created from the base thesaurus by, for each word in the cone structure, determining the average frequency of usage for that word. The user's WordCone can be used to create a filter to emphasize a user's interests for Web information by suppressing information irrelevant to the user. The filtering process consists of the user's WordCone applied over the overall Web WordCone. Changes in a WordCone over time can for example illustrate the recent changes in interests or recent trends on the Web.
The initial WordCone is then passed to individual users who map their own level of interest by populating their word usage into the data structure. Alternatively, the system can scan user's documents on their computer or scan web pages bookmarked by the user in order to calculate word frequencies associated with the user for the user's word preference profile. This WordCone stores the intensity of user's interests. The intensities of words may vary with time, reflecting the user's changing interests or, for example, the latest events in current affairs.
The relative usage of words in the WordCone can be used as a filter to find similar WordCone structure to emphasize the user's interests. Each word that occupies a node (that is, categorical words and leaf nodes) has a frequency associated with it. The data structure can have an additional data item comprising the data element that houses the frequency for that word associated with that node. In one embodiment, the leaf nodes are used, and each leaf node is mapped to a position on the number line for ease of computation. In this case, the parent nodes (which have the category definitions in them) inherit a summed frequency by summing the individual frequencies of the leaf nodes, and so on, inheriting up the tree, with the root node having a frequency of 100%. In another embodiment, each node in the WordCone is mapped to a unique position in a number-line by well known tree ordering techniques used in computer programming. In this case, the leaf node words and category description words each have word-use frequencies associated with them. The WordCone represents a word preference profile for the data object it is derived from, and in the case of data objects associated with a user, the user's word preference profile. In either case, which ends up with a vector, algebraically, we would say that for the i-th word in the vector, w(i), there is a usage frequency P(i). In the second case, the parent node P(i) would equal the sum of the P(i) associated with its child nodes. Besides summing, other heuristic techniques of combining the usage frequencies of leaf nodes can be used. By usage frequency, it is mean the percentage difference between the frequency of use of the word in that domain and the overall average frequency of use.
In general, we can say that for a word (i) for user (j), there is a P(i)G)- Similarly, for all words w(i) for a giving website (k), there is a P(i)(k) associated with the website. Filtering can be accomplished by a dot product of the two vectors. That is, a relevance score S(j)(k) =
Σ? P(i)(j)*P(i)(k). This represents a score of how strongly user j's interests intersect with website k that is, the relevance of website k to user j. The score can be compared to a scoring threshold in order to determine whether to present the website to the user as a result of a search. Alternatively, search results can be ranked by the dot product score and presented to the user from the highest score down. Presentation can include displaying a web page with hyperlinks to the search results, where the top of the page shows the highest relevant result, with decreasing relevance as one moves down the display page. In another embodiment, the user can adjust the scoring threshold to broaden or tighten the effect of the filter typically by making changes through a graphical user interface.
One issue is how to adjust the scoring threshold so that websites of limited breadth but sufficient depth are not unfairly excluded as a result of the limited breadth. In one embodiment, a Web site can be classified on the WordCone as well as an individual. This need only be done once per user and sent to a central database to then be available to any user without processing. This allows the web site to be matched to a particular user's interests by comparing the same node for the user and the web site. For example, consider a website specialising in music downloads. The website would show high specific values for the nodes of "entertainment", then sequentially higher values for subnodes of (say) "arts", then "music" - because the user is more interested in "music" than its parent node "arts" or "entertainment" in general. A user who is interested in classical music downloads will also have high values for the nodes of "entertainment", its child node "arts", and then "music". The match with the music download database derived from the product of the two nodes for the user and the music download site will therefore be a high value. The match could be for example, a product of the level of interest for the same node for the user and website. Thus if the user has an interest level of 90% for classical music, and the web site is 90%, the match level would be 81%. If the specific interest of the user is at a lower node beneath music (e.g. classical music), then that lower node will have a yet higher value of interest. Provided that the music download site also has a significant value for classical, music then the match for this node will be high for classical music and much lower for say "jazz" if the user has little interest there. If the user has a 5% interest in "jazz" and the website is 90% for the same node, the match would be 4.5%.
Evaluating the match between user and web site.
The Wordcone for the user and websites are then classified as to the level of interest for particular subjects. Matching the most important web sites for the user's interests then becomes possible.
In another embodiment, a simple measure of user's interests against a web site consists of the sum of the products of the two leaf nodes of the user and a Web site i.e. ∑NuNws for all leaf nodes. The highest value will be obtained by the web site with the largest amount of information relevant to the user, regardless of the amount of irrelevant information.
Another measure of interest to the user is a smaller web site or a section of a larger web site that has a higher matching ratio but for a smaller area of the web site i.e. Max ∑NuNws for each of the leaf nodes between a user and a Web site that are below a node of specific interest for the user. This maximum value shows the best match to a special area for the user.
Both measures can be presented to a user on a single graph by showing the Wordcone with each node populated by the highest Web site matches for the user. The generic large Web sites will occupy the top node, and the lowest specialized areas will occur at successively lower nodes. They will show web sites or sections of web sites of interest to the subject matter at that node. For example, the "early music" node will be occupied by specialist early music web sites, but also the early music sections of more generic web sites that cover the subject area well.
The user can select the relevant web site by clicking on a node of interest and seeing an list of web sites that have the highest matching values for that node.
More generally, in another embodiment, the calculation of S(j)(k) can be made so that it is limited to be equal to a sum ∑m: P(m)(j)*P(m)(k), where m are the indices associated with the words in the category of interest specified by the user, in the above example, it would be "music." The unitary WordCone can be used to determine what relative share of usage the entire selected category represents. The reciprocal of that can be used to scale the dot product prior to comparison to the scoring threshold. For example, if the category of "music" is selected, and the unitary WordCone has 5% of usage being words in that category, then the relevance score dot product can be scaled by a factor of 20 prior to comparison to the scoring threshold. Alternatively, the scoring threshold can be reduced by a factor of 20. Other heuristic adjustments to the scoring threshold may be used. Comparison of two word preference profiles can be made by means of the dot product as well. In this case the result of the dot product can be compared to a threshold that indicates some level of confidence of the match.
By adjusting the filtering strength of the user WordCone relative to the generic WordCone in one-dimension, then presenting it in two dimensions, the user can filter information to his/her interests in real time. Where there are a cluster of interests along a line, small gaps show potential areas of interest in which the user has not been active (see figure 3).
When the user searches for a term on a Search Engine, the term can be used to locate a section of the WordCone, corresponding to words for similar interests/meaning. The user may well have preferentially used words in the user profile in that part of the WordCone. These words can be used to programmably reinforce the search term in the Search Engine. The resulting search is more relevant to the user interest. An extension to this technique is to determine the area in the Wordcone that is most relevant to the search and then to select user keywords that are specific to that subject area.
Applications include the ability to filter or prioritize a Web site of text, media or products from a Web site or an e-mail sent to a user e.g. DVDs, wines, news snippets. For example, a Web newspaper (text, audio, video, MP3) could be filtered through a user filter to emphasize the user interests. Similarly it could reorder the information according to user interests. This filter could reside at the origin of the media information or at the user site. The system performs the steps of:
3) The production of an overall hierarchical word index for the Web, organized as a hierarchy, with the words at the leaf level organized by association. An alternative representation is as a layered cone.
4) The production and distribution of encrypted standard WordCone(s) that allows each user to populate their own WordCone with the relative strength of interests.
5) The production of a one or more dimensional filter of the user's interests (and strength of interest), ordered substantially in the same way as the overall WordCone, so that the Web can be filtered or prioritized through the user's interests. This allows for a user-controlled profile to select relevant Web sites or to select relevant information within Web sites e.g. a selection of text, video and/or audio off a TV guide or a newspaper.
6) The ability to show potential areas of interest by filling in small gaps between clusters of user interest along the WordCone..
2. Web analysis
2.1.Combining users' profiles
Synthetic view(s) of the overall Web information usage can be produced by combining individual users' WordCones. One or more user WordCones are sent to the server and combined to provide generic information about Web usage. Each user WordCone represents the interests of the individμal user, the intensity of that interest and is ordered by similarity of subject. Because the underlying WordCone structures used by each user are compatible, they can be added together straightforwardly. The summation is a map of the whole Web word and word usage in different areas by the totality of users, i.e. an overall WordCone for the Web. Subsets can be chosen for different tasks e.g. to define a USA stereotype.
The current approach of "Web farming" requires many centralized computers to trawl the Web. WordCone makes this a distributed task shared between the users' computers, and the results focus on the actual usage by Web users. This allows a minute-by-minute analysis of Web usage, based on the Web sites that users are looking at.
The overall WordCone ordering and word usage can also be adjusted in the light of users' inputs and then re-distributed. The process of accumulating WordCones is anonymous, with the user identity being unknown, but they can be encrypted to ensure they are from unique individuals (to prevent automatic systems supplying information in bulk to distort the system). The system performs these steps:
7) Producing an overall analysis of Web usage by subject area by combining the usage of distributed computers that supply coherent structures to allow easy computation of overall usage.
8) Producing a combined measure of interest for a group (e.g. a company) by combining the Web usage statistics from members of the group.
9) The ability to adjust the relative word usage and ordering of the overall WordCone in the light of users' inputs.
3. Using the User profile
The user profile can now be anonymously used to scan Web information. A profile agent combines the user profile, software to view the Web, and encryption. The profile is owned by the user, who can read the profile and can alter parts of the profile. The user profile is partially or completely encrypted to preserve the user's anonymity. It permits different levels of exposure to allow the user to apply the profile for different tasks while retaining sufficient anonymity and prevent tampering. - -
The profile agent can use the WordCone to filter the Web content for example to find other users with similar interests, advertisements, jobs etc. anonymously. The user can have multiple profiles or get a specialist profile from others for a particular task, for example a profile could be specific about a holiday destination or a pop star. WordCones are additive so multiple cones can be combined to complete complex tasks.
3.1. Filtering and reordering of information
The profile agent can be used to receive information from a Web site anonymously by having the information passed indirectly e.g. through a KnowingMe site. When using the product, click-through information will be able to be passed to third parties such as advertisers or KnowingMe (without compromising anonymity). The profile agents belong to the user, based on a user profile, and anonymously trawl the Web and retrieve items of interest to the user.
3.2.Choice of media
The profile, profile agent and/or WordCone can be used on other devices to allow autonomous selection of information specific to a user. It may be passed to a mobile phone, a TV storage device to store or display programs, PDA and other mobile devices and link to those devices, to be able to detect items of interests to the user. For example it may combine user interests and details of the current location known to the phone and tell the user that an event of interest is currently taking place locally. Alternatively the user-controlled profile may extract information on the TV programs that are watched from a TV or TV recorder system and add them to the overall
3.3.Verifying the user or site profile
A part of the profile can represent what the user (or site) has done in a neutral and unalterable fashion i.e. the user can chose whether to display parts of this information, but not alter the information itself. This leads to a range of applications which verify the user or the site. The profile is still user-controlled (for display and use) but not for content of what is displayed. - -
For example, the user can send (e.g. by attaching to an e-mail), an encrypted e-mail which summarizes the user's profile and the way it has been produced. The user is unable to alter this summary, which therefore forms a basis for independent verification. For example the recipient would be to see that the profile was generated over a period of 18 months and has changed only steadily during that time. This approach enables a user to be independently verified to reduce spam, verify user interests and stability.
In particular, some details from the profile (unique identifier, usage time etc) could be used to prevent and detect clickthrough fraud, where a computer is used to repeatedly click on ads to generate cash without any intention to purchase the product.
Similarly the user may have a system which allows themselves to be rated by other users. The user can choose whether or not to display the overall rating, but cannot alter the overall rating or the information supplied by others. This is a generalized version of the rating systems which are specific to one site (Amazon, ebay) but in this case the information is held by the end user or Web site. However the end user is not able to alter the statistics of this rating.
The encrypted user profile can be used by others to verify the user is unique and active, without necessarily supplying a e-mail address. The agents will be able to assure click- through companies that the request is real and relevant to the user's interests. The profile agents will be able to supply the KnowingMe supplier with details of click-throughs and/or transactions, even though it may not know who the user is.
Similarly there is currently problem on the sale of popular tickets e.g. for concerts or sporting events. Ticket touts can monopolize the web sites and cream off large numbers of tickets for re-sales. By linking the sale to parts of the user profile including the unique ID and history, a fairer distribution can be made.
The same principle can be used to reduce Spam. A user profile is unique, a valuable item built up over time, with elements not alterable by the user, yet hides personal details of the user. As a result, a user could exclude or suppress e-mails from recently created e-mail addresses. - -
The production of parts of a user profile which the user (or a Web site) cannot alter, but can choose to display or use. This would allow aspects of the user behaviour (interests, behaviour, ratings from others)to be displayed to Web users e.g. on the user's Web site or on e-mails sent by the user. This allows the user to be verified. This would be tied to a unique identifier for the user, who could still however remain anonymous. The application could be used to help reject Spam, by indicating factors such as length of usage, interests etc. It can also be used to detect false clickthroughs (clickthrough repeatedly performed specifically to earn clickthrough income) or to prevent a single person ordering masses of tickets for re-sale.
3.4.Specialist filters
The user, domestic and commercial, will be able to receive specialist KnowingMes which are produced by third parties to filter the Web through specialist interests (e.g. for art, baseball or a geographical location). WordCone may be encrypted to control the supply of these third- party elements, and to direct part of the click-through revenue back to the originator. Practitioners of ordinary skill will recognize that access to the WordCone can be controlled by means of the encryption. The code that can decrypt and use it to create search queries or to filter search results or send to search engines specific profile information can also include an identifier for the search engine itself to recognize that the source of the query was WordCone or the source of the WordCone application. In this manner, a credit for the search referral can be accrued to the WordCone account associated with the identifier.
The user profile can be used to automatically classify information that is received and stored by the user, such as e-mails.
The system has these capabilities:
11) The ability to produce a user profile automatically, verifiably from a unique user with certain usage characteristics. On social sites such as My Space or Couchsurfer, this would give the reader some confidence that the profile is real, because it is automatically produced over months or years, and can be verified by KnowingMe, which can supply details about history etc to show that the profile was generated sensibly. - -
12) Keeping users anonymous while querying Web sites using user-controlled profile information before and supplying information back to users.
13) The production and supply of specialist filters, WordCones and profile agents to users and commercial sites to allow them to use the Web more effectively. Keeping users anonymous while querying Web sites using user-controlled profile information before and supplying information back to users.
14) The ability to receive and/or filter or reorder click-through information even when the user uses the user-controlled profile or profile agent anonymously (server side activity, to be patented).
15) The ability to ensure that the author of a click-through is a unique identifiable, yet anonymous, individual from his/her user-controlled profile rather than a fake user used solely for repetitive clickthroughs (server side activity, to be patented).
16) The ability to automatically select information autonomously based on the user-controlled profile where the user has zero or limited ability to select manually. For example, the profile could download music or news to a car hi-fi system during the moments when there is a wifi link or onto a storage device (IPOD or USB storage) for replay during the journey.
3.5.Choice of relevant ads or media
In this application, the user-controlled profile interacts with other media systems (TV, radio, ads) to select appropriate material based on the user interests.
17) The ability to select non-Web media such as programs or ads where the user- controlled profile can interact with such systems. For example, KnowingMe can select an ad from several possibilities using the user profile, or select broadcast or stored programs or music based on the user profile. For example, a user could have one or more sequences of TV programs selected from broadcast options based on the user profile. The user choices could then be added to improve the user profile. - -
18) The ability to process a Web site to obtain a compatible WordCone index which can be matched against other indices to measure compatibility. 4. Display and filter techniques 4.1.2-Display of 1-D information
Using the properties of a two-dimensional fractal curve (e.g. a Hubert Curve), the single- dimension WordCone can be wrapped along the curve to fill a two-dimensional display. The single dimension line then has the whole 2-D area available for display. Related information on the WordCone will then appear even closer on a 2-D display. Real-time filtering on the 1- D WordCone is then shown in 2-D through the curve. The single dimensional WordCone is wrapped onto the two dimensions following the fractal curve that fills in the plane with a single line. The curve ensures that subjects that are close in one dimension are statistically even closer in two dimensions. This is demonstrated by RJ. Stevens, A.F. Lehar and F.H. Preston, in the article "Manipulation and Presentation of Multi-dimensional Image Data using the Peano Scan," I.E.E.E. Proc Pattern Analysis and Machine Intelligence, February 1984, which is incorporated herein by reference. The display can show (e.g. by coloration or 3-D relief map) the intensity of usage of the words across the WordCone, and the WordCone can be stretched to emphasize the heavily used elements. Filtering consists of overlaying the user WordCone usage levels over the generic WordCone in 1-D. Display consists of showing the result in 2-D through the fractal curve. In this case a Hubert curve is used, but other fractal curves will produce satisfactory results.
The user WordCone acts as a 1-D filter of the user interests to enable relevant Web sites to be selected, and to select out relevant information in a Web site. The filters also allow Web sites to tailor their content to each user to provide more relevant information that is more relevant to the user needs. Alternatively the user can receive a standard set of information and filter that locally e.g. for a Web newspaper or audio blog with tags that enable filtering.
4.2.Real-time filtering of Web sites
The one-dimensional WordCone is mapped along a two — dimensional fractal curve (e.g. a Hubert Curve) that fills the 2-D screen. This allows the whole of the Web (or segments of it) to be displayed efficiently. Fig, TBD shows how a continuous fractal curve can track through every pixel of a 2-D display, following the path of a Hubert Curve. This has two great advantages - a long, detailed curve can be mapped onto the hole of the 2-D screen, and elements close on the line will be even closer in 2-D because of the fractal curve. If we are dealing with a square of n elements by n elements, the effective length of the curve that can be displayed is n2. So a screen of 1024 x 1024 elements can display an index 1048576 elements long, yet retain the coherence of the one-dimensional index.
Subject matter that is close on the index will be even more clustered in 2-D fractal curves, because a section of the line will be wrapped into a tight 2-D curve on the fractal line. Real-time filtering of the index is straightforward, simply by arithmetic process of the index. The strength of the user filter can be adjusted by simple arithmetic processing of the 1-D image e.g. by allocating more of the fractal line to areas of high user interest, and less to areas of low interest. The line is then remapped on the fractal curve.
The systems capabilities include:
19) The ability to filter the Web, or Web sites, or for Web sites to filter their content, through the user's interests, in real time by overlaying their WordCone(s) over the generic WordCones.
20) The ability of single-dimension Web information and indices to be displayed coherently in 2 dimensions through mapping the leaf level of the WordCone along a fractal curve to display the results.
WordCone Algorithms
Definitions
Ps(w) : the Standard Word Frequency. The probability of occurrence of a word w in a large diverse sample of documents. Pa(w) : the Actual Word Frequency. The probability of the occurrence of a word w in a specific document or portion of text.
Pr(w) : the Relative Word Frequency. The ratio or the actual word frequency to the standard word frequency for a word w in a specific document or portion of text. Pr(w) = Pa(w) / Ps(w).
Generating Relative Word Frequencies
Generating relative word frequencies relies on reasonable estimates of standard word frequencies for as large a dictionary of words as possible. Tables of Standard Word
Frequencies are readily available in the public domain for very large samples of documents.
Common examples of these include the Brown Corpus, covering tens of thousands of words.
In a given document or portion of text containing Ntot words the actual frequency Pa(w) of a word w is the number of occurrences of the word nw divided by the total words :
Pa(w) = nw / Ntot
The relative frequency of a word w in a specific document or portion of text Pr(w) is defined by:
Pr(w) = Pa(w) / Ps(w).
Identifying Statistically Significant Words
In a given document we count the number of occurrences of each word nw, the total number of words Ntotand then rank the words in descending order of relative frequency. Words that have a relative frequency much greater than 1 are statistically significant within the text and give one of the required indications of the subject matter of the text.
Identifying Contextually Significant Words
A single word can be ambiguous without some context given to it by the words around it. We measure the context of a specific word by recording the distance (that is, counting the number of intervening words) between it and each of the surrounding words. A threshold must be placed on the maximum distance between words that we regard as related e.g. the first and last words in a book are unlikely to be related. Empirically, distances of less than 10 words have proven to be the best choice. If we define the distance between a word wi and a word wj as Dij and the number of instances of the pairs of words wi,wj within some predetermined number of words and in a given portion of text as I(i j) then rank the pairs of words in descending order of their number of instances then the words at the top of the list are contextually significant. Typically, the predetermined number is about 10 words, and 1(1 j) is typically an entire document. However, portions of a document may be use, for example, article abstracts and the like.
Identifying the Subject of a Document
If we pick the common information from the statistically significant words and the contextually significant words then the resulting subset of words defines the likely subject of the document being examined. Each part of the information is not sufficient by itself. A very unusual single word must not be able to dominate the subject of a document but similarly neither should a very common set of close word pairs. Unusual words that are close together are significant and relevant to the subject of the document. In this embodiment, the combination of the statistically significant words and the contextually significant words is a logical union. In some embodiments, a set concatenating can occur so that the size of the union set of words is within practical bounds. Practitioners of ordinary skill will recognize that identifying the subject of a document is not limited to this method, but are disclosed in a variety of techniques.
Generating the User Profile
The User Profile is generated by aggregating instances of the subjects of documents examined over the course of time by the user. Also, the system will examine frequency analysis of subject result words to determine word usage for the user. After a large enough collection of documents have been analyzed the subject areas of interest to the user will rise to the top of the subject rankings. By aggregating the instances of each subject area over a large number of users we can calculate the relative interest in a subject for an individual user. We define the relative interest in a subject s as Ir(s) as Ir(s) = I(s) / Iav(s), where I(s) is the number of instances of the subject s for a user and Iav(s) is the average number of instances of the subject s over a large number N of users. Besides this technique of aggregation, there is the combining of all of the source documents and treating it as one document to be analyzed. Also, algebraic approximations can be used to combine the results of two different documents. That is, if a set of words in a user's document has one set of usage characteristics and another document has a different set, then a new usage set can be created algebraically in a number of ways, including heuristics. One way is to simply average the frequency values of a word. Another technique is to pick the larger number. However, the point is to use a formula to aggregate the individual characteristics of a number of documents.
Generating the WordCone
The WordCone can be visualised as a 3 -dimensional hierarchical structure consisting of subject related words at each level that summarise the words in hierarchically lower levels e.g. the word sport has the words football, hockey and baseball beneath it. At each level the subject related words wrap around to form a "circle" of subject words. Each higher level in the hierarchy is a smaller diameter circle of (summarising) parent words with a larger diameter circle of more detailed words underneath. The structure naturally forms a cone. However, it can also be thought of as a tree data structure, where each word is represented by an element in the data structure, called a node. The subject categories are branch nodes and the finely granular normative words are leaf nodes. In the computer memory, a data structure can be used where each element of the data structure is combined with at least one pointer, pointing to a parent node (representing a category) or to a child node, possibly a leaf node, which has no further categorical distinctions. In one embodiment, when the leaf node word is detected, the parent node (and that parent node's parent and so on up the tree) are incremented in value. The point is that if a document has the word "softball", for example, then the node representing the category "sport" should also register a change in order to facilitate word cone matching and filtering. - -
The 3 -dimensional nature of the structure is necessary to account for documents that cross subject areas. For example History and Science could be nodes in the hierarchy. Some documents will be about the History of Science. In such a document there will be subject words that are children of the History node and children of the Science node. There will be links between the History node and the children of the Science node (and vice versa). These links will fall across the body of the cone structure (hence the need for a 3-dimensional structure). Grouping related subjects can be done with a suitable classification algorithm such as Bayesian classification or fuzzy logic that are common in areas such as spam recognition.
An entire group of users can be considered to have one Word Cone data structure, which can be determined by aggregating each user's individual Word Cone. In the same way, documents, websites or other sources of words can be analyzed in the same manner so that they too have a Word Cone calculated. In the typical system, most of the nodes in the Word Cone are associated with a zero word frequency: very few documents use at least every word in a language. Documents can be matched with users by comparing the nodes in the document's word cone to the user's word cone. Those comparisons that show a sufficiently close match indicate that the user is likely to find the document interesting. In the same manner, a user can adjust their word cone so that they can pre-focus search efforts. For example, the user can specify that their immediate interest is "sports" and hence matching is conducted in the "sports" sub-tree of the user's word cone and not across all subjects and therefore not all of the nodes need to be matched. The comparison of word cones is done by comparing the nodes. Each node in the data structure has at least two elements: the word and its frequency. Additionally, there is one or two structure pointers that point to the parent or leaf nodes or both.
Two word cones are likely different because a document and a user profile will not have the same word set. Therefore, nodes not. appearing in both word cone structures are not further compared. Of the nodes appearing in both word cones, the frequency value is compared and, if within a predetermined threshold, considered a match. The result of the comparison is a certain number of words in common and for each word, a match or no match. Any kind of analysis can be used to determine if a sufficient number of words in common have matching frequencies of usage. In addition, heuristics can be used to weight the comparisons. For example, branch matches may be considered to be more significant than merely leaf matches. In any case, when the two word cones are found to be sufficiently matched, then the results of the match can be used, for example, for indexing or searching.
Additional Embodiments:
Definitions
User-controlled means a characteristic of the user profile in which the user is able to know the profile is being created, and control the use and display of the information in the display. While the user may enter some of the information, other parts (e.g. interests and usage) is automatically derived and not controlled by the user.
WordCone means a structured organization of a set of words organized by similarity of words at the leaf and all levels of the structure. Also the name of the overall system. The nodes above the leaf level partition the word set into related groups. The structure forms a hierarchical cone - at all levels of the cone, similar words are close together and organized along a circular path. The structure can be cut vertically at any point to form a hierarchy with a thesaurus at the leaf level. The leaf level of the thesaurus is what is mapped along the fractal curve.
Filtering can easily be performed by - for example - allocating more of the length of the fractal curve to elements which have high levels of user interest. The WordCone can be an initial variety, generated by analysis of the Web, a user WordCone which reflects the user's interests and the current WordCone which is the summary of all user WordCones.
Definitions
Subject: a subject is an identifiable topic that a Information Unit relates to.
Category: a synonym of Subject.
Information Unit: an Information Unit is a collection of words about a subject. An
Information Unit may take the form of written or spoken words in electronic or physical form.
So both a web page and a television program are examples of Information Units. An
Information Unit may also be a subset or a larger Information Unit e.g. a chapter in a book.
Creating a Global Structure for User Profiles
In order to sum or compare interests between users and their interests we must measure them against a common standard structure.
The common structure is essentially a tree structure with each node representing one or more subject words that summarise the subject words in their child nodes. So a node that represents
"sport" would have child nodes that represent "football", "basketball", "athletics" etc. This is like an agreed thesaurus or Dewey Decimal index which each user can then populate. There are a number of ways that his hierarchy could be constructed. The most obvious is to construct the initial categories by hand and fill the leaf nodes by manually classifying a set of
"training documents".
Another mechanism for filling the structure would be to determine the subjects (we describe how later) of a large set of Information Units and look for the most frequent hypernyms of those subjects (x is a hypernym of y if the statement "y is a kind of x" is true, so for example sport is a hypernym of football). Each of the most frequent hypernyms becomes the parent category for the Information Units where that hypernym is found most often. Finding hypernyms can be done using common tools such as WordNet (http://wordnet.princeton.edu). This global structure is created and updated by continuously determining the subjects examined by all of the users. Updated versions of the structure are periodically distributed to all users in order to ensure that subjects are up to date and that every user is measured against the most current structure.
Populating the Structure to Create a User Profile
We create a user profile by associating a numerical value with each node in the common structure that is a "measurement" of the user's level of interest in the subject matter associated with the node. A value of 0 indicates that a user has no interest in a subject and the larger the number, the greater the interest. The measurement should reflect such factors as the number of Information Units that a user examines in a subject area and the frequency that they are examined since this should give a good indication of interest. Other possible factors could be included.
Calculating the Interest Levels for the User Profile
There is more than one possible scoring system for determining which subjects a user is interested in. A crude mechanism would be a simple Boolean flag set to TRUE if any
Information Unit examined by the user fell into a given category and FALSE if there were no
Information Units. The parent flag would be set to TRUE is any child node were set to TRUE and FALSE if all child nodes were FALSE. In practice the user will examine many separate instances in an area and an analog value created to express the overall level of interests across time.
Another possible method would be to calculate a number for each category between (0,1) where 0 implied that the user had no interest (i.e. had examined no Information Units) in a category and 1 implied that the user had read every Information Unit relating to the category.
The method of calculating this would be to divide the total words in a category that the user had examined by the global number of words in the same category i.e. the sum total of all the words in that category examined by all users. A suggested algorithm for performing this - -
calculation follows. The same calculation could be applied at higher levels in the hierarchy by summing the user's word count for all of the subcategories and dividing by the sum of the global word counts for the subcategories.
An Algorithm for Calculating Interest Levels
Suppose that we have identified an Information Unit D examined by a user as belonging to a category (subject) C using Bayesian classification or some other method. If D contains a total of Nd words and that the global total of words in the subject area C is Nc then we add the contribution for this Information Unit into the user's existing structure thus:
Add the contribution of the new Information Unit to the global total
Nc = Nc + Nd
Calculate the new interest value Ic for category C
Ic = Nd / Nc
We must also communicate this additional contribution to the global total in the category in such a way that it is available to all other users in order to allow the same calculations to be computed for every user.
Another Embodiment:
Combining the Analysis of Multiple Documents
Suppose that we have identified a document D as belonging to a category (subject) C using
Bayesian classification or some other method. If D consists of a set of words {wi} and that each wi appers ni times in the document. We add the contribution for this document into the existing structure thus: for each wi not already in the category C store the word count value ni . for each wi already in the category C retrieve the stored running total count for wi, Ni set Ni = Ni + ni store Ni as the new running total for wi retrieve the total number of words Nc found in all documents in category C so far add the total number of words in D to the total number in C, Nc = Nc + sumi ni store Nc for category C
Each user examines a number of documents and the above analysis is carried out for each document. A local copy is kept for each user that only contains the analysis (word counts by word and total words by category) for the documents examined by that user. Another global analysis is updated that contains the results for the documents examined by every user. The global analysis is the master structure that forms the standard that every user populates with their profile (dealt with under scoring).
The results of the analysis could either be carried out at the user end and transmitted to the central resource or the document (or its location) could be transmitted to the central resource and the analysis carried out there.
Building the Hierarchical Structure
The hierarchical structure is essentially a tree structure with each node representing one or more words that summarise the content of the words in their child nodes. So a node that represents "sport" would have child nodes that represent "football", "basketball", "athletics" etc.
There are a number of was that his hierarchy could be constructed. The most obvious is to construct the categories by hand and fill the leaf nodes by manually classifying a set of
"training documents".
Another mechanism for filling the structure would be to analyse the words in a document as described above and look for the most frequent hypernyms (x is a hypernym of y if the statement "y is a kind of x" is true, so for example sport is a hypernym of football). Each of the most frequent hypernyms becomes the parent category for the documents where that hypernym is found most often. Finding hypernyms can be done using common tools such as WordNet, a research program at Princeton University, (http://wordnet.princeton.eduy
Category Scoring for the User
There is more than one possible scoring system for determining which subjects each user is interested in. One possible method would be to calculate a number for each category between (0,1) where 0 implied that the user had no interest (i.e. had examined no documents) in a category and 1 implied that the user had read every document relating to the category.
The method of calculating this would be to divide the total words in a category that the user had examined by the global number of words in the same category (both numbers being taken from the analysis above). The same calculation could be applied at higher levels in the hierarchy by summing the user's word count for all of the subcategories and dividing by the sum of the global word counts for the subcategories.
Another more crude mechanism would be a simple Boolean flag set to TRUE if there were any word count entries for the user in a given category and FALSE if there were no entries. The parent flag would be set to TRUE is any child node were set to TRUE and FALSE if all child nodes were FALSE.
Searching:
Searching a set of documents where each document already has a word cone associated with it can be implemented. In this case, the user can specify some key- words. Those words in turn can be expanded to find related words, as demonstrated by the prior art or simply by using the word cone to include those words constituting parent nodes to the specified keywords and, if the keyword is a leaf-node, those words within a predetermined distance of the leaf-node, or a combination of the two. In any case, the expanded set of keywords can be used to expand the results of the search. Practitioners of ordinary skill will recognize that expanding search results is often not the goal, but rather focusing search results on the few documents that are truly relevant. The invention facilitates this by generating a word cone for each recovered document. The word cone can be matched against all or part of a user's word cone, where the match context is limited to a subtree, that is, a subject area that encompasses only part of the word cone. Those documents whose word cones sufficiently match the subtree would be considered relevant search results. Another way that the invention facilitates filtering search results is by making the query more limited: if the user specifies keywords and selects a subject area, then the invention can determine which words are the most likely to be found in documents related to the same subject area. In that case, the query can be automatically recomposed to require that those additional words are present, and, as described above, exist at approximately the same word frequencies as expected for that search subject. In another embodiment, the user does not explicitly select a subject area for the search. Instead, the user's keywords are used and the word cone for the user is examined to determine which subject areas the keywords are mostly coming from. This is accomplished by examining the parent-leaf nodes structure. As an example, a query for "basketball, knee and arthroscopic" is more likely a search under the category of "medicine" than "sports". That is because all three would be found under "medicine" but no under "sports". In this way, incoming search query results can be filtered be determining which documents have the approximately same expected word frequencies for all three words within the document.
Another Embodiment:
The KnowingMe invention has applications beyond just searching and filtering of documents or adverts. For example, this invention can be used to create a distributed search engine (as opposed to the centralised, company controlled model used now). This is accomplished by use of peer to peer distribution systems.
If all users who are members of a peer-to-peer system have every document analyzed by the
KnowingMe application, then a search consists of broadcasting to all users a desired word cone set. Each user makes a comparison locally and then sends keyword or excerpts back to the requesting user. The requesting user then can request specific documents directly from the peer computer. In this case, there is no central indexing of document data.
The invention is not limited to just searching a contained set of (structured) information (e.g. a database). It can search information wherever it can be found, including the internet. The invention may be a stand-alone application or a tool that runs in conjunction or "piggyback" on existing search engines by constructing appropriate search terms on behalf of the user or creating its own index of web pages.
KnowingMe users can have their classification or word cone results uploaded as an aggregate data form so that a large user-base word cone is created. This data classification is dynamically updated by a large number of users. Both the classification and the data itself can be updated with time.
The word cone classification is not only about documents. The text that a user reads can be used to classify the user themselves or it could classify a collection of text (a whole web site) or it could classify a collection of people ( a company). It is therefore also a means of aggregating knowledge.
The collective information gathered is anonymous. Each individual user chooses how much information they reveal to whom at any given time. No advertising agency purchases space. The user can reveal as much information as they please in order to receive _relevant_ adverts.
Another Embodiment:
1.1 Using the user interests in searching
When the user makes a search e.g. on Google, KnowingMe finds the area in WordCone to which the search is related, and extracts words heavily used words and phrases from that subject area. KnowingMe then adds those to the search, and thereby makes the search more precise.
For example, if the user searches for "Chelsea Barcelona", KnowingMe may find the heavily used words related to the Chelsea terms such as "football club" "Stamford Bridge". When these are added, the search results are reduced to 41 instead of the original 10,000,000. A slider bar allows the strength of the filter to be varied e.g. to give a single page of results, or
1,000 results.
1.2 Word frequencies
The user's vocabulary size is calculated from his/her usage of the Web and words written by the user.
From this, the size of the user vocabulary can be estimated in terms of stem words e.g. 25,000 stem words. Although the user may not have looked at all these words, the profile of word usage will drop off steeply at around 25,000 words i.e. the user will use a few words in the
30,000 range but very few of the 40,000 range.
Now we can compare the users vocabulary with that of a typical person with a vocabulary of
25,000 words. For example, to find out if the user is interested in Science, we could take a batch of science words which are in the batch of words between 22,000 and 28,000 most common in the language.
The Science words would be extracted from the WordCone section on Science as if it were a thesaurus section. Each word is classified according to frequency of usgae on the web allowing a selection to be made of words corresponding to the size of the user's vocabularly.
A typical user with a vocabulary of 25,000 stem words will therefore have 50% of these words in his/her vocabulary. In examining the user's vocabulary we might find 93% of these Science words, indicating a heavy user interest in the area.
1.3 Other potentially novel aspects 1 : Estimating the user interest by comparing their vocabulary to "stereotype" vocabularies to determine interest, nationality, locality etc.
2: Building a user profile which is controlled by the user for his/her own benefit rather than the benefit of advertisers.
3: Having a user profile parts of which can be revealed incrementally without necessarily giving away details which the user wishes to keep anonymous (e.g. e-mail). This allows a user to find products or even negotiate rewards without having to give away sensitive data.
Personal Clickthrough management
KnowingMe allows the application of personal clickthrough management. Currently the money for clickthrough goes to the Search Engine and the site on which the advert appears.
However, if the user can be incentivized to press the clickthrough or go further down the chain towards buying the product or service, the user could benefit by having a percentage of the clickthrough. KnowingMe can maximize this opportunity by increasing the value of the clickthrough because of known user interests. KnowingMe would also negotiate the clickthrough "fee" for the individual user and route it to a payment mechanism e.g. Paypal.
The novel aspects are:
1 : Organizing clickthroughs to enable the user to benefit.
2: Maximizing the value of the clickthrough for the user.
3 : Banking and routing payments to the user.
By data object, it is meant any kind of file that contains text or data that can be converted into text, whether HTML, Java, digital video stream, word processing document, email, mobile SMS or MMS, text derived from voice data converstion or any other kind of format.
A hyperlink: A link in a data object to information within that data object or another data object. These links are usually represented by highlighted words or images. When a reader selects a hyperlink, the computer display switches to the document or portion of the document referenced by the hyperlink. Retrieving by Internet search may be accomplished three ways. First, a user who uses an internet search website, can transmit to the search website one or more keywords. The website will respond by sending back a message with links to retrieved web pages. In this case, the invention may be run on the user's computer, and the user's computer can process the retrieved hyperlinks to rank them according to the relevancy score. This can be done by fetching the cited webpages and processing them to calculate their relevancy. Second, the search website itself can execute the invention, in which case the internet search retrieval is simply the operation of its search engine database lookup system. The website operator can process cached copies of the webpages in its database to come up with relevancy scores. Alternatively, the word preference profile of a website can be determined during the web- crawling phase of search engine operation and the profile stored. Third, an intermediate website can receive keyword search requests from the user, transmit the search request to a search engine, and then check the results for relevancy before sending the most relevant search results back to the user.
Practitioners of ordinary skill will recognize that the invention can be implemented on a computer by means of software, such that coded instructions comprising software that are executed by the computer cause the computer to execute the method. The computer is typically a central processing unit (CPU), a random access addressable memory (RAM) operatively coupled to the CPU, external mass storage, typically a hard drive, and some kind of input-output device (I/O). I/O is typically a keyboard operatively connected to the computer and a visual display also operatively connected to the computer. In the typical embodiment, the computer has an operative connection to a data communications network, typically the Internet, and executes typical network protocols in order that data packets be transmitted and received between one or more computers embodying the invention. Computer program logic implementing all or part of the functionality previously described herein may be embodied in various forms, including, but in no way limited to, a source code form, a computer executable form, and various intermediate forms (e.g., forms generated by an assembler, compiler, linker, or locator.) Source code may include a series of computer program instructions implemented in any of various programming languages (e.g., an object code, an assembly language, or a high-level language such as FORTRAN, C, C++, JAVA, or HTML) for use with various operating systems or operating environments. The source code may define and use various data structures and communication messages. The source code may be in a computer executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into a computer executable form.
The computer program embodying the invention may be fixed in any form (e.g., source code form, computer executable form, or an intermediate form) either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM), a PC card (e.g., PCMCIA card), or other memory device. The computer program may be fixed in any form in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies, networking technologies, and internetworking technologies. The computer program may be distributed in any form as a removable storage medium with accompanying printed or electronic documentation (e.g., shrink wrapped software or a magnetic tape), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the Internet or World Wide Web.)
Although the present invention has been described and illustrated in detail, it is to be clearly understood that the same is by way of illustration and example only, and is not to be taken by way of limitation. It is appreciated that various features of the invention which are, for clarity, described in the context of separate embodiments may also be provided in combination in a single embodiment. .Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable combination. It is appreciated that the particular embodiment described in the Appendices is intended only to provide an extremely detailed disclosure of the present invention and is not intended to be limiting. It is appreciated that any of the software components of the present invention may, if desired, be implemented in ROM (read-only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. The spirit and scope of the present invention are to be limited only by the terms of the appended claims.

Claims

- Jo -CLAIMS:
1. A method of determining the relevance of a collection of at least one data object to a user comprising:
Determining the word preference profile of the collection of data objects; Determining the word preference profile of the user;
Calculating a relevance score as a function of the collection profile and the user profile; and
Storing the result of the calculation.
2. The method of claim 1 where the word preference profiles are derived from word use frequencies.
3. The method of Claim 2 where the calculation step is comprised of calculating the dot product of at least part of the word frequency vector of the collection of data objects with at least part of the word frequency vector of the user.
4. The method of Claim 3 where the word frequency vector values represent for each word corresponding to the values, the relative difference between the word frequency of an average user of a similar sized vocabulary and the word frequency of such user.
5. The method of Claim 1 where a subset of the word preference profile of the user and a subset of the word preference profile of the object is used to calculate the relevance score.
6. The method of Claim 1 further comprising displaying on a user's computer display graphical objects representing Web search results where the display order of the results are ranked substantially in the order of the relevance score of each result.
7. The method of Claim 6 where the display order is presented with the higher relevancy scored search results presented first.
8. The method of Claim 3 further comprising comparing the result of the dot product with a pre-determined value.
9. The method of Claim 8 where the pre-determined value is changed by the user.
10. The method of Claim 8 where the pre-determined value is calculated as a function of the percentage usage of words in a specified category.
11. The methods of Claims 1 through 10 where the collection of data objects is at least one web-page.
12. The method of Claims 1 through 10 where the collection of data objects is an entire web-site.
13. A method of representing the relative interests of a computer user comprising:
Storing in a computer memory a data structure comprised of at least one data element each further comprised of one word;
For each stored element in the data structure, determining the relative frequency of use by the user of the word comprising the element;
Storing the determined result in the data element of the word cone corresponding to such word.
14. The method of Claim 13 where the determining step is further comprised of detecting each word input by a user into the computer. - jo -
15. The method of Claim 14 where the detection step is comprised of checking each user document accessible to the computer and compiling a word frequency count for such document.
16. A method of making a data representation of the subject matter of a data object comprising:
Retrieving from computer memory a stored data element comprising a data structure said element comprising a word;
For the word stored in the data element, determining the relative use frequency of such word by the data object;
Storing the determined result in the element of the word cone associated with such word.
17. The method of Claim 11 further comprising retrieving a web-page; calculating the word preference profile for the retrieved web-page and compiling a word preference profile for such web page.
18. The method of Claim 17 further comprising matching a word in the retrieved web- page with a word in the data structure and updating the determined result associated with the matched word.
19. The method of Claim 17 further comprised of matching a word in the loaded web- page with a word in the data structure, making a first calculation of the word frequency of said matched word in the web-page and making a second calculation of a new average word frequency using the result of the first calculation.
20. A method of determining a word cone data structure comprising: Storing in a computer memory a first word-cone data structure;
For a second and third word cones representing the relative interests of two distinct users, combining the second and third word-cones by taking the frequency value associated ^
with a word in the second and third word cones and averaging the values and storing the average value in the stored first word-cone data structure at the element location associated with such word.
21. A method of determining the relevancy between two data objects comprised of Calculating a first and second word preference profile corresponding to a first and second data object;
Determining if the first and second profile are sufficiently similar;
Transmitting a message indicating that the first and second profiles are sufficiently similar.
22. A method of determining the authenticity of a user comprising: determining the word use frequency profile of the user; comparing said determined profile with a stored profile associated with the identity of the user; and storing the comparison result;
23. A method of conducting an internet search comprising: receiving from a user data comprising at least part of a word preference profile; receiving from said user at least one search term; retrieving by means of an internet search, at least one web-page containing the at least one search term; calculating a relevancy score for each such retrieved web-pages using the user data.
24. A method of finding at least one document on a computer connected to a data communications network comprising: ^ ^
calculating a first word preference profile for at least one document on the computer; receiving into the computer a word cone request, said request coupled with a request source address and comprising a second word preference profile; comparing the second word preference profile with at least one calculated word preference profile; transmitting to the request source address the at least one document whose calculated word cone profiles meet a matching criteria.
25. The method of Claim 23 where the at least part of the word preference profile is word preference profiles for one or more words associated with a subject category.
26. A method of conducting an Internet search comprising:
Receiving from a user at least one search term; Receiving from a user at least one category term;
Determining a word preference profile for the at least one category term; Retrieving by means of Internet search, at least one web page containing the at least one search term;
Calculating the relevancy score of the at least one retrieved web pages.
27. The method of Claim 26 further comprising:
Transmitting hyperlinks to the retrieved web pages.
28. The method of Claim 27 where the transmitted hyperlinks are embedded in a transmitted web page.
29. The method of Claim 28 where the order of appearance of the transmitted hyperlinks in the transmitted web page is substantially dependent on the relative relevancy scores of the web pages associated with the transmitted hyperlinks.
30. A method of conducting an Internet search comprising:
Receiving from a user at least one search term;
Receiving from a user at least part of a word preference profile;
Retrieving by means of Internet search, at least one hyperlink to a web page containing the at least one search term;
Calculating the relevancy score of the web page corresponding to the at least one retrieved hyperlink.
31. The method of Claim 30 further comprising:
Transmitting the retrieved hyperlinks.
32. The method of Claim 31 where the transmitted hyperlinks are embedded in a transmitted web page.
33. The method of Claim 32 where the order of appearance of the transmitted hyperlinks in the transmitted web page is substantially dependent on the relative relevancy scores of the web pages associated with the transmitted hyperlinks.
PCT/GB2007/003418 2006-09-11 2007-09-11 Method and system for filtering and searching data using word frequencies WO2008032037A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US82520106P 2006-09-11 2006-09-11
US60/825,201 2006-09-11

Publications (1)

Publication Number Publication Date
WO2008032037A1 true WO2008032037A1 (en) 2008-03-20

Family

ID=38621708

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003418 WO2008032037A1 (en) 2006-09-11 2007-09-11 Method and system for filtering and searching data using word frequencies

Country Status (1)

Country Link
WO (1) WO2008032037A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949647B2 (en) 2008-11-26 2011-05-24 Yahoo! Inc. Navigation assistance for search engines
US20140289389A1 (en) * 2012-02-29 2014-09-25 William Brandon George Systems And Methods For Analysis of Content Items
EP2840515A4 (en) * 2012-04-17 2015-09-23 Tencent Tech Shenzhen Co Ltd Method, device and computer storage media for user preferences information collection
CN110457917A (en) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 Filter out the method and relevant apparatus of the illegal contents in block chain data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6981040B1 (en) * 1999-12-28 2005-12-27 Utopy, Inc. Automatic, personalized online information and product services

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6981040B1 (en) * 1999-12-28 2005-12-27 Utopy, Inc. Automatic, personalized online information and product services

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AHU SIEG, BAMSHAD MOBASHER, AND ROBIN BURKE: "Inferring User?s Information Context: Integrating User Profiles and Concept Hierarchies", N PROC. OF THE 2004 MEETING OF THE INTERNATIONAL FEDERATION OF CLASSIFICATION SOCIETIES, CHICAGO, IL, JULY 2004, 15 July 2004 (2004-07-15) - 18 July 2004 (2004-07-18), pages 1 - 12, XP002457374, Retrieved from the Internet <URL:http://maya.cs.depaul.edu/~mobasher/papers/arch-ifcs2004.pdf> [retrieved on 20071031] *
J.C. BOTTRAUD, G. BISSON, M.F. BRUANDET: "An Adaptive Information Research Personal Assistant", IN PROC. OF WORKSHOP ARTIFICIAL INTELLIGENCE, INFORMATION ACCESS AND MOBILE COMPUTING, ACAPULCO, MEXICO, 2003, 11 August 2003 (2003-08-11), pages 48 - 58, XP002457373, Retrieved from the Internet <URL:http://users.dimi.uniud.it/workshop/ai2ia/cameraready/bottraud.pdf> [retrieved on 20071030] *
NORIHIDE SHINAGAWA ET AL: "Dynamic Generation and Browsing of Virtual WWW Space Based on User Profiles", INTERNET APPLICATIONS LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER-VERLAG, BE, vol. 1749, 2004, pages 93 - 108, XP019001155, ISBN: 3-540-66903-5 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949647B2 (en) 2008-11-26 2011-05-24 Yahoo! Inc. Navigation assistance for search engines
US8484184B2 (en) 2008-11-26 2013-07-09 Yahoo! Inc. Navigation assistance for search engines
US20140289389A1 (en) * 2012-02-29 2014-09-25 William Brandon George Systems And Methods For Analysis of Content Items
US9514461B2 (en) * 2012-02-29 2016-12-06 Adobe Systems Incorporated Systems and methods for analysis of content items
EP2840515A4 (en) * 2012-04-17 2015-09-23 Tencent Tech Shenzhen Co Ltd Method, device and computer storage media for user preferences information collection
CN110457917A (en) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 Filter out the method and relevant apparatus of the illegal contents in block chain data
CN110457917B (en) * 2019-01-09 2022-12-09 腾讯科技(深圳)有限公司 Method and related device for filtering illegal content in block chain data

Similar Documents

Publication Publication Date Title
US9355185B2 (en) Infinite browse
KR101171405B1 (en) Personalization of placed content ordering in search results
JP5329900B2 (en) Digital information disclosure method in target area
US9165060B2 (en) Content creation and management system
US10102307B2 (en) Method and system for multi-phase ranking for content personalization
CA2833359C (en) Analyzing content to determine context and serving relevant content based on the context
KR101793222B1 (en) Updating a search index used to facilitate application searches
US8321278B2 (en) Targeted advertisements based on user profiles and page profile
US7451135B2 (en) System and method for retrieving and displaying information relating to electronic documents available from an informational network
US7904303B2 (en) Engagement-oriented recommendation principle
US20070214133A1 (en) Methods for filtering data and filling in missing data using nonlinear inference
US20060155751A1 (en) System and method for document analysis, processing and information extraction
WO2018040069A1 (en) Information recommendation system and method
WO2001025947A1 (en) Method of dynamically recommending web sites and answering user queries based upon affinity groups
KR20050049750A (en) On-line advertising system and method
Liu et al. Real‐time user interest modeling for real‐time ranking
EP2384476A1 (en) Personalization engine for building a user profile
WO2008032037A1 (en) Method and system for filtering and searching data using word frequencies
US20140278983A1 (en) Using entity repository to enhance advertisement display
Wen Development of personalized online systems for web search, recommendations, and e-commerce
JP2017182746A (en) Information provision server device, program and information provision method
JP6228425B2 (en) Advertisement generation apparatus and advertisement generation method
KR101083669B1 (en) Expert website searching system using internet and method thereof
Ashkan et al. Location-and Query-Aware Modeling of Browsing and Click Behavior in Sponsored Search
WO2006034222A2 (en) System and method for document analysis, processing and information extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07804216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07804216

Country of ref document: EP

Kind code of ref document: A1