US20100057719A1 - System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm - Google Patents
System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm Download PDFInfo
- Publication number
- US20100057719A1 US20100057719A1 US12/367,656 US36765609A US2010057719A1 US 20100057719 A1 US20100057719 A1 US 20100057719A1 US 36765609 A US36765609 A US 36765609A US 2010057719 A1 US2010057719 A1 US 2010057719A1
- Authority
- US
- United States
- Prior art keywords
- page
- rank
- label
- keyword
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- This disclosure relates to machine learning algorithms and, more particularly, to generation of training data for machine learning algorithms.
- the World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge.
- a search engine is typically used to search the WWW.
- FIG. 1 A typical prior art search engine 20 is shown in FIG. 1 .
- Pages from the Internet or other source 22 are accessed through the use of a crawler 24 .
- Crawler 24 aggregates pages from source 22 to ensure that these pages are searchable.
- the pages retrieved by crawler 24 are stored in a database 36 . Thereafter, these pages are indexed by an indexer 26 .
- Indexer 26 builds a searchable index of the pages in a database 34 . For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations.
- a user 32 sends a search query to a dispatcher 30 .
- Dispatcher 30 compiles a list of search nodes in cluster 28 to execute the query and forwards the query to those selected search nodes.
- the search nodes in search node cluster 28 search respective parts of the index 34 and return search results along with a document identifier to dispatcher 30 .
- Dispatcher 30 merges the received results to produce a final result set displayed to user 32 sorted by ranking scores based on a ranking function.
- the ranking further is a function of the query itself and the type of page produced.
- Factors that are used for relevance include hundreds of features extracted, collected or identified for each page including: a static relevance score for the page such as link cardinality and page quality, superior parts of the page such as titles, metadata and page headers, authority of the page such as external references and the “level” of the references, the GOOGLE page rank algorithm, and page statistics such as query term frequency in the page, words on a page, global term frequency, term distances within the page, etc.
- search engines have become one of the most popular online activities with billions of searches being performed by users every month. Search engines are also a starting point for consumers for shopping and various day to day purchases and activities. With billions of dollars being spent by consumers online, it has become ever more important for web sites to organize and optimize their web pages in an effort to be more visible and accessible to users of a search engine.
- a merchant with a web page would like his page to be ranked higher in a result set based on relevant search keywords compared with web pages of his competitor for the same keywords. For example, for a merchant selling telephones, that merchant would like his web page to acquire a higher ranking score, and appear higher in a result set produced by a search engine based on the keyword query “telephone” than the ranking scores of web sites of his competitors for the same keyword.
- search engine based on the keyword query “telephone” than the ranking scores of web sites of his competitors for the same keyword.
- One embodiment of the invention is a method for generating training data for a machine learning system.
- the method comprises sending at least one keyword to a search engine; and receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword.
- the method further comprises assigning at the first processor a first label to the first page based on the first rank; assigning at the first processor a second label to the second page based on the second rank; and forwarding the first web page, second page, first label and second label to a machine learning processor.
- Another embodiment of the invention is a method for generating training data for a machine learning system.
- the method comprises sending at least one input to a system effective to perform a process; and receiving at a first processor at least a first and a second output from the system in response to the input, the first output having a first rank, the second output having a second rank, the first and second rank being based on the input.
- the method further comprises assigning at the first processor a first label to the first output based on the first rank; assigning at the first processor a second label to the second output based on the second rank; and forwarding the first result, second result, first label and second label to a machine learning processor.
- Yet another embodiment of the invention is a system for generating training data for a machine learning system.
- the system comprises a first processor effective to send at least one keyword to a search engine.
- the first processor is further effective to: receive at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword; assign a first label to the first page based on the first rank; and assign a second label to the second page based on the second rank.
- the system further comprises a machine learning processor connected to the first processor, the machine learning processor effective to receive the first web page, second web page, first label and second label.
- Still another embodiment of the invention is a computer readable storage medium including computer executable code effective to generate training data for a machine learning system.
- the code includes the steps of sending at least one keyword to a search engine; and receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword.
- the code further includes the steps of assigning at the first processor a first label to the first page based on the first rank; assigning at the first processor a second label to the second page based on the second rank; and forwarding the first web page, second page, first label and second label to a machine learning processor.
- FIG. 1 is a system drawing a search engine in accordance with the prior art.
- FIG. 2 is a system drawing of a machine learning system in accordance with an embodiment of the invention.
- FIG. 3 is a schematic drawing of a database structure in accordance with an embodiment of the invention.
- FIG. 4 is a flow chart of a process which could be used in accordance with a embodiment of the invention.
- search engines When applying a ranking function, search engines receive as input: 1) at least one keyword and 2) a plurality of web pages in a result set produced based on keyword(s). With those inputs, the search engine produces as an output a ranking score for each web page.
- the inventors recognized this phenomenon and produced a system and algorithm to reverse engineer the function performed by search engines to produce that output. Stated another way, search engines perform the following ranking function to generate a ranking score for each page in a result set:
- the present system and method determines training data that may be used to determine the function F used by a search engine.
- training data may be sent to a machine learning system. Generating such training data is perhaps the most difficult and labor intensive part of any machine learning system.
- prior art techniques for generating training data include the use of teams of humans subjectively viewing selected portions of available data such as keywords and result sets. Even if collection of data may be automated, in the prior art, labeling of the data is performed manually. Such labeling techniques are often inaccurate as they are subject to human judgment of a complex system such as a search engine. A human being typically cannot judge by intuition whether he has collected all kinds of different search results to ensure that the training data is diverse and it is generally not possible to manually track or generate a diverse set of data. A diverse training set is desired for a machine learning algorithm to work well. Moreover, human labeling in not accurate because it is generally not possible to judge a label value by intuition.
- System 80 includes a training data generator server 60 .
- Training data generator server or processor 60 sends keywords 62 over a network 64 (such as the Internet) to a search engine server 66 .
- Keywords 62 could be virtually any set of keywords that, when input to a search engine, yield web pages in a result set. It is desirable to generate a number of different sets of keywords. Many techniques could be used to generate such sets.
- Search engine index 68 outputs web pages 70 that are responsive to a search query including keywords 62 .
- Search engine server 66 receives web pages 70 and orders or ranks web pages 70 based on an unknown ranking algorithm to produce ranked web pages 76 .
- Ranked web pages 76 are sent over network 64 and fed to training data generator server 60 .
- Training data generator server 60 stores ranked web pages 76 and labels 82 for those pages in a training data storage 84 .
- a label 82 is associated with each ranked web page 76 corresponding to the rank of the ranked web page 76 based on keyword 62 .
- Label 82 allows system 80 to represent the relevance of each ranked web page 76 to keywords 62 .
- Prior art labeling techniques required manually intensive, inaccurate and expensive human capital. Humans would view each ranked web page 76 and provide an appropriate label. The inventors have determined that a linear distribution of the ranking scores is a good representation of those scores. Consequently, if L ranked web pages 76 are considered, the highest ranked web page is given a label L, the second highest is given a label L- 1 , etc.
- FIG. 4 there is shown a flow chart of a process which could be used in accordance with an embodiment of the invention.
- the process could be used with, for example, system 80 described with respect to FIG. 2 .
- a search engine or any other system implementing a process As shown at step S 2 , at least one input or keyword is sent to a search engine or any other system implementing a process.
- the search engine queries a search engine index using the keyword to produce a result set including web pages or the process uses the keywords as input or produce an output.
- the search engine ranks the web pages or the process ranks the output.
- the search engine or process forwards the inputs or keywords and ranked web pages or outputs to a training data server or processor.
- the training data server assigns a label to each page or output based on the rank.
- the labels and pages or outputs are used as training data.
- servers are shown for various elements those servers could be combined in a single processor housing or location.
- a system in accordance with that described above can be used to collect training data for any search engine. Moreover, the system can adapt automatically to changes in ranking functions of existing search engines and produce new training data accordingly. Prior art systems are significantly limited in that subjective, expensive human capital is used to analyze only samples of available data. A system in accordance with the invention could analyze one page or thousands of pages easily and efficiently.
- the system and process described above is more accurate than human labeling because, in part, results of the unknown process, such as search engine ranking, are used.
- results of the unknown process such as search engine ranking
- the system is automated, it is possible to easily collected large amounts of training data without manual intervention.
- Ranking algorithms produced in accordance with the invention are change resistant. This is because training data is based on search results. If any search engine changes its ranking algorithm the results will change and the training data will change.
- Prior art systems based on intuition and prior knowledge of humans cannot adapt as easily. The system works with known and to be developed search engines and can easily be applied to specific sites such as TRAVELOCITY.COM.
Abstract
A system and method for generating training data for a machine learning system. A training data generator server sends at least one keyword to a search engine. The training data generator server receives at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword. The training data generator server assigns a first label to the first page based on the first rank; and assigns a second label to the second page based on the second rank. The first web page, second page, first label and second label are forwarded to a machine learning server.
Description
- This application claims priority to U.S. Patent application Ser. No. 61/093,586 entitled “Techniques for Automated Search Rank Function, Approximation, Rank Improvement Recommendations and Predictions”, filed Sep. 2, 2008, the entirety of which is hereby incorporated by reference.
- 1. Field of the Invention
- This disclosure relates to machine learning algorithms and, more particularly, to generation of training data for machine learning algorithms.
- 2. Description of the Related Art
- Referring to
FIG. 1 , the World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. A search engine is typically used to search the WWW. - A typical prior
art search engine 20 is shown inFIG. 1 . Pages from the Internet or other source 22 are accessed through the use of acrawler 24. Crawler 24 aggregates pages from source 22 to ensure that these pages are searchable. Many algorithms exist for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved bycrawler 24 are stored in adatabase 36. Thereafter, these pages are indexed by anindexer 26.Indexer 26 builds a searchable index of the pages in adatabase 34. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations. - In use, a user 32 sends a search query to a
dispatcher 30. Dispatcher 30 compiles a list of search nodes in cluster 28 to execute the query and forwards the query to those selected search nodes. The search nodes in search node cluster 28 search respective parts of theindex 34 and return search results along with a document identifier to dispatcher 30.Dispatcher 30 merges the received results to produce a final result set displayed to user 32 sorted by ranking scores based on a ranking function. - The ranking further is a function of the query itself and the type of page produced. Factors that are used for relevance include hundreds of features extracted, collected or identified for each page including: a static relevance score for the page such as link cardinality and page quality, superior parts of the page such as titles, metadata and page headers, authority of the page such as external references and the “level” of the references, the GOOGLE page rank algorithm, and page statistics such as query term frequency in the page, words on a page, global term frequency, term distances within the page, etc.
- The use of search engines has become one of the most popular online activities with billions of searches being performed by users every month. Search engines are also a starting point for consumers for shopping and various day to day purchases and activities. With billions of dollars being spent by consumers online, it has become ever more important for web sites to organize and optimize their web pages in an effort to be more visible and accessible to users of a search engine.
- As discussed above, for each web page, hundreds of features are extracted and a ranking function is applied to those features to produce a ranking score. A merchant with a web page would like his page to be ranked higher in a result set based on relevant search keywords compared with web pages of his competitor for the same keywords. For example, for a merchant selling telephones, that merchant would like his web page to acquire a higher ranking score, and appear higher in a result set produced by a search engine based on the keyword query “telephone” than the ranking scores of web sites of his competitors for the same keyword. There are some prior art solutions available to guess the ranking algorithm used by a search engine and to provide recommendations about improvements that can be made to web pages so that the ranking score for a web page relating to particular keywords may improve. However, most of these systems use manual human judgment and historical knowledge about search engines. Humans must be trained to perform this analysis. The basis for these judgments are mostly guesses or arrived at by trial and error. Consequently, most prior art solutions are inaccurate, time consuming, and require expensive human capital. Moreover, these solutions are available only for specific search engines and are not immune to changes in search or ranking algorithms used by known search engines nor do they have the ability to adapt to new search engines.
- One embodiment of the invention is a method for generating training data for a machine learning system. The method comprises sending at least one keyword to a search engine; and receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword. The method further comprises assigning at the first processor a first label to the first page based on the first rank; assigning at the first processor a second label to the second page based on the second rank; and forwarding the first web page, second page, first label and second label to a machine learning processor.
- Another embodiment of the invention is a method for generating training data for a machine learning system. The method comprises sending at least one input to a system effective to perform a process; and receiving at a first processor at least a first and a second output from the system in response to the input, the first output having a first rank, the second output having a second rank, the first and second rank being based on the input. The method further comprises assigning at the first processor a first label to the first output based on the first rank; assigning at the first processor a second label to the second output based on the second rank; and forwarding the first result, second result, first label and second label to a machine learning processor.
- Yet another embodiment of the invention is a system for generating training data for a machine learning system. The system comprises a first processor effective to send at least one keyword to a search engine. The first processor is further effective to: receive at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword; assign a first label to the first page based on the first rank; and assign a second label to the second page based on the second rank. The system further comprises a machine learning processor connected to the first processor, the machine learning processor effective to receive the first web page, second web page, first label and second label.
- Still another embodiment of the invention is a computer readable storage medium including computer executable code effective to generate training data for a machine learning system. The code includes the steps of sending at least one keyword to a search engine; and receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword. The code further includes the steps of assigning at the first processor a first label to the first page based on the first rank; assigning at the first processor a second label to the second page based on the second rank; and forwarding the first web page, second page, first label and second label to a machine learning processor.
- The drawings constitute a part of the specification and include exemplary embodiments of the present invention and illustrate various objects and features thereof.
-
FIG. 1 is a system drawing a search engine in accordance with the prior art. -
FIG. 2 is a system drawing of a machine learning system in accordance with an embodiment of the invention. -
FIG. 3 is a schematic drawing of a database structure in accordance with an embodiment of the invention. -
FIG. 4 is a flow chart of a process which could be used in accordance with a embodiment of the invention. - Various embodiments of the invention are described hereinafter with reference to the figures. Elements of like structures or function are represented with like reference numerals throughout the figures. The figures are only intended to facilitate the description of the invention or as a limitation on the scope of the invention. In addition, an aspect described in conjunction with a particular embodiment of the invention is not necessarily limited to that embodiment and can be practiced in conjunction with any other embodiments of the invention.
- When applying a ranking function, search engines receive as input: 1) at least one keyword and 2) a plurality of web pages in a result set produced based on keyword(s). With those inputs, the search engine produces as an output a ranking score for each web page. The inventors recognized this phenomenon and produced a system and algorithm to reverse engineer the function performed by search engines to produce that output. Stated another way, search engines perform the following ranking function to generate a ranking score for each page in a result set:
-
ranking score=F(input) - where the input is the search query in the form of keyword(s) and the extracted features of the pages in the result set. The present system and method determines training data that may be used to determine the function F used by a search engine.
- In order to approximate the ranking function, training data may be sent to a machine learning system. Generating such training data is perhaps the most difficult and labor intensive part of any machine learning system. As discussed above, prior art techniques for generating training data include the use of teams of humans subjectively viewing selected portions of available data such as keywords and result sets. Even if collection of data may be automated, in the prior art, labeling of the data is performed manually. Such labeling techniques are often inaccurate as they are subject to human judgment of a complex system such as a search engine. A human being typically cannot judge by intuition whether he has collected all kinds of different search results to ensure that the training data is diverse and it is generally not possible to manually track or generate a diverse set of data. A diverse training set is desired for a machine learning algorithm to work well. Moreover, human labeling in not accurate because it is generally not possible to judge a label value by intuition.
- Referring to
FIG. 2 , there is shown asystem 80 in accordance with an embodiment of the invention.System 80 includes a trainingdata generator server 60. Training data generator server orprocessor 60 sendskeywords 62 over a network 64 (such as the Internet) to asearch engine server 66.Keywords 62 could be virtually any set of keywords that, when input to a search engine, yield web pages in a result set. It is desirable to generate a number of different sets of keywords. Many techniques could be used to generate such sets. For example, keyword tools provided by search engines such as the MSN Keyword tool, or the GOOGLE ADWORDs tool could be used, third party tools which monitor and collect keywords based on popularity usage and other metrics may be used, or statistical analysis may be used to determine important keywords from web pages and web logs. For example, by collecting the frequency distribution of keywords from web pages and web logs, it may be possible to identify important keywords from pages.Keywords 62 are sent bysearch engine server 66 to asearch engine index 68. -
Search engine index 68outputs web pages 70 that are responsive to a searchquery including keywords 62.Search engine server 66 receivesweb pages 70 and orders or ranksweb pages 70 based on an unknown ranking algorithm to produce rankedweb pages 76. -
Ranked web pages 76 are sent overnetwork 64 and fed to trainingdata generator server 60. Trainingdata generator server 60 stores rankedweb pages 76 andlabels 82 for those pages in atraining data storage 84. Alabel 82 is associated with each rankedweb page 76 corresponding to the rank of the rankedweb page 76 based onkeyword 62.Label 82 allowssystem 80 to represent the relevance of each rankedweb page 76 tokeywords 62. Prior art labeling techniques required manually intensive, inaccurate and expensive human capital. Humans would view eachranked web page 76 and provide an appropriate label. The inventors have determined that a linear distribution of the ranking scores is a good representation of those scores. Consequently, if L rankedweb pages 76 are considered, the highest ranked web page is given a label L, the second highest is given a label L-1, etc. - Referring to
FIGS. 2 and 3 , there is shown an example of atraining data structure 110 which may be stored intraining data storage 84. As shown, for a keyword 112 (“telephone” is shown)training data structure 110 may include alabel column 114 and aweb page column 118.Label column 114 includeslabels 116 for ranked web pages 76 (FIG. 2 ). The web pages themselves may be stored inweb page column 118. The contents oftraining data structure 110 may be forwarded and used as training data in a machine learning server orprocessor 74.Machine learning server 74 may use any known machine learning techniques ontraining data 110 to produce an approximatedranking function 88. - Referring to
FIG. 4 , there is shown a flow chart of a process which could be used in accordance with an embodiment of the invention. The process could be used with, for example,system 80 described with respect toFIG. 2 . As shown at step S2, at least one input or keyword is sent to a search engine or any other system implementing a process. At step S4, the search engine queries a search engine index using the keyword to produce a result set including web pages or the process uses the keywords as input or produce an output. At step S6, the search engine ranks the web pages or the process ranks the output. At step S8, the search engine or process forwards the inputs or keywords and ranked web pages or outputs to a training data server or processor. At step S10, the training data server assigns a label to each page or output based on the rank. At step S12, the labels and pages or outputs are used as training data. - Clearly, although different servers are shown for various elements those servers could be combined in a single processor housing or location.
- A system in accordance with that described above can be used to collect training data for any search engine. Moreover, the system can adapt automatically to changes in ranking functions of existing search engines and produce new training data accordingly. Prior art systems are significantly limited in that subjective, expensive human capital is used to analyze only samples of available data. A system in accordance with the invention could analyze one page or thousands of pages easily and efficiently.
- The invention has been described with reference to an embodiment that illustrates the principles of the invention and is not meant to limit the scope of the invention. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the scope of the invention be construed as including all modifications and alterations that may occur to others upon reading and understanding the preceding detailed description insofar as they come within the scope of the following claims or equivalents thereof. Various changes may be made without departing from the spirit and scope of the invention.
- Although the above description is focused on the search engine context, the inventive concepts may be applied to any function approximation system where the inputs and outputs are known.
- As can be discerned, the system and process described above is more accurate than human labeling because, in part, results of the unknown process, such as search engine ranking, are used. As the system is automated, it is possible to easily collected large amounts of training data without manual intervention. Ranking algorithms produced in accordance with the invention are change resistant. This is because training data is based on search results. If any search engine changes its ranking algorithm the results will change and the training data will change. Prior art systems based on intuition and prior knowledge of humans cannot adapt as easily. The system works with known and to be developed search engines and can easily be applied to specific sites such as TRAVELOCITY.COM.
Claims (16)
1. A method for generating training data for a machine learning system, the method comprising:
sending at least one keyword to a search engine;
receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword;
assigning at the first processor a first label to the first page based on the first rank;
assigning at the first processor a second label to the second page based on the second rank; and
forwarding the first web page, second page, first label and second label to a machine learning processor.
2. The method as recited in claim 1 , wherein the first and second labels are based on a linear distribution of a ranking of the first and second pages by the search engine.
3. The method as recited in claim 1 , wherein the pages are web pages.
4. The method as recited in claim 1 , wherein the keyword is generated using at least one of an MSN keyword tool, GOOGLE ADWORDS, and a statistical analysis of keywords from web pages.
5. A method for generating training data for a machine learning system, the method comprising:
sending at least one input to a system effective to perform a process;
receiving at a first processor at least a first and a second output from the system in response to the input, the first output having a first rank, the second output having a second rank, the first and second rank being based on the input;
assigning at the first processor a first label to the first output based on the first rank;
assigning at the first processor a second label to the second output based on the second rank; and
forwarding the first result, second result, first label and second label to a machine learning processor.
6. The method as recited in claim 5 , wherein the first and second labels are based on a linear distribution of a ranking of the first and second pages by the search engine.
7. The method as recited in claim 5 , wherein the pages are web pages.
8. The method as recited in claim 5 , wherein the keyword is generated using at least one of an MSN keyword tool, GOOGLE ADWORDS, and a statistical analysis of keywords from web pages.
9. A system for generating training data for a machine learning system, the system comprising:
a first processor effective to send at least one keyword to a search engine;
the first processor further effective to:
receive at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword;
assign a first label to the first page based on the first rank; and
assign a second label to the second page based on the second rank; and
a machine learning processor connected to the first processor, the machine learning processor effective to receive the first web page, second web page, first label and second label.
10. The system as recited in claim 9 , wherein the first and second labels are based on a linear distribution of a ranking of the first and second pages by the search engine.
11. The system as recited in claim 9 , wherein the pages are web pages.
12. The system as recited in claim 9 , wherein the keyword is generated using at least one of an MSN keyword tool, GOOGLE ADWORDS, and a statistical analysis of keywords from web pages.
13. A computer readable storage medium including computer executable code effective to generate training data for a machine learning system, the code including the steps of:
sending at least one keyword to a search engine;
receiving at a first processor at least a first and a second page from the search engine in response to the keyword, the first page having a first rank, the second page having a second rank, the first and second rank being based on the keyword;
assigning at the first processor a first label to the first page based on the first rank;
assigning at the first processor a second label to the second page based on the second rank; and
forwarding the first web page, second page, first label and second label to a machine learning processor.
14. The storage medium as recited in claim 13 , wherein the first and second labels are based on a linear distribution of a ranking of the first and second pages by the search engine.
15. The storage medium as recited in claim 13 , wherein the pages are web pages.
16. The storage medium as recited in claim 13 , wherein the keyword is generated using at least one of an MSN keyword tool, GOOGLE ADWORDS, and a statistical analysis of keywords from web pages.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/367,656 US20100057719A1 (en) | 2008-09-02 | 2009-02-09 | System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm |
PCT/US2009/055342 WO2010027916A1 (en) | 2008-09-02 | 2009-08-28 | System and method for generating an approximation of a search engine ranking algorithm |
PCT/US2009/055331 WO2010027914A1 (en) | 2008-09-02 | 2009-08-28 | System and method for generating a search ranking score for a web page |
PCT/US2009/055355 WO2010027917A1 (en) | 2008-09-02 | 2009-08-28 | System and method for generating training data for function approximation of an unknown process such as a search engine ranking algorithm |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US9358608P | 2008-09-02 | 2008-09-02 | |
US12/367,656 US20100057719A1 (en) | 2008-09-02 | 2009-02-09 | System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
US20100057719A1 true US20100057719A1 (en) | 2010-03-04 |
Family
ID=41726829
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/367,656 Abandoned US20100057719A1 (en) | 2008-09-02 | 2009-02-09 | System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm |
US12/367,646 Expired - Fee Related US8255391B2 (en) | 2008-09-02 | 2009-02-09 | System and method for generating an approximation of a search engine ranking algorithm |
US12/367,634 Abandoned US20100057717A1 (en) | 2008-09-02 | 2009-02-09 | System And Method For Generating A Search Ranking Score For A Web Page |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/367,646 Expired - Fee Related US8255391B2 (en) | 2008-09-02 | 2009-02-09 | System and method for generating an approximation of a search engine ranking algorithm |
US12/367,634 Abandoned US20100057717A1 (en) | 2008-09-02 | 2009-02-09 | System And Method For Generating A Search Ranking Score For A Web Page |
Country Status (2)
Country | Link |
---|---|
US (3) | US20100057719A1 (en) |
WO (3) | WO2010027916A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530295A (en) * | 2012-07-05 | 2014-01-22 | 腾讯科技(深圳)有限公司 | Webpage pre-reading method and device |
CN103530321A (en) * | 2013-09-18 | 2014-01-22 | 上海交通大学 | Sequencing system based on machine learning |
US9454732B1 (en) * | 2012-11-21 | 2016-09-27 | Amazon Technologies, Inc. | Adaptive machine learning platform |
US10572527B1 (en) * | 2018-12-11 | 2020-02-25 | Rina Systems, Llc. | Enhancement of search results |
Families Citing this family (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8554638B2 (en) * | 2006-09-29 | 2013-10-08 | Microsoft Corporation | Comparative shopping tool |
US7899763B2 (en) | 2007-06-13 | 2011-03-01 | International Business Machines Corporation | System, method and computer program product for evaluating a storage policy based on simulation |
US8065199B2 (en) | 2009-04-08 | 2011-11-22 | Ebay Inc. | Method, medium, and system for adjusting product ranking scores based on an adjustment factor |
US9846898B2 (en) | 2009-09-30 | 2017-12-19 | Ebay Inc. | Method and system for exposing data used in ranking search results |
US8533191B1 (en) * | 2010-05-27 | 2013-09-10 | Conductor, Inc. | System for generating a keyword ranking report |
US8782037B1 (en) * | 2010-06-20 | 2014-07-15 | Remeztech Ltd. | System and method for mark-up language document rank analysis |
US20120011112A1 (en) * | 2010-07-06 | 2012-01-12 | Yahoo! Inc. | Ranking specialization for a search |
US8818880B1 (en) * | 2010-09-07 | 2014-08-26 | Amazon Technologies, Inc. | Systems and methods for source identification in item sourcing |
US8639686B1 (en) | 2010-09-07 | 2014-01-28 | Amazon Technologies, Inc. | Item identification systems and methods |
US20120066359A1 (en) * | 2010-09-09 | 2012-03-15 | Freeman Erik S | Method and system for evaluating link-hosting webpages |
US8478704B2 (en) | 2010-11-22 | 2013-07-02 | Microsoft Corporation | Decomposable ranking for efficient precomputing that selects preliminary ranking features comprising static ranking features and dynamic atom-isolated components |
US8620907B2 (en) | 2010-11-22 | 2013-12-31 | Microsoft Corporation | Matching funnel for large document index |
US9424351B2 (en) | 2010-11-22 | 2016-08-23 | Microsoft Technology Licensing, Llc | Hybrid-distribution model for search engine indexes |
US8713024B2 (en) | 2010-11-22 | 2014-04-29 | Microsoft Corporation | Efficient forward ranking in a search engine |
US9529908B2 (en) | 2010-11-22 | 2016-12-27 | Microsoft Technology Licensing, Llc | Tiering of posting lists in search engine index |
US8554700B2 (en) * | 2010-12-03 | 2013-10-08 | Microsoft Corporation | Answer model comparison |
US8630992B1 (en) * | 2010-12-07 | 2014-01-14 | Conductor, Inc. | URL rank variability determination |
US8909644B2 (en) | 2011-05-26 | 2014-12-09 | Nice Systems Technologies Uk Limited | Real-time adaptive binning |
US10223451B2 (en) * | 2011-06-14 | 2019-03-05 | International Business Machines Corporation | Ranking search results based upon content creation trends |
US8620840B2 (en) | 2011-07-19 | 2013-12-31 | Nice Systems Technologies Uk Limited | Distributed scalable incrementally updated models in decisioning systems |
US8788436B2 (en) * | 2011-07-27 | 2014-07-22 | Microsoft Corporation | Utilization of features extracted from structured documents to improve search relevance |
US8924318B2 (en) | 2011-09-28 | 2014-12-30 | Nice Systems Technologies Uk Limited | Online asynchronous reinforcement learning from concurrent customer histories |
US8914314B2 (en) | 2011-09-28 | 2014-12-16 | Nice Systems Technologies Uk Limited | Online temporal difference learning from incomplete customer interaction histories |
US8843477B1 (en) * | 2011-10-31 | 2014-09-23 | Google Inc. | Onsite and offsite search ranking results |
US9576053B2 (en) | 2012-12-31 | 2017-02-21 | Charles J. Reed | Method and system for ranking content of objects for search results |
US9141906B2 (en) * | 2013-03-13 | 2015-09-22 | Google Inc. | Scoring concept terms using a deep network |
US20140365453A1 (en) * | 2013-06-06 | 2014-12-11 | Conductor, Inc. | Projecting analytics based on changes in search engine optimization metrics |
US20160283952A1 (en) * | 2013-11-04 | 2016-09-29 | Agingo Corporation | Ranking information providers |
RU2634218C2 (en) | 2014-07-24 | 2017-10-24 | Общество С Ограниченной Ответственностью "Яндекс" | Method for determining sequence of web browsing and server used |
RU2637883C1 (en) * | 2016-06-20 | 2017-12-07 | Общество С Ограниченной Ответственностью "Яндекс" | Method of establishing training object for training machine training algorithm |
US11200242B2 (en) * | 2017-02-21 | 2021-12-14 | International Business Machines Corporation | Medical condition communication management |
US10878058B2 (en) * | 2017-06-16 | 2020-12-29 | T-Mobile Usa, Inc. | Systems and methods for optimizing and simulating webpage ranking and traffic |
US10489511B2 (en) * | 2018-03-01 | 2019-11-26 | Ink Content, Inc. | Content editing using AI-based content modeling |
US20220027400A1 (en) * | 2018-05-21 | 2022-01-27 | State Street Corporation | Techniques for information ranking and retrieval |
US11822588B2 (en) * | 2018-10-24 | 2023-11-21 | International Business Machines Corporation | Supporting passage ranking in question answering (QA) system |
US11449515B1 (en) * | 2019-06-14 | 2022-09-20 | Grant Michael Russell | Crowd sourced database system |
RU2019128026A (en) | 2019-09-05 | 2021-03-05 | Общество С Ограниченной Ответственностью «Яндекс» | METHOD AND SYSTEM FOR RANKING A SET OF DIGITAL DOCUMENTS |
CN111160001B (en) * | 2019-12-23 | 2022-09-23 | 联想(北京)有限公司 | Data processing method and device |
CN111966946A (en) * | 2020-09-10 | 2020-11-20 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for identifying authority value of page |
US11836173B2 (en) * | 2021-05-26 | 2023-12-05 | Banjo Health Inc. | Apparatus and method for generating a schema |
CN115983616B (en) * | 2023-03-21 | 2023-07-14 | 济南丽阳神州智能科技有限公司 | Business process mining method and equipment based on management system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050080782A1 (en) * | 2003-10-10 | 2005-04-14 | Microsoft Corporation | Computer aided query to task mapping |
US20060026152A1 (en) * | 2004-07-13 | 2006-02-02 | Microsoft Corporation | Query-based snippet clustering for search result grouping |
US20060248049A1 (en) * | 2005-04-27 | 2006-11-02 | Microsoft Corporation | Ranking and accessing definitions of terms |
US20070130112A1 (en) * | 2005-06-30 | 2007-06-07 | Intelligentek Corp. | Multimedia conceptual search system and associated search method |
US20080077574A1 (en) * | 2006-09-22 | 2008-03-27 | John Nicholas Gross | Topic Based Recommender System & Methods |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6601075B1 (en) * | 2000-07-27 | 2003-07-29 | International Business Machines Corporation | System and method of ranking and retrieving documents based on authority scores of schemas and documents |
US6865573B1 (en) | 2001-07-27 | 2005-03-08 | Oracle International Corporation | Data mining application programming interface |
US7425968B2 (en) * | 2003-06-16 | 2008-09-16 | Gelber Theodore J | System and method for labeling maps |
US20060074905A1 (en) * | 2004-09-17 | 2006-04-06 | Become, Inc. | Systems and methods of retrieving topic specific information |
US7801899B1 (en) * | 2004-10-01 | 2010-09-21 | Google Inc. | Mixing items, such as ad targeting keyword suggestions, from heterogeneous sources |
WO2006055983A2 (en) | 2004-11-22 | 2006-05-26 | Truveo, Inc. | Method and apparatus for a ranking engine |
GB2425195A (en) * | 2005-04-14 | 2006-10-18 | Yosi Heber | Website analysis method |
US7831685B2 (en) * | 2005-12-14 | 2010-11-09 | Microsoft Corporation | Automatic detection of online commercial intention |
US7593934B2 (en) * | 2006-07-28 | 2009-09-22 | Microsoft Corporation | Learning a document ranking using a loss function with a rank pair or a query parameter |
US8214272B2 (en) * | 2006-09-05 | 2012-07-03 | Rafael A. Sosa | Web site valuation |
US7617208B2 (en) * | 2006-09-12 | 2009-11-10 | Yahoo! Inc. | User query data mining and related techniques |
US7930302B2 (en) * | 2006-11-22 | 2011-04-19 | Intuit Inc. | Method and system for analyzing user-generated content |
US7925651B2 (en) * | 2007-01-11 | 2011-04-12 | Microsoft Corporation | Ranking items by optimizing ranking cost function |
US7877384B2 (en) * | 2007-03-01 | 2011-01-25 | Microsoft Corporation | Scoring relevance of a document based on image text |
US8407589B2 (en) * | 2007-04-20 | 2013-03-26 | Microsoft Corporation | Grouping writing regions of digital ink |
US7941391B2 (en) * | 2007-05-04 | 2011-05-10 | Microsoft Corporation | Link spam detection using smooth classification function |
US20090326916A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Unsupervised chinese word segmentation for statistical machine translation |
-
2009
- 2009-02-09 US US12/367,656 patent/US20100057719A1/en not_active Abandoned
- 2009-02-09 US US12/367,646 patent/US8255391B2/en not_active Expired - Fee Related
- 2009-02-09 US US12/367,634 patent/US20100057717A1/en not_active Abandoned
- 2009-08-28 WO PCT/US2009/055342 patent/WO2010027916A1/en active Application Filing
- 2009-08-28 WO PCT/US2009/055355 patent/WO2010027917A1/en active Application Filing
- 2009-08-28 WO PCT/US2009/055331 patent/WO2010027914A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050080782A1 (en) * | 2003-10-10 | 2005-04-14 | Microsoft Corporation | Computer aided query to task mapping |
US20060026152A1 (en) * | 2004-07-13 | 2006-02-02 | Microsoft Corporation | Query-based snippet clustering for search result grouping |
US20060248049A1 (en) * | 2005-04-27 | 2006-11-02 | Microsoft Corporation | Ranking and accessing definitions of terms |
US20070130112A1 (en) * | 2005-06-30 | 2007-06-07 | Intelligentek Corp. | Multimedia conceptual search system and associated search method |
US20080077574A1 (en) * | 2006-09-22 | 2008-03-27 | John Nicholas Gross | Topic Based Recommender System & Methods |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530295A (en) * | 2012-07-05 | 2014-01-22 | 腾讯科技(深圳)有限公司 | Webpage pre-reading method and device |
US9454732B1 (en) * | 2012-11-21 | 2016-09-27 | Amazon Technologies, Inc. | Adaptive machine learning platform |
CN103530321A (en) * | 2013-09-18 | 2014-01-22 | 上海交通大学 | Sequencing system based on machine learning |
US10572527B1 (en) * | 2018-12-11 | 2020-02-25 | Rina Systems, Llc. | Enhancement of search results |
Also Published As
Publication number | Publication date |
---|---|
US8255391B2 (en) | 2012-08-28 |
US20100057718A1 (en) | 2010-03-04 |
US20100057717A1 (en) | 2010-03-04 |
WO2010027914A1 (en) | 2010-03-11 |
WO2010027917A1 (en) | 2010-03-11 |
WO2010027916A1 (en) | 2010-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20100057719A1 (en) | System And Method For Generating Training Data For Function Approximation Of An Unknown Process Such As A Search Engine Ranking Algorithm | |
US7624102B2 (en) | System and method for grouping by attribute | |
US7966337B2 (en) | System and method for prioritizing websites during a webcrawling process | |
US7647314B2 (en) | System and method for indexing web content using click-through features | |
US8494897B1 (en) | Inferring profiles of network users and the resources they access | |
US7657515B1 (en) | High efficiency document search | |
US8984398B2 (en) | Generation of search result abstracts | |
US20130268482A1 (en) | Determining entity popularity using search queries | |
US9311388B2 (en) | Semantic and contextual searching of knowledge repositories | |
US20090187516A1 (en) | Search summary result evaluation model methods and systems | |
CA2713932A1 (en) | Automated boolean expression generation for computerized search and indexing | |
Ru et al. | Indexing the invisible web: a survey | |
Kantorski et al. | Automatic filling of hidden web forms: a survey | |
US8335791B1 (en) | Detecting synonyms and merging synonyms into search indexes | |
CN116431895A (en) | Personalized recommendation method and system for safety production knowledge | |
Fong | Framework of competitor analysis by monitoring information on the web | |
Bharamagoudar et al. | Literature survey on web mining | |
US8489560B1 (en) | System and method for facilitating the management of keyword/universal resource locator (URL) data | |
EP2662785A2 (en) | A method and system for non-ephemeral search | |
Alahmari et al. | Linked Data and Entity Search: A Brief History and Some Ways Ahead. | |
Wu et al. | A quality analysis of keyword searching in different search engines projects | |
JP2010186474A (en) | Retrieval modeling system using association degree dictionary and method | |
Zhang et al. | Collective intelligence-based web page search: Combining folksonomy and link-based ranking strategy | |
Sardar et al. | Resource Selection in Federated Web Search | |
Arora et al. | Efficient Hybrid Ranking Algorithm for Search Engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CONDUCTOR, INC.,NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KULKARNI, PARASHURAM;REEL/FRAME:023992/0041 Effective date: 20100203 Owner name: CONDUCTOR, INC., NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KULKARNI, PARASHURAM;REEL/FRAME:023992/0041 Effective date: 20100203 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |