WO2004097670A1 - Method for generating data records from a data bank, especially from the world wide web, characteristic short data records, method for determining data records from a data bank which are relevant for a predefined search query and search system for implementing said method - Google Patents
Method for generating data records from a data bank, especially from the world wide web, characteristic short data records, method for determining data records from a data bank which are relevant for a predefined search query and search system for implementing said method Download PDFInfo
- Publication number
- WO2004097670A1 WO2004097670A1 PCT/EP2004/003972 EP2004003972W WO2004097670A1 WO 2004097670 A1 WO2004097670 A1 WO 2004097670A1 EP 2004003972 W EP2004003972 W EP 2004003972W WO 2004097670 A1 WO2004097670 A1 WO 2004097670A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data records
- search
- characteristic
- search queries
- account
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- Method for creating short data records characteristic of data records from a database in particular from the World Wide Web, method for determining data records relevant to a specifiable search query from a database and search system for carrying out the method
- the invention relates to a method for creating short data records characteristic of data records from a database, in particular from the World Wide Web, for storing on a memory module as a basis for determining the data records relevant for a predefinable search query. It further relates to a method for determining data records relevant to a specifiable search query from a database, in particular from the World Wide Web, in which such short data records are searched for their relevance for the respective search query. The invention further relates to a search system for determining a predefinable one
- World Wide Web An enormous amount of information is available in complex databases or in the global computer network (“World Wide Web”), which a user can access more or less specifically for research purposes.
- search engines In order to be able to use information from the large amount of information in a targeted manner , so-called search engines are used, which have in some cases become particularly widespread when it comes to obtaining information from the World Wide Web.
- Output module provides a query window, through which targeted search or research terms can be specified.
- the search engine searches the information base of the database or of the World Wide Web for suitable key words or key words.
- the response data records found thereupon are usually categorized by the respective search engine with regard to their relevance for the specified search order and made available to the user in the manner of a hit list in an order arranged according to their relevance.
- the search engines used for the research are therefore increasingly being improved with regard to the search algorithms used, and further aids for classifying data records from the database can also be used in the manner of pre-sorting or pre-filtering.
- the data records are usually structured and organized in the form of so-called domains, a domain typically being maintained by an operator and in turn comprising a large number of sub-data records, text documents or the like.
- Each domain is assigned a characteristic value which, based on accessible secondary information in the manner of a relative relevance, characterizes the importance of considering the respective domain for the search query.
- an information base is usually used in the manner of a so-called static approach, in which, for example, the relative importance of the respective domain is inferred on the basis of the degree of networking of the respective domain with other domains.
- the number of so-called links or cross-references from other domains to the respective domain can be used as a measure of such meaning, based on the assumption that a large number of cross-references to the respective domain is an indication that this domain is for a large number of users is of particular importance when processing their search queries.
- the respective search module searches the currently selected domain or the data record and sets up to a limit specified by the assigned system resources using the in the information found in the respective domain is combined with a short data record characteristic of the domain or the data record, for example in the form of a text file with possibly assigned headings or other indicators.
- This short data record is then stored on a memory module and kept ready for a subsequent examination.
- the entirety of the short data sets created from the data records or domains taken into account in this procedure and stored on the memory module is also referred to as the so-called “index” of the respective search engine and serves as an information basis for the searches carried out below usually continuously, whereby individual domains are selected cyclically, so that the index is continuously updated.
- the index formed by the total of the short data records stored is then searched for the presence of keywords or key words of the respective search query or of individual elements thereof, using the search results obtained or Hits those found
- Data records or domains assigned to short data records are determined to be relevant for the respective search query.
- the invention is therefore based on the object of specifying a method for creating short data records of the type mentioned above, which can be used to generate a search index which is particularly suitable for obtaining high quality information from the database or from the World Wide Web. Furthermore, using this method, a particularly suitable method for determining data records relevant to a specifiable search query from a database, in particular from the World Wide Web, and a search system for carrying out this method are to be specified.
- this object is achieved according to the invention by using the system resources provided for creating a short data record from a data record Consideration of determined empirical values from previous search queries can be selected.
- the invention is based on the consideration that, on the one hand, available information about the individual data records or domains per se is taken into account in the manner of static characteristic values for the generation of an information base which is particularly suitable for obtaining particularly high quality information based on the short data records characteristic of the data records can, which, on the other hand, should also be taken into account in the manner of a dynamic element, also information characteristic of the user's interests.
- This is based on the knowledge that the result of obtaining information from the database or the World Wide Web is considered to be of particularly high quality if it reflects the user interest correctly as far as possible. Measures should therefore be taken to incorporate information that is characteristic of the user's interest in further information gathering.
- the frequency of search queries that are similar or similar to a search query in the recent past is advantageously taken into account as empirical value when assigning the system resources.
- the frequency of hits of the data records or domains with regard to the search queries that have been specified particularly frequently by users in the recent past can also be taken into account.
- the Experience values therefore expediently comprise a characteristic number which is characteristic of the number of similar search queries in a predefinable time interval.
- System resources selected the resources of a searcher module or crawler provided for the creation of the short data sets which are characteristic of the data sets, taking into account the empirical values determined from previous search queries.
- the user interests are particularly largely taken into account when allocating the system resources by taking into account to a particular extent the possibly complex structure of the search queries used by the users when determining the empirical values.
- This is based on the knowledge that a particularly precise image of general user interest can be achieved not only by the relative frequency of individual elements or terms used in search queries, but also or additionally by taking into account specific correlations between individual terms or elements of search queries .
- individual elements or components of a search query are preferably requested in combination with specific other individual elements or components of search queries in accordance with the currently widespread user interest.
- the current user interest could generally go in one direction, that preferably free multimedia files should be downloaded from the Internet.
- a combination of the search terms “MP3”, “free” and “download” is increasingly to be expected for search queries.
- the combination of these three individual elements of a search query can therefore be used as a particularly important indicator for
- correlations between individual elements of the search queries are preferably taken into account when determining the empirical values.
- the relative frequency of search queries and / or of individual elements of the search queries is advantageously taken into account when determining the empirical values.
- the short data records created in the manner mentioned which are characteristic of the data records from the database, are used to determine data records from the database, in particular from the World Wide Web, which are relevant for a specifiable search query, in that the data records created in this way are used in a Storage module stored short data records are searched for their relevance for the respective search query.
- the criterion for determining this relevance can be, for example, the frequency with which a key word or keyword of the search query can be found in the respective short data record, with a differentiation according to the location of the respective finding, for example in a heading or in full text, can be made.
- the above-mentioned object is achieved in that characteristic short data records are stored in a memory module for the data records Short data set from a data set provided system resources are selected taking into account stored empirical values from previous search queries.
- the empirical values advantageously include a characteristic number that is characteristic of the number of similar search queries in a predefinable time interval.
- the resources of a browser module provided for creating the short data records that are characteristic of the data records are selected as system resources, taking into account stored empirical values from previous search queries.
- the advantages achieved by the invention are, in particular, that taking into account empirical values from previous search queries when allocating the system resources when creating the index or the short data records characteristic of the data records already in a particularly early stage, namely in the preparation phase of a database. or Internet research, a broad consideration of the current user interest is possible. Precisely by taking the user interest into account in addition to or instead of the database-specific characteristics previously used, such as, for example, the frequency of the respective cross-references, it is possible for the user to obtain information which is considered to be of particularly high quality.
- a particularly specific image of user interest and thus a particularly high level of accuracy in the allocation of resources can be achieved by taking correlations between individual elements of search queries into account, with particularly frequently used combinations of specific individual elements and the conclusion as to the result of such combined search queries found records or domains can be expected to generate a hit generation that is particularly tailored to the user's interests.
- FIG. 1 An embodiment of the invention is explained in more detail with reference to a drawing.
- the figure schematically shows a search system for determining data records or domains from the World Wide Web that are relevant for a specifiable search query.
- the search system 1 is connected to a large number of domains 4 via the data lines of the Internet or World-Wide-Web indicated by the double arrows 2, each domain 4 in turn typically having a large number of sub-data sets, text modules, multimedia information elements or the like includes. Because of the large amount of information available on the World Wide Web, the search system 1 for processing a search query is not for searching the domains 4 or the information content contained therein for the presence of specific keywords or key words, but instead for searching a so-called index 8 stored in a memory module 6.
- the index 8 comprises a large number of short data records 10, each of which is characteristic of a data record or a domain 4 of the World Wide Web.
- Each short data record 10 contains a part of the information content of the respectively assigned domain 4 which is recognized as relevant, the text data contained in the respective domain 4 in particular being reproduced in the short data record 10.
- a search query as indicated by arrow 12, it is fed to an input / output module 14 of the search system 1, from where a search of the short data records 10 is started on the basis of key words or key words characteristic of the search query.
- the domain 4 corresponding to the respective short data record 10 is recognized as relevant for the search query and the corresponding domain address is communicated to the user on a result list.
- the search system 1 comprises a searcher module 16, also referred to as a "crawler".
- the searcher module 16 contacts the respective domains 4 at regular, preferably cyclical, time intervals and searches for their information content. In particular, it can be provided to capture and appropriately compress the text information stored on the respective domain 4.
- the type and scope of the analysis of the content of each domain 4 by the search module 16 are determined by the specification specific system resources of the search module 16 for the respective domain 4.
- the system resources can be, for example, the time period provided for the search, the time used
- Computer performance and / or allocated storage capacities can be specified. In particular, it can also be specified whether the respective domain 4 should be addressed by the search module 16 or ignored from the outset. On- Using the information base determined during the search for the respective domain 4, the search module 16 then creates the associated short data record 10 in the manner of a short version and stores it as a component of the index 8 in the memory module 6.
- the system resources for the search of the respective domain 4 can be assigned, for example, as a function of domain-specific relevance parameters.
- So-called static relevance parameters can also be provided, which use predetermined criteria, such as the degree of networking of a domain 4 with other domains 4, to characterize how high the degree of acceptance of the respective domain 4 is among the users.
- predetermined criteria such as the degree of networking of a domain 4 with other domains 4, to characterize how high the degree of acceptance of the respective domain 4 is among the users.
- the search system 1 is also designed to take into account empirical values and knowledge from the previous search queries when creating the short data records 10 and thus to incorporate the current user interest reflected therein to a particular extent in the creation or cyclical renewal of the index 8.
- a further memory module 18 is assigned to the memory module 6, in which the incoming search queries are stored for further evaluation in the manner of a log book.
- the contents of the memory module 18 are made accessible to an analysis module 20, which subjects the search queries received to an evaluation and uses the knowledge gained thereby to redistribute the system resources to the domains 4 to be taken into account in the next search cycle.
- the analysis module 20 transmits the corresponding allocation of the system resources to the browser module 16, as shown by the arrow 22.
- the analysis module 20 When assigning the system resources, the analysis module 20 thus takes into account empirical values from previous search queries. This can be done, for example, by determining the frequency of a search query or a key word or keyword as a single element of a search query, with frequently used Search queries or individual elements of search queries are currently concluded that users are comparatively popular. Accordingly, it is assumed that the data records or domains 4 found in comparatively popular search queries and identified as relevant reflect the current user interest to a comparatively high degree. In this embodiment, the analysis module 20 can thus assign a correspondingly increased proportion of system resources during the next search by the search module 16 to those domains 4 that are listed as a result for comparatively frequently used search queries.
- the search system 1 is also designed to take comparatively complex structures in the profile of the search queries into account when allocating the system resources by the analysis module 20.
- correlations between individual elements of search queries are also taken into account. If, for example, it is found that individual elements or search words in search queries are combined particularly frequently with certain other individual elements or search words, then a high intrinsic correlation between these two search elements is concluded, so that on the one hand those domains 4 in which complete or approximate Combinations are found to be recognized as particularly relevant, and on the other hand, when evaluating the relative frequencies of individual search elements, the relative frequencies of the further search elements that are particularly correlated can also be taken into account.
- a correlation matrix is created in the analysis module 20, the matrix elements of which indicate a quantitative measure for the correlation between two individual elements of search queries.
- the relative frequency with which the two respective individual elements of search queries are asked for in combination can be provided as a quantitative measure.
- This correlation matrix is then diagonized by a main axis transformation, the eigenvalues of the original correlation matrix being indicated on the main diagonal of the diagonalized matrix. At this Principal axis transformation, the eigenvectors of the correlation matrix are also determined.
- the eigenvalues and eigenvectors of the correlation matrix can then be used for a further evaluation of the search queries.
- Those eigenvectors of the correlation matrix which have a comparatively large eigenvalue correspond to a mix of individual elements of search queries which, according to the linear coefficients of the individual elements of the search queries, occur comparatively frequently in typical search queries and thus reflects the current user interest to a particular degree.
- those eigenvectors of the correlation matrix are selected which are assigned a comparatively large eigenvalue. The eigenvectors determined in this way result in a result of a mix of search queries that have occurred in the respective combination with a particularly high probability in the recent past.
- the analysis module 20 accesses index 8 in the manner of a test query and thus determines the data records or domains 4 identified as relevant for this self-query. Since the Domains 4 determined in this way correspond to a particular degree to the current user interest, the system resources for these domains 4 are increased proportionally in comparison with the previous run when the World Wide Web is searched again. This can be done, for example, by assigning a weighting factor when providing the system resources for the respective domain 4 according to the relationship
- ⁇ is the eigenvalue of the associated self-inquiry D k a domain 4 displayed as a hit on this self-inquiry and ⁇ can be a suitably chosen constant> 0.
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP04727536A EP1620809A1 (en) | 2003-04-29 | 2004-04-15 | Method for generating data records from a data bank, especially from the world wide web, characteristic short data records, method for determining data records from a data bank which are relevant for a predefined search query and search system for implementing said method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10319427A DE10319427A1 (en) | 2003-04-29 | 2003-04-29 | Method for creating short data records characteristic of data records from a database, in particular from the World Wide Web, method for determining data records relevant to a specifiable search query from a database and search system for carrying out the method |
DE10319427.4 | 2003-04-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004097670A1 true WO2004097670A1 (en) | 2004-11-11 |
Family
ID=33394008
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2004/003972 WO2004097670A1 (en) | 2003-04-29 | 2004-04-15 | Method for generating data records from a data bank, especially from the world wide web, characteristic short data records, method for determining data records from a data bank which are relevant for a predefined search query and search system for implementing said method |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1620809A1 (en) |
DE (1) | DE10319427A1 (en) |
WO (1) | WO2004097670A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2172898A1 (en) * | 2008-08-28 | 2010-04-07 | Palo Alto Research Center Incorporated | System and method for providing community-based advertising term disambiguation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002008962A1 (en) * | 2000-07-25 | 2002-01-31 | Energy E-Comm.Com, Inc. | Internet information retrieval method and apparatus |
WO2002027562A2 (en) * | 2000-09-29 | 2002-04-04 | Ninesigma, Inc. | Method and apparatus to retrieve information from a network |
EP1207468A2 (en) * | 2000-11-14 | 2002-05-22 | Itt Manufacturing Enterprises, Inc. | A method and system for updating a searchable database of descriptive information describing information stored at a plurality of addressable logical locations |
US6418433B1 (en) * | 1999-01-28 | 2002-07-09 | International Business Machines Corporation | System and method for focussed web crawling |
US6493703B1 (en) * | 1999-05-11 | 2002-12-10 | Prophet Financial Systems | System and method for implementing intelligent online community message board |
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6006225A (en) * | 1998-06-15 | 1999-12-21 | Amazon.Com | Refining search queries by the suggestion of correlated terms from prior searches |
US7194454B2 (en) * | 2001-03-12 | 2007-03-20 | Lucent Technologies | Method for organizing records of database search activity by topical relevance |
-
2003
- 2003-04-29 DE DE10319427A patent/DE10319427A1/en not_active Ceased
-
2004
- 2004-04-15 WO PCT/EP2004/003972 patent/WO2004097670A1/en not_active Application Discontinuation
- 2004-04-15 EP EP04727536A patent/EP1620809A1/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6418433B1 (en) * | 1999-01-28 | 2002-07-09 | International Business Machines Corporation | System and method for focussed web crawling |
US6493703B1 (en) * | 1999-05-11 | 2002-12-10 | Prophet Financial Systems | System and method for implementing intelligent online community message board |
WO2002008962A1 (en) * | 2000-07-25 | 2002-01-31 | Energy E-Comm.Com, Inc. | Internet information retrieval method and apparatus |
WO2002027562A2 (en) * | 2000-09-29 | 2002-04-04 | Ninesigma, Inc. | Method and apparatus to retrieve information from a network |
EP1207468A2 (en) * | 2000-11-14 | 2002-05-22 | Itt Manufacturing Enterprises, Inc. | A method and system for updating a searchable database of descriptive information describing information stored at a plurality of addressable logical locations |
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
Non-Patent Citations (2)
Title |
---|
ARASU A ET AL: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, ACM, NEW YORK, NY, US, vol. 1, no. 1, August 2001 (2001-08-01), pages 2 - 43, XP001143684, ISSN: 1049-3301 * |
ROCHA L M: "Adaptive Webs for Heterarchies with Diverse Communities of Users", WORKSHOP FROM INTELLIGENT NETWORKS TO THE GLOBAL BRAIN: EVOLUTIONARY SOCIAL ORGANIZATION THROUGH KNOWLEDGE TECHNOLOGY, XX, XX, 3 July 2001 (2001-07-03), pages 1 - 35, XP002209508 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2172898A1 (en) * | 2008-08-28 | 2010-04-07 | Palo Alto Research Center Incorporated | System and method for providing community-based advertising term disambiguation |
Also Published As
Publication number | Publication date |
---|---|
DE10319427A1 (en) | 2004-12-02 |
EP1620809A1 (en) | 2006-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE602004003361T2 (en) | SYSTEM AND METHOD FOR GENERATING REFINEMENT CATEGORIES FOR A GROUP OF SEARCH RESULTS | |
DE69933187T2 (en) | Document Search and Service | |
DE60004687T2 (en) | METHOD FOR THE THEMATIC CLASSIFICATION OF DOCUMENTS, MODULE FOR THE THEMATIC CLASSIFICATION AND A SEARCH ENGINE CONTAINING SUCH A MODULE | |
DE69731142T2 (en) | System for retrieving documents | |
DE60129652T2 (en) | Image retrieval system and method with semantic and property-based relevance feedback | |
DE69833238T2 (en) | Keyword extraction system and text retrieval system for its use | |
DE69934102T2 (en) | SYSTEM AND METHOD FOR MODEL MINING OF COMPLEX INFORMATION TECHNOLOGY SYSTEMS | |
DE602005001940T2 (en) | METHOD AND SYSTEM FOR GENERATING A POPULATION REPRESENTATIVE TO A LOT OF USERS OF A COMMUNICATION NETWORK | |
EP1877932B1 (en) | System and method for aggregating and monitoring decentrally stored multimedia data | |
DE202017107393U1 (en) | Predicting a search engine map signal value | |
DE10231161A1 (en) | Domain-specific knowledge-based meta search system and method for using the same | |
DE202004021885U1 (en) | Information retrieval system based on historical data | |
DE112015000218T5 (en) | A method, system and computer program for scanning a plurality of memory areas in a work memory for a specified number of results | |
CH704497B1 (en) | Procedures for notifying storage medium having processor instructions for such a procedure. | |
DE112018006345T5 (en) | GET SUPPORTING EVIDENCE FOR COMPLEX ANSWERS | |
DE69719641T2 (en) | A process for presenting information on screen devices in various sizes | |
DE102007037646A1 (en) | System and method for indexing, searching and retrieving databases | |
WO2006018041A1 (en) | Speech and textual analysis device and corresponding method | |
DE102020116499A1 (en) | Method for selecting questions for respondents in a respondent inquiry system | |
DE112010002620T5 (en) | ONTOLOGY USE FOR THE ORDER OF DATA RECORDS NACHRELEVANZ | |
DE102004016930A1 (en) | Generate a sampling plan for testing generated content | |
EP1264253B1 (en) | Method and arrangement for modelling a system | |
DE10048478A1 (en) | Method for accessing a storage unit when searching for substrings and associated storage unit | |
DE102005032733A1 (en) | Index extraction of documents | |
WO2004097670A1 (en) | Method for generating data records from a data bank, especially from the world wide web, characteristic short data records, method for determining data records from a data bank which are relevant for a predefined search query and search system for implementing said method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1020057020376 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2004727536 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2004727536 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1020057020376 Country of ref document: KR |
|
DPEN | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWW | Wipo information: withdrawn in national office |
Ref document number: 2004727536 Country of ref document: EP |