WO2008131597A1

WO2008131597A1 - Search engine and method for filtering agency information

Info

Publication number: WO2008131597A1
Application number: PCT/CN2007/001474
Authority: WO
Inventors: Haitao Lin
Original assignee: Haitao Lin
Priority date: 2007-04-29
Filing date: 2007-04-29
Publication date: 2008-11-06
Also published as: CN101849232A

Abstract

A search engine and a method for filtering agency information, wherein the method comprising: grasping web pages from internet, sending them to a web page database; extracting link information, extracting the titles and contents of the web pages from the database, and extracting agency feature information further; analyzing the agency feature information extracted, if the set agency information judging condition is satisfied, the information corresponding to the agency feature information is determined as agency information; filtering the agency information from the search result.

Description

Technical field

The present invention relates to computer search engine technology, and more particularly to a search engine and a filtering method for the mediation information. Background technique

The Internet provides instant and rich information (and a platform for people to communicate and participate in entertainment), which deeply influences the lives of modern people. But with the rapid increase in the number and content of websites, the Internet is like a huge encyclopedia with no catalogs, making it impossible for people to find the information they want. The emergence of search engines has added catalogues and indexes to this encyclopedia. Just type the keyword in the search box and you will be able to get the relevant information or URL. In the face of vast online resources, search engines provide an entry point for all surfers. It is no exaggeration to say that almost all users can search from the search to any place on the Internet they want. Therefore, it has also become the most used online service in addition to email.

Figure 1 shows the system architecture diagram of a typical search engine in the prior art. The parts of the search engine are interdependent and interdependent. The processing flow is as follows:

The web spider crawls the webpage from the Internet. The crawling process is as follows: (1) Manually add one or more URLs of the starting webpage (Uniform Resource Locator, also known as webpage address) to the URL database. These URLs are also called Seed; (2) The web spider obtains a URL from the URL database, grabs the webpage content corresponding to the URL, and then puts the webpage content into the webpage database; (3) the URL that satisfies the requested webpage Extract it and put it in the URL database. The method for judging whether the URL satisfies the requirements is pattern matching; (4) Repeat steps (2) one (3) until the web database no longer has new records added.

The system obtains the original page of the webpage from the webpage database, and extracts the textual information from the webpage, that is, removes all the HTML grammar marks. Then, the extracted text information is sent to the text indexing module to establish an index. The process of indexing is to first calculate the relevance (or importance) of each keyword in the page content and the hyperlink, and then use the related information to establish a webpage index. Database, forming an index database. Text index In the process of establishment, you need to refer to the link information of the website, mainly to prevent illegal websites, such as multiple loop links of the website itself. At the same time as the index database is established, the link information is extracted from the webpage database, and the link information (including the anchor text and the link itself) is sent to the link database to provide a basis for the webpage rating.

The user submits the query request to the query server, and the server searches for the relevant webpage in the index database, and the webpage rating combines the query request and the link information to evaluate the relevance of the search result, and sorts according to the relevance degree by the query server. The content summary of the keyword is extracted, and finally the page generation system organizes the link address of the search result and the page content summary and returns the content to the user.

In the system architecture of the search engine shown in Figure 1, the spider and the linker (Parser) module are the most important parts. among them:

The web spider (Spider) uses multi-threaded concurrent search technology to complete the document access agent, the path selection engine, and the access control engine. Web spider (Spider) is mainly composed of three major data resources: URL server, crawler, memory, URL parser and resource library (web database), anchor library, URL database, and also one of the indexers. Accessibility. The specific process is that the URL server obtains the URL to be crawled from the URL database, the crawler grabs the web page according to the URL and sends it to the memory, compresses the web page and stores it into the webpage database, and then analyzes each web by the indexer. All links to the page and store relevant important information in the anchors file. The URL parser reads the anchor file and parses the URL, which in turn turns into a docID. The anchor text is then indexed into the index and sent to the index database. The specific process is shown in Figure 2. The analyzer in Figure 2 can be seen as part of the indexer, or as an auxiliary part of the indexer. Since the processing flow of the web spider is a well-known technique, it is not described in detail herein.

The link information extraction module is configured to read a webpage database, decompress the document, and then perform analysis. Each document is converted into a set of words, which is called the number of samples. The number of words is recorded and the position in the document, the size of the font, and the case information. Search engines have two types of samples: (1) Title: This title is the title of the HTML or URL and the meta information in the HTML file. Index by analyzing individual words. Users can search for this information through this index.

(2) Content: Get all the content of the page, and build an index by analyzing each word. The user can search for this information through this cable bow I.

It can be seen that the general search engine only extracts and indexes the title and content in the webpage, and does not further extract the information in the content.

With the rapid increase of webpages that search engines can obtain, users often return too much information after inputting search keywords, including many irrelevant or useless information. Users must filter from the results, which greatly affects the user's search efficiency. Therefore, in order to facilitate the use of search engines and enable users to efficiently obtain useful information from search engines, the processing of search results becomes more and more important. For example, in the search results for housing rental information, many users want to filter out the information of the intermediary. However, current search engines have not been able to solve this problem. Summary of the invention

An object of the embodiments of the present invention is to provide a search engine and a method for filtering the mediation information, so that some or all of the mediation information is filtered out in the search result.

In order to achieve the above object, the present invention provides a search engine, including: a web spider, a link information extraction module, and a query server;

The link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;

The search engine filters out the index corresponding to the mediation information from its index database.

The invention also provides a search engine, comprising: a web spider, a link information extraction module and a query server;

The link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.

The present invention also provides a filtering method for a search engine to mediate information, including: Grab a web page from the Internet and send it to a web page database;

Performing link information extraction, extracting a webpage title and webpage content from the webpage database, and further extracting the mediation feature information from the webpage content;

The extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;

Filter out the mediation information in the search results.

The present invention also provides a filtering method for a search engine to mediate information, including:

Grab a web page from the Internet and send it to a web page database;

Extracting the link information, extracting the webpage title and the webpage content from the webpage database, and analyzing the extracted webpage content, and if the webpage content includes the intermediary propensity information, determining that the information corresponding to the intermediary propensity information is the intermediary information;

Filter out the mediation information in the search results.

The search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively prevent the interference of the intermediary information to the user, improve the usability of the search result, and provide the user with more Great convenience. DRAWINGS

The drawings described herein are provided to provide a further understanding of the invention, and are not intended to limit the invention. In the drawing:

1 is a system architecture diagram of a typical search engine in the prior art;

2 is a schematic diagram of a processing flow of a web spider in the prior art;

FIG. 3 is a schematic flowchart of filtering mediation information according to an embodiment of the present invention. detailed description

In order to make the objects, technical solutions and advantages of the present invention more comprehensible, the specific embodiments of the present invention will be described in detail below. The illustrative embodiments of the present invention and the description thereof are intended to explain the present invention, but are not intended to limit the invention. Example 1

If you want search engines to filter your search results in a targeted manner, you must let the search engine "learn" the content of the page. For example, for search results such as house rentals, if you want to filter out the mediation information, you need to understand the general characteristics of the mediation information. The intermediary information generally has one or more of the following characteristics:

(1) The same intermediary will publish a lot of different information. Taking rental housing as an example, an intermediary usually publishes rental information in many different locations.

(2) The published information contains company information. For example, company address and company contact information.

(3) The published information contains unreasonable information. Examples include incorrect phone numbers (including cell phone numbers, landline numbers, PHS numbers, etc.), very low prices, and more.

The embodiment of the present invention modifies the link information extraction portion (link information extraction module) of the search engine based on the general vertical search. The search engine in this embodiment mainly includes a web spider (Spider), a link information extraction module (Parser), and a query server. The web spider (Spider) and the query server adopt a common processing technology, which is not described in detail herein. The link information extraction module improves the feature of the mediation information, and further extracts information in the content in addition to the web page title and content, to extract mediation feature information (such as a phone number, for identifying the mediation information, Email and price, etc., and the extracted content can be further processed: By extracting and analyzing the mediation feature information, it is possible to find a lot of mediation information published by the same agent and mediation information containing unreasonable information; The analysis and processing of the web content can be used to find further information about the company or other mediation.

In addition to the page title and content, the improved link information extraction module adds the following functions:

(1) Extract the mediation feature information used to determine the mediation information (take the phone number and Emai l as an example) -

I. Extract the user's phone number, the mode of extraction is pattern matching, that is, look for "mobile phone", "mobile phone", "telephone", "Little Smart", "Mobile Phone", "Cell Phone", etc. for each web page. Once found, the first consecutive number following these strings is extracted. The first consecutive number is the user's phone number.

提取 . Extract user emails, the extraction method is pattern matching, that is, look for "email box", "Emai l", etc. for each web page. Once found, extract the consecutive strings after these strings, and encounter the space to stop the extraction. The extracted string is the user's email.

(2) After extracting the phone number and user email, count the same number of phone numbers or emails. The statistics are related to time. The phone number of the past n months (24>η>1, for example, 3 months) or the number of repetitions of eamil are generally counted.

(3) Set a threshold for each number of repetitions of the phone number and eamil. If this threshold is exceeded, the message is considered to be an intermediary. For example, if the number of repetitions of the telephone number is set to 10, when the number of repetitions of a telephone number is greater than 10, the information corresponding to the telephone number is considered to be all intermediary information.

(4) Analyze the phone number and determine the number that does not exist or is illegal according to the number prefix rule.

For example, on the Chinese website, the number starting with 010 must be 5, 6, and 8. Otherwise, the information corresponding to this number is considered to be all intermediary information.

(5) Analyze the content of the main body of the webpage to identify the intermediary information.

Since the information published by the intermediary also contains information such as "company" and "mass of listing", the link information extraction module can further identify the intermediate information by analyzing and processing the extracted content. For example, the content of the main body of the webpage can be analyzed. If the words "company", "company address", "my company", "large amount of listings" are included, the information is considered as intermediary information.

After the link information extraction module extracts the above information, only the information determined as the non-intermediary information is indexed, or the link information extraction module extracts the above information, and the index is established, but all the information determined to be the intermediary information is deleted from the index database. Indexing is performed using the generic "inverted index" technique (since the inverted indexing technique is well known in the art and will not be described in detail herein).

In this way, the index corresponding to the mediation information is filtered out in the index database, and the user submits the query by submitting the query. The request is sent to the query server, and the server searches for the relevant webpage in the index database, and the intermediate information is basically filtered out in the returned search result.

FIG. 3 is a schematic diagram of a filtering process of a mediation information by a search engine according to an embodiment of the present invention. As shown in Figure 3, the following steps are included:

Step 100: Extract mediation feature information (such as a phone number and an email), and specifically include the following information: i. a mobile phone number;

Ii. a fixed telephone number;

Iii. PHS number;

Iv. Email.

In step 200, the same information extracted is counted. The method implemented in this embodiment is to establish a table in the background database of the search engine, the first field is a phone number or Email, and the second field is the number of times of repeated occurrence. After each message is extracted, the table is queried first. If there is already a record, the corresponding number of repetitions is incremented by one; if there is no record, a record is inserted, and the corresponding number of repetitions is set to 1.

Step 300, step 400, if a mobile phone, a fixed telephone, a PHS or an email is repeated more than 10 times, then all the posting information corresponding to the mobile phone, the telephone, the PHS or the email is not indexed, or the mobile phone is All published information corresponding to the phone, PHS or Email is deleted from the search engine's index database.

Step 500: Determine whether the mobile phone, the telephone or the PHS number is legal. The rule of judgment is based on the number rule table of various places in China. For example, the telephone number of Beijing is 8 digits. For those that do not comply with the rules, all the posting information corresponding to this mobile phone, telephone, and PHS is deleted from the index database of the search engine.

Step 600: Determine whether the extracted webpage content has an intermediary tendency. If the content of the webpage contains "the company", "large number of listings" or contains multiple different addresses (for example: existing Dongzhimen, Xizhimen, Zhongguancun multiple housing), then this information is not indexed, or this information is searched from Engine cable In each step of the above steps 300-600 of the embodiment, the information determined as the intermediary information may also be processed without special processing, and after all the conditions are determined, the mediation information of all the judgments is from the search engine. Deleted in the index database; or after all the conditions are judged, the non-intermediary information is added to the index database for the user to query, and the information determined as the intermediary information is not indexed. However, the present invention is not limited to these modes, and it is within the scope of the present invention as long as the mediation information of the judgment can be filtered out from the index database.

In addition, the above steps in the present embodiment shown in FIG. 3 are not limited in order, and the mediation feature information is not limited to the phone number or email given in the embodiment, and may be other information such as price.

A person skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as ROM/RAM, disk. , CD, etc.

Through the above processing, the index database record of the search engine can be provided to the query server for the user to query and use. At this time, since the index database has substantially no index of the mediation information, the mediation information in the search result can be reduced from 90% before processing to 10% or less.

As described above, the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively preventing the interference of the intermediary information to the user, and improving the usability of the search result. Users provide greater convenience.

The above specific embodiments are merely illustrative of the invention and are not intended to limit the invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

Rights request

A method for filtering mediation information by a search engine, characterized in that the method comprises: crawling a webpage from the Internet and sending it to a webpage database;

Performing link information extraction, extracting a webpage title and webpage content from the webpage database, and further extracting mediation feature information from the apage content;

Filter the mediation information in the search results.

2. The method according to claim 1, wherein the manner of extracting the mediation feature information is a pattern matching mode.

3. The method of claim 1 wherein:

The mediation feature information is a phone number and/or email information;

The analysis of the extracted mediation feature information refers to: counting the number of repetitions of the same phone number and/or email in the webpage within a predetermined time period;

The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information exceeds respective corresponding thresholds.

4. The method of claim 1 wherein:

The mediation feature information is a phone number;

The set intermediary information determining condition is: the phone number is an incorrect phone number.

5. The method according to claim 1, wherein: the mediation feature information is price information;

The set intermediary information determination condition is: the price is lower than a set threshold.

The method according to any one of claims 1 to 5, characterized in that filtering the intermediary information in the search result means:

Deleting the mediation information from the index database of the search engine or indexing only the information determined to be non-intermediary information to filter the mediation information from the index database; The search engine searches based on an index database that filters out the intermediary information to obtain a search result.

The method according to any one of claims 1 to 5, characterized in that the method further comprises: analyzing the extracted webpage content, and if the webpage content includes intermediary propensity information, determining the intermediary propensity information The corresponding information is the intermediary information.

8. A method for filtering mediation information by a search engine, which is characterized by:

Grab a web page from the Internet and send it to a web page database;

Filter out the mediation information in the search results.

The method according to claim 8, wherein the method further comprises:

Further extracting the mediation feature information from the content of the webpage;

The extracted mediation feature information is analyzed. If the set mediation information determination condition is met, the webpage information corresponding to the mediation feature information is determined as the mediation information.

1 0. The method of claim 9 wherein:

The manner of extracting the mediation feature information is a pattern matching mode.

11. The method of claim 9 wherein:

The mediation feature information is a phone number and/or email information;

12. The method according to claim 9, wherein: said mediation feature information is a phone number;

13. The method of claim 9 wherein:

The mediation feature information is price information;

The method according to any one of claims 8 to 13, characterized in that filtering the intermediary information in the search result means:

Deleting the mediation information from the index database of the search engine or indexing only the information determined to be non-intermediary information to filter the mediation information from the index database;

The search engine searches based on an index database that filters out the intermediary information to obtain a search result.

15. A search engine, comprising a web spider and a query server, wherein the search engine further comprises a link information extraction module;

The search engine filters out the index corresponding to the mediation information from the index database.

16. The search engine of claim 15 wherein:

The link information extraction module is further configured to analyze the content of the webpage, and determine that the content including the mediation direction information is the mediation information.

17. The search engine of claim 15 wherein:

The method for extracting the mediation feature information by the link information extraction module is a mode matching mode.

18. The search engine of claim 15 wherein:

The mediation feature information is a phone number and/or email information;

The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information within the estimated time counted by the link information extraction module exceeds respective corresponding thresholds.

19. The search engine according to claim 15, wherein: said mediation feature information is a phone number;

20. The search engine of claim 15 wherein:

The mediation feature information is price information;

The search engine according to any one of claims 15 to 20, wherein: the search engine filters out the index corresponding to the intermediary information from the index database, and the index information extraction module only judges the non-determination Indexing information of the intermediary information; or

After the link information extraction module extracts the information and indexes it, the information determined to be the intermediary information is deleted from the index database.

22. A search engine, comprising a web spider and a query server, wherein the search engine further comprises a link information extraction module;

23. The search engine according to claim 22, wherein the link information extraction module is further configured to extract the mediation feature information from the webpage database, and determine the information corresponding to the mediation feature information by using the set mediation information determination condition. Whether it is intermediary information.

24. The search engine according to claim 22, wherein - the manner in which the link information extraction module extracts the mediation feature information is a pattern matching mode.

25. The search engine of claim 22, wherein:

The mediation feature information is a phone number and/or email information;

26. The search engine according to claim 22, wherein - said mediation feature information is a phone number;

27. The search engine of claim 22, wherein - The mediation feature information is price information;

The search engine according to any one of claims 22-27, wherein: the search engine filters out the index corresponding to the intermediary information from the index database, and the link information extraction module only judges the non- Indexing information of the intermediary information; or