WO2008131597A1 - Search engine and method for filtering agency information - Google Patents

Search engine and method for filtering agency information Download PDF

Info

Publication number
WO2008131597A1
WO2008131597A1 PCT/CN2007/001474 CN2007001474W WO2008131597A1 WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1 CN 2007001474 W CN2007001474 W CN 2007001474W WO 2008131597 A1 WO2008131597 A1 WO 2008131597A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
mediation
intermediary
search engine
webpage
Prior art date
Application number
PCT/CN2007/001474
Other languages
French (fr)
Chinese (zh)
Inventor
Haitao Lin
Original Assignee
Haitao Lin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haitao Lin filed Critical Haitao Lin
Priority to PCT/CN2007/001474 priority Critical patent/WO2008131597A1/en
Priority to CN200780052784A priority patent/CN101849232A/en
Publication of WO2008131597A1 publication Critical patent/WO2008131597A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • the present invention relates to computer search engine technology, and more particularly to a search engine and a filtering method for the mediation information.
  • the Internet provides instant and rich information (and a platform for people to communicate and participate in entertainment), which deeply influences the lives of modern people. But with the rapid increase in the number and content of websites, the Internet is like a huge encyclopedia with no catalogs, making it impossible for people to find the information they want.
  • search engines has added catalogues and indexes to this encyclopedia. Just type the keyword in the search box and you will be able to get the relevant information or URL.
  • search engines provide an entry point for all surfers. It is no exaggeration to say that almost all users can search from the search to any place on the Internet they want. Therefore, it has also become the most used online service in addition to email.
  • Figure 1 shows the system architecture diagram of a typical search engine in the prior art.
  • the parts of the search engine are interdependent and interdependent.
  • the processing flow is as follows:
  • the web spider crawls the webpage from the Internet.
  • the crawling process is as follows: (1) Manually add one or more URLs of the starting webpage (Uniform Resource Locator, also known as webpage address) to the URL database. These URLs are also called Seed; (2) The web spider obtains a URL from the URL database, grabs the webpage content corresponding to the URL, and then puts the webpage content into the webpage database; (3) the URL that satisfies the requested webpage Extract it and put it in the URL database.
  • the method for judging whether the URL satisfies the requirements is pattern matching; (4) Repeat steps (2) one (3) until the web database no longer has new records added.
  • the system obtains the original page of the webpage from the webpage database, and extracts the textual information from the webpage, that is, removes all the HTML grammar marks. Then, the extracted text information is sent to the text indexing module to establish an index.
  • the process of indexing is to first calculate the relevance (or importance) of each keyword in the page content and the hyperlink, and then use the related information to establish a webpage index. Database, forming an index database.
  • Text index In the process of establishment, you need to refer to the link information of the website, mainly to prevent illegal websites, such as multiple loop links of the website itself.
  • the link information is extracted from the webpage database, and the link information (including the anchor text and the link itself) is sent to the link database to provide a basis for the webpage rating.
  • the user submits the query request to the query server, and the server searches for the relevant webpage in the index database, and the webpage rating combines the query request and the link information to evaluate the relevance of the search result, and sorts according to the relevance degree by the query server.
  • the content summary of the keyword is extracted, and finally the page generation system organizes the link address of the search result and the page content summary and returns the content to the user.
  • the spider and the linker (Parser) module are the most important parts. among them:
  • the web spider uses multi-threaded concurrent search technology to complete the document access agent, the path selection engine, and the access control engine.
  • Web spider is mainly composed of three major data resources: URL server, crawler, memory, URL parser and resource library (web database), anchor library, URL database, and also one of the indexers. Accessibility.
  • the specific process is that the URL server obtains the URL to be crawled from the URL database, the crawler grabs the web page according to the URL and sends it to the memory, compresses the web page and stores it into the webpage database, and then analyzes each web by the indexer. All links to the page and store relevant important information in the anchors file.
  • the URL parser reads the anchor file and parses the URL, which in turn turns into a docID.
  • the anchor text is then indexed into the index and sent to the index database.
  • the specific process is shown in Figure 2.
  • the analyzer in Figure 2 can be seen as part of the indexer, or as an auxiliary part of the indexer. Since the processing flow of the web spider is a well-known technique, it is not described in detail herein.
  • the link information extraction module is configured to read a webpage database, decompress the document, and then perform analysis. Each document is converted into a set of words, which is called the number of samples. The number of words is recorded and the position in the document, the size of the font, and the case information. Search engines have two types of samples: (1) Title: This title is the title of the HTML or URL and the meta information in the HTML file. Index by analyzing individual words. Users can search for this information through this index.
  • the general search engine only extracts and indexes the title and content in the webpage, and does not further extract the information in the content.
  • An object of the embodiments of the present invention is to provide a search engine and a method for filtering the mediation information, so that some or all of the mediation information is filtered out in the search result.
  • the present invention provides a search engine, including: a web spider, a link information extraction module, and a query server;
  • the link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;
  • the search engine filters out the index corresponding to the mediation information from its index database.
  • the invention also provides a search engine, comprising: a web spider, a link information extraction module and a query server;
  • the link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.
  • the search engine filters out the index corresponding to the mediation information from its index database.
  • the present invention also provides a filtering method for a search engine to mediate information, including: Grab a web page from the Internet and send it to a web page database;
  • the extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;
  • the present invention also provides a filtering method for a search engine to mediate information, including:
  • the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively prevent the interference of the intermediary information to the user, improve the usability of the search result, and provide the user with more Great convenience.
  • FIG. 1 is a system architecture diagram of a typical search engine in the prior art
  • FIG. 2 is a schematic diagram of a processing flow of a web spider in the prior art
  • FIG. 3 is a schematic flowchart of filtering mediation information according to an embodiment of the present invention. detailed description
  • the intermediary information generally has one or more of the following characteristics:
  • the same intermediary will publish a lot of different information. Taking rental housing as an example, an intermediary usually publishes rental information in many different locations.
  • the published information contains company information. For example, company address and company contact information.
  • the published information contains unreasonable information. Examples include incorrect phone numbers (including cell phone numbers, landline numbers, PHS numbers, etc.), very low prices, and more.
  • the embodiment of the present invention modifies the link information extraction portion (link information extraction module) of the search engine based on the general vertical search.
  • the search engine in this embodiment mainly includes a web spider (Spider), a link information extraction module (Parser), and a query server.
  • the web spider (Spider) and the query server adopt a common processing technology, which is not described in detail herein.
  • the link information extraction module improves the feature of the mediation information, and further extracts information in the content in addition to the web page title and content, to extract mediation feature information (such as a phone number, for identifying the mediation information, Email and price, etc., and the extracted content can be further processed:
  • mediation feature information such as a phone number, for identifying the mediation information, Email and price, etc.
  • the analysis and processing of the web content can be used to find further information about the company or other mediation.
  • the improved link information extraction module adds the following functions:
  • the mode of extraction is pattern matching, that is, look for “mobile phone”, “mobile phone”, “telephone”, “Little Smart”, “Mobile Phone”, “Cell Phone”, etc. for each web page. Once found, the first consecutive number following these strings is extracted. The first consecutive number is the user's phone number.
  • the extraction method is pattern matching, that is, look for "email box”, "Emai l", etc. for each web page. Once found, extract the consecutive strings after these strings, and encounter the space to stop the extraction.
  • the extracted string is the user's email.
  • the number starting with 010 must be 5, 6, and 8. Otherwise, the information corresponding to this number is considered to be all intermediary information.
  • the link information extraction module can further identify the intermediate information by analyzing and processing the extracted content. For example, the content of the main body of the webpage can be analyzed. If the words "company”, “company address”, “my company”, “large amount of listings” are included, the information is considered as intermediary information.
  • the link information extraction module extracts the above information, only the information determined as the non-intermediary information is indexed, or the link information extraction module extracts the above information, and the index is established, but all the information determined to be the intermediary information is deleted from the index database. Indexing is performed using the generic "inverted index" technique (since the inverted indexing technique is well known in the art and will not be described in detail herein).
  • the index corresponding to the mediation information is filtered out in the index database, and the user submits the query by submitting the query.
  • the request is sent to the query server, and the server searches for the relevant webpage in the index database, and the intermediate information is basically filtered out in the returned search result.
  • FIG. 3 is a schematic diagram of a filtering process of a mediation information by a search engine according to an embodiment of the present invention. As shown in Figure 3, the following steps are included:
  • Step 100 Extract mediation feature information (such as a phone number and an email), and specifically include the following information: i. a mobile phone number;
  • step 200 the same information extracted is counted.
  • the method implemented in this embodiment is to establish a table in the background database of the search engine, the first field is a phone number or Email, and the second field is the number of times of repeated occurrence. After each message is extracted, the table is queried first. If there is already a record, the corresponding number of repetitions is incremented by one; if there is no record, a record is inserted, and the corresponding number of repetitions is set to 1.
  • Step 500 Determine whether the mobile phone, the telephone or the PHS number is legal.
  • the rule of judgment is based on the number rule table of various places in China. For example, the telephone number of Beijing is 8 digits. For those that do not comply with the rules, all the posting information corresponding to this mobile phone, telephone, and PHS is deleted from the index database of the search engine.
  • Step 600 Determine whether the extracted webpage content has an intermediary tendency. If the content of the webpage contains "the company", "large number of listings" or contains multiple different addresses (for example: existing Dongzhimen, Xizhimen, Zhongguancun multiple housing), then this information is not indexed, or this information is searched from Engine cable
  • the information determined as the intermediary information may also be processed without special processing, and after all the conditions are determined, the mediation information of all the judgments is from the search engine. Deleted in the index database; or after all the conditions are judged, the non-intermediary information is added to the index database for the user to query, and the information determined as the intermediary information is not indexed.
  • the present invention is not limited to these modes, and it is within the scope of the present invention as long as the mediation information of the judgment can be filtered out from the index database.
  • the above steps in the present embodiment shown in FIG. 3 are not limited in order, and the mediation feature information is not limited to the phone number or email given in the embodiment, and may be other information such as price.
  • the index database record of the search engine can be provided to the query server for the user to query and use.
  • the mediation information in the search result can be reduced from 90% before processing to 10% or less.
  • the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively preventing the interference of the intermediary information to the user, and improving the usability of the search result. Users provide greater convenience.

Abstract

A search engine and a method for filtering agency information, wherein the method comprising: grasping web pages from internet, sending them to a web page database; extracting link information, extracting the titles and contents of the web pages from the database, and extracting agency feature information further; analyzing the agency feature information extracted, if the set agency information judging condition is satisfied, the information corresponding to the agency feature information is determined as agency information; filtering the agency information from the search result.

Description

技术领域 Technical field
本发明涉及计算机搜索引擎技术, 特别涉及搜索引擎及其对中介信息的过 滤方法。 背景技术  The present invention relates to computer search engine technology, and more particularly to a search engine and a filtering method for the mediation information. Background technique
互联网提供了即时丰富的信息 (以及人与人沟通参与 /娱乐的平台) , 深层 影响着现代人的生活。 但随着网站数量和内容的急增, 互联网就像是没有目录 的巨大百科全书, 让人们无法找寻自己想要的信息。 而搜索引擎的出现, 为这 本百科全书加上了目录和索引。 只需要在搜索框中敲入关键词汇, 就能够获得 相关的信息或网址。 面对浩瀚的网络资源, 搜索引擎为所有网上冲浪的用户提 供了一个入口, 毫不夸张的说, 几乎所有的用户都可以从搜索出发到达自己想 去的网上任何一个地方。 因此它也成为除了电子邮件以外最多人使用的网上服 务。  The Internet provides instant and rich information (and a platform for people to communicate and participate in entertainment), which deeply influences the lives of modern people. But with the rapid increase in the number and content of websites, the Internet is like a huge encyclopedia with no catalogs, making it impossible for people to find the information they want. The emergence of search engines has added catalogues and indexes to this encyclopedia. Just type the keyword in the search box and you will be able to get the relevant information or URL. In the face of vast online resources, search engines provide an entry point for all surfers. It is no exaggeration to say that almost all users can search from the search to any place on the Internet they want. Therefore, it has also become the most used online service in addition to email.
图 1 列出了现有技术中一个典型的搜索引擎的系统架构图, 搜索引擎的各 部分都会相互交错相互依赖。 其处理流程大致如下:  Figure 1 shows the system architecture diagram of a typical search engine in the prior art. The parts of the search engine are interdependent and interdependent. The processing flow is as follows:
网络蜘蛛从互联网上抓取网页, 抓取过程如下: (1)手工向 URL数据库中加 入一个或多个起始网页的 URL (统一资源定位符, 又称为网页地址), 这些 URL也 称为种子; (2) 网络蜘蛛程序从 URL数据库中获取一个 URL,抓取这个 URL对应的 网页内容, 然后把网页内容放入网页数据库中; (3 ) 把抓取到的网页中的满足 要求的 URL提取出, 放入 URL数据库中。 判断 URL是否满足要求的方法为模式匹 配; (4) 重复步骤 (2) 一 (3 ) , 直到网页数据库不再有新的记录加入。  The web spider crawls the webpage from the Internet. The crawling process is as follows: (1) Manually add one or more URLs of the starting webpage (Uniform Resource Locator, also known as webpage address) to the URL database. These URLs are also called Seed; (2) The web spider obtains a URL from the URL database, grabs the webpage content corresponding to the URL, and then puts the webpage content into the webpage database; (3) the URL that satisfies the requested webpage Extract it and put it in the URL database. The method for judging whether the URL satisfies the requirements is pattern matching; (4) Repeat steps (2) one (3) until the web database no longer has new records added.
系统从网页数据库中取得网页原始页面, 从网页中提取文本信息, 即把 HTML 语法标记全部去除。 然后把提取后的文本信息送入文本索引模块建立索引, 建 立索引的过程为首先计算页面内容中及超链中每一个关键词的相关度 (或重要 性) , 然后用这些相关信息建立网页索引数据库, 形成索引数据库。 文本索引 建立的过程中, 需要参考网站的链接信息, 主要是用来防止非法网站, 例如网 站自身的多重循环链接。 索引数据库建立的同时, 也从网页数据库进行链接信 息提取, 把链接信息 (包括锚文本、 链接本身等信息) 送入链接数据库, 为网 页评级提供依据。 The system obtains the original page of the webpage from the webpage database, and extracts the textual information from the webpage, that is, removes all the HTML grammar marks. Then, the extracted text information is sent to the text indexing module to establish an index. The process of indexing is to first calculate the relevance (or importance) of each keyword in the page content and the hyperlink, and then use the related information to establish a webpage index. Database, forming an index database. Text index In the process of establishment, you need to refer to the link information of the website, mainly to prevent illegal websites, such as multiple loop links of the website itself. At the same time as the index database is established, the link information is extracted from the webpage database, and the link information (including the anchor text and the link itself) is sent to the link database to provide a basis for the webpage rating.
用户通过提交查询请求给查询服务器, 服务器在索引数据库中进行相关网 页的查找, 同时网页评级把查询请求和链接信息结合起来对搜索结果进行相关 度的评价, 通过査询服务器按照相关度进行排序, 并提取关键词的内容摘要, 最后由页面生成系统将搜索结果的链接地址和页面内容摘要等内容组织起来返 回给用户。  The user submits the query request to the query server, and the server searches for the relevant webpage in the index database, and the webpage rating combines the query request and the link information to evaluate the relevance of the search result, and sorts according to the relevance degree by the query server. The content summary of the keyword is extracted, and finally the page generation system organizes the link address of the search result and the page content summary and returns the content to the user.
如图 1 所示的搜索引擎的系统架构中, 网络蜘蛛 (Spider) 和链接信息提 取 (Parser) 模块是最主要的部分。 其中:  In the system architecture of the search engine shown in Figure 1, the spider and the linker (Parser) module are the most important parts. among them:
所述网络蜘蛛 (Spider ) 使用多线程并发搜索技术, 主要完成文档访问代 理、 路径选择引擎和访问控制引擎。 网络蜘蛛 (Spider) 主要由 URL服务器、 爬行器、 存储器、 URL解析器四大功能部件和资源库 (网页数据库) 、 锚库、 URL 数据库三大数据资源构成, 另外还要借助标引器的一个辅助功能。 具体过程是, URL服务器从 URL数据库中获取要去抓取的 URL,爬行器根据 URL抓取 Web页并送 给存储器, 存储器压缩 Web页并存入网页数据库, 然后由标引器分析每个 Web页 的所有链接并把相关的重要信息存储在锚 (anchors ) 文件中。 URL解析器读锚 文件并解析 URL, 然后依次转成 docID。 再把锚文本变成顺排索引, 送入索引数 据库。 具体过程如图 2所示, 图 2中分析器可以看成是标引器的一部分, 或者 说标引器的一个辅助功能部分。 由于网络蜘蛛的处理流程属于公知技术, 在此 并不详述。  The web spider (Spider) uses multi-threaded concurrent search technology to complete the document access agent, the path selection engine, and the access control engine. Web spider (Spider) is mainly composed of three major data resources: URL server, crawler, memory, URL parser and resource library (web database), anchor library, URL database, and also one of the indexers. Accessibility. The specific process is that the URL server obtains the URL to be crawled from the URL database, the crawler grabs the web page according to the URL and sends it to the memory, compresses the web page and stores it into the webpage database, and then analyzes each web by the indexer. All links to the page and store relevant important information in the anchors file. The URL parser reads the anchor file and parses the URL, which in turn turns into a docID. The anchor text is then indexed into the index and sent to the index database. The specific process is shown in Figure 2. The analyzer in Figure 2 can be seen as part of the indexer, or as an auxiliary part of the indexer. Since the processing flow of the web spider is a well-known technique, it is not described in detail herein.
所述链接信息提取模块用于读取网页数据库, 解压缩文档然后进行分析。 每个文档都被转成一套单词出现频率, 称之为釆样数。 釆样数记录单词及在文 档中出现的位置, 字体的大小以及大小写信息。 搜索引擎有两种类型的采样数: ( 1 ) 标题: 此标题为 HTML或 URL的标题以及 HTML文件中的 Meta信息。 通过分析各个单词, 建立索引。 用户就可以通过此索引搜索到此条信息。 The link information extraction module is configured to read a webpage database, decompress the document, and then perform analysis. Each document is converted into a set of words, which is called the number of samples. The number of words is recorded and the position in the document, the size of the font, and the case information. Search engines have two types of samples: (1) Title: This title is the title of the HTML or URL and the meta information in the HTML file. Index by analyzing individual words. Users can search for this information through this index.
(2) 内容: 获取页面的所有内容, 通过分析各个单词, 建立索引。 用户就 可以通过此索弓 I搜索到此条信息。  (2) Content: Get all the content of the page, and build an index by analyzing each word. The user can search for this information through this cable bow I.
由此可以看到, 通用的搜索引擎仅仅对网页中的标题和内容进行提取并建 立索引, 并不对内容中的信息进一步提取。  It can be seen that the general search engine only extracts and indexes the title and content in the webpage, and does not further extract the information in the content.
随着搜索引擎能够获取的网页的迅速增加, 用户输入搜索关键词后, 往往 会返回过多信息, 其中包括很多无关或无用信息, 用户必须从结果中进行筛选, 大大影响了用户的搜索效率。 因此, 为了方便使用搜索引擎, 使用户高效率地 从搜索引擎中得到有用的信息, 对搜索结果的处理就显得越来越重要。 例如, 在对于房屋出租信息的搜索结果, 很多用户都希望过滤掉中介的信息。 但目前 的搜索引擎还未能解决这个问题。 发明内容  With the rapid increase of webpages that search engines can obtain, users often return too much information after inputting search keywords, including many irrelevant or useless information. Users must filter from the results, which greatly affects the user's search efficiency. Therefore, in order to facilitate the use of search engines and enable users to efficiently obtain useful information from search engines, the processing of search results becomes more and more important. For example, in the search results for housing rental information, many users want to filter out the information of the intermediary. However, current search engines have not been able to solve this problem. Summary of the invention
本发明实施例的目的在于提供一种搜索引擎及其对中介信息的过滤方法, 使得在搜索结果过滤掉部分或全部中介信息。  An object of the embodiments of the present invention is to provide a search engine and a method for filtering the mediation information, so that some or all of the mediation information is filtered out in the search result.
为了实现上述目的, 本发明提供一种搜索引擎, 包括: 网络蜘蛛、 链接信 息提取模块及査询服务器;  In order to achieve the above object, the present invention provides a search engine, including: a web spider, a link information extraction module, and a query server;
所述链接信息提取模块用于从网页数据库提取网页标题、 网页内容及中介 特征信息, 并通过设定的中介信息判断条件判断该中介特征信息对应的信息是 否为中介信息;  The link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;
所述搜索引擎从其索引数据库中过滤掉中介信息对应的索引。  The search engine filters out the index corresponding to the mediation information from its index database.
本发明还提供一种搜索引擎, 包括: 网络蜘蛛、 链接信息提取模块及查询 服务器;  The invention also provides a search engine, comprising: a web spider, a link information extraction module and a query server;
所述链接信息提取模块用于从网页数据库提取网页标题、 网页内容, 并对 所述网页内容进行分析, 判断包含中介倾向信息的内容为中介信息。  The link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.
所述搜索引擎从其索引数据库中过滤掉中介信息对应的索引。  The search engine filters out the index corresponding to the mediation information from its index database.
本发明还提供一种搜索引擎对中介信息的过滤方法, 包括: 从互联网抓取网页, 送入网页数据库; The present invention also provides a filtering method for a search engine to mediate information, including: Grab a web page from the Internet and send it to a web page database;
进行链接信息提取, 从所述网页数据库提取网页标题和网页内容, 并从网 页内容中进一步提取中介特征信息;  Performing link information extraction, extracting a webpage title and webpage content from the webpage database, and further extracting the mediation feature information from the webpage content;
对提取的中介特征信息进行分析, 如果满足设定的中介信息判断条件, 则 判断该中介特征信息对应的信息为中介信息;  The extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;
在搜索结果中过滤掉中介信息。  Filter out the mediation information in the search results.
本发明还提供一种搜索引擎对中介信息的过滤方法, 包括:  The present invention also provides a filtering method for a search engine to mediate information, including:
从互联网抓取网页, 送入网页数据库;  Grab a web page from the Internet and send it to a web page database;
进行链接信息提取, 从所述网页数据库提取网页标题和网页内容, 并对提 取的网页内容进行分析, 如果该网页内容中包含中介倾向信息, 则判断该中介 倾向信息对应的信息为中介信息;  Extracting the link information, extracting the webpage title and the webpage content from the webpage database, and analyzing the extracted webpage content, and if the webpage content includes the intermediary propensity information, determining that the information corresponding to the intermediary propensity information is the intermediary information;
在搜索结果中过滤掉中介信息。  Filter out the mediation information in the search results.
本发明实施例的搜索引擎及其对中介信息的过滤方法可以过滤掉搜索结果 中的部分或全部中介信息, 有效防止了中介信息对用户的干扰, 提高了搜索结 果的可用性, 为用户提供了更大的方便。 附图说明  The search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively prevent the interference of the intermediary information to the user, improve the usability of the search result, and provide the user with more Great convenience. DRAWINGS
此处所说明的附图用来提供对本发明的进一步理解, 构成本申请的一部分, 并不构成对本发明的限定。 在附图中:  The drawings described herein are provided to provide a further understanding of the invention, and are not intended to limit the invention. In the drawing:
图 1为现有技术中典型的搜索引擎的系统架构图;  1 is a system architecture diagram of a typical search engine in the prior art;
图 2为现有技术中网络蜘蛛的处理流程示意图;  2 is a schematic diagram of a processing flow of a web spider in the prior art;
图 3为本发明实施例的过滤中介信息的流程示意图。 具体实施方式  FIG. 3 is a schematic flowchart of filtering mediation information according to an embodiment of the present invention. detailed description
为使本发明的目的、 技术方案和优点更加清楚, 下面结合附图对本发明的 具体实施例进行详细说明。 在此, 本发明的示意性实施例及其说明用于解释本 发明, 但并不作为对本发明的限定。 实施例 1 In order to make the objects, technical solutions and advantages of the present invention more comprehensible, the specific embodiments of the present invention will be described in detail below. The illustrative embodiments of the present invention and the description thereof are intended to explain the present invention, but are not intended to limit the invention. Example 1
如果希望搜索引擎对搜索结果进行有针对性的筛选, 必须让搜索引擎 "了 解"页面的内容。 例如, 对于房屋出租等信息的搜索结果, 如果希望过滤掉中 介信息, 则需要了解中介信息的一般特征。 中介信息一般具有如下特征中的一 个或多个:  If you want search engines to filter your search results in a targeted manner, you must let the search engine "learn" the content of the page. For example, for search results such as house rentals, if you want to filter out the mediation information, you need to understand the general characteristics of the mediation information. The intermediary information generally has one or more of the following characteristics:
( 1 ) 同一个中介会发布很多条不同的信息。 以出租房屋为例, 中介一般会 发布很多个不同地点的租房信息。  (1) The same intermediary will publish a lot of different information. Taking rental housing as an example, an intermediary usually publishes rental information in many different locations.
(2) 发布的信息中包含公司信息。 例如公司地址和公司联系方式等。  (2) The published information contains company information. For example, company address and company contact information.
( 3 ) 发布的信息中包含不合理的信息。 例如包括不正确的电话号码(包括 手机号、 固定电话号码、 小灵通号码等) , 非常低的价格等。  (3) The published information contains unreasonable information. Examples include incorrect phone numbers (including cell phone numbers, landline numbers, PHS numbers, etc.), very low prices, and more.
本发明实施例基于通用的垂直搜索, 对搜索引擎的链接信息提取部分 (链 接信息提取模块)进行修改。本实施例中的搜索引擎主要包括网络蜘蛛(Spider) 、 链接信息提取模块 (Parser) 和査询服务器。 其中, 所述网络蜘蛛 (Spider ) 和査询服务器采用通用的处理技术, 在此不作详述。 所述链接信息提取模块针 对中介信息的特征进行了改进, 除了获取网页标题和内容之外, 还对内容中的 信息进一步的提取, 以提取用于识别中介信息的中介特征信息 (如电话号码、 Email和价格等) , 同时也可对提取的内容进行进一步处理: 通过提取并分析中 介特征信息, 可以找出同一个中介发布的很多条中介信息以及包含有不合理信 息的中介信息; 通过对提取的网页内容的分析处理, 可以找出进一步的包含公 司信息或其它有中介倾向的信息。  The embodiment of the present invention modifies the link information extraction portion (link information extraction module) of the search engine based on the general vertical search. The search engine in this embodiment mainly includes a web spider (Spider), a link information extraction module (Parser), and a query server. The web spider (Spider) and the query server adopt a common processing technology, which is not described in detail herein. The link information extraction module improves the feature of the mediation information, and further extracts information in the content in addition to the web page title and content, to extract mediation feature information (such as a phone number, for identifying the mediation information, Email and price, etc., and the extracted content can be further processed: By extracting and analyzing the mediation feature information, it is possible to find a lot of mediation information published by the same agent and mediation information containing unreasonable information; The analysis and processing of the web content can be used to find further information about the company or other mediation.
改进后的链接信息提取模块除了可获取网页标题和内容之外, 还增加了如 下功能:  In addition to the page title and content, the improved link information extraction module adds the following functions:
( 1 ) 提取用于判断中介信息的中介特征信息 (以电话号码和 Emai l为例进 行说明) - (1) Extract the mediation feature information used to determine the mediation information (take the phone number and Emai l as an example) -
I .提取用户电话号码,提取方式为模式匹配,即针对每个网页寻找 "手机"、 "移动电话"、 "电话"、 "小灵通"、 "Mobile Phone" 、 "Cell Phone "等, 一旦发现就提取这些字符串后面的第一个连续的数字。 第一个连续的数字就是 用户的电话号码。 I. Extract the user's phone number, the mode of extraction is pattern matching, that is, look for "mobile phone", "mobile phone", "telephone", "Little Smart", "Mobile Phone", "Cell Phone", etc. for each web page. Once found, the first consecutive number following these strings is extracted. The first consecutive number is the user's phone number.
Π . 提取用户 Email , 提取方式为模式匹配, 即针对每个网页寻找 "电子邮 箱" 、 "Emai l "等,一旦发现就提取这些字符串后面的连续的字符串, 遇到空 格停止提取。 提取到的字符串就是用户的 Email。  提取 . Extract user emails, the extraction method is pattern matching, that is, look for "email box", "Emai l", etc. for each web page. Once found, extract the consecutive strings after these strings, and encounter the space to stop the extraction. The extracted string is the user's email.
( 2) 提取电话号码和用户 Email后, 统计相同的电话号码或 Email的重复 次数。 统计与时间有关, 一般统计过去 n个月 (24>η>1, 例如 3个月) 的电话 号码或 eamil的重复次数。  (2) After extracting the phone number and user email, count the same number of phone numbers or emails. The statistics are related to time. The phone number of the past n months (24>η>1, for example, 3 months) or the number of repetitions of eamil are generally counted.
( 3) 对于电话号码和 eamil 的重复次数各设置一个阈值, 如果超过这个阈 值, 就认为信息是中介发布。 例如对电话号码的重复次数设置阈值为 10, 当一 个电话号码重复次数大于 10时,则认为该电话号码对应的信息全部是中介信息。  (3) Set a threshold for each number of repetitions of the phone number and eamil. If this threshold is exceeded, the message is considered to be an intermediary. For example, if the number of repetitions of the telephone number is set to 10, when the number of repetitions of a telephone number is greater than 10, the information corresponding to the telephone number is considered to be all intermediary information.
(4) 对于电话号码进行分析, 根据号码前缀规则, 判断出不存在或不合法 的号码。  (4) Analyze the phone number and determine the number that does not exist or is illegal according to the number prefix rule.
例如中国的网站上以 010开始的号码, 第 4个数字必须为 5、 6、 8。 否则, 则认为这个号码对应的信息全部是中介信息。  For example, on the Chinese website, the number starting with 010 must be 5, 6, and 8. Otherwise, the information corresponding to this number is considered to be all intermediary information.
( 5 ) 对网页主体内容进行分析, 以识别中介信息。  (5) Analyze the content of the main body of the webpage to identify the intermediary information.
由于中介发布的信息中有的还包含 "公司"、 "大量房源" 字样等具有中 介倾向的信息, 因此链接信息提取模块通过对提取的内容进行分析处理, 可以 进一步的识别出这些中介信息。 例如, 可对网页主体内容进行分析, 如果包含 "本公司" 、 "公司地址"、 "我公司" "大量房源"等字样, 则认为这条信息 为中介信息。  Since the information published by the intermediary also contains information such as "company" and "mass of listing", the link information extraction module can further identify the intermediate information by analyzing and processing the extracted content. For example, the content of the main body of the webpage can be analyzed. If the words "company", "company address", "my company", "large amount of listings" are included, the information is considered as intermediary information.
链接信息提取模块提取如上信息后, 仅对判断为非中介信息的信息建立索 引, 或者链接信息提取模块提取如上信息后, 建立索引, 但对于判定为是中介 信息的所有信息从索引数据库中删除。 建立索引采用的是通用的 "倒排索引" 技术 (由于倒排索引技术是本技术领域的公知技术, 在此不作详述) 。  After the link information extraction module extracts the above information, only the information determined as the non-intermediary information is indexed, or the link information extraction module extracts the above information, and the index is established, but all the information determined to be the intermediary information is deleted from the index database. Indexing is performed using the generic "inverted index" technique (since the inverted indexing technique is well known in the art and will not be described in detail herein).
这样, 索引数据库中就过滤掉了中介信息对应的索引, 用户通过提交査询 请求给查询服务器, 服务器在索引数据库中进行相关网页的查找, 返回的搜索 结果中便基本过滤掉了中介信息。 In this way, the index corresponding to the mediation information is filtered out in the index database, and the user submits the query by submitting the query. The request is sent to the query server, and the server searches for the relevant webpage in the index database, and the intermediate information is basically filtered out in the returned search result.
图 3 为本发明一实施例的搜索引擎对中介信息的过滤流程示意图。 如图 3 所示, 包括如下步骤:  FIG. 3 is a schematic diagram of a filtering process of a mediation information by a search engine according to an embodiment of the present invention. As shown in Figure 3, the following steps are included:
步骤 100, 提取中介特征信息(如电话号码和 Email ), 具体包括如下信息: i. 手机号码;  Step 100: Extract mediation feature information (such as a phone number and an email), and specifically include the following information: i. a mobile phone number;
ii. 固定电话号码;  Ii. a fixed telephone number;
iii. 小灵通号码;  Iii. PHS number;
iv. Email。  Iv. Email.
步骤 200,对于提取到的相同的信息进行计数。 本实施例实现的方式为在搜 索引擎的后台数据库中建立一个表, 第一个字段为电话号码或者 Email,第二个 字段为重复出现的次数。 每提取一个信息后, 先查询这个表, 如果已经存在记 录, 则把相应的重复出现次数加 1; 如果没有记录, 则插入一个记录, 把相应的 重复出现次数设置为 1。  In step 200, the same information extracted is counted. The method implemented in this embodiment is to establish a table in the background database of the search engine, the first field is a phone number or Email, and the second field is the number of times of repeated occurrence. After each message is extracted, the table is queried first. If there is already a record, the corresponding number of repetitions is incremented by one; if there is no record, a record is inserted, and the corresponding number of repetitions is set to 1.
步骤 300、 步骤 400, 如果某个手机、 固定电话、 小灵通或者 Email的重复 次数多于 10次, 则不对此手机、 电话、 小灵通或者 Email所对应的所有发布信 息建立索引, 或者把此手机、 电话、 小灵通或者 Email 所对应的所有发布信息 从搜索引擎的索引数据库中删除。  Step 300, step 400, if a mobile phone, a fixed telephone, a PHS or an email is repeated more than 10 times, then all the posting information corresponding to the mobile phone, the telephone, the PHS or the email is not indexed, or the mobile phone is All published information corresponding to the phone, PHS or Email is deleted from the search engine's index database.
步骤 500, 判断手机、 电话或小灵通号码是否合法, 判断的规则是根据中国 各个地方的号码规则表, 例如北京的电话号码为 8位。 对于不符合规则的, 把 此手机、 电话、 小灵通所对应的所有发布信息从搜索引擎的索引数据库中删 除。  Step 500: Determine whether the mobile phone, the telephone or the PHS number is legal. The rule of judgment is based on the number rule table of various places in China. For example, the telephone number of Beijing is 8 digits. For those that do not comply with the rules, all the posting information corresponding to this mobile phone, telephone, and PHS is deleted from the index database of the search engine.
步骤 600, 判断提取的网页内容是否有中介倾向。 如果网页内容包含"本公 司" "大量房源"或者包含多个不同的地址 (例如: 现有东直门、 西直门、 中 关村多处住房) , 则不对此条信息建立索引, 或者把此条信息从搜索引擎的索 在本实施例的上述步骤 300-600 的每一步骤中, 对判断为中介信息的信息 也可以先不进行特殊处理, 而在所有条件都判断完之后再对所有判断的中介信 息从搜索引擎的索引数据库中删除; 或者在所有条件都判断完之后再将非中介 信息添加到索引数据库中, 供用户查询使用, 对判断为中介信息的信息则不建 立索引。 但本发明并不限于这些方式, 只要能将判断的中介信息从索引数据库 中过滤掉, 都应涵盖在本发明的范围之内。 Step 600: Determine whether the extracted webpage content has an intermediary tendency. If the content of the webpage contains "the company", "large number of listings" or contains multiple different addresses (for example: existing Dongzhimen, Xizhimen, Zhongguancun multiple housing), then this information is not indexed, or this information is searched from Engine cable In each step of the above steps 300-600 of the embodiment, the information determined as the intermediary information may also be processed without special processing, and after all the conditions are determined, the mediation information of all the judgments is from the search engine. Deleted in the index database; or after all the conditions are judged, the non-intermediary information is added to the index database for the user to query, and the information determined as the intermediary information is not indexed. However, the present invention is not limited to these modes, and it is within the scope of the present invention as long as the mediation information of the judgment can be filtered out from the index database.
另外, 图 3所示的本实施例中的如上各步骤并没有先后顺序上的限制, 并 且, 中介特征信息并不限于实施例中给出的电话号码或者 Email , 还可以为价格 等其它信息。  In addition, the above steps in the present embodiment shown in FIG. 3 are not limited in order, and the mediation feature information is not limited to the phone number or email given in the embodiment, and may be other information such as price.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤可 以通过程序来指令相关的硬件来完成, 该程序可以存储于一计算机可读取存储 介质中, 比如 R0M/RAM、 磁碟、 光盘等。  A person skilled in the art can understand that all or part of the steps of implementing the above embodiments may be completed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as ROM/RAM, disk. , CD, etc.
通过以上的处理, 就可以把搜索引擎的索引数据库记录提供给查询服务器, 供用户查询使用。 此时, 由于索引数据库中已经基本不包含中介信息的索引, 因此这样处理后, 搜索结果中的中介信息可由处理之前的 90%降低为 10%以下。  Through the above processing, the index database record of the search engine can be provided to the query server for the user to query and use. At this time, since the index database has substantially no index of the mediation information, the mediation information in the search result can be reduced from 90% before processing to 10% or less.
如上所述, 本发明实施例的搜索引擎及其对中介信息的过滤方法可以过滤 掉搜索结果中的部分或全部中介信息, 有效防止了中介信息对用户的干扰, 提 高了搜索结果的可用性, 为用户提供了更大的方便。  As described above, the search engine and the filtering method for the intermediary information in the embodiment of the present invention can filter some or all of the intermediary information in the search result, effectively preventing the interference of the intermediary information to the user, and improving the usability of the search result. Users provide greater convenience.
以上具体实施方式仅用于说明本发明, 而非用于限定本发明。 凡在本发明 的精神和原则之内, 所做的任何修改、 等同替换、 改进等, 均应包含在本发明 的保护范围之内。  The above specific embodiments are merely illustrative of the invention and are not intended to limit the invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

权 利 要 求 Rights request
1. 一种搜索引擎对中介信息的过滤方法, 其特征在于, 该方法包括: 从互联网抓取网页, 送入网页数据库; A method for filtering mediation information by a search engine, characterized in that the method comprises: crawling a webpage from the Internet and sending it to a webpage database;
进行链接信息提取, 从所述网页数据库提取网页标题和网页内容, 并从阿 页内容中进一步提取中介特征信息;  Performing link information extraction, extracting a webpage title and webpage content from the webpage database, and further extracting mediation feature information from the apage content;
对提取的中介特征信息进行分析, 如果满足设定的中介信息判断条件, 则 判断该中介特征信息对应的信息为中介信息;  The extracted mediation feature information is analyzed, and if the set mediation information judgment condition is met, the information corresponding to the mediation feature information is determined as the mediation information;
在搜索结果中过滤掉该中介信息。  Filter the mediation information in the search results.
2. 根据权利要求 1所述的方法, 其特征在于- 对所述中介特征信息的提取方式为模式匹配方式。  2. The method according to claim 1, wherein the manner of extracting the mediation feature information is a pattern matching mode.
3. 根据权利要求 1所述的方法, 其特征在于:  3. The method of claim 1 wherein:
所述的中介特征信息为电话号码及 /或电子邮件信息;  The mediation feature information is a phone number and/or email information;
对提取的中介特征信息进行分析是指: 统计预定时间内网页中相同电话号 码及 /或电子邮件的重复次数;  The analysis of the extracted mediation feature information refers to: counting the number of repetitions of the same phone number and/or email in the webpage within a predetermined time period;
所述设定的中介信息判断条件为: 所述相同电话号码及 /或电子邮件信息的 重复次数超过各自对应的阈值。  The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information exceeds respective corresponding thresholds.
4. 根据权利要求 1所述的方法, 其特征在于:  4. The method of claim 1 wherein:
所述的中介特征信息为电话号码;  The mediation feature information is a phone number;
所述设定的中介信息判断条件为: 所述电话号码为错误电话号码。  The set intermediary information determining condition is: the phone number is an incorrect phone number.
5. 根据权利要求 1所述的方法, 其特征在于- 所述中介特征信息为价格信息;  5. The method according to claim 1, wherein: the mediation feature information is price information;
所述设定的中介信息判断条件为: 所述价格低于设定的阈值。  The set intermediary information determination condition is: the price is lower than a set threshold.
6. 根据权利要求 1-5中任意一项所述的方法, 其特征在于, 在搜索结果 中过滤掉中介信息是指:  The method according to any one of claims 1 to 5, characterized in that filtering the intermediary information in the search result means:
从搜索引擎的索引数据库中删除中介信息或者仅对判断为非中介信息的信 息建立索引, 以从索引数据库中过滤掉中介信息; 搜索引擎基于过滤掉中介信息的索引数据库进行检索, 获得检索结果。Deleting the mediation information from the index database of the search engine or indexing only the information determined to be non-intermediary information to filter the mediation information from the index database; The search engine searches based on an index database that filters out the intermediary information to obtain a search result.
7 . 根据权利要求 1-5中任意一项所述的方法, 其特征在于, 该方法还包 括- 对提取的网页内容进行分析, 如果该网页内容中包含中介倾向信息, 则判 断该中介倾向信息对应的信息为中介信息。 The method according to any one of claims 1 to 5, characterized in that the method further comprises: analyzing the extracted webpage content, and if the webpage content includes intermediary propensity information, determining the intermediary propensity information The corresponding information is the intermediary information.
8. 一种搜索引擎对中介信息的过滤方法, 其特征在于:  8. A method for filtering mediation information by a search engine, which is characterized by:
从互联网抓取网页, 送入网页数据库;  Grab a web page from the Internet and send it to a web page database;
进行链接信息提取, 从所述网页数据库提取网页标题和网页内容, 并对提 取的网页内容进行分析, 如果该网页内容中包含中介倾向信息, 则判断该中介 倾向信息对应的信息为中介信息;  Extracting the link information, extracting the webpage title and the webpage content from the webpage database, and analyzing the extracted webpage content, and if the webpage content includes the intermediary propensity information, determining that the information corresponding to the intermediary propensity information is the intermediary information;
在搜索结果中过滤掉中介信息。  Filter out the mediation information in the search results.
9 . 根据权利要求 8所述的方法, 其特征在于, 该方法还包括:  The method according to claim 8, wherein the method further comprises:
从网页内容中进一步提取中介特征信息;  Further extracting the mediation feature information from the content of the webpage;
对提取的中介特征信息进行分析, 如果满足设定的中介信息判断条件, 则 判断该中介特征信息对应的网页信息为中介信息。  The extracted mediation feature information is analyzed. If the set mediation information determination condition is met, the webpage information corresponding to the mediation feature information is determined as the mediation information.
1 0. 根据权利要求 9所述的方法, 其特征在于:  1 0. The method of claim 9 wherein:
对所述中介特征信息的提取方式为模式匹配方式。  The manner of extracting the mediation feature information is a pattern matching mode.
11 . 根据权利要求 9所述的方法, 其特征在于:  11. The method of claim 9 wherein:
所述的中介特征信息为电话号码及 /或电子邮件信息;  The mediation feature information is a phone number and/or email information;
对提取的中介特征信息进行分析是指: 统计预定时间内网页中相同电话号 码及 /或电子邮件的重复次数;  The analysis of the extracted mediation feature information refers to: counting the number of repetitions of the same phone number and/or email in the webpage within a predetermined time period;
所述设定的中介信息判断条件为: 所述相同电话号码及 /或电子邮件信息的 重复次数超过各自对应的阈值。  The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information exceeds respective corresponding thresholds.
12. 根据权利要求 9所述的方法, 其特征在于- 所述的中介特征信息为电话号码;  12. The method according to claim 9, wherein: said mediation feature information is a phone number;
所述设定的中介信息判断条件为: 所述电话号码为错误电话号码。 The set intermediary information determining condition is: the phone number is an incorrect phone number.
13. 根据权利要求 9所述的方法, 其特征在于: 13. The method of claim 9 wherein:
所述中介特征信息为价格信息;  The mediation feature information is price information;
所述设定的中介信息判断条件为: 所述价格低于设定的阈值。  The set intermediary information determination condition is: the price is lower than a set threshold.
14. 根据权利要求 8-13中任意一项所述的方法, 其特征在于, 在搜索结果 中过滤掉中介信息是指:  The method according to any one of claims 8 to 13, characterized in that filtering the intermediary information in the search result means:
从搜索引擎的索引数据库中删除中介信息或者仅对判断为非中介信息的信 息建立索引, 以从索引数据库中过滤掉中介信息;  Deleting the mediation information from the index database of the search engine or indexing only the information determined to be non-intermediary information to filter the mediation information from the index database;
搜索引擎基于过滤掉中介信息的索引数据库进行检索, 获得检索结果。 The search engine searches based on an index database that filters out the intermediary information to obtain a search result.
15. 一种搜索引擎, 包括网络蜘蛛和查询服务器, 其特征在于, 该搜索引 擎还包括链接信息提取模块; 15. A search engine, comprising a web spider and a query server, wherein the search engine further comprises a link information extraction module;
所述链接信息提取模块用于从网页数据库提取网页标题、 网页内容及中介 特征信息, 并通过设定的中介信息判断条件判断该中介特征信息对应的信息是 否为中介信息;  The link information extraction module is configured to extract a webpage title, a webpage content, and an intermediary feature information from a webpage database, and determine whether the information corresponding to the mediation feature information is the intermediary information by using the set mediation information judgment condition;
所述搜索引擎从索引数据库中过滤掉中介信息对应的索引。  The search engine filters out the index corresponding to the mediation information from the index database.
16. 根据权利要求 15所述的搜索引擎, 其特征在于:  16. The search engine of claim 15 wherein:
所述链接信息提取模块还用于对所述网页内容进行分析, 判断包含中介倾 向信息的内容为中介信息。  The link information extraction module is further configured to analyze the content of the webpage, and determine that the content including the mediation direction information is the mediation information.
17. 根据权利要求 15所述的搜索引擎, 其特征在于:  17. The search engine of claim 15 wherein:
所述链接信息提取模块对中介特征信息的提取方式为模式匹配方式。  The method for extracting the mediation feature information by the link information extraction module is a mode matching mode.
18. 根据权利要求 15所述的搜索引擎, 其特征在于:  18. The search engine of claim 15 wherein:
所述的中介特征信息为电话号码及 /或电子邮件信息;  The mediation feature information is a phone number and/or email information;
所述设定的中介信息判断条件为: 所述链接信息提取模块统计的预计时间 内相同电话号码及 /或电子邮件信息的重复次数超过各自对应的阈值。  The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information within the estimated time counted by the link information extraction module exceeds respective corresponding thresholds.
19. 根据权利要求 15所述的搜索引擎, 其特征在于- 所述的中介特征信息为电话号码;  19. The search engine according to claim 15, wherein: said mediation feature information is a phone number;
所述设定的中介信息判断条件为: 所述电话号码为错误电话号码。 The set intermediary information determining condition is: the phone number is an incorrect phone number.
20. 根据权利要求 15所述的搜索引擎, 其特征在于: 20. The search engine of claim 15 wherein:
所述中介特征信息为价格信息;  The mediation feature information is price information;
所述设定的中介信息判断条件为: 所述价格低于设定的阈值。  The set intermediary information determination condition is: the price is lower than a set threshold.
21. 根据权利要求 15- 20中任意一项所述的搜索引擎, 其特征在于: 所述搜索引擎从其索引数据库中过滤掉中介信息对应的索引是指: 链接信息提取模块仅对判断为非中介信息的信息建立索引; 或者  The search engine according to any one of claims 15 to 20, wherein: the search engine filters out the index corresponding to the intermediary information from the index database, and the index information extraction module only judges the non-determination Indexing information of the intermediary information; or
链接信息提取模块提取信息并建立索引后, 对判断为是中介信息的信息从 索引数据库中删除。  After the link information extraction module extracts the information and indexes it, the information determined to be the intermediary information is deleted from the index database.
22. —种搜索引擎, 包括网络蜘蛛和查询服务器, 其特征在于, 该搜索引 擎还包括链接信息提取模块;  22. A search engine, comprising a web spider and a query server, wherein the search engine further comprises a link information extraction module;
所述链接信息提取模块用于从网页数据库提取网页标题、 网页内容, 并对 所述网页内容进行分析, 判断包含中介倾向信息的内容为中介信息。  The link information extraction module is configured to extract a webpage title and a webpage content from a webpage database, analyze the webpage content, and determine that the content including the intermediary propensity information is the intermediary information.
所述搜索引擎从索引数据库中过滤掉中介信息对应的索引。  The search engine filters out the index corresponding to the mediation information from the index database.
23. 根据权利引擎 22所述的搜索引擎, 其特征在于- 所述链接信息提取模块还用于从网页数据库提取中介特征信息, 并通过设 定的中介信息判断条件判断该中介特征信息对应的信息是否为中介信息。  23. The search engine according to claim 22, wherein the link information extraction module is further configured to extract the mediation feature information from the webpage database, and determine the information corresponding to the mediation feature information by using the set mediation information determination condition. Whether it is intermediary information.
24. 根据权利要求 22所述的搜索引擎, 其特征在于- 所述链接信息提取模块对中介特征信息的提取方式为模式匹配方式。  24. The search engine according to claim 22, wherein - the manner in which the link information extraction module extracts the mediation feature information is a pattern matching mode.
25. 根据权利要求 22所述的搜索引擎, 其特征在于:  25. The search engine of claim 22, wherein:
所述的中介特征信息为电话号码及 /或电子邮件信息;  The mediation feature information is a phone number and/or email information;
所述设定的中介信息判断条件为: 所述链接信息提取模块统计的预计时间 内相同电话号码及 /或电子邮件信息的重复次数超过各自对应的阈值。  The set mediation information determination condition is: the number of repetitions of the same phone number and/or email information within the estimated time counted by the link information extraction module exceeds respective corresponding thresholds.
26. 根据权利要求 22所述的搜索引擎, 其特征在于- 所述的中介特征信息为电话号码;  26. The search engine according to claim 22, wherein - said mediation feature information is a phone number;
所述设定的中介信息判断条件为: 所述电话号码为错误电话号码。  The set intermediary information determining condition is: the phone number is an incorrect phone number.
27. 根据权利要求 22所述的搜索引擎, 其特征在于- 所述中介特征信息为价格信息; 27. The search engine of claim 22, wherein - The mediation feature information is price information;
所述设定的中介信息判断条件为: 所述价格低于设定的阈值。  The set intermediary information determination condition is: the price is lower than a set threshold.
28. 根据权利要求 22-27中任意一项所述的搜索引擎, 其特征在于: 所述搜索引擎从其索引数据库中过滤掉中介信息对应的索引是指- 链接信息提取模块仅对判断为非中介信息的信息建立索引; 或者  The search engine according to any one of claims 22-27, wherein: the search engine filters out the index corresponding to the intermediary information from the index database, and the link information extraction module only judges the non- Indexing information of the intermediary information; or
链接信息提取模块提取信息并建立索引后, 对判断为是中介信息的信息从索引 数据库中删除。 After the link information extraction module extracts the information and indexes it, the information determined to be the intermediary information is deleted from the index database.
PCT/CN2007/001474 2007-04-29 2007-04-29 Search engine and method for filtering agency information WO2008131597A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2007/001474 WO2008131597A1 (en) 2007-04-29 2007-04-29 Search engine and method for filtering agency information
CN200780052784A CN101849232A (en) 2007-04-29 2007-04-29 Search engine and method for filtering agency information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2007/001474 WO2008131597A1 (en) 2007-04-29 2007-04-29 Search engine and method for filtering agency information

Publications (1)

Publication Number Publication Date
WO2008131597A1 true WO2008131597A1 (en) 2008-11-06

Family

ID=39925170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/001474 WO2008131597A1 (en) 2007-04-29 2007-04-29 Search engine and method for filtering agency information

Country Status (2)

Country Link
CN (1) CN101849232A (en)
WO (1) WO2008131597A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062328A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 The method and apparatus for obtaining website nature search rank

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962510B2 (en) * 2005-02-11 2011-06-14 Microsoft Corporation Using content analysis to detect spam web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US20060136411A1 (en) * 2004-12-21 2006-06-22 Microsoft Corporation Ranking search results using feature extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG MAOYUAN AND ZOU CHUNYAN: "Research for Web Page Filter with Natural Language Processing", COMPUTER & DIGITAL ENGINEERING, vol. 31, no. 3, March 2003 (2003-03-01), pages 11, 24 - 28 *

Also Published As

Publication number Publication date
CN101849232A (en) 2010-09-29

Similar Documents

Publication Publication Date Title
US8341150B1 (en) Filtering search results using annotations
Li et al. Tag-based social interest discovery
US8224809B2 (en) System and method for matching entities
Bharat et al. A comparison of techniques to find mirrored hosts on the WWW
US9367637B2 (en) System and method for searching a bookmark and tag database for relevant bookmarks
Jansen et al. Determining the user intent of web search engine queries
JP4857075B2 (en) Method and computer program for efficiently retrieving dates in a collection of web documents
US8990210B2 (en) Propagating information among web pages
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
US20070250501A1 (en) Search result delivery engine
US20100115003A1 (en) Methods For Merging Text Snippets For Context Classification
US20100115001A1 (en) Methods For Pairing Text Snippets To File Activity
CN101676907A (en) Method and system of directionally acquiring Internet resources
WO2009000174A1 (en) Method and device of web page rank
Klein et al. Evaluating methods to rediscover missing web pages from the web infrastructure
US8521746B1 (en) Detection of bounce pad sites
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
JP5364012B2 (en) Data extraction apparatus, data extraction method, and data extraction program
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
WO2008131597A1 (en) Search engine and method for filtering agency information
US7502773B1 (en) System and method facilitating page indexing employing reference information
Yu et al. The design and realization of open-source search engine based on Nutch
Wenyin et al. A media agent for automatically building a personalized semantic index of Web media objects

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200780052784.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07721047

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07721047

Country of ref document: EP

Kind code of ref document: A1