WO2017000659A1 - Enriched uniform resource locator (url) identification method and apparatus - Google Patents

Enriched uniform resource locator (url) identification method and apparatus Download PDF

Info

Publication number
WO2017000659A1
WO2017000659A1 PCT/CN2016/081003 CN2016081003W WO2017000659A1 WO 2017000659 A1 WO2017000659 A1 WO 2017000659A1 CN 2016081003 W CN2016081003 W CN 2016081003W WO 2017000659 A1 WO2017000659 A1 WO 2017000659A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
anchor
similarity
enriched
text
Prior art date
Application number
PCT/CN2016/081003
Other languages
French (fr)
Chinese (zh)
Inventor
王智广
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2017000659A1 publication Critical patent/WO2017000659A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Definitions

  • the present invention relates to the technical field of computer processing, and in particular, to a method for identifying an enriched URL and an apparatus for identifying an enriched URL.
  • the search engine usually downloads web pages from the network through a web crawler.
  • the web crawler starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage. During the process of crawling the webpage, the web crawler continuously extracts a new URL from the current webpage into the queue. Until the system has a certain stopping condition.
  • URL Uniform Resource Locator
  • Web crawlers can find a large number of newly generated URLs in the network every day.
  • the data of the URLs in the network is massive, and the amount of URLs that the search engine can actually crawl every day is limited, which requires the actual crawling of the web crawler. Sort the URLs that have been found before fetching the page, and preferentially fetch some URLs.
  • the newly discovered URLs are sorted mainly based on feedback from the crawled web pages. If the quality of the crawled webpage is high, then the quality of the URL that is similar to the URL of the crawled webpage is considered to be higher.
  • the present invention has been made in order to provide an enriched URL identification method and a corresponding enrichment URL identification apparatus that overcome the above problems or at least partially solve or alleviate the above problems.
  • a method for identifying an enriched URL including the steps of:
  • each candidate URL is associated with each anchor text anchor
  • An enriched URL is identified from the candidate URLs based on the similarity.
  • an apparatus for identifying an enriched URL including:
  • a URL extraction module adapted to extract one or more URLs
  • a candidate URL selection module configured to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
  • a similarity calculation module configured to calculate a similarity between the anchor text anchors
  • An enriched URL identification module adapted to identify an enriched URL from the candidate URLs based on the similarity.
  • a computer program comprising computer readable code causing the computing device to perform the method of identifying an enriched URL described above when the computer readable code is run on a computing device .
  • a computer readable medium wherein the computer program described above is stored.
  • the candidate URL is selected from the extracted URL, and the rich URL is identified according to the similarity of the anchor text anchor associated with the candidate URL, which can prevent the search engine from crawling the garbage and repeating the webpage when the webpage is crawled, thereby greatly saving The bandwidth is wasted when crawling, and the amount of crawling is reduced, which reduces the burden on the search engine.
  • the search engine can additionally capture other high-quality webpages, which improves the coverage and timeliness of the webpages included in the search engine.
  • FIG. 1 is a flow chart showing the steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention
  • FIG. 2 is a block diagram showing the structure of an embodiment of an apparatus for identifying an encrypted URL according to an embodiment of the present invention
  • Figure 3 schematically shows a block diagram of a computing device for performing the method according to the invention
  • Fig. 4 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1 a flow chart of steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention is shown.
  • Step 101 Extract one or more URLs
  • the search engine may pre-fetch the URL of the webpage from the network by using a web crawler (also known as a web spider), and store it in the database, and may identify the enriched URL from the database. Extract one or more URLs.
  • a web crawler also known as a web spider
  • the web crawler generally parses from the URL of one or more initial webpages, obtains the URL on the initial webpage, and continuously extracts a new URL from the current page into the queue during the process of crawling the webpage until the system is satisfied. Stop condition.
  • the focus crawler (a type of web crawler) has a more complex workflow, usually filtering links that are not related to the topic, retaining useful links and placing them in a queue of URLs waiting to be crawled. Then, the focused crawler will select the URL of the web page to be crawled from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition is reached.
  • the same question generally includes a web page with an answer and a satisfactory answer, and the other can be considered as a duplicate.
  • Step 102 Select a candidate URL from the one or more URLs
  • some or all URLs may be selected as candidate URLs according to a certain policy from the extracted URLs.
  • step 102 may include the following sub-steps:
  • Sub-step S11 it is determined whether the URL matches a pattern pattern; if yes, sub-step S12;
  • Sub-step S12 the URL is selected as a candidate URL.
  • the URL of the same website since the URL of the same website generally configures similar URLs for the same type of service (such as question and answer), the URL of the same website may be selected as the candidate URL by the same pattern pattern.
  • the pattern pattern can be a URL with the same or similar style.
  • ( ⁇ d+) is a wildcard.
  • each candidate URL is associated with each anchor text anchor, that is, the URL and the anchor text anchor are generally one-to-one correspondence.
  • Anchor text also known as anchor text link, is a form of link.
  • hyperlinked code is anchor text, making a link to a keyword, pointing to a web page. This form of link is called anchor text.
  • the anchor text can be used as an evaluation of the content of the web page where the anchor text is located, ie the anchor text within the station.
  • the added links in the webpage have a certain relationship with the content of the webpage itself.
  • the clothing industry website will add links to some peer websites or some well-known companies that make clothing.
  • the anchor text can be used as an evaluation of the web page pointed to, ie the anchor text outside the station.
  • the anchor text can describe the content of the web page pointed to, for example, a link to add "ABC” on the personal website, and the anchor text is "search engine”. This way, the anchor text itself knows that "ABC" is a search engine.
  • anchor text anchor For the URL crawled at the zhidao.***.com site, an example of its anchor text anchor can be as shown in the following table:
  • XXX is the name of a TV series.
  • Step 103 Calculate a similarity between the anchor text anchors
  • Similarity can refer to the content relevance between anchor text anchors.
  • step 103 may include the following sub-steps:
  • Sub-step S21 performing vectorization processing on the anchor text anchor
  • the similarity can be calculated based on the vector space model, which assumes that the word is not related to the word, and uses the vector to represent the text, thereby simplifying the complex relationship between the keywords in the text, and the document is very simple.
  • the vector representation makes the model computable.
  • the sub-step S21 may further include the following sub-steps:
  • Sub-step S211 performing a word segmentation process on the anchor text anchor to obtain a text segmentation
  • the word segmentation process can be performed by one or more of the following methods:
  • Word segmentation based on string matching refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the word segmentation method based on feature scanning or mark segmentation refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and mark In the process, the result of the word segmentation is in turn tested and adjusted to improve the accuracy of the segmentation.
  • the word segmentation method based on understanding refers to the effect of identifying words by letting the computer simulate the understanding of the sentence.
  • the basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence.
  • Statistical-based word segmentation method It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word.
  • the method for extracting the above-mentioned word segmentation is only an example.
  • the method for extracting other word segments may be set according to the actual situation, which is not limited by the embodiment of the present invention.
  • a person skilled in the art may also adopt a method for extracting other word segments according to actual needs, which is not limited by the embodiment of the present invention.
  • Sub-step S212 filtering out invalid words from the text participle
  • the words (invalid words) in the stop word table may be used to remove words, symbols, punctuation, and garbled characters that are not meaningful to the text content but appear frequently.
  • the invalid word includes one or more of the following:
  • Adverbs auxiliary words, symbols, punctuation and garbled.
  • stop words lists The process of using stop words lists to eliminate stop words is roughly as follows: each text segmentation is seen if it is in the stop word list, and if so, it is removed from the text segmentation.
  • several keywords may be determined according to the frequency of text segmentation.
  • the word can be determined by TF (Term frequency) frequency.
  • Sub-step S214 configuring weights for the keywords
  • the configuration weight is a mechanism set for each keyword to have different effects on the text features.
  • the weight of the keyword may be determined by an IDF (Inverse document frequency).
  • Sub-step S215 setting the weight of the keyword to the component of the anchor text anchor.
  • the anchor text anchor is stringified into an N-dimensional vector representation with the weight of the keyword as a component to perform the similarity calculation.
  • Sub-step S22 calculating the similarity between the vectorized anchor text anchors.
  • a cosine value between the components of the anchor text anchor (physical meaning is the cosine value of the spatial angle of the two vectors) may be calculated as the similarity between the anchor text anchors.
  • a vector (a 1 , a 2 , a 3 ... a n can be calculated
  • the cosine of the angle between (b 1 , b 2 , b 3 ... b n ) is used as the similarity between the anchor text anchor A and the anchor text anchor B.
  • sim(A, B) represents the similarity between the anchor text anchor A and the anchor text anchor B
  • sqrt() represents the root number
  • the anchor text anchorA calculated according to the above formula is related to the anchor B. It seems to be 0.86.
  • Step 104 Identify an enriched URL from the candidate URL according to the similarity.
  • the similarity is greater than the preset similarity threshold, the candidate URL is confirmed to be an enriched URL, that is, the similarity is greater than a certain similarity.
  • the threshold URL can be thought of as a URL with the same or similar content (ie, an enriched URL).
  • the anchor text anchor is related to the music of the XXX fifth season episode 14 and can be considered as a rich URL.
  • the candidate URL is selected from the extracted URL, and the rich URL is identified according to the similarity of the anchor text anchor associated with the candidate URL, which can prevent the search engine from crawling the garbage and repeating the webpage when the webpage is crawled, thereby greatly saving The bandwidth is wasted when crawling, and the amount of crawling is reduced, which reduces the burden on the search engine.
  • the search engine can additionally capture other high-quality webpages, which improves the coverage and timeliness of the webpages included in the search engine.
  • the method may further include the following steps:
  • Step 105 Select a target URL from the enriched URL.
  • some or all of the URLs may be selected from the enrichment URL according to a certain policy as the target URL.
  • step 105 may include the following sub-steps:
  • Sub-step S31 acquiring the degree of attention of the enriched URL
  • Sub-step S32 selecting a target URL from the enriched URL based on the degree of interest.
  • the degree of attention may be the degree of attention of the user to the URL.
  • the URL corresponds to the number of recommendations of the webpage (eg, "to force”, “like”, etc.), and the more the number of recommendations, the higher the degree of attention.
  • an enriched URL with a high degree of attention may be selected, for example, the degree of attention is higher than the preset attention threshold.
  • Step 106 Grab a webpage corresponding to the target URL
  • the target URL is placed in the crawled URL queue.
  • Step 107 Generate an index file by using the webpage.
  • the search engine search process is generally divided into two parts, one is the front-end user request process, and the other is the back-end production data process.
  • the front-end user request process is roughly as follows:
  • Receiving a request receiving a search keyword input by a user in a search engine
  • query word analysis word segmentation processing of search keywords
  • Sorting Sorting related webpage information according to dimensions such as content relevance and timeliness;
  • Web crawling use web crawler technology to capture various types of web pages and save them.
  • Index production Analyze the network information that has been captured and saved, such as word segmentation of the page title and page text, and create an index file (such as an inverted index) according to the word segmentation result, which is used by the front-end user request process.
  • index file such as an inverted index
  • the webpage record may be written into an index file (such as an inverted index) to As a search in search engines.
  • an index file such as an inverted index
  • the inverted index is derived from the actual application and needs to find records according to the value of the attribute.
  • Each item in the index table includes an attribute value and an address of each record having the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index.
  • a file with an inverted index is called an inverted index file, or simply an inverted file.
  • an index object is a word in a document or collection of documents (such as a web page), and is used to store the storage location of the words in a document or a group of documents, which is a common use of documents or collections of documents. Indexing mechanism.
  • T1 “it is what it is”
  • T3 “it is a banana”
  • banana ⁇ (2, 3) ⁇ is “banana” in the text information of the third web page (T3), and the position of the third web page is the fourth word (address is 3).
  • FIG. 2 a block diagram of an embodiment of an apparatus for identifying an enriched URL according to an embodiment of the present invention is shown. Specifically, the following modules may be included:
  • the URL extraction module 201 is adapted to extract one or more URLs
  • the candidate URL selection module 202 is adapted to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
  • the similarity calculation module 203 is adapted to calculate a similarity between the anchor text anchors
  • the enriched URL identification module 204 is adapted to identify the enriched URL from the candidate URLs based on the similarity.
  • the candidate URL selection module 202 may further be adapted to:
  • the similarity calculation module 203 is further adapted to:
  • the similarity calculation module 203 is further adapted to:
  • the weight of the keyword is set to the component of the anchor text anchor.
  • the similarity calculation module 203 is further adapted to:
  • the invalid word includes one or more of the following:
  • Adverbs auxiliary words, symbols, punctuation, garbled.
  • the similarity calculation module 203 is further adapted to:
  • a cosine value between components of the anchor text anchor is calculated as the similarity between the anchor text anchors.
  • the enriched URL identification module 204 can also Suitable for:
  • the candidate URL is confirmed to be a rich URL.
  • the device may further comprise the following modules:
  • a target URL selection module adapted to select a target URL from the enriched URL.
  • the target URL selection module may further be adapted to:
  • the target URL is selected from the enriched URL based on the degree of interest.
  • the device may further comprise the following modules:
  • a webpage crawling module configured to capture a webpage corresponding to the target URL
  • An index file generating module is adapted to generate an index file by using the webpage.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the functionality of some or all of the components of the enhanced URL identification device in accordance with embodiments of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals.
  • Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 3 illustrates an identification computing device, such as an application server, that can implement an enriched URL in accordance with the present invention.
  • the computing device conventionally includes a processor 310 and a computer program product or computer readable medium in the form of a memory 320.
  • the memory 320 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • the memory 320 has a memory space 330 for program code 331 for performing any of the method steps described above.
  • storage space 330 for program code Various program codes 331 for respectively implementing the various steps in the above methods may be included.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 320 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 331', ie, code readable by a processor, such as 310, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Abstract

Disclosed are an enriched uniform resource locator (URL) identification method and apparatus. The method comprises: extracting one or more URLs; selecting candidate URLs from the one or more URLs; correlating each candidate URL with an anchor text; calculating the similarity between the anchor texts; and identifying an enriched URL from the candidate URLs according to the similarity. The embodiments of the present invention can prevent a search engine from grabbing spam and repeated web pages during web page grabbing, thereby greatly reducing the bandwidth waste during grabbing, and further reducing the burden of the search engine due to the reduction in the grabbing amount; and meanwhile, the search engine can additionally grab other good-quality web pages, thereby improving the coverage rate of the search engine during web page collection is increased and the timeliness of the search engine during web page collection.

Description

一种富集化URL的识别方法和装置Method and device for identifying enriched URL 技术领域Technical field
本发明涉及计算机处理的技术领域,尤其涉及一种富集化URL的识别方法和一种富集化URL的识别装置。The present invention relates to the technical field of computer processing, and in particular, to a method for identifying an enriched URL and an apparatus for identifying an enriched URL.
背景技术Background technique
随着网络的迅速发展,网络成为大量信息的载体,为了有效地提取并利用这些信息,搜索引擎(Search Engine)通常通过网络爬虫从网络上下载网页。With the rapid development of the network, the network has become a carrier of a large amount of information. In order to effectively extract and utilize this information, the search engine usually downloads web pages from the network through a web crawler.
网络爬虫从一个或若干初始网页的URL(Uniform Resource Locator,统一资源定位符)开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。The web crawler starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage. During the process of crawling the webpage, the web crawler continuously extracts a new URL from the current webpage into the queue. Until the system has a certain stopping condition.
网络爬虫每天能够发现大量的网络中新产生的URL,但是,网络中的URL的数据是海量的,而搜索引擎每天能够实际抓取的URL量是有限的,这就需要在网络爬虫实际发起抓取网页之前对已经发现的URL进行排序,优先抓取某些URL。Web crawlers can find a large number of newly generated URLs in the network every day. However, the data of the URLs in the network is massive, and the amount of URLs that the search engine can actually crawl every day is limited, which requires the actual crawling of the web crawler. Sort the URLs that have been found before fetching the page, and preferentially fetch some URLs.
目前主要根据已抓取的网页的反馈,对新发现的URL的排序。如果已抓取网页的质量较高,那么认为与已抓取网页的URL相似的URL的质量也是较高的。Currently, the newly discovered URLs are sorted mainly based on feedback from the crawled web pages. If the quality of the crawled webpage is high, then the quality of the URL that is similar to the URL of the crawled webpage is considered to be higher.
但是,这中方案存在富集的现象,每个URL具有单独的特征,相似URL的网页的质量差异是很大的,可能存在垃圾、重复的网页,这些网页的抓取,大大浪费了带宽、加大搜索引擎的负担。However, there is a phenomenon of enrichment in this scheme. Each URL has a separate feature. The quality difference of webpages with similar URLs is very large. There may be garbage and duplicate webpages. The crawling of these webpages wastes bandwidth. Increase the burden on search engines.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决或减缓上述问题的一种富集化URL的识别方法和相应的一种富集化URL的识别装置。 In view of the above problems, the present invention has been made in order to provide an enriched URL identification method and a corresponding enrichment URL identification apparatus that overcome the above problems or at least partially solve or alleviate the above problems.
依据本发明的一个方面,提供了一种富集化URL的识别方法,包括步骤:According to an aspect of the present invention, a method for identifying an enriched URL is provided, including the steps of:
提取一个或多个URL;Extract one or more URLs;
从所述一个或多个URL选取候选URL;各个候选URL关联有各个锚文本anchor;Selecting candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
计算所述各个锚文本anchor之间的相似度;Calculating a similarity between the anchor text anchors;
根据所述相似度从所述候选URL中识别出富集化URL。An enriched URL is identified from the candidate URLs based on the similarity.
根据本发明的另一方面,提供了一种富集化URL的识别装置,包括:According to another aspect of the present invention, an apparatus for identifying an enriched URL is provided, including:
URL提取模块,适于提取一个或多个URL;a URL extraction module adapted to extract one or more URLs;
候选URL选取模块,适于从所述一个或多个URL选取候选URL;各个候选URL关联有各个锚文本anchor;a candidate URL selection module, configured to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
相似度计算模块,适于计算所述各个锚文本anchor之间的相似度;a similarity calculation module, configured to calculate a similarity between the anchor text anchors;
富集化URL识别模块,适于根据所述相似度从所述候选URL中识别出富集化URL。An enriched URL identification module adapted to identify an enriched URL from the candidate URLs based on the similarity.
根据本发明的又一个方面,提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上述的富集化URL的识别方法。According to still another aspect of the present invention, a computer program comprising computer readable code causing the computing device to perform the method of identifying an enriched URL described above when the computer readable code is run on a computing device .
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.
本发明的有益效果为:The beneficial effects of the invention are:
本发明实施例从提取的URL中选取候选URL,根据候选URL关联的锚文本anchor的相似度识别出富集化URL,可以避免搜索引擎在抓取网页时抓取垃圾、重复的网页,大大节省了抓取时的带宽浪费,由于抓取量减少了,进而减少了搜索引擎的负担,同时,搜索引擎可以额外抓取其他优质的网页,提升了搜索引擎收录网页的覆盖率和时效性。In the embodiment of the present invention, the candidate URL is selected from the extracted URL, and the rich URL is identified according to the similarity of the anchor text anchor associated with the candidate URL, which can prevent the search engine from crawling the garbage and repeating the webpage when the webpage is crawled, thereby greatly saving The bandwidth is wasted when crawling, and the amount of crawling is reduced, which reduces the burden on the search engine. At the same time, the search engine can additionally capture other high-quality webpages, which improves the coverage and timeliness of the webpages included in the search engine.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方 式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the present invention are as follows formula.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示意性地示出了根据本发明一个实施例的种富集化URL的识别方法实施例的步骤流程图;1 is a flow chart showing the steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention;
图2示意性地示出了根据本发明一个实施例的一种富集化URL的识别装置实施例的结构框图;2 is a block diagram showing the structure of an embodiment of an apparatus for identifying an encrypted URL according to an embodiment of the present invention;
图3示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 3 schematically shows a block diagram of a computing device for performing the method according to the invention;
图4示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Fig. 4 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
具体实施例Specific embodiment
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
参照图1,示出了根据本发明一个实施例的一种富集化URL的识别方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 1 , a flow chart of steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention is shown.
步骤101,提取一个或多个URL;Step 101: Extract one or more URLs;
在实际应用中,各种类型的网站每天都可能设计众多的网页,每个网页都会具有URL。In practical applications, various types of websites may design a large number of web pages every day, and each web page will have a URL.
应用本发明实施例,搜索引擎可以预先采用网络爬虫(又称为网络蜘蛛,Web Spider)从网络上抓取网页的URL,存储在数据库中,则在识别富集化URL时,可以从数据库中提取一个或多个URL。In the embodiment of the present invention, the search engine may pre-fetch the URL of the webpage from the network by using a web crawler (also known as a web spider), and store it in the database, and may identify the enriched URL from the database. Extract one or more URLs.
其中,网络爬虫一般从一个或多个初始网页的URL开始解析,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。 The web crawler generally parses from the URL of one or more initial webpages, obtains the URL on the initial webpage, and continuously extracts a new URL from the current page into the queue during the process of crawling the webpage until the system is satisfied. Stop condition.
特别地,聚焦爬虫(一种网络爬虫)的工作流程较为复杂,通常过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,聚焦爬虫将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到某一条件时停止。In particular, the focus crawler (a type of web crawler) has a more complex workflow, usually filtering links that are not related to the topic, retaining useful links and placing them in a queue of URLs waiting to be crawled. Then, the focused crawler will select the URL of the web page to be crawled from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition is reached.
为使本领域技术人员更好地理解本申请实施例,在本说明书中,将问答类的网站作为一种示例进行说明。In order to enable those skilled in the art to better understand the embodiments of the present application, in the present specification, the website of the question and answer category is explained as an example.
对于问答类的网站(比如zhidao.baidu.com),用户每天可能产生大量的问题,这些问题有的会被其他用户解答,有的则不会被解答,在这些问题中,可能很多是重复的问题。For questions and answers (such as zhidao.baidu.com), users may generate a lot of questions every day. Some of these questions will be answered by other users, while others will not be answered. Many of these questions may be duplicated. problem.
也就是说,大量的问题是相同或者类似的,那么对于搜索引擎而言,同一个问题一般收录有回答并且答案较满意的网页即可,其他的可以认为是重复的。That is to say, a large number of problems are the same or similar, so for the search engine, the same question generally includes a web page with an answer and a satisfactory answer, and the other can be considered as a duplicate.
对于zhidao.***.com这个问答类站点抓取到的URL的示例如下:An example of a URL that is crawled by the question and answer class site of zhidao.***.com is as follows:
http://zhidao.***.com/question/433737807751460604.htmlHttp://zhidao.***.com/question/433737807751460604.html
http://zhidao.***.com/question/1605209362191413347.htmlHttp://zhidao.***.com/question/1605209362191413347.html
http://zhidao.***.com/question/618238863630856372.htmlHttp://zhidao.***.com/question/618238863630856372.html
http://zhidao.***.com/question/625161396233610844.htmlHttp://zhidao.***.com/question/625161396233610844.html
http://zhidao.***.com/question/1367620128259860259.htmlHttp://zhidao.***.com/question/1367620128259860259.html
http://zhidao.***.com/question/2139209187911446788.htmlHttp://zhidao.***.com/question/2139209187911446788.html
http://zhidao.***.com/question/584108667629594845.htmlHttp://zhidao.***.com/question/584108667629594845.html
其中,“***”为一个网站的域名。Among them, "***" is the domain name of a website.
步骤102,从所述一个或多个URL选取候选URL;Step 102: Select a candidate URL from the one or more URLs;
在具体实现中,可以从提取的URL中按照一定的策略选取部分或全部URL作为候选URL。In a specific implementation, some or all URLs may be selected as candidate URLs according to a certain policy from the extracted URLs.
在本发明的一种可选实施例中,步骤102可以包括如下子步骤:In an optional embodiment of the invention, step 102 may include the following sub-steps:
子步骤S11,判断所述URL是否匹配有模式pattern;若是,则执行子步骤S12;Sub-step S11, it is determined whether the URL matches a pattern pattern; if yes, sub-step S12;
子步骤S12,选取所述URL为候选URL。 Sub-step S12, the URL is selected as a candidate URL.
在本发明实施例中,由于同一网站的URL一般会对同一类型的服务(如问答)配置相似的URL,因此,可以通过相同的模式pattern选取同一网站的URL作为候选URL。In the embodiment of the present invention, since the URL of the same website generally configures similar URLs for the same type of service (such as question and answer), the URL of the same website may be selected as the candidate URL by the same pattern pattern.
其中,模式pattern,可以为样式相同或相似的URL。Among them, the pattern pattern can be a URL with the same or similar style.
例如,对于上述在zhidao.***.com这个问答类站点抓取到的URL,其具有同一模式pattern:For example, for the above URL crawled at the zhidao.***.com quiz site, it has the same pattern:
http://zhidao.***..com/question/(\d+).html;Http://zhidao.***..com/question/(\d+).html;
其中,(\d+)为通配符。Among them, (\d+) is a wildcard.
可以认为上述在zhidao.***.com这个问答类站点抓取到的URL为候选URL。It can be considered that the above URL crawled in the question and answer class site of zhidao.***.com is a candidate URL.
在实际应用中,各个候选URL关联有各个锚文本anchor,即URL和锚文本anchor一般是一一对应关系。In practical applications, each candidate URL is associated with each anchor text anchor, that is, the URL and the anchor text anchor are generally one-to-one correspondence.
锚文本(anchor text)又称锚文本链接,是链接的一种形式。Anchor text, also known as anchor text link, is a form of link.
和超链接类似,超链接的代码是锚文本,把关键词做一个链接,指向网页,这种形式的链接就叫作锚文本。Similar to hyperlinks, hyperlinked code is anchor text, making a link to a keyword, pointing to a web page. This form of link is called anchor text.
一方面,锚文本可以作为锚文本所在的网页的内容的评估,即站内锚文本。On the one hand, the anchor text can be used as an evaluation of the content of the web page where the anchor text is located, ie the anchor text within the station.
网页中增加的链接和网页本身的内容有一定的关系,例如,服装的行业网站上会增加一些同行网站的链接或者一些做服装的知名企业的链接。The added links in the webpage have a certain relationship with the content of the webpage itself. For example, the clothing industry website will add links to some peer websites or some well-known companies that make clothing.
另一方面,锚文本能做为对所指向网页的评估,即站外锚文本。On the other hand, the anchor text can be used as an evaluation of the web page pointed to, ie the anchor text outside the station.
锚文本能描述所指向网页的内容,例如,个人网站上增加“ABC”的链接,锚文本为“搜索引擎”。这样通过锚文本本身就能知道,“ABC”是搜索引擎。The anchor text can describe the content of the web page pointed to, for example, a link to add "ABC" on the personal website, and the anchor text is "search engine". This way, the anchor text itself knows that "ABC" is a search engine.
对于在zhidao.***.com这个站点抓取到的URL,其锚文本anchor的示例可以如下表所示:For the URL crawled at the zhidao.***.com site, an example of its anchor text anchor can be as shown in the following table:
Figure PCTCN2016081003-appb-000001
Figure PCTCN2016081003-appb-000001
Figure PCTCN2016081003-appb-000002
Figure PCTCN2016081003-appb-000002
其中,“XXX”为一部电视剧的名称。Among them, "XXX" is the name of a TV series.
步骤103,计算所述各个锚文本anchor之间的相似度;Step 103: Calculate a similarity between the anchor text anchors;
相似度,可以指锚文本anchor之间的内容相关度。Similarity can refer to the content relevance between anchor text anchors.
在本发明的一种可选实施例中,步骤103可以包括如下子步骤:In an optional embodiment of the present invention, step 103 may include the following sub-steps:
子步骤S21,对所述锚文本anchor进行向量化处理;Sub-step S21, performing vectorization processing on the anchor text anchor;
在本发明实施例中,可以基于向量空间模型计算相似度,这个模型假设词与词间不相关,用向量来表示文本,从而简化了文本中的关键词之间的复杂关系,文档用十分简单的向量表示,使得模型具备了可计算性。In the embodiment of the present invention, the similarity can be calculated based on the vector space model, which assumes that the word is not related to the word, and uses the vector to represent the text, thereby simplifying the complex relationship between the keywords in the text, and the document is very simple. The vector representation makes the model computable.
在本发明的一种可选实施例中,子步骤S21进一步可以包括如下子步骤:In an optional embodiment of the present invention, the sub-step S21 may further include the following sub-steps:
子步骤S211,对所述锚文本anchor进行分词处理,获得文本分词;Sub-step S211, performing a word segmentation process on the anchor text anchor to obtain a text segmentation;
在具体实现中,可以通过以下一种或多种方式进行分词处理:In a specific implementation, the word segmentation process can be performed by one or more of the following methods:
1、基于字符串匹配的分词方法:是指按照一定的策略将待分析的汉字串与一个预置的机器词典中的词条进行匹配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。1. Word segmentation based on string matching: refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
2、基于特征扫描或标志切分的分词方法:是指优先在待分析字符串中识别和切分出一些带有明显特征的词,以这些词作为断点,可将原字符串分为较小的串再来进机械分词,从而减少匹配的错误率;或者将分词和词类标注结合起来,利用丰富的词类信息对分词决策提供帮助,并且在标注 过程中又反过来对分词结果进行检验、调整,从而提高切分的准确率。2. The word segmentation method based on feature scanning or mark segmentation: refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and mark In the process, the result of the word segmentation is in turn tested and adjusted to improve the accuracy of the segmentation.
3、基于理解的分词方法:是指通过让计算机模拟人对句子的理解,达到识别词的效果。其基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。它通常包括三个部分:分词子系统、句法语义子系统、总控部分。在总控部分的协调下,分词子系统可以获得有关词、句子等的句法和语义信息来对分词歧义进行判断,即它模拟了人对句子的理解过程。3. The word segmentation method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence.
4、基于统计的分词方法:是指,中文信息中由于字与字相邻共现的频率或概率能够较好的反映成词的可信度,所以可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息,以及计算两个汉字X、Y的相邻共现概率。互现信息可以体现汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word.
当然,上述分词的提取方法只是作为示例,在实施本发明实施例时,可以根据实际情况设置其他分词的提取方法,本发明实施例对此不加以限制。另外,除了上述分词的提取方法外,本领域技术人员还可以根据实际需要采用其它分词的提取方法,本发明实施例对此也不加以限制。Of course, the method for extracting the above-mentioned word segmentation is only an example. In the embodiment of the present invention, the method for extracting other word segments may be set according to the actual situation, which is not limited by the embodiment of the present invention. In addition, in addition to the above-mentioned method for extracting the word segmentation, a person skilled in the art may also adopt a method for extracting other word segments according to actual needs, which is not limited by the embodiment of the present invention.
子步骤S212,从所述文本分词中滤去无效词;Sub-step S212, filtering out invalid words from the text participle;
在具体实现中,可以按照停用词表中的词语(无效词)将语料中对文本内容识别意义不大但出现频率很高的词、符号、标点及乱码等去掉。In a specific implementation, the words (invalid words) in the stop word table may be used to remove words, symbols, punctuation, and garbled characters that are not meaningful to the text content but appear frequently.
其中,所述无效词包括以下的一种或多种:The invalid word includes one or more of the following:
副词、助词、符号、标点及乱码。Adverbs, auxiliary words, symbols, punctuation and garbled.
例如,“这,的,和,会,为”等词几乎出现在任何一篇中文文本中,但是它们对这个文本所表达的意思几乎没有任何贡献。For example, the words "this,,,,,,," are appearing in almost any Chinese text, but they have little to do with the meaning expressed in this text.
使用停用词列表来剔除停用词的过程大致为:对每一个文本分词,看其是否位于停用词列表中,如果是,则将其从文本分词中删除。The process of using stop words lists to eliminate stop words is roughly as follows: each text segmentation is seen if it is in the stop word list, and if so, it is removed from the text segmentation.
子步骤S213,从所述文本分词中确定关键词;Sub-step S213, determining a keyword from the text segmentation;
在具体实现中,可以根据文本分词的频度确定若干关键词。In a specific implementation, several keywords may be determined according to the frequency of text segmentation.
在一种实施例中,可以通过TF(Term frequency,关键词词频)确定词 频。In an embodiment, the word can be determined by TF (Term frequency) frequency.
TF是指一篇文章中关键词出现的频率,比如在一篇M个词的文章中有N个该关键词,则TF=N/M,为该关键词在这篇文章中的词频。TF refers to the frequency of occurrence of keywords in an article. For example, in an article with M words, there are N such keywords, then TF=N/M, which is the word frequency of the keyword in this article.
子步骤S214,对所述关键词配置权重;Sub-step S214, configuring weights for the keywords;
配置权重是针对每个关键词对文本特征的体现效果大小不同而设置的机制。The configuration weight is a mechanism set for each keyword to have different effects on the text features.
在一种实施例中,可以通过IDF(Inverse document frequency,指逆向文本频率)确定关键词的权重。In one embodiment, the weight of the keyword may be determined by an IDF (Inverse document frequency).
IDF是用于衡量关键词权重的指数,IDF=log(D/Dw),其中,D为文章总数,Dw为关键词出现过的文章数。IDF is an index used to measure the weight of a keyword, IDF=log(D/D w ), where D is the total number of articles, and D w is the number of articles that have appeared in the keyword.
子步骤S215,将所述关键词的权重设置为所述锚文本anchor的分量。Sub-step S215, setting the weight of the keyword to the component of the anchor text anchor.
在本发明实施例中,把锚文本anchor字符串化为以关键词的权重为分量的N维向量表示,以进行相似度的计算。In the embodiment of the present invention, the anchor text anchor is stringified into an N-dimensional vector representation with the weight of the keyword as a component to perform the similarity calculation.
例如,锚文本anchor A可以表示为A=(a1,a2,a3…an)、锚文本anchor B可以表示为B=(b1,b2,b3…bn),其中,a1,a2,a3…an为A的分量,b1,b2,b3…bn为B的分量。For example, the anchor text anchor A can be expressed as A = (a 1 , a 2 , a 3 ... a n ), and the anchor text anchor B can be expressed as B = (b 1 , b 2 , b 3 ... b n ), where a 1 , a 2 , a 3 ... a n is a component of A, and b 1 , b 2 , b 3 ... b n are components of B.
子步骤S22,计算向量化的锚文本anchor之间的相似度。Sub-step S22, calculating the similarity between the vectorized anchor text anchors.
在具体实现中,可以计算所述锚文本anchor的分量之间的余弦值(物理意义就是两个向量的空间夹角的余弦数值),作为所述锚文本anchor之间的相似度。In a specific implementation, a cosine value between the components of the anchor text anchor (physical meaning is the cosine value of the spatial angle of the two vectors) may be calculated as the similarity between the anchor text anchors.
例如,对于A=(a1,a2,a3…an)和B=(b1,b2,b3…bn),可以计算向量(a1,a2,a3…an)和(b1,b2,b3…bn)之间夹角的余弦值作为锚文本anchor A和锚文本anchor B的相似度。For example, for A = (a 1 , a 2 , a 3 ... a n ) and B = (b 1 , b 2 , b 3 ... b n ), a vector (a 1 , a 2 , a 3 ... a n can be calculated The cosine of the angle between (b 1 , b 2 , b 3 ... b n ) is used as the similarity between the anchor text anchor A and the anchor text anchor B.
其中,夹角的余弦值计算相似度的示例如下:An example of calculating the similarity of the cosine of the included angle is as follows:
sim(A,B)=(a1*b1+a2*b2+a3*b3+…+an*bn)/(sqrt(a1*a1+a2*a2+a3*a3+…+an*an)*sqrt(b1*b1+b2*b2+b3*b3+…+bn*bn));Sim(A,B)=(a 1 *b 1 +a 2 *b 2 +a 3 *b 3 +...+a n *b n )/(sqrt(a 1 *a 1 +a 2 *a 2 + a 3 *a 3 +...+a n *a n )*sqrt(b 1 *b 1 +b 2 *b 2 +b 3 *b 3 +...+b n *b n ));
其中,sim(A,B)表示锚文本anchor A和锚文本anchor B的相似度,sqrt()表示开根号。 Where sim(A, B) represents the similarity between the anchor text anchor A and the anchor text anchor B, and sqrt() represents the root number.
假设文本锚文本anchor A的分量(权重)分别为30,20,20,10,锚文本anchor B的的分量(权重)分别为40,30,20,10,则锚文本anchor A的向量表示为A=(30,20,20,10,0),锚文本anchor B的向量表示为B=(40,0,30,20,10),则根据上式计算出来的锚文本anchorA与anchor B相关似是0.86。Assuming that the components (weights) of the text anchor text anchor A are 30, 20, 20, and 10, and the components (weights) of the anchor text anchor B are 40, 30, 20, and 10, respectively, the vector of the anchor text anchor A is represented as A=(30,20,20,10,0), the vector of the anchor text anchor B is expressed as B=(40,0,30,20,10), then the anchor text anchorA calculated according to the above formula is related to the anchor B. It seems to be 0.86.
步骤104,根据所述相似度从所述候选URL中识别出富集化URL。Step 104: Identify an enriched URL from the candidate URL according to the similarity.
在具体实现中,网页内容越相似,其相似度越高,当所述相似度大于预设的相似度阈值时,确认所述所述候选URL为富集化URL,即相似度大于一定相似度阈值的URL可以认为是内容相同或相似的URL(即富集化URL)。In a specific implementation, the more similar the webpage content is, the higher the similarity is. When the similarity is greater than the preset similarity threshold, the candidate URL is confirmed to be an enriched URL, that is, the similarity is greater than a certain similarity. The threshold URL can be thought of as a URL with the same or similar content (ie, an enriched URL).
例如,对于在zhidao.***.com这个站点抓取到的URL,其锚文本anchor都与XXX第五季第14集的音乐相关,可以认为是富集化URL。For example, for the URL crawled at zhidao.***.com, the anchor text anchor is related to the music of the XXX fifth season episode 14 and can be considered as a rich URL.
本发明实施例从提取的URL中选取候选URL,根据候选URL关联的锚文本anchor的相似度识别出富集化URL,可以避免搜索引擎在抓取网页时抓取垃圾、重复的网页,大大节省了抓取时的带宽浪费,由于抓取量减少了,进而减少了搜索引擎的负担,同时,搜索引擎可以额外抓取其他优质的网页,提升了搜索引擎收录网页的覆盖率和时效性。In the embodiment of the present invention, the candidate URL is selected from the extracted URL, and the rich URL is identified according to the similarity of the anchor text anchor associated with the candidate URL, which can prevent the search engine from crawling the garbage and repeating the webpage when the webpage is crawled, thereby greatly saving The bandwidth is wasted when crawling, and the amount of crawling is reduced, which reduces the burden on the search engine. At the same time, the search engine can additionally capture other high-quality webpages, which improves the coverage and timeliness of the webpages included in the search engine.
在本发明的一种可选实施例中,该方法还可以包括如下步骤:In an optional embodiment of the present invention, the method may further include the following steps:
步骤105,从所述富集化URL中选取目标URL。Step 105: Select a target URL from the enriched URL.
在具体实现中,可以从富集化URL中按照一定的策略选取部分或全部URL作为目标URL。In a specific implementation, some or all of the URLs may be selected from the enrichment URL according to a certain policy as the target URL.
在本发明的一种可选实施例中,步骤105可以包括如下子步骤:In an optional embodiment of the invention, step 105 may include the following sub-steps:
子步骤S31,获取所述富集化URL的关注度;Sub-step S31, acquiring the degree of attention of the enriched URL;
子步骤S32,基于所述关注度从富集化URL中选取目标URL。Sub-step S32, selecting a target URL from the enriched URL based on the degree of interest.
关注度可以为用户对该URL的关注程度,例如,该URL对应网页的推荐数(如用“给力”、“点赞”等表征),推荐数越多,关注程度越高。The degree of attention may be the degree of attention of the user to the URL. For example, the URL corresponds to the number of recommendations of the webpage (eg, "to force", "like", etc.), and the more the number of recommendations, the higher the degree of attention.
关注度较高的URL,其网页的质量一般也越高,因此,在本发明实施例中,可以选取关注度较高的富集化URL,如关注度高于预设关注度阈值 的富集化URL、关注度顺序排序最高的一个或多个富集化URL,等等,作为目标URL。For a URL with a high degree of interest, the quality of the web page is generally higher. Therefore, in the embodiment of the present invention, an enriched URL with a high degree of attention may be selected, for example, the degree of attention is higher than the preset attention threshold. The enriched URL, one or more enriched URLs with the highest order of attention, and so on, as the target URL.
步骤106,抓取所述目标URL对应的网页;Step 106: Grab a webpage corresponding to the target URL;
在实际应用中,网络爬虫抓取网页的基本工作流程一般如下:In practical applications, the basic workflow of crawling web pages by web crawlers is as follows:
1、选取目标URL;1. Select the target URL;
2、将目标URL放入待抓取URL队列;2. Put the target URL into the queue to be crawled;
3、从待抓取URL队列中取出待抓取的目标URL,解析DNS(Domain Name System,域名系统),并且得到主机的IP(Internet Protocol,网络之间互连的协议)地址,访问该IP地址,将目标URL对应的网页下载下来,存储进已下载网页库中。3. Retrieve the target URL to be crawled from the queue to be crawled, parse the DNS (Domain Name System), and obtain the IP address of the host (Internet Protocol). Access the IP address. The address, download the webpage corresponding to the target URL, and store it in the downloaded webpage library.
此外,将该目标URL放进已抓取URL队列。In addition, the target URL is placed in the crawled URL queue.
步骤107,采用所述网页生成索引文件。Step 107: Generate an index file by using the webpage.
搜索引擎的搜索流程一般分为二个部分,一是前端用户请求过程,二是后端制作数据过程。The search engine search process is generally divided into two parts, one is the front-end user request process, and the other is the back-end production data process.
一、前端用户请求过程大致如下:First, the front-end user request process is roughly as follows:
1、接收请求:接收用户在搜索引擎输入的搜索关键词;1. Receiving a request: receiving a search keyword input by a user in a search engine;
2、查询词分析:对搜索关键词进行分词处理;2, query word analysis: word segmentation processing of search keywords;
3、检索:根据分词结果,从预先制作的索引文件(如倒排索引)中,查找与分词结果相关的网页信息;3. Search: According to the result of the word segmentation, search for the webpage information related to the word segmentation result from the pre-made index file (such as the inverted index);
4、排序:针对相关的网页信息,根据内容相关性、时效性等维度进行排序;4. Sorting: Sorting related webpage information according to dimensions such as content relevance and timeliness;
5、展现:将排序后的网页信息在搜索引擎的结果页面展现出来。5. Presentation: Display the sorted webpage information on the search engine's result page.
二、后端制作数据过程:Second, the back-end production data process:
1.网页抓取:采用网络爬虫技术抓取各种类型的网页并保存。1. Web crawling: use web crawler technology to capture various types of web pages and save them.
2.索引制作:对已抓取保存的网络信息进行分析,如对网页标题和页面文本进行分词处理,根据分词结果制作索引文件(如倒排索引),供前端用户请求过程使用。2. Index production: Analyze the network information that has been captured and saved, such as word segmentation of the page title and page text, and create an index file (such as an inverted index) according to the word segmentation result, which is used by the front-end user request process.
本发明实施例中,可以将网页记录写入索引文件(如倒排索引)中,以 在搜索引擎中作为搜索。In the embodiment of the present invention, the webpage record may be written into an index file (such as an inverted index) to As a search in search engines.
以倒排索引为例,倒排索引源于实际应用中需要根据属性的值来查找记录,这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引(inverted index)。带有倒排索引的文件称为倒排索引文件,简称倒排文件(inverted file)。Taking the inverted index as an example, the inverted index is derived from the actual application and needs to find records according to the value of the attribute. Each item in the index table includes an attribute value and an address of each record having the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A file with an inverted index is called an inverted index file, or simply an inverted file.
在倒排文件中,索引对象是文档或者文档集合(例如网页)中的单词等,用来存储这些单词在一个文档或者一组文档中的存储位置,是对文档或者文档集合的一种常用的索引机制。In an inverted file, an index object is a word in a document or collection of documents (such as a web page), and is used to store the storage location of the words in a document or a group of documents, which is a common use of documents or collections of documents. Indexing mechanism.
以英文为例,以下为要被索引的网页中的文本信息:In English, for example, the following is the text information in the web page to be indexed:
T1=″it is what it is″;T1=“it is what it is”;
T2=″what is it″;T2=“what is it”;
T3=″it is a banana″;T3=“it is a banana”;
以下为倒排索引:The following is the inverted index:
″a″:{(2,2)}"a": {(2, 2)}
″banana″:{(2,3)}"banana": {(2,3)}
″is″:{(0,1),(0,4),(1,1),(2,1)}"is": {(0,1),(0,4),(1,1),(2,1)}
″it″:{(0,0),(0,3),(1,2),(2,0)}"it": {(0,0),(0,3),(1,2),(2,0)}
″what″:{(0,2),(1,0)}"what": {(0,2),(1,0)}
其中,″banana″:{(2,3)}为″banana″在第三个网页(T3)的文本信息里,而且在第三个网页的位置是第四个单词(地址为3)。Among them, "banana": {(2, 3)} is "banana" in the text information of the third web page (T3), and the position of the third web page is the fourth word (address is 3).
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。 For the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present invention are not limited by the described action sequence, because the embodiment according to the present invention Some steps can be performed in other orders or at the same time. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.
参照图2,示出了根据本发明一个实施例的一种富集化URL的识别装置实施例的结构框图,具体可以包括如下模块:Referring to FIG. 2, a block diagram of an embodiment of an apparatus for identifying an enriched URL according to an embodiment of the present invention is shown. Specifically, the following modules may be included:
URL提取模块201,适于提取一个或多个URL;The URL extraction module 201 is adapted to extract one or more URLs;
候选URL选取模块202,适于从所述一个或多个URL选取候选URL;各个候选URL关联有各个锚文本anchor;The candidate URL selection module 202 is adapted to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
相似度计算模块203,适于计算所述各个锚文本anchor之间的相似度;The similarity calculation module 203 is adapted to calculate a similarity between the anchor text anchors;
富集化URL识别模块204,适于根据所述相似度从所述候选URL中识别出富集化URL。The enriched URL identification module 204 is adapted to identify the enriched URL from the candidate URLs based on the similarity.
在本发明的一种可选实施例中,所述候选URL选取模块202还可以适于:In an optional embodiment of the present invention, the candidate URL selection module 202 may further be adapted to:
判断所述URL是否匹配有模式pattern;若是,则选取所述URL为候选URL。Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.
在本发明的一种可选实施例中,所述相似度计算模块203还可以适于:In an optional embodiment of the present invention, the similarity calculation module 203 is further adapted to:
对所述锚文本anchor进行向量化处理;Performing vectorization processing on the anchor text anchor;
计算向量化的锚文本anchor之间的相似度。Calculate the similarity between vectorized anchor text anchors.
在本发明的一种可选实施例中,所述相似度计算模块203还可以适于:In an optional embodiment of the present invention, the similarity calculation module 203 is further adapted to:
对所述锚文本anchor进行分词处理,获得文本分词;Performing word segmentation on the anchor text anchor to obtain a text segmentation;
从所述文本分词中确定关键词;Determining keywords from the text segmentation;
对所述关键词配置权重;Configuring weights for the keywords;
将所述关键词的权重设置为所述锚文本anchor的分量。The weight of the keyword is set to the component of the anchor text anchor.
在本发明的一种可选实施例中,所述相似度计算模块203还可以适于:In an optional embodiment of the present invention, the similarity calculation module 203 is further adapted to:
从所述文本分词中滤去无效词;Filtering out invalid words from the text participle;
其中,所述无效词包括以下的一种或多种:The invalid word includes one or more of the following:
副词、助词、符号、标点、乱码。Adverbs, auxiliary words, symbols, punctuation, garbled.
在本发明的一种可选实施例中,所述相似度计算模块203还可以适于:In an optional embodiment of the present invention, the similarity calculation module 203 is further adapted to:
计算所述锚文本anchor的分量之间的余弦值,作为所述锚文本anchor之间的相似度。A cosine value between components of the anchor text anchor is calculated as the similarity between the anchor text anchors.
在本发明的一种可选实施例中,所述富集化URL识别模块204还可以 适于:In an optional embodiment of the present invention, the enriched URL identification module 204 can also Suitable for:
当所述相似度大于预设的相似度阈值时,确认所述所述候选URL为富集化URL。When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.
在本发明的一种可选实施例中,该装置还可以包括如下模块:In an optional embodiment of the invention, the device may further comprise the following modules:
目标URL选取模块,适于从所述富集化URL中选取目标URL。A target URL selection module adapted to select a target URL from the enriched URL.
在本发明的一种可选实施例中,所述目标URL选取模块还可以适于:In an optional embodiment of the present invention, the target URL selection module may further be adapted to:
获取所述富集化URL的关注度;Obtaining the attention degree of the enriched URL;
基于所述关注度从富集化URL中选取目标URL。The target URL is selected from the enriched URL based on the degree of interest.
在本发明的一种可选实施例中,该装置还可以包括如下模块:In an optional embodiment of the invention, the device may further comprise the following modules:
网页抓取模块,适于抓取所述目标URL对应的网页;a webpage crawling module, configured to capture a webpage corresponding to the target URL;
索引文件生成模块,适于采用所述网页生成索引文件。An index file generating module is adapted to generate an index file by using the webpage.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的富集化URL的识别设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components of the enhanced URL identification device in accordance with embodiments of the present invention. . The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图3示出了可以实现根据本发明的富集化URL的识别计算设备,例如应用服务器。该计算设备传统上包括处理器310和以存储器320形式的计算机程序产品或者计算机可读介质。存储器320可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器320具有用于执行上述方法中的任何方法步骤的程序代码331的存储空间330。例如,用于程序代码的存储空间330 可以包括分别用于实现上面的方法中的各种步骤的各个程序代码331。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图4所述的便携式或者固定存储单元。该存储单元可以具有与图3的计算设备中的存储器320类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码331’,即可以由例如诸如310之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, Figure 3 illustrates an identification computing device, such as an application server, that can implement an enriched URL in accordance with the present invention. The computing device conventionally includes a processor 310 and a computer program product or computer readable medium in the form of a memory 320. The memory 320 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. The memory 320 has a memory space 330 for program code 331 for performing any of the method steps described above. For example, storage space 330 for program code Various program codes 331 for respectively implementing the various steps in the above methods may be included. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 320 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 331', ie, code readable by a processor, such as 310, that when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领 域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, without departing from the scope and spirit of the appended claims, Many modifications and variations will be apparent to those of ordinary skill in the art. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (22)

  1. 一种富集化URL的识别方法,包括步骤:A method for identifying an enriched URL, comprising the steps of:
    提取一个或多个URL;Extract one or more URLs;
    从所述一个或多个URL选取候选URL;各个候选URL关联有各个锚文本anchor;Selecting candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
    计算所述各个锚文本anchor之间的相似度;Calculating a similarity between the anchor text anchors;
    根据所述相似度从所述候选URL中识别出富集化URL。An enriched URL is identified from the candidate URLs based on the similarity.
  2. 如权利要求1所述的方法,所述从所述一个或多个URL选取候选URL的步骤包括:The method of claim 1 wherein said step of selecting a candidate URL from said one or more URLs comprises:
    判断所述URL是否匹配有模式pattern;若是,则选取所述URL为候选URL。Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.
  3. 如权利要求1或2所述的方法,所述计算所述各个锚文本anchor之间的相似度的步骤包括:The method according to claim 1 or 2, wherein the calculating the similarity between the respective anchor text anchors comprises:
    对所述锚文本anchor进行向量化处理;Performing vectorization processing on the anchor text anchor;
    计算向量化的锚文本anchor之间的相似度。Calculate the similarity between vectorized anchor text anchors.
  4. 如权利要求3所述的方法,所述对所述锚文本anchor进行向量化处理的步骤包括:The method of claim 3, the step of performing vectorization processing on the anchor text anchor comprises:
    对所述锚文本anchor进行分词处理,获得文本分词;Performing word segmentation on the anchor text anchor to obtain a text segmentation;
    从所述文本分词中确定关键词;Determining keywords from the text segmentation;
    对所述关键词配置权重;Configuring weights for the keywords;
    将所述关键词的权重设置为所述锚文本anchor的分量。The weight of the keyword is set to the component of the anchor text anchor.
  5. 如权利要求3所述的方法,所述对所述锚文本anchor进行向量化处理的步骤还包括:The method of claim 3, the step of performing vectorization processing on the anchor text anchor further comprises:
    从所述文本分词中滤去无效词;Filtering out invalid words from the text participle;
    其中,所述无效词包括以下的一种或多种:The invalid word includes one or more of the following:
    副词、助词、符号、标点、乱码。Adverbs, auxiliary words, symbols, punctuation, garbled.
  6. 如权利要求3所述的方法,所述计算向量化的锚文本anchor之间的相似度的步骤包括:The method of claim 3, wherein the step of calculating the similarity between the vectorized anchor text anchors comprises:
    计算所述锚文本anchor的分量之间的余弦值,作为所述锚文本anchor 之间的相似度。Calculating a cosine value between components of the anchor text anchor as the anchor text anchor The similarity between the two.
  7. 如权利要求1或2或4或5或6所述的方法,所述根据所述相似度从所述候选URL中识别出富集化URL的步骤包括:The method of claim 1 or 2 or 4 or 5 or 6, wherein the step of identifying an enriched URL from the candidate URLs according to the similarity comprises:
    当所述相似度大于预设的相似度阈值时,确认所述所述候选URL为富集化URL。When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.
  8. 如权利要求1所述的方法,还包括步骤:The method of claim 1 further comprising the step of:
    从所述富集化URL中选取目标URL。The target URL is selected from the enriched URL.
  9. 如权利要求8所述的方法,所述从所述富集化URL中选取目标URL的步骤包括:The method of claim 8, the step of selecting a target URL from the enriched URLs comprises:
    获取所述富集化URL的关注度;Obtaining the attention degree of the enriched URL;
    基于所述关注度从富集化URL中选取目标URL。The target URL is selected from the enriched URL based on the degree of interest.
  10. 如权利要求8或9所述的方法,还包括步骤:The method of claim 8 or 9, further comprising the steps of:
    抓取所述目标URL对应的网页;Grab the webpage corresponding to the target URL;
    采用所述网页生成索引文件。The index file is generated by using the webpage.
  11. 一种富集化URL的识别装置,包括:An apparatus for identifying an enriched URL, comprising:
    URL提取模块,适于提取一个或多个URL;a URL extraction module adapted to extract one or more URLs;
    候选URL选取模块,适于从所述一个或多个URL选取候选URL;各个候选URL关联有各个锚文本anchor;a candidate URL selection module, configured to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;
    相似度计算模块,适于计算所述各个锚文本anchor之间的相似度;a similarity calculation module, configured to calculate a similarity between the anchor text anchors;
    富集化URL识别模块,适于根据所述相似度从所述候选URL中识别出富集化URL。An enriched URL identification module adapted to identify an enriched URL from the candidate URLs based on the similarity.
  12. 如权利要求11所述的装置,所述候选URL选取模块还适于:The apparatus according to claim 11, wherein the candidate URL selection module is further adapted to:
    判断所述URL是否匹配有模式pattern;若是,则选取所述URL为候选URL。Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.
  13. 如权利要求11或12所述的装置,所述相似度计算模块还适于:The apparatus according to claim 11 or 12, wherein the similarity calculation module is further adapted to:
    对所述锚文本anchor进行向量化处理;Performing vectorization processing on the anchor text anchor;
    计算向量化的锚文本anchor之间的相似度。Calculate the similarity between vectorized anchor text anchors.
  14. 如权利要求13所述的装置,所述相似度计算模块还适于: The apparatus of claim 13, the similarity calculation module is further adapted to:
    对所述锚文本anchor进行分词处理,获得文本分词;Performing word segmentation on the anchor text anchor to obtain a text segmentation;
    从所述文本分词中确定关键词;Determining keywords from the text segmentation;
    对所述关键词配置权重;Configuring weights for the keywords;
    将所述关键词的权重设置为所述锚文本anchor的分量。The weight of the keyword is set to the component of the anchor text anchor.
  15. 如权利要求13所述的装置,所述相似度计算模块还适于:The apparatus of claim 13, the similarity calculation module is further adapted to:
    从所述文本分词中滤去无效词;Filtering out invalid words from the text participle;
    其中,所述无效词包括以下的一种或多种:The invalid word includes one or more of the following:
    副词、助词、符号、标点、乱码。Adverbs, auxiliary words, symbols, punctuation, garbled.
  16. 如权利要求13所述的装置,所述相似度计算模块还适于:The apparatus of claim 13, the similarity calculation module is further adapted to:
    计算所述锚文本anchor的分量之间的余弦值,作为所述锚文本anchor之间的相似度。A cosine value between components of the anchor text anchor is calculated as the similarity between the anchor text anchors.
  17. 如权利要求11或12或14或15或16所述的装置,所述富集化URL识别模块还适于:The apparatus of claim 11 or 12 or 14 or 15 or 16, wherein the enriched URL identification module is further adapted to:
    当所述相似度大于预设的相似度阈值时,确认所述所述候选URL为富集化URL。When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.
  18. 如权利要求11所述的装置,还包括:The apparatus of claim 11 further comprising:
    目标URL选取模块,适于从所述富集化URL中选取目标URL。A target URL selection module adapted to select a target URL from the enriched URL.
  19. 如权利要求18所述的装置,所述目标URL选取模块还适于:The apparatus of claim 18, wherein the target URL selection module is further adapted to:
    获取所述富集化URL的关注度;Obtaining the attention degree of the enriched URL;
    基于所述关注度从富集化URL中选取目标URL。The target URL is selected from the enriched URL based on the degree of interest.
  20. 如权利要求18或19所述的装置,还包括:The apparatus of claim 18 or 19, further comprising:
    网页抓取模块,适于抓取所述目标URL对应的网页;a webpage crawling module, configured to capture a webpage corresponding to the target URL;
    索引文件生成模块,适于采用所述网页生成索引文件。An index file generating module is adapted to generate an index file by using the webpage.
  21. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-10中的任一个所述的富集化URL的识别方法。A computer program comprising computer readable code causing the computing device to perform recognition of an enriched URL according to any one of claims 1-10 when the computer readable code is run on a computing device method.
  22. 一种计算机可读介质,其中存储了如权利要求21所述的计算机程序。 A computer readable medium storing the computer program of claim 21.
PCT/CN2016/081003 2015-06-30 2016-05-04 Enriched uniform resource locator (url) identification method and apparatus WO2017000659A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510375487.6A CN104965902A (en) 2015-06-30 2015-06-30 Enriched URL (uniform resource locator) recognition method and apparatus
CN201510375487.6 2015-06-30

Publications (1)

Publication Number Publication Date
WO2017000659A1 true WO2017000659A1 (en) 2017-01-05

Family

ID=54219940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081003 WO2017000659A1 (en) 2015-06-30 2016-05-04 Enriched uniform resource locator (url) identification method and apparatus

Country Status (2)

Country Link
CN (1) CN104965902A (en)
WO (1) WO2017000659A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10154041B2 (en) * 2015-01-13 2018-12-11 Microsoft Technology Licensing, Llc Website access control
CN104965902A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Enriched URL (uniform resource locator) recognition method and apparatus
CN108090104B (en) * 2016-11-23 2023-05-02 百度在线网络技术(北京)有限公司 Method and device for acquiring webpage information
CN109672706B (en) * 2017-10-16 2022-06-14 百度在线网络技术(北京)有限公司 Information recommendation method and device, server and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050165781A1 (en) * 2004-01-26 2005-07-28 Reiner Kraft Method, system, and program for handling anchor text
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN102135967A (en) * 2010-01-27 2011-07-27 华为技术有限公司 Webpage keywords extracting method, device and system
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104090976A (en) * 2014-07-21 2014-10-08 北京奇虎科技有限公司 Method and device for crawling webpages by search engine crawlers
CN104965902A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Enriched URL (uniform resource locator) recognition method and apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102654861B (en) * 2011-03-01 2017-12-08 深圳市世纪光速信息技术有限公司 Webpage extraction accuracy computational methods and system
CN102411626A (en) * 2011-12-13 2012-04-11 北京大学 Correlation fraction distribution-based method for classifying query intentions
CN103631906A (en) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 Method and device for recognizing page number identification in webpage URL

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050165781A1 (en) * 2004-01-26 2005-07-28 Reiner Kraft Method, system, and program for handling anchor text
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN102135967A (en) * 2010-01-27 2011-07-27 华为技术有限公司 Webpage keywords extracting method, device and system
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages
CN104090976A (en) * 2014-07-21 2014-10-08 北京奇虎科技有限公司 Method and device for crawling webpages by search engine crawlers
CN104965902A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Enriched URL (uniform resource locator) recognition method and apparatus

Also Published As

Publication number Publication date
CN104965902A (en) 2015-10-07

Similar Documents

Publication Publication Date Title
US8073877B2 (en) Scalable semi-structured named entity detection
Jijkoun et al. Retrieving answers from frequently asked questions pages on the web
US8341150B1 (en) Filtering search results using annotations
US8161059B2 (en) Method and apparatus for collecting entity aliases
US9104772B2 (en) System and method for providing tag-based relevance recommendations of bookmarks in a bookmark and tag database
US9367637B2 (en) System and method for searching a bookmark and tag database for relevant bookmarks
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
CN108280114B (en) Deep learning-based user literature reading interest analysis method
TWI695277B (en) Automatic website data collection method
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN108038173B (en) Webpage classification method and system and webpage classification equipment
US20080168049A1 (en) Automatic acquisition of a parallel corpus from a network
CN110555154B (en) Theme-oriented information retrieval method
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
US20090259649A1 (en) System and method for detecting templates of a website using hyperlink analysis
CN104778232B (en) Searching result optimizing method and device based on long query
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
KR102256007B1 (en) System and method for searching documents and providing an answer to a natural language question
US20090182759A1 (en) Extracting entities from a web page
Kang Transactional query identification in web search
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Seger A bounded delay race model
CN111767482B (en) Self-adaptive crawling method for focused web crawlers
Pu et al. A vision-based approach for deep web form extraction
Preetha et al. Personalized search engines on mining user preferences using clickthrough data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16817023

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16817023

Country of ref document: EP

Kind code of ref document: A1