WO2017000659A1

WO2017000659A1 - Enriched uniform resource locator (url) identification method and apparatus

Info

Publication number: WO2017000659A1
Application number: PCT/CN2016/081003
Authority: WO
Inventors: 王智广
Original assignee: 北京奇虎科技有限公司; 奇智软件（北京）有限公司
Priority date: 2015-06-30
Filing date: 2016-05-04
Publication date: 2017-01-05
Also published as: CN104965902A

Abstract

Disclosed are an enriched uniform resource locator (URL) identification method and apparatus. The method comprises: extracting one or more URLs; selecting candidate URLs from the one or more URLs; correlating each candidate URL with an anchor text; calculating the similarity between the anchor texts; and identifying an enriched URL from the candidate URLs according to the similarity. The embodiments of the present invention can prevent a search engine from grabbing spam and repeated web pages during web page grabbing, thereby greatly reducing the bandwidth waste during grabbing, and further reducing the burden of the search engine due to the reduction in the grabbing amount; and meanwhile, the search engine can additionally grab other good-quality web pages, thereby improving the coverage rate of the search engine during web page collection is increased and the timeliness of the search engine during web page collection.

Description

Method and device for identifying enriched URL

Technical field

The present invention relates to the technical field of computer processing, and in particular, to a method for identifying an enriched URL and an apparatus for identifying an enriched URL.

Background technique

With the rapid development of the network, the network has become a carrier of a large amount of information. In order to effectively extract and utilize this information, the search engine usually downloads web pages from the network through a web crawler.

The web crawler starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage. During the process of crawling the webpage, the web crawler continuously extracts a new URL from the current webpage into the queue. Until the system has a certain stopping condition.

Web crawlers can find a large number of newly generated URLs in the network every day. However, the data of the URLs in the network is massive, and the amount of URLs that the search engine can actually crawl every day is limited, which requires the actual crawling of the web crawler. Sort the URLs that have been found before fetching the page, and preferentially fetch some URLs.

Currently, the newly discovered URLs are sorted mainly based on feedback from the crawled web pages. If the quality of the crawled webpage is high, then the quality of the URL that is similar to the URL of the crawled webpage is considered to be higher.

However, there is a phenomenon of enrichment in this scheme. Each URL has a separate feature. The quality difference of webpages with similar URLs is very large. There may be garbage and duplicate webpages. The crawling of these webpages wastes bandwidth. Increase the burden on search engines.

Summary of the invention

In view of the above problems, the present invention has been made in order to provide an enriched URL identification method and a corresponding enrichment URL identification apparatus that overcome the above problems or at least partially solve or alleviate the above problems.

According to an aspect of the present invention, a method for identifying an enriched URL is provided, including the steps of:

Extract one or more URLs;

Selecting candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;

Calculating a similarity between the anchor text anchors;

An enriched URL is identified from the candidate URLs based on the similarity.

According to another aspect of the present invention, an apparatus for identifying an enriched URL is provided, including:

a URL extraction module adapted to extract one or more URLs;

a candidate URL selection module, configured to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;

a similarity calculation module, configured to calculate a similarity between the anchor text anchors;

An enriched URL identification module adapted to identify an enriched URL from the candidate URLs based on the similarity.

According to still another aspect of the present invention, a computer program comprising computer readable code causing the computing device to perform the method of identifying an enriched URL described above when the computer readable code is run on a computing device .

According to still another aspect of the present invention, a computer readable medium is provided, wherein the computer program described above is stored.

The beneficial effects of the invention are:

In the embodiment of the present invention, the candidate URL is selected from the extracted URL, and the rich URL is identified according to the similarity of the anchor text anchor associated with the candidate URL, which can prevent the search engine from crawling the garbage and repeating the webpage when the webpage is crawled, thereby greatly saving The bandwidth is wasted when crawling, and the amount of crawling is reduced, which reduces the burden on the search engine. At the same time, the search engine can additionally capture other high-quality webpages, which improves the coverage and timeliness of the webpages included in the search engine.

The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the present invention are as follows formula.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

1 is a flow chart showing the steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention;

2 is a block diagram showing the structure of an embodiment of an apparatus for identifying an encrypted URL according to an embodiment of the present invention;

Figure 3 schematically shows a block diagram of a computing device for performing the method according to the invention;

Fig. 4 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.

Specific embodiment

The invention is further described below in conjunction with the drawings and specific embodiments.

Referring to FIG. 1 , a flow chart of steps of an embodiment of a method for identifying an enriched URL according to an embodiment of the present invention is shown.

Step 101: Extract one or more URLs;

In practical applications, various types of websites may design a large number of web pages every day, and each web page will have a URL.

In the embodiment of the present invention, the search engine may pre-fetch the URL of the webpage from the network by using a web crawler (also known as a web spider), and store it in the database, and may identify the enriched URL from the database. Extract one or more URLs.

The web crawler generally parses from the URL of one or more initial webpages, obtains the URL on the initial webpage, and continuously extracts a new URL from the current page into the queue during the process of crawling the webpage until the system is satisfied. Stop condition.

In particular, the focus crawler (a type of web crawler) has a more complex workflow, usually filtering links that are not related to the topic, retaining useful links and placing them in a queue of URLs waiting to be crawled. Then, the focused crawler will select the URL of the web page to be crawled from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition is reached.

In order to enable those skilled in the art to better understand the embodiments of the present application, in the present specification, the website of the question and answer category is explained as an example.

For questions and answers (such as zhidao.baidu.com), users may generate a lot of questions every day. Some of these questions will be answered by other users, while others will not be answered. Many of these questions may be duplicated. problem.

That is to say, a large number of problems are the same or similar, so for the search engine, the same question generally includes a web page with an answer and a satisfactory answer, and the other can be considered as a duplicate.

An example of a URL that is crawled by the question and answer class site of zhidao.***.com is as follows:

Http://zhidao.***.com/question/433737807751460604.html

Http://zhidao.***.com/question/1605209362191413347.html

Http://zhidao.***.com/question/618238863630856372.html

Http://zhidao.***.com/question/625161396233610844.html

Http://zhidao.***.com/question/1367620128259860259.html

Http://zhidao.***.com/question/2139209187911446788.html

Http://zhidao.***.com/question/584108667629594845.html

Among them, "***" is the domain name of a website.

Step 102: Select a candidate URL from the one or more URLs;

In a specific implementation, some or all URLs may be selected as candidate URLs according to a certain policy from the extracted URLs.

In an optional embodiment of the invention, step 102 may include the following sub-steps:

Sub-step S11, it is determined whether the URL matches a pattern pattern; if yes, sub-step S12;

Sub-step S12, the URL is selected as a candidate URL.

In the embodiment of the present invention, since the URL of the same website generally configures similar URLs for the same type of service (such as question and answer), the URL of the same website may be selected as the candidate URL by the same pattern pattern.

Among them, the pattern pattern can be a URL with the same or similar style.

For example, for the above URL crawled at the zhidao.***.com quiz site, it has the same pattern:

Http://zhidao.***..com/question/(\d+).html;

Among them, (\d+) is a wildcard.

It can be considered that the above URL crawled in the question and answer class site of zhidao.***.com is a candidate URL.

In practical applications, each candidate URL is associated with each anchor text anchor, that is, the URL and the anchor text anchor are generally one-to-one correspondence.

Anchor text, also known as anchor text link, is a form of link.

Similar to hyperlinks, hyperlinked code is anchor text, making a link to a keyword, pointing to a web page. This form of link is called anchor text.

On the one hand, the anchor text can be used as an evaluation of the content of the web page where the anchor text is located, ie the anchor text within the station.

The added links in the webpage have a certain relationship with the content of the webpage itself. For example, the clothing industry website will add links to some peer websites or some well-known companies that make clothing.

On the other hand, the anchor text can be used as an evaluation of the web page pointed to, ie the anchor text outside the station.

The anchor text can describe the content of the web page pointed to, for example, a link to add "ABC" on the personal website, and the anchor text is "search engine". This way, the anchor text itself knows that "ABC" is a search engine.

For the URL crawled at the zhidao.***.com site, an example of its anchor text anchor can be as shown in the following table:

Among them, "XXX" is the name of a TV series.

Step 103: Calculate a similarity between the anchor text anchors;

Similarity can refer to the content relevance between anchor text anchors.

In an optional embodiment of the present invention, step 103 may include the following sub-steps:

Sub-step S21, performing vectorization processing on the anchor text anchor;

In the embodiment of the present invention, the similarity can be calculated based on the vector space model, which assumes that the word is not related to the word, and uses the vector to represent the text, thereby simplifying the complex relationship between the keywords in the text, and the document is very simple. The vector representation makes the model computable.

In an optional embodiment of the present invention, the sub-step S21 may further include the following sub-steps:

Sub-step S211, performing a word segmentation process on the anchor text anchor to obtain a text segmentation;

In a specific implementation, the word segmentation process can be performed by one or more of the following methods:

1. Word segmentation based on string matching: refers to matching the Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).

2. The word segmentation method based on feature scanning or mark segmentation: refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and mark In the process, the result of the word segmentation is in turn tested and adjusted to improve the accuracy of the segmentation.

3. The word segmentation method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the participle ambiguity, that is, it simulates the process of human understanding of the sentence.

4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so each word in the corpus can be co-occurred. The frequency of the combination is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word.

Of course, the method for extracting the above-mentioned word segmentation is only an example. In the embodiment of the present invention, the method for extracting other word segments may be set according to the actual situation, which is not limited by the embodiment of the present invention. In addition, in addition to the above-mentioned method for extracting the word segmentation, a person skilled in the art may also adopt a method for extracting other word segments according to actual needs, which is not limited by the embodiment of the present invention.

Sub-step S212, filtering out invalid words from the text participle;

In a specific implementation, the words (invalid words) in the stop word table may be used to remove words, symbols, punctuation, and garbled characters that are not meaningful to the text content but appear frequently.

The invalid word includes one or more of the following:

Adverbs, auxiliary words, symbols, punctuation and garbled.

For example, the words "this,,,,,,," are appearing in almost any Chinese text, but they have little to do with the meaning expressed in this text.

The process of using stop words lists to eliminate stop words is roughly as follows: each text segmentation is seen if it is in the stop word list, and if so, it is removed from the text segmentation.

Sub-step S213, determining a keyword from the text segmentation;

In a specific implementation, several keywords may be determined according to the frequency of text segmentation.

In an embodiment, the word can be determined by TF (Term frequency) frequency.

TF refers to the frequency of occurrence of keywords in an article. For example, in an article with M words, there are N such keywords, then TF=N/M, which is the word frequency of the keyword in this article.

Sub-step S214, configuring weights for the keywords;

The configuration weight is a mechanism set for each keyword to have different effects on the text features.

In one embodiment, the weight of the keyword may be determined by an IDF (Inverse document frequency).

IDF is an index used to measure the weight of a keyword, IDF=log(D/D _w ), where D is the total number of articles, and D _w is the number of articles that have appeared in the keyword.

Sub-step S215, setting the weight of the keyword to the component of the anchor text anchor.

In the embodiment of the present invention, the anchor text anchor is stringified into an N-dimensional vector representation with the weight of the keyword as a component to perform the similarity calculation.

For example, the anchor text anchor A can be expressed as A = (a ₁ , a ₂ , a ₃ ... a _n ), and the anchor text anchor B can be expressed as B = (b ₁ , b ₂ , b ₃ ... b _n ), where a ₁ , a ₂ , a ₃ ... a _n is a component of A, and b ₁ , b ₂ , b ₃ ... b _n are components of B.

Sub-step S22, calculating the similarity between the vectorized anchor text anchors.

In a specific implementation, a cosine value between the components of the anchor text anchor (physical meaning is the cosine value of the spatial angle of the two vectors) may be calculated as the similarity between the anchor text anchors.

For example, for A = (a ₁ , a ₂ , a ₃ ... a _n ) and B = (b ₁ , b ₂ , b ₃ ... b _n ), a vector (a ₁ , a ₂ , a ₃ ... a _n can be calculated The cosine of the angle between (b ₁ , b ₂ , b ₃ ... b _n ) is used as the similarity between the anchor text anchor A and the anchor text anchor B.

An example of calculating the similarity of the cosine of the included angle is as follows:

Sim(A,B)=(a ₁ *b ₁ +a ₂ *b ₂ +a ₃ *b ₃ +...+a _n *b _n )/(sqrt(a ₁ *a ₁ +a ₂ *a ₂ + a ₃ *a ₃ +...+a _n *a _n )*sqrt(b ₁ *b ₁ +b ₂ *b ₂ +b ₃ *b ₃ +...+b _n *b _n ));

Where sim(A, B) represents the similarity between the anchor text anchor A and the anchor text anchor B, and sqrt() represents the root number.

Assuming that the components (weights) of the text anchor text anchor A are 30, 20, 20, and 10, and the components (weights) of the anchor text anchor B are 40, 30, 20, and 10, respectively, the vector of the anchor text anchor A is represented as A=(30,20,20,10,0), the vector of the anchor text anchor B is expressed as B=(40,0,30,20,10), then the anchor text anchorA calculated according to the above formula is related to the anchor B. It seems to be 0.86.

Step 104: Identify an enriched URL from the candidate URL according to the similarity.

In a specific implementation, the more similar the webpage content is, the higher the similarity is. When the similarity is greater than the preset similarity threshold, the candidate URL is confirmed to be an enriched URL, that is, the similarity is greater than a certain similarity. The threshold URL can be thought of as a URL with the same or similar content (ie, an enriched URL).

For example, for the URL crawled at zhidao.***.com, the anchor text anchor is related to the music of the XXX fifth season episode 14 and can be considered as a rich URL.

In an optional embodiment of the present invention, the method may further include the following steps:

Step 105: Select a target URL from the enriched URL.

In a specific implementation, some or all of the URLs may be selected from the enrichment URL according to a certain policy as the target URL.

In an optional embodiment of the invention, step 105 may include the following sub-steps:

Sub-step S31, acquiring the degree of attention of the enriched URL;

Sub-step S32, selecting a target URL from the enriched URL based on the degree of interest.

The degree of attention may be the degree of attention of the user to the URL. For example, the URL corresponds to the number of recommendations of the webpage (eg, "to force", "like", etc.), and the more the number of recommendations, the higher the degree of attention.

For a URL with a high degree of interest, the quality of the web page is generally higher. Therefore, in the embodiment of the present invention, an enriched URL with a high degree of attention may be selected, for example, the degree of attention is higher than the preset attention threshold. The enriched URL, one or more enriched URLs with the highest order of attention, and so on, as the target URL.

Step 106: Grab a webpage corresponding to the target URL;

In practical applications, the basic workflow of crawling web pages by web crawlers is as follows:

1. Select the target URL;

2. Put the target URL into the queue to be crawled;

3. Retrieve the target URL to be crawled from the queue to be crawled, parse the DNS (Domain Name System), and obtain the IP address of the host (Internet Protocol). Access the IP address. The address, download the webpage corresponding to the target URL, and store it in the downloaded webpage library.

In addition, the target URL is placed in the crawled URL queue.

Step 107: Generate an index file by using the webpage.

The search engine search process is generally divided into two parts, one is the front-end user request process, and the other is the back-end production data process.

First, the front-end user request process is roughly as follows:

1. Receiving a request: receiving a search keyword input by a user in a search engine;

2, query word analysis: word segmentation processing of search keywords;

3. Search: According to the result of the word segmentation, search for the webpage information related to the word segmentation result from the pre-made index file (such as the inverted index);

4. Sorting: Sorting related webpage information according to dimensions such as content relevance and timeliness;

5. Presentation: Display the sorted webpage information on the search engine's result page.

Second, the back-end production data process:

1. Web crawling: use web crawler technology to capture various types of web pages and save them.

2. Index production: Analyze the network information that has been captured and saved, such as word segmentation of the page title and page text, and create an index file (such as an inverted index) according to the word segmentation result, which is used by the front-end user request process.

In the embodiment of the present invention, the webpage record may be written into an index file (such as an inverted index) to As a search in search engines.

Taking the inverted index as an example, the inverted index is derived from the actual application and needs to find records according to the value of the attribute. Each item in the index table includes an attribute value and an address of each record having the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A file with an inverted index is called an inverted index file, or simply an inverted file.

In an inverted file, an index object is a word in a document or collection of documents (such as a web page), and is used to store the storage location of the words in a document or a group of documents, which is a common use of documents or collections of documents. Indexing mechanism.

In English, for example, the following is the text information in the web page to be indexed:

T1=“it is what it is”;

T2=“what is it”;

T3=“it is a banana”;

The following is the inverted index:

"a": {(2, 2)}

"banana": {(2,3)}

"is": {(0,1),(0,4),(1,1),(2,1)}

"it": {(0,0),(0,3),(1,2),(2,0)}

"what": {(0,2),(1,0)}

Among them, "banana": {(2, 3)} is "banana" in the text information of the third web page (T3), and the position of the third web page is the fourth word (address is 3).

For the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present invention are not limited by the described action sequence, because the embodiment according to the present invention Some steps can be performed in other orders or at the same time. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

Referring to FIG. 2, a block diagram of an embodiment of an apparatus for identifying an enriched URL according to an embodiment of the present invention is shown. Specifically, the following modules may be included:

The URL extraction module 201 is adapted to extract one or more URLs;

The candidate URL selection module 202 is adapted to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;

The similarity calculation module 203 is adapted to calculate a similarity between the anchor text anchors;

The enriched URL identification module 204 is adapted to identify the enriched URL from the candidate URLs based on the similarity.

In an optional embodiment of the present invention, the candidate URL selection module 202 may further be adapted to:

Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.

In an optional embodiment of the present invention, the similarity calculation module 203 is further adapted to:

Performing vectorization processing on the anchor text anchor;

Calculate the similarity between vectorized anchor text anchors.

Performing word segmentation on the anchor text anchor to obtain a text segmentation;

Determining keywords from the text segmentation;

Configuring weights for the keywords;

The weight of the keyword is set to the component of the anchor text anchor.

Filtering out invalid words from the text participle;

The invalid word includes one or more of the following:

Adverbs, auxiliary words, symbols, punctuation, garbled.

A cosine value between components of the anchor text anchor is calculated as the similarity between the anchor text anchors.

In an optional embodiment of the present invention, the enriched URL identification module 204 can also Suitable for:

When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.

In an optional embodiment of the invention, the device may further comprise the following modules:

A target URL selection module adapted to select a target URL from the enriched URL.

In an optional embodiment of the present invention, the target URL selection module may further be adapted to:

Obtaining the attention degree of the enriched URL;

The target URL is selected from the enriched URL based on the degree of interest.

a webpage crawling module, configured to capture a webpage corresponding to the target URL;

An index file generating module is adapted to generate an index file by using the webpage.

For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components of the enhanced URL identification device in accordance with embodiments of the present invention. . The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

For example, Figure 3 illustrates an identification computing device, such as an application server, that can implement an enriched URL in accordance with the present invention. The computing device conventionally includes a processor 310 and a computer program product or computer readable medium in the form of a memory 320. The memory 320 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. The memory 320 has a memory space 330 for program code 331 for performing any of the method steps described above. For example, storage space 330 for program code Various program codes 331 for respectively implementing the various steps in the above methods may be included. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 320 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 331', ie, code readable by a processor, such as 310, that when executed by a computing device causes the computing device to perform each of the methods described above step.

"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.

It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.

In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, without departing from the scope and spirit of the appended claims, Many modifications and variations will be apparent to those of ordinary skill in the art. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

A method for identifying an enriched URL, comprising the steps of:

Extract one or more URLs;

Selecting candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;

Calculating a similarity between the anchor text anchors;

An enriched URL is identified from the candidate URLs based on the similarity.
The method of claim 1 wherein said step of selecting a candidate URL from said one or more URLs comprises:

Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.
The method according to claim 1 or 2, wherein the calculating the similarity between the respective anchor text anchors comprises:

Performing vectorization processing on the anchor text anchor;

Calculate the similarity between vectorized anchor text anchors.
The method of claim 3, the step of performing vectorization processing on the anchor text anchor comprises:

Performing word segmentation on the anchor text anchor to obtain a text segmentation;

Determining keywords from the text segmentation;

Configuring weights for the keywords;

The weight of the keyword is set to the component of the anchor text anchor.
The method of claim 3, the step of performing vectorization processing on the anchor text anchor further comprises:

Filtering out invalid words from the text participle;

The invalid word includes one or more of the following:

Adverbs, auxiliary words, symbols, punctuation, garbled.
The method of claim 3, wherein the step of calculating the similarity between the vectorized anchor text anchors comprises:

Calculating a cosine value between components of the anchor text anchor as the anchor text anchor The similarity between the two.
The method of claim 1 or 2 or 4 or 5 or 6, wherein the step of identifying an enriched URL from the candidate URLs according to the similarity comprises:

When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.
The method of claim 1 further comprising the step of:

The target URL is selected from the enriched URL.
The method of claim 8, the step of selecting a target URL from the enriched URLs comprises:

Obtaining the attention degree of the enriched URL;

The target URL is selected from the enriched URL based on the degree of interest.
The method of claim 8 or 9, further comprising the steps of:

Grab the webpage corresponding to the target URL;

The index file is generated by using the webpage.
An apparatus for identifying an enriched URL, comprising:

a URL extraction module adapted to extract one or more URLs;

a candidate URL selection module, configured to select candidate URLs from the one or more URLs; each candidate URL is associated with each anchor text anchor;

a similarity calculation module, configured to calculate a similarity between the anchor text anchors;

An enriched URL identification module adapted to identify an enriched URL from the candidate URLs based on the similarity.
The apparatus according to claim 11, wherein the candidate URL selection module is further adapted to:

Determining whether the URL matches a pattern pattern; if so, selecting the URL as a candidate URL.
The apparatus according to claim 11 or 12, wherein the similarity calculation module is further adapted to:

Performing vectorization processing on the anchor text anchor;

Calculate the similarity between vectorized anchor text anchors.
The apparatus of claim 13, the similarity calculation module is further adapted to:

Performing word segmentation on the anchor text anchor to obtain a text segmentation;

Determining keywords from the text segmentation;

Configuring weights for the keywords;

The weight of the keyword is set to the component of the anchor text anchor.
The apparatus of claim 13, the similarity calculation module is further adapted to:

Filtering out invalid words from the text participle;

The invalid word includes one or more of the following:

Adverbs, auxiliary words, symbols, punctuation, garbled.
The apparatus of claim 13, the similarity calculation module is further adapted to:

A cosine value between components of the anchor text anchor is calculated as the similarity between the anchor text anchors.
The apparatus of claim 11 or 12 or 14 or 15 or 16, wherein the enriched URL identification module is further adapted to:

When the similarity is greater than a preset similarity threshold, the candidate URL is confirmed to be a rich URL.
The apparatus of claim 11 further comprising:

A target URL selection module adapted to select a target URL from the enriched URL.
The apparatus of claim 18, wherein the target URL selection module is further adapted to:

Obtaining the attention degree of the enriched URL;

The target URL is selected from the enriched URL based on the degree of interest.
The apparatus of claim 18 or 19, further comprising:

a webpage crawling module, configured to capture a webpage corresponding to the target URL;

An index file generating module is adapted to generate an index file by using the webpage.
A computer program comprising computer readable code causing the computing device to perform recognition of an enriched URL according to any one of claims 1-10 when the computer readable code is run on a computing device method.
A computer readable medium storing the computer program of claim 21.