CN105260469A - Sitemap processing method, apparatus and device - Google Patents

Sitemap processing method, apparatus and device Download PDF

Info

Publication number
CN105260469A
CN105260469A CN201510676894.0A CN201510676894A CN105260469A CN 105260469 A CN105260469 A CN 105260469A CN 201510676894 A CN201510676894 A CN 201510676894A CN 105260469 A CN105260469 A CN 105260469A
Authority
CN
China
Prior art keywords
site maps
link
page
keyword
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510676894.0A
Other languages
Chinese (zh)
Other versions
CN105260469B (en
Inventor
梁捷
梁卡喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Shenma Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shenma Mobile Information Technology Co Ltd filed Critical Guangzhou Shenma Mobile Information Technology Co Ltd
Priority to CN201510676894.0A priority Critical patent/CN105260469B/en
Publication of CN105260469A publication Critical patent/CN105260469A/en
Priority to PCT/CN2016/102215 priority patent/WO2017063596A1/en
Application granted granted Critical
Publication of CN105260469B publication Critical patent/CN105260469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The present invention discloses a sitemap processing method, apparatus and device. The method comprises: obtaining a sitemap of a website according to preset information; obtaining a link of a page in the sitemap, and carrying out access; deleting links which influence search inclusion in the sitemap according to an access result; and generating a new sitemap. According to the technical scheme provided by the present invention, the quality of the sitemap can be enhanced, the possibility of inclusion by a search engine also can be increased, and the respective needs of the website and the search engine are met.

Description

A kind of method, device and equipment processing site maps
Technical field
The present invention relates to mobile internet technical field, be specifically related to a kind of method, device and the equipment that process site maps.
Background technology
At present, search engine innerly by website (also referred to as website) can search webpage with the link on other websites usually, and site maps sitemap can facilitate website notice search engine on website, which has can supply the webpage captured.The simplest sitemap form, be exactly XML (ExtensibleMarkupLanguage, extend markup language) file, list the network address in website and other metadata (time that last time upgrades, the frequency of change and the significance level etc. relative to other network address on website) about each network address wherein, so that search engine can capture web site contents more intelligently.Briefly, sitemap can be understood as the list that website links.Generate sitemap and submit to search engine, the content of website can be made easily to be included, comprise those and hide the deep page, this is the good mode that a kind of website and search engine are talked with.
But, likely there is many problems in the quality of the web site url comprised inside the sitemap that a lot of website provides at present, such as break links, the content of link is inferior or do not upgrade in time, these situations all can waste the resource that search engine crawls, although which results in website to provide sitemap, but search engine is according to the result crawled, the web site url of sitemap might not be included, also falling of triggering searches engine may weigh rule simultaneously, reduce this website number of links of including and the searching order etc. reducing this website.
Therefore, the disposal route of existing site maps, can not meet website and search engine needs separately.
Summary of the invention
For solving the problems of the technologies described above, the invention provides a kind of method, device and the equipment that process site maps, website and search engine needs separately can be met.
According to an aspect of the present invention, a kind of method processing site maps is provided, comprises:
The site maps of website is obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
The link that in site maps, impact search is included is deleted according to access result;
Generate new site maps.
Preferably, in described acquisition site maps the page link and also comprise after conducting interviews:
Keyword and text eigenwert are extracted to the page of access;
According to the keyword of extraction and the comparative result of text eigenwert and the keyword prestored and text eigenwert, delete the link that in site maps, impact search is included.
Preferably, described link of including according to impact search in access result deletion site maps comprises:
When accessing result and being the HTTP404 mistake occurring accessing, delete corresponding link; Or,
, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result; Or,
Access result be the title of the page, keyword and description imperfect time, delete corresponding link; Or,
Accessing that result is the title of body matter with the page of the page, keyword and description be when mating, and deletes the link of correspondence.
Preferably, the comparative result of the described keyword according to extraction and text eigenwert and the keyword prestored and text eigenwert, delete the link that in site maps, impact search is included and comprise:
Be consistent with text eigenwert with the comparative result of the keyword prestored and text eigenwert according to the keyword extracted, be judged as that content repeats to submit to, delete the link of correspondence.
Preferably, described method also comprises:
The new site maps generated is supplied to search engine access.
Preferably, described method also comprises:
Record described search engine access the laggard line search of new site maps and include include data.
According to a further aspect in the invention, a kind of device processing site maps is provided, comprises:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to described acquisition module, obtains the link of the page in site maps and conducts interviews;
First processing module, deletes for the access result according to described access modules the link that in site maps, impact search is included;
Generation module, for generating new site maps after described first processing module processes.
Preferably, described device also comprises:
Second processing module, for extracting keyword and text eigenwert to the page of access, according to the keyword of extraction and the comparative result of text eigenwert and the keyword prestored and text eigenwert, deletes the link that in site maps, impact search is included;
Described generation module, after described first processing module and described second processing module process, generates new site maps.
Preferably, described device also comprises:
Output module, the new site maps for being generated by described generation module is supplied to search engine access.
Preferably, described device also comprises:
Monitoring module, for record described search engine access the laggard line search of new site maps and include include data.
Preferably, described first processing module comprises:
First delete cells, for when accessing result and being the HTTP404 mistake occurring accessing, deletes corresponding link; Or,
Second delete cells, for being, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result; Or,
3rd delete cells, for access result be the title of the page, keyword and description imperfect time, delete corresponding link; Or,
4th delete cells, for accessing that result is the title of body matter with the page of the page, keyword and description do not mate time, delete the link of correspondence.
According to a further aspect in the invention, a kind for the treatment of facility is provided, comprises:
Storer, for storage program,
Processor, for performing the following program that described storer stores:
The site maps of website is obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
The link that in site maps, impact search is included is deleted according to access result;
Generate new site maps.
Can find, the technical scheme of the embodiment of the present invention, by first conducting interviews after the link of the page in acquisition site maps, after found that according to access the link that impact search is included, just delete the link that in site maps, impact search is included, the new site maps of regeneration, so just can realize being optimized process to original site maps of website, avoid the link occurring in site maps that various content is bad or easily make mistakes as far as possible, thus site maps quality can be promoted, also the possibility that searched engine is included can be increased, meet the demand of website and search engine.
Accompanying drawing explanation
In conjunction with the drawings disclosure illustrative embodiments is described in more detail, above-mentioned and other object of the present disclosure, Characteristics and advantages will become more obvious, wherein, in disclosure illustrative embodiments, identical reference number represents same parts usually.
Fig. 1 is the indicative flowchart of the method for process site maps according to an embodiment of the invention;
Fig. 2 is another indicative flowchart of the method for process site maps according to an embodiment of the invention;
Fig. 3 is another indicative flowchart of the method for process site maps according to an embodiment of the invention;
Fig. 4 is a kind of schematic block diagram processing the device of site maps of the present invention;
Fig. 5 is a kind of another schematic block diagram processing the device of site maps of the present invention;
Fig. 6 is the schematic block diagram of a kind for the treatment of facility of the present invention.
Embodiment
Below with reference to accompanying drawings preferred implementation of the present disclosure is described in more detail.Although show preferred implementation of the present disclosure in accompanying drawing, but should be appreciated that, the disclosure can be realized in a variety of manners and not should limit by the embodiment of setting forth here.On the contrary, provide these embodiments to be to make the disclosure more thorough and complete, and the scope of the present disclosure intactly can be conveyed to those skilled in the art.
The invention provides a kind of method processing site maps, website and search engine needs separately can be met.
Fig. 1 is the indicative flowchart of the method for process site maps according to an embodiment of the invention.
As shown in Figure 1, comprising:
Step 101, obtain the site maps of website according to presupposed information.
In this step, according to after consensus with website, according to the configuration information that website provides, the site maps of website can be obtained.
Step 102, obtain the page in site maps link and conduct interviews.
In this step, obtain each URL (UniformResourceLocator, the URL(uniform resource locator)) link in site maps, and conduct interviews respectively to verify to URL link.
Step 103, the link of including according to impact search in access result deletion site maps.
In this step, delete according to access result the link that in site maps, impact search is included and comprise:
When accessing result and being the HTTP404 mistake occurring accessing, delete corresponding link; Or,
, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result; Or,
Access result be the title of the page, keyword and description imperfect time, delete corresponding link; Or,
Accessing that result is the title of body matter with the page of the page, keyword and description be when mating, and deletes the link of correspondence.
Step 104, generate new site maps.
In this step, after deleting each link that in site maps, impact search is included, rearrange and generate new site maps.
Can find, the technical scheme of the embodiment of the present invention, by first conducting interviews after the link of the page in acquisition site maps, after found that according to access the link that impact search is included, just delete the link that in site maps, impact search is included, the new site maps of regeneration, so just can realize being optimized process to original site maps of website, avoid the link occurring in site maps that various content is bad or easily make mistakes as far as possible, thus site maps quality can be promoted, also the possibility that searched engine is included can be increased, meet the demand of website and search engine.
Below technical scheme of the present invention is more specifically introduced further.
Fig. 2 is another indicative flowchart of the method for process site maps according to an embodiment of the invention.
As shown in Figure 2, comprising:
Step 201, obtain the site maps of website according to presupposed information.
This step is see the description of above-mentioned steps 101.
Step 202, obtain the page in site maps link and conduct interviews.
This step is see the description of above-mentioned steps 102.
Step 203, the link of including according to impact search in access result deletion site maps.
This step is see the description of above-mentioned steps 103.
Step 204, keyword and text eigenwert are extracted to the page of access.
In this step, existing algorithms of different can be utilized to carry out keyword extraction to the content of the page, and to text contents extraction text eigenwert, the present invention is not limited.
Step 205, comparative result according to the keyword extracted and text eigenwert and the keyword that prestores and text eigenwert, delete the link that in site maps, impact search is included.
In this step, be consistent with text eigenwert with the comparative result of the keyword prestored and text eigenwert according to the keyword extracted, be judged as that content repeats to submit to, delete the link of correspondence.
Step 206, generate new site maps.
Step 207, by generate new site maps be supplied to search engine access.
In this step, the new site maps generated can be replaced the original site maps in website, for search engine to the new site maps of website visiting, also can be arranged by website, directly new site maps is accessed to service platform by search engine, the present invention is not limited, as long as search engine can be allowed to access new site maps.
It should be noted that, the process of above-mentioned steps 202,203 and step 204,205 process there is no inevitable ordinal relation, above-mentioned steps arrangement is only the convenience of description.
It should be noted that, can also comprise after above-mentioned steps 207: record described search engine access the laggard line search of new site maps and include include data.
Can find, the technical scheme of the embodiment of the present invention, impact search is included in site maps link and the comparative result according to the keyword extracted and text eigenwert and the keyword prestored and text eigenwert can be deleted respectively according to access result, delete the link that in site maps, impact search is included, effect of optimization is provided.In addition, can also record described search engine access the laggard line search of new site maps and include include data, thus provide with reference to for the amendment of follow-up site maps or analyze for website.
Fig. 3 is another indicative flowchart of the method for process site maps according to an embodiment of the invention.
As shown in Figure 3, comprising:
Step 301, sitemap service platform carry out data extraction according to the configuration information of website to the sitemap of website.
In this step, website and sitemap service platform (hereinafter referred service platform) consensus in advance, the mapping relations of sitemap and service platform are set by website, allow configuration information such as address information that service platform provides according to website to sitemap process.Website arranges mapping relations and realizes by XML.The configuration information that service platform provides according to website, can carry out data extraction to sitemap, obtain the URL information of wherein each link.
URL in the sitemap of extraction checks by step 302, service platform respectively, judges whether access URL occurs the HTTP404 mistake that cannot access, if, enter step 311, from sitemap, delete this URL and record reason, if not, entering step 303.
HTTP404 mistake means that the webpage that link is pointed to does not exist, namely the URL of original web page lost efficacy, this situation often can occur, such as: webpage URL create-rule changes, web page files is renamed or shift position, importing link misspelling etc., causes original URL address to access; When web page server receives similar request, 404 status codes can be returned, tell that the resource that browser will be asked does not exist.Therefore, when occurring the HTTP404 mistake that URL cannot access, representing that this URL lost efficacy, now from sitemap, delete this URL and record reason.
Step 303, service platform judge that whether the page response speed of accessing URL is abnormal, if so, enter step 311, delete this URL and record reason from sitemap, if not, enter step 304.
When URL can normally access, detect the response speed of the page, response speed can be weighed by the response time.If the response time is more than or equal to setting threshold value, thinks that response speed is abnormal, if be less than setting threshold value, think that response speed is normal.Setting threshold value, can rule of thumb value, and be such as set to 500 milliseconds or 1 second, the present invention is not limited.
When it should be noted that, also can contrast according to page history access response speed and current accessed response speed, judge that whether response speed is abnormal.If the current response time is more much larger than the historical responses time, exceed certain threshold value, can think that response speed is abnormal.
Therefore, when page response velocity sag, represent that the page that this URL is corresponding may have problem or network corresponding to URL to connect and may have problem, these all can affect the viewing experience of user, now from sitemap, delete this URL and record reason.
Step 304, service platform judge that whether the TKD of the page is imperfect, if so, enter step 311, delete this URL and record reason from sitemap, if not, enter step 305.
TKD is title title, keyword keywords, the abbreviation describing description.The format content of TKD can be as follows:
<title> is title content </title> here
<metaname=" keywords " content=" being key words content here "/>
<metaname=" description " content=" being describe content here "/>
Keyword keywords, be a website webmaster to certain page setting of website so that user can search out the vocabulary of this webpage by search engine, keyword represents the market orientation of website.Describe description, also can be described as " content tab ", " description label " or " synopsis ", the main contents of reflection webpage.
Be generally the search rule that complete TKD just meets search engine, if TKD is imperfect, do not meet the search rule of search engine, so search engine may can not search for this page, or does not include this linked contents.Therefore, from sitemap, delete this URL and record reason when finding that TKD is imperfect.
Step 305, service platform judge whether page body content does not mate with TKD, if so, enter step 311, delete this URL and record reason from sitemap, if not, enter step 306.
In this step, according to the body matter in the page, judge the keyword whether occurred in text in TKD, whether the content of text is with the title of TKD with describe corresponding, if there is the keyword in TKD, the content of text is with the title of TKD and to describe be corresponding, and expression is mated, otherwise be unmatched.If do not mate, it is wrong to be so that text is arranged, or TKD arranges wrong, and these all can affect the search quality of search engine and affect the viewing experience of user.Therefore, from sitemap, delete this URL and record reason when finding whether page body content does not mate with TKD.
Step 306, service platform carry out keyword extraction to the content of the page, and to text contents extraction text eigenwert.
In this step, service platform can utilize existing algorithms of different to carry out keyword extraction to the content of the page, and to text contents extraction text eigenwert, the present invention is not limited.
Such as, keyword extraction can adopt existing TFIDF (termfrequency – inversedocumentfrequency, word frequency--inverted file frequency) algorithm, this algorithm mainly preserves all word informations with a dictionary, then according to value value sorts to dictionary, and last weighting resets the forward several words of name as keyword.Such as, to text contents extraction text eigenwert, can adopt based on the text feature of Context Framework or based on ontological text feature etc.
The keyword that the keyword of extraction and text eigenwert and service platform store by step 307, service platform and text eigenwert compare, check whether the situation existing and repeat to submit content to, if, enter step 311, from sitemap, delete this URL and record reason, if not, step 308 is entered.
This step is compared by the keyword that the keyword of extraction and text eigenwert and service platform stored and text eigenwert, carry out the coupling of the text degree of correlation, if have found same keyword and text eigenwert at service platform, be judged as that content repeats.By this matching detection, thus the situation existing and repeat to submit content to can be checked whether.At service platform, prestore keyword and the text eigenwert of each page article be detected.
Step 308, the link of the keyword of extraction, text eigenwert and correspondence is preserved by service platform, for follow-up look into heavy used.
The new sitemap data that step 309, service platform generate after treatment obtain for search engine.
In this step, can arrange in website, instruction search engine directly arrives service platform and obtains sitemap, or new sitemap directly can be replaced the original sitemap of website by service platform.
Step 310, service platform carry out collection situation monitoring to up-to-date sitemap data.
If the searched engine of the URL of sitemap is included, meeting return label information, the situation that the searched engine of service platform monitoring URL is included, can provide reference for follow-up adjustment sitemap.
Step 311, service platform delete this link from sitemap, and record reason for website and analyze.
In this step, the deleted reason of this link can be recorded in detail, analyze for website.
Can find, the technical scheme of the embodiment of the present invention, the sitemap data analysis of the website obtained is filtered, and the link that sitemap provides is conducted interviews checking, also keyword extraction and text characteristics extraction are carried out to body matter in addition, and mate with the keyword prestored and text eigenwert, thus avoid the content submitting duplicate contents or poor quality to.Finally can also monitor the collection situation of search engine to sitemap.By above-mentioned process, the present invention can optimize the quality of sitemap, what the searched engine of lifting web site contents was included includes quantity, search engine is allowed better to include the page of website, also solve duplicate contents, rubbish contents is submitted to the problem that power falls in search that search engine causes, can also the situation of better monitoring web site contents.
The above-mentioned method describing process site maps of the present invention in detail, accordingly, the present invention also provides a kind of device processing site maps.
Fig. 4 is a kind of schematic block diagram processing the device of site maps of the present invention.
As shown in Figure 4, a kind of device processing site maps, comprising: acquisition module 401, access modules 402, first processing module 403, generation module 404.The device of process site maps of the present invention can be service platform or other equipment.
Acquisition module 401, for obtaining the site maps of website according to presupposed information.
Device can according to after consensus with website, and the configuration information provided according to website by acquisition module 401 obtains the site maps of website.
Access modules 402, for the site maps obtained according to described acquisition module 401, obtains the link of the page in site maps and conducts interviews.
Access modules 402 obtains each URL link in site maps, and conducts interviews respectively to verify to URL link.
First processing module 403, deletes for the access result according to described access modules 402 link that in site maps, impact search is included.
First processing module 403 deletes according to various different access result the link that in site maps, impact search is included.
Generation module 404, for generating new site maps after described first processing module 403 processes.
Fig. 5 is a kind of another schematic block diagram processing the device of site maps of the present invention.
As shown in Figure 5, a kind of device processing site maps, comprising: acquisition module 401, access modules 402, first processing module 403, generation module 404, the function of each module is see described in Fig. 4.
In addition, described device also comprises: the second processing module 405.
Second processing module 405, for extracting keyword and text eigenwert to the page of access, according to the keyword of extraction and the comparative result of text eigenwert and the keyword prestored and text eigenwert, deletes the link that in site maps, impact search is included; Described generation module 404, after described first processing module 403 and described second processing module 405 process, generates new site maps.
Second processing module 405 is consistent with text eigenwert with the comparative result of the keyword prestored and text eigenwert according to the keyword extracted, and is judged as that content repeats to submit to, deletes the link of correspondence.
Described device also comprises: output module 406.
Output module 406, the new site maps for being generated by described generation module is supplied to search engine access.
The new site maps generated can be replaced the original site maps in website by the present invention, for search engine to the new site maps of website visiting, also can be arranged by website, directly new site maps is accessed to service platform by search engine, the present invention is not limited, as long as search engine can be allowed to access new site maps.
Described device also comprises: monitoring module 407.
Monitoring module 407, for record described search engine access the laggard line search of new site maps and include include data.
Wherein, described first processing module 403 comprises: the first delete cells 4031, second delete cells 4032, the 3rd delete cells 4033 or the 4th delete cells 4034.
First delete cells 4031, for when accessing result and being the HTTP404 mistake occurring accessing, deletes corresponding link.
Second delete cells 4032, for being, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result.
3rd delete cells 4033, for access result be the title of the page, keyword and description imperfect time, delete corresponding link.
4th delete cells 4034, for accessing that result is the title of body matter with the page of the page, keyword and description do not mate time, delete the link of correspondence.
The present invention also provides a kind for the treatment of facility.
Fig. 6 is the schematic block diagram of a kind for the treatment of facility of the present invention.
As shown in Figure 6, treatment facility comprises: storer 601 and processor 602.
Storer 601, for storage program,
Processor 602, for performing the following program that described storer 601 stores:
The site maps of website is obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
The link that in site maps, impact search is included is deleted according to access result;
Generate new site maps.
It should be noted that other programs that storer 601 stores, specifically see the description in previous methods flow process, repeat no more herein, processor 602 is also for other programs of execute store 601 storage.
In sum, the technical scheme of the embodiment of the present invention, the sitemap data analysis of the website obtained is filtered, the link provided sitemap conducts interviews checking, also keyword extraction and text characteristics extraction are carried out to body matter in addition, and mate with the keyword prestored and text eigenwert, thus avoid the content submitting duplicate contents or poor quality to.Finally can also monitor the collection situation of search engine to sitemap.By above-mentioned process, the present invention can optimize the quality of sitemap, what the searched engine of lifting web site contents was included includes quantity, search engine is allowed better to include the page of website, also solve duplicate contents, rubbish contents is submitted to the problem that power falls in search that search engine causes, can also the situation of better monitoring web site contents.
Above be described in detail with reference to the attached drawings according to technical scheme of the present invention.
In addition, can also be embodied as a kind of computer program according to method of the present invention, this computer program comprises the computer program code instruction for performing the above steps limited in said method of the present invention.Or, a kind of computer program can also be embodied as according to method of the present invention, this computer program comprises computer-readable medium, stores the computer program for performing the above-mentioned functions limited in said method of the present invention on the computer-readable medium.Those skilled in the art will also understand is that, may be implemented as electronic hardware, computer software or both combinations in conjunction with various illustrative logical blocks, module, circuit and the algorithm steps described by disclosure herein.
Process flow diagram in accompanying drawing and block diagram show the architectural framework in the cards of the system and method according to multiple embodiment of the present invention, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact two continuous print square frames can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.
Be described above various embodiments of the present invention, above-mentioned explanation is exemplary, and non-exclusive, and be also not limited to disclosed each embodiment.When not departing from the scope and spirit of illustrated each embodiment, many modifications and changes are all apparent for those skilled in the art.The selection of term used herein, is intended to explain best the principle of each embodiment, practical application or the improvement to the technology in market, or makes other those of ordinary skill of the art can understand each embodiment disclosed herein.

Claims (12)

1. process a method for site maps, it is characterized in that, comprising:
The site maps of website is obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
The link that in site maps, impact search is included is deleted according to access result;
Generate new site maps.
2. method according to claim 1, is characterized in that, in described acquisition site maps the page link and also comprise after conducting interviews:
Keyword and text eigenwert are extracted to the page of access;
According to the keyword of extraction and the comparative result of text eigenwert and the keyword prestored and text eigenwert, delete the link that in site maps, impact search is included.
3. method according to claim 1, is characterized in that, described link of including according to impact search in access result deletion site maps comprises:
When accessing result and being the HTTP404 mistake occurring accessing, delete corresponding link; Or,
, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result; Or,
Access result be the title of the page, keyword and description imperfect time, delete corresponding link; Or,
Accessing that result is the title of body matter with the page of the page, keyword and description be when mating, and deletes the link of correspondence.
4. method according to claim 2, is characterized in that, the comparative result of the described keyword according to extraction and text eigenwert and the keyword prestored and text eigenwert, deletes the link that in site maps, impact search is included and comprises:
Be consistent with text eigenwert with the comparative result of the keyword prestored and text eigenwert according to the keyword extracted, be judged as that content repeats to submit to, delete the link of correspondence.
5. the method according to any one of Claims 1-4, is characterized in that, described method also comprises:
The new site maps generated is supplied to search engine access.
6. method according to claim 5, is characterized in that, described method also comprises:
Record described search engine access the laggard line search of new site maps and include include data.
7. process a device for site maps, it is characterized in that, comprising:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to described acquisition module, obtains the link of the page in site maps and conducts interviews;
First processing module, deletes for the access result according to described access modules the link that in site maps, impact search is included;
Generation module, for generating new site maps after described first processing module processes.
8. device according to claim 7, is characterized in that, described device also comprises:
Second processing module, for extracting keyword and text eigenwert to the page of access, according to the keyword of extraction and the comparative result of text eigenwert and the keyword prestored and text eigenwert, deletes the link that in site maps, impact search is included;
Described generation module, after described first processing module and described second processing module process, generates new site maps.
9. device according to claim 7, is characterized in that, described device also comprises:
Output module, the new site maps for being generated by described generation module is supplied to search engine access.
10. device according to claim 9, is characterized in that, described device also comprises:
Monitoring module, for record described search engine access the laggard line search of new site maps and include include data.
11. devices according to any one of claim 7 to 10, it is characterized in that, described first processing module comprises:
First delete cells, for when accessing result and being the HTTP404 mistake occurring accessing, deletes corresponding link; Or,
Second delete cells, for being, when being more than or equal to setting threshold value the page response time, delete corresponding link accessing result; Or,
3rd delete cells, for access result be the title of the page, keyword and description imperfect time, delete corresponding link; Or,
4th delete cells, for accessing that result is the title of body matter with the page of the page, keyword and description do not mate time, delete the link of correspondence.
12. 1 kinds for the treatment of facilities, is characterized in that, comprising:
Storer, for storage program,
Processor, for performing the following program that described storer stores:
The site maps of website is obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
The link that in site maps, impact search is included is deleted according to access result;
Generate new site maps.
CN201510676894.0A 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps Active CN105260469B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510676894.0A CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps
PCT/CN2016/102215 WO2017063596A1 (en) 2015-10-16 2016-10-14 Method, apparatus and device for processing sitemap

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510676894.0A CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps

Publications (2)

Publication Number Publication Date
CN105260469A true CN105260469A (en) 2016-01-20
CN105260469B CN105260469B (en) 2017-12-26

Family

ID=55100159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510676894.0A Active CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps

Country Status (2)

Country Link
CN (1) CN105260469B (en)
WO (1) WO2017063596A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095674A (en) * 2016-06-07 2016-11-09 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
WO2017063596A1 (en) * 2015-10-16 2017-04-20 广州神马移动信息科技有限公司 Method, apparatus and device for processing sitemap
CN107807937A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of website SEO processing methods, apparatus and system
CN108255831A (en) * 2016-12-28 2018-07-06 航天信息股份有限公司 A kind of method and system for being used to generate site maps for website

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695056B (en) * 2019-03-12 2024-03-22 阿里巴巴集团控股有限公司 Page processing and page return processing methods, devices and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
US20090204638A1 (en) * 2008-02-08 2009-08-13 Microsoft Corporation Automated client sitemap generation
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769742B1 (en) * 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US7865497B1 (en) * 2008-02-21 2011-01-04 Google Inc. Sitemap generation where last modified time is not available to a network crawler
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A kind of method, apparatus and equipment for handling site maps

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
US20090204638A1 (en) * 2008-02-08 2009-08-13 Microsoft Corporation Automated client sitemap generation
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017063596A1 (en) * 2015-10-16 2017-04-20 广州神马移动信息科技有限公司 Method, apparatus and device for processing sitemap
CN106095674A (en) * 2016-06-07 2016-11-09 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN106095674B (en) * 2016-06-07 2019-05-24 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN107807937A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 A kind of website SEO processing methods, apparatus and system
CN107807937B (en) * 2016-09-09 2021-11-30 阿里巴巴集团控股有限公司 Website SEO processing method, device and system
CN108255831A (en) * 2016-12-28 2018-07-06 航天信息股份有限公司 A kind of method and system for being used to generate site maps for website

Also Published As

Publication number Publication date
WO2017063596A1 (en) 2017-04-20
CN105260469B (en) 2017-12-26

Similar Documents

Publication Publication Date Title
US11176124B2 (en) Managing a search
CN101782919B (en) Web form data output method, device and form processing system
CN105260469A (en) Sitemap processing method, apparatus and device
US20070118528A1 (en) Apparatus and method for blocking phishing web page access
US9514113B1 (en) Methods for automatic footnote generation
TW200849045A (en) Web spam page classification using query-dependent data
JP2009104591A (en) Web document clustering method and system
US20150324350A1 (en) Identifying Content Relationship for Content Copied by a Content Identification Mechanism
CN102737021B (en) Search engine and realization method thereof
CN102722498A (en) Search engine and implementation method thereof
CN104750754A (en) Website industry classification method and server
US20090187516A1 (en) Search summary result evaluation model methods and systems
US20090083266A1 (en) Techniques for tokenizing urls
US9792370B2 (en) Identifying equivalent links on a page
KR20110009142A (en) Method for aggregating web feed minimizing redundancies
CN102722499A (en) Search engine and implementation method thereof
CN105138907A (en) Method and system for actively detecting attacked website
US20040034635A1 (en) Method and system for identifying and matching companies to business event information
CN104778232B (en) Searching result optimizing method and device based on long query
CN111158973B (en) Web application dynamic evolution monitoring method
CN104965902A (en) Enriched URL (uniform resource locator) recognition method and apparatus
CN111814040A (en) Maintenance case searching method and device, terminal equipment and storage medium
US9910926B2 (en) Managing searches for information associated with a message
KR20120090131A (en) Method, system and computer readable recording medium for providing search results
CN104462519A (en) Search query method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200812

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 12 layer self unit 01

Patentee before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right