US20120278299A1 - Presenting search results for gallery web pages - Google Patents

Presenting search results for gallery web pages Download PDF

Info

Publication number
US20120278299A1
US20120278299A1 US13/283,007 US201113283007A US2012278299A1 US 20120278299 A1 US20120278299 A1 US 20120278299A1 US 201113283007 A US201113283007 A US 201113283007A US 2012278299 A1 US2012278299 A1 US 2012278299A1
Authority
US
United States
Prior art keywords
web page
gallery
web pages
images
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/283,007
Inventor
Yuguo Liao
Ning Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US13/283,878 priority Critical patent/US8938441B2/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LIAO, Yuguo, WANG, NING
Publication of US20120278299A1 publication Critical patent/US20120278299A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results

Definitions

  • Different web pages may include different types of content. For example, a text-intensive web page contains primarily text content, while an image-intensive web page contains primarily image content.
  • one innovative aspect of the subject matter described in this specification may be embodied in a method for classifying web pages as gallery web pages or as not gallery web pages, and for presenting search results for web pages that have been classified as gallery web pages.
  • a “gallery web page” is a web page that includes multiple images and may also include text, and the principal content of which is its images.
  • One method for classifying a web page as a gallery web page includes selecting a candidate web page for analysis, and identifying one or more images from the web page. Characteristics of the web page and/or the images are evaluated against one or more predefined criteria, and a score is generated based on the evaluation. In some examples, this generating the score involves counting all or some of images included on the web page, or counting the number of images from the web page that individually satisfy the predefined criteria.
  • the candidate web page is classified as a gallery web page if the score meets a predefined threshold, or is classified as not a gallery web page if the score does not meet the predefined threshold.
  • a label or tag that designates a web page as a gallery web page is stored to identify the web pages that are classified as gallery web pages.
  • Search engines may treat web pages that are classified as gallery web pages differently than web pages that have not yet been classified, or that are classified as not gallery web pages.
  • a search result that includes a reference to a gallery web page may appear differently in a listing of search results than a search result that does not include a reference to a gallery web page.
  • a search result that includes a reference to a gallery web page may include a larger or smaller snippet of text from the gallery web page than a search result that does not include a reference to a gallery web page.
  • a search result that includes a reference to a gallery web page may include an image from the gallery web page, a description of an image from the gallery web page, a preview or thumbnail version of an image from the gallery web page, or any other visual indication that indicates that the search result references a gallery web page.
  • a search result that does not include a reference to a gallery web page may merely include information that is typically associated with web pages that are not gallery web pages, and may not include the information that would be included if the search result referenced a gallery web page.
  • a search engine may provide more relevant and interesting search results, thereby enhancing the experience of the user of the search engine.
  • providing a preview of an image from a gallery web page in a search result provides the user a useful preview or indication of the content of the gallery web page.
  • the method includes receiving a web page that includes text and one or more images, and evaluating one or more characteristics of the web page against predefined criteria.
  • the method also includes generating a score for the web page based on evaluating the characteristics of the web page against the predefined criteria, and classifying the web page as a gallery web page or as not a gallery web page when the score meets or does not meet a predefined threshold, respectively.
  • a method that includes determining, by a search engine, that a web page that is classified as a gallery web page is responsive to a search query, and selecting a gallery-web-page-specific search result format.
  • the method also includes formatting a search result that identifies the web page according to the selected, gallery-web-page-specific search result format, and providing the formatted search result that identifies the web page for display in a list of search results.
  • evaluating one or more characteristics of the web page against predefined criteria includes evaluating an area of the web page that is covered by images, against a minimum value, evaluating an amount of text that is included on the web page, against a maximum value, evaluating a quantity of images included on the web page, against a minimum value, evaluating a quantity of images of the web page that share a same Document Object Model (DOM) path, against a minimum value, or evaluating a quantity of images of the web page that are not of an excluded type of image, using a minimum value.
  • the excluded type of image includes an image that includes pornographic content or advertising content, or an image that is included in a boilerplate section of the web page.
  • evaluating one or more characteristics of the web page using predefined criteria includes evaluating a quantity of images of the web page that individually satisfy the predefined criteria, against a minimum value.
  • the predefined criteria specifies a minimum altitude on the web page.
  • the web page is classified as a gallery web page if and only if the score meets the predefined threshold.
  • the method includes selecting a subset of the images, where evaluating one or more characteristics of the web page includes evaluating characteristics of the subset of the images of the web page only.
  • the method includes labeling a web page that is classified as a gallery web page, as a gallery web page, or a web page that is classified as not a gallery web page, as not a gallery web page.
  • the method includes determining, after the web page has been classified as a gallery web page, that the web page is responsive to a search query, selecting a gallery-web-page-specific search result format, and presenting a search result for the web page in a list of search results, where the search result for the web page is formatted according to the selected, gallery-web-page-specific search result format.
  • a gallery web page is a web page in which its principal content is images.
  • the formatted search result that identifies the web page includes a preview of an image from the web page.
  • the gallery-web-page-specific search result format is selected from among multiple available search result formats.
  • FIG. 1 illustrates an example web page that includes text and images.
  • FIG. 2 is a block diagram of a server system for classifying web pages.
  • FIG. 3 is a flowchart illustrating a process for classifying web pages.
  • FIG. 4 is a flowchart illustrating an example process for determining whether an image satisfies predefined criteria.
  • FIG. 5 illustrates an example web page that includes text and images.
  • FIG. 6 is an example of a HyperText Markup Language (HTML) document containing various HTML elements for displaying images.
  • HTML HyperText Markup Language
  • FIG. 7 is tree representation of the hierarchy of the HTML elements in the HTML document of FIG. 6 .
  • FIGS. 8A-8E are different examples of search results that reference gallery web pages.
  • FIG. 1 illustrates an example web page 100 that includes text 101 and images 102 .
  • the quantity, size, content, type, order, and/or arrangement of the images 102 suggests that the principal content of the web page 100 is the images 102 , rather than the text 101 .
  • the web page 100 may therefore be regarded as a gallery web page.
  • the web page 100 may be automatically classified as a gallery web page because it includes characteristics that are indicative of gallery web pages.
  • the characteristics of the web page 100 or the characteristics of all or some of the images 102 of the web page 100 , may be evaluated by a classifier using any number of predefined criteria. A score may be generated based on this evaluation, where the score may be used by the classifier to determine whether the web page 100 should be classified as a gallery web page, or as not a gallery web page.
  • the characteristics of the web page 100 may be evaluated using minimum image size criteria. Because the images of a gallery web page typically cover a large area of a gallery web page, the minimum image size criteria may specify a minimum value (e.g., 5%, 10%, 25%, 33%, or 50%) representing an amount of the display area of the web page 100 may be covered by the images 102 , in order for the web page 100 to be classified as a gallery web page.
  • a minimum value e.g., 5%, 10%, 25%, 33%, or 50%
  • the maximum text amount criteria may specify a maximum value (e.g., 100 words) representing an amount of the text 101 that the web page 100 may include, in order for the web page 100 be classified as a gallery web page.
  • the web page 100 may be classified as a gallery web page based in part on the total number of images 102 , or the total number of images 102 which individually meets other quantity, size, order, quality, or arrangement criteria. For instance, the web page 100 may be classified as not a gallery web page if it includes no images, or if it includes three or fewer images. Further, the web page 100 may be classified as a gallery web page based in part on the total quantity of the images 102 that are displayed in an upper part of the web page 100 , or may be classified as not a gallery web page if many or all of the images 102 are displayed in an area that is close to the bottom of the web page 101 . All or some subset of all of the images 102 may be subject to this evaluation.
  • the characteristics of the web page 100 may also be evaluated using type or content criteria.
  • images for evaluation for example, certain types of images that include excluded content, e.g., pornographic content, boilerplate, advertising content, or any content that is unrelated to the principal content of a web page, may be ignored, tagged or processed differently than other types of images. If a web page is classified as a gallery web page despite including excluded content, this excluded content may labeled or tagged, such that the excluded content is not shown in any search results that reference the web page.
  • the characteristics of the web page 100 may be evaluated based on Document Object Model (DOM) path criteria. Because the images of a gallery web page are typically displayed together and may therefore share a same or similar DOM path, the web page 100 may be classified as a gallery web page if more than a predefined number of images in the web page 100 share a same or similar DOM path, or if more than a predefined number of images in the web page 100 that share a same or similar DOM path satisfy other criteria. To increase processing efficiency, images from a web page 100 that share a same or similar DOM path with fewer than a predefined number of images, i.e., images that are not the principal content of the web page 100 , may not be evaluated against this criteria.
  • DOM Document Object Model
  • FIG. 2 is a block diagram of a server system 200 for automatically classifying web pages.
  • the server system 200 includes a server 201 that is connected to the network 230 , and that receives and processes web pages 240 .
  • a search engine crawls the web pages 240 and stores the web pages 240 in a search engine cache, and the server 201 classifies each of the web pages 240 that are stored in the search engine cache as gallery web pages or as not gallery web pages.
  • the server 201 labels the web page as a gallery web page, e.g., by associating gallery-web-page-identifying data with the web page in the cache.
  • This data may be, for example, a tag that identifies the web page as a gallery web page.
  • the data that identifies the web page as a gallery web page may be stored in association with the web page, or separately from the web page.
  • the server 201 may also generate relevant information from the web page that is to be included in a search result that references the web page.
  • relevant information may include, for example, data referencing the number of images included in the web page, a description of the images, or a thumbnail or preview image.
  • Server 201 includes a layer of hardware or firmware, including one or more processors 212 , computer readable medium 216 , a communication interface 218 that communicates with other clients over the network 230 , user interface modules 220 and any additional modules 214 .
  • the server 201 also includes specialized application modules for classifying web pages as gallery web pages, through the evaluation of characteristics of web pages, and through scoring the web pages.
  • the specialized application modules for classifying web pages as gallery web pages may include an image parser 202 , a page evaluator 204 , a boilerplate identifier 206 , an altitude calculator 208 , and an image area calculator 210 .
  • the image parser 202 is configured to identify images included on the web page.
  • the page evaluator 204 which is a type of classifier, is configured to apply criteria to the web page or the images of the web page to determine whether the web page is indeed a gallery web page.
  • the boilerplate page evaluator 204 identifies and optionally excludes boilerplate content on a web page from further processing, such as by excluding images that are included in boilerplate sections of the web page.
  • the boilerplate page evaluator 204 may also flag images that are included in boilerplate sections, so that these images are not used for generating search results.
  • the altitude calculator 208 is configured to determine whether the location of an image is above or below a predefined absolute or relative height on the web page, and optionally to exclude images that are located above or below the predefined height.
  • the altitude calculator 208 may, for example, exclude images that are positioned in the highest or lowest 10% or 25% of a web page, or that have top or bottom edges that are within “50” or “100” pixels from the top or bottom of a web page, respectively. Images that are located below the height that is predefined by the altitude calculator 208 may also be flagged by the altitude calculator 208 , so that they are not used for generating search results.
  • the image area calculator 210 calculates a size characteristic (e.g., quantity of pixels, total height) of the images included on a web page, and compares the size characteristic with the amount of textual content (e.g., number of words) on the web page, to determine the amount of image content in relation to the amount of text content.
  • the result of the calculation of the image area calculator 210 may be used by the page evaluator 204 to classify the web page as a gallery web page or as not a gallery web page if the ratio of the size characteristic to the amount of textual content exceeds or does not exceed a predefined threshold, respectively.
  • Other modules may optionally be included on the server 201 in addition to or instead of the image parser 202 , the page evaluator 204 , the boilerplate identifier 206 , the altitude calculator 208 and the image area calculator 210 .
  • the server 201 may be a dedicated server that is used solely for classifying web pages as gallery web pages.
  • the server 201 may include or may be associated with application modules for classifying web pages as gallery web pages, and application modules that perform the functionalities associated with a crawler or a search engine.
  • One or more of these application modules may be implemented as a service that is located on another server, and that is connected to the server 201 though the network 230 .
  • FIG. 3 is a flowchart illustrating a process 300 for classifying web pages.
  • the process 300 includes receiving a web page that includes text and one or more images, and evaluating one or more characteristics of the web page against predefined criteria.
  • the process 300 also includes generating a score for the web page based on evaluating the characteristics of the web page against the predefined criteria, and classifying the web page as a gallery web page or as not a gallery web page when the score meets or does not meet a predefined threshold, respectively.
  • a web page that includes text and at least one image is received ( 302 ).
  • the received web page may be, for example, an HTML document that includes text and at least one ⁇ IMG> element.
  • the characteristics of the web page are evaluated using predefined criteria ( 304 ). Evaluating the web page may include evaluating characteristics of the web page itself, or characteristics of the images included on the web page. Because gallery web pages typically include several images, one example criteria may specify a minimum quantity of images (e.g., 6 images) that should be included on the web page in order for the web page to be classified as a gallery web page. Evaluating the web page using this criteria may include counting the quantity of ⁇ IMG> elements included in an HTML document, or counting the quantity of ⁇ IMG> that satisfy other predefined criteria. Other example criteria are discussed with reference to FIG. 4 .
  • a score is generated for the web page based on evaluating the characteristics of the web page against the predefined criteria ( 306 ).
  • the score may equal the quantity of ⁇ IMG> elements counted in the HTML document that corresponds to the web page that meet (i.e., is greater than, or is greater or equal than) a predefined threshold quantity (e.g., “6”).
  • the score is generated by counting the number of images from the web page that individually meet the predefined criteria. For instance, generating the score may include counting the number of images from the web page that individually meet the predefined criteria. All of the images from the web page may be evaluated against the predefined criteria, or a subset of the images may be selected for evaluation beforehand.
  • the web page is classified as a gallery web page ( 310 ). If the score does not meet a predefined threshold ( 308 , “No”), the web page is classified as not a gallery web page ( 312 ). The web page may then be labeled or tagged with data that identifies it as a gallery web page, as unclassified, or as not a gallery web page.
  • FIG. 4 is a flowchart illustrating an example process 400 for determining whether an image satisfies predefined criteria.
  • the process 400 may be iteratively performed on each image included on a web page, or on the subset of images of the web page that are selected for evaluation, in order to determine a total quantity of images that satisfy the predefined criteria.
  • the total quantity of images that satisfy the predefined criteria may be used to generate the score for the web page.
  • the example process 400 evaluates images based on size ratio criteria, pixel quantity criteria, image altitude criteria, boilerplate content criteria, and excluded content criteria, other processes may omit certain of these criteria, may use other criteria, or may evaluate images using these same criteria but in a different order.
  • the process 400 first evaluates an image using size ratio criteria ( 410 ).
  • Size ratio criteria may be used to identify and exclude images that are tall and narrow, or short and wide. These characteristics may suggest that an image is associated with a banner ad, boilerplate content, or menu buttons.
  • the size ratio of an image is evaluated to determine whether it matches a predefined target ratio, or whether it fits within a predefined range.
  • evaluating the size ratio criteria for an image includes determining whether the width-to-height ratio of the image matches or exceeds “5:3.” Any image whose width-to-height ratio is greater than “5:3,” such as an image whose width-to-height ratio is “8:3,” will be regarded as not satisfying the size ratio criteria ( 410 , “No”), and will not be further evaluated ( 422 ).
  • Pixel quantity criteria may be used to identify and exclude small images that, while exhibiting an acceptable size ratio, may be associated with buttons, icons, or other graphics that may be unrelated to the other content of the web page itself.
  • evaluating the pixel quantity criteria includes determining whether the quantity of pixels of an image exceeds 3,600 pixels.
  • Any image that has fewer than 3,600 pixels will be regarded as not satisfying the pixel quantity criteria ( 412 , “No”), and will not be further evaluated ( 422 ).
  • an image with 3,600 pixels or more than 3,600 pixels such as an image with 100 ⁇ 130, or 13,000 pixels, will be regarded as satisfying the pixel quantity criteria ( 412 , “Yes”), and will be subject to further evaluation using additional predefined criteria.
  • the process 400 next evaluates the image using image altitude criteria ( 414 ).
  • the image altitude criteria may be used to identify and exclude images that are at the bottom of a web page, images which may be associated with boilerplate content or that may otherwise be unrelated to the other content of the web page itself.
  • the altitude of an image may be expressed and evaluated in relative terms, such as by measuring whether an image is wholly or partially positioned in the lower “5%” of a web page, or in absolute terms, such as by measuring whether an image is wholly or partially positioned within the bottom “50 pixels” of the web page or outside the top “1000” pixels of the web page.
  • FIG. 5 illustrates an example web page 500 that includes text 510 and images 520 .
  • the altitude of an image is defined by its bottom edge, therefore the altitude 530 of the lower four images is illustrated by a dotted line.
  • the web page 500 includes a visible section 540 , which is an area of the web page that is within the viewable web browser window, and a non-visible section 550 , which is an area of the web page 500 that is outside of the viewable web browser window. Portions of the non-visible area 540 may be made visible if the scroll bar 560 is manipulated to move the web page 500 downwards.
  • Height 570 refers to the distance from the bottom 580 of the web page 500 to the top 590 of the web page 500
  • height 595 refers to the distance between the bottom 580 of the web page and the altitude 530 of the lower four images.
  • the predefined threshold value is expressed as a percentage (e.g., “20%”), and evaluating the image altitude criteria for the lower four images includes determining if a ratio of the height 595 to the height 570 is above the predefined threshold value.
  • the predefined threshold value is expressed as a quantity of pixels (e.g., “50 pixels”), and evaluating the image altitude criteria for the lower four images includes determining if the height 595 exceeds the predefined threshold value.
  • any image whose altitude is not above the predefined threshold will be regarded as not satisfying the image altitude criteria ( 414 , “No”), and will not be further evaluated ( 422 ).
  • an image whose altitude is above the predefined threshold value will be regarded as satisfying the image altitude criteria ( 414 , “Yes”), and will be subject to additional evaluation.
  • the image is also evaluated using boilerplate content criteria ( 416 ).
  • the boilerplate content of a web page may be texts and/or images that appear on different web pages on the same web site, for example, navigational icons or hypertexts, copyright information, contact information, legal disclaimers, etc. Images that are included in boilerplate content sections are unlikely to be related to the other content of the web page itself. Determining whether the image is included in a section of the web page that is associated with boilerplate content includes providing the web page to a module that is adapted to detect boilerplate content within a web page, and receiving information from the boilerplate content detection module that identifies any potential boilerplate content.
  • the image will be regarded as not satisfying the boilerplate content criteria, and will not be further evaluated ( 422 ). If the image is not in a section of the web page that has been identified as including boilerplate content, or is in a section of the web page that has been identified as not including boilerplate content ( 416 , “No”), the image will be subject to further evaluation.
  • the image is lastly evaluated using excluded content criteria ( 418 ).
  • the web page may be deemed to satisfy the excluded content criteria to a lesser extent, a search result format that is not specific to gallery web pages may be used, even though the web page may be classified as a gallery web page, or the search result may not show the image that is determined to include the excluded content. If the image includes excluded content ( 418 , “Yes”), the image will be regarded as not satisfying the excluded content criteria, and will not be further evaluated ( 422 ). If the image does not include excluded content ( 418 , “No”), the image will be regarded as satisfying the predefined criteria associated with the process 400 .
  • excluded content e.g., pornographic content or advertising content
  • a score is generated based on the total quantity of images that satisfy the various predefined criteria, and the score is compared with a predefined threshold value. If the score is equal to or larger than the predefined threshold value, the web page is classified as a gallery web page. If not, the web page is classified as not a gallery web page, or is left unclassified.
  • the DOM path of images of a web page may be used to select images that are to be subject to evaluation using the predefined criteria, or the web page may be classified as a gallery web page based on DOM path criteria.
  • FIG. 6 is an example of an HTML document 600 containing various HTML elements for displaying images
  • FIG. 7 is a tree representation 700 of the hierarchy of the HTML elements in the HTML document of FIG. 6 .
  • a subset of the images of a web page may be selected, and only the subset of the images may be evaluated using the predefined criteria.
  • the HTML document 600 of the web page may be parsed to identify the particular DOM path of each image in the hierarchy of the HTML elements.
  • the images of “rose 1 ”, “rose 2 ”, “rose 3 ” and “rose 4 ” all have the same DOM path of “ ⁇ HTML> ⁇ BODY> ⁇ TABLE> ⁇ TR> ⁇ TD>”, and the images of “rose 11 ” and “rose 12 ” have the same DOM path of “ ⁇ HTML> ⁇ BODY> ⁇ TABLE> ⁇ TR> ⁇ TD> ⁇ A>”.
  • the images of “rose 1 ”, “rose 2 ”, “rose 3 ” and “rose 4 ” will be determined as belonging to a first group of images having a DOM path of “ ⁇ HTML> ⁇ BODY> ⁇ TABLE> ⁇ TR> ⁇ TD>”, and the images of “rose 11 ” and “rose 12 ” belonging to a second group have a DOM path of “ ⁇ HTML> ⁇ BODY> ⁇ TABLE> ⁇ TR> ⁇ TD> ⁇ A>”.
  • the quantity of images in each group is determined, and is evaluated using DOM path group criteria.
  • the size of the first group is “4” and the size of the second group is “2”.
  • the size of different groups of images having different DOM paths are further ordered, and the size of the largest group is compared to a predefined threshold value. If the size of the largest group is found to be equal to or larger than the predefined value, the web page is regarded as having satisfied the DOM path group criteria, and may be classified as a gallery web page. If the size of the largest group of images having the same DOM path includes fewer images than the threshold value, the web page is regarded as having not satisfied the DOM path group criteria, and may be classified as not a gallery web page. In one implementation, this threshold value for the size of the largest group is set to four.
  • only the images in the groups having a size equal to or larger than a predefined group size may be selected a subset for evaluation using other predefined criteria, images in the groups having a size smaller than a predefined group size may be ignored or discarded.
  • Such an approach reflects the recognition that gallery images on a gallery web page are typically similarly arranged for display during the creation of the web page, and therefore they are likely to share a same DOM path in the HTML document of the web page.
  • evaluation of a web page may include skipping certain HTML elements that do not have a significant effect on the formatting or arrangement of displayed images. For example, the pair of the HTML elements “ ⁇ a>” and “ ⁇ /a>” simply embeds a hyperlink for the content enclosed therebetween. If an image is enclosed within these HTML elements, the image will be displayed in a similar manner as other images that do not share the same DOM path, however the image will be selectable.
  • the DOM path of the image “rose 11 ” is only different from the DOM path of “rose 4 ” in that it has an additional “ ⁇ a>” element immediately before the ⁇ IMG> element.
  • the HTML element “ ⁇ a>” may be disregarded in determining the DOM path of a specific image.
  • all the images referenced in the example HTML document in FIG. 6 may be regarded as falling within the same group, having a size of “6.” If the images in this example satisfy the remaining predefined criteria, the web page will then be determined to meet the requirement on the number of images, and may be classified as a gallery web page.
  • Additional criteria may be further applied against on the web page to avoid false positives or negatives resulting from the evaluation of other criteria. For example, the total number of pixels of all the images of a web page can be determined and compared with the total area (in pixels) of the entire web page, to see if the ratio exceeds a predefined ratio, for example, 60%. If the ratio is below this predefined ratio, the images cover less than a predefined area of the entire web page, and the web page may not be classified as a gallery web page.
  • a predefined ratio for example, 60%.
  • the ratio of the number of pixels of all the candidate gallery images versus the amount of textual contents displayed can also be calculated and determined to see if it is over a threshold value.
  • the amount of the textual contents displayed can be the number of words in the sections other than the boilerplate section on the web page and displayed to the user when the web page is rendered.
  • This threshold ratio can be set to “3,000:1,” for example. Any web page having a ratio of the total number of pixels of all the candidate gallery images versus the number of words on the web page equal to or higher than this threshold value can be thought of as an image-intensive web page and thereby qualified to be a gallery web page, provided that the other tests having been passed.
  • Systems for identifying gallery web pages using any one of the implementations as set forth above can be used to assist a search engine in classifying web pages crawled from the Web as either being a gallery web page or as not being a gallery web page. Further processes can be performed to prepare the web page and the identified gallery images in the web page to be presented in a search result.
  • the total number of images can be recorded in the cache of indexed web pages, and for each image, a separate thumbnail or preview image within a predefined size range can be created and stored.
  • Preview or thumbnail images may not be prepared and stored for images that include excluded content.
  • a particular search result format that is specific to gallery web pages may present search results that include information specifying a total number of images on a particular gallery web page, thumbnails of at least a subset of these images, and a snippet of textual content of the gallery web page, if the web page is identified by a search engine in response to a particular search query.
  • FIGS. 8A-8E are different examples of search results that reference gallery web pages. For a particular search query “photos of grand canyon,” if any one of the search results is found to be a gallery web page, the number of gallery images and the thumbnail images can be displayed in different layouts, as shown in the examples in FIGS. 8A-8E .
  • thumbnail images shown in a search result may not cover all the gallery images identified for the web page, a subset of these gallery image can be selected sequentially, or randomly or in any other particular manner, to be presented in the search result.
  • navigational icons may be arranged beside these thumbnail images to assist the user in viewing these other preview images not initially shown in the search result.
  • Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • a computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them.
  • a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal.
  • the computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
  • the operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.
  • the term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing
  • the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them.
  • the apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
  • Devices suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a backend component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network.
  • Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
  • LAN local area network
  • WAN wide area network
  • inter-network e.g., the Internet
  • peer-to-peer networks e.g., ad hoc peer-to-peer networks.
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device).
  • client device e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device.
  • Data generated at the client device e.g., a result of the user interaction

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for classifying web pages as gallery web pages, and for presenting search results for gallery web pages. In one aspect, a method includes receiving a web page that includes text and one or more images, evaluating one or more characteristics of the web page against predefined criteria, generating a score for the web page based on evaluating the characteristics of the web page against the predefined criteria, and classifying the web page as a gallery web page or as not a gallery web page when the score meets or does not meet a predefined threshold, respectively.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present application is a continuation of PCT/CN2011/073465, filed Apr. 28, 2011, titled PRESENTING SEARCH RESULTS FOR GALLERY WEB PAGES. The contents of the prior application are incorporated herein by reference in their entirety.
  • BACKGROUND
  • Different web pages may include different types of content. For example, a text-intensive web page contains primarily text content, while an image-intensive web page contains primarily image content.
  • SUMMARY
  • In general, one innovative aspect of the subject matter described in this specification may be embodied in a method for classifying web pages as gallery web pages or as not gallery web pages, and for presenting search results for web pages that have been classified as gallery web pages. Generally, a “gallery web page” is a web page that includes multiple images and may also include text, and the principal content of which is its images.
  • One method for classifying a web page as a gallery web page includes selecting a candidate web page for analysis, and identifying one or more images from the web page. Characteristics of the web page and/or the images are evaluated against one or more predefined criteria, and a score is generated based on the evaluation. In some examples, this generating the score involves counting all or some of images included on the web page, or counting the number of images from the web page that individually satisfy the predefined criteria.
  • The candidate web page is classified as a gallery web page if the score meets a predefined threshold, or is classified as not a gallery web page if the score does not meet the predefined threshold. A label or tag that designates a web page as a gallery web page is stored to identify the web pages that are classified as gallery web pages.
  • Search engines may treat web pages that are classified as gallery web pages differently than web pages that have not yet been classified, or that are classified as not gallery web pages. In one example, a search result that includes a reference to a gallery web page may appear differently in a listing of search results than a search result that does not include a reference to a gallery web page. For instance, a search result that includes a reference to a gallery web page may include a larger or smaller snippet of text from the gallery web page than a search result that does not include a reference to a gallery web page. Additionally, a search result that includes a reference to a gallery web page may include an image from the gallery web page, a description of an image from the gallery web page, a preview or thumbnail version of an image from the gallery web page, or any other visual indication that indicates that the search result references a gallery web page.
  • By contrast, a search result that does not include a reference to a gallery web page may merely include information that is typically associated with web pages that are not gallery web pages, and may not include the information that would be included if the search result referenced a gallery web page. By treating gallery web pages differently than web pages that are not gallery web pages, a search engine may provide more relevant and interesting search results, thereby enhancing the experience of the user of the search engine. Furthermore, providing a preview of an image from a gallery web page in a search result provides the user a useful preview or indication of the content of the gallery web page.
  • In general, another innovative aspect of the subject matter described in this specification may be embodied in a method for classifying web pages. The method includes receiving a web page that includes text and one or more images, and evaluating one or more characteristics of the web page against predefined criteria. The method also includes generating a score for the web page based on evaluating the characteristics of the web page against the predefined criteria, and classifying the web page as a gallery web page or as not a gallery web page when the score meets or does not meet a predefined threshold, respectively.
  • In general, another innovative aspect of the subject matter described in this specification may be embodied in a method that includes determining, by a search engine, that a web page that is classified as a gallery web page is responsive to a search query, and selecting a gallery-web-page-specific search result format. The method also includes formatting a search result that identifies the web page according to the selected, gallery-web-page-specific search result format, and providing the formatted search result that identifies the web page for display in a list of search results.
  • These and other embodiments may each optionally include one or more of the following features. For instance, evaluating one or more characteristics of the web page against predefined criteria includes evaluating an area of the web page that is covered by images, against a minimum value, evaluating an amount of text that is included on the web page, against a maximum value, evaluating a quantity of images included on the web page, against a minimum value, evaluating a quantity of images of the web page that share a same Document Object Model (DOM) path, against a minimum value, or evaluating a quantity of images of the web page that are not of an excluded type of image, using a minimum value. The excluded type of image includes an image that includes pornographic content or advertising content, or an image that is included in a boilerplate section of the web page.
  • In some examples, evaluating one or more characteristics of the web page using predefined criteria includes evaluating a quantity of images of the web page that individually satisfy the predefined criteria, against a minimum value. The predefined criteria specifies a minimum altitude on the web page. The web page is classified as a gallery web page if and only if the score meets the predefined threshold. The method includes selecting a subset of the images, where evaluating one or more characteristics of the web page includes evaluating characteristics of the subset of the images of the web page only. The method includes labeling a web page that is classified as a gallery web page, as a gallery web page, or a web page that is classified as not a gallery web page, as not a gallery web page.
  • In additional examples, the method includes determining, after the web page has been classified as a gallery web page, that the web page is responsive to a search query, selecting a gallery-web-page-specific search result format, and presenting a search result for the web page in a list of search results, where the search result for the web page is formatted according to the selected, gallery-web-page-specific search result format.
  • In other examples, a gallery web page is a web page in which its principal content is images. The formatted search result that identifies the web page includes a preview of an image from the web page. The gallery-web-page-specific search result format is selected from among multiple available search result formats.
  • The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 illustrates an example web page that includes text and images.
  • FIG. 2 is a block diagram of a server system for classifying web pages.
  • FIG. 3 is a flowchart illustrating a process for classifying web pages.
  • FIG. 4 is a flowchart illustrating an example process for determining whether an image satisfies predefined criteria.
  • FIG. 5 illustrates an example web page that includes text and images.
  • FIG. 6 is an example of a HyperText Markup Language (HTML) document containing various HTML elements for displaying images.
  • FIG. 7 is tree representation of the hierarchy of the HTML elements in the HTML document of FIG. 6.
  • FIGS. 8A-8E are different examples of search results that reference gallery web pages.
  • Like reference numbers and designations in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • FIG. 1 illustrates an example web page 100 that includes text 101 and images 102. The quantity, size, content, type, order, and/or arrangement of the images 102 suggests that the principal content of the web page 100 is the images 102, rather than the text 101. The web page 100 may therefore be regarded as a gallery web page.
  • The web page 100 may be automatically classified as a gallery web page because it includes characteristics that are indicative of gallery web pages. The characteristics of the web page 100, or the characteristics of all or some of the images 102 of the web page 100, may be evaluated by a classifier using any number of predefined criteria. A score may be generated based on this evaluation, where the score may be used by the classifier to determine whether the web page 100 should be classified as a gallery web page, or as not a gallery web page.
  • In one example, the characteristics of the web page 100 may be evaluated using minimum image size criteria. Because the images of a gallery web page typically cover a large area of a gallery web page, the minimum image size criteria may specify a minimum value (e.g., 5%, 10%, 25%, 33%, or 50%) representing an amount of the display area of the web page 100 may be covered by the images 102, in order for the web page 100 to be classified as a gallery web page.
  • Another example of the predefined criteria is maximum text amount criteria. Because the principal content of gallery web pages is images rather than text, the maximum text amount criteria may specify a maximum value (e.g., 100 words) representing an amount of the text 101 that the web page 100 may include, in order for the web page 100 be classified as a gallery web page.
  • In other examples, the web page 100 may be classified as a gallery web page based in part on the total number of images 102, or the total number of images 102 which individually meets other quantity, size, order, quality, or arrangement criteria. For instance, the web page 100 may be classified as not a gallery web page if it includes no images, or if it includes three or fewer images. Further, the web page 100 may be classified as a gallery web page based in part on the total quantity of the images 102 that are displayed in an upper part of the web page 100, or may be classified as not a gallery web page if many or all of the images 102 are displayed in an area that is close to the bottom of the web page 101. All or some subset of all of the images 102 may be subject to this evaluation.
  • The characteristics of the web page 100 may also be evaluated using type or content criteria. In selecting images for evaluation, for example, certain types of images that include excluded content, e.g., pornographic content, boilerplate, advertising content, or any content that is unrelated to the principal content of a web page, may be ignored, tagged or processed differently than other types of images. If a web page is classified as a gallery web page despite including excluded content, this excluded content may labeled or tagged, such that the excluded content is not shown in any search results that reference the web page.
  • Furthermore, the characteristics of the web page 100 may be evaluated based on Document Object Model (DOM) path criteria. Because the images of a gallery web page are typically displayed together and may therefore share a same or similar DOM path, the web page 100 may be classified as a gallery web page if more than a predefined number of images in the web page 100 share a same or similar DOM path, or if more than a predefined number of images in the web page 100 that share a same or similar DOM path satisfy other criteria. To increase processing efficiency, images from a web page 100 that share a same or similar DOM path with fewer than a predefined number of images, i.e., images that are not the principal content of the web page 100, may not be evaluated against this criteria.
  • FIG. 2 is a block diagram of a server system 200 for automatically classifying web pages. The server system 200 includes a server 201 that is connected to the network 230, and that receives and processes web pages 240. In some implementations, a search engine crawls the web pages 240 and stores the web pages 240 in a search engine cache, and the server 201 classifies each of the web pages 240 that are stored in the search engine cache as gallery web pages or as not gallery web pages.
  • If a web page is classified as a gallery web page, the server 201 labels the web page as a gallery web page, e.g., by associating gallery-web-page-identifying data with the web page in the cache. This data may be, for example, a tag that identifies the web page as a gallery web page. The data that identifies the web page as a gallery web page may be stored in association with the web page, or separately from the web page.
  • The server 201 may also generate relevant information from the web page that is to be included in a search result that references the web page. Such relevant information may include, for example, data referencing the number of images included in the web page, a description of the images, or a thumbnail or preview image.
  • Server 201 includes a layer of hardware or firmware, including one or more processors 212, computer readable medium 216, a communication interface 218 that communicates with other clients over the network 230, user interface modules 220 and any additional modules 214. In addition to the hardware or firmware that supports the underlying functionality of the server 201, the server 201 also includes specialized application modules for classifying web pages as gallery web pages, through the evaluation of characteristics of web pages, and through scoring the web pages.
  • The specialized application modules for classifying web pages as gallery web pages may include an image parser 202, a page evaluator 204, a boilerplate identifier 206, an altitude calculator 208, and an image area calculator 210. The image parser 202 is configured to identify images included on the web page. The page evaluator 204, which is a type of classifier, is configured to apply criteria to the web page or the images of the web page to determine whether the web page is indeed a gallery web page.
  • The boilerplate page evaluator 204 identifies and optionally excludes boilerplate content on a web page from further processing, such as by excluding images that are included in boilerplate sections of the web page. The boilerplate page evaluator 204 may also flag images that are included in boilerplate sections, so that these images are not used for generating search results.
  • The altitude calculator 208 is configured to determine whether the location of an image is above or below a predefined absolute or relative height on the web page, and optionally to exclude images that are located above or below the predefined height. The altitude calculator 208 may, for example, exclude images that are positioned in the highest or lowest 10% or 25% of a web page, or that have top or bottom edges that are within “50” or “100” pixels from the top or bottom of a web page, respectively. Images that are located below the height that is predefined by the altitude calculator 208 may also be flagged by the altitude calculator 208, so that they are not used for generating search results.
  • The image area calculator 210 calculates a size characteristic (e.g., quantity of pixels, total height) of the images included on a web page, and compares the size characteristic with the amount of textual content (e.g., number of words) on the web page, to determine the amount of image content in relation to the amount of text content. The result of the calculation of the image area calculator 210 may be used by the page evaluator 204 to classify the web page as a gallery web page or as not a gallery web page if the ratio of the size characteristic to the amount of textual content exceeds or does not exceed a predefined threshold, respectively. Other modules may optionally be included on the server 201 in addition to or instead of the image parser 202, the page evaluator 204, the boilerplate identifier 206, the altitude calculator 208 and the image area calculator 210.
  • In some implementations, the server 201 may be a dedicated server that is used solely for classifying web pages as gallery web pages. Alternatively, the server 201 may include or may be associated with application modules for classifying web pages as gallery web pages, and application modules that perform the functionalities associated with a crawler or a search engine. One or more of these application modules may be implemented as a service that is located on another server, and that is connected to the server 201 though the network 230.
  • FIG. 3 is a flowchart illustrating a process 300 for classifying web pages. Briefly, the process 300 includes receiving a web page that includes text and one or more images, and evaluating one or more characteristics of the web page against predefined criteria. The process 300 also includes generating a score for the web page based on evaluating the characteristics of the web page against the predefined criteria, and classifying the web page as a gallery web page or as not a gallery web page when the score meets or does not meet a predefined threshold, respectively.
  • In more detail, when the process 300 begins, a web page that includes text and at least one image is received (302). The received web page may be, for example, an HTML document that includes text and at least one <IMG> element.
  • The characteristics of the web page are evaluated using predefined criteria (304). Evaluating the web page may include evaluating characteristics of the web page itself, or characteristics of the images included on the web page. Because gallery web pages typically include several images, one example criteria may specify a minimum quantity of images (e.g., 6 images) that should be included on the web page in order for the web page to be classified as a gallery web page. Evaluating the web page using this criteria may include counting the quantity of <IMG> elements included in an HTML document, or counting the quantity of <IMG> that satisfy other predefined criteria. Other example criteria are discussed with reference to FIG. 4.
  • A score is generated for the web page based on evaluating the characteristics of the web page against the predefined criteria (306). In one implementation, the score may equal the quantity of <IMG> elements counted in the HTML document that corresponds to the web page that meet (i.e., is greater than, or is greater or equal than) a predefined threshold quantity (e.g., “6”).
  • In another example implementation, the score is generated by counting the number of images from the web page that individually meet the predefined criteria. For instance, generating the score may include counting the number of images from the web page that individually meet the predefined criteria. All of the images from the web page may be evaluated against the predefined criteria, or a subset of the images may be selected for evaluation beforehand.
  • If the score meets a predefined threshold (308, “Yes”), the web page is classified as a gallery web page (310). If the score does not meet a predefined threshold (308, “No”), the web page is classified as not a gallery web page (312). The web page may then be labeled or tagged with data that identifies it as a gallery web page, as unclassified, or as not a gallery web page.
  • FIG. 4 is a flowchart illustrating an example process 400 for determining whether an image satisfies predefined criteria. The process 400 may be iteratively performed on each image included on a web page, or on the subset of images of the web page that are selected for evaluation, in order to determine a total quantity of images that satisfy the predefined criteria. The total quantity of images that satisfy the predefined criteria may be used to generate the score for the web page. Although the example process 400 evaluates images based on size ratio criteria, pixel quantity criteria, image altitude criteria, boilerplate content criteria, and excluded content criteria, other processes may omit certain of these criteria, may use other criteria, or may evaluate images using these same criteria but in a different order.
  • The process 400 first evaluates an image using size ratio criteria (410). Size ratio criteria may be used to identify and exclude images that are tall and narrow, or short and wide. These characteristics may suggest that an image is associated with a banner ad, boilerplate content, or menu buttons.
  • In more detail, the size ratio of an image is evaluated to determine whether it matches a predefined target ratio, or whether it fits within a predefined range. In an example implementation where the predefined target ratio is “5:3” (width-to-height), evaluating the size ratio criteria for an image includes determining whether the width-to-height ratio of the image matches or exceeds “5:3.” Any image whose width-to-height ratio is greater than “5:3,” such as an image whose width-to-height ratio is “8:3,” will be regarded as not satisfying the size ratio criteria (410, “No”), and will not be further evaluated (422).
  • Similarly, if the height-to-width of an image exceeds “3:5,” such as an image whose height-to-width ratio is “3:8,” it will be regarded as not satisfying the size ratio criteria (410, “No”), and will not be further evaluated (422). By contrast, an image with a size of “100×130” pixels will be regarded as satisfying the size ratio criteria (410, “Yes”), because its “10:13” size ratio is within the range of “1:1” to “5:3” (or “3:5”). Such an image, will be subject to further evaluation using additional predefined criteria.
  • The process 400 next evaluates the image using pixel quantity criteria (412). Pixel quantity criteria may be used to identify and exclude small images that, while exhibiting an acceptable size ratio, may be associated with buttons, icons, or other graphics that may be unrelated to the other content of the web page itself. In one example implementation in which a predefined threshold value is “3,600 pixels,” evaluating the pixel quantity criteria includes determining whether the quantity of pixels of an image exceeds 3,600 pixels.
  • Any image that has fewer than 3,600 pixels will be regarded as not satisfying the pixel quantity criteria (412, “No”), and will not be further evaluated (422). By contrast, an image with 3,600 pixels or more than 3,600 pixels, such as an image with 100×130, or 13,000 pixels, will be regarded as satisfying the pixel quantity criteria (412, “Yes”), and will be subject to further evaluation using additional predefined criteria.
  • The process 400 next evaluates the image using image altitude criteria (414). The image altitude criteria may be used to identify and exclude images that are at the bottom of a web page, images which may be associated with boilerplate content or that may otherwise be unrelated to the other content of the web page itself. The altitude of an image may be expressed and evaluated in relative terms, such as by measuring whether an image is wholly or partially positioned in the lower “5%” of a web page, or in absolute terms, such as by measuring whether an image is wholly or partially positioned within the bottom “50 pixels” of the web page or outside the top “1000” pixels of the web page.
  • The evaluation of an image using image altitude criteria is described with reference to FIG. 5. Specifically, FIG. 5 illustrates an example web page 500 that includes text 510 and images 520. In this example, the altitude of an image is defined by its bottom edge, therefore the altitude 530 of the lower four images is illustrated by a dotted line.
  • The web page 500 includes a visible section 540, which is an area of the web page that is within the viewable web browser window, and a non-visible section 550, which is an area of the web page 500 that is outside of the viewable web browser window. Portions of the non-visible area 540 may be made visible if the scroll bar 560 is manipulated to move the web page 500 downwards.
  • Height 570 refers to the distance from the bottom 580 of the web page 500 to the top 590 of the web page 500, and height 595 refers to the distance between the bottom 580 of the web page and the altitude 530 of the lower four images. In an example implementation, the predefined threshold value is expressed as a percentage (e.g., “20%”), and evaluating the image altitude criteria for the lower four images includes determining if a ratio of the height 595 to the height 570 is above the predefined threshold value. In another example implementation, the predefined threshold value is expressed as a quantity of pixels (e.g., “50 pixels”), and evaluating the image altitude criteria for the lower four images includes determining if the height 595 exceeds the predefined threshold value.
  • Referring back to FIG. 4, any image whose altitude is not above the predefined threshold will be regarded as not satisfying the image altitude criteria (414, “No”), and will not be further evaluated (422). By contrast, an image whose altitude is above the predefined threshold value will be regarded as satisfying the image altitude criteria (414, “Yes”), and will be subject to additional evaluation.
  • The image is also evaluated using boilerplate content criteria (416). The boilerplate content of a web page may be texts and/or images that appear on different web pages on the same web site, for example, navigational icons or hypertexts, copyright information, contact information, legal disclaimers, etc. Images that are included in boilerplate content sections are unlikely to be related to the other content of the web page itself. Determining whether the image is included in a section of the web page that is associated with boilerplate content includes providing the web page to a module that is adapted to detect boilerplate content within a web page, and receiving information from the boilerplate content detection module that identifies any potential boilerplate content.
  • If the image is in a section of the web page that has been identified as including boilerplate content (416, “Yes”), the image will be regarded as not satisfying the boilerplate content criteria, and will not be further evaluated (422). If the image is not in a section of the web page that has been identified as including boilerplate content, or is in a section of the web page that has been identified as not including boilerplate content (416, “No”), the image will be subject to further evaluation.
  • In process 400, the image is lastly evaluated using excluded content criteria (418). The content of an image may be important for determining whether a web page is to be classified as a gallery web page or, if the gallery web page is classified as a web page, whether the image should appear in a gallery-web-page-specific search result. Determining whether the image includes excluded content includes providing the image or the web page to a module that is adapted to detect excluded content, and receiving information from the excluded content detection module that identifies whether the image includes excluded content.
  • If an image is determined to include excluded content, e.g., pornographic content or advertising content, the web page may be deemed to satisfy the excluded content criteria to a lesser extent, a search result format that is not specific to gallery web pages may be used, even though the web page may be classified as a gallery web page, or the search result may not show the image that is determined to include the excluded content. If the image includes excluded content (418, “Yes”), the image will be regarded as not satisfying the excluded content criteria, and will not be further evaluated (422). If the image does not include excluded content (418, “No”), the image will be regarded as satisfying the predefined criteria associated with the process 400.
  • After each of the images on a web page having been evaluated, a score is generated based on the total quantity of images that satisfy the various predefined criteria, and the score is compared with a predefined threshold value. If the score is equal to or larger than the predefined threshold value, the web page is classified as a gallery web page. If not, the web page is classified as not a gallery web page, or is left unclassified.
  • In an additional implementation, illustrated in FIGS. 6 and 7, the DOM path of images of a web page may be used to select images that are to be subject to evaluation using the predefined criteria, or the web page may be classified as a gallery web page based on DOM path criteria. FIG. 6 is an example of an HTML document 600 containing various HTML elements for displaying images, and FIG. 7 is a tree representation 700 of the hierarchy of the HTML elements in the HTML document of FIG. 6.
  • A subset of the images of a web page may be selected, and only the subset of the images may be evaluated using the predefined criteria. The HTML document 600 of the web page may be parsed to identify the particular DOM path of each image in the hierarchy of the HTML elements. As can be seen from FIG. 7, the images of “rose1”, “rose2”, “rose3” and “rose4” all have the same DOM path of “<HTML> <BODY> <TABLE> <TR> <TD>”, and the images of “rose11” and “rose12” have the same DOM path of “<HTML> <BODY> <TABLE> <TR> <TD> <A>”.
  • In this example, the images of “rose1”, “rose2”, “rose3” and “rose4” will be determined as belonging to a first group of images having a DOM path of “<HTML> <BODY> <TABLE> <TR> <TD>”, and the images of “rose11” and “rose12” belonging to a second group have a DOM path of “<HTML> <BODY> <TABLE> <TR> <TD> <A>”.
  • The quantity of images in each group is determined, and is evaluated using DOM path group criteria. In the example shown in FIG. 7, the size of the first group is “4” and the size of the second group is “2”. The size of different groups of images having different DOM paths are further ordered, and the size of the largest group is compared to a predefined threshold value. If the size of the largest group is found to be equal to or larger than the predefined value, the web page is regarded as having satisfied the DOM path group criteria, and may be classified as a gallery web page. If the size of the largest group of images having the same DOM path includes fewer images than the threshold value, the web page is regarded as having not satisfied the DOM path group criteria, and may be classified as not a gallery web page. In one implementation, this threshold value for the size of the largest group is set to four.
  • In an alternative implementation, only the images in the groups having a size equal to or larger than a predefined group size may be selected a subset for evaluation using other predefined criteria, images in the groups having a size smaller than a predefined group size may be ignored or discarded. Such an approach reflects the recognition that gallery images on a gallery web page are typically similarly arranged for display during the creation of the web page, and therefore they are likely to share a same DOM path in the HTML document of the web page.
  • In another alternative implementation, evaluation of a web page may include skipping certain HTML elements that do not have a significant effect on the formatting or arrangement of displayed images. For example, the pair of the HTML elements “<a>” and “</a>” simply embeds a hyperlink for the content enclosed therebetween. If an image is enclosed within these HTML elements, the image will be displayed in a similar manner as other images that do not share the same DOM path, however the image will be selectable.
  • For example, in FIG. 7, the DOM path of the image “rose11” is only different from the DOM path of “rose4” in that it has an additional “<a>” element immediately before the <IMG> element. In this case, the HTML element “<a>” may be disregarded in determining the DOM path of a specific image. Hence, all the images referenced in the example HTML document in FIG. 6 may be regarded as falling within the same group, having a size of “6.” If the images in this example satisfy the remaining predefined criteria, the web page will then be determined to meet the requirement on the number of images, and may be classified as a gallery web page.
  • Additional criteria may be further applied against on the web page to avoid false positives or negatives resulting from the evaluation of other criteria. For example, the total number of pixels of all the images of a web page can be determined and compared with the total area (in pixels) of the entire web page, to see if the ratio exceeds a predefined ratio, for example, 60%. If the ratio is below this predefined ratio, the images cover less than a predefined area of the entire web page, and the web page may not be classified as a gallery web page.
  • In another alternative, the ratio of the number of pixels of all the candidate gallery images versus the amount of textual contents displayed can also be calculated and determined to see if it is over a threshold value. The amount of the textual contents displayed can be the number of words in the sections other than the boilerplate section on the web page and displayed to the user when the web page is rendered. This threshold ratio can be set to “3,000:1,” for example. Any web page having a ratio of the total number of pixels of all the candidate gallery images versus the number of words on the web page equal to or higher than this threshold value can be thought of as an image-intensive web page and thereby qualified to be a gallery web page, provided that the other tests having been passed.
  • Systems for identifying gallery web pages using any one of the implementations as set forth above can be used to assist a search engine in classifying web pages crawled from the Web as either being a gallery web page or as not being a gallery web page. Further processes can be performed to prepare the web page and the identified gallery images in the web page to be presented in a search result.
  • For example, the total number of images can be recorded in the cache of indexed web pages, and for each image, a separate thumbnail or preview image within a predefined size range can be created and stored. Preview or thumbnail images may not be prepared and stored for images that include excluded content. A particular search result format that is specific to gallery web pages may present search results that include information specifying a total number of images on a particular gallery web page, thumbnails of at least a subset of these images, and a snippet of textual content of the gallery web page, if the web page is identified by a search engine in response to a particular search query.
  • FIGS. 8A-8E are different examples of search results that reference gallery web pages. For a particular search query “photos of grand canyon,” if any one of the search results is found to be a gallery web page, the number of gallery images and the thumbnail images can be displayed in different layouts, as shown in the examples in FIGS. 8A-8E.
  • Further, as the number of thumbnail images shown in a search result may not cover all the gallery images identified for the web page, a subset of these gallery image can be selected sequentially, or randomly or in any other particular manner, to be presented in the search result. Alternatively, in order for a user to browse to the other preview images of the gallery images without visiting the actual web page, navigational icons may be arranged beside these thumbnail images to assist the user in viewing these other preview images not initially shown in the search result.
  • Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
  • The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.
  • The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
  • A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few. Devices suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a backend component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
  • The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
  • While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
  • Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims (9)

1. A computer-implemented method comprising:
determining, by a search engine, that a set of search results that are responsive to a search query includes one or more search results that reference one or more web pages that are classified as gallery web pages, and one or more search results that reference one or more web pages that are not classified as gallery web pages, wherein a gallery web page is a web page that includes images and text, and the principal content of which is one or more of the images;
formatting the search results that reference the one or more web pages that are classified as gallery web pages according a gallery-web-page-specific search result format, and formatting the search results that reference the one or more web pages that are not classified as gallery web pages according to one or more different search result formats; and
providing the formatted set of search results for display on a search results page.
2. The method of claim 1, wherein the formatted search results that reference the one or more web pages that are classified as gallery web pages each includes a preview of an image from the web page referenced by the search result.
3. The method of claim 1, wherein the gallery-web-page-specific search result format includes navigational icons that are used to display images that are not initially displayed.
4. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
determining, by a search engine, that a set of search results that are responsive to a search query includes one or more search results that reference one or more web pages that are classified as gallery web pages, and one or more search results that reference one or more web pages that are not classified as gallery web pages, wherein a gallery web page is a web page that includes images and text, and the principal content of which is one or more of the images;
formatting the search results that reference the one or more web pages that are classified as gallery web pages according to a gallery-web-page-specific search result format, and formatting the search results that reference the one or more web pages that are not classified as gallery web pages according to one or more different search result formats; and
providing the formatted set of search results for display on a search results page.
5. The system of claim 4, wherein the formatted search results that reference the one or more web pages that are classified as gallery web pages each include a preview of an image from the web page referenced by the search result.
6. The system of claim 4, wherein the gallery-web-page-specific search result format includes navigational icons that are used to display images that are not initially displayed.
7. A computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
determining, by a search engine, that a set of search results that are responsive to a search query includes one or more search results that reference one or more web pages that are classified as gallery web pages, and one or more search results that reference one or more web pages that are not classified as gallery web pages, wherein a gallery web page is a web page that includes images and text, and the principal content of which is one or more of the images;
formatting the search results that reference the one or more web pages that are classified as gallery web pages according to a gallery-web-page-specific search result format, and formatting the search results that reference the one or more web pages that are not classified as gallery web pages according to one or more different search result formats; and
providing the formatted set of search results for display on a search results page.
8. The medium of claim 7, wherein the formatted search results that reference the one or more web pages that are classified as gallery web pages each include a preview of an image from the web page referenced by the search result.
9. The medium of claim 7, wherein the gallery-web-page-specific search result format includes navigational icons that are used to display images that are not initially displayed.
US13/283,007 2011-04-28 2011-10-27 Presenting search results for gallery web pages Abandoned US20120278299A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/283,878 US8938441B2 (en) 2011-04-28 2011-10-28 Presenting search results for gallery web pages

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNPCT/CN2011/073465 2011-04-28
PCT/CN2011/073465 WO2012145912A1 (en) 2011-04-28 2011-04-28 Presenting search results for gallery web pages

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/073465 Continuation WO2012145912A1 (en) 2011-04-28 2011-04-28 Presenting search results for gallery web pages

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/283,878 Continuation US8938441B2 (en) 2011-04-28 2011-10-28 Presenting search results for gallery web pages

Publications (1)

Publication Number Publication Date
US20120278299A1 true US20120278299A1 (en) 2012-11-01

Family

ID=47068754

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/283,007 Abandoned US20120278299A1 (en) 2011-04-28 2011-10-27 Presenting search results for gallery web pages
US13/283,878 Active 2031-10-02 US8938441B2 (en) 2011-04-28 2011-10-28 Presenting search results for gallery web pages

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/283,878 Active 2031-10-02 US8938441B2 (en) 2011-04-28 2011-10-28 Presenting search results for gallery web pages

Country Status (3)

Country Link
US (2) US20120278299A1 (en)
DE (1) DE212011100098U1 (en)
WO (1) WO2012145912A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9881100B2 (en) 2013-01-14 2018-01-30 International Business Machines Corporation Scoping searches within websites

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346782A (en) * 2011-10-25 2012-02-08 中兴通讯股份有限公司 Method and device for displaying pictures on browser of user terminal as required
US9465572B2 (en) * 2011-11-09 2016-10-11 Microsoft Technology Licensing, Llc Dynamic server-side image sizing for fidelity improvements
US9519661B2 (en) * 2012-04-17 2016-12-13 Excalibur Ip, Llc Method and system for updating a background picture of a web search results page for different search queries
JP6064392B2 (en) * 2012-06-29 2017-01-25 株式会社リコー SEARCH DEVICE, SEARCH METHOD, SEARCH PROGRAM, AND SEARCH SYSTEM
US9832284B2 (en) 2013-12-27 2017-11-28 Facebook, Inc. Maintaining cached data extracted from a linked resource
US10133710B2 (en) * 2014-02-06 2018-11-20 Facebook, Inc. Generating preview data for online content
US9442903B2 (en) 2014-02-06 2016-09-13 Facebook, Inc. Generating preview data for online content
US10567327B2 (en) 2014-05-30 2020-02-18 Facebook, Inc. Automatic creator identification of content to be shared in a social networking system
CA2989462A1 (en) * 2015-06-18 2016-12-22 Tylio Inc. System and method for generating an electronic page
US10929461B2 (en) * 2016-07-25 2021-02-23 Evernote Corporation Automatic detection and transfer of relevant image data to content collections
US11086961B2 (en) * 2017-04-05 2021-08-10 Google Llc Visual leaf page identification and processing
CN108256104B (en) * 2018-02-05 2020-05-26 恒安嘉新(北京)科技股份公司 Comprehensive classification method of internet websites based on multidimensional characteristics

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234904A1 (en) * 2004-04-08 2005-10-20 Microsoft Corporation Systems and methods that rank search results
US20070266022A1 (en) * 2006-05-10 2007-11-15 Google Inc. Presenting Search Result Information
US20090254643A1 (en) * 2008-04-04 2009-10-08 Merijn Camiel Terheggen System and method for identifying galleries of media objects on a network
US20100287175A1 (en) * 2009-05-11 2010-11-11 Microsoft Corporation Model-based searching
US20110029541A1 (en) * 2009-07-31 2011-02-03 Yahoo! Inc. System and method for intent-driven search result presentation
US20110035345A1 (en) * 2009-08-10 2011-02-10 Yahoo! Inc. Automatic classification of segmented portions of web pages
US7984044B2 (en) * 2008-04-01 2011-07-19 Hitachi, Ltd. System or program for searching documents
US8086600B2 (en) * 2006-12-07 2011-12-27 Google Inc. Interleaving search results

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6049821A (en) * 1997-01-24 2000-04-11 Motorola, Inc. Proxy host computer and method for accessing and retrieving information between a browser and a proxy
US20040250205A1 (en) * 2003-05-23 2004-12-09 Conning James K. On-line photo album with customizable pages
US7913163B1 (en) * 2004-09-22 2011-03-22 Google Inc. Determining semantically distinct regions of a document
US7849093B2 (en) * 2005-10-14 2010-12-07 Microsoft Corporation Searches over a collection of items through classification and display of media galleries
US8015162B2 (en) * 2006-08-04 2011-09-06 Google Inc. Detecting duplicate and near-duplicate files
US8306326B2 (en) 2006-08-30 2012-11-06 Amazon Technologies, Inc. Method and system for automatically classifying page images
IES20070382A2 (en) * 2007-05-28 2008-10-29 Chad Gilmer A method and apparatus for providing an on-line directory service
US7840502B2 (en) * 2007-06-13 2010-11-23 Microsoft Corporation Classification of images as advertisement images or non-advertisement images of web pages
US8000504B2 (en) * 2007-08-03 2011-08-16 Microsoft Corporation Multimodal classification of adult content
US20090171766A1 (en) 2007-12-27 2009-07-02 Jeremy Schiff System and method for providing advertisement optimization services
US20090204610A1 (en) * 2008-02-11 2009-08-13 Hellstrom Benjamin J Deep web miner
US20090254515A1 (en) * 2008-04-04 2009-10-08 Merijn Camiel Terheggen System and method for presenting gallery renditions that are identified from a network
US20100082594A1 (en) * 2008-09-25 2010-04-01 Yahoo!, Inc. Building a topic based webpage based on algorithmic and community interactions
US8635528B2 (en) * 2008-11-06 2014-01-21 Nexplore Technologies, Inc. System and method for dynamic search result formatting
US9430521B2 (en) * 2009-09-30 2016-08-30 Microsoft Technology Licensing, Llc Query expansion through searching content identifiers

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050234904A1 (en) * 2004-04-08 2005-10-20 Microsoft Corporation Systems and methods that rank search results
US20070266022A1 (en) * 2006-05-10 2007-11-15 Google Inc. Presenting Search Result Information
US8086600B2 (en) * 2006-12-07 2011-12-27 Google Inc. Interleaving search results
US7984044B2 (en) * 2008-04-01 2011-07-19 Hitachi, Ltd. System or program for searching documents
US20090254643A1 (en) * 2008-04-04 2009-10-08 Merijn Camiel Terheggen System and method for identifying galleries of media objects on a network
US20100287175A1 (en) * 2009-05-11 2010-11-11 Microsoft Corporation Model-based searching
US20110029541A1 (en) * 2009-07-31 2011-02-03 Yahoo! Inc. System and method for intent-driven search result presentation
US20110035345A1 (en) * 2009-08-10 2011-02-10 Yahoo! Inc. Automatic classification of segmented portions of web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Danny Sullivan, "Why Google Can't Count Results Properly", 21 October 2010 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9881100B2 (en) 2013-01-14 2018-01-30 International Business Machines Corporation Scoping searches within websites
US11157586B2 (en) 2013-01-14 2021-10-26 International Business Machines Corporation Scoping searches within websites

Also Published As

Publication number Publication date
DE212011100098U1 (en) 2013-01-10
US20120278338A1 (en) 2012-11-01
US8938441B2 (en) 2015-01-20
WO2012145912A1 (en) 2012-11-01

Similar Documents

Publication Publication Date Title
US8938441B2 (en) Presenting search results for gallery web pages
US10146743B2 (en) Systems and methods for optimizing content layout using behavior metrics
CN108090111B (en) Animated excerpts for search results
US9177046B2 (en) Refining image relevance models
US9262766B2 (en) Systems and methods for contextualizing services for inline mobile banner advertising
US11687707B2 (en) Arbitrary size content item generation
US20090148045A1 (en) Applying image-based contextual advertisements to images
US8548981B1 (en) Providing relevance- and diversity-influenced advertisements including filtering
US20140372873A1 (en) Detecting Main Page Content
US10210181B2 (en) Searching and annotating within images
US20130054672A1 (en) Systems and methods for contextualizing a toolbar
US20160070990A1 (en) Choosing image labels
US9275016B1 (en) Content item transformations for image overlays
US20160110082A1 (en) Arbitrary size content item generation
US9183577B2 (en) Selection of images to display next to textual content
US20160117331A1 (en) Providing a Search Results Document That Includes a User Interface for Performing an Action in Connection with a Web Page Identified in the Search Results Document
US8838432B2 (en) Image annotations on web pages
US11461801B2 (en) Detecting and resolving semantic misalignments between digital messages and external digital content
US9311734B1 (en) Adjusting a digital image characteristic of an object in a digital image
JP5767413B1 (en) Information processing system, information processing method, and information processing program
US20150169177A1 (en) Classifying particular images as primary images
US11768804B2 (en) Deep search embedding of inferred document characteristics
US9135313B2 (en) Providing a search display environment on an online resource
WO2013033445A2 (en) Systems and methods for contextualizing a toolbar, an image and inline mobile banner advertising

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIAO, YUGUO;WANG, NING;REEL/FRAME:028501/0492

Effective date: 20110812

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION