WO2010076785A4 - System and method for aggregating data from a plurality of web sites - Google Patents

System and method for aggregating data from a plurality of web sites Download PDF

Info

Publication number
WO2010076785A4
WO2010076785A4 PCT/IL2009/001218 IL2009001218W WO2010076785A4 WO 2010076785 A4 WO2010076785 A4 WO 2010076785A4 IL 2009001218 W IL2009001218 W IL 2009001218W WO 2010076785 A4 WO2010076785 A4 WO 2010076785A4
Authority
WO
WIPO (PCT)
Prior art keywords
record
records
analyzing
geometrical
data
Prior art date
Application number
PCT/IL2009/001218
Other languages
French (fr)
Other versions
WO2010076785A1 (en
Inventor
Michael Rubanovich
Dmitry Babitsky
Original Assignee
Fornova Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fornova Ltd filed Critical Fornova Ltd
Priority to RU2011130218/08A priority Critical patent/RU2011130218A/en
Priority to JP2011542972A priority patent/JP5501373B2/en
Priority to EP09807502A priority patent/EP2380099A1/en
Priority to CN2009801568512A priority patent/CN102317937A/en
Publication of WO2010076785A1 publication Critical patent/WO2010076785A1/en
Publication of WO2010076785A4 publication Critical patent/WO2010076785A4/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

System and method for collecting information from a plurality of related sites, analyzing the information and storing the relevant information in a data base for future use. According to one embodiment of the present invention, the system uses the provided list of sites, whether obtained automatically or separately, queries them and analyzes the result retrieved from each site. The information may also optionally and preferably be ranked.

Claims

AMENDED CLAIMS received by the International Bureau on 11 AUG 2010 (11.08.2010)
Claims:
1 A method for automatic aggregation of data from a plurality of web sites; comprising: i. Automatically and periodically querying for said data from a plurality of related sites;
ii. Analyzing the results from said querying, said results comprising at least one document, said analyzing comprising geometrical analyzing of a page layout of the document, wherein said geometrical analyzing comprises determining one or more geometrical properties of the document; analyzing said one or more geometrical properties to determine a layout of the document; searching for a plurality of record containers within said layout; and determining a relevancy of a record from at least one record container according to a semantic analysis and according to said one or more geometrical properties;
iii. Storing the relevant record data-in a data base; and
iv. Retrieving said data from said data base, upon demand from user.
2., The method of claim 1 wherein said searching for a plurality of record containers within said layout further comprises identifying a plurality of records from each record container; dividing said records into groups, each group having the same geometrical pattern; the method further comprising semantically analyzing a representative from each said group; and wherein if the outcome of said semantic analyzing identifies relevant data, saving said data and said pattern in a data base.
3. The method of claim 2 wherein groups having identical said pattern in other pages are assumed to have the same semantic structure, such that data from said groups is fetched without further semantic analyzing.
4. The method of claim 1 , wherein said searching for a plurality of record containers within said layout further comprises ranking the area size of the container and the closeness of the geometric center of the container Io the
25 geometric center of the layout of document; and selecting a record container according to said ranking Io form a selected record container, such that said determining said relevancy is performed on said selected record container.
5. The method of claim 4, wherein said determining said relevancy of said record comprises identifying a plurality of records within said selected record container; grouping said plurality of records into groups according to geometrical pattern, such that records having the same geometrical pattern are identified as belonging to the same group; performing semantic analysis on a representative record of each group; and if said representative record is relevant, storing data from said group of records.
6. The method of claim 5, wherein said grouping according to geometrical pattern is performed by identifying geometrical rectangles, or other geometrically defined shapes, within the record container and by ordering the rectangles or other geometrically defined shapes.
7. The method of claim 6, further comprising receiving a query from a user and comparing said query to a plurality of records; and ranking a plurality of records according to said geometrical pattern for said comparing said query.
8. The method of claim 7, further comprising ranking a plurality of records according to one or more of "freshness", ranking of the source website according to reliability and/or popularity, completeness of the record, or prominence of the record within the website.
9. The method of claim 7, further comprising ranking said plurality of records according to a plurality of weighted attributes.
10. The method of claim 7, further comprising dividing said plurality of records into a group of one or more relevant records and a group of one or more non- relevant records before said ranking said plurality of records, such that said ranking said plurality of records is performed only for said group of one or more relevant records, wherein said dividing said plurality of records comprises analyzing said user query to decompose said query to a plurality of items; analyzing each record to decompose said record to a plurality of items; and comparing values of said items for said user query and for said record.
1 L The method of claim 10, wherein said comparing said query to a plurality of records further comprise representing each record and said query as a vector of variables, said variables having differential weighting; and comparing said vectors of variables to determine their similarity.
12, A method for geometrical analyzing of a page layout comprising database query results; the method comprising: a. Determining at least one record container within said layout by identifying said record container according to said layout;
b. If a plurality of record containers is determined, selecting a record container either by using the size relations of the layout records or by deducing the most regular area on a page;
c. Dividing the records within said record container into groups, each group having the same geometrical pattern; and
d. Analyzing the records according to semantic analysis, said semantic analysis comprising analyzing according to a plurality of keywords.
13. The method of claim 12 wherein rectangles within said chosen record container are identified,
14. The method of claim 13 wherein said identification is done by ordering said records inside said record container and by separating them, using line boundaries. 5,. A system tor automatic aggregating data from a plurality of web sites; comprising: a. A crawler process for fetching data from a provided list of related web sites;
b. A geometrical analyzer process for analyzing said data, said data comprising at least one document, said analyzing comprising geometrical analyzing of a page layout of the document, wherein said geometrical analyzing comprises determining one or more geometrical properties of the document; analyzing said one or more geometrical properties to detect a geometric pattern; searching for a plurality of record containers within said layout; and
27 determining a relevancy of a record from at least one record container according to said geometric pattern;
C- A semantic layer for textually analyzing said relevant record; and d. A data base for storing the information retrieved by said semantic layer.
28
PCT/IL2009/001218 2008-12-31 2009-12-27 System and method for aggregating data from a plurality of web sites WO2010076785A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
RU2011130218/08A RU2011130218A (en) 2008-12-31 2009-12-27 SYSTEM AND METHOD OF DATA AGREEMENT FROM MANY WEBSITES
JP2011542972A JP5501373B2 (en) 2008-12-31 2009-12-27 System and method for collecting and ranking data from multiple websites
EP09807502A EP2380099A1 (en) 2008-12-31 2009-12-27 System and method for aggregating data from a plurality of web sites
CN2009801568512A CN102317937A (en) 2008-12-31 2009-12-27 System and method for aggregating and ranking data from a plurality of web sites

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US19386208P 2008-12-31 2008-12-31
US61/193,862 2008-12-31
US12/567,773 US8880498B2 (en) 2008-12-31 2009-09-27 System and method for aggregating and ranking data from a plurality of web sites
US12/567,773 2009-09-27

Publications (2)

Publication Number Publication Date
WO2010076785A1 WO2010076785A1 (en) 2010-07-08
WO2010076785A4 true WO2010076785A4 (en) 2010-10-07

Family

ID=42286118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2009/001218 WO2010076785A1 (en) 2008-12-31 2009-12-27 System and method for aggregating data from a plurality of web sites

Country Status (6)

Country Link
US (2) US8880498B2 (en)
EP (1) EP2380099A1 (en)
JP (1) JP5501373B2 (en)
CN (1) CN102317937A (en)
RU (1) RU2011130218A (en)
WO (1) WO2010076785A1 (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006108069A2 (en) * 2005-04-06 2006-10-12 Google, Inc. Searching through content which is accessible through web-based forms
US10380652B1 (en) 2008-10-18 2019-08-13 Clearcapital.Com, Inc. Method and system for providing a home data index model
US8484286B1 (en) * 2009-11-16 2013-07-09 Hydrabyte, Inc Method and system for distributed collecting of information from a network
WO2012006509A1 (en) * 2010-07-09 2012-01-12 Google Inc. Table search using recovered semantic information
US9183573B2 (en) 2011-06-03 2015-11-10 Facebook, Inc. Überfeed
US20130019195A1 (en) * 2011-07-12 2013-01-17 Oracle International Corporation Aggregating multiple information sources (dashboard4life)
US10083247B2 (en) 2011-10-01 2018-09-25 Oracle International Corporation Generating state-driven role-based landing pages
US10210465B2 (en) * 2011-11-11 2019-02-19 Facebook, Inc. Enabling preference portability for users of a social networking system
US9672252B2 (en) 2012-03-08 2017-06-06 Hewlett-Packard Development Company, L.P. Identifying and ranking solutions from multiple data sources
US20130238972A1 (en) * 2012-03-09 2013-09-12 Nathan Woodman Look-alike website scoring
US8688713B1 (en) * 2012-03-22 2014-04-01 Google Inc. Resource identification from organic and structured content
US20130311440A1 (en) * 2012-05-15 2013-11-21 International Business Machines Corporation Comparison search queries
CN102750372A (en) * 2012-06-15 2012-10-24 翁时锋 Analytical method for automatically acquiring webpage structured information
US9582494B2 (en) 2013-02-22 2017-02-28 Altilia S.R.L. Object extraction from presentation-oriented documents using a semantic and spatial approach
US9733638B2 (en) * 2013-04-05 2017-08-15 Symbotic, LLC Automated storage and retrieval system and control system thereof
US9317873B2 (en) 2014-03-28 2016-04-19 Google Inc. Automatic verification of advertiser identifier in advertisements
US11080777B2 (en) * 2014-03-31 2021-08-03 Monticello Enterprises LLC System and method for providing a social media shopping experience
US20150287099A1 (en) 2014-04-07 2015-10-08 Google Inc. Method to compute the prominence score to phone numbers on web pages and automatically annotate/attach it to ads
US11115529B2 (en) 2014-04-07 2021-09-07 Google Llc System and method for providing and managing third party content with call functionality
US10817884B2 (en) * 2014-05-08 2020-10-27 Google Llc Building topic-oriented audiences
JP6386089B2 (en) 2014-06-26 2018-09-05 グーグル エルエルシー Optimized browser rendering process
CN106462582B (en) 2014-06-26 2020-05-15 谷歌有限责任公司 Batch optimized rendering and fetching architecture
KR102133486B1 (en) * 2014-06-26 2020-07-13 구글 엘엘씨 Optimized browser rendering process
US20160048548A1 (en) * 2014-08-13 2016-02-18 Microsoft Corporation Population of graph nodes
US10529031B2 (en) * 2014-09-25 2020-01-07 Sai Suresh Ganesamoorthi Method and systems of implementing a ranked health-content article feed
US20160125081A1 (en) * 2014-10-31 2016-05-05 Yahoo! Inc. Web crawling
US10083295B2 (en) * 2014-12-23 2018-09-25 Mcafee, Llc System and method to combine multiple reputations
US10643258B2 (en) * 2014-12-24 2020-05-05 Keep Holdings, Inc. Determining commerce entity pricing and availability based on stylistic heuristics
WO2017115272A1 (en) * 2015-12-28 2017-07-06 Sixgill Ltd. Dark web monitoring, analysis and alert system and method
US10469424B2 (en) 2016-10-07 2019-11-05 Google Llc Network based data traffic latency reduction
US11023526B2 (en) * 2017-06-02 2021-06-01 International Business Machines Corporation System and method for graph search enhancement
US11461829B1 (en) 2019-06-27 2022-10-04 Amazon Technologies, Inc. Machine learned system for predicting item package quantity relationship between item descriptions
JP7002804B2 (en) 2019-12-13 2022-01-20 翼 加藤 Search device, search application and search method
CN111291155A (en) * 2020-01-17 2020-06-16 青梧桐有限责任公司 Method and system for identifying homonymous cells based on text similarity
CN112734165A (en) * 2020-12-18 2021-04-30 中国平安财产保险股份有限公司 Intelligent function display method, device, equipment and storage medium

Family Cites Families (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5659732A (en) * 1995-05-17 1997-08-19 Infoseek Corporation Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents
US6067552A (en) * 1995-08-21 2000-05-23 Cnet, Inc. User interface system and method for browsing a hypertext database
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6275820B1 (en) * 1998-07-16 2001-08-14 Perot Systems Corporation System and method for integrating search results from heterogeneous information resources
AU4712601A (en) * 1999-12-08 2001-07-03 Amazon.Com, Inc. System and method for locating and displaying web-based product offerings
US7240067B2 (en) * 2000-02-08 2007-07-03 Sybase, Inc. System and methodology for extraction and aggregation of data from dynamic content
WO2001075664A1 (en) * 2000-03-31 2001-10-11 Kapow Aps Method of retrieving attributes from at least two data sources
US7346858B1 (en) * 2000-07-24 2008-03-18 The Hive Group Computer hierarchical display of multiple data characteristics
JP2002108846A (en) * 2000-09-27 2002-04-12 Fuji Xerox Co Ltd Device/method for processing document image and recording medium
US7231381B2 (en) * 2001-03-13 2007-06-12 Microsoft Corporation Media content search engine incorporating text content and user log mining
JP2003216647A (en) * 2002-01-18 2003-07-31 Matsushita Electric Ind Co Ltd Merchandise search device for use in cyber store, cyber store service providing device, media, and information assembly
US7246306B2 (en) * 2002-06-21 2007-07-17 Microsoft Corporation Web information presentation structure for web page authoring
JP4370783B2 (en) * 2002-06-27 2009-11-25 沖電気工業株式会社 Information processing apparatus and method
US7251648B2 (en) * 2002-06-28 2007-07-31 Microsoft Corporation Automatically ranking answers to database queries
US20060047649A1 (en) * 2003-12-29 2006-03-02 Ping Liang Internet and computer information retrieval and mining with intelligent conceptual filtering, visualization and automation
US7330608B2 (en) * 2004-12-22 2008-02-12 Ricoh Co., Ltd. Semantic document smartnails
US7672958B2 (en) * 2005-01-14 2010-03-02 Im2, Inc. Method and system to identify records that relate to a pre-defined context in a data set
EP1866806A1 (en) * 2005-03-09 2007-12-19 Medio Systems, Inc. Method and system for active ranking of browser search engine results
WO2006108069A2 (en) * 2005-04-06 2006-10-12 Google, Inc. Searching through content which is accessible through web-based forms
US20060282455A1 (en) * 2005-06-13 2006-12-14 It Interactive Services Inc. System and method for ranking web content
US20070078814A1 (en) * 2005-10-04 2007-04-05 Kozoru, Inc. Novel information retrieval systems and methods
US8065286B2 (en) * 2006-01-23 2011-11-22 Chacha Search, Inc. Scalable search system using human searchers
US20070208732A1 (en) * 2006-02-07 2007-09-06 Future Vistas, Inc. Telephonic information retrieval systems and methods
US20070294240A1 (en) * 2006-06-07 2007-12-20 Microsoft Corporation Intent based search
US20080033996A1 (en) * 2006-08-03 2008-02-07 Anandsudhakar Kesari Techniques for approximating the visual layout of a web page and determining the portion of the page containing the significant content
US8510298B2 (en) * 2006-08-04 2013-08-13 Thefind, Inc. Method for relevancy ranking of products in online shopping
US7917492B2 (en) * 2007-09-21 2011-03-29 Limelight Networks, Inc. Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system
US20080098300A1 (en) * 2006-10-24 2008-04-24 Brilliant Shopper, Inc. Method and system for extracting information from web pages
US8707167B2 (en) * 2006-11-15 2014-04-22 Ebay Inc. High precision data extraction
US7930302B2 (en) * 2006-11-22 2011-04-19 Intuit Inc. Method and system for analyzing user-generated content
JP5056133B2 (en) * 2007-04-13 2012-10-24 日本電気株式会社 Information extraction system, information extraction method, and information extraction program
US8392446B2 (en) 2007-05-31 2013-03-05 Yahoo! Inc. System and method for providing vector terms related to a search query
US20090077180A1 (en) * 2007-09-14 2009-03-19 Flowers John S Novel systems and methods for transmitting syntactically accurate messages over a network
US8117208B2 (en) 2007-09-21 2012-02-14 The Board Of Trustees Of The University Of Illinois System for entity search and a method for entity scoring in a linked document database
KR100938830B1 (en) 2007-12-18 2010-01-26 한국과학기술정보연구원 Method constructing knowledge base and thereof server
US20090265611A1 (en) * 2008-04-18 2009-10-22 Yahoo ! Inc. Web page layout optimization using section importance
US20100169352A1 (en) * 2008-12-31 2010-07-01 Flowers John S Novel systems and methods for transmitting syntactically accurate messages over a network
US8874552B2 (en) 2009-11-29 2014-10-28 Rinor Technologies Inc. Automated generation of ontologies

Also Published As

Publication number Publication date
CN102317937A (en) 2012-01-11
JP5501373B2 (en) 2014-05-21
US9430569B2 (en) 2016-08-30
JP2013515977A (en) 2013-05-09
US20100169301A1 (en) 2010-07-01
RU2011130218A (en) 2013-02-10
EP2380099A1 (en) 2011-10-26
US8880498B2 (en) 2014-11-04
WO2010076785A1 (en) 2010-07-08
US20150134636A1 (en) 2015-05-14

Similar Documents

Publication Publication Date Title
WO2010076785A4 (en) System and method for aggregating data from a plurality of web sites
Gauch et al. ProFusion*: Intelligent fusion from multiple, distributed search engines
US8117208B2 (en) System for entity search and a method for entity scoring in a linked document database
US8161050B2 (en) Visualizing hyperlinks in a search results list
Barbosa et al. Organizing hidden-web databases by clustering visible web documents
US20070162448A1 (en) Adaptive hierarchy structure ranking algorithm
US20100299343A1 (en) Identifying Task Groups for Organizing Search Results
US9460207B2 (en) Automated database generation for answering fact lookup queries
WO2005031614A1 (en) Systems and methods for clustering search results
JP2000339350A (en) Multi-mode information access
US9405803B2 (en) Ranking signals in mixed corpora environments
CN111506727B (en) Text content category acquisition method, apparatus, computer device and storage medium
Radu et al. A hybrid machine-crowd approach to photo retrieval result diversification
Tsai A review of image retrieval methods for digital cultural heritage resources
US9779140B2 (en) Ranking signals for sparse corpora
KR19990048712A (en) Map Type Classification Search Method
WO2001039008A1 (en) Method and system for collecting topically related resources
Sathya et al. Link based K-Means clustering algorithm for information retrieval
Bokhari et al. A new criterion for evaluating news search systems
Yoshida et al. Query transformation by visualizing and utilizing information about what users are or are not searching
Yadav et al. Ontdr: An ontology-based augmented method for document retrieval
AU5126700A (en) Method and system for creating a topical data structure
Umesh et al. Web images evaluations based on visual content
Vadivu et al. Ranking images in web documents based on HTML TAGs for image retrieval from WWW
Vadivu et al. Image Retrieval From WWW Using Attributes in HTML TAGs

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200980156851.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09807502

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase in:

Ref document number: 2011542972

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase in:

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009807502

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2011130218

Country of ref document: RU