WO2008039542A2 - System and method of ad-hoc analysis of data - Google Patents

System and method of ad-hoc analysis of data Download PDF

Info

Publication number
WO2008039542A2
WO2008039542A2 PCT/US2007/021035 US2007021035W WO2008039542A2 WO 2008039542 A2 WO2008039542 A2 WO 2008039542A2 US 2007021035 W US2007021035 W US 2007021035W WO 2008039542 A2 WO2008039542 A2 WO 2008039542A2
Authority
WO
WIPO (PCT)
Prior art keywords
information
metadata
items
generating
text index
Prior art date
Application number
PCT/US2007/021035
Other languages
French (fr)
Other versions
WO2008039542A3 (en
Inventor
Mark William Reed
Original Assignee
Mark William Reed
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mark William Reed filed Critical Mark William Reed
Publication of WO2008039542A2 publication Critical patent/WO2008039542A2/en
Publication of WO2008039542A3 publication Critical patent/WO2008039542A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the present invention is generally directed to providing an improved system and method for ad-hoc analysis of data. Specifically, the present invention implements a metadata lookup structure to assist in data analysis.
  • search approach textual data is collected and stored in a full-text index that allows for rapid searching of the data.
  • Large public Internet portals such as Google, Yahoo, etc.
  • numerous commercial indexing solutions support this functionality.
  • a second approach to this problem is the "analytical approach.”
  • the analytical approach allows for analysis by collecting items and running these items through various text mining algorithms to extract additional metadata information.
  • This additional processing may include language detection, extraction of links to other data, or determining the sentiment of the author.
  • This derived metadata information is typically stored in a relational database which allows for aggregate analytics such as what websites are linked by the data.
  • these analytics are preconfigured to extract information relevant to the goal of the system.
  • the advantage of the search approach is speed and simplicity. Without any pre-configuration, a full text index allows ad-hoc searching of the data. For instance, if someone wants to find textual data about a particular movie, they can simply search for the title of the movie and find it. However, the search approach does not give deeper insights such as what websites are linked or how people feel about the movie. The analysis approach can provide this type of information, but it typically requires a separate time-consuming text mining step. Therefore, the analysis approach lacks the speed and simplicity needed for "ad-hoc" analysis.
  • the step of accessing information sources for a plurality of textual information items prior to generating the text index is provided.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata, analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item, and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, determining a quantity of the one or more metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, determining a quantity of the one or more metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity, hi this embodiment, the number of metadata items generated is not the same for all of the textual information items,
  • the metadata items may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information
  • the link information includes a Uniform Resource Locator.
  • the language information includes one or more language specific annotation.
  • the demographic information may be generated and includes at least one of age and gender of a text item author.
  • the language specific annotation may be provided by the textual information items.
  • the language specific annotation may determined by analyzing the textual information items.
  • the textual information items include electronic data from one or more of Internet message boards, blogs and news groups.
  • the aggregate information may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information pertaining to the search results.
  • the text index may be updated after a predetermined time period.
  • the predetermined time period may be between five and fifteen minutes.
  • It is a second aspect of the present invention to provide a system for performing ad-hoc analysis including a computer server having access to information sources, the information sources including a plurality of textual information items, and a user computer device linked to the computer server.
  • the user computer device includes software that performs the steps of: (a) generating a text index of the textual information items; (b) generating a metadata lookup structure based, at least in part, on the text index, the metadata lookup structure including metadata items associated with each of the textual information items; (c) searching the text index using search queries, the searching step producing search results including textual information items matching the search query; (d) compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with each of the textual information items in the search results from the metadata lookup structure; and (e) reporting the aggregate information.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata, analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item, and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating metadata items associated with the textual information items, determining a quantity of the metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
  • the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating metadata items associated with the textual information items, wherein the number of metadata items generated is not the same for all of the textual information items, determining a quantity of the metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
  • the metadata items may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information.
  • the aggregate information may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information.
  • the text index may be updated after a predetermined time period.
  • the predetermined time period may be between five and fifteen minutes.
  • Figure 1 is a flow diagram of one embodiment of the present invention.
  • Figure 2 is a view of an exemplary environment utilized in one embodiment of the present invention.
  • Figure 3 is an exemplary computer screenshot from one embodiment of the present invention.
  • one embodiment of the present invention accomplishes this by generating 20 an index that not only includes typical keyword information but also additional metadata information such as links and date information. By building this information into the index, it allows for searching and analyzing this data.
  • Embodiments of the present invention then use the index to rapidly build 22 a metadata lookup structure, rather than requiring additional analytics to be run separately on the data.
  • embodiments of the invention allow for deep analysis without requiring separate time-consuming text mining steps.
  • the first step in implementing this method is gathering data over which ad-hoc analysis is desired.
  • This data or information may come from any information source 24.
  • This data may include consumer generated media or social media such as boards, blogs, and newsgroups. It may also include content such as news media, press releases, website content, local content, networked content or anything else that can be rendered in digital form.
  • HTTP Hypertext Transfer Protocol
  • These requests may be for semi-structured data such as those available in feeds (RSS, Atom, etc.) or unstructured data such as raw Hypertext Markup Language (HTML).
  • Other methods may include accessing data located on local or networked computer servers or devices.
  • a text index of the data is generated 20.
  • a full text index of the data is generated 20.
  • Lucene from Apache ⁇ is used to create 20 the full-text index.
  • Apache ⁇ Lucene is a "high-performance, full-featured text search engine library written entirely in Java" (http://lucene.apache.org). Numerous other full-text indexing solutions may also be suitable.
  • data may include structured information and/or unstructured information.
  • Structured information may contain separate elements for each metadata type. In one embodiment, these elements may include a data element, a title element and a body element. Examples of structured data may include a structured XML format such as an RSS feed.
  • Unstructured information may be processed to extract metadata types. Such processing may include screen scraping techniques, pattern matching techniques or any other known processing technique capable of extracting metadata information. Metadata that may be the subject of analysis may also be included in the index. For instance, if a user wants to be able to break down the items by date, the data being added may include the date of creation or publication as additional metadata.
  • Source and demographic data may also be added to the index for further annotation.
  • the data source and author information may be added to the index.
  • demographic data about the author such as gender and age may be added to the index to allow for demographic based searching and analysis.
  • demographics analytics the age and/or gender of an author may be generated for each item. Using this demographic information, one embodiment of the present invention may generate a demographic breakdown such as the age breakdown of authors of items.
  • Various types of extraction algorithms may also be performed on the data to further annotate the index. This may include link extraction (determining HTML anchor reference to other data). It may also include various types of entity extraction, such as person name extraction or company name extraction. In one embodiment, the text may be examined to extract all proper names based on capitalization. In another embodiment, techniques based on large lists of company names or pattern recognition may also extract entities.
  • the index generation 20 process may be language neutral such that all items are processed identically regardless of language, or it can be language specific.
  • the language is identified. This identification may either be a configuration item provided to the system or through a dynamic language identification module. Once the language is identified, language specific processing may take place. For instance, language specific entity extraction may be used to extract names for a particular language.
  • the internal index data structures are used to build 22 a metadata lookup structure.
  • a full text index typically builds tables that map a particular keyword or attribute to corresponding data.
  • the metadata lookup structure may include a table that lists all text items having a common date. This metadata lookup structure is designed for rapid searching for various attributes.
  • the metadata lookup structure maps text items to metadata information.
  • the metadata information is stored in a way that a metadata value may be retrieved for any item in the text index.
  • index systems have an identification (ID) that represents a discrete item. Such an ID may represent a value or a numeric representation of a value. This may be used as a reference into the metadata lookup structure.
  • the metadata lookup structure maps text items to one or more IDs representing metadata values or numerical representations thereof. For instance, the index may return item #10 and the metadata lookup structure may be used to retrieve the metadata values associated with item #10. In this manner, the date value and author value of item #10 may be easily retrieved.
  • each metadata value is assigned a metadata ID and an array is built 24 mapping text item IDs to a metadata ID.
  • a metadata ID may represent a metadata value or a numeric representation of a metadata value.
  • This array is built 24 by interrogating the text index for each type of metadata and assigning it an ID. The text index is also interrogated for the text items matching that metadata value(s) and that metadata ID(s) is added to the array and associated with the corresponding text item.
  • the metadata lookup structure may be implemented in a variety of ways.
  • Metadata information is stored using a proprietary compression scheme.
  • This metadata information may be stored in memory, on a disk, or in any other storage medium.
  • An exemplary embodiment stores the data in memory for improved access speed.
  • a compressed structure references each item to conserve space. For example, a compressed structure may determine that an index has data from 200 unique dates. The compressed structure could assign each unique date a corresponding number (1-200) and then store the date and number as a single byte reference.
  • the compressed structure may use a compressed number format such that smaller numbers are stored in fewer bytes than larger numbers.
  • the assignment of reference numbers may be intelligent to make efficient use of space. For instance, smaller reference numbers may be used for items occurring frequently and larger reference numbers may be used for items occurring less frequently.
  • the size of the data structure varies depending on the content.
  • the dynamic structure first examines the items in the index and determines the necessary size of the structure such as how many metadata items does it need to store for an item and how many items are present. The dynamic structure then allocates the appropriate amount of memory.
  • the many-to-one structure allows multiple metadata items to be mapped to one item.
  • Traditional methods create a fixed array with a set number of metadata items per item.
  • some metadata items appear in different quantities. For instance, some items may be tagged with three different subjects while other posts might not be tagged at all.
  • the structure allows for both, while not wasting memory space.
  • An indexed data structure includes a periodic offset index to improve lookup performance. Because metadata items may appear in different quantities and each metadata item may be variable in size, a periodic offset index provides a means for quickly determining where a metadata item is located in the data structure. Unlike a typical fixed-size array, the indexed data structure may not know the location of a specific offset. In a fixed-size array with each metadata item having 4 bytes, going to metadata item #1002 would only require navigating to offset #4008 (item #1002 x 4 bytes per item). Because the metadata items may vary for each item and the size of each metadata item is not necessarily fixed, a lookup table including an offset is helpful.
  • the indexed data structure may reference the lookup table (the lookup table returns an offset of metadata item #1000) to move directly to metadata item #1000 and then process through two metadata items to get to
  • the system is ready for ad-hoc use.
  • a user may provide a query using standard full text query syntax. For instance, "Harry Potter" may be used as a query string. This query string is passed to the indexing system, which searches 26 the text index and produces the items in the text index matching this query.
  • the metadata lookup structure may then be used to retrieve 26 metadata information for the items found in the searching step. As this data is retrieved 26, it may be used to compile or aggregate 28 the search results based upon the metadata associated therewith. Compilation or aggregation 28 involves grouping items that share a common attribute.
  • the system may aggregate 28 items found in the searching step based on date. In this situation, it would create a table of dates and a count of items associated with that date. As each item is processed, a counter may be incremented for that item.
  • this aggregate information is provided to the user.
  • this aggregate information is reported 30 to the user by displaying the aggregate information on a display device and/or by generating a hard- copy report of such aggregate information.
  • This aggregate information may be displayed in a table format, graphical format, or any other known format. For example, date analysis may generate a table similar to the following:
  • the system may aggregate 28 the search results based on other types of data such as links (HTML anchor tag references to other data).
  • links may be added to a table to produce a table of the most popular links.
  • Link analysis may produce a table similar to the following:
  • Author information may be aggregated 28 to find authors that write about certain subject.
  • Keyword analysis may be aggregated 28 to provide common keywords related to a query.
  • Sentiment analysis may be aggregated 28 to give some indication of overall opinion about a certain subject.
  • Other analytics such as language, demographic, and video analytics may also be generated.
  • language analytics the language of an item may be indexed 22 or may be otherwise determined. This may be accomplished by using language identifier metadata during indexing 22. With this language metadata, reports of language breakdowns may be generated. Such a language identifier may be commercially available.
  • demographics analytics the age and/or gender of an author may be generated for each item. Using this demographic information, one embodiment of the present invention may generate a demographic breakdown such as the age breakdown of authors of items mentioning Harry Potter.
  • the index may include video links in the items.
  • One embodiment of the present invention can identify video links in posts based on standard pattern matching with known video sites and URL formats. This video metadata may be used to produce a report of the most cited videos for a given search query.
  • Extracted entities may be used to provide additional analytics. For instance, if person's names were extracted and annotated in the index, the system may provide a list of common person's names used in items matching the query.
  • results of ad-hoc analysis may be used in many ways. Results may be used to track important dates or uncover important media sites related to query terms. For example, the "Harry Potter” query revealed that August 1 had a higher amount of "buzz” or popularity related to Harry Potter than other days. Also, the link analysis revealed that www.ikrowling.com (the website of the author of the Harry Potter book series) was the most popular link for the given time period.
  • the present invention is directed to a system for performing ad-hoc analysis.
  • a system includes a computer server 32 and a user computer device 34.
  • the computer server 32 is any server device capable of accessing information or data in an information source 24.
  • This information or data may include consumer generated media or social media such as boards, blogs, and newsgroups. It may also include content such as news media, press releases, website content, local content, networked content or any other information that can be rendered in digital form.
  • the user computer device 34 is linked via data links (wired, wireless, and/or networked) to the computer server 32.
  • the user computer device 34 includes software tools 36 operating thereon. These software tools 36 are configured to execute instructions to perform a method of ad-hoc analysis. This method generates a metadata lookup structure based, at least in part, on information or data gathered from one or more information source 24.
  • the software tools 36 then search the one or more information source 24 for a search string.
  • This search string may be any string of characters that a user desires to search.
  • the search returns results that are compiled into aggregate information. As described above, the compiling of the results incorporates the metadata lookup table for improved speed and completeness.
  • the use of the metadata lookup table also allows for various analytics to be run.
  • the software 36 then reports the aggregate information. This reporting may include outputting the aggregate information to an output device such as a monitor, a printer or other output device.
  • This system may analyze the aggregate information based on a date, a link, an author, a keyword, a sentiment, a demographic, an entity, and/or a language. Similar to the Harry Potter example, a series of analytics may be performed to better understand various aspects of the source or content of the information.
  • the computer server 32 on which the exemplary system is operating may be a single computer server 32, a networked group of computer servers 32, or any other networked computer device or computerized device or system of computer devices or computerized devices on which the tools and/or processes of the exemplary embodiments may operate.
  • the user computer device 34 may also comprise the server 32 or be included with the server 32 system.
  • FIG. 3 depicts an exemplary computer screenshot of one embodiment of the present invention.
  • This screenshot is an exemplary screenshot of what a user may see as aggregated information after searching for "Harry Potter.”
  • the search interface may include an input area 38 for inputting a search string 40, date parameters 42 and other options.
  • the user may input that term into the input area 38.
  • the user may input specific dates 42 or a date range 42 (such as the last 90 days) for which to search. Any other search commands or Boolean operators may be used in the search query as known to those of ordinary skill in the art.
  • the user may then click on (or otherwise actuate or activate) the search button 44 to begin searching.
  • the system may output a graphical representation 46 of aggregated results for the given data parameters.
  • the search for "Harry Potter" between August 1, 2007 and August 10, 2007 generated a graph 46 showing how many messages were found.
  • the graph may be broken down to show how many messages were found on message boards, blogs, and groups.
  • the graph 46 in this embodiment may be displayed in various levels of detail based on user-selected display options.

Abstract

It is a first aspect of the present invention to provide a computer implemented method of performing ad-hoc analysis including the steps of: generating a text index of the textual information items (20), generating a metadata lookup structure based, at least in part, on the text index (22), searching the text index using a search query (26), compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with the textual information items in the search results from the metadata lookup structure (28), and reporting the aggregate information (30).

Description

Title: SYSTEM AND METHOD OF AD-HOC ANALYSIS OF DATA
CROSS-REFERENCE TO RELATED APPLICATIONS
{0001] This application claims the benefit of U.S. Provisional Application
Serial No. 60/847,486, entitled "SYSTEM FOR AD-HOC ANALYSIS OF ONLINE DATA," filed on September 27, 2006, and U.S. Non-provisional Application Serial No. 11/897,984 entitled "SYSTEM AND METHOD OF AD-HOC ANALYSIS OF DATA," filed on August 31, 2007, the disclosures of which are incorporated herein by reference.
FIELD OF THE INVENTION
[0002] The present invention is generally directed to providing an improved system and method for ad-hoc analysis of data. Specifically, the present invention implements a metadata lookup structure to assist in data analysis.
BACKGROUND OF THE INVENTION
J0003] Today, there are vast amounts of unstructured data on the Internet.
There is a great need to be able to search and analyze this data in order uncover useful information about particular areas of interest. This is not only desired by consumers who want to find information about people and products, but also by companies that want to know what their customers are saying about their products and services.
[0004] Traditionally, there have been two approaches to this problem. One approach is the "search approach." With the search approach, textual data is collected and stored in a full-text index that allows for rapid searching of the data. Large public Internet portals (such as Google, Yahoo, etc.) as well as numerous commercial indexing solutions support this functionality.
[0005] A second approach to this problem is the "analytical approach." The analytical approach allows for analysis by collecting items and running these items through various text mining algorithms to extract additional metadata information. This additional processing may include language detection, extraction of links to other data, or determining the sentiment of the author. This derived metadata information is typically stored in a relational database which allows for aggregate analytics such as what websites are linked by the data. Typically, these analytics are preconfigured to extract information relevant to the goal of the system.
[0006] The advantage of the search approach is speed and simplicity. Without any pre-configuration, a full text index allows ad-hoc searching of the data. For instance, if someone wants to find textual data about a particular movie, they can simply search for the title of the movie and find it. However, the search approach does not give deeper insights such as what websites are linked or how people feel about the movie. The analysis approach can provide this type of information, but it typically requires a separate time-consuming text mining step. Therefore, the analysis approach lacks the speed and simplicity needed for "ad-hoc" analysis.
[0007] Therefore, there is a need for a solution that combines the speed of the search approach and the deep insights of the analytical approach to provide for true ad-hoc analysis. Aspects of the present invention address this need.
SUMMARY OF THE INVENTION
[0008] Aspects of the present invention address this need by providing an improved system and method for ad-hoc analysis of data.
[0009] It is a first aspect of the present invention to provide a computer implemented method of performing ad-hoc analysis including the steps of: generating a metadata lookup structure based, at least in part, on the text index, the metadata lookup structure including metadata items associated with each of the textual information items, searching the text index using search queries, the searching step producing search results including textual information items matching the search query, compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with each of the textual information items in the search results from the metadata lookup structure, and reporting the aggregate information. In one embodiment of the first aspect, the step of accessing information sources for a plurality of textual information items prior to generating the text index is provided. [0010] In one embodiment of the first aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata, analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item, and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
[0011] In another embodiment of the first aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, determining a quantity of the one or more metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
[0012] hi yet another embodiment of the first aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, determining a quantity of the one or more metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity, hi this embodiment, the number of metadata items generated is not the same for all of the textual information items,
[0013] hi one embodiment of the first aspect, the metadata items may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information, hi one embodiment, the link information includes a Uniform Resource Locator. In another embodiment, the language information includes one or more language specific annotation. In another embodiment, the demographic information may be generated and includes at least one of age and gender of a text item author. In yet another embodiment, the language specific annotation may be provided by the textual information items. In yet another embodiment, the language specific annotation may determined by analyzing the textual information items.
[0014] In one embodiment of the first aspect, the textual information items include electronic data from one or more of Internet message boards, blogs and news groups. In another embodiment, the aggregate information may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information pertaining to the search results.
[0015] In another embodiment of the first aspect, the text index may be updated after a predetermined time period. In one embodiment, the predetermined time period may be between five and fifteen minutes.
[0016] It is a second aspect of the present invention to provide a system for performing ad-hoc analysis including a computer server having access to information sources, the information sources including a plurality of textual information items, and a user computer device linked to the computer server. The user computer device includes software that performs the steps of: (a) generating a text index of the textual information items; (b) generating a metadata lookup structure based, at least in part, on the text index, the metadata lookup structure including metadata items associated with each of the textual information items; (c) searching the text index using search queries, the searching step producing search results including textual information items matching the search query; (d) compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with each of the textual information items in the search results from the metadata lookup structure; and (e) reporting the aggregate information.
[0017] In one embodiment of the second aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata, analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item, and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
[0018] In another embodiment of the second aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating metadata items associated with the textual information items, determining a quantity of the metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
[0019] In yet another embodiment of the second aspect, the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating metadata items associated with the textual information items, wherein the number of metadata items generated is not the same for all of the textual information items, determining a quantity of the metadata items, and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
[0020] In one embodiment of the second aspect, the metadata items may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information.
[0021] In another embodiment of the second aspect, the aggregate information may include date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and/or language information.
[0022] In another embodiment of the first aspect, the text index may be updated after a predetermined time period. In one embodiment, the predetermined time period may be between five and fifteen minutes.
[0023] From the foregoing disclosure and the following detailed description of various preferred embodiments it will be apparent to those skilled in the art that the present invention provides a significant advance in the art of data analysis. Additional features and advantages of various preferred embodiments will be better understood in view of the detailed description provided below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] The present invention will be understood and appreciated more fully from the detailed description in conjunction with the following drawings in which:
Figure 1 is a flow diagram of one embodiment of the present invention.
Figure 2 is a view of an exemplary environment utilized in one embodiment of the present invention.
Figure 3 is an exemplary computer screenshot from one embodiment of the present invention.
DETAILED DESCRIPTION
[0025] It will be apparent to those skilled in the art that many uses and variations are possible for the method and system for data analysis. The following detailed discussion of various exemplary embodiments will illustrate the general principles of the invention. Other embodiments will be apparent to those skilled in the art given the benefit of this disclosure.
[0026] As shown in Figure 1 , one embodiment of the present invention accomplishes this by generating 20 an index that not only includes typical keyword information but also additional metadata information such as links and date information. By building this information into the index, it allows for searching and analyzing this data.
[0027] Embodiments of the present invention then use the index to rapidly build 22 a metadata lookup structure, rather than requiring additional analytics to be run separately on the data. By leveraging the index structures themselves to build 22 a metadata lookup structure, embodiments of the invention allow for deep analysis without requiring separate time-consuming text mining steps.
[0028] hi an exemplary embodiment, the first step in implementing this method is gathering data over which ad-hoc analysis is desired. This data or information may come from any information source 24. This data may include consumer generated media or social media such as boards, blogs, and newsgroups. It may also include content such as news media, press releases, website content, local content, networked content or anything else that can be rendered in digital form.
[0029] There are numerous well known methods for acquiring this data. The most common current method is via Hypertext Transfer Protocol (HTTP) requests for the data. These requests may be for semi-structured data such as those available in feeds (RSS, Atom, etc.) or unstructured data such as raw Hypertext Markup Language (HTML). Other methods may include accessing data located on local or networked computer servers or devices.
[0030] Once the data has been acquired, it is processed to generate 20 a text index of the data. In one embodiment, a full text index of the data is generated 20. There are numerous technologies available for this process, hi the exemplary embodiment, Lucene from Apache© is used to create 20 the full-text index. Apache© Lucene is a "high-performance, full-featured text search engine library written entirely in Java" (http://lucene.apache.org). Numerous other full-text indexing solutions may also be suitable.
[0031] When data is processed to generate 20 a full-text index, it goes through a series of steps. Text is broken down into keywords. Additional metadata such as date, author or source may also be included during this process. In an exemplary embodiment, data may include structured information and/or unstructured information. Structured information may contain separate elements for each metadata type. In one embodiment, these elements may include a data element, a title element and a body element. Examples of structured data may include a structured XML format such as an RSS feed. Unstructured information may be processed to extract metadata types. Such processing may include screen scraping techniques, pattern matching techniques or any other known processing technique capable of extracting metadata information. Metadata that may be the subject of analysis may also be included in the index. For instance, if a user wants to be able to break down the items by date, the data being added may include the date of creation or publication as additional metadata.
[0032] Source and demographic data may also be added to the index for further annotation. For instance, the data source and author information may be added to the index. Furthermore, demographic data about the author such as gender and age may be added to the index to allow for demographic based searching and analysis. For demographics analytics, the age and/or gender of an author may be generated for each item. Using this demographic information, one embodiment of the present invention may generate a demographic breakdown such as the age breakdown of authors of items.
[0033] Various types of extraction algorithms may also be performed on the data to further annotate the index. This may include link extraction (determining HTML anchor reference to other data). It may also include various types of entity extraction, such as person name extraction or company name extraction. In one embodiment, the text may be examined to extract all proper names based on capitalization. In another embodiment, techniques based on large lists of company names or pattern recognition may also extract entities.
[0034] The index generation 20 process may be language neutral such that all items are processed identically regardless of language, or it can be language specific. In a language specific system, the language is identified. This identification may either be a configuration item provided to the system or through a dynamic language identification module. Once the language is identified, language specific processing may take place. For instance, language specific entity extraction may be used to extract names for a particular language.
[0035] Once the full-text index is generated 20, the internal index data structures are used to build 22 a metadata lookup structure. A full text index typically builds tables that map a particular keyword or attribute to corresponding data. For instance, the metadata lookup structure may include a table that lists all text items having a common date. This metadata lookup structure is designed for rapid searching for various attributes.
[0036] In an exemplary embodiment, the metadata lookup structure maps text items to metadata information. The metadata information is stored in a way that a metadata value may be retrieved for any item in the text index. Typically, index systems have an identification (ID) that represents a discrete item. Such an ID may represent a value or a numeric representation of a value. This may be used as a reference into the metadata lookup structure. Specifically, the metadata lookup structure maps text items to one or more IDs representing metadata values or numerical representations thereof. For instance, the index may return item #10 and the metadata lookup structure may be used to retrieve the metadata values associated with item #10. In this manner, the date value and author value of item #10 may be easily retrieved.
[0037] In an exemplary embodiment, each metadata value is assigned a metadata ID and an array is built 24 mapping text item IDs to a metadata ID. A metadata ID may represent a metadata value or a numeric representation of a metadata value. This array is built 24 by interrogating the text index for each type of metadata and assigning it an ID. The text index is also interrogated for the text items matching that metadata value(s) and that metadata ID(s) is added to the array and associated with the corresponding text item.
[0038] The metadata lookup structure may be implemented in a variety of ways.
This may include many implementations, from a simple array that holds a metadata value for each item to a more complex structure that stores the metadata value in a more space efficient manner. Examples of such data structures include compressed, dynamic, "many-to-one" and indexed structures. In an exemplary embodiment, the metadata information is stored using a proprietary compression scheme. This metadata information may be stored in memory, on a disk, or in any other storage medium. An exemplary embodiment stores the data in memory for improved access speed. [0039] A compressed structure references each item to conserve space. For example, a compressed structure may determine that an index has data from 200 unique dates. The compressed structure could assign each unique date a corresponding number (1-200) and then store the date and number as a single byte reference. The compressed structure may use a compressed number format such that smaller numbers are stored in fewer bytes than larger numbers. Furthermore, the assignment of reference numbers may be intelligent to make efficient use of space. For instance, smaller reference numbers may be used for items occurring frequently and larger reference numbers may be used for items occurring less frequently.
[0040] In a dynamic structure, the size of the data structure varies depending on the content. The dynamic structure first examines the items in the index and determines the necessary size of the structure such as how many metadata items does it need to store for an item and how many items are present. The dynamic structure then allocates the appropriate amount of memory.
[0041] The many-to-one structure allows multiple metadata items to be mapped to one item. Traditional methods create a fixed array with a set number of metadata items per item. However, some metadata items appear in different quantities. For instance, some items may be tagged with three different subjects while other posts might not be tagged at all. The structure allows for both, while not wasting memory space.
[0042] An indexed data structure includes a periodic offset index to improve lookup performance. Because metadata items may appear in different quantities and each metadata item may be variable in size, a periodic offset index provides a means for quickly determining where a metadata item is located in the data structure. Unlike a typical fixed-size array, the indexed data structure may not know the location of a specific offset. In a fixed-size array with each metadata item having 4 bytes, going to metadata item #1002 would only require navigating to offset #4008 (item #1002 x 4 bytes per item). Because the metadata items may vary for each item and the size of each metadata item is not necessarily fixed, a lookup table including an offset is helpful.
For example, to find metadata item #1002, the indexed data structure may reference the lookup table (the lookup table returns an offset of metadata item #1000) to move directly to metadata item #1000 and then process through two metadata items to get to
#1002. Without a lookup table, the size of each metadata item is unknown and the process would need to start at the beginning of the array and process all 1,002 entries.
[0043] Once the metadata lookup structure is generated, the system is ready for ad-hoc use. A user may provide a query using standard full text query syntax. For instance, "Harry Potter" may be used as a query string. This query string is passed to the indexing system, which searches 26 the text index and produces the items in the text index matching this query.
[0044] As items are identified, the metadata lookup structure may then be used to retrieve 26 metadata information for the items found in the searching step. As this data is retrieved 26, it may be used to compile or aggregate 28 the search results based upon the metadata associated therewith. Compilation or aggregation 28 involves grouping items that share a common attribute.
[0045] For example, the system may aggregate 28 items found in the searching step based on date. In this situation, it would create a table of dates and a count of items associated with that date. As each item is processed, a counter may be incremented for that item.
[0046] Finally, this aggregate information is provided to the user. In an exemplary embodiment, this aggregate information is reported 30 to the user by displaying the aggregate information on a display device and/or by generating a hard- copy report of such aggregate information. This aggregate information may be displayed in a table format, graphical format, or any other known format. For example, date analysis may generate a table similar to the following:
Mentions of "Harry Potter" bv date 8/1/2006 1188
8/2/2006 1166
8/3/2006 986
8/4/2006 992
8/5/2006 738
8/6/2006 770
8/7/2006 436
[0047] Similarly, the system may aggregate 28 the search results based on other types of data such as links (HTML anchor tag references to other data). As each item is processed, links may be added to a table to produce a table of the most popular links.
[0048] Link analysis may produce a table similar to the following:
Most Popular Links for posts mentioning "Harry Potter" http://www.jkrowling.com/ 55 http://www.cbc.ca/story/arts/national/2006/09/03/harrypotter... 48 http://www.muggienet.com/ 43
[0049] A wide variety of analytics may be generated with this method.
Author information may be aggregated 28 to find authors that write about certain subject. Keyword analysis may be aggregated 28 to provide common keywords related to a query. Sentiment analysis may be aggregated 28 to give some indication of overall opinion about a certain subject.
[0050] Other analytics such as language, demographic, and video analytics may also be generated. For language analytics, the language of an item may be indexed 22 or may be otherwise determined. This may be accomplished by using language identifier metadata during indexing 22. With this language metadata, reports of language breakdowns may be generated. Such a language identifier may be commercially available. For demographics analytics, the age and/or gender of an author may be generated for each item. Using this demographic information, one embodiment of the present invention may generate a demographic breakdown such as the age breakdown of authors of items mentioning Harry Potter. For video analytics, the index may include video links in the items. One embodiment of the present invention can identify video links in posts based on standard pattern matching with known video sites and URL formats. This video metadata may be used to produce a report of the most cited videos for a given search query.
[0051] Extracted entities may be used to provide additional analytics. For instance, if person's names were extracted and annotated in the index, the system may provide a list of common person's names used in items matching the query.
[0052] The results of ad-hoc analysis may be used in many ways. Results may be used to track important dates or uncover important media sites related to query terms. For example, the "Harry Potter" query revealed that August 1 had a higher amount of "buzz" or popularity related to Harry Potter than other days. Also, the link analysis revealed that www.ikrowling.com (the website of the author of the Harry Potter book series) was the most popular link for the given time period.
[0053] In one exemplary embodiment, the present invention is directed to a system for performing ad-hoc analysis. Such a system (as shown in Figure 2) includes a computer server 32 and a user computer device 34. The computer server 32 is any server device capable of accessing information or data in an information source 24. This information or data may include consumer generated media or social media such as boards, blogs, and newsgroups. It may also include content such as news media, press releases, website content, local content, networked content or any other information that can be rendered in digital form.
[0054] In this exemplary system, the user computer device 34 is linked via data links (wired, wireless, and/or networked) to the computer server 32. The user computer device 34 includes software tools 36 operating thereon. These software tools 36 are configured to execute instructions to perform a method of ad-hoc analysis. This method generates a metadata lookup structure based, at least in part, on information or data gathered from one or more information source 24. The software tools 36 then search the one or more information source 24 for a search string. This search string may be any string of characters that a user desires to search. The search returns results that are compiled into aggregate information. As described above, the compiling of the results incorporates the metadata lookup table for improved speed and completeness. The use of the metadata lookup table also allows for various analytics to be run. The software 36 then reports the aggregate information. This reporting may include outputting the aggregate information to an output device such as a monitor, a printer or other output device.
[0055] This system may analyze the aggregate information based on a date, a link, an author, a keyword, a sentiment, a demographic, an entity, and/or a language. Similar to the Harry Potter example, a series of analytics may be performed to better understand various aspects of the source or content of the information.
[0056] It should be understood that the computer server 32 on which the exemplary system is operating (and which may appear in the appended claims) may be a single computer server 32, a networked group of computer servers 32, or any other networked computer device or computerized device or system of computer devices or computerized devices on which the tools and/or processes of the exemplary embodiments may operate. It is also to be understood that the user computer device 34 may also comprise the server 32 or be included with the server 32 system.
[0057] Figure 3 depicts an exemplary computer screenshot of one embodiment of the present invention. This screenshot is an exemplary screenshot of what a user may see as aggregated information after searching for "Harry Potter." The search interface may include an input area 38 for inputting a search string 40, date parameters 42 and other options. When a user wishes to search for "Harry Potter," the user may input that term into the input area 38. Also, the user may input specific dates 42 or a date range 42 (such as the last 90 days) for which to search. Any other search commands or Boolean operators may be used in the search query as known to those of ordinary skill in the art. The user may then click on (or otherwise actuate or activate) the search button 44 to begin searching. After the search is complete, the system may output a graphical representation 46 of aggregated results for the given data parameters. In this example, the search for "Harry Potter" between August 1, 2007 and August 10, 2007 generated a graph 46 showing how many messages were found. In addition, the graph may be broken down to show how many messages were found on message boards, blogs, and groups. The graph 46 in this embodiment may be displayed in various levels of detail based on user-selected display options.
[0058] Following from the above description and invention summaries, it should be apparent to persons of ordinary skill in the art that, while the methods and systems herein described constitute exemplary embodiments of the present invention, it is to be understood that the inventions contained herein are not limited to the above precise embodiments and that changes may be made without departing from the scope of the invention as defined by the claims. Likewise, it is to be understood that the invention is defined by the claims and it is not necessary to meet any or all of the identified advantages or objects of the invention disclosed herein in order to fall within the scope of the claims, since inherent and/or unforeseen advantages of the present invention may exist even though they may not have been explicitly discussed herein.
[0059] What is claimed is:

Claims

1. A computer implemented method for performing ad-hoc analysis, the method comprising the steps of: generating a text index of a plurality of textual information items; generating a metadata lookup structure based, at least in part, on the text index, the metadata lookup structure including one or more metadata items associated with each of the textual information items; searching the text index using one or more search queries, the searching step producing search results including one or more textual information items matching the search query; compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with each of the one or more textual information items in the search results from the metadata lookup structure; and reporting the aggregate information.
2. The method of claim 1, further comprising the step of: prior to generating the text index, accessing a plurality of information sources for a plurality of textual information items.
3. The method of claim 1 , wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata; analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item; and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
4. The method of claim 1 , wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items; determining a quantity of the one or more metadata items; and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
5. The method of claim 1 , wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, wherein the number of metadata items generated is not the same for all of the textual information items; determining a quantity of the one or more metadata items; and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
6. The method of claim 1 , wherein the one or more metadata items includes at least one of date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and language information.
7. The method of claim 6, wherein the link information includes a Uniform Resource Locator.
8. The method of claim 6, wherein the demographic information is generated and includes at least one of age and gender of a text item author.
9. The method of claim 6, wherein the language information includes one or more language specific annotation.
10. The method of claim 9, wherein the language specific annotation is provided by the textual information items.
11. The method of claim 9, wherein the language specific annotation is determined by analyzing the textual information items.
12. The method of claim 1, wherein the textual information items include electronic data from one or more of Internet message boards, blogs and news groups.
13. The method of claim 1 , wherein the aggregate information includes at least one of date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and language information pertaining to the search results.
14. The method of claim 1, wherein the text index is updated after a predetermined time period.
15. The method of claim 14, wherein the predetermined time period is between five and fifteen minutes.
16. A system for performing ad-hoc analysis, comprising: a computer server having access to one or more information sources, the one or more information sources including a plurality of textual information items; and a user computer device linked via one or more data links to the computer server, the user computer device including software configured to perform the steps of: generating a text index of the textual information items; generating a metadata lookup structure based, at least in part, on the text index, the metadata lookup structure including one or more metadata items associated with each of the textual information items; searching the text index using one or more search queries, the searching step producing search results including one or more textual information items matching the search query; compiling results of the text index search into aggregate information related to characteristics of the search results from the metadata items associated with each of the one or more textual information items in the search results from the metadata lookup structure; and reporting the aggregate information.
17. The system of claim 16, wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating a plurality of metadata IDs, each metadata ID associated with at least a type of metadata; analyzing each textual information item to determine which metadata ID(s) are associated with the respective textual information item; and mapping each textual information item with the respective metadata ID(s) determined for it in the analyzing step.
18. The system of claim 16 , wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items; determining a quantity of the one or more metadata items; and dynamically allocating a portion of a computer memory component based, at least in part, on the determined quantity.
19. The system of claim 16, wherein the step of generating a metadata lookup structure based, at least in part, on the text index includes the steps of: generating one or more metadata items associated with the textual information items, wherein the number of metadata items generated is not the same for all of the textual information items; determining a quantity of the one or more metadata items; and dynamically allocating a portion of a computer memory component based, at least in. part, on the determined quantity.
20. The system of claim 16, wherein the one or more metadata items includes at least one of date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and language information.
21. The system of claim 16, wherein the aggregate information includes at least one of date information, link information, author information, keyword information, sentiment information, demographic information, entity information, and language information pertaining to the search results.
22. The system of claim 16, wherein the text index is updated after a predetermined time period.
23. The system of claim 22, wherein the predetermined time period is between five and fifteen minutes.
PCT/US2007/021035 2006-09-27 2007-09-27 System and method of ad-hoc analysis of data WO2008039542A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US84748606P 2006-09-27 2006-09-27
US60/847,486 2006-09-27
US11/897,984 US7660783B2 (en) 2006-09-27 2007-08-31 System and method of ad-hoc analysis of data
US11/897,984 2007-08-31

Publications (2)

Publication Number Publication Date
WO2008039542A2 true WO2008039542A2 (en) 2008-04-03
WO2008039542A3 WO2008039542A3 (en) 2008-10-02

Family

ID=39226278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/021035 WO2008039542A2 (en) 2006-09-27 2007-09-27 System and method of ad-hoc analysis of data

Country Status (2)

Country Link
US (1) US7660783B2 (en)
WO (1) WO2008039542A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9418389B2 (en) 2012-05-07 2016-08-16 Nasdaq, Inc. Social intelligence architecture using social media message queues
US10304036B2 (en) 2012-05-07 2019-05-28 Nasdaq, Inc. Social media profiling for one or more authors using one or more social media platforms

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7268700B1 (en) 1998-01-27 2007-09-11 Hoffberg Steven M Mobile communication device
US9818136B1 (en) * 2003-02-05 2017-11-14 Steven M. Hoffberg System and method for determining contingent relevance
US7756750B2 (en) 2003-09-02 2010-07-13 Vinimaya, Inc. Method and system for providing online procurement between a buyer and suppliers over a network
US8538997B2 (en) 2004-06-25 2013-09-17 Apple Inc. Methods and systems for managing data
US8131674B2 (en) 2004-06-25 2012-03-06 Apple Inc. Methods and systems for managing data
US7590589B2 (en) 2004-09-10 2009-09-15 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
US8874477B2 (en) 2005-10-04 2014-10-28 Steven Mark Hoffberg Multifactorial optimization system and method
US7464078B2 (en) * 2005-10-25 2008-12-09 International Business Machines Corporation Method for automatically extracting by-line information
TWI337712B (en) * 2006-10-30 2011-02-21 Inst Information Industry Systems and methods for measuring behavior characteristics, and machine readable medium thereof
US7660793B2 (en) 2006-11-13 2010-02-09 Exegy Incorporated Method and system for high performance integration, processing and searching of structured and unstructured data using coprocessors
US8326819B2 (en) 2006-11-13 2012-12-04 Exegy Incorporated Method and system for high performance data metatagging and data indexing using coprocessors
WO2008083504A1 (en) * 2007-01-10 2008-07-17 Nick Koudas Method and system for information discovery and text analysis
US7827128B1 (en) 2007-05-11 2010-11-02 Aol Advertising Inc. System identification, estimation, and prediction of advertising-related data
US20090157668A1 (en) * 2007-12-12 2009-06-18 Christopher Daniel Newton Method and system for measuring an impact of various categories of media owners on a corporate brand
US8347326B2 (en) 2007-12-18 2013-01-01 The Nielsen Company (US) Identifying key media events and modeling causal relationships between key events and reported feelings
CA2650319C (en) 2008-01-24 2016-10-18 Radian6 Technologies Inc. Method and system for targeted advertising based on topical memes
US9245252B2 (en) 2008-05-07 2016-01-26 Salesforce.Com, Inc. Method and system for determining on-line influence in social media
US8374986B2 (en) 2008-05-15 2013-02-12 Exegy Incorporated Method and system for accelerated stream processing
US9521013B2 (en) 2008-12-31 2016-12-13 Facebook, Inc. Tracking significant topics of discourse in forums
US8462160B2 (en) * 2008-12-31 2013-06-11 Facebook, Inc. Displaying demographic information of members discussing topics in a forum
US10007729B1 (en) 2009-01-23 2018-06-26 Zakta, LLC Collaboratively finding, organizing and/or accessing information
US10191982B1 (en) 2009-01-23 2019-01-29 Zakata, LLC Topical search portal
US9607324B1 (en) 2009-01-23 2017-03-28 Zakta, LLC Topical trust network
US8230062B2 (en) 2010-06-21 2012-07-24 Salesforce.Com, Inc. Referred internet traffic analysis system and method
US10068266B2 (en) 2010-12-02 2018-09-04 Vinimaya Inc. Methods and systems to maintain, check, report, and audit contract and historical pricing in electronic procurement
US10146845B2 (en) 2012-10-23 2018-12-04 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US10102260B2 (en) 2012-10-23 2018-10-16 Ip Reservoir, Llc Method and apparatus for accelerated data translation using record layout detection
US9633093B2 (en) 2012-10-23 2017-04-25 Ip Reservoir, Llc Method and apparatus for accelerated format translation of data in a delimited data format
US9298814B2 (en) 2013-03-15 2016-03-29 Maritz Holdings Inc. Systems and methods for classifying electronic documents
US11928606B2 (en) 2013-03-15 2024-03-12 TSG Technologies, LLC Systems and methods for classifying electronic documents
US10521807B2 (en) 2013-09-05 2019-12-31 TSG Technologies, LLC Methods and systems for determining a risk of an emotional response of an audience
US20150120748A1 (en) * 2013-10-31 2015-04-30 Microsoft Corporation Indexing spreadsheet structural attributes for searching
CN103678537B (en) * 2013-12-02 2017-06-20 华为技术有限公司 Metadata amending method, device and node device based on cluster
CA2934280C (en) 2013-12-16 2020-08-25 Mx Technologies, Inc. Long string pattern matching of aggregated account data
WO2015164639A1 (en) 2014-04-23 2015-10-29 Ip Reservoir, Llc Method and apparatus for accelerated data translation
US9990441B2 (en) * 2014-12-05 2018-06-05 Facebook, Inc. Suggested keywords for searching content on online social networks
US9940354B2 (en) 2015-03-09 2018-04-10 International Business Machines Corporation Providing answers to questions having both rankable and probabilistic components
US10942943B2 (en) 2015-10-29 2021-03-09 Ip Reservoir, Llc Dynamic field data translation to support high performance stream data processing
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US9996527B1 (en) 2017-03-30 2018-06-12 International Business Machines Corporation Supporting interactive text mining process with natural language and dialog
US10643178B1 (en) 2017-06-16 2020-05-05 Coupa Software Incorporated Asynchronous real-time procurement system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928526B1 (en) * 2002-12-20 2005-08-09 Datadomain, Inc. Efficient data storage system
US20060041605A1 (en) * 2004-04-01 2006-02-23 King Martin T Determining actions involving captured information and electronic content associated with rendered documents
US20060173837A1 (en) * 2005-01-11 2006-08-03 Viktors Berstis Systems, methods, and media for awarding credits based on provided usage information
US20070027840A1 (en) * 2005-07-27 2007-02-01 Jobserve Limited Searching method and system

Family Cites Families (142)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3950618A (en) 1971-03-25 1976-04-13 Bloisi Albertoni De Lemos System for public opinion research
US4930077A (en) 1987-04-06 1990-05-29 Fan David P Information processing expert system for text analysis and predicting public opinion based information available to the public
US5124911A (en) 1988-04-15 1992-06-23 Image Engineering, Inc. Method of evaluating consumer choice through concept testing for the marketing and development of consumer products
US5041972A (en) 1988-04-15 1991-08-20 Frost W Alan Method of measuring and evaluating consumer response for the development of consumer products
US5301109A (en) 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5077785A (en) 1990-07-13 1991-12-31 Monson Gerald D System for recording comments by patrons of establishments
US5321833A (en) 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5317507A (en) 1990-11-07 1994-05-31 Gallant Stephen I Method for document retrieval and for word sense disambiguation using neural networks
US5519608A (en) 1993-06-24 1996-05-21 Xerox Corporation Method for extracting from a text corpus answers to questions stated in natural language by using linguistic analysis and hypothesis generation
JPH0756933A (en) 1993-06-24 1995-03-03 Xerox Corp Method for retrieval of document
CA2179523A1 (en) 1993-12-23 1995-06-29 David A. Boulton Method and apparatus for implementing user feedback
US5671333A (en) 1994-04-07 1997-09-23 Lucent Technologies Inc. Training apparatus and method
US5495412A (en) 1994-07-15 1996-02-27 Ican Systems, Inc. Computer-based method and apparatus for interactive computer-assisted negotiations
US6029195A (en) 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5668953A (en) 1995-02-22 1997-09-16 Sloo; Marshall Allan Method and apparatus for handling a complaint
US5895450A (en) 1995-02-22 1999-04-20 Sloo; Marshall A. Method and apparatus for handling complaints
DE69634247T2 (en) 1995-04-27 2006-01-12 Northrop Grumman Corp., Los Angeles Classifier having a neural network for adaptive filtering
US5659732A (en) 1995-05-17 1997-08-19 Infoseek Corporation Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents
US5675710A (en) 1995-06-07 1997-10-07 Lucent Technologies, Inc. Method and apparatus for training a text classifier
WO1997008604A2 (en) 1995-08-16 1997-03-06 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6026388A (en) 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5659742A (en) 1995-09-15 1997-08-19 Infonautics Corporation Method for storing multi-media information in an information retrieval system
US5819285A (en) 1995-09-20 1998-10-06 Infonautics Corporation Apparatus for capturing, storing and processing co-marketing information associated with a user of an on-line computer service using the world-wide-web.
IT1285619B1 (en) 1996-03-15 1998-06-18 Co Me Sca Costruzioni Meccanic BENDING METHOD OF A DIELECTRIC SHEET IN THE SHEET
US6314420B1 (en) 1996-04-04 2001-11-06 Lycos, Inc. Collaborative/adaptive search engine
US5867799A (en) 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5950172A (en) 1996-06-07 1999-09-07 Klingman; Edwin E. Secured electronic rating system
US6026387A (en) 1996-07-15 2000-02-15 Kesel; Brad Consumer comment reporting apparatus and method
US5822744A (en) 1996-07-15 1998-10-13 Kesel; Brad Consumer comment reporting apparatus and method
US6038610A (en) 1996-07-17 2000-03-14 Microsoft Corporation Storage of sitemaps at server sites for holding information regarding content
US5864863A (en) 1996-08-09 1999-01-26 Digital Equipment Corporation Method for parsing, indexing and searching world-wide-web pages
US5920854A (en) 1996-08-14 1999-07-06 Infoseek Corporation Real-time document collection search engine with phrase indexing
US5857179A (en) 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
US5911043A (en) 1996-10-01 1999-06-08 Baker & Botts, L.L.P. System and method for computer-based rating of information retrieved from a computer network
US5924094A (en) 1996-11-01 1999-07-13 Current Network Technologies Corporation Independent distributed database system
US5836771A (en) 1996-12-02 1998-11-17 Ho; Chi Fai Learning method and system based on questioning
US5950189A (en) 1997-01-02 1999-09-07 At&T Corp Retrieval system and method
US7437351B2 (en) 1997-01-10 2008-10-14 Google Inc. Method for searching media
US6138128A (en) 1997-04-02 2000-10-24 Microsoft Corp. Sharing and organizing world wide web references using distinctive characters
US6362837B1 (en) 1997-05-06 2002-03-26 Michael Ginn Method and apparatus for simultaneously indicating rating value for the first document and display of second document in response to the selection
US6098066A (en) 1997-06-13 2000-08-01 Sun Microsystems, Inc. Method and apparatus for searching for documents stored within a document directory hierarchy
US6012053A (en) 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6233575B1 (en) 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US6119933A (en) 1997-07-17 2000-09-19 Wong; Earl Chang Method and apparatus for customer loyalty and marketing analysis
US6278990B1 (en) 1997-07-25 2001-08-21 Claritech Corporation Sort system for text retrieval
US5845278A (en) 1997-09-12 1998-12-01 Inioseek Corporation Method for automatically selecting collections to search in full text searches
US5983216A (en) 1997-09-12 1999-11-09 Infoseek Corporation Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections
US5974412A (en) 1997-09-24 1999-10-26 Sapient Health Network Intelligent query system for automatically indexing information in a database and automatically categorizing users
US6094657A (en) 1997-10-01 2000-07-25 International Business Machines Corporation Apparatus and method for dynamic meta-tagging of compound documents
US6266664B1 (en) 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US5953718A (en) 1997-11-12 1999-09-14 Oracle Corporation Research mode for a knowledge base search and retrieval system
US6236991B1 (en) 1997-11-26 2001-05-22 International Business Machines Corp. Method and system for providing access for categorized information from online internet and intranet sources
US6269362B1 (en) 1997-12-19 2001-07-31 Alta Vista Company System and method for monitoring web pages by comparing generated abstracts
US6289342B1 (en) 1998-01-05 2001-09-11 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
US6256620B1 (en) 1998-01-16 2001-07-03 Aspect Communications Method and apparatus for monitoring information access
JP3692764B2 (en) 1998-02-25 2005-09-07 株式会社日立製作所 Structured document registration method, search method, and portable medium used therefor
US6067539A (en) 1998-03-02 2000-05-23 Vigil, Inc. Intelligent information retrieval system
US6185558B1 (en) 1998-03-03 2001-02-06 Amazon.Com, Inc. Identifying the items most relevant to a current query based on items selected in connection with similar queries
US6421675B1 (en) 1998-03-16 2002-07-16 S. L. I. Systems, Inc. Search engine
US6064980A (en) 1998-03-17 2000-05-16 Amazon.Com, Inc. System and methods for collaborative recommendations
US6236987B1 (en) 1998-04-03 2001-05-22 Damon Horowitz Dynamic content organization in information retrieval systems
US6112203A (en) 1998-04-09 2000-08-29 Altavista Company Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis
US6078892A (en) 1998-04-09 2000-06-20 International Business Machines Corporation Method for customer lead selection and optimization
US6032145A (en) 1998-04-10 2000-02-29 Requisite Technology, Inc. Method and system for database manipulation
US7051277B2 (en) 1998-04-17 2006-05-23 International Business Machines Corporation Automated assistant for organizing electronic documents
GB2336698A (en) 1998-04-24 1999-10-27 Dialog Corp Plc The Automatic content categorisation of text data files using subdivision to reduce false classification
US6006225A (en) 1998-06-15 1999-12-21 Amazon.Com Refining search queries by the suggestion of correlated terms from prior searches
US6192360B1 (en) 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
US6401118B1 (en) 1998-06-30 2002-06-04 Online Monitoring Services Method and computer program product for an online monitoring search engine
US6202068B1 (en) 1998-07-02 2001-03-13 Thomas A. Kraay Database display and search method
US6035294A (en) 1998-08-03 2000-03-07 Big Fat Fish, Inc. Wide access databases and database systems
AU5465099A (en) 1998-08-04 2000-02-28 Rulespace, Inc. Method and system for deriving computer users' personal interests
US6138113A (en) 1998-08-10 2000-10-24 Altavista Company Method for identifying near duplicate pages in a hyperlinked database
US6654813B1 (en) 1998-08-17 2003-11-25 Alta Vista Company Dynamically categorizing entity information
US6393460B1 (en) 1998-08-28 2002-05-21 International Business Machines Corporation Method and system for informing users of subjects of discussion in on-line chats
US6334131B2 (en) 1998-08-29 2001-12-25 International Business Machines Corporation Method for cataloging, filtering, and relevance ranking frame-based hierarchical information structures
US6513032B1 (en) 1998-10-29 2003-01-28 Alta Vista Company Search and navigation system and method using category intersection pre-computation
US6360215B1 (en) 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US6751606B1 (en) 1998-12-23 2004-06-15 Microsoft Corporation System for enhancing a query interface
US6236977B1 (en) 1999-01-04 2001-05-22 Realty One, Inc. Computer implemented marketing system
US20020059258A1 (en) 1999-01-21 2002-05-16 James F. Kirkpatrick Method and apparatus for retrieving and displaying consumer interest data from the internet
US6385586B1 (en) 1999-01-28 2002-05-07 International Business Machines Corporation Speech recognition text-based language conversion and text-to-speech in a client-server configuration to enable language translation devices
US6418433B1 (en) 1999-01-28 2002-07-09 International Business Machines Corporation System and method for focussed web crawling
US6411936B1 (en) 1999-02-05 2002-06-25 Nval Solutions, Inc. Enterprise value enhancement system and method
US7277919B1 (en) 1999-03-19 2007-10-02 Bigfix, Inc. Relevance clause for computed relevance messaging
US6553358B1 (en) 1999-04-20 2003-04-22 Microsoft Corporation Decision-theoretic approach to harnessing text classification for guiding automated action
US6304864B1 (en) 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
US6571234B1 (en) 1999-05-11 2003-05-27 Prophet Financial Systems, Inc. System and method for managing online message board
US6493703B1 (en) 1999-05-11 2002-12-10 Prophet Financial Systems System and method for implementing intelligent online community message board
US7006999B1 (en) 1999-05-13 2006-02-28 Xerox Corporation Method for enabling privacy and trust in electronic communities
US6571238B1 (en) 1999-06-11 2003-05-27 Abuzz Technologies, Inc. System for regulating flow of information to user by using time dependent function to adjust relevancy threshold
US6546390B1 (en) 1999-06-11 2003-04-08 Abuzz Technologies, Inc. Method and apparatus for evaluating relevancy of messages to users
KR20010004404A (en) 1999-06-28 2001-01-15 정선종 Keyfact-based text retrieval system, keyfact-based text index method, and retrieval method using this system
US6507866B1 (en) 1999-07-19 2003-01-14 At&T Wireless Services, Inc. E-mail usage pattern detection
US6341306B1 (en) 1999-08-13 2002-01-22 Atomica Corporation Web-based information retrieval responsive to displayed word identified by a text-grabbing algorithm
WO2001016839A2 (en) 1999-08-31 2001-03-08 Video Ventures Joint Venture D/B/A Comsort System for influence network marketing
US6260041B1 (en) 1999-09-30 2001-07-10 Netcurrents, Inc. Apparatus and method of implementing fast internet real-time search technology (first)
JP4279427B2 (en) 1999-11-22 2009-06-17 富士通株式会社 Communication support method and system
US6434549B1 (en) 1999-12-13 2002-08-13 Ultris, Inc. Network-based, human-mediated exchange of information
US6772141B1 (en) 1999-12-14 2004-08-03 Novell, Inc. Method and apparatus for organizing and using indexes utilizing a search decision table
US6651086B1 (en) 2000-02-22 2003-11-18 Yahoo! Inc. Systems and methods for matching participants to a conversation
US6606644B1 (en) 2000-02-24 2003-08-12 International Business Machines Corporation System and technique for dynamic information gathering and targeted advertising in a web based model using a live information selection and analysis tool
CA2402916A1 (en) 2000-03-16 2001-09-20 Yuan Yan Chen Apparatus and method for fuzzy analysis of statistical evidence
US6757646B2 (en) 2000-03-22 2004-06-29 Insightful Corporation Extended functionality for an inverse inference engine based web search
US6658389B1 (en) 2000-03-24 2003-12-02 Ahmet Alpdemir System, method, and business model for speech-interactive information system having business self-promotion, audio coupon and rating features
US6721734B1 (en) 2000-04-18 2004-04-13 Claritech Corporation Method and apparatus for information management using fuzzy typing
US6983320B1 (en) 2000-05-23 2006-01-03 Cyveillance, Inc. System, method and computer program product for analyzing e-commerce competition of an entity by utilizing predetermined entity-specific metrics and analyzed statistics from web pages
DE60119934D1 (en) 2000-05-25 2006-06-29 Manyworlds Inc NETWORK ADMINISTRATIVE AND ACCESS SYSTEM FOR HARMFUL CONTENT
US6782393B1 (en) 2000-05-31 2004-08-24 Ricoh Co., Ltd. Method and system for electronic message composition with relevant documents
US6640218B1 (en) 2000-06-02 2003-10-28 Lycos, Inc. Estimating the usefulness of an item in a collection of information
AU2001268314A1 (en) * 2000-06-14 2001-12-24 Artesia Technologies, Inc. Method and system for link management
US7136854B2 (en) 2000-07-06 2006-11-14 Google, Inc. Methods and apparatus for providing search results in response to an ambiguous search query
US6807566B1 (en) 2000-08-16 2004-10-19 International Business Machines Corporation Method, article of manufacture and apparatus for processing an electronic message on an electronic message board
US6662170B1 (en) 2000-08-22 2003-12-09 International Business Machines Corporation System and method for boosting support vector machines
US7146416B1 (en) 2000-09-01 2006-12-05 Yahoo! Inc. Web site activity monitoring system with tracking by categories and terms
NO313399B1 (en) 2000-09-14 2002-09-23 Fast Search & Transfer Asa Procedure for searching and analyzing information in computer networks
US6999914B1 (en) 2000-09-28 2006-02-14 Manning And Napier Information Services Llc Device and method of determining emotive index corresponding to a message
US6751683B1 (en) 2000-09-29 2004-06-15 International Business Machines Corporation Method, system and program products for projecting the impact of configuration changes on controllers
US7197470B1 (en) 2000-10-11 2007-03-27 Buzzmetrics, Ltd. System and method for collection analysis of electronic discussion methods
US6560600B1 (en) 2000-10-25 2003-05-06 Alta Vista Company Method and apparatus for ranking Web page search results
GB2368670A (en) 2000-11-03 2002-05-08 Envisional Software Solutions Data acquisition system
US6622140B1 (en) 2000-11-15 2003-09-16 Justsystem Corporation Method and apparatus for analyzing affect and emotion in text
US6526440B1 (en) 2001-01-30 2003-02-25 Google, Inc. Ranking search results by reranking the results based on local inter-connectivity
US6584470B2 (en) 2001-03-01 2003-06-24 Intelliseek, Inc. Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US8001118B2 (en) 2001-03-02 2011-08-16 Google Inc. Methods and apparatus for employing usage statistics in document retrieval
US6778975B1 (en) 2001-03-05 2004-08-17 Overture Services, Inc. Search engine for selecting targeted messages
US20020159642A1 (en) 2001-03-14 2002-10-31 Whitney Paul D. Feature selection and feature set construction
EP1444649A1 (en) 2001-10-11 2004-08-11 Exscientia, LLC Method and apparatus for learning to classify patterns and assess the value of decisions
US7055273B2 (en) 2001-10-12 2006-06-06 Attitude Measurement Corporation Removable label and incentive item to facilitate collecting consumer data
US20040205482A1 (en) 2002-01-24 2004-10-14 International Business Machines Corporation Method and apparatus for active annotation of multimedia content
WO2003075186A1 (en) 2002-03-01 2003-09-12 Paul Jeffrey Krupin A method and system for creating improved search queries
JP2003281446A (en) 2002-03-13 2003-10-03 Culture Com Technology (Macau) Ltd Media management method and system
US7716161B2 (en) 2002-09-24 2010-05-11 Google, Inc, Methods and apparatus for serving relevant advertisements
US7599911B2 (en) 2002-08-05 2009-10-06 Yahoo! Inc. Method and apparatus for search ranking using human input and automated ranking
US7051023B2 (en) 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries
US7130777B2 (en) 2003-11-26 2006-10-31 International Business Machines Corporation Method to hierarchical pooling of opinions from multiple sources
US7865354B2 (en) 2003-12-05 2011-01-04 International Business Machines Corporation Extracting and grouping opinions from text documents
US7287012B2 (en) 2004-01-09 2007-10-23 Microsoft Corporation Machine-learned approach to determining document relevance for search over large electronic collections of documents
US7596571B2 (en) 2004-06-30 2009-09-29 Technorati, Inc. Ecosystem method of aggregation and search and related techniques
US7523085B2 (en) 2004-09-30 2009-04-21 Buzzmetrics, Ltd An Israel Corporation Topical sentiments in electronically stored communications
US7624102B2 (en) 2005-01-28 2009-11-24 Microsoft Corporation System and method for grouping by attribute
US7680855B2 (en) 2005-03-11 2010-03-16 Yahoo! Inc. System and method for managing listings

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928526B1 (en) * 2002-12-20 2005-08-09 Datadomain, Inc. Efficient data storage system
US20060041605A1 (en) * 2004-04-01 2006-02-23 King Martin T Determining actions involving captured information and electronic content associated with rendered documents
US20060173837A1 (en) * 2005-01-11 2006-08-03 Viktors Berstis Systems, methods, and media for awarding credits based on provided usage information
US20070027840A1 (en) * 2005-07-27 2007-02-01 Jobserve Limited Searching method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9418389B2 (en) 2012-05-07 2016-08-16 Nasdaq, Inc. Social intelligence architecture using social media message queues
US10304036B2 (en) 2012-05-07 2019-05-28 Nasdaq, Inc. Social media profiling for one or more authors using one or more social media platforms
US11086885B2 (en) 2012-05-07 2021-08-10 Nasdaq, Inc. Social intelligence architecture using social media message queues
US11100466B2 (en) 2012-05-07 2021-08-24 Nasdaq, Inc. Social media profiling for one or more authors using one or more social media platforms
US11803557B2 (en) 2012-05-07 2023-10-31 Nasdaq, Inc. Social intelligence architecture using social media message queues
US11847612B2 (en) 2012-05-07 2023-12-19 Nasdaq, Inc. Social media profiling for one or more authors using one or more social media platforms

Also Published As

Publication number Publication date
US20080077582A1 (en) 2008-03-27
WO2008039542A3 (en) 2008-10-02
US7660783B2 (en) 2010-02-09

Similar Documents

Publication Publication Date Title
US7660783B2 (en) System and method of ad-hoc analysis of data
US9990368B2 (en) System and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information
US8135669B2 (en) Information access with usage-driven metadata feedback
US10235681B2 (en) Text extraction module for contextual analysis engine
US10430806B2 (en) Input/output interface for contextual analysis engine
JP5608286B2 (en) Infinite browsing
US8874542B2 (en) Displaying browse sequence with search results
US20150106078A1 (en) Contextual analysis engine
US20070192129A1 (en) Method and system for the objective quantification of fame
US20080228695A1 (en) Techniques for analyzing and presenting information in an event-based data aggregation system
US20120016863A1 (en) Enriching metadata of categorized documents for search
US8712999B2 (en) Systems and methods for online search recirculation and query categorization
JP2006309515A (en) Information delivery method and information delivery server
TW201415254A (en) Method and system for recommending semantic annotations
JP4820147B2 (en) Attribute evaluation program, attribute evaluation system, and attribute evaluation method
US9563666B2 (en) Unsupervised detection and categorization of word clusters in text data
JP5492047B2 (en) Purchasing behavior analysis apparatus, purchasing behavior analysis method, purchasing behavior analysis program, purchasing behavior analysis system, and control method
Paliouras et al. PNS: A personalized news aggregator on the web
JP4853915B2 (en) Search system
Fung et al. Discover information and knowledge from websites using an integrated summarization and visualization framework
Pradana et al. An Android-based Hoax Detection for Social Media
Perugini et al. Recommendation and personalization: a survey
Doerfel et al. Of course we share! testing assumptions about social tagging systems
KR101132393B1 (en) Method of searching web pages based on a collective intelligence using folksonomy and linked-based ranking strategy, and system for performing the method
CN113901325A (en) User behavior analysis device and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07839065

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07839065

Country of ref document: EP

Kind code of ref document: A2