US20110302179A1 - Using Context to Extract Entities from a Document Collection - Google Patents

Using Context to Extract Entities from a Document Collection Download PDF

Info

Publication number
US20110302179A1
US20110302179A1 US12/794,779 US79477910A US2011302179A1 US 20110302179 A1 US20110302179 A1 US 20110302179A1 US 79477910 A US79477910 A US 79477910A US 2011302179 A1 US2011302179 A1 US 2011302179A1
Authority
US
United States
Prior art keywords
entity
entities
specific
context
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/794,779
Other versions
US9251248B2 (en
Inventor
Sanjay Agrawal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US12/794,779 priority Critical patent/US9251248B2/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AGRAWAL, SANJAY
Publication of US20110302179A1 publication Critical patent/US20110302179A1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Priority to US15/006,743 priority patent/US20160154876A1/en
Application granted granted Critical
Publication of US9251248B2 publication Critical patent/US9251248B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • An example entity extraction task may be to identify mentions of book titles within the web pages, given a prepared list of the desired book titles.
  • the task of extracting entities is difficult when some of the entities in the provided list have a significant overlap with entities in other domains or with the underlying language of the documents or both. For example, consider the movie “seven” (ignoring uppercase versus lowercase) among a list of movie titles to extract. There are many documents that contain the term “seven” that have nothing to do with the movie, e.g., there are seven days in a week, the distance to a location is seven miles, and so on. This overlap makes it very difficult to disambiguate relevant (“true”) mentions of such entities with respect to the domain from irrelevant (“false”) mentions.
  • a list of entities is input into an entity extraction mechanism.
  • the entity extraction mechanism processes the collection of documents to determine data corresponding to how frequently each entity of a list of entities corresponding to a domain is mentioned in the collection. For example, for each entity, a percentage of how many documents the entity is mentioned in relative to the total number of documents may be used as a measure of the entity frequency. Entities that are mentioned infrequently are identified as specific entities, while entities that are mentioned frequently are identified as non-specific (e.g., generic or ambiguous) entities.
  • context relative to the mentions of the entities is extracted from the documents, e.g., some number of words or phrases (or a mix of words and phrases) before and after the entity mention.
  • interesting context terms note that each “term” comprises a word, or a phrase comprising multiple words, for example
  • terms in the contexts become candidate terms, with those candidate terms processed based upon count information to eliminate candidates that are too frequent among the collection (and thus may not have affinity with the domain or correlation with entity mentions for the domain), or to eliminate candidate terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
  • the documents are processed to determine whether non-specific entity mentions in those documents are likely relevant to the domain.
  • the context surrounding each non-specific entity mention is evaluated against the interesting context terms. If there is a match in the non-specific entity mention's context with one (or more) of the context terms, then the non-specific entity mention and document are considered relevant to that domain; other documents are filtered out.
  • a result set containing only relevant documents or relevant mentions or both corresponding to a filtered subset of the collection is output.
  • FIG. 1 is block diagram representing an entity extraction mechanism that uses context to filter a large document collection based upon entity names for a domain.
  • FIG. 2 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to filter the document collection.
  • FIG. 3 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to separate entities into specific or non-specific (ambiguous) categories based on counts of the times the entities are mentioned in the document collection.
  • FIG. 4 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to determine a set of interesting context terms for a domain from the context of entity mentions.
  • FIG. 5 shows an illustrative example of a computing environment into which various aspects of the present invention may be incorporated.
  • the technology described herein may be used to generate training data, e.g., by providing the output (document-to-entity mentions) to humans, such as to collect good quality training data for further supervised learning techniques for a given entity domain.
  • the technology may be used to generate rules, e.g., an entity mention is a likely true mention if it contains context terms.
  • FIG. 1 is a block diagram showing a document collection 102 being processed in an automated way by an entity extraction mechanism 104 to provide results 106 corresponding to a provided set (list) of extracted entities 108 .
  • the number of documents in the collection 102 is typically very large, e.g., at a web scale
  • the results 106 may comprise a list of entity mentions for each document, entity mention pair over the document collection, but may be in any other suitable format, e.g., a list of document identifiers for the documents that contain the entity names, text snippets in which the entity names appear, and so forth.
  • the mechanism 104 uses mathematical representations related to frequency distribution, such as counting of entity mentions or context terms over the documents, (which is very efficient and can be performed in parallel; for example a large document collection, map reduce architecture may be used).
  • FIG. 2 shows general logic of the entity extraction mechanism 104 , beginning at step 202 where the mechanism 104 splits the entity list based on entity counts. More particularly, as generally represented in FIG. 3 , step 302 extracts an entity mention count from the documents for each entity.
  • the count may be the number of documents in which the entity is mentioned, may be the total number of instances (if mentioned twice in the same document add two to the count) or some combination thereof (e.g., count the instances but not more than three maximum per document).
  • the entity mention count for a given entity is high, such as above some threshold percentage (e.g., one-tenth of a percent) of the total number of documents, then that entity is considered as ambiguous and added to a non-specific list. If the entity mention count for a given entity is low, then that entity is not as likely to be ambiguous and is added to a specific entity list.
  • some threshold percentage e.g., one-tenth of a percent
  • the collection is processed based upon the entities named in the specific list to extract the context surrounding the named entities.
  • entity e.g., an actual movie title
  • terms words or phrases
  • the specific list contains the entities that are likely true, their contexts provide useful information.
  • the context may be some number of terms (e.g., five, ten, twenty) before and after the entity; note that the “before” number need not be the same as the “after” number.
  • affinity count data is referenced at step 206 , as more particularly represented in FIG. 4 .
  • the mention counts of the context terms are obtained at step 402 as the affinity counts.
  • the mention counts of the context terms that are near an entity are also obtained.
  • the counts are used to remove candidate context terms that are used too frequently, or if their mentions are as likely to be near an entity as not near an entity and thus do not correlate with entity mentions for the domain; (note that, exactly likely is not required to be considered “as likely”).
  • the remaining context terms are those considered as the “interesting” context terms with respect to having affinity with the true entity mentions.
  • context terms words or phrases that occur near the mention of entities in the “specific” entity list are extracted as candidate context terms for entities in the non-specific entity list. Further mentions of the candidate context terms are counted over the document collection. In addition, for each candidate context term, the affinity counts over the document collection are generated where the candidate context terms is in the context of an entity in the “specific” entity list.
  • candidate context terms that occur in a large number of documents, or are as likely to be mentioned within the context of an entity in the “specific” entity list as they are likely to be mentioned outside the context of such entities, are removed from consideration. The remaining candidate context terms are the “interesting” context terms.
  • an ambiguous entity such as “seven” mentioned in a document may be considered a true mention with respect to a movie title if the surrounding terms include an interesting context term such as “director” or “starred in” that were extracted as being interesting context terms from documents known to have specific entity mentions. Conversely, if there are no such interesting terms in the surrounding context of a “seven” mention, then this mention is considered not true and the document is filtered out.
  • the context need not be the same number of words or phrases as processed in the specific entity list, e.g., the context in the specific entity list from which interesting context terms were extracted may have been ten words on either side of the specific entity mention, while filtering non-specific entity documents may need to have an interesting context term within a five word context on either side of the non-specific entity mention, or vice-versa.
  • the filter may be even more restrictive in an implementation by having to have two (or some other number of) interesting context terms within the context of the non-specific entity mention.
  • the more restrictive filter may be applied to certain entities and not others. For example, if instead of having two (non-specific and specific) categories for entities, consider a split of the entities in the list into a specific category (e.g., low percentage of mentions per total documents) and a non-specific category comprising ambiguous (medium percentage) and very ambiguous (high percentage) categories. At least a single interesting context terms may be needed in an ambiguous entity mention's context to not filter out the document, with at least two interesting context terms needed for very ambiguous entity mentions. Alternatively, the list of interesting context terms may be larger for ambiguous entity mentions and smaller for very ambiguous entity mentions, such as by using different affinity counts for each category.
  • Information other than counts may be used to help in the filtering. For example, titles usually start with a capital letter, and thus if extracting titles, this information can be used as well.
  • the entities that are in the specific category are considered to be likely relevant with respect to the domain, whereby documents containing these specific entity mentions are not filtered out from the results.
  • the documents are processed to extract the specific entity mentions, as represented by step 210 .
  • Step 212 represents producing the results by combining the extracted specific mentions with the (formerly) non-specific entity mentions that remain after interesting context-based term filtering. These results may be used in any suitable way.
  • the mechanism is frequency or count-based, and is thus substantially non-specific and can be applied to entities in a variety of domains.
  • the mechanism operates in a generally automated manner, without requiring knowledge of underlying entity domains for the extraction task.
  • FIG. 5 illustrates an example of a suitable computing and networking environment 500 on which the examples of FIGS. 1-4 may be implemented.
  • the computing system environment 500 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 500 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 500 .
  • the invention is operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in local and/or remote computer storage media including memory storage devices.
  • an exemplary system for implementing various aspects of the invention may include a general purpose computing device in the form of a computer 510 .
  • Components of the computer 510 may include, but are not limited to, a processing unit 520 , a system memory 530 , and a system bus 521 that couples various system components including the system memory to the processing unit 520 .
  • the system bus 521 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • the computer 510 typically includes a variety of computer-readable media.
  • Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media.
  • Computer-readable media may comprise computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510 .
  • Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
  • the system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520 .
  • FIG. 5 illustrates operating system 534 , application programs 535 , other program modules 536 and program data 537 .
  • the computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
  • FIG. 5 illustrates a hard disk drive 541 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 551 that reads from or writes to a removable, nonvolatile magnetic disk 552 , and an optical disk drive 555 that reads from or writes to a removable, nonvolatile optical disk 556 such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 541 is typically connected to the system bus 521 through a non-removable memory interface such as interface 540
  • magnetic disk drive 551 and optical disk drive 555 are typically connected to the system bus 521 by a removable memory interface, such as interface 550 .
  • the drives and their associated computer storage media provide storage of computer-readable instructions, data structures, program modules and other data for the computer 510 .
  • hard disk drive 541 is illustrated as storing operating system 544 , application programs 545 , other program modules 546 and program data 547 .
  • operating system 544 application programs 545 , other program modules 546 and program data 547 are given different numbers herein to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 510 through input devices such as a tablet, or electronic digitizer, 564 , a microphone 563 , a keyboard 562 and pointing device 561 , commonly referred to as mouse, trackball or touch pad.
  • Other input devices not shown in FIG. 5 may include a joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 520 through a user input interface 560 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 591 or other type of display device is also connected to the system bus 521 via an interface, such as a video interface 590 .
  • the monitor 591 may also be integrated with a touch-screen panel or the like. Note that the monitor and/or touch screen panel can be physically coupled to a housing in which the computing device 510 is incorporated, such as in a tablet-type personal computer. In addition, computers such as the computing device 510 may also include other peripheral output devices such as speakers 595 and printer 596 , which may be connected through an output peripheral interface 594 or the like.
  • the computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580 .
  • the remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510 , although only a memory storage device 581 has been illustrated in FIG. 5 .
  • the logical connections depicted in FIG. 5 include one or more local area networks (LAN) 571 and one or more wide area networks (WAN) 573 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 510 When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570 .
  • the computer 510 When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573 , such as the Internet.
  • the modem 572 which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism.
  • a wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN.
  • program modules depicted relative to the computer 510 may be stored in the remote memory storage device.
  • FIG. 5 illustrates remote application programs 585 as residing on memory device 581 . It may be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • An auxiliary subsystem 599 may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state.
  • the auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.

Abstract

Described is using context information obtained from entity mentions in likely relevant documents to extract entity mentions from documents that are ambiguous with respect to their relevance to a domain. A list of entities is input into an entity extraction mechanism, which processes a large collection of documents to determine data (counts) corresponding to frequency of entity mentions. Infrequently mentioned entities are specific entities, while frequently mentioned entities are non-specific (generic or ambiguous) entities. The context surrounding mentions of the specific entities is processed to obtain interesting context terms (words, phrases or both) for the domain. The interesting context terms are then compared against the contexts of non-specific entity mentions to determine whether each non-specific entity mention is relevant to the domain. A result set containing only relevant documents or relevant mentions collection is output.

Description

    BACKGROUND
  • Given a large collection of documents, there are various applications that benefit from having a relatively small subset of these documents filtered and identified in an appropriate way, or to extract certain entity information (e.g., words or phrases) from only relevant documents, or both. By way of example, consider that the collection to be processed comprises the large number of documents on the web, on the order of billions. An example entity extraction task may be to identify mentions of book titles within the web pages, given a prepared list of the desired book titles.
  • The task of extracting entities is difficult when some of the entities in the provided list have a significant overlap with entities in other domains or with the underlying language of the documents or both. For example, consider the movie “seven” (ignoring uppercase versus lowercase) among a list of movie titles to extract. There are many documents that contain the term “seven” that have nothing to do with the movie, e.g., there are seven days in a week, the distance to a location is seven miles, and so on. This overlap makes it very difficult to disambiguate relevant (“true”) mentions of such entities with respect to the domain from irrelevant (“false”) mentions.
  • Further, there is generally very limited domain-based information in terms of available training data, or in terms of available classifiers for entity extraction tasks or both. In general this is because there is a significant variety of such entity lists for which extraction is desired, and differing entity domains over which extraction may be performed, each domain having to have a classifier trained with knowledge of the specific domain. Indeed, such data may be entirely absent for an entity list or domain. By way of example, there may not be a classifier available for an entity list comprising romantic movies. Even if one exists, running such a classifier over such a large document collection may not be practical as a classifier tends to have large amount of performance overhead.
  • Another difficulty arises from the large size of the underlying document collection, which limits the time that can be spent on each document for extraction purposes. The large size of the document collection makes it impractical to identify all mentions of entities over the entire document collection as an intermediate step, followed by a subsequent step that removes false mentions. This is even worse in the presence of entities that overlap with the underlying language of the document, e.g., materializing mentions of “man” over web pages can lead to millions of web page URLs in which only a small fraction of the pages refer to a movie named “man.”
  • SUMMARY
  • This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
  • Briefly, various aspects of the subject matter described herein are directed towards an entity extraction technology by which a large set of documents is filtered into a smaller subset of documents that contain mentions of identified entities that are likely relevant to a domain corresponding to the entities. In one aspect, a list of entities is input into an entity extraction mechanism. The entity extraction mechanism processes the collection of documents to determine data corresponding to how frequently each entity of a list of entities corresponding to a domain is mentioned in the collection. For example, for each entity, a percentage of how many documents the entity is mentioned in relative to the total number of documents may be used as a measure of the entity frequency. Entities that are mentioned infrequently are identified as specific entities, while entities that are mentioned frequently are identified as non-specific (e.g., generic or ambiguous) entities.
  • For the set of specific entities, context relative to the mentions of the entities is extracted from the documents, e.g., some number of words or phrases (or a mix of words and phrases) before and after the entity mention. Based upon the context, interesting context terms (note that each “term” comprises a word, or a phrase comprising multiple words, for example) for the domain are selected. For example, terms in the contexts become candidate terms, with those candidate terms processed based upon count information to eliminate candidates that are too frequent among the collection (and thus may not have affinity with the domain or correlation with entity mentions for the domain), or to eliminate candidate terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
  • Once the interesting context terms for the domain are known, the documents are processed to determine whether non-specific entity mentions in those documents are likely relevant to the domain. To this end, the context surrounding each non-specific entity mention is evaluated against the interesting context terms. If there is a match in the non-specific entity mention's context with one (or more) of the context terms, then the non-specific entity mention and document are considered relevant to that domain; other documents are filtered out. A result set containing only relevant documents or relevant mentions or both corresponding to a filtered subset of the collection is output.
  • Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
  • FIG. 1 is block diagram representing an entity extraction mechanism that uses context to filter a large document collection based upon entity names for a domain.
  • FIG. 2 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to filter the document collection.
  • FIG. 3 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to separate entities into specific or non-specific (ambiguous) categories based on counts of the times the entities are mentioned in the document collection.
  • FIG. 4 is a flow diagram representing example steps that may be taken by the entity extraction mechanism to determine a set of interesting context terms for a domain from the context of entity mentions.
  • FIG. 5 shows an illustrative example of a computing environment into which various aspects of the present invention may be incorporated.
  • DETAILED DESCRIPTION
  • Various aspects of the technology described herein are generally directed towards a mechanism that uses context around mentions of entities in a large document collection to perform entity extraction in a generally automated manner. As will be understood, the technology performs the entity extraction without necessarily needing any knowledge of the underlying entity domain for the extraction task.
  • While entity extraction from documents is one usage scenario for the technology, the mechanism can be used for a variety of other tasks. For example, as will be understood, the mechanism may be used as a very fast and automated filtering mechanism to significantly reduce the amount of data (e.g., by orders of magnitude, approximately fifty times in one implementation) for further processing without requiring knowledge of underlying entity domains. In this way, for example, the mechanism may be used as a pre-filter that provides remaining documents to one or more subsequent extractors, such as an extractor having advanced domain-dependent knowledge to further improve the accuracy of extraction.
  • The technology described herein may be used to generate training data, e.g., by providing the output (document-to-entity mentions) to humans, such as to collect good quality training data for further supervised learning techniques for a given entity domain. The technology may be used to generate rules, e.g., an entity mention is a likely true mention if it contains context terms.
  • As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and data processing in general.
  • FIG. 1 is a block diagram showing a document collection 102 being processed in an automated way by an entity extraction mechanism 104 to provide results 106 corresponding to a provided set (list) of extracted entities 108. The number of documents in the collection 102 is typically very large, e.g., at a web scale The results 106 may comprise a list of entity mentions for each document, entity mention pair over the document collection, but may be in any other suitable format, e.g., a list of document identifiers for the documents that contain the entity names, text snippets in which the entity names appear, and so forth.
  • For example, a document identifier, entity name, and location (or multiple locations of that entity name) within the document may be maintained as the results, e.g., <docID, “Seven”, 100>. Instances of common documents (the same document) may be merged in the results, e.g., a document that contains two different entity names that are true with respect to the domain (and thus has two instances) may be referenced as <docID, <“Seven”, 100>, <“ABC goes to DEFG” 150>> and so on (where “ABC goes to DEFG” is a hypothetical movie title); note that a pointer/identifier to the entity name in the list, rather than the entity name itself, may be maintained in the data.
  • In one implementation, the entity extraction mechanism 104 uses various count data 110 obtained from the documents as described below. As also described below, the entity extraction mechanism 104 uses context data 112, namely text surrounding the entity names in the documents, to determine whether mentions of entity names are true or false with respect to being relevant to the entity.
  • As can be readily appreciated, the use of appropriate context terms significantly reduces the number of false mentions of entities within documents. The mechanism 104 uses mathematical representations related to frequency distribution, such as counting of entity mentions or context terms over the documents, (which is very efficient and can be performed in parallel; for example a large document collection, map reduce architecture may be used).
  • FIG. 2 shows general logic of the entity extraction mechanism 104, beginning at step 202 where the mechanism 104 splits the entity list based on entity counts. More particularly, as generally represented in FIG. 3, step 302 extracts an entity mention count from the documents for each entity. The count may be the number of documents in which the entity is mentioned, may be the total number of instances (if mentioned twice in the same document add two to the count) or some combination thereof (e.g., count the instances but not more than three maximum per document).
  • If the entity mention count for a given entity is high, such as above some threshold percentage (e.g., one-tenth of a percent) of the total number of documents, then that entity is considered as ambiguous and added to a non-specific list. If the entity mention count for a given entity is low, then that entity is not as likely to be ambiguous and is added to a specific entity list. Using the above example, “Seven” is mentioned in a large number of documents, and thus ambiguous as to whether it is referring to the movie or to another concept, whereas a movie title having a long or unusual name will be mentioned far less frequently, and thus when the entity is mentioned, the document is more likely to be referring to that specific movie.
  • Returning to FIG. 2, at step 204 the collection is processed based upon the entities named in the specific list to extract the context surrounding the named entities. The general idea is that if an entity is likely to be the true entity (e.g., an actual movie title) in a given document, the terms (words or phrases) around that entity may be related to the concept (e.g., movies) to which that entity relates. Because the specific list contains the entities that are likely true, their contexts provide useful information. The context may be some number of terms (e.g., five, ten, twenty) before and after the entity; note that the “before” number need not be the same as the “after” number.
  • Not all of the context terms may be “interesting” terms with respect to the domain (e.g., movies, medicines, musicians, consumer electronics, people and so forth), in that they do not help distinguish entity mentions that are true with respect to the domain from those that are false. For example, frequently used terms such as “the” and “was” and “this is” do not provide much (if any) insight into whether an unknown entity mention is true or false in a document. However, if the entity list is names of musicians, for example, a term such as “guitar” is relevant to (has affinity with) the domain and is more likely an “interesting” term, such as determined in the manner described above. Note that stopword filtering may be applied to reduce the number of affinity counts that need to be determined, e.g., words such as “and” and “the” (and phrases such as “this is”) can be eliminated without obtaining their affinity counts in order to eliminate them.
  • In order to determine whether a context term is considered interesting, affinity count data is referenced at step 206, as more particularly represented in FIG. 4. In one implementation, the mention counts of the context terms are obtained at step 402 as the affinity counts. At step 404, the mention counts of the context terms that are near an entity are also obtained. At step 406, the counts are used to remove candidate context terms that are used too frequently, or if their mentions are as likely to be near an entity as not near an entity and thus do not correlate with entity mentions for the domain; (note that, exactly likely is not required to be considered “as likely”). The remaining context terms are those considered as the “interesting” context terms with respect to having affinity with the true entity mentions.
  • To summarize, context terms (words or phrases) that occur near the mention of entities in the “specific” entity list are extracted as candidate context terms for entities in the non-specific entity list. Further mentions of the candidate context terms are counted over the document collection. In addition, for each candidate context term, the affinity counts over the document collection are generated where the candidate context terms is in the context of an entity in the “specific” entity list. Candidate context terms that occur in a large number of documents, or are as likely to be mentioned within the context of an entity in the “specific” entity list as they are likely to be mentioned outside the context of such entities, are removed from consideration. The remaining candidate context terms are the “interesting” context terms.
  • Step 208 of FIG. 2 represents filtering the documents based on whether at least one interesting context terms is in the context of a non-specific entity that is mentioned. This filtering restricts the true mentions of “non-specific” entities to ones where the mentions have at least one “interesting” context term in its context, that is, if there are no “interesting” context terms near the mention of the entity, the mention is considered as a false mention and is removed from consideration.
  • In this manner, an ambiguous entity such as “seven” mentioned in a document may be considered a true mention with respect to a movie title if the surrounding terms include an interesting context term such as “director” or “starred in” that were extracted as being interesting context terms from documents known to have specific entity mentions. Conversely, if there are no such interesting terms in the surrounding context of a “seven” mention, then this mention is considered not true and the document is filtered out. Note that the context need not be the same number of words or phrases as processed in the specific entity list, e.g., the context in the specific entity list from which interesting context terms were extracted may have been ten words on either side of the specific entity mention, while filtering non-specific entity documents may need to have an interesting context term within a five word context on either side of the non-specific entity mention, or vice-versa.
  • Note that the filter may be even more restrictive in an implementation by having to have two (or some other number of) interesting context terms within the context of the non-specific entity mention. The more restrictive filter may be applied to certain entities and not others. For example, if instead of having two (non-specific and specific) categories for entities, consider a split of the entities in the list into a specific category (e.g., low percentage of mentions per total documents) and a non-specific category comprising ambiguous (medium percentage) and very ambiguous (high percentage) categories. At least a single interesting context terms may be needed in an ambiguous entity mention's context to not filter out the document, with at least two interesting context terms needed for very ambiguous entity mentions. Alternatively, the list of interesting context terms may be larger for ambiguous entity mentions and smaller for very ambiguous entity mentions, such as by using different affinity counts for each category.
  • Information other than counts may be used to help in the filtering. For example, titles usually start with a capital letter, and thus if extracting titles, this information can be used as well.
  • The entities that are in the specific category are considered to be likely relevant with respect to the domain, whereby documents containing these specific entity mentions are not filtered out from the results. The documents are processed to extract the specific entity mentions, as represented by step 210.
  • Step 212 represents producing the results by combining the extracted specific mentions with the (formerly) non-specific entity mentions that remain after interesting context-based term filtering. These results may be used in any suitable way.
  • As can be seen, the mechanism is frequency or count-based, and is thus substantially non-specific and can be applied to entities in a variety of domains. The mechanism operates in a generally automated manner, without requiring knowledge of underlying entity domains for the extraction task.
  • Exemplary Operating Environment
  • FIG. 5 illustrates an example of a suitable computing and networking environment 500 on which the examples of FIGS. 1-4 may be implemented. The computing system environment 500 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 500 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 500.
  • The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
  • With reference to FIG. 5, an exemplary system for implementing various aspects of the invention may include a general purpose computing device in the form of a computer 510. Components of the computer 510 may include, but are not limited to, a processing unit 520, a system memory 530, and a system bus 521 that couples various system components including the system memory to the processing unit 520. The system bus 521 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
  • The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation, FIG. 5 illustrates operating system 534, application programs 535, other program modules 536 and program data 537.
  • The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 5 illustrates a hard disk drive 541 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 551 that reads from or writes to a removable, nonvolatile magnetic disk 552, and an optical disk drive 555 that reads from or writes to a removable, nonvolatile optical disk 556 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 541 is typically connected to the system bus 521 through a non-removable memory interface such as interface 540, and magnetic disk drive 551 and optical disk drive 555 are typically connected to the system bus 521 by a removable memory interface, such as interface 550.
  • The drives and their associated computer storage media, described above and illustrated in FIG. 5, provide storage of computer-readable instructions, data structures, program modules and other data for the computer 510. In FIG. 5, for example, hard disk drive 541 is illustrated as storing operating system 544, application programs 545, other program modules 546 and program data 547. Note that these components can either be the same as or different from operating system 534, application programs 535, other program modules 536, and program data 537. Operating system 544, application programs 545, other program modules 546, and program data 547 are given different numbers herein to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 510 through input devices such as a tablet, or electronic digitizer, 564, a microphone 563, a keyboard 562 and pointing device 561, commonly referred to as mouse, trackball or touch pad. Other input devices not shown in FIG. 5 may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 520 through a user input interface 560 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 591 or other type of display device is also connected to the system bus 521 via an interface, such as a video interface 590. The monitor 591 may also be integrated with a touch-screen panel or the like. Note that the monitor and/or touch screen panel can be physically coupled to a housing in which the computing device 510 is incorporated, such as in a tablet-type personal computer. In addition, computers such as the computing device 510 may also include other peripheral output devices such as speakers 595 and printer 596, which may be connected through an output peripheral interface 594 or the like.
  • The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in FIG. 5. The logical connections depicted in FIG. 5 include one or more local area networks (LAN) 571 and one or more wide area networks (WAN) 573, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 5 illustrates remote application programs 585 as residing on memory device 581. It may be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
  • CONCLUSION
  • While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

Claims (20)

1. In a computing environment, a method performed on at least one processor, comprising:
inputting entity names from a list of entities corresponding to a domain;
processing a collection of documents to determine from entity mentions in those documents which entities are likely specific entities with respect to the domain and which are likely non-specific entities;
determining interesting context terms from the documents that contain one or more mentions of the specific entities; and
filtering documents that contain a mention of a non-specific entity based upon whether one or more of the interesting context terms are within a context of the non-specific entity.
2. The method of claim 1 further comprising, outputting results corresponding to a first set of documents that include a mention of a specific entity, and a second set of documents that each includes a mention of a non-specific entity and has one or more of the interesting context terms within the context of that non-specific entity.
3. The method of claim 2 further comprising, merging a plurality of instances of a common document into a single representation of that common document in the results.
4. The method of claim 1 wherein processing the collection of documents to determine from entity mentions in those documents which entities are likely specific entities with respect to the domain and which are likely non-specific entities comprises, using count data indicative of how frequently each entity is mentioned with respect to the collection of documents.
5. The method of claim 1 wherein using the count data comprises, for each entity, comparing a percentage of the documents that contain the entity with respect to a total number of documents against a threshold percentage, and placing entities below the threshold percentage into a category corresponding to the non-specific entities.
6. The method of claim 1 wherein determining the interesting context terms comprises determining candidate context terms obtained from contexts of the mentions of the specific entities, and using count information of the candidate context terms over the document collection to eliminate candidate context terms that appear frequently in the document collection.
7. The method of claim 1 wherein determining the interesting context terms comprises determining candidate context terms obtained from contexts of the mentions of the specific entities, and using count information to eliminate candidate context terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
8. The method of claim 1 wherein determining the interesting context terms comprises determining candidate context terms obtained from contexts of the mentions of the specific entities, using count information of the candidate context terms over the document collection to eliminate candidate context terms that appear frequently in the document collection, and using count information to eliminate candidate context terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
9. In a computing environment, a system comprising, an entity extraction mechanism that divides entities of a list of entities corresponding to a domain into specific entities and non-specific entities, including by processing a collection of documents to obtain data corresponding to of how frequently each entity is mentioned in the collection and identifying entities that are mentioned infrequently as specific entities and entities that are mentioned frequently as non-specific entities, the entity extraction mechanism configured to identify interesting context terms for the domain based upon contexts of mentions of the specific entities within the collection, and to use the interesting context terms to determine whether mentions of the non-specific entities are relevant to the domain.
10. The system of claim 9 wherein the entity extraction mechanism outputs results comprising data corresponding to documents that contain one or more mentions of the specific entities and documents that contain one or more mentions of the non-specific entities that are determined to be relevant to the domain.
11. The system of claim 10 wherein the results include a document identifier, data corresponding to the entity mention or mentions in that document, and data corresponding to a location of each entity mention in that document, for each document containing at least one specific entity mention or non-specific entity mention determined to be relevant.
12. The system of claim 9 wherein the entity extraction mechanism determines how frequently each entity is mentioned in the collection by counts of mentions for the entities.
13. The system of claim 9 wherein the entity extraction mechanism determines the interesting context terms by determining candidate context terms, and using count information of the candidate context terms over the document collection to eliminate candidate context terms that appear frequently in the document collection.
14. The system of claim 9 wherein the entity extraction mechanism determines the interesting context terms by determining candidate context terms, and using count information to eliminate candidate context terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
15. The system of claim 9 wherein the entity extraction mechanism determines the interesting context terms by eliminate candidate context terms based upon a set of stopwords.
16. The system of claim 9 wherein the domain corresponds to a movie domain, a medicine domain, a music-related domain, a consumer products domain, or a people domain.
17. The system of claim 9 wherein the collection comprises web documents.
18. One or more computer-readable media having computer-executable instructions, which when executed perform steps, comprising:
processing a collection of documents to determine data corresponding to how frequently each entity of a list of entities corresponding to a domain is mentioned in the collection;
identifying entities that are mentioned infrequently as specific entities;
identifying entities that are mentioned frequently as non-specific entities;
extracting interesting context terms for the domain based upon contexts of mentions of the specific entities within the collection; and
providing results that identify which of the documents contain one or more mentions of at least one specific entity or non-specific entity that has context data that match one or more of the interesting context terms.
19. The one or more computer-readable media of claim 18 wherein extracting the interesting context terms for the domain comprises obtaining count information of candidate context terms over the document collection to eliminate candidate context terms that appear frequently in the document collection.
20. The one or more computer-readable media of claim 18 wherein extracting the interesting context terms for the domain comprises using count information to eliminate candidate context terms that are as likely to be mentioned within the context of a specific entity as mentioned outside the context.
US12/794,779 2010-06-07 2010-06-07 Using context to extract entities from a document collection Active 2033-02-01 US9251248B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US12/794,779 US9251248B2 (en) 2010-06-07 2010-06-07 Using context to extract entities from a document collection
US15/006,743 US20160154876A1 (en) 2010-06-07 2016-01-26 Using context to extract entities from a document collection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/794,779 US9251248B2 (en) 2010-06-07 2010-06-07 Using context to extract entities from a document collection

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/006,743 Continuation US20160154876A1 (en) 2010-06-07 2016-01-26 Using context to extract entities from a document collection

Publications (2)

Publication Number Publication Date
US20110302179A1 true US20110302179A1 (en) 2011-12-08
US9251248B2 US9251248B2 (en) 2016-02-02

Family

ID=45065296

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/794,779 Active 2033-02-01 US9251248B2 (en) 2010-06-07 2010-06-07 Using context to extract entities from a document collection
US15/006,743 Abandoned US20160154876A1 (en) 2010-06-07 2016-01-26 Using context to extract entities from a document collection

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/006,743 Abandoned US20160154876A1 (en) 2010-06-07 2016-01-26 Using context to extract entities from a document collection

Country Status (1)

Country Link
US (2) US9251248B2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151538A1 (en) * 2011-12-12 2013-06-13 Microsoft Corporation Entity summarization and comparison
US9092517B2 (en) 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US20150351058A1 (en) * 2012-12-09 2015-12-03 Lg Electronics Inc. Method for obtaining synchronization for device-to-device communication outside of coverage area in a wireless communication system and apparatus for same
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
US9594831B2 (en) * 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
CN106570132A (en) * 2016-10-27 2017-04-19 浙江大学 Document vector learning method with fusion of mentioned entity information
WO2017078367A1 (en) * 2015-11-06 2017-05-11 삼성전자 주식회사 Electronic device comprising plurality of displays and method for operating same
US9747278B2 (en) * 2012-02-23 2017-08-29 Palo Alto Research Center Incorporated System and method for mapping text phrases to geographical locations
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US10235358B2 (en) * 2013-02-21 2019-03-19 Microsoft Technology Licensing, Llc Exploiting structured content for unsupervised natural language semantic parsing
CN113326701A (en) * 2021-06-17 2021-08-31 广州华多网络科技有限公司 Nested entity recognition method and device, computer equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019535B1 (en) * 2013-08-06 2018-07-10 Intuit Inc. Template-free extraction of data from documents
CN108647258B (en) * 2018-01-24 2020-12-22 北京理工大学 Representation learning method based on entity relevance constraint

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295543B1 (en) * 1996-04-03 2001-09-25 Siemens Aktiengesellshaft Method of automatically classifying a text appearing in a document when said text has been converted into digital data
US20010032204A1 (en) * 2000-03-13 2001-10-18 Ddi Corporation. Scheme for filtering documents on network using relevant and non-relevant profiles
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
US20050086205A1 (en) * 2003-10-15 2005-04-21 Xerox Corporation System and method for performing electronic information retrieval using keywords
US20080195567A1 (en) * 2007-02-13 2008-08-14 International Business Machines Corporation Information mining using domain specific conceptual structures
US20090125371A1 (en) * 2007-08-23 2009-05-14 Google Inc. Domain-Specific Sentiment Classification
US7603393B1 (en) * 2007-04-02 2009-10-13 Juniper Networks, Inc. Software merging utility
US7895205B2 (en) * 2008-03-04 2011-02-22 Microsoft Corporation Using core words to extract key phrases from documents
US8041126B1 (en) * 2004-09-21 2011-10-18 Apple Inc. Intelligent document scanning
US8041669B2 (en) * 2004-09-30 2011-10-18 Buzzmetrics, Ltd. Topical sentiments in electronically stored communications
US20120136812A1 (en) * 2010-11-29 2012-05-31 Palo Alto Research Center Incorporated Method and system for machine-learning based optimization and customization of document similarities calculation

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070027672A1 (en) 2000-07-31 2007-02-01 Michel Decary Computer method and apparatus for extracting data from web pages
US7672833B2 (en) 2005-09-22 2010-03-02 Fair Isaac Corporation Method and apparatus for automatic entity disambiguation
US7685201B2 (en) 2006-09-08 2010-03-23 Microsoft Corporation Person disambiguation using name entity extraction-based clustering
US20080065621A1 (en) 2006-09-13 2008-03-13 Kenneth Alexander Ellis Ambiguous entity disambiguation method
US7974964B2 (en) 2007-01-17 2011-07-05 Microsoft Corporation Context based search and document retrieval
US8112402B2 (en) * 2007-02-26 2012-02-07 Microsoft Corporation Automatic disambiguation based on a reference resource
US7970808B2 (en) 2008-05-05 2011-06-28 Microsoft Corporation Leveraging cross-document context to label entity
US8782061B2 (en) 2008-06-24 2014-07-15 Microsoft Corporation Scalable lookup-driven entity extraction from indexed document collections
US8321398B2 (en) * 2009-07-01 2012-11-27 Thomson Reuters (Markets) Llc Method and system for determining relevance of terms in text documents

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295543B1 (en) * 1996-04-03 2001-09-25 Siemens Aktiengesellshaft Method of automatically classifying a text appearing in a document when said text has been converted into digital data
US20010032204A1 (en) * 2000-03-13 2001-10-18 Ddi Corporation. Scheme for filtering documents on network using relevant and non-relevant profiles
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
US20050086205A1 (en) * 2003-10-15 2005-04-21 Xerox Corporation System and method for performing electronic information retrieval using keywords
US8041126B1 (en) * 2004-09-21 2011-10-18 Apple Inc. Intelligent document scanning
US8041669B2 (en) * 2004-09-30 2011-10-18 Buzzmetrics, Ltd. Topical sentiments in electronically stored communications
US20080195567A1 (en) * 2007-02-13 2008-08-14 International Business Machines Corporation Information mining using domain specific conceptual structures
US7603393B1 (en) * 2007-04-02 2009-10-13 Juniper Networks, Inc. Software merging utility
US20090125371A1 (en) * 2007-08-23 2009-05-14 Google Inc. Domain-Specific Sentiment Classification
US7987188B2 (en) * 2007-08-23 2011-07-26 Google Inc. Domain-specific sentiment classification
US7895205B2 (en) * 2008-03-04 2011-02-22 Microsoft Corporation Using core words to extract key phrases from documents
US20120136812A1 (en) * 2010-11-29 2012-05-31 Palo Alto Research Center Incorporated Method and system for machine-learning based optimization and customization of document similarities calculation

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092517B2 (en) 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
US9251249B2 (en) * 2011-12-12 2016-02-02 Microsoft Technology Licensing, Llc Entity summarization and comparison
US20130151538A1 (en) * 2011-12-12 2013-06-13 Microsoft Corporation Entity summarization and comparison
US9747278B2 (en) * 2012-02-23 2017-08-29 Palo Alto Research Center Incorporated System and method for mapping text phrases to geographical locations
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US9594831B2 (en) * 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
US20150351058A1 (en) * 2012-12-09 2015-12-03 Lg Electronics Inc. Method for obtaining synchronization for device-to-device communication outside of coverage area in a wireless communication system and apparatus for same
US10235358B2 (en) * 2013-02-21 2019-03-19 Microsoft Technology Licensing, Llc Exploiting structured content for unsupervised natural language semantic parsing
WO2017078367A1 (en) * 2015-11-06 2017-05-11 삼성전자 주식회사 Electronic device comprising plurality of displays and method for operating same
US11086436B2 (en) 2015-11-06 2021-08-10 Samsung Electronics Co., Ltd. Electronic device comprising plurality of displays and method for operating same
CN106570132A (en) * 2016-10-27 2017-04-19 浙江大学 Document vector learning method with fusion of mentioned entity information
CN113326701A (en) * 2021-06-17 2021-08-31 广州华多网络科技有限公司 Nested entity recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
US20160154876A1 (en) 2016-06-02
US9251248B2 (en) 2016-02-02

Similar Documents

Publication Publication Date Title
US9251248B2 (en) Using context to extract entities from a document collection
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
US7783476B2 (en) Word extraction method and system for use in word-breaking using statistical information
US8370278B2 (en) Ontological categorization of question concepts from document summaries
US7424421B2 (en) Word collection method and system for use in word-breaking
US20100088303A1 (en) Mining new words from a query log for input method editors
US20140052728A1 (en) Text clustering device, text clustering method, and computer-readable recording medium
US8750630B2 (en) Hierarchical and index based watermarks represented as trees
Forsyth et al. Document dissimilarity within and across languages: a benchmarking study
US20140324416A1 (en) Method of automated analysis of text documents
RU2491622C1 (en) Method of classifying documents by categories
US7284006B2 (en) Method and apparatus for browsing document content
Basha et al. Evaluating the impact of feature selection on overall performance of sentiment analysis
US20050004902A1 (en) Information retrieving system, information retrieving method, and information retrieving program
Fišer et al. Distributional modelling for semantic shift detection
He et al. Language feature mining for music emotion classification via supervised learning from lyrics
Singh et al. Writing Style Change Detection on Multi-Author Documents.
Zheng et al. A review on authorship attribution in text mining
CN112861510A (en) Summary processing method, apparatus, device and storage medium
McEnery et al. Building a written corpus: What are the basics?
Shang et al. DIANES: A DEI Audit Toolkit for News Sources
US20230090601A1 (en) System and method for polarity analysis
JP4525433B2 (en) Document aggregation device and program
WO2022105178A1 (en) Keyword extraction method and related device
Ma et al. A cybercrime forensic method for chinese web information authorship analysis

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AGRAWAL, SANJAY;REEL/FRAME:024490/0120

Effective date: 20100602

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001

Effective date: 20141014

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8