US20160117386A1

US20160117386A1 - Discovering terms using statistical corpus analysis

Info

Publication number: US20160117386A1
Application number: US14/520,654
Authority: US
Inventors: Jitendra Ajmera; Ankur Parikh
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2014-10-22
Filing date: 2014-10-22
Publication date: 2016-04-28
Also published as: US20160117313A1; US10592605B2

Abstract

Software that extracts contextually relevant terms from a text sample (or corpus) by performing the following steps: (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

Description

BACKGROUND OF THE INVENTION

The present invention relates generally to the field of natural language processing, and more particularly to “term extraction.”
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).
Information Extraction (IE) is a known element of NLP. IE is the task of automatically extracting structured information from unstructured (and/or semi-structured) machine-readable documents. Term Extraction is a sub-task of IE. The goal of Term Extraction is to automatically extract relevant terms from a given text (or “corpus”). Term Extraction is used in many NLP tasks and applications, such as question answering, information retrieval, ontology engineering, semantic web, text summarization, document classification, and clustering. Generally, in term extraction, statistical and machine learning methods may be used to help select relevant terms.
Domain ontologies are known. A domain ontology represents concepts which belong to a particular “domain” such as an industry or a genre. In fact, multiple domain ontologies may exist within a single domain due to differences in language, intended use of the ontologies, and different perceptions of the domain. However, since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible. In the context of NLP, term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news).

SUMMARY

According to an aspect of the present invention, there is a method, computer program product and/or system that performs the following steps (not necessarily in the following order): (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram view of a first embodiment of a system according to the present invention;

FIG. 2 is a flowchart showing a first embodiment method performed, at least in part, by the first embodiment system;

FIG. 3 is a block diagram view of a machine logic (for example, software) portion of the first embodiment system;

FIG. 4 is a flowchart view of a method according to the present invention;

FIG. 5 is a flowchart view of a method according to the present invention;

FIG. 6 is a flowchart view of a method according to the present invention;

FIG. 7 is a flowchart view of a method according to the present invention;

FIG. 8 is a flowchart view of a method according to the present invention;

FIG. 9 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 10 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 11 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 12 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 13 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 14 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;

FIG. 15 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention; and

FIG. 16 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.

DETAILED DESCRIPTION

Some embodiments of the present invention extract contextually relevant terms from a text sample (or corpus) by iteratively discovering new terms using weighted “contextual characteristics” of terms discovered in previous iterations. Roughly speaking, a “contextual characteristic” is a feature of a term derived from that term's particular usage in a given corpus (for example, one contextual characteristic is a list of words that commonly precede or follow a given term in the corpus). This Detailed Description section is divided into the following sub-sections: (i) The Hardware and Software Environment; (ii) Example Embodiment; (iii) Further Comments and/or Embodiments; and (iv) Definitions.

I. The Hardware and Software Environment

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
An embodiment of a possible hardware and software environment for software and/or methods according to the present invention will now be described in detail with reference to the Figures. FIG. 1 is a functional block diagram illustrating various portions of networked computers system 100, including: sub-system 102; client sub-systems 104, 106, 108, 110, 112; communication network 114; computer 200; communication unit 202; processor set 204; input/output (I/O) interface set 206; memory device 208; persistent storage device 210; display device 212; external device set 214; random access memory (RAM) devices 230; cache memory device 232; and program 300.
Sub-system 102 is, in many respects, representative of the various computer sub-system(s) in the present invention. Accordingly, several portions of sub-system 102 will now be discussed in the following paragraphs.
Sub-system 102 may be a laptop computer, tablet computer, netbook computer, personal computer (PC), a desktop computer, a personal digital assistant (PDA), a smart phone, or any programmable electronic device capable of communicating with the client sub-systems via network 114. Program 300 is a collection of machine readable instructions and/or data that is used to create, manage and control certain software functions that will be discussed in detail, below, in the Example Embodiment sub-section of this Detailed Description section.
Sub-system 102 is capable of communicating with other computer sub-systems via network 114. Network 114 can be, for example, a local area network (LAN), a wide area network (WAN) such as the Internet, or a combination of the two, and can include wired, wireless, or fiber optic connections. In general, network 114 can be any combination of connections and protocols that will support communications between server and client sub-systems.
Sub-system 102 is shown as a block diagram with many double arrows. These double arrows (no separate reference numerals) represent a communications fabric, which provides communications between various components of sub-system 102. This communications fabric can be implemented with any architecture designed for passing data and/or control information between processors (such as microprocessors, communications and network processors, etc.), system memory, peripheral devices, and any other hardware components within a system. For example, the communications fabric can be implemented, at least in part, with one or more buses.
Memory 208 and persistent storage 210 are computer-readable storage media. In general, memory 208 can include any suitable volatile or non-volatile computer-readable storage media. It is further noted that, now and/or in the near future: (i) external device(s) 214 may be able to supply, some or all, memory for sub-system 102; and/or (ii) devices external to sub-system 102 may be able to provide memory for sub-system 102.
Program 300 is stored in persistent storage 210 for access and/or execution by one or more of the respective computer processors 204, usually through one or more memories of memory 208. Persistent storage 210: (i) is at least more persistent than a signal in transit; (ii) stores the program (including its soft logic and/or data), on a tangible medium (such as magnetic or optical domains); and (iii) is substantially less persistent than permanent storage. Alternatively, data storage may be more persistent and/or permanent than the type of storage provided by persistent storage 210.
Program 300 may include both machine readable and performable instructions and/or substantive data (that is, the type of data stored in a database). In this particular embodiment, persistent storage 210 includes a magnetic hard disk drive. To name some possible variations, persistent storage 210 may include a solid state hard drive, a semiconductor storage device, read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, or any other computer-readable storage media that is capable of storing program instructions or digital information.
The media used by persistent storage 210 may also be removable. For example, a removable hard drive may be used for persistent storage 210. Other examples include optical and magnetic disks, thumb drives, and smart cards that are inserted into a drive for transfer onto another computer-readable storage medium that is also part of persistent storage 210.
Communications unit 202, in these examples, provides for communications with other data processing systems or devices external to sub-system 102. In these examples, communications unit 202 includes one or more network interface cards. Communications unit 202 may provide communications through the use of either or both physical and wireless communications links. Any software modules discussed herein may be downloaded to a persistent storage device (such as persistent storage device 210) through a communications unit (such as communications unit 202).
I/O interface set 206 allows for input and output of data with other devices that may be connected locally in data communication with server computer 200. For example, I/O interface set 206 provides a connection to external device set 214. External device set 214 will typically include devices such as a keyboard, keypad, a touch screen, and/or some other suitable input device. External device set 214 can also include portable computer-readable storage media such as, for example, thumb drives, portable optical or magnetic disks, and memory cards. Software and data used to practice embodiments of the present invention, for example, program 300, can be stored on such portable computer-readable storage media. In these embodiments the relevant software may (or may not) be loaded, in whole or in part, onto persistent storage device 210 via I/O interface set 206. I/O interface set 206 also connects in data communication with display device 212.
Display device 212 provides a mechanism to display data to a user and may be, for example, a computer monitor or a smart phone display screen.
The programs described herein are identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature herein is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The terminology used herein was chosen to best explain the principles of the embodiment, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

II. Example Embodiment

FIG. 2 shows flowchart 250 depicting a method according to the present invention. FIG. 3 shows program 300 for performing at least some of the method steps of flowchart 250. This method and associated software will now be discussed, over the course of the following paragraphs, with extensive reference to FIG. 2 (for the method step blocks) and FIG. 3 (for the software blocks).
The present embodiment refers extensively to a high precision domain lexicon (HPDL). The HPDL (also referred to as a “set of category related terms”) is a collection of terms (words or sets of words) that belong to a specific domain, category, or genre (“domain”). In term extraction, and more generally in natural language processing, the HPDL can serve as an underlying “knowledge base” for a given domain so as to extract more contextually relevant terms from a piece of text (or corpus). In many embodiments of the present invention, the HPDL is used to: (i) extract contextually relevant terms (term extraction); and (ii) extract additional HPDL-eligible terms in order to grow, strengthen, and/or expand the HPDL.
HPDL domains may have multiple categories (or sub-domains). For example, the domain of smartphones may include categories such as smartphone models, smartphone apps, and/or smartphone modes. It is contemplated that the present invention may apply to HPDLs with singular domains, multiple domains, and/or multiple domain categories (or sub-domains).
In some embodiments of the present invention, method 250 may begin with an existing, predefined HPDL, while in other embodiments the HPDL may be initially extracted from the corpus using, for example, term extraction methods adapted to achieve high levels of precision. Some known methods for extracting an initial HPDL from the corpus are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description. In the present example embodiment, the HPDL has a domain of “things that jump” and initially includes the following terms: (i) fox; and (ii) rabbit (in other embodiments, an HPDL including the terms “fox” and “rabbit” might also have a domain of “animals” and a sub-domain of “mammals”).
The present embodiment also refers extensively to a corpus. The corpus is a text sample that method 250 extracts relevant terms from. In other words, the corpus is the text that is being acted upon (interpreted, processed, classified, etc.) during term extraction. In the present example embodiment, the corpus includes the following text: “A quick brown fox jumps over the lazy dog, but a quicker, more nimble kangaroo jumps over the fox. The following day, while the kangaroo leaps over the still-lazy dog, a determined frog leaps over a surprisingly speedy sloth.”
Processing begins at step S255, where extract candidate terms module (“mod”) 302 extracts candidate terms (also referred to as “relevant terms”) from the corpus. In many embodiments, various statistical methods are used to extract relevant candidate terms. A number of these known methods are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description. However, these are not meant to be all-inclusive or limiting, as other, less traditional extraction methods may also be used. In other embodiments of the present invention, dictionaries or domain lexicons different from and/or unrelated to the HPDL may be used in this step. For example, in the present example embodiment, candidate terms are extracted from the corpus if they are identified as “animals”. As such, the following terms are extracted from the corpus: (i) fox; (ii) dog; (iii) kangaroo; (iv) frog; and (v) sloth. Furthermore, terms that are already in the HPDL are excluded from the candidate terms list. Therefore, “fox” is not included in the candidate terms list, and the resulting list is as follows: (i) dog; (ii) kangaroo; (iii) frog; and (iv) sloth.
Processing proceeds to step S260, where discover new generation mod 304 discovers a new generation of HPDL terms from the candidate terms using the HPDL and its contextual characteristics. This step begins by identifying contextual characteristics (or “initial contextual characteristics”) of the terms in the HPDL. A contextual characteristic is a feature of a term derived from that term's particular usage in a given corpus (for a more complete definition of “contextual characteristic,” see the Definitions Sub-Section of this Detailed Description). In the present example embodiment, the contextual characteristic for each term in the HPDL is the word immediately following that term in the corpus (when a term is the last word in a sentence, it does not have a contextual characteristic). So, in the present embodiment, the only contextual characteristic for the term “fox” (the first HPDL term) is “jumps”, because the only word immediately following “fox” in the corpus is “jumps”. For the second HPDL term, “rabbit”, there are no contextual characteristics, because “rabbit” does not appear in the corpus. As such, the only contextual characteristic of the HPDL is the word “jumps”. It should be noted that although the present embodiment includes a simple example with one contextual characteristic, in many embodiments the HPDL has a plurality of context characteristics.
Once contextual characteristics for the HPDL have been identified, those characteristics are then applied to the candidate terms. In the present example, the only candidate term to immediately precede the word “jumps” is “kangaroo”. As such, “kangaroo” (the “first term”) is the only term included in the current generation of discovered terms. In other embodiments of the present invention, however, the current generation may include a plurality of discovered terms. In those embodiments, additional steps may be taken to further refine the list of discovered terms (for some examples, see the Further Comments and/or Embodiments Sub-Section of this Detailed Description).
Processing proceeds to step S265, where update terms mod 306 adds the current generation of terms to the HPDL. In the present embodiment, the term “kangaroo” is added to the HPDL, with the resulting HPDL (or “revised set of category related terms”) being as follows: (i) fox; (ii) rabbit; and (iii) kangaroo.
Processing proceeds to step S270, where update terms mod 306 deletes the current generation of terms from the candidate terms list. In the present embodiment, the term “kangaroo” is removed from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; (ii) frog; and (iii) sloth.
Processing proceeds to step S275, where iterate mod 308 checks to see if method 250 is on its last iteration. In the present embodiment, a total of two iterations are to be performed. As such, method 250 is not on its last iteration (NO), and processing returns to step S260 for another iteration. In other embodiments, however, other tests may be used. For example, in one embodiment, iterations may occur until the HPDL reaches a certain size. In another embodiment, iterations may continue to occur for a certain period of time. In still other embodiments, iterations may continue to occur indefinitely and/or until no further terms for the HPDL are discovered.
In the present example, upon returning to step S260, discover new generation mod 304 repeats the process of identifying contextual characteristics of the terms in the HPDL. However, this time, there is an additional term (“kangaroo”) in the HPDL. As a result, an additional contextual characteristic (or “first term contextual characteristic”) is identified: the word “leaps”, which immediately follows the word “kangaroo” in the second sentence of the corpus. As such, when the updated contextual characteristics are applied to the candidate terms, an additional match is found: the term “frog” appears immediately before the word “leaps” in the corpus. As a result, “frog” (the “second term”) is added to the current generation of discovered terms.
Processing proceeds to step S265, where update terms mod 306 adds “frog” to the HPDL, resulting in the following HPDL: (i) fox; (ii) rabbit; (iii) kangaroo; and (iv) frog. Processing then proceeds to step S270, where update terms mod 306 removes “frog” from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; and (ii) sloth.
Processing proceeds to step S275. In the present example, two iterations have now completed, which means that method 250 is on its final iteration. Therefore, step S275 resolves to “YES”, and processing proceeds to step S280, where method 250 ends. As a result of executing the method 250, the HPDL for the domain of “things that jump” now includes two additional terms (“kangaroo” and “frog), and will be able to further extract contextually relevant terms in future iterations and/or from different corpuses. Additionally, system 102 now also has a list of candidate terms (“dog” and “sloth”) that may be helpful for other NLP-related tasks.

III. Further Comments and/or Embodiments

Some embodiments of the present invention recognize the following facts, potential problems and/or potential areas for improvement with respect to the current state of the art: (i) some existing approaches (including approaches that rely on linguistic processors to extract candidate terms) do not perform well when the corpus (or text) has a different genre (or domain) than the corpus used to build the processor; (ii) some existing approaches rely purely on statistical methods (such as n-gram sequences or topic modeling) to extract candidate terms, thereby negatively affecting system precision; (iii) existing approaches can be configured to provide terms with either high precision or high recall, but not both (thereby negatively affecting the overall accuracy of the system); and/or (iv) existing approaches are unable to discover new domain-specific terms directly from the corpus in a bootstrapping manner.
Some embodiments of the present invention may include one, or more, of the following features, characteristics and/or advantages: (i) using contextual similarity with a high precision domain lexicon for ranking; (ii) extracting candidate terms statistically; (iii) using an approach other than singular value decomposition; (iv) extracting terms without using linguistic processors; (v) extracting terms without analyzing linguistic or structural characteristics of a document; (vi) extracting terms without using syntactic and semantic contextual analysis; (vii) extracting terms without using dictionary-based statistics; (viii) extracting terms without using specialized corpora; (ix) extracting terms based on contextual information of a lexicon obtained from a given corpus; and/or (x) using association rules to measure unithood and/or filter candidate terms.
Many embodiments of the present invention are adapted to identify terms (nouns or noun phrases) from a corpus with both high precision and recall without using any linguistic processors and open domain linguistic resources (such as dictionaries and ontology). In doing so, these embodiments may include one, or more, of the following features, characteristics and/or advantages: (i) providing an iterative approach to term discovery where discovery depends on weighted contextual characteristics of already discovered terms in previous iterations; (ii) ranking pure statistically extracted candidate terms (N-grams) based on their noun specificity and term specificity determined using weighted contextual similarity with known terms (nouns or noun phrases); (iii) using association rules, filtering candidate terms that cannot exist independently; and/or (iv) validating unithood of candidate terms using association rules.
Some embodiments of the present invention may further include one, or more, of the following helpful features, characteristics, and/or advantages: (i) achieving positive results in entity set expansion tasks, where the goal is to identify entities from the corpus in a bootstrapping manner; (ii) performing well on diverse domains such as medical and/or news; (iii) performing better term extraction for any language, including languages for which linguistic processors are not built or do not perform well for; and/or (iv) keeping resources such as dictionaries, lexicons, ontologies, and/or entity lists up-to-date.
Method 400 according to the present invention is provided in FIG. 4. Method 400 is adapted to extract terms and their variants from a corpus with both high precision and high recall. High precision and recall occur even if the corpus has a domain that is different than the system's source domain, or if the corpus is in a language for which linguistic systems are not available or mature. Processing begins with step S402, where the method 400 uses a statistical corpus analysis to extract candidate terms. This step S402 uses known (or to be known in the future) statistical approaches (such as frequent item set mining, language mining, and topic modeling) to extract potential candidate terms from the corpus. Additionally, step S402 filters out irrelevant potential candidate terms using statistical criteria.
Processing proceeds to step S404, where the method 400 creates a high precision domain lexicon and analyzes contextual information therein. The high precision domain lexicon terms (or “lexicon terms”) are either manually extracted from the corpus or automatically extracted using any system (known or to be known in the future) configured to focus on high precision. Once the lexicon terms have been extracted, context words of those lexicon terms (such as the words appearing before and after the lexicon terms in the corpus) are extracted and weighed to create a set of weighted term context words.
Processing proceeds to step S406, where the method 400 ranks the candidate terms (see step S402) based on contextual similarity with the weighted term context words.
Processing proceeds to step S408, where method 400 selects the top candidate terms as discovered terms. The number of top candidate terms to be selected is a preconfigured value that depends on application and business context.
Processing proceeds to step S410, where the newly discovered terms (that is, the top candidate terms) are added to the high precision domain lexicon. Once the newly discovered terms have been added, they are deleted from the candidate terms list.
Processing proceeds to step S412, where the method 400 compares an iteration count with a pre-defined iteration threshold (where the iteration threshold is determined based on application and business context). Processing then proceeds to step S414. If the iteration count is less than the iteration threshold (NO), processing returns to step S404 to discover more relevant terms from the corpus. If the iteration count is greater than or equal to the iteration threshold, however, processing for method 400 completes. As a result of method 400 completing: (i) the high precision domain lexicon now includes additional relevant domain terms; and (ii) the list of candidate terms includes additional contextually relevant terms that may be used for natural language processing or other tasks.
In some embodiments of the present invention, step S402 (Extract Candidate Terms using Statistical Corpus Analysis, see FIG. 4) further includes method 500 (see FIG. 5). Method 500 is adapted to apply statistic approaches to the corpus to extract potential candidate terms. The potential candidate terms are passed through statistical filters to identify relevant terms, resulting in new candidate terms. Processing begins with step S502, where method 500 extracts text from the corpus and then applies heuristics-based sentence splitters on the text to extract sentences.
Processing proceeds to step S504, where various statistical methods are applied to the extracted sentences for candidate term extraction. The method to be used is typically determined based on a few factors: (i) the type of document the corpus is (for example, a web page, a text book, or a manual); (ii) the length of the corpus (for example, the number of words in the corpus); and/or (iii) the general domain of the corpus (for example, healthcare, finance, or telecommunication). Although many methods may be used, two are discussed below: (i) a statistical language modeling method (beginning with step S506); and (ii) an associated rule mining method (beginning with step S514).
If the statistical language modeling method is chosen, processing proceeds to step S506, where the method 500 extracts n-grams from each extracted sentence in the corpus for each preconfigured value of n. N may be determined in a number of ways, including, for example, by conducting experiments or by prioritizing certain features (such as speed vs. accuracy). An n-gram is a contiguous sequence of words from a given extracted sentence, where ‘n’ represents the number of words in the sequence. All unique n-grams of the corpus are considered as potential candidate terms.
Processing proceeds to steps S508 and S510, where method 500 scores the potential candidate terms based on their termhood and unithood, respectively. Termhood (as used in step S508) scores the validity of the potential candidate term as a representative for the corpus content as a whole using one or more statistical measures now known (or to be known in the future). In one embodiment, a measure of frequency in a corpus is used. In another embodiment, a measure of ‘weirdness’ (the term's frequency in the corpus compared to its frequency in a reference corpus) is used. In yet another embodiment, a measure of the pertinence or specificity of the term to a particular domain is used.
Unithood (as used in step S510) scores the collocation strength (the strength of parts of terms) of potential candidate terms using one of the statistical measures now known (or to be known in the future). In one embodiment, a mutual information test is used. In another embodiment, a t-test is used.
Once steps S508 and S510 have completed, processing proceeds to step S512. In step S512, potential candidate terms with termhood and unithood scores above pre-defined thresholds are selected and identified as candidate terms, and processing for method 500 completes.
If the associated rule mining method is chosen in step S504 (as opposed to the statistical language modeling method discussed above), processing proceeds to step S514. In this step, method 500 uses an algorithm to extract frequent n-grams. The extracted frequent n-grams are identified as potential candidate terms.
An example of a way to extract frequent n-grams (sets of words occurring in a specific order) is to extract n-grams meeting the following criteria: (i) the n-grams are frequent; and (ii) the order preserving a subset of the n-grams is frequent. In other words, in this embodiment method 500 first extracts unigrams (i.e. single words) that are frequent (that is, they have a frequency above a pre-defined threshold). Then, method 500 extracts frequent bigrams (i.e. two-word phrases) that are made up of the previously identified unigrams. This continues for n steps (where n equals the number of words in the phrase being analyzed). Another way to express this example extraction method is to say that for n>1, the system extracts n-grams that are frequent and also include frequent (n−1)-grams. For n=1, the system extracts unigrams that are frequent.
Processing proceeds to step S516, where the method 500 analyzes each potential candidate term and generates association rules along with corresponding confidence values. To generate association rules, step S516 performs three tasks: (i) for every potential candidate term t, all non-empty ordered subsets s are generated; (ii) for every subset s of t, a forward rule,“s->t−s”, along with its confidence (measured as frequency of t divided by the frequency of s) is generated; and (iii) for every subset s of t, an inverse rule, “s<-t−s”, along with its confidence, is generated. To provide an example, in one embodiment of the invention, term t is “Mobile Phone A”. Applying task (i), the subsets for “Mobile Phone A” are: (a) “Mobile”; (b) “Phone”; (c) “A”; (d) “Mobile Phone”; and (e) “Phone A”. Applying task (ii), a forward rule for “Mobile Phone A” is “Mobile Phone->A”. And applying task (iii), an inverse rule is “Mobile<-Phone A”.
Processing proceeds to step S518, where n-grams are filtered using the inverse rules created in step S516. In this step, term variations are identified and removed based on their confidence score. The method 500 identifies a term variation if an inverse rule from term variation to term has a confidence score above a predefined threshold (determined experimentally, for example). For example, “Mobile Phone A” has one inverse rule with a confidence score above the predefined threshold: “Mobile<-Phone A”. Because the confidence score is over the threshold, the method 500 identifies that “Phone A” is a variation of “Mobile Phone A” and removes “Phone A” from the list of potential candidate terms.
Processing proceeds to step S520, where n-grams are filtered using forward rules (which serve as a measure of unithood for potential candidate terms). The confidence of forward rules provides the probability of the order of term constituents. If none of the forward rules for a term have a confidence score above a pre-defined threshold, that term is removed from the potential candidate term. For example, the term “Manufacturer launches new” has two forward rules: “Manufacturer launches->new” and “Manufacturer->launches new”. Because neither of the forward rules have a confidence level above the threshold, they are removed from the list of potential candidate terms.
Upon completing step S520, processing for method 500 completes, resulting in a new list of candidate terms from the remaining potential candidate terms.
In some embodiments of the present invention, step S404 (Analyzing Contextual Information of High Precision Domain Lexicon, see FIG. 4) further includes method 600 shown in FIG. 6. Processing begins with step S602, where a high precision domain lexicon is created (either manually or automatically using methods configured to focus on high precision) or provided from previous iterations of method 600. Term variations from the lexicon are then filtered and/or replaced using previously generated inverse rules, if available (for example, from step S518 (see FIG. 5)). A term from the lexicon is identified as a term variation if an inverse rule from the term to some other longer term has a confidence score above pre-defined threshold. If the longer term is a part of the lexicon, then the term identified as a term variation is removed. For example, if “Mobile Phone A” and “Phone A” are in the lexicon and inverse rule “Mobile<-Phone A” has a confidence level above the threshold, then “Phone A” is removed from the lexicon. If the longer term is not part of lexicon, then the lexicon term is replaced with longer term. For example, if “Phone” is in the lexicon and the inverse rule “Mobile<-Phone” has a confidence level above the threshold, then “Phone” is replaced with “Mobile Phone”.
Processing proceeds to step S604, where method 600 scores lexicon terms and ranks them based on their scores. Scoring may be performed by a variety of methods now known (or to be known in the future), and may be based on properties such as term frequency observed in a given corpus. Processing proceeds to step S606, where the top X terms are selected, where X is pre-defined (and determined experimentally, for example).
Processing proceeds to step S608, where context words are extracted from the corpus. First, occurrences of lexicon terms within a given corpus are identified. Then, for each occurrence, context words are extracted per a pre-defined window size (for example, the two words before and the two words after a lexicon term). The words within the window are identified as context words and added to a list of context words. Processing proceeds to step S610, where closed class context words (such as determiners, prepositions, pronouns, and/or conjunctions) are removed from the list of context words.
Processing proceeds to step S612, where each context word is weighted. In some embodiments, the weight of a context word equals the number of unique terms the context word appears in divided by the total number of terms. Processing for method 600 concludes with a list of weighted term context words.
In some embodiments of the present invention, step S406 (Contextual Similarity based ranking of Candidate Terms, see FIG. 4) further includes method 700 shown in FIG. 7. Processing begins with step S702, where context words for candidate terms are extracted. First, occurrences of each candidate term (see discussion of method 500, above) in the corpus are identified. Then, from each occurrence, context words are extracted per a pre-defined window size (for example, the two words before and the two words after each candidate term).
Processing proceeds to step S704, where closed class context words (such as determiners, prepositions, pronouns, and/or conjunction) are removed from the list of context words. The remaining context words (“candidate term context words”) are selected, stored (along with their frequency), and mapped to their corresponding candidate terms.
Processing proceeds to step S706, where the contextual similarity between candidate term context words (see step S704, above) and weighted term context words (see discussion of method 600, above) is measured by a contextual similarity score. The contextual similarity score may be obtained by a number of methods now known or to be known in the future. In one example embodiment, the contextual similarity score is represented by the equation “Σi Wi*Fi”, where: ‘i’ equals the number of distinct context words for a candidate term; ‘Wi’ equals the weight of the context word in the weighted term context words (‘Wi’ equals zero when the context word is not in a set); and ‘Fi’ equals the frequency of the context word with respect to the candidate term in a given corpus.
Processing proceeds to step S708, where candidate terms are ranked based on the contextual similarity score obtained in step S706 (and discussed in the preceding paragraph). The result of this step is a list of ranked candidate terms.
In some embodiments of the present invention, step S408 (Discover New Terms, see FIG. 4) further includes method 800 shown in FIG. 8. Processing begins with step S802, where method 800 selects the top K candidate terms from the ranked list and creates a set of top K candidate terms, where the value of K is pre-configured. Processing then proceeds to step S804, where method 800 removes any candidate terms from the set if they are also part of domain lexicon. The remaining terms from the list of top k candidate terms are identified as, simply, “terms” and processing for method 800 completes.
For explanation purposes, an example embodiment demonstrating the present invention and portions of the above-discussed methods 400, 500, 600, 700, 800 (see FIGS. 4, 5, 6, 7, and 8) is provided. Referring first to method 500 (see FIG. 5), table 900 (see FIG. 9) shows the results of step S514 on an example corpus. In this example, ‘n’ equals ‘4’. Table 900 begins with row 902, which shows the result of a frequent unigram extraction on the example corpus (showing both the extracted unigrams and their corresponding frequencies).
Referring still to table 900, row 904 shows the result of a frequent bigram extraction on the example corpus (showing both the extracted bigrams and their corresponding frequencies). The frequency threshold for the bigram extraction is 30; as such, bigrams with a frequency of less than 30 (none, in this example) will not be included in the output for step S514.
Referring still to table 900 (see FIG. 9), row 906 shows the result of a frequent trigram extraction on the example corpus (showing both the extracted trigrams and their corresponding frequencies). The frequency threshold for the trigram extraction is 20; as such, trigrams with a frequency of less than 20 (none, in this example) will not be included in the output for step S514.
Still referring to table 900 (see FIG. 9), row 908 shows the result of a frequent 4-gram extraction on the example corpus (showing both the extracted 4-grams and their corresponding frequencies). The frequency threshold for the 4-gram extraction is 10; as such, trigrams with a frequency of less than 10 (none, in this example) will not be included in the output for step S514. The resulting example output of row 908, combined with the output from rows 902, 904, and 906, make up the entire list of frequent n-grams generated by step S514.
Table 1000 (see FIG. 10) shows the results of step S516 (see FIG. 5), where association rules are generated from the list of frequent n-grams generated by the previous step S514. Specifically, row 1002 (see FIG. 10) shows the generated forward rules, along with their corresponding confidence values (see discussion of step S516, above). Although a given term can have multiple forward rules, in the present example, for each term, only the forward rule with the maximum confidence value is shown. Row 1004 (see FIG. 10) similarly shows the generated inverse rules, along with their corresponding confidence values (again, see discussion of step S516, above). Although a given term can have multiple inverse rules, in the present example, for each term, only the inverse rule with the maximum confidence value is shown.
Table 1100 (see FIG. 11) shows the results of steps S518 and S520 (see FIG. 5), where the n-grams created in step S514 are filtered using the inverse rules and the forward rules generated in step S516. In step S518, the inverse rules are applied to the list of frequent n-grams. For each inverse rule over a pre-defined confidence value threshold (in the present example, 0.8), the term on the right-hand side of the rule is removed from the list of frequent n-grams. The general reasoning for this is that when an inverse rule has a high threshold value, it is unlikely that the term on the right-hand side would exist independently separate from the term on the left hand side. To provide an example, because the inverse rule “ManufacturerA<-PhoneD” has a confidence value of 1.00, “Phone D,” which is on the right hand side of the rule, is removed from the list of frequent n-grams (as “Phone D” is unlikely to appear without “ManufacturerA” as a prefix). Row 1102 of table 1100 shows all of the n-grams that have been filtered using the inverse rules, and row 1104 shows the n-grams that remain after that filtering.
In step S520, the forward rules are applied to the list of frequent n-grams. For each remaining n-gram in the list of frequent n-grams, the n-gram is removed from the list if it doesn't have a corresponding forward rule above a pre-defined confidence value threshold (in the present example, 0.8). To provide an example, because forward rule “ManufacturerB PhoneA W->4G” has a confidence value of 1.00 (which is greater than 0.8), the term “ManufacturerB PhoneA W 4G” remains on the list. Conversely, because the forward rule “Connect->ManufacturerA” has a confidence value of 0.10, the term “Connect ManufacturerA” is removed from the list. Row 1106 of table 1100 shows all of the n-grams that have been filtered using the forward rules, and row 1108 shows the n-grams that remain after that filtering and are considered candidate terms.
Referring now to method 600 (see FIG. 6), table 1200 (see FIG. 12) shows the results of steps S602, S604, and S606. Row 1202 shows the lexicon terms that have been extracted at the beginning of step S602. These terms are considered to be high precision domain lexicon terms for the domain of smartphones (collectively, they are referred to as the “high precision domain lexicon,” the “lexicon,” and/or the “lexicon terms”).
Continuing with step S602, method 600 identifies the n-grams from the corpus that end with any of the terms from the domain lexicon. Method 600 generates inverse rules for these n-grams along with corresponding confidence values. If the confidence of an inverse rule exceeds a pre-determined confidence value threshold (in this case, 0.8), then the method checks if the full term of the inverse rule is included in the high precision domain lexicon. If so, the term on the right-hand side of the rule is removed from the lexicon. If not, then the right-hand side term is replaced in the lexicon by the full term of the inverse rule. To provide an example of this, row 1204 of table 1200 shows both of the generated inverse rules that meet the confidence value threshold in the present example embodiment, along with their corresponding confidence values. For the first rule, “ManufacturerC<-PhoneB 12,” because “ManufacturerC PhoneB 12” is already included in the lexicon, “Phone B,” (that is, the term on the right-hand side of the rule) is removed from the lexicon. Conversely, for the second rule, “ManufacturerB<-PhoneC,” because “ManufacturerB PhoneC” is not in the lexicon, “PhoneC” is replaced by “ManufacturerB PhoneC” in the lexicon. The resulting, modified lexicon terms are shown in row 1206 of table 1200.
Still referring to table 1200 (see FIG. 12), row 1208 shows the results of step S604 (see FIG. 6), where the lexicon is scored using a C-Value/NC-Value method. As shown in table 1200, the lexicon terms are ranked based on their respective scores. In the next method step S606, the top X terms are selected. In the present case, X equals 5, so all four of the lexicon terms are selected, as shown in row 1210 of FIG. 12.
Referring still to the present example embodiment, table 1300 (see FIG. 13) shows the results of steps S608, S610, and S612 (see FIG. 6). In step S608 (shown in row 1302), context words for lexicon terms are extracted from the corpus with a pre-defined window. In the present embodiment, the window extends to one word before the term and one word after the term. So, when a lexicon term is found in the corpus, the word immediately preceding that lexicon term and the word immediately following the lexicon term are added to a list of context words. The list of context words is shown in row 1302, where each context word is listed along with the lexicon term(s) used to identify the context word.
Proceeding to step S610, a list of various closed-class words (such as determiners, prepositions, pronouns, and conjunctions) is used to reduce the number of words included in the list of context words. Row 1304 of table 1300 (see FIG. 13) shows the results of step S610 in the present example embodiment, where words such as “to,” “from,” and “your” have been removed from the list.
Referring still to table 1300 (see FIG. 13), step S612 provides weights for the context words, thereby creating weighted term context words. As mentioned above in the discussion of step S612, the weight of a given word is equal to the number of lexicon terms the word appeared with in the corpus divided by the total number of lexicon terms. The resulting weighted context words for the present example embodiment are shown in row 1306 of table 13.
Referring now to method 700 (see FIG. 7), table 1400 (see FIG. 14) shows the results of steps S702 and S704 for the present example embodiment. In step S702 (the results of which are shown in row 1402), context words for candidate terms (see discussion of method 500, above) are extracted from the corpus with a predefined window. In the present embodiment, the window extends to one word before the term and one word after the term (as in step S608). So, when a candidate term is found in the corpus, the word immediately preceding the candidate term and the word immediately following the candidate term are extracted and added to a list of context words. Row 1402 shows the extracted context words for the present embodiment, along with their corresponding candidate terms. The number of times a context word appears with each candidate term denoted by parentheses.
Processing continues to step S704, where closed-class context words (such as determiners, prepositions, pronouns, and conjunctions) are removed from the list of context words in a manner similar to the removal of closed-class context words in step S610. The resulting list of context words is shown in row 1404 of table 1400 (see FIG. 14).
Table 1500 (see FIG. 15) shows the results of steps S706 and S708 for the present example embodiment. In step S706, for each candidate term, a contextual similarity analysis is performed between the candidate term's context words (produced in step S704, discussed above) and the weighted term context words (produced in step S612, discussed above). In the present embodiment, the resulting contextual similarity score is calculated by computing the sum, for each of a candidate term's context words, of the candidate term's frequency with that context word multiplied by the context word's weight. If the given context word is not listed in the list of weighted term context words, then the weight of the context word is zero. The calculations for computing the contextual similarity score for each candidate term in the present example embodiment are shown in row 1502 of table 1500 (see FIG. 5). In the next row 1504, the candidate terms are listed according to their resulting contextual similarity scores (as a result of the contextual similarity score ranking that occurs in step S708).
Referring now to method 800, table 16 (see FIG. 16) shows the results of steps S802 and S804 (see FIG. 8) for the present example embodiment. In step S802 of this embodiment, the top K candidate terms are selected from the ranked list produced in step S708 (and shown in row 1504 of table 1500 (see FIG. 15)). In this example, K equals six. The resulting discovered terms (that is, the top six terms from the ranked list of candidate terms) are shown in row 1602 of table 1600 (see FIG. 16). In the following step S804, the discovered terms produced in step S802 are removed from the list of candidate terms. The terms remaining in the list of candidate terms after this step are shown in row 1604 of table 1600.
After completion of the steps in method 800, processing returns back to step S410 in method 400 (see FIG. 4), where the newly discovered terms from step S802 are added to the high precision domain lexicon. The new, modified, high precision domain lexicon for the present example embodiment is shown in row 1606 of table 1600.

IV. Definitions

Present invention: should not be taken as an absolute indication that the subject matter described by the term “present invention” is covered by either the claims as they are filed, or by the claims that may eventually issue after patent prosecution; while the term “present invention” is used to help the reader to get a general feel for which disclosures herein that are believed as maybe being new, this understanding, as indicated by use of the term “present invention,” is tentative and provisional and subject to change over the course of patent prosecution as relevant information is developed and as the claims are potentially amended.
Embodiment: see definition of “present invention” above—similar cautions apply to the term “embodiment.”
and/or: inclusive or; for example, A, B “and/or” C means that at least one of A or B or C is true and applicable.
Module/Sub-Module: any set of hardware, firmware and/or software that operatively works to do some kind of function, without regard to whether the module is: (i) in a single local proximity; (ii) distributed over a wide area; (iii) in a single proximity within a larger piece of software code; (iv) located within a single piece of software code; (v) located in a single storage device, memory or medium; (vi) mechanically connected; (vii) electrically connected; and/or (viii) connected in data communication.
Computer: any device with significant data processing and/or machine readable instruction reading capabilities including, but not limited to: desktop computers, mainframe computers, laptop computers, field-programmable gate array (FPGA) based devices, smart phones, personal digital assistants (PDAs), body-mounted or inserted computers, embedded device style computers, application-specific integrated circuit (ASIC) based devices.
Contextual characteristic: a feature of a term derived from that term's particular usage in a corpus; some examples of possible contextual characteristics include: (i) proximity related characteristics such as the words located within n words of the term, the words located farther than n words away from a term, and/or the distance between the term and a specific, pre-identified word; (ii) frequency-related characteristics such as the number of times the term appears in the corpus, the most/least number of times the term appears in a sentence, and/or the relative percentage of the term compared to the other terms in the corpus; and/or (iii) usage-related characteristics such as the location of the term in a sentence, the location of the term in a paragraph, whether the term commonly appears in the singular form or in plural form, whether the term regularly appears as a noun/verb/adjective/adverb/subject/object, the adjectives used to describe the term (when a noun), the adverbs used to describe a term (when a verb), the nouns the term typically describes (when an adjective), the verbs the term typically describes (when an adverb), the object of the term (when a subject), and/or the subject of the term (when an object).

Claims

1-7. (canceled)

8. A computer program product comprising a computer readable storage medium having stored thereon:

first program instructions programmed to identify a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus;

second program instructions programmed to add the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and

third program instructions programmed to identify a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

9. The computer program product of claim 8, further comprising:

fourth program instructions programmed to add the second term to the revised set of category related term(s), thereby creating a second revised set of category related term(s) and a set of second term contextual characteristic(s), where each second term contextual characteristic of the set of second term contextual characteristic(s) relates to the contextual use of the second term in the corpus; and

fifth program instructions programmed to identify a third term from the corpus, based, at least in part, on the set of second term contextual characteristic(s).

10. The computer program product of claim 8, wherein:

the identifying of the second term from the corpus is further based, at least in part, on the set of initial contextual characteristic(s).

11. The computer program product of claim 8, further comprising:

fourth program instructions programmed to create the set of category related term(s), where at least one category related term of the set of category related term(s) is extracted from the corpus using a precision oriented extraction method.

12. The computer program product of claim 8, wherein:

the first term belongs to a set of relevant term(s), where each relevant term of the set of relevant term(s) is extracted from the corpus using a statistical extraction method.

13. The computer program product of claim 8, wherein:

each initial contextual characteristic of the set of initial contextual characteristic(s) includes a contextual weight corresponding to the respective initial contextual characteristic's use in the corpus.

14. The computer program product of claim 8, wherein:

the identifying of the first term in the corpus is further based, at least in part, on a weighted strength of a match between the first term and the respective contextual weights of each initial contextual characteristic in the set of initial contextual characteristic(s).

15. A computer system comprising:

a processor(s) set; and

a computer readable storage medium;

wherein:

the processor set is structured, located, connected and/or programmed to run program instructions stored on the computer readable storage medium; and

the program instructions include:

16. The computer system of claim 15, further comprising:

17. The computer system of claim 15, wherein:

18. The computer system of claim 15, further comprising:

19. The computer system of claim 15, wherein:

20. The computer system of claim 15, wherein: