US20160117386A1 - Discovering terms using statistical corpus analysis - Google Patents

Discovering terms using statistical corpus analysis Download PDF

Info

Publication number
US20160117386A1
US20160117386A1 US14/520,654 US201414520654A US2016117386A1 US 20160117386 A1 US20160117386 A1 US 20160117386A1 US 201414520654 A US201414520654 A US 201414520654A US 2016117386 A1 US2016117386 A1 US 2016117386A1
Authority
US
United States
Prior art keywords
term
corpus
contextual
terms
contextual characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/520,654
Inventor
Jitendra Ajmera
Ankur Parikh
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US14/520,654 priority Critical patent/US20160117386A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Parikh, Ankur, AJMERA, JITENDRA
Priority to US14/722,984 priority patent/US10592605B2/en
Publication of US20160117386A1 publication Critical patent/US20160117386A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30684
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F17/30705
    • G06F17/30719

Definitions

  • the present invention relates generally to the field of natural language processing, and more particularly to “term extraction.”
  • Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).
  • IE Information Extraction
  • NLP Network-to-Network Interface
  • Term Extraction is a sub-task of IE.
  • the goal of Term Extraction is to automatically extract relevant terms from a given text (or “corpus”).
  • Term Extraction is used in many NLP tasks and applications, such as question answering, information retrieval, ontology engineering, semantic web, text summarization, document classification, and clustering. Generally, in term extraction, statistical and machine learning methods may be used to help select relevant terms.
  • a domain ontology represents concepts which belong to a particular “domain” such as an industry or a genre.
  • domain ontologies may exist within a single domain due to differences in language, intended use of the ontologies, and different perceptions of the domain.
  • domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible.
  • term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news).
  • a method, computer program product and/or system that performs the following steps (not necessarily in the following order): (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).
  • FIG. 1 is a block diagram view of a first embodiment of a system according to the present invention
  • FIG. 2 is a flowchart showing a first embodiment method performed, at least in part, by the first embodiment system
  • FIG. 3 is a block diagram view of a machine logic (for example, software) portion of the first embodiment system
  • FIG. 4 is a flowchart view of a method according to the present invention.
  • FIG. 5 is a flowchart view of a method according to the present invention.
  • FIG. 6 is a flowchart view of a method according to the present invention.
  • FIG. 7 is a flowchart view of a method according to the present invention.
  • FIG. 8 is a flowchart view of a method according to the present invention.
  • FIG. 9 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 10 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 11 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 12 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 13 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 14 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 15 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • FIG. 16 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • Some embodiments of the present invention extract contextually relevant terms from a text sample (or corpus) by iteratively discovering new terms using weighted “contextual characteristics” of terms discovered in previous iterations.
  • a “contextual characteristic” is a feature of a term derived from that term's particular usage in a given corpus (for example, one contextual characteristic is a list of words that commonly precede or follow a given term in the corpus).
  • This Detailed Description section is divided into the following sub-sections: (i) The Hardware and Software Environment; (ii) Example Embodiment; (iii) Further Comments and/or Embodiments; and (iv) Definitions.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 is a functional block diagram illustrating various portions of networked computers system 100 , including: sub-system 102 ; client sub-systems 104 , 106 , 108 , 110 , 112 ; communication network 114 ; computer 200 ; communication unit 202 ; processor set 204 ; input/output (I/O) interface set 206 ; memory device 208 ; persistent storage device 210 ; display device 212 ; external device set 214 ; random access memory (RAM) devices 230 ; cache memory device 232 ; and program 300 .
  • sub-system 102 client sub-systems 104 , 106 , 108 , 110 , 112 ; communication network 114 ; computer 200 ; communication unit 202 ; processor set 204 ; input/output (I/O) interface set 206 ; memory device 208 ; persistent storage device 210 ; display device 212 ; external device set 214 ; random access memory (RAM) devices 230 ; cache memory device 232 ; and
  • Sub-system 102 is, in many respects, representative of the various computer sub-system(s) in the present invention. Accordingly, several portions of sub-system 102 will now be discussed in the following paragraphs.
  • Sub-system 102 may be a laptop computer, tablet computer, netbook computer, personal computer (PC), a desktop computer, a personal digital assistant (PDA), a smart phone, or any programmable electronic device capable of communicating with the client sub-systems via network 114 .
  • Program 300 is a collection of machine readable instructions and/or data that is used to create, manage and control certain software functions that will be discussed in detail, below, in the Example Embodiment sub-section of this Detailed Description section.
  • Sub-system 102 is capable of communicating with other computer sub-systems via network 114 .
  • Network 114 can be, for example, a local area network (LAN), a wide area network (WAN) such as the Internet, or a combination of the two, and can include wired, wireless, or fiber optic connections.
  • LAN local area network
  • WAN wide area network
  • network 114 can be any combination of connections and protocols that will support communications between server and client sub-systems.
  • Sub-system 102 is shown as a block diagram with many double arrows. These double arrows (no separate reference numerals) represent a communications fabric, which provides communications between various components of sub-system 102 .
  • This communications fabric can be implemented with any architecture designed for passing data and/or control information between processors (such as microprocessors, communications and network processors, etc.), system memory, peripheral devices, and any other hardware components within a system.
  • processors such as microprocessors, communications and network processors, etc.
  • the communications fabric can be implemented, at least in part, with one or more buses.
  • Memory 208 and persistent storage 210 are computer-readable storage media.
  • memory 208 can include any suitable volatile or non-volatile computer-readable storage media. It is further noted that, now and/or in the near future: (i) external device(s) 214 may be able to supply, some or all, memory for sub-system 102 ; and/or (ii) devices external to sub-system 102 may be able to provide memory for sub-system 102 .
  • Program 300 is stored in persistent storage 210 for access and/or execution by one or more of the respective computer processors 204 , usually through one or more memories of memory 208 .
  • Persistent storage 210 (i) is at least more persistent than a signal in transit; (ii) stores the program (including its soft logic and/or data), on a tangible medium (such as magnetic or optical domains); and (iii) is substantially less persistent than permanent storage.
  • data storage may be more persistent and/or permanent than the type of storage provided by persistent storage 210 .
  • Program 300 may include both machine readable and performable instructions and/or substantive data (that is, the type of data stored in a database).
  • persistent storage 210 includes a magnetic hard disk drive.
  • persistent storage 210 may include a solid state hard drive, a semiconductor storage device, read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, or any other computer-readable storage media that is capable of storing program instructions or digital information.
  • the media used by persistent storage 210 may also be removable.
  • a removable hard drive may be used for persistent storage 210 .
  • Other examples include optical and magnetic disks, thumb drives, and smart cards that are inserted into a drive for transfer onto another computer-readable storage medium that is also part of persistent storage 210 .
  • Communications unit 202 in these examples, provides for communications with other data processing systems or devices external to sub-system 102 .
  • communications unit 202 includes one or more network interface cards.
  • Communications unit 202 may provide communications through the use of either or both physical and wireless communications links. Any software modules discussed herein may be downloaded to a persistent storage device (such as persistent storage device 210 ) through a communications unit (such as communications unit 202 ).
  • I/O interface set 206 allows for input and output of data with other devices that may be connected locally in data communication with server computer 200 .
  • I/O interface set 206 provides a connection to external device set 214 .
  • External device set 214 will typically include devices such as a keyboard, keypad, a touch screen, and/or some other suitable input device.
  • External device set 214 can also include portable computer-readable storage media such as, for example, thumb drives, portable optical or magnetic disks, and memory cards.
  • Software and data used to practice embodiments of the present invention, for example, program 300 can be stored on such portable computer-readable storage media. In these embodiments the relevant software may (or may not) be loaded, in whole or in part, onto persistent storage device 210 via I/O interface set 206 .
  • I/O interface set 206 also connects in data communication with display device 212 .
  • Display device 212 provides a mechanism to display data to a user and may be, for example, a computer monitor or a smart phone display screen.
  • FIG. 2 shows flowchart 250 depicting a method according to the present invention.
  • FIG. 3 shows program 300 for performing at least some of the method steps of flowchart 250 .
  • the present embodiment refers extensively to a high precision domain lexicon (HPDL).
  • HPDL also referred to as a “set of category related terms” is a collection of terms (words or sets of words) that belong to a specific domain, category, or genre (“domain”).
  • domain domain
  • the HPDL can serve as an underlying “knowledge base” for a given domain so as to extract more contextually relevant terms from a piece of text (or corpus).
  • the HPDL is used to: (i) extract contextually relevant terms (term extraction); and (ii) extract additional HPDL-eligible terms in order to grow, strengthen, and/or expand the HPDL.
  • HPDL domains may have multiple categories (or sub-domains).
  • the domain of smartphones may include categories such as smartphone models, smartphone apps, and/or smartphone modes. It is contemplated that the present invention may apply to HPDLs with singular domains, multiple domains, and/or multiple domain categories (or sub-domains).
  • method 250 may begin with an existing, predefined HPDL, while in other embodiments the HPDL may be initially extracted from the corpus using, for example, term extraction methods adapted to achieve high levels of precision. Some known methods for extracting an initial HPDL from the corpus are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description.
  • the HPDL has a domain of “things that jump” and initially includes the following terms: (i) fox; and (ii) rabbit (in other embodiments, an HPDL including the terms “fox” and “rabbit” might also have a domain of “animals” and a sub-domain of “mammals”).
  • the present embodiment also refers extensively to a corpus.
  • the corpus is a text sample that method 250 extracts relevant terms from.
  • the corpus is the text that is being acted upon (interpreted, processed, classified, etc.) during term extraction.
  • the corpus includes the following text: “A quick brown fox jumps over the lazy dog, but a quicker, more nimble kangaroo jumps over the fox. The following day, while the kangaroo leaps over the still-lazy dog, a determined frog leaps over a surprisingly speedy sloth.”
  • extract candidate terms module (“mod”) 302 extracts candidate terms (also referred to as “relevant terms”) from the corpus.
  • candidate terms also referred to as “relevant terms”.
  • various statistical methods are used to extract relevant candidate terms. A number of these known methods are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description. However, these are not meant to be all-inclusive or limiting, as other, less traditional extraction methods may also be used.
  • dictionaries or domain lexicons different from and/or unrelated to the HPDL may be used in this step. For example, in the present example embodiment, candidate terms are extracted from the corpus if they are identified as “animals”.
  • Step S 260 where discover new generation mod 304 discovers a new generation of HPDL terms from the candidate terms using the HPDL and its contextual characteristics.
  • This step begins by identifying contextual characteristics (or “initial contextual characteristics”) of the terms in the HPDL.
  • a contextual characteristic is a feature of a term derived from that term's particular usage in a given corpus (for a more complete definition of “contextual characteristic,” see the Definitions Sub-Section of this Detailed Description).
  • the contextual characteristic for each term in the HPDL is the word immediately following that term in the corpus (when a term is the last word in a sentence, it does not have a contextual characteristic).
  • the only contextual characteristic for the term “fox” is “jumps”, because the only word immediately following “fox” in the corpus is “jumps”.
  • the second HPDL term “rabbit”, there are no contextual characteristics, because “rabbit” does not appear in the corpus.
  • the only contextual characteristic of the HPDL is the word “jumps”. It should be noted that although the present embodiment includes a simple example with one contextual characteristic, in many embodiments the HPDL has a plurality of context characteristics.
  • kangaroo the only candidate term to immediately precede the word “jumps” is “kangaroo”.
  • “kangaroo” the “first term” is the only term included in the current generation of discovered terms.
  • the current generation may include a plurality of discovered terms.
  • additional steps may be taken to further refine the list of discovered terms (for some examples, see the Further Comments and/or Embodiments Sub-Section of this Detailed Description).
  • step S 265 where update terms mod 306 adds the current generation of terms to the HPDL.
  • the term “kangaroo” is added to the HPDL, with the resulting HPDL (or “revised set of category related terms”) being as follows: (i) fox; (ii) rabbit; and (iii) kangaroo.
  • step S 270 where update terms mod 306 deletes the current generation of terms from the candidate terms list.
  • the term “kangaroo” is removed from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; (ii) frog; and (iii) sloth.
  • step S 275 iterate mod 308 checks to see if method 250 is on its last iteration. In the present embodiment, a total of two iterations are to be performed. As such, method 250 is not on its last iteration (NO), and processing returns to step S 260 for another iteration. In other embodiments, however, other tests may be used. For example, in one embodiment, iterations may occur until the HPDL reaches a certain size. In another embodiment, iterations may continue to occur for a certain period of time. In still other embodiments, iterations may continue to occur indefinitely and/or until no further terms for the HPDL are discovered.
  • discover new generation mod 304 repeats the process of identifying contextual characteristics of the terms in the HPDL.
  • an additional contextual characteristic (or “first term contextual characteristic”) is identified: the word “leaps”, which immediately follows the word “kangaroo” in the second sentence of the corpus.
  • first term contextual characteristic the word “leaps”
  • the updated contextual characteristics are applied to the candidate terms, an additional match is found: the term “frog” appears immediately before the word “leaps” in the corpus.
  • “frog” (the “second term”) is added to the current generation of discovered terms.
  • step S 265 update terms mod 306 adds “frog” to the HPDL, resulting in the following HPDL: (i) fox; (ii) rabbit; (iii) kangaroo; and (iv) frog.
  • step S 270 update terms mod 306 removes “frog” from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; and (ii) sloth.
  • step S 275 In the present example, two iterations have now completed, which means that method 250 is on its final iteration. Therefore, step S 275 resolves to “YES”, and processing proceeds to step S 280 , where method 250 ends.
  • the HPDL for the domain of “things that jump” now includes two additional terms (“kangaroo” and “frog), and will be able to further extract contextually relevant terms in future iterations and/or from different corpuses.
  • system 102 now also has a list of candidate terms (“dog” and “sloth”) that may be helpful for other NLP-related tasks.
  • Some embodiments of the present invention recognize the following facts, potential problems and/or potential areas for improvement with respect to the current state of the art: (i) some existing approaches (including approaches that rely on linguistic processors to extract candidate terms) do not perform well when the corpus (or text) has a different genre (or domain) than the corpus used to build the processor; (ii) some existing approaches rely purely on statistical methods (such as n-gram sequences or topic modeling) to extract candidate terms, thereby negatively affecting system precision; (iii) existing approaches can be configured to provide terms with either high precision or high recall, but not both (thereby negatively affecting the overall accuracy of the system); and/or (iv) existing approaches are unable to discover new domain-specific terms directly from the corpus in a bootstrapping manner.
  • Some embodiments of the present invention may include one, or more, of the following features, characteristics and/or advantages: (i) using contextual similarity with a high precision domain lexicon for ranking; (ii) extracting candidate terms statistically; (iii) using an approach other than singular value decomposition; (iv) extracting terms without using linguistic processors; (v) extracting terms without analyzing linguistic or structural characteristics of a document; (vi) extracting terms without using syntactic and semantic contextual analysis; (vii) extracting terms without using dictionary-based statistics; (viii) extracting terms without using specialized corpora; (ix) extracting terms based on contextual information of a lexicon obtained from a given corpus; and/or (x) using association rules to measure unithood and/or filter candidate terms.
  • inventions of the present invention are adapted to identify terms (nouns or noun phrases) from a corpus with both high precision and recall without using any linguistic processors and open domain linguistic resources (such as dictionaries and ontology).
  • these embodiments may include one, or more, of the following features, characteristics and/or advantages: (i) providing an iterative approach to term discovery where discovery depends on weighted contextual characteristics of already discovered terms in previous iterations; (ii) ranking pure statistically extracted candidate terms (N-grams) based on their noun specificity and term specificity determined using weighted contextual similarity with known terms (nouns or noun phrases); (iii) using association rules, filtering candidate terms that cannot exist independently; and/or (iv) validating unithood of candidate terms using association rules.
  • Some embodiments of the present invention may further include one, or more, of the following helpful features, characteristics, and/or advantages: (i) achieving positive results in entity set expansion tasks, where the goal is to identify entities from the corpus in a bootstrapping manner; (ii) performing well on diverse domains such as medical and/or news; (iii) performing better term extraction for any language, including languages for which linguistic processors are not built or do not perform well for; and/or (iv) keeping resources such as dictionaries, lexicons, ontologies, and/or entity lists up-to-date.
  • Method 400 is provided in FIG. 4 .
  • Method 400 is adapted to extract terms and their variants from a corpus with both high precision and high recall. High precision and recall occur even if the corpus has a domain that is different than the system's source domain, or if the corpus is in a language for which linguistic systems are not available or mature.
  • Processing begins with step S 402 , where the method 400 uses a statistical corpus analysis to extract candidate terms.
  • This step S 402 uses known (or to be known in the future) statistical approaches (such as frequent item set mining, language mining, and topic modeling) to extract potential candidate terms from the corpus. Additionally, step S 402 filters out irrelevant potential candidate terms using statistical criteria.
  • step S 404 the method 400 creates a high precision domain lexicon and analyzes contextual information therein.
  • the high precision domain lexicon terms (or “lexicon terms”) are either manually extracted from the corpus or automatically extracted using any system (known or to be known in the future) configured to focus on high precision.
  • context words of those lexicon terms (such as the words appearing before and after the lexicon terms in the corpus) are extracted and weighed to create a set of weighted term context words.
  • step S 406 the method 400 ranks the candidate terms (see step S 402 ) based on contextual similarity with the weighted term context words.
  • Processing proceeds to step S 408 , where method 400 selects the top candidate terms as discovered terms.
  • the number of top candidate terms to be selected is a preconfigured value that depends on application and business context.
  • Step S 410 the newly discovered terms (that is, the top candidate terms) are added to the high precision domain lexicon. Once the newly discovered terms have been added, they are deleted from the candidate terms list.
  • step S 412 the method 400 compares an iteration count with a pre-defined iteration threshold (where the iteration threshold is determined based on application and business context). Processing then proceeds to step S 414 . If the iteration count is less than the iteration threshold (NO), processing returns to step S 404 to discover more relevant terms from the corpus. If the iteration count is greater than or equal to the iteration threshold, however, processing for method 400 completes. As a result of method 400 completing: (i) the high precision domain lexicon now includes additional relevant domain terms; and (ii) the list of candidate terms includes additional contextually relevant terms that may be used for natural language processing or other tasks.
  • step S 402 (Extract Candidate Terms using Statistical Corpus Analysis, see FIG. 4 ) further includes method 500 (see FIG. 5 ).
  • Method 500 is adapted to apply statistic approaches to the corpus to extract potential candidate terms.
  • the potential candidate terms are passed through statistical filters to identify relevant terms, resulting in new candidate terms.
  • Processing begins with step S 502 , where method 500 extracts text from the corpus and then applies heuristics-based sentence splitters on the text to extract sentences.
  • processing proceeds to step S 504 , where various statistical methods are applied to the extracted sentences for candidate term extraction.
  • the method to be used is typically determined based on a few factors: (i) the type of document the corpus is (for example, a web page, a text book, or a manual); (ii) the length of the corpus (for example, the number of words in the corpus); and/or (iii) the general domain of the corpus (for example, healthcare, finance, or telecommunication). Although many methods may be used, two are discussed below: (i) a statistical language modeling method (beginning with step S 506 ); and (ii) an associated rule mining method (beginning with step S 514 ).
  • processing proceeds to step S 506 , where the method 500 extracts n-grams from each extracted sentence in the corpus for each preconfigured value of n.
  • N may be determined in a number of ways, including, for example, by conducting experiments or by prioritizing certain features (such as speed vs. accuracy).
  • An n-gram is a contiguous sequence of words from a given extracted sentence, where ‘n’ represents the number of words in the sequence. All unique n-grams of the corpus are considered as potential candidate terms.
  • Processing proceeds to steps S 508 and S 510 , where method 500 scores the potential candidate terms based on their termhood and unithood, respectively.
  • Termhood (as used in step S 508 ) scores the validity of the potential candidate term as a representative for the corpus content as a whole using one or more statistical measures now known (or to be known in the future).
  • a measure of frequency in a corpus is used.
  • a measure of ‘weirdness’ the term's frequency in the corpus compared to its frequency in a reference corpus
  • a measure of the pertinence or specificity of the term to a particular domain is used.
  • Unithood scores the collocation strength (the strength of parts of terms) of potential candidate terms using one of the statistical measures now known (or to be known in the future). In one embodiment, a mutual information test is used. In another embodiment, a t-test is used.
  • step S 512 potential candidate terms with termhood and unithood scores above pre-defined thresholds are selected and identified as candidate terms, and processing for method 500 completes.
  • step S 504 If the associated rule mining method is chosen in step S 504 (as opposed to the statistical language modeling method discussed above), processing proceeds to step S 514 .
  • method 500 uses an algorithm to extract frequent n-grams. The extracted frequent n-grams are identified as potential candidate terms.
  • method 500 first extracts unigrams (i.e. single words) that are frequent (that is, they have a frequency above a pre-defined threshold). Then, method 500 extracts frequent bigrams (i.e. two-word phrases) that are made up of the previously identified unigrams. This continues for n steps (where n equals the number of words in the phrase being analyzed).
  • step S 516 the method 500 analyzes each potential candidate term and generates association rules along with corresponding confidence values.
  • step S 516 performs three tasks: (i) for every potential candidate term t, all non-empty ordered subsets s are generated; (ii) for every subset s of t, a forward rule,“s->t ⁇ s”, along with its confidence (measured as frequency of t divided by the frequency of s) is generated; and (iii) for every subset s of t, an inverse rule, “s ⁇ -t ⁇ s”, along with its confidence, is generated.
  • term t is “Mobile Phone A”.
  • step S 518 n-grams are filtered using the inverse rules created in step S 516 .
  • term variations are identified and removed based on their confidence score.
  • the method 500 identifies a term variation if an inverse rule from term variation to term has a confidence score above a predefined threshold (determined experimentally, for example). For example, “Mobile Phone A” has one inverse rule with a confidence score above the predefined threshold: “Mobile ⁇ -Phone A”. Because the confidence score is over the threshold, the method 500 identifies that “Phone A” is a variation of “Mobile Phone A” and removes “Phone A” from the list of potential candidate terms.
  • a predefined threshold determined experimentally, for example.
  • n-grams are filtered using forward rules (which serve as a measure of unithood for potential candidate terms).
  • the confidence of forward rules provides the probability of the order of term constituents. If none of the forward rules for a term have a confidence score above a pre-defined threshold, that term is removed from the potential candidate term. For example, the term “Manufacturer launches new” has two forward rules: “Manufacturer launches->new” and “Manufacturer->launches new”. Because neither of the forward rules have a confidence level above the threshold, they are removed from the list of potential candidate terms.
  • step S 520 processing for method 500 completes, resulting in a new list of candidate terms from the remaining potential candidate terms.
  • step S 404 (Analyzing Contextual Information of High Precision Domain Lexicon, see FIG. 4 ) further includes method 600 shown in FIG. 6 .
  • Processing begins with step S 602 , where a high precision domain lexicon is created (either manually or automatically using methods configured to focus on high precision) or provided from previous iterations of method 600 .
  • Term variations from the lexicon are then filtered and/or replaced using previously generated inverse rules, if available (for example, from step S 518 (see FIG. 5 )).
  • a term from the lexicon is identified as a term variation if an inverse rule from the term to some other longer term has a confidence score above pre-defined threshold.
  • the longer term is a part of the lexicon, then the term identified as a term variation is removed. For example, if “Mobile Phone A” and “Phone A” are in the lexicon and inverse rule “Mobile ⁇ -Phone A” has a confidence level above the threshold, then “Phone A” is removed from the lexicon. If the longer term is not part of lexicon, then the lexicon term is replaced with longer term. For example, if “Phone” is in the lexicon and the inverse rule “Mobile ⁇ -Phone” has a confidence level above the threshold, then “Phone” is replaced with “Mobile Phone”.
  • step S 604 method 600 scores lexicon terms and ranks them based on their scores. Scoring may be performed by a variety of methods now known (or to be known in the future), and may be based on properties such as term frequency observed in a given corpus. Processing proceeds to step S 606 , where the top X terms are selected, where X is pre-defined (and determined experimentally, for example).
  • step S 608 context words are extracted from the corpus.
  • occurrences of lexicon terms within a given corpus are identified.
  • context words are extracted per a pre-defined window size (for example, the two words before and the two words after a lexicon term).
  • the words within the window are identified as context words and added to a list of context words.
  • processing proceeds to step S 610 , where closed class context words (such as determiners, prepositions, pronouns, and/or conjunctions) are removed from the list of context words.
  • each context word is weighted.
  • the weight of a context word equals the number of unique terms the context word appears in divided by the total number of terms. Processing for method 600 concludes with a list of weighted term context words.
  • step S 406 Contextual Similarity based ranking of Candidate Terms, see FIG. 4 ) further includes method 700 shown in FIG. 7 .
  • Processing begins with step S 702 , where context words for candidate terms are extracted. First, occurrences of each candidate term (see discussion of method 500 , above) in the corpus are identified. Then, from each occurrence, context words are extracted per a pre-defined window size (for example, the two words before and the two words after each candidate term).
  • Step S 704 closed class context words (such as determiners, prepositions, pronouns, and/or conjunction) are removed from the list of context words.
  • the remaining context words (“candidate term context words”) are selected, stored (along with their frequency), and mapped to their corresponding candidate terms.
  • step S 706 the contextual similarity between candidate term context words (see step S 704 , above) and weighted term context words (see discussion of method 600 , above) is measured by a contextual similarity score.
  • the contextual similarity score may be obtained by a number of methods now known or to be known in the future.
  • the contextual similarity score is represented by the equation “ ⁇ i Wi*Fi”, where: ‘i’ equals the number of distinct context words for a candidate term; ‘Wi’ equals the weight of the context word in the weighted term context words (‘Wi’ equals zero when the context word is not in a set); and ‘Fi’ equals the frequency of the context word with respect to the candidate term in a given corpus.
  • step S 708 where candidate terms are ranked based on the contextual similarity score obtained in step S 706 (and discussed in the preceding paragraph).
  • the result of this step is a list of ranked candidate terms.
  • step S 408 (Discover New Terms, see FIG. 4 ) further includes method 800 shown in FIG. 8 .
  • Processing begins with step S 802 , where method 800 selects the top K candidate terms from the ranked list and creates a set of top K candidate terms, where the value of K is pre-configured. Processing then proceeds to step S 804 , where method 800 removes any candidate terms from the set if they are also part of domain lexicon. The remaining terms from the list of top k candidate terms are identified as, simply, “terms” and processing for method 800 completes.
  • table 900 shows the results of step S 514 on an example corpus.
  • ‘n’ equals ‘4’.
  • Table 900 begins with row 902 , which shows the result of a frequent unigram extraction on the example corpus (showing both the extracted unigrams and their corresponding frequencies).
  • row 904 shows the result of a frequent bigram extraction on the example corpus (showing both the extracted bigrams and their corresponding frequencies).
  • the frequency threshold for the bigram extraction is 30; as such, bigrams with a frequency of less than 30 (none, in this example) will not be included in the output for step S 514 .
  • row 906 shows the result of a frequent trigram extraction on the example corpus (showing both the extracted trigrams and their corresponding frequencies).
  • the frequency threshold for the trigram extraction is 20; as such, trigrams with a frequency of less than 20 (none, in this example) will not be included in the output for step S 514 .
  • row 908 shows the result of a frequent 4-gram extraction on the example corpus (showing both the extracted 4-grams and their corresponding frequencies).
  • the frequency threshold for the 4-gram extraction is 10; as such, trigrams with a frequency of less than 10 (none, in this example) will not be included in the output for step S 514 .
  • the resulting example output of row 908 combined with the output from rows 902 , 904 , and 906 , make up the entire list of frequent n-grams generated by step S 514 .
  • Table 1000 shows the results of step S 516 (see FIG. 5 ), where association rules are generated from the list of frequent n-grams generated by the previous step S 514 .
  • row 1002 shows the generated forward rules, along with their corresponding confidence values (see discussion of step S 516 , above).
  • Row 1004 similarly shows the generated inverse rules, along with their corresponding confidence values (again, see discussion of step S 516 , above).
  • a given term can have multiple inverse rules, in the present example, for each term, only the inverse rule with the maximum confidence value is shown.
  • Table 1100 shows the results of steps S 518 and S 520 (see FIG. 5 ), where the n-grams created in step S 514 are filtered using the inverse rules and the forward rules generated in step S 516 .
  • the inverse rules are applied to the list of frequent n-grams. For each inverse rule over a pre-defined confidence value threshold (in the present example, 0.8), the term on the right-hand side of the rule is removed from the list of frequent n-grams.
  • a pre-defined confidence value threshold in the present example, 0.8
  • Row 1102 of table 1100 shows all of the n-grams that have been filtered using the inverse rules, and row 1104 shows the n-grams that remain after that filtering.
  • step S 520 the forward rules are applied to the list of frequent n-grams. For each remaining n-gram in the list of frequent n-grams, the n-gram is removed from the list if it doesn't have a corresponding forward rule above a pre-defined confidence value threshold (in the present example, 0.8).
  • a pre-defined confidence value threshold in the present example, 0.8.
  • the forward rule “ManufacturerB PhoneA W->4G” has a confidence value of 1.00 (which is greater than 0.8)
  • the term “ManufacturerB PhoneA W 4G” remains on the list.
  • the forward rule “Connect->ManufacturerA” has a confidence value of 0.10
  • the term “Connect ManufacturerA” is removed from the list.
  • Row 1106 of table 1100 shows all of the n-grams that have been filtered using the forward rules, and row 1108 shows the n-grams that remain after that filtering and are considered candidate terms.
  • table 1200 shows the results of steps S 602 , S 604 , and S 606 .
  • Row 1202 shows the lexicon terms that have been extracted at the beginning of step S 602 . These terms are considered to be high precision domain lexicon terms for the domain of smartphones (collectively, they are referred to as the “high precision domain lexicon,” the “lexicon,” and/or the “lexicon terms”).
  • method 600 identifies the n-grams from the corpus that end with any of the terms from the domain lexicon.
  • Method 600 generates inverse rules for these n-grams along with corresponding confidence values. If the confidence of an inverse rule exceeds a pre-determined confidence value threshold (in this case, 0.8), then the method checks if the full term of the inverse rule is included in the high precision domain lexicon. If so, the term on the right-hand side of the rule is removed from the lexicon. If not, then the right-hand side term is replaced in the lexicon by the full term of the inverse rule.
  • a pre-determined confidence value threshold in this case, 0.8
  • row 1204 of table 1200 shows both of the generated inverse rules that meet the confidence value threshold in the present example embodiment, along with their corresponding confidence values.
  • “ManufacturerC ⁇ -PhoneB 12” because “ManufacturerC PhoneB 12” is already included in the lexicon, “Phone B,” (that is, the term on the right-hand side of the rule) is removed from the lexicon.
  • “ManufacturerB ⁇ -PhoneC” because “ManufacturerB PhoneC” is not in the lexicon
  • “PhoneC” is replaced by “ManufacturerB PhoneC” in the lexicon.
  • the resulting, modified lexicon terms are shown in row 1206 of table 1200.
  • row 1208 shows the results of step S 604 (see FIG. 6 ), where the lexicon is scored using a C-Value/NC-Value method. As shown in table 1200, the lexicon terms are ranked based on their respective scores. In the next method step S 606 , the top X terms are selected. In the present case, X equals 5, so all four of the lexicon terms are selected, as shown in row 1210 of FIG. 12 .
  • step S 608 shows the results of steps S 608 , S 610 , and S 612 (see FIG. 6 ).
  • step S 608 shown in row 1302
  • the window extends to one word before the term and one word after the term. So, when a lexicon term is found in the corpus, the word immediately preceding that lexicon term and the word immediately following the lexicon term are added to a list of context words.
  • the list of context words is shown in row 1302 , where each context word is listed along with the lexicon term(s) used to identify the context word.
  • step S 610 a list of various closed-class words (such as determiners, prepositions, pronouns, and conjunctions) is used to reduce the number of words included in the list of context words.
  • Row 1304 of table 1300 shows the results of step S 610 in the present example embodiment, where words such as “to,” “from,” and “your” have been removed from the list.
  • step S 612 provides weights for the context words, thereby creating weighted term context words.
  • the weight of a given word is equal to the number of lexicon terms the word appeared with in the corpus divided by the total number of lexicon terms.
  • the resulting weighted context words for the present example embodiment are shown in row 1306 of table 13.
  • step S 702 shows the results of steps S 702 and S 704 for the present example embodiment.
  • step S 702 the results of which are shown in row 1402
  • context words for candidate terms are extracted from the corpus with a predefined window.
  • the window extends to one word before the term and one word after the term (as in step S 608 ). So, when a candidate term is found in the corpus, the word immediately preceding the candidate term and the word immediately following the candidate term are extracted and added to a list of context words.
  • Row 1402 shows the extracted context words for the present embodiment, along with their corresponding candidate terms. The number of times a context word appears with each candidate term denoted by parentheses.
  • step S 704 where closed-class context words (such as determiners, prepositions, pronouns, and conjunctions) are removed from the list of context words in a manner similar to the removal of closed-class context words in step S 610 .
  • the resulting list of context words is shown in row 1404 of table 1400 (see FIG. 14 ).
  • Table 1500 shows the results of steps S 706 and S 708 for the present example embodiment.
  • step S 706 for each candidate term, a contextual similarity analysis is performed between the candidate term's context words (produced in step S 704 , discussed above) and the weighted term context words (produced in step S 612 , discussed above).
  • the resulting contextual similarity score is calculated by computing the sum, for each of a candidate term's context words, of the candidate term's frequency with that context word multiplied by the context word's weight. If the given context word is not listed in the list of weighted term context words, then the weight of the context word is zero.
  • step S 802 of this embodiment the top K candidate terms are selected from the ranked list produced in step S 708 (and shown in row 1504 of table 1500 (see FIG. 15 )).
  • K equals six.
  • the resulting discovered terms (that is, the top six terms from the ranked list of candidate terms) are shown in row 1602 of table 1600 (see FIG. 16 ).
  • step S 804 the discovered terms produced in step S 802 are removed from the list of candidate terms. The terms remaining in the list of candidate terms after this step are shown in row 1604 of table 1600.
  • processing returns back to step S 410 in method 400 (see FIG. 4 ), where the newly discovered terms from step S 802 are added to the high precision domain lexicon.
  • the new, modified, high precision domain lexicon for the present example embodiment is shown in row 1606 of table 1600.
  • Present invention should not be taken as an absolute indication that the subject matter described by the term “present invention” is covered by either the claims as they are filed, or by the claims that may eventually issue after patent prosecution; while the term “present invention” is used to help the reader to get a general feel for which disclosures herein that are believed as maybe being new, this understanding, as indicated by use of the term “present invention,” is tentative and provisional and subject to change over the course of patent prosecution as relevant information is developed and as the claims are potentially amended.
  • Embodiment see definition of “present invention” above—similar cautions apply to the term “embodiment.”
  • Module/Sub-Module any set of hardware, firmware and/or software that operatively works to do some kind of function, without regard to whether the module is: (i) in a single local proximity; (ii) distributed over a wide area; (iii) in a single proximity within a larger piece of software code; (iv) located within a single piece of software code; (v) located in a single storage device, memory or medium; (vi) mechanically connected; (vii) electrically connected; and/or (viii) connected in data communication.
  • Computer any device with significant data processing and/or machine readable instruction reading capabilities including, but not limited to: desktop computers, mainframe computers, laptop computers, field-programmable gate array (FPGA) based devices, smart phones, personal digital assistants (PDAs), body-mounted or inserted computers, embedded device style computers, application-specific integrated circuit (ASIC) based devices.
  • FPGA field-programmable gate array
  • PDA personal digital assistants
  • ASIC application-specific integrated circuit
  • Contextual characteristic a feature of a term derived from that term's particular usage in a corpus; some examples of possible contextual characteristics include: (i) proximity related characteristics such as the words located within n words of the term, the words located farther than n words away from a term, and/or the distance between the term and a specific, pre-identified word; (ii) frequency-related characteristics such as the number of times the term appears in the corpus, the most/least number of times the term appears in a sentence, and/or the relative percentage of the term compared to the other terms in the corpus; and/or (iii) usage-related characteristics such as the location of the term in a sentence, the location of the term in a paragraph, whether the term commonly appears in the singular form or in plural form, whether the term regularly appears as a noun/verb/adjective/adverb/subject/object, the adjectives used to describe the term (when a noun), the adverbs used to describe a term (when a verb), the no

Abstract

Software that extracts contextually relevant terms from a text sample (or corpus) by performing the following steps: (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).

Description

    BACKGROUND OF THE INVENTION
  • The present invention relates generally to the field of natural language processing, and more particularly to “term extraction.”
  • Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding (that is, enabling computers to derive meaning from human or natural language input).
  • Information Extraction (IE) is a known element of NLP. IE is the task of automatically extracting structured information from unstructured (and/or semi-structured) machine-readable documents. Term Extraction is a sub-task of IE. The goal of Term Extraction is to automatically extract relevant terms from a given text (or “corpus”). Term Extraction is used in many NLP tasks and applications, such as question answering, information retrieval, ontology engineering, semantic web, text summarization, document classification, and clustering. Generally, in term extraction, statistical and machine learning methods may be used to help select relevant terms.
  • Domain ontologies are known. A domain ontology represents concepts which belong to a particular “domain” such as an industry or a genre. In fact, multiple domain ontologies may exist within a single domain due to differences in language, intended use of the ontologies, and different perceptions of the domain. However, since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible. In the context of NLP, term extraction becomes difficult when the text being processed belongs to a different domain (for example, medical technology) than the domain from which the NLP software was built (for example, financial news).
  • SUMMARY
  • According to an aspect of the present invention, there is a method, computer program product and/or system that performs the following steps (not necessarily in the following order): (i) identifying a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus; (ii) adding the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and (iii) identifying a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram view of a first embodiment of a system according to the present invention;
  • FIG. 2 is a flowchart showing a first embodiment method performed, at least in part, by the first embodiment system;
  • FIG. 3 is a block diagram view of a machine logic (for example, software) portion of the first embodiment system;
  • FIG. 4 is a flowchart view of a method according to the present invention;
  • FIG. 5 is a flowchart view of a method according to the present invention;
  • FIG. 6 is a flowchart view of a method according to the present invention;
  • FIG. 7 is a flowchart view of a method according to the present invention;
  • FIG. 8 is a flowchart view of a method according to the present invention;
  • FIG. 9 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 10 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 11 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 12 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 13 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 14 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention;
  • FIG. 15 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention; and
  • FIG. 16 is a table view showing information that is generated by and helpful in understanding embodiments of the present invention.
  • DETAILED DESCRIPTION
  • Some embodiments of the present invention extract contextually relevant terms from a text sample (or corpus) by iteratively discovering new terms using weighted “contextual characteristics” of terms discovered in previous iterations. Roughly speaking, a “contextual characteristic” is a feature of a term derived from that term's particular usage in a given corpus (for example, one contextual characteristic is a list of words that commonly precede or follow a given term in the corpus). This Detailed Description section is divided into the following sub-sections: (i) The Hardware and Software Environment; (ii) Example Embodiment; (iii) Further Comments and/or Embodiments; and (iv) Definitions.
  • I. The Hardware and Software Environment
  • The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
  • An embodiment of a possible hardware and software environment for software and/or methods according to the present invention will now be described in detail with reference to the Figures. FIG. 1 is a functional block diagram illustrating various portions of networked computers system 100, including: sub-system 102; client sub-systems 104, 106, 108, 110, 112; communication network 114; computer 200; communication unit 202; processor set 204; input/output (I/O) interface set 206; memory device 208; persistent storage device 210; display device 212; external device set 214; random access memory (RAM) devices 230; cache memory device 232; and program 300.
  • Sub-system 102 is, in many respects, representative of the various computer sub-system(s) in the present invention. Accordingly, several portions of sub-system 102 will now be discussed in the following paragraphs.
  • Sub-system 102 may be a laptop computer, tablet computer, netbook computer, personal computer (PC), a desktop computer, a personal digital assistant (PDA), a smart phone, or any programmable electronic device capable of communicating with the client sub-systems via network 114. Program 300 is a collection of machine readable instructions and/or data that is used to create, manage and control certain software functions that will be discussed in detail, below, in the Example Embodiment sub-section of this Detailed Description section.
  • Sub-system 102 is capable of communicating with other computer sub-systems via network 114. Network 114 can be, for example, a local area network (LAN), a wide area network (WAN) such as the Internet, or a combination of the two, and can include wired, wireless, or fiber optic connections. In general, network 114 can be any combination of connections and protocols that will support communications between server and client sub-systems.
  • Sub-system 102 is shown as a block diagram with many double arrows. These double arrows (no separate reference numerals) represent a communications fabric, which provides communications between various components of sub-system 102. This communications fabric can be implemented with any architecture designed for passing data and/or control information between processors (such as microprocessors, communications and network processors, etc.), system memory, peripheral devices, and any other hardware components within a system. For example, the communications fabric can be implemented, at least in part, with one or more buses.
  • Memory 208 and persistent storage 210 are computer-readable storage media. In general, memory 208 can include any suitable volatile or non-volatile computer-readable storage media. It is further noted that, now and/or in the near future: (i) external device(s) 214 may be able to supply, some or all, memory for sub-system 102; and/or (ii) devices external to sub-system 102 may be able to provide memory for sub-system 102.
  • Program 300 is stored in persistent storage 210 for access and/or execution by one or more of the respective computer processors 204, usually through one or more memories of memory 208. Persistent storage 210: (i) is at least more persistent than a signal in transit; (ii) stores the program (including its soft logic and/or data), on a tangible medium (such as magnetic or optical domains); and (iii) is substantially less persistent than permanent storage. Alternatively, data storage may be more persistent and/or permanent than the type of storage provided by persistent storage 210.
  • Program 300 may include both machine readable and performable instructions and/or substantive data (that is, the type of data stored in a database). In this particular embodiment, persistent storage 210 includes a magnetic hard disk drive. To name some possible variations, persistent storage 210 may include a solid state hard drive, a semiconductor storage device, read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, or any other computer-readable storage media that is capable of storing program instructions or digital information.
  • The media used by persistent storage 210 may also be removable. For example, a removable hard drive may be used for persistent storage 210. Other examples include optical and magnetic disks, thumb drives, and smart cards that are inserted into a drive for transfer onto another computer-readable storage medium that is also part of persistent storage 210.
  • Communications unit 202, in these examples, provides for communications with other data processing systems or devices external to sub-system 102. In these examples, communications unit 202 includes one or more network interface cards. Communications unit 202 may provide communications through the use of either or both physical and wireless communications links. Any software modules discussed herein may be downloaded to a persistent storage device (such as persistent storage device 210) through a communications unit (such as communications unit 202).
  • I/O interface set 206 allows for input and output of data with other devices that may be connected locally in data communication with server computer 200. For example, I/O interface set 206 provides a connection to external device set 214. External device set 214 will typically include devices such as a keyboard, keypad, a touch screen, and/or some other suitable input device. External device set 214 can also include portable computer-readable storage media such as, for example, thumb drives, portable optical or magnetic disks, and memory cards. Software and data used to practice embodiments of the present invention, for example, program 300, can be stored on such portable computer-readable storage media. In these embodiments the relevant software may (or may not) be loaded, in whole or in part, onto persistent storage device 210 via I/O interface set 206. I/O interface set 206 also connects in data communication with display device 212.
  • Display device 212 provides a mechanism to display data to a user and may be, for example, a computer monitor or a smart phone display screen.
  • The programs described herein are identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature herein is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
  • The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The terminology used herein was chosen to best explain the principles of the embodiment, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
  • II. Example Embodiment
  • FIG. 2 shows flowchart 250 depicting a method according to the present invention. FIG. 3 shows program 300 for performing at least some of the method steps of flowchart 250. This method and associated software will now be discussed, over the course of the following paragraphs, with extensive reference to FIG. 2 (for the method step blocks) and FIG. 3 (for the software blocks).
  • The present embodiment refers extensively to a high precision domain lexicon (HPDL). The HPDL (also referred to as a “set of category related terms”) is a collection of terms (words or sets of words) that belong to a specific domain, category, or genre (“domain”). In term extraction, and more generally in natural language processing, the HPDL can serve as an underlying “knowledge base” for a given domain so as to extract more contextually relevant terms from a piece of text (or corpus). In many embodiments of the present invention, the HPDL is used to: (i) extract contextually relevant terms (term extraction); and (ii) extract additional HPDL-eligible terms in order to grow, strengthen, and/or expand the HPDL.
  • HPDL domains may have multiple categories (or sub-domains). For example, the domain of smartphones may include categories such as smartphone models, smartphone apps, and/or smartphone modes. It is contemplated that the present invention may apply to HPDLs with singular domains, multiple domains, and/or multiple domain categories (or sub-domains).
  • In some embodiments of the present invention, method 250 may begin with an existing, predefined HPDL, while in other embodiments the HPDL may be initially extracted from the corpus using, for example, term extraction methods adapted to achieve high levels of precision. Some known methods for extracting an initial HPDL from the corpus are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description. In the present example embodiment, the HPDL has a domain of “things that jump” and initially includes the following terms: (i) fox; and (ii) rabbit (in other embodiments, an HPDL including the terms “fox” and “rabbit” might also have a domain of “animals” and a sub-domain of “mammals”).
  • The present embodiment also refers extensively to a corpus. The corpus is a text sample that method 250 extracts relevant terms from. In other words, the corpus is the text that is being acted upon (interpreted, processed, classified, etc.) during term extraction. In the present example embodiment, the corpus includes the following text: “A quick brown fox jumps over the lazy dog, but a quicker, more nimble kangaroo jumps over the fox. The following day, while the kangaroo leaps over the still-lazy dog, a determined frog leaps over a surprisingly speedy sloth.”
  • Processing begins at step S255, where extract candidate terms module (“mod”) 302 extracts candidate terms (also referred to as “relevant terms”) from the corpus. In many embodiments, various statistical methods are used to extract relevant candidate terms. A number of these known methods are discussed below in the Further Comments and/or Embodiments Sub-Section of this Detailed Description. However, these are not meant to be all-inclusive or limiting, as other, less traditional extraction methods may also be used. In other embodiments of the present invention, dictionaries or domain lexicons different from and/or unrelated to the HPDL may be used in this step. For example, in the present example embodiment, candidate terms are extracted from the corpus if they are identified as “animals”. As such, the following terms are extracted from the corpus: (i) fox; (ii) dog; (iii) kangaroo; (iv) frog; and (v) sloth. Furthermore, terms that are already in the HPDL are excluded from the candidate terms list. Therefore, “fox” is not included in the candidate terms list, and the resulting list is as follows: (i) dog; (ii) kangaroo; (iii) frog; and (iv) sloth.
  • Processing proceeds to step S260, where discover new generation mod 304 discovers a new generation of HPDL terms from the candidate terms using the HPDL and its contextual characteristics. This step begins by identifying contextual characteristics (or “initial contextual characteristics”) of the terms in the HPDL. A contextual characteristic is a feature of a term derived from that term's particular usage in a given corpus (for a more complete definition of “contextual characteristic,” see the Definitions Sub-Section of this Detailed Description). In the present example embodiment, the contextual characteristic for each term in the HPDL is the word immediately following that term in the corpus (when a term is the last word in a sentence, it does not have a contextual characteristic). So, in the present embodiment, the only contextual characteristic for the term “fox” (the first HPDL term) is “jumps”, because the only word immediately following “fox” in the corpus is “jumps”. For the second HPDL term, “rabbit”, there are no contextual characteristics, because “rabbit” does not appear in the corpus. As such, the only contextual characteristic of the HPDL is the word “jumps”. It should be noted that although the present embodiment includes a simple example with one contextual characteristic, in many embodiments the HPDL has a plurality of context characteristics.
  • Once contextual characteristics for the HPDL have been identified, those characteristics are then applied to the candidate terms. In the present example, the only candidate term to immediately precede the word “jumps” is “kangaroo”. As such, “kangaroo” (the “first term”) is the only term included in the current generation of discovered terms. In other embodiments of the present invention, however, the current generation may include a plurality of discovered terms. In those embodiments, additional steps may be taken to further refine the list of discovered terms (for some examples, see the Further Comments and/or Embodiments Sub-Section of this Detailed Description).
  • Processing proceeds to step S265, where update terms mod 306 adds the current generation of terms to the HPDL. In the present embodiment, the term “kangaroo” is added to the HPDL, with the resulting HPDL (or “revised set of category related terms”) being as follows: (i) fox; (ii) rabbit; and (iii) kangaroo.
  • Processing proceeds to step S270, where update terms mod 306 deletes the current generation of terms from the candidate terms list. In the present embodiment, the term “kangaroo” is removed from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; (ii) frog; and (iii) sloth.
  • Processing proceeds to step S275, where iterate mod 308 checks to see if method 250 is on its last iteration. In the present embodiment, a total of two iterations are to be performed. As such, method 250 is not on its last iteration (NO), and processing returns to step S260 for another iteration. In other embodiments, however, other tests may be used. For example, in one embodiment, iterations may occur until the HPDL reaches a certain size. In another embodiment, iterations may continue to occur for a certain period of time. In still other embodiments, iterations may continue to occur indefinitely and/or until no further terms for the HPDL are discovered.
  • In the present example, upon returning to step S260, discover new generation mod 304 repeats the process of identifying contextual characteristics of the terms in the HPDL. However, this time, there is an additional term (“kangaroo”) in the HPDL. As a result, an additional contextual characteristic (or “first term contextual characteristic”) is identified: the word “leaps”, which immediately follows the word “kangaroo” in the second sentence of the corpus. As such, when the updated contextual characteristics are applied to the candidate terms, an additional match is found: the term “frog” appears immediately before the word “leaps” in the corpus. As a result, “frog” (the “second term”) is added to the current generation of discovered terms.
  • Processing proceeds to step S265, where update terms mod 306 adds “frog” to the HPDL, resulting in the following HPDL: (i) fox; (ii) rabbit; (iii) kangaroo; and (iv) frog. Processing then proceeds to step S270, where update terms mod 306 removes “frog” from the candidate terms list, with the resulting candidate terms list being as follows: (i) dog; and (ii) sloth.
  • Processing proceeds to step S275. In the present example, two iterations have now completed, which means that method 250 is on its final iteration. Therefore, step S275 resolves to “YES”, and processing proceeds to step S280, where method 250 ends. As a result of executing the method 250, the HPDL for the domain of “things that jump” now includes two additional terms (“kangaroo” and “frog), and will be able to further extract contextually relevant terms in future iterations and/or from different corpuses. Additionally, system 102 now also has a list of candidate terms (“dog” and “sloth”) that may be helpful for other NLP-related tasks.
  • III. Further Comments and/or Embodiments
  • Some embodiments of the present invention recognize the following facts, potential problems and/or potential areas for improvement with respect to the current state of the art: (i) some existing approaches (including approaches that rely on linguistic processors to extract candidate terms) do not perform well when the corpus (or text) has a different genre (or domain) than the corpus used to build the processor; (ii) some existing approaches rely purely on statistical methods (such as n-gram sequences or topic modeling) to extract candidate terms, thereby negatively affecting system precision; (iii) existing approaches can be configured to provide terms with either high precision or high recall, but not both (thereby negatively affecting the overall accuracy of the system); and/or (iv) existing approaches are unable to discover new domain-specific terms directly from the corpus in a bootstrapping manner.
  • Some embodiments of the present invention may include one, or more, of the following features, characteristics and/or advantages: (i) using contextual similarity with a high precision domain lexicon for ranking; (ii) extracting candidate terms statistically; (iii) using an approach other than singular value decomposition; (iv) extracting terms without using linguistic processors; (v) extracting terms without analyzing linguistic or structural characteristics of a document; (vi) extracting terms without using syntactic and semantic contextual analysis; (vii) extracting terms without using dictionary-based statistics; (viii) extracting terms without using specialized corpora; (ix) extracting terms based on contextual information of a lexicon obtained from a given corpus; and/or (x) using association rules to measure unithood and/or filter candidate terms.
  • Many embodiments of the present invention are adapted to identify terms (nouns or noun phrases) from a corpus with both high precision and recall without using any linguistic processors and open domain linguistic resources (such as dictionaries and ontology). In doing so, these embodiments may include one, or more, of the following features, characteristics and/or advantages: (i) providing an iterative approach to term discovery where discovery depends on weighted contextual characteristics of already discovered terms in previous iterations; (ii) ranking pure statistically extracted candidate terms (N-grams) based on their noun specificity and term specificity determined using weighted contextual similarity with known terms (nouns or noun phrases); (iii) using association rules, filtering candidate terms that cannot exist independently; and/or (iv) validating unithood of candidate terms using association rules.
  • Some embodiments of the present invention may further include one, or more, of the following helpful features, characteristics, and/or advantages: (i) achieving positive results in entity set expansion tasks, where the goal is to identify entities from the corpus in a bootstrapping manner; (ii) performing well on diverse domains such as medical and/or news; (iii) performing better term extraction for any language, including languages for which linguistic processors are not built or do not perform well for; and/or (iv) keeping resources such as dictionaries, lexicons, ontologies, and/or entity lists up-to-date.
  • Method 400 according to the present invention is provided in FIG. 4. Method 400 is adapted to extract terms and their variants from a corpus with both high precision and high recall. High precision and recall occur even if the corpus has a domain that is different than the system's source domain, or if the corpus is in a language for which linguistic systems are not available or mature. Processing begins with step S402, where the method 400 uses a statistical corpus analysis to extract candidate terms. This step S402 uses known (or to be known in the future) statistical approaches (such as frequent item set mining, language mining, and topic modeling) to extract potential candidate terms from the corpus. Additionally, step S402 filters out irrelevant potential candidate terms using statistical criteria.
  • Processing proceeds to step S404, where the method 400 creates a high precision domain lexicon and analyzes contextual information therein. The high precision domain lexicon terms (or “lexicon terms”) are either manually extracted from the corpus or automatically extracted using any system (known or to be known in the future) configured to focus on high precision. Once the lexicon terms have been extracted, context words of those lexicon terms (such as the words appearing before and after the lexicon terms in the corpus) are extracted and weighed to create a set of weighted term context words.
  • Processing proceeds to step S406, where the method 400 ranks the candidate terms (see step S402) based on contextual similarity with the weighted term context words.
  • Processing proceeds to step S408, where method 400 selects the top candidate terms as discovered terms. The number of top candidate terms to be selected is a preconfigured value that depends on application and business context.
  • Processing proceeds to step S410, where the newly discovered terms (that is, the top candidate terms) are added to the high precision domain lexicon. Once the newly discovered terms have been added, they are deleted from the candidate terms list.
  • Processing proceeds to step S412, where the method 400 compares an iteration count with a pre-defined iteration threshold (where the iteration threshold is determined based on application and business context). Processing then proceeds to step S414. If the iteration count is less than the iteration threshold (NO), processing returns to step S404 to discover more relevant terms from the corpus. If the iteration count is greater than or equal to the iteration threshold, however, processing for method 400 completes. As a result of method 400 completing: (i) the high precision domain lexicon now includes additional relevant domain terms; and (ii) the list of candidate terms includes additional contextually relevant terms that may be used for natural language processing or other tasks.
  • In some embodiments of the present invention, step S402 (Extract Candidate Terms using Statistical Corpus Analysis, see FIG. 4) further includes method 500 (see FIG. 5). Method 500 is adapted to apply statistic approaches to the corpus to extract potential candidate terms. The potential candidate terms are passed through statistical filters to identify relevant terms, resulting in new candidate terms. Processing begins with step S502, where method 500 extracts text from the corpus and then applies heuristics-based sentence splitters on the text to extract sentences.
  • Processing proceeds to step S504, where various statistical methods are applied to the extracted sentences for candidate term extraction. The method to be used is typically determined based on a few factors: (i) the type of document the corpus is (for example, a web page, a text book, or a manual); (ii) the length of the corpus (for example, the number of words in the corpus); and/or (iii) the general domain of the corpus (for example, healthcare, finance, or telecommunication). Although many methods may be used, two are discussed below: (i) a statistical language modeling method (beginning with step S506); and (ii) an associated rule mining method (beginning with step S514).
  • If the statistical language modeling method is chosen, processing proceeds to step S506, where the method 500 extracts n-grams from each extracted sentence in the corpus for each preconfigured value of n. N may be determined in a number of ways, including, for example, by conducting experiments or by prioritizing certain features (such as speed vs. accuracy). An n-gram is a contiguous sequence of words from a given extracted sentence, where ‘n’ represents the number of words in the sequence. All unique n-grams of the corpus are considered as potential candidate terms.
  • Processing proceeds to steps S508 and S510, where method 500 scores the potential candidate terms based on their termhood and unithood, respectively. Termhood (as used in step S508) scores the validity of the potential candidate term as a representative for the corpus content as a whole using one or more statistical measures now known (or to be known in the future). In one embodiment, a measure of frequency in a corpus is used. In another embodiment, a measure of ‘weirdness’ (the term's frequency in the corpus compared to its frequency in a reference corpus) is used. In yet another embodiment, a measure of the pertinence or specificity of the term to a particular domain is used.
  • Unithood (as used in step S510) scores the collocation strength (the strength of parts of terms) of potential candidate terms using one of the statistical measures now known (or to be known in the future). In one embodiment, a mutual information test is used. In another embodiment, a t-test is used.
  • Once steps S508 and S510 have completed, processing proceeds to step S512. In step S512, potential candidate terms with termhood and unithood scores above pre-defined thresholds are selected and identified as candidate terms, and processing for method 500 completes.
  • If the associated rule mining method is chosen in step S504 (as opposed to the statistical language modeling method discussed above), processing proceeds to step S514. In this step, method 500 uses an algorithm to extract frequent n-grams. The extracted frequent n-grams are identified as potential candidate terms.
  • An example of a way to extract frequent n-grams (sets of words occurring in a specific order) is to extract n-grams meeting the following criteria: (i) the n-grams are frequent; and (ii) the order preserving a subset of the n-grams is frequent. In other words, in this embodiment method 500 first extracts unigrams (i.e. single words) that are frequent (that is, they have a frequency above a pre-defined threshold). Then, method 500 extracts frequent bigrams (i.e. two-word phrases) that are made up of the previously identified unigrams. This continues for n steps (where n equals the number of words in the phrase being analyzed). Another way to express this example extraction method is to say that for n>1, the system extracts n-grams that are frequent and also include frequent (n−1)-grams. For n=1, the system extracts unigrams that are frequent.
  • Processing proceeds to step S516, where the method 500 analyzes each potential candidate term and generates association rules along with corresponding confidence values. To generate association rules, step S516 performs three tasks: (i) for every potential candidate term t, all non-empty ordered subsets s are generated; (ii) for every subset s of t, a forward rule,“s->t−s”, along with its confidence (measured as frequency of t divided by the frequency of s) is generated; and (iii) for every subset s of t, an inverse rule, “s<-t−s”, along with its confidence, is generated. To provide an example, in one embodiment of the invention, term t is “Mobile Phone A”. Applying task (i), the subsets for “Mobile Phone A” are: (a) “Mobile”; (b) “Phone”; (c) “A”; (d) “Mobile Phone”; and (e) “Phone A”. Applying task (ii), a forward rule for “Mobile Phone A” is “Mobile Phone->A”. And applying task (iii), an inverse rule is “Mobile<-Phone A”.
  • Processing proceeds to step S518, where n-grams are filtered using the inverse rules created in step S516. In this step, term variations are identified and removed based on their confidence score. The method 500 identifies a term variation if an inverse rule from term variation to term has a confidence score above a predefined threshold (determined experimentally, for example). For example, “Mobile Phone A” has one inverse rule with a confidence score above the predefined threshold: “Mobile<-Phone A”. Because the confidence score is over the threshold, the method 500 identifies that “Phone A” is a variation of “Mobile Phone A” and removes “Phone A” from the list of potential candidate terms.
  • Processing proceeds to step S520, where n-grams are filtered using forward rules (which serve as a measure of unithood for potential candidate terms). The confidence of forward rules provides the probability of the order of term constituents. If none of the forward rules for a term have a confidence score above a pre-defined threshold, that term is removed from the potential candidate term. For example, the term “Manufacturer launches new” has two forward rules: “Manufacturer launches->new” and “Manufacturer->launches new”. Because neither of the forward rules have a confidence level above the threshold, they are removed from the list of potential candidate terms.
  • Upon completing step S520, processing for method 500 completes, resulting in a new list of candidate terms from the remaining potential candidate terms.
  • In some embodiments of the present invention, step S404 (Analyzing Contextual Information of High Precision Domain Lexicon, see FIG. 4) further includes method 600 shown in FIG. 6. Processing begins with step S602, where a high precision domain lexicon is created (either manually or automatically using methods configured to focus on high precision) or provided from previous iterations of method 600. Term variations from the lexicon are then filtered and/or replaced using previously generated inverse rules, if available (for example, from step S518 (see FIG. 5)). A term from the lexicon is identified as a term variation if an inverse rule from the term to some other longer term has a confidence score above pre-defined threshold. If the longer term is a part of the lexicon, then the term identified as a term variation is removed. For example, if “Mobile Phone A” and “Phone A” are in the lexicon and inverse rule “Mobile<-Phone A” has a confidence level above the threshold, then “Phone A” is removed from the lexicon. If the longer term is not part of lexicon, then the lexicon term is replaced with longer term. For example, if “Phone” is in the lexicon and the inverse rule “Mobile<-Phone” has a confidence level above the threshold, then “Phone” is replaced with “Mobile Phone”.
  • Processing proceeds to step S604, where method 600 scores lexicon terms and ranks them based on their scores. Scoring may be performed by a variety of methods now known (or to be known in the future), and may be based on properties such as term frequency observed in a given corpus. Processing proceeds to step S606, where the top X terms are selected, where X is pre-defined (and determined experimentally, for example).
  • Processing proceeds to step S608, where context words are extracted from the corpus. First, occurrences of lexicon terms within a given corpus are identified. Then, for each occurrence, context words are extracted per a pre-defined window size (for example, the two words before and the two words after a lexicon term). The words within the window are identified as context words and added to a list of context words. Processing proceeds to step S610, where closed class context words (such as determiners, prepositions, pronouns, and/or conjunctions) are removed from the list of context words.
  • Processing proceeds to step S612, where each context word is weighted. In some embodiments, the weight of a context word equals the number of unique terms the context word appears in divided by the total number of terms. Processing for method 600 concludes with a list of weighted term context words.
  • In some embodiments of the present invention, step S406 (Contextual Similarity based ranking of Candidate Terms, see FIG. 4) further includes method 700 shown in FIG. 7. Processing begins with step S702, where context words for candidate terms are extracted. First, occurrences of each candidate term (see discussion of method 500, above) in the corpus are identified. Then, from each occurrence, context words are extracted per a pre-defined window size (for example, the two words before and the two words after each candidate term).
  • Processing proceeds to step S704, where closed class context words (such as determiners, prepositions, pronouns, and/or conjunction) are removed from the list of context words. The remaining context words (“candidate term context words”) are selected, stored (along with their frequency), and mapped to their corresponding candidate terms.
  • Processing proceeds to step S706, where the contextual similarity between candidate term context words (see step S704, above) and weighted term context words (see discussion of method 600, above) is measured by a contextual similarity score. The contextual similarity score may be obtained by a number of methods now known or to be known in the future. In one example embodiment, the contextual similarity score is represented by the equation “Σi Wi*Fi”, where: ‘i’ equals the number of distinct context words for a candidate term; ‘Wi’ equals the weight of the context word in the weighted term context words (‘Wi’ equals zero when the context word is not in a set); and ‘Fi’ equals the frequency of the context word with respect to the candidate term in a given corpus.
  • Processing proceeds to step S708, where candidate terms are ranked based on the contextual similarity score obtained in step S706 (and discussed in the preceding paragraph). The result of this step is a list of ranked candidate terms.
  • In some embodiments of the present invention, step S408 (Discover New Terms, see FIG. 4) further includes method 800 shown in FIG. 8. Processing begins with step S802, where method 800 selects the top K candidate terms from the ranked list and creates a set of top K candidate terms, where the value of K is pre-configured. Processing then proceeds to step S804, where method 800 removes any candidate terms from the set if they are also part of domain lexicon. The remaining terms from the list of top k candidate terms are identified as, simply, “terms” and processing for method 800 completes.
  • For explanation purposes, an example embodiment demonstrating the present invention and portions of the above-discussed methods 400, 500, 600, 700, 800 (see FIGS. 4, 5, 6, 7, and 8) is provided. Referring first to method 500 (see FIG. 5), table 900 (see FIG. 9) shows the results of step S514 on an example corpus. In this example, ‘n’ equals ‘4’. Table 900 begins with row 902, which shows the result of a frequent unigram extraction on the example corpus (showing both the extracted unigrams and their corresponding frequencies).
  • Referring still to table 900, row 904 shows the result of a frequent bigram extraction on the example corpus (showing both the extracted bigrams and their corresponding frequencies). The frequency threshold for the bigram extraction is 30; as such, bigrams with a frequency of less than 30 (none, in this example) will not be included in the output for step S514.
  • Referring still to table 900 (see FIG. 9), row 906 shows the result of a frequent trigram extraction on the example corpus (showing both the extracted trigrams and their corresponding frequencies). The frequency threshold for the trigram extraction is 20; as such, trigrams with a frequency of less than 20 (none, in this example) will not be included in the output for step S514.
  • Still referring to table 900 (see FIG. 9), row 908 shows the result of a frequent 4-gram extraction on the example corpus (showing both the extracted 4-grams and their corresponding frequencies). The frequency threshold for the 4-gram extraction is 10; as such, trigrams with a frequency of less than 10 (none, in this example) will not be included in the output for step S514. The resulting example output of row 908, combined with the output from rows 902, 904, and 906, make up the entire list of frequent n-grams generated by step S514.
  • Table 1000 (see FIG. 10) shows the results of step S516 (see FIG. 5), where association rules are generated from the list of frequent n-grams generated by the previous step S514. Specifically, row 1002 (see FIG. 10) shows the generated forward rules, along with their corresponding confidence values (see discussion of step S516, above). Although a given term can have multiple forward rules, in the present example, for each term, only the forward rule with the maximum confidence value is shown. Row 1004 (see FIG. 10) similarly shows the generated inverse rules, along with their corresponding confidence values (again, see discussion of step S516, above). Although a given term can have multiple inverse rules, in the present example, for each term, only the inverse rule with the maximum confidence value is shown.
  • Table 1100 (see FIG. 11) shows the results of steps S518 and S520 (see FIG. 5), where the n-grams created in step S514 are filtered using the inverse rules and the forward rules generated in step S516. In step S518, the inverse rules are applied to the list of frequent n-grams. For each inverse rule over a pre-defined confidence value threshold (in the present example, 0.8), the term on the right-hand side of the rule is removed from the list of frequent n-grams. The general reasoning for this is that when an inverse rule has a high threshold value, it is unlikely that the term on the right-hand side would exist independently separate from the term on the left hand side. To provide an example, because the inverse rule “ManufacturerA<-PhoneD” has a confidence value of 1.00, “Phone D,” which is on the right hand side of the rule, is removed from the list of frequent n-grams (as “Phone D” is unlikely to appear without “ManufacturerA” as a prefix). Row 1102 of table 1100 shows all of the n-grams that have been filtered using the inverse rules, and row 1104 shows the n-grams that remain after that filtering.
  • In step S520, the forward rules are applied to the list of frequent n-grams. For each remaining n-gram in the list of frequent n-grams, the n-gram is removed from the list if it doesn't have a corresponding forward rule above a pre-defined confidence value threshold (in the present example, 0.8). To provide an example, because forward rule “ManufacturerB PhoneA W->4G” has a confidence value of 1.00 (which is greater than 0.8), the term “ManufacturerB PhoneA W 4G” remains on the list. Conversely, because the forward rule “Connect->ManufacturerA” has a confidence value of 0.10, the term “Connect ManufacturerA” is removed from the list. Row 1106 of table 1100 shows all of the n-grams that have been filtered using the forward rules, and row 1108 shows the n-grams that remain after that filtering and are considered candidate terms.
  • Referring now to method 600 (see FIG. 6), table 1200 (see FIG. 12) shows the results of steps S602, S604, and S606. Row 1202 shows the lexicon terms that have been extracted at the beginning of step S602. These terms are considered to be high precision domain lexicon terms for the domain of smartphones (collectively, they are referred to as the “high precision domain lexicon,” the “lexicon,” and/or the “lexicon terms”).
  • Continuing with step S602, method 600 identifies the n-grams from the corpus that end with any of the terms from the domain lexicon. Method 600 generates inverse rules for these n-grams along with corresponding confidence values. If the confidence of an inverse rule exceeds a pre-determined confidence value threshold (in this case, 0.8), then the method checks if the full term of the inverse rule is included in the high precision domain lexicon. If so, the term on the right-hand side of the rule is removed from the lexicon. If not, then the right-hand side term is replaced in the lexicon by the full term of the inverse rule. To provide an example of this, row 1204 of table 1200 shows both of the generated inverse rules that meet the confidence value threshold in the present example embodiment, along with their corresponding confidence values. For the first rule, “ManufacturerC<-PhoneB 12,” because “ManufacturerC PhoneB 12” is already included in the lexicon, “Phone B,” (that is, the term on the right-hand side of the rule) is removed from the lexicon. Conversely, for the second rule, “ManufacturerB<-PhoneC,” because “ManufacturerB PhoneC” is not in the lexicon, “PhoneC” is replaced by “ManufacturerB PhoneC” in the lexicon. The resulting, modified lexicon terms are shown in row 1206 of table 1200.
  • Still referring to table 1200 (see FIG. 12), row 1208 shows the results of step S604 (see FIG. 6), where the lexicon is scored using a C-Value/NC-Value method. As shown in table 1200, the lexicon terms are ranked based on their respective scores. In the next method step S606, the top X terms are selected. In the present case, X equals 5, so all four of the lexicon terms are selected, as shown in row 1210 of FIG. 12.
  • Referring still to the present example embodiment, table 1300 (see FIG. 13) shows the results of steps S608, S610, and S612 (see FIG. 6). In step S608 (shown in row 1302), context words for lexicon terms are extracted from the corpus with a pre-defined window. In the present embodiment, the window extends to one word before the term and one word after the term. So, when a lexicon term is found in the corpus, the word immediately preceding that lexicon term and the word immediately following the lexicon term are added to a list of context words. The list of context words is shown in row 1302, where each context word is listed along with the lexicon term(s) used to identify the context word.
  • Proceeding to step S610, a list of various closed-class words (such as determiners, prepositions, pronouns, and conjunctions) is used to reduce the number of words included in the list of context words. Row 1304 of table 1300 (see FIG. 13) shows the results of step S610 in the present example embodiment, where words such as “to,” “from,” and “your” have been removed from the list.
  • Referring still to table 1300 (see FIG. 13), step S612 provides weights for the context words, thereby creating weighted term context words. As mentioned above in the discussion of step S612, the weight of a given word is equal to the number of lexicon terms the word appeared with in the corpus divided by the total number of lexicon terms. The resulting weighted context words for the present example embodiment are shown in row 1306 of table 13.
  • Referring now to method 700 (see FIG. 7), table 1400 (see FIG. 14) shows the results of steps S702 and S704 for the present example embodiment. In step S702 (the results of which are shown in row 1402), context words for candidate terms (see discussion of method 500, above) are extracted from the corpus with a predefined window. In the present embodiment, the window extends to one word before the term and one word after the term (as in step S608). So, when a candidate term is found in the corpus, the word immediately preceding the candidate term and the word immediately following the candidate term are extracted and added to a list of context words. Row 1402 shows the extracted context words for the present embodiment, along with their corresponding candidate terms. The number of times a context word appears with each candidate term denoted by parentheses.
  • Processing continues to step S704, where closed-class context words (such as determiners, prepositions, pronouns, and conjunctions) are removed from the list of context words in a manner similar to the removal of closed-class context words in step S610. The resulting list of context words is shown in row 1404 of table 1400 (see FIG. 14).
  • Table 1500 (see FIG. 15) shows the results of steps S706 and S708 for the present example embodiment. In step S706, for each candidate term, a contextual similarity analysis is performed between the candidate term's context words (produced in step S704, discussed above) and the weighted term context words (produced in step S612, discussed above). In the present embodiment, the resulting contextual similarity score is calculated by computing the sum, for each of a candidate term's context words, of the candidate term's frequency with that context word multiplied by the context word's weight. If the given context word is not listed in the list of weighted term context words, then the weight of the context word is zero. The calculations for computing the contextual similarity score for each candidate term in the present example embodiment are shown in row 1502 of table 1500 (see FIG. 5). In the next row 1504, the candidate terms are listed according to their resulting contextual similarity scores (as a result of the contextual similarity score ranking that occurs in step S708).
  • Referring now to method 800, table 16 (see FIG. 16) shows the results of steps S802 and S804 (see FIG. 8) for the present example embodiment. In step S802 of this embodiment, the top K candidate terms are selected from the ranked list produced in step S708 (and shown in row 1504 of table 1500 (see FIG. 15)). In this example, K equals six. The resulting discovered terms (that is, the top six terms from the ranked list of candidate terms) are shown in row 1602 of table 1600 (see FIG. 16). In the following step S804, the discovered terms produced in step S802 are removed from the list of candidate terms. The terms remaining in the list of candidate terms after this step are shown in row 1604 of table 1600.
  • After completion of the steps in method 800, processing returns back to step S410 in method 400 (see FIG. 4), where the newly discovered terms from step S802 are added to the high precision domain lexicon. The new, modified, high precision domain lexicon for the present example embodiment is shown in row 1606 of table 1600.
  • IV. Definitions
  • Present invention: should not be taken as an absolute indication that the subject matter described by the term “present invention” is covered by either the claims as they are filed, or by the claims that may eventually issue after patent prosecution; while the term “present invention” is used to help the reader to get a general feel for which disclosures herein that are believed as maybe being new, this understanding, as indicated by use of the term “present invention,” is tentative and provisional and subject to change over the course of patent prosecution as relevant information is developed and as the claims are potentially amended.
  • Embodiment: see definition of “present invention” above—similar cautions apply to the term “embodiment.”
  • and/or: inclusive or; for example, A, B “and/or” C means that at least one of A or B or C is true and applicable.
  • Module/Sub-Module: any set of hardware, firmware and/or software that operatively works to do some kind of function, without regard to whether the module is: (i) in a single local proximity; (ii) distributed over a wide area; (iii) in a single proximity within a larger piece of software code; (iv) located within a single piece of software code; (v) located in a single storage device, memory or medium; (vi) mechanically connected; (vii) electrically connected; and/or (viii) connected in data communication.
  • Computer: any device with significant data processing and/or machine readable instruction reading capabilities including, but not limited to: desktop computers, mainframe computers, laptop computers, field-programmable gate array (FPGA) based devices, smart phones, personal digital assistants (PDAs), body-mounted or inserted computers, embedded device style computers, application-specific integrated circuit (ASIC) based devices.
  • Contextual characteristic: a feature of a term derived from that term's particular usage in a corpus; some examples of possible contextual characteristics include: (i) proximity related characteristics such as the words located within n words of the term, the words located farther than n words away from a term, and/or the distance between the term and a specific, pre-identified word; (ii) frequency-related characteristics such as the number of times the term appears in the corpus, the most/least number of times the term appears in a sentence, and/or the relative percentage of the term compared to the other terms in the corpus; and/or (iii) usage-related characteristics such as the location of the term in a sentence, the location of the term in a paragraph, whether the term commonly appears in the singular form or in plural form, whether the term regularly appears as a noun/verb/adjective/adverb/subject/object, the adjectives used to describe the term (when a noun), the adverbs used to describe a term (when a verb), the nouns the term typically describes (when an adjective), the verbs the term typically describes (when an adverb), the object of the term (when a subject), and/or the subject of the term (when an object).

Claims (14)

1-7. (canceled)
8. A computer program product comprising a computer readable storage medium having stored thereon:
first program instructions programmed to identify a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus;
second program instructions programmed to add the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and
third program instructions programmed to identify a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).
9. The computer program product of claim 8, further comprising:
fourth program instructions programmed to add the second term to the revised set of category related term(s), thereby creating a second revised set of category related term(s) and a set of second term contextual characteristic(s), where each second term contextual characteristic of the set of second term contextual characteristic(s) relates to the contextual use of the second term in the corpus; and
fifth program instructions programmed to identify a third term from the corpus, based, at least in part, on the set of second term contextual characteristic(s).
10. The computer program product of claim 8, wherein:
the identifying of the second term from the corpus is further based, at least in part, on the set of initial contextual characteristic(s).
11. The computer program product of claim 8, further comprising:
fourth program instructions programmed to create the set of category related term(s), where at least one category related term of the set of category related term(s) is extracted from the corpus using a precision oriented extraction method.
12. The computer program product of claim 8, wherein:
the first term belongs to a set of relevant term(s), where each relevant term of the set of relevant term(s) is extracted from the corpus using a statistical extraction method.
13. The computer program product of claim 8, wherein:
each initial contextual characteristic of the set of initial contextual characteristic(s) includes a contextual weight corresponding to the respective initial contextual characteristic's use in the corpus.
14. The computer program product of claim 8, wherein:
the identifying of the first term in the corpus is further based, at least in part, on a weighted strength of a match between the first term and the respective contextual weights of each initial contextual characteristic in the set of initial contextual characteristic(s).
15. A computer system comprising:
a processor(s) set; and
a computer readable storage medium;
wherein:
the processor set is structured, located, connected and/or programmed to run program instructions stored on the computer readable storage medium; and
the program instructions include:
first program instructions programmed to identify a first term from a corpus, based, at least in part, on a set of initial contextual characteristic(s), where each initial contextual characteristic of the set of initial contextual characteristic(s) relates to the contextual use of at least one category related term of a set of category related term(s) in the corpus;
second program instructions programmed to add the first term to the set of category related term(s), thereby creating a revised set of category related term(s) and a set of first term contextual characteristic(s), where each first term contextual characteristic of the set of first term contextual characteristic(s) relates to the contextual use of the first term in the corpus; and
third program instructions programmed to identify a second term from the corpus, based, at least in part, on the set of first term contextual characteristic(s).
16. The computer system of claim 15, further comprising:
fourth program instructions programmed to add the second term to the revised set of category related term(s), thereby creating a second revised set of category related term(s) and a set of second term contextual characteristic(s), where each second term contextual characteristic of the set of second term contextual characteristic(s) relates to the contextual use of the second term in the corpus; and
fifth program instructions programmed to identify a third term from the corpus, based, at least in part, on the set of second term contextual characteristic(s).
17. The computer system of claim 15, wherein:
the identifying of the second term from the corpus is further based, at least in part, on the set of initial contextual characteristic(s).
18. The computer system of claim 15, further comprising:
fourth program instructions programmed to create the set of category related term(s), where at least one category related term of the set of category related term(s) is extracted from the corpus using a precision oriented extraction method.
19. The computer system of claim 15, wherein:
the first term belongs to a set of relevant term(s), where each relevant term of the set of relevant term(s) is extracted from the corpus using a statistical extraction method.
20. The computer system of claim 15, wherein:
each initial contextual characteristic of the set of initial contextual characteristic(s) includes a contextual weight corresponding to the respective initial contextual characteristic's use in the corpus.
US14/520,654 2014-10-22 2014-10-22 Discovering terms using statistical corpus analysis Abandoned US20160117386A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/520,654 US20160117386A1 (en) 2014-10-22 2014-10-22 Discovering terms using statistical corpus analysis
US14/722,984 US10592605B2 (en) 2014-10-22 2015-05-27 Discovering terms using statistical corpus analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US14/520,654 US20160117386A1 (en) 2014-10-22 2014-10-22 Discovering terms using statistical corpus analysis

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/722,984 Continuation US10592605B2 (en) 2014-10-22 2015-05-27 Discovering terms using statistical corpus analysis

Publications (1)

Publication Number Publication Date
US20160117386A1 true US20160117386A1 (en) 2016-04-28

Family

ID=55792137

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/520,654 Abandoned US20160117386A1 (en) 2014-10-22 2014-10-22 Discovering terms using statistical corpus analysis
US14/722,984 Active 2037-08-05 US10592605B2 (en) 2014-10-22 2015-05-27 Discovering terms using statistical corpus analysis

Family Applications After (1)

Application Number Title Priority Date Filing Date
US14/722,984 Active 2037-08-05 US10592605B2 (en) 2014-10-22 2015-05-27 Discovering terms using statistical corpus analysis

Country Status (1)

Country Link
US (2) US20160117386A1 (en)

Cited By (140)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347383A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Text prediction using combined word n-gram and unigram language models
US9589049B1 (en) * 2015-12-10 2017-03-07 International Business Machines Corporation Correcting natural language processing annotators in a question answering system
US20170228461A1 (en) * 2016-02-04 2017-08-10 Gartner, Inc. Methods and systems for finding and ranking entities in a domain specific system
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
CN107784048A (en) * 2016-11-14 2018-03-09 平安科技(深圳)有限公司 The problem of question and answer corpus sorting technique and device
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US10120861B2 (en) * 2016-08-17 2018-11-06 Oath Inc. Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US20190035083A1 (en) * 2017-03-14 2019-01-31 Adobe Systems Incorporated Automatically Segmenting Images Based On Natural Language Phrases
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
CN109815321A (en) * 2018-12-26 2019-05-28 出门问问信息科技有限公司 Question answering method, device, equipment and storage medium
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10592605B2 (en) 2014-10-22 2020-03-17 International Business Machines Corporation Discovering terms using statistical corpus analysis
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140115B1 (en) * 2014-12-09 2021-10-05 Google Llc Systems and methods of applying semantic features for machine learning of message categories
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11163952B2 (en) 2018-07-11 2021-11-02 International Business Machines Corporation Linked data seeded multi-lingual lexicon extraction
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US20220378874A1 (en) * 2018-10-22 2022-12-01 Verint Americas Inc. Automated system and method to prioritize language model and ontology pruning
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11615154B2 (en) 2021-02-17 2023-03-28 International Business Machines Corporation Unsupervised corpus expansion using domain-specific terms
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11663411B2 (en) 2015-01-27 2023-05-30 Verint Systems Ltd. Ontology expansion using entity-association rules and abstract relations
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11769012B2 (en) * 2019-03-27 2023-09-26 Verint Americas Inc. Automated system and method to prioritize language model and ontology expansion and pruning
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11861301B1 (en) * 2023-03-02 2024-01-02 The Boeing Company Part sorting system
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11954405B2 (en) 2022-11-07 2024-04-09 Apple Inc. Zero latency digital assistant

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11468234B2 (en) 2017-06-26 2022-10-11 International Business Machines Corporation Identifying linguistic replacements to improve textual message effectiveness
US11354504B2 (en) * 2019-07-10 2022-06-07 International Business Machines Corporation Multi-lingual action identification
US11461339B2 (en) 2021-01-30 2022-10-04 Microsoft Technology Licensing, Llc Extracting and surfacing contextually relevant topic descriptions

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010274A1 (en) * 2006-06-21 2008-01-10 Information Extraction Systems, Inc. Semantic exploration and discovery
US20100125540A1 (en) * 2008-11-14 2010-05-20 Palo Alto Research Center Incorporated System And Method For Providing Robust Topic Identification In Social Indexes
US20110093452A1 (en) * 2009-10-20 2011-04-21 Yahoo! Inc. Automatic comparative analysis
US20110246486A1 (en) * 2010-04-01 2011-10-06 Institute For Information Industry Methods and Systems for Extracting Domain Phrases
US20120271788A1 (en) * 2011-04-21 2012-10-25 Palo Alto Research Center Incorporated Incorporating lexicon knowledge into svm learning to improve sentiment classification
US20130246430A1 (en) * 2011-09-07 2013-09-19 Venio Inc. System, method and computer program product for automatic topic identification using a hypertext corpus
US8589399B1 (en) * 2011-03-25 2013-11-19 Google Inc. Assigning terms of interest to an entity
US20140082003A1 (en) * 2012-09-17 2014-03-20 Digital Trowel (Israel) Ltd. Document mining with relation extraction
US20140172417A1 (en) * 2012-12-16 2014-06-19 Cloud 9, Llc Vital text analytics system for the enhancement of requirements engineering documents and other documents
US20150199333A1 (en) * 2014-01-15 2015-07-16 Abbyy Infopoisk Llc Automatic extraction of named entities from texts
US20160034305A1 (en) * 2013-03-15 2016-02-04 Advanced Elemental Technologies, Inc. Methods and systems for purposeful computing
US9477752B1 (en) * 2013-09-30 2016-10-25 Verint Systems Inc. Ontology administration and application to enhance communication data analytics

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7395256B2 (en) 2003-06-20 2008-07-01 Agency For Science, Technology And Research Method and platform for term extraction from large collection of documents
US7478092B2 (en) 2005-07-21 2009-01-13 International Business Machines Corporation Key term extraction
US8131536B2 (en) 2007-01-12 2012-03-06 Raytheon Bbn Technologies Corp. Extraction-empowered machine translation
WO2010038540A1 (en) 2008-10-02 2010-04-08 インターナショナル・ビジネス・マシーンズ・コーポレーション System for extracting term from document containing text segment
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
US8073877B2 (en) 2009-01-20 2011-12-06 Yahoo! Inc. Scalable semi-structured named entity detection
US8255405B2 (en) 2009-01-30 2012-08-28 Hewlett-Packard Development Company, L.P. Term extraction from service description documents
US20100274770A1 (en) * 2009-04-24 2010-10-28 Yahoo! Inc. Transductive approach to category-specific record attribute extraction
US9164983B2 (en) * 2011-05-27 2015-10-20 Robert Bosch Gmbh Broad-coverage normalization system for social media language
US8635107B2 (en) * 2011-06-03 2014-01-21 Adobe Systems Incorporated Automatic expansion of an advertisement offer inventory
US10339214B2 (en) 2011-11-04 2019-07-02 International Business Machines Corporation Structured term recognition
US20160117386A1 (en) 2014-10-22 2016-04-28 International Business Machines Corporation Discovering terms using statistical corpus analysis

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010274A1 (en) * 2006-06-21 2008-01-10 Information Extraction Systems, Inc. Semantic exploration and discovery
US20100125540A1 (en) * 2008-11-14 2010-05-20 Palo Alto Research Center Incorporated System And Method For Providing Robust Topic Identification In Social Indexes
US20110093452A1 (en) * 2009-10-20 2011-04-21 Yahoo! Inc. Automatic comparative analysis
US20110246486A1 (en) * 2010-04-01 2011-10-06 Institute For Information Industry Methods and Systems for Extracting Domain Phrases
US8589399B1 (en) * 2011-03-25 2013-11-19 Google Inc. Assigning terms of interest to an entity
US20120271788A1 (en) * 2011-04-21 2012-10-25 Palo Alto Research Center Incorporated Incorporating lexicon knowledge into svm learning to improve sentiment classification
US20130246430A1 (en) * 2011-09-07 2013-09-19 Venio Inc. System, method and computer program product for automatic topic identification using a hypertext corpus
US20140082003A1 (en) * 2012-09-17 2014-03-20 Digital Trowel (Israel) Ltd. Document mining with relation extraction
US20140172417A1 (en) * 2012-12-16 2014-06-19 Cloud 9, Llc Vital text analytics system for the enhancement of requirements engineering documents and other documents
US20160034305A1 (en) * 2013-03-15 2016-02-04 Advanced Elemental Technologies, Inc. Methods and systems for purposeful computing
US9477752B1 (en) * 2013-09-30 2016-10-25 Verint Systems Inc. Ontology administration and application to enhance communication data analytics
US20150199333A1 (en) * 2014-01-15 2015-07-16 Abbyy Infopoisk Llc Automatic extraction of named entities from texts

Cited By (216)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US9785630B2 (en) * 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10657966B2 (en) 2014-05-30 2020-05-19 Apple Inc. Better resolution when referencing to concepts
US20150347383A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Text prediction using combined word n-gram and unigram language models
US10714095B2 (en) 2014-05-30 2020-07-14 Apple Inc. Intelligent assistant for home automation
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10592605B2 (en) 2014-10-22 2020-03-17 International Business Machines Corporation Discovering terms using statistical corpus analysis
US11140115B1 (en) * 2014-12-09 2021-10-05 Google Llc Systems and methods of applying semantic features for machine learning of message categories
US11663411B2 (en) 2015-01-27 2023-05-30 Verint Systems Ltd. Ontology expansion using entity-association rules and abstract relations
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US10681212B2 (en) 2015-06-05 2020-06-09 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US9589049B1 (en) * 2015-12-10 2017-03-07 International Business Machines Corporation Correcting natural language processing annotators in a question answering system
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US10586174B2 (en) * 2016-02-04 2020-03-10 Gartner, Inc. Methods and systems for finding and ranking entities in a domain specific system
US20170228461A1 (en) * 2016-02-04 2017-08-10 Gartner, Inc. Methods and systems for finding and ranking entities in a domain specific system
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US20190073357A1 (en) * 2016-08-17 2019-03-07 Oath Inc. Hybrid classifier for assigning natural language processing (nlp) inputs to domains in real-time
US10120861B2 (en) * 2016-08-17 2018-11-06 Oath Inc. Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US10997370B2 (en) * 2016-08-17 2021-05-04 Verizon Media Inc. Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
CN107784048A (en) * 2016-11-14 2018-03-09 平安科技(深圳)有限公司 The problem of question and answer corpus sorting technique and device
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US10410351B2 (en) * 2017-03-14 2019-09-10 Adobe Inc. Automatically segmenting images based on natural language phrases
US20190035083A1 (en) * 2017-03-14 2019-01-31 Adobe Systems Incorporated Automatically Segmenting Images Based On Natural Language Phrases
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10847142B2 (en) 2017-05-11 2020-11-24 Apple Inc. Maintaining privacy of personal information
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US10909171B2 (en) 2017-05-16 2021-02-02 Apple Inc. Intelligent automated assistant for media exploration
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10504518B1 (en) 2018-06-03 2019-12-10 Apple Inc. Accelerated task performance
US11163952B2 (en) 2018-07-11 2021-11-02 International Business Machines Corporation Linked data seeded multi-lingual lexicon extraction
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US20220378874A1 (en) * 2018-10-22 2022-12-01 Verint Americas Inc. Automated system and method to prioritize language model and ontology pruning
US11934784B2 (en) 2018-10-22 2024-03-19 Verint Americas Inc. Automated system and method to prioritize language model and ontology expansion and pruning
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN109815321A (en) * 2018-12-26 2019-05-28 出门问问信息科技有限公司 Question answering method, device, equipment and storage medium
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11769012B2 (en) * 2019-03-27 2023-09-26 Verint Americas Inc. Automated system and method to prioritize language model and ontology expansion and pruning
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11615154B2 (en) 2021-02-17 2023-03-28 International Business Machines Corporation Unsupervised corpus expansion using domain-specific terms
US11954405B2 (en) 2022-11-07 2024-04-09 Apple Inc. Zero latency digital assistant
US11861301B1 (en) * 2023-03-02 2024-01-02 The Boeing Company Part sorting system

Also Published As

Publication number Publication date
US20160117313A1 (en) 2016-04-28
US10592605B2 (en) 2020-03-17

Similar Documents

Publication Publication Date Title
US10592605B2 (en) Discovering terms using statistical corpus analysis
US9922025B2 (en) Generating distributed word embeddings using structured information
Hill et al. The goldilocks principle: Reading children's books with explicit memory representations
Koh et al. An empirical survey on long document summarization: Datasets, models, and metrics
US9436918B2 (en) Smart selection of text spans
Li et al. Tweet segmentation and its application to named entity recognition
US9898529B2 (en) Augmenting semantic models based on morphological rules
Khalifa et al. Character convolutions for arabic named entity recognition with long short-term memory networks
US9588958B2 (en) Cross-language text classification
US9734238B2 (en) Context based passage retreival and scoring in a question answering system
US10713438B2 (en) Determining off-topic questions in a question answering system using probabilistic language models
US20150178268A1 (en) Semantic disambiguation using a statistical analysis
US20150178270A1 (en) Semantic disambiguation with using a language-independent semantic structure
Quan et al. Weighted high-order hidden Markov models for compound emotions recognition in text
US10810375B2 (en) Automated entity disambiguation
Yang et al. Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs
US20150178269A1 (en) Semantic disambiguation using a semantic classifier
Abdallah et al. Multi-domain evaluation framework for named entity recognition tools
US9984064B2 (en) Reduction of memory usage in feature generation
CN112541062B (en) Parallel corpus alignment method and device, storage medium and electronic equipment
Makrynioti et al. Sentiment extraction from tweets: multilingual challenges
Ritter Extracting knowledge from Twitter and the Web
Nguyen et al. Learning to summarize multi-documents with local and global information
US10528661B2 (en) Evaluating parse trees in linguistic analysis
US11487940B1 (en) Controlling abstraction of rule generation based on linguistic context

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AJMERA, JITENDRA;PARIKH, ANKUR;SIGNING DATES FROM 20141015 TO 20141016;REEL/FRAME:034005/0524

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION