US20070067320A1 - Detecting relationships in unstructured text - Google Patents
Detecting relationships in unstructured text Download PDFInfo
- Publication number
- US20070067320A1 US20070067320A1 US11/231,205 US23120505A US2007067320A1 US 20070067320 A1 US20070067320 A1 US 20070067320A1 US 23120505 A US23120505 A US 23120505A US 2007067320 A1 US2007067320 A1 US 2007067320A1
- Authority
- US
- United States
- Prior art keywords
- document
- relationship
- entity
- text
- slot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Definitions
- the invention generally relates to the field of data mining and, more particularly, to a system and a computer-implemented method of detecting relationships by creating input files of text patterns for each type of relationship, identifying a specific text pattern within a text-based document, tagging proper names in the text-based document, and extracting those proper names located within the specific text pattern so as to identify the two entities in the relationship.
- embodiments of the invention provide a system and a computer implemented method of detecting relationships in unstructured text.
- An embodiment of a method of detecting relationships in unstructured text comprises first creating text patterns that represent different types of relationships and storing those text patterns in an input file.
- the input file can store various text patterns representing employer/employee relationships, various patterns representing partnership relationships, etc.
- the text patterns may be custom-created by a user and input into the input file and/or pre-created and stored in the input file by a system manufacturer.
- a text pattern may be created by developing at least one regular text expression, comprising a plurality of words that describe the particular type of relationship. Additionally, the text pattern is developed with two or more slots positioned within, before, or after this regular text expression.
- the text pattern can also be created with slot location identifiers which indicate a position of the first slot and/or a position of the second slot relative to the regular text expression.
- the text pattern can be created with slot location identifiers that indicate that the first slot should be located before the text expression and/or within a predetermined proximity from the text expression (e.g., within a predetermined number of words from the text expression).
- the text pattern can be created with slot location identifiers to indicate that the second slot should be located after the text expression and/or within a predetermined proximity from the text expression.
- the text pattern can be created with a relationship order identifier (i.e., an identifier that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first and second slots). For example, if the type of relationship detected is a customer/seller relationship in which one entity is a “customer of” another entity, a relationship order identifier can be embedded in the text pattern to indicate that the proper name located in the first slot identifies the customer.
- the text pattern can be created with a keyword for the particular type of relationship, and specifically, for the particular text pattern. This keyword may be used in subsequent method steps, as described below, to screen out documents prior to conducting a pattern matching analysis.
- one or more text-based electronic documents are selected for processing by using an input device.
- the documents can be selected, for example, from the world wide web (WWW), from a wide area network (WAN), from a local area network, etc.
- WWW world wide web
- WAN wide area network
- the selection of documents can include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc.
- the selected unstructured text document(s) may be preprocessed, for example, by a preprocessor, in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below.
- Processing of a selected text-based document comprises analyzing the document in order to determine the location for each proper name occurring within the document. This can be accomplished using a multi-step process performed, for example, by a proper noun tagger.
- the tagger can be adapted to first scan the document in order to identify each of the proper names occurring within the document based on a predetermined set of matching rules.
- the set of matching rules can be based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc.
- the tagger can also be adapted to re-scan the document in order to tag and record each of the proper names found within the document along with their the locations.
- Processing of a selected text-based document also comprises analyzing the document on a sentence by sentence basis so as to locate a text pattern within the document. This can also be accomplished using a multi-step process performed, for example, by a pattern keyword identifier and pattern matcher.
- the keyword identifier can be adapted to first scans the document in order to determine whether or not a keyword from one or more of the text patterns in the input file are located in the document. If a keyword for a particular text pattern is found, then a full text pattern matching process can be performed, for example, by a pattern matcher, to determine if the regular text expression defined in the particular text pattern is located in the document. If a full text pattern is found within the document, the identity of the document is recorded and the location of the full text pattern is flagged.
- a multi-step relationship detection process is performed, for example, by a relationship detector.
- the relationship detector refers to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots and extracts those proper names, thereby, identifying the first and second entities engaged in the relationship. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order. Lastly, the relationship detector outputs the results of the relationship detection analysis.
- the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern), which can be stored and/or displayed.
- An embodiment of a system for detecting relationships in one or more unstructured text documents comprises text pattern input files, a keyword identifier, a pattern matcher, a proper noun tagger and a relationship detector.
- the system can comprise text pattern input files stored in memory. These input files comprise text patterns that describe different types of relationships.
- the text patterns can be pre-created and input in the input file (e.g., by a system manufacturer) or custom developed and input into the input file by the user using an input device (e.g., a keyboard, disk, CD, internet link, hard drive, etc.).
- Each text pattern can comprise at least one regular text expression having a plurality of words that describe a particular relationship as well as two or more slots positioned within, before, or after this regular text expression.
- the slots will be used by system features, as described below, in order to identify the proper names of the entities involved in the relationship (e.g., a first slot for the name of the first entity and a second slot for the name of the second entity in the relationship).
- the text pattern can also comprise slot location identifiers that indicate a position of the first slot and/or a position of the second slot relative to the regular text expression, as described in detail above.
- the text pattern can comprise a relationship order identifier that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first and second slots, also as described in detail above.
- the text pattern can comprise a keyword for the particular type of relationship and, specifically, for the particular text pattern. This keyword may be used by other features of the system, as described below, to screen out documents prior to conducting a pattern matching analysis.
- a communications link can be established between the system and a source for unstructured text documents (e.g., the world wide web (WWW), a wide area network (WAN), a local area network, etc.) so that a user of the system, using an input device (e.g., a keyboard, mouse, etc.) can select one or more text-based electronic documents for analysis.
- the documents may be selected such that they include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc.
- the system may further comprise a pre-processor adapted to pre-process selected unstructured text document(s) prior to analysis in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below.
- the proper noun tagger can be adapted to receive the selected unstructured text document(s) and to perform a multi-step tagging process on the documents.
- the tagger can be adapted to first scan each document in order to identify each occurrence of a proper name within the document based on a predetermined set of matching rules.
- the set of matching rules can be based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc.
- the tagger can also be adapted to re-scan the document in order to tag and record a list of each of the proper names found within the document along with their the locations.
- the keyword identifier is in communication with the relationship pattern input file and is adapted to receive the selected unstructured text document(s) (e.g., before, after, or separate from the processing by the proper noun tagger) and to analyze the document(s). Specifically, the keyword identifier is adapted to scan each document sentence by sentence in order to determine whether or not a keyword from one or more of the text patterns in the input file is located in the document. If a keyword for a particular text pattern is found, the document is forwarded to a pattern matcher for further analysis.
- the selected unstructured text document(s) e.g., before, after, or separate from the processing by the proper noun tagger
- the keyword identifier is adapted to scan each document sentence by sentence in order to determine whether or not a keyword from one or more of the text patterns in the input file is located in the document. If a keyword for a particular text pattern is found, the document is forwarded to a pattern matcher for further analysis.
- the pattern matcher is adapted to perform a full text pattern matching process on the forwarded document. Specifically, the pattern matcher is adapted to scan the document sentence by sentence to determine if the regular text expression defined in the particular text pattern associated with the keyword is located in the document. If a full text pattern is found within the document, the identity of the document is recorded, the location of the full text pattern is flagged, and the document is forwarded to the relationship detector.
- the relationship detector is in communication with the proper noun tagger and is adapted to analyze the document further in order to detect a relationship. Specifically, the relationship detector is adapted to refer to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots for the text pattern that was located in the document. If proper names are found in both slots, the relationship detector extracts those proper names, and thereby, identifies the first and second entities engaged in the relationship described by the text pattern. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order of each named entity. Lastly, the relationship detector outputs the results of the relationship detection analysis.
- the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern).
- This output can be stored (e.g., in a data storage device) and/or displayed on a display screen.
- FIG. 1 is a schematic flow diagram of an embodiment of a method of detecting relationships in unstructured text-based electronic documents
- FIG. 2 is a schematic block diagram of an exemplary relationship text pattern input file
- FIG. 3 is a schematic block diagram representing an embodiment of a system of detecting relationships in unstructured text-based electronic documents.
- FIG. 4 is a schematic representation of a computer system suitable for use in detecting relationships in unstructured text-based electronic documents.
- a system and a computer-implemented method for automatically and accurately detecting relationships e.g., a partner relationship between two corporations, an employee-employer relationship between two people, a seller-customer relationship, etc.
- relationships e.g., a partner relationship between two corporations, an employee-employer relationship between two people, a seller-customer relationship, etc.
- the challenge is both in identifying entities in a document and in detecting the particular relationship, if any, between two entities. Therefore, disclosed herein are embodiments of a system and method for detecting any type of relationship that is described in unstructured text-based electronic documents. Specifically, the system and method each incorporate the use of an input file that contains one or more text patterns that represent particular relationships.
- the text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship.
- Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of a proper name within the document(s). Then, a pattern matcher scans the document(s) to match text patterns from the input file. If a text pattern is matched a relationship detector extracts the proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in a relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.
- an embodiment of a method of detecting relationships in unstructured text comprises first creating text patterns 205 that represent different types of relationships 201 and storing those text patterns 205 in an input file 200 , as illustrated in FIG. 2 ( 102 - 104 ).
- the input file 200 can store various text patterns representing different types of relationships 201 , such as employer/employee relationships, various patterns representing partnership relationships, etc.
- the text patterns 205 may be custom created and input into the input file 200 by a user and/or pre-created and stored in the input file 200 by a system manufacturer (e.g., as illustrated in FIG. 3 and discussed below). Any number of input files may be given as input with each file containing a list of patterns 205 for a particular relationship 201 .
- the text patterns 205 may be created by developing at least one regular text expression 210 , comprising a plurality of words that describe the particular type of relationship, and providing two or more slots 208 , 212 positioned within, before, or after this regular text expression. These slots will be used in subsequent method steps, as described below, in order to identify the proper names of the entities involved in the relationship (e.g., a first slot for the name of the first entity and a second slot for the name of the second entity in the relationship).
- the text pattern 205 can also be created with slot location identifiers 202 that indicate a position of the first slot and/or a position of the second slot relative to the regular text expression.
- the text pattern 205 can be created with a slot identifier 202 a to indicate that the first slot 208 should be located before the text expression 210 and/or within a predetermined proximity from the text expression (e.g., within a predetermined number of words from the text expression).
- the text pattern 205 can be created with a slot identifier 202 b to indicate that the second slot 212 should be located after the text expression 210 and/or within a predetermined proximity from the text expression.
- the text pattern 205 can be created with a relationship order identifier 204 (i.e., an identifier that defines an order of the first and second entities in the relationship that is not symmetric based on the locations of the proper names within the first and second slots 208 and 212 ). For example, if the type of relationship detected is a customer/seller relationship in which one entity is a “customer of” another entity, a relationship order identifier can be embedded in the text pattern to indicate that the proper name located in the first slot identifies the customer.
- the text pattern 205 can also be created with a keyword 206 for the particular type of relationship 201 , and specifically, for the particular text pattern. This keyword 206 may be used in subsequent method steps, as described below, to screen out documents prior to conducting a pattern matching analysis and to, thereby, improve the speed of the pattern-matching.
- the text patterns 205 may be described in any language that supports regular expression matching (e.g., Perl) such that the slots 208 and 212 for the entities match the $1 and $2 variables after a successful match is performed.
- regular expression matching e.g., Perl
- This exemplary text pattern contains four comma-separated fields: (1) a first number (i.e., slot location identifier 202 ), (2) a second number (i.e., another slot location identifier 202 ), (3) a third number (i.e., a relationship order identifier 204 ), (4) a keyword ( 206 ) for the pattern, and (5) a regular expression 210 with two slots 208 and 210 .
- the text matching the first (.*) in the expression will be accessible via the $1 variable after a successful match has been performed.
- the second occurrence will be accessible via the $2 variable.
- these slots may contain an arbitrary amount of text.
- proper names are located within the slots.
- the first two numbers in this exemplary text pattern comprise the slot location identifiers 202 and refer to the text matched in the $1 and $2 slots, respectively.
- a 0 means that for a successful match, a proper name found within the slot must occur to the far right, a 1 means it must occur to the far left.
- the third number in the exemplary text pattern comprises the relationship order identifier which specifies the order of the entities in the relationship. For example, if the relationship is “customer of,” a 1 in this field means that entity 1 (matched via $1) is a customer of entity 2 (matched via $2). A 2 in this field would mean that entity 2 is a customer of entity 1 . If this field is 0, the relationship is symmetric, as in a partnership relation.
- one or more text-based electronic documents are selected for processing by using an input device (e.g., the same or a different input device than that used to create and input input files) ( 106 ).
- the documents can be selected, for example, from the world wide web (WWW), from a wide area network (WAN), from a local area network, etc.
- WWW world wide web
- WAN wide area network
- the selection of documents can include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc.
- Each selected unstructured text document may be preprocessed, for example, by a preprocessor, in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below ( 108 ).
- Processing of each selected text-based document comprises analyzing the document in order to determine the location for each proper name occurring within the document ( 116 ). This can be accomplished using a multi-step process performed, for example, by a proper noun tagger.
- the tagger can be adapted to first scan the document in order to identify each of the proper names occurring within each document based on a predetermined complex set of matching rules and lexicons.
- the set of matching rules define the proper nouns based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc. For example, the list of excluded words may include months, days of the week, words not capitalized in a title, etc.
- An exemplary rule may provide that all capitalized words, not located at the beginning of a sentence and not included on the list of excluded words, are identifiable as proper nouns.
- the tagger can also be adapted to re-scan the document in order to tag and record a list of each of the proper names found within the document along with their the locations.
- Processing of each selected text-based document also comprises analyzing the document on a sentence by sentence basis so as to locate a text pattern within the document ( 112 - 114 ).
- This can also be accomplished using a multi-step process performed, for example, by a pattern keyword identifier and pattern matcher.
- the keyword identifier can be adapted to first scans the document in order to determine whether or not a keyword from one or more of the text patterns in the input file are located in the document ( 112 ). If a keyword for a particular text pattern is found, then a full text pattern matching process can be performed, for example, by a pattern matcher, to determine if the regular text expression defined in the particular text pattern is located in the document ( 114 ). If a full text pattern is found within the document, the identity of the document is recorded and the location of the full text pattern is flagged ( 115 ).
- a multi-step relationship detection process is performed, for example, by a relationship detector.
- the relationship detector refers to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots and extracts those proper names, thereby, identifying the first and second entities engaged in the relationship ( 119 ). Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order. Lastly, the relationship detector outputs the results of the relationship detection analysis ( 120 ).
- the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern), which can be stored ( 122 ) and/or displayed ( 124 ).
- an embodiment of a system 300 for detecting relationships in one or more unstructured text documents comprises a text pattern input file 304 , a keyword identifier 312 , a pattern matcher 314 , a proper noun tagger 316 and a relationship detector 318 .
- the system 300 can comprise input files 304 stored in memory 305 (e.g., a hard drive, a disk, data storage device, etc.).
- the input files 304 as illustrated in FIG. 2 and discussed above, comprise a plurality of text patterns 205 that describe different types of relationships 201 .
- These text patterns 205 can be pre-created and input in the input files 304 (e.g., by a system manufacturer) or custom developed and input into the input files 304 by the user using an input device 302 (e.g., a keyboard, disk, CD, internet link, hard drive, etc.).
- Each text pattern 205 can comprise at least one text expression 210 , discussed in detail above, having a plurality of words that describe a particular relationship 201 as well as two or more slots 208 , 212 positioned within, before, or after this regular text expression 210 .
- the slots 208 , 212 will be used by the relationship detector 318 , as described below, in order to identify the proper names of the entities involved in the relationship (e.g., a first slot 208 for the name of the first entity and a second slot 212 for the name of the second entity in the relationship).
- the text pattern 205 can also comprise slot location identifiers 202 a - b that indicate a position of the first slot 208 and/or a position of the second slot 212 relative to the regular text expression 210 , as described in detail above. Additionally, the text pattern 205 can comprise a relationship order identifier 204 that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first and second slots 208 , 212 , also as described in detail above. Lastly, the text pattern 205 can comprise a keyword 206 for the particular type of relationship 201 and, specifically, for the particular text pattern 205 . This keyword 206 may be used by the keyword identifier 312 , as described below, to screen out documents prior to conducting a pattern matching analysis in order to improve processing speed.
- a communications link 307 can be established between the system 300 and a source 306 for unstructured text documents (e.g., the Internet, the world wide web (WWW), a wide area network (WAN), a local area network, etc.) so that a user of the system 300 can select, using an input device 308 (e.g., a keyboard, a mouse, etc.) one or more text-based electronic documents 309 for analysis.
- the document(s) may be selected to include specific document(s), all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc.
- the system 300 may further comprise a pre-processor 310 adapted to pre-process each selected unstructured text document 309 prior to analysis in order to provide “noise free” text to either the proper noun tagger 315 or the keyword identifier 312 , described below.
- a pre-processor 310 adapted to pre-process each selected unstructured text document 309 prior to analysis in order to provide “noise free” text to either the proper noun tagger 315 or the keyword identifier 312 , described below.
- the proper noun tagger 315 can be adapted to receive each selected unstructured text document 309 and to perform a multi-step tagging process on the documents. Specifically, the tagger 315 can be adapted to first scan each document in order to identify each occurrence of a proper name within the document based on a predetermined and complex set of matching rules and lexicons. The set of matching rules can be based, for example, on at least one of word capitalization, sentence structure, sentence boundaries, and excluded words (e.g., as illustrated in the detail discussion above). The tagger 315 can also be adapted to re-scan the document(s) 309 in order to tag each proper name and record a list of each of the proper names found within the document along with their the locations 317 in memory 318 .
- the keyword identifier 312 is in communication with (i.e., is adapted to access) the relationship pattern input file 304 and is further adapted to receive the selected unstructured text document(s) 309 from the preprocessor 310 (e.g., before, after, or separate from the processing by the proper noun tagger) and to analyze each document 309 .
- the keyword identifier 312 is adapted to scan each document 309 sentence by sentence in order to determine whether or not a keyword 206 from one or more of the text patterns 205 in the input file (as illustrated in FIG. 2 ) are located in the document 309 . If a keyword 206 for a particular text pattern 205 is found, the document containing the keyword is forwarded to a pattern matcher 314 for further analysis.
- the pattern matcher 314 is adapted to perform a full text pattern matching process on the forwarded document. Specifically, the pattern matcher 314 is adapted to scan the document sentence by sentence to determine if the regular text expression defined in the particular text pattern associated with the keyword is located in the document. If a full text pattern is found within the document, the identity of the document and the location of the full text pattern 320 is recorded in a memory 319 that is accessible by the relationship detector 318 . The document that contains the full text pattern is then forwarded to the relationship detector 318 for further analysis.
- the relationship detector 318 is adapted to further analyze the document that contains the full text pattern in order to detect a relationship and, particularly, the entities engaged in the relationship. Specifically, the relationship detector 318 is adapted to access the memory 316 in order to refer to the list of proper names 317 recorded by the proper noun tagger 315 . The relationship detector 318 then reviews the document and determines if proper names are located within the first and second slots for the text pattern that was located within the document. If proper names are found in both slots, the relationship detector 318 extracts those proper names, and thereby, identifies the first and second entities engaged in the relationship described by the text pattern. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector 318 determines the order of each named entity.
- the relationship detector outputs the results of the relationship detection analysis.
- the relationship detector 318 can provide an output comprising the type of relationship (as defined by the text pattern), the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern).
- This output can be stored (e.g., in a data storage device 322 ) and/or displayed on a display screen 324 .
- Embodiments of the system 300 can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements.
- the invention is implemented using software, which includes but is not limited to firmware, resident software, microcode, etc.
- embodiments of the system 300 can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
- a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
- Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
- a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- FIG. 4 is a schematic representation of an exemplary computer system 400 suitable for use in detecting relationships as described herein.
- Computer software executes under a suitable operating system installed on the computer system 400 to assist in performing the described techniques.
- This computer software is programmed using any suitable computer programming language, and may be though of as comprising various software code means for achieving particular steps.
- the components of the computer system 400 include a computer 420 , a keyboard 410 and a mouse 415 , and a video display 490 .
- the computer 420 includes a processor 440 , a memory 450 , input/output (I/O) interfaces 460 , 465 , a video interface 445 , and a storage device 455 .
- I/O input/output
- the processor 440 is a central processing unit (CPU) that executes the operating system and the computer software executing under the operating system.
- the memory 450 includes random access memory (RAM) and read-only memory (ROM), and is used under direction of the processor 440 .
- the video interface 445 is connected to video display 490 .
- User input to operate the computer 420 is provided from the keyboard 410 and mouse 415 .
- the storage device 455 can include a disk drive or any other suitable storage medium.
- Each of the components of the computer 420 is connected to an internal bus 430 that includes data, address, and control buses, to allow components of the computer 420 to communicate with each other via the bus 430 .
- the computer system 400 can be connected to one or more other similar computers via input/output (I/O) interface 465 using a communication channel 465 to a network, represented as the Internet 480 .
- the computer software may be recorded on a portable storage medium, in which case, the computer software program is accessed by the computer system 400 from the storage device 455 .
- the computer software can be accessed directly from the Internet 480 by the computer 420 .
- a user can interact with the computer system 400 using the keyboard 410 and mouse 415 to operate the programmed computer software executing on the computer 420 .
- Other configurations or types of computer systems can be equally well used to implement the described techniques.
- the computer system 400 described above is described only as an example of a particular type of system suitable for implementing the described techniques.
- a system and a method for detecting relationships described in unstructured text-based electronic documents incorporate the use of an input file that contains one or more text patterns that represent particular relationships.
- the text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship.
- Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of proper names within the document(s). Then, a pattern matcher scans the document(s) to match text patterns. If a text pattern is matched within a document a relationship detector extracts all pairs of proper names found in the slots for each matched text pattern.
- the output from the relationship detector includes the names for each entity in the relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.
- This method and associated system are extremely cost and time efficient because they avoid the need of natural language processing or parsing (i.e., running expensive machines such as parsers and parts-of-speech taggers is unnecessary), so that they are scalable to a large number of documents. Additionally, because a user may define the text patterns with regular text expressions (as opposed to a single word or simple phrase) describing each relationship, the system and method are applicable to any type of relationship and are very precise in detecting particular relationships.
Abstract
Disclosed are embodiments of a system and a method for detecting relationships described in unstructured text-based electronic documents. The system and method incorporate the use of an input file that contains one or more text patterns that represent particular relationships. The text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship. Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of proper names within the document(s). Then, a pattern matcher scans the document(s) to match text patterns. If a text pattern is matched within a document a relationship detector extracts all pairs of proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in the relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.
Description
- 1. Field of the Invention
- The invention generally relates to the field of data mining and, more particularly, to a system and a computer-implemented method of detecting relationships by creating input files of text patterns for each type of relationship, identifying a specific text pattern within a text-based document, tagging proper names in the text-based document, and extracting those proper names located within the specific text pattern so as to identify the two entities in the relationship.
- 2. Description of the Related Art
- Recently, there has been a rapid growth of on-line discussion groups and news websites on the World Wide Web (WWW). Detecting relationships between entities (e.g., buyer/seller, employee/employer, partnerships, parent/subsidiaries, etc.) discussed on those websites could prove to be a valuable resource (e.g., to a company investigating a rival company's business dealings, to a company or individual investigating a prospective client, employee, or contractor, etc.). However, the task of manually detecting such relationships from amongst the large corpus of documents contained on the Web is laborious. Thus, there is a need for a system and computer-implemented method for automatically and accurately detecting relationships in unstructured text contained within electronic documents with minimal processing times so as to be scalable to large document sets. The challenge is both in identifying entities in a document and in detecting the particular relationship, if any, between two entities.
- In view of the foregoing, embodiments of the invention provide a system and a computer implemented method of detecting relationships in unstructured text.
- An embodiment of a method of detecting relationships in unstructured text comprises first creating text patterns that represent different types of relationships and storing those text patterns in an input file. For example, the input file can store various text patterns representing employer/employee relationships, various patterns representing partnership relationships, etc. The text patterns may be custom-created by a user and input into the input file and/or pre-created and stored in the input file by a system manufacturer. A text pattern may be created by developing at least one regular text expression, comprising a plurality of words that describe the particular type of relationship. Additionally, the text pattern is developed with two or more slots positioned within, before, or after this regular text expression. These slots will be used in subsequent method steps, as described below, in order to identify the proper names of the entities involved in the relationship (e.g., a first slot for the name of the first entity and a second slot for the name of the second entity in the relationship). The text pattern can also be created with slot location identifiers which indicate a position of the first slot and/or a position of the second slot relative to the regular text expression. For example, the text pattern can be created with slot location identifiers that indicate that the first slot should be located before the text expression and/or within a predetermined proximity from the text expression (e.g., within a predetermined number of words from the text expression). Similarly, the text pattern can be created with slot location identifiers to indicate that the second slot should be located after the text expression and/or within a predetermined proximity from the text expression. Additionally, the text pattern can be created with a relationship order identifier (i.e., an identifier that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first and second slots). For example, if the type of relationship detected is a customer/seller relationship in which one entity is a “customer of” another entity, a relationship order identifier can be embedded in the text pattern to indicate that the proper name located in the first slot identifies the customer. Lastly, the text pattern can be created with a keyword for the particular type of relationship, and specifically, for the particular text pattern. This keyword may be used in subsequent method steps, as described below, to screen out documents prior to conducting a pattern matching analysis.
- In addition to creating an input file, one or more text-based electronic documents (e.g., an unstructured text document (UTD)) are selected for processing by using an input device. The documents can be selected, for example, from the world wide web (WWW), from a wide area network (WAN), from a local area network, etc. The selection of documents can include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc. The selected unstructured text document(s) may be preprocessed, for example, by a preprocessor, in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below.
- Processing of a selected text-based document comprises analyzing the document in order to determine the location for each proper name occurring within the document. This can be accomplished using a multi-step process performed, for example, by a proper noun tagger. The tagger can be adapted to first scan the document in order to identify each of the proper names occurring within the document based on a predetermined set of matching rules. The set of matching rules can be based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc. The tagger can also be adapted to re-scan the document in order to tag and record each of the proper names found within the document along with their the locations.
- Processing of a selected text-based document also comprises analyzing the document on a sentence by sentence basis so as to locate a text pattern within the document. This can also be accomplished using a multi-step process performed, for example, by a pattern keyword identifier and pattern matcher. The keyword identifier can be adapted to first scans the document in order to determine whether or not a keyword from one or more of the text patterns in the input file are located in the document. If a keyword for a particular text pattern is found, then a full text pattern matching process can be performed, for example, by a pattern matcher, to determine if the regular text expression defined in the particular text pattern is located in the document. If a full text pattern is found within the document, the identity of the document is recorded and the location of the full text pattern is flagged.
- Upon detection of a full text pattern with a document, a multi-step relationship detection process is performed, for example, by a relationship detector. The relationship detector refers to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots and extracts those proper names, thereby, identifying the first and second entities engaged in the relationship. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order. Lastly, the relationship detector outputs the results of the relationship detection analysis. Specifically, the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern), which can be stored and/or displayed.
- An embodiment of a system for detecting relationships in one or more unstructured text documents comprises text pattern input files, a keyword identifier, a pattern matcher, a proper noun tagger and a relationship detector.
- More specifically, the system can comprise text pattern input files stored in memory. These input files comprise text patterns that describe different types of relationships. The text patterns can be pre-created and input in the input file (e.g., by a system manufacturer) or custom developed and input into the input file by the user using an input device (e.g., a keyboard, disk, CD, internet link, hard drive, etc.). Each text pattern can comprise at least one regular text expression having a plurality of words that describe a particular relationship as well as two or more slots positioned within, before, or after this regular text expression. The slots will be used by system features, as described below, in order to identify the proper names of the entities involved in the relationship (e.g., a first slot for the name of the first entity and a second slot for the name of the second entity in the relationship). The text pattern can also comprise slot location identifiers that indicate a position of the first slot and/or a position of the second slot relative to the regular text expression, as described in detail above. Additionally, the text pattern can comprise a relationship order identifier that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first and second slots, also as described in detail above. Lastly, the text pattern can comprise a keyword for the particular type of relationship and, specifically, for the particular text pattern. This keyword may be used by other features of the system, as described below, to screen out documents prior to conducting a pattern matching analysis.
- A communications link can be established between the system and a source for unstructured text documents (e.g., the world wide web (WWW), a wide area network (WAN), a local area network, etc.) so that a user of the system, using an input device (e.g., a keyboard, mouse, etc.) can select one or more text-based electronic documents for analysis. The documents may be selected such that they include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc. The system may further comprise a pre-processor adapted to pre-process selected unstructured text document(s) prior to analysis in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below.
- The proper noun tagger can be adapted to receive the selected unstructured text document(s) and to perform a multi-step tagging process on the documents. Specifically, the tagger can be adapted to first scan each document in order to identify each occurrence of a proper name within the document based on a predetermined set of matching rules. The set of matching rules can be based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc. The tagger can also be adapted to re-scan the document in order to tag and record a list of each of the proper names found within the document along with their the locations.
- The keyword identifier is in communication with the relationship pattern input file and is adapted to receive the selected unstructured text document(s) (e.g., before, after, or separate from the processing by the proper noun tagger) and to analyze the document(s). Specifically, the keyword identifier is adapted to scan each document sentence by sentence in order to determine whether or not a keyword from one or more of the text patterns in the input file is located in the document. If a keyword for a particular text pattern is found, the document is forwarded to a pattern matcher for further analysis.
- The pattern matcher is adapted to perform a full text pattern matching process on the forwarded document. Specifically, the pattern matcher is adapted to scan the document sentence by sentence to determine if the regular text expression defined in the particular text pattern associated with the keyword is located in the document. If a full text pattern is found within the document, the identity of the document is recorded, the location of the full text pattern is flagged, and the document is forwarded to the relationship detector.
- The relationship detector is in communication with the proper noun tagger and is adapted to analyze the document further in order to detect a relationship. Specifically, the relationship detector is adapted to refer to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots for the text pattern that was located in the document. If proper names are found in both slots, the relationship detector extracts those proper names, and thereby, identifies the first and second entities engaged in the relationship described by the text pattern. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order of each named entity. Lastly, the relationship detector outputs the results of the relationship detection analysis. Specifically, the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern). This output can be stored (e.g., in a data storage device) and/or displayed on a display screen.
- These and other aspects of embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments of the invention without departing from the spirit thereof, and the invention includes all such modifications.
- The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
-
FIG. 1 is a schematic flow diagram of an embodiment of a method of detecting relationships in unstructured text-based electronic documents; -
FIG. 2 is a schematic block diagram of an exemplary relationship text pattern input file; -
FIG. 3 is a schematic block diagram representing an embodiment of a system of detecting relationships in unstructured text-based electronic documents; and, -
FIG. 4 is a schematic representation of a computer system suitable for use in detecting relationships in unstructured text-based electronic documents. - The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the invention.
- As mentioned above, there is need for a system and a computer-implemented method for automatically and accurately detecting relationships (e.g., a partner relationship between two corporations, an employee-employer relationship between two people, a seller-customer relationship, etc.) in unstructured text contained within electronic documents with minimal processing times so as to be scalable to large document sets. The challenge is both in identifying entities in a document and in detecting the particular relationship, if any, between two entities. Therefore, disclosed herein are embodiments of a system and method for detecting any type of relationship that is described in unstructured text-based electronic documents. Specifically, the system and method each incorporate the use of an input file that contains one or more text patterns that represent particular relationships. The text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship. Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of a proper name within the document(s). Then, a pattern matcher scans the document(s) to match text patterns from the input file. If a text pattern is matched a relationship detector extracts the proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in a relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document.
- More particularly, referring to
FIG. 1 , an embodiment of a method of detecting relationships in unstructured text comprises first creating text patterns 205 that represent different types ofrelationships 201 and storing those text patterns 205 in aninput file 200, as illustrated inFIG. 2 (102-104). For example, theinput file 200 can store various text patterns representing different types ofrelationships 201, such as employer/employee relationships, various patterns representing partnership relationships, etc. The text patterns 205 may be custom created and input into theinput file 200 by a user and/or pre-created and stored in theinput file 200 by a system manufacturer (e.g., as illustrated inFIG. 3 and discussed below). Any number of input files may be given as input with each file containing a list of patterns 205 for aparticular relationship 201. - Specifically, the text patterns 205 may be created by developing at least one
regular text expression 210, comprising a plurality of words that describe the particular type of relationship, and providing two ormore slots slot location identifiers 202 that indicate a position of the first slot and/or a position of the second slot relative to the regular text expression. For example, the text pattern 205 can be created with a slot identifier 202 a to indicate that thefirst slot 208 should be located before thetext expression 210 and/or within a predetermined proximity from the text expression (e.g., within a predetermined number of words from the text expression). Similarly, the text pattern 205 can be created with a slot identifier 202 b to indicate that thesecond slot 212 should be located after thetext expression 210 and/or within a predetermined proximity from the text expression. Additionally, the text pattern 205 can be created with a relationship order identifier 204 (i.e., an identifier that defines an order of the first and second entities in the relationship that is not symmetric based on the locations of the proper names within the first andsecond slots 208 and 212). For example, if the type of relationship detected is a customer/seller relationship in which one entity is a “customer of” another entity, a relationship order identifier can be embedded in the text pattern to indicate that the proper name located in the first slot identifies the customer. Lastly, the text pattern 205 can also be created with akeyword 206 for the particular type ofrelationship 201, and specifically, for the particular text pattern. Thiskeyword 206 may be used in subsequent method steps, as described below, to screen out documents prior to conducting a pattern matching analysis and to, thereby, improve the speed of the pattern-matching. - For example, the text patterns 205 may be described in any language that supports regular expression matching (e.g., Perl) such that the
slots - O,1,1,awarded,(,*)(?:has)awarded (.*) a (?:[ˆ]* ){O,3}contract
- This exemplary text pattern contains four comma-separated fields: (1) a first number (i.e., slot location identifier 202), (2) a second number (i.e., another slot location identifier 202), (3) a third number (i.e., a relationship order identifier 204), (4) a keyword (206) for the pattern, and (5) a
regular expression 210 with twoslots slot location identifiers 202 and refer to the text matched in the $1 and $2 slots, respectively. A 0 means that for a successful match, a proper name found within the slot must occur to the far right, a 1 means it must occur to the far left. The third number in the exemplary text pattern comprises the relationship order identifier which specifies the order of the entities in the relationship. For example, if the relationship is “customer of,” a 1 in this field means that entity 1 (matched via $1) is a customer of entity 2 (matched via $2). A 2 in this field would mean that entity 2 is a customer of entity 1. If this field is 0, the relationship is symmetric, as in a partnership relation. - At the start of the process, all input files and corresponding text patterns are loaded into memory and a mapping is created from
relationship type 201 to the set of patterns 205 for that relationship. - Referring again to
FIG. 1 , in addition to creating an input file, one or more text-based electronic documents (e.g., an unstructured text document (UTD)) are selected for processing by using an input device (e.g., the same or a different input device than that used to create and input input files) (106). The documents can be selected, for example, from the world wide web (WWW), from a wide area network (WAN), from a local area network, etc. The selection of documents can include a specific document, all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc. Each selected unstructured text document may be preprocessed, for example, by a preprocessor, in order to provide “noise free” text to either the proper noun tagger or the keyword identifier, described below (108). - Processing of each selected text-based document comprises analyzing the document in order to determine the location for each proper name occurring within the document (116). This can be accomplished using a multi-step process performed, for example, by a proper noun tagger. The tagger can be adapted to first scan the document in order to identify each of the proper names occurring within each document based on a predetermined complex set of matching rules and lexicons. The set of matching rules define the proper nouns based, for example, on word capitalization, sentence structure, sentence boundaries, excluded words, etc. For example, the list of excluded words may include months, days of the week, words not capitalized in a title, etc. An exemplary rule may provide that all capitalized words, not located at the beginning of a sentence and not included on the list of excluded words, are identifiable as proper nouns. The tagger can also be adapted to re-scan the document in order to tag and record a list of each of the proper names found within the document along with their the locations.
- Processing of each selected text-based document also comprises analyzing the document on a sentence by sentence basis so as to locate a text pattern within the document (112-114). This can also be accomplished using a multi-step process performed, for example, by a pattern keyword identifier and pattern matcher. The keyword identifier can be adapted to first scans the document in order to determine whether or not a keyword from one or more of the text patterns in the input file are located in the document (112). If a keyword for a particular text pattern is found, then a full text pattern matching process can be performed, for example, by a pattern matcher, to determine if the regular text expression defined in the particular text pattern is located in the document (114). If a full text pattern is found within the document, the identity of the document is recorded and the location of the full text pattern is flagged (115).
- Upon detection of a full text pattern within a document, a multi-step relationship detection process is performed, for example, by a relationship detector. The relationship detector refers to the list of proper names recorded by the proper noun tagger and determines if proper names are located within the first and second slots and extracts those proper names, thereby, identifying the first and second entities engaged in the relationship (119). Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then the relationship detector determines the order. Lastly, the relationship detector outputs the results of the relationship detection analysis (120). Specifically, the relationship detector can provide an output comprising the type of relationship, the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern), which can be stored (122) and/or displayed (124).
- Referring to
FIG. 3 , an embodiment of asystem 300 for detecting relationships in one or more unstructured text documents comprises a textpattern input file 304, akeyword identifier 312, apattern matcher 314, aproper noun tagger 316 and arelationship detector 318. - More specifically, the
system 300 can comprise input files 304 stored in memory 305 (e.g., a hard drive, a disk, data storage device, etc.). The input files 304, as illustrated inFIG. 2 and discussed above, comprise a plurality of text patterns 205 that describe different types ofrelationships 201. These text patterns 205 can be pre-created and input in the input files 304 (e.g., by a system manufacturer) or custom developed and input into the input files 304 by the user using an input device 302 (e.g., a keyboard, disk, CD, internet link, hard drive, etc.). - Each text pattern 205 can comprise at least one
text expression 210, discussed in detail above, having a plurality of words that describe aparticular relationship 201 as well as two ormore slots regular text expression 210. Theslots relationship detector 318, as described below, in order to identify the proper names of the entities involved in the relationship (e.g., afirst slot 208 for the name of the first entity and asecond slot 212 for the name of the second entity in the relationship). The text pattern 205 can also compriseslot location identifiers 202 a-b that indicate a position of thefirst slot 208 and/or a position of thesecond slot 212 relative to theregular text expression 210, as described in detail above. Additionally, the text pattern 205 can comprise arelationship order identifier 204 that defines an order of the first and second entities in the relationship based on the locations of the proper names within the first andsecond slots keyword 206 for the particular type ofrelationship 201 and, specifically, for the particular text pattern 205. Thiskeyword 206 may be used by thekeyword identifier 312, as described below, to screen out documents prior to conducting a pattern matching analysis in order to improve processing speed. - A communications link 307 can be established between the
system 300 and asource 306 for unstructured text documents (e.g., the Internet, the world wide web (WWW), a wide area network (WAN), a local area network, etc.) so that a user of thesystem 300 can select, using an input device 308 (e.g., a keyboard, a mouse, etc.) one or more text-based electronic documents 309 for analysis. The document(s) may be selected to include specific document(s), all documents in a specified category of documents, all documents having a specified date range, all documents matching a Boolean query of terms, etc. Thesystem 300 may further comprise a pre-processor 310 adapted to pre-process each selected unstructured text document 309 prior to analysis in order to provide “noise free” text to either the proper noun tagger 315 or thekeyword identifier 312, described below. - The proper noun tagger 315 can be adapted to receive each selected unstructured text document 309 and to perform a multi-step tagging process on the documents. Specifically, the tagger 315 can be adapted to first scan each document in order to identify each occurrence of a proper name within the document based on a predetermined and complex set of matching rules and lexicons. The set of matching rules can be based, for example, on at least one of word capitalization, sentence structure, sentence boundaries, and excluded words (e.g., as illustrated in the detail discussion above). The tagger 315 can also be adapted to re-scan the document(s) 309 in order to tag each proper name and record a list of each of the proper names found within the document along with their the locations 317 in
memory 318. - The
keyword identifier 312 is in communication with (i.e., is adapted to access) the relationshippattern input file 304 and is further adapted to receive the selected unstructured text document(s) 309 from the preprocessor 310 (e.g., before, after, or separate from the processing by the proper noun tagger) and to analyze each document 309. Specifically, thekeyword identifier 312 is adapted to scan each document 309 sentence by sentence in order to determine whether or not akeyword 206 from one or more of the text patterns 205 in the input file (as illustrated inFIG. 2 ) are located in the document 309. If akeyword 206 for a particular text pattern 205 is found, the document containing the keyword is forwarded to apattern matcher 314 for further analysis. - The
pattern matcher 314 is adapted to perform a full text pattern matching process on the forwarded document. Specifically, thepattern matcher 314 is adapted to scan the document sentence by sentence to determine if the regular text expression defined in the particular text pattern associated with the keyword is located in the document. If a full text pattern is found within the document, the identity of the document and the location of thefull text pattern 320 is recorded in a memory 319 that is accessible by therelationship detector 318. The document that contains the full text pattern is then forwarded to therelationship detector 318 for further analysis. - The
relationship detector 318 is adapted to further analyze the document that contains the full text pattern in order to detect a relationship and, particularly, the entities engaged in the relationship. Specifically, therelationship detector 318 is adapted to access thememory 316 in order to refer to the list of proper names 317 recorded by the proper noun tagger 315. Therelationship detector 318 then reviews the document and determines if proper names are located within the first and second slots for the text pattern that was located within the document. If proper names are found in both slots, therelationship detector 318 extracts those proper names, and thereby, identifies the first and second entities engaged in the relationship described by the text pattern. Additionally, if an order for the relationship between the first and second entities is defined in the text pattern, then therelationship detector 318 determines the order of each named entity. Lastly, the relationship detector outputs the results of the relationship detection analysis. Specifically, therelationship detector 318 can provide an output comprising the type of relationship (as defined by the text pattern), the names of the first and second entities engaged in the relationship, the order of the relationship (if applicable) and the identification of the document and the location in the document where the relationship was detected (i.e., the location of the text pattern). This output can be stored (e.g., in a data storage device 322) and/or displayed on a display screen 324. - Embodiments of the
system 300, described above, can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the invention is implemented using software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, embodiments of thesystem 300 can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD. A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. -
FIG. 4 is a schematic representation of anexemplary computer system 400 suitable for use in detecting relationships as described herein. Computer software executes under a suitable operating system installed on thecomputer system 400 to assist in performing the described techniques. This computer software is programmed using any suitable computer programming language, and may be though of as comprising various software code means for achieving particular steps. The components of thecomputer system 400 include acomputer 420, akeyboard 410 and a mouse 415, and avideo display 490. Thecomputer 420 includes aprocessor 440, amemory 450, input/output (I/O) interfaces 460, 465, avideo interface 445, and astorage device 455. Theprocessor 440 is a central processing unit (CPU) that executes the operating system and the computer software executing under the operating system. Thememory 450 includes random access memory (RAM) and read-only memory (ROM), and is used under direction of theprocessor 440. Thevideo interface 445 is connected tovideo display 490. User input to operate thecomputer 420 is provided from thekeyboard 410 and mouse 415. Thestorage device 455 can include a disk drive or any other suitable storage medium. Each of the components of thecomputer 420 is connected to an internal bus 430 that includes data, address, and control buses, to allow components of thecomputer 420 to communicate with each other via the bus 430. Thecomputer system 400 can be connected to one or more other similar computers via input/output (I/O)interface 465 using acommunication channel 465 to a network, represented as theInternet 480. The computer software may be recorded on a portable storage medium, in which case, the computer software program is accessed by thecomputer system 400 from thestorage device 455. Alternatively, the computer software can be accessed directly from theInternet 480 by thecomputer 420. In either case, a user can interact with thecomputer system 400 using thekeyboard 410 and mouse 415 to operate the programmed computer software executing on thecomputer 420. Other configurations or types of computer systems can be equally well used to implement the described techniques. Thecomputer system 400 described above is described only as an example of a particular type of system suitable for implementing the described techniques. - Therefore, disclosed above are embodiments of a system and a method for detecting relationships described in unstructured text-based electronic documents. The system and method incorporate the use of an input file that contains one or more text patterns that represent particular relationships. The text patterns each include regular text expressions that describe the particular relationship and slots for the location of each entity in that relationship. Document(s) are selected by a user and scanned by a proper noun tagger that identifies and tags every occurrence of proper names within the document(s). Then, a pattern matcher scans the document(s) to match text patterns. If a text pattern is matched within a document a relationship detector extracts all pairs of proper names found in the slots for each matched text pattern. The output from the relationship detector includes the names for each entity in the relationship, the type of relationship, and the identity of the document and the location of the sentence describing the relationship in the document. This method and associated system are extremely cost and time efficient because they avoid the need of natural language processing or parsing (i.e., running expensive machines such as parsers and parts-of-speech taggers is unnecessary), so that they are scalable to a large number of documents. Additionally, because a user may define the text patterns with regular text expressions (as opposed to a single word or simple phrase) describing each relationship, the system and method are applicable to any type of relationship and are very precise in detecting particular relationships.
- The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.
Claims (20)
1. A computer implemented method of detecting a relationship between a first entity and a second entity, said method comprising:
creating a text pattern that represents a type of relationship, wherein said text pattern comprises a first slot for said first entity and a second slot for said second entity;
analyzing a text-based document so as to locate said text pattern within said document;
determining a location for each proper name occurring within said document; and
extracting proper names located within said first slot and said second slot of said text pattern within said document, wherein said proper names located within said first slot and said second slot identify said first entity and said second entity.
2. The method of claim 1 , wherein said creating of said text pattern further comprises identifying a keyword describing said relationship and wherein said method further comprises before said analyzing of said document, reviewing said document to determine if said keyword is located in said document.
3. The method of claim 1 , wherein said creating of said text pattern further comprises:
creating at least one text expression comprising a plurality of words that describe said type of said relationship; and
setting a position of said first slot and said second slot relative to said at least one text expression.
4. The method of claim 1 , wherein said determining of said location of each of said proper names comprises:
scanning said document to identify each of said proper names occurring within said document based on a set of matching rules;
re-scanning said document to tag said location for each of said proper names identified; and
recording said location for each of said proper names.
5. The method of claim 4 , wherein said set of matching rules is based on at least one of word capitalization, sentence structure, sentence boundaries, and excluded words.
6. The method of claim 1 , wherein said creating of said text pattern further comprises defining an order of said first entity and said second entity in said relationship based on said locations of said proper names within said first slot and said second slot.
7. The method of claim 1 , further comprising storing a record of said relationship comprising at least one of said proper name of said first entity, said proper name of said second entity, said type of relationship between said first entity and said second entity, said order of said first entity and said second entity in said relationship, and an identifier for said document and a location in said document where said relationship is detected.
8. A system for detecting a relationship between a first entity and a second entity, said system comprising:
an input file adapted to store a text pattern that describes a type of relationship, wherein said text pattern comprises a first slot for said first entity and a second slot for said second entity;
a pattern matcher in communication with said input file and adapted to analyze a text-based document so as to locate said text pattern within said document;
a proper noun tagger adapted to locate and record occurrences of proper names within said document; and
a relationship detector in communication with said pattern matcher and said proper noun tagger and adapted to extract said proper names located within said first slot and said second slot of said text pattern within said document so as to identify said first entity and said second entity and, thereby, detect said relationship.
9. The system of claim 8 , further comprising a keyword identifier in communication with said input file and adapted to review said document for said keyword and to forward said document to said pattern matcher only if said keyword is located in said document.
10. The system of claim 8 , wherein said text pattern further comprises:
at least one text expression comprising a plurality of words that describe said relationship; and
positions for said first slot and said second slot relative to said at least one text expression.
11. The system of claim 8 , wherein said text pattern further comprises an order of said first entity and said second entity in said relationship based said locations of said proper names within said first slot and said second slot.
12. The method of claim 8 , wherein said proper noun tagger is further adapted to scan said document to identify each of said proper names occurring within said document based on a set of matching rules, to re-scan said document to tag said location for each of said proper names, and to record said location for each of said proper names within said document.
13. The system of claim 12 , wherein said set of matching rules is based on at least one of word capitalization, sentence structure, sentence boundaries, and excluded words.
14. The system of claim 11 , further comprising at least one of a data storage device adapted to store at least one of said proper name of said first entity, said proper name of said second entity, said relationship between said first entity and said second entity, said order of said first entity and said second entity in said relationship, and a record of said document in which said relationship is detected.
15. A program storage device readable by computer and tangibly embodying a program of instructions executable by said computer to perform a method of detecting a relationship between a first entity and a second entity, said method comprising:
creating a text pattern that represents a type of relationship, wherein said text pattern comprises a first slot for said first entity and a second slot for said second entity;
analyzing a text-based document so as to locate said text pattern within said document;
determining a location for each proper name occurring within said document; and
extracting proper names located within said first slot and said second slot of said text pattern within said document, wherein said proper names located within said first slot and said second slot identify said first entity and said second entity
16. The program storage device of claim 15 , wherein said creating of said text pattern further comprises identifying a keyword describing said relationship and wherein said method further comprises before said analyzing of said document, reviewing said document to determine if said keyword is located in said document.
17. The program storage device of claim 15 , wherein said creating of said text pattern further comprises:
creating at least one text expression comprising a plurality of words that describe said type of said relationship; and
setting a position of said first slot and said second slot relative to said at least one text expression.
18. The program storage device of claim 15 , wherein said determining of said location for each of said proper names comprises:
scanning said document to identify each of said proper names occurring within said document based on a set of matching rules;
re-scanning said document to tag said location for each of said proper names; and
recording said location for each of said proper names.
19. The program storage device of claim 18 , wherein said set of matching rules is based on at least one of word capitalization, sentence structure, sentence boundaries, and excluded words.
20. The program storage device of claim 15 , wherein said creating of said text pattern further comprises defining an order of said first entity and said second entity in said relationship based on said locations of said proper names within said first slot and said second slot.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/231,205 US20070067320A1 (en) | 2005-09-20 | 2005-09-20 | Detecting relationships in unstructured text |
US12/056,048 US8001144B2 (en) | 2005-09-20 | 2008-03-26 | Detecting relationships in unstructured text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/231,205 US20070067320A1 (en) | 2005-09-20 | 2005-09-20 | Detecting relationships in unstructured text |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/056,048 Continuation US8001144B2 (en) | 2005-09-20 | 2008-03-26 | Detecting relationships in unstructured text |
Publications (1)
Publication Number | Publication Date |
---|---|
US20070067320A1 true US20070067320A1 (en) | 2007-03-22 |
Family
ID=37885423
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/231,205 Abandoned US20070067320A1 (en) | 2005-09-20 | 2005-09-20 | Detecting relationships in unstructured text |
US12/056,048 Active 2026-12-11 US8001144B2 (en) | 2005-09-20 | 2008-03-26 | Detecting relationships in unstructured text |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/056,048 Active 2026-12-11 US8001144B2 (en) | 2005-09-20 | 2008-03-26 | Detecting relationships in unstructured text |
Country Status (1)
Country | Link |
---|---|
US (2) | US20070067320A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110093258A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for text cleaning |
US20120066160A1 (en) * | 2010-09-10 | 2012-03-15 | Salesforce.Com, Inc. | Probabilistic tree-structured learning system for extracting contact data from quotes |
US8521769B2 (en) | 2011-07-25 | 2013-08-27 | The Boeing Company | Locating ambiguities in data |
US8527695B2 (en) | 2011-07-29 | 2013-09-03 | The Boeing Company | System for updating an associative memory |
US8639495B2 (en) | 2012-01-04 | 2014-01-28 | International Business Machines Corporation | Natural language processing (‘NLP’) |
US20150331936A1 (en) * | 2014-05-14 | 2015-11-19 | Faris ALQADAH | Method and system for extracting a product and classifying text-based electronic documents |
CN111209348A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting information |
US10679230B2 (en) | 2009-04-07 | 2020-06-09 | The Boeing Company | Associative memory-based project management system |
US11360990B2 (en) | 2019-06-21 | 2022-06-14 | Salesforce.Com, Inc. | Method and a system for fuzzy matching of entities in a database system based on machine learning |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8275608B2 (en) * | 2008-07-03 | 2012-09-25 | Xerox Corporation | Clique based clustering for named entity recognition system |
JP4640554B2 (en) * | 2008-08-26 | 2011-03-02 | Necビッグローブ株式会社 | Server apparatus, information processing method, and program |
US20110035210A1 (en) * | 2009-08-10 | 2011-02-10 | Benjamin Rosenfeld | Conditional random fields (crf)-based relation extraction system |
US8180755B2 (en) * | 2009-09-04 | 2012-05-15 | Yahoo! Inc. | Matching reviews to objects using a language model |
US8375022B2 (en) * | 2010-11-02 | 2013-02-12 | Hewlett-Packard Development Company, L.P. | Keyword determination based on a weight of meaningfulness |
US8626682B2 (en) * | 2011-02-22 | 2014-01-07 | Thomson Reuters Global Resources | Automatic data cleaning for machine learning classifiers |
US10380554B2 (en) | 2012-06-20 | 2019-08-13 | Hewlett-Packard Development Company, L.P. | Extracting data from email attachments |
US8781815B1 (en) * | 2013-12-05 | 2014-07-15 | Seal Software Ltd. | Non-standard and standard clause detection |
US9311295B2 (en) | 2014-01-30 | 2016-04-12 | International Business Machines Corporation | Procedure extraction and enrichment from unstructured text using natural language processing (NLP) techniques |
US9659005B2 (en) | 2014-05-16 | 2017-05-23 | Semantix Technologies Corporation | System for semantic interpretation |
US11250450B1 (en) | 2014-06-27 | 2022-02-15 | Groupon, Inc. | Method and system for programmatic generation of survey queries |
US9317566B1 (en) | 2014-06-27 | 2016-04-19 | Groupon, Inc. | Method and system for programmatic analysis of consumer reviews |
US10878017B1 (en) | 2014-07-29 | 2020-12-29 | Groupon, Inc. | System and method for programmatic generation of attribute descriptors |
US10977667B1 (en) | 2014-10-22 | 2021-04-13 | Groupon, Inc. | Method and system for programmatic analysis of consumer sentiment with regard to attribute descriptors |
US9805025B2 (en) * | 2015-07-13 | 2017-10-31 | Seal Software Limited | Standard exact clause detection |
CN106991090B (en) * | 2016-01-20 | 2020-12-11 | 北京国双科技有限公司 | Public opinion event entity analysis method and device |
US10592603B2 (en) | 2016-02-03 | 2020-03-17 | International Business Machines Corporation | Identifying logic problems in text using a statistical approach and natural language processing |
US11042702B2 (en) | 2016-02-04 | 2021-06-22 | International Business Machines Corporation | Solving textual logic problems using a statistical approach and natural language processing |
US11003716B2 (en) | 2017-01-10 | 2021-05-11 | International Business Machines Corporation | Discovery, characterization, and analysis of interpersonal relationships extracted from unstructured text data |
Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5819260A (en) * | 1996-01-22 | 1998-10-06 | Lexis-Nexis | Phrase recognition method and apparatus |
US6263335B1 (en) * | 1996-02-09 | 2001-07-17 | Textwise Llc | Information extraction system and method using concept-relation-concept (CRC) triples |
US6442545B1 (en) * | 1999-06-01 | 2002-08-27 | Clearforest Ltd. | Term-level text with mining with taxonomies |
US6505197B1 (en) * | 1999-11-15 | 2003-01-07 | International Business Machines Corporation | System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences |
US6539376B1 (en) * | 1999-11-15 | 2003-03-25 | International Business Machines Corporation | System and method for the automatic mining of new relationships |
US20030120640A1 (en) * | 2001-12-21 | 2003-06-26 | Hitachi. Ltd. | Construction method of substance dictionary, extraction of binary relationship of substance, prediction method and dynamic viewer |
US20030125929A1 (en) * | 2001-12-10 | 2003-07-03 | Thomas Bergstraesser | Services for context-sensitive flagging of information in natural language text and central management of metadata relating that information over a computer network |
US6629097B1 (en) * | 1999-04-28 | 2003-09-30 | Douglas K. Keith | Displaying implicit associations among items in loosely-structured data sets |
US20030191625A1 (en) * | 1999-11-05 | 2003-10-09 | Gorin Allen Louis | Method and system for creating a named entity language model |
US20040073548A1 (en) * | 2002-10-09 | 2004-04-15 | Myung-Eun Lim | System and method of extracting event sentences from documents |
US20040073874A1 (en) * | 2001-02-20 | 2004-04-15 | Thierry Poibeau | Device for retrieving data from a knowledge-based text |
US20040093331A1 (en) * | 2002-09-20 | 2004-05-13 | Board Of Regents, University Of Texas System | Computer program products, systems and methods for information discovery and relational analyses |
US20040167911A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and products for integrating mixed format data including the extraction of relational facts from free text |
US20040225555A1 (en) * | 2003-05-09 | 2004-11-11 | Andreas Persidis | System and method for generating targeted marketing resources and market performance data |
US20040243645A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
US6829668B2 (en) * | 2000-12-28 | 2004-12-07 | Intel Corporation | System for finding data related to an example datum on two electronic devices |
US20050038770A1 (en) * | 2003-08-14 | 2005-02-17 | Kuchinsky Allan J. | System, tools and methods for viewing textual documents, extracting knowledge therefrom and converting the knowledge into other forms of representation of the knowledge |
US20050060170A1 (en) * | 2003-09-17 | 2005-03-17 | Krishna Kummamura | Method, system and computer program product for profiling entities |
US20050071217A1 (en) * | 2003-09-30 | 2005-03-31 | General Electric Company | Method, system and computer product for analyzing business risk using event information extracted from natural language sources |
US20050108630A1 (en) * | 2003-11-19 | 2005-05-19 | Wasson Mark D. | Extraction of facts from text |
US20050125216A1 (en) * | 2003-12-05 | 2005-06-09 | Chitrapura Krishna P. | Extracting and grouping opinions from text documents |
US6968332B1 (en) * | 2000-05-25 | 2005-11-22 | Microsoft Corporation | Facility for highlighting documents accessed through search or browsing |
US20060277465A1 (en) * | 2005-06-07 | 2006-12-07 | Textual Analytics Solutions Pvt. Ltd. | System and method of textual information analytics |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5696916A (en) * | 1985-03-27 | 1997-12-09 | Hitachi, Ltd. | Information storage and retrieval system and display method therefor |
JP3108015B2 (en) * | 1996-05-22 | 2000-11-13 | 松下電器産業株式会社 | Hypertext search device |
US5819265A (en) * | 1996-07-12 | 1998-10-06 | International Business Machines Corporation | Processing names in a text |
US7490092B2 (en) * | 2000-07-06 | 2009-02-10 | Streamsage, Inc. | Method and system for indexing and searching timed media information based upon relevance intervals |
US6741988B1 (en) * | 2000-08-11 | 2004-05-25 | Attensity Corporation | Relational text index creation and searching |
US7363308B2 (en) * | 2000-12-28 | 2008-04-22 | Fair Isaac Corporation | System and method for obtaining keyword descriptions of records from a large database |
US7146308B2 (en) * | 2001-04-05 | 2006-12-05 | Dekang Lin | Discovery of inference rules from text |
US7587381B1 (en) * | 2002-01-25 | 2009-09-08 | Sphere Source, Inc. | Method for extracting a compact representation of the topical content of an electronic text |
JP4586690B2 (en) * | 2005-09-09 | 2010-11-24 | 沖電気工業株式会社 | Position estimation system |
-
2005
- 2005-09-20 US US11/231,205 patent/US20070067320A1/en not_active Abandoned
-
2008
- 2008-03-26 US US12/056,048 patent/US8001144B2/en active Active
Patent Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5819260A (en) * | 1996-01-22 | 1998-10-06 | Lexis-Nexis | Phrase recognition method and apparatus |
US6263335B1 (en) * | 1996-02-09 | 2001-07-17 | Textwise Llc | Information extraction system and method using concept-relation-concept (CRC) triples |
US6629097B1 (en) * | 1999-04-28 | 2003-09-30 | Douglas K. Keith | Displaying implicit associations among items in loosely-structured data sets |
US6442545B1 (en) * | 1999-06-01 | 2002-08-27 | Clearforest Ltd. | Term-level text with mining with taxonomies |
US20030191625A1 (en) * | 1999-11-05 | 2003-10-09 | Gorin Allen Louis | Method and system for creating a named entity language model |
US6505197B1 (en) * | 1999-11-15 | 2003-01-07 | International Business Machines Corporation | System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences |
US6539376B1 (en) * | 1999-11-15 | 2003-03-25 | International Business Machines Corporation | System and method for the automatic mining of new relationships |
US6968332B1 (en) * | 2000-05-25 | 2005-11-22 | Microsoft Corporation | Facility for highlighting documents accessed through search or browsing |
US6829668B2 (en) * | 2000-12-28 | 2004-12-07 | Intel Corporation | System for finding data related to an example datum on two electronic devices |
US20040073874A1 (en) * | 2001-02-20 | 2004-04-15 | Thierry Poibeau | Device for retrieving data from a knowledge-based text |
US20030125929A1 (en) * | 2001-12-10 | 2003-07-03 | Thomas Bergstraesser | Services for context-sensitive flagging of information in natural language text and central management of metadata relating that information over a computer network |
US20030120640A1 (en) * | 2001-12-21 | 2003-06-26 | Hitachi. Ltd. | Construction method of substance dictionary, extraction of binary relationship of substance, prediction method and dynamic viewer |
US20040093331A1 (en) * | 2002-09-20 | 2004-05-13 | Board Of Regents, University Of Texas System | Computer program products, systems and methods for information discovery and relational analyses |
US20040073548A1 (en) * | 2002-10-09 | 2004-04-15 | Myung-Eun Lim | System and method of extracting event sentences from documents |
US20040167911A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and products for integrating mixed format data including the extraction of relational facts from free text |
US20040225555A1 (en) * | 2003-05-09 | 2004-11-11 | Andreas Persidis | System and method for generating targeted marketing resources and market performance data |
US20040243645A1 (en) * | 2003-05-30 | 2004-12-02 | International Business Machines Corporation | System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations |
US20050038770A1 (en) * | 2003-08-14 | 2005-02-17 | Kuchinsky Allan J. | System, tools and methods for viewing textual documents, extracting knowledge therefrom and converting the knowledge into other forms of representation of the knowledge |
US20050060170A1 (en) * | 2003-09-17 | 2005-03-17 | Krishna Kummamura | Method, system and computer program product for profiling entities |
US20050071217A1 (en) * | 2003-09-30 | 2005-03-31 | General Electric Company | Method, system and computer product for analyzing business risk using event information extracted from natural language sources |
US20050108630A1 (en) * | 2003-11-19 | 2005-05-19 | Wasson Mark D. | Extraction of facts from text |
US20050125216A1 (en) * | 2003-12-05 | 2005-06-09 | Chitrapura Krishna P. | Extracting and grouping opinions from text documents |
US20060277465A1 (en) * | 2005-06-07 | 2006-12-07 | Textual Analytics Solutions Pvt. Ltd. | System and method of textual information analytics |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10679230B2 (en) | 2009-04-07 | 2020-06-09 | The Boeing Company | Associative memory-based project management system |
US8380492B2 (en) * | 2009-10-15 | 2013-02-19 | Rogers Communications Inc. | System and method for text cleaning by classifying sentences using numerically represented features |
US20110093258A1 (en) * | 2009-10-15 | 2011-04-21 | 2167959 Ontario Inc. | System and method for text cleaning |
US9619534B2 (en) * | 2010-09-10 | 2017-04-11 | Salesforce.Com, Inc. | Probabilistic tree-structured learning system for extracting contact data from quotes |
US20120066160A1 (en) * | 2010-09-10 | 2012-03-15 | Salesforce.Com, Inc. | Probabilistic tree-structured learning system for extracting contact data from quotes |
US9805085B2 (en) | 2011-07-25 | 2017-10-31 | The Boeing Company | Locating ambiguities in data |
US8521769B2 (en) | 2011-07-25 | 2013-08-27 | The Boeing Company | Locating ambiguities in data |
US8527695B2 (en) | 2011-07-29 | 2013-09-03 | The Boeing Company | System for updating an associative memory |
US8639497B2 (en) | 2012-01-04 | 2014-01-28 | International Business Machines Corporation | Natural language processing (‘NLP’) |
US8639495B2 (en) | 2012-01-04 | 2014-01-28 | International Business Machines Corporation | Natural language processing (‘NLP’) |
US20150331936A1 (en) * | 2014-05-14 | 2015-11-19 | Faris ALQADAH | Method and system for extracting a product and classifying text-based electronic documents |
CN111209348A (en) * | 2018-11-21 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting information |
US11360990B2 (en) | 2019-06-21 | 2022-06-14 | Salesforce.Com, Inc. | Method and a system for fuzzy matching of entities in a database system based on machine learning |
Also Published As
Publication number | Publication date |
---|---|
US20080177740A1 (en) | 2008-07-24 |
US8001144B2 (en) | 2011-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8001144B2 (en) | Detecting relationships in unstructured text | |
US11341194B1 (en) | Models for classifying documents | |
US10282468B2 (en) | Document-based requirement identification and extraction | |
US7565606B2 (en) | Automated spell analysis | |
US8209335B2 (en) | Extracting informative phrases from unstructured text | |
US20180075025A1 (en) | Converting data into natural language form | |
US7630968B2 (en) | Extracting information from formatted sources | |
JP5065420B2 (en) | Method, system, and computer-readable medium for pre-assessment and refinement of the quality of a web service definition | |
US20030004941A1 (en) | Method, terminal and computer program for keyword searching | |
Dingli et al. | An intelligent framework for website usability | |
US20160292153A1 (en) | Identification of examples in documents | |
Nasr et al. | Automated extraction of product comparison matrices from informal product descriptions | |
US20050081146A1 (en) | Relation chart-creating program, relation chart-creating method, and relation chart-creating apparatus | |
WO2009096523A1 (en) | Information analysis device, search system, information analysis method, and information analysis program | |
Newman et al. | On the generation, structure, and semantics of grammar patterns in source code identifiers | |
Merten et al. | Do information retrieval algorithms for automated traceability perform effectively on issue tracking system data? | |
Dąbrowski et al. | Mining user opinions to support requirement engineering: an empirical study | |
US20170083614A1 (en) | System and method for concept-based search summaries | |
Quirchmayr et al. | Semi-automatic Software Feature-Relevant Information Extraction from Natural Language User Manuals: An Approach and Practical Experience at Roche Diagnostics GmbH | |
Nassif et al. | Automatically categorizing software technologies | |
Sahin et al. | Introduction to Apple ML tools | |
US20230068340A1 (en) | Data management suggestions from knowledge graph actions | |
JP4361299B2 (en) | Evaluation expression extraction apparatus, program, and storage medium | |
CN112711695A (en) | Content-based search suggestion generation method and device | |
Dąbrowski et al. | Mining and searching app reviews for requirements engineering: Evaluation and replication studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NOVAK, JASMINE;REEL/FRAME:017022/0632 Effective date: 20050919 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE |