WO2014035334A1 - Semiotic selection method and system for text summarization - Google Patents

Semiotic selection method and system for text summarization Download PDF

Info

Publication number
WO2014035334A1
WO2014035334A1 PCT/SG2012/000306 SG2012000306W WO2014035334A1 WO 2014035334 A1 WO2014035334 A1 WO 2014035334A1 SG 2012000306 W SG2012000306 W SG 2012000306W WO 2014035334 A1 WO2014035334 A1 WO 2014035334A1
Authority
WO
WIPO (PCT)
Prior art keywords
semiotic
text
marker
identifier
textual statement
Prior art date
Application number
PCT/SG2012/000306
Other languages
French (fr)
Inventor
Ewe Tiam TIAH
Ming Shen CHEO
Original Assignee
Nuffnangx Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuffnangx Pte Ltd filed Critical Nuffnangx Pte Ltd
Priority to PCT/SG2012/000306 priority Critical patent/WO2014035334A1/en
Publication of WO2014035334A1 publication Critical patent/WO2014035334A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • This invention relates generally to methods and systems for text summarization, and more particularly to a semiotic selection method and system for text summarization.
  • Search facilitators such as search engines, search aggregators, and web crawlers have been developed in an attempt to assist in gathering and presenting a selection of information that is specific to an individual's request. These Internet search facilitators typically base their searches on a formatted representation of the actual information, such as formatted summaries or feeds of the information rather than the actual information itself.
  • the formatted summaries may be generated in accordance with a standardised format such as resource description framework (RDF) site summary or rich site summary or really simple syndication (RSS) format.
  • RDF resource description framework
  • RSS really simple syndication
  • the quality of the search result is limited to the quality of the search query format and searchable format of the representation of the actual information on which the search request is based.
  • the quality of such formatted summaries or feeds of the actual information may range from automatically providing the title of the actual article or information, or to manually prepare a written synopsis of the information.
  • Merely providing the title of an article does not improve the quality of the searchable format as the title is typically included in the searchable format, and while preparing manually written synopsis may improve the quality of the searchable format but requires time consuming manual input.
  • a semiotic selection system for text summarization comprises a processor configured for receiving a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; and a database for storing a corresponding value for each associated semiotic identifier; the processor calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values retrieved from the database of any semiotic identifier associated with a semiotic marker within the textual statement, and selecting the textual statement and semiotic marker having the highest score for the body of text summary.
  • a semiotic selection method for text summarization that comprises receiving a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; assigning a corresponding value for each associated semiotic identifier; calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values of any semiotic identifier associated with a semiotic marker within the textual statement; and selecting the textual statement and semiotic marker having the highest score for the body of text summary.
  • the textual statement is a sentence.
  • the semiotic marker is a delimiter comprising a full stop and a space.
  • the semiotic identifier is a semiotic keyword.
  • the semiotic keyword may be an adjective, a verb or the like.
  • the associated value for the semiotic keyword may be 50 points.
  • the semiotic identifier is a punctuation mark.
  • the associated value for the punctuation mark may be 10 points.
  • the semiotic identifier is a hypertext mark-up language (HTML) element.
  • HTML hypertext mark-up language
  • the associated value for the HTML element may be 10 points.
  • the HTML element may comprise ⁇ b>, ⁇ em>, ⁇ strong>, or ⁇ i>.
  • the processor processes the selected textual statement for body of text summary as an RSS feed.
  • the processor is arranged to identify at least two categories of associated semiotic identifiers with the body of text.
  • the body of text is an article.
  • the body of text comprises at least two semiotic markers.
  • the body of text comprises a plurality of semiotic markers and a plurality of semiotic identifiers.
  • Figure 1 illustrates a block diagram of an exemplary system on which various embodiments of the invention may be implemented
  • Figure 2 illustrates a block diagram of an exemplary personal computer that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention
  • FIG. 3 illustrates a block diagram of an exemplary server that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention
  • Figure 4 illustrates a block diagram of an exemplary semiotic identifier module that may be implemented in the server of Figure 3 in accordance with an embodiment of the invention
  • Figure 5 illustrates a block diagram of an exemplary database that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention
  • Figure 6 illustrates a semiotic keyword look up table (LUT) in accordance with an embodiment of the invention
  • Figure 7 illustrates a semiotic punctuation LUT in accordance with an embodiment of the invention
  • Figure 8 illustrates a semiotic HTML LUT in accordance with an embodiment of the invention
  • Figure 9 illustrates a semiotic identifier point score LUT in accordance with an embodiment of the invention.
  • Figure 10 illustrates a flow chart of a method in accordance with an embodiment of the invention
  • Figures 1 1-20 illustrate allocation of point value score tables for sentences in a body of text analysed with a semiotic selection method for text summarization in accordance with an embodiment of the invention
  • Figure 21 illustrates a total semiotic point value summary table of all the sentences analysed in Figures 11 -20 in accordance with an embodiment of the invention.
  • a semiotic selection method and system for text summarization for automatically receiving and analysing a body of text in a natural language.
  • the body of text comprises semiotic markers defining a series of statements or sentences. Each of the sentences are compared and ranked according to the semiotic relevance to the body of text. The sentence with the highest semiotic ranking is selected as a summary for the body of text.
  • the body of information may be in textual form in the format of an article.
  • Such articles are published on the blog sites via a web service and the semiotic selection text summary may be the content of RSS feeds.
  • the sentence selected as the summary is an excerpt of the body of text.
  • the sentence selected is determined to be the most semiotic significant sentence in the body of text.
  • the categories for determining the semiotic significance of each sentence in the body of text relate to the spectrum of semiotic
  • the semiotic significance of text includes semantic relevance, and additional semiotic elements or semiotic identifiers which can be conveyed to the reader in other formats while the reader is reading the body of text.
  • semiotic identifier categories include keywords that convey particularly semiotic meaning, punctuation marks, font types comprising bold, italics, and the like.
  • Each sentence is ranked according to the number and type of semiotic identifier categories. The sentence ranked with the highest semiotic score has different textual attribute elements and ranked for semiotic identifiers in different semiotic categories. It will be appreciated that the semiotic selection method and system for text summarization may base the choice of selecting or ranking of sentences on one or more semiotic category or identifier.
  • Figure 1 illustrates a block diagram of an exemplary system 10 on which various embodiments of the invention may be implemented.
  • a user 12 accesses the system 10 with a personal computer 14 that is in communication with a server 16 and database 18 via a network 20, such as a local area network (LAN), intranet, internet or the like.
  • the server 16 may communicate directly with the personal computer 16 and/or the database 18.
  • FIG. 2 illustrates a block diagram 30 of an exemplary personal computer 14 that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention.
  • the personal computer comprises a processor 32 and memory 34 for processing and storing data, respectively, and executing the various data modules and software commands installed to perform various functions on the portable communications device.
  • the personal computer 14 comprises a communications interface 36 to transmit and receive data via the network 20 including cellular networks and other networks such as local area networks LAN, intranet, internet and the like.
  • the personal computer device may comprise data input 42 means such as a keyboard, camera, microphone, and the like, and data output means 40 such as a display, printer, and the like.
  • FIG. 3 illustrates a block diagram 50 of an exemplary server 16 that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention.
  • the server comprises a processor 52 and memory 54 for processing and storing data, respectively, and executing the various data modules and software commands installed to perform various functions on the server.
  • the server 16 comprises a communications interface 56 to transmit and receive data via networks 20 such as local area networks LAN, intranet, internet and the like.
  • the server 16 comprises the semiotic selection and management system modules 60 comprising a semiotic marker module 62, a semiotic identifier module 64, and a semiotic point value scoring module 66.
  • the semiotic marker module 62 analyses the text of the information to be summarised.
  • the semiotic marker module 62 identifies the statements or sentences within the text by identifying a full stop followed by a space to define the sentences.
  • the semiotic identifier module 64 analyses each identified sentence for any semiotic identifiers or elements such as semiotic keywords, punctuation marks, HTML elements or tags, and the like.
  • the semiotic point value scoring module 66 calculates the number of semiotic identifiers or elements.
  • the semiotic point value scoring module 66 determines the semiotic total point scores for each of the sentences and compares the scores to determine and select the sentence with the highest semiotic point value score.
  • the sentence with the highest semiotic point value score is the sentence selected as the summary for the text of the information. It will be appreciated that the choice of selecting or ranking of sentences may comprise one or more semiotic category or identifier.
  • FIG 4 illustrates a block diagram 70 of an exemplary semiotic identifier module 64 that may be implemented in the server 16 of Figure 3 in accordance with an embodiment of the invention.
  • the semiotic identifier module 64 comprises a keyword analyser module 72, a punctuation analyser module 74, and HTML analyser module 76.
  • the keyword analyser module 72 analyses each sentence for any semiotic keywords.
  • the punctuation analyser module 74 analyses each sentence for any punctuation.
  • the HTML analyser module 76 analyses each sentence for any HTML elements.
  • FIG. 5 illustrates a block diagram 80 of an exemplary database 18 that may be implemented in the system 10 of Figure 1 in accordance with an embodiment of the invention.
  • the database 18 comprises a communications interface 82 to transmit and receive data via networks 20 such as local area networks LAN, intranet, internet and the like.
  • the database comprises a data store for semiotic keyword look up table (LUT) 84, semiotic punctuation LUT 86, semiotic HTML LUT 88, semiotic identifier point score LUT 90, and semiotic total point scores store 92.
  • the data may be stored in the database, or may be stored and received in other sources, such as other databases, server, external means, personal computer, or the like.
  • the database is shown as a standalone unit, it will be appreciated that the database 18 or each data store 84, 86, 88, 90, 92 may reside within the server or personal computer in the system our external to the system shown.
  • FIG. 6 illustrates a semiotic keyword look up table (LUT) 84 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention.
  • the semiotic keywords in the LUT are shown as: “ABNORMAL”; “ABOARD”; “ABSORBING”; “QUICK”; “WHITTLE”; “WONDERFUL”; “WREAK”; and the like. It will be appreciated that any number of semiotic keywords can be in the LUT. For example, there may be 500 keywords chosen in the LUT. The keywords chosen are deemed semiotic significant.
  • Semiotic significant keywords are words that are emotive words, such as "happy”, value judgement words such as "best”, words that signify a conclusion, such as "end”, or the like.
  • the semiotic significant keywords that are selected are chosen for the descriptive, emotive and distinct value of the word. For example, words that may have semiotic significance are descriptive, emotive and distinct from other words. It will be appreciated that any number of semiotic keywords and additional types of semiotic keywords may be included in the LUT. It will be appreciated that the list of semiotic keywords are selected to be applicable to all different types of subjects or topics of the body of text being summarised.
  • keywords are symbolic representations that convey a message or meaning, and may comprise combinations of words, phrases, letters, numbers, symbols, alphanumeric, and the like.
  • the keywords may comprise different spelt versions of the same word or words such as commonly misspelt versions, culturally different versions,
  • the keywords may comprise jargon, lingo, slang, profanity, dialect, acronyms, abbreviations, shorthand, and the like.
  • the keywords may take different grammatical or syntactic variations and comprise different tenses, base, stem, or root words with different affixes, such as prefixes and/or suffixes, adjectives, nouns, verbs, hyphenated and non-hyphenated versions, and the like.
  • keywords for illustrative purposes only is: abnormal, aboard, absorbing, accessible, accredited, accusing, acrimonious, adapted, adept, admiration, adoring, adrift, adverse, affection, afflicted, aflutter, aggrieved, agitated, alienated, internship, amused, anesthetized, animals, anticipation, antique, apathy, apologetic, apprehensive, archaic, armored, assign, astute, asymmetrical, at home, at war, attitude, average, awakened, awed, out, bad, banned, beckoned, belief, best, bland, boring, brash, spectacular, brisk, bristling, broadminded, broken-up, burdensome, capable, capricious, captious, cautious, ceaseless, classy, claustrophobic, clenched, closed, clouded, coherent, collapsed, colossal, comforted, commanding, committed, complimented, conclude,
  • Figure 7 illustrates a semiotic punctuation LUT 86 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention.
  • the punctuation marks shown are a question mark, exclamation mark, open parenthesis, closed parenthesis, and the like. It will be appreciated that any number of punctuation marks and additional punctuation marks not shown may be included in the LUT.
  • Figure 8 illustrates a semiotic HTML LUT 88 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention.
  • the text of the information being analysed contains HTML elements or tags.
  • the HTML elements or tags are searched.
  • the HTML elements shown are: " ⁇ b>”, as a boldface presentation; " ⁇ em>”, as an italic emphasis presentation; " ⁇ i>”, as an italic presentation; “ ⁇ strong>”, as strong emphasis boldface presentation; and the like. It will be appreciated that any number of HTML elements and additional HTML elements not shown may be included in the LUT.
  • Figure 9 illustrates a semiotic identifier point score LUT 90 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention.
  • FIG. 10 illustrates a flow chart 150 of a method in accordance with an embodiment of the invention. The method comprises receiving incoming data such as a body of text 152 of information to be summarised.
  • Text may be received from different sources, and the incoming data may comprise different formats and variations of text such as symbols, letters, words, numbers, fonts, font size, images, symbols, numbers, codes, ciphers, hieroglyphs, and the like.
  • the semiotic markers are identified 154.
  • the semiotic markers define the statements or sentences of the text.
  • the semiotic markers in this embodiment are full stops or periods followed by a space. It will be appreciated that the semiotic markers may comprise different forms and variations, for example double space, and the like.
  • each sentence is analysed 156.
  • the semiotic identifiers are identified 158.
  • the semiotic identifiers in this embodiment are semiotic keywords, punctuation, HTML elements, and the like.
  • a point value is applied 160 to each semiotic identifier in accordance with the respective look up table (LUT).
  • points are allocated as follows: 50 points are applied to each semiotic keyword; 10 points are applied to each punctuation mark; and 10 points are applied to each HTML element.
  • a total semiotic point value is calculated and stored 162 for each sentence analysed. The steps are repeated 164 for each sentence until the final sentence. The total semiotic point value scores are compared 168, and the sentence with the highest semiotic point value score is selected 168 as the summary sentence to represent the text being analysed.
  • An example semiotic selection method for text summarization in accordance with an embodiment of the invention is applied to the following illustration of source of information:
  • the above body of text is an example for illustrative purposes only of incoming data format of text of an article retrieved or downloaded from a website, blog, or the like.
  • the text includes a markup language such as HTML.
  • data may be retrieved from a posting on a social network website such as FACEBOOK, LINKEDIN, or the like.
  • FACEBOOK is a trademark or registered trademark of Facebook, Inc. of Palo Alto,
  • LINKEDIN is a trademark or registered trademark of Linkedln Corporation of Mountain View, California, United States of America.
  • the semiotic markers are identified as full stops.
  • the sentence numbering and the semiotic identifiers of the HTML elements are provided in the sentences and displayed in their corresponding word processing version for illustrative purposes only, as follows:
  • Sentence 1 Words and photographs by JK of Cupcake Weekend.
  • Sentence 2 On the last week of May I treaded on nippon ground despite the many incessant warnings of radiation scares and it was a risk well taken.
  • Sentence 3 If a city could be a person then I was extremely impressed by Tokyo.
  • Sentence 4 She is indeed very magical with so much to explore, taste & feel.
  • Sentence 5 Tokyo is undoubtedly my Paris in Asia.
  • Sentence 6 Here's a quick travel guide on things to do in Tokyo for five days.
  • Sentence 7 The first thing to do in right after you get off your plane is to purchase a subway pass.
  • Sentence 8 The next thing not to do: take a cab (unless you really have to).
  • Sentence 9 Not only is it veryly expensive, you miss out on alot when you travel via taxi.
  • Sentence 10 Though the subway map can be pretty taunting, I assure you that you will get the hang of it within a day or two.
  • FIGS 1 1-20 illustrate the allocation of point value score tables
  • the point value score of sentence 1 is shown in the point value score table 200 of Figure 1 1 with a 10 point value score for one semiotic identifier taking the form of an HTML identifier of italics with emphasis identified with the ⁇ em> and ⁇ /em> HTML tag.
  • the identified semiotic HTML element tag " ⁇ em>" in sentence 1 is listed in semiotic HTML keyword LUT 88 of Figure 8 as "2. ⁇ em>".
  • the point value score of Sentence 6 is shown in the point value score table 210 of Figure 16 with a 60 point value score for two semiotic identifiers taking the forms of a semiotic keyword "quick", and an HTML identifier of bold identified with the ⁇ strong> and ⁇ /strong> HTML tag.
  • the identified semiotic keyword "quick” in sentence 5 is shown listed in semiotic keyword LUT 84 of Figure 6 as semiotic keyword "N-3. QUICK”.
  • the identified semiotic HTML element tag " ⁇ strong>” in sentence 6 is listed in semiotic HTML keyword LUT 88 of Figure 8 as "N. ⁇ strong>".
  • the point value score of Sentence 8 is shown in the point value score table 214 of
  • Figure 18 with a 20 point value score for one semiotic identifier taking the form of a punctuation opening and closing parenthesis is listed in semiotic punctuation LUT 86 of Figure 7 as "N- 1. (", and "N. )".
  • the overall total semiotic point value summary table 220 of Figure 21 shows the point value score results all nine sentence. Sentence 5 has the highest semiotic point value score of 60, the sentence number and the score are shown in bold. Accordingly, sentence 5 is selected as the summary excerpt.
  • the sentence is displayed without the semiotic identifiers, "Here's a quick travel guide on things to do in Tokyo for five days.” It will be appreciated that the selected sentence for the summary excerpt may be displayed with all, at least one, or without any of the semiotic identifiers.
  • the semiotic point value scoring is calculated based on the number of semiotic identifiers associated with a particular semiotic marker. In this embodiment, the semiotic point value scoring is calculated on the number and type of semiotic identifiers or categories identified preceding an identified marker. It will be appreciated that this
  • embodiment is suitable for the English natural language, however, the systems and methods of embodiments of the invention may have other semiotic marker and semiotic identifier formats and correlations that may be more suitable and arranged for other natural languages. It will be appreciated that the above body of text, semiotic markers, semiotic identifiers, and semiotic identifier LUTs are for illustrative purposes only. In accordance with embodiments of the invention the semiotic markers, semiotic identifiers, and semiotic identifier LUTs may take different forms and variations.
  • semiotic identifiers may take additional forms that appear in the body of text being summarised, such as capitalization, non-capitalization, fonts, font size, images, symbols, numbers, codes, ciphers, hieroglyphs, spaces, orientations, and the like, and which may be taken in different combinations thereof.
  • Additional LUTs may comprise different semiotic identifiers and may be used in conjunction with or replace any one or all of the LUTs shown and discussed.
  • the selected sentence excerpt may be presented in the original form of the sentence.
  • the sentence format may be displayed in HTML format in blog aggregator type web services, or the like.
  • Such web services include blogs posting blog articles where the selected sentence within the blog article acts as a snapshot summary.
  • the selected sentence excerpt may be used with web crawlers for content from blog RSS feeds, or the like.
  • the semiotic selection system and method for text summarization in accordance with an embodiment of the invention may be adapted for any type of encoded type text such as UTF-8, or the like. It will be appreciated that the system and method can be adapted to other natural languages other than English. It will be appreciated that in order to adapt the system and method for other Roman and non-Roman based languages, a combination of the same or different delimiters, semiotic markers, semiotic identifiers or categories, or the like are required as appropriate for each natural language. Additional services may be added to the method and system in accordance with embodiments of the invention such as translation services and the like.
  • embodiments are applicable to any suitable application, such as digital and non-digital content, devices, software, services, goods, resources, and the like, and may be practiced with variations in technology, interface, content, rights, offerings, services, speed, size, limitations, devices, and the like.
  • the devices and subsystems of the exemplary methods and systems described with respect to Figures 1-21 can communicate, for example, over a communication network, and can include any suitable servers, workstations, personal computers (PCs), laptop computers, handheld devices, with visual displays and/or monitors, tablets, telephones, cellular telephones, smartphones, wireless devices, personal digital assistants (PDAs), Internet appliances, set top boxes, modems, other devices, and the like, capable of performing the processes of the disclosed exemplary embodiments.
  • the devices and subsystems may communicate with each other using any suitable protocol and may be implemented using a general-purpose computer system and the like.
  • One or more interface mechanisms may be employed, for example, including internet access, telecommunications in any suitable form, such as voice, modem, and the like, wireless communications media, and the like.
  • the network may include, for example, wireless communications networks, cellular communications networks, public switched telephone networks (PSTNs), packet data networks (PONs), the Internet, intranets, hybrid communications networks, combinations thereof, and the like.
  • PSTNs public switched telephone networks
  • PONs packet data networks
  • the Internet intranets
  • hybrid communications networks combinations thereof, and the like.
  • the embodiments, as described with respect to Figures 1-21 are for exemplary purposes, as many variations of the specific hardware used to implement the disclosed exemplary embodiments are possible.
  • the functionality of the devices and the subsystems of the embodiments may be implemented via one or more programmed computer system or devices.
  • a single computer system may be programmed to perform the functions of one or more of the devices and subsystems of the exemplary systems.
  • two or more programmed computer systems or devices may be substituted for any one of the devices and subsystems ofthe exemplary systems.
  • the exemplary systems described with respect to Figures 1-21 may be used to store information relating to various processes described herein. This information may be stored in one or more memories, such as hard disk, optical disk, magneto-optical disk, RAM, and the like, of the devices and sub-systems of the embodiments.
  • One or more databases of the devices and subsystems may store the information used to implement the exemplary embodiments.
  • the databases may be organised using data structures, such as records, tables, arrays, fields, graphs, trees, lists, and the like, included in one or more memories, such as the memories listed above.
  • All or a portion of the exemplary systems described with respect to Figures 1-21 may be conveniently implemented using one or more general-purpose computer systems, microprocessors, digital signal processors, micro-controllers, and the like, programmed according to the teachings of the disclosed exemplary embodiments.
  • Appropriate software may be readily prepared by programmers of ordinary skill based on the teachings of the disclosed exemplary embodiments.
  • the exemplary systems may be implemented by the preparation of application-specific integrated circuits or by interconnecting an appropriate network of component circuits.
  • the exemplary embodiments described herein may be employed in offline systems, online systems, and the like, and in applications, such as TV applications, computer applications, DVD applications, VCR applications, appliance applications, CD play applications, and the like.
  • signals employed to transmit the summary data, and the like, of the exemplary embodiments may be configured to be transmitted within the visible spectrum of a human, within the audible spectrum of a human, combinations thereof, and the like.
  • Embodiments of the invention have been described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by the applicable law.

Abstract

We describe a semiotic selection system for text summarization. The system comprises a processor configured to receive a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; and a database for storing a corresponding value for each associated semiotic identifier. The processor calculates a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values retrieved from the database of any semiotic identifier associated with a semiotic marker within the textual statement, and selects the textual statement and semiotic marker having the highest score for the body of text summary.

Description

SEMIOTIC SELECTION METHOD AND SYSTEM FOR TEXT SUMMARIZATIO
FIELD
This invention relates generally to methods and systems for text summarization, and more particularly to a semiotic selection method and system for text summarization. BACKGROUND
The study of semiotics recognises that information can be conveyed between individuals in different forms and formats. Information in the form of natural language text is one convenient means of conveying information. With the advent of the digital age, computer devices linked together via networks, such as the Internet, have enabled individuals to freely communicate, and instantly access and store a vast amount of data and information in text form.
While this information is accessible, randomly searching through the data becomes problematic if a specific piece of information is required. Search facilitators such as search engines, search aggregators, and web crawlers have been developed in an attempt to assist in gathering and presenting a selection of information that is specific to an individual's request. These Internet search facilitators typically base their searches on a formatted representation of the actual information, such as formatted summaries or feeds of the information rather than the actual information itself. The formatted summaries may be generated in accordance with a standardised format such as resource description framework (RDF) site summary or rich site summary or really simple syndication (RSS) format. However, as with any search facilitator, the quality of the search result is limited to the quality of the search query format and searchable format of the representation of the actual information on which the search request is based. The quality of such formatted summaries or feeds of the actual information may range from automatically providing the title of the actual article or information, or to manually prepare a written synopsis of the information. Merely providing the title of an article does not improve the quality of the searchable format as the title is typically included in the searchable format, and while preparing manually written synopsis may improve the quality of the searchable format but requires time consuming manual input.
There is a need for a method and system for text summarization that facilitates the summarization of information in a semiotic approach to at least alleviate the above problems. SUMMARY OF THE INVENTION
According to a 1st aspect of the present invention, we provide a semiotic selection system for text summarization that comprises a processor configured for receiving a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; and a database for storing a corresponding value for each associated semiotic identifier; the processor calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values retrieved from the database of any semiotic identifier associated with a semiotic marker within the textual statement, and selecting the textual statement and semiotic marker having the highest score for the body of text summary.
There is provided, according to a 2nd aspect of the present invention, a semiotic selection method for text summarization that comprises receiving a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; assigning a corresponding value for each associated semiotic identifier; calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values of any semiotic identifier associated with a semiotic marker within the textual statement; and selecting the textual statement and semiotic marker having the highest score for the body of text summary. In an embodiment the textual statement is a sentence.
In an embodiment the semiotic marker is a delimiter comprising a full stop and a space.
In an embodiment the semiotic identifier is a semiotic keyword. The semiotic keyword may be an adjective, a verb or the like. The associated value for the semiotic keyword may be 50 points.
In an embodiment the semiotic identifier is a punctuation mark. The associated value for the punctuation mark may be 10 points.
In an embodiment the semiotic identifier is a hypertext mark-up language (HTML) element. The associated value for the HTML element may be 10 points. The HTML element may comprise <b>, <em>, <strong>, or <i>. In an embodiment the processor processes the selected textual statement for body of text summary as an RSS feed.
In an embodiment the processor is arranged to identify at least two categories of associated semiotic identifiers with the body of text. In an embodiment the body of text is an article.
In an embodiment the body of text comprises at least two semiotic markers.
In an embodiment the body of text comprises a plurality of semiotic markers and a plurality of semiotic identifiers.
BRIEF DESCRIPTION OF THE FIGURES
The accompanying drawings incorporated herein and forming a part of the specification illustrate several aspects of the present invention and, together with the description, serve to explain the principles of the invention. While the invention will be described in connection with certain embodiments, there is no intent to limit the invention to those embodiments described. On the contrary, the intent is to cover all alternatives, modifications and equivalents as included within the scope of the invention as defined by the appended claims. In the drawings:
Figure 1 illustrates a block diagram of an exemplary system on which various embodiments of the invention may be implemented;
Figure 2 illustrates a block diagram of an exemplary personal computer that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention;
Figure 3 illustrates a block diagram of an exemplary server that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention;
Figure 4 illustrates a block diagram of an exemplary semiotic identifier module that may be implemented in the server of Figure 3 in accordance with an embodiment of the invention;
Figure 5 illustrates a block diagram of an exemplary database that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention; Figure 6 illustrates a semiotic keyword look up table (LUT) in accordance with an embodiment of the invention;
Figure 7 illustrates a semiotic punctuation LUT in accordance with an embodiment of the invention; Figure 8 illustrates a semiotic HTML LUT in accordance with an embodiment of the invention;
Figure 9 illustrates a semiotic identifier point score LUT in accordance with an embodiment of the invention;
Figure 10 illustrates a flow chart of a method in accordance with an embodiment of the invention;
Figures 1 1-20 illustrate allocation of point value score tables for sentences in a body of text analysed with a semiotic selection method for text summarization in accordance with an embodiment of the invention; and
Figure 21 illustrates a total semiotic point value summary table of all the sentences analysed in Figures 11 -20 in accordance with an embodiment of the invention.
DETAILED DESCRIPTION
A semiotic selection method and system for text summarization is disclosed for automatically receiving and analysing a body of text in a natural language. The body of text comprises semiotic markers defining a series of statements or sentences. Each of the sentences are compared and ranked according to the semiotic relevance to the body of text. The sentence with the highest semiotic ranking is selected as a summary for the body of text.
In an embodiment, the body of information may be in textual form in the format of an article. Such articles are published on the blog sites via a web service and the semiotic selection text summary may be the content of RSS feeds. The sentence selected as the summary is an excerpt of the body of text. The sentence selected is determined to be the most semiotic significant sentence in the body of text. The categories for determining the semiotic significance of each sentence in the body of text relate to the spectrum of semiotic
characteristics or features available to a reader reading the text in the text format. The semiotic significance of text includes semantic relevance, and additional semiotic elements or semiotic identifiers which can be conveyed to the reader in other formats while the reader is reading the body of text. For example, such semiotic identifier categories include keywords that convey particularly semiotic meaning, punctuation marks, font types comprising bold, italics, and the like. Each sentence is ranked according to the number and type of semiotic identifier categories. The sentence ranked with the highest semiotic score has different textual attribute elements and ranked for semiotic identifiers in different semiotic categories. It will be appreciated that the semiotic selection method and system for text summarization may base the choice of selecting or ranking of sentences on one or more semiotic category or identifier.
Referring to Figure 1, Figure 1 illustrates a block diagram of an exemplary system 10 on which various embodiments of the invention may be implemented. A user 12 accesses the system 10 with a personal computer 14 that is in communication with a server 16 and database 18 via a network 20, such as a local area network (LAN), intranet, internet or the like. The server 16 may communicate directly with the personal computer 16 and/or the database 18.
Figure 2 illustrates a block diagram 30 of an exemplary personal computer 14 that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention. The personal computer comprises a processor 32 and memory 34 for processing and storing data, respectively, and executing the various data modules and software commands installed to perform various functions on the portable communications device. The personal computer 14 comprises a communications interface 36 to transmit and receive data via the network 20 including cellular networks and other networks such as local area networks LAN, intranet, internet and the like. The personal computer device may comprise data input 42 means such as a keyboard, camera, microphone, and the like, and data output means 40 such as a display, printer, and the like.
Figure 3 illustrates a block diagram 50 of an exemplary server 16 that may be implemented in the system of Figure 1 in accordance with an embodiment of the invention. The server comprises a processor 52 and memory 54 for processing and storing data, respectively, and executing the various data modules and software commands installed to perform various functions on the server. The server 16 comprises a communications interface 56 to transmit and receive data via networks 20 such as local area networks LAN, intranet, internet and the like. The server 16 comprises the semiotic selection and management system modules 60 comprising a semiotic marker module 62, a semiotic identifier module 64, and a semiotic point value scoring module 66. The semiotic marker module 62 analyses the text of the information to be summarised. The semiotic marker module 62 identifies the statements or sentences within the text by identifying a full stop followed by a space to define the sentences. The semiotic identifier module 64 analyses each identified sentence for any semiotic identifiers or elements such as semiotic keywords, punctuation marks, HTML elements or tags, and the like. The semiotic point value scoring module 66 calculates the number of semiotic identifiers or elements. The semiotic point value scoring module 66 determines the semiotic total point scores for each of the sentences and compares the scores to determine and select the sentence with the highest semiotic point value score. The sentence with the highest semiotic point value score is the sentence selected as the summary for the text of the information. It will be appreciated that the choice of selecting or ranking of sentences may comprise one or more semiotic category or identifier.
Figure 4 illustrates a block diagram 70 of an exemplary semiotic identifier module 64 that may be implemented in the server 16 of Figure 3 in accordance with an embodiment of the invention. The semiotic identifier module 64 comprises a keyword analyser module 72, a punctuation analyser module 74, and HTML analyser module 76. The keyword analyser module 72 analyses each sentence for any semiotic keywords. The punctuation analyser module 74 analyses each sentence for any punctuation. The HTML analyser module 76 analyses each sentence for any HTML elements.
Figure 5 illustrates a block diagram 80 of an exemplary database 18 that may be implemented in the system 10 of Figure 1 in accordance with an embodiment of the invention. The database 18 comprises a communications interface 82 to transmit and receive data via networks 20 such as local area networks LAN, intranet, internet and the like. The database comprises a data store for semiotic keyword look up table (LUT) 84, semiotic punctuation LUT 86, semiotic HTML LUT 88, semiotic identifier point score LUT 90, and semiotic total point scores store 92. The data may be stored in the database, or may be stored and received in other sources, such as other databases, server, external means, personal computer, or the like. Although the database is shown as a standalone unit, it will be appreciated that the database 18 or each data store 84, 86, 88, 90, 92 may reside within the server or personal computer in the system our external to the system shown.
Figure 6 illustrates a semiotic keyword look up table (LUT) 84 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention. The semiotic keywords in the LUT are shown as: "ABNORMAL"; "ABOARD"; "ABSORBING"; "QUICK"; "WHITTLE"; "WONDERFUL"; "WREAK"; and the like. It will be appreciated that any number of semiotic keywords can be in the LUT. For example, there may be 500 keywords chosen in the LUT. The keywords chosen are deemed semiotic significant. Semiotic significant keywords are words that are emotive words, such as "happy", value judgement words such as "best", words that signify a conclusion, such as "end", or the like. The semiotic significant keywords that are selected are chosen for the descriptive, emotive and distinct value of the word. For example, words that may have semiotic significance are descriptive, emotive and distinct from other words. It will be appreciated that any number of semiotic keywords and additional types of semiotic keywords may be included in the LUT. It will be appreciated that the list of semiotic keywords are selected to be applicable to all different types of subjects or topics of the body of text being summarised.
In an embodiment keywords are symbolic representations that convey a message or meaning, and may comprise combinations of words, phrases, letters, numbers, symbols, alphanumeric, and the like. The keywords may comprise different spelt versions of the same word or words such as commonly misspelt versions, culturally different versions,
American/British spelt versions, phonetically spelt versions, homonyms, and the like. The keywords may comprise jargon, lingo, slang, profanity, dialect, acronyms, abbreviations, shorthand, and the like. The keywords may take different grammatical or syntactic variations and comprise different tenses, base, stem, or root words with different affixes, such as prefixes and/or suffixes, adjectives, nouns, verbs, hyphenated and non-hyphenated versions, and the like. An example list of keywords for illustrative purposes only is: abnormal, aboard, absorbing, accessible, accredited, accusing, acrimonious, adapted, adept, admiration, adoring, adrift, adverse, affection, afflicted, aflutter, aggrieved, agitated, alienated, ambitious, amused, anesthetized, animals, anticipation, antique, apathy, apologetic, apprehensive, archaic, armored, assign, astute, asymmetrical, at home, at war, attitude, average, awakened, awed, awful, bad, banned, beckoned, belief, best, bland, boring, brash, breathtaking, brisk, bristling, broadminded, broken-up, burdensome, capable, capricious, captious, cautious, ceaseless, classy, claustrophobic, clenched, closed, clouded, coherent, collapsed, colossal, comforted, commanding, committed, complimented, conclude, conclusion, confuse, connected, conscious, consistent, contaminated, contemplative, contradictory, convincing, corrupted, counterfeit, courteous, crafty, cramped, crave, craving, crippled, crooked, crucial, cruddy, cupidity, cut-down, cut-off, damaging, debilitating, defensive, definitely, delivered, denounced, deprecated, deserving, desire, desperate, dogmatic,- dull, dumbfounded, elegant, enamored, enchanted, end, endangered, enfeebling, enjoyment, enraptured, eros, especially, etiolated, execrated, extreme, extroverted, fabricated, faint, famous, fantabulous, fantastic, fascinating, fatherless, faulty, fazed, feisty, felicitous, female, feminine, fermenting, fervent, fierce, finished, first-rate, flagellated, flaky, fleeced, flippant, flogged, flourishing, flush, foggy, forceful, forsaken, frantic, freakish, frenzied, fret- filled, friendly, functional, funereal, fun loving, funny, future minded, futuristic, game, gaze, genuine, giant, glance, gleaming, great, greedy, grind, grounded, gullible, gutted, half witted, handicapped, happy, harangued, harmed, harmonious, hasty, hatred, heard, held dear, hexed, high-spirited, humane, idealistic, idle, ignored, ill, immense, impassioned, impassive, impeccable, imperfect, imposed-upon, impotent, impoverished, impractical, imprisoned, inaccessible, inattentive, incapacitated, incompatible, incomplete, inconsolable, indestructible, indirect, inert, inferior, inhumane, injustice, insignificant, insouciant, inspired, instilled, instructive, intentional, interesting, intrepid, introverted, intuitive, irascible, irritable, itchy, jangled, jarred, jealousy, jejune, jinxed, jolly, jolted, joyful, jubilant, judged, judgmental, justice, kaput, kicked, kicked back, kind, kindhearted, kingly, kinky, knightly, knocked, knotted, knotty, knowledgeable, labeled, lament, lash, lifetime, limp, loath, lonely, loved, lovely, luring, lurking, lying, majestic, malicious, malnourished, managerial, manipulated, marauding, mean, meander, meanness, medmanic, mellow, menaced, mended, messed, messy, miffed, mind-blown, mind-gamed, mindful miracle, misguided, misinformed, misunderstood, mixed-up, moany, molded, mollified, momentous, monopolistic, monopolized, moody, moralistic, morose, muddled, murky, mushy, mutinous, mystery, naked, nannied, narrow- minded, nauseated, needless, neurotic, nifty, nit-picking, noiseless, nonconformist, notted, nuts, obedient, objectified, obliterated, oblivious, obsequious, observant, obstructed, obvious, odd, off the hook, off-balance, ogre-ish, old, old-fashioned, omnipotent, open, oppositional, ornery, orphaned, ostracized, outdated, outdone, outgoing, outnumbered, outpowered, outranked, outreasoned, outspoken, outstanding, over-protected, overcome, overjoyed, overpowered, overrule, overzealous, overworked, owned, ownership, owning, pain, paired, paired-up, parsimonious, partial, peaceful, peevish, pell-mell, perfectionistic, perfused, perplexed, persnickety, personalized, pert, perturbed, petrinoid, phat, pious, piteous, pleading, plumbed, plundered, pompous, possessionless, practical, prim, probationary, probationed, prodigious, promise, propagandistic, propagandized, prosaic, protective, prudish, psychotic, puffed up, pugnacious, pulled,, purged, quaking, qualified, quality, quandary, quarantined, quarrelsome, quashed, queasy, questioning, quick, quiescent, quirky, rageful, reactive, refueled, rejected, rejoiced, reliant, removed, reproached, respected, restricted, roam, rubbery, sacrificed, scared, scarred, scatter, scolded, scornful, searching, secrets, sedate, self- acceptance, self-acceptant, self-hatred, self-pitying, sensible, sentenced, shunned, skeptical, sleepy, small, smart, smudged, sold-out, special, spiritless, squashed, startling, stilted, straight) acketed, stroked, studious, stunned, stunning, subjugated, success, superseded, surpassed, survey, sycophantic, sympathetic, tear, tease, terrific, tested, thrash, thrill, thrust, thwart, tilt, trace, tremendous, trudge, twinkle, unconditional, unique, unlace, unlimited, unparalleled, unsurpassed, unusual, unusual size, urge, useful, utter, vacate, valuable, wager, wander, warble, warp, waver, wealth, wedge, weep, weird, whet, whirl, whittle, wonderful, wreak.
Figure 7 illustrates a semiotic punctuation LUT 86 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention. The punctuation marks shown are a question mark, exclamation mark, open parenthesis, closed parenthesis, and the like. It will be appreciated that any number of punctuation marks and additional punctuation marks not shown may be included in the LUT.
Figure 8 illustrates a semiotic HTML LUT 88 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention. The text of the information being analysed contains HTML elements or tags. The HTML elements or tags are searched. The HTML elements shown are: "<b>", as a boldface presentation; "<em>", as an italic emphasis presentation; "<i>", as an italic presentation; "<strong>", as strong emphasis boldface presentation; and the like. It will be appreciated that any number of HTML elements and additional HTML elements not shown may be included in the LUT. Figure 9 illustrates a semiotic identifier point score LUT 90 as shown in Figure 5 in greater detail in accordance with an embodiment of the invention. The scale of scores shown includes: keywords receive 50 points; punctuation marks receive 10 points; HTML elements receive 10 points; and the like. It will be appreciated that any points may be weighted and allocated to each respective category of semiotic identifiers. It will be appreciated that one or more semiotic identifier or category may be used for the ranking and scoring of the sentences. Once the scores are compared and ranked, the sentence with the most points is selected as the summary excerpt of the text of information. Figure 10 illustrates a flow chart 150 of a method in accordance with an embodiment of the invention. The method comprises receiving incoming data such as a body of text 152 of information to be summarised. Text may be received from different sources, and the incoming data may comprise different formats and variations of text such as symbols, letters, words, numbers, fonts, font size, images, symbols, numbers, codes, ciphers, hieroglyphs, and the like. The semiotic markers are identified 154. The semiotic markers define the statements or sentences of the text. The semiotic markers in this embodiment are full stops or periods followed by a space. It will be appreciated that the semiotic markers may comprise different forms and variations, for example double space, and the like. Once the sentences are defined, each sentence is analysed 156. The semiotic identifiers are identified 158. The semiotic identifiers in this embodiment are semiotic keywords, punctuation, HTML elements, and the like. A point value is applied 160 to each semiotic identifier in accordance with the respective look up table (LUT).
In an embodiment, points are allocated as follows: 50 points are applied to each semiotic keyword; 10 points are applied to each punctuation mark; and 10 points are applied to each HTML element. A total semiotic point value is calculated and stored 162 for each sentence analysed. The steps are repeated 164 for each sentence until the final sentence. The total semiotic point value scores are compared 168, and the sentence with the highest semiotic point value score is selected 168 as the summary sentence to represent the text being analysed. An example semiotic selection method for text summarization in accordance with an embodiment of the invention is applied to the following illustration of source of information:
<p><em> Words and photographs by JK of Cupcake Weekend</em>.</p>
<p>On the last week of May I treaded on nippon ground despite the many incessant warnings of radiation scares and it was a risk well taken. If a city could be a person then I was extremely impressed by Tokyo. She is indeed very magical with so much to explore, taste &amp; feel. Tokyo is undoubtedly my Paris in Asia.</p>
<p>Here&#8217;s a quick travel guide on <strong>things to do in Tokyo for five days</strong>. <p>
<p><span></span>The first thing to do in right after you get off your plane is to purchase a subway pass. The next thing not to do: take a cab (unless you really have to). Not only is it ridiculously expensive, you miss out on alot when you travel via taxi. Though the subway map can be. pretty taunting, I assure you that you will get the hang of it within a day or two.</p>
The above body of text is an example for illustrative purposes only of incoming data format of text of an article retrieved or downloaded from a website, blog, or the like. The text includes a markup language such as HTML. For example, data may be retrieved from a posting on a social network website such as FACEBOOK, LINKEDIN, or the like.
FACEBOOK is a trademark or registered trademark of Facebook, Inc. of Palo Alto,
California, United States of America. LINKEDIN is a trademark or registered trademark of Linkedln Corporation of Mountain View, California, United States of America.
In accordance with an embodiment of the invention, the semiotic markers are identified as full stops. The sentence numbering and the semiotic identifiers of the HTML elements are provided in the sentences and displayed in their corresponding word processing version for illustrative purposes only, as follows:
Sentence 1 : Words and photographs by JK of Cupcake Weekend.
Sentence 2: On the last week of May I treaded on nippon ground despite the many incessant warnings of radiation scares and it was a risk well taken.
Sentence 3: If a city could be a person then I was extremely impressed by Tokyo.
Sentence 4: She is indeed very magical with so much to explore, taste & feel.
Sentence 5: Tokyo is undoubtedly my Paris in Asia.
Sentence 6: Here's a quick travel guide on things to do in Tokyo for five days.
Sentence 7: The first thing to do in right after you get off your plane is to purchase a subway pass.
Sentence 8: The next thing not to do: take a cab (unless you really have to).
Sentence 9: Not only is it ridiculously expensive, you miss out on alot when you travel via taxi.
Sentence 10: Though the subway map can be pretty taunting, I assure you that you will get the hang of it within a day or two.
Figures 1 1-20 illustrate the allocation of point value score tables
200,202,204,206,208,210,212,214,216,218 for sentences 1-10, respectively. The allocation of semiotic point value scores are shown as semiotic identifier words, punctuation, HTML, and total sentences, accordingly. Figure 21 illustrates the total semiotic point value table 220 of all the sentences analysed. Sentence 2,3,4,5,7,9,10 have a zero point value score, as there are no semiotic identifiers in these sentences, as shown in FIG. 12,13,14,15,17,19,20 respectively.
The point value score of sentence 1 is shown in the point value score table 200 of Figure 1 1 with a 10 point value score for one semiotic identifier taking the form of an HTML identifier of italics with emphasis identified with the <em> and </em> HTML tag. The identified semiotic HTML element tag "<em>" in sentence 1 is listed in semiotic HTML keyword LUT 88 of Figure 8 as "2. <em>".
The point value score of Sentence 6 is shown in the point value score table 210 of Figure 16 with a 60 point value score for two semiotic identifiers taking the forms of a semiotic keyword "quick", and an HTML identifier of bold identified with the <strong> and </strong> HTML tag. The identified semiotic keyword "quick" in sentence 5 is shown listed in semiotic keyword LUT 84 of Figure 6 as semiotic keyword "N-3. QUICK". The identified semiotic HTML element tag "<strong>" in sentence 6 is listed in semiotic HTML keyword LUT 88 of Figure 8 as "N. <strong>". The point value score of Sentence 8 is shown in the point value score table 214 of
Figure 18 with a 20 point value score for one semiotic identifier taking the form of a punctuation opening and closing parenthesis. The identified semiotic punctuation open parenthesis "(" and closing parenthesis ")" in sentence 5 is listed in semiotic punctuation LUT 86 of Figure 7 as "N- 1. (", and "N. )". The overall total semiotic point value summary table 220 of Figure 21 shows the point value score results all nine sentence. Sentence 5 has the highest semiotic point value score of 60, the sentence number and the score are shown in bold. Accordingly, sentence 5 is selected as the summary excerpt. The sentence is displayed without the semiotic identifiers, "Here's a quick travel guide on things to do in Tokyo for five days." It will be appreciated that the selected sentence for the summary excerpt may be displayed with all, at least one, or without any of the semiotic identifiers.
Although the semiotic markers are identified in this embodiment prior to the semiotic identifiers, it will be appreciated that the semiotic markers and semiotic identifiers may be identified in any order. The semiotic point value scoring is calculated based on the number of semiotic identifiers associated with a particular semiotic marker. In this embodiment, the semiotic point value scoring is calculated on the number and type of semiotic identifiers or categories identified preceding an identified marker. It will be appreciated that this
embodiment is suitable for the English natural language, however, the systems and methods of embodiments of the invention may have other semiotic marker and semiotic identifier formats and correlations that may be more suitable and arranged for other natural languages. It will be appreciated that the above body of text, semiotic markers, semiotic identifiers, and semiotic identifier LUTs are for illustrative purposes only. In accordance with embodiments of the invention the semiotic markers, semiotic identifiers, and semiotic identifier LUTs may take different forms and variations. For example, semiotic identifiers may take additional forms that appear in the body of text being summarised, such as capitalization, non-capitalization, fonts, font size, images, symbols, numbers, codes, ciphers, hieroglyphs, spaces, orientations, and the like, and which may be taken in different combinations thereof. Additional LUTs may comprise different semiotic identifiers and may be used in conjunction with or replace any one or all of the LUTs shown and discussed.
The selected sentence excerpt may be presented in the original form of the sentence. The sentence format may be displayed in HTML format in blog aggregator type web services, or the like. Such web services include blogs posting blog articles where the selected sentence within the blog article acts as a snapshot summary. The selected sentence excerpt may be used with web crawlers for content from blog RSS feeds, or the like.
The semiotic selection system and method for text summarization in accordance with an embodiment of the invention may be adapted for any type of encoded type text such as UTF-8, or the like. It will be appreciated that the system and method can be adapted to other natural languages other than English. It will be appreciated that in order to adapt the system and method for other Roman and non-Roman based languages, a combination of the same or different delimiters, semiotic markers, semiotic identifiers or categories, or the like are required as appropriate for each natural language. Additional services may be added to the method and system in accordance with embodiments of the invention such as translation services and the like.
Although the exemplary embodiments are described of applications in selection semiotic methods and systems for text summarization, and the like, the exemplary
embodiments are applicable to any suitable application, such as digital and non-digital content, devices, software, services, goods, resources, and the like, and may be practiced with variations in technology, interface, content, rights, offerings, services, speed, size, limitations, devices, and the like.
The devices and subsystems of the exemplary methods and systems described with respect to Figures 1-21 can communicate, for example, over a communication network, and can include any suitable servers, workstations, personal computers (PCs), laptop computers, handheld devices, with visual displays and/or monitors, tablets, telephones, cellular telephones, smartphones, wireless devices, personal digital assistants (PDAs), Internet appliances, set top boxes, modems, other devices, and the like, capable of performing the processes of the disclosed exemplary embodiments. The devices and subsystems, for example, may communicate with each other using any suitable protocol and may be implemented using a general-purpose computer system and the like. One or more interface mechanisms may be employed, for example, including internet access, telecommunications in any suitable form, such as voice, modem, and the like, wireless communications media, and the like.
Accordingly, the network may include, for example, wireless communications networks, cellular communications networks, public switched telephone networks (PSTNs), packet data networks (PONs), the Internet, intranets, hybrid communications networks, combinations thereof, and the like.
It is to be understood that the embodiments, as described with respect to Figures 1-21 , are for exemplary purposes, as many variations of the specific hardware used to implement the disclosed exemplary embodiments are possible. For example, the functionality of the devices and the subsystems of the embodiments may be implemented via one or more programmed computer system or devices. To implement such variations as well as other variations, a single computer system may be programmed to perform the functions of one or more of the devices and subsystems of the exemplary systems. On the other hand, two or more programmed computer systems or devices may be substituted for any one of the devices and subsystems ofthe exemplary systems. Accordingly, principles and advantages of distributed processing, such as redundancy, replication, and the like, also may be implemented, as desired, for example, to increase robustness and performance of the exemplary systems described with respect to Figures 1-21. The exemplary systems described with respect to Figures 1-21 may be used to store information relating to various processes described herein. This information may be stored in one or more memories, such as hard disk, optical disk, magneto-optical disk, RAM, and the like, of the devices and sub-systems of the embodiments. One or more databases of the devices and subsystems may store the information used to implement the exemplary embodiments. The databases may be organised using data structures, such as records, tables, arrays, fields, graphs, trees, lists, and the like, included in one or more memories, such as the memories listed above.
All or a portion of the exemplary systems described with respect to Figures 1-21 may be conveniently implemented using one or more general-purpose computer systems, microprocessors, digital signal processors, micro-controllers, and the like, programmed according to the teachings of the disclosed exemplary embodiments. Appropriate software may be readily prepared by programmers of ordinary skill based on the teachings of the disclosed exemplary embodiments. In addition, the exemplary systems may be implemented by the preparation of application-specific integrated circuits or by interconnecting an appropriate network of component circuits. Advantageously, the exemplary embodiments described herein may be employed in offline systems, online systems, and the like, and in applications, such as TV applications, computer applications, DVD applications, VCR applications, appliance applications, CD play applications, and the like. In addition the signals employed to transmit the summary data, and the like, of the exemplary embodiments, may be configured to be transmitted within the visible spectrum of a human, within the audible spectrum of a human, combinations thereof, and the like. Embodiments of the invention have been described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by the applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context. In this document and in its claims, the verb "to comprise" and its conjugations is used in its non-limiting sense to mean that items following the word are included, but items not specifically mentioned are not excluded. In addition, reference to an element by the indefinite article "a" or "an" does not exclude the possibility that more than one of the element is present, unless the context clearly requires that there be one and only one of the elements. The indefinite article "a" or "an" thus usually means "at least one".
Each of the applications and patents mentioned in this document, and each document cited or referenced in each of the above applications and patents, including during the prosecution of each of the applications and patents ("application cited documents") and any manufacturer's instructions or catalogues for any products cited or mentioned in each of the applications and patents and in any of the application cited documents, are hereby
incorporated herein by reference. Furthermore, all documents cited in this text, and all documents cited or referenced in documents cited in this text, and any manufacturer's instructions or catalogues for any products cited or mentioned in this text, are hereby incorporated herein by reference.
Various modifications and variations of the described methods and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in this or related fields are intended to be within the scope of the claims.

Claims

1. A semiotic selection system for text summarization eomprising: a processor configured to receive a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; and a database for storing a corresponding value for each associated semiotic identifier; the processor calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values retrieved from the database of any semiotic identifier associated with a semiotic marker within the textual statement, and selecting the textual statement and semiotic marker having the highest score for the body of text summary.
2. A system according to Claim 1, in which the textual statement is a sentence.
3. A system according to Claim 1 or 2, in which the "semiotic marker is a delimiter comprising a full stop and a space.
4. A system according to Claim 1, 2 or 3, in which the semiotic identifier is a semiotic keyword.
5. A system according to Claim 4, in which the semiotic keyword is an adjective.
6. A system according to Claim 4, in which the semiotic keyword is a verb.
7. A system according to Claim 4, in which the associated value for the semiotic keyword is 50 points.
8. A system according to any preceding claim, in which the semiotic identifier is a punctuation mark.
9. A system according to Claim 8, in which the associated value for the punctuation mark is 10 points.
10. A system according to any preceding claim, in which the semiotic identifier is a HTML element.
11. A system according to Claim 10, in which the associated value for the HTML element is 10 points.
12. A system according to Claim 10 or 1 1 , in which the HTML element comprises <b>, <em>, <strong> or <i>.
13. A system according to any preceding claim, in which the processor processes the selected textual statement for body of text summary as an RSS feed.
14. A system according to any preceding claim, in which the processor is arranged to identify at least two categories of associated semiotic identifiers within the body of text.
15. A system according to any preceding claim, in which the body of text is an article.
16. A system according to any preceding claim, in which the body of text comprises at least two semiotic markers.
17. A system according to any preceding claim, in which the body of text comprises a plurality of semiotic markers and a plurality of semiotic identifiers.
18. A semiotic selection method for text summarization comprising: receiving a body of text and identifying at least one semiotic marker and associated semiotic identifiers within the body of text, each semiotic marker defining a textual statement; assigning a corresponding value for each associated semiotic identifier; calculating a score for each semiotic marker and associated textual statement, the score being the sum of the corresponding values of any semiotic identifier associated with a semiotic marker within the textual statement; and selecting the textual statement and semiotic marker having the highest score for the body of text summary.
19. A system according to Claim 18, further comprising arranging the selected textual statement for the body of text summary as an RSS feed.
20. A system according to Claim 18, in which the body of text comprises at least two semiotic markers.
PCT/SG2012/000306 2012-08-30 2012-08-30 Semiotic selection method and system for text summarization WO2014035334A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2012/000306 WO2014035334A1 (en) 2012-08-30 2012-08-30 Semiotic selection method and system for text summarization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2012/000306 WO2014035334A1 (en) 2012-08-30 2012-08-30 Semiotic selection method and system for text summarization

Publications (1)

Publication Number Publication Date
WO2014035334A1 true WO2014035334A1 (en) 2014-03-06

Family

ID=50183996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2012/000306 WO2014035334A1 (en) 2012-08-30 2012-08-30 Semiotic selection method and system for text summarization

Country Status (1)

Country Link
WO (1) WO2014035334A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080109425A1 (en) * 2006-11-02 2008-05-08 Microsoft Corporation Document summarization by maximizing informative content words
US7912701B1 (en) * 2005-05-04 2011-03-22 IgniteIP Capital IA Special Management LLC Method and apparatus for semiotic correlation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7912701B1 (en) * 2005-05-04 2011-03-22 IgniteIP Capital IA Special Management LLC Method and apparatus for semiotic correlation
US20080109425A1 (en) * 2006-11-02 2008-05-08 Microsoft Corporation Document summarization by maximizing informative content words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method

Similar Documents

Publication Publication Date Title
US8838633B2 (en) NLP-based sentiment analysis
US9672251B1 (en) Extracting facts from documents
US10691737B2 (en) Content summarization and/or recommendation apparatus and method
US10878233B2 (en) Analyzing technical documents against known art
CA2669236C (en) Extending keyword searching to syntactically and semantically annotated data
CN102203774B (en) Retrieval using a generalized sentence collocation
US20110173174A1 (en) Linguistically enhanced search engine and meta-search engine
US10552539B2 (en) Dynamic highlighting of text in electronic documents
WO2017015231A1 (en) Natural language processing system and method
US9519703B2 (en) Refining search results for a compound search query
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
Yang et al. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations
US9916375B2 (en) Extraction of concept-based summaries from documents
US10558631B2 (en) Enhancing textual searches with executables
Konchady Building Search Applications: Lucene, LingPipe, and Gate
Aksyonoff Introduction to Search with Sphinx: From installation to relevance tuning
KR101238927B1 (en) Electronic book contents searching service system and electronic book contents searching service method
WO2014035334A1 (en) Semiotic selection method and system for text summarization
Barreira et al. A framework for digital forensics analysis based on semantic role labeling
Ouda QuranAnalysis: a semantic search and intelligence system for the Quran
US11017172B2 (en) Proposition identification in natural language and usage thereof for search and retrieval
Huang et al. Commonsense reasoning in a deeper way: By discovering relations between predicates
Selvadurai A natural language processing based web mining system for social media analysis
Paul et al. Bangla news summarization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12883764

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12883764

Country of ref document: EP

Kind code of ref document: A1