WO2002093414A1 - System and method for clustering and visualization of online chat - Google Patents

System and method for clustering and visualization of online chat Download PDF

Info

Publication number
WO2002093414A1
WO2002093414A1 PCT/SG2001/000089 SG0100089W WO02093414A1 WO 2002093414 A1 WO2002093414 A1 WO 2002093414A1 SG 0100089 W SG0100089 W SG 0100089W WO 02093414 A1 WO02093414 A1 WO 02093414A1
Authority
WO
WIPO (PCT)
Prior art keywords
chat
cluster
topic
utterance
content
Prior art date
Application number
PCT/SG2001/000089
Other languages
French (fr)
Inventor
Lung Hsiang Wong
Chee Kit Looi
Original Assignee
Kent Ridge Digital Labs
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kent Ridge Digital Labs filed Critical Kent Ridge Digital Labs
Priority to PCT/SG2001/000089 priority Critical patent/WO2002093414A1/en
Publication of WO2002093414A1 publication Critical patent/WO2002093414A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This invention relates generally to data processing systems and, more specifically, to systems and methods for clustering and visualization of content included in online chat sessions.
  • Online chat sessions may either be very structured and targeted to discussion of a particular topic, or may be unstructured and serve as an open forum for a public discussion of various topics. Unstructured online chat sessions are generally noisy and out-of-focus.
  • the term "noise,” as used herein, refers to off-topic content included in a topic-based online chat session. Participants in an online chat session, "chatters,” may switch discussion topics frequently and randomly without alerting other users. The randomness and frequency with which topics may be changed makes it difficult for chatters to determine both a current topic and to trace previous content of the chat session.
  • chat systems interface with a logging feature that allows users to log a chat transcript, indicating discussion topics covered in a particular chat session.
  • the chat transcript is generally saved as an archive file and thus is not organized according to, for example, a threaded structure that can be easily indexed and searched.
  • Such spontaneity and randomness in conventional online chat sessions thus poses several problems for chatters.
  • a latecomer to a particular chat session, or an existing chatter who leaves the chat session for a period of time and later returns may have some difficulty tracing the previous discussion. Therefore it is hard for such person to join, or rejoin, the chat session.
  • Second, individual chatters who enter a chat room when it is empty need a manner of determining a current status of a chat session.
  • chat sessions that include a logging feature that tracks chat discussions in the form of, for example, a chat transcript
  • searching such transcript to find a quality exchange can be a tedious and time consuming process arid therefore discouraging to many users.
  • chatters are likely to initiate, perhaps inadvertently, off-topic discussions unless there is a mechanism that can detect the switching of topics and possibly provide a warning or alert about the topic switch, either directly or indirectly.
  • a method to analyze chat content. The method includes determining a next utterance in a chat session, extracting keywords from the next utterance, associating the extracted keyword with a topic, and creating a cluster of the content of the chat session by organizing the content according to the topic.
  • a system to analyze chat content.
  • the system includes a chat summarization tool that divides a content included in a chat session into time-based segments of related chat data and provides a visual representation of the time-based segments of related chat data.
  • Fig. 1 depicts an illustrative block diagram of the invention.
  • Fig. 2 depicts an illustrative flow diagram of the invention.
  • Fig. 3 depicts an illustrative data structure that is used to facilitate logging of a chat session.
  • Figs. 4A-4E depict a chat timeline and illustrate various features of the chat timeline.
  • This invention provides a chat session summarization tool that makes associations among and organizes the content of a chat session.
  • the tool outputs a graphical representation of the content included in the chat session.
  • This representation is provided in the form of a bar chart that depicts various utterances of the chat session and indicates when each of the utterances occurs relative to other utterances of the chat session, and a length of time each utterance occurred.
  • the utterances are grouped as clusters, described further below.
  • This chat summarization tool assists current chatters, newly joined chatters, and chat log readers to conveniently and easily determine a current status of a chat session and navigate a content history of the chat session.
  • This chat summarization tool also creates a searchable chat database, i.e., a chat log, that allows users to search for segments of a chat session by topic or keyword.
  • this invention analyzes the content of a chat session by associating utterances of a chat session with one another to create "clusters" of chat content.
  • a “cluster” refers to a group of adjacent utterances with similar or related discussion topics in a chat session.
  • Chat clustering includes applying a content analysis tool to the content of a chat session, identifying "clusters" of temporal chat utterances in the chat session, and grouping the clusters. Chat clusters may be separated, for example, by noise, i.e., off topic utterances of the chat session.
  • Threshold parameters configurable by users, determine the tolerance level of such "noisy utterances.” Exemplary threshold parameters are described further below.
  • An utterance may involve more than one topic, e.g., the discussion of "computers in education” involves two topics: "computer science” and "education”. Therefore, multiple topics can be included in a single utterance.
  • the system can also detect "socializing clusters,” including “looking for chatters,” “greetings,” and “separation.” Socializing clusters are clusters that do not involve any specific discussion topic but are common socializing behaviors between humans, e.g., to exchange "hellos” or "goodbyes.”
  • the content of a chat session is divided into clusters according to a temporal sequence of the discussion topics.
  • conventional clustering methods are typically applied to a group of documents or items.
  • the system of the invention may execute either in real-time, i.e., while a chat session occurs, or after a chat session has completed.
  • a chat timeline is constructed.
  • the chat timeline depicts a visual, temporal representation of chat topics included in the chat session.
  • each cluster is represented as one or more lines of color.
  • Each of the clusters that relate to a particular topic is represented by the same color.
  • the length of each line corresponds to the time span of the content represented by the cluster.
  • the cluster lines on the timeline are selectable, allowing users to conveniently view statistics of each cluster and read the chat log on a newly- launched window. Users may change the time' span and the scale reflected on the timeline. Users may also change the temporal resolution of the timeline, e.g., on a scale of 1 minute or 10 minutes per unit on the timeline.
  • the timeline construction may be performed by either a software program that interfaces with the chat room, or a software agent who has the access to the chat content.
  • the chat timeline can either be updated in real-time during a chat session or after a chat session has completed if a chat session includes a logging feature.
  • This invention may also support additional embodiments, including, for example:
  • a time-based pattern of a chat room may include discussing a specific topic at a certain time on a particular day of the week.
  • an embodiment of this feature checks the contents of chat in a chat room during a specific day. This feature may be used to suggest to a user the best time to enter a particular chat room; and (5) alerting a chatter when the chatter is off-topic, e.g., the system may display a pop-up window to inform a particular user that his or her utterances are off-topic relative to an ongoing discussion.
  • Fig. 1 depicts an illustrative block diagram of the invention. Clients 110 a ... n interface with server 120 via network 130.
  • Each of clients 110 a...n includes the conventional components of memory 132, processor(s) 134, input/output devices 136, and browser 138.
  • Server 120 includes the following conventional components: processor 140, input/output device 142, storage 144, and memory 146.
  • Server 120 further includes additional storage, such as, chat log 150 and topic dictionary 154.
  • Memory 146 may further include local application 148.
  • Chat session summarization tool 160 analyzes the content of chat session 164 among clients 110 a... n and provides a visual representation of such content.
  • chat summarization tool 160 may be performed by an attached procedure of the chat room or a software agent who has access to the chat content.
  • Fig. 2 depicts an illustrative flow diagram of the invention. Chat clustering is performed relative to 210-240 and 260; 250 relates to constructing a chat timeline.
  • the invention first determines a next utterance in a chat session (210).
  • An utterance refers to a single word or group of words or sentences entered by a specific chatter during a single entry. Thus, an utterance refers to a message that is created and sent to a chat room when the chatter presses the "enter" key.
  • Fig. 3 depicts an illustrative data structure that is used to facilitate logging of a chat session.
  • the data structure of Fig. 3 depicts a linked list that defines a log unit and a cluster.
  • An utterance is a log unit that is included as part of a cluster.
  • an utterance may include multiple topics.
  • the system extracts keywords by applying standard text parsing and morphology algorithms. Once an utterance has been defined, keywords are extracted from the utterance (220).
  • the key words correspond to words that have been included in a topic dictionary, such as topic dictonary 154.
  • Each keyword that has been extracted from an utterance is then associated to one or more topics listed in the topic dictionary (230).
  • each utterance is associated to a list of topics.
  • the invention includes a "dictionary” that associates, i.e., maps, keywords to topics (235).
  • the dictionary may be in the form of, for example, a relational database or a table, that lists keywords and related topics for each keyword.
  • Each keyword can be associated to multiple topics.
  • the keyword “Java” can be associated to "programming languages,” "geography,” and "food & beverage.”
  • This dictionary may be in the form of, for example, a relational database or a look-up table that maps keywords to topics.
  • the topics are categorized according to the following logic. If a keyword can only be mapped to one topic, then the keyword is added to the topic list associated to the utterance that the keyword was extracted from; If the keyword can be mapped to more than one topic, then the system determines whether the topic has been included in a recent chat cluster. Recent is determined according to user-specified values reflecting constraints of a cluster, which are described further below.
  • each of the topic(s) is added to a topic list associated with the utterance; If the topic does not relate to a recent cluster, then the system determines whether the particular keyword, or other keywords (if any), in the utterance relate to the same topic; If the topics share a particular keyword, then the systems adds the topic to the topic list of the utterance; If no such topic is found, then the systems adds the topic with the highest corresponding confidence rating (described further below) to the topic list of the utterance.
  • the topics are organized hierarchically according to a "confidence rating" between zero and one that is assigned to each keyword-topic mapping. If a keyword is associated to more than one topic, each mapping may possess different "confidence ratings," depending on which mapping is more likely to happen. For example, a chat room with a computer theme may assign the highest rating to the mapping of "Java - programming_languages"; whereas a chat room with a Southeast Asian-theme may give "Java -> geography" the highest rating.
  • a dictionary editor is provided for the chat room administrator to add/delete topics, re-arrange the topic hierarchy, add/delete keywords, and edit confidence rates.
  • the system divides the utterance into clusters (240).
  • the system executes chat clustering recursively. In each recursion, the system focuses on a single utterance. More specifically, for real-time clustering, it handles the latest utterance; and for post-processing, it handles each utterance in chronological order.
  • the system determines whether the specific utterance relates to a recent cluster(s). If the utterance does not relate to one or more recent clusters, a new cluster that includes the utterance is created. Hence, the system compares the topics with recent chat clusters to determine the most likely topic that the keyword in the utterance refers to.
  • the system groups a current utterance with a recent cluster that relates to the same topic.
  • the system considers the following three user-specified parameters that indicate how "fine-grained" the clustering should be, i.e., how many utterances should be included in a cluster and the relationship among utterances included in a cluster.
  • the size of an individual cluster is therefore influenced by specific threshold parameters. Bigger threshold parameters will result in the generation of clusters of larger sizes, which may involve more noise (i.e., off-topic utterances in individual clusters.
  • the system ensures that each cluster satisfies the constraints posed by each of these threshold parameters.
  • the threshold parameters include:
  • Utterance Count Threshold The minimum count of utterances needed to form a cluster
  • Utterance Proximity Threshold UPT
  • Utterance Count Threshold The maximum count of off- topic utterances between a current utterance and a last utterance of a cluster that the cluster is allowed to be expanded to the current utterance
  • Time Threshold The maximum time gap between a current utterance and a last utterance of a cluster that the cluster is allowed to be expanded to the current utterance.
  • the system After the utterance has been divided into clusters, the system generates and/or updates, as appropriate, the chat timeline (250).
  • the chat timeline reflects a temporal depiction of content of a chat session.
  • Each "track" in a chat timeline consists of one or more "cluster lines" of the same topic, illustrating when and how long each topic is covered in the course of a chat session. Overlapping clusters/topics can thus be easily observed from the chat timeline.
  • a user can select a cluster depicted on a chat timeline to gain additional information about the cluster and the utterances included in the cluster, described further relative to Fig 4, below.
  • the system After the chat timeline is updated, the system returns to 210 to analyze the next utterance and the processing repeats.
  • the following example depicts a process for clustering chat content and steps through exemplary algorithms that may be used to perform such processing. More specifically, the example demonstrates how the threshold parameters are used to detect clusters.
  • the first two clusters are overlapping with each other.
  • the cluster has a gap between 12-14.
  • the gap is taken as part of the cluster because it only covers 3 utterances ( ⁇ UPT) and spans 11 seconds ( ⁇ TT).
  • the 2 ⁇ TRf and the 4 ⁇ r clusters are overlapping with each other.
  • the following is an illustrative algorithm to cluster chat content:
  • ALGORITHM clustering INPUT chat og, UCT, UPT, TT;
  • VARIABLES log_unit current_utterance, current_utterance2; cluster eligible_block; list of clusters eligible_clusters; integer count;
  • eligible_block a temporary cluster of the preceding utterances who fit the following conditions: utterance count ⁇ UPT AND time spanned ⁇ TT); eligible_clusters are clusters that overlap with eligible_block;
  • a cluster After a cluster is created, it is refined.
  • the term "refining" a cluster refers to a cluster B being absorbed by, i.e., combined with, a cluster A when each of the following conditions is met: (1) clusters A and B both overlap each other; AND
  • cluster B's topic is a sub-category of cluster A's topic
  • the length, i.e., utterance count, of cluster B is less than 1/5 the length of cluster A.
  • a list of clusters, i.e., a cluster list, included in a particular chat session is stored in a chat log, such as, for example, a database.
  • a chat log such as, for example, a database.
  • the owner or the moderator of the chat room changes the values of UCT, UPT and TT
  • the system deletes a current cluster list and re-creates a new cluster list.
  • the user can save current cluster list.
  • Figs. 4A-4F depict a chat timeline and illustrate various features of the chat timeline.
  • Each topic of clusters is depicted with a different color.
  • a user may navigate the timeline, for example, via buttons provided on a user interface that allow a user to scroll left and right.
  • the user interface may include a scrollbar that allows a user to zoom in and out.
  • Fig. 4B depicts a chat session that includes more tracks than can be displayed, for example, on a single page. In such case, the visualization tool combines "tracks" that do not have overlapping clusters.
  • Each topic is still represented by a different color.
  • a pull-down menu appears, see Fig. 4C, enabling the user to access to a series of features related to the cluster or the corresponding topic.
  • the following additional system features can be accessed via selection of a cluster: (1) Display of a topic information window that displays the "accumulative" statistics and information of each cluster that has been associated to a particular topic, e.g., total time elapsed, active chatters, regular patterns, etc., and the starting and ending time of each cluster.
  • the other information may include, for example, time elapsed, a list of all participants in a chat session, i.e., chatters, active chatters, an indication of a quality of the conversation, i.e., a measure of how much noise, or off-topic discussion is included in a chat session (see Fig. 4D).
  • Figure 4E depicts a chat timeline that includes small icons corresponding to cluster annotations and private chat rooms.
  • the icons are superimposed on the clusters having user annotations and/or private chat rooms.
  • a user may access annotations of a particular cluster, e.g., notes or statistics related to the cluster, or a private chat room.
  • Figure 4F shows an example of an annotation on a cluster.
  • the system supports a search engine embodiment that allows a user to search a chat log for either specific topics or specific keywords.
  • the system lists the clusters which relate to, i.e., have been associated to, the topic.
  • the system may also list the sub-categories that relate to the topic, which subcategories are configurable by the user, as links so to the relevant portions of the chat log and its statistics. If the resultant list is too large, the user can narrow it down by specifying more conditions, such as, for example, keywords of a cluster, time periods of a cluster, or participants in a chat session or specific cluster of a chat session, etc.

Abstract

This invention provides a system and method of creating temporal clusters of content included in a chat session and representing such clusters in a chat timeline. The chat timeline provides a visul summary of the content of the chat session. The system further incldues additional user-friendly features that allow a user to view statistics of particular clusters, patterns of the chat session, and other relevant chat information.

Description

SYSTEM AND METHOD FOR CLUSTERING AND VISUALIZATION OF ONLINE CHAT
FIELD OF THE INVENTION This invention relates generally to data processing systems and, more specifically, to systems and methods for clustering and visualization of content included in online chat sessions.
BACKGROUND OF THE INVENTION The popularity of online chatting has increased significantly over the last several years. Online chat sessions may either be very structured and targeted to discussion of a particular topic, or may be unstructured and serve as an open forum for a public discussion of various topics. Unstructured online chat sessions are generally noisy and out-of-focus. The term "noise," as used herein, refers to off-topic content included in a topic-based online chat session. Participants in an online chat session, "chatters," may switch discussion topics frequently and randomly without alerting other users. The randomness and frequency with which topics may be changed makes it difficult for chatters to determine both a current topic and to trace previous content of the chat session. Some chat systems interface with a logging feature that allows users to log a chat transcript, indicating discussion topics covered in a particular chat session. The chat transcript is generally saved as an archive file and thus is not organized according to, for example, a threaded structure that can be easily indexed and searched. Such spontaneity and randomness in conventional online chat sessions thus poses several problems for chatters. First, a latecomer to a particular chat session, or an existing chatter who leaves the chat session for a period of time and later returns, may have some difficulty tracing the previous discussion. Therefore it is hard for such person to join, or rejoin, the chat session. Second, individual chatters who enter a chat room when it is empty need a manner of determining a current status of a chat session. Conventional logging systems, as described above, are insufficient for this purpose. Similarly, even in chat sessions that include a logging feature that tracks chat discussions in the form of, for example, a chat transcript, searching such transcript to find a quality exchange can be a tedious and time consuming process arid therefore discouraging to many users. Fourth, for topic-based discussions, chatters are likely to initiate, perhaps inadvertently, off-topic discussions unless there is a mechanism that can detect the switching of topics and possibly provide a warning or alert about the topic switch, either directly or indirectly.
Accordingly, a need exists for an online chat analysis tool that tracks a chat session such that chatters may enter and exit the conversation randomly, and easily determine both the previous content of the chat session and the current status of the chat session. Additionally, for topic-based chat sessions a tool is needed to monitor a chat session to ensure that topic switching is avoided.
SUMMARY OF THE INVENTION
According to an embodiment of this invention, a method is provided to analyze chat content. The method includes determining a next utterance in a chat session, extracting keywords from the next utterance, associating the extracted keyword with a topic, and creating a cluster of the content of the chat session by organizing the content according to the topic.
According to another embodiment of this invention, a system is provided to analyze chat content. The system includes a chat summarization tool that divides a content included in a chat session into time-based segments of related chat data and provides a visual representation of the time-based segments of related chat data.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 depicts an illustrative block diagram of the invention. Fig. 2 depicts an illustrative flow diagram of the invention. Fig. 3 depicts an illustrative data structure that is used to facilitate logging of a chat session.
Figs. 4A-4E depict a chat timeline and illustrate various features of the chat timeline. DETAILED DESCRIPTION OF THE INVENTION This invention provides a chat session summarization tool that makes associations among and organizes the content of a chat session. The tool outputs a graphical representation of the content included in the chat session. This representation is provided in the form of a bar chart that depicts various utterances of the chat session and indicates when each of the utterances occurs relative to other utterances of the chat session, and a length of time each utterance occurred. In the visual representation of the chat session, the utterances are grouped as clusters, described further below. This chat summarization tool assists current chatters, newly joined chatters, and chat log readers to conveniently and easily determine a current status of a chat session and navigate a content history of the chat session. This chat summarization tool also creates a searchable chat database, i.e., a chat log, that allows users to search for segments of a chat session by topic or keyword.
More specifically, this invention analyzes the content of a chat session by associating utterances of a chat session with one another to create "clusters" of chat content. A "cluster" refers to a group of adjacent utterances with similar or related discussion topics in a chat session. Chat clustering includes applying a content analysis tool to the content of a chat session, identifying "clusters" of temporal chat utterances in the chat session, and grouping the clusters. Chat clusters may be separated, for example, by noise, i.e., off topic utterances of the chat session. Threshold parameters, configurable by users, determine the tolerance level of such "noisy utterances." Exemplary threshold parameters are described further below. An utterance may involve more than one topic, e.g., the discussion of "computers in education" involves two topics: "computer science" and "education". Therefore, multiple topics can be included in a single utterance. In addition to determining clusters with specific topics, the system can also detect "socializing clusters," including "looking for chatters," "greetings," and "separation." Socializing clusters are clusters that do not involve any specific discussion topic but are common socializing behaviors between humans, e.g., to exchange "hellos" or "goodbyes." In the invention, the content of a chat session is divided into clusters according to a temporal sequence of the discussion topics. By contrast, conventional clustering methods are typically applied to a group of documents or items. The system of the invention may execute either in real-time, i.e., while a chat session occurs, or after a chat session has completed.
Once the content of a chat session has been divided into clusters, a chat timeline is constructed. The chat timeline depicts a visual, temporal representation of chat topics included in the chat session. On the chat timeline, each cluster is represented as one or more lines of color. Each of the clusters that relate to a particular topic is represented by the same color. The length of each line corresponds to the time span of the content represented by the cluster. The cluster lines on the timeline are selectable, allowing users to conveniently view statistics of each cluster and read the chat log on a newly- launched window. Users may change the time' span and the scale reflected on the timeline. Users may also change the temporal resolution of the timeline, e.g., on a scale of 1 minute or 10 minutes per unit on the timeline. The timeline construction may be performed by either a software program that interfaces with the chat room, or a software agent who has the access to the chat content. The chat timeline can either be updated in real-time during a chat session or after a chat session has completed if a chat session includes a logging feature.
This invention may also support additional embodiments, including, for example:
(1) tools for "pasting" notes, that will allow users, for example, to input comments on the content of a cluster or launch a new private chat session for further discussion on the content of a current chat session;
(2) updating chatters on topic changes and further including a feature to notify a chatter who is participating in a topic-based discussion when the chatter changes the topic;
(3) construction of a cluster search engine that allows a user to search a chat log by topic or keyword;
(4) detection of regular time-based patterns of discussion in a chat room. A time-based pattern of a chat room may include discussing a specific topic at a certain time on a particular day of the week. Thus, for example, an embodiment of this feature checks the contents of chat in a chat room during a specific day. This feature may be used to suggest to a user the best time to enter a particular chat room; and (5) alerting a chatter when the chatter is off-topic, e.g., the system may display a pop-up window to inform a particular user that his or her utterances are off-topic relative to an ongoing discussion. Fig. 1 depicts an illustrative block diagram of the invention. Clients 110 a ... n interface with server 120 via network 130. Each of clients 110 a...n includes the conventional components of memory 132, processor(s) 134, input/output devices 136, and browser 138. Server 120 includes the following conventional components: processor 140, input/output device 142, storage 144, and memory 146. Server 120 further includes additional storage, such as, chat log 150 and topic dictionary 154. Memory 146 may further include local application 148. Chat session summarization tool 160 analyzes the content of chat session 164 among clients 110 a... n and provides a visual representation of such content.
One of skill in the art will appreciate that while network 100 has been depicted with specific components, additional or different hardware or software components may be used within the scope of the invention. For example, topic dictionary 154 and chat log 150 may not reside in server 120, but may reside on a network accessible data storage device. Similarly, server 120 may not include local applications 148. Further, while chat summarization tool 160 has been depicted in memory 146, it may reside in a storage device connected to server 120 via a network. Still further, the processing described relative to chat summarization tool 160 may be performed by an attached procedure of the chat room or a software agent who has access to the chat content.
Fig. 2 depicts an illustrative flow diagram of the invention. Chat clustering is performed relative to 210-240 and 260; 250 relates to constructing a chat timeline. The invention first determines a next utterance in a chat session (210). An utterance refers to a single word or group of words or sentences entered by a specific chatter during a single entry. Thus, an utterance refers to a message that is created and sent to a chat room when the chatter presses the "enter" key.
Fig. 3 depicts an illustrative data structure that is used to facilitate logging of a chat session. The data structure of Fig. 3 depicts a linked list that defines a log unit and a cluster. An utterance is a log unit that is included as part of a cluster. As described above, an utterance may include multiple topics. For each utterance, the system extracts keywords by applying standard text parsing and morphology algorithms. Once an utterance has been defined, keywords are extracted from the utterance (220). The key words correspond to words that have been included in a topic dictionary, such as topic dictonary 154. Each keyword that has been extracted from an utterance is then associated to one or more topics listed in the topic dictionary (230). Thus, each utterance is associated to a list of topics. For this purpose, the invention includes a "dictionary" that associates, i.e., maps, keywords to topics (235). The dictionary may be in the form of, for example, a relational database or a table, that lists keywords and related topics for each keyword. Each keyword can be associated to multiple topics. For example, the keyword "Java" can be associated to "programming languages," "geography," and "food & beverage." This dictionary may be in the form of, for example, a relational database or a look-up table that maps keywords to topics. Once a keyword has been mapped to one or more topic, the topics are categorized and organized hierarchically.
The topics are categorized according to the following logic. If a keyword can only be mapped to one topic, then the keyword is added to the topic list associated to the utterance that the keyword was extracted from; If the keyword can be mapped to more than one topic, then the system determines whether the topic has been included in a recent chat cluster. Recent is determined according to user-specified values reflecting constraints of a cluster, which are described further below. If there are one or more topics that relate to a recent chat cluster(s), then each of the topic(s) is added to a topic list associated with the utterance; If the topic does not relate to a recent cluster, then the system determines whether the particular keyword, or other keywords (if any), in the utterance relate to the same topic; If the topics share a particular keyword, then the systems adds the topic to the topic list of the utterance; If no such topic is found, then the systems adds the topic with the highest corresponding confidence rating (described further below) to the topic list of the utterance.
The topics are organized hierarchically according to a "confidence rating" between zero and one that is assigned to each keyword-topic mapping. If a keyword is associated to more than one topic, each mapping may possess different "confidence ratings," depending on which mapping is more likely to happen. For example, a chat room with a computer theme may assign the highest rating to the mapping of "Java - programming_languages"; whereas a chat room with a Southeast Asian-theme may give "Java -> geography" the highest rating. A dictionary editor is provided for the chat room administrator to add/delete topics, re-arrange the topic hierarchy, add/delete keywords, and edit confidence rates.
Once the topic list for an utterance is generated, the system divides the utterance into clusters (240). The system executes chat clustering recursively. In each recursion, the system focuses on a single utterance. More specifically, for real-time clustering, it handles the latest utterance; and for post-processing, it handles each utterance in chronological order. After the topic detection process is performed relative to a specific utterance, the system determines whether the specific utterance relates to a recent cluster(s). If the utterance does not relate to one or more recent clusters, a new cluster that includes the utterance is created. Hence, the system compares the topics with recent chat clusters to determine the most likely topic that the keyword in the utterance refers to.
In this processing, the system groups a current utterance with a recent cluster that relates to the same topic. When clustering an utterance, the system considers the following three user-specified parameters that indicate how "fine-grained" the clustering should be, i.e., how many utterances should be included in a cluster and the relationship among utterances included in a cluster. The size of an individual cluster is therefore influenced by specific threshold parameters. Bigger threshold parameters will result in the generation of clusters of larger sizes, which may involve more noise (i.e., off-topic utterances in individual clusters. The system ensures that each cluster satisfies the constraints posed by each of these threshold parameters. The threshold parameters include:
(1) Utterance Count Threshold (UCT): The minimum count of utterances needed to form a cluster; (2) Utterance Proximity Threshold (UPT): The maximum count of off- topic utterances between a current utterance and a last utterance of a cluster that the cluster is allowed to be expanded to the current utterance; and
(3) Time Threshold (TT): The maximum time gap between a current utterance and a last utterance of a cluster that the cluster is allowed to be expanded to the current utterance. After the utterance has been divided into clusters, the system generates and/or updates, as appropriate, the chat timeline (250). The chat timeline reflects a temporal depiction of content of a chat session. Each "track" in a chat timeline consists of one or more "cluster lines" of the same topic, illustrating when and how long each topic is covered in the course of a chat session. Overlapping clusters/topics can thus be easily observed from the chat timeline. A user can select a cluster depicted on a chat timeline to gain additional information about the cluster and the utterances included in the cluster, described further relative to Fig 4, below.
After the chat timeline is updated, the system returns to 210 to analyze the next utterance and the processing repeats.
Example:
The following example depicts a process for clustering chat content and steps through exemplary algorithms that may be used to perform such processing. More specifically, the example demonstrates how the threshold parameters are used to detect clusters.
Figure imgf000009_0001
Case 1: UCT = 3; UPT = 5; TT = 10 seconds Clusters formed:
Figure imgf000010_0001
The first two clusters are overlapping with each other.
Case 2: UCT = 3; UPT = 3; TT = 12 seconds Clusters formed:
Topic Range Remarks
(Utterance
Indices)
1-4
9-11 The reason this cluster can't merge with the first cluster (1-4, Topic A) is that it will otherwise form a gap (5-8) that spans 8 seconds (< TT) but covers 4 utterances (> UPT).
B 3-6 The reason utterances 11 & 12 can't be included in this cluster is that it will otherwise form a gap (7-10) that spans 8 seconds (< TT) but covers 4 utterances (> UPT). Neither can utterances 11 & 12 make up a separate cluster because it only has 2 utterances with the same topic (< UCT).
11-17 The cluster has a gap between 12-14. The gap is taken as part of the cluster because it only covers 3 utterances (< UPT) and spans 11 seconds (< TT).
The 2 ΪTRf and the 4 τϋr clusters are overlapping with each other. The following is an illustrative algorithm to cluster chat content:
ALGORITHM clustering INPUT: chat og, UCT, UPT, TT;
VARIABLES: log_unit current_utterance, current_utterance2; cluster eligible_block; list of clusters eligible_clusters; integer count;
START
WHILE there are more unanalyzed utterances current_utterance := the first unanalyzed utterance;
Analyze current_utterance => Form topic_list;
Identify eligible_block (a temporary cluster of the preceding utterances who fit the following conditions: utterance count < UPT AND time spanned < TT); eligible_clusters are clusters that overlap with eligible_block;
/* merge current utterance with an existing cluster */ WHILE there are more unchecked clusters in eligible_clusters current_cluster := the first unchecked cluster in eligible_clusters; IF 3 (topic in current_utterance's topic_list) that tallies with current_cluster's topic
Expand current_cluster to current_utterance (i.e., current_utterance is now part of current_cluster); ENDIF ENDWHILE
/* generate new cluster */
WHILE there are more "unclustered" topics in current_utterance count := 1; current_topic := first unchecked topic of current_utterance;
Identify eligible_block with respect to current_utterance (a temporary cluster of the preceding utterances who fit the following conditions: utterance count < UPT AND time spanned < TT); WHILE (there are more unchecked previous utterances) AND (there are more unchecked utterances in eligible_block) current_utterance2 := the closest unchecked utterance to current_utterance; IF topicjlist in current_utterance2 contains current_topic count = count + 1 ; IF (count = UCT) Generate new cluster with topic := current_topic and covers current_utterance2 (starting) to current_utterance (ending); EXIT (innermost) WHILE loop; ELSE Identify eligible_block with respect to current_utterance2;
NEXT (innermost) WHILE iteration; ENDIF ENDIF ENDWHILE ENDWHILE
Refine the clustering; Update timeline display; ENDWHILE END
After a cluster is created, it is refined. The term "refining" a cluster refers to a cluster B being absorbed by, i.e., combined with, a cluster A when each of the following conditions is met: (1) clusters A and B both overlap each other; AND
(2) cluster B's topic is a sub-category of cluster A's topic; AND
(3) the length, i.e., utterance count, of cluster B is less than 1/5 the length of cluster A.
A list of clusters, i.e., a cluster list, included in a particular chat session is stored in a chat log, such as, for example, a database. When a user, for example, the owner or the moderator of the chat room, changes the values of UCT, UPT and TT, the system deletes a current cluster list and re-creates a new cluster list. Before modifying the UCT, UPT, and TT values, the user can save current cluster list.
Figs. 4A-4F depict a chat timeline and illustrate various features of the chat timeline. Each topic of clusters is depicted with a different color. A user may navigate the timeline, for example, via buttons provided on a user interface that allow a user to scroll left and right. Similarly, the user interface may include a scrollbar that allows a user to zoom in and out. Fig. 4B depicts a chat session that includes more tracks than can be displayed, for example, on a single page. In such case, the visualization tool combines "tracks" that do not have overlapping clusters. Each topic is still represented by a different color.
When a user clicks on a cluster line, a pull-down menu appears, see Fig. 4C, enabling the user to access to a series of features related to the cluster or the corresponding topic. For example, the following additional system features can be accessed via selection of a cluster: (1) Display of a topic information window that displays the "accumulative" statistics and information of each cluster that has been associated to a particular topic, e.g., total time elapsed, active chatters, regular patterns, etc., and the starting and ending time of each cluster.
(2) Display of a cluster information window that displays statistics and other information related to the cluster. The other information may include, for example, time elapsed, a list of all participants in a chat session, i.e., chatters, active chatters, an indication of a quality of the conversation, i.e., a measure of how much noise, or off-topic discussion is included in a chat session (see Fig. 4D).
(3) Display of a chat log window displays the chat log of the cluster (see Fig. 4D). (4) Access to an annotation tool that allows a user to add annotations/comments to a particular cluster. The annotations/comments are stored relative to individual clusters and can be displayed when a user requests to retrieve them. The annotation window is presented as a single-threaded online message board. If the user changes the values of UCT, UPT and TT, i.e., initiates a re-clustering of a chat session, the system will automatically attach the annotations to the new cluster(s) at the same portion of the dialogue with the same topic. The chat log is also provided with new cluster assignments when re-clustering occurs. For example, compare "Case 1" and "Case 2" in the above example. (5) Launch of and access to a private chat room that is accessible to the chatters in the main chat room to engage in a topic-specific discussion directed to the topic reflected by the selected cluster. Each cluster can have one related private chat room.
(6) Entry into a private chat room to determine a status of the private chat room, by viewing, for example, a chat log or chat timeline, etc., and join a current discussion in the private chat room.
Figure 4E depicts a chat timeline that includes small icons corresponding to cluster annotations and private chat rooms. The icons are superimposed on the clusters having user annotations and/or private chat rooms. Thus, by double-clicking on an icon, a user may access annotations of a particular cluster, e.g., notes or statistics related to the cluster, or a private chat room. Figure 4F shows an example of an annotation on a cluster.
As described above, the system supports a search engine embodiment that allows a user to search a chat log for either specific topics or specific keywords. In this embodiment, when the user specifies a topic, the system lists the clusters which relate to, i.e., have been associated to, the topic. The system may also list the sub-categories that relate to the topic, which subcategories are configurable by the user, as links so to the relevant portions of the chat log and its statistics. If the resultant list is too large, the user can narrow it down by specifying more conditions, such as, for example, keywords of a cluster, time periods of a cluster, or participants in a chat session or specific cluster of a chat session, etc.
Although this invention has been described relative to a particular embodiment, one of skill in the art will appreciate that this description is merely exemplary and the system and method of this invention may include additional or different components. This description is therefore limited only by the appended claims and the full scope of their equivalents.

Claims

WHAT IS CLAIMED IS:
1. A method to analyze chat content, comprising: determining a next utterance in a chat session; extracting keywords from the next utterance; associating the extracted keywords with a topic; and creating a cluster of the content of the chat session by organizing the content according to the topic.
2. The method of claim 1, further comprising creating a chat timeline reflecting the content of the chat session by cluster.
3. The method of claim 2, wherein creating the timeline comprises creating a temporal representation of the content of the chat session.
4. A system to analyze chat content, comprising a server that includes a chat summarization tool that divides content of the chat session into time-based segments of related chat content and provides a visual representation of the time-based segments of the related chat content.
5. The system of claim 4, further including a client that participates in an online chat session.
PCT/SG2001/000089 2001-05-11 2001-05-11 System and method for clustering and visualization of online chat WO2002093414A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2001/000089 WO2002093414A1 (en) 2001-05-11 2001-05-11 System and method for clustering and visualization of online chat

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2001/000089 WO2002093414A1 (en) 2001-05-11 2001-05-11 System and method for clustering and visualization of online chat

Publications (1)

Publication Number Publication Date
WO2002093414A1 true WO2002093414A1 (en) 2002-11-21

Family

ID=20428934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2001/000089 WO2002093414A1 (en) 2001-05-11 2001-05-11 System and method for clustering and visualization of online chat

Country Status (1)

Country Link
WO (1) WO2002093414A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006070215A1 (en) * 2004-12-27 2006-07-06 Sap Aktiengesellschaft Chat detection
US7200635B2 (en) * 2002-01-09 2007-04-03 International Business Machines Corporation Smart messenger
EP2293197A1 (en) * 2008-05-28 2011-03-09 Panasonic Corporation Communication terminal device, communication control method and communication control program
US20120284294A1 (en) * 2005-07-14 2012-11-08 Seth Nickell User discussion relating to common subject matter
US8612211B1 (en) * 2012-09-10 2013-12-17 Google Inc. Speech recognition and summarization
WO2014004204A3 (en) * 2012-06-28 2015-02-05 Aol Inc. Systems and methods for analyzing and managing electronic content
US8997005B1 (en) * 2002-10-23 2015-03-31 Amazon Technologies, Inc. Method and system for conducting a chat
US9438542B1 (en) 2015-09-25 2016-09-06 International Business Machines Corporation Linking selected messages in electronic message threads
US9443518B1 (en) 2011-08-31 2016-09-13 Google Inc. Text transcript generation from a communication session
US10003559B2 (en) 2015-11-12 2018-06-19 International Business Machines Corporation Aggregating redundant messages in a group chat
US10204105B2 (en) 2016-02-16 2019-02-12 Cisco Technology, Inc. Conversation timeline mapping
US10205688B2 (en) 2016-09-28 2019-02-12 International Business Machines Corporation Online chat questions segmentation and visualization
US10410385B2 (en) 2016-02-19 2019-09-10 International Business Machines Corporation Generating hypergraph representations of dialog
US10452702B2 (en) 2017-05-18 2019-10-22 International Business Machines Corporation Data clustering
US10574601B2 (en) 2017-08-03 2020-02-25 International Business Machines Corporation Managing and displaying online messages along timelines
US10635703B2 (en) 2017-10-19 2020-04-28 International Business Machines Corporation Data clustering
US10885065B2 (en) 2017-10-05 2021-01-05 International Business Machines Corporation Data convergence
US11170173B2 (en) 2019-02-05 2021-11-09 International Business Machines Corporation Analyzing chat transcript data by classifying utterances into products, intents and clusters
US11190467B2 (en) 2018-11-30 2021-11-30 International Business Machines Corporation Micro-term modelling for real time chat discourse
US11811585B2 (en) 2021-03-23 2023-11-07 International Business Machines Corporation Measuring incident management process efficiency metrics utilizing real-time conversation analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999044152A2 (en) * 1998-02-24 1999-09-02 Koninklijke Philips Electronics N.V. Apparatus and data network browser for providing context sensitive web communications
JPH11242545A (en) * 1998-02-24 1999-09-07 Sharp Corp Real-time chat system
WO2001024016A1 (en) * 1999-09-28 2001-04-05 Parlano, Inc. Information flow management in real time

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999044152A2 (en) * 1998-02-24 1999-09-02 Koninklijke Philips Electronics N.V. Apparatus and data network browser for providing context sensitive web communications
JPH11242545A (en) * 1998-02-24 1999-09-07 Sharp Corp Real-time chat system
WO2001024016A1 (en) * 1999-09-28 2001-04-05 Parlano, Inc. Information flow management in real time

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7200635B2 (en) * 2002-01-09 2007-04-03 International Business Machines Corporation Smart messenger
US10628861B1 (en) 2002-10-23 2020-04-21 Amazon Technologies, Inc. Method and system for conducting a chat
US8997005B1 (en) * 2002-10-23 2015-03-31 Amazon Technologies, Inc. Method and system for conducting a chat
WO2006070215A1 (en) * 2004-12-27 2006-07-06 Sap Aktiengesellschaft Chat detection
US20120284294A1 (en) * 2005-07-14 2012-11-08 Seth Nickell User discussion relating to common subject matter
US8898183B2 (en) * 2005-07-14 2014-11-25 Red Hat, Inc. Enabling users searching for common subject matter on a computer network to communicate with one another
EP2293197A4 (en) * 2008-05-28 2011-10-26 Panasonic Corp Communication terminal device, communication control method and communication control program
EP2293197A1 (en) * 2008-05-28 2011-03-09 Panasonic Corporation Communication terminal device, communication control method and communication control program
US9443518B1 (en) 2011-08-31 2016-09-13 Google Inc. Text transcript generation from a communication session
US10019989B2 (en) 2011-08-31 2018-07-10 Google Llc Text transcript generation from a communication session
WO2014004204A3 (en) * 2012-06-28 2015-02-05 Aol Inc. Systems and methods for analyzing and managing electronic content
US8612211B1 (en) * 2012-09-10 2013-12-17 Google Inc. Speech recognition and summarization
US9420227B1 (en) 2012-09-10 2016-08-16 Google Inc. Speech recognition and summarization
US9438542B1 (en) 2015-09-25 2016-09-06 International Business Machines Corporation Linking selected messages in electronic message threads
US9596200B1 (en) 2015-09-25 2017-03-14 International Business Machines Corporation Linking selected messages in electronic message threads
US9772750B2 (en) 2015-09-25 2017-09-26 International Business Machines Corporation Linking selected messages in electronic message threads
US10003559B2 (en) 2015-11-12 2018-06-19 International Business Machines Corporation Aggregating redundant messages in a group chat
US11178087B2 (en) 2015-11-12 2021-11-16 International Business Machines Corporation Aggregating redundant messages in a group chat
US10204105B2 (en) 2016-02-16 2019-02-12 Cisco Technology, Inc. Conversation timeline mapping
US10410385B2 (en) 2016-02-19 2019-09-10 International Business Machines Corporation Generating hypergraph representations of dialog
US10237213B2 (en) 2016-09-28 2019-03-19 International Business Machines Corporation Online chat questions segmentation and visualization
US10205688B2 (en) 2016-09-28 2019-02-12 International Business Machines Corporation Online chat questions segmentation and visualization
US10452702B2 (en) 2017-05-18 2019-10-22 International Business Machines Corporation Data clustering
US10574601B2 (en) 2017-08-03 2020-02-25 International Business Machines Corporation Managing and displaying online messages along timelines
US11374884B2 (en) 2017-08-03 2022-06-28 International Business Machines Corporation Managing and displaying online messages along timelines
US10885065B2 (en) 2017-10-05 2021-01-05 International Business Machines Corporation Data convergence
US10635703B2 (en) 2017-10-19 2020-04-28 International Business Machines Corporation Data clustering
US11222059B2 (en) 2017-10-19 2022-01-11 International Business Machines Corporation Data clustering
US11190467B2 (en) 2018-11-30 2021-11-30 International Business Machines Corporation Micro-term modelling for real time chat discourse
US11170173B2 (en) 2019-02-05 2021-11-09 International Business Machines Corporation Analyzing chat transcript data by classifying utterances into products, intents and clusters
US11811585B2 (en) 2021-03-23 2023-11-07 International Business Machines Corporation Measuring incident management process efficiency metrics utilizing real-time conversation analysis

Similar Documents

Publication Publication Date Title
WO2002093414A1 (en) System and method for clustering and visualization of online chat
EP1481346B1 (en) A method and apparatus to visually present discussions for data mining purposes
US20210278959A1 (en) Methods and Apparatus for Managing and Exchanging Information Using Information Objects
JP5395014B2 (en) Search system and method integrating user annotations from a trust network
EP1050831B1 (en) System for providing document change information for a community of users
Kontostathis et al. A survey of emerging trend detection in textual data mining
US20040153456A1 (en) Method and apparatus to visually present discussions for data mining purposes
Marcus et al. Twitinfo: aggregating and visualizing microblogs for event exploration
Sack Conversation Map: A content-based Usenet newsgroup browser
US6493703B1 (en) System and method for implementing intelligent online community message board
US7149983B1 (en) User interface and method to facilitate hierarchical specification of queries using an information taxonomy
US7904510B2 (en) Systems and methods for managing discussion threads based on ratings
JP4241934B2 (en) Text processing and retrieval system and method
US7743048B2 (en) System and method for providing a geographic search function
US7225187B2 (en) Systems and methods for performing background queries from content and activity
JP4623820B2 (en) Network-based information retrieval system and document search promotion method
US6515681B1 (en) User interface for interacting with online message board
JP4312954B2 (en) Information management system
JP4845392B2 (en) Principles and methods for personalizing a newsfeed (PERSONALIZING) by analysis of information novelty (NOVELTY) and information dynamics (DYNAMICS)
US6571234B1 (en) System and method for managing online message board
US20060190440A1 (en) Systems and methods for constructing and using models of memorability in computing and communications applications
US20080082607A1 (en) Advanced discussion thread management using a tag-based categorization system
US20080228695A1 (en) Techniques for analyzing and presenting information in an event-based data aggregation system
Magnani et al. Conversation retrieval for microblogging sites
EP4084442A1 (en) Conference system content sharing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP