WO2006030751A1 - 文書を時系列に配置した文書相関図の作成装置 - Google Patents
文書を時系列に配置した文書相関図の作成装置 Download PDFInfo
- Publication number
- WO2006030751A1 WO2006030751A1 PCT/JP2005/016785 JP2005016785W WO2006030751A1 WO 2006030751 A1 WO2006030751 A1 WO 2006030751A1 JP 2005016785 W JP2005016785 W JP 2005016785W WO 2006030751 A1 WO2006030751 A1 WO 2006030751A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- cluster
- diagram
- elements
- cutting
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
Definitions
- Document correlation diagram creation device that arranges documents in time series
- the present invention relates to a technique for automatically creating a document correlation diagram that shows the relationship between documents and reflects the temporal order of documents, and in particular, such a document correlation diagram creation device, creation method, and creation program. About.
- Patent documents, technical documents, and other documents are newly created every day, and the number is enormous.
- Patent Document 1 discloses a method for associating documents ordered in time series. Specifically, the similarity between documents is calculated based on the degree of coincidence of words between documents, and a similarity matrix is created from the similarity using time constraints. This similarity matrix is converted into an adjacency matrix in which the matrix elements having a similarity equal to or greater than a predetermined threshold are 1 and the rest are 0. Based on this adjacency matrix, a directed graph, which is a related figure of the document, is created.
- Patent Document 1 Japanese Patent Laid-Open No. 11 53387 “Document Association Method and System” Disclosure of the Invention
- Patent Document 1 a shift accumulates as the user sequentially goes from a certain document to a similar document and then to the similar document. You might end up with a completely different document. In addition, multiple flows from a document may eventually end up in a single document, and the meaning of branching may become unclear. Therefore, the technique described in the above Japanese Patent Laid-Open No. 11-53387 (Patent Document 1) has a problem that time development in each field cannot be expressed appropriately.
- An object of the present invention is to create a document correlation diagram that can appropriately represent the temporal development of each field. It is providing a production apparatus, a production method, and a production program.
- the document correlation diagram creation apparatus of the present invention extracts the content data and time data of a document element composed of one or a plurality of documents by extracting them into a plurality of document elements.
- a warp diagram creating means for creating a warp diagram showing the correlation of the plurality of document elements based on the content data of each document element, and cutting the warp diagram based on a predetermined rule to form a cluster Clustering means for extracting, and intra-cluster arrangement means for determining the arrangement of the document element group belonging to each cluster in the cluster based on the time data of each document element.
- the predetermined rule that the clustering means cuts the saddle diagram is derived by an association rule analysis.
- cutting rules derived by association rule analysis cutting rules applicable to various saddle diagrams (high versatility) can be used, and cutting with ideal cutting values is realized with high probability. can do.
- by increasing the number of examples in the teacher chart it is possible to easily improve the accuracy of the cutting rules.
- the predetermined rule is derived based on a shape parameter of the saddle diagram.
- the cutting position can be determined by reading the shape parameters of the analysis target diagram and applying the association rules to the shape parameters, the cutting position can be determined with a small amount of calculation.
- the number of times of cutting the cage diagram may be only once (fixed BC method; described later), and the parent rule is derived again based on the shape parameters of the parent cluster obtained by one cutting.
- the progeny cluster may be extracted by cutting the data (variable BC method; described later). According to the variable BC method, even if a parent cluster with a large number of elements is generated, it can be further separated into descendant clusters.
- the predetermined rule is derived based on the number of vector dimensions of a plurality of document elements combined at each node of the cage diagram. Also good.
- a more appropriate branch can be obtained by adopting a cutting rule derived by taking into account the number of vector dimensions.
- the number of vector dimensions of the plurality of document elements excludes the number of vector component dimensions in which the deviation between the document elements is smaller than the value determined by a predetermined method from the number of dimensions of the vector sum of the plurality of document elements.
- the number of dimensions is desirable. This makes it possible to use more appropriate cutting rules.
- the clustering means determines, for each node, whether or not the number of vector dimensions of a plurality of document elements combined at each node is greater than or equal to a certain value. In addition, it is desirable to individually disconnect nodes that are equal to or greater than the predetermined value based on the determination result. A more appropriate branch can be obtained by determining the disconnection criterion for each node and disconnecting each node individually based on the determination result.
- the clustering unit cuts the cage diagram to extract a parent cluster, and based on content data of each document element belonging to the parent cluster, the clustering unit extracts the parent cluster. It is desirable to create a partial chart showing the correlation of document element groups belonging to the cluster, and extract the descendant clusters by cutting the created partial chart based on a predetermined rule.
- the misclassification of the child cluster can be improved and an appropriate classification can be obtained.
- the clustering means determines a deviation between a plurality of document elements belonging to the parent cluster by a predetermined method in order to create the partial map diagram. Eliminate vector components that take values less than each document element vector force Is desirable.
- the child cluster is extracted from a point of view different from that of the parent cluster by removing the vector component having a small deviation between the document elements belonging to each parent cluster. Can be obtained.
- the vector component of the document element is, for example, an entire document IDF weighting TF value (TF * IDF (P) value; described later) for each index word in the document. Whether the deviation is small! / ⁇ is determined by, for example, calculating the TF * ID F (P) value of each index word for all document elements belonging to the parent cluster, and The ratio of the standard deviation to the average of these can depend on whether the force falls within a predetermined range.
- the saddle diagram creating means creates the saddle diagram so that the coupling height between document elements reflects the degree of similarity between document elements
- the clustering means preferably extracts the cluster by cutting at two or more predetermined heights in the saddle diagram.
- connection structure after cutting it is desirable to determine the branch structure based on the number of branch lines cut at each cutting position.
- the saddle diagram creation means creates the saddle diagram so that the coupling height between the document elements reflects the degree of similarity between the document elements.
- the clustering means extracts the cluster by cutting at a cutting position based on a function including any one or both of the combined height average value and the deviation of the document element group belonging to the block diagram as a variable. Hope to do.
- cutting is performed based on a function that includes either or both of the bond height average value and deviation as variables, it can be used in a wide variety of shapes, and does not require complicated calculations. An appropriate branch can be easily obtained.
- a function that includes one or both of the combined height average value and deviation as a variable is a function that includes both an average value and deviation as a variable, which is preferably a function that includes at least the average value as a variable. It is more preferable. For example, it is preferable that ⁇ d> + ⁇ (where ⁇ 3 ⁇ 3) using the average value d> of the coupling height d and the standard deviation ⁇ . D d
- the average deviation is not limited to the difference ⁇ .
- the saddle diagram creation means creates the saddle diagram so that the connection height between the document elements reflects the degree of similarity between the document elements.
- the clustering means cuts the parent cluster by cutting the parent figure at a cutting position based on a function that includes either or both of the combined height average value and the deviation of the document element group belonging to the hook diagram as variables. Extracting the descendant cluster by cutting the parent cluster at the cutting position based on the function including either or both of the combined height average value and the deviation of the document element group belonging to the parent cluster as a variable. Is desirable.
- the parent cluster is extracted based on a function that includes one or both of the combined height average value and the deviation of the document elements belonging to the cage diagram as variables, and the child cluster is extracted to each parent cluster. Since it is based on a function that includes one or both of the combined height average and deviation of document elements as variables, an appropriate parent-child cluster can be obtained even if the number of elements is large (eg, N> 20). . In addition, since cluster extraction is performed based on a function that includes either or both of the combined height average value and deviation of document element groups as variables, the similarity of document element groups belonging to a diagram is high. A wide variety of saddle-shaped shapes can be supported, and appropriate parent-child clusters can be obtained.
- a function that includes one or both of the combined height average value and deviation as a variable is a function that includes both an average value and deviation as a variable, which is preferably a function that includes at least the average value as a variable. It is more preferable. For example, it is preferable that ⁇ d> + ⁇ (where ⁇ 3 ⁇ 3) using the average value d> of the coupling height d and the standard deviation ⁇ . Na As a function that includes the deviation of the coupling height d as a variable and does not include the average value ⁇ d> of the coupling height d as a variable, for example, the standard deviation ⁇ of the coupling height d and the midpoint distance m ( (See below)
- Deviation is standard deviation.
- the average deviation is not limited to the difference ⁇ .
- the data of the applicant of a patent document is used as the content data used for distinction display. In this way, it is possible to know how a group of patent documents by an applicant is positioned in relation to other companies.
- the intra-cluster arrangement unit is configured to display the combined document elements in a cage diagram composed of document element groups belonging to the cluster. Which is older is compared in order from the lowest node, and the document element determined to be older V in the lower node is compared to the highest node as the comparison target in the upper node. The oldest element determined as a result of the comparison at the highest node is placed at the beginning of the cluster, and from the oldest element, the number of document elements directly compared with the oldest element is placed. It is desirable to create a branch and connect these compared document elements to each branch and determine the sequence.
- the opponent of the oldest element is compared with other document elements in a lower node, the opponent of the oldest element is It is desirable to repeat the same process as the oldest element in each branch.
- the intra-cluster arrangement means extracts one or more oldest elements in the cluster and arranges them at the head, and removes the oldest elements. For the remaining document elements, a time-ordered array is formed for each classification that defines these document elements, and among the time-ordered arrays, a time-ordered array in which the document element of the same classification exists as the oldest element. Is connected to the oldest element of the same classification, and among the time-ordered arrays, the time-ordered array in which the document element of the same classification does not exist as the oldest element is the oldest of the time-ordered arrays. It is desirable to select the document element having the highest degree of similarity with the element from the cluster and connect it to the document element having the highest degree of similarity to determine the arrangement in the cluster.
- the same time element can be processed by determining the intra-cluster arrangement in consideration of the classification information.
- Each of the document correlation diagram creation devices described above further includes a time slice classification unit and a connection unit between time slices, and the time slice classification unit includes the plurality of document elements.
- the hierarchical diagram creating means creates a hierarchical diagram showing the correlation of document element groups belonging to each time slice, and the clustering means It is desirable that the slice diagram is cut based on a predetermined rule to extract clusters, and the inter-slice connection means connects the clusters belonging to different time slices. In this way, by performing the time slice separation first, it is possible to represent the relationship of contemporary documents between different classifications, and also to represent the relationship of documents in the same field in different periods.
- connection between clusters by the above-mentioned means for connecting between slices is calculated by calculating the similarity between clusters based on the distance between groups, the distance between elements of the oldest element and the shortest distance element in the time front group, and the like. It is desirable to connect the clusters.
- connection between clusters by the time slice connection means described above is performed between elements belonging to both connected clusters (between the oldest element in the time rear group and the latest element in the time front group, or the last element in the time rear group. It is desirable that the connection be made between the old element and the shortest distance element of the time front group.
- extraction means for extracting content data and time data of a document element composed of one or a plurality of documents for a plurality of document elements, and the plurality of documents.
- a time slice classification means for classifying elements into a plurality of time slices based on the time data of each document element, and clusters of each time slice force are also extracted based on content data of each document element belonging to each time slice.
- Clustering means for connecting, and inter-slice connecting means for connecting clusters belonging to different time slices.
- time slice by first segmenting by time slice, it is possible to represent the relationship of contemporary documents between different classifications, and also to represent the relationship of documents of the same field in different time periods.
- the cluster extraction by the clustering means is preferably performed by the method of cutting the chart, but it is not limited to this, and the cluster extraction using a known k-average method or the like may be used.
- the arrangement of document elements in each cluster may be performed based on the time data of the document elements, or may be simply arranged in parallel, for example, without being based on the time data.
- the connection between clusters by the time slice connection means is similar to the degree of similarity between clusters. It is desirable to connect the clusters by calculating the distance between groups, the distance between elements of the oldest element and the shortest distance element of the time front group, and the like.
- connection between clusters by the time slice connection means described above is performed between elements belonging to both connected clusters (between the oldest element in the time rear group and the latest element in the time front group, or the last element in the time rear group. It is desirable that the connection be made between the old element and the shortest distance element of the time front group.
- the present invention also causes a computer to execute a document correlation diagram creation method including the same steps as the method executed by each of the above apparatuses, and the same process as the process executed by each of the above apparatuses.
- This is a document correlation diagram creation program.
- This program may be recorded on a recording medium such as an FD, CDROM, or DVD, or sent and received over a network.
- FIG. 1 is a diagram showing a hardware configuration of a document correlation diagram creating apparatus according to an embodiment of the present invention.
- FIG. 2 is a diagram for explaining in detail the configuration and functions of the document correlation diagram creation device described above, in particular, the processing device 1 and the recording device 3.
- FIG. 3 is a flowchart showing the operation procedure of processing device 1 in the document correlation diagram creation device.
- FIG. 5 is a flowchart for explaining a cluster extraction process in the first embodiment.
- FIG. 6 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the first embodiment.
- FIG. 7 is a diagram showing a specific example of a document correlation diagram generated by the method of the first embodiment.
- FIG. 8 Flow chart explaining the cluster extraction process in Example 2 (Codimension descent method; CR method), 9] A diagram showing an example of the arrangement of saddle diagrams in the cluster extraction process in the second embodiment.
- FIG. 11 is a flowchart explaining the cluster extraction process in Example 3 (cell division method; CD method).
- FIG. 14 is a diagram showing another specific example of the document correlation diagram generated by the method of the third embodiment.
- FIG. 15 is a flowchart for explaining a cluster extraction process in Example 4 (stepwise cutting method; SC method).
- FIG. 16 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the fourth embodiment.
- FIG. 20 is a diagram showing a part of a saddle diagram arrangement example in the cluster extraction process in the fifth embodiment.
- FIG. 26 is a diagram showing a specific example (3000 documents) of the document correlation diagram generated by the method according to the second modification of the fifth embodiment.
- ⁇ 27 A diagram showing a concrete example (300 documents) of the document correlation diagram generated by the method of the second modification of the fifth embodiment.
- FIG. 28 is a diagram showing a part of another display example in the document correlation diagram of FIG.
- FIG. 29 is a view showing a part of still another display example in the document correlation diagram of FIG. 26.
- FIG. 30 is a flow chart illustrating an intra-cluster arrangement process in Example 6 (—main fishing arrangement; PLA).
- FIG. 31 is a diagram showing a saddle diagram arrangement example in the intra-cluster arrangement process in the sixth embodiment.
- FIG. 32 is a flowchart for explaining the intra-cluster arrangement process in Example 7 (group time order; GTO).
- FIG. 33 is a diagram showing a part of a saddle diagram arrangement example in an intra-cluster arrangement process in Example 7.
- FIG. 34 is a diagram for explaining in more detail the configuration and functions of the document correlation diagram creation device of Example 8 (time cross-sectional analysis; TSA).
- FIG. 35 is a flowchart for explaining a document correlation diagram creation process in the eighth embodiment.
- FIG. 36 is a diagram showing a saddle diagram arrangement example in the document correlation diagram creation process in the eighth embodiment.
- FIG. 37 is a diagram showing a first specific example of a document correlation diagram generated by the method of Embodiment 8 and a generation process thereof.
- FIG. 38 is a diagram showing a second specific example of a document correlation diagram generated by the method of Embodiment 8 and the generation process thereof.
- FIG. 39 is a diagram showing a third specific example of a document correlation diagram generated by the method of Embodiment 8 and a generation process thereof.
- FIG. 40 is a diagram showing a fourth specific example of a document correlation diagram generated by the method of Embodiment 8 and the generation process thereof.
- E Document element
- oc Cutting height
- c Node (node)
- n Slice number
- G Group Best mode for carrying out the invention
- Document element E or E to E constitutes a document group to be analyzed, and analyzes according to the present invention
- Each document element has one or more document capabilities.
- a document element group refers to a plurality of document elements.
- Similarity The similarity or dissimilarity between the document element and document element to be compared, the document element and document element group, or the document element group and document element group.
- Saddle diagram A diagram in which each document element constituting the document group to be analyzed is connected in a saddle shape.
- Dendrogram A plot generated by hierarchical cluster analysis. The creation principle will be briefly explained. First, based on the dissimilarity (similarity) between the document elements that make up the analysis target document group, the document elements with the lowest dissimilarity (maximum similarity) To form a conjugate. Furthermore, the combined object and other document elements, or the combined object and the combined object are combined in order of decreasing dissimilarity to generate a new combined object. Thus, it is expressed as a hierarchical structure.
- Index word A word cut out from all or part of a document. Extracting meaningful parts of speech, excluding particles and conjunctions, using traditionally known methods that have no particular restrictions on how to extract words, or, for example, commercially available morphological analysis software for Japanese documents Alternatively, the index word dictionary (thesaurus) database may be stored in advance and index words obtained from the database may be used.
- the index word dictionary thesaurus
- d Height of the coupling position (coupling distance) between the document element and document element, the document element group and document element group, or the document element and document element group in the chart.
- N Number of document elements to be analyzed.
- Document element time data For example, in the case of patent documents, the date of filing, the date of publication, the date of registered registration, the date of claiming priority, etc. can be used. If the application number, publication number, etc. of patent documents follow the order of application, order of publication, etc., these application numbers, publication numbers, etc. can be used as time data. If the document element consists of multiple documents, the average value, median value, etc. of the time data of each document making up the document element is obtained and used as the time data of the document element.
- DF (P) Document frequency in the entire document P that is the population based on the index word of document element E (Document Frequency) 0 Document frequency is a hit document when searching from multiple documents with a certain index word. Numbers. For the whole document P, which is the population, for example, approximately 4 million published patent publications or registered utility model publications issued in Japan over the past 10 years are used for the analysis of patent documents.
- TF * IDF (P) The product of TF (E) and the logarithm of the reciprocal of ⁇ DF (P) X the total number of documents ⁇ . Calculated for each index word in the document. When document element E consists of multiple documents, it is equivalent to GF (E) * IDF (P).
- GFIDF (E) GF (E) / DF (E) when document element E has multiple document capabilities. Calculated for each index word in the document.
- FIG. 1 is a diagram showing a hardware configuration of a document correlation diagram creating apparatus according to an embodiment of the present invention.
- the document correlation diagram creation device of the present embodiment includes a processing device 1 composed of a CPU (central processing unit) and a memory (recording device), and input means such as a keyboard (manual input device).
- An input device 2 a recording device 3 which is a recording means for storing document data, conditions, work results by the processing device 1, and an output device 4 which is an output means for displaying or printing the created document correlation diagram Consists of
- FIG. 2 is a diagram for explaining in detail the configuration and function of the above-described document correlation diagram creation device, particularly the processing device 1 and the recording device 3.
- the processing apparatus 1 includes a document reading unit 10, a time data extracting unit 20, an index word data extracting unit 30, a similarity calculation unit 40, a saddle diagram creating unit 50, a cutting condition reading unit 60, a cluster extracting unit 70, and an arrangement.
- a condition reading unit 80 and an in-cluster element arrangement unit 90 are provided.
- the recording device 3 includes a condition recording unit 310, a work result storage unit 320, a document storage unit 330, and the like.
- the document storage unit 330 includes an external database and an internal database.
- An external database means, for example, a document digital database such as IPDL of a patent digital library serviced by the Japan Patent Office or PATOLIS serviced by Patrice Co., Ltd.!
- the internal database is a database that stores data such as patent JP-ROM that is sold on its own, FD (flexible disk), CD (compact disk) ROM, MO (magneto-optical disk), DVD that stores documents.
- Media power reading device digital video disc
- OCR optical information reading device
- USB Universal System Bus
- a communication means for exchanging signals and data among the processing device 1, the input device 2, the recording device 3, and the output device 4 a USB (Universal System Bus) cable or the like is used. It may be connected directly, may be transmitted / received via a network such as a LAN (local area network), or may be via a medium such as FD, CDROM, MO, or DVD that stores documents. Alternatively, some or a combination of these may be used.
- the document reading unit 10 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 according to the reading conditions input by the input device 2.
- the read data of the document element group is directly sent to the time data extraction unit 20 and the index word data extraction unit 30 and used for each processing, or sent to the work result storage unit 320 of the recording device 3. Stored.
- the data sent from the document reading unit 10 to the time data extraction unit 20 and the index word data extraction unit 30 or the work result storage unit 320 includes all data including time data and content data of the read document element group. It may be data. Further, only bibliographic data (for example, an application number or a publication number in the case of patent documents) specifying each of these document element groups may be used. In the latter case, the data of each document element can be read again from the document storage unit 330 based on the bibliographic data when necessary in the subsequent processing.
- the time data extraction unit 20 extracts time data of each element from the document element group read by the document reading unit 10.
- the extracted time data is directly sent to the intra-cluster element placement unit 90 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the index word data extraction unit 30 extracts index word data, which is content data of each document element, from the document element group read by the document reading unit 10.
- the index word data extracted from each document element is directly sent to the similarity calculation unit 40 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the similarity calculation unit 40 calculates the similarity (or dissimilarity) between document elements based on the index word data of each document element extracted by the index word data extraction unit 30. This similarity calculation is based on the conditions input from the input device 2 and is used to calculate the similarity. The module is called from the condition recording unit 310 and executed. The calculated similarity is sent directly to the cage diagram creation unit 50 and used for processing there, or sent to the work result storage unit 320 of the recording apparatus 3 and stored therein.
- the saddle diagram creating unit 50 is based on the similarity calculated by the similarity calculating unit 40 in accordance with the saddle diagram creating condition input by the input device 2, and the saddle diagram of the document element group to be analyzed Create.
- the created saddle diagram is sent to and stored in the work result storage unit 320 of the recording device 3.
- the storage format of the chart is, for example, the coordinate value of each document element placed on the two-dimensional coordinate plane and the coordinate value data of the start and end points of each connecting line connecting them, or the combination of the document elements It can take the form of data indicating the location of combinations and combinations.
- the cutting condition reading unit 60 reads the saddle diagram cutting condition input by the input device 2 and recorded in the condition recording unit 310 of the recording device 3. The read cutting condition is sent to the cluster extraction unit 70.
- the cluster extraction unit 70 reads the cage diagram created by the cage diagram creation unit 50 from the work result storage unit 320 of the recording device 3, and the cutting condition read by the cutting condition reading unit 60. Based on the above, the saddle diagram is cut and a cluster is extracted. Data regarding the extracted cluster is sent to and stored in the work result storage unit 320 of the recording device 3.
- the cluster data includes, for example, information for specifying document elements belonging to each cluster and connection information between the clusters.
- the arrangement condition reading unit 80 reads the document element arrangement conditions in the cluster that are input by the input device 2 and recorded in the condition recording unit 310 of the recording device 3. The read arrangement condition is sent to the element arrangement unit 90 in the cluster.
- the intra-cluster element arrangement unit 90 reads the cluster data extracted by the cluster extraction unit 70 from the work result storage unit 320 of the recording device 3, and the document element read by the arrangement condition reading unit 80. Based on the placement conditions, the placement of document elements in each cluster is determined. By determining the arrangement within the cluster, the document correlation diagram of the present invention is completed. This document correlation diagram is sent to and stored in the work result storage unit 320 of the recording device 3, and is output by the output device 4 as necessary.
- the condition recording unit 310 records information such as conditions obtained from the input device 2, and sends necessary data based on a request from the processing device 1.
- the work result storage unit 320 stores the work result of each component in the processing device 1 and sends necessary data based on the request of the processing device 1.
- the document storage unit 330 stores and provides necessary document data obtained from the external database or the internal database based on the request of the input device 2 or the processing device 1.
- the output device 4 in FIG. 2 outputs the document correlation diagram created by the intra-cluster element placement unit 90 of the processing device 1 and stored in the work result storage unit 320 of the recording device 3.
- Examples of the output form include display on a display device, printing on a print medium such as paper, or transmission to a computer device on a network via a communication unit.
- FIG. 3 is a flowchart showing the operation procedure of the processing apparatus 1 in the document correlation diagram creation apparatus.
- the document reading unit 10 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 in accordance with the reading conditions input by the input device 2 (step S10).
- the document element group to be analyzed may be, for example, a document group selected in descending order of similarity (ascending order of dissimilarity) with a patent document among all patent documents, or a specific keyword (international patent)
- a group of documents selected by a search in accordance with a certain theme such as classification, technical term, applicant, inventor, etc., or may be selected by other methods.
- the time data extraction unit 20 extracts time data of each element from the document element group read in the document reading step S10 (step S20).
- the index word data extraction unit 30 extracts index word data, which is the content data of each document element, from the document element group read in the document reading step S10 (step S30).
- the index word data of each document element is, for example, the number of occurrences in the document element for each index word from which the document element E force is also extracted (index word frequency TF (E). If each element E also has multiple document capabilities, it can be expressed as a multidimensional vector whose component is the function value of the global frequency GF (E)).
- the content data of document elements is not limited to index word data, but data such as the International Patent Classification (IPC), applicants, and inventors can also be used.
- the similarity calculation unit 40 calculates the similarity (or dissimilarity) between the document elements based on the index word data of each document element extracted in the index word data extraction step S30 ( Step S40).
- the index word frequency TF (E) in document element E the index word frequency TF (E) in document element E, the whole document P (the number of documents in all documents P)
- TF * IDF (P) is calculated for each index word of each document, and a vector representation of each document element is calculated. The result is as follows for document element vectors E and E:
- the similarity (or dissimilarity) between 1 2 1 and E is obtained.
- the cosine (similarity) between vectors is The larger the value, the higher the degree of similarity.
- each document element E when the document element E is composed of one document (micro element), it is preferable to use, for example, the index word TF * IDF (P). If each document element E has multiple document capabilities (macro elements), for example, GFIDF (E) or GF (E) * IDF (P) is used as the component of the document group vector that represents each document element. It is preferable to use it. Moreover, it is good also as a component of a document element vector using other parameters
- the similarity may be defined not only by the vector space method but also by using other methods.
- the document to be analyzed is based on the similarity calculated in the similarity calculation step S40 in the saddle diagram creation unit 50 in accordance with the saddle diagram creation conditions input by the input device 2.
- Create a chart of element groups (step S50). It is desirable to create a dendrogram that reflects the dissimilarity (or similarity) between document elements in the height of the coupling position (coupling distance).
- a specific method for creating a dendrogram a known Ward method or the like is used.
- step S60 the cutting diagram cutting condition input by the input device 2 and recorded in the condition recording unit 310 of the recording device 3 is read.
- step S70 the hook diagram created in the hook diagram creating step S50 is cut to extract clusters.
- step S80 the document element arrangement conditions in the cluster that are input by the input device 2 and recorded in the condition recording unit 310 of the recording device 3 are read.
- the arrangement of the document elements in the cluster extracted in the cluster extraction step S70 is determined based on the document element arrangement conditions read in the arrangement condition reading step S80. (Step S90).
- the document correlation diagram of the present invention is completed.
- the placement conditions are common to all clusters. Good. Therefore, once step S80 is executed for a cluster, it is not necessary to execute it again for another cluster.
- Examples 1 to 5 relating to the process of cutting the cage diagram and extracting clusters (mainly corresponding to step S70 in FIG. 3) will be described, followed by the process of determining the arrangement based on time data (mainly the step in FIG. 3).
- Examples 6 to 8 relating to S90 and the like will be described.
- Examples 1 to 5 relating to the cluster extraction process and Examples 6 to 8 relating to the time arrangement process can be arbitrarily combined with each other.
- Example 1 (balance cutting method; BC method)> In the Balance Cutting Method, an associative rule is used to determine the cutting position of the saddle diagram.
- a number of pre-existing teacher charts (saddle charts with known ideal cutting positions for providing a document correlation diagram arranged based on time data) are analyzed in advance, and a rule that selects the ideal cutting position as much as possible ( (Association rules) are obtained as conditional expressions for various chart parameters. This analysis is called linkage rule analysis.
- the associative rules found in this way are applied to the analysis chart and the cutting position is determined.
- P (A) and P (B) be the probabilities that two events A and B will occur independently. If event B (consequence event) occurs after event A (premise event) occurs, the probability (conditional probability) is written as P (BIA), and P (A) is the “premise probability”, P ( B) is called “prior probability” and P (B
- association rule A A set of two events selected according to the following criteria (1) to (3) is called “association rule” A ⁇ B, and if “event A occurs (with a probability greater than a certain value) It means the regularity that “Event B occurs”.
- High probability means taking a value equal to or greater than a certain threshold.
- the threshold for the posterior probability P is called “confidence” and is set to about 60 to 70%, for example.
- the algorithm for calculating the association rule is a known force. When this is applied to the derivation of the association rule for determining the cutting diagram cutting position in the present invention, the following 4 1 2. 4 -1-See 3 above.
- FIG. 4 is an explanatory diagram of parameters used in the association rule analysis performed in the first embodiment.
- To derive the association rules first read the parameters of the teacher chart. For example, the following parameters are read from the geometric shape of the teacher figure.
- the association rules are applied to the analysis chart When doing this, it is necessary to read the same parameters for the analysis target diagram.
- Midpoint distance m The height of the two-body bond (initial bond) is h,
- Ah h ⁇ h.
- subscript i is the join level (first i i i (i-1)
- the number is obtained by adding 1 to each step when the period coupling is 0. Ah Zh ⁇ 1 or ⁇
- Foundation ⁇ h> Average value of height h of two-body bond. That is, the two-body bond is
- Cluster area s (not shown): Sum of initial coupling heights of all elements
- a function including one or both of the average value and the deviation of the coupling height d as a variable may be used. it can.
- the bond height average value ⁇ d> can be used, and instead of the base ⁇ h>, the bond height average value ⁇ d> and the standard deviation ⁇
- ⁇ D> — ⁇ or d> —2 ⁇ can be used with 0 d. Also, the cutting height candidate and d d
- cluster area sZ-shaped area S is the cluster density, “base h> Z midpoint distance m”
- the reason for crossing the s / S condition with the ⁇ h> / m condition is to avoid misjudgment.
- conditional bifurcation Since the number of hits is as small as five, re-analyze 16 cases with m ⁇ 0.9 with different conditional branches. Since the purpose of reanalysis is to derive a conditional expression for low density or low density, we consider conditional bifurcation by height and density.
- the “midpoint distance mZ final bond height H” is defined as the high-rise, and it is divided into mZH ⁇ 0.55 (high-rise type) and mZH-0.55 (lower cluster type).
- the cluster density sZS and the base ratio ⁇ h> Zm are high according to the above equation 1.
- the probability of the optimal solution is compared before (28 cases) and after (16 cases) with the condition m ⁇ 0.9.
- the probability of the optimal solution is compared before and after narrowing down under the condition m ⁇ 0.9 depending on the cluster density sZS.
- An expression can be derived.
- A F (m / H, 0.4; F ( ⁇ h> / m, 0.4; ⁇ , a), F (s / S, 0.4; a, a ⁇ ⁇ 0 0 1 0 0
- ⁇ (X) is a function that returns 1 when the proposition X is true and 0 otherwise. That is, F (X, y; y, z) is a function that returns y when X and y, and z when x ⁇ y.
- association rules derived in this way are stored in the condition recording unit 310 of the recording device 3 in accordance with the input from the input device 2 or the like. Since this association rule depends on the teacher diagram, for example, if the teacher diagram is updated according to the number of elements in the analysis target diagram and the association rule analysis is performed again, a different association rule can be derived. .
- FIG. 5 is a flowchart for explaining a cluster extraction process in the first embodiment (balance cutting method; BC method). This flowchart shows the procedure of the first embodiment in more detail than FIG. Steps similar to those in FIG. 3 may be added to the step numbers in FIG. 3 with the last two digits being the same step numbers as in FIG.
- FIG. 6 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the first embodiment, and supplements FIG. E to E represent document elements. For convenience, small subscripts are used here.
- the document reading unit 10 of the processing device 1 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 (step S 110).
- time data extraction unit 20 of the processing device 1 extracts time data from each document element of the document group to be analyzed (step S 120).
- the index word data extraction unit 30 of the processing device 1 extracts index word data from each document element of the document group to be analyzed (step S 130). At this time, as described later, since the index word data of the oldest element (oldest document element) E in the document group is unnecessary, step S Based on the time data extracted at 120, it is preferable to extract only index word data other than the oldest element.
- the similarity calculation unit 40 of the processing device 1 calculates the similarity between the document elements (step S140). At this time, only the similarity between the elements other than the oldest element is calculated as described above.
- the warp diagram creation unit 50 of the processing device 1 creates a warp diagram that also includes each document element force of the document group to be analyzed (step S150: FIG. 6A). At this time, the oldest element E is placed at the top of the cage diagram regardless of the similarity with other elements.
- the cutting condition reading unit 60 of the processing apparatus 1 reads the cutting condition (step S160).
- the parameters reading conditions of the saddle diagram and the association rules derived by the above association rule analysis are read out.
- the cluster extraction unit 70 performs cluster extraction.
- the parameters of the chart are read in accordance with the read parameter reading conditions (step S171).
- the association rule read out above is applied to this parameter to determine the cutting height a of the saddle diagram (step S 172: FIG. 6B).
- the saddle diagram is cut and a cluster is extracted (step S173).
- the first element E force branch lines are created for the number of clusters extracted here (see Fig. 6 (C)).
- step S174 the number of document elements in each cluster is counted. For clusters with more than three document elements, exclude the oldest element E of the relevant cluster and place it at the beginning of the relevant cluster, and create a partial chart with the remaining cluster elements E to E (step SI 75: Figure 6 (
- the partial map created at this time has almost the same structure as the part corresponding to the cluster in the map created first in step S150, except that the oldest element E of the cluster is excluded. . However, since the oldest element E of the cluster is excluded, the distance between element groups in the cluster changes. Therefore, the remaining intra-cluster elements E through E
- the structure may be slightly different from the cage diagram created in step S150.
- the distance between centroids or the average of all distances as the distance between document elements and document elements (dissimilarity) or the distance between document elements and document elements (dissimilarity) Is used to create a saddle diagram, the distance between elements E and E and element E in Fig. 6 (B)
- step S171 For the cluster in which the partial hook diagram is created, the process returns to step S171, the parameters of the partial hook figure are read, and the cutting height a is determined in step S172 (Fig. 6 (D)).
- the cutting height ⁇ changes even when the same association rule is applied. Cutting at the new cutting height is executed in step S173 to extract descendant clusters. It should be noted that it is preferable to use another association rule as the association rule applied to the partial chart, rather than reusing the association rule applied to the first chart.
- Such an association rule is preferably derived by performing an association rule analysis based on a teacher diagram having the same number of elements as the number of document elements included in the (partial) sigma diagram to be applied. .
- the in-cluster element arrangement unit 90 determines whether each of the in-cluster element arrangement units 90 is Based on the time data of the document element, the arrangement of the document element group in the cluster is determined (step S 190: Fig. 6 (E)).
- the arrangement conditions in this case are preferably arranged in a line in the oldest order based on, for example, time data, but other arrangement conditions such as arrangements in Examples 6 to 8 described later may be used.
- FIG. 7 is a diagram showing a specific example of the document correlation diagram generated by the method of the first embodiment.
- Each of the 17 Japanese patent applications related to sake extracted by word search was analyzed as document elements, and the patent application number and the name of the invention were entered for each document element in the document correlation diagram.
- all the clusters with a single cut resulted in the number of cases below the threshold (3), so the variable BC method and the fixed BC method yielded the same output results.
- Example 1 by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on the time data, it is possible to create a cage diagram that appropriately represents the temporal development of each field. .
- the cutting rules of the saddle diagram are derived by the association rule analysis, it is possible to use cutting rules (high versatility) that can be applied to various saddle diagrams. It can be realized with high probability.
- cutting rules high versatility
- association rule is derived based on the shape parameter of the teacher chart, it is possible to use a highly reliable cutting rule that can determine an appropriate cutting position in accordance with the shape of the hook chart.
- the cutting position can be determined by reading the shape parameters of the analysis target diagram and applying the association rules to this, it is possible to determine the cutting position with a small amount of calculation / computation.
- Example 1 balanced cutting method; BC method
- an association rule is used to determine the cutting position of the saddle diagram.
- the geometric shape force of the saddle diagram was used, and the connection height between elements was used as the cutting position.
- the index word dimension indicating the difference between the document element vectors. Use to determine the cutting position.
- association rule analysis Since the basic description of the association rule analysis has already been given in the first embodiment, a description thereof will be omitted. First, the parameters used in the association rule analysis of the second embodiment will be described with respect to differences from the first embodiment.
- FIG. 9A shows the coupling level i (c) for each of the nodes c to c.
- D takes a value that is less than or equal to the dimension number D of the index word union set of all elements in the cage diagram, but is not included in the document element group connected at node c (zero in each document element E).
- the index word frequency TF (E) of the index word takes the same value 0 in all the document elements connected at node c.
- the extra dimension R is an index word having the same index word frequency (including 0) between the document elements connected at the node c from the dimension number D of the index word union set of all elements in the cage diagram. It is also possible to define the number of dimensions by subtracting the number of dimensions.
- the dimension number D or D of the index word union set is deeply related to the size of the variation between the document elements belonging to the partial bowl diagram or the whole bowl diagram below the relevant node.
- the number of dimensions D or D of the index word union set is large, there are many index words that share the same index word frequency TF (E) (the codimension R is small). It means big.
- the index word frequency TF (E) with a large number of dimensions D or D in the index word union set TF (E) is common.
- a small number of index words (large extra dimension R) means a large difference between document elements.
- this property is used to determine the cutting position of the cage diagram. If the parameters used in Example 1 (equilibrium cutting method; BC method) are geometric parameters related to the shape of the saddle diagram, the extra dimension can be said to be a non-geometric parameter.
- various parameters may be used for the association rule analysis, for example, a function including either or both of the average value and the deviation of the coupling height d as a variable may be used. it can.
- the bond height average value ⁇ d> can be used, and instead of the base ⁇ h>, the bond height average value ⁇ d> and the standard deviation ⁇
- association rule is as follows. A description of the derivation process of the association rules is omitted.
- ⁇ (X) is a function that returns 1 when the proposition X is true and 0 otherwise.
- the association rules are stored in the condition recording unit 310 of the recording device 3 in accordance with the input from the input device 2 and the like.
- FIG. 8 is a flow chart for explaining the cluster extraction process in the second embodiment (codimensional drop method; CR method). This flowchart shows the procedure of the second embodiment in more detail than FIG. Steps similar to those in FIG. 3 may include the step number in FIG. 3 with 200 added, and the last two digits set to the same step numbers as in FIG.
- FIG. 9 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the second embodiment, which supplements FIG. E
- the document reading unit 10 of the processing device 1 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 (step S210).
- the time data extraction unit 20 of the processing device 1 extracts time data from each document element of the document group to be analyzed (step S220).
- the index word data extraction unit 30 of the processing device 1 extracts index word data from each document element of the document group to be analyzed (step S230). At this time, since the index word data of the oldest element (oldest document element) E in the document group is unnecessary as described later, the index other than the oldest element is based on the time data extracted in step S220. It is preferable to extract only word data.
- the similarity calculation unit 40 of the processing device 1 calculates the similarity between the document elements (step S 240). At this time, only the similarity between the elements other than the oldest element is calculated as described above.
- the warp diagram creating unit 50 of the processing device 1 creates a warp diagram that also includes the document element forces of the document group to be analyzed (step S250: Fig. 9A). At this time, the oldest element E is placed at the top of the cage diagram regardless of the similarity with other elements.
- the cutting condition reading unit 60 of the processing apparatus 1 reads the cutting condition (step S260).
- the parameters reading conditions of the saddle diagram and the association rules derived by the above association rule analysis are read out.
- the cluster extraction unit 70 performs cluster extraction. First, the parameters of the hook diagram are read according to the read parameter reading conditions (step S271). Next, the association rule read out is applied to this parameter, and the critical dimension D for determining the cutting position of the corrugation diagram is determined (step S272).
- step S273 the extra dimension R (i; c) of the processing target node c is calculated (step S273).
- the extra dimension R (i; c) and the critical dimension D are compared (step 3274). If 1 ⁇ (1; ( :)> 0, the node is disconnected (step S2 75), and the process proceeds to step S276. If R (i; c) ⁇ D, do not disconnect and proceed to step S276.
- Figure 9 (B) shows an example of the comparison result between the extra dimension R and the critical dimension D for each of the nodes c to c! /.
- the extra dimension R is less than the critical dimension D
- step S275 nodes c and c are disconnected in step S275, and a cluster is extracted.
- node C has a higher combined height than node C (between the document elements to be combined
- the extra dimension of node c is less than or equal to critical dimension D
- the position is not directly related to the coupling height in the saddle diagram.
- the document element group combined by the upper node located upstream thereof includes all the document elements E combined by the lower node c. Therefore, the upper node has a codimension R larger than the codimension R of the lower node c. Therefore, for example, as shown in Fig. 9 (B), it is determined that the extra dimension R (2; c) of the lower node c exceeds the critical dimension D.
- the arrangement condition reading unit 80 reads out the arrangement conditions in the cluster (step S280).
- the intra-cluster element arrangement unit 90 determines the arrangement of document element groups in the cluster based on the time data of each document element (step S290: FIG. 9 (C)).
- the arrangement conditions in this case are preferably arranged in a line in the oldest order based on, for example, time data, but other arrangements such as arrangements according to Examples 6 to 8 described later may be used.
- the index word subtracted from the number of dimensions of the index word union set to obtain the extra dimension R is assumed to have the same index word frequency TF (E). It may be a thing. For example, an index word whose index word frequency TF (E) has a deviation smaller than a predetermined value (index word) An index word having a standard deviation of frequency TF (E) below a certain value may be used.
- each document element E includes a plurality of document cards, it is preferable to use the global frequency GF (E) instead of the index word frequency TF (E).
- an index in which the deviation of the vector component quantity is smaller than a value determined by a predetermined method.
- it is a word.
- FIG. 10 is a diagram illustrating a specific example of the document correlation diagram generated by the method of the second embodiment.
- the second embodiment it is possible to create a cage diagram that appropriately represents temporal development in each field by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on time data. .
- the cutting rules of the saddle diagram are derived by the association rule analysis, it is possible to use cutting rules (high versatility) that can be applied to various saddle diagrams. It can be realized with high probability.
- cutting rules high versatility
- Example 3 (Cell division method; CD method)> In the Cell Division Method, after extracting the parent cluster by cutting the cage diagram at the cutting height ⁇ determined by a certain method, each parent cluster is further divided into child clusters. Using only the document elements belonging to, the pie chart of the part is created again. When creating this partial saddle diagram, the analysis is performed by removing the index word dimension in which the deviation of the component of the document element vector in the parent cluster is smaller than the value determined by a predetermined method.
- FIG. 11 is a flowchart explaining the cluster extraction process in Example 3 (cell division method; CD method). This flowchart shows the procedure of the third embodiment in more detail than FIG. Steps similar to those in Fig. 3 may include 300 as the step number in Fig. 3 and the last two digits are set to the same step numbers as in Fig. 3, and duplicate descriptions with Fig. 3 may be omitted.
- FIG. 12 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the third embodiment, which supplements FIG. E
- the document reading unit 10 of the processing device 1 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 (step S310).
- time data extraction unit 20 of the processing device 1 extracts time data from each document element of the document group to be analyzed (step S320).
- step S330 the index word data extraction unit 30 of the processing device 1 extracts index word data from each document element of the document group to be analyzed. At this time, as described later, since the index word data of the oldest element (oldest document element) E in the document group is unnecessary, step S
- the similarity calculation unit 40 of the processing device 1 calculates the similarity between the document elements (step S340). At this time, only the similarity between the elements other than the oldest element E is calculated as described above.
- the warp diagram creation unit 50 of the processing apparatus 1 creates a warp diagram that also includes each document element force of the document group to be analyzed (step S350: FIG. 12A). At this time, the oldest element E is placed at the top of the cage diagram regardless of the similarity with other elements.
- the cutting condition reading unit 60 of the processing apparatus 1 reads the cutting condition (step S360). Here, the cutting height oc, the deviation judgment threshold described later, and the like are read out.
- the cluster extraction unit 70 performs cluster extraction.
- Step S373 When the cage diagram is cut, the oldest elements E and E in each cluster are assigned to each cluster.
- Step S374 Fig. 12 (C)
- the following processing is performed for document elements other than the oldest element in each cluster.
- step S375 a process of deleting an index word dimension in which a deviation between elements in the cluster other than the oldest element is smaller than a value determined by a predetermined method is performed (step S375).
- document elements E, E, E, E the cluster starting with document element E in Figure 12, document elements E, E, E, E
- index values of 2 3 4 5 6 and the component values of each document element vector calculated for each index word are as shown in the following table.
- the index words w and w are judged to have a small deviation and are deleted.
- Step S376 Fig. 12 (D)
- a partial saddle diagram is created using the remaining index words w, w, acw, w. Therefore, the chart df created in step S350
- a branch in the cluster different from the branch at is obtained.
- the difference between the remaining index words is emphasized because the index word dimension taking the deviation force and the dice is deleted. Therefore, even when the similarity is between the same document elements, the similarity when creating a partial chart in this step S376 is greater than the similarity when creating a chart in step S350. Smaller (higher dissimilarity) will be evaluated.
- step S 377 the number of elements in the cluster excluding the oldest element is obtained and compared with a predetermined threshold value (for example, 3) (step S 377).
- a predetermined threshold value for example, 3
- step S377 NO
- step S371 the cage diagram is cut and the descendant clusters are extracted.
- the cutting height a (or a *) at this time is evaluated as having low similarity by deleting the index word dimension having a small value of force deviation as described above in step S 371 (or step S 373). Therefore, it is possible to cut the saddle diagram again with the same cutting height OC (or ex *).
- oc * may be updated each time according to the height d of each joining position in the parent cluster to be cut. It is good (variable method) or the initial value of may be used as it is (fixed method).
- step S380 the arrangement condition reading unit 80 reads the arrangement conditions in the cluster. According to this arrangement condition, the intra-cluster element arrangement unit 90 determines the arrangement of document element groups in the cluster based on the time data of each document element (step S390: FIG. 12 (F)).
- I is a serial chain of document element E, document element E, and document elements E and E in time data order.
- the arrangement conditions in the cluster are preferably arranged in the oldest order based on the time data as in this example, but other arrangement conditions such as those in Examples 6 to 8 described later may be used.
- the deviation determination threshold is 10% in terms of the ratio of the standard deviation to the average
- the decision threshold is preferably 0% or more and 10% or less.
- each document element consists of multiple documents, it is preferable that the deviation is small if the ratio of the standard deviation to the average of the document elements in the cluster is 60% or 70% or less.
- FIG. 13 is a diagram showing a specific example of the document correlation diagram generated by the method of the third embodiment.
- TF * IDF (P) is used as the component value of the document element vector
- the document correlation diagram is analyzed.
- the patent application number and the title of the invention were entered.
- one of the partial saddle diagrams created in step S376 was further cut to form a two-stage branch.
- FIG. 14 is a diagram showing another specific example of the document correlation diagram generated by the method of the third embodiment.
- the document groups that should belong to each of the 16 main fields are selected by keyword search.
- One document element (macro element).
- the oldest element was removed and placed at the top, and a cage diagram was created and the diagram was cut with the remaining 15 elements, and the branch structure shown in the figure was obtained.
- the average value of the filing date is used as the time data t for each document element
- GFIDF (E) is used as the component value of the document element vector
- 70% is used as the deviation judgment threshold. did.
- Document correlation diagram The keywords that characterize the above 16 fields were entered.
- the third embodiment it is possible to create a cage diagram that appropriately represents the temporal development of each field by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on time data. .
- the child cluster is extracted from the partial tree diagram created by reanalyzing each parent cluster, so that the misclassification of the child cluster can be improved and an appropriate classification can be obtained.
- the vector components in which the deviation between the document elements belonging to each parent cluster takes a value smaller than the value determined by a predetermined method is removed.
- index words related to solvents with small deviations are removed in each parent cluster.For example, the difference between pigments is emphasized and the group using organic pigments and the group using inorganic pigments are emphasized. Broadly divided.
- index words with small deviations are not removed in each parent cluster, there is a possibility that even if the solvent is more powerful, the classification and the classification related to the pigment antagonize and an appropriate child cluster may not be obtained. By emphasizing the differences within the cluster, it is possible to obtain an appropriate classification in the descendant cluster.
- the saddle diagram is cut at two or more cutting heights ⁇ ;., ⁇ . (Fixed values), and the parent cluster and the descendant clusters are extracted.
- FIG. 15 is a flowchart explaining the cluster extraction process in Example 4 (stepwise cutting method; SC method).
- SC method stepwise cutting method
- This flowchart shows the procedure of the fourth embodiment in more detail than FIG.
- 400 is added to the step number in Fig. 3 so that the bottom analysis is the same step number as in Fig. 3, and the description overlapping with Fig. 3 may be omitted.
- FIG. 16 is a diagram showing a saddle diagram arrangement example in the cluster extraction process in the fourth embodiment, which supplements FIG. E to E represent document elements. For convenience, it is assumed that the smaller subscript is the document element with the smaller time t (older).
- the document reading unit 10 of the processing device 1 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 (step S410).
- the time data extraction unit 20 of the processing device 1 extracts time data from each document element of the document group to be analyzed (step S420).
- step S430 the index word data extraction unit 30 of the processing device 1 extracts index word data from each document element of the document group to be analyzed. At this time, as described later, since the index word data of the oldest element (oldest document element) E in the document group is unnecessary, step S
- the similarity calculation unit 40 of the processing device 1 calculates the similarity between the document elements (step S440). At this time, only the similarity between the elements other than the oldest element is calculated as described above.
- the warp diagram creation unit 50 of the processing device 1 creates a warp diagram that also includes each document element force of the document group to be analyzed (step S450: FIG. 16 (A)). At this time, the oldest element E is placed at the top of the cage diagram regardless of the similarity with other elements.
- the cutting condition reading unit 60 of the processing apparatus 1 reads the cutting conditions (step S460).
- the cutting height a ;, a (where a> a) or the calculation method thereof is read.
- the cluster extraction unit 70 performs cluster extraction.
- the branch line cut by the cutting line The number (first branch number) is read, and the branch lines corresponding to the first branch number are drawn directly from the oldest elements excluded in step S450 (step S472: FIG. 16 (C)).
- This number of first branches is the number of parent clusters.
- step S473 Fig. 16D
- step S474 the number of branch lines (second branch number) cut by the cutting line is read for each parent cluster, and the line force of each parent cluster is directly equal to the number of branches corresponding to the second branch number of the parent cluster.
- Draw a line step S474.
- the total number of child clusters is the total number of this second branch for all parent clusters. This completes the cluster extraction.
- the arrangement condition reading unit 80 next reads out the arrangement conditions in the cluster (step S480).
- the in-cluster element arrangement unit 90 determines the arrangement of the document element group in the cluster based on the time data of each document element (step S490: FIG. 16 (E)).
- the arrangement conditions in this case are preferably arranged in a line from the oldest on the basis of the time data, but other arrangements such as arrangements according to Examples 6 to 8 described later may be used.
- step S472 the number of branch lines corresponding to the number of first branches is drawn directly from the oldest element. Therefore, for example, even when the parent cluster [1] and the parent clusters [2] and [3] are located in different layers as shown in the diagram of FIG. As shown in Fig. 5, the hierarchical structure above the cutting height ⁇ can be processed uniformly. Therefore, the diagram can be simplified.
- step S474 the line force of each parent cluster is directly drawn as many branch lines as the number of second branches of the parent cluster. Therefore, for example, as shown in the diagram of FIG. 16 (D), when the parent cluster [1] force branching child cluster [11] and [12] and the child cluster [13] are located at different levels from each other, However, as shown in Fig. 16 (E), the hierarchical structure between the cutting heights ⁇ and ⁇ can be processed uniformly. Therefore, the diagram can be simplified.
- child clusters [11], [12] and [13] branch from the parent cluster [1]
- child clusters branch from the parent cluster [3] [ Even if 31] and [32] are joined at different heights, they are joined at the same height as shown in Fig. 16 (E).
- Obedience Connexion the ⁇ view unified treatment to the difference between the combined height between the cutting height and alpha Eta leaving in be simplified I spoon.
- the cage diagram can be simplified reasonably, while the number of first branches at the cutting height ex i and the number of second branches at the cutting height can be maintained. Therefore, it is possible to create a document correlation diagram reflecting the hierarchical structure of the original hook diagram while simplifying the hierarchical structure of the hook diagram moderately.
- FIGS. 17 and 18 are diagrams showing specific examples of the document correlation diagram generated by the method of the fourth embodiment.
- the same publication as FIG. 7 of Example 1 was analyzed as document elements, and the patent application number and the title of the invention were entered for each document element in the document correlation diagram.
- the oldest element is not extracted before the generation of the descendant cluster, the oldest element of the parent cluster is placed between the oldest element and the descendant cluster in the entire chart. Only the hangar-shaped structure is displayed.
- Fig. 17 shows a saddle diagram created by using a similarity chart (cosine) that has not been standardized, and a chart created by using standardized similarity score (number of relations). It has been cut.
- the fourth embodiment it is possible to create a cage diagram that appropriately represents temporal development in each field by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on time data. .
- the hierarchical structure of the cage diagram is moderately simplified, and the initial cage diagram is displayed.
- a document correlation diagram that reflects the hierarchical structure can be created.
- FIG. 19 is a flowchart illustrating the cluster extraction process in the fifth embodiment (variable composite method; FC method). This flowchart shows the procedure of the fifth embodiment in more detail than FIG. Steps similar to those in Fig. 3 may have the same step number as Fig. 3 with the last two digits added to 500 in the step number of Fig. 3, and the description overlapping with Fig. 3 may be omitted.
- FIG. 20 is a diagram showing a part of the saddle diagram arrangement example in the cluster extraction process in the fifth embodiment, and supplements FIG. ⁇ to ⁇ represent document elements.
- the document reading unit 10 of the processing device 1 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 (step S510).
- the time data extraction unit 20 of the processing device 1 extracts time data from each document element of the document group to be analyzed (step S520).
- step S530 the index word data extraction unit 30 of the processing device 1 extracts index word data from each document element of the document group to be analyzed. At this time, as described later, since the index word data of the oldest element (oldest document element) E in the document group is unnecessary, step S
- the similarity calculation unit 40 of the processing device 1 calculates the similarity between the document elements (step S540). At this time, only the similarity between the elements other than the oldest element E is calculated as described above.
- the chart diagram creation unit 50 of the processing apparatus 1 creates a chart diagram that also includes each document element force of the document group to be analyzed (step S550: FIG. 20 (A)). At this time, the oldest element E is placed at the top of the cage diagram regardless of the similarity with other elements.
- the cutting condition reading unit 60 of the processing device 1 reads the cutting condition (step S560).
- the calculation method of the cutting height a, the upper limit g of the number of cuttings (number of layers), etc. are read.
- the upper limit g of the number of cuts is, for example, the total number N of document elements to be analyzed
- step S571 it is determined whether or not the calculated cutting height a * is smaller than the maximum value Max (d) of the coupling height d of the elements E to E d [2-N] 2 (step S572), and is small
- the subsequent processing is performed for each cluster.
- the number of document elements exceeds a predetermined threshold (here, 4; the predetermined threshold is preferably 4 or more and 10 X [InN / lnIO] or less) (step S5).
- step S575 NO
- the oldest element E is excluded from the cluster.
- Step S576 Fig. 20 (C)
- the partial chart created at this time is the first chart created in step S550, except that the oldest element E of the cluster is excluded.
- the structure is almost the same as the portion corresponding to the cluster. However, since the oldest element E of the cluster is excluded, the distance between element groups in the cluster changes.
- step S571 After creating a partial cage diagram with the elements in the cluster, return to step S571, and use the height d of each coupling position of elements E to E, excluding the oldest element E, among the elements in the cluster. ⁇
- step S574 For clusters in which the number of document elements is equal to or less than the predetermined threshold (here, 4) (step S574: YES), the cell division method of Example 3 is used for these clusters regardless of the number of cluster cuts. Proceed to descendant cluster extraction by other cluster extraction methods such as (CD method) (step S577).
- CD method Cluster extraction methods
- step S575 For clusters whose number of cuts has reached the upper limit g (step S575: YES), regardless of the number of document elements in the cluster, other clusters such as the cell division method (CD method) in Example 3 are extracted for the cluster. The process moves to descendant cluster extraction by the method (step S577).
- step S577 the step cutting method of Example 4 that uses the balanced cutting method (BC method) of Example 1 or the codimension descent method (CR method) of Example 2 may be used. (SC method) may be used.
- step S572 the cutting height a * or ⁇ * force element ⁇ to ⁇ or
- step S575 the number of cuts is determined in step S575 (in this case, the cutting process has been skipped and the number of cuts has not increased, so the determination of the number of cuts may be omitted).
- the next oldest element E or E is excluded in S576.
- step S576 the oldest elements are excluded one by one (step S576), and if the number of elements in the cluster is equal to or less than the threshold (step S574), the process proceeds to step S577.
- the arrangement condition reading unit 80 reads out the arrangement condition in the cluster (step S580).
- the intra-cluster element arrangement unit 90 determines the arrangement of document element groups in the cluster based on the time data of each document element (step S590: FIG. 20 (D)).
- the arrangement conditions in this case are preferably arranged in a line in the oldest order based on time data, for example, but other arrangements such as arrangements in Examples 6 to 8 described later may be used.
- step S575 is omitted, and if step S574 is NO, the process immediately proceeds to step S576, and a descendant cluster is extracted with an unlimited number of cuts.
- step S574 it is desirable to make a NO determination if the number of document elements exceeds 9, for example, and to determine YES for a cluster where the number of document elements is 9 or less.
- FIG. 21 and FIG. 22 are diagrams showing specific examples of the document correlation diagram generated by the method of the fifth embodiment.
- the Japanese patent applications and utility model registration applications related to the liquid wrinkle prevention method extracted by keyword search are analyzed as document elements, and the obtained document correlation diagram is shown here for simplicity. Only the department (35 cases) is shown. Illustrated document In the correlation diagram, the patent application number for each document element (however, the application model registration application number with (u) at the end) is entered, and the name of the invention (invention) is entered for the higher-level document element. In Examples 1-4, the number of elements less than 20 seems to be preferable, but in this Example 5, an appropriate parent-child cluster can be obtained even if the number of elements to be analyzed is large as shown in this example. .
- the parent cluster (number of elements 5) starting with application number H03-320020 was separated into child clusters by the second cut because the number of elements exceeded the threshold value 4.
- the child cluster (number of elements: 10) starting with the application number S63-0 33662 (11) was generated by the second cutting, so it was not cut and separated any more.
- the parent cluster (number of elements 5) starting with the application number H03-320020 was not cut because the number of elements was 9 or less.
- the child cluster (number of elements: 10) starting with application number S63-033662 (U) was cut for the third time and separated into grandchild clusters.
- FIG. 23 is a diagram showing another specific example of the document correlation diagram generated by the method of the fifth embodiment.
- Document elements macro elements
- Document elements in the 16 fields that are the same as Fig. 14 in Example 3 are placed at the top, excluding the oldest elements, according to Example 5. Cutting was performed. Until the upper limit of the number of elements in the cluster (4) was reached, the removal of the oldest elements and the creation and cutting of the chart were repeated. Clusters for which the number of elements in the cluster was less than the upper limit were further generated by the method of Example 3 (cell division method; CD method), and the branched structure shown in the figure was obtained.
- cell division method CD method
- the average value of the filing date is used as the time data t of each document element
- GFIDF (E) is used as the component value of the document element vector
- step S550 and step S576 described above the oldest element is excluded when creating the saddle diagram and the partial saddle diagram. However, it is also possible to create the chart without excluding the oldest element. Then, this cage diagram is cut g times as described above. By obtaining clusters in this way, document elements can be classified. In this case, macro-analysis of the document element group can be facilitated by appropriately labeling the obtained classification based on the content data of the document element belonging to each classification.
- FIG. 24 is a diagram showing a specific example of the document correlation diagram generated by the method according to the first modification of the fifth embodiment.
- the procedure for creating this document correlation diagram is as follows. First, for the approximately 4000 Japanese patent publications filed by a household chemicals manufacturer, a chart was created without removing the oldest publication, and was cut g times using the method according to the first modification. did. A cocoon figure in which the 27 clusters obtained in this way were newly used as document elements (macro elements) was created, and the oldest element was extracted by the method of Example 5, and the cocoon figure was cut. Until the upper limit of the number of elements in the cluster (4) was reached, extraction of the oldest elements and cutting of the saddle diagram were repeated, and the branch structure shown in the figure was obtained. Each macro element was labeled based on the content data of the documents belonging to it. As a result, even an analysis target document group consisting of an enormous number of documents can be automatically analyzed in a macro manner to facilitate understanding of the general flow of technology.
- This document correlation diagram first creates a document correlation diagram of a group of patent documents held by an applicant X, and among the group of patent documents by the applicant X, a group of patent documents belonging to a specific technical field It shows the relationship with the patent documents of other companies.
- FIG. 25 is a diagram illustrating a process of creating a document correlation diagram according to the second modification of the fifth embodiment
- FIGS. 26 and 27 are diagrams illustrating specific examples of the document correlation diagram according to the second modification of the fifth embodiment.
- FIG. 28 and FIG. 29 are diagrams illustrating a part of another display example in the document correlation diagram according to the second modification of the fifth embodiment.
- a cage diagram was newly created with the 21 clusters thus obtained as document elements (macro elements).
- the oldest elements were extracted by the method of Example 5 and the cage diagram was cut. Until the upper limit of the number of elements in the cluster (set to 4) was reached, the extraction of the oldest elements and the cutting of the saddle diagram were repeated, and the branch structure shown in Fig. 26 was obtained.
- the patent documents whose applicant is Company X above account for the top (in this case, 5th or less), and are distinguished from other document elements.
- a highlight is added to the top, and a stronger highlight is added to the top one.
- Such highlighting may be based on the thickness of the frame line as shown in the figure, or may be based on color coding or a pattern.
- such highlighting is not limited to whether or not a document of a certain applicant (in-house or another company) occupies the top, but is also the power or power that includes even one document of a certain applicant. It may be according to the standard.
- Fig. 26 and Fig. 27 the average value of the filing date of each document element (here, the last two digits of the year) is entered as the value on the vertical axis.
- FIGS. 26 and 27 for convenience of explanation, only the symbol “E201” is displayed as the name of each document element. However, based on the content data of the document belonging to each, a label indicating the content characteristics of the document element It is advisable to apply a date.
- a document element having a specific attribute among the document elements in the document correlation diagram for example, a document element composed of a patent document of a specific applicant or a specific applicant is dominant.
- the document elements consisting of the patent documents that occupy are displayed in a form that is distinguished from other document elements. This makes it possible to see at a glance how document elements having specific attributes, such as patent groups belonging to a certain field of the specific applicant, are positioned in terms of content and time in relation to other companies. be able to. If you select your company as the specific applicant, you can know the position of the technology belonging to a certain field in the industry as a whole.
- FIG. 28 and FIG. 29 are diagrams showing a part of another display example in the document correlation diagram of FIG.
- each document element is labeled based on content data, such as “Oka-Keido powder related”, and as a more detailed display, the number of documents belonging to the document element, applicant ranking ( (Company name and number) are displayed. Adding a detailed display in this way enables a more detailed analysis.
- the content of the detailed display is not limited to this, but may be a ranking based on the International Patent Classification (IPC), filing date (average value or range, etc.), keywords, etc. of patent documents.
- the detailed display may be performed simultaneously for all the document elements as shown in FIGS. 28 and 29, or the document correlation diagram that does not include the detailed display at the beginning is displayed on the image display device. When the is moved, a detailed display regarding the document element may be additionally output.
- the document element description column itself may be enlarged as shown in FIG. 28, or may be displayed outside the column as shown in FIG. Further, not only in FIG. 26, the same detailed display may be performed for FIG. 27 or other document correlation diagrams.
- the fifth embodiment by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on the time data, it is possible to create a cage diagram that appropriately represents the temporal development of each field. .
- the parent cluster is extracted based on a function including either or both of the combined height average value and the deviation of the document element group belonging to the cage diagram as variables, and the child cluster is extracted from each parent cluster. Since it is based on a function that includes either or both of the combined height average value and deviation of the document element group as a variable, an appropriate parent-child cluster can be obtained even if the number of elements N is large. In addition, since cluster extraction is performed based on a function that includes either or both of the combined height average value and deviation of a document element group as variables, there are various cases such as when the similarity of a document element group belonging to a cage diagram is high. It is possible to deal with a wide range of saddle shapes and to obtain appropriate parent-child clusters.
- the arrangement in the cluster is determined based on the time data and the chart layout data.
- FIG. 30 is a flowchart for explaining the intra-cluster arrangement process in Example 6 (-main fishing arrangement; PLA).
- This flowchart assumes that the cluster has been extracted by the processing up to step S70 (cluster extraction) in Fig. 3, and that steps S80 (read placement conditions) and step S90 (element array in cluster) in Fig. 3
- steps S80 read placement conditions
- step S90 element array in cluster
- FIG. 31 is a diagram showing a saddle diagram arrangement example in the intra-cluster arrangement process in the sixth embodiment, which supplements FIG. E to E represent document elements.
- Fig. 31 (A) shows the cage structure of each of the five clusters extracted by the processing up to step S70 in Fig. 3.
- Example 1 (equilibrium cutting method: BC method), Example 2 (codimension descent method: CR method), Example 3 (cell division method: CD method) or Example 4 (stage cutting method: SC method) ) Etc.
- the placement condition reading unit 80 first reads the placement conditions in the cluster (step S680).
- the intra-cluster element arrangement unit 90 performs document processing in the cluster based on the time data and the scallop arrangement data of each document element in the cluster. Determine the arrangement of prime groups.
- the cluster part of the saddle diagram is regarded as a tournament table, and the winner of each stage (the one with the smaller time t) is determined (Fig. 31 (B)). That is, it is determined which document element has the smaller time data t in order from the lower node (nodal height) (node), and the result is recorded (step S691). This determination is performed from the lowest node (two-body connection) to the highest node of the cluster (step S692). At that time, the winner at the lower node (the document element with the smaller time data t) is set as the opponent at the upper node (to be compared with the time data t) (step S693).
- step S694 Since the winner (the oldest document element) is determined when determining up to the highest node, the winner is placed at the head of the cluster (step S694). In addition, the number of opponents who have been defeated directly against the winner (the number of document elements that have been directly compared with the oldest document element and determined to have a larger time data t) are diverged from the winner. (Step S695: Fig. 31 (C)). The following processing is performed for each branch.
- step S697 the number of opponents who are directly defeated by winning against the winners in each branch is counted. If the number of defeated opponents is 0, the branch process ends. If the number of defeated opponents is 1 or more, a new branch from the winner in the branch is created for the number of the opponents (step S698: Fig. 31 (D)), and the process returns to step S696. By repeating the processes of steps S696 to S698, the intra-cluster arrangement is determined (FIG. 31E).
- the sixth embodiment it is possible to create a cage diagram that appropriately represents the temporal development of each field by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on time data. .
- Group Time Ordering is an effective method when the element definition of multiple document elements is based on classification information and large time units. If the element definition is based on a large time unit (for example, in units of a certain number of years), the same time element may occur, which may cause trouble when considering an arrangement in time series. This is solved by determining the sequence.
- FIG. 32 is a flowchart for explaining the intra-cluster arrangement process in Example 7 (group time order; GTO). This flowchart assumes that the cluster has been extracted by the processing up to step S70 (cluster extraction) in Fig. 3, and that steps S80 (read placement conditions) and step S90 (element array in cluster) in Fig. 3 For the part, the procedure of Example 7 is shown in more detail. Steps similar to those in Fig. 3 may include 700 as the step number in Fig. 3, with the last two digits being the same step number as in Fig. 3, and detailed description omitted.
- FIG. 33 is a diagram showing a part of a saddle diagram arrangement example in the intra-cluster arrangement process in the seventh embodiment, and supplements FIG. E, E, etc. are documents consisting of multiple documents.
- classification such as International Patent Classification (IPC)
- Arabic numeral represents the time t (smaller is older).
- the placement condition reading unit 80 first determines the placement conditions in the cluster. Reading is performed (step S780). In accordance with this arrangement condition, the in-cluster element arrangement unit 90 determines the arrangement of the document element group in the cluster based on the time data and the hook diagram arrangement data of each document element in the cluster.
- the oldest element in the cluster is extracted and placed at the head of the cluster.
- Step S791 When there are multiple oldest elements (E and E in Fig. 33 (B)),
- Step S792 Fig. 33 (B)
- the oldest element force extracted in step S791 is also searched for elements of the same classification (step S793).
- the time series chain that has the oldest element of the same class is connected to the oldest element of the same class (step S794).
- the intra-cluster arrangement is determined.
- the seventh embodiment by performing cluster extraction by cutting the cage diagram and determining the intra-cluster arrangement based on the time data, it is possible to create a cage diagram that appropriately represents the temporal development of each field. .
- the array within the cluster should be determined taking into account the classification information.
- the same time element can be processed.
- Time Slice Analyzes a plurality of document elements to be analyzed are classified based on time data, and then cluster analysis is performed within each time classification. This is different from Examples 6 and 7 in that the analysis based on the time data is performed prior to the cluster extraction based on the content data. After the classification based on the time data and the cluster analysis within each time classification, the connection between the elements belonging to the cluster around the time is performed. Thus, the document correlation diagram is completed.
- FIG. 34 is a diagram for explaining the configuration and function of the document correlation diagram creation device of Example 8 (temporal section analysis; TSA) in more detail than FIG. Portions common to those in FIG. 2 are denoted by the same reference numerals and description thereof is omitted.
- the document correlation diagram creation apparatus of the eighth embodiment includes a time slice classification unit 25 and a time slice connection unit 75 in addition to the components of the document correlation diagram creation apparatus described in FIG.
- the time slice classification unit 25 acquires the time data of each document element extracted by the time data extraction unit 20 from the work result storage unit 320 or directly from the time data extraction unit 20, and based on this time data. Then, the document group to be analyzed is classified into time slices at regular intervals. The classification result is sent directly to the similarity calculation unit 40 and used for processing there, or sent to the work result storage unit 320 for storage.
- the similarity calculation unit 40 calculates the similarity of the document elements in each time slice, the sag diagram creation unit 50 creates a sag diagram for each time slice, and the cluster extraction unit 70 Extract clusters from the time slicer.
- the inter-slice connection unit 75 acquires the cluster information extracted by the cluster extraction unit 70 from the work result storage unit 320 or directly from the cluster extraction unit 70, and belongs to different time slices based on the cluster information. Connect between clusters.
- the generated connection data is directly sent to the intra-cluster element placement unit 90 and used for processing there, or sent to the work result storage unit 320 and stored therein.
- the intra-cluster element arrangement unit 90 arranges the elements in the cluster and completes the document correlation diagram by referring to the connection data of the inter-slice connection unit 75.
- FIG. 35 is a flowchart for explaining the document correlation diagram creation process in the eighth embodiment. This flowchart shows the procedure of the eighth embodiment in more detail than FIG. In the same step as in FIG. 3, 800 is added to the step number in FIG. 3 to make the last two digits the same step number as in FIG.
- FIG. 36 is a diagram showing a saddle diagram layout example in the document correlation diagram creation process in the eighth embodiment. This supplements Figure 35.
- the document reading unit 10 reads a plurality of document elements to be analyzed from the document storage unit 330 of the recording device 3 in accordance with the reading conditions input by the input device 2 (step S810).
- the time data extraction unit 20 extracts time data of each element from the document element group read in the document reading step S810 (step S820).
- time data of each element are extracted, they are classified based on the time data (step S825).
- the classification based on the time data may be a variable interval instead of a fixed time interval.
- the time may be disconnected when a certain number of cases are accumulated in time order.
- a group G is formed at each slice. Specifically, clusters are extracted from each slicer as follows.
- the index word data extraction unit 30 extracts index word data (step S830), and the similarity calculation unit 40 calculates the similarity (or dissimilarity) between the document elements in each slice (step S830). S840). And each slice diary, ⁇ view creation unit 50 creates a ⁇ view (step S850) o
- the cutting condition reading unit 60 reads the ⁇ diagram cutting conditions (step S86 0), cluster extraction unit 70, A cluster is also extracted for each slicing force (step S870).
- each n-slice force extracted cluster is called a group G.
- Each group G has a slice number n and a group number j, which is represented by G (n, j) (Fig. 36 (A)).
- Group G may consist of multiple document elements or a single document element.
- One document element group will be called a trivial group [0225]
- the cutting height a is preferably set in the range of a—b ⁇ a ⁇ a—0.5b.
- the cluster extraction is preferably performed by cutting the saddle diagram described in steps S830 to S870, but other methods may be used. For example, cluster extraction using a known k-one average method may be used.
- the arc segmentation method may be used in which the document elements to be analyzed are connected, and the clusters are extracted by eliminating lines that are dissimilar to the cutting radius.
- this arc segmentation method there are M elements (E, E, ..., E)
- This document element E consists of document element E and document
- the method for forming Group G may be other than the above cluster analysis. For example, if a document element group is already classified by patent classification or company name, the group definition may be performed using this. In this case, since the element definition and the group definition match, one group is formed by one document element consisting of multiple documents (this is also an obvious group).
- connection between groups belonging to the 0-slice is determined (step S872). For example, each cluster obtained by cutting the cross-sectional diagram is connected by a hook-shaped diagram connection structure above the cutting position (FIG. 36 (B)).
- connection between slices is performed. This process is performed by the time slice connecting unit 75.
- the document element (hereinafter referred to as the “shortest distance element”) having the highest similarity to the oldest element of group G (n, j) belonging to each n—slice (n ⁇ 0). Then, select the elemental force of the forward dull G ( ⁇ , j) for a time ⁇ n. Then, the oldest element of group G (n, j) is connected to the shortest distance element for which the group G ( ⁇ , j) force ahead of time is also selected (step S875: Fig. 36 (C)). If there are multiple shortest distance elements, the oldest element is selected and connected to the oldest element in group G (n, j).
- a group G (n, j) belonging to each n—slice (n ⁇ 0) and a group having the highest degree of similarity between groups (the distance between the groups is short) is represented by a time-forward group G ( ⁇ ⁇ ⁇ , j) Force may also be selected.
- the oldest element of the group G (n, j) and the latest element of the selected time forward group G ( ⁇ , j) are connected.
- the distance between groups is calculated by using the dissimilarity (distance) between elements belonging to the group being compared. It can be defined by average. If it is an obvious group consisting of one group of document elements, it matches the dissimilarity between elements (inter-element distance).
- the arrangement condition reading unit 80 reads out the document element arrangement conditions in each group.
- Step S880 the element arrangement part 90 within the cluster determines the arrangement of the document elements in each group (Step S890), and the document correlation diagram is completed.
- FIG. 36 (C) other arrangements are possible, such as arrangement in time order even within a force group in which document elements are arranged in parallel within each group.
- FIG. 37 is a diagram showing a first specific example of the document correlation diagram generated by the method of the eighth embodiment and its generation process.
- the oldest element of each group was connected to the shortest distance element of the time-front group, and was connected in time series within each group.
- the patent application number was entered for each document element in the document correlation diagram ( Figure 37 (B)).
- FIG. 38 is a diagram showing a second specific example of the document correlation diagram generated by the method of the eighth embodiment and its generation process.
- the application date average value of the document group constituting each document element by the method of Example 8 is set as time data t of each document element, and 1
- time slices of n 0-4.
- FIG. 38 (A) A loop was formed (FIG. 38 (A)).
- the oldest elements in each group were connected to the shortest distance element in the time-front group, and were connected in time series within each group. Keywords that characterize the above 16 fields were entered in the document correlation diagram (Fig. 38 (B)).
- FIG. 39 is a diagram showing a third specific example of the document correlation diagram generated by the method of the eighth embodiment and its generation process.
- FIG. 40 is a diagram showing a fourth specific example of the document correlation diagram generated by the method of the eighth embodiment and its generation process.
- a cluster analysis was conducted to form a group.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
BRPI0515687-4A BRPI0515687A (pt) | 2004-09-14 | 2005-09-12 | dispositivo, programa e método de desenho para o relacionamento de diagrama dos documentos em ordem cronológica |
RU2007114059/09A RU2007114059A (ru) | 2004-09-14 | 2005-09-12 | Чертежное устройство для схемы взаимосвязи документов, компонующее документы в хронологическом порядке |
CA002589531A CA2589531A1 (en) | 2004-09-14 | 2005-09-12 | Drawing device for relationship diagram of documents arranging the documents in chronological order |
JP2006535132A JP4171514B2 (ja) | 2004-09-14 | 2005-09-12 | 文書を時系列に配置した文書相関図の作成装置 |
US11/662,759 US20080294651A1 (en) | 2004-09-14 | 2005-09-12 | Drawing Device for Relationship Diagram of Documents Arranging the Documents in Chronolgical Order |
EP05782121A EP1806663A1 (en) | 2004-09-14 | 2005-09-12 | Device for drawing document correlation diagram where documents are arranged in time series |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004266199 | 2004-09-14 | ||
JP2004-266199 | 2004-09-14 | ||
JP2005-171755 | 2005-06-10 | ||
JP2005171755 | 2005-06-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006030751A1 true WO2006030751A1 (ja) | 2006-03-23 |
Family
ID=36060003
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/016785 WO2006030751A1 (ja) | 2004-09-14 | 2005-09-12 | 文書を時系列に配置した文書相関図の作成装置 |
Country Status (8)
Country | Link |
---|---|
US (1) | US20080294651A1 (ja) |
EP (1) | EP1806663A1 (ja) |
JP (2) | JP4171514B2 (ja) |
KR (1) | KR20070053246A (ja) |
BR (1) | BRPI0515687A (ja) |
CA (1) | CA2589531A1 (ja) |
RU (1) | RU2007114059A (ja) |
WO (1) | WO2006030751A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008090510A (ja) * | 2006-09-29 | 2008-04-17 | Shin Etsu Polymer Co Ltd | 文書分類装置及び文書分類方法 |
JP2008234482A (ja) * | 2007-03-22 | 2008-10-02 | Nippon Telegr & Teleph Corp <Ntt> | 文書分類装置、文書分類方法、プログラムおよび記録媒体 |
Families Citing this family (206)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US8082264B2 (en) | 2004-04-07 | 2011-12-20 | Inquira, Inc. | Automated scheme for identifying user intent in real-time |
US8612208B2 (en) | 2004-04-07 | 2013-12-17 | Oracle Otc Subsidiary Llc | Ontology for use with a system, method, and computer readable medium for retrieving information and response to a query |
US7747601B2 (en) | 2006-08-14 | 2010-06-29 | Inquira, Inc. | Method and apparatus for identifying and classifying query intent |
US8005707B1 (en) | 2005-05-09 | 2011-08-23 | Sas Institute Inc. | Computer-implemented systems and methods for defining events |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US7711734B2 (en) * | 2006-04-06 | 2010-05-04 | Sas Institute Inc. | Systems and methods for mining transactional and time series data |
US7921099B2 (en) * | 2006-05-10 | 2011-04-05 | Inquira, Inc. | Guided navigation system |
US8781813B2 (en) | 2006-08-14 | 2014-07-15 | Oracle Otc Subsidiary Llc | Intent management tool for identifying concepts associated with a plurality of users' queries |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8112302B1 (en) | 2006-11-03 | 2012-02-07 | Sas Institute Inc. | Computer-implemented systems and methods for forecast reconciliation |
US8095476B2 (en) * | 2006-11-27 | 2012-01-10 | Inquira, Inc. | Automated support scheme for electronic forms |
JP4550074B2 (ja) * | 2007-01-23 | 2010-09-22 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 不均質な情報源からの情報トラッキングのためのシステム、方法およびコンピュータ実行可能プログラム |
JP4403561B2 (ja) * | 2007-01-31 | 2010-01-27 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 画面の表示を制御する技術 |
US7797265B2 (en) * | 2007-02-26 | 2010-09-14 | Siemens Corporation | Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters |
US7711668B2 (en) * | 2007-02-26 | 2010-05-04 | Siemens Corporation | Online document clustering using TFIDF and predefined time windows |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US8793264B2 (en) * | 2007-07-18 | 2014-07-29 | Hewlett-Packard Development Company, L. P. | Determining a subset of documents from which a particular document was derived |
US8069404B2 (en) | 2007-08-22 | 2011-11-29 | Maya-Systems Inc. | Method of managing expected documents and system providing same |
US8601392B2 (en) | 2007-08-22 | 2013-12-03 | 9224-5489 Quebec Inc. | Timeline for presenting information |
US8949177B2 (en) * | 2007-10-17 | 2015-02-03 | Avaya Inc. | Method for characterizing system state using message logs |
JP4146505B1 (ja) * | 2007-11-19 | 2008-09-10 | デュアキシズ株式会社 | 判定装置及び判定方法 |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US7921100B2 (en) * | 2008-01-02 | 2011-04-05 | At&T Intellectual Property I, L.P. | Set similarity selection queries at interactive speeds |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US20090216611A1 (en) * | 2008-02-25 | 2009-08-27 | Leonard Michael J | Computer-Implemented Systems And Methods Of Product Forecasting For New Products |
US8739050B2 (en) | 2008-03-07 | 2014-05-27 | 9224-5489 Quebec Inc. | Documents discrimination system and method thereof |
JP5157551B2 (ja) * | 2008-03-17 | 2013-03-06 | 株式会社リコー | オブジェクト連携システム、オブジェクト連携方法およびプログラム |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US8676815B2 (en) | 2008-05-07 | 2014-03-18 | City University Of Hong Kong | Suffix tree similarity measure for document clustering |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
KR101054824B1 (ko) * | 2008-11-28 | 2011-08-05 | 한국과학기술원 | 키워드 시맨틱 네트워크 구성을 통한 특허정보 시각화 시스템 및 그 방법 |
US9959870B2 (en) | 2008-12-11 | 2018-05-01 | Apple Inc. | Speech recognition involving a mobile device |
JP5213758B2 (ja) * | 2009-02-26 | 2013-06-19 | 三菱電機株式会社 | 情報処理装置及び情報処理方法及びプログラム |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10706373B2 (en) | 2011-06-03 | 2020-07-07 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US8631040B2 (en) | 2010-02-23 | 2014-01-14 | Sas Institute Inc. | Computer-implemented systems and methods for flexible definition of time intervals |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
JP2011243148A (ja) * | 2010-05-21 | 2011-12-01 | Sony Corp | 情報処理装置、情報処理方法及びプログラム |
US8713021B2 (en) * | 2010-07-07 | 2014-04-29 | Apple Inc. | Unsupervised document clustering using latent semantic density analysis |
US8407221B2 (en) * | 2010-07-09 | 2013-03-26 | International Business Machines Corporation | Generalized notion of similarities between uncertain time series |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US9189129B2 (en) | 2011-02-01 | 2015-11-17 | 9224-5489 Quebec Inc. | Non-homogeneous objects magnification and reduction |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US8751858B2 (en) | 2011-05-20 | 2014-06-10 | International Business Machines Corporation | System, method, and computer program product for physical drive failure identification, prevention, and minimization of firmware revisions |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US9336493B2 (en) | 2011-06-06 | 2016-05-10 | Sas Institute Inc. | Systems and methods for clustering time series data based on forecast distributions |
US9361273B2 (en) * | 2011-07-21 | 2016-06-07 | Sap Se | Context-aware parameter estimation for forecast models |
US9047559B2 (en) | 2011-07-22 | 2015-06-02 | Sas Institute Inc. | Computer-implemented systems and methods for testing large scale automatic forecast combinations |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US10289657B2 (en) | 2011-09-25 | 2019-05-14 | 9224-5489 Quebec Inc. | Method of retrieving information elements on an undisplayed portion of an axis of information elements |
CA2801663A1 (en) * | 2012-01-10 | 2013-07-10 | Francois Cassistat | Method of reducing computing time and apparatus thereof |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9519693B2 (en) | 2012-06-11 | 2016-12-13 | 9224-5489 Quebec Inc. | Method and apparatus for displaying data element axes |
US9646080B2 (en) | 2012-06-12 | 2017-05-09 | 9224-5489 Quebec Inc. | Multi-functions axis-based interface |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9087306B2 (en) | 2012-07-13 | 2015-07-21 | Sas Institute Inc. | Computer-implemented systems and methods for time series exploration |
US9244887B2 (en) | 2012-07-13 | 2016-01-26 | Sas Institute Inc. | Computer-implemented systems and methods for efficient structuring of time series data |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
EP2954514B1 (en) | 2013-02-07 | 2021-03-31 | Apple Inc. | Voice trigger for a digital assistant |
US9147218B2 (en) | 2013-03-06 | 2015-09-29 | Sas Institute Inc. | Devices for forecasting ratios in hierarchies |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
AU2014233517B2 (en) | 2013-03-15 | 2017-05-25 | Apple Inc. | Training an at least partial voice command system |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
DE112014002747T5 (de) | 2013-06-09 | 2016-03-03 | Apple Inc. | Vorrichtung, Verfahren und grafische Benutzerschnittstelle zum Ermöglichen einer Konversationspersistenz über zwei oder mehr Instanzen eines digitalen Assistenten |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
KR101809808B1 (ko) | 2013-06-13 | 2017-12-15 | 애플 인크. | 음성 명령에 의해 개시되는 긴급 전화를 걸기 위한 시스템 및 방법 |
AU2014306221B2 (en) | 2013-08-06 | 2017-04-06 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US9934259B2 (en) | 2013-08-15 | 2018-04-03 | Sas Institute Inc. | In-memory time series database and processing in a distributed environment |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
JP6326886B2 (ja) * | 2014-03-19 | 2018-05-23 | 富士通株式会社 | ソフトウェア分割プログラム、ソフトウェア分割装置およびソフトウェア分割方法 |
US10169720B2 (en) | 2014-04-17 | 2019-01-01 | Sas Institute Inc. | Systems and methods for machine learning using classifying, clustering, and grouping time series data |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
AU2015266863B2 (en) | 2014-05-30 | 2018-03-15 | Apple Inc. | Multi-command single utterance input method |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9892370B2 (en) | 2014-06-12 | 2018-02-13 | Sas Institute Inc. | Systems and methods for resolving over multiple hierarchies |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9606986B2 (en) | 2014-09-29 | 2017-03-28 | Apple Inc. | Integrated word N-gram and class M-gram language models |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9208209B1 (en) | 2014-10-02 | 2015-12-08 | Sas Institute Inc. | Techniques for monitoring transformation techniques using control charts |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9715418B2 (en) | 2014-12-02 | 2017-07-25 | International Business Machines Corporation | Performance problem detection in arrays of similar hardware |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9418339B1 (en) | 2015-01-26 | 2016-08-16 | Sas Institute, Inc. | Systems and methods for time series analysis techniques utilizing count data sets |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10983682B2 (en) | 2015-08-27 | 2021-04-20 | Sas Institute Inc. | Interactive graphical user-interface for analyzing and manipulating time-series projections |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | USER INTERFACE FOR CORRECTING RECOGNITION ERRORS |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
DK201770427A1 (en) | 2017-05-12 | 2018-12-20 | Apple Inc. | LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
DK179549B1 (en) | 2017-05-16 | 2019-02-12 | Apple Inc. | FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES |
US20180336275A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Intelligent automated assistant for media exploration |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10671266B2 (en) | 2017-06-05 | 2020-06-02 | 9224-5489 Quebec Inc. | Method and apparatus of aligning information element axes |
US10565444B2 (en) * | 2017-09-07 | 2020-02-18 | International Business Machines Corporation | Using visual features to identify document sections |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10331490B2 (en) | 2017-11-16 | 2019-06-25 | Sas Institute Inc. | Scalable cloud-based time series analysis |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10338994B1 (en) | 2018-02-22 | 2019-07-02 | Sas Institute Inc. | Predicting and adjusting computer functionality to avoid failures |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10255085B1 (en) | 2018-03-13 | 2019-04-09 | Sas Institute Inc. | Interactive graphical user interface with override guidance |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | VIRTUAL ASSISTANT OPERATION IN MULTI-DEVICE ENVIRONMENTS |
DK179822B1 (da) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11076039B2 (en) | 2018-06-03 | 2021-07-27 | Apple Inc. | Accelerated task performance |
US10685283B2 (en) | 2018-06-26 | 2020-06-16 | Sas Institute Inc. | Demand classification based pipeline system for time-series data forecasting |
US10560313B2 (en) | 2018-06-26 | 2020-02-11 | Sas Institute Inc. | Pipeline system for time-series data forecasting |
US10922271B2 (en) * | 2018-10-08 | 2021-02-16 | Minereye Ltd. | Methods and systems for clustering files |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07319905A (ja) * | 1994-05-25 | 1995-12-08 | Fujitsu Ltd | 情報検索装置 |
JP2572308B2 (ja) * | 1991-01-25 | 1997-01-16 | 株式会社テレマティーク国際研究所 | レビュー処理装置 |
JP2000242652A (ja) * | 1999-02-18 | 2000-09-08 | Nippon Telegr & Teleph Corp <Ntt> | 情報潮流検索方法、装置、および情報潮流検索プログラムを記録した記録媒体 |
JP2002163275A (ja) * | 2000-11-29 | 2002-06-07 | Matsushita Electric Ind Co Ltd | 技術文書検索装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB9517988D0 (en) * | 1995-09-04 | 1995-11-08 | Ibm | Interactive visualisation aid |
JPH1153387A (ja) * | 1997-08-06 | 1999-02-26 | Ibm Japan Ltd | 文書の関連付け方法及びそのシステム |
US6526398B2 (en) * | 1999-05-28 | 2003-02-25 | Ricoh Co., Ltd. | Generating labels indicating gaps in retrieval of electronic documents |
US7395256B2 (en) * | 2003-06-20 | 2008-07-01 | Agency For Science, Technology And Research | Method and platform for term extraction from large collection of documents |
-
2005
- 2005-09-12 KR KR1020077005827A patent/KR20070053246A/ko not_active Application Discontinuation
- 2005-09-12 RU RU2007114059/09A patent/RU2007114059A/ru not_active Application Discontinuation
- 2005-09-12 BR BRPI0515687-4A patent/BRPI0515687A/pt not_active IP Right Cessation
- 2005-09-12 CA CA002589531A patent/CA2589531A1/en not_active Abandoned
- 2005-09-12 EP EP05782121A patent/EP1806663A1/en not_active Withdrawn
- 2005-09-12 WO PCT/JP2005/016785 patent/WO2006030751A1/ja active Application Filing
- 2005-09-12 US US11/662,759 patent/US20080294651A1/en not_active Abandoned
- 2005-09-12 JP JP2006535132A patent/JP4171514B2/ja not_active Expired - Fee Related
-
2008
- 2008-06-09 JP JP2008150022A patent/JP2008269639A/ja active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2572308B2 (ja) * | 1991-01-25 | 1997-01-16 | 株式会社テレマティーク国際研究所 | レビュー処理装置 |
JPH07319905A (ja) * | 1994-05-25 | 1995-12-08 | Fujitsu Ltd | 情報検索装置 |
JP2000242652A (ja) * | 1999-02-18 | 2000-09-08 | Nippon Telegr & Teleph Corp <Ntt> | 情報潮流検索方法、装置、および情報潮流検索プログラムを記録した記録媒体 |
JP2002163275A (ja) * | 2000-11-29 | 2002-06-07 | Matsushita Electric Ind Co Ltd | 技術文書検索装置 |
Non-Patent Citations (1)
Title |
---|
TAKEDA K. ET AL: "Text Joho no Kashika o Riyo shita Joho Kensaku", JOHO SHORI, vol. 41, no. 4, 15 April 2000 (2000-04-15), pages 343 - 350, XP002955618 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008090510A (ja) * | 2006-09-29 | 2008-04-17 | Shin Etsu Polymer Co Ltd | 文書分類装置及び文書分類方法 |
JP2008234482A (ja) * | 2007-03-22 | 2008-10-02 | Nippon Telegr & Teleph Corp <Ntt> | 文書分類装置、文書分類方法、プログラムおよび記録媒体 |
Also Published As
Publication number | Publication date |
---|---|
CA2589531A1 (en) | 2006-03-23 |
KR20070053246A (ko) | 2007-05-23 |
JP4171514B2 (ja) | 2008-10-22 |
EP1806663A1 (en) | 2007-07-11 |
RU2007114059A (ru) | 2008-10-27 |
JP2008269639A (ja) | 2008-11-06 |
BRPI0515687A (pt) | 2008-07-29 |
US20080294651A1 (en) | 2008-11-27 |
JPWO2006030751A1 (ja) | 2008-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2006030751A1 (ja) | 文書を時系列に配置した文書相関図の作成装置 | |
US8131087B2 (en) | Program and apparatus for forms processing | |
JP5407169B2 (ja) | クラスタリングプログラム、検索プログラム、クラスタリング方法、検索方法、クラスタリング装置および検索装置 | |
CN101916382B (zh) | 一种植物叶片的图像识别方法 | |
CN107577785A (zh) | 一种适用于法律识别的层次多标签分类方法 | |
US20020078044A1 (en) | System for automatically classifying documents by category learning using a genetic algorithm and a term cluster and method thereof | |
US20120197908A1 (en) | Method and apparatus for associating a table of contents and headings | |
CN110532379B (zh) | 一种基于lstm的用户评论情感分析的电子资讯推荐方法 | |
CN111324765A (zh) | 基于深度级联跨模态相关性的细粒度草图图像检索方法 | |
JP6973782B2 (ja) | 標準項目名設定装置、標準項目名設定方法及び標準項目名設定プログラム | |
CN110738053A (zh) | 基于语义分析与监督学习模型的新闻主题推荐算法 | |
KR101224312B1 (ko) | 소셜 네트워킹 서비스 사용자를 위한 친구 추천 방법, 이를 위한 기록 매체 및 이를 이용하는 소셜 네트워킹 서비스 및 서버 | |
CN104778157A (zh) | 一种多文档摘要句的生成方法 | |
JP4017354B2 (ja) | 情報分類装置および情報分類プログラム | |
Steven et al. | The right sentiment analysis method of Indonesian tourism in social media Twitter | |
Luqman et al. | Subgraph spotting through explicit graph embedding: An application to content spotting in graphic document images | |
CN110781300A (zh) | 基于百度百科知识图谱的旅游资源文化特色评分算法 | |
WO2022038821A1 (ja) | 表構造認識装置及び方法 | |
JP2008176489A (ja) | テキスト判別装置およびテキスト判別方法 | |
JP2000250919A (ja) | 文書処理装置及びそのプログラム記憶媒体 | |
CN115630141B (zh) | 基于社区查询和高维向量检索的科技专家检索方法 | |
CN100462966C (zh) | 将文件配置成时间序列的文件相关图的制成装置 | |
CN113505223B (zh) | 一种网络水军识别方法与系统 | |
CN106202116B (zh) | 一种基于粗糙集与knn的文本分类方法及系统 | |
Saund | A graph lattice approach to maintaining and learning dense collections of subgraphs as image features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2006535132 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020077005827 Country of ref document: KR Ref document number: 200580030724.X Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2005782121 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007114059 Country of ref document: RU Ref document number: 2823/DELNP/2007 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2589531 Country of ref document: CA |
|
WWP | Wipo information: published in national office |
Ref document number: 2005782121 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11662759 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: PI0515687 Country of ref document: BR |