WO2004066086A3 - Identifying similarities and history of modification within large collections of unstructured data - Google Patents

Identifying similarities and history of modification within large collections of unstructured data Download PDF

Info

Publication number
WO2004066086A3
WO2004066086A3 PCT/US2004/001530 US2004001530W WO2004066086A3 WO 2004066086 A3 WO2004066086 A3 WO 2004066086A3 US 2004001530 W US2004001530 W US 2004001530W WO 2004066086 A3 WO2004066086 A3 WO 2004066086A3
Authority
WO
WIPO (PCT)
Prior art keywords
documents
representation
history
modification
dependencies
Prior art date
Application number
PCT/US2004/001530
Other languages
French (fr)
Other versions
WO2004066086A2 (en
Inventor
Dwayne A Carson
Donato Buccella
Michael Smolsky
Original Assignee
Verdasys Inc
Dwayne A Carson
Donato Buccella
Michael Smolsky
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/738,919 external-priority patent/US6947933B2/en
Application filed by Verdasys Inc, Dwayne A Carson, Donato Buccella, Michael Smolsky filed Critical Verdasys Inc
Priority to JP2006501066A priority Critical patent/JP4667362B2/en
Priority to CA2553654A priority patent/CA2553654C/en
Priority to EP04704049A priority patent/EP1590748A4/en
Publication of WO2004066086A2 publication Critical patent/WO2004066086A2/en
Publication of WO2004066086A3 publication Critical patent/WO2004066086A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Abstract

A technique for efficient representation of dependencies between electronically-stored documents, such as in an enterprise data processing system. A document distribution path is developed as a directional graph that is a representation of the historic dependencies between documents, which is constructed in real time as documents are created. The system preferably maintains a lossy hierarchical representation of the documents indexed in such a way that allows for fast queries for similar but not necessarily equivalent documents. A distribution path, coupled with a document similarity service, can be used to provide a number of applications, such as a security solution that is capable of finding and restricting access to documents that contain information that is similar to other existing files that are known to contain sensitive information.
PCT/US2004/001530 2003-01-23 2004-01-21 Identifying similarities and history of modification within large collections of unstructured data WO2004066086A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2006501066A JP4667362B2 (en) 2003-01-23 2004-01-21 Identifying similarity and revision history in large unstructured data sets
CA2553654A CA2553654C (en) 2003-01-23 2004-01-21 Identifying similarities and history of modification within large collections of unstructured data
EP04704049A EP1590748A4 (en) 2003-01-23 2004-01-21 Identifying similarities and history of modification within large collections of unstructured data

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US44246403P 2003-01-23 2003-01-23
US60/442,464 2003-01-23
US10/738,919 US6947933B2 (en) 2003-01-23 2003-12-17 Identifying similarities within large collections of unstructured data
US10/738,919 2003-12-17
US10/738,924 US7490116B2 (en) 2003-01-23 2003-12-17 Identifying history of modification within large collections of unstructured data
US10/738,924 2003-12-17

Publications (2)

Publication Number Publication Date
WO2004066086A2 WO2004066086A2 (en) 2004-08-05
WO2004066086A3 true WO2004066086A3 (en) 2005-01-20

Family

ID=32777026

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/001530 WO2004066086A2 (en) 2003-01-23 2004-01-21 Identifying similarities and history of modification within large collections of unstructured data

Country Status (4)

Country Link
EP (1) EP1590748A4 (en)
JP (1) JP4667362B2 (en)
CA (1) CA2553654C (en)
WO (1) WO2004066086A2 (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4695388B2 (en) * 2004-12-27 2011-06-08 株式会社リコー Security information estimation apparatus, security information estimation method, security information estimation program, and recording medium
JP2006338147A (en) * 2005-05-31 2006-12-14 Ricoh Co Ltd Document management device, document management method and program
JP4791776B2 (en) * 2005-07-26 2011-10-12 株式会社リコー Security information estimation apparatus, security information estimation method, security information estimation program, and recording medium
WO2007117643A2 (en) * 2006-04-07 2007-10-18 Mathsoft Engineering & Education, Inc. System and method for maintaining the genealogy of documents
JP4895696B2 (en) * 2006-06-14 2012-03-14 株式会社リコー Information processing apparatus, information processing method, and information processing program
JP5003131B2 (en) 2006-12-04 2012-08-15 富士ゼロックス株式会社 Document providing system and information providing program
JP5023715B2 (en) * 2007-01-25 2012-09-12 富士ゼロックス株式会社 Information processing system, information processing apparatus, and program
JP2008305094A (en) * 2007-06-06 2008-12-18 Canon Inc Documentation management method and its apparatus
JP5294002B2 (en) * 2008-07-22 2013-09-18 株式会社日立製作所 Document management system, document management program, and document management method
JP5213758B2 (en) * 2009-02-26 2013-06-19 三菱電機株式会社 Information processing apparatus, information processing method, and program
JP2011022705A (en) 2009-07-14 2011-02-03 Hitachi Ltd Trail management method, system, and program
JP5264643B2 (en) * 2009-07-28 2013-08-14 日本電信電話株式会社 Content distribution monitoring method and system, and apparatus and program used in this system
JP5630193B2 (en) * 2010-10-08 2014-11-26 富士通株式会社 Operation restriction management program, operation restriction management apparatus, and operation restriction management method
JP5621490B2 (en) * 2010-10-08 2014-11-12 富士通株式会社 Log management program, log management apparatus, and log management method
US20120215908A1 (en) * 2011-02-18 2012-08-23 Hitachi, Ltd. Method and system for detecting improper operation and computer-readable non-transitory storage medium
JP5701096B2 (en) * 2011-02-24 2015-04-15 三菱電機株式会社 File tracking apparatus, file tracking method, and file tracking program
JP5689174B2 (en) * 2011-05-27 2015-03-25 株式会社日立製作所 File history recording system, file history management device, and file history recording method
CN112199936B (en) * 2020-11-12 2024-01-23 深圳供电局有限公司 Intelligent analysis method and storage medium for repeated declaration of scientific research projects

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5926812A (en) * 1996-06-20 1999-07-20 Mantra Technologies, Inc. Document extraction and comparison method with applications to automatic personalized database searching
US5940830A (en) * 1996-09-05 1999-08-17 Fujitsu Limited Distributed document management system
US6633882B1 (en) * 2000-06-29 2003-10-14 Microsoft Corporation Multi-dimensional database record compression utilizing optimized cluster models

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0581096A (en) * 1991-09-19 1993-04-02 Matsushita Electric Ind Co Ltd Page deletion system for electronic filing device
JP3584540B2 (en) * 1995-04-20 2004-11-04 富士ゼロックス株式会社 Document copy relation management system
JPH0944432A (en) * 1995-05-24 1997-02-14 Fuji Xerox Co Ltd Information processing method and information processor
JPH0950410A (en) * 1995-06-01 1997-02-18 Fuji Xerox Co Ltd Information processing method and information processor
JPH10133934A (en) * 1996-09-05 1998-05-22 Fujitsu Ltd Distributed document managing system and program storing medium realizing it
JP3832077B2 (en) * 1998-03-06 2006-10-11 富士ゼロックス株式会社 Document management device
JP3689593B2 (en) * 1999-07-02 2005-08-31 シャープ株式会社 Content distribution management device and program recording medium
JP2001136363A (en) * 1999-11-02 2001-05-18 Nippon Telegraph & Telephone West Corp Contents use acceptance managing method and its device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5926812A (en) * 1996-06-20 1999-07-20 Mantra Technologies, Inc. Document extraction and comparison method with applications to automatic personalized database searching
US5940830A (en) * 1996-09-05 1999-08-17 Fujitsu Limited Distributed document management system
US6633882B1 (en) * 2000-06-29 2003-10-14 Microsoft Corporation Multi-dimensional database record compression utilizing optimized cluster models

Also Published As

Publication number Publication date
WO2004066086A2 (en) 2004-08-05
CA2553654A1 (en) 2004-08-05
JP2006516775A (en) 2006-07-06
EP1590748A4 (en) 2008-07-30
EP1590748A2 (en) 2005-11-02
CA2553654C (en) 2014-04-22
JP4667362B2 (en) 2011-04-13

Similar Documents

Publication Publication Date Title
WO2004066086A3 (en) Identifying similarities and history of modification within large collections of unstructured data
US8099401B1 (en) Efficiently indexing and searching similar data
US7228299B1 (en) System and method for performing file lookups based on tags
Sagayam et al. A survey of text mining: Retrieval, extraction and indexing techniques
CN102054022B (en) Systems and methods for processing and managing object-related data for use by a plurality of applications
Wang et al. Mapdupreducer: detecting near duplicates over massive datasets
Beel et al. Docear's PDF inspector: Title extraction from PDF files
Schmidek et al. Improving Open Relation Extraction via Sentence Re-Structuring.
CN101408882B (en) Method and system for searching authorization document
Angiulli et al. Very efficient mining of distance-based outliers
Li et al. Keyword-based k-nearest neighbor search in spatial databases
CN104699688A (en) File searching method and electronic device
CN102902925A (en) Infected file processing method and system
Hardi et al. pengelompokan topik dokumen berbasis text mining dengan algoritme k-means: studi kasus pada dokumen kedutaan besar Australia Jakarta
Nindito et al. Comparative study of storing unstructured data type between BasicFile and SecureFile in Oracle Database 12c
Roul et al. Efficient approach for near duplicate document detection using textual and conceptual based techniques
RU2772300C2 (en) Obfuscation of user content in structured user data files
Wang et al. Clean living: Eliminating near-duplicates in lifetime personal storage
Boža Experimental comparison of set intersection algorithms for inverted indexing
Kahng et al. Mining Generalized Term Associations: Count Propagation Algorithm.
Zeng et al. A Chinese Document Retrieval Method Considering Text Order Information
Hoang et al. Dataset for" Towards Automated Related Work Summarization"
Nedunchezhian et al. An Alternative Extension of the K-Means Algorithm for Clustering Medical Data
Li et al. Efficient algorithms for skyline top-k keyword queries on XML streams
Salih Hierarchical multi-label of short document classification using term expansion and label powerset

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2006501066

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2004704049

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2004704049

Country of ref document: EP

DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2553654

Country of ref document: CA