US20100146299A1 - System and method for confidentiality-preserving rank-ordered search - Google Patents

System and method for confidentiality-preserving rank-ordered search Download PDF

Info

Publication number
US20100146299A1
US20100146299A1 US12/608,724 US60872409A US2010146299A1 US 20100146299 A1 US20100146299 A1 US 20100146299A1 US 60872409 A US60872409 A US 60872409A US 2010146299 A1 US2010146299 A1 US 2010146299A1
Authority
US
United States
Prior art keywords
data collection
rank
search
term frequency
contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/608,724
Inventor
Ashwin Swaminathan
Yinian Mao
Guan-Ming Su
Hongmei Gou
Avinash L. Varna
Shan He
Min Wu
Douglas W. Oard
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Maryland at College Park
Original Assignee
Ashwin Swaminathan
Yinian Mao
Guan-Ming Su
Hongmei Gou
Varna Avinash L
Shan He
Min Wu
Oard Douglas W
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ashwin Swaminathan, Yinian Mao, Guan-Ming Su, Hongmei Gou, Varna Avinash L, Shan He, Min Wu, Oard Douglas W filed Critical Ashwin Swaminathan
Priority to US12/608,724 priority Critical patent/US20100146299A1/en
Publication of US20100146299A1 publication Critical patent/US20100146299A1/en
Priority to US14/104,652 priority patent/US20160154971A9/en
Priority to US15/274,605 priority patent/US20170235736A1/en
Assigned to UNIVERSITY OF MARYLAND, COLLEGE PARK reassignment UNIVERSITY OF MARYLAND, COLLEGE PARK ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SWAMINATHAN, ASHWIN, WU, MIN, MAO, YINIAN, GOU, HONGMEI, VARNA, Avinash L., HE, SHAN, OARD, DOUGLAS, SU, GUAN-MING
Priority to US17/112,874 priority patent/US11567950B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/60Digital content management, e.g. content distribution

Definitions

  • This invention relates to information search and retrieval.
  • the instant invention relates to a system and method for information search and retrieval in large-scale encrypted databases, with a particular embodiment employing a confidentiality-preserving rank-ordered search.
  • a known method of data protection from theft or intrusion includes cryptography encryption. If the contents of a data storage system are not encrypted, any outsider intruding into the system may gain knowledge of the data content. In addition to such outsider attacks, security measures must also be taken against potential insider attacks. For example, when data storage is outsourced to a third-party data center, system administrators and other personnel involved may not be trusted to have decryption keys and thus have access to the content of the data collections. When an authorized user remotely accesses the data collection to search and retrieve desired documents, the large size of the collections can often make it infeasible to transfer all encrypted data to the user's side, and then perform decryption and search on the user's trusted computers. Therefore, new techniques are needed to encrypt and organize data collections in such a way as to allow the data center to perform effective and efficient search in encrypted data.
  • the searcher may be a scholar or a low-level analyst who wants to identify relevant documents from a private/classified collection, and may need clearance only for the top-ranked documents; the searcher may also be the opposing party during the document discovery phase of a litigation, who would request relevant documents from the content owner's digital collection (e.g. e-mails) be turned in.
  • Conventional practices to accommodate such searches on hard-copy collections are extremely time consuming, and are often based on human factors (e.g. have limited memory and bounded by rules of privilege) that cannot all be directly extended to computerized practice. New algorithms and processes are thus needed to enable secure search for a variety of applications.
  • Keyword based approaches to reduce search complexity have been introduced at the expense of limited search capabilities confined by a keyword list identified beforehand.
  • the documents containing some of the keywords are first identified, and the keywords or the keyword indices are encrypted in a way that facilitates search and retrieval.
  • Securing indices based on Bloom filters have also been proposed to further enhance search efficiency, and conjunctive keyword based searches have been investigated.
  • the aforementioned techniques involve a high computational complexity, and target simple Boolean searches to identify the presence or absence of a term in encrypted text. Furthermore, the aforementioned techniques cannot be easily extended to more sophisticated relevance-ranked searches over large collections.
  • the inventors herein have thus recognized the need for balancing privacy and confidentiality with efficiency and accuracy, which pose significant challenges to the design of search schemes for a number of search scenarios and large data collections.
  • the inventors herein have also recognized the need for a system that focuses on secure and efficient rank-ordered search and retrieval over large data collections.
  • the confidentiality preserving rank-ordered search system and method of the invention focuses on secure and efficient rank-ordered search and retrieval over large data collections.
  • the system includes a framework to securely rank-order documents in response to a query, and techniques for extracting the most relevant document(s) from an encrypted data collection.
  • the system and method includes collection of term frequency information for each of the documents in the collection to build indices, as in traditional retrieval systems in plaintext. The system and method further includes securing of these indices that would otherwise reveal important statistical information about the collection to protect against statistical attacks.
  • the query terms may be encrypted to prevent the exposure of information to the data center and other intruders, and also confine the searching entity to only make queries within an authorized scope. Utilizing the term frequencies and other document information, schemes are developed herein to securely compute relevance scores of each document, identify the most relevant documents, and reserve the right to screen and release the full content of relevant documents.
  • the proposed framework is built upon well-studied cryptographic encryption and hashing primitives.
  • the system includes comparable performance to conventional searching systems designed for non-encrypted data in terms of search accuracy.
  • other security issues such as protecting communication links and combating traffic analysis are addressed by appropriate security protocols and randomization.
  • the invention provides a confidentiality preserving system for performing a rank-ordered search and retrieval of contents of a data collection.
  • the system may include a computer system including a search and retrieval algorithm using term frequency and/or similar features for rank-ordering selective contents of the data collection, and enabling secure retrieval of the selective contents based on the rank-order.
  • the search and retrieval algorithm may generate a relevance score for the rank-ordering based on one or more queries.
  • the data collection and/or query may be encrypted.
  • the data collection may include documents and/or multi-media content.
  • the search and retrieval algorithm may include three algorithms; a baseline algorithm, a partially server oriented algorithm, and a fully server oriented algorithm.
  • the baseline algorithm may include a pre-processing algorithm for building a secure term frequency table and an inverse data collection frequency table, and a search stage algorithm for rank-ordering in response to a query.
  • the pre-processing algorithm may include stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table.
  • the selective components may be words, and the data collection contents may be documents.
  • the search stage algorithm may include stemming of a query term, searching of the term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order.
  • the pre-processing and search stage algorithms may be executed at a user site remote from a data center for storing the data collection.
  • the partially server oriented algorithm may include performance of selective computations at a user site remote from a data center for storing the data collection.
  • the partially server oriented algorithm may include building of a term frequency table and/or generation of a relevance score at a user site remote from a data center for storing the data collection.
  • the fully server oriented algorithm may include building of a term frequency table at a user site, and generation of a relevance score at a secure computing unit and/or a data center for storing the data collection.
  • the partially and/or fully server oriented algorithms may enable search capability from a user other than an owner of the contents of the data collection.
  • the invention also provides a confidentiality preserving method for performing a rank-ordered search and retrieval of contents of a data collection.
  • the method may include using term frequency and/or similar features for rank-ordering selective contents of the data collection, and securely retrieving the selective contents based on the rank-order.
  • the method may further include generating a relevance score for the rank-ordering based on at least one query.
  • the method may further include encrypting the data collection and/or query.
  • the data collection may include documents and/or multi-media content.
  • the method may further include building a secure term frequency table and an inverse data collection frequency table by stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table.
  • the selective components may include words
  • the data collection contents may include documents.
  • the term frequency table may be generated at a user site remote from a data center for storing the data collection.
  • the method may further include stemming of a query term, searching of a term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order.
  • generation of the relevance score and rank ordering may be performed at a user site remote from a data center for storing the data collection.
  • the term frequency table and relevance score may be selectively generated at a user site remote from a data center for storing the data collection, and/or at a data center for storing the data collection.
  • the method may include using homomorphic encryption and/or order preserving encryption for enabling search capability from a user other than an owner of the contents of the data collection.
  • FIG. 1 is a diagram illustrating the confidentiality-preserving rank-ordered search system and method of the invention
  • FIG. 2 is a diagram illustrating the generation and securing of index information
  • FIG. 3 is a diagram illustrating search and retrieval for a confidentiality-preserving baseline model scheme according to the invention.
  • FIG. 4 is a diagram illustrating search and retrieval in a fully server oriented scheme according to the invention.
  • FIGS. 5A and 5B are examples of term frequency histograms, and FIGS. 5C and 5D are the corresponding histograms of the encrypted term frequency values;
  • FIG. 6 is a diagram illustrating the partially server oriented scheme according to the invention.
  • FIG. 7 is a precision-recall graph for the baseline scheme, and the order-preserving encryption scheme according to the invention.
  • FIG. 8 is a graph illustrating the difference in Mean Average Precision (MAP) between the baseline and order-preserving encryption schemes according to the invention.
  • FIG. 9 is scatter plot of Mean Average Precision (MAP) values for the order-preserving encryption scheme with different mapping table for each row of a TF table, plotted with respect to the baseline scheme.
  • MAP Mean Average Precision
  • FIG. 10 is a graph illustrating use of a modified Kendall distance measure for comparing top 20 and top 100 ranks obtained using the baseline and order-preserving encryption schemes according to the invention.
  • FIG. 1 a diagram illustrating the confidentiality-preserving rank-ordered search system and method of the invention is illustrated.
  • a content owner 100 e.g. a supervisor
  • the content owner may also grant another user 104 the permission to search and retrieve his/her documents through the data center.
  • the documents stored at the data center are encrypted at location 106 .
  • the content owner manages the content decryption keys and may provide decryption services upon the user's request. In the following discussion, a few application scenarios will be examined under this framework.
  • Case 1 The content owner wants to search for some documents stored at the data center. He/she has a limited bandwidth connection with the data center, and needs to search through the encrypted content without downloading the entire collection. Furthermore, the content owner does not trust the data center with his/her unencrypted content. He/she wants to remotely search and retrieve top-ranked relevant documents without revealing the search terms, document content, and/or document index information to the data center.
  • This scenario will be referred to as the confidentiality preserving baseline model, as discussed below, where the scheme enables both the confidentiality protection and the use of term frequency (discussed below) to achieve secure and efficient retrieval.
  • Case 2 Next, consider the scenario where a user, who is not the content owner, wants to search for a particular phrase in the set of confidential documents held by the data center.
  • This scenario may arise in a number of cases, for example, where the user may be a scholar or a low-level analyst who wants to search relevant documents from a private/classified collection, and may need clearance only for the top-ranked documents.
  • the user may also be the opposing side in a litigation requesting relevant documents from a digital collection (e.g. e-mails) be turned in by the content owner's side.
  • the content owner does not trust the data center with the document content or the term frequency values.
  • the data center has a secure computing unit (SCU), which is trusted by the content owner to some degree.
  • SCU secure computing unit
  • Case 2a the content owner trusts the SCU both with the plain-text documents and the associated term-frequency table (discussed below).
  • Case 2b the content owner trusts the SCU with the plain-text term-frequency values, but not with the plain-text documents.
  • Case 2c the content owner does not trust the SCU with either the term-frequency values or the documents in plain-text form, but trusts the SCU with certain computations to be performed on some encrypted version of the term-frequency (TF) table without disclosing the exact values.
  • TF term-frequency
  • the content owner trusts the SCU with the term frequency values.
  • the SCU can be considered as a heavily guarded “Maximum-Security Computing Unit” (MaxSCU) in the data center that can be used to decrypt term frequency (TF) table, compute relevance scores using EQ-1 (see below), and rank-order the documents based on these values.
  • MaxSCU Max-Security Computing Unit
  • the baseline model introduced under the Confidentiality Preserving Baseline Model section can be the solution under this scenario.
  • the MaxSCU is a critical link of the overall system security and may be subject to heavy attacks, and as such, it can be expensive to design and maintain such a unit hosted in a data center.
  • N (D) documents in which N (T) unique terms appear.
  • the term frequency information for all terms and all documents can be organized as a table at location 110 of size N (T) ⁇ N (D) , in which the entry at i th row and j th column indicates the number of occurrences of the i th term in the j th document.
  • Term frequency has been employed as a core variable to define the relevance score in rank-ordering documents in a collection.
  • One example metric is the Okapi relevance score CW (i, j), which is defined as:
  • CW ⁇ ( i , j ) CFW ⁇ ( i ) ⁇ TF ⁇ ( i , j ) ⁇ ( K 1 + 1 ) K 1 ⁇ ( 1 - b + b ⁇ NDL ⁇ ( j ) ) + TF ⁇ ( i , j ) , ( EQ ⁇ - ⁇ 1 )
  • N(i) is the number of documents containing the i th term
  • the CFW plays an equivalent role as the inverse document frequency used in some information retrieval schemes. It can be either pre-computed or obtained concurrently from the term frequency table.
  • ⁇ ⁇ i k i 1 i M ⁇ CW ⁇ ( i k , j ) , ⁇ j ⁇ ,
  • frequency table and indices may be secured at location 112 .
  • the confidentiality preserving system and method of the invention includes a unique framework for performing ranked search securely and efficiently without revealing the indexing information.
  • the baseline scheme it is assumed that the data center can only be trusted with data storage and should not be allowed to obtain any information about the stored data.
  • the baseline model is proposed that involves multiple rounds of interaction between the client and server to obtain the relevant information pertaining to a query.
  • the proposed framework may include two major stages, a pre-processing stage for building a secure term frequency table and an inverse document frequency table, and a search stage for rank-ordering documents in response to a particular query while preserving the confidentiality of term frequency information.
  • the pre-processing is executed once by the content owner, when he/she stores the documents, all in encrypted form, in the data center.
  • the major task of the pre-processing stage is to build a secure term frequency table and an inverse document frequency table, so as to facilitate efficient and accurate information retrieval.
  • both the search term and its term frequency information are in plain text. To protect the confidentiality of the search, both of them may be encrypted in an appropriate way.
  • FIG. 2 a diagram illustrating the generation and securing of index information for the baseline model is illustrated.
  • a word w in a document first undergoes stemming at location 130 to retain the word root while removing the word ending to obtain w S .
  • the word key may be unique to each stemmed word and is obtained using the stemmed word and a pre-defined master key.
  • the term frequency information is collected by counting the number of occurrences of the stemmed word in the j th document and stored in the table entry ⁇ TF(i, j) ⁇ at location 136 .
  • K i (TF) denotes the key used to encrypt the i th row of the term frequency table TF (s) .
  • the value of K i (TF) is unique for each row and is derived from the word-key Kw S corresponding to the i th row. Thus, even if the key corresponding to one row is compromised, no information can be obtained about other rows of the term frequency table.
  • FIG. 3 a diagram illustrating search and retrieval for the confidentiality-preserving baseline model scheme is illustrated.
  • the content owner when searching for a particular word w in the collection, the content owner first performs stemming at location 170 to obtain the stemmed word w S .
  • the word-key is then derived from the master key and used to encrypt the stemmed-word w S to obtain w S (e) .
  • the hash value of w S (e) is calculated at location 172 and sent to data center.
  • the content owner further computes relevance scores at location 178 from the term frequency values as in EQ-1, rank-orders the documents based on the score, and requests the most relevant documents from the data center at locations 180 , 182 .
  • a query consists of multiple terms, w(i 1 ), w(i 2 ), . . .
  • the data center does not get access to the unencrypted content at any point of time both during the pre-processing and the search and retrieval stage.
  • the data center does not know the term frequency information as they are stored encrypted.
  • the only information that the data center gains from the search process is the retrieval log.
  • the retrieval log at most contains data on which user searched for what encrypted queries, when and how often.
  • the data center may also learn which documents were requested pertaining to the encrypted search queries. Based on such information collected over a period of time, the data center may launch some kinds of statistical attacks. However, such attacks can be easily mitigated by the content owner, by adding to his/her requests some phantom terms and document indices to obfuscate the access statistics of his/her intended terms and documents.
  • the content owner can also hide his/her identity by introducing a proxy in his/her connection link with the data center.
  • Encoding the term frequency rows helps reduce the bandwidth required for its transmission during the search phase.
  • Value-precision encoding is used herein for encoding to compress the term-frequency rows, wherein the position and the value of every non-zero term is encoded in the term-frequency table.
  • the results with 200,000 e-mails from the Enron e-mail corpus suggest that the average size of the compressed term frequency rows is 435 bytes, and 86% of them have a size within 200 to 300 bytes (see B. Klimt and Y. Yang, “Introducing the Enron Corpus,” Conf. On Email and Anti - Spam (CEAS), Mountain View, Calif., 2004).
  • the required bandwidth in transmitting the term frequency rows can also be minimized.
  • the CFW can be computed before-hand and encrypted using the same word key as in the term frequency table.
  • the CFW is then stored in the data center separately from the term frequency. It can be sent to the content owner along with the term frequency rows during relevance computation. If the relevance score is computed by the data center, the CFW can be stored in the data center in clear-text form.
  • the baseline model previously introduced provides secure and effective search to the scenarios where the content owner makes a query himself/herself.
  • two different schemes namely homomorphic encryption and order-preserving encryption (each discussed in greater detail below) are presented for enabling the search capability from a user other than the content owner.
  • These schemes reduce the involvement of the content owner either partially or completely by shifting the task of computing the relevance score to the data center, while still maintaining the confidentiality of the term frequency information and the document content.
  • an additional layer of encryption on the term frequency information is designed. This additional layer of encryption is referred to as the inner-layer encryption.
  • Two different types of inner-layer encryptions/schemes, namely, homomorphic encryption and order-preserving encryption are discussed herein.
  • TF (s) is encoded to obtain TF C (s) , and further encrypted to obtain TF C (e) in the same way as in the baseline scheme.
  • This second round of encryption is referred to as outer-layer encryption, which prevents unauthorized users from accessing term frequency information.
  • FIG. 4 is a diagram illustrating search and retrieval in the fully server oriented scheme according to the invention.
  • the indexing and pre-processing stages of the proposed schemes are similar to the baseline model with an additional inner-layer encryption, and the searching stage is shown in FIG. 4 .
  • the user When searching for a particular query consisting of multiple terms, w(i 1 ), w(i 2 ), . . . , w(i M ), in the collection, the user first performs stemming to obtain its corresponding stemmed words. The user then sends the stemmed words to the content owner, who checks if the user has the required permission to search for the query words at location 210 .
  • the hash value of w S (i k ) (e) is calculated and transmitted to the user who forwards it to the data center.
  • the data center uses the received hash values H (w S (i k ) (e) ) from location 212 , the data center searches the protected term frequency table TF C (e) at location 214 and identifies the rows corresponding to the query words. In this way, the data center does not get any information about the query.
  • the data center After the data center identifies the target rows from the term frequency table TF C (e) , it uses the Secure Computing Unit (SCU) to decrypt and decode it at location 216 , and subsequently obtain the corresponding rows of the term frequency table TF (S) that are protected by the inner-layer encryption algorithms. During this stage, the encrypted rows, TF (S) , are retained within the SCU and not revealed to the data center. The SCU then performs part or the entire computation for the relevance scores at location 218 in the encrypted domain as shown in FIG. 4 . In the homomorphic encryption based scheme (HME), the computation results are then sent to the content owner, who decrypts the results, obtains the relevance score, and rank-orders the documents.
  • HME homomorphic encryption based scheme
  • HME is also referred to as the partially server oriented scheme.
  • the order of the relevant documents pertaining to the user's query is sent back to the data center who gives the user the corresponding documents at location 220 .
  • OPE order preserving encryption based scheme
  • the entire computational burden is shifted to the SCU, which computes relevance scores, rank-orders the documents, and directly sends back to the user the most relevant documents with their ranking information.
  • the OPE is also referred to as the fully server oriented scheme.
  • the main difference between the HME and the OPE schemes is the additional round of communication between the data center and the content owner, and the need of using the content owner's decryption key. As discussed below, the need for this additional round of communication can be offset at the cost of slightly reduced retrieval accuracy. In the following sections, details of the OPE and HME schemes are discussed.
  • OPE order preserving encryption scheme
  • order preserving encryption is applied on TF(i, j) to obtain encrypted TF (s) (i, j) in the inner-layer encryption step, i.e., if TF(i, j) ⁇ TF(i,k), then TF (s) (i, j) ⁇ TF (s) (i, k). Due to the monotonicity of the relevance score function in EQ-1, as long as the order of relevance scores (or the order of term frequency values) is preserved, rather than their exact values, the correct search results can be obtained for queries that involve only one term. Based on the experimental analysis on the Enron e-mail corpus discussed earlier, generally peak histograms are observed for the term frequency values over a large number of rows, and some examples are shown in FIGS.
  • the encryption is performed row by row for each of the N (TF) terms.
  • the generally peaking structure of term frequency distribution reflects that there are a large number of entries having the same term frequency value in individual row of the term frequency table.
  • the random mapping range [tf l , tf u ] for a term frequency value tf is adaptively determined according to the distribution of row term frequency values, so that an approximately uniform distribution can be obtained for the encrypted term frequency values TF (s) (i, j). More specifically, the width of the random mapping range [tf l ,tf u ] is chosen proportional to the counts of tf l in that particular row. The values of tf l and tf u are then determined with 0 ⁇ tf l ⁇ tf u ⁇ 2 B and the constraint in EQ-3. In this way, an approximately uniform distribution can be obtained for the encrypted TF (s) (i, j) at individual rows.
  • FIGS. 5A and 5B are examples of term frequency histograms
  • FIGS. 5C and 5D are the corresponding histograms of the encrypted term frequency values.
  • the server will perform part of the combination, which is then sent back to the user side for obtaining the final relevance scores.
  • the basis for the partially server oriented scheme is that in some scenarios such as that of a mobile computing unit, the computation power of the client and the bandwidth of the communication channel may be severely limited and the MedSCU can help perform certain computations in a secure manner. Hence, the amount of data transferred between the client and server and the amount of computation to be performed by the client should be minimized.
  • FIG. 6 is a diagram illustrating the partially server oriented scheme according to the invention.
  • the user side when searching for a particular word w in the database, the user side first performs stemming at location 240 to obtain its corresponding stemmed word w S .
  • the word-key is then derived from the master key and used to encrypt the stemmed-word w S to be w S (e) at location 242 .
  • the hash value of w S (e) is calculated at location 244 and transmitted to the server side.
  • the server can search the protected term frequency table TF C (e) at location 246 and identify the row corresponding to the query word w.
  • the server After the server identifies the target row TF C (e) (k,.) at location 246 from the term frequency table TF C (e) , in the partially server oriented scheme, the server itself decrypts and decompresses it at locations 248 , 250 and subsequently obtains term frequencies TF (s) (k,.) that are protected with inner-layer encryption algorithms. The server then performs part of or all the computation at location 252 in finding the relevance scores in the encrypted domain. After that, the server sends the computation results back to the user side at location 254 , which then decrypts the received results and further rank-orders the documents. The encrypted documents are then obtained at location 256 , and returned to the user at location 258 for decryption.
  • the server For a query submitted by the user, the server first extracts the corresponding term-frequency rows stored in the encrypted format. For each of the identified rows, TF C (e) (i,.), the server decrypts it using the word key and then decompresses it to obtain TF (s) (i,.) with an inner-layer encryption. Then, in this encrypted domain, at location 252 as discussed above, the server performs certain computations toward finding the relevance scores. The computation results are then sent back to the user, who uses the decryption keys to find the actual values of the relevance scores at location 254 . The user then rank orders the documents using the derived relevance scores and requests the most pertinent documents from the server at location 256 .
  • the partially server oriented scheme also involves two rounds of communication.
  • the user sends the query word(s) and gets the encrypted relevance scores from the server.
  • the user then processes the results to find the relevant documents and requests the documents in the second round.
  • this method does not require transmission of all term frequency files related to a query. Therefore, it needs much lower bandwidth in the searching process and would be feasible for low-bandwidth scenarios.
  • the RSA public-key cryptosystem involves a public key (n, e) and a private key (n, d) such that e d ⁇ 1(Mod n).
  • the RSA encryption scheme has the following property:
  • EQ-5 is used to compute the relevance score for document D(j) for each word in the query, CW(i 1 , j),CW(i 2 , j), . . . , CW(i M , j) and the final relevance score is calculated by
  • CW ( j ) CW ( i 1 ,j )+ CW ( i 2 ,j )+ . . . + CW ( i M ,j ) (EQ-7)
  • the client sends the query with terms and the corresponding keys K i 1 (TF) , to the server.
  • CW(i m ,j),TF (e) (i m ,.) is decrypted using the decryption function D and key K i m (TF) and decompressed to obtain TF (s) (i m ,.).
  • the server then performs the following computation to obtain the encrypted values of the relevance scores
  • the client then requests the relevant files from the server.
  • the RSA based scheme has the advantage that the relevance scores are computed on the server without sacrificing security. However, the amount of data that needs to be transferred to the client is still proportional to the number of terms in the query. This is due to the fact that the only operation that is homomorphic in RSA is multiplication, which limits the operations that can be performed on the server without sacrificing security. To overcome this limitation, a scheme based on a homomorphic encryption scheme may be utilized, as discussed below.
  • the function gK is homomorphic with respect to addition and multiplication operations. Division can then be performed by treating it as operations on rational numbers, and the numerator and denominator terms can be computed separately as follows:
  • g ⁇ ( x 1 x 2 + x 3 x 4 ) g ⁇ ( x 1 ) * g ⁇ ( x 4 ) + g ⁇ ( x 2 ) * g ⁇ ( x 3 ) g ⁇ ( x 2 ) * g ⁇ ( x 4 ) ( EQ ⁇ - ⁇ 9 )
  • the values of the constants C 1 (i) and C 2 (j) are also computed and stored along with the encrypted term frequency rows TF (e) (i,.).
  • the search phase suppose that a query contains the terms, w(i 1 ), w(i 2 ), . . . , w(i M ); for each term in the query, the SCU decrypts and decodes the corresponding term frequency row to obtain TF (s) (i m ,.). It then obtains the numerator and denominator of gK(CW(i m , j)) for each query term using
  • gK (Num( i m ,j )) C 1 ( i m )* gK ( TF ( i m ,j )) (EQ-11)
  • the overall encrypted value of the relevance score, gK(CW(j)), is then obtained by adding the relevance scores in the encrypted domain and can be shown to be
  • the exact value of the relevance score cannot be computed by the SCU, and the numerator and denominator of gK(CW(j)) are sent to the content owner/supervisor.
  • the content owner decrypts with the secret key to obtain the actual numeric values of Num(j) and Den(j) to compute the relevance score for each document.
  • the content owner sorts the relevance scores and sends the list of relevant documents to the data center who retrieves them from his/her collection for the user.
  • the proposed symmetric homomorphic encryption based scheme has the advantage that the amount of data transferred between the server and the client is independent of the number of terms in the query. Also the amount of computation that has to be performed on the client side is reduced by shifting most of the computation operation to the servers side. However, this necessitates that the keys used for encrypting the rows of the Term Frequency table TF(i,.), K i (s) be the same. In contrast, the RSA based scheme does not require that the keys used for encrypting the rows of the term frequency table be the same. The consequence is the relatively larger amount of data that needs to be transferred from the server to the client. Thus, depending on the usage scenario, the user may choose one of the two options.
  • HME homomorphic encryption
  • OPE order-preserving encryption
  • baseline model Performance of the homomorphic encryption (OPE), the order-preserving encryption (OPE), and the baseline model will now be compared in terms of security, retrieval accuracy, and tradeoffs involved in securing the term frequency using order preserving encryption will be examined.
  • the retrieval accuracies of the secure search schemes will be evaluated on the W3C collection, and the 59 queries used for the discussion search in the enterprise track in the 2005 Text Retrieval Conference (TREC). Any document that is judged partially relevant or relevant is taken to be relevant (i.e. conflating the top two judgment levels).
  • the performance of the HME scheme should be identical to the baseline model as it also has the accurate term frequency information to compute the relevance score.
  • the OPE By introducing the order-preserving encryption on row term frequency values, the OPE enables document search on the data center side while preventing it from learning the critical term frequency information.
  • the OPE can achieve effective search as the baseline model by accurately identifying the target documents. This is because the order of term frequency values are preserved after the inner-layer encryption, and the relevance score is a strictly increasing function of the term frequency. As the number of terms in a query increases, the order may not be completely preserved when summing up scores of all terms.
  • FIG. 8 shows the differences in the Mean Average Precision (MAP) for the baseline scheme and that for the order-preserving encryption scheme for different numbers of search terms.
  • MAP Mean Average Precision
  • the search accuracy is examined and compared with the number of searched terms within this range.
  • the accuracy of OPE is only within a small gap from that of the baseline model.
  • the number of search terms in the query does not affect the performance of the OPE scheme.
  • FIG. 9 shows a scatter plot of the Mean Average Precision (MAP) values for the fully server oriented (FSO) scheme plotted with respect to the baseline scheme for the 59 search queries in the W3C database.
  • MAP Mean Average Precision
  • the modified Kendall distance measure proposed in “Common Evaluation Measures,” Appendix to the Proceedings of Text Retrieval Conference, 2005, are used to compare the top 20 and top 100 ranks obtained using the baseline scheme and the FSO scheme.
  • the distance between the top 20 ranks for the FSO scheme and the baseline scheme is approximately 0.42 and the corresponding value for the top 100 ranks is approximately 0.29.
  • the distance for the top 20 ranks is higher because the random mapping may change the order of the top 20 ranks.
  • the distance is much lower because most of the top 100 documents are common between the two lists.
  • the inner layer encryption in HME and OPE would have to retain the sparsity of the TF table by keeping the zero-valued terms.
  • the SCU may gain knowledge of the zero-valued TF, without knowing which plain-text term and which document these correspond to.
  • the proposed schemes require a secure environment to initially generate the encrypted indices and encrypted documents. Usually such initial processing is required only once. However, in the case when the collection is constantly changing, such as by adding more documents or changing the contents in existing documents, the secure index information in HME and OPE should also be updated.
  • mapping of frequency values for all terms that appear in the new/changed documents should be updated to ensure security and search accuracy.
  • the cost of maintaining a secure search system can be relatively high.
  • One method of addressing such incremental changes to the encrypted TF without a complete update would be to encrypt each document separately, instead of encrypting the documents together. By doing so, while accuracy is slightly reduced due to the different encryption for the different document, the documents can nevertheless be updated as needed.
  • the invention thus provides a new framework for secure and confidentiality-preserving search and retrieval in large scale document collections, and techniques for securely rank-ordering the documents and extracting the most relevant documents from an encrypted collection based on the encrypted search queries.
  • the baseline, fully and partially server oriented schemes Maintain the confidentiality of the query as well as the content of retrieved documents.
  • the confidentiality preserving system and method described herein are highly secure (relying on the secure cryptographic encryption and hashing algorithms), accurate (comparable to conventional searching systems working with unencrypted data), and efficient (in terms of computational complexity, and communication bandwidth), as demonstrated by experiments with the W3C collection (discussed above).
  • the confidentiality preserving system and method have a wide range of applications, such as searching information with hierarchical access control, flexible “e-discovery” practices for digital records in legal proceedings, a variety of multi-media applications, image/video searching, and finger-print matching etc.
  • joinder references do not necessarily infer that two elements are directly connected and in fixed relation to each other. It is intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative only and not as limiting. Changes in detail or structure may be made without departing from the invention as defined in the appended claims.

Abstract

A confidentiality preserving system and method for performing a rank-ordered search and retrieval of contents of a data collection. The system includes at least one computer system including a search and retrieval algorithm using term frequency and/or similar features for rank-ordering selective contents of the data collection, and enabling secure retrieval of the selective contents based on the rank-order. The search and retrieval algorithm includes a baseline algorithm, a partially server oriented algorithm, and/or a fully server oriented algorithm. The partially and/or fully server oriented algorithms use homomorphic and/or order preserving encryption for enabling search capability from a user other than an owner of the contents of the data collection. The confidentiality preserving method includes using term frequency for rank-ordering selective contents of the data collection, and retrieving the selective contents based on the rank-order.

Description

    CROSS-REFERENCE TO RELATED APPLICATION(S)
  • This application claims the benefit of provisional patent application U.S. Ser. No. 61/109,291, filed Oct. 29, 2008, which is expressly incorporated herein by reference.
  • GOVERNMENT SUPPORT CLAUSE
  • This invention was made with government support under H9823005C0425 awarded by NSA. The government has certain rights in the invention.
  • BACKGROUND OF INVENTION
  • a. Field of Invention
  • This invention relates to information search and retrieval. In particular, the instant invention relates to a system and method for information search and retrieval in large-scale encrypted databases, with a particular embodiment employing a confidentiality-preserving rank-ordered search.
  • b. Background Art
  • In today's information era, efficient and effective search capability of digital collections is essential in information management and knowledge discovery. At the same time, many data collections have to be stored in an encrypted form to limit their access to only authorized users in order to protect confidentiality and privacy. Examples of such data collections include medical records, corporate proprietary communications, and classified government documents. An emerging critical issue that must be addressed is how to protect data collections and indexes through encryption, while simultaneously providing efficient and accurate search capabilities.
  • A known method of data protection from theft or intrusion includes cryptography encryption. If the contents of a data storage system are not encrypted, any outsider intruding into the system may gain knowledge of the data content. In addition to such outsider attacks, security measures must also be taken against potential insider attacks. For example, when data storage is outsourced to a third-party data center, system administrators and other personnel involved may not be trusted to have decryption keys and thus have access to the content of the data collections. When an authorized user remotely accesses the data collection to search and retrieve desired documents, the large size of the collections can often make it infeasible to transfer all encrypted data to the user's side, and then perform decryption and search on the user's trusted computers. Therefore, new techniques are needed to encrypt and organize data collections in such a way as to allow the data center to perform effective and efficient search in encrypted data.
  • A number of scenarios exist where the content owner may want to grant a user limited access to search a confidential collection. For example, the searcher may be a scholar or a low-level analyst who wants to identify relevant documents from a private/classified collection, and may need clearance only for the top-ranked documents; the searcher may also be the opposing party during the document discovery phase of a litigation, who would request relevant documents from the content owner's digital collection (e.g. e-mails) be turned in. Conventional practices to accommodate such searches on hard-copy collections are extremely time consuming, and are often based on human factors (e.g. have limited memory and bounded by rules of privilege) that cannot all be directly extended to computerized practice. New algorithms and processes are thus needed to enable secure search for a variety of applications.
  • There has been a considerable amount of prior work on algorithms and data structures to support information retrieval. The vast majority of such work has focused on efficient representation and effective ranking. There has also been minimal effort in addressing secure searching, and such effort has typically been limited to small collections. One example of a search in encrypted data and private information retrieval includes using established cryptographic tools as building blocks, and devising an encryption method to make two subparts of each encrypted term in a document to hold a special relationship to allow for determination of the presence or absence of a query term in an encrypted document. This method still incurs a significant increase in storage (for storing the specially encrypted documents) and typically involves a linear time computational complexity with respect to the number of words in the collection.
  • Keyword based approaches to reduce search complexity have been introduced at the expense of limited search capabilities confined by a keyword list identified beforehand. The documents containing some of the keywords are first identified, and the keywords or the keyword indices are encrypted in a way that facilitates search and retrieval. Securing indices based on Bloom filters have also been proposed to further enhance search efficiency, and conjunctive keyword based searches have been investigated.
  • The aforementioned techniques involve a high computational complexity, and target simple Boolean searches to identify the presence or absence of a term in encrypted text. Furthermore, the aforementioned techniques cannot be easily extended to more sophisticated relevance-ranked searches over large collections.
  • The inventors herein have thus recognized the need for balancing privacy and confidentiality with efficiency and accuracy, which pose significant challenges to the design of search schemes for a number of search scenarios and large data collections. The inventors herein have also recognized the need for a system that focuses on secure and efficient rank-ordered search and retrieval over large data collections.
  • BRIEF SUMMARY OF THE INVENTION
  • The confidentiality preserving rank-ordered search system and method of the invention focuses on secure and efficient rank-ordered search and retrieval over large data collections. The system includes a framework to securely rank-order documents in response to a query, and techniques for extracting the most relevant document(s) from an encrypted data collection. The system and method includes collection of term frequency information for each of the documents in the collection to build indices, as in traditional retrieval systems in plaintext. The system and method further includes securing of these indices that would otherwise reveal important statistical information about the collection to protect against statistical attacks. During the search process, the query terms may be encrypted to prevent the exposure of information to the data center and other intruders, and also confine the searching entity to only make queries within an authorized scope. Utilizing the term frequencies and other document information, schemes are developed herein to securely compute relevance scores of each document, identify the most relevant documents, and reserve the right to screen and release the full content of relevant documents.
  • For the system and method of the invention, the proposed framework is built upon well-studied cryptographic encryption and hashing primitives. The system includes comparable performance to conventional searching systems designed for non-encrypted data in terms of search accuracy. In addition to the focus on securing the indexes and ranking, other security issues such as protecting communication links and combating traffic analysis are addressed by appropriate security protocols and randomization.
  • In an exemplary embodiment, the invention provides a confidentiality preserving system for performing a rank-ordered search and retrieval of contents of a data collection. The system may include a computer system including a search and retrieval algorithm using term frequency and/or similar features for rank-ordering selective contents of the data collection, and enabling secure retrieval of the selective contents based on the rank-order.
  • For the confidentiality preserving system described above, in an embodiment, the search and retrieval algorithm may generate a relevance score for the rank-ordering based on one or more queries. In an embodiment, the data collection and/or query may be encrypted. The data collection may include documents and/or multi-media content. The search and retrieval algorithm may include three algorithms; a baseline algorithm, a partially server oriented algorithm, and a fully server oriented algorithm.
  • In an embodiment, the baseline algorithm may include a pre-processing algorithm for building a secure term frequency table and an inverse data collection frequency table, and a search stage algorithm for rank-ordering in response to a query. The pre-processing algorithm may include stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table. The selective components may be words, and the data collection contents may be documents. In an embodiment, the search stage algorithm may include stemming of a query term, searching of the term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order. The pre-processing and search stage algorithms may be executed at a user site remote from a data center for storing the data collection.
  • In an embodiment, the partially server oriented algorithm may include performance of selective computations at a user site remote from a data center for storing the data collection. The partially server oriented algorithm may include building of a term frequency table and/or generation of a relevance score at a user site remote from a data center for storing the data collection.
  • In an embodiment, the fully server oriented algorithm may include building of a term frequency table at a user site, and generation of a relevance score at a secure computing unit and/or a data center for storing the data collection.
  • In an embodiment, the partially and/or fully server oriented algorithms may enable search capability from a user other than an owner of the contents of the data collection.
  • The invention also provides a confidentiality preserving method for performing a rank-ordered search and retrieval of contents of a data collection. The method may include using term frequency and/or similar features for rank-ordering selective contents of the data collection, and securely retrieving the selective contents based on the rank-order.
  • For the method described above, in an embodiment, the method may further include generating a relevance score for the rank-ordering based on at least one query. The method may further include encrypting the data collection and/or query. In an embodiment, the data collection may include documents and/or multi-media content.
  • For the method described above, the method may further include building a secure term frequency table and an inverse data collection frequency table by stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table. In an embodiment, the selective components may include words, and the data collection contents may include documents. The term frequency table may be generated at a user site remote from a data center for storing the data collection.
  • For the method described above, the method may further include stemming of a query term, searching of a term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order. In an embodiment, generation of the relevance score and rank ordering may be performed at a user site remote from a data center for storing the data collection. In an embodiment, the term frequency table and relevance score may be selectively generated at a user site remote from a data center for storing the data collection, and/or at a data center for storing the data collection.
  • For the method described above, the method may include using homomorphic encryption and/or order preserving encryption for enabling search capability from a user other than an owner of the contents of the data collection.
  • Additional features, advantages, and embodiments of the invention may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary of the invention and the following detailed description are exemplary and intended to provide further explanation without limiting the scope of the invention as claimed.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate preferred embodiments of the invention and, together with the detailed description, serve to explain the principles of the invention. In the drawings:
  • FIG. 1 is a diagram illustrating the confidentiality-preserving rank-ordered search system and method of the invention;
  • FIG. 2 is a diagram illustrating the generation and securing of index information;
  • FIG. 3 is a diagram illustrating search and retrieval for a confidentiality-preserving baseline model scheme according to the invention;
  • FIG. 4 is a diagram illustrating search and retrieval in a fully server oriented scheme according to the invention;
  • FIGS. 5A and 5B are examples of term frequency histograms, and FIGS. 5C and 5D are the corresponding histograms of the encrypted term frequency values;
  • FIG. 6 is a diagram illustrating the partially server oriented scheme according to the invention;
  • FIG. 7 is a precision-recall graph for the baseline scheme, and the order-preserving encryption scheme according to the invention;
  • FIG. 8 is a graph illustrating the difference in Mean Average Precision (MAP) between the baseline and order-preserving encryption schemes according to the invention;
  • FIG. 9 is scatter plot of Mean Average Precision (MAP) values for the order-preserving encryption scheme with different mapping table for each row of a TF table, plotted with respect to the baseline scheme; and
  • FIG. 10 is a graph illustrating use of a modified Kendall distance measure for comparing top 20 and top 100 ranks obtained using the baseline and order-preserving encryption schemes according to the invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • Referring now to the drawings wherein like reference numerals are used to identify identical components and steps in the various views, an embodiment of the confidentiality preserving rank-ordered search system and method (hereinafter the “confidentiality preserving system” or “confidentiality preserving method”) will be described in detail.
  • Before proceeding with a detailed description of the confidentiality preserving system and method of the invention, exemplary use-cases will be described for facilitating an understanding of the invention. It should be noted that the use-cases are for exemplary purposes only and should by no means be used to limit the scope of the invention.
  • Scenarios of Secure Search
  • This section discusses representative scenarios where the secure search over a document collection may take place. As shown in FIG. 1, a diagram illustrating the confidentiality-preserving rank-ordered search system and method of the invention is illustrated. Referring to FIG. 1, a content owner 100, (e.g. a supervisor), uses the services of a data center 102 to store a large amount of documents, as well as perform search and retrieval. The content owner may also grant another user 104 the permission to search and retrieve his/her documents through the data center. Additionally, to prevent leakage of information against potential hacker attack, the documents stored at the data center are encrypted at location 106. The content owner manages the content decryption keys and may provide decryption services upon the user's request. In the following discussion, a few application scenarios will be examined under this framework.
  • Case 1: The content owner wants to search for some documents stored at the data center. He/she has a limited bandwidth connection with the data center, and needs to search through the encrypted content without downloading the entire collection. Furthermore, the content owner does not trust the data center with his/her unencrypted content. He/she wants to remotely search and retrieve top-ranked relevant documents without revealing the search terms, document content, and/or document index information to the data center. This scenario will be referred to as the confidentiality preserving baseline model, as discussed below, where the scheme enables both the confidentiality protection and the use of term frequency (discussed below) to achieve secure and efficient retrieval.
  • Case 2: Next, consider the scenario where a user, who is not the content owner, wants to search for a particular phrase in the set of confidential documents held by the data center. This scenario may arise in a number of cases, for example, where the user may be a scholar or a low-level analyst who wants to search relevant documents from a private/classified collection, and may need clearance only for the top-ranked documents. The user may also be the opposing side in a litigation requesting relevant documents from a digital collection (e.g. e-mails) be turned in by the content owner's side. In general, the content owner does not trust the data center with the document content or the term frequency values. However, it is considered herein that the data center has a secure computing unit (SCU), which is trusted by the content owner to some degree. Depending on the level of trust on the SCU by the content owner, the following exemplary scenarios are identified:
  • Case 2a: the content owner trusts the SCU both with the plain-text documents and the associated term-frequency table (discussed below).
  • Case 2b: the content owner trusts the SCU with the plain-text term-frequency values, but not with the plain-text documents.
  • Case 2c: the content owner does not trust the SCU with either the term-frequency values or the documents in plain-text form, but trusts the SCU with certain computations to be performed on some encrypted version of the term-frequency (TF) table without disclosing the exact values.
  • In Cases 2a and 2b, the content owner trusts the SCU with the term frequency values. In this case, the SCU can be considered as a heavily guarded “Maximum-Security Computing Unit” (MaxSCU) in the data center that can be used to decrypt term frequency (TF) table, compute relevance scores using EQ-1 (see below), and rank-order the documents based on these values. The baseline model introduced under the Confidentiality Preserving Baseline Model section can be the solution under this scenario. The MaxSCU, however, is a critical link of the overall system security and may be subject to heavy attacks, and as such, it can be expensive to design and maintain such a unit hosted in a data center.
  • In Case 2c, adversaries' threat on breaking the SCU is alleviated as the SCU only sees some encrypted index data and not the exact plain-text values. As such, a SCU with medium security (MedSCU) can be sufficient. This scenario calls for two layers of carefully designed encryptions to allow the SCU to compute relevance scores in the encrypted-domain of the first layer and enhance confidentiality outside the SCU with an outer-layer encryption. Two exemplary schemes (e.g. homomorphic encryption (HME) and order-preserving encryption (OPE)) to accomplish this objective are discussed below in the Secure Ranking of Document Relevance section presented below.
  • If the content owner does not trust the SCU with any plain-text or encrypted data, the content owner's involvement would be required in computing the relevance score. Thus it would reduce to the baseline model discussed in the Confidentiality Preserving Baseline Model section presented below.
  • Before proceeding with a detailed description of the aforementioned baseline model, and fully and partially server oriented schemes, as term frequency statistics of a collection are useful for ranked retrieval, the concepts will be briefly discussed to facilitate development of the proposed schemes.
  • Term Frequency
  • Referring to FIG. 1, consider a data collection 108 that contains N(D) documents, in which N(T) unique terms appear. The term frequency information for all terms and all documents can be organized as a table at location 110 of size N(T)×N(D), in which the entry at ith row and jth column indicates the number of occurrences of the ith term in the jth document. Term frequency has been employed as a core variable to define the relevance score in rank-ordering documents in a collection. One example metric is the Okapi relevance score CW (i, j), which is defined as:
  • CW ( i , j ) = CFW ( i ) TF ( i , j ) ( K 1 + 1 ) K 1 ( 1 - b + b · NDL ( j ) ) + TF ( i , j ) , ( EQ - 1 )
  • where N(i) is the number of documents containing the ith term; NDL(j) represents the normalized length of the jth document and is given by dividing the length of the jth document, L(j), by the average document length Lavg, i.e., NDL(j)=L(j)/Lavg; and K1 and b are constants chosen to achieve the best performance for the particular collection (see S. E. Robertson and K. S. Jones, “Simple Proven Approaches to Text Retrieval,” Technical Report TR356, Cambridge University Computer Laboratory, 1997). Exemplary values are K1=2 and b=0.75. CFW (i) denotes the cumulative frequency of the ith word in the whole collection and is given by CFW (i)=log(N(D)/N(i)). The CFW plays an equivalent role as the inverse document frequency used in some information retrieval schemes. It can be either pre-computed or obtained concurrently from the term frequency table.
  • Given a query consisting of a single term w(i), the set of relevance scores {CW (i, j), j=1, . . . , N(D)} can be directly used to identify the most relevant documents, which have the largest relevance scores over the above set {CW (i, j), j=1, . . . , N(D)}. If a query contains multiple terms {w(i1), w(i2), . . . , w(iM)}, the relevance scores for each of the query terms are added, i.e.,
  • { i k = i 1 i M CW ( i k , j ) , j } ,
  • and this overall score vector is employed to rank-order the documents. The term frequency table and indices may be secured at location 112.
  • The confidentiality preserving baseline model, and fully and partially server oriented schemes will now be discussed in detail in the following sections.
  • Approach/Scheme I—Confidentiality Preserving Baseline Model
  • As discussed above, the confidentiality preserving system and method of the invention includes a unique framework for performing ranked search securely and efficiently without revealing the indexing information. For the baseline scheme, it is assumed that the data center can only be trusted with data storage and should not be allowed to obtain any information about the stored data. To achieve secure search, the baseline model is proposed that involves multiple rounds of interaction between the client and server to obtain the relevant information pertaining to a query. It should be noted that various aspects of the fully and partially server oriented schemes will also be discussed in conjunction with the baseline model to provide a full understanding of the invention. The proposed framework may include two major stages, a pre-processing stage for building a secure term frequency table and an inverse document frequency table, and a search stage for rank-ordering documents in response to a particular query while preserving the confidentiality of term frequency information.
  • Indexing Stage to Secure Term Frequency
  • The pre-processing is executed once by the content owner, when he/she stores the documents, all in encrypted form, in the data center. The major task of the pre-processing stage is to build a secure term frequency table and an inverse document frequency table, so as to facilitate efficient and accurate information retrieval.
  • For an unprotected term frequency table, both the search term and its term frequency information are in plain text. To protect the confidentiality of the search, both of them may be encrypted in an appropriate way. As shown in FIG. 2, a diagram illustrating the generation and securing of index information for the baseline model is illustrated. Referring to FIG. 2, a word w in a document first undergoes stemming at location 130 to retain the word root while removing the word ending to obtain wS. The stemmed word may then be encrypted at location 132 using an encryption function E and the word-key Kws, obtaining the encrypted word wS (e)=E(Kw s ,wS). The word key may be unique to each stemmed word and is obtained using the stemmed word and a pre-defined master key. The encrypted word, wS (e) is further mapped to a particular row i in the term frequency table, where the index i is established via a hashing function at location 134 such that i=H(wS (e)). With the stemmed word, the term frequency information is collected by counting the number of occurrences of the stemmed word in the jth document and stored in the table entry {TF(i, j)} at location 136.
  • This process is repeated to obtain the term frequencies for all terms and documents, which are then further encrypted. In the baseline model discussed herein, where the data center can only be trusted with storing data, a single layer of encryption is sufficient to protect the term frequency information from both unauthorized users and from the data center. The term frequency information, i.e., TF(s)(i, j)=TF (i, j), is directly used at location 138. If needed, proper encoding can be performed to minimize the required storage. The encoded term frequency table denoted by TFC (s) is then encrypted to create TFC (e) at location 140 as follows:

  • TF C (e)(i,.)=E(K i (TF) ,TF C (s)(i,.))  (EQ-2)
  • Here, TFC (s)(i,.)=C(TF(s) (i,.)) represents the encoded term frequency values obtained through an encoding function C that removes redundancies in the term frequency table. Ki (TF) denotes the key used to encrypt the ith row of the term frequency table TF(s). To increase the security, the value of Ki (TF) is unique for each row and is derived from the word-key KwS corresponding to the ith row. Thus, even if the key corresponding to one row is compromised, no information can be obtained about other rows of the term frequency table.
  • Secure Search Stage
  • In the baseline model discussed herein, search and retrieval is initiated by the content owner. As shown in FIG. 3, a diagram illustrating search and retrieval for the confidentiality-preserving baseline model scheme is illustrated. Referring to FIG. 3, when searching for a particular word w in the collection, the content owner first performs stemming at location 170 to obtain the stemmed word wS. The word-key is then derived from the master key and used to encrypt the stemmed-word wS to obtain wS (e). After that, the hash value of wS (e) is calculated at location 172 and sent to data center. Using the received hash value k=H(wS (e))), the data center searches the protected term frequency table TFC (e) at location 174 and identifies the row corresponding to the query word w. In this way, the query content is concealed from the data center.
  • After the data center identifies the target row TFC (e) (k,.) from the encrypted term frequency table TFC (e) based on the calculated value of k=H(wS (e)), that particular row TFC (e) (k,.) is sent back to the content owner, who then decrypts and decodes at location 176 to obtain the plain-text term frequencies {TF(k, j)∀j}. The content owner further computes relevance scores at location 178 from the term frequency values as in EQ-1, rank-orders the documents based on the score, and requests the most relevant documents from the data center at locations 180, 182. When a query consists of multiple terms, w(i1), w(i2), . . . , w(iM), these M corresponding rows in TF table are identified, TFC (e)(i1,.), TFC (e)(i2,.), . . . , TFC (e)(iM,.), and sent back to the content owner for computing relevance scores. The content owner uses the received information to compute the relevance scores for each term, and then combines them to obtain the final score.
  • As discussed in detail herein and below with regard to the baseline model, or the fully and partially server oriented schemes, in the baseline scheme, all of these term frequency rows will be sent back to the user side for computing relevance scores using the combined information. In the partially server oriented scheme, after the term frequency rows TFC (e)(i1,.), TFC (e)(i2,.), . . . , TFC (e)(iM,.) go through out-layer decryption and decompression, the server will perform part of the combination, which is then sent back to the user side for obtaining the final relevance scores. In the fully server oriented scheme, after the outer-layer decryption and decompression on all the M related term frequency rows, the server computes relevance scores for each of them, and then does the combination to obtain the final scores.
  • TABLE I
    Comparison of the Proposed Techniques
    Partial Server Fully Server
    Property Baseline Oriented Oriented
    No. of communication rounds 2 2 1
    Bandwidth requirement High Medium Low
    for communication
    Memory Storage required Low Low Medium
    at Server
    Memory Storage required Medium Medium Low
    at User
    Security w.r.t outsiders High High High
    Security w.r.t Server High High/Medium Medium
  • Comparison of the Three Searching Schemes: In Table I (Comparison of the Proposed Techniques. The scale of low, medium and high only represents the relative values. These are intended for comparison purposes, and do not signify the performance in absolute terms), the proposed three searching schemes are compared in terms of storage, bandwidth requirement, and security. Each of the three approaches has its advantages and disadvantages, and may be suitable for different scenarios depending on the system constraints. It is usually up to the application requirement and user preferences to choose the most appropriate searching scheme in consideration of the specific threat model. In the subsequent discussion, techniques developed for each of the three schemes are presented in greater detail. For the baseline scheme, as the whole term frequency rows are transmitted from the server to the user during the searching process, compression of term frequencies will be discussed for saving communication bandwidth. For the partially and fully server oriented schemes, one important consideration will be developing appropriate inner-layer encryption algorithms to achieve a good tradeoff between data security, retrieval accuracy, and searching efficiency.
  • In the baseline model, the data center does not get access to the unencrypted content at any point of time both during the pre-processing and the search and retrieval stage. The data center does not know the term frequency information as they are stored encrypted. The only information that the data center gains from the search process is the retrieval log. The retrieval log at most contains data on which user searched for what encrypted queries, when and how often. The data center may also learn which documents were requested pertaining to the encrypted search queries. Based on such information collected over a period of time, the data center may launch some kinds of statistical attacks. However, such attacks can be easily mitigated by the content owner, by adding to his/her requests some phantom terms and document indices to obfuscate the access statistics of his/her intended terms and documents. The content owner can also hide his/her identity by introducing a proxy in his/her connection link with the data center.
  • Encoding the term frequency rows helps reduce the bandwidth required for its transmission during the search phase. Value-precision encoding is used herein for encoding to compress the term-frequency rows, wherein the position and the value of every non-zero term is encoded in the term-frequency table. As an example, the results with 200,000 e-mails from the Enron e-mail corpus suggest that the average size of the compressed term frequency rows is 435 bytes, and 86% of them have a size within 200 to 300 bytes (see B. Klimt and Y. Yang, “Introducing the Enron Corpus,” Conf. On Email and Anti-Spam (CEAS), Mountain View, Calif., 2004). Thus, by encoding, the required bandwidth in transmitting the term frequency rows can also be minimized.
  • Since computing the relevance score requires the use of cumulative frequency of a word (CFW) as in EQ-1, the CFW can be computed before-hand and encrypted using the same word key as in the term frequency table. The CFW is then stored in the data center separately from the term frequency. It can be sent to the content owner along with the term frequency rows during relevance computation. If the relevance score is computed by the data center, the CFW can be stored in the data center in clear-text form.
  • Secure Ranking of Document Relevance
  • The baseline model previously introduced provides secure and effective search to the scenarios where the content owner makes a query himself/herself. In this section, two different schemes, namely homomorphic encryption and order-preserving encryption (each discussed in greater detail below), are presented for enabling the search capability from a user other than the content owner. These schemes reduce the involvement of the content owner either partially or completely by shifting the task of computing the relevance score to the data center, while still maintaining the confidentiality of the term frequency information and the document content. To achieve the goal, an additional layer of encryption on the term frequency information is designed. This additional layer of encryption is referred to as the inner-layer encryption. Two different types of inner-layer encryptions/schemes, namely, homomorphic encryption and order-preserving encryption are discussed herein. After the inner-layer encryption, TF(s) is encoded to obtain TFC (s), and further encrypted to obtain TFC (e) in the same way as in the baseline scheme. This second round of encryption is referred to as outer-layer encryption, which prevents unauthorized users from accessing term frequency information.
  • FIG. 4 is a diagram illustrating search and retrieval in the fully server oriented scheme according to the invention. The indexing and pre-processing stages of the proposed schemes are similar to the baseline model with an additional inner-layer encryption, and the searching stage is shown in FIG. 4. When searching for a particular query consisting of multiple terms, w(i1), w(i2), . . . , w(iM), in the collection, the user first performs stemming to obtain its corresponding stemmed words. The user then sends the stemmed words to the content owner, who checks if the user has the required permission to search for the query words at location 210. Upon verification, the content owner derives the word-keys from the master key and uses it to encrypt the stemmed-words to obtain wS (ik)(e), k=1, 2, . . . , M. After that, the hash value of wS (ik)(e) is calculated and transmitted to the user who forwards it to the data center. Using the received hash values H (wS(ik)(e)) from location 212, the data center searches the protected term frequency table TFC (e) at location 214 and identifies the rows corresponding to the query words. In this way, the data center does not get any information about the query.
  • After the data center identifies the target rows from the term frequency table TFC (e), it uses the Secure Computing Unit (SCU) to decrypt and decode it at location 216, and subsequently obtain the corresponding rows of the term frequency table TF(S) that are protected by the inner-layer encryption algorithms. During this stage, the encrypted rows, TF(S), are retained within the SCU and not revealed to the data center. The SCU then performs part or the entire computation for the relevance scores at location 218 in the encrypted domain as shown in FIG. 4. In the homomorphic encryption based scheme (HME), the computation results are then sent to the content owner, who decrypts the results, obtains the relevance score, and rank-orders the documents. Therefore, HME is also referred to as the partially server oriented scheme. The order of the relevant documents pertaining to the user's query is sent back to the data center who gives the user the corresponding documents at location 220. On the other hand, in the order preserving encryption based scheme (OPE), the entire computational burden is shifted to the SCU, which computes relevance scores, rank-orders the documents, and directly sends back to the user the most relevant documents with their ranking information. The OPE is also referred to as the fully server oriented scheme.
  • The main difference between the HME and the OPE schemes is the additional round of communication between the data center and the content owner, and the need of using the content owner's decryption key. As discussed below, the need for this additional round of communication can be offset at the cost of slightly reduced retrieval accuracy. In the following sections, details of the OPE and HME schemes are discussed.
  • Approach/Scheme II—Fully Server Oriented Scheme Based on Order Preserving Encryption
  • To remove the need of communications between the data center and content owner during content search, computations and ranking are performed directly on term-frequency data in its inner-encrypted form. Discussed herein is an order preserving encryption scheme (OPE) as the inner-layer encryption and the method of computing and ranking relevance scores in the encrypted domain.
  • More specifically, order preserving encryption is applied on TF(i, j) to obtain encrypted TF(s)(i, j) in the inner-layer encryption step, i.e., if TF(i, j)<TF(i,k), then TF(s)(i, j)<TF(s)(i, k). Due to the monotonicity of the relevance score function in EQ-1, as long as the order of relevance scores (or the order of term frequency values) is preserved, rather than their exact values, the correct search results can be obtained for queries that involve only one term. Based on the experimental analysis on the Enron e-mail corpus discussed earlier, generally peak histograms are observed for the term frequency values over a large number of rows, and some examples are shown in FIGS. 5A and 5B. Applying the existing algorithms of order preserving encryption to such generally peaking distributions would not be able to randomize the term frequency values, since their one-to-one mapping operation will largely retain generally peaking nature of term frequency distributions, leaking valuable information to the server. Therefore, in order to enhance security and prevent the leak of term-frequency information, appropriate one-to-many mapping is required to flatten the generally peaking distribution to an approximately uniform distribution and increase its randomness.
  • In the one-to-many order preserving encryption method, the encryption is performed row by row for each of the N(TF) terms. The generally peaking structure of term frequency distribution reflects that there are a large number of entries having the same term frequency value in individual row of the term frequency table. In order to flatten the generally peaking distribution, every entry TF (i, j) is mapped with the value tf to a random number in the range of [tfl,tfu], where 0≦tfl≦tfu<2B (B=8 in the experiment) are the lower bound and the upper bound of the random mapping range that must be carefully chosen. In order to make the one-to-many mapping an order preserving operation, for two different term frequency values to and tf2, their random mapping ranges [tf1 l,tf1 u] and [tf2 l,tf2 u] are chosen to satisfy the following constraint:

  • if tf1<tf2, then tf1 u<tf2 l  (EQ-3)
  • To maximize the entropy of the encrypted output, the random mapping range [tfl, tfu] for a term frequency value tf is adaptively determined according to the distribution of row term frequency values, so that an approximately uniform distribution can be obtained for the encrypted term frequency values TF(s)(i, j). More specifically, the width of the random mapping range [tfl,tfu] is chosen proportional to the counts of tfl in that particular row. The values of tfl and tfu are then determined with 0≦tfl≦tfu<2B and the constraint in EQ-3. In this way, an approximately uniform distribution can be obtained for the encrypted TF(s)(i, j) at individual rows.
  • FIGS. 5A and 5B, briefly discussed above, are examples of term frequency histograms, and FIGS. 5C and 5D are the corresponding histograms of the encrypted term frequency values. Applying the proposed random mapping method to the two histograms shown in FIGS. 5A and 5B, with the random mapping range determined for individual rows, encrypted TF(s)(i, j) is obtained with histograms shown in FIGS. 5C and 5D, respectively. It can be seen that approximately uniform distributions are obtained after the one-to-many order preserving encryption, even though the distributions of row term frequency values are quite different in these two examples. This indicates that the confidentiality of critical term frequency information can be protected from both hackers, unauthorized users, and the data center that carries out the search task.
  • Approach/Scheme III—Partially Server Oriented Scheme Using Homomorphic Encyrption
  • In the partially server oriented scheme discussed herein, after the term frequency rows TFC (e)(i1,.), TFC (e)(i2,.), . . . , TFC (e)(iM,.) go through outer-layer decryption and decompression, the server will perform part of the combination, which is then sent back to the user side for obtaining the final relevance scores. The basis for the partially server oriented scheme is that in some scenarios such as that of a mobile computing unit, the computation power of the client and the bandwidth of the communication channel may be severely limited and the MedSCU can help perform certain computations in a secure manner. Hence, the amount of data transferred between the client and server and the amount of computation to be performed by the client should be minimized.
  • FIG. 6 is a diagram illustrating the partially server oriented scheme according to the invention. As shown in FIG. 6, when searching for a particular word w in the database, the user side first performs stemming at location 240 to obtain its corresponding stemmed word wS. The word-key is then derived from the master key and used to encrypt the stemmed-word wS to be wS (e) at location 242. After that, the hash value of wS (e) is calculated at location 244 and transmitted to the server side. Using the received hash value H(wS (e)), the server can search the protected term frequency table TFC (e) at location 246 and identify the row corresponding to the query word w.
  • After the server identifies the target row TFC (e)(k,.) at location 246 from the term frequency table TFC (e), in the partially server oriented scheme, the server itself decrypts and decompresses it at locations 248, 250 and subsequently obtains term frequencies TF(s)(k,.) that are protected with inner-layer encryption algorithms. The server then performs part of or all the computation at location 252 in finding the relevance scores in the encrypted domain. After that, the server sends the computation results back to the user side at location 254, which then decrypts the received results and further rank-orders the documents. The encrypted documents are then obtained at location 256, and returned to the user at location 258 for decryption.
  • In further detail, for the partially server oriented scheme, for a query submitted by the user, the server first extracts the corresponding term-frequency rows stored in the encrypted format. For each of the identified rows, TFC (e)(i,.), the server decrypts it using the word key and then decompresses it to obtain TF(s)(i,.) with an inner-layer encryption. Then, in this encrypted domain, at location 252 as discussed above, the server performs certain computations toward finding the relevance scores. The computation results are then sent back to the user, who uses the decryption keys to find the actual values of the relevance scores at location 254. The user then rank orders the documents using the derived relevance scores and requests the most pertinent documents from the server at location 256. Similar to the baseline scheme, the partially server oriented scheme also involves two rounds of communication. In the first round, the user sends the query word(s) and gets the encrypted relevance scores from the server. The user then processes the results to find the relevant documents and requests the documents in the second round. Unlike the baseline scheme, this method does not require transmission of all term frequency files related to a query. Therefore, it needs much lower bandwidth in the searching process and would be feasible for low-bandwidth scenarios.
  • When the server performs the computation of relevance scores, it works on term frequencies TF(s)(i,.) with an inner-layer encryption. Therefore, the security of the term frequency information with respect to the server itself largely depends on the nature of the inner-layer encryption. Meanwhile, computation results on TF(s)(i,.) should benefit the user side in the subsequent sorting of final relevance scores. In the following, we show that Homomorphic encryption algorithms may be used to encrypt the term-frequency values to enable performing arithmetic computations in the encrypted domain.
  • Secure Computation of Relevance Scores Based on Homomorphic Encryption
  • Generally, when the SCU performs the computation of relevance scores, it works on term frequencies rows, TF(s)(i,.), encrypted with an inner-layer encryption. Therefore, the security of the term frequency information with respect to the SCU itself largely depends on the nature of the inner-layer encryption. Meanwhile, computation results on TF(s)(i,.) should benefit the content owner in the subsequent sorting of final relevance scores. Homomorphic encryption algorithms may be used to encrypt the term-frequency values to enable performing arithmetic computations in the encrypted domain (see J. Domingo-Ferrer, “A New Privacy Homomorphism and Applications,” Information Processing Letters, Vol. 60, No. 5, pp. 277-282, December 1996, and R. L. Rivest, L. Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homomorphisms,” Foundations of Secure Computation, Academic Press, 1978, pp. 169-179). The RSA encryption and symmetric homomorphism schemes that may be used will now be discussed in detail.
  • RSA Based Approach
  • The RSA public-key cryptosystem involves a public key (n, e) and a private key (n, d) such that e d≡1(Mod n). A message mεZn={0, 1, 2, . . . , n−1} is encrypted using the public key (n, e) as c=RSA(M)=Me (Mod n). The message can then be recovered using the private key (n, d) as m=cd (Mod n). The RSA encryption scheme has the following property:
  • ( RSA ( m 1 ) * RSA ( m 2 ) ) mod n = m 1 e ( mod n ) * m 2 e ( mod n ) , = ( m 1 m 2 ) e ( mod n ) , = RSA ( m 1 * m 2 ) . ( EQ - 4 )
  • This homomorphic property is used to perform relevance score computations at the server's end. To facilitate easy computations in the encrypted domain, the relevance score defined in EQ-1 is approximated as follows:
  • CW ( i , j ) CFW ( i ) TF ( i , j ) ( K 1 + 1 ) K 1 , = C ( i ) TF ( i , j ) , where ( EQ - 5 ) C ( i ) = CFW ( i ) ( K 1 + 1 ) K 1 ( EQ - 6 )
  • and can be calculated with the knowledge of number of documents that do not contain the ith word. In arriving at EQ-5, the TF(i, j) term is ignored in the denominator of EQ-1 and it is assumed that NDL(j)≈1, i.e. the length of all documents is approximately the same and equal to the average length. Although ignoring the TF (i, j) term in the denominator would change the actual value of CW(i, j), the relative order is still preserved as both functions are monotonic in TF(i, j). For queries containing multiple terms, EQ-5 is used to compute the relevance score for document D(j) for each word in the query, CW(i1, j),CW(i2, j), . . . , CW(iM, j) and the final relevance score is calculated by

  • CW(j)=CW(i 1 ,j)+CW(i 2 ,j)+ . . . +CW(i M ,j)  (EQ-7)
  • TABLE II
    Evaluation of the Retrieval Results using the Simplified
    Relevance Score in EQ-5
    Number of
    Search Terms
    Ranks 1 2 3 5
    Top 10 10 10 9 7
    Top 20 20 20 20 18
    Top 50 50 50 50 48
    Top 100 100 100 100 100
  • To evaluate the performance of the search method using the approximation in EQ-5, the number of documents that are in the top 10, top 20, etc. retrieved using the original OKAPI score are counted and the results are compared with the ones obtained with the score calculated using EQ-5. Table II shows the results obtained. It should be noted that the approximation does not affect the performance of the retrieval system when searching for smaller number of query terms, and the performance gradually reduces as the number of query terms increase. This justifies the approximation in EQ-5.
  • While creating the database, the term frequency table TF(i,.) is first encoded using RSA to obtain TF(s)=RSA(K1 (s),TF(i,.). The encrypted table is then compressed and encrypted again using a symmetric encryption function E and key Ki (TF) to obtain TF(e)(i,.)=E(Ki (TF),TFC (s)(i,.)) which is stored in the database. The encrypted value of C(i), Cs(i)=RSA(Ki (s), C(i)) is also stored.
  • In the searching phase, the client sends the query with terms and the corresponding keys Ki 1 (TF), to the server. For computing the relevance score, CW(im,j),TF(e)(im,.) is decrypted using the decryption function D and key Ki m (TF) and decompressed to obtain TF(s)(im,.). The server then performs the following computation to obtain the encrypted values of the relevance scores

  • RSA(K i m ( s),CW(i m ,j))=RSA(K i m ( s),C(i m ,j))*RSA(K i m ( s),TF(i m ,j))(mod n)  (EQ-8)
  • The server then returns RSA(Ki m (s),CW(im,.)),m=1, 2, . . . M to the client which decrypts, sums, and sorts the scores. The client then requests the relevant files from the server.
  • The RSA based scheme has the advantage that the relevance scores are computed on the server without sacrificing security. However, the amount of data that needs to be transferred to the client is still proportional to the number of terms in the query. This is due to the fact that the only operation that is homomorphic in RSA is multiplication, which limits the operations that can be performed on the server without sacrificing security. To overcome this limitation, a scheme based on a homomorphic encryption scheme may be utilized, as discussed below.
  • Symmetric Homomorphism Based Approach
  • A key-dependent homomorphic encryption algorithm gK, with key K, operating on data items x1 and x2, satisfies gK(x1+x2)=gK(x1)+gK(x2), gK(xi*x2)=gK(x1)*gK(x2), and gK(x1,*c)=c*gK(x1) for any constant c. Thus, the function gK is homomorphic with respect to addition and multiplication operations. Division can then be performed by treating it as operations on rational numbers, and the numerator and denominator terms can be computed separately as follows:
  • g ( x 1 x 2 + x 3 x 4 ) = g ( x 1 ) * g ( x 4 ) + g ( x 2 ) * g ( x 3 ) g ( x 2 ) * g ( x 4 ) ( EQ - 9 )
  • These properties can be used to efficiently compute the relevance scores. Referring to EQ-1, the Okapi relevance score can now be written as follows:
  • CW ( i , j ) = TF ( i , j ) C 1 ( i ) TF ( i , j ) + C 2 ( j ) = Num ( i , j ) Den ( i , j ) ( EQ - 10 )
  • where C1(i)=(K1+1)CFW(i) and C2(j)=K1(1−b+b×NDL(j)).
  • In the pre-processing stage, the content owner encodes each row of the term frequency table TF(i,.) separately using homomorphic encryption to obtain TF(s)(i,.)=gK(TF(i,.)), and these results are used in the search stage. The values of the constants C1(i) and C2(j) are also computed and stored along with the encrypted term frequency rows TF(e)(i,.). In the search phase, suppose that a query contains the terms, w(i1), w(i2), . . . , w(iM); for each term in the query, the SCU decrypts and decodes the corresponding term frequency row to obtain TF(s)(im,.). It then obtains the numerator and denominator of gK(CW(im, j)) for each query term using

  • gK(Num(i m ,j))=C 1(i m)*gK(TF(i m ,j))  (EQ-11)

  • gK(Den(i m ,j))=C 2(j m)+gK(TF(i m ,j))  (EQ-12)
  • The overall encrypted value of the relevance score, gK(CW(j)), is then obtained by adding the relevance scores in the encrypted domain and can be shown to be
  • g κ ( CW ( j ) ) = m = 1 M g κ ( Num ( i m , j ) ) n = 1 n m M g κ ( Num ( i n , j ) ) m = 1 M g κ ( Den ( i m , j ) ) ( EQ - 13 )
  • In the absence of the decryption key, the exact value of the relevance score cannot be computed by the SCU, and the numerator and denominator of gK(CW(j)) are sent to the content owner/supervisor. The content owner decrypts with the secret key to obtain the actual numeric values of Num(j) and Den(j) to compute the relevance score for each document. The content owner then sorts the relevance scores and sends the list of relevant documents to the data center who retrieves them from his/her collection for the user.
  • Comparison of RSA and Homomorphic Encryption Approaches
  • The proposed symmetric homomorphic encryption based scheme has the advantage that the amount of data transferred between the server and the client is independent of the number of terms in the query. Also the amount of computation that has to be performed on the client side is reduced by shifting most of the computation operation to the servers side. However, this necessitates that the keys used for encrypting the rows of the Term Frequency table TF(i,.), Ki (s) be the same. In contrast, the RSA based scheme does not require that the keys used for encrypting the rows of the term frequency table be the same. The consequence is the relatively larger amount of data that needs to be transferred from the server to the client. Thus, depending on the usage scenario, the user may choose one of the two options.
  • RESULTS/DISCUSSION
  • Performance of the homomorphic encryption (HME), the order-preserving encryption (OPE), and the baseline model will now be compared in terms of security, retrieval accuracy, and tradeoffs involved in securing the term frequency using order preserving encryption will be examined. The retrieval accuracies of the secure search schemes will be evaluated on the W3C collection, and the 59 queries used for the discussion search in the enterprise track in the 2005 Text Retrieval Conference (TREC). Any document that is judged partially relevant or relevant is taken to be relevant (i.e. conflating the top two judgment levels). In terms of retrieval accuracy, the performance of the HME scheme should be identical to the baseline model as it also has the accurate term frequency information to compute the relevance score.
  • The performance of the proposed schemes is discussed using precision-recall graphs. The precision-recall results for all 59 queries are collected and the average performance is shown in FIG. 7, which shows that the retrieval accuracy of the OPE is slightly lower than that of the baseline scheme. However, this slight drop in performance in OPE comes with added advantages of fewer communication rounds compared with the HME and the baseline schemes.
  • TABLE III
    Retrieval Accuracy Measures for Various Schemes
    Metric Baseline OPE Metric Baseline OPE
    MAP 0.3739 0.3142 P@20 0.4271 0.3839
    r-prec 0.3878 0.3476 P@30 0.3791 0.3271
    bpref 0.3798 0.3412 P@100 0.2366 0.2056
    P@5 0.5424 0.5017 P@1000 0.0471 0.0422
    P@10 0.4881 0.4627 RR1 0.7257 0.6749
  • The search-retrieval accuracy of the proposed schemes is also examined using a set of common evaluation metrics discussed in N. Craswell, A. P. de Vries, and Ian Soboroff, “Overview of the TREC-2005 Enterprise Track,” Text Retrieval Conference, 2005, and “Common Evaluation Measures,” Appendix to the Proceedings of Text Retrieval Conference, 2005. The evaluation results are shown in Table 1. Comparing with the results published in the “Overview of the TREC-2005 Enterprise Track” document, with the values in Table 1, the baseline scheme using the Okapi relevance score would have been ranked second in the evaluation, suggesting that the retrieval accuracy for the baseline scheme is as good as the state of the art in the information retrieval literature that do not take account of security issues. With regard to the OPE, even with the added layer of security, the performance would have appeared in the top five search retrieval schemes evaluated in the TREC 2005 conference.
  • By introducing the order-preserving encryption on row term frequency values, the OPE enables document search on the data center side while preventing it from learning the critical term frequency information. When a query contains a single term, the OPE can achieve effective search as the baseline model by accurately identifying the target documents. This is because the order of term frequency values are preserved after the inner-layer encryption, and the relevance score is a strictly increasing function of the term frequency. As the number of terms in a query increases, the order may not be completely preserved when summing up scores of all terms. To examine the search accuracy for multiple terms, FIG. 8 shows the differences in the Mean Average Precision (MAP) for the baseline scheme and that for the order-preserving encryption scheme for different numbers of search terms. As the majority of queries in the W3C experiments, for which the ground-truth is available, include 2 to 4 terms, the search accuracy is examined and compared with the number of searched terms within this range. With multiple terms in a query, the accuracy of OPE is only within a small gap from that of the baseline model. Thus, the number of search terms in the query does not affect the performance of the OPE scheme. These results show that the OPE scheme is capable of effectively processing multiple-term queries while maintaining confidentiality of the content statistics.
  • FIG. 9 shows a scatter plot of the Mean Average Precision (MAP) values for the fully server oriented (FSO) scheme plotted with respect to the baseline scheme for the 59 search queries in the W3C database. The figure shows strong correlation, with the slope of the best linear fit close to 1, indicating that there is no significant reduction in performance for the FSO scheme compared to the baseline scheme.
  • As shown in FIG. 10, to compare the ranking accuracies, the modified Kendall distance measure proposed in “Common Evaluation Measures,” Appendix to the Proceedings of Text Retrieval Conference, 2005, are used to compare the top 20 and top 100 ranks obtained using the baseline scheme and the FSO scheme. The distance between the top 20 ranks for the FSO scheme and the baseline scheme is approximately 0.42 and the corresponding value for the top 100 ranks is approximately 0.29. The distance for the top 20 ranks is higher because the random mapping may change the order of the top 20 ranks. However, for the top 100 ranks the distance is much lower because most of the top 100 documents are common between the two lists.
  • Certain aspects of the proposed framework, as related to security, storage efficiency, search accuracy, and system complexity, will now be discussed. If efficient storage of term frequency is needed, the inner layer encryption in HME and OPE would have to retain the sparsity of the TF table by keeping the zero-valued terms. Thus the SCU may gain knowledge of the zero-valued TF, without knowing which plain-text term and which document these correspond to. The proposed schemes require a secure environment to initially generate the encrypted indices and encrypted documents. Usually such initial processing is required only once. However, in the case when the collection is constantly changing, such as by adding more documents or changing the contents in existing documents, the secure index information in HME and OPE should also be updated. For the OPE scheme, the mapping of frequency values for all terms that appear in the new/changed documents should be updated to ensure security and search accuracy. In such cases, the cost of maintaining a secure search system can be relatively high. One method of addressing such incremental changes to the encrypted TF without a complete update, would be to encrypt each document separately, instead of encrypting the documents together. By doing so, while accuracy is slightly reduced due to the different encryption for the different document, the documents can nevertheless be updated as needed.
  • The invention thus provides a new framework for secure and confidentiality-preserving search and retrieval in large scale document collections, and techniques for securely rank-ordering the documents and extracting the most relevant documents from an encrypted collection based on the encrypted search queries. The baseline, fully and partially server oriented schemes Maintain the confidentiality of the query as well as the content of retrieved documents. The confidentiality preserving system and method described herein are highly secure (relying on the secure cryptographic encryption and hashing algorithms), accurate (comparable to conventional searching systems working with unencrypted data), and efficient (in terms of computational complexity, and communication bandwidth), as demonstrated by experiments with the W3C collection (discussed above). The confidentiality preserving system and method have a wide range of applications, such as searching information with hierarchical access control, flexible “e-discovery” practices for digital records in legal proceedings, a variety of multi-media applications, image/video searching, and finger-print matching etc.
  • Although several embodiments of this invention have been described above with a certain degree of particularity, those skilled in the art may make numerous alterations to the disclosed embodiments without departing from the scope of this invention. All directional references (e.g., upper, lower, upward, downward, left, right, leftward, rightward, top, bottom, above, below, vertical, horizontal, clockwise and counterclockwise) are only used for identification purposes to aid the reader's understanding of the present invention, and do not create limitations, particularly as to the position, orientation, or use of the invention. Joinder references (e.g., attached, coupled, connected, and the like) are to be construed broadly and may include intermediate members between a connection of elements and relative movement between elements. As such, joinder references do not necessarily infer that two elements are directly connected and in fixed relation to each other. It is intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative only and not as limiting. Changes in detail or structure may be made without departing from the invention as defined in the appended claims.

Claims (20)

1. A confidentiality preserving system for performing a rank-ordered search and retrieval of contents of a data collection, the system comprising:
at least one computer system including a search and retrieval algorithm using at least one of term frequency and similar features for rank-ordering selective contents of the data collection, and enabling secure retrieval of the selective contents based on the rank-order.
2. A confidentiality preserving system according to claim 1, wherein the search and retrieval algorithm generates a relevance score for the rank-ordering based on at least one query.
3. A confidentiality preserving system according to claim 2, wherein at least one of the data collection and query are encrypted.
4. A confidentiality preserving system according to claim 1, wherein the data collection includes at least one of documents and multi-media content.
5. A confidentiality preserving system according to claim 1, wherein the search and retrieval algorithm includes at least one of a baseline algorithm, a partially server oriented algorithm, and a fully server oriented algorithm.
6. A confidentiality preserving system according to claim 5, wherein the baseline algorithm includes a pre-processing algorithm for building a secure term frequency table and an inverse data collection frequency table, and a search stage algorithm for the rank-ordering in response to a query.
7. A confidentiality preserving system according to claim 6, wherein the pre-processing algorithm includes stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table.
8. A confidentiality preserving system according to claim 7, wherein the selective components are words, and the data collection contents are documents.
9. A confidentiality preserving system according to claim 6, wherein the search stage algorithm includes stemming of a query term, searching of the term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order.
10. A confidentiality preserving system according to claim 6, wherein the pre-processing and search stage algorithms are executed at a user site remote from a data center for storing the data collection.
11. A confidentiality preserving system according to claim 5, wherein the partially server oriented algorithm includes performance of selective computations at a user site remote from a data center for storing the data collection.
12. A confidentiality preserving system according to claim 5, wherein the partially server oriented algorithm includes at least one of building of a term frequency table and generation of a relevance score at a user site remote from a data center for storing the data collection.
13. A confidentiality preserving system according to claim 5, wherein the fully server oriented algorithm includes building of a term frequency table at a user site and generation of a relevance score at a secure computing unit in a data center for storing the data collection.
14. A confidentiality preserving system according to claim 5, wherein at least one of the partially and fully server oriented algorithms use at least one of homomorphic encryption and order-preserving encryption for enabling search capability from a user other than an owner of the contents of the data collection.
15. A confidentiality preserving method for performing a rank-ordered search and retrieval of contents of a data collection, the method comprising:
using at least one of term frequency and similar features for rank-ordering selective contents of the data collection; and
securely retrieving the selective contents based on the rank-order.
16. A confidentiality preserving method according to claim 15, further comprising generating a relevance score for the rank-ordering based on at least one query.
17. A confidentiality preserving method according to claim 16, further comprising encrypting at least one of the data collection and query.
18. A confidentiality preserving method according to claim 15, wherein the data collection includes at least one of documents and multi-media content.
19. A confidentiality preserving method according to claim 15, further comprising building a secure term frequency table and an inverse data collection frequency table by stemming of selective components of the contents of the data collection and mapping of the stemmed components in the term frequency table.
20. A confidentiality preserving method according to claim 15, further comprising stemming of a query term, searching of a term frequency table, generation of a relevance score, rank ordering of the selective contents of the data collection based on the relevance score, and retrieval of the selective contents of the data collection based on the rank order.
US12/608,724 2008-10-29 2009-10-29 System and method for confidentiality-preserving rank-ordered search Abandoned US20100146299A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US12/608,724 US20100146299A1 (en) 2008-10-29 2009-10-29 System and method for confidentiality-preserving rank-ordered search
US14/104,652 US20160154971A9 (en) 2008-10-29 2013-12-12 System and Method for Confidentiality-Preserving Rank-Ordered Search
US15/274,605 US20170235736A1 (en) 2008-10-29 2016-09-23 System and method for confidentiality-preserving rank-ordered search
US17/112,874 US11567950B2 (en) 2008-10-29 2020-12-04 System and method for confidentiality-preserving rank-ordered search

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10929108P 2008-10-29 2008-10-29
US12/608,724 US20100146299A1 (en) 2008-10-29 2009-10-29 System and method for confidentiality-preserving rank-ordered search

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/104,652 Continuation US20160154971A9 (en) 2008-10-29 2013-12-12 System and Method for Confidentiality-Preserving Rank-Ordered Search

Publications (1)

Publication Number Publication Date
US20100146299A1 true US20100146299A1 (en) 2010-06-10

Family

ID=42232402

Family Applications (4)

Application Number Title Priority Date Filing Date
US12/608,724 Abandoned US20100146299A1 (en) 2008-10-29 2009-10-29 System and method for confidentiality-preserving rank-ordered search
US14/104,652 Abandoned US20160154971A9 (en) 2008-10-29 2013-12-12 System and Method for Confidentiality-Preserving Rank-Ordered Search
US15/274,605 Abandoned US20170235736A1 (en) 2008-10-29 2016-09-23 System and method for confidentiality-preserving rank-ordered search
US17/112,874 Active 2030-03-18 US11567950B2 (en) 2008-10-29 2020-12-04 System and method for confidentiality-preserving rank-ordered search

Family Applications After (3)

Application Number Title Priority Date Filing Date
US14/104,652 Abandoned US20160154971A9 (en) 2008-10-29 2013-12-12 System and Method for Confidentiality-Preserving Rank-Ordered Search
US15/274,605 Abandoned US20170235736A1 (en) 2008-10-29 2016-09-23 System and method for confidentiality-preserving rank-ordered search
US17/112,874 Active 2030-03-18 US11567950B2 (en) 2008-10-29 2020-12-04 System and method for confidentiality-preserving rank-ordered search

Country Status (1)

Country Link
US (4) US20100146299A1 (en)

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080294909A1 (en) * 2005-03-01 2008-11-27 The Regents Of The University Of California Method for Private Keyword Search on Streaming Data
US20110145594A1 (en) * 2009-12-16 2011-06-16 Electronics And Telecommunications Research Institute Method for performing searchable symmetric encryption
US20110167255A1 (en) * 2008-09-15 2011-07-07 Ben Matzkel System, apparatus and method for encryption and decryption of data transmitted over a network
US20110264920A1 (en) * 2010-04-27 2011-10-27 Fuji Xerox Co., Ltd. Systems and methods for communication, storage, retrieval, and computation of simple statistics and logical operations on encrypted data
US20120078914A1 (en) * 2010-09-29 2012-03-29 Microsoft Corporation Searchable symmetric encryption with dynamic updating
US8396871B2 (en) 2011-01-26 2013-03-12 DiscoverReady LLC Document classification and characterization
CN103095453A (en) * 2011-07-08 2013-05-08 Sap股份公司 Public-key Encrypted Bloom Filters With Applications To Private Set Intersection
US20130159694A1 (en) * 2011-12-20 2013-06-20 Industrial Technology Research Institute Document processing method and system
US20130159695A1 (en) * 2011-12-20 2013-06-20 Industrial Technology Research Institute Document processing method and system
US20130191498A1 (en) * 2012-01-25 2013-07-25 Microsoft Corporation Web page load time reduction by optimized authentication
US20130191650A1 (en) * 2012-01-25 2013-07-25 Massachusetts Institute Of Technology Methods and apparatus for securing a database
US20140012862A1 (en) * 2012-07-04 2014-01-09 Sony Corporation Information processing apparatus, information processing method, program, and information processing system
US20140108435A1 (en) * 2012-09-28 2014-04-17 Vladimir Y. Kolesnikov Secure private database querying system with content hiding bloom fiters
US20140122476A1 (en) * 2012-10-25 2014-05-01 Verisign, Inc. Privacy preserving data querying
US20140129567A1 (en) * 2011-07-29 2014-05-08 C/O Nec Corporation System for generating index resistant against divulging of information, index generation device, and method therefor
CN103927340A (en) * 2014-03-27 2014-07-16 中国科学院信息工程研究所 Ciphertext retrieval method
WO2014118230A1 (en) * 2013-01-29 2014-08-07 Nec Europe Ltd. Method and system for providing encrypted data for searching of information therein and a method and system for searching of information on encrypted data
EP2709307A3 (en) * 2012-08-21 2014-08-27 Xerox Corporation Methods and systems for securely accessing translation resource manager
US8832427B2 (en) 2012-03-30 2014-09-09 Microsoft Corporation Range-based queries for searchable symmetric encryption
WO2014109828A3 (en) * 2012-11-16 2014-09-18 Raytheon Bbn Technologies Corp. Method for secure substring search
US20140325217A1 (en) * 2011-11-11 2014-10-30 Nec Corporation Database apparatus, method, and program
US8904171B2 (en) 2011-12-30 2014-12-02 Ricoh Co., Ltd. Secure search and retrieval
TWI467411B (en) * 2011-12-20 2015-01-01 Ind Tech Res Inst Document processing method and system
US20150039903A1 (en) * 2013-08-05 2015-02-05 International Business Machines Corporation Masking query data access pattern in encrypted data
US9037860B1 (en) * 2013-11-22 2015-05-19 Sap Se Average-complexity ideal-security order-preserving encryption
US20150172044A1 (en) * 2012-07-04 2015-06-18 Nec Corporation Order-preserving encryption system, encryption device, decryption device, encryption method, decryption method, and programs thereof
US20150186627A1 (en) * 2013-12-26 2015-07-02 Nxp B.V. Secure software compnents anti-reverse-engineering by table interleaving
US9118631B1 (en) * 2013-08-16 2015-08-25 Google Inc. Mixing secure and insecure data and operations at server database
CN104967693A (en) * 2015-07-15 2015-10-07 中南民族大学 Document similarity calculation method facing cloud storage based on fully homomorphic password technology
US20150295716A1 (en) * 2012-06-22 2015-10-15 Commonwealth Scientific And Industrial Research Organisation Homomorphic encryption for database querying
US9363288B2 (en) 2012-10-25 2016-06-07 Verisign, Inc. Privacy preserving registry browsing
US20160179854A1 (en) * 2014-12-22 2016-06-23 Oracle International Corporation Collection frequency based data model
EP3076329A1 (en) * 2015-04-03 2016-10-05 NTT DoCoMo, Inc. Secure text retrieval
US20160335450A1 (en) * 2014-01-16 2016-11-17 Hitachi, Ltd. Searchable encryption processing system and searchable encryption processing method
US9519665B2 (en) 2011-07-06 2016-12-13 Business Partners Limited Search index
US20170004324A1 (en) * 2015-07-02 2017-01-05 Samsung Electronics Co., Ltd. Method for managing data and apparatuses therefor
US20170048058A1 (en) * 2014-04-23 2017-02-16 Agency For Science, Technology And Research Method and system for generating/decrypting ciphertext, and method and system for searching ciphertexts in a database
US9641489B1 (en) * 2015-09-30 2017-05-02 EMC IP Holding Company Fraud detection
US9667514B1 (en) 2012-01-30 2017-05-30 DiscoverReady LLC Electronic discovery system with statistical sampling
US9852306B2 (en) 2013-08-05 2017-12-26 International Business Machines Corporation Conjunctive search in encrypted data
FR3062936A1 (en) * 2017-02-15 2018-08-17 Wallix METHOD FOR SEARCHING INFORMATION IN A STORED STORED CORPUS ON A SERVER
US10181049B1 (en) * 2012-01-26 2019-01-15 Hrl Laboratories, Llc Method and apparatus for secure and privacy-preserving querying and interest announcement in content push and pull protocols
US10313371B2 (en) 2010-05-21 2019-06-04 Cyberark Software Ltd. System and method for controlling and monitoring access to data processing applications
US10341086B2 (en) 2013-01-29 2019-07-02 Nec Corporation Method and system for providing encrypted data for searching of information therein and a method and system for searching of information on encrypted data
CN110110163A (en) * 2018-01-18 2019-08-09 Sap欧洲公司 Safe substring search is with filtering enciphered data
US20190318118A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Secure encrypted document retrieval
US10467252B1 (en) 2012-01-30 2019-11-05 DiscoverReady LLC Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
US10565394B2 (en) 2012-10-25 2020-02-18 Verisign, Inc. Privacy—preserving data querying with authenticated denial of existence
US10614135B2 (en) * 2015-09-11 2020-04-07 Skyhigh Networks, Llc Wildcard search in encrypted text using order preserving encryption
US20210056220A1 (en) * 2019-08-22 2021-02-25 Mediatek Inc. Method for improving confidentiality protection of neural network model
CN113132085A (en) * 2021-04-14 2021-07-16 上海同态信息科技有限责任公司 Ciphertext query method based on searchable encryption
US11216575B2 (en) * 2018-10-09 2022-01-04 Q-Net Security, Inc. Enhanced securing and secured processing of data at rest
CN114329154A (en) * 2021-12-30 2022-04-12 电子科技大学广东电子信息工程研究院 Safety search method based on data stored by cloud server
US20220231847A1 (en) * 2021-01-19 2022-07-21 Bank Of America Corporation Collaborative architecture for secure data sharing
US11764940B2 (en) 2019-01-10 2023-09-19 Duality Technologies, Inc. Secure search of secret data in a semi-trusted environment using homomorphic encryption
US11775656B2 (en) * 2015-05-01 2023-10-03 Micro Focus Llc Secure multi-party information retrieval
US11861027B2 (en) 2018-10-09 2024-01-02 Q-Net Security, Inc. Enhanced securing of data at rest

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107492099B (en) * 2017-08-28 2021-08-20 京东方科技集团股份有限公司 Medical image analysis method, medical image analysis system, and storage medium
CN108156138B (en) * 2017-12-13 2020-10-27 西安电子科技大学 Fine-grained searchable encryption method for fog calculation
CN108763926B (en) * 2018-06-01 2021-11-12 中国电子技术标准化研究院 Industrial control system intrusion detection method with safety immunity capability
US11625752B2 (en) 2018-11-15 2023-04-11 Ravel Technologies SARL Cryptographic anonymization for zero-knowledge advertising methods, apparatus, and system
WO2020151015A1 (en) * 2019-01-25 2020-07-30 Microsoft Technology Licensing, Llc Scoring documents in document retrieval
CN112865949B (en) * 2021-01-20 2022-10-14 暨南大学 Outsourcing data storage and access method for efficiently resisting remote service attack

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010019614A1 (en) * 2000-10-20 2001-09-06 Medna, Llc Hidden Link Dynamic Key Manager for use in Computer Systems with Database Structure for Storage and Retrieval of Encrypted Data
US20020174355A1 (en) * 2001-03-12 2002-11-21 Arcot Systems, Inc. Techniques for searching encrypted files
US20040243816A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation Querying encrypted data in a relational database system
US20050144162A1 (en) * 2003-12-29 2005-06-30 Ping Liang Advanced search, file system, and intelligent assistant agent
US20050147246A1 (en) * 2004-01-05 2005-07-07 Rakesh Agrawal System and method for fast querying of encrypted databases
US20050166046A1 (en) * 2004-01-23 2005-07-28 Bellovin Steven M. Privacy-enhanced searches using encryption
US20060129545A1 (en) * 2004-12-09 2006-06-15 Philippe Golle System and method for performing a conjunctive keyword search over encrypted data
US20070250486A1 (en) * 2006-03-01 2007-10-25 Oracle International Corporation Document date as a ranking factor for crawling
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US20080294909A1 (en) * 2005-03-01 2008-11-27 The Regents Of The University Of California Method for Private Keyword Search on Streaming Data
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8321485B2 (en) * 2006-11-08 2012-11-27 Hitachi, Ltd. Device and method for constructing inverted indexes
US7970721B2 (en) * 2007-06-15 2011-06-28 Microsoft Corporation Learning and reasoning from web projections

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010019614A1 (en) * 2000-10-20 2001-09-06 Medna, Llc Hidden Link Dynamic Key Manager for use in Computer Systems with Database Structure for Storage and Retrieval of Encrypted Data
US20020174355A1 (en) * 2001-03-12 2002-11-21 Arcot Systems, Inc. Techniques for searching encrypted files
US20040243816A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation Querying encrypted data in a relational database system
US20050144162A1 (en) * 2003-12-29 2005-06-30 Ping Liang Advanced search, file system, and intelligent assistant agent
US20050147246A1 (en) * 2004-01-05 2005-07-07 Rakesh Agrawal System and method for fast querying of encrypted databases
US20050166046A1 (en) * 2004-01-23 2005-07-28 Bellovin Steven M. Privacy-enhanced searches using encryption
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US20060129545A1 (en) * 2004-12-09 2006-06-15 Philippe Golle System and method for performing a conjunctive keyword search over encrypted data
US20080294909A1 (en) * 2005-03-01 2008-11-27 The Regents Of The University Of California Method for Private Keyword Search on Streaming Data
US20070250486A1 (en) * 2006-03-01 2007-10-25 Oracle International Corporation Document date as a ranking factor for crawling
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion

Cited By (102)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8291237B2 (en) * 2005-03-01 2012-10-16 The Regents Of The University Of California Method for private keyword search on streaming data
US20080294909A1 (en) * 2005-03-01 2008-11-27 The Regents Of The University Of California Method for Private Keyword Search on Streaming Data
US20110167255A1 (en) * 2008-09-15 2011-07-07 Ben Matzkel System, apparatus and method for encryption and decryption of data transmitted over a network
US9444793B2 (en) 2008-09-15 2016-09-13 Vaultive Ltd. System, apparatus and method for encryption and decryption of data transmitted over a network
US9338139B2 (en) * 2008-09-15 2016-05-10 Vaultive Ltd. System, apparatus and method for encryption and decryption of data transmitted over a network
US8812867B2 (en) * 2009-12-16 2014-08-19 Electronics And Telecommunications Research Institute Method for performing searchable symmetric encryption
US20110145594A1 (en) * 2009-12-16 2011-06-16 Electronics And Telecommunications Research Institute Method for performing searchable symmetric encryption
US20110264920A1 (en) * 2010-04-27 2011-10-27 Fuji Xerox Co., Ltd. Systems and methods for communication, storage, retrieval, and computation of simple statistics and logical operations on encrypted data
US8862895B2 (en) * 2010-04-27 2014-10-14 Fuji Xerox Co., Ltd. Systems and methods for communication, storage, retrieval, and computation of simple statistics and logical operations on encrypted data
US10313371B2 (en) 2010-05-21 2019-06-04 Cyberark Software Ltd. System and method for controlling and monitoring access to data processing applications
US20120078914A1 (en) * 2010-09-29 2012-03-29 Microsoft Corporation Searchable symmetric encryption with dynamic updating
US8533489B2 (en) * 2010-09-29 2013-09-10 Microsoft Corporation Searchable symmetric encryption with dynamic updating
US8396871B2 (en) 2011-01-26 2013-03-12 DiscoverReady LLC Document classification and characterization
US9703863B2 (en) 2011-01-26 2017-07-11 DiscoverReady LLC Document classification and characterization
US9519665B2 (en) 2011-07-06 2016-12-13 Business Partners Limited Search index
US10552466B2 (en) 2011-07-06 2020-02-04 Business Partners Limited Search index
US8526603B2 (en) * 2011-07-08 2013-09-03 Sap Ag Public-key encrypted bloom filters with applications to private set intersection
CN103095453B (en) * 2011-07-08 2017-11-03 Sap欧洲公司 The Bloom filter of the public key encryption occured simultaneously using privately owned set
CN103095453A (en) * 2011-07-08 2013-05-08 Sap股份公司 Public-key Encrypted Bloom Filters With Applications To Private Set Intersection
US9690845B2 (en) * 2011-07-29 2017-06-27 Nec Corporation System for generating index resistant against divulging of information, index generation device, and method therefor
US20140129567A1 (en) * 2011-07-29 2014-05-08 C/O Nec Corporation System for generating index resistant against divulging of information, index generation device, and method therefor
US9600677B2 (en) * 2011-11-11 2017-03-21 Nec Corporation Database apparatus, method, and program
JPWO2013069770A1 (en) * 2011-11-11 2015-04-02 日本電気株式会社 Database apparatus, method and program
US20140325217A1 (en) * 2011-11-11 2014-10-30 Nec Corporation Database apparatus, method, and program
TWI467411B (en) * 2011-12-20 2015-01-01 Ind Tech Res Inst Document processing method and system
US20130159695A1 (en) * 2011-12-20 2013-06-20 Industrial Technology Research Institute Document processing method and system
US9197613B2 (en) * 2011-12-20 2015-11-24 Industrial Technology Research Institute Document processing method and system
CN103179179A (en) * 2011-12-20 2013-06-26 财团法人工业技术研究院 Document processing method and system
US20130159694A1 (en) * 2011-12-20 2013-06-20 Industrial Technology Research Institute Document processing method and system
US8819408B2 (en) * 2011-12-20 2014-08-26 Industrial Technology Research Institute Document processing method and system
US8904171B2 (en) 2011-12-30 2014-12-02 Ricoh Co., Ltd. Secure search and retrieval
US9892202B2 (en) * 2012-01-25 2018-02-13 Microsoft Technology Licensing, Llc Web page load time reduction by optimized authentication
US20130191650A1 (en) * 2012-01-25 2013-07-25 Massachusetts Institute Of Technology Methods and apparatus for securing a database
US9087212B2 (en) * 2012-01-25 2015-07-21 Massachusetts Institute Of Technology Methods and apparatus for securing a database
US20130191498A1 (en) * 2012-01-25 2013-07-25 Microsoft Corporation Web page load time reduction by optimized authentication
US10181049B1 (en) * 2012-01-26 2019-01-15 Hrl Laboratories, Llc Method and apparatus for secure and privacy-preserving querying and interest announcement in content push and pull protocols
US10467252B1 (en) 2012-01-30 2019-11-05 DiscoverReady LLC Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
US9667514B1 (en) 2012-01-30 2017-05-30 DiscoverReady LLC Electronic discovery system with statistical sampling
US8832427B2 (en) 2012-03-30 2014-09-09 Microsoft Corporation Range-based queries for searchable symmetric encryption
EP2865127A4 (en) * 2012-06-22 2016-03-09 Commw Scient Ind Res Org Homomorphic encryption for database querying
US20150295716A1 (en) * 2012-06-22 2015-10-15 Commonwealth Scientific And Industrial Research Organisation Homomorphic encryption for database querying
US10027486B2 (en) * 2012-06-22 2018-07-17 Commonwealth Scientific And Industrial Research Organisation Homomorphic encryption for database querying
US20140012862A1 (en) * 2012-07-04 2014-01-09 Sony Corporation Information processing apparatus, information processing method, program, and information processing system
US20150172044A1 (en) * 2012-07-04 2015-06-18 Nec Corporation Order-preserving encryption system, encryption device, decryption device, encryption method, decryption method, and programs thereof
US9584315B2 (en) * 2012-07-04 2017-02-28 Nec Corporation Order-preserving encryption system, encryption device, decryption device, encryption method, decryption method, and programs thereof
EP2709307A3 (en) * 2012-08-21 2014-08-27 Xerox Corporation Methods and systems for securely accessing translation resource manager
US9535658B2 (en) * 2012-09-28 2017-01-03 Alcatel Lucent Secure private database querying system with content hiding bloom filters
US20140108435A1 (en) * 2012-09-28 2014-04-17 Vladimir Y. Kolesnikov Secure private database querying system with content hiding bloom fiters
US10346627B2 (en) * 2012-10-25 2019-07-09 Verisign, Inc. Privacy preserving data querying
US9363288B2 (en) 2012-10-25 2016-06-07 Verisign, Inc. Privacy preserving registry browsing
US20140122476A1 (en) * 2012-10-25 2014-05-01 Verisign, Inc. Privacy preserving data querying
US20160085987A1 (en) * 2012-10-25 2016-03-24 Verisign, Inc. Privacy preserving data querying
US9866536B2 (en) 2012-10-25 2018-01-09 Verisign, Inc. Privacy preserving registry browsing
US9202079B2 (en) * 2012-10-25 2015-12-01 Verisign, Inc. Privacy preserving data querying
US10565394B2 (en) 2012-10-25 2020-02-18 Verisign, Inc. Privacy—preserving data querying with authenticated denial of existence
WO2014109828A3 (en) * 2012-11-16 2014-09-18 Raytheon Bbn Technologies Corp. Method for secure substring search
US10038562B2 (en) 2013-01-29 2018-07-31 Nec Corporation Method and system for providing encrypted data for searching of information therein and a method and system for searching of information on encrypted data
US10341086B2 (en) 2013-01-29 2019-07-02 Nec Corporation Method and system for providing encrypted data for searching of information therein and a method and system for searching of information on encrypted data
WO2014118230A1 (en) * 2013-01-29 2014-08-07 Nec Europe Ltd. Method and system for providing encrypted data for searching of information therein and a method and system for searching of information on encrypted data
US10089487B2 (en) 2013-08-05 2018-10-02 International Business Machines Corporation Masking query data access pattern in encrypted data
US20150039903A1 (en) * 2013-08-05 2015-02-05 International Business Machines Corporation Masking query data access pattern in encrypted data
US9646166B2 (en) * 2013-08-05 2017-05-09 International Business Machines Corporation Masking query data access pattern in encrypted data
US9852306B2 (en) 2013-08-05 2017-12-26 International Business Machines Corporation Conjunctive search in encrypted data
US9313179B1 (en) 2013-08-16 2016-04-12 Google Inc. Mixing secure and insecure data and operations at server database
US9118631B1 (en) * 2013-08-16 2015-08-25 Google Inc. Mixing secure and insecure data and operations at server database
US9037860B1 (en) * 2013-11-22 2015-05-19 Sap Se Average-complexity ideal-security order-preserving encryption
US20150149773A1 (en) * 2013-11-22 2015-05-28 Sap Ag Average-complexity ideal-security order-preserving encryption
US9576116B2 (en) * 2013-12-26 2017-02-21 Nxp B.V. Secure software components anti-reverse-engineering by table interleaving
US20150186627A1 (en) * 2013-12-26 2015-07-02 Nxp B.V. Secure software compnents anti-reverse-engineering by table interleaving
US20160335450A1 (en) * 2014-01-16 2016-11-17 Hitachi, Ltd. Searchable encryption processing system and searchable encryption processing method
US10489604B2 (en) * 2014-01-16 2019-11-26 Hitachi, Ltd. Searchable encryption processing system and searchable encryption processing method
CN103927340A (en) * 2014-03-27 2014-07-16 中国科学院信息工程研究所 Ciphertext retrieval method
US20170048058A1 (en) * 2014-04-23 2017-02-16 Agency For Science, Technology And Research Method and system for generating/decrypting ciphertext, and method and system for searching ciphertexts in a database
US10693626B2 (en) * 2014-04-23 2020-06-23 Agency For Science, Technology And Research Method and system for generating/decrypting ciphertext, and method and system for searching ciphertexts in a database
US10089336B2 (en) * 2014-12-22 2018-10-02 Oracle International Corporation Collection frequency based data model
US20160179854A1 (en) * 2014-12-22 2016-06-23 Oracle International Corporation Collection frequency based data model
US10037433B2 (en) 2015-04-03 2018-07-31 Ntt Docomo Inc. Secure text retrieval
EP3076329A1 (en) * 2015-04-03 2016-10-05 NTT DoCoMo, Inc. Secure text retrieval
US11775656B2 (en) * 2015-05-01 2023-10-03 Micro Focus Llc Secure multi-party information retrieval
US20170004324A1 (en) * 2015-07-02 2017-01-05 Samsung Electronics Co., Ltd. Method for managing data and apparatuses therefor
KR20170004456A (en) * 2015-07-02 2017-01-11 삼성전자주식회사 A method for managing data and apparatuses therefor
US10198592B2 (en) * 2015-07-02 2019-02-05 Samsung Electronics Co., Ltd. Method and system for communicating homomorphically encrypted data
KR102402625B1 (en) * 2015-07-02 2022-05-27 삼성전자주식회사 A method for managing data and apparatuses therefor
CN104967693A (en) * 2015-07-15 2015-10-07 中南民族大学 Document similarity calculation method facing cloud storage based on fully homomorphic password technology
US10614135B2 (en) * 2015-09-11 2020-04-07 Skyhigh Networks, Llc Wildcard search in encrypted text using order preserving encryption
US9641489B1 (en) * 2015-09-30 2017-05-02 EMC IP Holding Company Fraud detection
US11308233B2 (en) 2017-02-15 2022-04-19 Wallix Method for information retrieval in an encrypted corpus stored on a server
WO2018150119A1 (en) 2017-02-15 2018-08-23 Wallix Method for information retrieval in an encrypted corpus stored on a server
FR3062936A1 (en) * 2017-02-15 2018-08-17 Wallix METHOD FOR SEARCHING INFORMATION IN A STORED STORED CORPUS ON A SERVER
CN110110163A (en) * 2018-01-18 2019-08-09 Sap欧洲公司 Safe substring search is with filtering enciphered data
US10885216B2 (en) * 2018-01-18 2021-01-05 Sap Se Secure substring search to filter encrypted data
US20190318118A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Secure encrypted document retrieval
US11216575B2 (en) * 2018-10-09 2022-01-04 Q-Net Security, Inc. Enhanced securing and secured processing of data at rest
US20220237311A1 (en) * 2018-10-09 2022-07-28 Q-Net Security, Inc. Enhanced Securing and Secured Processing of Data at Rest
US11853445B2 (en) * 2018-10-09 2023-12-26 Q-Net Security, Inc. Enhanced securing and secured processing of data at rest
US11861027B2 (en) 2018-10-09 2024-01-02 Q-Net Security, Inc. Enhanced securing of data at rest
US11764940B2 (en) 2019-01-10 2023-09-19 Duality Technologies, Inc. Secure search of secret data in a semi-trusted environment using homomorphic encryption
US20210056220A1 (en) * 2019-08-22 2021-02-25 Mediatek Inc. Method for improving confidentiality protection of neural network model
US20220231847A1 (en) * 2021-01-19 2022-07-21 Bank Of America Corporation Collaborative architecture for secure data sharing
US11799643B2 (en) * 2021-01-19 2023-10-24 Bank Of America Corporation Collaborative architecture for secure data sharing
CN113132085A (en) * 2021-04-14 2021-07-16 上海同态信息科技有限责任公司 Ciphertext query method based on searchable encryption
CN114329154A (en) * 2021-12-30 2022-04-12 电子科技大学广东电子信息工程研究院 Safety search method based on data stored by cloud server

Also Published As

Publication number Publication date
US20160154971A9 (en) 2016-06-02
US20150169889A1 (en) 2015-06-18
US20170235736A1 (en) 2017-08-17
US11567950B2 (en) 2023-01-31
US20210109940A1 (en) 2021-04-15

Similar Documents

Publication Publication Date Title
US11567950B2 (en) System and method for confidentiality-preserving rank-ordered search
Swaminathan et al. Confidentiality-preserving rank-ordered search
EP3058678B1 (en) System and method for dynamic, non-interactive, and parallelizable searchable symmetric encryption
Cui et al. Efficient and expressive keyword search over encrypted data in cloud
Orencik et al. A practical and secure multi-keyword search method over encrypted cloud data
Xia et al. Secure semantic expansion based search over encrypted cloud data supporting similarity ranking
Zhang et al. SE-PPFM: A searchable encryption scheme supporting privacy-preserving fuzzy multikeyword in cloud systems
US7558970B2 (en) Privacy-enhanced searches using encryption
CN112270006A (en) Searchable encryption method for hiding search mode and access mode in e-commerce platform
KR20120068524A (en) Method and apparatus for providing data management
CN112332979B (en) Ciphertext search method, system and equipment in cloud computing environment
Carbunar et al. Toward private joins on outsourced data
Wang et al. An efficient and privacy-preserving range query over encrypted cloud data
Wang et al. Towards practical private processing of database queries over public data
Muhammad et al. A secure data outsourcing scheme based on Asmuth–Bloom secret sharing
Selvam et al. On developing dynamic and efficient cryptosystem for safeguarding healthcare data in public clouds
JP7132506B2 (en) Confidential Information Retrieval System, Confidential Information Retrieval Program, and Confidential Information Retrieval Method
Kamini et al. Encrypted multi-keyword ranked search supporting gram based search technique
Chi et al. Privacy-enhancing range query processing over encrypted cloud databases
Uplavikar et al. Lucene-P $^ 2 $2: A Distributed Platform for Privacy-Preserving Text-Based Search
Zhu et al. Secure data retrieval of outsourced data with complex query support
Rajendran et al. An Efficient Ranked Multi-Keyword Search for Multiple Data Owners Over Encrypted Cloud Data: Survey
Zhang et al. Secure multi-keyword fuzzy search supporting logic query over encrypted cloud data
Raj et al. A Survey on Healthcare Standards and Security Requirements for Electronic Health Records
Boucenna et al. Access Pattern Hiding in Searchable Encryption

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: UNIVERSITY OF MARYLAND, COLLEGE PARK, MARYLAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SWAMINATHAN, ASHWIN;MAO, YINIAN;SU, GUAN-MING;AND OTHERS;SIGNING DATES FROM 20190711 TO 20200519;REEL/FRAME:054526/0202