WO2007040380A1

WO2007040380A1 - Method of generating a printed signature in order to secure the contents of text documents

Info

Publication number: WO2007040380A1
Application number: PCT/MX2005/000089
Authority: WO
Inventors: Sergio Antonio Fernandez Orozco; Leo Hendrik Reyes Lozano
Original assignee: Fernandez Orozco Sergio Antoni
Priority date: 2005-10-04
Filing date: 2005-10-04
Publication date: 2007-04-12

Abstract

The invention relates to a method of generating secure documents. The invention is characterised by the way in which the signatures used to secure the documents are generated. Said signatures are formed from the text content of a printed document without requiring an electronic transcription of the text. The inventive method is based on automatically recognising the form of the words and locating the positions at which the words are repeated, without necessarily knowing the significance of same. In this way, the document can be copied or modified using any means without altering the original message contained in the document. Finally, the method can be used to secure documents printed with normal paper.

Description

METHOD FOR GENERATING A PRINTED SIGNATURE TO ENSURE THE

CONTENTS OF TEXT DOCUMENTS.

Technical Field of the Invention This invention relates to the area of security and integrity of the content of printed text documents. The objective of this invention is to protect the content of documents printed on common paper against malicious changes and counterfeits; However, it allows slight alterations of the document that do not change the message written on it. It describes a system and method that can generate a signature that can be attached in printed form to a text document to ensure that its content has not been modified maliciously. In addition, a system and method is described that allows to verify if a document signed by the previous procedure has been modified. The main characteristics of these methods is that they do not require the use of special technologies or materials (such as lasers, holograms or plastics); The original document does not have to be digital; the signed document can be copied without losing the signature properties; the signature is clearly visible; Optical character recognition is not required (as in an OCR); and only the part that contains text can be protected.

Background of the invention

At present, there are many methods to protect electronic documents against malicious modifications. However, there are relatively few advances to protect physical documents. Among the protections available for the latter are watermarks, the use of bar codes, special papers, holograms and digital signatures or printed watermarks.

Any document that requires assurance that its content has not been modified will benefit from the technology set forth in this document. Examples of such documents are bank checks, deeds, wills, notarized documents, passports, etc. A brief summary of the state of the art related to the present invention is presented below.

In US patents 4,835,028, US 6,414,761 and US 4,630,845 techniques are described that require the use of special papers or the addition of optical or magnetic elements difficult to reproduce (holograms or magnetic tapes, respectively) to ensure the authenticity and integrity of a document. This type of protection has the disadvantage of requiring special materials. The present invention works with any type of paper and still maintains the same level of protection. US 6,496,933 describes a method that produces an image mark that can be added to a document to ensure its authenticity. However, the main disadvantage of this invention with respect to that presented here is that it is required that the document be originally in an electronic format; while the invention described herein can work with printed documents. US 6,764,000 describes a system that uses a scanner to identify points of interest or signs (such as watermarks, holograms, serial numbers, patterns, colors, etc.) present in the document to be secured. These clues are compared to a database that has been previously built. If sufficient clues are similar to those stored, the document is classified as authentic. Unlike that method, the present invention does not require the use of a database to authenticate the document. In addition, the invention generates a signature from the content or message written in the original document. In US 6,934,845 an invention is described that uses an encoding based on the blanks of a document to hide a signature that allows the document to be authenticated. This invention requires that a document be used in electronic format to generate the signature, and the signature is not generated with the content of the document. On the contrary, the present invention differs from that in that it does not require the document in electronic format and uses the content of the document to generate the authentication information. The invention also does not use a coding based on blanks and is clearly visible. US 6,427,921 describes a method and system that uses various types of overlapping patterns with the original image that it is desired to secure to produce a watermark. Unlike this method and system, the present invention uses the contents of the document to generate the signature that will authenticate it. In HKl 028662 a method is described for securing printed documents that requires the use of a laser to illuminate the questionable document and then verify that the reflected pattern meets certain characteristics. The present invention does not require laser technology. US Patent 6,785,405 describes a system that uses digital images of printed documents to verify their authenticity. The system performs the segmentation of the images and then compares these segments with images contained in a database obtaining a correlation number for each type of document and segment. This phase serves to categorize the document. With this information, authentication information is read that the same document must possess and this information is used to verify that the document is original. Unlike this system, the present invention does not require segmentation or correlation to be used. In addition, it does not require the use of a database with known images of documents.

US Patent 3,069,654 "Method and Means for Recognizing Complex Patterns" describes a method for detecting figures in images. However, this method cannot be used by itself to protect the content of text documents. HP Safe Paper technology uses a watermark that is only visible when a plastic letter is placed on the signed document. This watermark can be printed on any document using normal printers and inks. This watermark cannot be reproduced by normal means, so documents cannot be copied. Unlike this system, the invention presented here does not use watermarks or require the use of special plastics to verify the integrity of the document. On the other hand, the invention allows multiple copies of a document to be napped as long as the content of said document is not altered. This allows documents generated with this system to be sent electronically or electromechanically (such as a facsimile). Of similar name, AIp Vision SafePaper technology uses a watermark that is invisible to secure a document. Unlike HP technology, this watermark can be read with a scanner; but disappears when copied or reproduced by other means. On the other hand, this system can only sign electronic documents. The present invention allows copying of documents as long as the content is intact. In the area of scientific publications, there are several methods that allow you to attach a digital signature to a printed document. S. Bhattacharjee and M. Kutter describe one of these algorithms in "Compression Tolerant Image Authentication", IEEE inter. Conference on Image processing, USA, pp. 435-439, 1998. This algorithm is based on the use of wavelets to generate a digital document signature. J. Fridrich, in "Methods for Detecting Changes in Digital Images", Proc. IEEE Int. Workshop on Intell. Signal Processing and Communi cation Systems, 1998. Uses the scatter spectrum signal to generate digital document signatures.

C-Y. Lin and S.-F. Chang in "A Robust Image Authentication Method Surviving JPEG Lossly Compression", SPIE Storage and Retrieval of Image / Video Datábase, VoI 3312, San Jose, 1998, use wavelets to secure the content of digital documents. None of the articles cited so far work with printed documents. In addition, this system does not use wavelets and is based on encoding the content of the text printed in the document.

Baoshi Zhu, Jiankang Wu and Mohán Kankanhalli in "Print Signatures for Document Authentication", in Proceedings of the lOth ACM conference on Computer and Communications security Washington D.C., USA, pp. 145-154, 2003, use the intrinsic randomness to the laser printing process to verify the authenticity of a printed document. The present invention does not require the use of a specific printing technology. In "Comparison ofSome Thresholding Algorithms for Text / Background Segmentation in Difflcult Document Images", published in The Seventh International Conference on Document Analysis and Recognition, VoI. 2, pg. 859, Leedham, Yan, Takru and Tan describe various binarization algorithms for the segmentation of text documents. However, binarization by itself cannot be used to secure the content of printed text documents. Our invention uses other techniques besides binarization to achieve this goal.

González and Woods in Chapter 5 of Image Restoration of Digital Image Processing, Second Edition, U.S.A., New Jersey, Addison-Wesley, 2002, describe some algorithms to eliminate image noise. However, none of these algorithms can be used by itself to ensure the integrity of text documents. In addition to these algorithms, our invention uses other techniques to produce secure documents.

In the chapter Optical Character Recognition of the book Algorithms for Image Processing and Computer Vision, by JR Parker, Wiley, 1996, a algorithm that aligns the text of an image with the horizontal lines of it. However, this algorithm, by itself, cannot be used to secure the content of text documents. Our invention uses other techniques in addition to this to achieve that goal. In "A Computational Framework for Segmentation and Grouping" of Medioni, Lee and Tang, Elsevier 2000, algorithms are described that allow to find oriented segments of line using a methodology called tensor voting. However, tensor voting by itself cannot be used to secure text documents. The present invention uses techniques other than this to reach this end. Zhang describes a method (called Iterated Closest Points or ICP) to align two-dimensional figures in the article "Iterative Point Matching for Registration of Free-Form Curves and Surfaces", International Journal of Computer Vision, VoI. 13, No. 2, pp. 119-152, 1994. Veltkamp and Hagedoom describe similar algorithms (such as Chamfer Matching) in "State-of-the-ari in shape matching", technical report UU-CS- 1999-27, Utretcht University, Netherlands, 1999. Also It is possible to align figures by means of an exhaustive search that consists simply in trying out all the possible ways in which the figures can coincide. This method is not usually efficient and for this reason it is rarely mentioned in the literature, but its implementation is obvious. However, figure alignment algorithms cannot be used by themselves to generate secure documents. In addition to these techniques, our invention uses other algorithms to achieve this goal.

C. Xu and JL Prince describe an algorithm for recognition and vectorization of forms (commonly called snakes or active contours) in "Snakes, Shapes, and Gradient Vector Flow" IEEE Transactions on Image Processing, 7 (3), pg. 359-369, March 1998. However, this algorithm cannot be used by itself to obtain secure documents. Our invention uses similar techniques at one stage of the method; but it uses other algorithms besides this one to produce secure documents. Bernd Jáhne describes several local orientation recognition algorithms in Chapter 13 of his book "Digital Image Processing", Springer-Verlag, 1997. These algorithms are based on Fourier Transform, gradient analysis, tensor representation, local wave numbers and phase, the Hilbert transform and the Hilbert filter, quadrature filters, Gabor filters, and variants thereof. Without However, all these methods produce the local orientation of an image. In addition to this, other algorithms are needed to produce secure documents, just as the present invention does.

A. J. Menezes. P. C. van Oorshot, and S.A. Vanstone describe cyclic redundancy verification (CRC) and public and private key encryption algorithms in the book "Handbook of Applied CRC Press", 1997. However, these algorithms cannot be used to verify the integrity of printed documents, as these algorithms require that the original byte sequence be identical at all times. This condition is not met when scanning a printed document, where lighting conditions may vary and produce different byte sequences in each case. Our invention works despite these variations in lighting. Finally, it should be noted that this patent is an extension of the patent application PCTVMX2005 / 000019 by Sergio Fernández. In that document, a system and method for secure printing of documents is claimed. Our invention differs from that by the method used to secure the documents (the signature).

The method presented in this document is based on recognizing the patterns and repetitions of the words of a text document to generate a signature that can be attached to the paper where the document is located. But at no time is an optical character recognizer (OCR) used, as these usually present some failures (due to noise in the digital image) or limitations (due to the need to use word dictionaries). Our method has the advantage that it is not required to have a digital document. The methods based on correlation and wavelets use textures, which are very sensitive to non-malicious changes such as rotation, translation and scaling that may be present when scanning a document. The method presented here is robust to this kind of changes. In addition, it does not require the use of special materials such as holograms, special inks or laser systems to sign a document. Finally, unlike some of the described technologies, the invention allows the document to be copied and transmitted by any electronic, electrical or mechanical method as long as the written message has not been modified. Brief description of the figures

Figure 1 shows a diagram of the method for printing secure documents. Figure 2 shows a diagram of the method to verify the integrity of a signed document. Figure 3 shows a diagram showing the internal parts of the signature generation stage.

Detailed description of the invention.

With reference to these figures, the method to ensure the integrity of the content of printed text documents consists of:

A method for printing secure documents. Which, in turn, is constituted of an image acquisition stage 3, to convert a printed text document 1 into a digital image (hereinafter simply referred to as "image") - This stage can be implemented with any device digitizer such as a scanner, a digital copier, a multifunction printer, digital camera, etc. The image obtained in said acquisition stage passes to a signature generation stage 5, to generate the element that ensures the integrity of the document (hereinafter simply called "signature"). The sub-stages that make up this stage are described in detail later. The signature generated by this stage is sent along with the image to the signature annexation and printing stage 6, to append said signature to the original document and thus produce a secure document printed on paper 7. The printing stage can be implemented with a device as a printer, digital copier, plotter or similar. Alternatively, this section of the system may contain an image raking stage (image raster processor, in English) 4, to convert an electronic document 2 (such as those generated by conventional programs such as Microsoft Word or Excel) into a digital image, in order to pass this image to the signature generation stage 5. These raking processors are usually an integral part of the hardware of many printers or can be obtained with the driver of these devices.

A method for the verification of secure documents. Which, in turn, is made up of an image acquisition stage 9 (such as the one described in the previous section) that converts a printed document 8 into a digital image. This image is Go to a signature analysis stage 10 to extract the signature contained in the document. This stage can be a simple two-dimensional barcode reader. In turn, the image of the document goes to the signature generation stage 11 (described in detail in the next section of this document), to obtain the integrity check (signature) element from the textual content of the document. The signatures extracted by these modules are passed to the document certification stage 13 which is responsible for verifying that both signatures coincide to finally issue a certificate of authenticity 13 of the signed document. Two stages of signature generation (5 and 11) that are, in turn, a binarization stage 14 that converts a color or grayscale image to black and white. A noise elimination stage 15, which eliminates the noise of the binarized image in the previous module. A horizontal alignment stage 16 that modifies the inclination of the image so that the lines of text appear horizontal. A word segmentation stage 17 to find the two-dimensional limits of each word in the document. A step identification step 18 to obtain the dominant traces of each word obtained in the previous module. A line alignment step 19 to align the words obtained in previous modules with each other. And, finally, a combination and encryption stage 20 that combines the information obtained in these modules to obtain a signature that can be attached to the original document to ensure its content.

Method for printing secure documents.

To generate a secure document, the user scans a printed document 1 at the image acquisition stage 3, or, through some electronic text editor (such as Word or Excel) generates a file 2 that is converted into an image by the processor image raking 4. The image is passed to the signature generator method 5.

The signature generator first binarizes the image in step 14 to obtain a black and white bitmap. The binarization method can be any of those described in "Comparison of Some Thresholding Algorithms for Text / Background Segmentation in Difflcult Document Images" by Leedham, Yan, Takru and Tan published in The Seventh International Conference on Document Analysis and Recognition, VoI. 2, pg. 859 Once the image has been binarized, step 15 eliminates all those small points that are produced by the noise in the measurement system. Gonzalez and Woods describe several such algorithms in Chapter 5 of Image Restoration in Digital Image Processing. Second Edition, USA, New Jersey, Addison-Wesley, 2002.

The clean image is then aligned so that the lines of text appear horizontal in the image (step 16). Some procedures to achieve this are described in Chapter 9: Optical Character Recognition in Algorithms for Image Processing and Computer Vision, by J. R. Parker, Wiley, 1996. Once the document has been aligned in this way. It is passed to word segmenter 17. At this stage a vertical histogram of the image is obtained. This histogram contains one entry for each line of the image. The amount of black pixels in that line is stored in each entry. In this way, the text lines appear as maximums and the blanks between lines, as minimums. It is easy then to identify the beginning and end of each line by a simple differential analysis. The beginning of each line is given by a maximum that goes from a small value to a large value and the end of each line passes from a high value to a small value. It is clear that one skilled in the art can easily implement an algorithm that performs the described identification. After identifying the lines of the text, the segmenter obtains a horizontal histogram for each line where the amount of black pixels per column in that line is now stored. This histogram can be used once again to identify the beginning and end of each word, using a procedure similar to that described in the previous paragraph. In this way, the segmenter produces a list of words by line, where each word is assigned a unique number (coordinate). This numbering only depends on the content of the document itself and is not altered by circumstantial changes (such as a change in image resolution, translation, rotation, or copying methods that do not alter the content). Other techniques, such as wavelets, are affected by this kind of harmless transformations.

The positions of each word and the image are passed to the step identification stage 18. This stage individually analyzes each word to obtain the most representative straight line segments (strokes) of each word. This is achieved, in particular, through tensor voting techniques (described in detail in "A Computational Framework for Segmentation and Grouping" by Medioni, Lee and Tang, Elsevier 2000); although the use of alternative techniques such as Hough transform (US Patent 3,069,654 "Method and Meansfor Recognizing Complex Patterns"), active contours (snakes in English, such as those described in the article "Snakes, Shapes, and Gradient Vector Flow "IEEE Transactions on Image Processing, 7 (3), p. 359-369, March 1998 by C. Xu and JL Prince), or any of the techniques described in the background of this document. These technologies have been in use for several years and are widely known to any expert in the area. Tensor voting produces a stroke for each pixel in the image. To compress this data, the next step is to group the strokes according to the affinity of your address. This is achieved by an in-depth search whose termination criterion is an excessive angle difference between pairs of strokes. In-depth search is a graph tour technique that is widely known and used for several years. Once again, implementing an in-depth search with the characteristics indicated is a trivial task for any programmer.

The strokes identified in this way are stored in a list of strokes per word and are passed, together with the positions of each word, to the line alignment stage 19. At this stage, all occurrences of each word are found in the rest of the document To detect the occurrence of a word in the rest of the document, said word is placed on a target word to be tested. If it is possible to properly align the main lines of the original word with the target word, then there is an occurrence. There are a variety of algorithms that allow this alignment to be made, such as the one described by Zhang (the ICP) in "Iterative Point Matching for Registration of Free-Form Curves and Surfaces", International Journal of Computer Vision, VoI. 13, No. 2, pp. 119-152, 1994; or by any of the methods described by Veltkamp and Hagedoom (such as Chamfer Matching) in "State-of-the-art in shape matching", technical report UU-CS-1999-27, Utretcht University, Netherlands, 1999. These algorithms They have been used for some years and are easy to implement for an expert in the area. At the end of this stage there is a list of repetitions that indicates the positions in which each word of the text reoccurs. It can be argued that if the text does not contain repetitions of words, then it can be easily altered without this change being noticeable in the signature. However, a significant document does not usually have these characteristics. So, in general, any change in the position or frequency of the words produces a different list, which, in turn, produces a different signature and the alteration is detectable.

Finally, the list of strokes of each word and the list of repetitions of each word are appended to the list of words by line in step 20 to generate a series of numbers that uniquely identify each document. This series of numbers can be encoded by itself as a two-dimensional barcode (or some other form of binary image coding); but usually, it is compressed and the result is obtained a cyclic redundancy code (CRC) which is normally used as a signature. It is also possible to obtain the original signature CRC (uncompressed). The signature thus obtained can be as small as a 16-bit number. This number is finally printed as a normal or two-dimensional barcode on the printed image you wish to secure. All these methods (compression and CRC) are widely known and used.

Alternatively, step 20 may use the information from the main lines of some critical sections of the document (such as figures on a check) to generate the signature, or as an annex to the regular signature described in the previous paragraph. It is also possible to systematically insert the list of words by line, the list of lines by word (for the critical sections of the document) and the list of repetitions, according to the needs of each particular document or the user's instructions, to generate the signature. Additionally, the signature generated with any of the procedures described (the sequence of numbers that describe the words and their repetitions, the sequence of the main strokes of the words or the systematic combination of all of them), can be encrypted with the public key of the issuer of the document to give greater security to the document. The encrypted signature can then be processed normally (it is compressed and its CRC is obtained). It should be noted that procedures similar to those described for stages 14, 15, 16 and 17 are already used in some optical character recognizers (OCR). The novelty is that instead of trying to recognize characters to form meaningful words (as in an OCR), the present process recognizes any pattern and repetitions of these. An OCR has the disadvantages that it does not work for any typeface (font) and also needs a dictionary of each language to disambiguate some words. Our invention is distinguished from an OCR in that it does not require any dictionary and works for any language and typeface. Moreover, by not requiring word recognition, the invention can easily work with handwritten text, which is still considered a difficult problem to solve at present by an OCR.

Finally, the signature obtained by this method is sent together with the image to the signature and printing annexation stage 6. This stage converts the signature to a small image (two-dimensional barcode, widely known and used technology) and appends it into some unused part of the original document (for example, at the edges of the page). The modified image is printed to finally generate a secure document 7.

Method for the verification of secure documents.

To verify the integrity of a document 8, it is scanned and converted into an image in step 9. This image is sent to a signature analysis stage 10 that simply reads and interprets the signature printed on the document. This stage can be implemented with a two-dimensional barcode reader, the use of which is widespread.

Additionally, after the signature has been processed with a barcode reader, it may be necessary to decrypt it with the issuer's private key, to verify that the integrity of the signature itself has not been compromised. This is also done in step 10. Simultaneously to this stage, the image is sent to the signature generation stage 11 (whose operation has already been explained above) to obtain the signature from the content of document 8.

Both signatures are compared in the certification stage 12. When the document has been maliciously modified, the signatures will be different. In that case, the certifier 12 responds by indicating that the document has been altered. Otherwise, an integrity certificate 13 is issued, indicating that the document is true to the original. The method of generating the signature constitutes the novelty of the invention. The novelty of the method is that, through computational vision algorithms, the textual content of the printed document is analyzed to generate a signature without using a character recognizer (OCR). That is, all the information necessary to verify the integrity of the document is found in the document itself, and the signature can be as small as a 16-bit number. Instead of recognizing individual characters, the form and position of the words in the text are used as a signature to ensure that the content has not been modified. No other current system can sign documents in this way without using an OCR. On the other hand, the systems that generate digital signatures based on wavelets have the disadvantage that the generated signature is usually of a relatively large size. This size makes it difficult to include the signature in the blanks of the original document (usually only the margins of the text). The advantage of our method with respect to wavelet-based methods is that the signature can be reduced to a size of 16 bits, facilitating its inclusion in the blanks of any printed text document. In addition to this, wavelet-based methods do not usually support alterations that preserve the content of the document, such as the resolution change and the translation or rotation of the image. Our invention works despite these changes. In addition to everything that has been mentioned, our invention does not need a set of word dictionaries. It does not require having the original document in electronic format. It does not require the use of special materials on paper or annexes to it. You do not need the use of colored plastics to visualize the signature. It does not require the use of lasers to illuminate the document and detect the signature. Finally, the method can be used with texts in any language.

Claims

1 A method to generate a printed signature, which includes the stages of binarization, noise elimination, horizontal alignment, word segmentation to generate a list of words by line where each word has a unique coordinate associated, characterized by the stages of : a) Identification of the word strokes. At this stage, tensor voting is used to obtain the local orientation of each pixel. Then, the local orientations are grouped by an in-depth search according to the affinity of their angle. This is done to identify the most representative orientations of each word and their relative positions within it. This stage produces a list of strokes of each word. b) Line alignment. Using the most representative strokes of each word, we seek to align the strokes of each word with the rest of the document. Those positions where this alignment is possible mark the repetitions of that word. The positions where each word is repeated are stored in a repetition list. This stage is implemented by a modified version of the ICP algorithm that only uses translations to align the strokes. c) Combination and encryption. At this stage the list of words by line, the list of repetitions, and the list of the strokes of each word are combined to create the signature. The signature can be constituted by the simple annexation of these lists, the annexation of a couple of them, or only one of these lists. It is also possible to systematically insert parts of these three lists to generate the signature, according to the indications given by the user or according to the characteristics of each document. d) Signature annexation and printing stage. At this stage, the signature obtained with this method is converted into a two-dimensional barcode by known techniques and is printed on the original document using a printer or similar device.

2 A method for generating a printed signature according to claim 1 characterized in that the identification of strokes is done by Hough transform, active contours (snakes in English), Fourier transform, gradient analysis, tensor representation, local wave numbers and phase, the Hilbert transform, the Hilbert filter, quadrature filters, Gabor filters, and the variants thereof, or some other similar method that allows obtaining the local orientation of an image. A method for generating a printed signature according to claims 1 and 2 characterized in that the alignment of strokes is done by exhaustive search, Chamfer Matching, active contours (snakes) or some similar algorithm.

4 A method for generating a printed signature according to claims 1, 2 and 3 characterized in that the signature is further protected by encryption with the public key of the issuer of the document.

A method for generating a printed signature according to claims 1, 2, 3 and 4 characterized in that the size of the signature is reduced using compression algorithms or cyclic redundancy (CRC).