US20090113003A1 - Image spam filtering based on senders' intention analysis - Google Patents
Image spam filtering based on senders' intention analysis Download PDFInfo
- Publication number
- US20090113003A1 US20090113003A1 US11/932,589 US93258907A US2009113003A1 US 20090113003 A1 US20090113003 A1 US 20090113003A1 US 93258907 A US93258907 A US 93258907A US 2009113003 A1 US2009113003 A1 US 2009113003A1
- Authority
- US
- United States
- Prior art keywords
- image
- spam
- blocks
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/107—Computer-aided management of electronic mailing [e-mailing]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
Definitions
- Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques.
- various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
- Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
- fuzzy signature technologies which flag both known and similar messages as spam, were deployed by anti-spam vendors.
- Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
- FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. As shown in FIGS. 1A and 1B , polygons, lines, random colors, jagged text, random dots, varying borders and the like may be inserted into image spam in an attempt to defeat signature detection techniques and obscure the embedded text from OCR techniques. There are an almost infinite number of ways that spammers can randomize images.
- spammers have recently used techniques such as varying the colors used in an image, changing the width and/or pattern of the border, altering the font style, and slicing images into smaller pieces (which are then reassembled to appear as a single image to the recipient).
- OCR is very computationally expensive.
- fully rendering a message and then looking for word matches against different character set libraries may take as long as several seconds per message, which is typically unacceptable for many contexts.
- an anti-spam detection module that can detect image spam.
- one or more of the quantity and position of text within an image associated with an electronic message are measured or estimated. Then, based at least in part on the results of the measuring or estimating, the likelihood that the electronic message is spam is determined.
- an embedded image of an electronic mail (email) message is converted to a binarized representation by performing thresholding on a grayscale representation of the embedded image.
- a quantity of text included in the embedded image is then determined and measured by analyzing one or more blocks of the binarized representations.
- the email message is classified as spam or clean based at least in part on the quantity of text measured.
- the embedded image may be formatted in accordance with the Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) or Portable Network Graphics (PNG) formats/standards.
- GIF Graphic Interchange Format
- JPEG Joint Photographic Experts Group
- PNG Portable Network Graphics
- the embedded image may be an image contained within a file attached to the email message.
- the method also includes determining an approximate display location of an embedded image within the email message and identifying existence of one or more abnormal factors associated with the embedded image. Then, the classification can be based upon the approximate display location, the existence of one or more abnormal factors as well as the quantity of text measured.
- FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques.
- FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed.
- FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system with a client workstation and an email server in accordance with an embodiment of the present invention.
- FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized.
- FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention.
- FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention.
- FIG. 7 is an example of an image spam email message containing an embedded image.
- FIG. 8 is a grayscale image based on the embedded image of FIG. 7 according to one embodiment of the present invention.
- FIG. 9 is an intensity histogram for the grayscale image of FIG. 8 according to one embodiment of the present invention.
- FIG. 10 is a binary image resulting from thresholding the grayscale image of FIG. 8 in accordance with an embodiment of the present invention.
- FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
- FIG. 12 is a grayscale image based on another exemplary embedded image observed in connection with image spam.
- FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image corresponding to the image of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
- images attached to or embedded within email messages are analyzed to determine the senders' intention.
- Empirical analysis reveals legitimate emails may contain embedded images, but valid images sent through email rarely contain a substantial quantity of text.
- the senders of such email messages do not painstakingly adjust the location of such included images to assure such images appear in the preview window of an email client.
- legitimate senders do not intentionally inject noise into the embedded images.
- spammers usually compose email messages in different ways. For example, in the context of image spam, spammers insert text into images to avoid filtering by traditional text filters and employ techniques to randomize images and/or obscure text embedded within images.
- various indicators of image spam include, but are not limited to, inclusion of one or more images in the front part of an email message, inclusion of one or more images containing text meeting a certain threshold and/or inclusion of one or more images into which noise appears to have been injected to obfuscate embedded text.
- various image analysis techniques are employed to more accurately detect image spam based on senders' intention analysis.
- the goal of senders' intention analysis is to discover the email message sender's intent by examining various characteristics of the email message and the embedded or attached images. If it appears, for example, after performing image analysis that one or more images associated with an email message have had one or more obfuscation techniques applied, the intent is to draw attention to the one or more images and/or the one or more images include suspicious quantities of text, then the sender's intention analysis anti-spam processing may flag the email message at issue as spam.
- the image scanning spam detection method is based on a combination of email header analysis, email body analysis and image processing on image attachments.
- anti-spam detection module and image scanning methodologies are discussed in the context of an email security system, they are equally applicable to network gateways, email appliances, client workstations, servers and other virtual or physical network devices or appliances that may be logically interposed between client workstations and servers, such as firewalls, network security appliances, email security appliances, virtual private network (VPN) gateways, switches, bridges, routers and like devices through which email messages flow.
- VPN virtual private network
- the anti-spam techniques described herein are equally applicable to instant messages, (Multimedia Message Service) MMS messages and other forms of electronic communications in the event that such message become vulnerable to image spam in the future.
- GUIF Graphic Interchange Format
- JPEG Joint Photographic Experts Group
- PNG Portable Network Graphics
- embodiments of the present invention are not so limited and are equally applicable to various other current and future graphic/image file formats, including, but not limited to, Progressive Graphics File (PGF), Tagged Image File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW formats used by various digital cameras, various vector formats, such as Scalable Vector Graphics (SVG), as well as other file formats of attachments which may themselves contain embedded images, such as Portable Document Format (PDF), Encapsulated PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw CDR.
- PDF Portable Document Format
- PDF Portable Document Format
- Embodiments of the present invention include various steps, which will be described below.
- the steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps.
- the steps may be performed by a combination of hardware, software, firmware and/or by human operators.
- Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process.
- the machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions.
- embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
- a communication link e.g., a modem or network connection
- client generally refers to an application, program, process or device in a client/server relationship that requests information or services from another program, process or device (a server) on a network.
- client and server are relative since an application may be a client to one application but a server to another.
- client also encompasses software that makes the connection between a requesting application, program, process or device to a server possible, such as an FTP client.
- connection or coupling and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling.
- two devices may be coupled directly, or via one or more intermediary media or devices.
- devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another.
- connection or coupling exists in accordance with the aforementioned definition.
- embedded image generally refers to an image that is displayed or rendered inline within a styled or formatted electronic message, such as a HyperText Markup Language (HTML)-based or formatted email message.
- HTML HyperText Markup Language
- embedded image is intended to encompass scenarios in which the image data is sent with the email message and linked images in which a reference to the image is sent with the email message and the image data is retrieved once the recipient views the email message.
- embedded image also includes an image that is embedded in other file formats of attachments, such as Portable Document Format (PDF) attachments, in which the image data is displayed to the email recipient when the attachment is viewed.
- PDF Portable Document Format
- image spam generally refers to spam in which the “call to action” of the message is partially or completely contained within an embedded file attachment, such as a .gif or jpeg or .pdf file, rather than in the body of the email message.
- embedded file attachment such as a .gif or jpeg or .pdf file
- Such images are typically automatically displayed to the email recipients and typically some form of obfuscation is implemented in an attempt to hide the true content of the image from spam filters.
- network gateway generally refers to an internetworking system, a system that joins two networks together.
- a “network gateway” can be implemented completely in software, completely in hardware, or as a combination of the two.
- network gateways can operate at any level of the OSI model from application protocols to low-level signaling.
- proxy generally refers to an intermediary device, program or agent, which acts as both a server and a client for the purpose of making or forwarding requests on behalf of other clients.
- responsive includes completely or partially responsive.
- server generally refers to an application, program, process or device in a client/server relationship that responds to requests for information or services by another program, process or device (a server) on a network.
- server also encompasses software that makes the act of serving information or providing services possible.
- spam generally refers to electronic junk mail, typically bulk electronic mail (email) messages in the form of commercial advertising. Often, email message content may be irrelevant in determining whether an email message is spam, though most spam is commercial in nature. There is spam that fraudulently promotes penny stocks in the classic pump-and-dump scheme. There is spam that promotes religious beliefs. From the recipient's perspective, spam typically represents unsolicited, unwanted, irrelevant, and/or inappropriate email messages, often unsolicited commercial email (UCE). In addition to UCE, spam includes, but is not limited to, email messages regarding or associated with fraudulent business schemes, chain letters, and/or offensive sexual or political messages.
- UCE unsolicited commercial email
- spam comprises Unsolicited Bulk Email (UBE).
- Unsolicited generally means the recipient of the email message has not granted verifiable permission for the email message to be sent and the sender has no discernible relationship with all or some of the recipients.
- Bulk generally refers to the fact that the email message is sent as part of a larger collection of email messages, all having substantively identical content.
- an email message is considered spam if it is both unsolicited and bulk.
- Unsolicited email can be normal email, such as first contact enquiries, job enquiries, and sales enquiries.
- Bulk email can be normal email, such as subscriber newsletters, customer communications, discussion lists, etc.
- an email message would be considered spam (i) the recipient's personal identity and context are irrelevant because the email message is equally applicable to many other potential recipients; and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for the email message to be sent.
- transparent proxy generally refers to a specialized form of proxy that only implements a subset of a given protocol and allows unknown or uninteresting protocol commands to pass unaltered.
- transparent proxy as compared to a full proxy in which use by a client typically requires editing of the client's configuration file(s) to point to the proxy, it is not necessary to perform such extra configuration in order to use a transparent proxy.
- FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed.
- spammers 205 are shown coupled to the public Internet 200 to which local area network (LAN) 240 is also communicatively coupled through a firewall 210 , a network gateway 215 and an email security system 220 , which incorporates within an anti-spam module 225 various novel image spam detection methodologies that are described further below.
- LAN local area network
- the email security system 220 is logically interposed between spammers and the email server 230 to perform spam filtering on incoming email messages from the public Internet 200 prior to receipt and storage on the email server 230 from which and through which client workstations 260 residing on the LAN 240 may retrieve and send email correspondence.
- the firewall 210 may represent a hardware or software solution configured to protect the resources of LAN from outsiders and to control what outside resources local users have access to by enforcing security policies.
- the firewall 210 may filter or disallow unauthorized or potentially dangerous material or content from entering LAN 240 and may otherwise limit access between the LAN 240 and the public Internet 200 in accordance with local security policy established and maintained by an administrator of LAN 240 .
- the network gateway 215 acts as an interface between the LAN 240 and the public Internet 200 .
- the network gateway 215 may, for example, translate between dissimilar protocols used internally and externally to the LAN 240 .
- the network gateway 215 or the firewall 210 may perform network address translation (NAT) to hide private Internet Protocol (IP) addresses used within LAN 240 and enable multiple client workstations, such as client workstations 260 , to access the public Internet 200 using a single public IP address.
- NAT network address translation
- the email security system 220 performs email filtering to detect, tag, block and/or remove unwanted spam and malicious attachments.
- an anti-spam module 225 of the email security system 220 performs one or more spam filtering techniques, including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam URI real-time blacklists (SURBL), banned word filtering, spam checksum blacklist, forged IP checking, greylist checking, Bayesian classification, Bayesian statistical filters, signature reputation, and/or filtering methods such as FortiGuard-Antispam, access policy filtering, global and user black/white list filtering, spam Real-time Blackhole List (RBL), Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian filtering so that individual users can set their own profiles.
- spam filtering techniques including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam
- the anti-spam module 225 also performs various novel image spam detection methodologies or spam image analysis scanning based on sender's intention analysis in an attempt to detect, tag, block and/or remove spam presented in the form of one or more images. Examples of the image analysis techniques and the sender's intention analysis methodologies are described in more detail below.
- Existing email security platforms that exemplifies various operational characteristics of the email security system 220 according to an embodiment of the present invention include the FortiMailTM family of high-performance, multi-layered email security platforms, including the FortiMail-100 platform, the FortiMail-400 platform, the FortiMail-2000 platform and the FortiMail-4000A platform all of which are available from Fortinet, Inc. of Sunnyvale, Calif.
- FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system 320 with a client workstation 360 and an email server 330 in accordance with an embodiment of the present invention.
- client workstation 360 a single client workstation
- email server 330 email server
- client workstation 360 a single email server
- email server 330 email server
- many local and/or remote client workstations, servers and email servers may interact directly or indirectly with the email security system 320 and directly or indirectly with each other.
- the email security system 320 which may be implemented as one or more virtual or physical devices, includes a content processor 321 , logically interposed between sources of inbound email 380 and an enterprise's email server 330 .
- the content processor 321 performs scanning of inbound email messages 380 originating from sources on the public Internet 200 before allowing such inbound email messages 380 to be stored on the email server 330 .
- an anti-spam module 325 of the content processor 321 may perform spam filtering and an anti-virus (AV) module 326 implementing AV and other filters potentially performs other traditional anti-virus detection and content filtering on data associated with the email messages.
- AV anti-virus
- anti-spam module 325 may apply various image analysis methodologies described further below to ascertain email senders' intentions and therefore the likelihood that attached and/or embedded images represent image spam. According to the current example, the anti-spam module 325 , responsive to being presented with an inbound email message, determines whether the email message contains embedded or attached images and if so, as described further below with reference to FIG. 5 and FIG. 6 , determines if such images represent image spam.
- the content processor 321 is an integrated FortiASICTM Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif.
- the content processor 321 may be a dedicated coprocessor or software to help offload content filtering tasks from a host processor (not shown).
- the anti-spam module 325 may be associated with or otherwise responsive to a mail transfer protocol proxy (not shown).
- the mail transfer protocol proxy may be implemented as a transparent proxy that implements handlers for Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant to the performance of content filtering activities and passes through those not relevant to the performance of content filtering activities.
- SMTP Simple Mail Transfer Protocol
- ESMTP Extended SMTP
- the mail transfer protocol proxy may subject each of incoming email, outgoing email and internal email to scanning by the anti-spam module 325 and/or the content processor 321 .
- filtering of email need not be performed prior to storage of email message on the email server 330 .
- the content processor 321 , the mail transfer protocol proxy (not shown) or some other functional unit logically interposed between a user agent or email client 361 executing on the client workstation 360 and the email server 330 may process email messages at the time they are requested to be transferred from the user agent/email client 361 to the email server 330 or vice versa.
- neither the email messages nor their attachments need be stored locally on the email security system 320 to support the filtering functionality described herein.
- the email security system 320 may open a direct connection between the email client 361 and the email server 330 , and filter email in real-time as it passes through.
- the content processor 321 , the anti-spam module 325 and the mail transfer protocol proxy have been described as residing within or as part of the same network device, in alternative embodiments one or more of these functional units may be located remotely from the other functional units.
- the hardware components and/or software modules that implement the content processor 321 the anti-spam module 325 and the mail transfer protocol proxy are generally provided on or distributed among one or more Internet and/or LAN accessible networked devices, such as one or more network gateways, firewalls, network security appliances, email security systems, switches, bridges, routers, data storage devices, computer systems and the like.
- the functionality of one or more of the above-referenced functional units may be merged in various combinations.
- the content processor 321 may be incorporated within the mail transfer protocol proxy or the anti-spam module 325 may be incorporated within the email server 330 or the email client 361 .
- the functional units can be communicatively coupled using any suitable communication method (e.g., message passing, parameter passing, and/or signals through one or more communication paths etc.).
- the functional units can be physically connected according to any suitable interconnection architecture (e.g., fully connected, hypercube, etc.).
- the functional units can be any suitable type of logic (e.g., digital logic) for executing the operations described herein.
- Any of the functional units used in conjunction with embodiments of the invention can include machine-readable media including instructions for performing operations described herein.
- Machine-readable media include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer).
- a machine-readable medium includes read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
- FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized.
- the computer system 300 may represent or form a part of an email security system, network gateway, firewall, network appliance, switch, bridge, router, data storage devices, server, client workstation and/or other network device implementing one or more of the content processor 321 or other functional units depicted in FIG. 3 .
- the computer system 400 includes one or more processors 405 , one or more communication ports 410 , main memory 415 , read only memory 420 , mass storage 425 , a bus 430 , and removable storage media 440 .
- the processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s) or other processors known in the art.
- Communication port(s) 410 represent physical and/or logical ports.
- communication port(s) may be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, or a Gigabit port using copper or fiber.
- Communication port(s) 410 may be chosen depending on a network such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 400 connects.
- LAN Local Area Network
- WAN Wide Area Network
- Communication port(s) 410 may also be the name of the end of a logical connection (e.g., a Transmission Control Protocol (TCP) port or a User Datagram Protocol (UDP) port).
- TCP Transmission Control Protocol
- UDP User Datagram Protocol
- communication ports may be one of the Well Know Ports, such as TCP port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by the Internet Assigned Numbers Authority (IANA) for specific uses.
- IANA Internet Assigned Numbers Authority
- Main memory 415 may be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art.
- RAM Random Access Memory
- Read only memory 420 may be any static storage device(s) such as Programmable Read Only Memory (PROM) chips for storing static information such as instructions for processors 405 .
- PROM Programmable Read Only Memory
- Mass storage 425 may be used to store information and instructions.
- hard disks such as the Adaptec® family of SCSI drives, an optical disc, an array of disks such as RAID, such as the Adaptec family of RAID drives, or any other mass storage devices may be used.
- Bus 430 communicatively couples processor(s) 405 with the other memory, storage and communication blocks.
- Bus 430 may be a PCI/PCI-X or SCSI based system bus depending on the storage devices used.
- Optional removable storage media 440 may be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like.
- FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention.
- the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
- an email message is analyzed to determine if it contains images.
- the direction of flow of the email message is not pertinent.
- the email message may be inbound, outbound or an intra-enterprise email message.
- the anti-spam processing may be enabled in one direction only or various detection threshholds could be configured differently for different flows.
- the headers, body and attachments, if any, of the email message at issue are parsed and scanned to identify whether the email message is deemed to contain one or more embedded images. If so, processing continues with block 520 . Otherwise, no further image spam analysis is required and processing branches to the end.
- the email message at issue has been determined to contain one or more embedded images.
- the senders' intention analysis anti-spam processing therefore, proceeds to calculate the location(s) of the embedded image(s).
- Images may be embedded in a HyperText Markup Language (HTML) part of an HTML formatted email message, within a MIME document or attached separately.
- HTML HyperText Markup Language
- MIME Multipurpose Internet Mail Extension
- abnormal factors are manifestations of a spammer's attempt to obscure text embedded within the one or more images by injecting a variety of noise.
- abnormal factors include the presence of one or more of the following characteristics (i) illegal base64 encoding; (ii) multiple images within one HTML part; (iii) one or more low entropy frames in an animated Graphic Interchange Format (GIF); (iv) illegal chunk data within a Portable Network Graphics (PNG) file; (v) quantities of unsmoothed curves; and (iv) quantities of unsmoothed color blocks.
- illegal base64 encoding can be detected by, among other things, observing illegal characters, such as ‘!’ in the encoded content, such as the HTML formatted message or any part of the MIME document.
- the inclusion of multiple images within one HTML part can be detected by parsing the HTML formatted email message and observing more than one image within an HTML part.
- the existence of three images within a single table row reveals an intention on the part of the creator of the email message to display a contiguous image to the email recipient based on the three separate embedded images.
- the existence of one or more low entropy frames of an animated GIF may be determined on an absolute and/or relative basis.
- an animated GIF frame may be determined to be low with reference to observed entropy values of normal GIF files, which vary from approximately 0.1 to 5.0. Therefore, in one embodiment, the existence of one or more low entropy frames is confirmed based on a comparison of the entropy values calculated for the animated GIF at issue to 0.1. If the entropy value calculated for any frame of the animated GIF at issue is less than 0.1, then this abnormal factor is deemed to exist.
- one or more frames of the animated GIF file at issue may simply be “low” entropy relative to the other high entropy frames. For example, a variation of more than 4.9 between the highest entropy frame and the lowest entropy frame relatively lower than the others within the animated GIF file at issue.
- Illegal chunk data within a Portable Network Graphics (PNG) file may be observed by evaluating information contained within and/or about the chunks. For example, the length of the chunk and cyclic redundancy checksum (CRC) may be verified against the actual data length and recomputed CRC.
- CRC cyclic redundancy checksum
- Quantities of unsmoothed curves may be detected by evaluating the amount of pixels in which the difference between their color and the average color of the surrounding pixels are greater than a threshold.
- Quantities of unsmoothed color blocks may be detected by evaluating the amount of the color blocks in which the difference between their color and the color of the surrounding color blocks are greater than a threshold.
- Color blocks contain pixels with the same or similar colors.
- a value within a range may be returned indicating the degree to which the abnormal factor is expressed.
- images are converted to a binary representation based on a thresholding technique described in further detail below.
- thresholding is a simple method of image segmentation. Individual pixels in a grayscale image are marked as “information” pixels if their value is greater than some threshold value, T, (assuming the information content is brighter than the background) and as “background” pixels otherwise. Typically, an information pixel is given a value of “1” while a background pixel is given a value of “0.” Then, a text string measurement algorithm is applied to the binary representation of the portion of the image deemed to contain the information content.
- both the quantity of text and the relative position of such text within an email viewer's preview window, for example, or within the image itself may be taken into consideration.
- a high spam score could be assigned to a very large image (with a correspondingly smaller percentage of text), but the text is positioned to occupy the whole preview window.
- the email message is classified as spam or clean based on the observed characteristics of the embedded image(s), such as image location information, the existence/non-existence of various abnormal factors and the quantity of text determined to exist within the embedded image(s).
- the spam/clean classification may be based upon a weighted average of the various observed characteristics.
- each observed characteristic may contribute to the score. Once the score reaches a threshold, the email message may be classified as spam and the further characteristics may not require analysis or observation. The email message is classified as clean if the score is less than the threshold after all the characteristics have been evaluated. In one embodiment, the characteristics may be considered in the following order:
- a “spaminess” score may be generated. For example, rather than simply conveying a binary result (e.g., spam vs. clean), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the email message appeared to contain indications of being spam or the likelihood the email message is spam.
- a binary result e.g., spam vs. clean
- a value within a range e.g., 0 to 10
- FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention. The steps described below represent the processing of block 540 of FIG. 5 according to one embodiment of the present invention.
- the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
- grayscale representation G i,j .
- color pixels of the image at issue are converted to grayscale by computing an average or weighted average of the red, green and blue color components. While various conversions may be used, examples of suitable conversion equations include the following:
- G i,j (0.299 *r i,j +0.587 * g i,j +0.114 * b i,j )/3 0 ⁇ i ⁇ x max, 0 ⁇ j ⁇ y max EQ #1
- G i,j (0.3 *r i,j +0.6 *g i,j +0.1 *b i,j )/3 0 ⁇ i ⁇ x max, 0 ⁇ j ⁇ y max EQ #2
- G i,j ( r i,j +g i,j +b i,j )/3 0 ⁇ i ⁇ x max ,0 ⁇ j ⁇ y max EQ #3
- entropy and threshold values are determined for the grayscale image, G i,j .
- Entropy is a statistical measure of randomness that can be used to characterize the texture of the input image.
- an intermediate data structure is built containing an intensity histogram, C g .
- each pixel may have a value of 0 to 255.
- the intensity histogram includes 256 bins each of which maintain a count of the number of pixels in the grayscale image having that value.
- FIG. 9 represents an intensity histogram for a grayscale representation of FIG. 8 .
- entropy is calculated according to:
- a threshold value within the intensity histogram is selected simply by choosing the mean or median value.
- the rationale for this simple threshold selection is that if the information pixels are brighter than the background, they should also be brighter than the average.
- a more sophisticated approach is to create a histogram of the image pixel intensities and then use the valley point as the threshold, T. This histogram approach assumes that there is some average value for the background and information pixels, but that the actual pixel values have some variation around these average values.
- the threshold, T is calculated by:
- the gray levels are divided into two groups by i, and w i1 and w i2 are the total amount of the pixels of each group while M i1 and M i2 are the average of the gray level of each group.
- thresholding is performed to form a binary representation, B i,j , of the grayscale image based on the threshold value selected in block 620 .
- thresholding is performed in accordance with the following equations:
- B i , j ⁇ 0 G i , j ⁇ T 1 Otherwise ⁇ 0 ⁇ i ⁇ x max , 0 ⁇ j ⁇ y max EQ ⁇ ⁇ #13
- B i , j ′ ⁇ B i , j max ⁇ ( C k ) ⁇ ⁇ , max ⁇ ( C k ) ⁇ T ! B i , j Otherwise ⁇ 0 ⁇ k ⁇ 255 EQ ⁇ ⁇ #14
- ⁇ is an adjustable parameter
- the binary image is logically divided into M ⁇ N virtual blocks.
- the M ⁇ N virtual blocks are analyzed to quantify the number of text strings.
- the text strings in the binary image are quantified in accordance with the following equations:
- ⁇ 0 . . . a ⁇ 7 are adjustable parameters
- T y t , y b m,n is the likelihood that the row between y t and y b in the block [m,n] represents text
- CB i n is the likelihood that the line[i] is a part of text
- B k,i is the value of pixel[k,i] in the binary image.
- FIG. 7 is an example of an image spam email message 700 containing an embedded image 710 .
- image spam email messages also include text 720 in an attempt to defeat conventional heuristic filters.
- FIG. 8 is a grayscale image 810 based on the embedded image 710 of FIG. 7 according to one embodiment of the present invention.
- the first step (block 610 ) is to convert the embedded image 710 to a grayscale representation, G i,j .
- G i,j Assuming embedded image 710 of FIG. 7 is a color image having red (r), green (g) and blue (b) color components, after application of one of equations EQ #1, EQ #2, EQ #3 or the like, the grayscale representation, G i,j , appears as grayscale image 810 .
- FIG. 9 is an intensity histogram 900 for the grayscale image 810 of FIG. 8 according to one embodiment of the present invention.
- the next step (block 620 ) is to build an intensity histogram data structure, C g , and determine a threshold value for the grayscale image 810 .
- an intensity histogram data structure, C g results, which appears as intensity histogram 900 when displayed in graphical form.
- the intensity histogram 900 graphically illustrates the number of pixel occurrences in grayscale image 810 for each gray level.
- a threshold value, T, 910 is calculated for grayscale image 810 .
- the threshold value 910 is 109 .
- FIG. 10 is a binary image 1010 resulting from thresholding the grayscale image 810 of FIG. 8 in accordance with an embodiment of the present invention.
- the next step (block 630 ) is to binarize the image by performing thresholding with the calculated threshold value.
- binary image 1010 the result of graphically depicting the binary representation, B i,j , in which pixels having a value of one are presented as black and pixels having a value of zero are presented as white image is shown as binary image 1010 .
- the information content intended to be conveyed, i.e., the various text strings, to the email recipient is now clearly distinguishable from the background.
- FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
- the next steps are to logically divide the binary image 1010 into virtual blocks and then separately analyze each block to measure perceived text content.
- segmented binary image 1110 contains 28 virtual blocks, examples of which are pointed out with reference numerals 1120 and 1130 .
- equations EQ #15, EQ #16, EQ #17 and EQ #18, 23 of the 28 blocks contain a total of 63 text strings.
- Text strings detected by the algorithm are underlined.
- Block 1120 is an example of a block that has been determined to contain one or more text strings, i.e., the word “TRADE” 1121 .
- Block 1130 is an example of a block that has been determined not to contain a text string.
- FIG. 12 is a grayscale image 1210 based on another exemplary embedded image observed in connection with image spam.
- FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image 1310 corresponding to the grayscale image 1210 of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
- segmented binary image 1310 contains 56 virtual blocks, examples of which are pointed out with reference numerals 1320 and 1330 .
- equations EQ #15, EQ #16, EQ #17 and EQ #18, 26 of the 56 blocks contain a total of 40 text strings. Text strings detected by the algorithm are underlined.
- Block 1320 is an example of a block that has been determined to contain one or more text strings, i.e., the group of letters “ebtEras”.
- Block 1330 is an example of a block that has been determined not to contain a text string.
Abstract
Systems and methods for an anti-spam detection module that can detect image spam are provided. According to one embodiment, an image spam detection process involves determining and measuring various characteristics of images that may be embedded within or otherwise associated with an electronic mail (email) message. An approximate display location of the embedded images is determined. The existence of one or more abnormal factors associated with the embedded images is identified. A quantity of text included in the one or more embedded images is determined and measured by analyzing one or more blocks of binarized representations of the one or more embedded images. Finally, the likelihood that the email message is spam is determined based on one or more of the approximate display location, the existence of one or more abnormal factors and the quantity and location of text measured.
Description
- Contained herein is material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever. Copyright© 2007, Fortinet, Inc.
- 1. Field
- Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques. In particular, various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
- 2. Description of the Related Art
- Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
- To address this spamming technique, fuzzy signature technologies, which flag both known and similar messages as spam, were deployed by anti-spam vendors. Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
- Spammers now alter the images to make the email message appear different to signature-based filtering approaches yet while maintaining readability of the embedded text message to human viewers. The content of images lies in two levels: (i) the pixel matrix and (ii) the text or graphics these pixel matrices represent. At present, the notion of pixel-based matching does not make sense, as the same text could be represented by countless pixel matrices by simply changing various attributes, such as the font, size, color or by adding noise. Therefore, hash matching and other signature-based approaches have essentially been rendered useless to block image spam as they fail as a result of even minor changes to the background of the image.
- Some vendors have attempted to catch image spam by employing Optical Character Recognition (OCR) techniques; however, such approaches have only limited success in view of spammers' use of techniques to obscure the embedded text messages with a variety of noise.
FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. As shown inFIGS. 1A and 1B , polygons, lines, random colors, jagged text, random dots, varying borders and the like may be inserted into image spam in an attempt to defeat signature detection techniques and obscure the embedded text from OCR techniques. There are an almost infinite number of ways that spammers can randomize images. In addition to the foregoing obfuscation techniques, spammers have recently used techniques such as varying the colors used in an image, changing the width and/or pattern of the border, altering the font style, and slicing images into smaller pieces (which are then reassembled to appear as a single image to the recipient). Meanwhile, OCR is very computationally expensive. Depending upon the implementation, fully rendering a message and then looking for word matches against different character set libraries may take as long as several seconds per message, which is typically unacceptable for many contexts. - Systems and methods are described for an anti-spam detection module that can detect image spam. According to one embodiment, one or more of the quantity and position of text within an image associated with an electronic message are measured or estimated. Then, based at least in part on the results of the measuring or estimating, the likelihood that the electronic message is spam is determined.
- According to another embodiment, an embedded image of an electronic mail (email) message is converted to a binarized representation by performing thresholding on a grayscale representation of the embedded image. A quantity of text included in the embedded image is then determined and measured by analyzing one or more blocks of the binarized representations. Finally, the email message is classified as spam or clean based at least in part on the quantity of text measured.
- In one embodiment, the embedded image may be formatted in accordance with the Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) or Portable Network Graphics (PNG) formats/standards.
- In one embodiment, the embedded image may be an image contained within a file attached to the email message.
- In one embodiment, the method also includes determining an approximate display location of an embedded image within the email message and identifying existence of one or more abnormal factors associated with the embedded image. Then, the classification can be based upon the approximate display location, the existence of one or more abnormal factors as well as the quantity of text measured.
- Other features of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows.
- Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
-
FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. -
FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed. -
FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system with a client workstation and an email server in accordance with an embodiment of the present invention. -
FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized. -
FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention. -
FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention. -
FIG. 7 is an example of an image spam email message containing an embedded image. -
FIG. 8 is a grayscale image based on the embedded image ofFIG. 7 according to one embodiment of the present invention. -
FIG. 9 is an intensity histogram for the grayscale image ofFIG. 8 according to one embodiment of the present invention. -
FIG. 10 is a binary image resulting from thresholding the grayscale image ofFIG. 8 in accordance with an embodiment of the present invention. -
FIG. 11 illustrates an exemplary segmentation of the binary image ofFIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. -
FIG. 12 is a grayscale image based on another exemplary embedded image observed in connection with image spam. -
FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image corresponding to the image ofFIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. - Systems and methods are described for an anti-spam detection module that can detect various forms of image spam. According to one embodiment, images attached to or embedded within email messages are analyzed to determine the senders' intention. Empirical analysis reveals legitimate emails may contain embedded images, but valid images sent through email rarely contain a substantial quantity of text. Additionally, when legitimate images are included within email messages, the senders of such email messages do not painstakingly adjust the location of such included images to assure such images appear in the preview window of an email client. Furthermore, legitimate senders do not intentionally inject noise into the embedded images. In contrast, spammers usually compose email messages in different ways. For example, in the context of image spam, spammers insert text into images to avoid filtering by traditional text filters and employ techniques to randomize images and/or obscure text embedded within images. Spammers also typically make great efforts to draw attention to their image spam by carefully placing the image in such a manner as to make it visible to the recipient in the preview window/pane of an email client that supports HTML email, such as Netscape Messenger or Microsoft Outlook. Consequently, various indicators of image spam include, but are not limited to, inclusion of one or more images in the front part of an email message, inclusion of one or more images containing text meeting a certain threshold and/or inclusion of one or more images into which noise appears to have been injected to obfuscate embedded text.
- According to one embodiment, various image analysis techniques are employed to more accurately detect image spam based on senders' intention analysis. The goal of senders' intention analysis is to discover the email message sender's intent by examining various characteristics of the email message and the embedded or attached images. If it appears, for example, after performing image analysis that one or more images associated with an email message have had one or more obfuscation techniques applied, the intent is to draw attention to the one or more images and/or the one or more images include suspicious quantities of text, then the sender's intention analysis anti-spam processing may flag the email message at issue as spam. In one embodiment, the image scanning spam detection method is based on a combination of email header analysis, email body analysis and image processing on image attachments.
- Importantly, although various embodiments of the anti-spam detection module and image scanning methodologies are discussed in the context of an email security system, they are equally applicable to network gateways, email appliances, client workstations, servers and other virtual or physical network devices or appliances that may be logically interposed between client workstations and servers, such as firewalls, network security appliances, email security appliances, virtual private network (VPN) gateways, switches, bridges, routers and like devices through which email messages flow. Furthermore, the anti-spam techniques described herein are equally applicable to instant messages, (Multimedia Message Service) MMS messages and other forms of electronic communications in the event that such message become vulnerable to image spam in the future.
- Additionally, various embodiments of the present invention are described with reference to filtering of incoming email messages. However, it is to be understood, that the filtering methodologies described herein are equally applicable to email messages originated within an enterprise and circulated internally or outgoing email messages intended for recipients outside of the enterprise. Therefore, the specific examples presented herein are not intended to be limiting and are merely representative of exemplary functionality.
- Furthermore, while, for convenience, various embodiments of the present invention may be described with reference to detecting image spam in the graphic/image file formats currently most prevalent (i.e., Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) and Portable Network Graphics (PNG) graphic/image file formats), embodiments of the present invention are not so limited and are equally applicable to various other current and future graphic/image file formats, including, but not limited to, Progressive Graphics File (PGF), Tagged Image File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW formats used by various digital cameras, various vector formats, such as Scalable Vector Graphics (SVG), as well as other file formats of attachments which may themselves contain embedded images, such as Portable Document Format (PDF), Encapsulated PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw CDR.
- In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
- Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware, software, firmware and/or by human operators.
- Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
- Brief definitions of terms used throughout this application are given below.
- The term “client” generally refers to an application, program, process or device in a client/server relationship that requests information or services from another program, process or device (a server) on a network. Importantly, the terms “client” and “server” are relative since an application may be a client to one application but a server to another. The term “client” also encompasses software that makes the connection between a requesting application, program, process or device to a server possible, such as an FTP client.
- The terms “connected” or “coupled” and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling. Thus, for example, two devices may be coupled directly, or via one or more intermediary media or devices. As another example, devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another. Based on the disclosure provided herein, one of ordinary skill in the art will appreciate a variety of ways in which connection or coupling exists in accordance with the aforementioned definition.
- The phrase “embedded image” generally refers to an image that is displayed or rendered inline within a styled or formatted electronic message, such as a HyperText Markup Language (HTML)-based or formatted email message. As used herein, the phrase “embedded image” is intended to encompass scenarios in which the image data is sent with the email message and linked images in which a reference to the image is sent with the email message and the image data is retrieved once the recipient views the email message. The phrase “embedded image” also includes an image that is embedded in other file formats of attachments, such as Portable Document Format (PDF) attachments, in which the image data is displayed to the email recipient when the attachment is viewed.
- The phrase “image spam” generally refers to spam in which the “call to action” of the message is partially or completely contained within an embedded file attachment, such as a .gif or jpeg or .pdf file, rather than in the body of the email message. Such images are typically automatically displayed to the email recipients and typically some form of obfuscation is implemented in an attempt to hide the true content of the image from spam filters.
- The phrases “in one embodiment,” “according to one embodiment,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present invention, and may be included in more than one embodiment of the present invention. Importantly, such phrases do not necessarily refer to the same embodiment.
- The phrase “network gateway” generally refers to an internetworking system, a system that joins two networks together. A “network gateway” can be implemented completely in software, completely in hardware, or as a combination of the two. Depending on the particular implementation, network gateways can operate at any level of the OSI model from application protocols to low-level signaling.
- If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.
- The term “proxy” generally refers to an intermediary device, program or agent, which acts as both a server and a client for the purpose of making or forwarding requests on behalf of other clients.
- The term “responsive” includes completely or partially responsive.
- The term “server” generally refers to an application, program, process or device in a client/server relationship that responds to requests for information or services by another program, process or device (a server) on a network. The term “server” also encompasses software that makes the act of serving information or providing services possible.
- The term “spam” generally refers to electronic junk mail, typically bulk electronic mail (email) messages in the form of commercial advertising. Often, email message content may be irrelevant in determining whether an email message is spam, though most spam is commercial in nature. There is spam that fraudulently promotes penny stocks in the classic pump-and-dump scheme. There is spam that promotes religious beliefs. From the recipient's perspective, spam typically represents unsolicited, unwanted, irrelevant, and/or inappropriate email messages, often unsolicited commercial email (UCE). In addition to UCE, spam includes, but is not limited to, email messages regarding or associated with fraudulent business schemes, chain letters, and/or offensive sexual or political messages.
- According to one embodiment “spam” comprises Unsolicited Bulk Email (UBE). Unsolicited generally means the recipient of the email message has not granted verifiable permission for the email message to be sent and the sender has no discernible relationship with all or some of the recipients. Bulk generally refers to the fact that the email message is sent as part of a larger collection of email messages, all having substantively identical content. In embodiments in which spam is equated with UBE, an email message is considered spam if it is both unsolicited and bulk. Unsolicited email can be normal email, such as first contact enquiries, job enquiries, and sales enquiries. Bulk email can be normal email, such as subscriber newsletters, customer communications, discussion lists, etc. Consequently, in such embodiments, an email message would be considered spam (i) the recipient's personal identity and context are irrelevant because the email message is equally applicable to many other potential recipients; and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for the email message to be sent.
- The phrase “transparent proxy” generally refers to a specialized form of proxy that only implements a subset of a given protocol and allows unknown or uninteresting protocol commands to pass unaltered. Advantageously, as compared to a full proxy in which use by a client typically requires editing of the client's configuration file(s) to point to the proxy, it is not necessary to perform such extra configuration in order to use a transparent proxy.
-
FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed. In this simple example,spammers 205 are shown coupled to thepublic Internet 200 to which local area network (LAN) 240 is also communicatively coupled through afirewall 210, anetwork gateway 215 and anemail security system 220, which incorporates within ananti-spam module 225 various novel image spam detection methodologies that are described further below. - In the present example, the
email security system 220 is logically interposed between spammers and theemail server 230 to perform spam filtering on incoming email messages from thepublic Internet 200 prior to receipt and storage on theemail server 230 from which and through whichclient workstations 260 residing on theLAN 240 may retrieve and send email correspondence. - In the exemplary network architecture of
FIG. 2 , thefirewall 210 may represent a hardware or software solution configured to protect the resources of LAN from outsiders and to control what outside resources local users have access to by enforcing security policies. Thefirewall 210 may filter or disallow unauthorized or potentially dangerous material or content from enteringLAN 240 and may otherwise limit access between theLAN 240 and thepublic Internet 200 in accordance with local security policy established and maintained by an administrator ofLAN 240. - In one embodiment, the
network gateway 215 acts as an interface between theLAN 240 and thepublic Internet 200. Thenetwork gateway 215 may, for example, translate between dissimilar protocols used internally and externally to theLAN 240. Depending upon the distribution of functionality, thenetwork gateway 215 or thefirewall 210 may perform network address translation (NAT) to hide private Internet Protocol (IP) addresses used withinLAN 240 and enable multiple client workstations, such asclient workstations 260, to access thepublic Internet 200 using a single public IP address. - According to one embodiment, the
email security system 220 performs email filtering to detect, tag, block and/or remove unwanted spam and malicious attachments. In one embodiment, ananti-spam module 225 of theemail security system 220, performs one or more spam filtering techniques, including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam URI real-time blacklists (SURBL), banned word filtering, spam checksum blacklist, forged IP checking, greylist checking, Bayesian classification, Bayesian statistical filters, signature reputation, and/or filtering methods such as FortiGuard-Antispam, access policy filtering, global and user black/white list filtering, spam Real-time Blackhole List (RBL), Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian filtering so that individual users can set their own profiles. - The
anti-spam module 225 also performs various novel image spam detection methodologies or spam image analysis scanning based on sender's intention analysis in an attempt to detect, tag, block and/or remove spam presented in the form of one or more images. Examples of the image analysis techniques and the sender's intention analysis methodologies are described in more detail below. Existing email security platforms that exemplifies various operational characteristics of theemail security system 220 according to an embodiment of the present invention include the FortiMail™ family of high-performance, multi-layered email security platforms, including the FortiMail-100 platform, the FortiMail-400 platform, the FortiMail-2000 platform and the FortiMail-4000A platform all of which are available from Fortinet, Inc. of Sunnyvale, Calif. -
FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of anemail security system 320 with aclient workstation 360 and anemail server 330 in accordance with an embodiment of the present invention. - While in this simplified example, only a single client workstation, i.e.,
client workstation 360, and a single email server, i.e.,email server 330, are shown interacting with theemail security system 320, it should be understood that many local and/or remote client workstations, servers and email servers may interact directly or indirectly with theemail security system 320 and directly or indirectly with each other. - According to the present example, the
email security system 320, which may be implemented as one or more virtual or physical devices, includes acontent processor 321, logically interposed between sources of inbound email 380 and an enterprise'semail server 330. In the context of the present example, thecontent processor 321 performs scanning of inbound email messages 380 originating from sources on thepublic Internet 200 before allowing such inbound email messages 380 to be stored on theemail server 330. In one embodiment, ananti-spam module 325 of thecontent processor 321 may perform spam filtering and an anti-virus (AV)module 326 implementing AV and other filters potentially performs other traditional anti-virus detection and content filtering on data associated with the email messages. - In the current example,
anti-spam module 325 may apply various image analysis methodologies described further below to ascertain email senders' intentions and therefore the likelihood that attached and/or embedded images represent image spam. According to the current example, theanti-spam module 325, responsive to being presented with an inbound email message, determines whether the email message contains embedded or attached images and if so, as described further below with reference toFIG. 5 andFIG. 6 , determines if such images represent image spam. - In one embodiment, the
content processor 321 is an integrated FortiASIC™ Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif. In alternative embodiments, thecontent processor 321 may be a dedicated coprocessor or software to help offload content filtering tasks from a host processor (not shown). - In alternative embodiments, the
anti-spam module 325 may be associated with or otherwise responsive to a mail transfer protocol proxy (not shown). The mail transfer protocol proxy may be implemented as a transparent proxy that implements handlers for Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant to the performance of content filtering activities and passes through those not relevant to the performance of content filtering activities. In one embodiment, the mail transfer protocol proxy may subject each of incoming email, outgoing email and internal email to scanning by theanti-spam module 325 and/or thecontent processor 321. - Notably, filtering of email need not be performed prior to storage of email message on the
email server 330. In alternative embodiments, thecontent processor 321, the mail transfer protocol proxy (not shown) or some other functional unit logically interposed between a user agent or email client 361 executing on theclient workstation 360 and theemail server 330 may process email messages at the time they are requested to be transferred from the user agent/email client 361 to theemail server 330 or vice versa. Meanwhile, neither the email messages nor their attachments need be stored locally on theemail security system 320 to support the filtering functionality described herein. For example, instead of the anti-spam processing running responsive to a mail transfer protocol proxy, theemail security system 320 may open a direct connection between the email client 361 and theemail server 330, and filter email in real-time as it passes through. - While in the context of the present example, the
content processor 321, theanti-spam module 325 and the mail transfer protocol proxy (not shown) have been described as residing within or as part of the same network device, in alternative embodiments one or more of these functional units may be located remotely from the other functional units. According to one embodiment, the hardware components and/or software modules that implement thecontent processor 321 theanti-spam module 325 and the mail transfer protocol proxy are generally provided on or distributed among one or more Internet and/or LAN accessible networked devices, such as one or more network gateways, firewalls, network security appliances, email security systems, switches, bridges, routers, data storage devices, computer systems and the like. - In one embodiment, the functionality of one or more of the above-referenced functional units may be merged in various combinations. For example, the
content processor 321 may be incorporated within the mail transfer protocol proxy or theanti-spam module 325 may be incorporated within theemail server 330 or the email client 361. Moreover, the functional units can be communicatively coupled using any suitable communication method (e.g., message passing, parameter passing, and/or signals through one or more communication paths etc.). Additionally, the functional units can be physically connected according to any suitable interconnection architecture (e.g., fully connected, hypercube, etc.). - According to embodiments of the invention, the functional units can be any suitable type of logic (e.g., digital logic) for executing the operations described herein. Any of the functional units used in conjunction with embodiments of the invention can include machine-readable media including instructions for performing operations described herein. Machine-readable media include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
-
FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized. The computer system 300 may represent or form a part of an email security system, network gateway, firewall, network appliance, switch, bridge, router, data storage devices, server, client workstation and/or other network device implementing one or more of thecontent processor 321 or other functional units depicted inFIG. 3 . According toFIG. 4 , thecomputer system 400 includes one ormore processors 405, one ormore communication ports 410,main memory 415, read onlymemory 420,mass storage 425, a bus 430, andremovable storage media 440. - The processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s) or other processors known in the art.
- Communication port(s) 410 represent physical and/or logical ports. For example communication port(s) may be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, or a Gigabit port using copper or fiber. Communication port(s) 410 may be chosen depending on a network such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the
computer system 400 connects. - Communication port(s) 410 may also be the name of the end of a logical connection (e.g., a Transmission Control Protocol (TCP) port or a User Datagram Protocol (UDP) port). For example communication ports may be one of the Well Know Ports, such as TCP port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by the Internet Assigned Numbers Authority (IANA) for specific uses.
-
Main memory 415 may be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art. - Read only
memory 420 may be any static storage device(s) such as Programmable Read Only Memory (PROM) chips for storing static information such as instructions forprocessors 405. -
Mass storage 425 may be used to store information and instructions. For example, hard disks such as the Adaptec® family of SCSI drives, an optical disc, an array of disks such as RAID, such as the Adaptec family of RAID drives, or any other mass storage devices may be used. - Bus 430 communicatively couples processor(s) 405 with the other memory, storage and communication blocks. Bus 430 may be a PCI/PCI-X or SCSI based system bus depending on the storage devices used.
- Optional
removable storage media 440 may be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like. -
FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention. Depending upon the particular implementation, the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction. - At
block 510, an email message is analyzed to determine if it contains images. For purposes of the present example, the direction of flow of the email message is not pertinent. As indicated above, the email message may be inbound, outbound or an intra-enterprise email message. In various embodiments, however, the anti-spam processing may be enabled in one direction only or various detection threshholds could be configured differently for different flows. - In any event, in one embodiment, the headers, body and attachments, if any, of the email message at issue are parsed and scanned to identify whether the email message is deemed to contain one or more embedded images. If so, processing continues with
block 520. Otherwise, no further image spam analysis is required and processing branches to the end. - At
block 520, the email message at issue has been determined to contain one or more embedded images. In the current example, the senders' intention analysis anti-spam processing, therefore, proceeds to calculate the location(s) of the embedded image(s). Images may be embedded in a HyperText Markup Language (HTML) part of an HTML formatted email message, within a MIME document or attached separately. In one embodiment, by parsing the HTML, plain text and/or other Multipurpose Internet Mail Extension (MIME) parts, the displaying line just prior to the images can be identified and thus the approximate displaying location of any embedded images can be calculated. - At
block 530, the one or more images are analyzed for indications of one or more abnormal factors. Typically, the abnormal factors are manifestations of a spammer's attempt to obscure text embedded within the one or more images by injecting a variety of noise. In one embodiment, abnormal factors include the presence of one or more of the following characteristics (i) illegal base64 encoding; (ii) multiple images within one HTML part; (iii) one or more low entropy frames in an animated Graphic Interchange Format (GIF); (iv) illegal chunk data within a Portable Network Graphics (PNG) file; (v) quantities of unsmoothed curves; and (iv) quantities of unsmoothed color blocks. - In one embodiment, illegal base64 encoding can be detected by, among other things, observing illegal characters, such as ‘!’ in the encoded content, such as the HTML formatted message or any part of the MIME document.
- In one embodiment, the inclusion of multiple images within one HTML part can be detected by parsing the HTML formatted email message and observing more than one image within an HTML part. In the exemplary HTML code excerpt below, the existence of three images within a single table row (<tr> . . . </tr>) reveals an intention on the part of the creator of the email message to display a contiguous image to the email recipient based on the three separate embedded images.
-
<html> <head> <meta content=“text/html;charset=ISO-8859-1” http-equiv=“Content- Type”> <title></title> </head> <body bgcolor=“#ffffff” text=“#000000”> <title>abovementioned bertie</title> <div align=“center”> [...] <tr> <td width=“33%”> <a href=“http://www.lklljjp.biz/vpr6160/”> <img name=“apprehension” src=“cid:part2.00020108.07020409@72.ca” border=“0” height=“179” width=“184”></a></td> <td width=“33%”> <a href=“http://www.lklljjp.biz/vpr6160/”> <img name=“gradate” src=“cid:part3.00060308.03010709@72.ca” border=“0” height=“179” width=“184”></a></td> <td width=“34%”> <a href=“http://www.lklljjp.biz/vpr6160/”> <img name=“maltreat” src=“cid:part4.02080304.00040002@72.ca” border=“0” height=“179” width=“184”></a></td> </tr> [...] </body> </html> - The existence of one or more low entropy frames of an animated GIF may be determined on an absolute and/or relative basis. For example, an animated GIF frame may be determined to be low with reference to observed entropy values of normal GIF files, which vary from approximately 0.1 to 5.0. Therefore, in one embodiment, the existence of one or more low entropy frames is confirmed based on a comparison of the entropy values calculated for the animated GIF at issue to 0.1. If the entropy value calculated for any frame of the animated GIF at issue is less than 0.1, then this abnormal factor is deemed to exist. In other embodiments, one or more frames of the animated GIF file at issue may simply be “low” entropy relative to the other high entropy frames. For example, a variation of more than 4.9 between the highest entropy frame and the lowest entropy frame relatively lower than the others within the animated GIF file at issue.
- Illegal chunk data within a Portable Network Graphics (PNG) file may be observed by evaluating information contained within and/or about the chunks. For example, the length of the chunk and cyclic redundancy checksum (CRC) may be verified against the actual data length and recomputed CRC.
- Quantities of unsmoothed curves may be detected by evaluating the amount of pixels in which the difference between their color and the average color of the surrounding pixels are greater than a threshold.
- Quantities of unsmoothed color blocks may be detected by evaluating the amount of the color blocks in which the difference between their color and the color of the surrounding color blocks are greater than a threshold. Color blocks contain pixels with the same or similar colors.
- In one embodiment, rather than simply conveying a binary result (e.g., a zero indicating the absence of the abnormal factor at issue and a one indicating the presence of the abnormal factor at issue), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the abnormal factor is expressed.
- At
block 540, the quantity of text embedded in the images is measured. In one embodiment, images are converted to a binary representation based on a thresholding technique described in further detail below. In general, thresholding is a simple method of image segmentation. Individual pixels in a grayscale image are marked as “information” pixels if their value is greater than some threshold value, T, (assuming the information content is brighter than the background) and as “background” pixels otherwise. Typically, an information pixel is given a value of “1” while a background pixel is given a value of “0.” Then, a text string measurement algorithm is applied to the binary representation of the portion of the image deemed to contain the information content. - Notably, in one embodiment, rather than considering the quantity of embedded text alone, both the quantity of text and the relative position of such text within an email viewer's preview window, for example, or within the image itself may be taken into consideration. For example, a high spam score could be assigned to a very large image (with a correspondingly smaller percentage of text), but the text is positioned to occupy the whole preview window.
- At
block 550, the email message is classified as spam or clean based on the observed characteristics of the embedded image(s), such as image location information, the existence/non-existence of various abnormal factors and the quantity of text determined to exist within the embedded image(s). In one embodiment, the spam/clean classification may be based upon a weighted average of the various observed characteristics. - In one embodiment, each observed characteristic may contribute to the score. Once the score reaches a threshold, the email message may be classified as spam and the further characteristics may not require analysis or observation. The email message is classified as clean if the score is less than the threshold after all the characteristics have been evaluated. In one embodiment, the characteristics may be considered in the following order:
-
- Image location information
- Presence of continuous images
- Presence of illegal base64 encoding
- Presence of lower entropy frames in an animated GIF
- Presence of illegal chunk data of a PNG encoded image
- Quantities and/or location of text in the images
- Quantities of unsmoothed curves in the images
- Quantities of unsmoothed color blocks in the images
- In one embodiment, similar to that described above with reference to abnormal factors, rather than making the ultimate spam/clean decision (because the ultimate decision could be made by another component), a “spaminess” score may be generated. For example, rather than simply conveying a binary result (e.g., spam vs. clean), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the email message appeared to contain indications of being spam or the likelihood the email message is spam.
- If upon completion of the anti-spam processing described above there is not sufficient data to determine the email message is spam (e.g., there is insufficient data to determine the sender's intention), then according to one embodiment, more CPU intensive processes, such as OCR, may be invoked. Advantageously, in this manner, most image spam emails can be detected in real-time without compromising performance and more CPU intensive processes are only performed if and when required.
-
FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention. The steps described below represent the processing ofblock 540 ofFIG. 5 according to one embodiment of the present invention. - As mentioned with reference to
FIG. 5 , depending upon the particular implementation, the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction. - At
block 610, if the image at issue is color, then it is converted to grayscale to form a grayscale representation, Gi,j. According to one embodiment, color pixels of the image at issue are converted to grayscale by computing an average or weighted average of the red, green and blue color components. While various conversions may be used, examples of suitable conversion equations include the following: -
G i,j=(0.299*r i,j+0.587* g i,j+0.114* b i,j)/3 0≦i<x max,0≦j<y max EQ #1 -
G i,j=(0.3*r i,j+0.6*g i,j+0.1*b i,j)/3 0≦i<x max,0≦j<y max EQ #2 -
G i,j=(r i,j +g i,j +b i,j)/3 0≦i<x max,0≦j<y max EQ #3 - At
block 620, entropy and threshold values are determined for the grayscale image, Gi,j. Entropy is a statistical measure of randomness that can be used to characterize the texture of the input image. In connection with calculating the entropy of the grayscale image, an intermediate data structure is built containing an intensity histogram, Cg. In the context of an 8-bit grayscale image, each pixel may have a value of 0 to 255. Thus, the intensity histogram includes 256 bins each of which maintain a count of the number of pixels in the grayscale image having that value. An example of an intensity histogram is shown inFIG. 9 , which represents an intensity histogram for a grayscale representation ofFIG. 8 . In one embodiment entropy is calculated according to: -
- According to one embodiment, a threshold value within the intensity histogram is selected simply by choosing the mean or median value. The rationale for this simple threshold selection is that if the information pixels are brighter than the background, they should also be brighter than the average. However, to compensate for the existence of noise and variability in the background, a more sophisticated approach is to create a histogram of the image pixel intensities and then use the valley point as the threshold, T. This histogram approach assumes that there is some average value for the background and information pixels, but that the actual pixel values have some variation around these average values. In one embodiment, the threshold, T, is calculated by:
-
T=Max(δi) 0≦i≦255EQ# 7 - Subject to:
-
δi =i w1 w i2(M i1 −M i2)20≦i≦255 EQ #8 -
- According to the above example, the gray levels are divided into two groups by i, and wi1 and wi2 are the total amount of the pixels of each group while Mi1 and Mi2 are the average of the gray level of each group.
- Notably, there are many existing methods of performing thresholding. Consequently, any other current or future method of performing thresholding may be used depending upon the needs of a particular implementation.
- At
block 630, thresholding is performed to form a binary representation, Bi,j, of the grayscale image based on the threshold value selected inblock 620. In one embodiment, thresholding is performed in accordance with the following equations: -
- where, ∂ is an adjustable parameter.
- At
block 640, the binary image is logically divided into M×N virtual blocks. - At
block 650, the M×N virtual blocks are analyzed to quantify the number of text strings. In one embodiment, the text strings in the binary image are quantified in accordance with the following equations: -
- where,
- ∂0 . . . a∂7 are adjustable parameters;
- Ty
t ,yb m,n is the likelihood that the row between yt and yb in the block [m,n] represents text; - CBi n is the likelihood that the line[i] is a part of text;
- Bk,i is the value of pixel[k,i] in the binary image.
- Notably, while in the context of the equations presented above, a global thresholding approach is implemented taking into consideration the image as a whole, in alternative embodiments, various forms of local thresholding may be performed that consider groups of blocks or individual blocks to determine the best thresholding approach for such block or blocks.
- For sake of illustration, two concrete examples of application of the thresholding and text quantification described above will now be provided with reference to
FIG. 7 toFIG. 13 . -
FIG. 7 is an example of an imagespam email message 700 containing an embeddedimage 710. Typically, such image spam email messages also includetext 720 in an attempt to defeat conventional heuristic filters. -
FIG. 8 is agrayscale image 810 based on the embeddedimage 710 ofFIG. 7 according to one embodiment of the present invention. According to the flow diagram ofFIG. 6 , the first step (block 610) is to convert the embeddedimage 710 to a grayscale representation, Gi,j. Assuming embeddedimage 710 ofFIG. 7 is a color image having red (r), green (g) and blue (b) color components, after application of one of equations EQ #1, EQ #2, EQ #3 or the like, the grayscale representation, Gi,j, appears asgrayscale image 810. -
FIG. 9 is anintensity histogram 900 for thegrayscale image 810 ofFIG. 8 according to one embodiment of the present invention. According to the flow diagram ofFIG. 6 , the next step (block 620) is to build an intensity histogram data structure, Cg, and determine a threshold value for thegrayscale image 810. After application of one or more of equations EQ #4, EQ #5,EQ # 6,EQ # 7, EQ #8,EQ # 9,EQ # 10, EQ #11,EQ # 12 or the like to the grayscale representation, Gi,j, (grayscale image 810), an intensity histogram data structure, Cg, results, which appears asintensity histogram 900 when displayed in graphical form. Assuming 256 possible gray levels are represented ingrayscale image 810, theintensity histogram 900 graphically illustrates the number of pixel occurrences ingrayscale image 810 for each gray level. - Application of the above-referenced equations also results in a threshold value, T, 910, being calculated for
grayscale image 810. According to this example, thethreshold value 910 is 109. -
FIG. 10 is abinary image 1010 resulting from thresholding thegrayscale image 810 ofFIG. 8 in accordance with an embodiment of the present invention. According to the flow diagram ofFIG. 6 , the next step (block 630) is to binarize the image by performing thresholding with the calculated threshold value. Application of one or both of equations EQ #13 and EQ #14 or the like, causes the binary representation, Bi,j, to contain a zero for each pixel in which the grayscale representation Gi,j, is less than the calculated threshold value, T, and to contain a one for each pixel in which the grayscale representation Gi,j, is greater than or equal to the calculated threshold value, T. For purposes of illustration, the result of graphically depicting the binary representation, Bi,j, in which pixels having a value of one are presented as black and pixels having a value of zero are presented as white image is shown asbinary image 1010. As can be seen with reference toFIG. 10 , the information content intended to be conveyed, i.e., the various text strings, to the email recipient is now clearly distinguishable from the background. -
FIG. 11 illustrates an exemplary segmentation of the binary image ofFIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. According to the flow diagram ofFIG. 6 , the next steps (blocks 640 and 650) are to logically divide thebinary image 1010 into virtual blocks and then separately analyze each block to measure perceived text content. Application of one or more of equations EQ #15,EQ # 16, EQ #17, EQ #18 or the like, causes the text string count, T, to contain the sum of all blocks determined to contain a text string. - In the present example, segmented
binary image 1110 contains 28 virtual blocks, examples of which are pointed out withreference numerals EQ # 16, EQ #17 and EQ #18, 23 of the 28 blocks contain a total of 63 text strings. Text strings detected by the algorithm are underlined.Block 1120 is an example of a block that has been determined to contain one or more text strings, i.e., the word “TRADE” 1121.Block 1130 is an example of a block that has been determined not to contain a text string. -
FIG. 12 is agrayscale image 1210 based on another exemplary embedded image observed in connection with image spam. -
FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks abinarized image 1310 corresponding to thegrayscale image 1210 ofFIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. In the present example, segmentedbinary image 1310 contains 56 virtual blocks, examples of which are pointed out withreference numerals EQ # 16, EQ #17 and EQ #18, 26 of the 56 blocks contain a total of 40 text strings. Text strings detected by the algorithm are underlined.Block 1320 is an example of a block that has been determined to contain one or more text strings, i.e., the group of letters “ebtEras”.Block 1330 is an example of a block that has been determined not to contain a text string. - Notably, to the extent reverse video or the presentation of light colored (e.g., white) text in the context of a dark (e.g., black) background becomes problematic (see, e.g., the “LEARN MORE” text string embedded within
FIG. 13 ), one approach to detect such text strings would be to apply a local thresholding approach using EQ #14, which would effectively reverse the black and white pixels for the blocks at issue. - While embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the invention, as described in the claims.
Claims (19)
1. A method comprising:
measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
2. The method of claim 1 , wherein the electronic message comprises an electronic mail (email) message.
3. The method of claim 1 , wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
4. The method of claim 3 , wherein the image processing includes local thresholding.
5. The method of claim 3 , wherein the image processing includes global thresholding.
6. The method of claim 1 , wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
7. The method of claim 3 , wherein the image processing comprises converting the image or one or more of the plurality of blocks to grayscale.
8. The method of claim 3 , further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
9. The method of claim 3 , wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
10-27. (canceled)
28. A computer-readable medium having stored thereon instructions, which when executed by one or more processors cause the one or more processors to perform a method comprising:
measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
29. The computer-readable medium of claim 28 , wherein the electronic message comprises an electronic mail (email) message.
30. The computer-readable medium of claim 28 , wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
31. The computer-readable medium of claim 30 , wherein the image processing includes local thresholding.
32. The computer-readable medium of claim 30 , wherein the image processing includes global thresholding.
33. The computer-readable medium of claim 28 , wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
34. The computer-readable medium of claim 30 , wherein the image processing comprises convening the image or one or more of the plurality of blocks to grayscale.
35. The computer-readable medium of claim 30 , further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
36. The computer-readable medium of claim 30 , wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/932,589 US20090113003A1 (en) | 2007-10-31 | 2007-10-31 | Image spam filtering based on senders' intention analysis |
US12/114,815 US8180837B2 (en) | 2007-10-31 | 2008-05-04 | Image spam filtering based on senders' intention analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/932,589 US20090113003A1 (en) | 2007-10-31 | 2007-10-31 | Image spam filtering based on senders' intention analysis |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/114,815 Continuation US8180837B2 (en) | 2007-10-31 | 2008-05-04 | Image spam filtering based on senders' intention analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090113003A1 true US20090113003A1 (en) | 2009-04-30 |
Family
ID=40582891
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/932,589 Abandoned US20090113003A1 (en) | 2007-10-31 | 2007-10-31 | Image spam filtering based on senders' intention analysis |
US12/114,815 Active 2029-05-18 US8180837B2 (en) | 2007-10-31 | 2008-05-04 | Image spam filtering based on senders' intention analysis |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/114,815 Active 2029-05-18 US8180837B2 (en) | 2007-10-31 | 2008-05-04 | Image spam filtering based on senders' intention analysis |
Country Status (1)
Country | Link |
---|---|
US (2) | US20090113003A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090043853A1 (en) * | 2007-08-06 | 2009-02-12 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US20090077617A1 (en) * | 2007-09-13 | 2009-03-19 | Levow Zachary S | Automated generation of spam-detection rules using optical character recognition and identifications of common features |
US20090110233A1 (en) * | 2007-10-31 | 2009-04-30 | Fortinet, Inc. | Image spam filtering based on senders' intention analysis |
US20090150419A1 (en) * | 2007-12-10 | 2009-06-11 | Won Ho Kim | Apparatus and method for removing malicious code inserted into file |
US20120150959A1 (en) * | 2010-12-14 | 2012-06-14 | Electronics And Telecommunications Research Institute | Spam countering method and apparatus |
US8290311B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8290203B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8396876B2 (en) | 2010-11-30 | 2013-03-12 | Yahoo! Inc. | Identifying reliable and authoritative sources of multimedia content |
US20140074942A1 (en) * | 2012-09-10 | 2014-03-13 | International Business Machines Corporation | Identifying a webpage from which an e-mail address is obtained |
US20150100648A1 (en) * | 2013-10-03 | 2015-04-09 | Yandex Europe Ag | Method of and system for processing an e-mail message to determine a categorization thereof |
US20160321255A1 (en) * | 2015-04-28 | 2016-11-03 | International Business Machines Corporation | Unsolicited bulk email detection using url tree hashes |
TWI569608B (en) * | 2015-10-08 | 2017-02-01 | 網擎資訊軟體股份有限公司 | A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment |
US20170126601A1 (en) * | 2008-12-31 | 2017-05-04 | Dell Software Inc. | Image based spam blocking |
US20170289083A1 (en) * | 2010-06-09 | 2017-10-05 | Quest Software Inc. | Net- based email filtering |
US10361989B2 (en) * | 2016-10-06 | 2019-07-23 | International Business Machines Corporation | Visibility management enhancement for messaging systems and online social networks |
US10469510B2 (en) * | 2014-01-31 | 2019-11-05 | Juniper Networks, Inc. | Intermediate responses for non-html downloads |
US11164156B1 (en) * | 2021-04-30 | 2021-11-02 | Oracle International Corporation | Email message receiving system in a cloud infrastructure |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7299261B1 (en) | 2003-02-20 | 2007-11-20 | Mailfrontier, Inc. A Wholly Owned Subsidiary Of Sonicwall, Inc. | Message classification using a summary |
US7406502B1 (en) | 2003-02-20 | 2008-07-29 | Sonicwall, Inc. | Method and system for classifying a message based on canonical equivalent of acceptable items included in the message |
US8266215B2 (en) * | 2003-02-20 | 2012-09-11 | Sonicwall, Inc. | Using distinguishing properties to classify messages |
US20090245635A1 (en) * | 2008-03-26 | 2009-10-01 | Yehezkel Erez | System and method for spam detection in image data |
US8731284B2 (en) * | 2008-12-19 | 2014-05-20 | Yahoo! Inc. | Method and system for detecting image spam |
US20130156288A1 (en) * | 2011-12-19 | 2013-06-20 | De La Rue North America Inc. | Systems And Methods For Locating Characters On A Document |
US9047293B2 (en) * | 2012-07-25 | 2015-06-02 | Aviv Grafi | Computer file format conversion for neutralization of attacks |
CN103020646A (en) * | 2013-01-06 | 2013-04-03 | 深圳市彩讯科技有限公司 | Incremental training supported spam image identifying method and incremental training supported spam image identifying system |
CN103944810B (en) * | 2014-05-06 | 2017-02-15 | 厦门大学 | Spam e-mail intention recognition system |
CN106547852B (en) * | 2016-10-19 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Abnormal data detection method and device, and data preprocessing method and system |
US9858424B1 (en) | 2017-01-05 | 2018-01-02 | Votiro Cybersec Ltd. | System and method for protecting systems from active content |
US10013557B1 (en) | 2017-01-05 | 2018-07-03 | Votiro Cybersec Ltd. | System and method for disarming malicious code |
US10331889B2 (en) | 2017-01-05 | 2019-06-25 | Votiro Cybersec Ltd. | Providing a fastlane for disarming malicious content in received input content |
US10331890B2 (en) | 2017-03-20 | 2019-06-25 | Votiro Cybersec Ltd. | Disarming malware in protected content |
US10621554B2 (en) | 2017-11-29 | 2020-04-14 | International Business Machines Corporation | Image representation of e-mails |
CN110086706B (en) * | 2019-04-24 | 2022-05-27 | 北京众纳鑫海网络技术有限公司 | Method and system for joining a device-specific message group |
CN110866543B (en) * | 2019-10-18 | 2022-07-15 | 支付宝(杭州)信息技术有限公司 | Picture detection and picture classification model training method and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6738496B1 (en) * | 1999-11-01 | 2004-05-18 | Lockheed Martin Corporation | Real time binarization of gray images |
US6772196B1 (en) * | 2000-07-27 | 2004-08-03 | Propel Software Corp. | Electronic mail filtering system and methods |
US20040221062A1 (en) * | 2003-05-02 | 2004-11-04 | Starbuck Bryan T. | Message rendering for identification of content features |
US20050204005A1 (en) * | 2004-03-12 | 2005-09-15 | Purcell Sean E. | Selective treatment of messages based on junk rating |
US20060123083A1 (en) * | 2004-12-03 | 2006-06-08 | Xerox Corporation | Adaptive spam message detector |
US20070168436A1 (en) * | 2006-01-19 | 2007-07-19 | Worldvuer, Inc. | System and method for supplying electronic messages |
US20080091765A1 (en) * | 2006-10-12 | 2008-04-17 | Simon David Hedley Gammage | Method and system for detecting undesired email containing image-based messages |
US20090110233A1 (en) * | 2007-10-31 | 2009-04-30 | Fortinet, Inc. | Image spam filtering based on senders' intention analysis |
US7882177B2 (en) * | 2007-08-06 | 2011-02-01 | Yahoo! Inc. | Employing pixel density to detect a spam image |
-
2007
- 2007-10-31 US US11/932,589 patent/US20090113003A1/en not_active Abandoned
-
2008
- 2008-05-04 US US12/114,815 patent/US8180837B2/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6738496B1 (en) * | 1999-11-01 | 2004-05-18 | Lockheed Martin Corporation | Real time binarization of gray images |
US6772196B1 (en) * | 2000-07-27 | 2004-08-03 | Propel Software Corp. | Electronic mail filtering system and methods |
US20040221062A1 (en) * | 2003-05-02 | 2004-11-04 | Starbuck Bryan T. | Message rendering for identification of content features |
US20050204005A1 (en) * | 2004-03-12 | 2005-09-15 | Purcell Sean E. | Selective treatment of messages based on junk rating |
US20060123083A1 (en) * | 2004-12-03 | 2006-06-08 | Xerox Corporation | Adaptive spam message detector |
US20070168436A1 (en) * | 2006-01-19 | 2007-07-19 | Worldvuer, Inc. | System and method for supplying electronic messages |
US20080091765A1 (en) * | 2006-10-12 | 2008-04-17 | Simon David Hedley Gammage | Method and system for detecting undesired email containing image-based messages |
US7882177B2 (en) * | 2007-08-06 | 2011-02-01 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US20090110233A1 (en) * | 2007-10-31 | 2009-04-30 | Fortinet, Inc. | Image spam filtering based on senders' intention analysis |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10095922B2 (en) * | 2007-01-11 | 2018-10-09 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US8290203B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US20130039582A1 (en) * | 2007-01-11 | 2013-02-14 | John Gardiner Myers | Apparatus and method for detecting images within spam |
US8290311B1 (en) * | 2007-01-11 | 2012-10-16 | Proofpoint, Inc. | Apparatus and method for detecting images within spam |
US7882177B2 (en) * | 2007-08-06 | 2011-02-01 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US20110078269A1 (en) * | 2007-08-06 | 2011-03-31 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US20090043853A1 (en) * | 2007-08-06 | 2009-02-12 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US8301719B2 (en) | 2007-08-06 | 2012-10-30 | Yahoo! Inc. | Employing pixel density to detect a spam image |
US20090077617A1 (en) * | 2007-09-13 | 2009-03-19 | Levow Zachary S | Automated generation of spam-detection rules using optical character recognition and identifications of common features |
US20090110233A1 (en) * | 2007-10-31 | 2009-04-30 | Fortinet, Inc. | Image spam filtering based on senders' intention analysis |
US8180837B2 (en) | 2007-10-31 | 2012-05-15 | Fortinet, Inc. | Image spam filtering based on senders' intention analysis |
US8590016B2 (en) * | 2007-12-10 | 2013-11-19 | Electronics And Telecommunications Research Institute | Apparatus and method for removing malicious code inserted into file |
US20090150419A1 (en) * | 2007-12-10 | 2009-06-11 | Won Ho Kim | Apparatus and method for removing malicious code inserted into file |
US10204157B2 (en) * | 2008-12-31 | 2019-02-12 | Sonicwall Inc. | Image based spam blocking |
US20170126601A1 (en) * | 2008-12-31 | 2017-05-04 | Dell Software Inc. | Image based spam blocking |
US10419378B2 (en) * | 2010-06-09 | 2019-09-17 | Sonicwall Inc. | Net-based email filtering |
US20170289083A1 (en) * | 2010-06-09 | 2017-10-05 | Quest Software Inc. | Net- based email filtering |
US8396876B2 (en) | 2010-11-30 | 2013-03-12 | Yahoo! Inc. | Identifying reliable and authoritative sources of multimedia content |
US20120150959A1 (en) * | 2010-12-14 | 2012-06-14 | Electronics And Telecommunications Research Institute | Spam countering method and apparatus |
US9076130B2 (en) * | 2012-09-10 | 2015-07-07 | International Business Machines Corporation | Identifying a webpage from which an E-mail address is obtained |
US20140074942A1 (en) * | 2012-09-10 | 2014-03-13 | International Business Machines Corporation | Identifying a webpage from which an e-mail address is obtained |
US9794208B2 (en) | 2013-10-03 | 2017-10-17 | Yandex Europe Ag | Method of and system for constructing a listing of e-mail messages |
US9525654B2 (en) | 2013-10-03 | 2016-12-20 | Yandex Europe Ag | Method of and system for reformatting an e-mail message based on a categorization thereof |
US9749275B2 (en) | 2013-10-03 | 2017-08-29 | Yandex Europe Ag | Method of and system for constructing a listing of E-mail messages |
US9521101B2 (en) | 2013-10-03 | 2016-12-13 | Yandex Europe Ag | Method of and system for reformatting an e-mail message based on a categorization thereof |
US9521102B2 (en) | 2013-10-03 | 2016-12-13 | Yandex Europe Ag | Method of and system for constructing a listing of e-mail messages |
US9450903B2 (en) * | 2013-10-03 | 2016-09-20 | Yandex Europe Ag | Method of and system for processing an e-mail message to determine a categorization thereof |
US20150100648A1 (en) * | 2013-10-03 | 2015-04-09 | Yandex Europe Ag | Method of and system for processing an e-mail message to determine a categorization thereof |
US10469510B2 (en) * | 2014-01-31 | 2019-11-05 | Juniper Networks, Inc. | Intermediate responses for non-html downloads |
US10810176B2 (en) | 2015-04-28 | 2020-10-20 | International Business Machines Corporation | Unsolicited bulk email detection using URL tree hashes |
US20160321255A1 (en) * | 2015-04-28 | 2016-11-03 | International Business Machines Corporation | Unsolicited bulk email detection using url tree hashes |
US10706032B2 (en) * | 2015-04-28 | 2020-07-07 | International Business Machines Corporation | Unsolicited bulk email detection using URL tree hashes |
TWI569608B (en) * | 2015-10-08 | 2017-02-01 | 網擎資訊軟體股份有限公司 | A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment |
US10826865B2 (en) | 2016-10-06 | 2020-11-03 | International Business Machines Corporation | Visibility management enhancement for messaging systems and online social networks |
US10361989B2 (en) * | 2016-10-06 | 2019-07-23 | International Business Machines Corporation | Visibility management enhancement for messaging systems and online social networks |
US11164156B1 (en) * | 2021-04-30 | 2021-11-02 | Oracle International Corporation | Email message receiving system in a cloud infrastructure |
US20220351143A1 (en) * | 2021-04-30 | 2022-11-03 | Oracle International Corporation | Email message receiving system in a cloud infrastructure |
US11544673B2 (en) * | 2021-04-30 | 2023-01-03 | Oracle International Corporation | Email message receiving system in a cloud infrastructure |
Also Published As
Publication number | Publication date |
---|---|
US8180837B2 (en) | 2012-05-15 |
US20090110233A1 (en) | 2009-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8180837B2 (en) | Image spam filtering based on senders' intention analysis | |
AU2004202268B2 (en) | Origination/destination features and lists for spam prevention | |
US7882187B2 (en) | Method and system for detecting undesired email containing image-based messages | |
US8032594B2 (en) | Email anti-phishing inspector | |
US20050050150A1 (en) | Filter, system and method for filtering an electronic mail message | |
US9413716B2 (en) | Securing email communications | |
US20100095377A1 (en) | Detection of suspicious traffic patterns in electronic communications | |
US7925044B2 (en) | Detecting online abuse in images | |
AU2006260933B2 (en) | Method and system for filtering electronic messages | |
US20050015626A1 (en) | System and method for identifying and filtering junk e-mail messages or spam based on URL content | |
US20060075099A1 (en) | Automatic elimination of viruses and spam | |
CN113518987A (en) | E-mail security analysis | |
JP4670049B2 (en) | E-mail filtering program, e-mail filtering method, e-mail filtering system | |
WO2017162997A1 (en) | A method of protecting a user from messages with links to malicious websites containing homograph attacks | |
EP1733521B1 (en) | A method and an apparatus to classify electronic communication | |
Ismail et al. | Image spam detection: problem and existing solution | |
Clifford et al. | Miracle cures and toner cartridges: Finding solutions to the spam problem | |
Srikanthan | An Overview of Spam Handling Techniques | |
GARG | Integrated Approach for Email Spam Filter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FORTINET, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LU, JUN;CHENG, JIANDONG;REEL/FRAME:020419/0977 Effective date: 20080111 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |