US20090113003A1 - Image spam filtering based on senders' intention analysis - Google Patents

Image spam filtering based on senders' intention analysis Download PDF

Info

Publication number
US20090113003A1
US20090113003A1 US11/932,589 US93258907A US2009113003A1 US 20090113003 A1 US20090113003 A1 US 20090113003A1 US 93258907 A US93258907 A US 93258907A US 2009113003 A1 US2009113003 A1 US 2009113003A1
Authority
US
United States
Prior art keywords
image
email
spam
blocks
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/932,589
Inventor
Jun Lu
Jiandong Cheng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fortinet Inc
Original Assignee
Fortinet Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fortinet Inc filed Critical Fortinet Inc
Priority to US11/932,589 priority Critical patent/US20090113003A1/en
Assigned to FORTINET, INC. reassignment FORTINET, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHENG, JIANDONG, LU, JUN
Priority to US12/114,815 priority patent/US8180837B2/en
Publication of US20090113003A1 publication Critical patent/US20090113003A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Definitions

  • Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques.
  • various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
  • Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
  • fuzzy signature technologies which flag both known and similar messages as spam, were deployed by anti-spam vendors.
  • Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
  • FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. As shown in FIGS. 1A and 1B , polygons, lines, random colors, jagged text, random dots, varying borders and the like may be inserted into image spam in an attempt to defeat signature detection techniques and obscure the embedded text from OCR techniques. There are an almost infinite number of ways that spammers can randomize images.
  • spammers have recently used techniques such as varying the colors used in an image, changing the width and/or pattern of the border, altering the font style, and slicing images into smaller pieces (which are then reassembled to appear as a single image to the recipient).
  • OCR is very computationally expensive.
  • fully rendering a message and then looking for word matches against different character set libraries may take as long as several seconds per message, which is typically unacceptable for many contexts.
  • an anti-spam detection module that can detect image spam.
  • one or more of the quantity and position of text within an image associated with an electronic message are measured or estimated. Then, based at least in part on the results of the measuring or estimating, the likelihood that the electronic message is spam is determined.
  • an embedded image of an electronic mail (email) message is converted to a binarized representation by performing thresholding on a grayscale representation of the embedded image.
  • a quantity of text included in the embedded image is then determined and measured by analyzing one or more blocks of the binarized representations.
  • the email message is classified as spam or clean based at least in part on the quantity of text measured.
  • the embedded image may be formatted in accordance with the Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) or Portable Network Graphics (PNG) formats/standards.
  • GIF Graphic Interchange Format
  • JPEG Joint Photographic Experts Group
  • PNG Portable Network Graphics
  • the embedded image may be an image contained within a file attached to the email message.
  • the method also includes determining an approximate display location of an embedded image within the email message and identifying existence of one or more abnormal factors associated with the embedded image. Then, the classification can be based upon the approximate display location, the existence of one or more abnormal factors as well as the quantity of text measured.
  • FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques.
  • FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed.
  • FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system with a client workstation and an email server in accordance with an embodiment of the present invention.
  • FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized.
  • FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention.
  • FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention.
  • FIG. 7 is an example of an image spam email message containing an embedded image.
  • FIG. 8 is a grayscale image based on the embedded image of FIG. 7 according to one embodiment of the present invention.
  • FIG. 9 is an intensity histogram for the grayscale image of FIG. 8 according to one embodiment of the present invention.
  • FIG. 10 is a binary image resulting from thresholding the grayscale image of FIG. 8 in accordance with an embodiment of the present invention.
  • FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • FIG. 12 is a grayscale image based on another exemplary embedded image observed in connection with image spam.
  • FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image corresponding to the image of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • images attached to or embedded within email messages are analyzed to determine the senders' intention.
  • Empirical analysis reveals legitimate emails may contain embedded images, but valid images sent through email rarely contain a substantial quantity of text.
  • the senders of such email messages do not painstakingly adjust the location of such included images to assure such images appear in the preview window of an email client.
  • legitimate senders do not intentionally inject noise into the embedded images.
  • spammers usually compose email messages in different ways. For example, in the context of image spam, spammers insert text into images to avoid filtering by traditional text filters and employ techniques to randomize images and/or obscure text embedded within images.
  • various indicators of image spam include, but are not limited to, inclusion of one or more images in the front part of an email message, inclusion of one or more images containing text meeting a certain threshold and/or inclusion of one or more images into which noise appears to have been injected to obfuscate embedded text.
  • various image analysis techniques are employed to more accurately detect image spam based on senders' intention analysis.
  • the goal of senders' intention analysis is to discover the email message sender's intent by examining various characteristics of the email message and the embedded or attached images. If it appears, for example, after performing image analysis that one or more images associated with an email message have had one or more obfuscation techniques applied, the intent is to draw attention to the one or more images and/or the one or more images include suspicious quantities of text, then the sender's intention analysis anti-spam processing may flag the email message at issue as spam.
  • the image scanning spam detection method is based on a combination of email header analysis, email body analysis and image processing on image attachments.
  • anti-spam detection module and image scanning methodologies are discussed in the context of an email security system, they are equally applicable to network gateways, email appliances, client workstations, servers and other virtual or physical network devices or appliances that may be logically interposed between client workstations and servers, such as firewalls, network security appliances, email security appliances, virtual private network (VPN) gateways, switches, bridges, routers and like devices through which email messages flow.
  • VPN virtual private network
  • the anti-spam techniques described herein are equally applicable to instant messages, (Multimedia Message Service) MMS messages and other forms of electronic communications in the event that such message become vulnerable to image spam in the future.
  • GUIF Graphic Interchange Format
  • JPEG Joint Photographic Experts Group
  • PNG Portable Network Graphics
  • embodiments of the present invention are not so limited and are equally applicable to various other current and future graphic/image file formats, including, but not limited to, Progressive Graphics File (PGF), Tagged Image File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW formats used by various digital cameras, various vector formats, such as Scalable Vector Graphics (SVG), as well as other file formats of attachments which may themselves contain embedded images, such as Portable Document Format (PDF), Encapsulated PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw CDR.
  • PDF Portable Document Format
  • PDF Portable Document Format
  • Embodiments of the present invention include various steps, which will be described below.
  • the steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps.
  • the steps may be performed by a combination of hardware, software, firmware and/or by human operators.
  • Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process.
  • the machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions.
  • embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
  • a communication link e.g., a modem or network connection
  • client generally refers to an application, program, process or device in a client/server relationship that requests information or services from another program, process or device (a server) on a network.
  • client and server are relative since an application may be a client to one application but a server to another.
  • client also encompasses software that makes the connection between a requesting application, program, process or device to a server possible, such as an FTP client.
  • connection or coupling and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling.
  • two devices may be coupled directly, or via one or more intermediary media or devices.
  • devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another.
  • connection or coupling exists in accordance with the aforementioned definition.
  • embedded image generally refers to an image that is displayed or rendered inline within a styled or formatted electronic message, such as a HyperText Markup Language (HTML)-based or formatted email message.
  • HTML HyperText Markup Language
  • embedded image is intended to encompass scenarios in which the image data is sent with the email message and linked images in which a reference to the image is sent with the email message and the image data is retrieved once the recipient views the email message.
  • embedded image also includes an image that is embedded in other file formats of attachments, such as Portable Document Format (PDF) attachments, in which the image data is displayed to the email recipient when the attachment is viewed.
  • PDF Portable Document Format
  • image spam generally refers to spam in which the “call to action” of the message is partially or completely contained within an embedded file attachment, such as a .gif or jpeg or .pdf file, rather than in the body of the email message.
  • embedded file attachment such as a .gif or jpeg or .pdf file
  • Such images are typically automatically displayed to the email recipients and typically some form of obfuscation is implemented in an attempt to hide the true content of the image from spam filters.
  • network gateway generally refers to an internetworking system, a system that joins two networks together.
  • a “network gateway” can be implemented completely in software, completely in hardware, or as a combination of the two.
  • network gateways can operate at any level of the OSI model from application protocols to low-level signaling.
  • proxy generally refers to an intermediary device, program or agent, which acts as both a server and a client for the purpose of making or forwarding requests on behalf of other clients.
  • responsive includes completely or partially responsive.
  • server generally refers to an application, program, process or device in a client/server relationship that responds to requests for information or services by another program, process or device (a server) on a network.
  • server also encompasses software that makes the act of serving information or providing services possible.
  • spam generally refers to electronic junk mail, typically bulk electronic mail (email) messages in the form of commercial advertising. Often, email message content may be irrelevant in determining whether an email message is spam, though most spam is commercial in nature. There is spam that fraudulently promotes penny stocks in the classic pump-and-dump scheme. There is spam that promotes religious beliefs. From the recipient's perspective, spam typically represents unsolicited, unwanted, irrelevant, and/or inappropriate email messages, often unsolicited commercial email (UCE). In addition to UCE, spam includes, but is not limited to, email messages regarding or associated with fraudulent business schemes, chain letters, and/or offensive sexual or political messages.
  • UCE unsolicited commercial email
  • spam comprises Unsolicited Bulk Email (UBE).
  • Unsolicited generally means the recipient of the email message has not granted verifiable permission for the email message to be sent and the sender has no discernible relationship with all or some of the recipients.
  • Bulk generally refers to the fact that the email message is sent as part of a larger collection of email messages, all having substantively identical content.
  • an email message is considered spam if it is both unsolicited and bulk.
  • Unsolicited email can be normal email, such as first contact enquiries, job enquiries, and sales enquiries.
  • Bulk email can be normal email, such as subscriber newsletters, customer communications, discussion lists, etc.
  • an email message would be considered spam (i) the recipient's personal identity and context are irrelevant because the email message is equally applicable to many other potential recipients; and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for the email message to be sent.
  • transparent proxy generally refers to a specialized form of proxy that only implements a subset of a given protocol and allows unknown or uninteresting protocol commands to pass unaltered.
  • transparent proxy as compared to a full proxy in which use by a client typically requires editing of the client's configuration file(s) to point to the proxy, it is not necessary to perform such extra configuration in order to use a transparent proxy.
  • FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed.
  • spammers 205 are shown coupled to the public Internet 200 to which local area network (LAN) 240 is also communicatively coupled through a firewall 210 , a network gateway 215 and an email security system 220 , which incorporates within an anti-spam module 225 various novel image spam detection methodologies that are described further below.
  • LAN local area network
  • the email security system 220 is logically interposed between spammers and the email server 230 to perform spam filtering on incoming email messages from the public Internet 200 prior to receipt and storage on the email server 230 from which and through which client workstations 260 residing on the LAN 240 may retrieve and send email correspondence.
  • the firewall 210 may represent a hardware or software solution configured to protect the resources of LAN from outsiders and to control what outside resources local users have access to by enforcing security policies.
  • the firewall 210 may filter or disallow unauthorized or potentially dangerous material or content from entering LAN 240 and may otherwise limit access between the LAN 240 and the public Internet 200 in accordance with local security policy established and maintained by an administrator of LAN 240 .
  • the network gateway 215 acts as an interface between the LAN 240 and the public Internet 200 .
  • the network gateway 215 may, for example, translate between dissimilar protocols used internally and externally to the LAN 240 .
  • the network gateway 215 or the firewall 210 may perform network address translation (NAT) to hide private Internet Protocol (IP) addresses used within LAN 240 and enable multiple client workstations, such as client workstations 260 , to access the public Internet 200 using a single public IP address.
  • NAT network address translation
  • the email security system 220 performs email filtering to detect, tag, block and/or remove unwanted spam and malicious attachments.
  • an anti-spam module 225 of the email security system 220 performs one or more spam filtering techniques, including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam URI real-time blacklists (SURBL), banned word filtering, spam checksum blacklist, forged IP checking, greylist checking, Bayesian classification, Bayesian statistical filters, signature reputation, and/or filtering methods such as FortiGuard-Antispam, access policy filtering, global and user black/white list filtering, spam Real-time Blackhole List (RBL), Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian filtering so that individual users can set their own profiles.
  • spam filtering techniques including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam
  • the anti-spam module 225 also performs various novel image spam detection methodologies or spam image analysis scanning based on sender's intention analysis in an attempt to detect, tag, block and/or remove spam presented in the form of one or more images. Examples of the image analysis techniques and the sender's intention analysis methodologies are described in more detail below.
  • Existing email security platforms that exemplifies various operational characteristics of the email security system 220 according to an embodiment of the present invention include the FortiMailTM family of high-performance, multi-layered email security platforms, including the FortiMail-100 platform, the FortiMail-400 platform, the FortiMail-2000 platform and the FortiMail-4000A platform all of which are available from Fortinet, Inc. of Sunnyvale, Calif.
  • FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system 320 with a client workstation 360 and an email server 330 in accordance with an embodiment of the present invention.
  • client workstation 360 a single client workstation
  • email server 330 email server
  • client workstation 360 a single email server
  • email server 330 email server
  • many local and/or remote client workstations, servers and email servers may interact directly or indirectly with the email security system 320 and directly or indirectly with each other.
  • the email security system 320 which may be implemented as one or more virtual or physical devices, includes a content processor 321 , logically interposed between sources of inbound email 380 and an enterprise's email server 330 .
  • the content processor 321 performs scanning of inbound email messages 380 originating from sources on the public Internet 200 before allowing such inbound email messages 380 to be stored on the email server 330 .
  • an anti-spam module 325 of the content processor 321 may perform spam filtering and an anti-virus (AV) module 326 implementing AV and other filters potentially performs other traditional anti-virus detection and content filtering on data associated with the email messages.
  • AV anti-virus
  • anti-spam module 325 may apply various image analysis methodologies described further below to ascertain email senders' intentions and therefore the likelihood that attached and/or embedded images represent image spam. According to the current example, the anti-spam module 325 , responsive to being presented with an inbound email message, determines whether the email message contains embedded or attached images and if so, as described further below with reference to FIG. 5 and FIG. 6 , determines if such images represent image spam.
  • the content processor 321 is an integrated FortiASICTM Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif.
  • the content processor 321 may be a dedicated coprocessor or software to help offload content filtering tasks from a host processor (not shown).
  • the anti-spam module 325 may be associated with or otherwise responsive to a mail transfer protocol proxy (not shown).
  • the mail transfer protocol proxy may be implemented as a transparent proxy that implements handlers for Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant to the performance of content filtering activities and passes through those not relevant to the performance of content filtering activities.
  • SMTP Simple Mail Transfer Protocol
  • ESMTP Extended SMTP
  • the mail transfer protocol proxy may subject each of incoming email, outgoing email and internal email to scanning by the anti-spam module 325 and/or the content processor 321 .
  • filtering of email need not be performed prior to storage of email message on the email server 330 .
  • the content processor 321 , the mail transfer protocol proxy (not shown) or some other functional unit logically interposed between a user agent or email client 361 executing on the client workstation 360 and the email server 330 may process email messages at the time they are requested to be transferred from the user agent/email client 361 to the email server 330 or vice versa.
  • neither the email messages nor their attachments need be stored locally on the email security system 320 to support the filtering functionality described herein.
  • the email security system 320 may open a direct connection between the email client 361 and the email server 330 , and filter email in real-time as it passes through.
  • the content processor 321 , the anti-spam module 325 and the mail transfer protocol proxy have been described as residing within or as part of the same network device, in alternative embodiments one or more of these functional units may be located remotely from the other functional units.
  • the hardware components and/or software modules that implement the content processor 321 the anti-spam module 325 and the mail transfer protocol proxy are generally provided on or distributed among one or more Internet and/or LAN accessible networked devices, such as one or more network gateways, firewalls, network security appliances, email security systems, switches, bridges, routers, data storage devices, computer systems and the like.
  • the functionality of one or more of the above-referenced functional units may be merged in various combinations.
  • the content processor 321 may be incorporated within the mail transfer protocol proxy or the anti-spam module 325 may be incorporated within the email server 330 or the email client 361 .
  • the functional units can be communicatively coupled using any suitable communication method (e.g., message passing, parameter passing, and/or signals through one or more communication paths etc.).
  • the functional units can be physically connected according to any suitable interconnection architecture (e.g., fully connected, hypercube, etc.).
  • the functional units can be any suitable type of logic (e.g., digital logic) for executing the operations described herein.
  • Any of the functional units used in conjunction with embodiments of the invention can include machine-readable media including instructions for performing operations described herein.
  • Machine-readable media include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer).
  • a machine-readable medium includes read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
  • FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized.
  • the computer system 300 may represent or form a part of an email security system, network gateway, firewall, network appliance, switch, bridge, router, data storage devices, server, client workstation and/or other network device implementing one or more of the content processor 321 or other functional units depicted in FIG. 3 .
  • the computer system 400 includes one or more processors 405 , one or more communication ports 410 , main memory 415 , read only memory 420 , mass storage 425 , a bus 430 , and removable storage media 440 .
  • the processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s) or other processors known in the art.
  • Communication port(s) 410 represent physical and/or logical ports.
  • communication port(s) may be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, or a Gigabit port using copper or fiber.
  • Communication port(s) 410 may be chosen depending on a network such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 400 connects.
  • LAN Local Area Network
  • WAN Wide Area Network
  • Communication port(s) 410 may also be the name of the end of a logical connection (e.g., a Transmission Control Protocol (TCP) port or a User Datagram Protocol (UDP) port).
  • TCP Transmission Control Protocol
  • UDP User Datagram Protocol
  • communication ports may be one of the Well Know Ports, such as TCP port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by the Internet Assigned Numbers Authority (IANA) for specific uses.
  • IANA Internet Assigned Numbers Authority
  • Main memory 415 may be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art.
  • RAM Random Access Memory
  • Read only memory 420 may be any static storage device(s) such as Programmable Read Only Memory (PROM) chips for storing static information such as instructions for processors 405 .
  • PROM Programmable Read Only Memory
  • Mass storage 425 may be used to store information and instructions.
  • hard disks such as the Adaptec® family of SCSI drives, an optical disc, an array of disks such as RAID, such as the Adaptec family of RAID drives, or any other mass storage devices may be used.
  • Bus 430 communicatively couples processor(s) 405 with the other memory, storage and communication blocks.
  • Bus 430 may be a PCI/PCI-X or SCSI based system bus depending on the storage devices used.
  • Optional removable storage media 440 may be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like.
  • FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention.
  • the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
  • an email message is analyzed to determine if it contains images.
  • the direction of flow of the email message is not pertinent.
  • the email message may be inbound, outbound or an intra-enterprise email message.
  • the anti-spam processing may be enabled in one direction only or various detection threshholds could be configured differently for different flows.
  • the headers, body and attachments, if any, of the email message at issue are parsed and scanned to identify whether the email message is deemed to contain one or more embedded images. If so, processing continues with block 520 . Otherwise, no further image spam analysis is required and processing branches to the end.
  • the email message at issue has been determined to contain one or more embedded images.
  • the senders' intention analysis anti-spam processing therefore, proceeds to calculate the location(s) of the embedded image(s).
  • Images may be embedded in a HyperText Markup Language (HTML) part of an HTML formatted email message, within a MIME document or attached separately.
  • HTML HyperText Markup Language
  • MIME Multipurpose Internet Mail Extension
  • abnormal factors are manifestations of a spammer's attempt to obscure text embedded within the one or more images by injecting a variety of noise.
  • abnormal factors include the presence of one or more of the following characteristics (i) illegal base64 encoding; (ii) multiple images within one HTML part; (iii) one or more low entropy frames in an animated Graphic Interchange Format (GIF); (iv) illegal chunk data within a Portable Network Graphics (PNG) file; (v) quantities of unsmoothed curves; and (iv) quantities of unsmoothed color blocks.
  • illegal base64 encoding can be detected by, among other things, observing illegal characters, such as ‘!’ in the encoded content, such as the HTML formatted message or any part of the MIME document.
  • the inclusion of multiple images within one HTML part can be detected by parsing the HTML formatted email message and observing more than one image within an HTML part.
  • the existence of three images within a single table row reveals an intention on the part of the creator of the email message to display a contiguous image to the email recipient based on the three separate embedded images.
  • the existence of one or more low entropy frames of an animated GIF may be determined on an absolute and/or relative basis.
  • an animated GIF frame may be determined to be low with reference to observed entropy values of normal GIF files, which vary from approximately 0.1 to 5.0. Therefore, in one embodiment, the existence of one or more low entropy frames is confirmed based on a comparison of the entropy values calculated for the animated GIF at issue to 0.1. If the entropy value calculated for any frame of the animated GIF at issue is less than 0.1, then this abnormal factor is deemed to exist.
  • one or more frames of the animated GIF file at issue may simply be “low” entropy relative to the other high entropy frames. For example, a variation of more than 4.9 between the highest entropy frame and the lowest entropy frame relatively lower than the others within the animated GIF file at issue.
  • Illegal chunk data within a Portable Network Graphics (PNG) file may be observed by evaluating information contained within and/or about the chunks. For example, the length of the chunk and cyclic redundancy checksum (CRC) may be verified against the actual data length and recomputed CRC.
  • CRC cyclic redundancy checksum
  • Quantities of unsmoothed curves may be detected by evaluating the amount of pixels in which the difference between their color and the average color of the surrounding pixels are greater than a threshold.
  • Quantities of unsmoothed color blocks may be detected by evaluating the amount of the color blocks in which the difference between their color and the color of the surrounding color blocks are greater than a threshold.
  • Color blocks contain pixels with the same or similar colors.
  • a value within a range may be returned indicating the degree to which the abnormal factor is expressed.
  • images are converted to a binary representation based on a thresholding technique described in further detail below.
  • thresholding is a simple method of image segmentation. Individual pixels in a grayscale image are marked as “information” pixels if their value is greater than some threshold value, T, (assuming the information content is brighter than the background) and as “background” pixels otherwise. Typically, an information pixel is given a value of “1” while a background pixel is given a value of “0.” Then, a text string measurement algorithm is applied to the binary representation of the portion of the image deemed to contain the information content.
  • both the quantity of text and the relative position of such text within an email viewer's preview window, for example, or within the image itself may be taken into consideration.
  • a high spam score could be assigned to a very large image (with a correspondingly smaller percentage of text), but the text is positioned to occupy the whole preview window.
  • the email message is classified as spam or clean based on the observed characteristics of the embedded image(s), such as image location information, the existence/non-existence of various abnormal factors and the quantity of text determined to exist within the embedded image(s).
  • the spam/clean classification may be based upon a weighted average of the various observed characteristics.
  • each observed characteristic may contribute to the score. Once the score reaches a threshold, the email message may be classified as spam and the further characteristics may not require analysis or observation. The email message is classified as clean if the score is less than the threshold after all the characteristics have been evaluated. In one embodiment, the characteristics may be considered in the following order:
  • a “spaminess” score may be generated. For example, rather than simply conveying a binary result (e.g., spam vs. clean), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the email message appeared to contain indications of being spam or the likelihood the email message is spam.
  • a binary result e.g., spam vs. clean
  • a value within a range e.g., 0 to 10
  • FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention. The steps described below represent the processing of block 540 of FIG. 5 according to one embodiment of the present invention.
  • the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
  • grayscale representation G i,j .
  • color pixels of the image at issue are converted to grayscale by computing an average or weighted average of the red, green and blue color components. While various conversions may be used, examples of suitable conversion equations include the following:
  • G i,j (0.299 *r i,j +0.587 * g i,j +0.114 * b i,j )/3 0 ⁇ i ⁇ x max, 0 ⁇ j ⁇ y max EQ #1
  • G i,j (0.3 *r i,j +0.6 *g i,j +0.1 *b i,j )/3 0 ⁇ i ⁇ x max, 0 ⁇ j ⁇ y max EQ #2
  • G i,j ( r i,j +g i,j +b i,j )/3 0 ⁇ i ⁇ x max ,0 ⁇ j ⁇ y max EQ #3
  • entropy and threshold values are determined for the grayscale image, G i,j .
  • Entropy is a statistical measure of randomness that can be used to characterize the texture of the input image.
  • an intermediate data structure is built containing an intensity histogram, C g .
  • each pixel may have a value of 0 to 255.
  • the intensity histogram includes 256 bins each of which maintain a count of the number of pixels in the grayscale image having that value.
  • FIG. 9 represents an intensity histogram for a grayscale representation of FIG. 8 .
  • entropy is calculated according to:
  • a threshold value within the intensity histogram is selected simply by choosing the mean or median value.
  • the rationale for this simple threshold selection is that if the information pixels are brighter than the background, they should also be brighter than the average.
  • a more sophisticated approach is to create a histogram of the image pixel intensities and then use the valley point as the threshold, T. This histogram approach assumes that there is some average value for the background and information pixels, but that the actual pixel values have some variation around these average values.
  • the threshold, T is calculated by:
  • the gray levels are divided into two groups by i, and w i1 and w i2 are the total amount of the pixels of each group while M i1 and M i2 are the average of the gray level of each group.
  • thresholding is performed to form a binary representation, B i,j , of the grayscale image based on the threshold value selected in block 620 .
  • thresholding is performed in accordance with the following equations:
  • B i , j ⁇ 0 G i , j ⁇ T 1 Otherwise ⁇ 0 ⁇ i ⁇ x max , 0 ⁇ j ⁇ y max EQ ⁇ ⁇ #13
  • B i , j ′ ⁇ B i , j max ⁇ ( C k ) ⁇ ⁇ , max ⁇ ( C k ) ⁇ T ! B i , j Otherwise ⁇ 0 ⁇ k ⁇ 255 EQ ⁇ ⁇ #14
  • is an adjustable parameter
  • the binary image is logically divided into M ⁇ N virtual blocks.
  • the M ⁇ N virtual blocks are analyzed to quantify the number of text strings.
  • the text strings in the binary image are quantified in accordance with the following equations:
  • ⁇ 0 . . . a ⁇ 7 are adjustable parameters
  • T y t , y b m,n is the likelihood that the row between y t and y b in the block [m,n] represents text
  • CB i n is the likelihood that the line[i] is a part of text
  • B k,i is the value of pixel[k,i] in the binary image.
  • FIG. 7 is an example of an image spam email message 700 containing an embedded image 710 .
  • image spam email messages also include text 720 in an attempt to defeat conventional heuristic filters.
  • FIG. 8 is a grayscale image 810 based on the embedded image 710 of FIG. 7 according to one embodiment of the present invention.
  • the first step (block 610 ) is to convert the embedded image 710 to a grayscale representation, G i,j .
  • G i,j Assuming embedded image 710 of FIG. 7 is a color image having red (r), green (g) and blue (b) color components, after application of one of equations EQ #1, EQ #2, EQ #3 or the like, the grayscale representation, G i,j , appears as grayscale image 810 .
  • FIG. 9 is an intensity histogram 900 for the grayscale image 810 of FIG. 8 according to one embodiment of the present invention.
  • the next step (block 620 ) is to build an intensity histogram data structure, C g , and determine a threshold value for the grayscale image 810 .
  • an intensity histogram data structure, C g results, which appears as intensity histogram 900 when displayed in graphical form.
  • the intensity histogram 900 graphically illustrates the number of pixel occurrences in grayscale image 810 for each gray level.
  • a threshold value, T, 910 is calculated for grayscale image 810 .
  • the threshold value 910 is 109 .
  • FIG. 10 is a binary image 1010 resulting from thresholding the grayscale image 810 of FIG. 8 in accordance with an embodiment of the present invention.
  • the next step (block 630 ) is to binarize the image by performing thresholding with the calculated threshold value.
  • binary image 1010 the result of graphically depicting the binary representation, B i,j , in which pixels having a value of one are presented as black and pixels having a value of zero are presented as white image is shown as binary image 1010 .
  • the information content intended to be conveyed, i.e., the various text strings, to the email recipient is now clearly distinguishable from the background.
  • FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • the next steps are to logically divide the binary image 1010 into virtual blocks and then separately analyze each block to measure perceived text content.
  • segmented binary image 1110 contains 28 virtual blocks, examples of which are pointed out with reference numerals 1120 and 1130 .
  • equations EQ #15, EQ #16, EQ #17 and EQ #18, 23 of the 28 blocks contain a total of 63 text strings.
  • Text strings detected by the algorithm are underlined.
  • Block 1120 is an example of a block that has been determined to contain one or more text strings, i.e., the word “TRADE” 1121 .
  • Block 1130 is an example of a block that has been determined not to contain a text string.
  • FIG. 12 is a grayscale image 1210 based on another exemplary embedded image observed in connection with image spam.
  • FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image 1310 corresponding to the grayscale image 1210 of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • segmented binary image 1310 contains 56 virtual blocks, examples of which are pointed out with reference numerals 1320 and 1330 .
  • equations EQ #15, EQ #16, EQ #17 and EQ #18, 26 of the 56 blocks contain a total of 40 text strings. Text strings detected by the algorithm are underlined.
  • Block 1320 is an example of a block that has been determined to contain one or more text strings, i.e., the group of letters “ebtEras”.
  • Block 1330 is an example of a block that has been determined not to contain a text string.

Abstract

Systems and methods for an anti-spam detection module that can detect image spam are provided. According to one embodiment, an image spam detection process involves determining and measuring various characteristics of images that may be embedded within or otherwise associated with an electronic mail (email) message. An approximate display location of the embedded images is determined. The existence of one or more abnormal factors associated with the embedded images is identified. A quantity of text included in the one or more embedded images is determined and measured by analyzing one or more blocks of binarized representations of the one or more embedded images. Finally, the likelihood that the email message is spam is determined based on one or more of the approximate display location, the existence of one or more abnormal factors and the quantity and location of text measured.

Description

    COPYRIGHT NOTICE
  • Contained herein is material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever. Copyright© 2007, Fortinet, Inc.
  • BACKGROUND
  • 1. Field
  • Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques. In particular, various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
  • 2. Description of the Related Art
  • Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
  • To address this spamming technique, fuzzy signature technologies, which flag both known and similar messages as spam, were deployed by anti-spam vendors. Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
  • Spammers now alter the images to make the email message appear different to signature-based filtering approaches yet while maintaining readability of the embedded text message to human viewers. The content of images lies in two levels: (i) the pixel matrix and (ii) the text or graphics these pixel matrices represent. At present, the notion of pixel-based matching does not make sense, as the same text could be represented by countless pixel matrices by simply changing various attributes, such as the font, size, color or by adding noise. Therefore, hash matching and other signature-based approaches have essentially been rendered useless to block image spam as they fail as a result of even minor changes to the background of the image.
  • Some vendors have attempted to catch image spam by employing Optical Character Recognition (OCR) techniques; however, such approaches have only limited success in view of spammers' use of techniques to obscure the embedded text messages with a variety of noise. FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques. As shown in FIGS. 1A and 1B, polygons, lines, random colors, jagged text, random dots, varying borders and the like may be inserted into image spam in an attempt to defeat signature detection techniques and obscure the embedded text from OCR techniques. There are an almost infinite number of ways that spammers can randomize images. In addition to the foregoing obfuscation techniques, spammers have recently used techniques such as varying the colors used in an image, changing the width and/or pattern of the border, altering the font style, and slicing images into smaller pieces (which are then reassembled to appear as a single image to the recipient). Meanwhile, OCR is very computationally expensive. Depending upon the implementation, fully rendering a message and then looking for word matches against different character set libraries may take as long as several seconds per message, which is typically unacceptable for many contexts.
  • SUMMARY
  • Systems and methods are described for an anti-spam detection module that can detect image spam. According to one embodiment, one or more of the quantity and position of text within an image associated with an electronic message are measured or estimated. Then, based at least in part on the results of the measuring or estimating, the likelihood that the electronic message is spam is determined.
  • According to another embodiment, an embedded image of an electronic mail (email) message is converted to a binarized representation by performing thresholding on a grayscale representation of the embedded image. A quantity of text included in the embedded image is then determined and measured by analyzing one or more blocks of the binarized representations. Finally, the email message is classified as spam or clean based at least in part on the quantity of text measured.
  • In one embodiment, the embedded image may be formatted in accordance with the Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) or Portable Network Graphics (PNG) formats/standards.
  • In one embodiment, the embedded image may be an image contained within a file attached to the email message.
  • In one embodiment, the method also includes determining an approximate display location of an embedded image within the email message and identifying existence of one or more abnormal factors associated with the embedded image. Then, the classification can be based upon the approximate display location, the existence of one or more abnormal factors as well as the quantity of text measured.
  • Other features of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
  • FIGS. 1A and 1B illustrate sample images and obfuscation techniques used by spammers to defeat OCR image spam detection techniques.
  • FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed.
  • FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system with a client workstation and an email server in accordance with an embodiment of the present invention.
  • FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized.
  • FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention.
  • FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention.
  • FIG. 7 is an example of an image spam email message containing an embedded image.
  • FIG. 8 is a grayscale image based on the embedded image of FIG. 7 according to one embodiment of the present invention.
  • FIG. 9 is an intensity histogram for the grayscale image of FIG. 8 according to one embodiment of the present invention.
  • FIG. 10 is a binary image resulting from thresholding the grayscale image of FIG. 8 in accordance with an embodiment of the present invention.
  • FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • FIG. 12 is a grayscale image based on another exemplary embedded image observed in connection with image spam.
  • FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image corresponding to the image of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • Systems and methods are described for an anti-spam detection module that can detect various forms of image spam. According to one embodiment, images attached to or embedded within email messages are analyzed to determine the senders' intention. Empirical analysis reveals legitimate emails may contain embedded images, but valid images sent through email rarely contain a substantial quantity of text. Additionally, when legitimate images are included within email messages, the senders of such email messages do not painstakingly adjust the location of such included images to assure such images appear in the preview window of an email client. Furthermore, legitimate senders do not intentionally inject noise into the embedded images. In contrast, spammers usually compose email messages in different ways. For example, in the context of image spam, spammers insert text into images to avoid filtering by traditional text filters and employ techniques to randomize images and/or obscure text embedded within images. Spammers also typically make great efforts to draw attention to their image spam by carefully placing the image in such a manner as to make it visible to the recipient in the preview window/pane of an email client that supports HTML email, such as Netscape Messenger or Microsoft Outlook. Consequently, various indicators of image spam include, but are not limited to, inclusion of one or more images in the front part of an email message, inclusion of one or more images containing text meeting a certain threshold and/or inclusion of one or more images into which noise appears to have been injected to obfuscate embedded text.
  • According to one embodiment, various image analysis techniques are employed to more accurately detect image spam based on senders' intention analysis. The goal of senders' intention analysis is to discover the email message sender's intent by examining various characteristics of the email message and the embedded or attached images. If it appears, for example, after performing image analysis that one or more images associated with an email message have had one or more obfuscation techniques applied, the intent is to draw attention to the one or more images and/or the one or more images include suspicious quantities of text, then the sender's intention analysis anti-spam processing may flag the email message at issue as spam. In one embodiment, the image scanning spam detection method is based on a combination of email header analysis, email body analysis and image processing on image attachments.
  • Importantly, although various embodiments of the anti-spam detection module and image scanning methodologies are discussed in the context of an email security system, they are equally applicable to network gateways, email appliances, client workstations, servers and other virtual or physical network devices or appliances that may be logically interposed between client workstations and servers, such as firewalls, network security appliances, email security appliances, virtual private network (VPN) gateways, switches, bridges, routers and like devices through which email messages flow. Furthermore, the anti-spam techniques described herein are equally applicable to instant messages, (Multimedia Message Service) MMS messages and other forms of electronic communications in the event that such message become vulnerable to image spam in the future.
  • Additionally, various embodiments of the present invention are described with reference to filtering of incoming email messages. However, it is to be understood, that the filtering methodologies described herein are equally applicable to email messages originated within an enterprise and circulated internally or outgoing email messages intended for recipients outside of the enterprise. Therefore, the specific examples presented herein are not intended to be limiting and are merely representative of exemplary functionality.
  • Furthermore, while, for convenience, various embodiments of the present invention may be described with reference to detecting image spam in the graphic/image file formats currently most prevalent (i.e., Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) and Portable Network Graphics (PNG) graphic/image file formats), embodiments of the present invention are not so limited and are equally applicable to various other current and future graphic/image file formats, including, but not limited to, Progressive Graphics File (PGF), Tagged Image File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW formats used by various digital cameras, various vector formats, such as Scalable Vector Graphics (SVG), as well as other file formats of attachments which may themselves contain embedded images, such as Portable Document Format (PDF), Encapsulated PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw CDR.
  • In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
  • Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware, software, firmware and/or by human operators.
  • Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
  • Terminology
  • Brief definitions of terms used throughout this application are given below.
  • The term “client” generally refers to an application, program, process or device in a client/server relationship that requests information or services from another program, process or device (a server) on a network. Importantly, the terms “client” and “server” are relative since an application may be a client to one application but a server to another. The term “client” also encompasses software that makes the connection between a requesting application, program, process or device to a server possible, such as an FTP client.
  • The terms “connected” or “coupled” and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling. Thus, for example, two devices may be coupled directly, or via one or more intermediary media or devices. As another example, devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another. Based on the disclosure provided herein, one of ordinary skill in the art will appreciate a variety of ways in which connection or coupling exists in accordance with the aforementioned definition.
  • The phrase “embedded image” generally refers to an image that is displayed or rendered inline within a styled or formatted electronic message, such as a HyperText Markup Language (HTML)-based or formatted email message. As used herein, the phrase “embedded image” is intended to encompass scenarios in which the image data is sent with the email message and linked images in which a reference to the image is sent with the email message and the image data is retrieved once the recipient views the email message. The phrase “embedded image” also includes an image that is embedded in other file formats of attachments, such as Portable Document Format (PDF) attachments, in which the image data is displayed to the email recipient when the attachment is viewed.
  • The phrase “image spam” generally refers to spam in which the “call to action” of the message is partially or completely contained within an embedded file attachment, such as a .gif or jpeg or .pdf file, rather than in the body of the email message. Such images are typically automatically displayed to the email recipients and typically some form of obfuscation is implemented in an attempt to hide the true content of the image from spam filters.
  • The phrases “in one embodiment,” “according to one embodiment,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present invention, and may be included in more than one embodiment of the present invention. Importantly, such phrases do not necessarily refer to the same embodiment.
  • The phrase “network gateway” generally refers to an internetworking system, a system that joins two networks together. A “network gateway” can be implemented completely in software, completely in hardware, or as a combination of the two. Depending on the particular implementation, network gateways can operate at any level of the OSI model from application protocols to low-level signaling.
  • If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.
  • The term “proxy” generally refers to an intermediary device, program or agent, which acts as both a server and a client for the purpose of making or forwarding requests on behalf of other clients.
  • The term “responsive” includes completely or partially responsive.
  • The term “server” generally refers to an application, program, process or device in a client/server relationship that responds to requests for information or services by another program, process or device (a server) on a network. The term “server” also encompasses software that makes the act of serving information or providing services possible.
  • The term “spam” generally refers to electronic junk mail, typically bulk electronic mail (email) messages in the form of commercial advertising. Often, email message content may be irrelevant in determining whether an email message is spam, though most spam is commercial in nature. There is spam that fraudulently promotes penny stocks in the classic pump-and-dump scheme. There is spam that promotes religious beliefs. From the recipient's perspective, spam typically represents unsolicited, unwanted, irrelevant, and/or inappropriate email messages, often unsolicited commercial email (UCE). In addition to UCE, spam includes, but is not limited to, email messages regarding or associated with fraudulent business schemes, chain letters, and/or offensive sexual or political messages.
  • According to one embodiment “spam” comprises Unsolicited Bulk Email (UBE). Unsolicited generally means the recipient of the email message has not granted verifiable permission for the email message to be sent and the sender has no discernible relationship with all or some of the recipients. Bulk generally refers to the fact that the email message is sent as part of a larger collection of email messages, all having substantively identical content. In embodiments in which spam is equated with UBE, an email message is considered spam if it is both unsolicited and bulk. Unsolicited email can be normal email, such as first contact enquiries, job enquiries, and sales enquiries. Bulk email can be normal email, such as subscriber newsletters, customer communications, discussion lists, etc. Consequently, in such embodiments, an email message would be considered spam (i) the recipient's personal identity and context are irrelevant because the email message is equally applicable to many other potential recipients; and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for the email message to be sent.
  • The phrase “transparent proxy” generally refers to a specialized form of proxy that only implements a subset of a given protocol and allows unknown or uninteresting protocol commands to pass unaltered. Advantageously, as compared to a full proxy in which use by a client typically requires editing of the client's configuration file(s) to point to the proxy, it is not necessary to perform such extra configuration in order to use a transparent proxy.
  • FIG. 2 is a block diagram conceptually illustrating a simplified network architecture in which embodiments of the present invention may be employed. In this simple example, spammers 205 are shown coupled to the public Internet 200 to which local area network (LAN) 240 is also communicatively coupled through a firewall 210, a network gateway 215 and an email security system 220, which incorporates within an anti-spam module 225 various novel image spam detection methodologies that are described further below.
  • In the present example, the email security system 220 is logically interposed between spammers and the email server 230 to perform spam filtering on incoming email messages from the public Internet 200 prior to receipt and storage on the email server 230 from which and through which client workstations 260 residing on the LAN 240 may retrieve and send email correspondence.
  • In the exemplary network architecture of FIG. 2, the firewall 210 may represent a hardware or software solution configured to protect the resources of LAN from outsiders and to control what outside resources local users have access to by enforcing security policies. The firewall 210 may filter or disallow unauthorized or potentially dangerous material or content from entering LAN 240 and may otherwise limit access between the LAN 240 and the public Internet 200 in accordance with local security policy established and maintained by an administrator of LAN 240.
  • In one embodiment, the network gateway 215 acts as an interface between the LAN 240 and the public Internet 200. The network gateway 215 may, for example, translate between dissimilar protocols used internally and externally to the LAN 240. Depending upon the distribution of functionality, the network gateway 215 or the firewall 210 may perform network address translation (NAT) to hide private Internet Protocol (IP) addresses used within LAN 240 and enable multiple client workstations, such as client workstations 260, to access the public Internet 200 using a single public IP address.
  • According to one embodiment, the email security system 220 performs email filtering to detect, tag, block and/or remove unwanted spam and malicious attachments. In one embodiment, an anti-spam module 225 of the email security system 220, performs one or more spam filtering techniques, including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam URI real-time blacklists (SURBL), banned word filtering, spam checksum blacklist, forged IP checking, greylist checking, Bayesian classification, Bayesian statistical filters, signature reputation, and/or filtering methods such as FortiGuard-Antispam, access policy filtering, global and user black/white list filtering, spam Real-time Blackhole List (RBL), Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian filtering so that individual users can set their own profiles.
  • The anti-spam module 225 also performs various novel image spam detection methodologies or spam image analysis scanning based on sender's intention analysis in an attempt to detect, tag, block and/or remove spam presented in the form of one or more images. Examples of the image analysis techniques and the sender's intention analysis methodologies are described in more detail below. Existing email security platforms that exemplifies various operational characteristics of the email security system 220 according to an embodiment of the present invention include the FortiMail™ family of high-performance, multi-layered email security platforms, including the FortiMail-100 platform, the FortiMail-400 platform, the FortiMail-2000 platform and the FortiMail-4000A platform all of which are available from Fortinet, Inc. of Sunnyvale, Calif.
  • FIG. 3 is a block diagram conceptually illustrating interaction among various functional units of an email security system 320 with a client workstation 360 and an email server 330 in accordance with an embodiment of the present invention.
  • While in this simplified example, only a single client workstation, i.e., client workstation 360, and a single email server, i.e., email server 330, are shown interacting with the email security system 320, it should be understood that many local and/or remote client workstations, servers and email servers may interact directly or indirectly with the email security system 320 and directly or indirectly with each other.
  • According to the present example, the email security system 320, which may be implemented as one or more virtual or physical devices, includes a content processor 321, logically interposed between sources of inbound email 380 and an enterprise's email server 330. In the context of the present example, the content processor 321 performs scanning of inbound email messages 380 originating from sources on the public Internet 200 before allowing such inbound email messages 380 to be stored on the email server 330. In one embodiment, an anti-spam module 325 of the content processor 321 may perform spam filtering and an anti-virus (AV) module 326 implementing AV and other filters potentially performs other traditional anti-virus detection and content filtering on data associated with the email messages.
  • In the current example, anti-spam module 325 may apply various image analysis methodologies described further below to ascertain email senders' intentions and therefore the likelihood that attached and/or embedded images represent image spam. According to the current example, the anti-spam module 325, responsive to being presented with an inbound email message, determines whether the email message contains embedded or attached images and if so, as described further below with reference to FIG. 5 and FIG. 6, determines if such images represent image spam.
  • In one embodiment, the content processor 321 is an integrated FortiASIC™ Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif. In alternative embodiments, the content processor 321 may be a dedicated coprocessor or software to help offload content filtering tasks from a host processor (not shown).
  • In alternative embodiments, the anti-spam module 325 may be associated with or otherwise responsive to a mail transfer protocol proxy (not shown). The mail transfer protocol proxy may be implemented as a transparent proxy that implements handlers for Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant to the performance of content filtering activities and passes through those not relevant to the performance of content filtering activities. In one embodiment, the mail transfer protocol proxy may subject each of incoming email, outgoing email and internal email to scanning by the anti-spam module 325 and/or the content processor 321.
  • Notably, filtering of email need not be performed prior to storage of email message on the email server 330. In alternative embodiments, the content processor 321, the mail transfer protocol proxy (not shown) or some other functional unit logically interposed between a user agent or email client 361 executing on the client workstation 360 and the email server 330 may process email messages at the time they are requested to be transferred from the user agent/email client 361 to the email server 330 or vice versa. Meanwhile, neither the email messages nor their attachments need be stored locally on the email security system 320 to support the filtering functionality described herein. For example, instead of the anti-spam processing running responsive to a mail transfer protocol proxy, the email security system 320 may open a direct connection between the email client 361 and the email server 330, and filter email in real-time as it passes through.
  • While in the context of the present example, the content processor 321, the anti-spam module 325 and the mail transfer protocol proxy (not shown) have been described as residing within or as part of the same network device, in alternative embodiments one or more of these functional units may be located remotely from the other functional units. According to one embodiment, the hardware components and/or software modules that implement the content processor 321 the anti-spam module 325 and the mail transfer protocol proxy are generally provided on or distributed among one or more Internet and/or LAN accessible networked devices, such as one or more network gateways, firewalls, network security appliances, email security systems, switches, bridges, routers, data storage devices, computer systems and the like.
  • In one embodiment, the functionality of one or more of the above-referenced functional units may be merged in various combinations. For example, the content processor 321 may be incorporated within the mail transfer protocol proxy or the anti-spam module 325 may be incorporated within the email server 330 or the email client 361. Moreover, the functional units can be communicatively coupled using any suitable communication method (e.g., message passing, parameter passing, and/or signals through one or more communication paths etc.). Additionally, the functional units can be physically connected according to any suitable interconnection architecture (e.g., fully connected, hypercube, etc.).
  • According to embodiments of the invention, the functional units can be any suitable type of logic (e.g., digital logic) for executing the operations described herein. Any of the functional units used in conjunction with embodiments of the invention can include machine-readable media including instructions for performing operations described herein. Machine-readable media include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
  • FIG. 4 is an example of a computer system with which embodiments of the present invention may be utilized. The computer system 300 may represent or form a part of an email security system, network gateway, firewall, network appliance, switch, bridge, router, data storage devices, server, client workstation and/or other network device implementing one or more of the content processor 321 or other functional units depicted in FIG. 3. According to FIG. 4, the computer system 400 includes one or more processors 405, one or more communication ports 410, main memory 415, read only memory 420, mass storage 425, a bus 430, and removable storage media 440.
  • The processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s) or other processors known in the art.
  • Communication port(s) 410 represent physical and/or logical ports. For example communication port(s) may be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, or a Gigabit port using copper or fiber. Communication port(s) 410 may be chosen depending on a network such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 400 connects.
  • Communication port(s) 410 may also be the name of the end of a logical connection (e.g., a Transmission Control Protocol (TCP) port or a User Datagram Protocol (UDP) port). For example communication ports may be one of the Well Know Ports, such as TCP port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by the Internet Assigned Numbers Authority (IANA) for specific uses.
  • Main memory 415 may be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art.
  • Read only memory 420 may be any static storage device(s) such as Programmable Read Only Memory (PROM) chips for storing static information such as instructions for processors 405.
  • Mass storage 425 may be used to store information and instructions. For example, hard disks such as the Adaptec® family of SCSI drives, an optical disc, an array of disks such as RAID, such as the Adaptec family of RAID drives, or any other mass storage devices may be used.
  • Bus 430 communicatively couples processor(s) 405 with the other memory, storage and communication blocks. Bus 430 may be a PCI/PCI-X or SCSI based system bus depending on the storage devices used.
  • Optional removable storage media 440 may be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like.
  • FIG. 5 is a high-level flow diagram illustrating anti-spam processing of images using sender's intention analysis in accordance with an embodiment of the present invention. Depending upon the particular implementation, the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
  • At block 510, an email message is analyzed to determine if it contains images. For purposes of the present example, the direction of flow of the email message is not pertinent. As indicated above, the email message may be inbound, outbound or an intra-enterprise email message. In various embodiments, however, the anti-spam processing may be enabled in one direction only or various detection threshholds could be configured differently for different flows.
  • In any event, in one embodiment, the headers, body and attachments, if any, of the email message at issue are parsed and scanned to identify whether the email message is deemed to contain one or more embedded images. If so, processing continues with block 520. Otherwise, no further image spam analysis is required and processing branches to the end.
  • At block 520, the email message at issue has been determined to contain one or more embedded images. In the current example, the senders' intention analysis anti-spam processing, therefore, proceeds to calculate the location(s) of the embedded image(s). Images may be embedded in a HyperText Markup Language (HTML) part of an HTML formatted email message, within a MIME document or attached separately. In one embodiment, by parsing the HTML, plain text and/or other Multipurpose Internet Mail Extension (MIME) parts, the displaying line just prior to the images can be identified and thus the approximate displaying location of any embedded images can be calculated.
  • At block 530, the one or more images are analyzed for indications of one or more abnormal factors. Typically, the abnormal factors are manifestations of a spammer's attempt to obscure text embedded within the one or more images by injecting a variety of noise. In one embodiment, abnormal factors include the presence of one or more of the following characteristics (i) illegal base64 encoding; (ii) multiple images within one HTML part; (iii) one or more low entropy frames in an animated Graphic Interchange Format (GIF); (iv) illegal chunk data within a Portable Network Graphics (PNG) file; (v) quantities of unsmoothed curves; and (iv) quantities of unsmoothed color blocks.
  • In one embodiment, illegal base64 encoding can be detected by, among other things, observing illegal characters, such as ‘!’ in the encoded content, such as the HTML formatted message or any part of the MIME document.
  • In one embodiment, the inclusion of multiple images within one HTML part can be detected by parsing the HTML formatted email message and observing more than one image within an HTML part. In the exemplary HTML code excerpt below, the existence of three images within a single table row (<tr> . . . </tr>) reveals an intention on the part of the creator of the email message to display a contiguous image to the email recipient based on the three separate embedded images.
  • <html>
    <head>
     <meta content=“text/html;charset=ISO-8859-1” http-equiv=“Content-
    Type”>
     <title></title>
    </head>
    <body bgcolor=“#ffffff” text=“#000000”>
    <title>abovementioned bertie</title>
    <div align=“center”>
    [...]
        <tr>
           <td width=“33%”> <a
    href=“http://www.lklljjp.biz/vpr6160/”> <img name=“apprehension”
    src=“cid:part2.00020108.07020409@72.ca” border=“0” height=“179”
    width=“184”></a></td>
           <td width=“33%”> <a
    href=“http://www.lklljjp.biz/vpr6160/”> <img name=“gradate”
    src=“cid:part3.00060308.03010709@72.ca” border=“0” height=“179”
    width=“184”></a></td>
           <td width=“34%”> <a
    href=“http://www.lklljjp.biz/vpr6160/”> <img name=“maltreat”
    src=“cid:part4.02080304.00040002@72.ca” border=“0” height=“179”
    width=“184”></a></td>
        </tr>
    [...]
    </body>
    </html>
  • The existence of one or more low entropy frames of an animated GIF may be determined on an absolute and/or relative basis. For example, an animated GIF frame may be determined to be low with reference to observed entropy values of normal GIF files, which vary from approximately 0.1 to 5.0. Therefore, in one embodiment, the existence of one or more low entropy frames is confirmed based on a comparison of the entropy values calculated for the animated GIF at issue to 0.1. If the entropy value calculated for any frame of the animated GIF at issue is less than 0.1, then this abnormal factor is deemed to exist. In other embodiments, one or more frames of the animated GIF file at issue may simply be “low” entropy relative to the other high entropy frames. For example, a variation of more than 4.9 between the highest entropy frame and the lowest entropy frame relatively lower than the others within the animated GIF file at issue.
  • Illegal chunk data within a Portable Network Graphics (PNG) file may be observed by evaluating information contained within and/or about the chunks. For example, the length of the chunk and cyclic redundancy checksum (CRC) may be verified against the actual data length and recomputed CRC.
  • Quantities of unsmoothed curves may be detected by evaluating the amount of pixels in which the difference between their color and the average color of the surrounding pixels are greater than a threshold.
  • Quantities of unsmoothed color blocks may be detected by evaluating the amount of the color blocks in which the difference between their color and the color of the surrounding color blocks are greater than a threshold. Color blocks contain pixels with the same or similar colors.
  • In one embodiment, rather than simply conveying a binary result (e.g., a zero indicating the absence of the abnormal factor at issue and a one indicating the presence of the abnormal factor at issue), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the abnormal factor is expressed.
  • At block 540, the quantity of text embedded in the images is measured. In one embodiment, images are converted to a binary representation based on a thresholding technique described in further detail below. In general, thresholding is a simple method of image segmentation. Individual pixels in a grayscale image are marked as “information” pixels if their value is greater than some threshold value, T, (assuming the information content is brighter than the background) and as “background” pixels otherwise. Typically, an information pixel is given a value of “1” while a background pixel is given a value of “0.” Then, a text string measurement algorithm is applied to the binary representation of the portion of the image deemed to contain the information content.
  • Notably, in one embodiment, rather than considering the quantity of embedded text alone, both the quantity of text and the relative position of such text within an email viewer's preview window, for example, or within the image itself may be taken into consideration. For example, a high spam score could be assigned to a very large image (with a correspondingly smaller percentage of text), but the text is positioned to occupy the whole preview window.
  • At block 550, the email message is classified as spam or clean based on the observed characteristics of the embedded image(s), such as image location information, the existence/non-existence of various abnormal factors and the quantity of text determined to exist within the embedded image(s). In one embodiment, the spam/clean classification may be based upon a weighted average of the various observed characteristics.
  • In one embodiment, each observed characteristic may contribute to the score. Once the score reaches a threshold, the email message may be classified as spam and the further characteristics may not require analysis or observation. The email message is classified as clean if the score is less than the threshold after all the characteristics have been evaluated. In one embodiment, the characteristics may be considered in the following order:
      • Image location information
      • Presence of continuous images
      • Presence of illegal base64 encoding
      • Presence of lower entropy frames in an animated GIF
      • Presence of illegal chunk data of a PNG encoded image
      • Quantities and/or location of text in the images
      • Quantities of unsmoothed curves in the images
      • Quantities of unsmoothed color blocks in the images
  • In one embodiment, similar to that described above with reference to abnormal factors, rather than making the ultimate spam/clean decision (because the ultimate decision could be made by another component), a “spaminess” score may be generated. For example, rather than simply conveying a binary result (e.g., spam vs. clean), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the email message appeared to contain indications of being spam or the likelihood the email message is spam.
  • If upon completion of the anti-spam processing described above there is not sufficient data to determine the email message is spam (e.g., there is insufficient data to determine the sender's intention), then according to one embodiment, more CPU intensive processes, such as OCR, may be invoked. Advantageously, in this manner, most image spam emails can be detected in real-time without compromising performance and more CPU intensive processes are only performed if and when required.
  • FIG. 6 is a flow diagram illustrating quantity of text measurement processing in accordance with an embodiment of the present invention. The steps described below represent the processing of block 540 of FIG. 5 according to one embodiment of the present invention.
  • As mentioned with reference to FIG. 5, depending upon the particular implementation, the various process and decision blocks described below may be performed by hardware components, embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps, or the steps may be performed by a combination of hardware, software, firmware and/or involvement of human participation/interaction.
  • At block 610, if the image at issue is color, then it is converted to grayscale to form a grayscale representation, Gi,j. According to one embodiment, color pixels of the image at issue are converted to grayscale by computing an average or weighted average of the red, green and blue color components. While various conversions may be used, examples of suitable conversion equations include the following:

  • G i,j=(0.299*r i,j+0.587* g i,j+0.114* b i,j)/3 0≦i<x max,0≦j<y max   EQ #1

  • G i,j=(0.3*r i,j+0.6*g i,j+0.1*b i,j)/3 0≦i<x max,0≦j<y max   EQ #2

  • G i,j=(r i,j +g i,j +b i,j)/3 0≦i<x max,0≦j<y max   EQ #3
  • At block 620, entropy and threshold values are determined for the grayscale image, Gi,j. Entropy is a statistical measure of randomness that can be used to characterize the texture of the input image. In connection with calculating the entropy of the grayscale image, an intermediate data structure is built containing an intensity histogram, Cg. In the context of an 8-bit grayscale image, each pixel may have a value of 0 to 255. Thus, the intensity histogram includes 256 bins each of which maintain a count of the number of pixels in the grayscale image having that value. An example of an intensity histogram is shown in FIG. 9, which represents an intensity histogram for a grayscale representation of FIG. 8. In one embodiment entropy is calculated according to:
  • E = - g = 0 255 ( C g C g * log ( C g C g ) ) EQ #4 Subject to : C g = i = 0 x max j = 0 y max c i , j g EQ #5 c i , j g = { 1 G i , j = g 0 otherwise , 0 i < x max , 0 j < y max EQ #6
  • According to one embodiment, a threshold value within the intensity histogram is selected simply by choosing the mean or median value. The rationale for this simple threshold selection is that if the information pixels are brighter than the background, they should also be brighter than the average. However, to compensate for the existence of noise and variability in the background, a more sophisticated approach is to create a histogram of the image pixel intensities and then use the valley point as the threshold, T. This histogram approach assumes that there is some average value for the background and information pixels, but that the actual pixel values have some variation around these average values. In one embodiment, the threshold, T, is calculated by:

  • T=Max(δi) 0≦i≦255   EQ#7
  • Subject to:

  • δi =i w1 w i2(M i1 −M i2)20≦i≦255   EQ #8
  • w i 1 = g = 0 i C g EQ #9 w i 2 = g = i + 1 255 C g EQ #10 M i 1 = g = 0 i g * C g g = 0 i C g EQ #11 M i 2 = g = i + 1 255 k * C g g = i + 1 255 C g EQ #12
  • According to the above example, the gray levels are divided into two groups by i, and wi1 and wi2 are the total amount of the pixels of each group while Mi1 and Mi2 are the average of the gray level of each group.
  • Notably, there are many existing methods of performing thresholding. Consequently, any other current or future method of performing thresholding may be used depending upon the needs of a particular implementation.
  • At block 630, thresholding is performed to form a binary representation, Bi,j, of the grayscale image based on the threshold value selected in block 620. In one embodiment, thresholding is performed in accordance with the following equations:
  • B i , j = { 0 G i , j < T 1 Otherwise 0 i < x max , 0 j < y max EQ #13 B i , j = { B i , j max ( C k ) < , max ( C k ) < T ! B i , j Otherwise 0 k 255 EQ #14
  • where, ∂ is an adjustable parameter.
  • At block 640, the binary image is logically divided into M×N virtual blocks.
  • At block 650, the M×N virtual blocks are analyzed to quantify the number of text strings. In one embodiment, the text strings in the binary image are quantified in accordance with the following equations:
  • T = m = 0 M n = 0 N T m , n EQ #15 Subject to : T m , n = y t = y 0 m y max m y b = y t + 1 y max m T y t , y b m , n y 0 m = y max 0 ( m - 1 ) , y max m = y 0 m + 0 EQ #16 T y t , y b m , n = { 1 1 > i = y t y b CB i n > 2 k = x 0 n x max n B k , y b + 1 < 3 , x 0 n = x max 0 ( n - 1 ) , x max n = x 0 n + 0 0 Otherwise EQ #17 CB i n = { 1 4 > k = x 0 n x max n B k , i > 5 Max ( k = x w x w + 6 B k , i ) < 7 , x 0 n x w x max n 0 Otherwise EQ #18
  • where,
  • 0 . . . a∂7 are adjustable parameters;
  • Ty t ,y b m,n is the likelihood that the row between yt and yb in the block [m,n] represents text;
  • CBi n is the likelihood that the line[i] is a part of text;
  • Bk,i is the value of pixel[k,i] in the binary image.
  • Notably, while in the context of the equations presented above, a global thresholding approach is implemented taking into consideration the image as a whole, in alternative embodiments, various forms of local thresholding may be performed that consider groups of blocks or individual blocks to determine the best thresholding approach for such block or blocks.
  • CONCRETE EXAMPLES
  • For sake of illustration, two concrete examples of application of the thresholding and text quantification described above will now be provided with reference to FIG. 7 to FIG. 13.
  • FIG. 7 is an example of an image spam email message 700 containing an embedded image 710. Typically, such image spam email messages also include text 720 in an attempt to defeat conventional heuristic filters.
  • FIG. 8 is a grayscale image 810 based on the embedded image 710 of FIG. 7 according to one embodiment of the present invention. According to the flow diagram of FIG. 6, the first step (block 610) is to convert the embedded image 710 to a grayscale representation, Gi,j. Assuming embedded image 710 of FIG. 7 is a color image having red (r), green (g) and blue (b) color components, after application of one of equations EQ #1, EQ #2, EQ #3 or the like, the grayscale representation, Gi,j, appears as grayscale image 810.
  • FIG. 9 is an intensity histogram 900 for the grayscale image 810 of FIG. 8 according to one embodiment of the present invention. According to the flow diagram of FIG. 6, the next step (block 620) is to build an intensity histogram data structure, Cg, and determine a threshold value for the grayscale image 810. After application of one or more of equations EQ #4, EQ #5, EQ #6, EQ #7, EQ #8, EQ #9, EQ #10, EQ #11, EQ #12 or the like to the grayscale representation, Gi,j, (grayscale image 810), an intensity histogram data structure, Cg, results, which appears as intensity histogram 900 when displayed in graphical form. Assuming 256 possible gray levels are represented in grayscale image 810, the intensity histogram 900 graphically illustrates the number of pixel occurrences in grayscale image 810 for each gray level.
  • Application of the above-referenced equations also results in a threshold value, T, 910, being calculated for grayscale image 810. According to this example, the threshold value 910 is 109.
  • FIG. 10 is a binary image 1010 resulting from thresholding the grayscale image 810 of FIG. 8 in accordance with an embodiment of the present invention. According to the flow diagram of FIG. 6, the next step (block 630) is to binarize the image by performing thresholding with the calculated threshold value. Application of one or both of equations EQ #13 and EQ #14 or the like, causes the binary representation, Bi,j, to contain a zero for each pixel in which the grayscale representation Gi,j, is less than the calculated threshold value, T, and to contain a one for each pixel in which the grayscale representation Gi,j, is greater than or equal to the calculated threshold value, T. For purposes of illustration, the result of graphically depicting the binary representation, Bi,j, in which pixels having a value of one are presented as black and pixels having a value of zero are presented as white image is shown as binary image 1010. As can be seen with reference to FIG. 10, the information content intended to be conveyed, i.e., the various text strings, to the email recipient is now clearly distinguishable from the background.
  • FIG. 11 illustrates an exemplary segmentation of the binary image of FIG. 10 into 28 virtual blocks and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. According to the flow diagram of FIG. 6, the next steps (blocks 640 and 650) are to logically divide the binary image 1010 into virtual blocks and then separately analyze each block to measure perceived text content. Application of one or more of equations EQ #15, EQ #16, EQ #17, EQ #18 or the like, causes the text string count, T, to contain the sum of all blocks determined to contain a text string.
  • In the present example, segmented binary image 1110 contains 28 virtual blocks, examples of which are pointed out with reference numerals 1120 and 1130. According to equations EQ #15, EQ #16, EQ #17 and EQ #18, 23 of the 28 blocks contain a total of 63 text strings. Text strings detected by the algorithm are underlined. Block 1120 is an example of a block that has been determined to contain one or more text strings, i.e., the word “TRADE” 1121. Block 1130 is an example of a block that has been determined not to contain a text string.
  • FIG. 12 is a grayscale image 1210 based on another exemplary embedded image observed in connection with image spam.
  • FIG. 13 illustrates an exemplary segmentation into 56 virtual blocks a binarized image 1310 corresponding to the grayscale image 1210 of FIG. 12 and highlights the text strings detected within the blocks in accordance with an embodiment of the present invention. In the present example, segmented binary image 1310 contains 56 virtual blocks, examples of which are pointed out with reference numerals 1320 and 1330. According to equations EQ #15, EQ #16, EQ #17 and EQ #18, 26 of the 56 blocks contain a total of 40 text strings. Text strings detected by the algorithm are underlined. Block 1320 is an example of a block that has been determined to contain one or more text strings, i.e., the group of letters “ebtEras”. Block 1330 is an example of a block that has been determined not to contain a text string.
  • Notably, to the extent reverse video or the presentation of light colored (e.g., white) text in the context of a dark (e.g., black) background becomes problematic (see, e.g., the “LEARN MORE” text string embedded within FIG. 13), one approach to detect such text strings would be to apply a local thresholding approach using EQ #14, which would effectively reverse the black and white pixels for the blocks at issue.
  • While embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the invention, as described in the claims.

Claims (19)

1. A method comprising:
measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
2. The method of claim 1, wherein the electronic message comprises an electronic mail (email) message.
3. The method of claim 1, wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
4. The method of claim 3, wherein the image processing includes local thresholding.
5. The method of claim 3, wherein the image processing includes global thresholding.
6. The method of claim 1, wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
7. The method of claim 3, wherein the image processing comprises converting the image or one or more of the plurality of blocks to grayscale.
8. The method of claim 3, further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
9. The method of claim 3, wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
10-27. (canceled)
28. A computer-readable medium having stored thereon instructions, which when executed by one or more processors cause the one or more processors to perform a method comprising:
measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
29. The computer-readable medium of claim 28, wherein the electronic message comprises an electronic mail (email) message.
30. The computer-readable medium of claim 28, wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
31. The computer-readable medium of claim 30, wherein the image processing includes local thresholding.
32. The computer-readable medium of claim 30, wherein the image processing includes global thresholding.
33. The computer-readable medium of claim 28, wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
34. The computer-readable medium of claim 30, wherein the image processing comprises convening the image or one or more of the plurality of blocks to grayscale.
35. The computer-readable medium of claim 30, further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
36. The computer-readable medium of claim 30, wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
US11/932,589 2007-10-31 2007-10-31 Image spam filtering based on senders' intention analysis Abandoned US20090113003A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/932,589 US20090113003A1 (en) 2007-10-31 2007-10-31 Image spam filtering based on senders' intention analysis
US12/114,815 US8180837B2 (en) 2007-10-31 2008-05-04 Image spam filtering based on senders' intention analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/932,589 US20090113003A1 (en) 2007-10-31 2007-10-31 Image spam filtering based on senders' intention analysis

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/114,815 Continuation US8180837B2 (en) 2007-10-31 2008-05-04 Image spam filtering based on senders' intention analysis

Publications (1)

Publication Number Publication Date
US20090113003A1 true US20090113003A1 (en) 2009-04-30

Family

ID=40582891

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/932,589 Abandoned US20090113003A1 (en) 2007-10-31 2007-10-31 Image spam filtering based on senders' intention analysis
US12/114,815 Active 2029-05-18 US8180837B2 (en) 2007-10-31 2008-05-04 Image spam filtering based on senders' intention analysis

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/114,815 Active 2029-05-18 US8180837B2 (en) 2007-10-31 2008-05-04 Image spam filtering based on senders' intention analysis

Country Status (1)

Country Link
US (2) US20090113003A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043853A1 (en) * 2007-08-06 2009-02-12 Yahoo! Inc. Employing pixel density to detect a spam image
US20090077617A1 (en) * 2007-09-13 2009-03-19 Levow Zachary S Automated generation of spam-detection rules using optical character recognition and identifications of common features
US20090110233A1 (en) * 2007-10-31 2009-04-30 Fortinet, Inc. Image spam filtering based on senders' intention analysis
US20090150419A1 (en) * 2007-12-10 2009-06-11 Won Ho Kim Apparatus and method for removing malicious code inserted into file
US20120150959A1 (en) * 2010-12-14 2012-06-14 Electronics And Telecommunications Research Institute Spam countering method and apparatus
US8290311B1 (en) * 2007-01-11 2012-10-16 Proofpoint, Inc. Apparatus and method for detecting images within spam
US8290203B1 (en) * 2007-01-11 2012-10-16 Proofpoint, Inc. Apparatus and method for detecting images within spam
US8396876B2 (en) 2010-11-30 2013-03-12 Yahoo! Inc. Identifying reliable and authoritative sources of multimedia content
US20140074942A1 (en) * 2012-09-10 2014-03-13 International Business Machines Corporation Identifying a webpage from which an e-mail address is obtained
US20150100648A1 (en) * 2013-10-03 2015-04-09 Yandex Europe Ag Method of and system for processing an e-mail message to determine a categorization thereof
US20160321255A1 (en) * 2015-04-28 2016-11-03 International Business Machines Corporation Unsolicited bulk email detection using url tree hashes
TWI569608B (en) * 2015-10-08 2017-02-01 網擎資訊軟體股份有限公司 A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment
US20170126601A1 (en) * 2008-12-31 2017-05-04 Dell Software Inc. Image based spam blocking
US20170289083A1 (en) * 2010-06-09 2017-10-05 Quest Software Inc. Net- based email filtering
US10361989B2 (en) * 2016-10-06 2019-07-23 International Business Machines Corporation Visibility management enhancement for messaging systems and online social networks
US10469510B2 (en) * 2014-01-31 2019-11-05 Juniper Networks, Inc. Intermediate responses for non-html downloads
US11164156B1 (en) * 2021-04-30 2021-11-02 Oracle International Corporation Email message receiving system in a cloud infrastructure

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7299261B1 (en) 2003-02-20 2007-11-20 Mailfrontier, Inc. A Wholly Owned Subsidiary Of Sonicwall, Inc. Message classification using a summary
US7406502B1 (en) 2003-02-20 2008-07-29 Sonicwall, Inc. Method and system for classifying a message based on canonical equivalent of acceptable items included in the message
US8266215B2 (en) * 2003-02-20 2012-09-11 Sonicwall, Inc. Using distinguishing properties to classify messages
US20090245635A1 (en) * 2008-03-26 2009-10-01 Yehezkel Erez System and method for spam detection in image data
US8731284B2 (en) * 2008-12-19 2014-05-20 Yahoo! Inc. Method and system for detecting image spam
US20130156288A1 (en) * 2011-12-19 2013-06-20 De La Rue North America Inc. Systems And Methods For Locating Characters On A Document
US9047293B2 (en) * 2012-07-25 2015-06-02 Aviv Grafi Computer file format conversion for neutralization of attacks
CN103020646A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 Incremental training supported spam image identifying method and incremental training supported spam image identifying system
CN103944810B (en) * 2014-05-06 2017-02-15 厦门大学 Spam e-mail intention recognition system
CN106547852B (en) * 2016-10-19 2021-03-12 腾讯科技(深圳)有限公司 Abnormal data detection method and device, and data preprocessing method and system
US9858424B1 (en) 2017-01-05 2018-01-02 Votiro Cybersec Ltd. System and method for protecting systems from active content
US10013557B1 (en) 2017-01-05 2018-07-03 Votiro Cybersec Ltd. System and method for disarming malicious code
US10331889B2 (en) 2017-01-05 2019-06-25 Votiro Cybersec Ltd. Providing a fastlane for disarming malicious content in received input content
US10331890B2 (en) 2017-03-20 2019-06-25 Votiro Cybersec Ltd. Disarming malware in protected content
US10621554B2 (en) 2017-11-29 2020-04-14 International Business Machines Corporation Image representation of e-mails
CN110086706B (en) * 2019-04-24 2022-05-27 北京众纳鑫海网络技术有限公司 Method and system for joining a device-specific message group
CN110866543B (en) * 2019-10-18 2022-07-15 支付宝(杭州)信息技术有限公司 Picture detection and picture classification model training method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738496B1 (en) * 1999-11-01 2004-05-18 Lockheed Martin Corporation Real time binarization of gray images
US6772196B1 (en) * 2000-07-27 2004-08-03 Propel Software Corp. Electronic mail filtering system and methods
US20040221062A1 (en) * 2003-05-02 2004-11-04 Starbuck Bryan T. Message rendering for identification of content features
US20050204005A1 (en) * 2004-03-12 2005-09-15 Purcell Sean E. Selective treatment of messages based on junk rating
US20060123083A1 (en) * 2004-12-03 2006-06-08 Xerox Corporation Adaptive spam message detector
US20070168436A1 (en) * 2006-01-19 2007-07-19 Worldvuer, Inc. System and method for supplying electronic messages
US20080091765A1 (en) * 2006-10-12 2008-04-17 Simon David Hedley Gammage Method and system for detecting undesired email containing image-based messages
US20090110233A1 (en) * 2007-10-31 2009-04-30 Fortinet, Inc. Image spam filtering based on senders' intention analysis
US7882177B2 (en) * 2007-08-06 2011-02-01 Yahoo! Inc. Employing pixel density to detect a spam image

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738496B1 (en) * 1999-11-01 2004-05-18 Lockheed Martin Corporation Real time binarization of gray images
US6772196B1 (en) * 2000-07-27 2004-08-03 Propel Software Corp. Electronic mail filtering system and methods
US20040221062A1 (en) * 2003-05-02 2004-11-04 Starbuck Bryan T. Message rendering for identification of content features
US20050204005A1 (en) * 2004-03-12 2005-09-15 Purcell Sean E. Selective treatment of messages based on junk rating
US20060123083A1 (en) * 2004-12-03 2006-06-08 Xerox Corporation Adaptive spam message detector
US20070168436A1 (en) * 2006-01-19 2007-07-19 Worldvuer, Inc. System and method for supplying electronic messages
US20080091765A1 (en) * 2006-10-12 2008-04-17 Simon David Hedley Gammage Method and system for detecting undesired email containing image-based messages
US7882177B2 (en) * 2007-08-06 2011-02-01 Yahoo! Inc. Employing pixel density to detect a spam image
US20090110233A1 (en) * 2007-10-31 2009-04-30 Fortinet, Inc. Image spam filtering based on senders' intention analysis

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10095922B2 (en) * 2007-01-11 2018-10-09 Proofpoint, Inc. Apparatus and method for detecting images within spam
US8290203B1 (en) * 2007-01-11 2012-10-16 Proofpoint, Inc. Apparatus and method for detecting images within spam
US20130039582A1 (en) * 2007-01-11 2013-02-14 John Gardiner Myers Apparatus and method for detecting images within spam
US8290311B1 (en) * 2007-01-11 2012-10-16 Proofpoint, Inc. Apparatus and method for detecting images within spam
US7882177B2 (en) * 2007-08-06 2011-02-01 Yahoo! Inc. Employing pixel density to detect a spam image
US20110078269A1 (en) * 2007-08-06 2011-03-31 Yahoo! Inc. Employing pixel density to detect a spam image
US20090043853A1 (en) * 2007-08-06 2009-02-12 Yahoo! Inc. Employing pixel density to detect a spam image
US8301719B2 (en) 2007-08-06 2012-10-30 Yahoo! Inc. Employing pixel density to detect a spam image
US20090077617A1 (en) * 2007-09-13 2009-03-19 Levow Zachary S Automated generation of spam-detection rules using optical character recognition and identifications of common features
US20090110233A1 (en) * 2007-10-31 2009-04-30 Fortinet, Inc. Image spam filtering based on senders' intention analysis
US8180837B2 (en) 2007-10-31 2012-05-15 Fortinet, Inc. Image spam filtering based on senders' intention analysis
US8590016B2 (en) * 2007-12-10 2013-11-19 Electronics And Telecommunications Research Institute Apparatus and method for removing malicious code inserted into file
US20090150419A1 (en) * 2007-12-10 2009-06-11 Won Ho Kim Apparatus and method for removing malicious code inserted into file
US10204157B2 (en) * 2008-12-31 2019-02-12 Sonicwall Inc. Image based spam blocking
US20170126601A1 (en) * 2008-12-31 2017-05-04 Dell Software Inc. Image based spam blocking
US10419378B2 (en) * 2010-06-09 2019-09-17 Sonicwall Inc. Net-based email filtering
US20170289083A1 (en) * 2010-06-09 2017-10-05 Quest Software Inc. Net- based email filtering
US8396876B2 (en) 2010-11-30 2013-03-12 Yahoo! Inc. Identifying reliable and authoritative sources of multimedia content
US20120150959A1 (en) * 2010-12-14 2012-06-14 Electronics And Telecommunications Research Institute Spam countering method and apparatus
US9076130B2 (en) * 2012-09-10 2015-07-07 International Business Machines Corporation Identifying a webpage from which an E-mail address is obtained
US20140074942A1 (en) * 2012-09-10 2014-03-13 International Business Machines Corporation Identifying a webpage from which an e-mail address is obtained
US9794208B2 (en) 2013-10-03 2017-10-17 Yandex Europe Ag Method of and system for constructing a listing of e-mail messages
US9525654B2 (en) 2013-10-03 2016-12-20 Yandex Europe Ag Method of and system for reformatting an e-mail message based on a categorization thereof
US9749275B2 (en) 2013-10-03 2017-08-29 Yandex Europe Ag Method of and system for constructing a listing of E-mail messages
US9521101B2 (en) 2013-10-03 2016-12-13 Yandex Europe Ag Method of and system for reformatting an e-mail message based on a categorization thereof
US9521102B2 (en) 2013-10-03 2016-12-13 Yandex Europe Ag Method of and system for constructing a listing of e-mail messages
US9450903B2 (en) * 2013-10-03 2016-09-20 Yandex Europe Ag Method of and system for processing an e-mail message to determine a categorization thereof
US20150100648A1 (en) * 2013-10-03 2015-04-09 Yandex Europe Ag Method of and system for processing an e-mail message to determine a categorization thereof
US10469510B2 (en) * 2014-01-31 2019-11-05 Juniper Networks, Inc. Intermediate responses for non-html downloads
US10810176B2 (en) 2015-04-28 2020-10-20 International Business Machines Corporation Unsolicited bulk email detection using URL tree hashes
US20160321255A1 (en) * 2015-04-28 2016-11-03 International Business Machines Corporation Unsolicited bulk email detection using url tree hashes
US10706032B2 (en) * 2015-04-28 2020-07-07 International Business Machines Corporation Unsolicited bulk email detection using URL tree hashes
TWI569608B (en) * 2015-10-08 2017-02-01 網擎資訊軟體股份有限公司 A computer program product and e-mail transmission method thereof for e-mail transmission in monitored network environment
US10826865B2 (en) 2016-10-06 2020-11-03 International Business Machines Corporation Visibility management enhancement for messaging systems and online social networks
US10361989B2 (en) * 2016-10-06 2019-07-23 International Business Machines Corporation Visibility management enhancement for messaging systems and online social networks
US11164156B1 (en) * 2021-04-30 2021-11-02 Oracle International Corporation Email message receiving system in a cloud infrastructure
US20220351143A1 (en) * 2021-04-30 2022-11-03 Oracle International Corporation Email message receiving system in a cloud infrastructure
US11544673B2 (en) * 2021-04-30 2023-01-03 Oracle International Corporation Email message receiving system in a cloud infrastructure

Also Published As

Publication number Publication date
US8180837B2 (en) 2012-05-15
US20090110233A1 (en) 2009-04-30

Similar Documents

Publication Publication Date Title
US8180837B2 (en) Image spam filtering based on senders&#39; intention analysis
AU2004202268B2 (en) Origination/destination features and lists for spam prevention
US7882187B2 (en) Method and system for detecting undesired email containing image-based messages
US8032594B2 (en) Email anti-phishing inspector
US20050050150A1 (en) Filter, system and method for filtering an electronic mail message
US9413716B2 (en) Securing email communications
US20100095377A1 (en) Detection of suspicious traffic patterns in electronic communications
US7925044B2 (en) Detecting online abuse in images
AU2006260933B2 (en) Method and system for filtering electronic messages
US20050015626A1 (en) System and method for identifying and filtering junk e-mail messages or spam based on URL content
US20060075099A1 (en) Automatic elimination of viruses and spam
CN113518987A (en) E-mail security analysis
JP4670049B2 (en) E-mail filtering program, e-mail filtering method, e-mail filtering system
WO2017162997A1 (en) A method of protecting a user from messages with links to malicious websites containing homograph attacks
EP1733521B1 (en) A method and an apparatus to classify electronic communication
Ismail et al. Image spam detection: problem and existing solution
Clifford et al. Miracle cures and toner cartridges: Finding solutions to the spam problem
Srikanthan An Overview of Spam Handling Techniques
GARG Integrated Approach for Email Spam Filter

Legal Events

Date Code Title Description
AS Assignment

Owner name: FORTINET, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LU, JUN;CHENG, JIANDONG;REEL/FRAME:020419/0977

Effective date: 20080111

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION