US20120130983A1 - Efficient string pattern matching for large pattern sets - Google Patents

Efficient string pattern matching for large pattern sets Download PDF

Info

Publication number
US20120130983A1
US20120130983A1 US12/953,469 US95346910A US2012130983A1 US 20120130983 A1 US20120130983 A1 US 20120130983A1 US 95346910 A US95346910 A US 95346910A US 2012130983 A1 US2012130983 A1 US 2012130983A1
Authority
US
United States
Prior art keywords
string
segment
strings
segments
match
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/953,469
Other versions
US8407245B2 (en
Inventor
Matthew V. Ryan
Andrew Hodgkinson
Russell Young
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US12/953,469 priority Critical patent/US8407245B2/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HODGKINSON, ANDREW, RYAN, MATTHEW V., YOUNG, RUSSELL
Publication of US20120130983A1 publication Critical patent/US20120130983A1/en
Application granted granted Critical
Publication of US8407245B2 publication Critical patent/US8407245B2/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • Strings can be stored in a variety of ways in a computer system, such as one or more characters terminated by a predetermined identifier (e.g., null-terminated strings) or counted data structures that store a string length and array of characters.
  • String matching may include comparing a single source string to a single target string, comparing a single source string to a set of multiple target strings, comparing multiple source strings to multiple target strings, and so forth.
  • string matching consumes a significant quantity of the computer hardware resources (e.g., processor time or memory space).
  • Modern desktop search programs often spend a large percentage of their execution time comparing a search query string with many possible matches in a search index.
  • performance of applications can be noticeably affected by the algorithms and data structures selected by application developers to store and manipulate strings.
  • Standard techniques for matching a source string against a large set of target pattern strings are inefficient and expensive. For example, many techniques iterate through each potential target string, comparing characters until a mismatch is found before moving to the next potential target string. This technique increases in time for every new target string added to the set, and slows as the source string length gets longer.
  • Pattern strings are strings that include one or more literal characters and/or wildcard characters that can match one or more literal characters. If a set of pattern strings is comprised of strings that can be logically divided into segments, these pattern strings can be stored efficiently in a tree-like data structure, segment by segment, storing pattern-match syntax segments separately from literal string segments. Pattern-match syntax segments may include “wild card” characters (e.g., “*” or “?”) that can be replaced by one or more literal characters to form a match. After segmentation of candidate strings, individual match candidate strings can be compared, segment by segment, against elements in the tree to quickly determine if the candidate matches any of the original pattern strings.
  • the string matching system uses a data structure that is conceptually a combination of a tree and a hash table. Using a hash table to implement the branching elements at each level in the tree contributes to quick matching speed at each level. By consistently separating strings into segments, the system can also share storage locations for like elements in the data structure. Thus, the system allows fast string matching even with large data sets.
  • FIG. 1 is a block diagram that illustrates components of the string matching system, in one embodiment.
  • FIG. 2 is a flow diagram that illustrates processing of the string matching system to store new strings in the string store against which candidate strings will be matched, in one embodiment.
  • FIG. 3 is a flow diagram that illustrates processing of the string matching system to determine whether a candidate string matches any string stored in the string store, in one embodiment.
  • FIG. 4 is a data structure diagram that graphically illustrates contents of the string data store of the string matching system, in one embodiment.
  • a string matching system is described herein that provides for very fast and efficient pattern matching against large sets of certain types of pattern strings. If a set of pattern strings is comprised of strings that can be logically divided into segments, these pattern strings can be stored efficiently in a tree-like data structure, segment by segment, storing pattern-match syntax segments separately from literal string segments. Pattern-match syntax segments may include “wild card” characters (e.g., “*” or “?”) that can be replaced by one or more literal characters to form a match. After segmentation, individual match candidate strings can be compared, segment by segment, against elements in the tree to quickly determine if the candidate matches any of the original pattern strings. In some embodiments, the string matching system uses a data structure that is conceptually a combination of a tree and a hash table.
  • Using a hash table to implement the branching elements at each level in the tree contributes to quick matching speed at each level.
  • the system can also share storage locations for like elements in the data structure.
  • the system allows fast string matching even with large data sets.
  • the string matching system is applied to applications in which a goal is to determine, given a set of pattern strings and a separate match candidate string, whether any of the pattern strings in the set match the candidate string.
  • One straightforward approach to solving this problem is to enumerate each pattern string in the set, compare it to the match candidate, and determine if there is a match. For example, given this set: “aaa”, “bbb”, “ccc”, and the match candidate string “bbb”, a common approach would first compare “bbb” to “aaa”. Upon finding these strings do not match, the approach next compares “bbb” to “bbb” and, having found a match, processing terminates. For literal strings, as in this example, matching can be done more efficiently by using a hash table instead of enumerating each string and comparing values individually.
  • the string matching system addresses this problem for certain types of pattern strings that are easily and consistently separable into segments and where pattern match portions can be separated into distinct segments from literal match portions.
  • File paths are examples of strings that fit these rules.
  • Configuration for this application may specify to delete all files in temporary directories and all files that end with the suffix “tmp” or “temp”.
  • asterisk (“*”) to represent a sequence of one or more matching characters
  • the list of pattern strings to represent this configuration might look like this: “/temp/*”, “/home/*/temp/*”, “*.tmp”, and “*.temp”.
  • These strings meet the criteria specified. They are easily and consistently separable (the slash (“/”) and period (“.”) characters are used to separate elements of the path specification), and the pattern match element, the asterisk (“*”), can be separated into its own segment.
  • the first three segments (“home”, “*”, and “temp”) are the same for each of these pattern strings and can therefore be shared.
  • the string matching system creates a tree-like data structure that represents the pattern strings in a format that leverages the efficiencies described.
  • the tree is comprised of a root node that acts as the starting access point to the data structure, and zero or more tree nodes that contain the pattern string data.
  • Each tree node contains a hash table that associates a specific string segment with a reference to the node that contains the next segment for that pattern string (see FIG. 4 described further herein). Thus, if two pattern strings have the same segment in common at the same location in the string, they will share the hash table entry for that particular node.
  • segments are defined as being the same if: 1) the string values of both segments are identical (or can be considered identical based on a pattern match) and 2) the status of the segment as being the last segment in the string is the same (i.e., both segments represent the last segment in their respective strings, or both segments do not represent the last segment in their respective strings). Other factors may be considered as a part of the segment-matching test, depending on implementation.
  • a hash table entry for the last segment of a string includes some indication that it is the last segment (shown in FIG. 4 as a NULL reference).
  • storage of pattern strings in the data structure is as follows.
  • the system first divides each string into segments (called segmentation herein).
  • segmentation herein
  • the system stores each segment at the corresponding level in the tree structure according to the above rules described for the data structure (see FIG. 2 described herein).
  • FIG. 4 represents the storage of the following set of pattern strings in the data structure described: “S1-A:S2-A:S3-A”, “S1-B”, “S1-A:*:S3-B”, “S1-A:S2-B”, “S1-A:*”, and “S1-A:S2-A”. Note that in this example the system can easily segment the strings at each colon (“:”).
  • both periods (“.”) and slashes (“/” and “ ⁇ ”) may be used to segment strings, as well as combinations of characters (e.g., “://” to separate “http” from a domain name) or other delimiters.
  • Other domains may contain other logical divisions that the system can use to divide strings into segments.
  • the system may store empty segments in some cases. For example, for some implementations two delimiters in a row might be ignored while in other implementations such a combination may indicate an empty segment that has logical significance to the implementation.
  • Determining whether a given string matches the set of pattern strings uses similar logic (see FIG. 3 described herein).
  • the system divides the candidate string into segments, the same as the string would be if it were being stored in the data structure. Then, starting with the first segment of the string, the system tests for a match for the first segment in the hash table of the first node of the tree.
  • the test for whether segments match is the same test that is used during storage, to determine whether a segment already exists or will be created. In other words, the status of the segment as the last in the string is considered as a part of the match test; again, other factors may be considered depending on implementation. If a match is found, the system follows the reference in the hash table to the next node, where a match for the next segment of the candidate string is sought.
  • While testing for a match if no literal match is found in the current node's hash table for that candidate segment, the system then tests for a match against the pattern strings stored at that node. Those of ordinary skill in the art will recognize that this order can be reversed so that pattern strings are tested before literal matches in some implementations. If a match is found, the reference to the next node is followed and matching continues. If no match is found, logic is returned to the previous node where this same algorithm is applied. Lacking a match, the system continues applying the same logic to the previous node until there is no previous node or a match is found. If, at any point during the matching effort, the reference in the matching element of the hash table refers to no node (e.g. NULL), matching efforts are complete.
  • no node e.g. NULL
  • a partial match is any candidate string for which at least one segment has a match stored in the tree and at least one segment for which no match is stored in the tree.
  • a full match is one in which all of the segments of the candidate string have matching segments stored in the tree.
  • the system is flexible to allow specific implementations to determine the criteria for what is a match, what is a partial match, or what is not a match, and the examples provided herein are merely to illustrate the capabilities of the system. For example, suppose the string “http://www.abc.com/home” is stored in the data store. Then suppose the system receives the string “http://www.abc.com/home/index.html” to see if it matches. An implementation may decide that the candidate string matches even if the last part of the path (“index.html”) is not included in the data store; in other words, the implementation may decide that the trailing parts of the path are not important for matching.
  • another candidate string may be “http://www.abc.com” and in this case, it may be that not matching the “home” part of the path may mean there is NOT a match.
  • another candidate string “http://www.abc.net” may also be considered not a match because the domain name portion may be considered significant.
  • FIG. 1 is a block diagram that illustrates components of the string matching system, in one embodiment.
  • the system 100 includes a string store 110 , a string loading component 120 , a segmentation component 130 , a segment matching component 140 , a node creation component 150 , and a search component 160 . Each of these components is described in further detail herein.
  • the string store 110 is a data store that stores data structures used by the string matching system 100 to hold a set of strings against which the system 100 searches for matches for a candidate string.
  • the data store may include persistent or in-memory data storage devices, including one or more memory units, hard drives, file systems, databases, cloud-based storage services, or other facility for storing data.
  • the string store 110 may include multiple instances of trees described herein used for matching candidate strings for different problem domains (e.g., a tree for matching URLs, a tree for matching file paths, and so forth).
  • the string store 110 may also include a combination of in-memory and persistent data storage based on frequency of use/reference to each node, and other performance or implementation considerations.
  • the string loading component 120 loads one or more strings into the string store 110 by invoking the segmentation component 130 to divide the strings into segments and locating or creating nodes in the tree to store each string segment.
  • the string loading component 120 may be invoked during a startup process that prepares a string data structure for use by an application that tests candidate strings for matches within the data structure. Alternatively or additionally, applications may add strings to the data structure as the application runs, by loading new strings as they are discovered or configured by an administrator, other user, or automated process.
  • the string loading component 120 determines whether segments of the string to be loaded are already present in the data structure and adds nodes for segments that are not already present.
  • the segmentation component 130 divides one or more strings into logical segments based on one or more segmentation criteria.
  • Many strings are such that they can be logically divided into independently recognizable or significant sub-strings.
  • Internet domain names can be divided by the periods (“.”) in the domain name and URLs can be divided by forward slashes (“/”) or backward slashes (“ ⁇ ”) as well as by periods (“.”) in the domain name portion of the URL, or any other method appropriate to a particular implementation.
  • Implementers of the string matching system may provide one or more segmentation criteria for dividing strings according to the problem domain address by their particular application.
  • the segmentation component 130 applies the segmentation criteria to divide strings received from the string loading component 120 to be loaded into the tree stored by the string store 110 and also to divide candidate strings provided by the search component 160 so that the system can determine whether particular segments of the candidate string match one or more stored strings.
  • the segment matching component 140 performs a segment-by-segment match of a candidate string to one or more stored strings to identify segments of the strings that match.
  • a match may be determined as a literal character match or by matching one or more literal characters to one or more pattern-match specifications. For example, some implementations may allow wildcard characters in the stored strings, candidate strings, or both, and matches may be determined by expanding these wildcards to match characters provided in the subject string.
  • the segment matching component 140 may also determine other criteria, such as whether the status of two segments as marking the end of a string or not marking the end of a string is the same.
  • the system stores segments that do not mark the end of a string separately (e.g., as separate hash table entries) from those segments that do mark the end of a string.
  • the segment matching component 140 applies these and any other application-specific criteria to determine whether two segments will be declared a match.
  • the segment matching component 140 determines whether an input string to be loaded is already represented in the string store 110 and, if not, can invoke the node creation component 150 to add a new node or hash table entry to the tree.
  • the node creation component 150 creates new nodes or hash table entries to represent string segments in the tree of stored strings. For example, as strings with more segments are loaded into the tree and no existing node represents the latter segments, the component 150 may create new nodes and add a reference from previous nodes to store the new string segments. Each node has a hash table and each created new node includes a new hash table with the strings located at that level. If a node already exists that is suitable for a particular segment, but no matching entry for that segment is found in the hash table, the node creation component 150 may add a hash table entry to represent the segment. Likewise, if the end status of the segment does not match the end status of an existing matching string segment, then the component 150 may add a new hash table entry based on the new end status.
  • the search component 160 receives a candidate string and searches the string store 110 for one or more matching strings.
  • the search component 160 invokes the segmentation component 130 to divide the candidate string into segments similar to how strings loaded into the string store were previously divided into segments. Then, the search component 160 invokes the segment matching component 140 to perform a segment-by-segment match of the segments of the candidate strings to the segments of string stored in the string store 110 . If a segment is found in the string store 110 that matches each segment of the candidate string, then the search component 160 declares a match and may return any metadata associated with the matching string entry. If no match is found for any segment of the candidate string, then the search component 160 may report no match. Some implementations may support partial matches and may report a portion of the candidate string for which a match is available in the string store 110 .
  • the computing device on which the string matching system is implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives or other non-volatile storage media).
  • the memory and storage devices are computer-readable storage media that may be encoded with computer-executable instructions (e.g., software) that implement or enable the system.
  • the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communication link.
  • Various communication links may be used, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on.
  • Embodiments of the system may be implemented in various operating environments that include personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, set top boxes, systems on a chip (SOCs), and so on.
  • the computer systems may be cell phones, personal digital assistants, smart phones, personal computers, programmable consumer electronics, digital cameras, and so on.
  • the system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types.
  • functionality of the program modules may be combined or distributed as desired in various embodiments.
  • FIG. 2 is a flow diagram that illustrates processing of the string matching system to store new strings in the string store against which candidate strings will be matched, in one embodiment.
  • the system receives a string to be added to the string data store.
  • the system may receive the string from an application (e.g., in an extensible markup language (XML) file or as a parameter to an application programming interface (API)), as user input, via a network interface, or from another source.
  • the system adds the string to a store of strings against which to attempt to match later-received candidate strings using the following steps.
  • the system divides the received string into one or more segments.
  • the system may identify segments based on one or more delimiting characters (e.g., periods or slashes), a count of characters (e.g., every four characters is a segment), or other segmentation criteria suitable for a particular application.
  • the system selects a root node of the string data store that identifies an entry to a data structure for efficiently storing a set of strings in a tree-like form.
  • the data store may include a node for each level in the tree that represents a different possible following segment for the previous node. At the top is the root node that provides a reference to the first node in the data structure.
  • Each node of the tree may include a hash table or other data structure for efficiently looking up matching strings at that level of the tree.
  • the system selects the first segment of the received string. In some embodiments, the system may reorder segments to increase efficiency of storage or for other reasons.
  • domain names often have the most significant portion at the end, while virtual paths in a URL often have the most significant portions at the beginning.
  • the system may reverse the segments of the domain name portion (e.g., “com”->“microsoft” instead of “microsoft”->“com”).
  • the system selects the next segment of the string in the order determined.
  • the system determines that all segments of the received string have been stored in the string data store, then the system completes, else the system continues in block 260 .
  • the system processes each segment in turn until each segment has been processed.
  • the system navigates to the next node of the string data store. For example, if the current node is the root node, then navigating to the next node places the system at the level of a hash table containing the first set of string segments with which previously stored strings begin. If there is no additional node but the string to be stored contains additional segments, a new node is created that is referenced by the hash table entry of the current node, at which point the system navigates to the newly created node. As the system navigates deeper in the string data store, navigating to the next node follows a reference from the previous matching segment to determine whether a subsequent node matches the next segment of the received string.
  • the system performs a look up at the current node to determine whether the current string segment is already stored in the string data store at a current level.
  • the system may use a hash table at each node of the string data store and the look up may perform a look up using a hash key derived from the current string segment. If a matching string segment is found, then a string like the received string up to the current segment has previously been stored in the data store.
  • the system loops to block 240 to select the next segment, else the system continues at block 290 .
  • the system stores the non-matching segment as a new entry in the string data store at the current level, then loops to block 240 to select the next segment. If the system is using a hash table, then this block adds a new entry to the hash table that includes a reference to a new node that represents any subsequent segments in the received string. After block 290 , these steps conclude.
  • FIG. 3 is a flow diagram that illustrates processing of the string matching system to determine whether a candidate string matches any string stored in the string store, in one embodiment.
  • the system receives a candidate string for which to search the string store to identify zero or more matches to the candidate string.
  • the candidate string is provided by an application for a purpose specific to the application. For example, an anti-malware application may compare a received file path to a set of file paths stored in the string store that refer to interesting file locations (e.g., system or temporary files).
  • the system divides the received string into one or more segments.
  • the system may identify segments based on one or more delimiting characters (e.g., periods or slashes), a count of characters (e.g., every four characters is a segment), or other segmentation criteria suitable for a particular application.
  • the system divides the candidate string using the same segmentation criteria used to initially store strings in the string store.
  • the system selects a root node of the string data store that identifies an entry to a data structure for efficiently storing a set of strings in a tree-like form.
  • the data store may include a node for each level in the tree that represents a different possible following segment for the previous node. At the top is the root node that provides a reference to the first node in the data structure.
  • Each node of the tree may include a hash table or other data structure for efficiently looking up matching strings at that level of the tree.
  • the system selects the first segment of the received string.
  • the system may reorder segments to increase efficiency of storage or for other reasons. For example, domain names often have the most significant portion at the end, while virtual paths in a URL often have the most significant portions at the beginning. Thus, for a URL, the system may reverse the segments of the domain name portion (e.g., “com”->“microsoft” instead of “microsoft”->“com”). On subsequent iterations, the system selects the next segment of the string in the order determined.
  • the system determines that all segments of the received string have been matched to one or more segments stored in the string data store, then the system continues at block 340 , else the system continues at block 330 .
  • the system processes each segment in turn until each segment has been matched or it is determined that there is no matching string in the string store.
  • decision block 330 if a child node exists for the current node, then the system continues at block 345 , else the system continues at block 335 .
  • a child node indicates that further potentially matching segments are available in the string store. Because the test in block 325 indicates that there are more segments to the candidate string, there will also be more segments in the string store unless no match exists in the string store.
  • the system continues at block 340 to report a partial match, else the system continues at block 375 to report no match.
  • the system reports a match.
  • the match may be full or partial, depending on application-specific preference. A full match indicates that some string previously stored in the string store matched the candidate string, either literally or by applying pattern-matching wildcard characters.
  • the system may report a match by returning a return value in response to an API call, providing a text response, or other common methods of reporting results.
  • the system selects the child node of the current matching segment in the string store. This is the child node determined to exist in block 330 .
  • the child node contains a hash table or other data structure that identifies segments of strings stored at the current level of the string store.
  • the system performs a look up at the current node to determine whether the current string segment matches a segment previously stored in the string data store at a current level. The system may use a hash table at each node of the string data store and the look up may perform a look up using a hash key derived from the current string segment. If a matching string segment is found, then a string like the received string up to the current segment has previously been stored in the data store.
  • the system determines whether a literal match for the current segment was found. If a literal match is found, then the system loops to block 320 to access the matching record and retrieve a reference to a node in the string store containing subsequent segments that may match the next segment(s) of the candidate string. If no literal match is found then the system continues in block 360 to identify pattern matches.
  • the system determines that a pattern match exists, then the system loops to block 320 to continue processing the next candidate string segment.
  • the system separates processing of literal string matches from pattern string matches.
  • pattern strings are not typically well suited to inclusion in a hash table because they will not generate the same hash key as a literal candidate string segment that matches the pattern, leading to failure to find the matching pattern string in the hash table.
  • the system may perform slower or alternative processing to iterate through pattern strings and determine whether any match the provided candidate string segment.
  • the system determines that a parent node of the current node exists, then the system continues at block 370 , else the system continues at block 370 to report no match. Because patterns can be inclusive enough to span multiple segments, a failure to find a match at a particular level of the string store is not indicative of no match. By navigating to the parent level (potentially multiple times), the system searches for pattern matches at higher levels that may encompass the current candidate string segment. For example, a pattern string “http://www.*”, where “*” is a wildcard pattern for matching any characters, would match with many potential candidate strings having many more segments than this target pattern string. Continuing in block 370 , the system selects the parent node and loops to block 360 to determine whether a pattern match exists by comparing the current segment of the candidate string to the parent node's string patterns.
  • the system reports that no match was found for the candidate string.
  • the system may report no match by returning a return value in response to an API call, providing a text response, raising an exception, or other common methods of reporting results.
  • FIG. 4 is a data structure diagram that graphically illustrates contents of the string data store of the string matching system, in one embodiment.
  • FIG. 4 represents the storage of the following set of pattern strings in the data structure described: “S1-A:S2-A:S-A”, “S1-B”, “S1-A:*:S3-B”, “S1-A:S2-B”, “S1-A:*”, and “S1-A:S2-A”.
  • the data structure starts with a root node 410 that contains a pointer to the first node 420 of the data structure containing strings.
  • Each string-containing node contains a hash table or other data structure that makes finding matching segments at that level efficient.
  • the first node 420 includes a hash table 430 .
  • the hash table 430 contains the strings “S1-A” and “S1-B” because these encompass the potential first segment values of all of the strings stored in the string store. Note that even though there are six strings stored in the string store, data that is similar for multiple strings may not be duplicated at each level, resulting in significant space savings.
  • the “S1-A” value in the hash table 430 contains a reference 440 to the second node 460 .
  • the second node 460 represents string segments that can follow “S1-A” for strings stored in the string store.
  • the second node 460 contains its own hash table 470 with both literal strings and pattern-match strings. Note that in the embodiment illustrated, the system separately stores segments that represent the end of a string from segments that point to further segments.
  • the segment value “S1-B” in the first node 420 points to a NULL reference 450 that indicates that no further segments follow that instance of “S1-B” for data presently stored in the string store.
  • the value “S2-A” is stored twice, once for strings that end in “S2-A” and again for strings that continue with one or more other segments.
  • the data structure described allows for both space-efficient storage and fast matching of a candidate strings to the strings stored in the string store.
  • the string matching system uses a Boolean value to indicate segment termination for partial matches.
  • partial matches are allowed (i.e., are treated as matches)
  • the system may distinguish segments that end a string with a simple Boolean value rather than a null terminator or other delimiter. This allows the strings loaded in the string store to continue on, potentially beyond the length of provided candidate strings.
  • the string matching system combines node type or other application criteria at each segment to determine a match.
  • node types may include scheme (e.g., http), domain (e.g., .com), address (e.g., facebook), directory, and so forth.
  • scheme e.g., http
  • domain e.g., .com
  • address e.g., facebook
  • the system may test that a candidate string's segments have both a matching type and a matching value at each level before declaring a match. This allows an administrator that specifically wants to match a specific address or other part of a string to a string to distinguish type-based matches from standard string matches.
  • the system may accept partial matches based on application-specified preferences.
  • the string matching system allows the application to determine particular portions of strings for which partial matches are accepted. For example, using URLs, a particular application may allow partial matches for the virtual directory portion of the URL, but not for the domain name.
  • the system may provide settings for partial matching that include a text description, such as “always”, “never”, “only after address”, and so forth.
  • the string matching system facilitates faster updates of the string data store using the data structure described herein. Because the string data is subdivided as described herein, updates only affect the leaf nodes of each string or the farthest current segment in the string store that is also in the string to be added. Because updates often lock the data structure being updated to prevent interference by other threads, the subdivision described herein limits the level at which a lock is needed, so that updates are less likely to impact nodes that are currently being used in a search. The system can quickly lock the furthest node to which the added string pertains, then unlock that node after the update. During that time, searches for strings that do not visit that node are unaffected by the update lock.
  • strings have been described, the techniques described herein can also be used with other types of data, such as binary streams.
  • Anti-virus software for example, often scans executable code for known malicious binary streams.
  • the system described can store particular binary streams to be matched against input streams to identify malicious software code. Accordingly, the invention is not limited except as by the appended claims.

Abstract

A string matching system is described herein that provides for very fast and efficient pattern matching against large sets of certain types of pattern strings. If a set of pattern strings is comprised of strings that can be logically divided into segments, these pattern strings can be stored efficiently in a tree-like data structure, segment by segment, storing pattern-match syntax segments separately from literal string segments. After segmentation, individual match candidate strings are compared, segment by segment, against elements in the tree. The system uses a data structure that is conceptually a combination of a tree and a hash table. Using a hash table to implement the branching elements at each level in the tree contributes to quick matching speed at each level. By consistently separating strings into segments, the system can also share storage locations for like elements in the data structure.

Description

    BACKGROUND
  • Numerous computer science problems involve searching strings of characters to find two matching or partially matching strings. Many types of data involve strings, including addresses, names, file paths, Uniform Resource Locators (URLs), and so forth. Strings can be stored in a variety of ways in a computer system, such as one or more characters terminated by a predetermined identifier (e.g., null-terminated strings) or counted data structures that store a string length and array of characters. String matching may include comparing a single source string to a single target string, comparing a single source string to a set of multiple target strings, comparing multiple source strings to multiple target strings, and so forth.
  • In many software applications, string matching consumes a significant quantity of the computer hardware resources (e.g., processor time or memory space). Modern desktop search programs often spend a large percentage of their execution time comparing a search query string with many possible matches in a search index. Thus, performance of applications can be noticeably affected by the algorithms and data structures selected by application developers to store and manipulate strings.
  • Standard techniques for matching a source string against a large set of target pattern strings are inefficient and expensive. For example, many techniques iterate through each potential target string, comparing characters until a mismatch is found before moving to the next potential target string. This technique increases in time for every new target string added to the set, and slows as the source string length gets longer.
  • SUMMARY
  • A string matching system is described herein that provides for very fast and efficient pattern matching against large sets of certain types of pattern strings. Pattern strings are strings that include one or more literal characters and/or wildcard characters that can match one or more literal characters. If a set of pattern strings is comprised of strings that can be logically divided into segments, these pattern strings can be stored efficiently in a tree-like data structure, segment by segment, storing pattern-match syntax segments separately from literal string segments. Pattern-match syntax segments may include “wild card” characters (e.g., “*” or “?”) that can be replaced by one or more literal characters to form a match. After segmentation of candidate strings, individual match candidate strings can be compared, segment by segment, against elements in the tree to quickly determine if the candidate matches any of the original pattern strings. In some embodiments, the string matching system uses a data structure that is conceptually a combination of a tree and a hash table. Using a hash table to implement the branching elements at each level in the tree contributes to quick matching speed at each level. By consistently separating strings into segments, the system can also share storage locations for like elements in the data structure. Thus, the system allows fast string matching even with large data sets.
  • This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram that illustrates components of the string matching system, in one embodiment.
  • FIG. 2 is a flow diagram that illustrates processing of the string matching system to store new strings in the string store against which candidate strings will be matched, in one embodiment.
  • FIG. 3 is a flow diagram that illustrates processing of the string matching system to determine whether a candidate string matches any string stored in the string store, in one embodiment.
  • FIG. 4 is a data structure diagram that graphically illustrates contents of the string data store of the string matching system, in one embodiment.
  • DETAILED DESCRIPTION
  • A string matching system is described herein that provides for very fast and efficient pattern matching against large sets of certain types of pattern strings. If a set of pattern strings is comprised of strings that can be logically divided into segments, these pattern strings can be stored efficiently in a tree-like data structure, segment by segment, storing pattern-match syntax segments separately from literal string segments. Pattern-match syntax segments may include “wild card” characters (e.g., “*” or “?”) that can be replaced by one or more literal characters to form a match. After segmentation, individual match candidate strings can be compared, segment by segment, against elements in the tree to quickly determine if the candidate matches any of the original pattern strings. In some embodiments, the string matching system uses a data structure that is conceptually a combination of a tree and a hash table. Using a hash table to implement the branching elements at each level in the tree contributes to quick matching speed at each level. By consistently separating strings into segments, the system can also share storage locations for like elements in the data structure. Thus, the system allows fast string matching even with large data sets.
  • In some embodiments, the string matching system is applied to applications in which a goal is to determine, given a set of pattern strings and a separate match candidate string, whether any of the pattern strings in the set match the candidate string. One straightforward approach to solving this problem is to enumerate each pattern string in the set, compare it to the match candidate, and determine if there is a match. For example, given this set: “aaa”, “bbb”, “ccc”, and the match candidate string “bbb”, a common approach would first compare “bbb” to “aaa”. Upon finding these strings do not match, the approach next compares “bbb” to “bbb” and, having found a match, processing terminates. For literal strings, as in this example, matching can be done more efficiently by using a hash table instead of enumerating each string and comparing values individually.
  • Additional flexibility can be provided by using some form of pattern matching syntax. If we assume that the ‘#’ character is interpreted to mean “any alphabetic character”, we could change our set to the following: “a##”, “b##”, and “c##”. If the match candidate string is still “bbb”, we would use the same technique and find that “bbb” matches “b##” by simply extending the matching algorithm to match any character to the ‘#’ character. While this general approach is sufficient for small sets of pattern strings, the cost of finding a match increases linearly as the pattern strings become longer and as the set becomes larger. With very large sets of strings, the standard approach to this problem quickly becomes insufficient. Furthermore, once patterns are introduced, hash tables become useless for lookups because the match candidate strings will not generate the same hash values as the pattern strings that are stored.
  • The string matching system addresses this problem for certain types of pattern strings that are easily and consistently separable into segments and where pattern match portions can be separated into distinct segments from literal match portions. File paths are examples of strings that fit these rules. As an example, consider an automatic disk cleanup application. Configuration for this application may specify to delete all files in temporary directories and all files that end with the suffix “tmp” or “temp”. Using an asterisk (“*”) to represent a sequence of one or more matching characters, the list of pattern strings to represent this configuration might look like this: “/temp/*”, “/home/*/temp/*”, “*.tmp”, and “*.temp”. These strings meet the criteria specified. They are easily and consistently separable (the slash (“/”) and period (“.”) characters are used to separate elements of the path specification), and the pattern match element, the asterisk (“*”), can be separated into its own segment.
  • Splitting the strings into segments offers two possible increases in efficiency. First, pattern matching is only performed on the smallest portions of the candidate match strings, not the entire string, limiting the scope in which less efficient pattern matching algorithms are employed. Second, comment elements can be preserved. As an extension of the previous example, suppose that instead of simply deleting all the files in each user's “temp” directory, the configuration wanted to specify that only certain types of files or files in certain subfolders be deleted:
  • “/home/*/temp/*.jpg”
    “/home/*/temp/*.mp3”
    “/home/*/temp/*.avi”
    “/home/*/temp/downloaded-files/*”
    “/home/*/temp/internet-cache/*”
    “/home/*/temp/temporary-application-files/*”
  • The first three segments (“home”, “*”, and “temp”) are the same for each of these pattern strings and can therefore be shared. To implement this technique, the string matching system creates a tree-like data structure that represents the pattern strings in a format that leverages the efficiencies described. The tree is comprised of a root node that acts as the starting access point to the data structure, and zero or more tree nodes that contain the pattern string data.
  • Each tree node contains a hash table that associates a specific string segment with a reference to the node that contains the next segment for that pattern string (see FIG. 4 described further herein). Thus, if two pattern strings have the same segment in common at the same location in the string, they will share the hash table entry for that particular node. In some embodiments, segments are defined as being the same if: 1) the string values of both segments are identical (or can be considered identical based on a pattern match) and 2) the status of the segment as being the last segment in the string is the same (i.e., both segments represent the last segment in their respective strings, or both segments do not represent the last segment in their respective strings). Other factors may be considered as a part of the segment-matching test, depending on implementation. Once a differing segment is found, an entry will be created in that node's hash table for both of the distinct segments. Thus, both strings can be represented without unnecessarily duplicating storage. A hash table entry for the last segment of a string includes some indication that it is the last segment (shown in FIG. 4 as a NULL reference).
  • In some embodiments, storage of pattern strings in the data structure is as follows. The system first divides each string into segments (called segmentation herein). The system then stores each segment at the corresponding level in the tree structure according to the above rules described for the data structure (see FIG. 2 described herein). FIG. 4 represents the storage of the following set of pattern strings in the data structure described: “S1-A:S2-A:S3-A”, “S1-B”, “S1-A:*:S3-B”, “S1-A:S2-B”, “S1-A:*”, and “S1-A:S2-A”. Note that in this example the system can easily segment the strings at each colon (“:”). For URLs, both periods (“.”) and slashes (“/” and “\”) may be used to segment strings, as well as combinations of characters (e.g., “://” to separate “http” from a domain name) or other delimiters. Other domains may contain other logical divisions that the system can use to divide strings into segments. In some embodiments, the system may store empty segments in some cases. For example, for some implementations two delimiters in a row might be ignored while in other implementations such a combination may indicate an empty segment that has logical significance to the implementation.
  • Determining whether a given string matches the set of pattern strings uses similar logic (see FIG. 3 described herein). The system divides the candidate string into segments, the same as the string would be if it were being stored in the data structure. Then, starting with the first segment of the string, the system tests for a match for the first segment in the hash table of the first node of the tree. The test for whether segments match is the same test that is used during storage, to determine whether a segment already exists or will be created. In other words, the status of the segment as the last in the string is considered as a part of the match test; again, other factors may be considered depending on implementation. If a match is found, the system follows the reference in the hash table to the next node, where a match for the next segment of the candidate string is sought.
  • While testing for a match, if no literal match is found in the current node's hash table for that candidate segment, the system then tests for a match against the pattern strings stored at that node. Those of ordinary skill in the art will recognize that this order can be reversed so that pattern strings are tested before literal matches in some implementations. If a match is found, the reference to the next node is followed and matching continues. If no match is found, logic is returned to the previous node where this same algorithm is applied. Lacking a match, the system continues applying the same logic to the previous node until there is no previous node or a match is found. If, at any point during the matching effort, the reference in the matching element of the hash table refers to no node (e.g. NULL), matching efforts are complete. The candidate string matches up to this point. If there are additional segments in the candidate string for which no match was found, then the system may declare a partial match, otherwise the system may declare no match. Whether partial matches are valid is implementation dependent. A partial match is any candidate string for which at least one segment has a match stored in the tree and at least one segment for which no match is stored in the tree. A full match is one in which all of the segments of the candidate string have matching segments stored in the tree.
  • The system is flexible to allow specific implementations to determine the criteria for what is a match, what is a partial match, or what is not a match, and the examples provided herein are merely to illustrate the capabilities of the system. For example, suppose the string “http://www.abc.com/home” is stored in the data store. Then suppose the system receives the string “http://www.abc.com/home/index.html” to see if it matches. An implementation may decide that the candidate string matches even if the last part of the path (“index.html”) is not included in the data store; in other words, the implementation may decide that the trailing parts of the path are not important for matching. However, another candidate string may be “http://www.abc.com” and in this case, it may be that not matching the “home” part of the path may mean there is NOT a match. Likewise, another candidate string “http://www.abc.net” may also be considered not a match because the domain name portion may be considered significant. Thus, the rules of the implementation, location within the pattern or candidate string, context, matching in other parts of the string, or any number of other conditions could apply in any particular case.
  • FIG. 1 is a block diagram that illustrates components of the string matching system, in one embodiment. The system 100 includes a string store 110, a string loading component 120, a segmentation component 130, a segment matching component 140, a node creation component 150, and a search component 160. Each of these components is described in further detail herein.
  • The string store 110 is a data store that stores data structures used by the string matching system 100 to hold a set of strings against which the system 100 searches for matches for a candidate string. The data store may include persistent or in-memory data storage devices, including one or more memory units, hard drives, file systems, databases, cloud-based storage services, or other facility for storing data. The string store 110 may include multiple instances of trees described herein used for matching candidate strings for different problem domains (e.g., a tree for matching URLs, a tree for matching file paths, and so forth). The string store 110 may also include a combination of in-memory and persistent data storage based on frequency of use/reference to each node, and other performance or implementation considerations.
  • The string loading component 120 loads one or more strings into the string store 110 by invoking the segmentation component 130 to divide the strings into segments and locating or creating nodes in the tree to store each string segment. The string loading component 120 may be invoked during a startup process that prepares a string data structure for use by an application that tests candidate strings for matches within the data structure. Alternatively or additionally, applications may add strings to the data structure as the application runs, by loading new strings as they are discovered or configured by an administrator, other user, or automated process. The string loading component 120 determines whether segments of the string to be loaded are already present in the data structure and adds nodes for segments that are not already present.
  • The segmentation component 130 divides one or more strings into logical segments based on one or more segmentation criteria. Many strings are such that they can be logically divided into independently recognizable or significant sub-strings. For example, Internet domain names can be divided by the periods (“.”) in the domain name and URLs can be divided by forward slashes (“/”) or backward slashes (“\”) as well as by periods (“.”) in the domain name portion of the URL, or any other method appropriate to a particular implementation. Implementers of the string matching system may provide one or more segmentation criteria for dividing strings according to the problem domain address by their particular application. The segmentation component 130 applies the segmentation criteria to divide strings received from the string loading component 120 to be loaded into the tree stored by the string store 110 and also to divide candidate strings provided by the search component 160 so that the system can determine whether particular segments of the candidate string match one or more stored strings.
  • The segment matching component 140 performs a segment-by-segment match of a candidate string to one or more stored strings to identify segments of the strings that match. A match may be determined as a literal character match or by matching one or more literal characters to one or more pattern-match specifications. For example, some implementations may allow wildcard characters in the stored strings, candidate strings, or both, and matches may be determined by expanding these wildcards to match characters provided in the subject string. The segment matching component 140 may also determine other criteria, such as whether the status of two segments as marking the end of a string or not marking the end of a string is the same. In some embodiments, the system stores segments that do not mark the end of a string separately (e.g., as separate hash table entries) from those segments that do mark the end of a string. The segment matching component 140 applies these and any other application-specific criteria to determine whether two segments will be declared a match. When adding strings to the string store, the segment matching component 140 determines whether an input string to be loaded is already represented in the string store 110 and, if not, can invoke the node creation component 150 to add a new node or hash table entry to the tree.
  • The node creation component 150 creates new nodes or hash table entries to represent string segments in the tree of stored strings. For example, as strings with more segments are loaded into the tree and no existing node represents the latter segments, the component 150 may create new nodes and add a reference from previous nodes to store the new string segments. Each node has a hash table and each created new node includes a new hash table with the strings located at that level. If a node already exists that is suitable for a particular segment, but no matching entry for that segment is found in the hash table, the node creation component 150 may add a hash table entry to represent the segment. Likewise, if the end status of the segment does not match the end status of an existing matching string segment, then the component 150 may add a new hash table entry based on the new end status.
  • The search component 160 receives a candidate string and searches the string store 110 for one or more matching strings. The search component 160 invokes the segmentation component 130 to divide the candidate string into segments similar to how strings loaded into the string store were previously divided into segments. Then, the search component 160 invokes the segment matching component 140 to perform a segment-by-segment match of the segments of the candidate strings to the segments of string stored in the string store 110. If a segment is found in the string store 110 that matches each segment of the candidate string, then the search component 160 declares a match and may return any metadata associated with the matching string entry. If no match is found for any segment of the candidate string, then the search component 160 may report no match. Some implementations may support partial matches and may report a portion of the candidate string for which a match is available in the string store 110.
  • The computing device on which the string matching system is implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives or other non-volatile storage media). The memory and storage devices are computer-readable storage media that may be encoded with computer-executable instructions (e.g., software) that implement or enable the system. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communication link. Various communication links may be used, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on.
  • Embodiments of the system may be implemented in various operating environments that include personal computers, server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, set top boxes, systems on a chip (SOCs), and so on. The computer systems may be cell phones, personal digital assistants, smart phones, personal computers, programmable consumer electronics, digital cameras, and so on.
  • The system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
  • FIG. 2 is a flow diagram that illustrates processing of the string matching system to store new strings in the string store against which candidate strings will be matched, in one embodiment. Beginning in block 210, the system receives a string to be added to the string data store. The system may receive the string from an application (e.g., in an extensible markup language (XML) file or as a parameter to an application programming interface (API)), as user input, via a network interface, or from another source. The system adds the string to a store of strings against which to attempt to match later-received candidate strings using the following steps. Continuing in block 220, the system divides the received string into one or more segments. The system may identify segments based on one or more delimiting characters (e.g., periods or slashes), a count of characters (e.g., every four characters is a segment), or other segmentation criteria suitable for a particular application.
  • Continuing in block 230, the system selects a root node of the string data store that identifies an entry to a data structure for efficiently storing a set of strings in a tree-like form. For example, the data store may include a node for each level in the tree that represents a different possible following segment for the previous node. At the top is the root node that provides a reference to the first node in the data structure. Each node of the tree may include a hash table or other data structure for efficiently looking up matching strings at that level of the tree. Continuing in block 240, the system selects the first segment of the received string. In some embodiments, the system may reorder segments to increase efficiency of storage or for other reasons. For example, domain names often have the most significant portion at the end, while virtual paths in a URL often have the most significant portions at the beginning. Thus, for a URL, the system may reverse the segments of the domain name portion (e.g., “com”->“microsoft” instead of “microsoft”->“com”). On subsequent iterations, the system selects the next segment of the string in the order determined.
  • Continuing in decision block 250, if the system determines that all segments of the received string have been stored in the string data store, then the system completes, else the system continues in block 260. The system processes each segment in turn until each segment has been processed. Continuing in block 260, the system navigates to the next node of the string data store. For example, if the current node is the root node, then navigating to the next node places the system at the level of a hash table containing the first set of string segments with which previously stored strings begin. If there is no additional node but the string to be stored contains additional segments, a new node is created that is referenced by the hash table entry of the current node, at which point the system navigates to the newly created node. As the system navigates deeper in the string data store, navigating to the next node follows a reference from the previous matching segment to determine whether a subsequent node matches the next segment of the received string.
  • Continuing in block 270, the system performs a look up at the current node to determine whether the current string segment is already stored in the string data store at a current level. The system may use a hash table at each node of the string data store and the look up may perform a look up using a hash key derived from the current string segment. If a matching string segment is found, then a string like the received string up to the current segment has previously been stored in the data store.
  • Continuing in decision block 280, if the system found a matching string segment, then the system loops to block 240 to select the next segment, else the system continues at block 290. Continuing in block 290, the system stores the non-matching segment as a new entry in the string data store at the current level, then loops to block 240 to select the next segment. If the system is using a hash table, then this block adds a new entry to the hash table that includes a reference to a new node that represents any subsequent segments in the received string. After block 290, these steps conclude.
  • FIG. 3 is a flow diagram that illustrates processing of the string matching system to determine whether a candidate string matches any string stored in the string store, in one embodiment. Beginning in block 305, the system receives a candidate string for which to search the string store to identify zero or more matches to the candidate string. The candidate string is provided by an application for a purpose specific to the application. For example, an anti-malware application may compare a received file path to a set of file paths stored in the string store that refer to interesting file locations (e.g., system or temporary files).
  • Continuing in block 310, the system divides the received string into one or more segments. The system may identify segments based on one or more delimiting characters (e.g., periods or slashes), a count of characters (e.g., every four characters is a segment), or other segmentation criteria suitable for a particular application. The system divides the candidate string using the same segmentation criteria used to initially store strings in the string store. Continuing in block 315, the system selects a root node of the string data store that identifies an entry to a data structure for efficiently storing a set of strings in a tree-like form. For example, the data store may include a node for each level in the tree that represents a different possible following segment for the previous node. At the top is the root node that provides a reference to the first node in the data structure. Each node of the tree may include a hash table or other data structure for efficiently looking up matching strings at that level of the tree.
  • Continuing in block 320, the system selects the first segment of the received string. In some embodiments, the system may reorder segments to increase efficiency of storage or for other reasons. For example, domain names often have the most significant portion at the end, while virtual paths in a URL often have the most significant portions at the beginning. Thus, for a URL, the system may reverse the segments of the domain name portion (e.g., “com”->“microsoft” instead of “microsoft”->“com”). On subsequent iterations, the system selects the next segment of the string in the order determined.
  • Continuing in decision block 325, if the system determines that all segments of the received string have been matched to one or more segments stored in the string data store, then the system continues at block 340, else the system continues at block 330. The system processes each segment in turn until each segment has been matched or it is determined that there is no matching string in the string store. Continuing in decision block 330, if a child node exists for the current node, then the system continues at block 345, else the system continues at block 335. A child node indicates that further potentially matching segments are available in the string store. Because the test in block 325 indicates that there are more segments to the candidate string, there will also be more segments in the string store unless no match exists in the string store.
  • Continuing in decision block 335, if partial matches are allowed and at least one previous segment had a match (or other criteria determined by the particular implementation of the system), then the system continues at block 340 to report a partial match, else the system continues at block 375 to report no match. Continuing in block 340, the system reports a match. The match may be full or partial, depending on application-specific preference. A full match indicates that some string previously stored in the string store matched the candidate string, either literally or by applying pattern-matching wildcard characters. The system may report a match by returning a return value in response to an API call, providing a text response, or other common methods of reporting results.
  • Continuing in block 345, the system selects the child node of the current matching segment in the string store. This is the child node determined to exist in block 330. The child node contains a hash table or other data structure that identifies segments of strings stored at the current level of the string store. Continuing in block 350, the system performs a look up at the current node to determine whether the current string segment matches a segment previously stored in the string data store at a current level. The system may use a hash table at each node of the string data store and the look up may perform a look up using a hash key derived from the current string segment. If a matching string segment is found, then a string like the received string up to the current segment has previously been stored in the data store.
  • Continuing in decision block 355, the system determines whether a literal match for the current segment was found. If a literal match is found, then the system loops to block 320 to access the matching record and retrieve a reference to a node in the string store containing subsequent segments that may match the next segment(s) of the candidate string. If no literal match is found then the system continues in block 360 to identify pattern matches.
  • Continuing in decision block 360, if the system determines that a pattern match exists, then the system loops to block 320 to continue processing the next candidate string segment. In some embodiments, the system separates processing of literal string matches from pattern string matches. As noted herein, pattern strings are not typically well suited to inclusion in a hash table because they will not generate the same hash key as a literal candidate string segment that matches the pattern, leading to failure to find the matching pattern string in the hash table. The system may perform slower or alternative processing to iterate through pattern strings and determine whether any match the provided candidate string segment.
  • Continuing in decision block 365, if the system determines that a parent node of the current node exists, then the system continues at block 370, else the system continues at block 370 to report no match. Because patterns can be inclusive enough to span multiple segments, a failure to find a match at a particular level of the string store is not indicative of no match. By navigating to the parent level (potentially multiple times), the system searches for pattern matches at higher levels that may encompass the current candidate string segment. For example, a pattern string “http://www.*”, where “*” is a wildcard pattern for matching any characters, would match with many potential candidate strings having many more segments than this target pattern string. Continuing in block 370, the system selects the parent node and loops to block 360 to determine whether a pattern match exists by comparing the current segment of the candidate string to the parent node's string patterns.
  • Continuing in block 375, the system reports that no match was found for the candidate string. The system may report no match by returning a return value in response to an API call, providing a text response, raising an exception, or other common methods of reporting results. After block 375, these steps conclude.
  • FIG. 4 is a data structure diagram that graphically illustrates contents of the string data store of the string matching system, in one embodiment. FIG. 4 represents the storage of the following set of pattern strings in the data structure described: “S1-A:S2-A:S-A”, “S1-B”, “S1-A:*:S3-B”, “S1-A:S2-B”, “S1-A:*”, and “S1-A:S2-A”. Note that in this example the system can easily segment the strings at each colon (“:”). The data structure starts with a root node 410 that contains a pointer to the first node 420 of the data structure containing strings. Each string-containing node contains a hash table or other data structure that makes finding matching segments at that level efficient. For example, the first node 420 includes a hash table 430. The hash table 430 contains the strings “S1-A” and “S1-B” because these encompass the potential first segment values of all of the strings stored in the string store. Note that even though there are six strings stored in the string store, data that is similar for multiple strings may not be duplicated at each level, resulting in significant space savings.
  • The “S1-A” value in the hash table 430 contains a reference 440 to the second node 460. The second node 460 represents string segments that can follow “S1-A” for strings stored in the string store. The second node 460 contains its own hash table 470 with both literal strings and pattern-match strings. Note that in the embodiment illustrated, the system separately stores segments that represent the end of a string from segments that point to further segments. The segment value “S1-B” in the first node 420 points to a NULL reference 450 that indicates that no further segments follow that instance of “S1-B” for data presently stored in the string store. In the second node 460 hash table 470, the value “S2-A” is stored twice, once for strings that end in “S2-A” and again for strings that continue with one or more other segments. The data structure described allows for both space-efficient storage and fast matching of a candidate strings to the strings stored in the string store.
  • In some embodiments, the string matching system uses a Boolean value to indicate segment termination for partial matches. In implementations where partial matches are allowed (i.e., are treated as matches), then the system may distinguish segments that end a string with a simple Boolean value rather than a null terminator or other delimiter. This allows the strings loaded in the string store to continue on, potentially beyond the length of provided candidate strings.
  • In some embodiments, the string matching system combines node type or other application criteria at each segment to determine a match. For example, node types may include scheme (e.g., http), domain (e.g., .com), address (e.g., facebook), directory, and so forth. The system may test that a candidate string's segments have both a matching type and a matching value at each level before declaring a match. This allows an administrator that specifically wants to match a specific address or other part of a string to a string to distinguish type-based matches from standard string matches.
  • As noted herein, the system may accept partial matches based on application-specified preferences. In some embodiments, the string matching system allows the application to determine particular portions of strings for which partial matches are accepted. For example, using URLs, a particular application may allow partial matches for the virtual directory portion of the URL, but not for the domain name. The system may provide settings for partial matching that include a text description, such as “always”, “never”, “only after address”, and so forth.
  • In some embodiments, the string matching system facilitates faster updates of the string data store using the data structure described herein. Because the string data is subdivided as described herein, updates only affect the leaf nodes of each string or the farthest current segment in the string store that is also in the string to be added. Because updates often lock the data structure being updated to prevent interference by other threads, the subdivision described herein limits the level at which a lock is needed, so that updates are less likely to impact nodes that are currently being used in a search. The system can quickly lock the furthest node to which the added string pertains, then unlock that node after the update. During that time, searches for strings that do not visit that node are unaffected by the update lock.
  • From the foregoing, it will be appreciated that specific embodiments of the string matching system have been described herein for purposes of illustration, but that various modifications may be made without deviating from the spirit and scope of the invention. For example, although strings have been described, the techniques described herein can also be used with other types of data, such as binary streams. Anti-virus software, for example, often scans executable code for known malicious binary streams. The system described can store particular binary streams to be matched against input streams to identify malicious software code. Accordingly, the invention is not limited except as by the appended claims.

Claims (20)

1. A computer-implemented method for determining whether a candidate string matches any string stored in a string data store, the method comprising:
receiving a candidate string for which to search the string data store to identify zero or more matches to the candidate string;
dividing the received string into one or more segments;
selecting a root node of the string data store that identifies an entry to a data structure for efficiently storing a set of strings in a tree-like form;
searching the string data store segment-by-segment to identify segments that match the segments of the candidate string; and
upon determining that each segment in the candidate string matches a segment stored in the string data store, reporting 340 that the candidate string matches a string stored in the string data store,
wherein the preceding steps are performed by at least one processor.
2. The method of claim 1 wherein receiving the candidate string comprises receiving the string from an application through an application-programming interface (API) provided by a string matching system.
3. The method of claim 1 wherein dividing the received string into segments comprises identifying segments based on one or more delimiting characters that separate segments.
4. The method of claim 1 wherein dividing the received string into segments comprises dividing the candidate string using the same segmentation criteria used to previously store strings in the string data store.
5. The method of claim 1 wherein each node of the tree after the root node includes a hash table that allows efficiently looking up matching strings at the node's level of the string data store.
6. The method of claim 1 wherein searching the string data store segment-by-segment comprises: selecting a first segment of the received candidate string and the child node of the root node of the data structure, and looking up the first segment of the received candidate string in a data structure associated with the child node.
7. The method of claim 1 wherein searching the string data store segment-by-segment comprises determining that the candidate string partially matches a string stored in the string data store, and reporting the partial match.
8. The method of claim 1 wherein at least one string stored in the string data store includes a pattern match segment with at least one wildcard character that represents one or more optional characters to match in the candidate string.
9. The method of claim 1 wherein searching the string data store segment-by-segment comprises, at each level of the tree data structure, determining whether there are any literal match strings and if no literal matches are found, searching zero or more pattern match strings for a pattern match.
10. The method of claim 1 wherein searching the string data store segment-by-segment comprises performing a data structure look up at one or more nodes in the tree data structure that include a matching segment for a segment in that candidate string.
11. The method of claim 1 further comprising, upon determining that no string segments at a current tree node match a current candidate string segment, navigating to anancestor node of the current node, navigating to a previous candidate string segment, and determining whether any pattern match strings segments associated with the ancestor node match the previous candidate string segment.
12. The method of claim 1 further comprising, upon determining that at least one segment of the candidate string does not match a segment stored in the string data store, reporting that no match was found for the candidate string.
13. A computer system for fast and efficient pattern matching against large sets of pattern strings, the system comprising:
a processor and memory configured to execute software instructions embodied within the following components;
a string store configured to store one or more data structures that hold a set of strings against which the system searches for matches to a candidate string;
a string loading component configured to load one or more strings into the string store by invoking the segmentation component to divide the strings into segments and locating or creating nodes in the tree to store each string segment;
a segmentation component configured to divide one or more strings into logical segments based on one or more segmentation criteria;
a segment matching component configured to perform a segment-by-segment match of a candidate string to one or more stored strings to identify segments of the strings that match;
a node creation component configured to create new entries to represent string segments in the tree of stored strings; and
a search component configured to receive a candidate string and search the string store for one or more matching strings.
14. The system of claim 13 wherein the string store is further configured to include a combination of in-memory and persistent data storage based on frequency of use of each node in a tree data structure for storing the strings.
15. The system of claim 13 wherein the string loading component is further configured to be invoked during a startup process that prepares a string data structure for use by an application that tests candidate strings for matches within the data structure.
16. The system of claim 13 wherein the string loading component is further configured to update the string store with new strings without locking access to all of the strings in the string data store.
17. The system of claim 13 wherein the segmentation component is further configured to apply the segmentation criteria to divide strings received from the string loading component to be loaded into the tree stored by the string store and also to divide candidate strings provided by the search component so that the system can determine whether particular segments of the candidate string match one or more stored strings.
18. The system of claim 13 wherein the segment matching component is further configured to determine literal character matches and pattern matches with one or more wildcard characters.
19. The system of claim 13 wherein the segment matching component is further configured to determine whether a status of a first segment as marking the end of a string and a second segment as marking the end of a string is the same.
20. A computer-readable storage medium comprising instructions for controlling a computer system to store one or more strings in a string data store against which candidate strings will subsequently be tested for a match, wherein the instructions, upon execution, cause a processor to perform actions comprising:
receiving a string to be added to the string data store;
dividing the received string into one or more segments;
selecting a root node of the string data store that identifies an entry to a tree-based data structure for efficiently storing a set of strings;
selecting a first segment of the received string;
performing a look up at the current node to determine whether the current string segment is already stored in the string data store at a current level; and
upon identifying a segment of the received string that is not already stored in the string data store, storing the non-matching segment as a new entry in the string data store at the current level.
US12/953,469 2010-11-24 2010-11-24 Efficient string pattern matching for large pattern sets Expired - Fee Related US8407245B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/953,469 US8407245B2 (en) 2010-11-24 2010-11-24 Efficient string pattern matching for large pattern sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/953,469 US8407245B2 (en) 2010-11-24 2010-11-24 Efficient string pattern matching for large pattern sets

Publications (2)

Publication Number Publication Date
US20120130983A1 true US20120130983A1 (en) 2012-05-24
US8407245B2 US8407245B2 (en) 2013-03-26

Family

ID=46065320

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/953,469 Expired - Fee Related US8407245B2 (en) 2010-11-24 2010-11-24 Efficient string pattern matching for large pattern sets

Country Status (1)

Country Link
US (1) US8407245B2 (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014106282A1 (en) * 2012-12-31 2014-07-03 Mandiant Corporation Identification of obfuscated computer items using visual algorithms
US20150016172A1 (en) * 2013-07-15 2015-01-15 Advanced Micro Devices, Inc. Query operations for stacked-die memory device
US9065722B2 (en) 2012-12-23 2015-06-23 Advanced Micro Devices, Inc. Die-stacked device with partitioned multi-hop network
US9087064B2 (en) * 2011-10-27 2015-07-21 International Business Machines Corporation User-defined hierarchies in file system data sets
US9135185B2 (en) 2012-12-23 2015-09-15 Advanced Micro Devices, Inc. Die-stacked memory device providing data translation
US9170948B2 (en) 2012-12-23 2015-10-27 Advanced Micro Devices, Inc. Cache coherency using die-stacked memory device with logic die
US9201777B2 (en) 2012-12-23 2015-12-01 Advanced Micro Devices, Inc. Quality of service support using stacked memory device with logic die
US20160063055A1 (en) * 2014-08-29 2016-03-03 Alvin Roy Reed Method And Apparatus For Improved Database Searching
EP2871816B1 (en) 2013-11-11 2016-03-09 51 Degrees Mobile Experts Limited Identifying properties of a communication device
US9300637B1 (en) * 2011-03-08 2016-03-29 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9338220B1 (en) 2011-03-08 2016-05-10 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9344091B2 (en) 2012-08-06 2016-05-17 Advanced Micro Devices, Inc. Die-stacked memory device with reconfigurable logic
US9413526B1 (en) 2011-03-08 2016-08-09 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9432342B1 (en) 2011-03-08 2016-08-30 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US20160321255A1 (en) * 2015-04-28 2016-11-03 International Business Machines Corporation Unsolicited bulk email detection using url tree hashes
EP3091450A1 (en) * 2015-05-06 2016-11-09 Örjan Vestgöte Method and system for performing binary searches
GB2540216A (en) * 2015-07-10 2017-01-11 Advanced Risc Mach Ltd Comparison of wide data types
US9667741B1 (en) 2011-03-08 2017-05-30 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9697147B2 (en) 2012-08-06 2017-07-04 Advanced Micro Devices, Inc. Stacked memory device with metadata management
US9852311B1 (en) 2011-03-08 2017-12-26 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9930052B2 (en) 2013-06-27 2018-03-27 International Business Machines Corporation Pre-processing before precise pattern matching
US10068021B1 (en) * 2017-10-20 2018-09-04 Carmax Business Services, Llc Systems and methods for generating web pages for database search and search engine optimization
US10204177B2 (en) 2015-11-30 2019-02-12 International Business Machines Corporation Matching an ordered set of strings containing wild cards
US10277613B2 (en) * 2013-12-10 2019-04-30 Nippon Telegraph And Telephone Corporation URL matching apparatus, URL matching method, and URL matching program
US10482175B2 (en) 2017-07-31 2019-11-19 51 Degrees Mobile Experts Limited Identifying properties of a communication device
CN110674247A (en) * 2019-09-23 2020-01-10 广州虎牙科技有限公司 Barrage information intercepting method and device, storage medium and equipment
CN110728113A (en) * 2018-06-28 2020-01-24 北京金山办公软件股份有限公司 Information screening method and device of electronic forms and terminal equipment
CN111159362A (en) * 2019-12-31 2020-05-15 奇安信科技集团股份有限公司 Text matching processing method and device
US10719540B2 (en) 2014-01-30 2020-07-21 Verint Systems Ltd. Systems and methods for keyword spotting using alternating search algorithms
CN113312549A (en) * 2021-05-25 2021-08-27 北京天空卫士网络安全技术有限公司 Domain name processing method and device
US11228566B1 (en) 2011-03-08 2022-01-18 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US11288368B1 (en) * 2019-06-26 2022-03-29 Support Intelligence, Inc. Signature generation
US20220139506A1 (en) * 2020-10-30 2022-05-05 Shenzhen Academy Of Inspection And Quarantine Method for automatically collecteing and matching of laboratory data
US11563715B2 (en) * 2020-11-23 2023-01-24 Juniper Networks, Inc. Pattern matching by a network device for domain names with wildcard characters

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544208B (en) * 2013-08-16 2016-07-06 东软集团股份有限公司 The matching process of massive feature cluster set and system
US9826463B2 (en) 2013-12-18 2017-11-21 Qualcomm Incorporated Hash partial matching for discovery
CN105573922B (en) * 2014-11-07 2020-07-10 中兴通讯股份有限公司 Method and device for realizing data format conversion
CN105677740A (en) * 2015-12-29 2016-06-15 中国民用航空上海航空器适航审定中心 Method for matching entity-based text data and XML files
US10169451B1 (en) 2018-04-20 2019-01-01 International Business Machines Corporation Rapid character substring searching
US10747819B2 (en) 2018-04-20 2020-08-18 International Business Machines Corporation Rapid partial substring matching
US10782968B2 (en) 2018-08-23 2020-09-22 International Business Machines Corporation Rapid substring detection within a data element string
US10732972B2 (en) 2018-08-23 2020-08-04 International Business Machines Corporation Non-overlapping substring detection within a data element string
US10996951B2 (en) 2019-09-11 2021-05-04 International Business Machines Corporation Plausibility-driven fault detection in string termination logic for fast exact substring match
US11042371B2 (en) 2019-09-11 2021-06-22 International Business Machines Corporation Plausability-driven fault detection in result logic and condition codes for fast exact substring match
CN111176725B (en) 2019-12-27 2022-05-06 北京市商汤科技开发有限公司 Data processing method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047283A (en) * 1998-02-26 2000-04-04 Sap Aktiengesellschaft Fast string searching and indexing using a search tree having a plurality of linked nodes
US20030144832A1 (en) * 2002-01-16 2003-07-31 Harris Henry M. Machine translation system
US6662184B1 (en) * 1999-09-23 2003-12-09 International Business Machines Corporation Lock-free wild card search data structure and method
US20030233362A1 (en) * 2002-06-14 2003-12-18 Integrated Knowledge Solutions Inc. Method to efficiently process and present possible arrangements of a set of contiguous peer-to-peer links
US7058663B2 (en) * 2001-03-13 2006-06-06 Koninklijke Philips Electronics, N.V. Automatic data update
US20070050335A1 (en) * 2005-08-26 2007-03-01 Fujitsu Limited Information searching apparatus and method with mechanism of refining search results
US20070150469A1 (en) * 2005-12-19 2007-06-28 Charles Simonyi Multi-segment string search
US20070198548A1 (en) * 2005-11-28 2007-08-23 Lee Won S Compressed prefix trees and estDec+ method for finding frequent itemsets over data streams
US7403941B2 (en) * 2004-04-23 2008-07-22 Novauris Technologies Ltd. System, method and technique for searching structured databases
US20080177737A1 (en) * 2002-06-21 2008-07-24 Microsoft Corporation Method and system for a pattern matching engine
US20100017397A1 (en) * 2008-07-17 2010-01-21 International Business Machines Corporation Defining a data structure for pattern matching

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7054855B2 (en) 2001-07-03 2006-05-30 International Business Machines Corporation Method and system for performing a pattern match search for text strings
US7636703B2 (en) 2006-05-02 2009-12-22 Exegy Incorporated Method and apparatus for approximate pattern matching
US8189931B2 (en) 2008-01-04 2012-05-29 International Business Machines Corporation Method and apparatus for matching of bracketed patterns in test strings
US9032503B2 (en) 2008-05-20 2015-05-12 Shakeel Mustafa Diversity string based pattern matching
US8103764B2 (en) 2008-10-14 2012-01-24 CacheIQ, Inc. Method and apparatus for matching trigger pattern

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047283A (en) * 1998-02-26 2000-04-04 Sap Aktiengesellschaft Fast string searching and indexing using a search tree having a plurality of linked nodes
US6662184B1 (en) * 1999-09-23 2003-12-09 International Business Machines Corporation Lock-free wild card search data structure and method
US7058663B2 (en) * 2001-03-13 2006-06-06 Koninklijke Philips Electronics, N.V. Automatic data update
US20030144832A1 (en) * 2002-01-16 2003-07-31 Harris Henry M. Machine translation system
US20030233362A1 (en) * 2002-06-14 2003-12-18 Integrated Knowledge Solutions Inc. Method to efficiently process and present possible arrangements of a set of contiguous peer-to-peer links
US20080177737A1 (en) * 2002-06-21 2008-07-24 Microsoft Corporation Method and system for a pattern matching engine
US7403941B2 (en) * 2004-04-23 2008-07-22 Novauris Technologies Ltd. System, method and technique for searching structured databases
US20070050335A1 (en) * 2005-08-26 2007-03-01 Fujitsu Limited Information searching apparatus and method with mechanism of refining search results
US20070198548A1 (en) * 2005-11-28 2007-08-23 Lee Won S Compressed prefix trees and estDec+ method for finding frequent itemsets over data streams
US20070150469A1 (en) * 2005-12-19 2007-06-28 Charles Simonyi Multi-segment string search
US20100017397A1 (en) * 2008-07-17 2010-01-21 International Business Machines Corporation Defining a data structure for pattern matching

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9852311B1 (en) 2011-03-08 2017-12-26 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US11228566B1 (en) 2011-03-08 2022-01-18 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9432342B1 (en) 2011-03-08 2016-08-30 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9413526B1 (en) 2011-03-08 2016-08-09 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9667741B1 (en) 2011-03-08 2017-05-30 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9338220B1 (en) 2011-03-08 2016-05-10 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9300637B1 (en) * 2011-03-08 2016-03-29 Ciphercloud, Inc. System and method to anonymize data transmitted to a destination computing device
US9087064B2 (en) * 2011-10-27 2015-07-21 International Business Machines Corporation User-defined hierarchies in file system data sets
US9344091B2 (en) 2012-08-06 2016-05-17 Advanced Micro Devices, Inc. Die-stacked memory device with reconfigurable logic
US9697147B2 (en) 2012-08-06 2017-07-04 Advanced Micro Devices, Inc. Stacked memory device with metadata management
US9201777B2 (en) 2012-12-23 2015-12-01 Advanced Micro Devices, Inc. Quality of service support using stacked memory device with logic die
US9170948B2 (en) 2012-12-23 2015-10-27 Advanced Micro Devices, Inc. Cache coherency using die-stacked memory device with logic die
US9135185B2 (en) 2012-12-23 2015-09-15 Advanced Micro Devices, Inc. Die-stacked memory device providing data translation
US9065722B2 (en) 2012-12-23 2015-06-23 Advanced Micro Devices, Inc. Die-stacked device with partitioned multi-hop network
WO2014106282A1 (en) * 2012-12-31 2014-07-03 Mandiant Corporation Identification of obfuscated computer items using visual algorithms
US9690935B2 (en) 2012-12-31 2017-06-27 Fireeye, Inc. Identification of obfuscated computer items using visual algorithms
US9930052B2 (en) 2013-06-27 2018-03-27 International Business Machines Corporation Pre-processing before precise pattern matching
US10171482B2 (en) 2013-06-27 2019-01-01 International Business Machines Corporation Pre-processing before precise pattern matching
US10594704B2 (en) 2013-06-27 2020-03-17 International Business Machines Corporation Pre-processing before precise pattern matching
US10333947B2 (en) 2013-06-27 2019-06-25 International Business Machines Corporation Pre-processing before precise pattern matching
US20150016172A1 (en) * 2013-07-15 2015-01-15 Advanced Micro Devices, Inc. Query operations for stacked-die memory device
US9286948B2 (en) * 2013-07-15 2016-03-15 Advanced Micro Devices, Inc. Query operations for stacked-die memory device
US9818455B2 (en) * 2013-07-15 2017-11-14 Advanced Micro Devices, Inc. Query operations for stacked-die memory device
EP2871816B1 (en) 2013-11-11 2016-03-09 51 Degrees Mobile Experts Limited Identifying properties of a communication device
US9875264B2 (en) 2013-11-11 2018-01-23 51 Degrees Mobile Experts Limited Identifying properties of a communication device
US10277613B2 (en) * 2013-12-10 2019-04-30 Nippon Telegraph And Telephone Corporation URL matching apparatus, URL matching method, and URL matching program
US10719540B2 (en) 2014-01-30 2020-07-21 Verint Systems Ltd. Systems and methods for keyword spotting using alternating search algorithms
US20160063055A1 (en) * 2014-08-29 2016-03-03 Alvin Roy Reed Method And Apparatus For Improved Database Searching
US10262081B2 (en) * 2014-08-29 2019-04-16 Alvin Roy Reed Method and apparatus for improved database searching
US10706032B2 (en) * 2015-04-28 2020-07-07 International Business Machines Corporation Unsolicited bulk email detection using URL tree hashes
US10810176B2 (en) 2015-04-28 2020-10-20 International Business Machines Corporation Unsolicited bulk email detection using URL tree hashes
US20160321255A1 (en) * 2015-04-28 2016-11-03 International Business Machines Corporation Unsolicited bulk email detection using url tree hashes
WO2016177830A1 (en) * 2015-05-06 2016-11-10 Örjan Vestgöte Method, system and computer program product for performing numeric searches
EP3091450A1 (en) * 2015-05-06 2016-11-09 Örjan Vestgöte Method and system for performing binary searches
US10649997B2 (en) 2015-05-06 2020-05-12 Örjan Vestgöte Technology AB Method, system and computer program product for performing numeric searches related to biometric information, for finding a matching biometric identifier in a biometric database
GB2540216A (en) * 2015-07-10 2017-01-11 Advanced Risc Mach Ltd Comparison of wide data types
US10474427B2 (en) 2015-07-10 2019-11-12 Arm Limited Comparison of wide data types
GB2540216B (en) * 2015-07-10 2020-01-01 Advanced Risc Mach Ltd Comparison of wide data types
US10204177B2 (en) 2015-11-30 2019-02-12 International Business Machines Corporation Matching an ordered set of strings containing wild cards
US10482175B2 (en) 2017-07-31 2019-11-19 51 Degrees Mobile Experts Limited Identifying properties of a communication device
US10949476B1 (en) 2017-10-20 2021-03-16 Carmax Business Services, Llc Systems and methods for generating web pages for database search and search engine optimization
US10068021B1 (en) * 2017-10-20 2018-09-04 Carmax Business Services, Llc Systems and methods for generating web pages for database search and search engine optimization
US11797616B1 (en) 2017-10-20 2023-10-24 Carmax Enterprise Services, Llc Systems and methods for generating web pages for database search and search engine optimization
CN110728113A (en) * 2018-06-28 2020-01-24 北京金山办公软件股份有限公司 Information screening method and device of electronic forms and terminal equipment
US11288368B1 (en) * 2019-06-26 2022-03-29 Support Intelligence, Inc. Signature generation
CN110674247A (en) * 2019-09-23 2020-01-10 广州虎牙科技有限公司 Barrage information intercepting method and device, storage medium and equipment
CN111159362A (en) * 2019-12-31 2020-05-15 奇安信科技集团股份有限公司 Text matching processing method and device
US20220139506A1 (en) * 2020-10-30 2022-05-05 Shenzhen Academy Of Inspection And Quarantine Method for automatically collecteing and matching of laboratory data
US11557382B2 (en) * 2020-10-30 2023-01-17 Shenzhen Academy Of Inspection And Quarantine Method for automatically collecting and matching of laboratory data
US11563715B2 (en) * 2020-11-23 2023-01-24 Juniper Networks, Inc. Pattern matching by a network device for domain names with wildcard characters
CN113312549A (en) * 2021-05-25 2021-08-27 北京天空卫士网络安全技术有限公司 Domain name processing method and device

Also Published As

Publication number Publication date
US8407245B2 (en) 2013-03-26

Similar Documents

Publication Publication Date Title
US8407245B2 (en) Efficient string pattern matching for large pattern sets
US11423216B2 (en) Providing extraction results for a particular field
US10042746B2 (en) Callpath finder
KR101963917B1 (en) Automatic synchronization of most recently used document lists
EP1587006A2 (en) Method and system for renaming consecutive keys in a B-tree
US9020951B2 (en) Methods for indexing and searching based on language locale
US11068536B2 (en) Method and apparatus for managing a document index
Keivanloo et al. SeByte: Scalable clone and similarity search for bytecode
US20060112127A1 (en) Method for efficiently mapping error messages to unique identifiers
CN104320312A (en) Network application safety test tool and fuzz test case generation method and system
Kim et al. Supporting set-valued joins in NoSQL using MapReduce
US11645234B2 (en) Rule-based collections of subset(s) of metadata in response to a trigger event occurring
US9411618B2 (en) Metadata-based class loading using a content repository
US10606805B2 (en) Object-level image query and retrieval
EP3649566A1 (en) System and method for value based region searching and associated search operators
US20090171922A1 (en) System and Method for Dynamically Exposing SQL Statements as Web Protocols
US20040098670A1 (en) Processing of data
US6768989B2 (en) Collection recognizer
Mathew et al. Efficient information retrieval using Lucene, LIndex and HIndex in Hadoop
US20150347402A1 (en) System and method for enabling a client system to generate file system operations on a file system data set using a virtual namespace
Raj et al. Distributed component-based crawler for AJAX applications
US10896115B2 (en) Investigation of performance bottlenecks occurring during execution of software applications
US20230094137A1 (en) Generating and visualizing a data structure of a management model of an application server
US20080281838A1 (en) Method of determining dependencies between items in a graph in an extensible system
CN117828134A (en) Quick positioning method for object semantics under tree metadata structure

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RYAN, MATTHEW V.;HODGKINSON, ANDREW;YOUNG, RUSSELL;REEL/FRAME:025417/0636

Effective date: 20101122

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0001

Effective date: 20141014

FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20210326