WO2013126281A1 - Systems and methods for putative cluster analysis - Google Patents

Systems and methods for putative cluster analysis Download PDF

Info

Publication number
WO2013126281A1
WO2013126281A1 PCT/US2013/026343 US2013026343W WO2013126281A1 WO 2013126281 A1 WO2013126281 A1 WO 2013126281A1 US 2013026343 W US2013026343 W US 2013026343W WO 2013126281 A1 WO2013126281 A1 WO 2013126281A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
distinct data
connections
data points
clusters
Prior art date
Application number
PCT/US2013/026343
Other languages
French (fr)
Inventor
Johannes Philippus de Villiers PRICHARD
David Alan Bayliss
Original Assignee
Lexisnexis Risk Solutions Fl Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lexisnexis Risk Solutions Fl Inc. filed Critical Lexisnexis Risk Solutions Fl Inc.
Priority to US13/848,850 priority Critical patent/US9412141B2/en
Publication of WO2013126281A1 publication Critical patent/WO2013126281A1/en
Priority to US15/202,099 priority patent/US10438308B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0204Market segmentation

Definitions

  • Bayliss I U.S. Patent No. 7,403,942 to Bayliss, David et al.
  • Bayliss II US Patent Application Serial No. 10/357,489 to Bayliss, David et al.
  • Various embodiments of the systems and methods described herein relate to data mining and, more particularly, to systems and methods for efficiently mining data to identify collusion, fraud, and organized groups of entities.
  • Applications for exploiting collected data include, but are not limited to: national security; law enforcement; immigration and border control; locating missing persons and property; firearms tracking; civil and criminal investigations; person and property location and verification; governmental and agency record handling; entity searching and location; package delivery; telecommunications; consumer related applications; credit reporting, scoring, and/or evaluating; debt collection; entity identification verification; account establishment, scoring and monitoring; fraud detection; health industry (patient record maintenance); biometric and other forms of authentication; insurance and risk management; marketing, including direct to consumer marketing; human resources/employment; and financial/banking industries.
  • the applications may span an enterprise or agency or extend across multiple agencies, businesses, industries, etc.
  • various embodiments of the disclosed technology may include putative cluster analysis systems and methods for identifying various connected entities and organizations.
  • an analytical system may be provided that includes a database, a clustering unit, a scoring unit, and a filtering unit.
  • Certain implementations of the disclosed technology may include systems, methods, and computer-readable media for identifying connected organizations in a collection of distinct data points.
  • a method is provided for determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points.
  • the method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid.
  • the method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.
  • a system may include one or more processors; at least one memory in communication with the one or more processors.
  • the at least one memory may include an operating system, a database a clustering unit, and a scoring unit.
  • the memory in communication with the one or more processors may be configured for storing data and instructions which, when executed by the at least one processor under control of the operating system, enable the system to determine, from a collection of records in the database, wherein the collection of records comprise a plurality of distinct data points, connections between one or more of the plurality of distinct data points.
  • the instructions, when executed by the at least one processor under control of the operating system may further identify, by the clustering unit, and from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid.
  • the instructions, when executed by the at least one processor under control of the operating system may further identify, by the cluster unit, cluster connections among the plurality of clusters, score, by the scoring unit, the cluster connections based on predetermined criteria; and identify one or more of the distinct data points associated with the scored cluster connections.
  • a computer- readable media for a method.
  • the method includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points.
  • the method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid.
  • the method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.
  • the database may store a plurality of records to be analyzed.
  • Each record may include data related to an entity or transaction.
  • a record may include data related to a real estate purchase, an insurance claim, or an income tax return.
  • the putative cluster analysis system may be directed to identify organizations related to a single industry. In that case, each record in the database, for the purpose of the putative cluster analysis system, may be related to that single industry. For example, if an embodiment of the putative cluster analysis system is directed to identifying insurance fraud, then various records may be related to insurance claims.
  • Some embodiments of the disclosed technology may include a database.
  • Other example embodiments of the disclosed technology may include systems and/or methods for accessing a database or other collection of data to be analyzed.
  • the clustering unit may group the various records into distinct, putative clusters.
  • the term "putative clusters" as discussed herein may mean groups of records that are supposed, presumed, and/or reputed as having some type of a connection to one another, no matter how tenuous that connection may prove to be in actuality.
  • each record, or data point may be deemed the central point of a cluster. For that data point, relatives of that data point may be identified up to a predetermined distance from the central data point, where "distance" between points is predefined and, in some embodiments, relates to a degree of connectivity between data points.
  • the scoring unit may have access to a predetermined feature set, and may be configured to analyze each putative cluster based on the feature set.
  • a direct link exists between each pair of data points with a direct relationship. For example, if a pair of data points represents two real estate transactions with the same seller, then these data points may be connected by a direct link within a cluster. Data points within a cluster may be indirectly connected when the data points are connected by a series of links.
  • the scoring unit may analyze the attributes of the various links or data points in the cluster to provide a score with respect to the feature in question.
  • each data point or each link may be assigned a score for each feature.
  • the cluster as a whole may be assigned a total score comprising a combination of the scores of the various features applicable to the cluster.
  • the total score may be one of various combinations calculated from the feature scores, such as, for example, a sum, a weighted sum, or another formula based on the various features.
  • the filtering unit may filter the putative clusters into real clusters and false clusters, where the real ones will be deemed to be those of interest for potential collusion.
  • the filtering unit may utilize a predetermined algorithm for separating the clusters into two groups based on the results of the scoring.
  • the algorithm may include a filter that significantly reduces the data set by selecting a subset of the putative clusters to deem real clusters.
  • the algorithm may be embodied in various forms according to certain embodiments.
  • the algorithm may examine the result of the scoring for each feature, and may select a subset of the clusters based on the various feature scores.
  • the filtering unit may have a target score, and real clusters may be those that meet a criterion, e.g., greater than, less than, with respect to that score for the combination of feature scores.
  • the putative cluster analysis system may calculate a set of putative clusters and filter those putative clusters into a set of high- interest real clusters.
  • FIG. 1 illustrates an analytics method according to an example implementation of the disclosed technology.
  • FIG. 2 illustrates a putative cluster evidencing connectedness between entities represented in the cluster and sub-clusters.
  • FIG. 3 illustrates a representative computer architecture, according to an example embodiment of the disclosed technology.
  • FIG. 4 illustrates a diagram of potentially fraudulent transactions identified by an example implementation of the disclosed technology during a test analysis.
  • FIG. 5 is a flow-diagram of a method, according to an example embodiment of the disclosed technology.
  • Example systems and methods described herein may utilize various forms of data to identify connected entities and/or organizations. Certain embodiments of the disclosed technology may provide improved accuracy over conventional data mining and putative cluster analysis systems and techniques. For example, insurance companies and other industries attempting to identify fraud may utilize conventional focused analysis techniques that examine each event in isolation. The conventional techniques typically utilize high thresholds to filter the large number of events to be analyzed. In other words, because the data that entities must analyze with conventional techniques is so large, a high degree of suspicious activity may be required in order to identify fraud. Without a high threshold, conventional techniques may have too many potentially fraudulent events to investigate. As a result, entities using conventional techniques often overlook collusion from groups that are able to stay below these high thresholds with respect to certain suspicious activities.
  • the putative cluster analysis systems and methods disclosed herein may perform one or more or the following tasks: drive improved investigative and due diligence workflows; evaluate and segment loan files to identify notable risks; identify non-obvious relationships between entities, within and external to loan transactions; expose key perpetrators to improve remediation and recourse opportunities; augment existing fraud detection and scoring models during origination and loan pool acquisition; and enhance internal fraud and risk controls with a flexible pattern selection process.
  • the putative cluster analysis system may start with large quantity of data and group that data into smaller, distinct clusters.
  • the proximity of seemingly low risk activity within each cluster may be measured using lower thresholds than is reasonably possible in the methods used by conventional systems.
  • the putative cluster analysis system may identify potentially organized groups without having to apply low thresholds to the large amounts of data as a whole.
  • high interest clusters may be identified from a plurality of data.
  • High interest clusters may represent connected organizations, entities, and or people.
  • the putative cluster analysis system disclosed herein may rely upon relatively large amounts of data to measure proximity of seemingly low risk events commonly associated with high risk activities to detect potentially fraudulent activities.
  • a domain of entities may be identified for analysis. For example, data associated with a large number (perhaps millions) of property deeds may be gathered for analysis.
  • the associated data may include identities of individuals, organizations, companies, etc., that are associated with the deeds.
  • the associated data may include information such as addresses, mortgage lenders, names of law firms, dates of transactions, etc.
  • one or more types of relationships between the entities may then be collected.
  • a non- partitioning clustering algorithm may be utilized to form clusters for each of the domain entities, wherein copies of the domain entity may be created, as required, for populating clusters associated with neighboring clusters.
  • a filtering mechanism may operate against the clusters and may retain those clusters that have outlying behavior.
  • Such filtering may conventionally utilize graph-or network analysis, and queries/filtering of this form may utilize sub-graph matching routines or fuzzy sub-graphs matching.
  • sub-graph matching routines or fuzzy-subgraphs matching techniques may be NP-complete, and thus, impractical for analyzing large sets of data.
  • the most notable characteristic of NP-complete problems is that no fast solution to them is known. That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows.
  • Embodiments of the disclosed technology may be utilized to provide clusters and connections between entities even though the set of data analyzed may be extremely large.
  • entities may be identified and may include people, companies, places, objects, virtual identities, etc.
  • relationships may be formed in many ways, and with many qualities. For example, co-occurrence of values in common fields database may be utilized, such as the same last name. Relationships may also be formed using multiple co-occurrence of an entity with one or more other properties, such as people who have lived at two or more addresses. [0032] Relationships may also be formed based on a high reoccurrence and/or frequency of a common relationship, according to an example embodiment. For example, records of person X sending an email to person Y greater than N times may indicate a relationship between person X and person Y.
  • person X sends an email to or receives an email from person Y
  • person Z sends an email or receives an email from person Y
  • a relationship may be implied between person X and person Z.
  • relationships between entities may comprise Boolean, weighted, directed, undirected, and/or combinations of multiple relationships.
  • clustering of the entities may rely on relationships steps.
  • entities may be related by at least two different relationship types.
  • relationships for the clustering may be established by examining weights or strengths of connections between entities in certain directions and conditional upon other relationships, including temporal relationships. For example, in one embodiment, the directional relationships between entities X, Y, and Z may be examined and the connection between X, Y, and Z may be followed if there is a link between Y and Z happened (in time) after the link was established between X and Y.
  • clusters may be scored.
  • a threshold may be utilized to identify clusters of interest.
  • a model may be utilized to compute a number of statistics on each cluster.
  • the model may be as simple as determining counts.
  • the model may detect relationships within a cluster, for example, entities that are related to the centroid of the cluster that are also related to each other. This analysis may provide a measure of cohesiveness of relationships that exist inside the cluster.
  • scoring and weighting of each cluster may be utilized to determine which clusters rise above a particular threshold, and may be classified as "interesting.”
  • scoring and weighting of the determined statistics may be accomplished using a heuristic scoring model, such as linear regression, neural network analysis, etc.
  • An example analytics method may be implemented by a putative cluster analysis system 100, as illustrated in FIG. 1. It will be understood that the method illustrated herein is provided for illustrative purposes only and does not limit the scope of the disclosed technology.
  • the putative cluster analysis system 100 may receive a plurality of data 102 to be analyzed.
  • the data may be processed 104, and output 106 may be generated.
  • the data may include identities and property deeds 108.
  • the data may also include information 110, for example, that may include data related to a bank portfolio.
  • the system 100 may receive the data 102 in its various forms (which may include identities, property deeds portfolios, etc.), and may process 104 the data 102 to derive relationships 112 and perform analytics 114.
  • the relationships 112 and analytics 114 may be used to determine particular attributes 116.
  • the attributes 116 may include one or more of the following: property status; property deed transfer history; buyer history; and/or the previous seller's cluster activity.
  • the determined attributes 116 may go through a scoring and filtering process 118, which may result in an output 106 that may include one or more primary attributes 120, features 122, and risk segmentation 124.
  • the primary attributes 120 may include entity and property characteristics, such as suspicious deeds, associations with businesses and other entities, seller address history, etc.
  • the features 122 may be derived from aggregating characteristics such as store code deeds, defaults, transfer activity, etc. In one example embodiment, such features 122 may be derived by combining primary attributes 120.
  • the risk segmentation 124 may be utilized to augment current scoring models.
  • the clustering unit of the putative cluster analysis system 100 may treat each data point in the data as a centroid of its own cluster.
  • the total number of clusters may be equal to the total number of data points, and each cluster may be uniquely represented by its centroid data point.
  • the distance between the centroid and any data point within each cluster may be limited, such that the clusters are limited in size and, for some analyses, may be treated as being disconnected from one another.
  • An example method of clustering data for the purposes of the example implementation of the disclosed technology of the putative cluster analysis systems and methods is disclosed in Bayliss I and II, which are incorporated herein.
  • scoring and filter 118 may be applied, for example, to analyze each cluster and assign one or more scores to each cluster.
  • a scoring unit may utilize a predetermined scoring algorithm for scoring some or all of the clusters.
  • the scoring unit may utilize a dynamic scoring algorithm for scoring some or all of the clusters.
  • the scoring algorithm may be based on seemingly low-risk events that tend to be associated with organizations, such as fraud organizations. The algorithm may thus also be based on research into what events tend to be indicative of fraud in the industry or application to which the putative cluster analysis system is directed.
  • each putative cluster may be scored individually. For example, a plurality of predetermined attributes, or variables, may be calculated for each cluster based on the data points in the cluster. For each attribute, the putative cluster as a whole may be considered, or each data point or link between pairs of data points may be considered. An attribute may be evaluated and scored depending on the nature of the attribute.
  • the property status attribute may include one or more of the following: the date of subject property last deed; the sale amount of subject property, the last recorded deed transfer; the number of months subject property was owned by previous owner; and/or the number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).
  • the property deed transfer history attribute may include one or more of the following information: the previous owner is a member of a network having high volume or suspicious deed transfer activity; the number of properties ever sold by previous owner that then resulted in default; and/or the previous owner's count of historical deed transfers within a network of associates.
  • the buyer history attribute may include one or more of the following information: the number of properties ever owned by the buyer(s); the number of properties ever owned by the buyer(s) business; and/or the number of properties ever sold by the buyer(s) that resulted in default.
  • the previous seller's cluster activity attribute may include one or more of the following information: buyer(s') count of historical deed transfers within a network of associates; and/or number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).
  • These or other features may be integrated into the scoring unit, so as to score the various putative clusters provided by the clustering unit.
  • Core transaction measurements which may be incorporated into the above list of features, may include velocity, profit, and buyer or seller relationship.
  • the filtering unit may filter out and those clusters that are deemed to represent real organizations based on the scoring.
  • the putative cluster analysis system may leverage publicly available data, such as property deeds and assessments, which may include several hundred million records.
  • the putative cluster analysis system may also clean and standardize data to reduce the possibility that matching entities are considered as distinct.
  • the putative cluster analysis system may use this data to build a large-scale network map of the population in question and its associated flow of property.
  • the putative cluster analysis system may leverage a relatively large-scale of supercomputing power and analytics to target organized collusion.
  • Example implementation of the disclosed technology of the putative cluster analysis systems and methods may rely upon open-source large scale parallel-processing computing platforms to increase the agility and scale of solutions.
  • centroids may be derived from a public database of around fifty terabytes for the U.S. population.
  • a cluster network map may be created with around four hundred million clusters with seventeen billion relationships.
  • Example implementation of the disclosed technology of the putative cluster analysis systems and methods may measure behavior and relationships that traditionally may be used to obscure activities to more actively and effectively expose syndicates and rings of collusion. Unlike many conventional systems, the putative cluster analysis system need not be limited to rings operating in a single geographic location, and it need not be limited to short time periods. Further, the putative cluster analysis system need not be limited to measuring only individually high value transactions, as banks do when identifying potential fraud that they consider to be worth their resources. The putative cluster analysis systems and methods disclosed herein thus may enable investigations to prioritize efforts on organized groups more effectively, rather than investigating individual transactions to determine whether they fall within an organized ring.
  • Table 1 A list of example attributes is shown in Table 1, below. It will be understood that these attributes are provided for illustrative purposes only and do not limit the scope of the putative cluster analysis systems and methods. Not all of these attributes need be used, and other attributes may be used as well, such as those described above with respect to the attributes 116 in reference to FIG. 1.
  • Buyer's Cluster in network high profit byr cl in net hi prof transfers byr cl in net hi prof flip cnt Buyer's Cluster in network high profit flips byr cl flop cnt Buyer's Cluster flop count
  • the scoring of the clusters may include accessing features, which may be based on research or knowledge about behaviors that suggest collusive activity.
  • each feature may represent a risky activity or characteristic.
  • the features may include the number of automobiles involved in an accident, the number of people injured, the value of vehicles involved in the accident, and/or number and extent of injuries.
  • each feature may be computed for each cluster.
  • a feature for example, may be calculated as a composite of one or more attributes of the cluster in question. For example, an attribute for detecting mortgage fraud may be "date of last deed transfer.” A feature that is based on this attribute may be "whether previous owner is a member of a network that shows high volume or suspicious deed transfer activity.” Thus, this feature may be a composite of the "date of last deed transfer" attribute, along with other attributes.
  • the scoring of the clusters may include calculating a score for each cluster, based on the features computed for the cluster. With wisely-chosen features, the resulting score for a cluster may be indicative of the connectedness of the various data points within a cluster.
  • filtering may be utilized to examine the cluster scores and filter, or identify, which putative clusters are real clusters, i.e., represent organized groups of entities. Organized groups may be flagged as being potentially involved in collusion-based fraud.
  • a filter may be utilized to reduce the data set to identify groups that evidence the greatest connectedness based on the scoring algorithm.
  • putative clusters with scores that match a predetermined set of criteria may be flagged for evaluation.
  • filtering may utilize one or more target scores, which may be selected based on the industry, goals of the putative cluster analysis system, or the scoring algorithm.
  • putative clusters having scores greater than or equal to a target score may be flagged as being potentially collusive.
  • the threshold for identifying fraud is too high, so as to prevent identifying too many entities for examination.
  • the features and scoring algorithm may be chosen to identify connectedness without the concern that too many individuals will be identified.
  • groups, instead of individuals, may be identified.
  • FIG. 2 illustrates an example putative cluster 200 where certain connectedness between entities may be determined according to the systems and methods disclosed herein.
  • This particular example may be directed toward identifying potential mortgage fraud, and at centroid of this example putative cluster 200 is a specific first property 202, which may be a house, for example.
  • This particular example is over-simplified for clarity, and it should be realized that such putative clusters in practice may actually contain hundreds of thousands of properties and associated entities having a dense web of connections among the properties, entities, etc.
  • the first property 202 may have certain characteristics (historical or otherwise) associated with it, for example, flipping (i.e., fast turnover), high sales profit, and/or transactions in which parties appeared to be associated with each other even outside of the transaction.
  • flipping i.e., fast turnover
  • high sales profit i.e., high sales profit
  • a first bank 206 that is considering providing a mortgage on this first property 202 to a potential buyer 204 may have certain visibility to the aforementioned characteristics but, using a conventional fraud- identification system, the bank 206 may not be able to detect the various connections that actually exist.
  • Other properties within the same putative cluster 200 may show similar characteristics: flipping, high sales profit, and relationships between parties.
  • connections between entities may be established based on public record documents, property deeds, etc., and such connections may be represented by lines connecting the entities, property, banks, etc.
  • a potential buyer 204 of a first property 202 may be in communication with a first bank 206.
  • the potential buyer 204, the first property, and the first bank may represent a first sub cluster 207.
  • the entire putative cluster 200 may include multiple sub clusters, each established with a property, person, etc., at its particular centroid.
  • the putative cluster 200 of FIG. 2 illustrates a number of sub clusters 207, 208, 209, 212, 214, 226, 208.
  • a particular entity may be at this centroid of its own cluster, and that same particular entity may be duplicated in the putative cluster to show connections with other entities that are set at the centroid of their own cluster.
  • the potential buyer 204 is shown connected to the first sub cluster 207 in which the first house 202 is at the centroid.
  • the potential buyer 204 is also shown in figure as being the centroid of a second sub cluster 209.
  • the fourth sub cluster 214 includes a second bank 215 at its centroid, and the potential buyer 204 is duplicated and shown as having a connection to the second bank 215.
  • the connection between first and second instances of the potential buyer 204 is represented in this figure by a thick line 205. Focusing now on the fourth sub cluster 214, in which the second bank 215 is at its centroid, we see that a second entity 216 is connected with the potential buyer 204, and with second bank 215.
  • a third entity 218 is connected to the second bank 215 and to the second entity 216.
  • the third entity 218 is a member of the fourth sub cluster 214 and the fifth sub cluster 226. Again, the connection between the duplicated third entity 218 is signified by the thick line 219.
  • the third entity 218 within the fifth sub cluster 226 is shown as being connected to a fourth entity 220, who is connected to a fifth entity 222. Therefore according to this example putative cluster 200, a connection may be determined to exist between the potential buyer 204 and the fifth entity 222, and this connection is shown by the dotted line 224.
  • a single property may have changed ownership between multiple entities in the sub cluster, as shown in the first sub-cluster 207, the fifth sub-cluster 226 and the sixth sub-cluster 208.
  • the centroid property 202 being analyzed has been subject to a number of transfers between related entities, which is often an indicator of fraudulent activities. Again, the movement of this property among these various entities would likely be overlooked in a conventional fraud-detection system.
  • FIG. 3 depicts a block diagram of an illustrative computer system architecture 300 according to an example implementation of the disclosed technology.
  • Various implementations and methods herein may be embodied in non-transitory computer readable media for execution by a processor. It will be understood that the architecture 300 is provided for example purposes only and does not limit the scope of the various implementations of the communication systems and methods.
  • the architecture 300 of FIG. 3 includes a central processing unit (CPU) 302, where computer instructions are processed; a display interface 304 that acts as a communication interface and provides functions for rendering video, graphics, images, and texts on the display.
  • the display interface 304 may be directly connected to a local display.
  • the display interface 304 may be configured for providing data, images, and other information for an external/remote display or computer that is not necessarily connected to the particular CPU 302.
  • the display interface 304 may wirelessly communicate, for example, via a Wi-Fi channel or other available network connection interface 312 to an external/remote display.
  • the architecture 300 may include a keyboard interface 306 that provides a communication interface to a keyboard; and a pointing device interface 308 that provides a communication interface to a pointing device, mouse, and/or touch screen.
  • Example implementations of the architecture 300 may include an antenna interface 310 that provides a communication interface to an antenna; a network connection interface 312 that provides a communication interface to a network.
  • the display interface 304 may be in communication with the network connection interface 312, for example, to provide information for display on a remote display that is not directly connected or attached to the system.
  • a camera interface 314 may be provided that may act as a communication interface and/or provide functions for capturing digital images from a camera.
  • a sound interface 316 is provided as a communication interface for converting sound into electrical signals using a microphone and for converting electrical signals into sound using a speaker.
  • a random access memory (RAM) 318 is provided, where computer instructions and data may be stored in a volatile memory device for processing by the CPU 302.
  • the architecture 300 includes a read-only memory (ROM) 320 where invariant low-level system code or data for basic system functions such as basic input and output (I/O), startup, or reception of keystrokes from a keyboard are stored in a non-volatile memory device.
  • ROM read-only memory
  • I/O basic input and output
  • the architecture 300 includes a storage medium 322 or other suitable type of memory (e.g.
  • the application programs 326 may include putative clustering instructions for organizing, storing, retrieving, comparing, and/or analyzing the various connections associated with the properties and entities associated with embodiments of the disclosed technology.
  • the putative cluster analysis system, the clustering unit, and/or the scoring unit may be embodied, at least in part, via the application programs 326 interacting with data from the ROM 320 or other memory storage medium 322, and may be enabled by interaction with the operating system 324 via the CPU 302 and bus 334.
  • the architecture 300 includes a power source 330 that provides an appropriate alternating current (AC) or direct current (DC) to power components.
  • the architecture 300 may include and a telephony subsystem 332 that allows the device 300 to transmit and receive sound over a telephone network.
  • the constituent devices and the CPU 302 communicate with each other over a bus 334.
  • the CPU 302 has appropriate structure to be a computer processor.
  • the computer CPU 302 may include more than one processing unit.
  • the RAM 318 interfaces with the computer bus 334 to provide quick RAM storage to the CPU 302 during the execution of software programs such as the operating system application programs, and device drivers. More specifically, the CPU 302 loads computer-executable process steps from the storage medium 322 or other media into a field of the RAM 318 in order to execute software programs. Data may be stored in the RAM 318, where the data may be accessed by the computer CPU 302 during execution.
  • the device 300 includes at least 128 MB of RAM, and 256 MB of flash memory.
  • the storage medium 322 itself may include a number of physical drive units, such as a redundant array of independent disks (RAID), a floppy disk drive, a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DVD) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, an external mini-dual inline memory module (DIMM) synchronous dynamic random access memory (SDRAM), or an external micro-DIMM SDRAM.
  • RAID redundant array of independent disks
  • HD-DVD High-Density Digital Versatile Disc
  • HD-DVD High-Density Digital Versatile Disc
  • HDDS Holographic Digital Data Storage
  • DIMM mini-dual inline memory module
  • SDRAM synchronous dynamic random access memory
  • micro-DIMM SDRAM an external micro-DIMM SDRAM
  • Such computer readable storage media allow the device 300 to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from the device 300 or to upload data onto the device 300.
  • a computer program product, such as one utilizing a communication system may be tangibly embodied in storage medium 322, which may comprise a machine- readable storage medium.
  • FIG. 4 illustrates a Venn diagram 400 of potentially fraudulent real estate transactions that may be identified, categorized, and/or flagged by putative cluster analysis, according to certain example embodiments of the disclosed technology.
  • the putative cluster analysis system may identify high-risk transactions that are performed within a network of associates that involve flipped properties and that result in high profits.
  • the term "flipping" is used herein to describe purchasing a revenue-generating asset and quickly reselling it for profit. This term is frequently used both as a descriptive term for legal real estate investing strategies that are perceived by some to be unethical or socially destructive. Certain embodiments of the disclosed technology may be applied for sensing schemes involving market manipulation and other illegal conduct including potentially collusive behavior.
  • the Venn diagram 400 of FIG. 4 illustrates the overlap of certain attributes that may be determined from a number of transactions involving certain properties.
  • related entities that are identified as being in the same network 402 may comprise a subset of transactions.
  • Certain transactions may be flagged as extracting high profit 404, and other transactions may be characterized as flipping or flopping 406.
  • flipping or flopping 406 may have the characteristic of a purchase, followed by a sale within a short period of time after the purchase.
  • Certain flipping or flopping 406 transactions may have low profit, and certain transactions may have a high profit.
  • the overlap (designated by the letter Y) of flipping or flopping 406 transactions with those that are high profit 404 may provide loan profiles with the characteristic of loan files that were flipped and resulted in high profit gains 410.
  • the overlap (designated by the letter W) of high profit 404 transactions with in network 402 transactions may provide loan profiles with a high profit gain having no flip 408.
  • the overlap (designated by the letter X) of in network 402 transactions and flipping or flopping 406 transactions may provide a loan profile with flip flops that are not high profit 412.
  • the overlap of the in-network 402, the flipping or flopping 406, and the high profit 404 transactions may be designated (by the letter Z) as having the characteristic of in cluster loans that were flipped and had high profit gains 414.
  • such overlap of characteristics, attributes data, etc. may be utilized to identify potential collusion within a network that may otherwise be very difficult to detect.
  • the method 500 starts in block 502, and according to an example implementation includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points.
  • the method 500 includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid.
  • the method 500 includes identifying cluster connections among the plurality of clusters.
  • the method 500 includes scoring the cluster connections based on predetermined criteria.
  • the method 500 includes identifying one or more of the distinct data points associated with the scored cluster connections..
  • the identified ringleader was not listed on any of the deeds of flipped properties, but could be identified by the test putative cluster analysis system by indirect connections with the flipped properties, and by other metrics disclosed herein.
  • Example implementations of the disclosed technology may be able to detect criminal activities that would not likely be identified if the involved individual or individuals intentionally avoid the type of behavior and connections that would be identifiable by conventional means.
  • Certain implementations of the disclosed technology of the putative cluster analysis systems and methods may be used to identify potential organizations of health insurance fraud, such as Medicaid fraud.
  • the input data to the clustering unit may be derived from historical address history of a population to be examined and such address history may be used to link individuals based on, for example, familial, residential, and business relationships.
  • the clustering unit may then take this input data and output clusters for use by the scoring unit.
  • some features considered for the scoring algorithm with respect to health insurance fraud may include: (1) the number of people within a cluster who lived in expensive residences, owned expensive property, or drove expensive cars; (2) the number of insurance recipients within the cluster who are contacts of medical providers; (3) the number of medical businesses associated with people in the cluster; (4) the number of people in cluster currently receiving benefits; and/or (5) the number of recipients associated with excluded providers.
  • These features may enable the putative cluster system to identify, among others, clusters that have dense clusters of recipients who appear to be colluding and transferring knowledge of how to claim Medicaid benefits and bypass eligibility requirements, as well as clusters that have close ties to medical providers who have the knowledge and means to defraud Medicaid.
  • the putative cluster analysis system may consider the following features to identify potential drug-seeking behavior: (1) prescription filling distance deviation; and (2) watchlist drug prescriptions. Such features may enable the putative cluster system to identify, among others, clusters that include patients who deviate when filling prescriptions for certain watchlist drugs, as well as clusters that include providers and prescribers with patterns of prescribing to the drug-seeking clusters.
  • certain technical effects can be provided, such as creating certain systems and methods that are able to identify an entity that is connected to various other entities evidencing suspicious behavior.
  • Embodiments of the disclosed technology may be utilized to examine related data in addition to data that is indicative of whether or not an individual entity is an active recipient of health insurance.
  • the putative cluster analysis system disclosed herein may also consider other recipients in an individual's cluster, which may be indicative of collusion.
  • the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit.
  • the filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
  • Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations of automobile insurance fraud.
  • automobile insurance fraud may include multiple victims, expensive vehicles, or multiple injuries.
  • some features considered for the scoring algorithm with respect to automobile insurance fraud may include: (1) the number of involved parties; (2) the number of claimants requiring medical treatment; (3) individual claim amounts; (4) vehicle damage; and (5) makes or models of involved automobiles. Analysis of these features, according to an example embodiment, may enable the putative cluster system to identify, among others, clusters that have a high number of collective claims with low standard deviation of claim counts, as well as clusters that have a statistically higher number of claims with soft tissue injuries, multiple passengers, low vehicle damage, or common passengers across multiple claims in the cluster.
  • the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit.
  • the filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
  • Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations involved in tax fraud.
  • some features considered for the scoring algorithm with respect to tax fraud may include: (1) a significant change in income between tax years; (2) a significant increase in deductions; (3) a change in filing status; (4); a change in number or nature of dependents; (5) and the number of self-employed individuals in cluster.
  • These or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit.
  • the filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
  • Various embodiments of the putative cluster analysis systems and methods may be embodied, in whole or in part, in a computer program product stored on non-transitory computer- readable media for execution by one or more processors.
  • various aspects of the disclosed technology such as the clustering unit, the scoring unit, and the filtering unit, may comprise hardware or software of a computer system, as discussed above with respect to FIG. 3.
  • these units may be discussed herein as being distinct from one another, they may be implemented in various ways. The distinctions between them throughout this disclosure are made for illustrative purposes only, based on operational distinctiveness.
  • putative cluster analysis systems and methods need not be limited to those above.
  • an example implementation of the putative cluster analysis system may be used to identify potential fraud related to credit cards applications, identity theft, investments, and various other fraud types that might involve an organization of connected entities.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Certain implementations of the disclosed technology may include systems, methods, and computer-readable media for identifying connected organizations from a collection of records. A method is provided for determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.

Description

SYSTEMS AND METHODS FOR PUTATIVE CLUSTER ANALYSIS
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is related to U.S. Provisional Application serial no. 61/603,068 filed on February 24, 2012, entitled: "Systems and Methods for Putative Cluster Analysis," the contents of which are hereby incorporated by reference in their entirety.
[0002] This application is also related to U.S. Patent No. 7,403,942 to Bayliss, David et al. (hereinafter Bayliss I), filed February 4, 2003, and to US Patent Application Serial No. 10/357,489 to Bayliss, David et al. (hereinafter Bayliss II), which are hereby incorporated by reference as if fully set forth below.
TECHNICAL FIELD
[0003] Various embodiments of the systems and methods described herein relate to data mining and, more particularly, to systems and methods for efficiently mining data to identify collusion, fraud, and organized groups of entities.
BACKGROUND
[0004] Increasingly, commercial, governmental, institutional and other entities collect vast amounts of data related to a variety of subjects, activities and pursuits. Society's appreciation for and use of information technology and management to analyze such data is now well ensconced in everyday life. For example, collected data may be examined for historical, trending, predictive, preventive, profiling, and many other useful purposes. Although the technology for collecting and storing such vast amounts of data is in place, efficient and effective technology for accessing, processing, verifying, analyzing and decisioning relating to such vast amounts of data is presently lacking or at the least in need of improvement. There exists broad and eager anticipation for unleashing the potential associated with such vast amounts of data and expanding the power that intelligent business solutions bring to commercial, governmental, and other societal pursuits. There exists a need and desire for intelligent solutions to realize this potential.
[0005] Applications for exploiting collected data include, but are not limited to: national security; law enforcement; immigration and border control; locating missing persons and property; firearms tracking; civil and criminal investigations; person and property location and verification; governmental and agency record handling; entity searching and location; package delivery; telecommunications; consumer related applications; credit reporting, scoring, and/or evaluating; debt collection; entity identification verification; account establishment, scoring and monitoring; fraud detection; health industry (patient record maintenance); biometric and other forms of authentication; insurance and risk management; marketing, including direct to consumer marketing; human resources/employment; and financial/banking industries. The applications may span an enterprise or agency or extend across multiple agencies, businesses, industries, etc.
[0006] Another such application is identifying collusion, such as that related to mortgage fraud. According to the Federal Bureau of Investigation (FBI), pending mortgage fraud-related investigations increased twelve percent in the fiscal year ending 30 September 30 2010, as opposed to the previous year. This represents a ninety percent jump in the increase amount from the previous fiscal year. The collapse of the housing boom and financial crisis has increased foreclosures. Although mortgage origination schemes have decreased because of depressed housing market, fraud aimed at troubled borrowers has increased. Such fraud includes loan modification scams and foreclosure rescue schemes, in which perpetrators convince borrowers they can save their homes through deed transfers and upfront fees.
[0007] The available data related to mortgages and to other industries is potentially immense. It is desirable to use that data efficiently to identify groups of entities that work as organizations. These groups may be colluding to perform illegal activity, and identifying them may reduce that activity.
BRIEF SUMMARY
[0008] Briefly described, various embodiments of the disclosed technology may include putative cluster analysis systems and methods for identifying various connected entities and organizations. In an example implementation of the disclosed technology, an analytical system may be provided that includes a database, a clustering unit, a scoring unit, and a filtering unit. Certain implementations of the disclosed technology may include systems, methods, and computer-readable media for identifying connected organizations in a collection of distinct data points. [0009] According to an example embodiment of the disclosed technology, a method is provided for determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.
[0010] In an example implementation, a system is provided. The system may include one or more processors; at least one memory in communication with the one or more processors. The at least one memory may include an operating system, a database a clustering unit, and a scoring unit. The memory in communication with the one or more processors may be configured for storing data and instructions which, when executed by the at least one processor under control of the operating system, enable the system to determine, from a collection of records in the database, wherein the collection of records comprise a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The instructions, when executed by the at least one processor under control of the operating system may further identify, by the clustering unit, and from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The instructions, when executed by the at least one processor under control of the operating system may further identify, by the cluster unit, cluster connections among the plurality of clusters, score, by the scoring unit, the cluster connections based on predetermined criteria; and identify one or more of the distinct data points associated with the scored cluster connections.
[0011] According to an example embodiment of the disclosed technology, a computer- readable media is provided for a method. The method includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.
[0012] In an example implementation, the database may store a plurality of records to be analyzed. Each record may include data related to an entity or transaction. For example, a record may include data related to a real estate purchase, an insurance claim, or an income tax return. In an example implementation of the disclosed technology, the putative cluster analysis system may be directed to identify organizations related to a single industry. In that case, each record in the database, for the purpose of the putative cluster analysis system, may be related to that single industry. For example, if an embodiment of the putative cluster analysis system is directed to identifying insurance fraud, then various records may be related to insurance claims. Some embodiments of the disclosed technology may include a database. Other example embodiments of the disclosed technology may include systems and/or methods for accessing a database or other collection of data to be analyzed.
[0013] In an example implementation, the clustering unit may group the various records into distinct, putative clusters. The term "putative clusters" as discussed herein may mean groups of records that are supposed, presumed, and/or reputed as having some type of a connection to one another, no matter how tenuous that connection may prove to be in actuality. In an example implementation, each record, or data point, may be deemed the central point of a cluster. For that data point, relatives of that data point may be identified up to a predetermined distance from the central data point, where "distance" between points is predefined and, in some embodiments, relates to a degree of connectivity between data points.
[0014] According to an example implementation, the scoring unit may have access to a predetermined feature set, and may be configured to analyze each putative cluster based on the feature set. Within a cluster, a direct link exists between each pair of data points with a direct relationship. For example, if a pair of data points represents two real estate transactions with the same seller, then these data points may be connected by a direct link within a cluster. Data points within a cluster may be indirectly connected when the data points are connected by a series of links.
[0015] According to an example implementation, for each feature in the feature set, the scoring unit may analyze the attributes of the various links or data points in the cluster to provide a score with respect to the feature in question. Thus, in one example embodiment, each data point or each link may be assigned a score for each feature. In one implementation, the cluster as a whole may be assigned a total score comprising a combination of the scores of the various features applicable to the cluster. The total score may be one of various combinations calculated from the feature scores, such as, for example, a sum, a weighted sum, or another formula based on the various features.
[0016] According to an example implementation, the filtering unit may filter the putative clusters into real clusters and false clusters, where the real ones will be deemed to be those of interest for potential collusion. In an example implementation of the disclosed technology, the filtering unit may utilize a predetermined algorithm for separating the clusters into two groups based on the results of the scoring. For example, the algorithm may include a filter that significantly reduces the data set by selecting a subset of the putative clusters to deem real clusters. The algorithm may be embodied in various forms according to certain embodiments. For example, the algorithm may examine the result of the scoring for each feature, and may select a subset of the clusters based on the various feature scores. Alternatively, the filtering unit may have a target score, and real clusters may be those that meet a criterion, e.g., greater than, less than, with respect to that score for the combination of feature scores.
[0017] According to an example implementation, the putative cluster analysis system may calculate a set of putative clusters and filter those putative clusters into a set of high- interest real clusters. These and other embodiments of the putative cluster analysis systems and methods will be described in more detail below with reference to the figures.
BRIEF DESCRIPTION OF THE FIGURES
[0018] FIG. 1 illustrates an analytics method according to an example implementation of the disclosed technology. [0019] FIG. 2 illustrates a putative cluster evidencing connectedness between entities represented in the cluster and sub-clusters.
[0020] FIG. 3 illustrates a representative computer architecture, according to an example embodiment of the disclosed technology.
[0021] FIG. 4 illustrates a diagram of potentially fraudulent transactions identified by an example implementation of the disclosed technology during a test analysis.
[0022] FIG. 5 is a flow-diagram of a method, according to an example embodiment of the disclosed technology.
DETAILED DESCRIPTION
[0023] To facilitate an understanding of the principles and features of the disclosed technology, various illustrative embodiments are explained below. Embodiments of the disclosed technology, however, are not limited to these embodiments. The materials and components described hereinafter as making up elements of the disclosed technology are intended to be illustrative and not restrictive. Many suitable materials and components that would perform the same or similar functions as the materials and components described herein are intended to be embraced within the scope of the disclosed technology. Other materials and components not described herein can include, but are not limited to, for example, similar or analogous materials or components developed after development of the disclosed technology.
[0024] Example systems and methods described herein may utilize various forms of data to identify connected entities and/or organizations. Certain embodiments of the disclosed technology may provide improved accuracy over conventional data mining and putative cluster analysis systems and techniques. For example, insurance companies and other industries attempting to identify fraud may utilize conventional focused analysis techniques that examine each event in isolation. The conventional techniques typically utilize high thresholds to filter the large number of events to be analyzed. In other words, because the data that entities must analyze with conventional techniques is so large, a high degree of suspicious activity may be required in order to identify fraud. Without a high threshold, conventional techniques may have too many potentially fraudulent events to investigate. As a result, entities using conventional techniques often overlook collusion from groups that are able to stay below these high thresholds with respect to certain suspicious activities.
[0025] Conventional systems for identifying mortgage fraud are often tied to specific loan portfolios, such as those related to a particular bank. Thus, these systems are not well-suited to fraud detection in large scale and across multiple banks. Mortgage fraud is prolific and can be hard to detect, especially given that mortgage data is spread across numerous databases and not consolidated into one database. The relevant data may be spread through financial services organizations, government agencies, and public records, such as property and assessment deeds. Further, the government agencies controlling some of this data have limited resources to detect and investigate the bigger mortgage fraud schemes. The putative cluster analysis system disclosed herein may be capable of efficiently leveraging readily available data to help organizations detect, prioritize, and investigate large mortgage fraud schemes.
[0026] When applied to the problem of mortgage fraud, the putative cluster analysis systems and methods disclosed herein may perform one or more or the following tasks: drive improved investigative and due diligence workflows; evaluate and segment loan files to identify notable risks; identify non-obvious relationships between entities, within and external to loan transactions; expose key perpetrators to improve remediation and recourse opportunities; augment existing fraud detection and scoring models during origination and loan pool acquisition; and enhance internal fraud and risk controls with a flexible pattern selection process.
[0027] According to an example implementation of the disclosed technology, the putative cluster analysis system may start with large quantity of data and group that data into smaller, distinct clusters. In an example embodiment, the proximity of seemingly low risk activity within each cluster may be measured using lower thresholds than is reasonably possible in the methods used by conventional systems. As a result, the putative cluster analysis system may identify potentially organized groups without having to apply low thresholds to the large amounts of data as a whole.
[0028] In accordance with certain example embodiments of the disclosed technology, high interest clusters may be identified from a plurality of data. High interest clusters, for example, may represent connected organizations, entities, and or people. In certain example implementations, the putative cluster analysis system disclosed herein may rely upon relatively large amounts of data to measure proximity of seemingly low risk events commonly associated with high risk activities to detect potentially fraudulent activities.
[0029] In one example embodiment, a domain of entities may be identified for analysis. For example, data associated with a large number (perhaps millions) of property deeds may be gathered for analysis. The associated data may include identities of individuals, organizations, companies, etc., that are associated with the deeds. The associated data may include information such as addresses, mortgage lenders, names of law firms, dates of transactions, etc. According to certain example embodiments of the disclosed technology, one or more types of relationships between the entities may then be collected. According to an example embodiment, a non- partitioning clustering algorithm may be utilized to form clusters for each of the domain entities, wherein copies of the domain entity may be created, as required, for populating clusters associated with neighboring clusters.
[0030] In certain embodiments, a filtering mechanism may operate against the clusters and may retain those clusters that have outlying behavior. Such filtering may conventionally utilize graph-or network analysis, and queries/filtering of this form may utilize sub-graph matching routines or fuzzy sub-graphs matching. However, sub-graph matching routines or fuzzy-subgraphs matching techniques may be NP-complete, and thus, impractical for analyzing large sets of data. The most notable characteristic of NP-complete problems is that no fast solution to them is known. That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows. This means that the time required to solve even moderately sized versions of many of these problems can easily reach into the billions or trillions of years, using any amount of computing power available today. Embodiments of the disclosed technology may be utilized to provide clusters and connections between entities even though the set of data analyzed may be extremely large.
[0031] In accordance with an example implementation of the disclosed technology, entities may be identified and may include people, companies, places, objects, virtual identities, etc. In an example embodiment, relationships may be formed in many ways, and with many qualities. For example, co-occurrence of values in common fields database may be utilized, such as the same last name. Relationships may also be formed using multiple co-occurrence of an entity with one or more other properties, such as people who have lived at two or more addresses. [0032] Relationships may also be formed based on a high reoccurrence and/or frequency of a common relationship, according to an example embodiment. For example, records of person X sending an email to person Y greater than N times may indicate a relationship between person X and person Y. In another example embodiment, if person X sends an email to or receives an email from person Y, and within a short period of time, person Z sends an email or receives an email from person Y, then a relationship may be implied between person X and person Z.
[0033] In accordance with an example implementation of the disclosed technology, relationships between entities may comprise Boolean, weighted, directed, undirected, and/or combinations of multiple relationships. According to certain example embodiments of the disclosed technology, clustering of the entities may rely on relationships steps. In one embodiment, entities may be related by at least two different relationship types. In one embodiment, relationships for the clustering may be established by examining weights or strengths of connections between entities in certain directions and conditional upon other relationships, including temporal relationships. For example, in one embodiment, the directional relationships between entities X, Y, and Z may be examined and the connection between X, Y, and Z may be followed if there is a link between Y and Z happened (in time) after the link was established between X and Y.
[0034] Many methods may be utilized to filter clusters, once they are identified. For example, in one embodiment, clusters may be scored. In another embodiment, a threshold may be utilized to identify clusters of interest. According to an example embodiment of the disclosed technology, a model may be utilized to compute a number of statistics on each cluster. In one embodiment, the model may be as simple as determining counts. In another embodiment, the model may detect relationships within a cluster, for example, entities that are related to the centroid of the cluster that are also related to each other. This analysis may provide a measure of cohesiveness of relationships that exist inside the cluster. According to an example embodiment of the disclosed technology, once the statistics have been computed, scoring and weighting of each cluster may be utilized to determine which clusters rise above a particular threshold, and may be classified as "interesting." In accordance with an example embodiment of the disclosed technology, and weighting and/or scoring of the determined statistics may be accomplished using a heuristic scoring model, such as linear regression, neural network analysis, etc. [0035] An example analytics method may be implemented by a putative cluster analysis system 100, as illustrated in FIG. 1. It will be understood that the method illustrated herein is provided for illustrative purposes only and does not limit the scope of the disclosed technology.
[0036] The putative cluster analysis system 100 may receive a plurality of data 102 to be analyzed. In accordance with an example embodiment, the data may be processed 104, and output 106 may be generated. In one example embodiment, the data may include identities and property deeds 108. The data may also include information 110, for example, that may include data related to a bank portfolio. In an example embodiment, the system 100 may receive the data 102 in its various forms (which may include identities, property deeds portfolios, etc.), and may process 104 the data 102 to derive relationships 112 and perform analytics 114. In an example embodiment, the relationships 112 and analytics 114 may be used to determine particular attributes 116. For example, the attributes 116 may include one or more of the following: property status; property deed transfer history; buyer history; and/or the previous seller's cluster activity. According to an example embodiment of the disclosed technology, the determined attributes 116 may go through a scoring and filtering process 118, which may result in an output 106 that may include one or more primary attributes 120, features 122, and risk segmentation 124. In accordance with an example embodiment of the disclosed technology, the primary attributes 120 may include entity and property characteristics, such as suspicious deeds, associations with businesses and other entities, seller address history, etc. according to an example embodiment, the features 122 may be derived from aggregating characteristics such as store code deeds, defaults, transfer activity, etc. In one example embodiment, such features 122 may be derived by combining primary attributes 120. In an example embodiment, the risk segmentation 124 may be utilized to augment current scoring models.
[0037] According to an example implementation of the disclosed technology, the clustering unit of the putative cluster analysis system 100 may treat each data point in the data as a centroid of its own cluster. Thus, the total number of clusters may be equal to the total number of data points, and each cluster may be uniquely represented by its centroid data point. The distance between the centroid and any data point within each cluster may be limited, such that the clusters are limited in size and, for some analyses, may be treated as being disconnected from one another. An example method of clustering data for the purposes of the example implementation of the disclosed technology of the putative cluster analysis systems and methods is disclosed in Bayliss I and II, which are incorporated herein.
[0038] According to an example implementation of the disclosed technology, scoring and filter 118 may be applied, for example, to analyze each cluster and assign one or more scores to each cluster. In an example implementation, a scoring unit may utilize a predetermined scoring algorithm for scoring some or all of the clusters. In another example implementation, the scoring unit may utilize a dynamic scoring algorithm for scoring some or all of the clusters. The scoring algorithm, for example, may be based on seemingly low-risk events that tend to be associated with organizations, such as fraud organizations. The algorithm may thus also be based on research into what events tend to be indicative of fraud in the industry or application to which the putative cluster analysis system is directed.
[0039] In one example implementation, each putative cluster may be scored individually. For example, a plurality of predetermined attributes, or variables, may be calculated for each cluster based on the data points in the cluster. For each attribute, the putative cluster as a whole may be considered, or each data point or link between pairs of data points may be considered. An attribute may be evaluated and scored depending on the nature of the attribute.
[0040] According to an example implementation of the disclosed technology, the property status attribute may include one or more of the following: the date of subject property last deed; the sale amount of subject property, the last recorded deed transfer; the number of months subject property was owned by previous owner; and/or the number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).
[0041] According to an example implementation of the disclosed technology, the property deed transfer history attribute may include one or more of the following information: the previous owner is a member of a network having high volume or suspicious deed transfer activity; the number of properties ever sold by previous owner that then resulted in default; and/or the previous owner's count of historical deed transfers within a network of associates.
[0042] According to an example implementation of the disclosed technology, the buyer history attribute may include one or more of the following information: the number of properties ever owned by the buyer(s); the number of properties ever owned by the buyer(s) business; and/or the number of properties ever sold by the buyer(s) that resulted in default.
[0043] According to an example implementation of the disclosed technology, the previous seller's cluster activity attribute may include one or more of the following information: buyer(s') count of historical deed transfers within a network of associates; and/or number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).
[0044] These or other features may be integrated into the scoring unit, so as to score the various putative clusters provided by the clustering unit. Core transaction measurements, which may be incorporated into the above list of features, may include velocity, profit, and buyer or seller relationship. The filtering unit may filter out and those clusters that are deemed to represent real organizations based on the scoring.
[0045] In accordance with an example implementation of the disclosed technology, the putative cluster analysis system may leverage publicly available data, such as property deeds and assessments, which may include several hundred million records. The putative cluster analysis system may also clean and standardize data to reduce the possibility that matching entities are considered as distinct. Before creating the putative clusters, the putative cluster analysis system may use this data to build a large-scale network map of the population in question and its associated flow of property.
[0046] According to an example implementation, the putative cluster analysis system may leverage a relatively large-scale of supercomputing power and analytics to target organized collusion. Example implementation of the disclosed technology of the putative cluster analysis systems and methods may rely upon open-source large scale parallel-processing computing platforms to increase the agility and scale of solutions. In one embodiment of the putative cluster analysis system, centroids may be derived from a public database of around fifty terabytes for the U.S. population. In this embodiment, a cluster network map may be created with around four hundred million clusters with seventeen billion relationships.
[0047] Example implementation of the disclosed technology of the putative cluster analysis systems and methods may measure behavior and relationships that traditionally may be used to obscure activities to more actively and effectively expose syndicates and rings of collusion. Unlike many conventional systems, the putative cluster analysis system need not be limited to rings operating in a single geographic location, and it need not be limited to short time periods. Further, the putative cluster analysis system need not be limited to measuring only individually high value transactions, as banks do when identifying potential fraud that they consider to be worth their resources. The putative cluster analysis systems and methods disclosed herein thus may enable investigations to prioritize efforts on organized groups more effectively, rather than investigating individual transactions to determine whether they fall within an organized ring.
[0048] A list of example attributes is shown in Table 1, below. It will be understood that these attributes are provided for illustrative purposes only and do not limit the scope of the putative cluster analysis systems and methods. Not all of these attributes need be used, and other attributes may be used as well, such as those described above with respect to the attributes 116 in reference to FIG. 1.
Table 1
Figure imgf000014_0001
Figure imgf000015_0001
Previous Seller's Cluster high profit sir cl hi prof cnt transfers
Previous Seller's Cluster in network high sir cl in net hi prof profit transfers
Previous Seller's Cluster in network high sir cl in net hi prof flip cnt profit flips
sir cl flop cnt Previous Seller's Cluster flop count
Previous Seller's Cluster property transfers sir cl default cnt ending in default
Previous Seller's Cluster property transfers sir cl fc cnt ending in foreclosure
Previous Seller's Cluster property sales end sir cl ends in default fc in default or foreclosure
Buyer's Activity
byr cl flip 0 deg Buyer Flips
byr cl in net cnt 0 deg Buyer in network deed transfers byr cl in net flip cnt 0 deg Buyer in network flips
byr cl hi prof cnt 0 deg Buyer high profit transfer
byr cl in net hi prof 0 deg Buyer in network high profit transfers
Buyer property transfers in default or byr cl cl fc default cnt 0 deg foreclosure byr susp flip net Buyer member of a suspicious flip network
Buyer member of suspicous network with byr susp fc net foreclosure
Buyer's Cluster Activity
byr cl sales cnt Buyer's Cluster Total Deed Transfers byr cl flip cnt Buyer's Cluster Flips
byr cl flip bus cnt Buyer's Cluster business flips
byr cl in net cnt Buyer's Cluster in network transfers byr cl in net flip bus cnt Buyer's Cluster in network business flips byr cl in net flop Buyer's Cluster in network flops byr cl in net flip cnt Buyer's Cluster in network flips byr cl hi prof cnt Buyer's Cluster high profit transfers
Buyer's Cluster in network high profit byr cl in net hi prof transfers byr cl in net hi prof flip cnt Buyer's Cluster in network high profit flips byr cl flop cnt Buyer's Cluster flop count
Buyer's Cluster property transfers ending in byr cl default cnt default Buyer's Cluster property transfers ending in byr cl fc cnt foreclosure
Buyer's Cluster property sales end in default byr cl ends in default fc or foreclosure
[0049] In accordance with an example implementation of the disclosed technology, the scoring of the clusters may include accessing features, which may be based on research or knowledge about behaviors that suggest collusive activity. For example, each feature may represent a risky activity or characteristic. In identifying automobile insurance fraud, for example, the features may include the number of automobiles involved in an accident, the number of people injured, the value of vehicles involved in the accident, and/or number and extent of injuries.
[0050] In an example implementation, each feature may be computed for each cluster. A feature, for example, may be calculated as a composite of one or more attributes of the cluster in question. For example, an attribute for detecting mortgage fraud may be "date of last deed transfer." A feature that is based on this attribute may be "whether previous owner is a member of a network that shows high volume or suspicious deed transfer activity." Thus, this feature may be a composite of the "date of last deed transfer" attribute, along with other attributes.
[0051] In accordance with an example implementation of the disclosed technology, the scoring of the clusters may include calculating a score for each cluster, based on the features computed for the cluster. With wisely-chosen features, the resulting score for a cluster may be indicative of the connectedness of the various data points within a cluster. In accordance with an example implementation of the disclosed technology, filtering may be utilized to examine the cluster scores and filter, or identify, which putative clusters are real clusters, i.e., represent organized groups of entities. Organized groups may be flagged as being potentially involved in collusion-based fraud.
[0052] In one example implementation, a filter may be utilized to reduce the data set to identify groups that evidence the greatest connectedness based on the scoring algorithm. In one example implementation, putative clusters with scores that match a predetermined set of criteria may be flagged for evaluation. In an example implementation of the disclosed technology, filtering may utilize one or more target scores, which may be selected based on the industry, goals of the putative cluster analysis system, or the scoring algorithm. In one example implementation, putative clusters having scores greater than or equal to a target score may be flagged as being potentially collusive.
[0053] As discussed above, an issue with conventional systems is that the threshold for identifying fraud is too high, so as to prevent identifying too many entities for examination. According to an example implementation of the disclosed technology, the features and scoring algorithm may be chosen to identify connectedness without the concern that too many individuals will be identified. According to an example implementation of the disclosed technology, groups, instead of individuals, may be identified.
[0054] FIG. 2 illustrates an example putative cluster 200 where certain connectedness between entities may be determined according to the systems and methods disclosed herein. This particular example may be directed toward identifying potential mortgage fraud, and at centroid of this example putative cluster 200 is a specific first property 202, which may be a house, for example. This particular example is over-simplified for clarity, and it should be realized that such putative clusters in practice may actually contain hundreds of thousands of properties and associated entities having a dense web of connections among the properties, entities, etc.
[0055] In accordance with an example implementation, and with continued reference to FIG. 2, the first property 202 may have certain characteristics (historical or otherwise) associated with it, for example, flipping (i.e., fast turnover), high sales profit, and/or transactions in which parties appeared to be associated with each other even outside of the transaction. A first bank 206 that is considering providing a mortgage on this first property 202 to a potential buyer 204 may have certain visibility to the aforementioned characteristics but, using a conventional fraud- identification system, the bank 206 may not be able to detect the various connections that actually exist. Other properties within the same putative cluster 200 may show similar characteristics: flipping, high sales profit, and relationships between parties. Alone, this may not raise a flag, but the putative cluster analysis systems and/or methods disclosed herein may identify this centroid property's connections to the other properties in the putative cluster 200, thus possibly raising a flag when suspicious connections are identified. [0056] In accordance with an example embodiment of the disclosed technology, connections between entities may be established based on public record documents, property deeds, etc., and such connections may be represented by lines connecting the entities, property, banks, etc. For example, a potential buyer 204 of a first property 202 may be in communication with a first bank 206. In one embodiment, the potential buyer 204, the first property, and the first bank may represent a first sub cluster 207. The entire putative cluster 200 may include multiple sub clusters, each established with a property, person, etc., at its particular centroid. For example, the putative cluster 200 of FIG. 2 illustrates a number of sub clusters 207, 208, 209, 212, 214, 226, 208. As discussed above, a particular entity may be at this centroid of its own cluster, and that same particular entity may be duplicated in the putative cluster to show connections with other entities that are set at the centroid of their own cluster.
[0057] The potential buyer 204, for example, is shown connected to the first sub cluster 207 in which the first house 202 is at the centroid. The potential buyer 204 is also shown in figure as being the centroid of a second sub cluster 209. The fourth sub cluster 214 includes a second bank 215 at its centroid, and the potential buyer 204 is duplicated and shown as having a connection to the second bank 215. The connection between first and second instances of the potential buyer 204 is represented in this figure by a thick line 205. Focusing now on the fourth sub cluster 214, in which the second bank 215 is at its centroid, we see that a second entity 216 is connected with the potential buyer 204, and with second bank 215. Additionally, connected to the second bank 215 and to the second entity 216 is a third entity 218. The third entity 218 is a member of the fourth sub cluster 214 and the fifth sub cluster 226. Again, the connection between the duplicated third entity 218 is signified by the thick line 219. The third entity 218 within the fifth sub cluster 226 is shown as being connected to a fourth entity 220, who is connected to a fifth entity 222. Therefore according to this example putative cluster 200, a connection may be determined to exist between the potential buyer 204 and the fifth entity 222, and this connection is shown by the dotted line 224.
[0058] According to an example implementation, a single property may have changed ownership between multiple entities in the sub cluster, as shown in the first sub-cluster 207, the fifth sub-cluster 226 and the sixth sub-cluster 208. However, as shown in FIG. 3, the centroid property 202 being analyzed has been subject to a number of transfers between related entities, which is often an indicator of fraudulent activities. Again, the movement of this property among these various entities would likely be overlooked in a conventional fraud-detection system.
[0059] FIG. 3 depicts a block diagram of an illustrative computer system architecture 300 according to an example implementation of the disclosed technology. Various implementations and methods herein may be embodied in non-transitory computer readable media for execution by a processor. It will be understood that the architecture 300 is provided for example purposes only and does not limit the scope of the various implementations of the communication systems and methods.
[0060] The architecture 300 of FIG. 3 includes a central processing unit (CPU) 302, where computer instructions are processed; a display interface 304 that acts as a communication interface and provides functions for rendering video, graphics, images, and texts on the display. In certain example implementations of the disclosed technology, the display interface 304 may be directly connected to a local display. In another example implementation, the display interface 304 may be configured for providing data, images, and other information for an external/remote display or computer that is not necessarily connected to the particular CPU 302. In certain example implementations, the display interface 304 may wirelessly communicate, for example, via a Wi-Fi channel or other available network connection interface 312 to an external/remote display.
[0061] The architecture 300 may include a keyboard interface 306 that provides a communication interface to a keyboard; and a pointing device interface 308 that provides a communication interface to a pointing device, mouse, and/or touch screen. Example implementations of the architecture 300 may include an antenna interface 310 that provides a communication interface to an antenna; a network connection interface 312 that provides a communication interface to a network. As mentioned above, the display interface 304 may be in communication with the network connection interface 312, for example, to provide information for display on a remote display that is not directly connected or attached to the system. In certain implementations, a camera interface 314 may be provided that may act as a communication interface and/or provide functions for capturing digital images from a camera. In certain implementations, a sound interface 316 is provided as a communication interface for converting sound into electrical signals using a microphone and for converting electrical signals into sound using a speaker. According to example implementations, a random access memory (RAM) 318 is provided, where computer instructions and data may be stored in a volatile memory device for processing by the CPU 302.
[0062] According to an example implementation, the architecture 300 includes a read-only memory (ROM) 320 where invariant low-level system code or data for basic system functions such as basic input and output (I/O), startup, or reception of keystrokes from a keyboard are stored in a non-volatile memory device. According to an example implementation, the architecture 300 includes a storage medium 322 or other suitable type of memory (e.g. such as RAM, ROM, programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic disks, optical disks, floppy disks, hard disks, removable cartridges, flash drives), where the files include an operating system 324, application programs 326 and data files 328 are stored. The application programs 326 may include putative clustering instructions for organizing, storing, retrieving, comparing, and/or analyzing the various connections associated with the properties and entities associated with embodiments of the disclosed technology. According to example implementations of the disclosed technology, the putative cluster analysis system, the clustering unit, and/or the scoring unit may be embodied, at least in part, via the application programs 326 interacting with data from the ROM 320 or other memory storage medium 322, and may be enabled by interaction with the operating system 324 via the CPU 302 and bus 334.
[0063] According to an example implementation, the architecture 300 includes a power source 330 that provides an appropriate alternating current (AC) or direct current (DC) to power components. According to an example implementation, the architecture 300 may include and a telephony subsystem 332 that allows the device 300 to transmit and receive sound over a telephone network. The constituent devices and the CPU 302 communicate with each other over a bus 334.
[0064] In accordance with an example implementation, the CPU 302 has appropriate structure to be a computer processor. In one arrangement, the computer CPU 302 may include more than one processing unit. The RAM 318 interfaces with the computer bus 334 to provide quick RAM storage to the CPU 302 during the execution of software programs such as the operating system application programs, and device drivers. More specifically, the CPU 302 loads computer-executable process steps from the storage medium 322 or other media into a field of the RAM 318 in order to execute software programs. Data may be stored in the RAM 318, where the data may be accessed by the computer CPU 302 during execution. In one example configuration, the device 300 includes at least 128 MB of RAM, and 256 MB of flash memory.
[0065] The storage medium 322 itself may include a number of physical drive units, such as a redundant array of independent disks (RAID), a floppy disk drive, a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DVD) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, an external mini-dual inline memory module (DIMM) synchronous dynamic random access memory (SDRAM), or an external micro-DIMM SDRAM. Such computer readable storage media allow the device 300 to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from the device 300 or to upload data onto the device 300. A computer program product, such as one utilizing a communication system may be tangibly embodied in storage medium 322, which may comprise a machine- readable storage medium.
[0066] FIG. 4 illustrates a Venn diagram 400 of potentially fraudulent real estate transactions that may be identified, categorized, and/or flagged by putative cluster analysis, according to certain example embodiments of the disclosed technology. As shown in FIG. 4, the putative cluster analysis system may identify high-risk transactions that are performed within a network of associates that involve flipped properties and that result in high profits. The term "flipping" is used herein to describe purchasing a revenue-generating asset and quickly reselling it for profit. This term is frequently used both as a descriptive term for legal real estate investing strategies that are perceived by some to be unethical or socially destructive. Certain embodiments of the disclosed technology may be applied for sensing schemes involving market manipulation and other illegal conduct including potentially collusive behavior.
[0067] The Venn diagram 400 of FIG. 4 illustrates the overlap of certain attributes that may be determined from a number of transactions involving certain properties. For example, related entities that are identified as being in the same network 402 may comprise a subset of transactions. Certain transactions may be flagged as extracting high profit 404, and other transactions may be characterized as flipping or flopping 406. For example, flipping or flopping 406 may have the characteristic of a purchase, followed by a sale within a short period of time after the purchase. Certain flipping or flopping 406 transactions may have low profit, and certain transactions may have a high profit. The overlap (designated by the letter Y) of flipping or flopping 406 transactions with those that are high profit 404 may provide loan profiles with the characteristic of loan files that were flipped and resulted in high profit gains 410.
[0068] The overlap (designated by the letter W) of high profit 404 transactions with in network 402 transactions may provide loan profiles with a high profit gain having no flip 408. The overlap (designated by the letter X) of in network 402 transactions and flipping or flopping 406 transactions may provide a loan profile with flip flops that are not high profit 412. The overlap of the in-network 402, the flipping or flopping 406, and the high profit 404 transactions may be designated (by the letter Z) as having the characteristic of in cluster loans that were flipped and had high profit gains 414. According to an example implementation of the disclosed technology, such overlap of characteristics, attributes data, etc., may be utilized to identify potential collusion within a network that may otherwise be very difficult to detect.
[0069] It will be understood that the combination of flipping and high-profit transactions illustrated in FIG. 4 is an illustrative example of potential results of the putative analysis systems and methods. Systems and methods disclosed herein may be capable of identifying potential collusion by much more complex mechanisms.
An example method 500 for identifying connected organizations from a collection of records will now be described with reference to the flowchart of FIG. 5. The method 500 starts in block 502, and according to an example implementation includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. In block 504, the method 500 includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. In block 506, the method 500 includes identifying cluster connections among the plurality of clusters. In block 508, the method 500 includes scoring the cluster connections based on predetermined criteria. In block 510, the method 500 includes identifying one or more of the distinct data points associated with the scored cluster connections..
[0070] An example embodiment of the putative cluster analysis system was tested using transaction data related to properties in Sarasota, Florida, over a ten year period. In that test, it was determined that the highest risk cluster did not include any high-velocity flippers. Instead, risk behavior was evenly spread across the actors within the cluster and this characteristic may have inhibited detection by conventional means. In a blind study, the test putative cluster analysis system was able to identify a "ringleader" at the center of a cluster. This ringleader was indicted approximately one month before the test was conducted, and was identified by authorities based on information provided by a disgruntled employee informant. The identified ringleader was not listed on any of the deeds of flipped properties, but could be identified by the test putative cluster analysis system by indirect connections with the flipped properties, and by other metrics disclosed herein. Example implementations of the disclosed technology may be able to detect criminal activities that would not likely be identified if the involved individual or individuals intentionally avoid the type of behavior and connections that would be identifiable by conventional means.
[0071] Certain implementations of the disclosed technology of the putative cluster analysis systems and methods may be used to identify potential organizations of health insurance fraud, such as Medicaid fraud. For example, the input data to the clustering unit may be derived from historical address history of a population to be examined and such address history may be used to link individuals based on, for example, familial, residential, and business relationships. The clustering unit may then take this input data and output clusters for use by the scoring unit.
[0072] Without limitation, some features considered for the scoring algorithm with respect to health insurance fraud may include: (1) the number of people within a cluster who lived in expensive residences, owned expensive property, or drove expensive cars; (2) the number of insurance recipients within the cluster who are contacts of medical providers; (3) the number of medical businesses associated with people in the cluster; (4) the number of people in cluster currently receiving benefits; and/or (5) the number of recipients associated with excluded providers. These features may enable the putative cluster system to identify, among others, clusters that have dense clusters of recipients who appear to be colluding and transferring knowledge of how to claim Medicaid benefits and bypass eligibility requirements, as well as clusters that have close ties to medical providers who have the knowledge and means to defraud Medicaid.
[0073] According to an example implementation of the disclosed technology, the putative cluster analysis system may consider the following features to identify potential drug-seeking behavior: (1) prescription filling distance deviation; and (2) watchlist drug prescriptions. Such features may enable the putative cluster system to identify, among others, clusters that include patients who deviate when filling prescriptions for certain watchlist drugs, as well as clusters that include providers and prescribers with patterns of prescribing to the drug-seeking clusters.
[0074] According to example implementations, certain technical effects can be provided, such as creating certain systems and methods that are able to identify an entity that is connected to various other entities evidencing suspicious behavior. Embodiments of the disclosed technology may be utilized to examine related data in addition to data that is indicative of whether or not an individual entity is an active recipient of health insurance. The putative cluster analysis system disclosed herein may also consider other recipients in an individual's cluster, which may be indicative of collusion.
[0075] According to an example implementation of the disclosed technology, the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. In an example implementation, the filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
[0076] An example implementation of the disclosed technology of the putative cluster analysis system was tested to identify potential Medicaid fraud. During the test, the individuals flagged as being potential ring-leaders were often not themselves Medicaid recipients. Rather, they were members of putative clusters having large numbers of recipients, and they were members of clusters in which other cluster-members drove a high proportion of expensive vehicles. Because these ring-leaders did not meet conventional criteria for fraud- flagging, they might have been overlooked; however, they were flagged as potential ring-leaders using an example embodiment of the putative cluster analysis system in a test with real data.
[0077] Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations of automobile insurance fraud. In some instances, automobile insurance fraud may include multiple victims, expensive vehicles, or multiple injuries.
[0078] Without limitation, some features considered for the scoring algorithm with respect to automobile insurance fraud may include: (1) the number of involved parties; (2) the number of claimants requiring medical treatment; (3) individual claim amounts; (4) vehicle damage; and (5) makes or models of involved automobiles. Analysis of these features, according to an example embodiment, may enable the putative cluster system to identify, among others, clusters that have a high number of collective claims with low standard deviation of claim counts, as well as clusters that have a statistically higher number of claims with soft tissue injuries, multiple passengers, low vehicle damage, or common passengers across multiple claims in the cluster.
[0079] According to an example implementation of the disclosed technology, the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. The filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
[0080] Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations involved in tax fraud. Without limitation, some features considered for the scoring algorithm with respect to tax fraud may include: (1) a significant change in income between tax years; (2) a significant increase in deductions; (3) a change in filing status; (4); a change in number or nature of dependents; (5) and the number of self-employed individuals in cluster. These or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. The filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.
[0081] Various embodiments of the putative cluster analysis systems and methods may be embodied, in whole or in part, in a computer program product stored on non-transitory computer- readable media for execution by one or more processors. It will thus be understood that various aspects of the disclosed technology, such as the clustering unit, the scoring unit, and the filtering unit, may comprise hardware or software of a computer system, as discussed above with respect to FIG. 3. It will also be understood that, although these units may be discussed herein as being distinct from one another, they may be implemented in various ways. The distinctions between them throughout this disclosure are made for illustrative purposes only, based on operational distinctiveness.
[0082] Application of the various embodiments of the putative cluster analysis systems and methods need not be limited to those above. For example, an example implementation of the putative cluster analysis system may be used to identify potential fraud related to credit cards applications, identity theft, investments, and various other fraud types that might involve an organization of connected entities.
[0083] While certain implementations of the disclosed technology have been described in connection with what is presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not to be limited to the disclosed implementations, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
[0084] This written description uses examples to disclose certain implementations of the disclosed technology, including the best mode, and also to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal language of the claims.

Claims

CLAIMS What is claimed is:
1. A computer-implemented method comprising:
determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points;
identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid;
identifying cluster connections among the plurality of clusters;
scoring the cluster connections based on predetermined criteria; and
identifying one or more of the distinct data points associated with the scored cluster connections.
2. The method of claim 1, wherein the collection of records comprise transaction records or relationship records.
3. The method of claim 1, wherein a distinct data point represents an individual, an organization, or a property.
4. The method of claim 1, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.
5. The method of claim 1, further comprising filtering the cluster connections based on predetermined attributes.
6. The method of claim 1, wherein scoring the cluster connections based on predetermined criteria.
7. The method of claim 1, wherein a number of the identified clusters is equal to or less than a number of the distinct data points.
8. A system comprising:
one or more processors; and
at least one memory comprising an operating system, a database a clustering unit, and a scoring unit, the memory in communication with the one or more processors and configured for storing data and instructions which, when executed by the at least one processor under control of the operating system, enable the system to:
determine, from a collection of records in the database, wherein the collection of records comprise a plurality of distinct data points, connections between one or more of the plurality of distinct data points;
identify, by the clustering unit, and from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid;
identify, by the cluster unit, cluster connections among the plurality of clusters; score, by the scoring unit, the cluster connections based on predetermined criteria; and
identify one or more of the distinct data points associated with the scored cluster connections.
9. The system of claim 8, wherein the collection of records comprise transaction records or relationship records.
10. The system of claim 8, wherein a distinct data point represents an individual, an organization, or a property.
11. The system of claim 8, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.
12. The system of claim 8, further comprising a filtering unit that is operable to filter the cluster connections based on predetermined attributes.
13. The system of claim 8, wherein the scoring unit is configured to score the cluster connections based on predetermined criteria.
14. A computer-readable medium that stores instructions which, when executed by at least one processor in a system, cause the system to perform a method comprising:
determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points;
identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid;
identifying cluster connections among the plurality of clusters;
scoring the cluster connections based on predetermined criteria; and
identifying one or more of the distinct data points associated with the scored cluster connections.
15. The computer-readable medium of claim 14, wherein the collection of records comprise transaction records or relationship records.
16. The computer-readable medium of claim 14, wherein a distinct data point represents an individual, an organization, or a property.
17. The computer-readable medium of claim 14, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.
18. The computer-readable medium of claim 14, further comprising filtering the cluster connections based on predetermined attributes.
19. The computer-readable medium of claim 14, wherein scoring the cluster connections based on predetermined criteria.
20. The computer-readable medium of claim 14, wherein a number of the identified clusters is equal to or less than a number of the distinct data points.
PCT/US2013/026343 2003-02-04 2013-02-15 Systems and methods for putative cluster analysis WO2013126281A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/848,850 US9412141B2 (en) 2003-02-04 2013-03-22 Systems and methods for identifying entities using geographical and social mapping
US15/202,099 US10438308B2 (en) 2003-02-04 2016-07-05 Systems and methods for identifying entities using geographical and social mapping

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261603068P 2012-02-24 2012-02-24
US61/603,068 2012-02-24

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/541,092 Continuation-In-Part US8549590B1 (en) 2003-02-04 2012-07-03 Systems and methods for identity authentication using a social network

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/848,850 Continuation-In-Part US9412141B2 (en) 2003-02-04 2013-03-22 Systems and methods for identifying entities using geographical and social mapping

Publications (1)

Publication Number Publication Date
WO2013126281A1 true WO2013126281A1 (en) 2013-08-29

Family

ID=49006122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/026343 WO2013126281A1 (en) 2003-02-04 2013-02-15 Systems and methods for putative cluster analysis

Country Status (1)

Country Link
WO (1) WO2013126281A1 (en)

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788405B1 (en) 2013-03-15 2014-07-22 Palantir Technologies, Inc. Generating data clusters with customizable analysis strategies
US8855999B1 (en) 2013-03-15 2014-10-07 Palantir Technologies Inc. Method and system for generating a parser and parsing complex data
US8930897B2 (en) 2013-03-15 2015-01-06 Palantir Technologies Inc. Data integration tool
US9009827B1 (en) 2014-02-20 2015-04-14 Palantir Technologies Inc. Security sharing system
US9021260B1 (en) 2014-07-03 2015-04-28 Palantir Technologies Inc. Malware data item analysis
US9043894B1 (en) 2014-11-06 2015-05-26 Palantir Technologies Inc. Malicious software detection in a computing system
US9202178B2 (en) 2014-03-11 2015-12-01 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
US9202249B1 (en) 2014-07-03 2015-12-01 Palantir Technologies Inc. Data item clustering and analysis
US9230280B1 (en) 2013-03-15 2016-01-05 Palantir Technologies Inc. Clustering data based on indications of financial malfeasance
US9367872B1 (en) 2014-12-22 2016-06-14 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
US9424337B2 (en) 2013-07-09 2016-08-23 Sas Institute Inc. Number of clusters estimation
US9456000B1 (en) 2015-08-06 2016-09-27 Palantir Technologies Inc. Systems, methods, user interfaces, and computer-readable media for investigating potential malicious communications
US9454785B1 (en) 2015-07-30 2016-09-27 Palantir Technologies Inc. Systems and user interfaces for holistic, data-driven investigation of bad actor behavior based on clustering and scoring of related data
US9535974B1 (en) 2014-06-30 2017-01-03 Palantir Technologies Inc. Systems and methods for identifying key phrase clusters within documents
US9552615B2 (en) 2013-12-20 2017-01-24 Palantir Technologies Inc. Automated database analysis to detect malfeasance
US9785773B2 (en) 2014-07-03 2017-10-10 Palantir Technologies Inc. Malware data item analysis
US9817563B1 (en) 2014-12-29 2017-11-14 Palantir Technologies Inc. System and method of generating data points from one or more data stores of data items for chart creation and manipulation
US9875293B2 (en) 2014-07-03 2018-01-23 Palanter Technologies Inc. System and method for news events detection and visualization
US9898509B2 (en) 2015-08-28 2018-02-20 Palantir Technologies Inc. Malicious activity detection system capable of efficiently processing data accessed from databases and generating alerts for display in interactive user interfaces
US9898528B2 (en) 2014-12-22 2018-02-20 Palantir Technologies Inc. Concept indexing among database of documents using machine learning techniques
US9965937B2 (en) 2013-03-15 2018-05-08 Palantir Technologies Inc. External malware data item clustering and analysis
US10120857B2 (en) 2013-03-15 2018-11-06 Palantir Technologies Inc. Method and system for generating a parser and parsing complex data
CN108985950A (en) * 2018-07-13 2018-12-11 平安科技(深圳)有限公司 Electronic device, user's insurance fraud method for prewarning risk and storage medium
US10230746B2 (en) 2014-01-03 2019-03-12 Palantir Technologies Inc. System and method for evaluating network threats and usage
US10235461B2 (en) 2017-05-02 2019-03-19 Palantir Technologies Inc. Automated assistance for generating relevant and valuable search results for an entity of interest
US10275778B1 (en) 2013-03-15 2019-04-30 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic malfeasance clustering of related data in various data structures
US10318630B1 (en) 2016-11-21 2019-06-11 Palantir Technologies Inc. Analysis of large bodies of textual data
US10325224B1 (en) 2017-03-23 2019-06-18 Palantir Technologies Inc. Systems and methods for selecting machine learning training data
US10356032B2 (en) 2013-12-26 2019-07-16 Palantir Technologies Inc. System and method for detecting confidential information emails
US10362133B1 (en) 2014-12-22 2019-07-23 Palantir Technologies Inc. Communication data processing architecture
US10482382B2 (en) 2017-05-09 2019-11-19 Palantir Technologies Inc. Systems and methods for reducing manufacturing failure rates
US10489391B1 (en) 2015-08-17 2019-11-26 Palantir Technologies Inc. Systems and methods for grouping and enriching data items accessed from one or more databases for presentation in a user interface
US10552994B2 (en) 2014-12-22 2020-02-04 Palantir Technologies Inc. Systems and interactive user interfaces for dynamic retrieval, analysis, and triage of data items
US10572487B1 (en) 2015-10-30 2020-02-25 Palantir Technologies Inc. Periodic database search manager for multiple data sources
US10572496B1 (en) 2014-07-03 2020-02-25 Palantir Technologies Inc. Distributed workflow system and database with access controls for city resiliency
US10579647B1 (en) 2013-12-16 2020-03-03 Palantir Technologies Inc. Methods and systems for analyzing entity performance
US10592982B2 (en) 2013-03-14 2020-03-17 Csidentity Corporation System and method for identifying related credit inquiries
US10593004B2 (en) 2011-02-18 2020-03-17 Csidentity Corporation System and methods for identifying compromised personally identifiable information on the internet
US10606866B1 (en) 2017-03-30 2020-03-31 Palantir Technologies Inc. Framework for exposing network activities
US10620618B2 (en) 2016-12-20 2020-04-14 Palantir Technologies Inc. Systems and methods for determining relationships between defects
US10699028B1 (en) 2017-09-28 2020-06-30 Csidentity Corporation Identity security architecture systems and methods
US10719527B2 (en) 2013-10-18 2020-07-21 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive simultaneous querying of multiple data stores
US10838987B1 (en) 2017-12-20 2020-11-17 Palantir Technologies Inc. Adaptive and transparent entity screening
US10896472B1 (en) 2017-11-14 2021-01-19 Csidentity Corporation Security and identity verification system and architecture
US10911234B2 (en) 2018-06-22 2021-02-02 Experian Information Solutions, Inc. System and method for a token gateway environment
US10909617B2 (en) 2010-03-24 2021-02-02 Consumerinfo.Com, Inc. Indirect monitoring and reporting of a user's credit data
US10990979B1 (en) 2014-10-31 2021-04-27 Experian Information Solutions, Inc. System and architecture for electronic fraud detection
US11030562B1 (en) 2011-10-31 2021-06-08 Consumerinfo.Com, Inc. Pre-data breach monitoring
US11074641B1 (en) 2014-04-25 2021-07-27 Csidentity Corporation Systems, methods and computer-program products for eligibility verification
US11120519B2 (en) 2013-05-23 2021-09-14 Consumerinfo.Com, Inc. Digital identity
US11119630B1 (en) 2018-06-19 2021-09-14 Palantir Technologies Inc. Artificial intelligence assisted evaluations and user interface for same
US11151468B1 (en) 2015-07-02 2021-10-19 Experian Information Solutions, Inc. Behavior analysis using distributed representations of event data
US11157872B2 (en) 2008-06-26 2021-10-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US11164271B2 (en) 2013-03-15 2021-11-02 Csidentity Corporation Systems and methods of delayed authentication and billing for on-demand products
US11232413B1 (en) 2011-06-16 2022-01-25 Consumerinfo.Com, Inc. Authentication alerts
US11288677B1 (en) 2013-03-15 2022-03-29 Consumerlnfo.com, Inc. Adjustment of knowledge-based authentication
US11341178B2 (en) 2014-06-30 2022-05-24 Palantir Technologies Inc. Systems and methods for key phrase characterization of documents
US11941065B1 (en) 2019-09-13 2024-03-26 Experian Information Solutions, Inc. Single identifier platform for storing entity data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404561A (en) * 1989-04-19 1995-04-04 Hughes Aircraft Company Clustering and associate processor
US20020107858A1 (en) * 2000-07-05 2002-08-08 Lundahl David S. Method and system for the dynamic analysis of data
US20030212519A1 (en) * 2002-05-10 2003-11-13 Campos Marcos M. Probabilistic model generation
US20060093222A1 (en) * 1999-09-30 2006-05-04 Battelle Memorial Institute Data processing, analysis, and visualization system for use with disparate data types
US20090177589A1 (en) * 1999-12-30 2009-07-09 Marc Thomas Edgar Cross correlation tool for automated portfolio descriptive statistics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404561A (en) * 1989-04-19 1995-04-04 Hughes Aircraft Company Clustering and associate processor
US20060093222A1 (en) * 1999-09-30 2006-05-04 Battelle Memorial Institute Data processing, analysis, and visualization system for use with disparate data types
US20090177589A1 (en) * 1999-12-30 2009-07-09 Marc Thomas Edgar Cross correlation tool for automated portfolio descriptive statistics
US20020107858A1 (en) * 2000-07-05 2002-08-08 Lundahl David S. Method and system for the dynamic analysis of data
US20030212519A1 (en) * 2002-05-10 2003-11-13 Campos Marcos M. Probabilistic model generation

Cited By (114)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11157872B2 (en) 2008-06-26 2021-10-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US11769112B2 (en) 2008-06-26 2023-09-26 Experian Marketing Solutions, Llc Systems and methods for providing an integrated identifier
US10909617B2 (en) 2010-03-24 2021-02-02 Consumerinfo.Com, Inc. Indirect monitoring and reporting of a user's credit data
US10593004B2 (en) 2011-02-18 2020-03-17 Csidentity Corporation System and methods for identifying compromised personally identifiable information on the internet
US11954655B1 (en) 2011-06-16 2024-04-09 Consumerinfo.Com, Inc. Authentication alerts
US11232413B1 (en) 2011-06-16 2022-01-25 Consumerinfo.Com, Inc. Authentication alerts
US11030562B1 (en) 2011-10-31 2021-06-08 Consumerinfo.Com, Inc. Pre-data breach monitoring
US11568348B1 (en) 2011-10-31 2023-01-31 Consumerinfo.Com, Inc. Pre-data breach monitoring
US10592982B2 (en) 2013-03-14 2020-03-17 Csidentity Corporation System and method for identifying related credit inquiries
US8788405B1 (en) 2013-03-15 2014-07-22 Palantir Technologies, Inc. Generating data clusters with customizable analysis strategies
US11288677B1 (en) 2013-03-15 2022-03-29 Consumerlnfo.com, Inc. Adjustment of knowledge-based authentication
US9177344B1 (en) 2013-03-15 2015-11-03 Palantir Technologies Inc. Trend data clustering
US10937034B2 (en) * 2013-03-15 2021-03-02 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic malfeasance clustering of related data in various data structures
US10834123B2 (en) * 2013-03-15 2020-11-10 Palantir Technologies Inc. Generating data clusters
US9230280B1 (en) 2013-03-15 2016-01-05 Palantir Technologies Inc. Clustering data based on indications of financial malfeasance
US10216801B2 (en) 2013-03-15 2019-02-26 Palantir Technologies Inc. Generating data clusters
US10721268B2 (en) 2013-03-15 2020-07-21 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic clustering of related data in various data structures
US8788407B1 (en) 2013-03-15 2014-07-22 Palantir Technologies Inc. Malware data clustering
US10264014B2 (en) 2013-03-15 2019-04-16 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic clustering of related data in various data structures
US9165299B1 (en) 2013-03-15 2015-10-20 Palantir Technologies Inc. User-agent data clustering
US9135658B2 (en) 2013-03-15 2015-09-15 Palantir Technologies Inc. Generating data clusters
US8818892B1 (en) 2013-03-15 2014-08-26 Palantir Technologies, Inc. Prioritizing data clusters with customizable scoring strategies
US10275778B1 (en) 2013-03-15 2019-04-30 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic malfeasance clustering of related data in various data structures
US8855999B1 (en) 2013-03-15 2014-10-07 Palantir Technologies Inc. Method and system for generating a parser and parsing complex data
US10120857B2 (en) 2013-03-15 2018-11-06 Palantir Technologies Inc. Method and system for generating a parser and parsing complex data
US20190166135A1 (en) * 2013-03-15 2019-05-30 Palantir Technologies Inc. Generating data clusters
US8930897B2 (en) 2013-03-15 2015-01-06 Palantir Technologies Inc. Data integration tool
US11790473B2 (en) 2013-03-15 2023-10-17 Csidentity Corporation Systems and methods of delayed authentication and billing for on-demand products
US11164271B2 (en) 2013-03-15 2021-11-02 Csidentity Corporation Systems and methods of delayed authentication and billing for on-demand products
US9965937B2 (en) 2013-03-15 2018-05-08 Palantir Technologies Inc. External malware data item clustering and analysis
US11775979B1 (en) 2013-03-15 2023-10-03 Consumerinfo.Com, Inc. Adjustment of knowledge-based authentication
US9171334B1 (en) 2013-03-15 2015-10-27 Palantir Technologies Inc. Tax data clustering
US20190205897A1 (en) * 2013-03-15 2019-07-04 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation based on automatic malfeasance clustering of related data in various data structures
US11803929B1 (en) 2013-05-23 2023-10-31 Consumerinfo.Com, Inc. Digital identity
US11120519B2 (en) 2013-05-23 2021-09-14 Consumerinfo.Com, Inc. Digital identity
US9424337B2 (en) 2013-07-09 2016-08-23 Sas Institute Inc. Number of clusters estimation
US10719527B2 (en) 2013-10-18 2020-07-21 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive simultaneous querying of multiple data stores
US10579647B1 (en) 2013-12-16 2020-03-03 Palantir Technologies Inc. Methods and systems for analyzing entity performance
US9552615B2 (en) 2013-12-20 2017-01-24 Palantir Technologies Inc. Automated database analysis to detect malfeasance
US10356032B2 (en) 2013-12-26 2019-07-16 Palantir Technologies Inc. System and method for detecting confidential information emails
US10230746B2 (en) 2014-01-03 2019-03-12 Palantir Technologies Inc. System and method for evaluating network threats and usage
US10805321B2 (en) 2014-01-03 2020-10-13 Palantir Technologies Inc. System and method for evaluating network threats and usage
US9923925B2 (en) 2014-02-20 2018-03-20 Palantir Technologies Inc. Cyber security sharing and identification system
US9009827B1 (en) 2014-02-20 2015-04-14 Palantir Technologies Inc. Security sharing system
US10873603B2 (en) 2014-02-20 2020-12-22 Palantir Technologies Inc. Cyber security sharing and identification system
US9202178B2 (en) 2014-03-11 2015-12-01 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
US11074641B1 (en) 2014-04-25 2021-07-27 Csidentity Corporation Systems, methods and computer-program products for eligibility verification
US11587150B1 (en) 2014-04-25 2023-02-21 Csidentity Corporation Systems and methods for eligibility verification
US11341178B2 (en) 2014-06-30 2022-05-24 Palantir Technologies Inc. Systems and methods for key phrase characterization of documents
US10180929B1 (en) 2014-06-30 2019-01-15 Palantir Technologies, Inc. Systems and methods for identifying key phrase clusters within documents
US9535974B1 (en) 2014-06-30 2017-01-03 Palantir Technologies Inc. Systems and methods for identifying key phrase clusters within documents
US9785773B2 (en) 2014-07-03 2017-10-10 Palantir Technologies Inc. Malware data item analysis
US9998485B2 (en) 2014-07-03 2018-06-12 Palantir Technologies, Inc. Network intrusion data item clustering and analysis
US9021260B1 (en) 2014-07-03 2015-04-28 Palantir Technologies Inc. Malware data item analysis
US9875293B2 (en) 2014-07-03 2018-01-23 Palanter Technologies Inc. System and method for news events detection and visualization
US9344447B2 (en) 2014-07-03 2016-05-17 Palantir Technologies Inc. Internal malware data item clustering and analysis
US10929436B2 (en) 2014-07-03 2021-02-23 Palantir Technologies Inc. System and method for news events detection and visualization
US10572496B1 (en) 2014-07-03 2020-02-25 Palantir Technologies Inc. Distributed workflow system and database with access controls for city resiliency
US9202249B1 (en) 2014-07-03 2015-12-01 Palantir Technologies Inc. Data item clustering and analysis
US10798116B2 (en) 2014-07-03 2020-10-06 Palantir Technologies Inc. External malware data item clustering and analysis
US9881074B2 (en) 2014-07-03 2018-01-30 Palantir Technologies Inc. System and method for news events detection and visualization
US11436606B1 (en) 2014-10-31 2022-09-06 Experian Information Solutions, Inc. System and architecture for electronic fraud detection
US10990979B1 (en) 2014-10-31 2021-04-27 Experian Information Solutions, Inc. System and architecture for electronic fraud detection
US11941635B1 (en) 2014-10-31 2024-03-26 Experian Information Solutions, Inc. System and architecture for electronic fraud detection
US10135863B2 (en) 2014-11-06 2018-11-20 Palantir Technologies Inc. Malicious software detection in a computing system
US9043894B1 (en) 2014-11-06 2015-05-26 Palantir Technologies Inc. Malicious software detection in a computing system
US9558352B1 (en) 2014-11-06 2017-01-31 Palantir Technologies Inc. Malicious software detection in a computing system
US10728277B2 (en) 2014-11-06 2020-07-28 Palantir Technologies Inc. Malicious software detection in a computing system
US9898528B2 (en) 2014-12-22 2018-02-20 Palantir Technologies Inc. Concept indexing among database of documents using machine learning techniques
US9367872B1 (en) 2014-12-22 2016-06-14 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
EP3037991A1 (en) * 2014-12-22 2016-06-29 Palantir Technologies, Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
US9589299B2 (en) 2014-12-22 2017-03-07 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
US10552994B2 (en) 2014-12-22 2020-02-04 Palantir Technologies Inc. Systems and interactive user interfaces for dynamic retrieval, analysis, and triage of data items
US11252248B2 (en) 2014-12-22 2022-02-15 Palantir Technologies Inc. Communication data processing architecture
US10447712B2 (en) 2014-12-22 2019-10-15 Palantir Technologies Inc. Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
US10362133B1 (en) 2014-12-22 2019-07-23 Palantir Technologies Inc. Communication data processing architecture
US10552998B2 (en) 2014-12-29 2020-02-04 Palantir Technologies Inc. System and method of generating data points from one or more data stores of data items for chart creation and manipulation
US9817563B1 (en) 2014-12-29 2017-11-14 Palantir Technologies Inc. System and method of generating data points from one or more data stores of data items for chart creation and manipulation
US11151468B1 (en) 2015-07-02 2021-10-19 Experian Information Solutions, Inc. Behavior analysis using distributed representations of event data
US11501369B2 (en) 2015-07-30 2022-11-15 Palantir Technologies Inc. Systems and user interfaces for holistic, data-driven investigation of bad actor behavior based on clustering and scoring of related data
US9454785B1 (en) 2015-07-30 2016-09-27 Palantir Technologies Inc. Systems and user interfaces for holistic, data-driven investigation of bad actor behavior based on clustering and scoring of related data
US10223748B2 (en) 2015-07-30 2019-03-05 Palantir Technologies Inc. Systems and user interfaces for holistic, data-driven investigation of bad actor behavior based on clustering and scoring of related data
US9456000B1 (en) 2015-08-06 2016-09-27 Palantir Technologies Inc. Systems, methods, user interfaces, and computer-readable media for investigating potential malicious communications
US10484407B2 (en) 2015-08-06 2019-11-19 Palantir Technologies Inc. Systems, methods, user interfaces, and computer-readable media for investigating potential malicious communications
US9635046B2 (en) 2015-08-06 2017-04-25 Palantir Technologies Inc. Systems, methods, user interfaces, and computer-readable media for investigating potential malicious communications
US10489391B1 (en) 2015-08-17 2019-11-26 Palantir Technologies Inc. Systems and methods for grouping and enriching data items accessed from one or more databases for presentation in a user interface
US9898509B2 (en) 2015-08-28 2018-02-20 Palantir Technologies Inc. Malicious activity detection system capable of efficiently processing data accessed from databases and generating alerts for display in interactive user interfaces
US11048706B2 (en) 2015-08-28 2021-06-29 Palantir Technologies Inc. Malicious activity detection system capable of efficiently processing data accessed from databases and generating alerts for display in interactive user interfaces
US10346410B2 (en) 2015-08-28 2019-07-09 Palantir Technologies Inc. Malicious activity detection system capable of efficiently processing data accessed from databases and generating alerts for display in interactive user interfaces
US10572487B1 (en) 2015-10-30 2020-02-25 Palantir Technologies Inc. Periodic database search manager for multiple data sources
US10318630B1 (en) 2016-11-21 2019-06-11 Palantir Technologies Inc. Analysis of large bodies of textual data
US11681282B2 (en) 2016-12-20 2023-06-20 Palantir Technologies Inc. Systems and methods for determining relationships between defects
US10620618B2 (en) 2016-12-20 2020-04-14 Palantir Technologies Inc. Systems and methods for determining relationships between defects
US10325224B1 (en) 2017-03-23 2019-06-18 Palantir Technologies Inc. Systems and methods for selecting machine learning training data
US10606866B1 (en) 2017-03-30 2020-03-31 Palantir Technologies Inc. Framework for exposing network activities
US11481410B1 (en) 2017-03-30 2022-10-25 Palantir Technologies Inc. Framework for exposing network activities
US11947569B1 (en) 2017-03-30 2024-04-02 Palantir Technologies Inc. Framework for exposing network activities
US11714869B2 (en) 2017-05-02 2023-08-01 Palantir Technologies Inc. Automated assistance for generating relevant and valuable search results for an entity of interest
US10235461B2 (en) 2017-05-02 2019-03-19 Palantir Technologies Inc. Automated assistance for generating relevant and valuable search results for an entity of interest
US11210350B2 (en) 2017-05-02 2021-12-28 Palantir Technologies Inc. Automated assistance for generating relevant and valuable search results for an entity of interest
US11537903B2 (en) 2017-05-09 2022-12-27 Palantir Technologies Inc. Systems and methods for reducing manufacturing failure rates
US10482382B2 (en) 2017-05-09 2019-11-19 Palantir Technologies Inc. Systems and methods for reducing manufacturing failure rates
US11954607B2 (en) 2017-05-09 2024-04-09 Palantir Technologies Inc. Systems and methods for reducing manufacturing failure rates
US10699028B1 (en) 2017-09-28 2020-06-30 Csidentity Corporation Identity security architecture systems and methods
US11580259B1 (en) 2017-09-28 2023-02-14 Csidentity Corporation Identity security architecture systems and methods
US11157650B1 (en) 2017-09-28 2021-10-26 Csidentity Corporation Identity security architecture systems and methods
US10896472B1 (en) 2017-11-14 2021-01-19 Csidentity Corporation Security and identity verification system and architecture
US10838987B1 (en) 2017-12-20 2020-11-17 Palantir Technologies Inc. Adaptive and transparent entity screening
US11119630B1 (en) 2018-06-19 2021-09-14 Palantir Technologies Inc. Artificial intelligence assisted evaluations and user interface for same
US11588639B2 (en) 2018-06-22 2023-02-21 Experian Information Solutions, Inc. System and method for a token gateway environment
US10911234B2 (en) 2018-06-22 2021-02-02 Experian Information Solutions, Inc. System and method for a token gateway environment
CN108985950B (en) * 2018-07-13 2023-04-18 平安科技(深圳)有限公司 Electronic device, user fraud protection risk early warning method and storage medium
CN108985950A (en) * 2018-07-13 2018-12-11 平安科技(深圳)有限公司 Electronic device, user's insurance fraud method for prewarning risk and storage medium
US11941065B1 (en) 2019-09-13 2024-03-26 Experian Information Solutions, Inc. Single identifier platform for storing entity data

Similar Documents

Publication Publication Date Title
WO2013126281A1 (en) Systems and methods for putative cluster analysis
Gaitonde et al. Interventions to reduce corruption in the health sector
Sithic et al. Survey of insurance fraud detection using data mining techniques
Harrell et al. Victims of identity theft, 2012
Okoye et al. Forensic accounting and fraud prevention in manufacturing companies in Nigeria
Ekin et al. Statistical medical fraud assessment: exposition to an emerging field
Kowshalya et al. Predicting fraudulent claims in automobile insurance
Levi et al. The nature, extent and economic impact of fraud in the UK
CN113994323A (en) Intelligent alarm system
US8429050B2 (en) Method for detecting ineligibility of a beneficiary and system
Anbarasi et al. Fraud detection using outlier predictor in health insurance data
US8566204B2 (en) Method for detecting ineligibility of a beneficiary and system
Yange A Fraud Detection System for Health Insurance in Nigeria
Khurjekar et al. Detection of fraudulent claims using hierarchical cluster analysis
WO2022228688A1 (en) Automated fraud monitoring and trigger-system for detecting unusual patterns associated with fraudulent activity, and corresponding method thereof
Timofeyev et al. Current trends in insurance fraud in Russia: Evidence from a survey of industry experts
Power et al. Sharing and analyzing data to reduce insurance fraud
Aiken Analyzing proactive fraud detection software tools and the push for quicker Solutions
Shekhar et al. Unsupervised Machine Learning for Explainable Health Care Fraud Detection
Desi et al. Forensic Accounting, a Veritable Financial Tool for Qualitative Financial Reporting Systems in the 21st Century
Kapoor Deception Detection And Vulnerability Analysis Using A Multi-Level Clustering Machine Learning Algorithm In Business Transactions
Şen et al. Detecting falsified financial statements using data mining: empirical research on finance sector in Turkey
Levi Assessing the cost of fraud
Yange et al. A Schematic View of the Application of Big Data Analytics in Healthcare Crime Investigation
Malaszczyk et al. Big data analytics in tax fraud detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13752433

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13752433

Country of ref document: EP

Kind code of ref document: A1