WO2013126281A1

WO2013126281A1 - Systems and methods for putative cluster analysis

Info

Publication number: WO2013126281A1
Application number: PCT/US2013/026343
Authority: WO
Inventors: Johannes Philippus de Villiers PRICHARD; David Alan Bayliss
Original assignee: Lexisnexis Risk Solutions Fl Inc.
Priority date: 2012-02-24
Filing date: 2013-02-15
Publication date: 2013-08-29

Abstract

Certain implementations of the disclosed technology may include systems, methods, and computer-readable media for identifying connected organizations from a collection of records. A method is provided for determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.

Description

SYSTEMS AND METHODS FOR PUTATIVE CLUSTER ANALYSIS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to U.S. Provisional Application serial no. 61/603,068 filed on February 24, 2012, entitled: "Systems and Methods for Putative Cluster Analysis," the contents of which are hereby incorporated by reference in their entirety.

[0002] This application is also related to U.S. Patent No. 7,403,942 to Bayliss, David et al. (hereinafter Bayliss I), filed February 4, 2003, and to US Patent Application Serial No. 10/357,489 to Bayliss, David et al. (hereinafter Bayliss II), which are hereby incorporated by reference as if fully set forth below.

TECHNICAL FIELD

[0003] Various embodiments of the systems and methods described herein relate to data mining and, more particularly, to systems and methods for efficiently mining data to identify collusion, fraud, and organized groups of entities.

BACKGROUND

[0004] Increasingly, commercial, governmental, institutional and other entities collect vast amounts of data related to a variety of subjects, activities and pursuits. Society's appreciation for and use of information technology and management to analyze such data is now well ensconced in everyday life. For example, collected data may be examined for historical, trending, predictive, preventive, profiling, and many other useful purposes. Although the technology for collecting and storing such vast amounts of data is in place, efficient and effective technology for accessing, processing, verifying, analyzing and decisioning relating to such vast amounts of data is presently lacking or at the least in need of improvement. There exists broad and eager anticipation for unleashing the potential associated with such vast amounts of data and expanding the power that intelligent business solutions bring to commercial, governmental, and other societal pursuits. There exists a need and desire for intelligent solutions to realize this potential.

[0005] Applications for exploiting collected data include, but are not limited to: national security; law enforcement; immigration and border control; locating missing persons and property; firearms tracking; civil and criminal investigations; person and property location and verification; governmental and agency record handling; entity searching and location; package delivery; telecommunications; consumer related applications; credit reporting, scoring, and/or evaluating; debt collection; entity identification verification; account establishment, scoring and monitoring; fraud detection; health industry (patient record maintenance); biometric and other forms of authentication; insurance and risk management; marketing, including direct to consumer marketing; human resources/employment; and financial/banking industries. The applications may span an enterprise or agency or extend across multiple agencies, businesses, industries, etc.

[0006] Another such application is identifying collusion, such as that related to mortgage fraud. According to the Federal Bureau of Investigation (FBI), pending mortgage fraud-related investigations increased twelve percent in the fiscal year ending 30 September 30 2010, as opposed to the previous year. This represents a ninety percent jump in the increase amount from the previous fiscal year. The collapse of the housing boom and financial crisis has increased foreclosures. Although mortgage origination schemes have decreased because of depressed housing market, fraud aimed at troubled borrowers has increased. Such fraud includes loan modification scams and foreclosure rescue schemes, in which perpetrators convince borrowers they can save their homes through deed transfers and upfront fees.

[0007] The available data related to mortgages and to other industries is potentially immense. It is desirable to use that data efficiently to identify groups of entities that work as organizations. These groups may be colluding to perform illegal activity, and identifying them may reduce that activity.

BRIEF SUMMARY

[0008] Briefly described, various embodiments of the disclosed technology may include putative cluster analysis systems and methods for identifying various connected entities and organizations. In an example implementation of the disclosed technology, an analytical system may be provided that includes a database, a clustering unit, a scoring unit, and a filtering unit. Certain implementations of the disclosed technology may include systems, methods, and computer-readable media for identifying connected organizations in a collection of distinct data points. [0009] According to an example embodiment of the disclosed technology, a method is provided for determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.

[0010] In an example implementation, a system is provided. The system may include one or more processors; at least one memory in communication with the one or more processors. The at least one memory may include an operating system, a database a clustering unit, and a scoring unit. The memory in communication with the one or more processors may be configured for storing data and instructions which, when executed by the at least one processor under control of the operating system, enable the system to determine, from a collection of records in the database, wherein the collection of records comprise a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The instructions, when executed by the at least one processor under control of the operating system may further identify, by the clustering unit, and from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The instructions, when executed by the at least one processor under control of the operating system may further identify, by the cluster unit, cluster connections among the plurality of clusters, score, by the scoring unit, the cluster connections based on predetermined criteria; and identify one or more of the distinct data points associated with the scored cluster connections.

[0011] According to an example embodiment of the disclosed technology, a computer- readable media is provided for a method. The method includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. The method includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. The method further includes identifying cluster connections among the plurality of clusters, scoring the cluster connections based on predetermined criteria, and identifying one or more of the distinct data points associated with the scored cluster connections.

[0012] In an example implementation, the database may store a plurality of records to be analyzed. Each record may include data related to an entity or transaction. For example, a record may include data related to a real estate purchase, an insurance claim, or an income tax return. In an example implementation of the disclosed technology, the putative cluster analysis system may be directed to identify organizations related to a single industry. In that case, each record in the database, for the purpose of the putative cluster analysis system, may be related to that single industry. For example, if an embodiment of the putative cluster analysis system is directed to identifying insurance fraud, then various records may be related to insurance claims. Some embodiments of the disclosed technology may include a database. Other example embodiments of the disclosed technology may include systems and/or methods for accessing a database or other collection of data to be analyzed.

[0013] In an example implementation, the clustering unit may group the various records into distinct, putative clusters. The term "putative clusters" as discussed herein may mean groups of records that are supposed, presumed, and/or reputed as having some type of a connection to one another, no matter how tenuous that connection may prove to be in actuality. In an example implementation, each record, or data point, may be deemed the central point of a cluster. For that data point, relatives of that data point may be identified up to a predetermined distance from the central data point, where "distance" between points is predefined and, in some embodiments, relates to a degree of connectivity between data points.

[0014] According to an example implementation, the scoring unit may have access to a predetermined feature set, and may be configured to analyze each putative cluster based on the feature set. Within a cluster, a direct link exists between each pair of data points with a direct relationship. For example, if a pair of data points represents two real estate transactions with the same seller, then these data points may be connected by a direct link within a cluster. Data points within a cluster may be indirectly connected when the data points are connected by a series of links.

[0015] According to an example implementation, for each feature in the feature set, the scoring unit may analyze the attributes of the various links or data points in the cluster to provide a score with respect to the feature in question. Thus, in one example embodiment, each data point or each link may be assigned a score for each feature. In one implementation, the cluster as a whole may be assigned a total score comprising a combination of the scores of the various features applicable to the cluster. The total score may be one of various combinations calculated from the feature scores, such as, for example, a sum, a weighted sum, or another formula based on the various features.

[0016] According to an example implementation, the filtering unit may filter the putative clusters into real clusters and false clusters, where the real ones will be deemed to be those of interest for potential collusion. In an example implementation of the disclosed technology, the filtering unit may utilize a predetermined algorithm for separating the clusters into two groups based on the results of the scoring. For example, the algorithm may include a filter that significantly reduces the data set by selecting a subset of the putative clusters to deem real clusters. The algorithm may be embodied in various forms according to certain embodiments. For example, the algorithm may examine the result of the scoring for each feature, and may select a subset of the clusters based on the various feature scores. Alternatively, the filtering unit may have a target score, and real clusters may be those that meet a criterion, e.g., greater than, less than, with respect to that score for the combination of feature scores.

[0017] According to an example implementation, the putative cluster analysis system may calculate a set of putative clusters and filter those putative clusters into a set of high- interest real clusters. These and other embodiments of the putative cluster analysis systems and methods will be described in more detail below with reference to the figures.

BRIEF DESCRIPTION OF THE FIGURES

[0018] FIG. 1 illustrates an analytics method according to an example implementation of the disclosed technology. [0019] FIG. 2 illustrates a putative cluster evidencing connectedness between entities represented in the cluster and sub-clusters.

[0020] FIG. 3 illustrates a representative computer architecture, according to an example embodiment of the disclosed technology.

[0021] FIG. 4 illustrates a diagram of potentially fraudulent transactions identified by an example implementation of the disclosed technology during a test analysis.

[0022] FIG. 5 is a flow-diagram of a method, according to an example embodiment of the disclosed technology.

DETAILED DESCRIPTION

[0023] To facilitate an understanding of the principles and features of the disclosed technology, various illustrative embodiments are explained below. Embodiments of the disclosed technology, however, are not limited to these embodiments. The materials and components described hereinafter as making up elements of the disclosed technology are intended to be illustrative and not restrictive. Many suitable materials and components that would perform the same or similar functions as the materials and components described herein are intended to be embraced within the scope of the disclosed technology. Other materials and components not described herein can include, but are not limited to, for example, similar or analogous materials or components developed after development of the disclosed technology.

[0024] Example systems and methods described herein may utilize various forms of data to identify connected entities and/or organizations. Certain embodiments of the disclosed technology may provide improved accuracy over conventional data mining and putative cluster analysis systems and techniques. For example, insurance companies and other industries attempting to identify fraud may utilize conventional focused analysis techniques that examine each event in isolation. The conventional techniques typically utilize high thresholds to filter the large number of events to be analyzed. In other words, because the data that entities must analyze with conventional techniques is so large, a high degree of suspicious activity may be required in order to identify fraud. Without a high threshold, conventional techniques may have too many potentially fraudulent events to investigate. As a result, entities using conventional techniques often overlook collusion from groups that are able to stay below these high thresholds with respect to certain suspicious activities.

[0025] Conventional systems for identifying mortgage fraud are often tied to specific loan portfolios, such as those related to a particular bank. Thus, these systems are not well-suited to fraud detection in large scale and across multiple banks. Mortgage fraud is prolific and can be hard to detect, especially given that mortgage data is spread across numerous databases and not consolidated into one database. The relevant data may be spread through financial services organizations, government agencies, and public records, such as property and assessment deeds. Further, the government agencies controlling some of this data have limited resources to detect and investigate the bigger mortgage fraud schemes. The putative cluster analysis system disclosed herein may be capable of efficiently leveraging readily available data to help organizations detect, prioritize, and investigate large mortgage fraud schemes.

[0026] When applied to the problem of mortgage fraud, the putative cluster analysis systems and methods disclosed herein may perform one or more or the following tasks: drive improved investigative and due diligence workflows; evaluate and segment loan files to identify notable risks; identify non-obvious relationships between entities, within and external to loan transactions; expose key perpetrators to improve remediation and recourse opportunities; augment existing fraud detection and scoring models during origination and loan pool acquisition; and enhance internal fraud and risk controls with a flexible pattern selection process.

[0027] According to an example implementation of the disclosed technology, the putative cluster analysis system may start with large quantity of data and group that data into smaller, distinct clusters. In an example embodiment, the proximity of seemingly low risk activity within each cluster may be measured using lower thresholds than is reasonably possible in the methods used by conventional systems. As a result, the putative cluster analysis system may identify potentially organized groups without having to apply low thresholds to the large amounts of data as a whole.

[0028] In accordance with certain example embodiments of the disclosed technology, high interest clusters may be identified from a plurality of data. High interest clusters, for example, may represent connected organizations, entities, and or people. In certain example implementations, the putative cluster analysis system disclosed herein may rely upon relatively large amounts of data to measure proximity of seemingly low risk events commonly associated with high risk activities to detect potentially fraudulent activities.

[0029] In one example embodiment, a domain of entities may be identified for analysis. For example, data associated with a large number (perhaps millions) of property deeds may be gathered for analysis. The associated data may include identities of individuals, organizations, companies, etc., that are associated with the deeds. The associated data may include information such as addresses, mortgage lenders, names of law firms, dates of transactions, etc. According to certain example embodiments of the disclosed technology, one or more types of relationships between the entities may then be collected. According to an example embodiment, a non- partitioning clustering algorithm may be utilized to form clusters for each of the domain entities, wherein copies of the domain entity may be created, as required, for populating clusters associated with neighboring clusters.

[0030] In certain embodiments, a filtering mechanism may operate against the clusters and may retain those clusters that have outlying behavior. Such filtering may conventionally utilize graph-or network analysis, and queries/filtering of this form may utilize sub-graph matching routines or fuzzy sub-graphs matching. However, sub-graph matching routines or fuzzy-subgraphs matching techniques may be NP-complete, and thus, impractical for analyzing large sets of data. The most notable characteristic of NP-complete problems is that no fast solution to them is known. That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows. This means that the time required to solve even moderately sized versions of many of these problems can easily reach into the billions or trillions of years, using any amount of computing power available today. Embodiments of the disclosed technology may be utilized to provide clusters and connections between entities even though the set of data analyzed may be extremely large.

[0031] In accordance with an example implementation of the disclosed technology, entities may be identified and may include people, companies, places, objects, virtual identities, etc. In an example embodiment, relationships may be formed in many ways, and with many qualities. For example, co-occurrence of values in common fields database may be utilized, such as the same last name. Relationships may also be formed using multiple co-occurrence of an entity with one or more other properties, such as people who have lived at two or more addresses. [0032] Relationships may also be formed based on a high reoccurrence and/or frequency of a common relationship, according to an example embodiment. For example, records of person X sending an email to person Y greater than N times may indicate a relationship between person X and person Y. In another example embodiment, if person X sends an email to or receives an email from person Y, and within a short period of time, person Z sends an email or receives an email from person Y, then a relationship may be implied between person X and person Z.

[0033] In accordance with an example implementation of the disclosed technology, relationships between entities may comprise Boolean, weighted, directed, undirected, and/or combinations of multiple relationships. According to certain example embodiments of the disclosed technology, clustering of the entities may rely on relationships steps. In one embodiment, entities may be related by at least two different relationship types. In one embodiment, relationships for the clustering may be established by examining weights or strengths of connections between entities in certain directions and conditional upon other relationships, including temporal relationships. For example, in one embodiment, the directional relationships between entities X, Y, and Z may be examined and the connection between X, Y, and Z may be followed if there is a link between Y and Z happened (in time) after the link was established between X and Y.

[0034] Many methods may be utilized to filter clusters, once they are identified. For example, in one embodiment, clusters may be scored. In another embodiment, a threshold may be utilized to identify clusters of interest. According to an example embodiment of the disclosed technology, a model may be utilized to compute a number of statistics on each cluster. In one embodiment, the model may be as simple as determining counts. In another embodiment, the model may detect relationships within a cluster, for example, entities that are related to the centroid of the cluster that are also related to each other. This analysis may provide a measure of cohesiveness of relationships that exist inside the cluster. According to an example embodiment of the disclosed technology, once the statistics have been computed, scoring and weighting of each cluster may be utilized to determine which clusters rise above a particular threshold, and may be classified as "interesting." In accordance with an example embodiment of the disclosed technology, and weighting and/or scoring of the determined statistics may be accomplished using a heuristic scoring model, such as linear regression, neural network analysis, etc. [0035] An example analytics method may be implemented by a putative cluster analysis system 100, as illustrated in FIG. 1. It will be understood that the method illustrated herein is provided for illustrative purposes only and does not limit the scope of the disclosed technology.

[0036] The putative cluster analysis system 100 may receive a plurality of data 102 to be analyzed. In accordance with an example embodiment, the data may be processed 104, and output 106 may be generated. In one example embodiment, the data may include identities and property deeds 108. The data may also include information 110, for example, that may include data related to a bank portfolio. In an example embodiment, the system 100 may receive the data 102 in its various forms (which may include identities, property deeds portfolios, etc.), and may process 104 the data 102 to derive relationships 112 and perform analytics 114. In an example embodiment, the relationships 112 and analytics 114 may be used to determine particular attributes 116. For example, the attributes 116 may include one or more of the following: property status; property deed transfer history; buyer history; and/or the previous seller's cluster activity. According to an example embodiment of the disclosed technology, the determined attributes 116 may go through a scoring and filtering process 118, which may result in an output 106 that may include one or more primary attributes 120, features 122, and risk segmentation 124. In accordance with an example embodiment of the disclosed technology, the primary attributes 120 may include entity and property characteristics, such as suspicious deeds, associations with businesses and other entities, seller address history, etc. according to an example embodiment, the features 122 may be derived from aggregating characteristics such as store code deeds, defaults, transfer activity, etc. In one example embodiment, such features 122 may be derived by combining primary attributes 120. In an example embodiment, the risk segmentation 124 may be utilized to augment current scoring models.

[0037] According to an example implementation of the disclosed technology, the clustering unit of the putative cluster analysis system 100 may treat each data point in the data as a centroid of its own cluster. Thus, the total number of clusters may be equal to the total number of data points, and each cluster may be uniquely represented by its centroid data point. The distance between the centroid and any data point within each cluster may be limited, such that the clusters are limited in size and, for some analyses, may be treated as being disconnected from one another. An example method of clustering data for the purposes of the example implementation of the disclosed technology of the putative cluster analysis systems and methods is disclosed in Bayliss I and II, which are incorporated herein.

[0038] According to an example implementation of the disclosed technology, scoring and filter 118 may be applied, for example, to analyze each cluster and assign one or more scores to each cluster. In an example implementation, a scoring unit may utilize a predetermined scoring algorithm for scoring some or all of the clusters. In another example implementation, the scoring unit may utilize a dynamic scoring algorithm for scoring some or all of the clusters. The scoring algorithm, for example, may be based on seemingly low-risk events that tend to be associated with organizations, such as fraud organizations. The algorithm may thus also be based on research into what events tend to be indicative of fraud in the industry or application to which the putative cluster analysis system is directed.

[0039] In one example implementation, each putative cluster may be scored individually. For example, a plurality of predetermined attributes, or variables, may be calculated for each cluster based on the data points in the cluster. For each attribute, the putative cluster as a whole may be considered, or each data point or link between pairs of data points may be considered. An attribute may be evaluated and scored depending on the nature of the attribute.

[0040] According to an example implementation of the disclosed technology, the property status attribute may include one or more of the following: the date of subject property last deed; the sale amount of subject property, the last recorded deed transfer; the number of months subject property was owned by previous owner; and/or the number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).

[0041] According to an example implementation of the disclosed technology, the property deed transfer history attribute may include one or more of the following information: the previous owner is a member of a network having high volume or suspicious deed transfer activity; the number of properties ever sold by previous owner that then resulted in default; and/or the previous owner's count of historical deed transfers within a network of associates.

[0042] According to an example implementation of the disclosed technology, the buyer history attribute may include one or more of the following information: the number of properties ever owned by the buyer(s); the number of properties ever owned by the buyer(s) business; and/or the number of properties ever sold by the buyer(s) that resulted in default.

[0043] According to an example implementation of the disclosed technology, the previous seller's cluster activity attribute may include one or more of the following information: buyer(s') count of historical deed transfers within a network of associates; and/or number of potential flip deed transfers (for example, property being owned less than 6 months or having a greater than 10% appreciation., etc.).

[0044] These or other features may be integrated into the scoring unit, so as to score the various putative clusters provided by the clustering unit. Core transaction measurements, which may be incorporated into the above list of features, may include velocity, profit, and buyer or seller relationship. The filtering unit may filter out and those clusters that are deemed to represent real organizations based on the scoring.

[0045] In accordance with an example implementation of the disclosed technology, the putative cluster analysis system may leverage publicly available data, such as property deeds and assessments, which may include several hundred million records. The putative cluster analysis system may also clean and standardize data to reduce the possibility that matching entities are considered as distinct. Before creating the putative clusters, the putative cluster analysis system may use this data to build a large-scale network map of the population in question and its associated flow of property.

[0046] According to an example implementation, the putative cluster analysis system may leverage a relatively large-scale of supercomputing power and analytics to target organized collusion. Example implementation of the disclosed technology of the putative cluster analysis systems and methods may rely upon open-source large scale parallel-processing computing platforms to increase the agility and scale of solutions. In one embodiment of the putative cluster analysis system, centroids may be derived from a public database of around fifty terabytes for the U.S. population. In this embodiment, a cluster network map may be created with around four hundred million clusters with seventeen billion relationships.

[0047] Example implementation of the disclosed technology of the putative cluster analysis systems and methods may measure behavior and relationships that traditionally may be used to obscure activities to more actively and effectively expose syndicates and rings of collusion. Unlike many conventional systems, the putative cluster analysis system need not be limited to rings operating in a single geographic location, and it need not be limited to short time periods. Further, the putative cluster analysis system need not be limited to measuring only individually high value transactions, as banks do when identifying potential fraud that they consider to be worth their resources. The putative cluster analysis systems and methods disclosed herein thus may enable investigations to prioritize efforts on organized groups more effectively, rather than investigating individual transactions to determine whether they fall within an organized ring.

[0048] A list of example attributes is shown in Table 1, below. It will be understood that these attributes are provided for illustrative purposes only and do not limit the scope of the putative cluster analysis systems and methods. Not all of these attributes need be used, and other attributes may be used as well, such as those described above with respect to the attributes 116 in reference to FIG. 1.

Table 1

Previous Seller's Cluster high profit sir cl hi prof cnt transfers

Previous Seller's Cluster in network high sir cl in net hi prof profit transfers

Previous Seller's Cluster in network high sir cl in net hi prof flip cnt profit flips

sir cl flop cnt Previous Seller's Cluster flop count

Previous Seller's Cluster property transfers sir cl default cnt ending in default

Previous Seller's Cluster property transfers sir cl fc cnt ending in foreclosure

Previous Seller's Cluster property sales end sir cl ends in default fc in default or foreclosure

Buyer's Activity

byr cl flip 0 deg Buyer Flips

byr cl in net cnt 0 deg Buyer in network deed transfers byr cl in net flip cnt 0 deg Buyer in network flips

byr cl hi prof cnt 0 deg Buyer high profit transfer

byr cl in net hi prof 0 deg Buyer in network high profit transfers

Buyer property transfers in default or byr cl cl fc default cnt 0 deg foreclosure byr susp flip net Buyer member of a suspicious flip network

Buyer member of suspicous network with byr susp fc net foreclosure

Buyer's Cluster Activity

byr cl sales cnt Buyer's Cluster Total Deed Transfers byr cl flip cnt Buyer's Cluster Flips

byr cl flip bus cnt Buyer's Cluster business flips

byr cl in net cnt Buyer's Cluster in network transfers byr cl in net flip bus cnt Buyer's Cluster in network business flips byr cl in net flop Buyer's Cluster in network flops byr cl in net flip cnt Buyer's Cluster in network flips byr cl hi prof cnt Buyer's Cluster high profit transfers

Buyer's Cluster in network high profit byr cl in net hi prof transfers byr cl in net hi prof flip cnt Buyer's Cluster in network high profit flips byr cl flop cnt Buyer's Cluster flop count

Buyer's Cluster property transfers ending in byr cl default cnt default Buyer's Cluster property transfers ending in byr cl fc cnt foreclosure

Buyer's Cluster property sales end in default byr cl ends in default fc or foreclosure

[0049] In accordance with an example implementation of the disclosed technology, the scoring of the clusters may include accessing features, which may be based on research or knowledge about behaviors that suggest collusive activity. For example, each feature may represent a risky activity or characteristic. In identifying automobile insurance fraud, for example, the features may include the number of automobiles involved in an accident, the number of people injured, the value of vehicles involved in the accident, and/or number and extent of injuries.

[0050] In an example implementation, each feature may be computed for each cluster. A feature, for example, may be calculated as a composite of one or more attributes of the cluster in question. For example, an attribute for detecting mortgage fraud may be "date of last deed transfer." A feature that is based on this attribute may be "whether previous owner is a member of a network that shows high volume or suspicious deed transfer activity." Thus, this feature may be a composite of the "date of last deed transfer" attribute, along with other attributes.

[0051] In accordance with an example implementation of the disclosed technology, the scoring of the clusters may include calculating a score for each cluster, based on the features computed for the cluster. With wisely-chosen features, the resulting score for a cluster may be indicative of the connectedness of the various data points within a cluster. In accordance with an example implementation of the disclosed technology, filtering may be utilized to examine the cluster scores and filter, or identify, which putative clusters are real clusters, i.e., represent organized groups of entities. Organized groups may be flagged as being potentially involved in collusion-based fraud.

[0052] In one example implementation, a filter may be utilized to reduce the data set to identify groups that evidence the greatest connectedness based on the scoring algorithm. In one example implementation, putative clusters with scores that match a predetermined set of criteria may be flagged for evaluation. In an example implementation of the disclosed technology, filtering may utilize one or more target scores, which may be selected based on the industry, goals of the putative cluster analysis system, or the scoring algorithm. In one example implementation, putative clusters having scores greater than or equal to a target score may be flagged as being potentially collusive.

[0053] As discussed above, an issue with conventional systems is that the threshold for identifying fraud is too high, so as to prevent identifying too many entities for examination. According to an example implementation of the disclosed technology, the features and scoring algorithm may be chosen to identify connectedness without the concern that too many individuals will be identified. According to an example implementation of the disclosed technology, groups, instead of individuals, may be identified.

[0054] FIG. 2 illustrates an example putative cluster 200 where certain connectedness between entities may be determined according to the systems and methods disclosed herein. This particular example may be directed toward identifying potential mortgage fraud, and at centroid of this example putative cluster 200 is a specific first property 202, which may be a house, for example. This particular example is over-simplified for clarity, and it should be realized that such putative clusters in practice may actually contain hundreds of thousands of properties and associated entities having a dense web of connections among the properties, entities, etc.

[0055] In accordance with an example implementation, and with continued reference to FIG. 2, the first property 202 may have certain characteristics (historical or otherwise) associated with it, for example, flipping (i.e., fast turnover), high sales profit, and/or transactions in which parties appeared to be associated with each other even outside of the transaction. A first bank 206 that is considering providing a mortgage on this first property 202 to a potential buyer 204 may have certain visibility to the aforementioned characteristics but, using a conventional fraud- identification system, the bank 206 may not be able to detect the various connections that actually exist. Other properties within the same putative cluster 200 may show similar characteristics: flipping, high sales profit, and relationships between parties. Alone, this may not raise a flag, but the putative cluster analysis systems and/or methods disclosed herein may identify this centroid property's connections to the other properties in the putative cluster 200, thus possibly raising a flag when suspicious connections are identified. [0056] In accordance with an example embodiment of the disclosed technology, connections between entities may be established based on public record documents, property deeds, etc., and such connections may be represented by lines connecting the entities, property, banks, etc. For example, a potential buyer 204 of a first property 202 may be in communication with a first bank 206. In one embodiment, the potential buyer 204, the first property, and the first bank may represent a first sub cluster 207. The entire putative cluster 200 may include multiple sub clusters, each established with a property, person, etc., at its particular centroid. For example, the putative cluster 200 of FIG. 2 illustrates a number of sub clusters 207, 208, 209, 212, 214, 226, 208. As discussed above, a particular entity may be at this centroid of its own cluster, and that same particular entity may be duplicated in the putative cluster to show connections with other entities that are set at the centroid of their own cluster.

[0057] The potential buyer 204, for example, is shown connected to the first sub cluster 207 in which the first house 202 is at the centroid. The potential buyer 204 is also shown in figure as being the centroid of a second sub cluster 209. The fourth sub cluster 214 includes a second bank 215 at its centroid, and the potential buyer 204 is duplicated and shown as having a connection to the second bank 215. The connection between first and second instances of the potential buyer 204 is represented in this figure by a thick line 205. Focusing now on the fourth sub cluster 214, in which the second bank 215 is at its centroid, we see that a second entity 216 is connected with the potential buyer 204, and with second bank 215. Additionally, connected to the second bank 215 and to the second entity 216 is a third entity 218. The third entity 218 is a member of the fourth sub cluster 214 and the fifth sub cluster 226. Again, the connection between the duplicated third entity 218 is signified by the thick line 219. The third entity 218 within the fifth sub cluster 226 is shown as being connected to a fourth entity 220, who is connected to a fifth entity 222. Therefore according to this example putative cluster 200, a connection may be determined to exist between the potential buyer 204 and the fifth entity 222, and this connection is shown by the dotted line 224.

[0058] According to an example implementation, a single property may have changed ownership between multiple entities in the sub cluster, as shown in the first sub-cluster 207, the fifth sub-cluster 226 and the sixth sub-cluster 208. However, as shown in FIG. 3, the centroid property 202 being analyzed has been subject to a number of transfers between related entities, which is often an indicator of fraudulent activities. Again, the movement of this property among these various entities would likely be overlooked in a conventional fraud-detection system.

[0059] FIG. 3 depicts a block diagram of an illustrative computer system architecture 300 according to an example implementation of the disclosed technology. Various implementations and methods herein may be embodied in non-transitory computer readable media for execution by a processor. It will be understood that the architecture 300 is provided for example purposes only and does not limit the scope of the various implementations of the communication systems and methods.

[0060] The architecture 300 of FIG. 3 includes a central processing unit (CPU) 302, where computer instructions are processed; a display interface 304 that acts as a communication interface and provides functions for rendering video, graphics, images, and texts on the display. In certain example implementations of the disclosed technology, the display interface 304 may be directly connected to a local display. In another example implementation, the display interface 304 may be configured for providing data, images, and other information for an external/remote display or computer that is not necessarily connected to the particular CPU 302. In certain example implementations, the display interface 304 may wirelessly communicate, for example, via a Wi-Fi channel or other available network connection interface 312 to an external/remote display.

[0061] The architecture 300 may include a keyboard interface 306 that provides a communication interface to a keyboard; and a pointing device interface 308 that provides a communication interface to a pointing device, mouse, and/or touch screen. Example implementations of the architecture 300 may include an antenna interface 310 that provides a communication interface to an antenna; a network connection interface 312 that provides a communication interface to a network. As mentioned above, the display interface 304 may be in communication with the network connection interface 312, for example, to provide information for display on a remote display that is not directly connected or attached to the system. In certain implementations, a camera interface 314 may be provided that may act as a communication interface and/or provide functions for capturing digital images from a camera. In certain implementations, a sound interface 316 is provided as a communication interface for converting sound into electrical signals using a microphone and for converting electrical signals into sound using a speaker. According to example implementations, a random access memory (RAM) 318 is provided, where computer instructions and data may be stored in a volatile memory device for processing by the CPU 302.

[0062] According to an example implementation, the architecture 300 includes a read-only memory (ROM) 320 where invariant low-level system code or data for basic system functions such as basic input and output (I/O), startup, or reception of keystrokes from a keyboard are stored in a non-volatile memory device. According to an example implementation, the architecture 300 includes a storage medium 322 or other suitable type of memory (e.g. such as RAM, ROM, programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic disks, optical disks, floppy disks, hard disks, removable cartridges, flash drives), where the files include an operating system 324, application programs 326 and data files 328 are stored. The application programs 326 may include putative clustering instructions for organizing, storing, retrieving, comparing, and/or analyzing the various connections associated with the properties and entities associated with embodiments of the disclosed technology. According to example implementations of the disclosed technology, the putative cluster analysis system, the clustering unit, and/or the scoring unit may be embodied, at least in part, via the application programs 326 interacting with data from the ROM 320 or other memory storage medium 322, and may be enabled by interaction with the operating system 324 via the CPU 302 and bus 334.

[0063] According to an example implementation, the architecture 300 includes a power source 330 that provides an appropriate alternating current (AC) or direct current (DC) to power components. According to an example implementation, the architecture 300 may include and a telephony subsystem 332 that allows the device 300 to transmit and receive sound over a telephone network. The constituent devices and the CPU 302 communicate with each other over a bus 334.

[0064] In accordance with an example implementation, the CPU 302 has appropriate structure to be a computer processor. In one arrangement, the computer CPU 302 may include more than one processing unit. The RAM 318 interfaces with the computer bus 334 to provide quick RAM storage to the CPU 302 during the execution of software programs such as the operating system application programs, and device drivers. More specifically, the CPU 302 loads computer-executable process steps from the storage medium 322 or other media into a field of the RAM 318 in order to execute software programs. Data may be stored in the RAM 318, where the data may be accessed by the computer CPU 302 during execution. In one example configuration, the device 300 includes at least 128 MB of RAM, and 256 MB of flash memory.

[0065] The storage medium 322 itself may include a number of physical drive units, such as a redundant array of independent disks (RAID), a floppy disk drive, a flash memory, a USB flash drive, an external hard disk drive, thumb drive, pen drive, key drive, a High-Density Digital Versatile Disc (HD-DVD) optical disc drive, an internal hard disk drive, a Blu-Ray optical disc drive, or a Holographic Digital Data Storage (HDDS) optical disc drive, an external mini-dual inline memory module (DIMM) synchronous dynamic random access memory (SDRAM), or an external micro-DIMM SDRAM. Such computer readable storage media allow the device 300 to access computer-executable process steps, application programs and the like, stored on removable and non-removable memory media, to off-load data from the device 300 or to upload data onto the device 300. A computer program product, such as one utilizing a communication system may be tangibly embodied in storage medium 322, which may comprise a machine- readable storage medium.

[0066] FIG. 4 illustrates a Venn diagram 400 of potentially fraudulent real estate transactions that may be identified, categorized, and/or flagged by putative cluster analysis, according to certain example embodiments of the disclosed technology. As shown in FIG. 4, the putative cluster analysis system may identify high-risk transactions that are performed within a network of associates that involve flipped properties and that result in high profits. The term "flipping" is used herein to describe purchasing a revenue-generating asset and quickly reselling it for profit. This term is frequently used both as a descriptive term for legal real estate investing strategies that are perceived by some to be unethical or socially destructive. Certain embodiments of the disclosed technology may be applied for sensing schemes involving market manipulation and other illegal conduct including potentially collusive behavior.

[0067] The Venn diagram 400 of FIG. 4 illustrates the overlap of certain attributes that may be determined from a number of transactions involving certain properties. For example, related entities that are identified as being in the same network 402 may comprise a subset of transactions. Certain transactions may be flagged as extracting high profit 404, and other transactions may be characterized as flipping or flopping 406. For example, flipping or flopping 406 may have the characteristic of a purchase, followed by a sale within a short period of time after the purchase. Certain flipping or flopping 406 transactions may have low profit, and certain transactions may have a high profit. The overlap (designated by the letter Y) of flipping or flopping 406 transactions with those that are high profit 404 may provide loan profiles with the characteristic of loan files that were flipped and resulted in high profit gains 410.

[0068] The overlap (designated by the letter W) of high profit 404 transactions with in network 402 transactions may provide loan profiles with a high profit gain having no flip 408. The overlap (designated by the letter X) of in network 402 transactions and flipping or flopping 406 transactions may provide a loan profile with flip flops that are not high profit 412. The overlap of the in-network 402, the flipping or flopping 406, and the high profit 404 transactions may be designated (by the letter Z) as having the characteristic of in cluster loans that were flipped and had high profit gains 414. According to an example implementation of the disclosed technology, such overlap of characteristics, attributes data, etc., may be utilized to identify potential collusion within a network that may otherwise be very difficult to detect.

[0069] It will be understood that the combination of flipping and high-profit transactions illustrated in FIG. 4 is an illustrative example of potential results of the putative analysis systems and methods. Systems and methods disclosed herein may be capable of identifying potential collusion by much more complex mechanisms.

An example method 500 for identifying connected organizations from a collection of records will now be described with reference to the flowchart of FIG. 5. The method 500 starts in block 502, and according to an example implementation includes determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points. In block 504, the method 500 includes identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid. In block 506, the method 500 includes identifying cluster connections among the plurality of clusters. In block 508, the method 500 includes scoring the cluster connections based on predetermined criteria. In block 510, the method 500 includes identifying one or more of the distinct data points associated with the scored cluster connections..

[0070] An example embodiment of the putative cluster analysis system was tested using transaction data related to properties in Sarasota, Florida, over a ten year period. In that test, it was determined that the highest risk cluster did not include any high-velocity flippers. Instead, risk behavior was evenly spread across the actors within the cluster and this characteristic may have inhibited detection by conventional means. In a blind study, the test putative cluster analysis system was able to identify a "ringleader" at the center of a cluster. This ringleader was indicted approximately one month before the test was conducted, and was identified by authorities based on information provided by a disgruntled employee informant. The identified ringleader was not listed on any of the deeds of flipped properties, but could be identified by the test putative cluster analysis system by indirect connections with the flipped properties, and by other metrics disclosed herein. Example implementations of the disclosed technology may be able to detect criminal activities that would not likely be identified if the involved individual or individuals intentionally avoid the type of behavior and connections that would be identifiable by conventional means.

[0071] Certain implementations of the disclosed technology of the putative cluster analysis systems and methods may be used to identify potential organizations of health insurance fraud, such as Medicaid fraud. For example, the input data to the clustering unit may be derived from historical address history of a population to be examined and such address history may be used to link individuals based on, for example, familial, residential, and business relationships. The clustering unit may then take this input data and output clusters for use by the scoring unit.

[0072] Without limitation, some features considered for the scoring algorithm with respect to health insurance fraud may include: (1) the number of people within a cluster who lived in expensive residences, owned expensive property, or drove expensive cars; (2) the number of insurance recipients within the cluster who are contacts of medical providers; (3) the number of medical businesses associated with people in the cluster; (4) the number of people in cluster currently receiving benefits; and/or (5) the number of recipients associated with excluded providers. These features may enable the putative cluster system to identify, among others, clusters that have dense clusters of recipients who appear to be colluding and transferring knowledge of how to claim Medicaid benefits and bypass eligibility requirements, as well as clusters that have close ties to medical providers who have the knowledge and means to defraud Medicaid.

[0073] According to an example implementation of the disclosed technology, the putative cluster analysis system may consider the following features to identify potential drug-seeking behavior: (1) prescription filling distance deviation; and (2) watchlist drug prescriptions. Such features may enable the putative cluster system to identify, among others, clusters that include patients who deviate when filling prescriptions for certain watchlist drugs, as well as clusters that include providers and prescribers with patterns of prescribing to the drug-seeking clusters.

[0074] According to example implementations, certain technical effects can be provided, such as creating certain systems and methods that are able to identify an entity that is connected to various other entities evidencing suspicious behavior. Embodiments of the disclosed technology may be utilized to examine related data in addition to data that is indicative of whether or not an individual entity is an active recipient of health insurance. The putative cluster analysis system disclosed herein may also consider other recipients in an individual's cluster, which may be indicative of collusion.

[0075] According to an example implementation of the disclosed technology, the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. In an example implementation, the filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.

[0076] An example implementation of the disclosed technology of the putative cluster analysis system was tested to identify potential Medicaid fraud. During the test, the individuals flagged as being potential ring-leaders were often not themselves Medicaid recipients. Rather, they were members of putative clusters having large numbers of recipients, and they were members of clusters in which other cluster-members drove a high proportion of expensive vehicles. Because these ring-leaders did not meet conventional criteria for fraud- flagging, they might have been overlooked; however, they were flagged as potential ring-leaders using an example embodiment of the putative cluster analysis system in a test with real data.

[0077] Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations of automobile insurance fraud. In some instances, automobile insurance fraud may include multiple victims, expensive vehicles, or multiple injuries.

[0078] Without limitation, some features considered for the scoring algorithm with respect to automobile insurance fraud may include: (1) the number of involved parties; (2) the number of claimants requiring medical treatment; (3) individual claim amounts; (4) vehicle damage; and (5) makes or models of involved automobiles. Analysis of these features, according to an example embodiment, may enable the putative cluster system to identify, among others, clusters that have a high number of collective claims with low standard deviation of claim counts, as well as clusters that have a statistically higher number of claims with soft tissue injuries, multiple passengers, low vehicle damage, or common passengers across multiple claims in the cluster.

[0079] According to an example implementation of the disclosed technology, the above or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. The filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.

[0080] Example implementations of the disclosed putative cluster analysis systems and methods may also be used to identify potential organizations involved in tax fraud. Without limitation, some features considered for the scoring algorithm with respect to tax fraud may include: (1) a significant change in income between tax years; (2) a significant increase in deductions; (3) a change in filing status; (4); a change in number or nature of dependents; (5) and the number of self-employed individuals in cluster. These or other features may be integrated into the scoring unit, so as to score the various clusters provided by the clustering unit. The filtering unit may then filter out and those clusters that are deemed to represent real organizations based on the scoring.

[0081] Various embodiments of the putative cluster analysis systems and methods may be embodied, in whole or in part, in a computer program product stored on non-transitory computer- readable media for execution by one or more processors. It will thus be understood that various aspects of the disclosed technology, such as the clustering unit, the scoring unit, and the filtering unit, may comprise hardware or software of a computer system, as discussed above with respect to FIG. 3. It will also be understood that, although these units may be discussed herein as being distinct from one another, they may be implemented in various ways. The distinctions between them throughout this disclosure are made for illustrative purposes only, based on operational distinctiveness.

[0082] Application of the various embodiments of the putative cluster analysis systems and methods need not be limited to those above. For example, an example implementation of the putative cluster analysis system may be used to identify potential fraud related to credit cards applications, identity theft, investments, and various other fraud types that might involve an organization of connected entities.

[0083] While certain implementations of the disclosed technology have been described in connection with what is presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not to be limited to the disclosed implementations, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

[0084] This written description uses examples to disclose certain implementations of the disclosed technology, including the best mode, and also to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal language of the claims.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method comprising:

determining, from a collection of records comprising a plurality of distinct data points, connections between one or more of the plurality of distinct data points;

identifying, from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid;

identifying cluster connections among the plurality of clusters;

scoring the cluster connections based on predetermined criteria; and

identifying one or more of the distinct data points associated with the scored cluster connections.

2. The method of claim 1, wherein the collection of records comprise transaction records or relationship records.

3. The method of claim 1, wherein a distinct data point represents an individual, an organization, or a property.

4. The method of claim 1, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.

5. The method of claim 1, further comprising filtering the cluster connections based on predetermined attributes.

6. The method of claim 1, wherein scoring the cluster connections based on predetermined criteria.

7. The method of claim 1, wherein a number of the identified clusters is equal to or less than a number of the distinct data points.

8. A system comprising:

one or more processors; and

at least one memory comprising an operating system, a database a clustering unit, and a scoring unit, the memory in communication with the one or more processors and configured for storing data and instructions which, when executed by the at least one processor under control of the operating system, enable the system to:

determine, from a collection of records in the database, wherein the collection of records comprise a plurality of distinct data points, connections between one or more of the plurality of distinct data points;

identify, by the clustering unit, and from the plurality of distinct data points, a plurality of clusters, each of the clusters comprising a cluster centroid, each cluster centroid comprising a distinct data point, wherein each cluster comprises the determined connections between the one or more of the plurality of distinct data points and the cluster centroid;

identify, by the cluster unit, cluster connections among the plurality of clusters; score, by the scoring unit, the cluster connections based on predetermined criteria; and

identify one or more of the distinct data points associated with the scored cluster connections.

9. The system of claim 8, wherein the collection of records comprise transaction records or relationship records.

10. The system of claim 8, wherein a distinct data point represents an individual, an organization, or a property.

11. The system of claim 8, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.

12. The system of claim 8, further comprising a filtering unit that is operable to filter the cluster connections based on predetermined attributes.

13. The system of claim 8, wherein the scoring unit is configured to score the cluster connections based on predetermined criteria.

14. A computer-readable medium that stores instructions which, when executed by at least one processor in a system, cause the system to perform a method comprising:

identifying cluster connections among the plurality of clusters;

scoring the cluster connections based on predetermined criteria; and

15. The computer-readable medium of claim 14, wherein the collection of records comprise transaction records or relationship records.

16. The computer-readable medium of claim 14, wherein a distinct data point represents an individual, an organization, or a property.

17. The computer-readable medium of claim 14, wherein the connections between the one or more of the plurality of distinct data points comprise information derived from public records.

18. The computer-readable medium of claim 14, further comprising filtering the cluster connections based on predetermined attributes.

19. The computer-readable medium of claim 14, wherein scoring the cluster connections based on predetermined criteria.

20. The computer-readable medium of claim 14, wherein a number of the identified clusters is equal to or less than a number of the distinct data points.