WO2003060652A2 - Method and apparatus for multifactor dimensionality reduction - Google Patents

Method and apparatus for multifactor dimensionality reduction Download PDF

Info

Publication number
WO2003060652A2
WO2003060652A2 PCT/US2003/001333 US0301333W WO03060652A2 WO 2003060652 A2 WO2003060652 A2 WO 2003060652A2 US 0301333 W US0301333 W US 0301333W WO 03060652 A2 WO03060652 A2 WO 03060652A2
Authority
WO
WIPO (PCT)
Prior art keywords
data set
value
dependent variable
observations
disease
Prior art date
Application number
PCT/US2003/001333
Other languages
French (fr)
Other versions
WO2003060652A3 (en
Inventor
Lance W. Hahn
Marylyn Ritchie
Jason H. Moore
Original Assignee
Vanderbilt University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vanderbilt University filed Critical Vanderbilt University
Priority to AU2003212806A priority Critical patent/AU2003212806A1/en
Publication of WO2003060652A2 publication Critical patent/WO2003060652A2/en
Publication of WO2003060652A3 publication Critical patent/WO2003060652A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention is directed to a method and an apparatus for analyzing data. More particularly, the present invention is directed to a method and an apparatus for reducing multidimensional data.
  • logistic regression is a commonly used method for modeling the relationship between discrete predictors, such as genotypes, and discrete clinical outcomes.
  • logistic regression like most parametric-statistical methods, is less practical for dealing with high-dimensional data. That is, when high-order interactions are modeled, there are many contingency-table cells that contain no observations (i.e., that are empty cells). This can lead to very large coefficient estimates and standard errors.
  • One solution to this problem is to collect very large numbers of samples to allow robust estimation of interaction effects; however, the magnitudes of the samples that are often required incur prohibitive expense.
  • An alternative solution is to develop new statistical and computational methods that have improved power to identify multilocus effects in relatively small samples.
  • n observations are measured out of a first data set including discrete independent variables.
  • a number of observations in which a dependent variable has a first value is determined, and a number of observations in which the dependent variable has a second value is determined.
  • Combinations are formed of the independent variables to produce a second data set. In each combination of independent variables, a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value is determined.
  • Each ratio is compared to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value.
  • Those combinations that have ratios at the approximately the same threshold or within the same range of thresholds are pooled together to produce a third data set.
  • the third data set includes a smaller number of independent variable combinations than the second data set.
  • the independent variables may be predictor or explanatory variables, including genetic and environmental factors.
  • the dependent variable may be a response variable indicative of disease.
  • the number of independent variable combinations may be reduced to two, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.
  • Figure 1 illustrates an exemplary summary of steps for implementing a method for multifactor reduction according to one embodiment
  • Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment
  • Figure 3 illustrates an exemplary summary of genotype combinations produced by the steps shown in Figure 1 according to an exemplary embodiment.
  • a multif actor-dimensionality reduction (MDR) method has been developed for detecting and characterizing high-order gene-gene and gene-environment interactions in case-control and discordant-sib-pair studies with relatively small samples.
  • the MDR method may be considered to be inspired in part by the combinatorial-partitioning method, a data-reduction method for the exploratory analysis of quantitative traits.
  • multilocus genotypes are pooled into high-risk and low-risk groups, effectively reducing the genotype predictors from n dimensions to one dimension.
  • the new, one-dimensional multilocus-genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing.
  • the MDR method is model free, in that it does not assume any particular, genetic model, and is nonparametric, in that it does not estimate any parameters.
  • the MDR method is first evaluated by using simulated multilocus data with epistatic effects, and it is then applied to identification of multiple single-nucleotide polymorphisms associated with sporadic breast cancer. It will be appreciated, however, that the invention is not limited to this application.
  • Breast cancer is generally considered a complex disease, since its most common form - sporadic breast cancer - is due to multiple unknown etiologies.
  • the catechol-estrogen pathway is regulated by catechol-O-methyltransferase (COMT), by cytochromes P450 1A1 and P450 1B1 (CYP1A1 and CYP1BI, respectively), and by glutathione S-transf erases Ml and Tl (GSTM1 and GSTT1, respectively).
  • Each of the genes encoding these enzymes contains functional polymorphisms that result in different concentrations of catechol-estrogen metabolites interactions between polymorphisms of these genes may have a synergistic, or nonadditive, effect on the pathogenesis of breast cancer. This may explain differences in breast cancer risk.
  • FIG. 1 illustrates a summary of steps involved in implementing the MDR method for case-control studies according to an exemplary embodiment. The same procedure is equally applicable to discordant-sib-pair studies.
  • step 1 a set of n genetic and/or discrete environmental factors is selected from a pool of all factors.
  • step 2 the n factors and their possible multifactor classes or cells are represented in n-dimensional space. For example, for two loci with three genotypes each, there are nine two-locus-genotype combinations. Then, the ratio of the number of cases (or affected sibs) to the number of controls (or unaffected sibs) is estimated within each multifactor class.
  • each multifactor cell in n-dimensional space is labeled either as "high- risk,” if the cases: controls ratio meets or exceeds some threshold (e.g., ⁇ l.O), or as “low- risk,” if that threshold is not exceeded.
  • some threshold e.g., ⁇ l.O
  • lower- risk if that threshold is not exceeded.
  • a model for both cases and controls is formed by pooling high-risk cells into one group and low-risk-cells into another group. This reduces the n-dimensional model to a one-dimensional model (i.e., having one variable with two multifactor classes-high risk and low risk).
  • step 4 the prediction error of each model is estimated by 10-fold cross-validation.
  • the data i.e., subjects
  • the MDR model is developed for each possible 9/10 of the subjects and then is used to make predictions about the disease status of each possible 1/10 of the subjects excluded.
  • the proportion of subjects for which an incorrect prediction was made is an estimation of the prediction error.
  • the 10-fold cross-validation is repeated 10 times, and the prediction errors are averaged.
  • the four steps of the MDR method may be repeated for each possible combination, if computationally feasible. If the number of combinations to be evaluated exceeds computational feasibility, machine learning methods, such as parallel genetic algorithms, may be employed.
  • machine learning methods such as parallel genetic algorithms.
  • This two-locus model will have the minimum classification error among the two-locus models.
  • Single best multifactor, models are also selected from among the models for each of the three- to n-factor combinations. Among this set of best multifactor models, the combination of loci and/or discrete environmental factors that minimizes the prediction error is selected. Thus, the classification errors and the prediction errors estimated by 10-fold cross-validation are used to select the final multifactor model.
  • Hypothesis testing for this final model can then be performed by evaluating the consistency of the model across cross-validation data sets - that is, how many times the same MDR model is identified in each possible 9/10 of the subjects.
  • the reasoning is that a true signal (i.e., association) should be present in the data regardless of how they are divided.
  • Statistical significance was determined by comparing the average cross-validation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations derived empirically from 1,000 permutations. The null hypothesis was rejected when the upper-tail Monte Carlo P value derived from the permutation test was ⁇ 05.
  • Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment.
  • the method for multifactor reduction may be performed on a personal computer 200.
  • the personal computer 200 includes, for example, a memory and a microprocessor.
  • the microprocessor may be an Intel 600 MHz Pentium III processor or a Solaris (Sun) processor.
  • Devices such as a keyboard, a mouse, and a display terminal may be connected to the microprocessor and memory to receive input data and export output data.
  • This two- locus epistasis model was extended to three-locus, four-locus, and five-locus epistasis models by adding corresponding homozygous or heterozygous genotypes to the aforementioned penetrance functions.
  • AAbbcc) .2, P(D
  • AaBbcc) .2, P(D
  • AabbCc) .2, and P(D
  • Table 1 summarizes the polymorphisms, in these genes, that were analyzed by PCR and restriction-endonuclease digestion. Genotype frequencies have been previously reported. The specific primers and amplification conditions and the subsequent restriction- endonuclease analysis for CYPIAI, CYP1B1, GSTM1, and GSTT1 have been described elsewhere.
  • COMT was amplified with primers Cl (5'-GCC GCC ATC ACC CAG CGG ATG GTG GAT TTC GCT GTC) and C2 (5'-GTT TTC AGT GAA CGT GGT GTG). Each PCR contained internal controls for the respective gene, and random retesting of ⁇ 5% of the samples yielded 100% reproducibility. Table I
  • Codon l l9 355G ⁇ T 119Ala->Ser B1, B2 C NgoMIV 51 40 9
  • MDR Prior to application of MDR to the sporadic-breast cancer data set, the method was evaluated by use of the simulated multilocus data sets. For each of the 50 replicates generated by each of the four multilocus epistasis models, the MDR algorithm was applied as described in the subsection "MDR," with a threshold cases: controls ratio of at least 1:1. This threshold was selected so that multilocus-genotype combinations would be considered high-risk if the number of cases with that particular combination either was equal to or exceeded the number of controls; more-stringent thresholds may improve the results. An exhaustive search of all possible two- to nine-locus models was performed. The 10-locus model was not evaluated, since there is only one such model and since its cross-validation consistency is always 10. On validation of the method, MDR was then applied to the sporadic-breast cancer data set, with the same threshold cases: controls ratio, at least 1:1. An exhaustive search of all possible two- to nine-locus models was again performed. Results
  • Table 2 summarizes the means and the standard errors of the means (SEMs), of both the cross-validation consistency and the prediction error, obtained from the MDR analysis of each group of 50 simulated data sets for each gene-gene interaction model and each number of loci evaluated.
  • SEMs standard errors of the means
  • the mean prediction error for the four-locus model was much closer to that of the three-locus model, because these models contained the correct three functional loci as well as a false-positive locus, whereas the two-locus models were missing one of the functional loci. Selecting the smaller three-locus model with the lower mean prediction error is consistent with statistical parsimony (i.e., smaller models are better because they are easier to interpret). For the three-locus models in this example, the cross-validation consistency was always 10.00; that is, the same three-locus model was found in each possible 9/10 of the subjects. These results suggest that, for this particular epistasis model, the cross- validation strategy is a reasonable approach to the identification of the correct multilocus model. Furthermore, the threshold cases: controls ratio of at least 1:1 was reasonable for this epistasis model. Table 2 Summary of Simulation Results
  • the Monte Carlo P values for each of the correctly identified models were all ⁇ .001.
  • the estimated power to identify the correct multilocus model was 78% for the two- locus model, 82% for the three-locus model, 94% for the four-locus model, and 90% for the five-locus model. It is interesting that the power to identify the correct multilocus model tends to increase as higher-order interactions are modeled. This may be a real phenomenon, or it may be due to the fact that fewer non-functional loci of the 10 that were simulated were present. These results suggest that, for this particular epistasis model, the MDR method has reasonable power to identify high-order gene-gene interactions in a sample of 200 cases and 200 controls.
  • Table 3 summarizes the cross-validation consistency and the prediction error obtained from MDR analysis of the sporadic-breast cancer case-control data set, for each number of loci evaluated.
  • One four-locus model had a minimum prediction error of 46.73 and a maximum cross-validation consistency of 9.8 that was significant at the .001 level, as determined empirically by permutation testing. Thus, under the null hypothesis of ho association, it is highly unlikely that a cross-validation consistency 9.8 will be observed for this four-locus model.
  • the four-locus model included the polymorphisms-of COMT CYPlAlml, CYPIB1 codon 48, and CYP1B1 codon 432.
  • Figure 3 illustrates a summary of the four-locus-genotype combinations associated with high risk and with low risk, along with the corresponding distribution of cases (left bars in boxes) and of controls (right bars in boxes), for each Multilocus-genotype combination. Note that the patterns of high-risk and low-risk cells differ across each of the different multilocus dimensions. This is evidence of epistasis, or gene-gene interaction; that is, the influence that each genotype at a particular locus has on disease risk is dependent on the genotypes at each of the other three loci. Previous analysis of this data set, by logistic regression, revealed no statistically significant evidence of independent main effects of any of the 10 polymorphisms. Table 3
  • MDR is a method for reducing the dimensionality of multilocus information, to improve identification of combinations of polymorphisms associated with the risk for common complex multifactorial diseases.
  • the development of MDR was motivated by the limitations of the generalized linear model for detection and characterization of gene-gene and gene-environment interactions and by the success of data-reduction methods for quantitative traits. Using simulated data, MDR has was demonstrated to be useful for identification of genes whose effects are primarily through interaction. MDR was then applied to identify gene-gene interaction effects on risk for sporadic breast cancer.
  • Breast cancer is generally considered a multifactorial disease with estrogens as one of the principal factors. MDR was therefore applied to a set of genes (i.e., COMT, CYPIAI, CYPIB1, GSTM1, and GSTT1) whose protein products interact as enzymes in the metabolism of estrogens in breast tissue.
  • genes i.e., COMT, CYPIAI, CYPIB1, GSTM1, and GSTT1
  • Several studies have examined the breast cancer risk associated with individual genotypes of each of these enzymes. Not surprisingly, the results have been inconsistent and even contradictory. That is, if a single gene in the estrogen-metabolism pathway were solely responsible for breast cancer, then the malignancy would likely present as familial breast cancer, and the gene would be identified by linkage analysis, as in the case of BRCAI and BRCA2.
  • CYPIAI, GSTM1, and GSTT1 polymorphisms were examined in a case-control study of 328 white and 108 African American women, using multiple logistic- regression analysis. None of the enzyme genotypes-individually or combined- were associated with an increased risk for breast cancer. COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated. Because of their clearly defined functional interactions in the catechol-estrogen pathway, it is important to consider the combined effect of all these enzymes.
  • MDR applied to 10 single-nucleotide polymorphisms in COMT, CYPIAI, CYP1B1, GSTM1, and GSTT1 identifies a four-locus interaction that is significantly associated with risk for sporadic breast cancer. This is the first report of a four-locus interaction associated with a common complex multifactorial disease.
  • Breast cancer risk is influenced by several nongenetic hormonal factors, such as age at menarche, and by age at menopause, body-mass index, reproductive history, lactation history, and use of exogenous estrogen in the form of either oral contraceptives or hormone-replacement therapy. Although these factors allow prediction of a relative risk for a given population, they are not very helpful to individual women.
  • the determination of a woman's genotype may add another dimension to the assessment of overall breast cancer risk.
  • genotype risk factors For example, obesity has been related both to the concentration of endogenous estrogen and to breast cancer risk.
  • obesity has been related both to the concentration of endogenous estrogen and to breast cancer risk.
  • Several studies have demonstrated that obese postmenopausal women have an increased risk for breast cancer, compared to age-matched nonobese postmenopausal women.
  • the elevated risk has been attributed to higher levels of circulating estrogens secondary to increased conversion, in adipose tissue, of androgen to estrogen.
  • serum-estradiol concentrations in obese postmenopausal women than in their nonobese counterparts.
  • MDR facilitates the simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical end-point. This is accomplished by reducing the dimensionality of the multilocus data. In essence, genotypes from multiple loci and/or discrete environmental classes are pooled into high- risk and low-risk groups, depending on whether they are more common in affected or in unaffected subjects. This new multilocus-genotype encoding reduces the dimensionality to one. For the simulated data, the mean cross-validation consistency was always maximized, and the mean prediction error was always minimized, at the correct multilocus model. Another important advantage of MDR is that it is nonparametric.
  • logistic-regression models should contain no more than P + 1 ⁇ min (n ⁇ , n 0 )/10 parameters, where ni is the number of events of type 1 and no is the number of events of type 0.
  • this formula suggests that no more than 19 parameters should be estimated in a logistic- regression model.
  • the number of orthogonal-regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) x 2 k (Wade 2000).
  • MDR model the main effects
  • 180 parameters to model the two-way interactions
  • 1,920 parameters to model the three-way interactions
  • 3,360 parameters to model the four-way interactions
  • the MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions.
  • a third advantage of MDR is that it assumes no particular genetic model (i.e., it is model free). That is, no mode of inheritance needs to be specified. This is important for diseases, such as sporadic breast cancer, in which the mode of inheritance is unknown and likely very complex. In its current form, MDR can be directly applied to case-control and discordant-sib-pair studies. Extension to other family-based control study designs, such as those using trios, should also be possible.
  • a fourth advantage of MDR is that false-positive results due to multiple testing are minimized. This is primarily due to the cross-validation strategy used to select optimal models. Data-reduction and pattern-recognition methods are good for identification of complex relationships among data, even when those relationships are due to either chance or false-positive variations.
  • the real test of any method is its ability to make predictions in independent data.
  • Cross-validation divides the data into 10 equal parts, allowing 9/10 of the data to be used to develop a model and the independent 1/10 of the data to be used to evaluate the predictive ability of the model.
  • Optimal models are selected solely on the basis of their ability to make predictions with regard to independent data. Only when a final predictive model has been selected is the null hypothesis of no association tested via permutation testing. It is this combined cross-validation- testing/permutation-testing method that minimizes false-positives due to multiple examinations of the data.
  • MDR overcomes many limitations of the generalized linear model. However,
  • MDR can be computationally intensive, especially when more than 10 polymorphisms need to be evaluated.
  • a genome scan with hundreds to thousands of polymorphisms requires robust machine learning algorithms, since all of the possible multilocus combinations cannot be exhaustively searched. This is, however, a consequence of any multilocus method that does not first condition on a particular locus having an independent main effect (e.g., stepwise logistic regression).
  • MDR models may be difficult to interpret. This is illustrated clearly in the four-locus model in Figure 3. There are no obvious trends or patterns in the distribution of high-risk and low-risk groupings across the four-dimensional genotype space; for example, a consistent trend of high-risk or low-risk cells across a series of rows or of columns may indicate that a particular locus has a main effect.
  • MDR is best applied to case-control studies that are balanced (i.e., that have the same number of cases and of controls). Also, MDR is somewhat limited in is its ability to make predictions for independent data sets when the dimensionality of the best model is relatively high and the sample is relatively small. High dimensionality and a small sample lead to many multifactor cells with either missing data or singleton data. This is not a problem for estimation of the classification error and evaluation of the cross-validation consistency, but it is a problem for estimation of the prediction error.
  • MDR is a powerful alternative to traditional parametric statistics, such as logistic regression.
  • MDR has the ability to identify high-order (i.e., more than two) gene- gene interactions in relatively small simulated and real data sets.
  • MDR addresses some of the limitations of the generalized linear model, there are several ways in which the method can be improved.
  • the first strategy uses a nearest neighbor method to determine whether an empty cell should be classified as high risk or as low risk; for example, if the majority of multilocus-genotype combinations within one step in n-dimensional space are classified as high risk, then the empty cell is also classified as high risk.
  • the second strategy projects either a high risk or a low risk classification for an empty cell in a lower dimension; for example, the locus with the least frequent genotype might be removed from the model, and risk could then be determined from the equivalent genotypes in a lower dimension. These strategies may be compared to determine whether either improves the estimation of the prediction error when empty cells are present.
  • MDR is introduced as a method for reducing the dimensionality of multilocus information, to improve the identification of polymorphism combinations associated with disease risk.
  • the MDR method is nonparametric (i.e., no hypothesis about the value of a statistical parameter is made), is model-free (i.e., it assumes no particular inheritance model), and is directly applicable to case-control and discordant-sib-pair studies. Using simulated case-control data, the description above demonstrates that MDR has reasonable power to identify interactions among two or more loci in relatively small samples.
  • MDR identified a statistically significant high-order interaction among four polymorphisms from three different estrogen- metabolism genes. This is the first known report of a four-locus interaction associated with a common complex multifactorial disease.

Abstract

Multidimensional data is reduced by measuring n observations out of a first data set including discrete independent variables, such as genetic or environmental factors. A number of observations in which a dependent variable has a first value is determined, such as a disease-indicative value, and a number of observations in which the dependent variable has a second value, such as a disease-free value, is determined. Combinations are formed of the independent variables to produce a second data set. In each combination of independent variables, a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value is determined. Each ratio is compared to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value. Those combinations having ratios at the approximately the same threshold or within the same range of thresholds are pooled together to produce a third data set having a smaller number of independent variable combinations than the second data set.

Description

METHOD AND APPARATUS FOR MULTIFACTOR DIMENSIONALITY
REDUCTION
CROSS-REFERENCE TO RELATED APPLICATION This application is related to and claims priority from commonly assigned U.S.
Provisional Application No. 60/349,317, filed January 15, 2002 in the names of Hahn et al.
STATEMENT OF GOVERNMENT SPONSORED RESEARCH
This invention was made with government support under Grant No. 1 R01 CA/ES83572 awarded by the National Institutes of Health. The United States government has certain rights in the invention.
BACKGROUND
The present invention is directed to a method and an apparatus for analyzing data. More particularly, the present invention is directed to a method and an apparatus for reducing multidimensional data.
The identification and characterization of susceptibility genes for common complex human diseases is one of the greatest challenges facing human geneticists. This challenge is partly due to the limitations of parametric-statistical methods (i.e., those in which a hypothesis about the value of a statistical parameter is made) for detection of gene effects that are dependent solely or partially on interactions with other genes and with environmental exposures.
For example, logistic regression is a commonly used method for modeling the relationship between discrete predictors, such as genotypes, and discrete clinical outcomes. However, logistic regression, like most parametric-statistical methods, is less practical for dealing with high-dimensional data. That is, when high-order interactions are modeled, there are many contingency-table cells that contain no observations (i.e., that are empty cells). This can lead to very large coefficient estimates and standard errors. One solution to this problem is to collect very large numbers of samples to allow robust estimation of interaction effects; however, the magnitudes of the samples that are often required incur prohibitive expense. An alternative solution is to develop new statistical and computational methods that have improved power to identify multilocus effects in relatively small samples.
SUMMARY
It is an object of the present invention to provide a method and apparatus for reducing multidimensional data in a simple, efficient manner. According to an exemplary embodiment, n observations are measured out of a first data set including discrete independent variables. A number of observations in which a dependent variable has a first value is determined, and a number of observations in which the dependent variable has a second value is determined. Combinations are formed of the independent variables to produce a second data set. In each combination of independent variables, a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value is determined. Each ratio is compared to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value. Those combinations that have ratios at the approximately the same threshold or within the same range of thresholds are pooled together to produce a third data set. The third data set includes a smaller number of independent variable combinations than the second data set.
According to exemplary embodiments, the independent variables may be predictor or explanatory variables, including genetic and environmental factors. The dependent variable may be a response variable indicative of disease. The number of independent variable combinations may be reduced to two, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 illustrates an exemplary summary of steps for implementing a method for multifactor reduction according to one embodiment;
Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment; and
Figure 3 illustrates an exemplary summary of genotype combinations produced by the steps shown in Figure 1 according to an exemplary embodiment.
DETAILED DESCRIPTION
According to exemplary embodiments, a multif actor-dimensionality reduction (MDR) method has been developed for detecting and characterizing high-order gene-gene and gene-environment interactions in case-control and discordant-sib-pair studies with relatively small samples. The MDR method may be considered to be inspired in part by the combinatorial-partitioning method, a data-reduction method for the exploratory analysis of quantitative traits. With MDR, multilocus genotypes are pooled into high-risk and low-risk groups, effectively reducing the genotype predictors from n dimensions to one dimension. The new, one-dimensional multilocus-genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing. The MDR method is model free, in that it does not assume any particular, genetic model, and is nonparametric, in that it does not estimate any parameters. For the purpose of illustration, in the following, the MDR method is first evaluated by using simulated multilocus data with epistatic effects, and it is then applied to identification of multiple single-nucleotide polymorphisms associated with sporadic breast cancer. It will be appreciated, however, that the invention is not limited to this application. Breast cancer is generally considered a complex disease, since its most common form - sporadic breast cancer - is due to multiple unknown etiologies. This is in contrast to the less common form - familial breast cancer - which is attributed to single-gene abnormalities (e.g., BRCA1 [MIM 1137051] and BRCA2 [MIM 600185]). Although the causes of sporadic breast cancer remain undetermined, there is substantial experimental, epidemiological, and clinical evidence that estrogens influence breast cancer risk. In fact, recent evidence indicates that the oxidative metabolism of estrogens to catechol estrogens and to estrogen quinones can cause mutagenic DNA lesions. Consequently, catechol estrogen and estrogen quinones have been implicated in mammary carcinogenesis.
The catechol-estrogen pathway is regulated by catechol-O-methyltransferase (COMT), by cytochromes P450 1A1 and P450 1B1 (CYP1A1 and CYP1BI, respectively), and by glutathione S-transf erases Ml and Tl (GSTM1 and GSTT1, respectively). Each of the genes encoding these enzymes contains functional polymorphisms that result in different concentrations of catechol-estrogen metabolites interactions between polymorphisms of these genes may have a synergistic, or nonadditive, effect on the pathogenesis of breast cancer. This may explain differences in breast cancer risk.
In one experiment, application of MDR to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, identified a statistically significant high-order interaction among four polymorphisms from three different estrogen-metabolism genes - COMT (MIM 116790), CYP1B1 (MIM 601771), and CYP1A1 (MIM 108330). Subjects and Methods (MDR)
Figure 1 illustrates a summary of steps involved in implementing the MDR method for case-control studies according to an exemplary embodiment. The same procedure is equally applicable to discordant-sib-pair studies. In step 1, a set of n genetic and/or discrete environmental factors is selected from a pool of all factors. In step 2, the n factors and their possible multifactor classes or cells are represented in n-dimensional space. For example, for two loci with three genotypes each, there are nine two-locus-genotype combinations. Then, the ratio of the number of cases (or affected sibs) to the number of controls (or unaffected sibs) is estimated within each multifactor class.
In step 3, each multifactor cell in n-dimensional space is labeled either as "high- risk," if the cases: controls ratio meets or exceeds some threshold (e.g., ≥l.O), or as "low- risk," if that threshold is not exceeded. For each multifactor combination, hypothetical distributions of cases (left bars in boxes) and of controls (right bars in boxes) are shown. In this way, a model for both cases and controls (or for affected and unaffected sibs) is formed by pooling high-risk cells into one group and low-risk-cells into another group. This reduces the n-dimensional model to a one-dimensional model (i.e., having one variable with two multifactor classes-high risk and low risk).
In this initial implementation of MDR, balanced case-control studies are needed. In step 4, the prediction error of each model is estimated by 10-fold cross-validation. Here, the data (i.e., subjects) are randomly divided into 10 equal parts. The MDR model is developed for each possible 9/10 of the subjects and then is used to make predictions about the disease status of each possible 1/10 of the subjects excluded. The proportion of subjects for which an incorrect prediction was made is an estimation of the prediction error. To reduce the possibility of poor estimates of the prediction error that are due to chance divisions of the data set, the 10-fold cross-validation is repeated 10 times, and the prediction errors are averaged.
For studies with more than two factors, the four steps of the MDR method may be repeated for each possible combination, if computationally feasible. If the number of combinations to be evaluated exceeds computational feasibility, machine learning methods, such as parallel genetic algorithms, may be employed. Among all of the two-factor combinations, a single model that maximizes the cases: controls ratio of the high-risk group is selected. This two-locus model will have the minimum classification error among the two-locus models. Single best multifactor, models are also selected from among the models for each of the three- to n-factor combinations. Among this set of best multifactor models, the combination of loci and/or discrete environmental factors that minimizes the prediction error is selected. Thus, the classification errors and the prediction errors estimated by 10-fold cross-validation are used to select the final multifactor model. Hypothesis testing for this final model can then be performed by evaluating the consistency of the model across cross-validation data sets - that is, how many times the same MDR model is identified in each possible 9/10 of the subjects. The reasoning is that a true signal (i.e., association) should be present in the data regardless of how they are divided. Statistical significance was determined by comparing the average cross-validation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations derived empirically from 1,000 permutations. The null hypothesis was rejected when the upper-tail Monte Carlo P value derived from the permutation test was < 05.
Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment. According to the embodiment depicted in Figure 2, the method for multifactor reduction may be performed on a personal computer 200. The personal computer 200 includes, for example, a memory and a microprocessor. The microprocessor may be an Intel 600 MHz Pentium III processor or a Solaris (Sun) processor. Devices such as a keyboard, a mouse, and a display terminal may be connected to the microprocessor and memory to receive input data and export output data. Data Simulation
To evaluate the MDR method, four sets of 50 replicates of 200 cases and 200 controls were simulated, using four different multilocus epistasis models. This number of replicates was selected to be large enough to provide validation of the method and to be small enough to allow exhaustive computational searches of all possible multilocus models. Unrelated subjects and genotypes for 10 unlinked biallelic loci were simulated by the Genometric Analysis Simulation Package. Allele frequencies for each of the 10 loci were selected to match those in the sporadic-breast cancer case-control sample. Hardy- Weinberg equilibrium and linkage equilibrium were assumed.
For the first model, a two-locus interaction effect was simulated, using penetrance functions P(D | AAbb) = .2, P(D | AaBb) = .2, P(D | aaBB) = .2, and P(D | others) = 0, where D is disease and A, a, B, and b represent the alleles for the disease-susceptibility loci. This is a well-characterized model for epistasis, in which disease risk is dependent on whether two deleterious alleles and two normal alleles are present, from either one locus or both loci. The independent main effects for the loci in this model are small. This two- locus epistasis model was extended to three-locus, four-locus, and five-locus epistasis models by adding corresponding homozygous or heterozygous genotypes to the aforementioned penetrance functions. For example, for the three-locus epistasis model, penetrance functions P(D|AAbbcc) = .2, P(D|AaBbcc) = .2, P(D|aaBBcc) = .2, P(D}aaBbCc) = .2, P(D|AabbCc) = .2, and P(D | aabbCC) = .2 were used. Thus, of the 10 total simulated loci, there were 2, 3, 4, or 5 functional epistatic loci and up to 8 nonfunctional loci.
Sporadic-Breast Cancer Data
One study was based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96. Informed consent for this study was obtained from all study subjects, in accordance with the requirements of the Institutional Review Board of Vanderbilt University Medical
School'. Breast cancer was classified as either sporadic or familial, on the basis of family history as determined by patient questionnaire: patients with either at least one first-degree relative with breast cancer or at least two second-degree relatives with breast cancer were considered to have familial breast cancer; patients not fulfilling these criteria were considered to have sporadic breast cancer. Patients with sporadic breast cancer were frequency age-matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses. Reasons for exclusion of controls included breast cancer or other forms of malignancy, as well as family history of breast cancer.
DNA was isolated from all samples by use of a DNA extraction kit (Gentra). Because their enzyme products interact in the metabolism of estrogens to catechol estrogens and to estrogen quinones, the analysis focused on the genes COMT (MIM 116790), on chromosome 22ql l.2; CYP1A1 (MIM 108330), on chromosome 15q22-qter; CYP1B1 (MIM 601771), on chromosome 2p21-22; GSTM1 (NUM 138350), on chromosome lpl3.3; and GSTT1 (MIM 600436), on chromosome 22ql 1.2. COMT and GSTT1 are ~ 4 Mb apart on chromosome 22ql 1.2.
Table 1 summarizes the polymorphisms, in these genes, that were analyzed by PCR and restriction-endonuclease digestion. Genotype frequencies have been previously reported. The specific primers and amplification conditions and the subsequent restriction- endonuclease analysis for CYPIAI, CYP1B1, GSTM1, and GSTT1 have been described elsewhere. COMT was amplified with primers Cl (5'-GCC GCC ATC ACC CAG CGG ATG GTG GAT TTC GCT GTC) and C2 (5'-GTT TTC AGT GAA CGT GGT GTG). Each PCR contained internal controls for the respective gene, and random retesting of ~ 5% of the samples yielded 100% reproducibility. Table I
Enzyme Genotype Analysis by PCR and Restriction-Endonuclease Digestion
GENOTYPE
POLYMORPHISM FREQUENCY1
(%)
ENZYME Nucleotide Codon PRIMERS ENDO- w/w w/p p/p
NUCLEASE
COMT 1947G→A 158Val→Met C1, C2 Bspϋl 25 51 24
CYPIAI:
Ml T623ST→C 3' UTR A3, A4b Mspl 82 15 3
M2 4887C→A 461Thr→Asn Al, A4b Bsal 92 7 1
M4 4889A→G 462Ile→Val Al, A2b BsrOl 92 8 0
CYP1B1:
Codon 48 143C→G 48Arg→Gly B1, B2C RSTII 51 40 9
Codon l l9 355G→T 119Ala->Ser B1, B2C NgoMIV 51 40 9
Codon 432 1294G→C 432Val->Leu B3, B4C Eco571 12 58 30
Codon 453 1358A→G 453Asn→Ser B3, B4C Cαcδl 68 30 2 GSTM1 Deletion Loss of enzyme Ml, M2b . . . 57d 43
GSTTl Deletion Loss of enzyme Tl, T2b 79^ 21 a w = Wild-type allele; p = polymorphic allele. b Bailey et al. (19986).
0 Bailey et al. (1998α). d Either w/w or w/p genotype.
Data Analysis
Prior to application of MDR to the sporadic-breast cancer data set, the method was evaluated by use of the simulated multilocus data sets. For each of the 50 replicates generated by each of the four multilocus epistasis models, the MDR algorithm was applied as described in the subsection "MDR," with a threshold cases: controls ratio of at least 1:1. This threshold was selected so that multilocus-genotype combinations would be considered high-risk if the number of cases with that particular combination either was equal to or exceeded the number of controls; more-stringent thresholds may improve the results. An exhaustive search of all possible two- to nine-locus models was performed. The 10-locus model was not evaluated, since there is only one such model and since its cross-validation consistency is always 10. On validation of the method, MDR was then applied to the sporadic-breast cancer data set, with the same threshold cases: controls ratio, at least 1:1. An exhaustive search of all possible two- to nine-locus models was again performed. Results
Application of MDR to Simulated Data
Table 2 summarizes the means and the standard errors of the means (SEMs), of both the cross-validation consistency and the prediction error, obtained from the MDR analysis of each group of 50 simulated data sets for each gene-gene interaction model and each number of loci evaluated. For the particular multilocus models that contain the correct two, three, four, or five genes, for each group of 50 simulated data sets, the mean prediction error was minimum, and the mean cross-validation consistency was maximum. Additionally, the SEM of the prediction error and of the cross-validation consistency was minimum at the correct multilocus model. For example, in the case in which a three-locus epistasis model was used to simulate the data sets, the mean + SEM prediction error was minimum for the three-locus model, at 12% ± 0.22%. The two-locus models had a mean ± SEM prediction error of 21.91% ± 0.33%), whereas the four-locus model had a mean ± SEM prediction error of 12.37% ± 0.24%.
The mean prediction error for the four-locus model was much closer to that of the three-locus model, because these models contained the correct three functional loci as well as a false-positive locus, whereas the two-locus models were missing one of the functional loci. Selecting the smaller three-locus model with the lower mean prediction error is consistent with statistical parsimony (i.e., smaller models are better because they are easier to interpret). For the three-locus models in this example, the cross-validation consistency was always 10.00; that is, the same three-locus model was found in each possible 9/10 of the subjects. These results suggest that, for this particular epistasis model, the cross- validation strategy is a reasonable approach to the identification of the correct multilocus model. Furthermore, the threshold cases: controls ratio of at least 1:1 was reasonable for this epistasis model. Table 2 Summary of Simulation Results
MEAN ± SEM
No. of Cross-Validation
Locia Consistency Prediction Error
Model 2:
2 9.86 ± .08 14.99 ± .24
3 7.41 ± .21 15.58 ± .26
4 6.01 ± .22 16.49 ± .29
5 5.56 ± .24 19.03 ± .38
6 6.52 ± .34 23.23 ± .53
7 6.94 ± .26 24.49 ± .62
8 7.90 ± .29 25.02 ± .73
9 8.03 ± .23 25.40 ± .73
Model 3:
2 9.20 + .17 21.91 ± .33
3 10.00 ± .00 12.00 ± .22
4 9.27 ± -13 12.37 ± .24
S 6.28 ± .21 13.90 ± .28
6 5.86 ± .25 15.57 ± .32 7 6.26 ± .29 17.75 ± .43
8 7.68 ± .28 19.39 ± .47
9 7.99 ± IS 19.93 ± SO
Model 4:
2 8.40 ± .26 19.15 ± .3S
3 8.79 ± .20 10.20 ± .23
4 10.00 ± .00 5.68 ± .17
5 9.32 ± .12 6.02 ± .19
6 7.74 ± -16 6.88 ± .22
7 7.01 ± .22 7.73 ± .26
8 7.04 ± .24 8.64 ± .31
9 7.79 ± .24 9.46 ± .34
Model 5:
2 9.01 ± .20 15.33 + .28
3 8.37 ± .25 8.54 ± .24
4 8.16 ± .25 5.17 ± .20
S 9.99 ± .01 z9S ± .l l
6 9.52 ± .12 3.17 ± .14
7 9.13 ± .16 3.66 ± .17
8 8.74 ± .17 4.17 ± .19
9 9.00 + .14 4.60 ± .18
NOTE. - For each simulation model, the multilocus model with maximum mean ± SEM cross-validation consistency and minimum mean ± SEM prediction error is indicated in boldface italic type. a Model number is based on the number of epistatic genes in each simulation model.
The Monte Carlo P values for each of the correctly identified models were all <.001. The estimated power to identify the correct multilocus model was 78% for the two- locus model, 82% for the three-locus model, 94% for the four-locus model, and 90% for the five-locus model. It is interesting that the power to identify the correct multilocus model tends to increase as higher-order interactions are modeled. This may be a real phenomenon, or it may be due to the fact that fewer non-functional loci of the 10 that were simulated were present. These results suggest that, for this particular epistasis model, the MDR method has reasonable power to identify high-order gene-gene interactions in a sample of 200 cases and 200 controls.
Application of MDR to Breast Cancer Data
Table 3 summarizes the cross-validation consistency and the prediction error obtained from MDR analysis of the sporadic-breast cancer case-control data set, for each number of loci evaluated. One four-locus model had a minimum prediction error of 46.73 and a maximum cross-validation consistency of 9.8 that was significant at the .001 level, as determined empirically by permutation testing. Thus, under the null hypothesis of ho association, it is highly unlikely that a cross-validation consistency 9.8 will be observed for this four-locus model. The four-locus model included the polymorphisms-of COMT CYPlAlml, CYPIB1 codon 48, and CYP1B1 codon 432.
Figure 3 illustrates a summary of the four-locus-genotype combinations associated with high risk and with low risk, along with the corresponding distribution of cases (left bars in boxes) and of controls (right bars in boxes), for each Multilocus-genotype combination. Note that the patterns of high-risk and low-risk cells differ across each of the different multilocus dimensions. This is evidence of epistasis, or gene-gene interaction; that is, the influence that each genotype at a particular locus has on disease risk is dependent on the genotypes at each of the other three loci. Previous analysis of this data set, by logistic regression, revealed no statistically significant evidence of independent main effects of any of the 10 polymorphisms. Table 3
Summary-of Results for Breast Cancer
No. of Cross-Validation Prediction
Loci Consistency Error
2 7.00 51.06
3 4.17 51.3S
4 9.80 a 46.73
5 4.71 50.26
6 5.00 48.61
7 8.60 47AS
8 8.20 52.SS
9 7.10 53.40
NOTE.-The multilocus model with maximum cross-validation consistency and minimum prediction error is indicated in boldface italic type. P<.001. According to exemplary embodiments, MDR is a method for reducing the dimensionality of multilocus information, to improve identification of combinations of polymorphisms associated with the risk for common complex multifactorial diseases. The development of MDR was motivated by the limitations of the generalized linear model for detection and characterization of gene-gene and gene-environment interactions and by the success of data-reduction methods for quantitative traits. Using simulated data, MDR has was demonstrated to be useful for identification of genes whose effects are primarily through interaction. MDR was then applied to identify gene-gene interaction effects on risk for sporadic breast cancer.
Breast cancer is generally considered a multifactorial disease with estrogens as one of the principal factors. MDR was therefore applied to a set of genes (i.e., COMT, CYPIAI, CYPIB1, GSTM1, and GSTT1) whose protein products interact as enzymes in the metabolism of estrogens in breast tissue. Several studies have examined the breast cancer risk associated with individual genotypes of each of these enzymes. Not surprisingly, the results have been inconsistent and even contradictory. That is, if a single gene in the estrogen-metabolism pathway were solely responsible for breast cancer, then the malignancy would likely present as familial breast cancer, and the gene would be identified by linkage analysis, as in the case of BRCAI and BRCA2. Studies of two or three genotypes in combination have also yielded inconsistent results. For example, CYPIAI, GSTM1, and GSTT1 polymorphisms were examined in a case-control study of 328 white and 108 African American women, using multiple logistic- regression analysis. None of the enzyme genotypes-individually or combined- were associated with an increased risk for breast cancer. COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated. Because of their clearly defined functional interactions in the catechol-estrogen pathway, it is important to consider the combined effect of all these enzymes. MDR applied to 10 single-nucleotide polymorphisms in COMT, CYPIAI, CYP1B1, GSTM1, and GSTT1 identifies a four-locus interaction that is significantly associated with risk for sporadic breast cancer. This is the first report of a four-locus interaction associated with a common complex multifactorial disease.
Breast cancer risk is influenced by several nongenetic hormonal factors, such as age at menarche, and by age at menopause, body-mass index, reproductive history, lactation history, and use of exogenous estrogen in the form of either oral contraceptives or hormone-replacement therapy. Although these factors allow prediction of a relative risk for a given population, they are not very helpful to individual women.
As defined by the MDR, the determination of a woman's genotype may add another dimension to the assessment of overall breast cancer risk. However, it is obvious that there is also an interaction between genotype risk factors and traditional hormonal risk factors. For example, obesity has been related both to the concentration of endogenous estrogen and to breast cancer risk. Several studies have demonstrated that obese postmenopausal women have an increased risk for breast cancer, compared to age-matched nonobese postmenopausal women. The elevated risk has been attributed to higher levels of circulating estrogens secondary to increased conversion, in adipose tissue, of androgen to estrogen. Several studies have demonstrated significantly higher serum-estradiol concentrations in obese postmenopausal women than in their nonobese counterparts. Thus, any effect that COMT, CYPIAI, CYP1B1, GSTM1, and GSTT1 may have on estrogen metabolism may be affected by the concentration of estradiol. Consequently, the analysis of genetic factors is limited by lack of consideration of these traditional hormonal risk factors. Advantages of MDR
An important advantage of MDR is that it facilitates the simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical end-point. This is accomplished by reducing the dimensionality of the multilocus data. In essence, genotypes from multiple loci and/or discrete environmental classes are pooled into high- risk and low-risk groups, depending on whether they are more common in affected or in unaffected subjects. This new multilocus-genotype encoding reduces the dimensionality to one. For the simulated data, the mean cross-validation consistency was always maximized, and the mean prediction error was always minimized, at the correct multilocus model. Another important advantage of MDR is that it is nonparametric. This is an important difference versus traditional parametric-statistical methods, which rely on the generalized linear model. For example, in logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially. Having too many independent variables in relation to the number of observed outcome events is, a well-recognized problem. Studies suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in type 1 and type 2 errors. For example, with two outcome events per independent variable, more than one-third of the estimated regression coefficients differed from the true parameter value by a magnitude of 2. It has been suggested that logistic-regression models should contain no more than P + 1 < min (nι, n0)/10 parameters, where ni is the number of events of type 1 and no is the number of events of type 0. For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic- regression model. In a logistic-regression model, how many parameters must be estimated to identify interactions among the 10 estrogen-metabohsm-gene polymorphisms? The number of orthogonal-regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) x 2k (Wade 2000). Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. Thus, fitting a full model with all interaction terms and then using backward elimination to derive a parsimonious model would not be possible. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions. A third advantage of MDR is that it assumes no particular genetic model (i.e., it is model free). That is, no mode of inheritance needs to be specified. This is important for diseases, such as sporadic breast cancer, in which the mode of inheritance is unknown and likely very complex. In its current form, MDR can be directly applied to case-control and discordant-sib-pair studies. Extension to other family-based control study designs, such as those using trios, should also be possible.
A fourth advantage of MDR is that false-positive results due to multiple testing are minimized. This is primarily due to the cross-validation strategy used to select optimal models. Data-reduction and pattern-recognition methods are good for identification of complex relationships among data, even when those relationships are due to either chance or false-positive variations. However, the real test of any method is its ability to make predictions in independent data. Cross-validation divides the data into 10 equal parts, allowing 9/10 of the data to be used to develop a model and the independent 1/10 of the data to be used to evaluate the predictive ability of the model. Optimal models are selected solely on the basis of their ability to make predictions with regard to independent data. Only when a final predictive model has been selected is the null hypothesis of no association tested via permutation testing. It is this combined cross-validation- testing/permutation-testing method that minimizes false-positives due to multiple examinations of the data. MDR overcomes many limitations of the generalized linear model. However,
MDR can be computationally intensive, especially when more than 10 polymorphisms need to be evaluated. A genome scan with hundreds to thousands of polymorphisms requires robust machine learning algorithms, since all of the possible multilocus combinations cannot be exhaustively searched. This is, however, a consequence of any multilocus method that does not first condition on a particular locus having an independent main effect (e.g., stepwise logistic regression). Also, MDR models may be difficult to interpret. This is illustrated clearly in the four-locus model in Figure 3. There are no obvious trends or patterns in the distribution of high-risk and low-risk groupings across the four-dimensional genotype space; for example, a consistent trend of high-risk or low-risk cells across a series of rows or of columns may indicate that a particular locus has a main effect. The lack of such trends in the four-locus model for breast cancer is indicative of epistasis; that is, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci. Sorting out the nature of the interactions in four- dimensional space to infer function remains an interpretive challenge.
In its current form, MDR is best applied to case-control studies that are balanced (i.e., that have the same number of cases and of controls). Also, MDR is somewhat limited in is its ability to make predictions for independent data sets when the dimensionality of the best model is relatively high and the sample is relatively small. High dimensionality and a small sample lead to many multifactor cells with either missing data or singleton data. This is not a problem for estimation of the classification error and evaluation of the cross-validation consistency, but it is a problem for estimation of the prediction error. For example, if there were one observation for each multifactor cell in /z-dimensional space, then, during cross-validation, that one observation will end up in either the training data used to estimate the classification error or the test data used to estimate the prediction error but not in both. If the observation ends up in the test data, there will be, from the training data, no model (i.e., there will be an empty cell) to make a prediction. This greatly limits the number of observations for which predictions can be made in the test set and ultimately impacts the SEM of the prediction error.
The MDR is a powerful alternative to traditional parametric statistics, such as logistic regression. MDR has the ability to identify high-order (i.e., more than two) gene- gene interactions in relatively small simulated and real data sets. Although MDR addresses some of the limitations of the generalized linear model, there are several ways in which the method can be improved.
First, if MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, machine learning strategies made be used to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. Parallel genetic algorithms may be used as a robust machine learning approach.
Second, it may be important to improve MDR's predictive ability in the higher dimensions. Several strategies can be used to improve the estimation of the prediction error. The first strategy uses a nearest neighbor method to determine whether an empty cell should be classified as high risk or as low risk; for example, if the majority of multilocus-genotype combinations within one step in n-dimensional space are classified as high risk, then the empty cell is also classified as high risk. The second strategy projects either a high risk or a low risk classification for an empty cell in a lower dimension; for example, the locus with the least frequent genotype might be removed from the model, and risk could then be determined from the equivalent genotypes in a lower dimension. These strategies may be compared to determine whether either improves the estimation of the prediction error when empty cells are present.
Third, it may be important to modify MDR for the analysis 'of unbalanced case- control studies. Several different weighting schemes may be used for the case-control ratio that account for whether the total number of cases or the total number of controls is greater.
Finally, simulation studies may be used to determine the strengths and the weaknesses of MDR in the presence of genotyping errors, phenocopies, genetic heterogeneity, and other phenomena that complicate the identification and characterization of functional polymorphisms. Data-reduction methods such as MDR will be invaluable for the identification and characterization of high-order gene-gene and high-order gene- environment interactions, when few degrees of freedom are available for parametric- statistical estimation of interaction effects. According to exemplary embodiments, multifactor-dimensionality reduction
(MDR) is introduced as a method for reducing the dimensionality of multilocus information, to improve the identification of polymorphism combinations associated with disease risk. The MDR method is nonparametric (i.e., no hypothesis about the value of a statistical parameter is made), is model-free (i.e., it assumes no particular inheritance model), and is directly applicable to case-control and discordant-sib-pair studies. Using simulated case-control data, the description above demonstrates that MDR has reasonable power to identify interactions among two or more loci in relatively small samples. When it was applied to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, MDR identified a statistically significant high-order interaction among four polymorphisms from three different estrogen- metabolism genes. This is the first known report of a four-locus interaction associated with a common complex multifactorial disease. It should be understood that the foregoing description and accompanying drawings are by example only. A variety of modifications are envisioned that do not depart from the scope and spirit of the invention. The above description is intended by way of example only and is not intended to limit the present invention in any way.

Claims

WHAT IS CLAIMED IS:
1. A method for reducing multidimensional data, comprising the steps of: out of a first data set including discrete independent variables, measuring n observations; determining a number of observations in which a dependent variable has a first value; determining a number of observations in which the dependent variable has a second value; forming combinations of the independent variables in the first data set to produce a second data set; in each combination of independent variables in the second data set, determining a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value; comparing each ratio to one or more thresholds or ranges of thresholds to determine which combination of independent variables in the second data set optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value; and pooling those combinations in the second data set that have ratios at the approximately the same threshold or within the same range of thresholds to produce a third data set, wherein the third data set has a smaller number of independent variable combinations than the second data set.
2. The method of claim 1, wherein the independent variables are predictor or explanatory variables, and the dependent variable is a response variable.
3. The method of claim 2, wherein the independent variables include at least one of genetic and environmental factors.
4. The method of claim 3, wherein the first value of the dependent variable is disease indicative, and the second value is disease-free indicative.
5. The method of claim 4, wherein the third data set includes two independent variable combinations, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.
6. An apparatus for reducing multidimensional data, comprising: means for measuring n observations out of a first data set including discrete independent variables; means for determining a number of observations in which a dependent variable has a first value; means for determining a number of observations in which the dependent variable has a second value; means for forming combinations of the independent variables in the first data set to produce a second data set; means for determining a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value in each combination of independent variables in the second data set; means for comparing each ratio to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value; and means for pooling those combinations in the second data set that have ratios at the approximately the same threshold or within the same range of thresholds to produce a third data set, wherein the third data set has a smaller number of independent variable combinations than the second data set.
7. The apparatus of claim 6, wherein the independent variables are predictor or explanatory variables, and the dependent variable is a response variable.
8. The apparatus of claim 7, wherein the independent variables include at least one of genetic and environmental factors.
9. The apparatus of claim 8, wherein the first value of the dependent variable is disease indicative, and the second value is disease-free indicative.
10. The apparatus of claim 9, wherein the third data set has two independent variable combinations, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.
PCT/US2003/001333 2002-01-15 2003-01-15 Method and apparatus for multifactor dimensionality reduction WO2003060652A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003212806A AU2003212806A1 (en) 2002-01-15 2003-01-15 Method and apparatus for multifactor dimensionality reduction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US34931702P 2002-01-15 2002-01-15
US60/349,317 2002-01-15

Publications (2)

Publication Number Publication Date
WO2003060652A2 true WO2003060652A2 (en) 2003-07-24
WO2003060652A3 WO2003060652A3 (en) 2004-02-05

Family

ID=23371855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/001333 WO2003060652A2 (en) 2002-01-15 2003-01-15 Method and apparatus for multifactor dimensionality reduction

Country Status (2)

Country Link
AU (1) AU2003212806A1 (en)
WO (1) WO2003060652A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1725967A2 (en) * 2004-03-05 2006-11-29 Perlegen Sciences, Inc. Methods for genetic analysis
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
CN101145030B (en) * 2006-09-13 2011-01-12 新鼎系统股份有限公司 Method and system for increasing variable amount, obtaining rest variable, dimensionality appreciation and variable screening
US8108406B2 (en) 2008-12-30 2012-01-31 Expanse Networks, Inc. Pangenetic web user behavior prediction system
US8255403B2 (en) 2008-12-30 2012-08-28 Expanse Networks, Inc. Pangenetic web satisfaction prediction system
US8386519B2 (en) 2008-12-30 2013-02-26 Expanse Networks, Inc. Pangenetic web item recommendation system
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133299A1 (en) * 2000-09-20 2002-09-19 Jacob Howard J. Physiological profiling

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133299A1 (en) * 2000-09-20 2002-09-19 Jacob Howard J. Physiological profiling

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1725967A4 (en) * 2004-03-05 2008-01-02 Perlegen Sciences Inc Methods for genetic analysis
EP1725967A2 (en) * 2004-03-05 2006-11-29 Perlegen Sciences, Inc. Methods for genetic analysis
CN101145030B (en) * 2006-09-13 2011-01-12 新鼎系统股份有限公司 Method and system for increasing variable amount, obtaining rest variable, dimensionality appreciation and variable screening
US10957455B2 (en) 2007-03-16 2021-03-23 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US11545269B2 (en) 2007-03-16 2023-01-03 23Andme, Inc. Computer implemented identification of genetic similarity
US7818310B2 (en) 2007-03-16 2010-10-19 Expanse Networks, Inc. Predisposition modification
US10991467B2 (en) 2007-03-16 2021-04-27 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US7941329B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Insurance optimization and longevity analysis
US11791054B2 (en) 2007-03-16 2023-10-17 23Andme, Inc. Comparison and identification of attribute similarity based on genetic markers
US8024348B2 (en) 2007-03-16 2011-09-20 Expanse Networks, Inc. Expanding attribute profiles
US11735323B2 (en) 2007-03-16 2023-08-22 23Andme, Inc. Computer implemented identification of genetic similarity
US8099424B2 (en) 2007-03-16 2012-01-17 Expanse Networks, Inc. Treatment determination and impact analysis
US11621089B2 (en) 2007-03-16 2023-04-04 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US8209319B2 (en) 2007-03-16 2012-06-26 Expanse Networks, Inc. Compiling co-associating bioattributes
US11600393B2 (en) 2007-03-16 2023-03-07 23Andme, Inc. Computer implemented modeling and prediction of phenotypes
US11581098B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US8606761B2 (en) 2007-03-16 2013-12-10 Expanse Bioinformatics, Inc. Lifestyle optimization and behavior modification
US11581096B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Attribute identification based on seeded learning
US7844609B2 (en) 2007-03-16 2010-11-30 Expanse Networks, Inc. Attribute combination discovery
US10379812B2 (en) 2007-03-16 2019-08-13 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US10803134B2 (en) 2007-03-16 2020-10-13 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US10896233B2 (en) 2007-03-16 2021-01-19 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
US7933912B2 (en) 2007-03-16 2011-04-26 Expanse Networks, Inc. Compiling co-associating bioattributes using expanded bioattribute profiles
US8051033B2 (en) 2007-03-16 2011-11-01 Expanse Networks, Inc. Predisposition prediction using attribute combinations
US7941434B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Efficiently compiling co-associating bioattributes
US11348692B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US11348691B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US11515047B2 (en) 2007-03-16 2022-11-29 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US11482340B1 (en) 2007-03-16 2022-10-25 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US11495360B2 (en) 2007-03-16 2022-11-08 23Andme, Inc. Computer implemented identification of treatments for predicted predispositions with clinician assistance
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
US8386519B2 (en) 2008-12-30 2013-02-26 Expanse Networks, Inc. Pangenetic web item recommendation system
US9031870B2 (en) 2008-12-30 2015-05-12 Expanse Bioinformatics, Inc. Pangenetic web user behavior prediction system
US11514085B2 (en) 2008-12-30 2022-11-29 23Andme, Inc. Learning system for pangenetic-based recommendations
US8255403B2 (en) 2008-12-30 2012-08-28 Expanse Networks, Inc. Pangenetic web satisfaction prediction system
US11003694B2 (en) 2008-12-30 2021-05-11 Expanse Bioinformatics Learning systems for pangenetic-based recommendations
US8108406B2 (en) 2008-12-30 2012-01-31 Expanse Networks, Inc. Pangenetic web user behavior prediction system
US11468971B2 (en) 2008-12-31 2022-10-11 23Andme, Inc. Ancestry finder
US11508461B2 (en) 2008-12-31 2022-11-22 23Andme, Inc. Finding relatives in a database
US11657902B2 (en) 2008-12-31 2023-05-23 23Andme, Inc. Finding relatives in a database
US11776662B2 (en) 2008-12-31 2023-10-03 23Andme, Inc. Finding relatives in a database
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database
US11935628B2 (en) 2008-12-31 2024-03-19 23Andme, Inc. Finding relatives in a database

Also Published As

Publication number Publication date
AU2003212806A1 (en) 2003-07-30
WO2003060652A3 (en) 2004-02-05
AU2003212806A8 (en) 2003-07-30

Similar Documents

Publication Publication Date Title
Ritchie et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer
Guo et al. Burden testing of rare variants identified through exome sequencing via publicly available control data
Hamid et al. Data integration in genetics and genomics: methods and challenges
Sham et al. Statistical power and significance testing in large-scale genetic studies
Devlin et al. Genomic control, a new approach to genetic-based association studies
Magwene et al. The statistics of bulk segregant analysis using next generation sequencing
Zaitlen et al. Leveraging genetic variability across populations for the identification of causal variants
Motsinger et al. The effect of reduction in cross‐validation intervals on the performance of multifactor dimensionality reduction
Gordon et al. Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis
Sung et al. The frequency of cancer predisposition gene mutations in hereditary breast and ovarian cancer patients in Taiwan: from BRCA1/2 to multi-gene panels
WO2003060652A2 (en) Method and apparatus for multifactor dimensionality reduction
Yamada et al. Effectiveness of integrated interpretation of exome and corresponding transcriptome data for detecting splicing variants of genes associated with autosomal recessive disorders
Zeng et al. Estimating haplotype‐disease associations with pooled genotype data
Morris et al. Genome‐Wide Association Studies
McRae et al. Power and SNP tagging in whole mitochondrial genome association studies
Chen et al. A ground truth based comparative study on detecting epistatic SNPs
O’Rielly et al. Genetic Epidemiology of Complex Phenotypes
Peddle et al. Genetic epidemiology of complex phenotypes
Gorrie-Stone DNA Methylation: Methods and Analyses
Cooley et al. Conducting genome-wide association studies: epistasis scenarios
LÉVY-LEDUC et al. Variable Selection in the General Linear Model: Application to Multiomic Approaches for the Study of Seed Quality
Shan et al. BayesRB: a markov chain Monte Carlo-based polygenic genetic risk score algorithm for dichotomous traits
Okamoto et al. Integrative analysis of the genome, transcriptome, and proteome identifies causal mechanisms of complex traits.
Albert et al. Identification and Validation of Candidate Genes from Genome-Wide Association Studies
Hao et al. Differential dropout among SNP genotypes and impacts on association tests

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP