WO2003060652A2

WO2003060652A2 - Method and apparatus for multifactor dimensionality reduction

Info

Publication number: WO2003060652A2
Application number: PCT/US2003/001333
Authority: WO
Inventors: Lance W. Hahn; Marylyn Ritchie; Jason H. Moore
Original assignee: Vanderbilt University
Priority date: 2002-01-15
Filing date: 2003-01-15
Publication date: 2003-07-24
Also published as: AU2003212806A1; WO2003060652A3; AU2003212806A8

Abstract

Multidimensional data is reduced by measuring n observations out of a first data set including discrete independent variables, such as genetic or environmental factors. A number of observations in which a dependent variable has a first value is determined, such as a disease-indicative value, and a number of observations in which the dependent variable has a second value, such as a disease-free value, is determined. Combinations are formed of the independent variables to produce a second data set. In each combination of independent variables, a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value is determined. Each ratio is compared to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value. Those combinations having ratios at the approximately the same threshold or within the same range of thresholds are pooled together to produce a third data set having a smaller number of independent variable combinations than the second data set.

Description

METHOD AND APPARATUS FOR MULTIFACTOR DIMENSIONALITY

REDUCTION

CROSS-REFERENCE TO RELATED APPLICATION This application is related to and claims priority from commonly assigned U.S.

Provisional Application No. 60/349,317, filed January 15, 2002 in the names of Hahn et al.

STATEMENT OF GOVERNMENT SPONSORED RESEARCH

This invention was made with government support under Grant No. 1 R01 CA/ES83572 awarded by the National Institutes of Health. The United States government has certain rights in the invention.

BACKGROUND

The present invention is directed to a method and an apparatus for analyzing data. More particularly, the present invention is directed to a method and an apparatus for reducing multidimensional data.

The identification and characterization of susceptibility genes for common complex human diseases is one of the greatest challenges facing human geneticists. This challenge is partly due to the limitations of parametric-statistical methods (i.e., those in which a hypothesis about the value of a statistical parameter is made) for detection of gene effects that are dependent solely or partially on interactions with other genes and with environmental exposures.

For example, logistic regression is a commonly used method for modeling the relationship between discrete predictors, such as genotypes, and discrete clinical outcomes. However, logistic regression, like most parametric-statistical methods, is less practical for dealing with high-dimensional data. That is, when high-order interactions are modeled, there are many contingency-table cells that contain no observations (i.e., that are empty cells). This can lead to very large coefficient estimates and standard errors. One solution to this problem is to collect very large numbers of samples to allow robust estimation of interaction effects; however, the magnitudes of the samples that are often required incur prohibitive expense. An alternative solution is to develop new statistical and computational methods that have improved power to identify multilocus effects in relatively small samples.

SUMMARY

It is an object of the present invention to provide a method and apparatus for reducing multidimensional data in a simple, efficient manner. According to an exemplary embodiment, n observations are measured out of a first data set including discrete independent variables. A number of observations in which a dependent variable has a first value is determined, and a number of observations in which the dependent variable has a second value is determined. Combinations are formed of the independent variables to produce a second data set. In each combination of independent variables, a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value is determined. Each ratio is compared to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value. Those combinations that have ratios at the approximately the same threshold or within the same range of thresholds are pooled together to produce a third data set. The third data set includes a smaller number of independent variable combinations than the second data set.

According to exemplary embodiments, the independent variables may be predictor or explanatory variables, including genetic and environmental factors. The dependent variable may be a response variable indicative of disease. The number of independent variable combinations may be reduced to two, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 illustrates an exemplary summary of steps for implementing a method for multifactor reduction according to one embodiment;

Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment; and

Figure 3 illustrates an exemplary summary of genotype combinations produced by the steps shown in Figure 1 according to an exemplary embodiment.

DETAILED DESCRIPTION

According to exemplary embodiments, a multif actor-dimensionality reduction (MDR) method has been developed for detecting and characterizing high-order gene-gene and gene-environment interactions in case-control and discordant-sib-pair studies with relatively small samples. The MDR method may be considered to be inspired in part by the combinatorial-partitioning method, a data-reduction method for the exploratory analysis of quantitative traits. With MDR, multilocus genotypes are pooled into high-risk and low-risk groups, effectively reducing the genotype predictors from n dimensions to one dimension. The new, one-dimensional multilocus-genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing. The MDR method is model free, in that it does not assume any particular, genetic model, and is nonparametric, in that it does not estimate any parameters. For the purpose of illustration, in the following, the MDR method is first evaluated by using simulated multilocus data with epistatic effects, and it is then applied to identification of multiple single-nucleotide polymorphisms associated with sporadic breast cancer. It will be appreciated, however, that the invention is not limited to this application. Breast cancer is generally considered a complex disease, since its most common form - sporadic breast cancer - is due to multiple unknown etiologies. This is in contrast to the less common form - familial breast cancer - which is attributed to single-gene abnormalities (e.g., BRCA1 [MIM 1137051] and BRCA2 [MIM 600185]). Although the causes of sporadic breast cancer remain undetermined, there is substantial experimental, epidemiological, and clinical evidence that estrogens influence breast cancer risk. In fact, recent evidence indicates that the oxidative metabolism of estrogens to catechol estrogens and to estrogen quinones can cause mutagenic DNA lesions. Consequently, catechol estrogen and estrogen quinones have been implicated in mammary carcinogenesis.

The catechol-estrogen pathway is regulated by catechol-O-methyltransferase (COMT), by cytochromes P450 1A1 and P450 1B1 (CYP1A1 and CYP1BI, respectively), and by glutathione S-transf erases Ml and Tl (GSTM1 and GSTT1, respectively). Each of the genes encoding these enzymes contains functional polymorphisms that result in different concentrations of catechol-estrogen metabolites interactions between polymorphisms of these genes may have a synergistic, or nonadditive, effect on the pathogenesis of breast cancer. This may explain differences in breast cancer risk.

In one experiment, application of MDR to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, identified a statistically significant high-order interaction among four polymorphisms from three different estrogen-metabolism genes - COMT (MIM 116790), CYP1B1 (MIM 601771), and CYP1A1 (MIM 108330). Subjects and Methods (MDR)

Figure 1 illustrates a summary of steps involved in implementing the MDR method for case-control studies according to an exemplary embodiment. The same procedure is equally applicable to discordant-sib-pair studies. In step 1, a set of n genetic and/or discrete environmental factors is selected from a pool of all factors. In step 2, the n factors and their possible multifactor classes or cells are represented in n-dimensional space. For example, for two loci with three genotypes each, there are nine two-locus-genotype combinations. Then, the ratio of the number of cases (or affected sibs) to the number of controls (or unaffected sibs) is estimated within each multifactor class.

In step 3, each multifactor cell in n-dimensional space is labeled either as "high- risk," if the cases: controls ratio meets or exceeds some threshold (e.g., ≥l.O), or as "low- risk," if that threshold is not exceeded. For each multifactor combination, hypothetical distributions of cases (left bars in boxes) and of controls (right bars in boxes) are shown. In this way, a model for both cases and controls (or for affected and unaffected sibs) is formed by pooling high-risk cells into one group and low-risk-cells into another group. This reduces the n-dimensional model to a one-dimensional model (i.e., having one variable with two multifactor classes-high risk and low risk).

In this initial implementation of MDR, balanced case-control studies are needed. In step 4, the prediction error of each model is estimated by 10-fold cross-validation. Here, the data (i.e., subjects) are randomly divided into 10 equal parts. The MDR model is developed for each possible 9/10 of the subjects and then is used to make predictions about the disease status of each possible 1/10 of the subjects excluded. The proportion of subjects for which an incorrect prediction was made is an estimation of the prediction error. To reduce the possibility of poor estimates of the prediction error that are due to chance divisions of the data set, the 10-fold cross-validation is repeated 10 times, and the prediction errors are averaged.

For studies with more than two factors, the four steps of the MDR method may be repeated for each possible combination, if computationally feasible. If the number of combinations to be evaluated exceeds computational feasibility, machine learning methods, such as parallel genetic algorithms, may be employed. Among all of the two-factor combinations, a single model that maximizes the cases: controls ratio of the high-risk group is selected. This two-locus model will have the minimum classification error among the two-locus models. Single best multifactor, models are also selected from among the models for each of the three- to n-factor combinations. Among this set of best multifactor models, the combination of loci and/or discrete environmental factors that minimizes the prediction error is selected. Thus, the classification errors and the prediction errors estimated by 10-fold cross-validation are used to select the final multifactor model. Hypothesis testing for this final model can then be performed by evaluating the consistency of the model across cross-validation data sets - that is, how many times the same MDR model is identified in each possible 9/10 of the subjects. The reasoning is that a true signal (i.e., association) should be present in the data regardless of how they are divided. Statistical significance was determined by comparing the average cross-validation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations derived empirically from 1,000 permutations. The null hypothesis was rejected when the upper-tail Monte Carlo P value derived from the permutation test was < 05.

Figure 2 illustrates an exemplary apparatus for performing multifactor reduction according to an exemplary embodiment. According to the embodiment depicted in Figure 2, the method for multifactor reduction may be performed on a personal computer 200. The personal computer 200 includes, for example, a memory and a microprocessor. The microprocessor may be an Intel 600 MHz Pentium III processor or a Solaris (Sun) processor. Devices such as a keyboard, a mouse, and a display terminal may be connected to the microprocessor and memory to receive input data and export output data. Data Simulation

To evaluate the MDR method, four sets of 50 replicates of 200 cases and 200 controls were simulated, using four different multilocus epistasis models. This number of replicates was selected to be large enough to provide validation of the method and to be small enough to allow exhaustive computational searches of all possible multilocus models. Unrelated subjects and genotypes for 10 unlinked biallelic loci were simulated by the Genometric Analysis Simulation Package. Allele frequencies for each of the 10 loci were selected to match those in the sporadic-breast cancer case-control sample. Hardy- Weinberg equilibrium and linkage equilibrium were assumed.

For the first model, a two-locus interaction effect was simulated, using penetrance functions P(D | AAbb) = .2, P(D | AaBb) = .2, P(D | aaBB) = .2, and P(D | others) = 0, where D is disease and A, a, B, and b represent the alleles for the disease-susceptibility loci. This is a well-characterized model for epistasis, in which disease risk is dependent on whether two deleterious alleles and two normal alleles are present, from either one locus or both loci. The independent main effects for the loci in this model are small. This two- locus epistasis model was extended to three-locus, four-locus, and five-locus epistasis models by adding corresponding homozygous or heterozygous genotypes to the aforementioned penetrance functions. For example, for the three-locus epistasis model, penetrance functions P(D|AAbbcc) = .2, P(D|AaBbcc) = .2, P(D|aaBBcc) = .2, P(D}aaBbCc) = .2, P(D|AabbCc) = .2, and P(D | aabbCC) = .2 were used. Thus, of the 10 total simulated loci, there were 2, 3, 4, or 5 functional epistatic loci and up to 8 nonfunctional loci.

Sporadic-Breast Cancer Data

One study was based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96. Informed consent for this study was obtained from all study subjects, in accordance with the requirements of the Institutional Review Board of Vanderbilt University Medical

School'. Breast cancer was classified as either sporadic or familial, on the basis of family history as determined by patient questionnaire: patients with either at least one first-degree relative with breast cancer or at least two second-degree relatives with breast cancer were considered to have familial breast cancer; patients not fulfilling these criteria were considered to have sporadic breast cancer. Patients with sporadic breast cancer were frequency age-matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses. Reasons for exclusion of controls included breast cancer or other forms of malignancy, as well as family history of breast cancer.

DNA was isolated from all samples by use of a DNA extraction kit (Gentra). Because their enzyme products interact in the metabolism of estrogens to catechol estrogens and to estrogen quinones, the analysis focused on the genes COMT (MIM 116790), on chromosome 22ql l.2; CYP1A1 (MIM 108330), on chromosome 15q22-qter; CYP1B1 (MIM 601771), on chromosome 2p21-22; GSTM1 (NUM 138350), on chromosome lpl3.3; and GSTT1 (MIM 600436), on chromosome 22ql 1.2. COMT and GSTT1 are ~ 4 Mb apart on chromosome 22ql 1.2.

Table 1 summarizes the polymorphisms, in these genes, that were analyzed by PCR and restriction-endonuclease digestion. Genotype frequencies have been previously reported. The specific primers and amplification conditions and the subsequent restriction- endonuclease analysis for CYPIAI, CYP1B1, GSTM1, and GSTT1 have been described elsewhere. COMT was amplified with primers Cl (5'-GCC GCC ATC ACC CAG CGG ATG GTG GAT TTC GCT GTC) and C2 (5'-GTT TTC AGT GAA CGT GGT GTG). Each PCR contained internal controls for the respective gene, and random retesting of ~ 5% of the samples yielded 100% reproducibility. Table I

Enzyme Genotype Analysis by PCR and Restriction-Endonuclease Digestion

GENOTYPE

POLYMORPHISM FREQUENCY¹

(%)

ENZYME Nucleotide Codon PRIMERS ENDO- w/w w/p p/p

NUCLEASE

COMT 1947G→A 158Val→Met C1, C2 Bspϋl 25 51 24

CYPIAI:

Ml T623ST→C 3' UTR A3, A4^b Mspl 82 15 3

M2 4887C→A 461Thr→Asn Al, A4^b Bsal 92 7 1

M4 4889A→G 462Ile→Val Al, A2^b BsrOl 92 8 0

CYP1B1:

Codon 48 143C→G 48Arg→Gly B1, B2^C RSTII 51 40 9

Codon l l9 355G→T 119Ala->Ser B1, B2^C NgoMIV 51 40 9

Codon 432 1294G→C 432Val->Leu B3, B4^C Eco571 12 58 30

Codon 453 1358A→G 453Asn→Ser B3, B4^C Cαcδl 68 30 2 GSTM1 Deletion Loss of enzyme Ml, M2^b . . . 57^d 43

GSTTl Deletion Loss of enzyme Tl, T2^b 79^ 21 a w = Wild-type allele; p = polymorphic allele. b Bailey et al. (19986).

⁰ Bailey et al. (1998α). d Either w/w or w/p genotype.

Data Analysis

Prior to application of MDR to the sporadic-breast cancer data set, the method was evaluated by use of the simulated multilocus data sets. For each of the 50 replicates generated by each of the four multilocus epistasis models, the MDR algorithm was applied as described in the subsection "MDR," with a threshold cases: controls ratio of at least 1:1. This threshold was selected so that multilocus-genotype combinations would be considered high-risk if the number of cases with that particular combination either was equal to or exceeded the number of controls; more-stringent thresholds may improve the results. An exhaustive search of all possible two- to nine-locus models was performed. The 10-locus model was not evaluated, since there is only one such model and since its cross-validation consistency is always 10. On validation of the method, MDR was then applied to the sporadic-breast cancer data set, with the same threshold cases: controls ratio, at least 1:1. An exhaustive search of all possible two- to nine-locus models was again performed. Results

Application of MDR to Simulated Data

Table 2 summarizes the means and the standard errors of the means (SEMs), of both the cross-validation consistency and the prediction error, obtained from the MDR analysis of each group of 50 simulated data sets for each gene-gene interaction model and each number of loci evaluated. For the particular multilocus models that contain the correct two, three, four, or five genes, for each group of 50 simulated data sets, the mean prediction error was minimum, and the mean cross-validation consistency was maximum. Additionally, the SEM of the prediction error and of the cross-validation consistency was minimum at the correct multilocus model. For example, in the case in which a three-locus epistasis model was used to simulate the data sets, the mean + SEM prediction error was minimum for the three-locus model, at 12% ± 0.22%. The two-locus models had a mean ± SEM prediction error of 21.91% ± 0.33%), whereas the four-locus model had a mean ± SEM prediction error of 12.37% ± 0.24%.

The mean prediction error for the four-locus model was much closer to that of the three-locus model, because these models contained the correct three functional loci as well as a false-positive locus, whereas the two-locus models were missing one of the functional loci. Selecting the smaller three-locus model with the lower mean prediction error is consistent with statistical parsimony (i.e., smaller models are better because they are easier to interpret). For the three-locus models in this example, the cross-validation consistency was always 10.00; that is, the same three-locus model was found in each possible 9/10 of the subjects. These results suggest that, for this particular epistasis model, the cross- validation strategy is a reasonable approach to the identification of the correct multilocus model. Furthermore, the threshold cases: controls ratio of at least 1:1 was reasonable for this epistasis model. Table 2 Summary of Simulation Results

MEAN ± SEM

No. of Cross-Validation

Loci^a Consistency Prediction Error

Model 2:

2 9.86 ± .08 14.99 ± .24

3 7.41 ± .21 15.58 ± .26

4 6.01 ± .22 16.49 ± .29

5 5.56 ± .24 19.03 ± .38

6 6.52 ± .34 23.23 ± .53

7 6.94 ± .26 24.49 ± .62

8 7.90 ± .29 25.02 ± .73

9 8.03 ± .23 25.40 ± .73

Model 3:

2 9.20 + .17 21.91 ± .33

3 10.00 ± .00 12.00 ± .22

4 9.27 ± -13 12.37 ± .24

S 6.28 ± .21 13.90 ± .28

6 5.86 ± .25 15.57 ± .32 7 6.26 ± .29 17.75 ± .43

8 7.68 ± .28 19.39 ± .47

9 7.99 ± IS 19.93 ± SO

Model 4:

2 8.40 ± .26 19.15 ± .3S

3 8.79 ± .20 10.20 ± .23

4 10.00 ± .00 5.68 ± .17

5 9.32 ± .12 6.02 ± .19

6 7.74 ± -16 6.88 ± .22

7 7.01 ± .22 7.73 ± .26

8 7.04 ± .24 8.64 ± .31

9 7.79 ± .24 9.46 ± .34

Model 5:

2 9.01 ± .20 15.33 + .28

3 8.37 ± .25 8.54 ± .24

4 8.16 ± .25 5.17 ± .20

S 9.99 ± .01 z9S ± .l l

6 9.52 ± .12 3.17 ± .14

7 9.13 ± .16 3.66 ± .17

8 8.74 ± .17 4.17 ± .19

9 9.00 + .14 4.60 ± .18

NOTE. - For each simulation model, the multilocus model with maximum mean ± SEM cross-validation consistency and minimum mean ± SEM prediction error is indicated in boldface italic type. a Model number is based on the number of epistatic genes in each simulation model.

The Monte Carlo P values for each of the correctly identified models were all <.001. The estimated power to identify the correct multilocus model was 78% for the two- locus model, 82% for the three-locus model, 94% for the four-locus model, and 90% for the five-locus model. It is interesting that the power to identify the correct multilocus model tends to increase as higher-order interactions are modeled. This may be a real phenomenon, or it may be due to the fact that fewer non-functional loci of the 10 that were simulated were present. These results suggest that, for this particular epistasis model, the MDR method has reasonable power to identify high-order gene-gene interactions in a sample of 200 cases and 200 controls.

Application of MDR to Breast Cancer Data

Table 3 summarizes the cross-validation consistency and the prediction error obtained from MDR analysis of the sporadic-breast cancer case-control data set, for each number of loci evaluated. One four-locus model had a minimum prediction error of 46.73 and a maximum cross-validation consistency of 9.8 that was significant at the .001 level, as determined empirically by permutation testing. Thus, under the null hypothesis of ho association, it is highly unlikely that a cross-validation consistency 9.8 will be observed for this four-locus model. The four-locus model included the polymorphisms-of COMT CYPlAlml, CYPIB1 codon 48, and CYP1B1 codon 432.

Figure 3 illustrates a summary of the four-locus-genotype combinations associated with high risk and with low risk, along with the corresponding distribution of cases (left bars in boxes) and of controls (right bars in boxes), for each Multilocus-genotype combination. Note that the patterns of high-risk and low-risk cells differ across each of the different multilocus dimensions. This is evidence of epistasis, or gene-gene interaction; that is, the influence that each genotype at a particular locus has on disease risk is dependent on the genotypes at each of the other three loci. Previous analysis of this data set, by logistic regression, revealed no statistically significant evidence of independent main effects of any of the 10 polymorphisms. Table 3

Summary-of Results for Breast Cancer

No. of Cross-Validation Prediction

Loci Consistency Error

2 7.00 51.06

3 4.17 51.3S

4 9.80 ^a 46.73

5 4.71 50.26

6 5.00 48.61

7 8.60 47AS

8 8.20 52.SS

9 7.10 53.40

NOTE.-The multilocus model with maximum cross-validation consistency and minimum prediction error is indicated in boldface italic type. P<.001. According to exemplary embodiments, MDR is a method for reducing the dimensionality of multilocus information, to improve identification of combinations of polymorphisms associated with the risk for common complex multifactorial diseases. The development of MDR was motivated by the limitations of the generalized linear model for detection and characterization of gene-gene and gene-environment interactions and by the success of data-reduction methods for quantitative traits. Using simulated data, MDR has was demonstrated to be useful for identification of genes whose effects are primarily through interaction. MDR was then applied to identify gene-gene interaction effects on risk for sporadic breast cancer.

Breast cancer is generally considered a multifactorial disease with estrogens as one of the principal factors. MDR was therefore applied to a set of genes (i.e., COMT, CYPIAI, CYPIB1, GSTM1, and GSTT1) whose protein products interact as enzymes in the metabolism of estrogens in breast tissue. Several studies have examined the breast cancer risk associated with individual genotypes of each of these enzymes. Not surprisingly, the results have been inconsistent and even contradictory. That is, if a single gene in the estrogen-metabolism pathway were solely responsible for breast cancer, then the malignancy would likely present as familial breast cancer, and the gene would be identified by linkage analysis, as in the case of BRCAI and BRCA2. Studies of two or three genotypes in combination have also yielded inconsistent results. For example, CYPIAI, GSTM1, and GSTT1 polymorphisms were examined in a case-control study of 328 white and 108 African American women, using multiple logistic- regression analysis. None of the enzyme genotypes-individually or combined- were associated with an increased risk for breast cancer. COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated. Because of their clearly defined functional interactions in the catechol-estrogen pathway, it is important to consider the combined effect of all these enzymes. MDR applied to 10 single-nucleotide polymorphisms in COMT, CYPIAI, CYP1B1, GSTM1, and GSTT1 identifies a four-locus interaction that is significantly associated with risk for sporadic breast cancer. This is the first report of a four-locus interaction associated with a common complex multifactorial disease.

Breast cancer risk is influenced by several nongenetic hormonal factors, such as age at menarche, and by age at menopause, body-mass index, reproductive history, lactation history, and use of exogenous estrogen in the form of either oral contraceptives or hormone-replacement therapy. Although these factors allow prediction of a relative risk for a given population, they are not very helpful to individual women.

As defined by the MDR, the determination of a woman's genotype may add another dimension to the assessment of overall breast cancer risk. However, it is obvious that there is also an interaction between genotype risk factors and traditional hormonal risk factors. For example, obesity has been related both to the concentration of endogenous estrogen and to breast cancer risk. Several studies have demonstrated that obese postmenopausal women have an increased risk for breast cancer, compared to age-matched nonobese postmenopausal women. The elevated risk has been attributed to higher levels of circulating estrogens secondary to increased conversion, in adipose tissue, of androgen to estrogen. Several studies have demonstrated significantly higher serum-estradiol concentrations in obese postmenopausal women than in their nonobese counterparts. Thus, any effect that COMT, CYPIAI, CYP1B1, GSTM1, and GSTT1 may have on estrogen metabolism may be affected by the concentration of estradiol. Consequently, the analysis of genetic factors is limited by lack of consideration of these traditional hormonal risk factors. Advantages of MDR

An important advantage of MDR is that it facilitates the simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical end-point. This is accomplished by reducing the dimensionality of the multilocus data. In essence, genotypes from multiple loci and/or discrete environmental classes are pooled into high- risk and low-risk groups, depending on whether they are more common in affected or in unaffected subjects. This new multilocus-genotype encoding reduces the dimensionality to one. For the simulated data, the mean cross-validation consistency was always maximized, and the mean prediction error was always minimized, at the correct multilocus model. Another important advantage of MDR is that it is nonparametric. This is an important difference versus traditional parametric-statistical methods, which rely on the generalized linear model. For example, in logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially. Having too many independent variables in relation to the number of observed outcome events is, a well-recognized problem. Studies suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in type 1 and type 2 errors. For example, with two outcome events per independent variable, more than one-third of the estimated regression coefficients differed from the true parameter value by a magnitude of 2. It has been suggested that logistic-regression models should contain no more than P + 1 < min (nι, n₀)/10 parameters, where ni is the number of events of type 1 and no is the number of events of type 0. For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic- regression model. In a logistic-regression model, how many parameters must be estimated to identify interactions among the 10 estrogen-metabohsm-gene polymorphisms? The number of orthogonal-regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) x 2^k (Wade 2000). Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. Thus, fitting a full model with all interaction terms and then using backward elimination to derive a parsimonious model would not be possible. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions. A third advantage of MDR is that it assumes no particular genetic model (i.e., it is model free). That is, no mode of inheritance needs to be specified. This is important for diseases, such as sporadic breast cancer, in which the mode of inheritance is unknown and likely very complex. In its current form, MDR can be directly applied to case-control and discordant-sib-pair studies. Extension to other family-based control study designs, such as those using trios, should also be possible.

A fourth advantage of MDR is that false-positive results due to multiple testing are minimized. This is primarily due to the cross-validation strategy used to select optimal models. Data-reduction and pattern-recognition methods are good for identification of complex relationships among data, even when those relationships are due to either chance or false-positive variations. However, the real test of any method is its ability to make predictions in independent data. Cross-validation divides the data into 10 equal parts, allowing 9/10 of the data to be used to develop a model and the independent 1/10 of the data to be used to evaluate the predictive ability of the model. Optimal models are selected solely on the basis of their ability to make predictions with regard to independent data. Only when a final predictive model has been selected is the null hypothesis of no association tested via permutation testing. It is this combined cross-validation- testing/permutation-testing method that minimizes false-positives due to multiple examinations of the data. MDR overcomes many limitations of the generalized linear model. However,

MDR can be computationally intensive, especially when more than 10 polymorphisms need to be evaluated. A genome scan with hundreds to thousands of polymorphisms requires robust machine learning algorithms, since all of the possible multilocus combinations cannot be exhaustively searched. This is, however, a consequence of any multilocus method that does not first condition on a particular locus having an independent main effect (e.g., stepwise logistic regression). Also, MDR models may be difficult to interpret. This is illustrated clearly in the four-locus model in Figure 3. There are no obvious trends or patterns in the distribution of high-risk and low-risk groupings across the four-dimensional genotype space; for example, a consistent trend of high-risk or low-risk cells across a series of rows or of columns may indicate that a particular locus has a main effect. The lack of such trends in the four-locus model for breast cancer is indicative of epistasis; that is, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci. Sorting out the nature of the interactions in four- dimensional space to infer function remains an interpretive challenge.

In its current form, MDR is best applied to case-control studies that are balanced (i.e., that have the same number of cases and of controls). Also, MDR is somewhat limited in is its ability to make predictions for independent data sets when the dimensionality of the best model is relatively high and the sample is relatively small. High dimensionality and a small sample lead to many multifactor cells with either missing data or singleton data. This is not a problem for estimation of the classification error and evaluation of the cross-validation consistency, but it is a problem for estimation of the prediction error. For example, if there were one observation for each multifactor cell in /z-dimensional space, then, during cross-validation, that one observation will end up in either the training data used to estimate the classification error or the test data used to estimate the prediction error but not in both. If the observation ends up in the test data, there will be, from the training data, no model (i.e., there will be an empty cell) to make a prediction. This greatly limits the number of observations for which predictions can be made in the test set and ultimately impacts the SEM of the prediction error.

The MDR is a powerful alternative to traditional parametric statistics, such as logistic regression. MDR has the ability to identify high-order (i.e., more than two) gene- gene interactions in relatively small simulated and real data sets. Although MDR addresses some of the limitations of the generalized linear model, there are several ways in which the method can be improved.

First, if MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, machine learning strategies made be used to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. Parallel genetic algorithms may be used as a robust machine learning approach.

Second, it may be important to improve MDR's predictive ability in the higher dimensions. Several strategies can be used to improve the estimation of the prediction error. The first strategy uses a nearest neighbor method to determine whether an empty cell should be classified as high risk or as low risk; for example, if the majority of multilocus-genotype combinations within one step in n-dimensional space are classified as high risk, then the empty cell is also classified as high risk. The second strategy projects either a high risk or a low risk classification for an empty cell in a lower dimension; for example, the locus with the least frequent genotype might be removed from the model, and risk could then be determined from the equivalent genotypes in a lower dimension. These strategies may be compared to determine whether either improves the estimation of the prediction error when empty cells are present.

Third, it may be important to modify MDR for the analysis 'of unbalanced case- control studies. Several different weighting schemes may be used for the case-control ratio that account for whether the total number of cases or the total number of controls is greater.

Finally, simulation studies may be used to determine the strengths and the weaknesses of MDR in the presence of genotyping errors, phenocopies, genetic heterogeneity, and other phenomena that complicate the identification and characterization of functional polymorphisms. Data-reduction methods such as MDR will be invaluable for the identification and characterization of high-order gene-gene and high-order gene- environment interactions, when few degrees of freedom are available for parametric- statistical estimation of interaction effects. According to exemplary embodiments, multifactor-dimensionality reduction

(MDR) is introduced as a method for reducing the dimensionality of multilocus information, to improve the identification of polymorphism combinations associated with disease risk. The MDR method is nonparametric (i.e., no hypothesis about the value of a statistical parameter is made), is model-free (i.e., it assumes no particular inheritance model), and is directly applicable to case-control and discordant-sib-pair studies. Using simulated case-control data, the description above demonstrates that MDR has reasonable power to identify interactions among two or more loci in relatively small samples. When it was applied to a sporadic breast cancer case-control data set, in the absence of any statistically significant independent main effects, MDR identified a statistically significant high-order interaction among four polymorphisms from three different estrogen- metabolism genes. This is the first known report of a four-locus interaction associated with a common complex multifactorial disease. It should be understood that the foregoing description and accompanying drawings are by example only. A variety of modifications are envisioned that do not depart from the scope and spirit of the invention. The above description is intended by way of example only and is not intended to limit the present invention in any way.

Claims

WHAT IS CLAIMED IS:

1. A method for reducing multidimensional data, comprising the steps of: out of a first data set including discrete independent variables, measuring n observations; determining a number of observations in which a dependent variable has a first value; determining a number of observations in which the dependent variable has a second value; forming combinations of the independent variables in the first data set to produce a second data set; in each combination of independent variables in the second data set, determining a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value; comparing each ratio to one or more thresholds or ranges of thresholds to determine which combination of independent variables in the second data set optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value; and pooling those combinations in the second data set that have ratios at the approximately the same threshold or within the same range of thresholds to produce a third data set, wherein the third data set has a smaller number of independent variable combinations than the second data set.

2. The method of claim 1, wherein the independent variables are predictor or explanatory variables, and the dependent variable is a response variable.

3. The method of claim 2, wherein the independent variables include at least one of genetic and environmental factors.

4. The method of claim 3, wherein the first value of the dependent variable is disease indicative, and the second value is disease-free indicative.

5. The method of claim 4, wherein the third data set includes two independent variable combinations, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.

6. An apparatus for reducing multidimensional data, comprising: means for measuring n observations out of a first data set including discrete independent variables; means for determining a number of observations in which a dependent variable has a first value; means for determining a number of observations in which the dependent variable has a second value; means for forming combinations of the independent variables in the first data set to produce a second data set; means for determining a ratio of the number of observations with the dependent variable having the first value to the number of observations with the dependent variable having the second value in each combination of independent variables in the second data set; means for comparing each ratio to one or more thresholds or ranges of thresholds to determine which combination of independent variables optimally discriminates observations with the dependent variable having the first value and observations with the dependent variable having the second value; and means for pooling those combinations in the second data set that have ratios at the approximately the same threshold or within the same range of thresholds to produce a third data set, wherein the third data set has a smaller number of independent variable combinations than the second data set.

7. The apparatus of claim 6, wherein the independent variables are predictor or explanatory variables, and the dependent variable is a response variable.

8. The apparatus of claim 7, wherein the independent variables include at least one of genetic and environmental factors.

9. The apparatus of claim 8, wherein the first value of the dependent variable is disease indicative, and the second value is disease-free indicative.

10. The apparatus of claim 9, wherein the third data set has two independent variable combinations, one of which has a high ratio of disease indicative values to disease-free indicative values of the dependent variable and the other of which has a low ratio of disease indicative values to disease-free indicative values of the dependent variable.