WO2004031912A2 - Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables - Google Patents

Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables Download PDF

Info

Publication number
WO2004031912A2
WO2004031912A2 PCT/US2003/031186 US0331186W WO2004031912A2 WO 2004031912 A2 WO2004031912 A2 WO 2004031912A2 US 0331186 W US0331186 W US 0331186W WO 2004031912 A2 WO2004031912 A2 WO 2004031912A2
Authority
WO
WIPO (PCT)
Prior art keywords
trait
block
haplotype
markers
haplotypes
Prior art date
Application number
PCT/US2003/031186
Other languages
French (fr)
Other versions
WO2004031912A3 (en
Inventor
Lue Ping Zhao
Original Assignee
Fred Hutchinson Cancer Research Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fred Hutchinson Cancer Research Center filed Critical Fred Hutchinson Cancer Research Center
Priority to AU2003282907A priority Critical patent/AU2003282907A1/en
Publication of WO2004031912A2 publication Critical patent/WO2004031912A2/en
Publication of WO2004031912A3 publication Critical patent/WO2004031912A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the present invention relates to methods of estimating haplotype frequencies and methods of associating haplotype frequencies and environmental factors with a trait.
  • a trait may be associated with a particular S ⁇ P genotype or with a distinct combination of contiguous S ⁇ P genotypes (S ⁇ P haplotype). For example, there is a haplotype association of lipoprotein lipase with coronary heart diseases, but no association is evident with individual S ⁇ Ps (Hallman et al. (1999) Ann. Hum. Genet. 63:499-510).
  • haplotypes from diploid individuals is a difficult task for individuals heterozygous at multiple SNPs. For example, if a sample has two SNP genotypes, A/C and A/G, then the two haplotypes could be either AA and CG or AG and CA.
  • Several approaches have been attempted to read alleles from separated chromosomes, such as dissecting a single chromosome or inserting an entire chromosome into a yeast artificial chromosome (YAC) (Green et al. (1998) in The Genetic Basis of Human Cancer, McGraw-Hill Health Professions Division, NY) or using rodent-human hybrid technique to physically separate two chromosomes (Patil et al.
  • genotyping Another approach is to infer haplotypes by genotyping first-degree relatives.
  • parental genotype data With parental genotype data, the phases (origin of parental copy of chromosome) of genotypes can be partially resolved (Wijsman (1987) Amer. J. Hum. Genet. : 41 :356-373).
  • gathering parental biological samples, if available, and genotyping them are expensive and may be ethically sensitive.
  • haplotype frequencies can be accurately estimated using expectation-maximization algorithms (EM).
  • EM expectation-maximization algorithms
  • maximum likelihood approaches have been developed to estimate population haplotype frequencies directly from genotype data in unrelated individuals (Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927; Hawley & Kidd (1995) J Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810).
  • Excoffier & Slatkin's EM algorithm is the most popular one and appears to require the fewest assumptions.
  • the EM algorithms are computationally burdensome when the number of SNPs exceeds 20.
  • SEs standard errors
  • haplotype frequencies and assessing the association of particular haplotypes with defined traits becomes increasingly practical as millions of SNPs are identified and are readily genotyped in population research. There exists a need for algorithms that accurately estimate haplotype frequencies from genotyped SNPs when the number of SNPs is large, and to provide SEs for the estimated haplotype frequencies. There also exists a need for methods for assessing disease associations with SNP haplotypes and environmental variables in case-control studies.
  • the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information is incomplete, and wherein the method directly provides a standard error measurement for each estimated haplotype frequency.
  • EE estimating equation
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and (i) estimating block haplotype frequencies for a first block, wherein a block haplotype comprises a combination of genotypes for markers in a block; (ii) estimating block haplotype frequencies for a second block; (iii) estimating combination block haplotype frequencies for a first combination block using selected and pooled block haplotypes for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein block haplotypes with greater than a predetermined minimum frequency are selected and block haplotypes that are not selected are pooled; (iv) estimating block haplotype frequencies for a third block; (v) estimating combination block haplotype frequencies for a second combination block using selected and pooled block haplotypes for the first combination block and the third block; and (vi) sequentially repeating steps (iv) and (v), wherein each repetition provides combination block haplotype frequencies for a combination block comprising an additional block.
  • the invention provides a method for assessing the association of a haplotype with a trait.
  • the method comprises two parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown) and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait- positive group and individuals in the trait-negative group, as described above for the methods of the first aspect of the invention.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the second aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait.
  • the method comprises four parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group, as described above.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the third part of the method derives one or more haplotypes that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with a trait.
  • the method comprises two parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait-negative individuals.
  • the forward-block algorithm comprises the steps of: (i) estimating pattern frequencies for a first block, wherein pattern comprises a block haplotypes and one or more environmental factors and a block haplotype comprises a combination of genotypes for markers in a block; (iii) estimating pattern frequencies for a second block; (iii) estimating pattern frequencies for a first combination block using selected and pooled patterns for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein patterns with greater than a predetermined minimum frequency are selected and patterns that are not selected are pooled; (iv) estimating pattern frequencies for a third block; (v) estimating pattern frequencies for a second combination block using selected and pooled patterns for the first combination block and the third block; and (vi) sequentially repeating
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait- negative individuals, as described above.
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third part of the method derives one or more patterns that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes or traits.
  • the methods of the invention may be used for any study design, for example case- control studies , observational studies, cohort studies, or clinical trials.
  • the markers used in any of the methods of the invention may have two alleles (biallelic markers) or multiple alleles (multiallelic markers).
  • Representative markers include, but are not limited to, microsatellite markers or single nucleotide polymorphisms (SNPs).
  • SNPs single nucleotide polymorphisms
  • the markers are biallelic SNPs.
  • the set of markers may comprise more than about 12 markers, such as more than about 20 markers, such as more than about 50 markers, such as more than about 100 markers, or such as several hundreds of markers.
  • each block may comprise at least about 3 markers, such as at least about 5 markers, such as at least about 10 markers, or such as at least about 20 markers. In some embodiments the genotype of at least some of the markers is unknown.
  • the sample population in any of the methods of the invention may comprise more than about 10 individuals, more than about 50 individuals, more than about 100 individuals, more than about 500 individuals, or more than about 1000 individuals.
  • the invention also provides a computer-readable medium having computer- readable instructions for performing the methods of the invention.
  • the invention also provides a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGURE 1 shows a schematic diagram representing the forward-block algorithm.
  • FIGURE 2 shows a schematic flowchart for the simulations.
  • FIGURE 3 shows plots of the average discrepancy of the estimates of haplotype frequencies.
  • the plots in the top panel are the average discrepancy ⁇ ⁇ j - ⁇ j
  • the plots in the lower panel are the average discrepancy y ⁇ ⁇ j - ⁇ j ⁇ summed over all common haplotypes) of the estimated standard error j compared to the sample standard deviation of the estimate based on 500 simulations under HWE, HWD(l), and HWD(2), respectively. The simulations are carried out for a five-locus system.
  • FIGURE 4 shows the discrepancy of the estimated haplotype frequencies using the EE method.
  • the solid line represents the average discrepancy over 1,000 simulated data sets and the two dashed lines represent the 5 and 95 percentile of 1 ,000 discrepancies.
  • FIGURE 5 shows Table 1 describing estimated haplotype frequencies (standard errors) using the EE and estimated haplotype frequencies using Bayesian method
  • ARHGDIB on chromosome 12 core SNP, G4923a6, G4923a7, G4923al0, G4923al l, G4923al2, G4923al8, G4923al5, G4923al6, G4923al7, G4923a22, G4923a26, G4923a28, G4923a44, G4923a46, G4923a35, G4923a37, and G4923a38.
  • Complete genotype for all 44 individuals are available for the 8 SNPs indicated in italics, the remaining SNPs have missing genotypes for one or more individuals (0: reference allele, 1 : variant allele).
  • FIGURE 6 shows Table 2 describing estimated haplotype frequencies (standard errors) using both the EE and EM algorithms and estimated haplotype frequencies using Bayesian methods, PHASE and HAPLOTYPER, from 44 individuals with complete genotype data for 8 SNPs within the gene ARHGDIB on chromosome 12: core SNP, G4923a6, G4923a7, G4923al2, G4923a26, G4923a44, G4923a46, and G4923a35 (0: reference allele, 1 : variant allele).
  • the standard errors of EM estimates of haplotype frequencies were computed from 1,000 bootstrap samples.
  • FIGURE 7 shows Table 3 describing estimated allelic frequencies among cases and controls, their allelic differences, standard errors and Z-statistics, and estimated odds ratios with bootstrap confidence intervals
  • FIGURE 8 shows Table 4 describing estimated Z-statistics, odds ratios, and their 95% confidence intervals for haplotype associations with two adjacent SNPs within APOC3.
  • FIGURES 9(a)-(d) show the distributions of maximum Z-statistics, 90% and 95% threshold values obtained through 1,000 permutations, (a) shows the frequency of maximum Z-scores using 2 adjacent SNPs; (b) shows the maximized Z-scores at every SNP locus when two adjacent SNPs are used ; (c) and (d) are similar to (a) and (b) using three adjacent SNPs.
  • the two threshold values in (b) are 2.86 and 2.54), in (d) are 3.05 and 2.82.
  • FIGURE 10 shows Table 5 describing estimated Z-statistics, OR, and their 95% confidence intervals for haplotype associations at three adjacent SNPs within APOC3.
  • FIGURE 11 shows Table 6 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at four adjacent SNPs within APOC3.
  • FIGURE 12 shows Table 7 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at five adjacent SNPs within APOC3.
  • FIGURE 13 shows Table 8 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at six adjacent SNPs within APOC3.
  • FIGURE 14 shows Table 9 comparing the estimating equation approach (EE) with the EM algorithm in estimating haplotype frequencies and their standard errors.
  • FIGURE 15 shows Table 10 describing a Monte Carlo simulation study under the null hypothesis with no covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each set of simulation data.
  • FIGURE 16 shows Table 11 describing a Monte Carlo simulation study under the c null hypothesis with covariates.
  • FIGURE 17 shows Table 12 describing a Monte Carlo simulation study under the c null hypothesis with varying sample sizes.
  • FIGURE 18 shows Table 13 describing a Monte Carlo simulation study under the c alternative hypothesis with no covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates.
  • FIGURE 19 shows Table 14 describing a Monte Carlo simulation study under the alternative hypothesis with covariates.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each simulation data.
  • FIGURE 20 shows Table 15 describing a case-control study under the null hypothesis with covariates.
  • I( ) is an indicator function;
  • h 2 and A- are the second and third common haplotypes.
  • the risk factors Gender and Age are omitted from the estimation model and only all common haplotypes are modeled in a case-control study with 300 cases and 300 controls.
  • Avg. bias is the average bias of estimate of parameter over 1,000 duplicates.
  • Std is the standard deviation of 1,000 estimates of parameter.
  • Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates.
  • ⁇ 5 is the coefficient of I(h 4 ) in the estimation model.
  • the invention provides a method for estimating haplotype frequencies in a population using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency.
  • the invention provides a method assessing the association of a haplotype with a trait, comprising estimating haplotype frequencies for a group of trait-positive individuals and a group of trait-negative individuals using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency; and estimating the differences in haplotype frequencies for the trait-positive group and the trait-negative group.
  • the invention provides a method for assessing the association of a trait with a pattern comprising: (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and/or (3) a combination of haplotypes or diplotypes and environmental factors, using an estimating equation method that provides standard error measurements.
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes.
  • the invention also provides a computer-readable medium having computer-readable instructions for performing the methods of the invention, and a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. Unless specifically defined herein, all terms used herein have the same meaning as they would to one skilled in the art of the present invention.
  • the term "genetic marker” or “marker” refers to genome-derived polynucleotides which are sufficiently polymorphic to allow a reasonable probability that a randomly selected individual will be heterozygous, and thus informative for genetic analysis by methods such as linkage analysis or association studies.
  • the human haploid genome contains a 3xl0 9 base- long double stranded DNA shared among the 24 chromosomes. Each human individual is diploid, i.e., possesses two haploid genomes, one of paternal origin, the other of maternal origin. The sequence of the human genome varies among individuals in a population.
  • polymorphism refers to the occurrence of two or more alternative genomic sequences or alleles between or among different genomes or individuals. “Polymorphic” refers to the condition in which two or more variants of a specific genomic sequence can be found in a population. A “polymorphic site” is the locus at which the variation occurs.
  • a single nucleotide polymorphism is the replacement of one nucleotide by another nucleotide at the polymorphic site. Deletion of a single nucleotide or insertion of a single nucleotide also gives rise to single nucleotide polymorphisms.
  • the term "single nucleotide polymorphism” or "SNP" refers to which ' are genome-derived polynucleotides markers that exhibit allelic polymorphism.
  • allele refers to a variant of a nucleotide sequence. All alleles on a single chromosome are from one of the two parents. At a given polymorphic site, any individual (diploid), can be either homozygous (twice the same allele) or heterozygous (two different alleles).
  • haplotype refers to a particular combination of alleles at multiple markers, such as SNPs, present in an individual. Haplotype information is specified by the parental origin (phase) of individual marker alleles. Across short distances (usually 10's of kbp), haplotypes are conserved over the evolutionary time-scale (Drysdale et al. (2000) Proc. Natl Acad. Sci.
  • the term “genotype” refers the identity of the alleles present in an individual or a sample.
  • the term “genotyping” a sample or an individual for a marker involves determining the specific allele or the specific nucleotide carried by an individual at a biallelic marker.
  • the term “diplotype” refers to the identity of the alleles on both chromosomes in an individual, i.e., a pair of haplotypes.
  • the term “Hardy- Weinberg Equilibrium” or “HWE” refers to the assumption that haplotypes are randomly paired to form individuals' diplotypes.
  • a “trait” or “genetic trait” or “phenotype” refers to a measurable characteristic present in some individuals of a population and absent in others.
  • a given polymo ⁇ hism or rare mutation can be either neutral, i.e., it has no effect on the trait, or functional, i.e., it is responsible for a particular genetic trait.
  • the preferred traits contemplated within the present invention relate to fields of therapeutic interest, for example susceptibility to a disease, or drug response reflecting drug efficacy, toxicity, or other side effects related to treatment.
  • Traits may also relate to any other desirable or undesirable characteristic in a population of individuals, such as a bovine trait to produce tender, high-quality beef (see Adam (2002) Nature 417:778) or a disease-resistance in a plant.
  • a bovine trait to produce tender
  • high-quality beef see Adam (2002) Nature 417:778
  • a disease-resistance in a plant.
  • the term “trait-positive individuals” or “cases” refers to individuals in which a particular trait is present.
  • twin-negative individuals or “controls” refers to individuals in which a particular trait is absent.
  • Traits can either be "binary”, e.g. diabetic vs. non-diabetic, "quantitative”, e.g. elevated blood pressure, or "censored”, e.g., time-to-onset of a disease outcome.
  • Individuals affected by a quantitative trait can be classified according to an appropriate scale of trait values, e.g., blood pressure ranges. Each trait value range can then be analyzed as a binary trait. Individuals showing a trait value within one such range are compared to individuals showing a trait value outside of this range. In such a case, genetic analysis methods are applied to subpopulations of individuals showing trait values within defined ranges.
  • a trait can be categorical (e.g., blood types), ordinal (e.g., stages of breast cancer), censored (e.g., observations for study participants who do not have the event of interest during the period of follow-up), or provide frequency information.
  • categorical e.g., blood types
  • ordinal e.g., stages of breast cancer
  • censored e.g., observations for study participants who do not have the event of interest during the period of follow-up
  • environmental variable refers to non-genetic contributions to the presence or absence of a trait.
  • Environmental factors include, for example, age, sex, weight, height, nutrition, life-styles, smoking, alcohol-consumption, work habits, history of medications, medical history, family history, leisure activities, etc.
  • pattern refers to either (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, or (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors.
  • the presence of a particular pattern may associate a trait with the combination of a haplotype or diplotype at one locus with a haplotype or diplotype at another locus, or it may associate a trait with one or more environmental factors, or it may associate a trait with a combination of a haplotype or diplotype at one ore more loci with one or more environmental factors.
  • locus refers to the specific location of a marker in the genome.
  • the terms "significantly associated,” “significantly positively associated,” and “significantly negatively associated” all refer to statistical significance. Statistical significance, is used herein as it is typically used by those with skill in the art. It is a measure of the probability that an observed difference would have been observed simply by chance and is not the result of a "real" difference between two groups, for example. Thus, the lower the probability that the observed difference would have happened by chance, the less likely that it happened by chance. Statistical significance is based on p- values. A p-value ⁇ 0.05 is typically considered statistically significant, although in some instances a p-value of ⁇ 0.01 or even ⁇ 0.005 or ⁇ 0.001 is preferred.
  • a statistically significant positive association between a haplotype and a trait means that the presence of a haplotype correlates with the presence of the trait
  • a statistically significant negative association between a haplotype and a trait means that the presence of a haplotype correlates with the absence of the trait.
  • the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information for the markers is incomplete, and wherein the algorithm used directly provides a standard error measurement for each estimated haplotype frequency without using bootstrap methods.
  • EE estimating equation
  • genotyping protocols result purely in genotype information; they produce information about the pair of alleles an individual possesses at each locus, but not necessarily haplotype information which would reveal the alleles that have been inherited together on the same paternal or maternal chromosome. Without explicit haplotype information, there is ambiguity with respect to the origin of alleles at neighboring loci. For example, it is difficult to determine if there are differences in the frequency of certain haplotypes between individuals with a disease ("cases”) and individuals without the disease (“controls”) in the absence of haplotype information.
  • haplotype frequencies from genotype data gathered on a sample of individuals is based on the fact that the haplotypes of some individuals in the sample are unambiguous. This allows the ambiguous haplotypes to be estimated using statistical predictions. Individuals that are unambiguous with respect to phase or haplotype information have homozygous genotypes either at all relevant loci or at all but one relevant locus. Individuals with two or more heterozygous genotypes have more than one possible haplotype configuration compatible with their genotype data, and hence are ambiguous with respect to phase or haplotype information.
  • the EE method of the first aspect of the invention is motivated by the likelihood approach for missing data problems (Efron (1994) J. Am. Stat. Assoc. 89:463-475; Heitjan (1994) Biometrika 81 :701-708; Rubin (1996) J Am. Stat. Assoc. 91:473-489).
  • the phase of a heterozygous SNP is unresolved, and may be treated as missing data.
  • the missingness of phase is Missing At Random (Heitjan (1994) Biometrika 81 :701-708) in the sense that the missing mechanism depends on observed genotype data, rather than missed phase information. Note that if the parental origin (phase) of an allele at any single heterozygous locus is assumed to be fixed, it may serve as a reference phase for other SNPs within the same individual.
  • the EE method of the first aspect of the invention is described in EXAMPLE 1.
  • the derivations of the likelihood and estimating equations are described in the first section of EXAMPLE 1.
  • the likelihood function used in the method of the invention is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • the EE method uses a forward-block computational algorithm, which is computationally efficient algorithm that permits the resolution of phase information when the number of SNPs is larger than about 20.
  • the forward-block algorithm used in the EE method of the invention is described in the second section of EXAMPLE 1.
  • the forward-block algorithm carries out computations in a stepwise fashion as shown in FIGURE 1.
  • the set of markers e.g., SNPs
  • a block may contain from about 3 to about 100 markers.
  • haplotype frequencies for the first two blocks are estimated separately.
  • the first two blocks are then joined and the haplotype frequencies for the enlarged block are estimated using the estimation results of the first two blocks as initial values.
  • haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block.
  • the estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined.
  • haplotypes with insignificant frequencies are filtered out.
  • the markers used in the methods of the invention may be biallelic or multiallelic markers.
  • Exemplary markers are microsatellite markers and SNPs.
  • the markers are biallelic SNPs.
  • the set of markers may comprise more than about 12 markers, more than about 20 markers, more than about 50 markers, more than about 100 markers, between about 200 and about 1000 markers, or several thousands of markers.
  • the genotype of at least some of the markers is unknown.
  • the sample population may comprise more than about 40 individuals, more than about 100 individuals, more than about 500 individuals, or more than 1000 individuals.
  • the EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values.
  • the estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1.
  • the adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
  • the EE method also permits the inference of haplotypes of individuals.
  • the phase of an individual's genotype can be predicted to provide a pair of haplotypes with a probability statement, as shown in the fifth section of EXAMPLE 1.
  • the EE method accurately estimates haplotype frequencies and their standard errors (SEs) under the assumption of Hardy- Weinberg Equilibrium (HWE) and under conditions when the model assumption is violated, as shown in EXAMPLE 2.
  • SEs standard errors
  • HWE Hardy- Weinberg Equilibrium
  • the estimated haplotype frequencies and their SEs using the EE method are consistent even when the sample size is small (see FIGURES 2 and 3).
  • Another simulation study shows that the haplotype frequencies estimated using the EE method are accurate even when the number of markers (e.g., SNPs) is large, as shown in EXAMPLE 2 and FIGURE 4.
  • the number of individuals can be several thousands, and the number of markers can be several hundreds.
  • the estimation of haplotype frequencies from a dataset of 632 individuals and 296 markers was completed in approximately 2 minutes, as shown in EXAMPLE 2.
  • the estimated haplotype frequencies and their SEs using the EE method of the invention compare favorable with those calculated with three existing methods: Arlequin software (Schneider et al. (2000) http//lgb.unige.ch/arelquin), which is an implementation of the EM method proposed in Excoffier & Slatkin (1995), PHASE software, which is an implementation of the Bayesian method (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989), and HAPLOTYPER software, which is an implementation of the Bayesian method (Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • the EE method Utilizing the forward-block-computational algorithm and estimating equation technique, the EE method is able to handle the data with large numbers of SNPs and estimate SEs directly without using bootstraps. Unlike the EM method, The EE method can also handle the data with missing genotypes. Compared with the EM method for estimating haplotype frequencies, the EE method yields consistently smaller estimates for SEs, as documented in EXAMPLE 4 and FIGURE 14. In addition, the computations using the EE methods were at least 120 times faster than using the EM algorithm in a dataset including 6 markers and 437 individuals, as shown in EXAMPLE 4.
  • haplotype frequencies can be estimated using Bayesian approach.
  • two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978- 989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169) were proposed to reconstruct individuals' haplotypes from their genotypes.
  • both maximum likelihood and Bayesian methods can be used to reconstruct individuals' haplotypes and estimate haplotype frequencies.
  • an additional assumption is required for the conditional distribution of individual haplotype given all haplotypes in the first Bayesian approach (PHASE, Stephens et al. (2001) Amer. J. Hum.
  • An extension of the EE method may be used if an individual's parental genotypes are also available.
  • the accuracy of the estimates may be improved by inco ⁇ orating parental genotypes in estimating equation (9) in EXAMPLE 1.
  • the improvement can be very valuable, especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped.
  • the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers.
  • the EE method may also be modified by inco ⁇ orating available population genetic models to estimate haplotype frequencies for an isolated population.
  • the EE method may be used together with the forward block computational algorithm to systematically identify haplotype blocks such that there are only limited haplotypes within each block. With identified blocks and haplotypes, it is possible to determine which SNPs are essential for defining the haplotypes and which SNPs are redundant, obviating the need to genotype non-essential SNPs.
  • the estimating equation approach may be used for assessing trait associations with haplotypes.
  • SNP genotypes are the study of their association with complex traits, such as cancers or coronary heart diseases.
  • case-control study a widely accepted design strategy is the case-control study, which has been established through decades of epidemiological research (Breslow & Day (1980) Statistical Methods in Cancer Research, Int'l Agency for Research on Cancer, Lyon; Schlesselman (1982) Case-Control Studies: Design, Conduct, Analysis, Oxford University Press, NY).
  • a typical case-control study identifies a sample of trait-positive individuals ("cases") and a sample of matched, trait-negative individuals (“controls”) from a well-defined population.
  • haplotypes are associated with haplotypes, rather than with any individual marker within the haplotype. For example, Hallman and co-workers noted a haplotype association of lipoprotein lipase with coronary heart diseases (Hallman et al. (1999) Ann.
  • haplotypes Without relying on experimental methods or on parental data, another class of methods infers haplotypes from multiple markers. Unambiguous haplotypes among cases and controls are identified, ignoring remainder haplotypes (Haviland et al. (1995) Ann. Hum. Genet. 59:211-231). As expected, ignoring partially informative haplotypes implies a loss of efficiency, and this difficulty becomes more significant as the number of sites assayed increases. Typically, such a method is applicable to at the most three or four markers. To inco ⁇ orate partially observed haplotypes, another method is to infer haplotypes based upon empirical distributions, tolerating a degree of misclassification error (Hallman et al. (1999) Ann. Hum. Genet.
  • the method of the second aspect of the invention uses an estimating equation approach for analyzing case-control marker data to associate haplotypes with a trait, as shown in EXAMPLES 4 and 5.
  • the method infers haplotype frequencies statistically, and correlates these with case-control status, thereby establishing associations of particular haplotypes with the disease phenotype.
  • the method of the invention has two parts. In the first part, the method estimates haplotype frequencies for a set of markers with incomplete phase information in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation algorithm, which directly provides a standard error measurement for each estimated haplotype frequency.
  • a forward-block algorithm carries out computation of this part of the method in a stepwise fashion as shown in FIGURE 1.
  • the set of markers (e.g., SNPs) is divided into several block.
  • a block may contain between about 3 and about 100 markers.
  • haplotype frequencies for the first two blocks are estimated separately. The first two blocks are then joined and the haplotype frequencies for the enlarged block is estimated using the estimation results of the first two blocks as initial values.
  • haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block. The estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined. At each step, haplotypes with insignificant frequencies are filtered out.
  • the EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values.
  • the estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1.
  • the adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
  • the method estimates the differences in frequencies of individual haplotypes for the set of markers in the trait-positive group and in the trait- negative group using an estimating equation algorithm, which directly provides a standard error measurement for each estimated difference in haplotype frequency.
  • the invention provides a haplotype- based method for diagnosing an increased risk of development a trait.
  • the method comprises four parts.
  • the first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait- negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
  • the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group.
  • haplotype frequencies are estimated for the subset of markers in the first block.
  • the second step comprises estimating haplotype frequencies for the subset of markers in the second block.
  • the haplotype frequencies are estimated from a combination of selected and pooled haplotypes for the first block and selected and pooled haplotypes for the second block, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected.
  • Step four comprises sequentially repeating steps two and three, wherein during each repetition the pooled and selected haplotypes for one additional block is added to the selected and pooled haplotypes for the previous combination of blocks.
  • the second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
  • the third part of the method derives one or more haplotypes that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the likelihood derivation used in the EE method in a case-control study is described in the first section of EXAMPLE 4.
  • the motivation of the estimating equation is described in the second section of EXAMPLE 4.
  • the strategy used is to search for haplotype associations progressively, in multiple steps, as shown in the third section of EXAMPLE 4.
  • the permutation of the case/control study under the null hypothesis that markers and their haplotypes do not associate with disease status to compute the maximum Z-statistic is described in the fourth section of EXAMPLE 4.
  • An efficient computational implementation of the estimating equation method for use in the case-control study is described in the fifth section of EXAMPLE 4.
  • FIGURES 7-13 show the estimated allele or haplotype frequencies, estimated odds ratios (ORs), and their 95% confidence intervals for cases and controls.
  • This study shows the feasibility of using the method of the invention with many markers.
  • This EE approach for case-control studies has several important features. First, it does not require family data to infer haplotype information. Of course, if family data are available, they can be used to improve haplotype information.
  • the method estimates haplotype frequencies, and establishes their associations (but not necessarily causal associations) with a phenotype, without the need to assay haplotypes directly.
  • this method is easily generalized to inco ⁇ orate many markers, and this feature is important in view of increasing knowledge of SNPs in the genome and early evidence for haplotype specific effects.
  • this method is more efficient than the analysis with individual markers, because the actual number of common haplotypes is much smaller than the theoretically possible number of haplotypes.
  • Sixth, because of its computational efficiency, this method is easily scaled to deal with a large number of SNPs (e.g., >100) collected from a large number of subjects (e.g., >1000).
  • This method of the second aspect of the invention if applied only to controls, has a close connection with Expectation-Maximization (EM) algorithms developed for estimating haplotype frequencies in a general population (Excoffier & Slatkin (1995) Mol Biol Evol. 12:921-927; Hawley & Kidd (1995) J. Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810).
  • the EE method has the advantage of readily estimating standard errors for haplotype frequency estimates, while an EM algorithm may not.
  • Arlequin Excoffier and co-workers used a parametric bootstrap method to estimate standard errors. As expected, the computing burden from the bootstrap is substantial, and is as much as 120 times slower than the EE method.
  • a modification of the EE method may be used if an individual's parental genotypes are also available.
  • the accuracy of the estimates may be improved by inco ⁇ orating parental genotypes in estimating equation (23) in EXAMPLE 4.
  • the application of such a modified likelihood function will improve the efficiency in deriving distributions of haplotypes and hence the efficiency of the estimation.
  • the improvement can be particularly valuable especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped.
  • the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers.
  • the method may be modified by inco ⁇ orating population genetic models.
  • the method of the invention may be extended to analyze quantitative phenotypes.
  • the invention provides a method for assessing trait associations with haplotypes or diplotypes and environmental factors in case-control studies. Interactions between genes and environmental factors has been of interest to genetic epidemiology (Khoury et al. (1993) Fundamentals of Genetic Epidemiology, Oxford University Press, New York). In recent years, researchers in pharmacological research have been very interested in studying the interactions of drugs and genes, the field for which is known as pharmacogenomics. Additionally, researchers in clinical sciences have been interested in personalized medicine in the sense that physicians can prescribe medical treatment based upon patients' genotypes. Characterization of the association between traits and genes, independently and/or interactively with environmental factors, can also improve the prediction, diagnosis, and prognosis of disease or other traits in an individual.
  • a gene may also interact with another gene at a functional level (epistasis). Given the current understanding on genetic regulatory circuitry, it is believed that multiple candidate genes, rather than a single gene, likely play a role in most of chronic diseases. Consequently, multiple genes may jointly penetrate to disease phenotype in the form of gene/gene interactions. The interactions could occur as haplotype-haplotype interaction, or as diplotype-diplotype interaction.
  • the method of the invention treats haplotypes, if unknown, as latent variables and constructs an estimating equation by integrating out these latent haplotypes, resulting in an estimating equation for estimating association parameters of interest.
  • the various notations and assumptions, and the estimation procedure is shown in EXAMPLE 6.
  • the derivation of the estimating equation is described in EXAMPLE 7. Under the assumption that the traits are uncommon, this conditional probability may be approximated as described in EXAMPLE 8.
  • the method of the invention provides an assessment of associations of a trait with haplotypes or with diplotypes, as shown in EXAMPLE 6.
  • the method estimates the association of a trait with a combination of haplotypes or diplotypes and environmental factors.
  • the method also provides estimates of associations of interacting haplotypes or diplotypes with a trait.
  • case/control status is permutated under the null hypothesis that markers and their haplotypes do not associate with trait status to compute the Z-statistic over all selected haplotypes, as shown in EXAMPLE 6.
  • the estimated association parameters are unbiased and inferences retain the desired false error rate, provided that the logistic regression model in equation (26) holds.
  • the method for associating one or more haplotypes or diplotypes for a set of markers at one or more loci and one or more environmental factors with a trait comprises two parts.
  • the first part of the method estimates pattern frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group.
  • each set of markers is divided into a plurality of blocks.
  • a block may contain between about 3 and about 100 markers.
  • the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1, wherein a pattern comprises a haplotype with or without the one or more environmental factors; (2) estimating pattern frequencies for block 2 haplotypes; (3) estimating the pattern frequencies for a combination of the first and the second blocks using of selected and pooled patterns for block 1 and selected and pooled patterns for block 2, wherein a selected pattern has greater than a predetermined minimum frequency and a pooled pattern is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the patterns for one additional block is added to the previous combination of blocks.
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts.
  • the first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency.
  • the forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group.
  • each set of markers is divided into a plurality of blocks.
  • a block may contain between about 3 and about 10 markers.
  • the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1 haplotypes with or without the one or more environmental factors, wherein a block 1 haplotype comprises the subset of markers in a first block; (2) estimating pattern frequencies for block 2 haplotypes with or without the one or more environmental factors, wherein a block 2 haplotype comprises the subset of markers in a second block, (3) estimating the pattern frequencies for a combination of selected and pooled block 1 haplotypes with or without environmental factors and selected and pooled block 2 haplotypes with or without environmental factors, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the pooled and selected haplotypes with or
  • the second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
  • the third part of the method derives one or more patterns that are significantly associated with the trait.
  • the fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattem that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
  • the validity of the method of this aspect of the invention was established using
  • the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes.
  • Yet another aspect of the invention provides computer programs and systems for implementing the haplotype-based methods of the invention.
  • the computation steps of the previous methods are implemented on a computer system or on one or more networked computer systems in order to provide a powerful and convenient facility for forming and testing network models of biological systems.
  • the computer system can include but is not limited to a hand-held device, a server computer, a desktop personal computer, a portable computer or a mobile telephone.
  • a representative computer system is a single hardware platform comprising internal components and being linked to external components. The internal components of this computer system include processor elements interconnected with main memory.
  • the computer system includes a processing unit, a display, an input/output (I/O) interface and a mass memory, all connected via a communication bus, or other communication device.
  • the I/O interface includes hardware and software components that facilitate interaction with a variety of the monitoring devices via a variety of communication protocols including TCP/IP, XI 0, digital I/O, RS-232, RS-485 and the like. Additionally, the I/O interface facilitates communication via a variety of communication mediums including telephone land lines, wireless networks (including cellular, digital and radio networks), cable networks and the like.
  • the I/O interface is implemented as a layer between the server hardware and software applications. It will be understood by one skilled in the relevant art that alternative interface configurations can be practiced with the present invention.
  • the external components include mass storage.
  • the mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof.
  • the mass memory stores an operating system for controlling the operation of the premises server. It will be appreciated that this component can comprise a general piupose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT.
  • the memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's Internet Explorer browsers, for accessing the WWW.
  • This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory).
  • Computer system is also linked to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows computer system to share data and processing tasks with other computer systems.
  • a software component represents the operating system, which is responsible for managing computer system and its network interconnections. This operating system can be, e.g., of the Microsoft Windows family, a UNIX operating system, or a LINUX-based operating system. Another software component represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention.
  • Languages that can be used to program the analytic methods of this invention include C, C++, or, less preferably, JAVA.
  • the methods of this invention are programmed in mathematical software packages which allow symbolic entry of equations and high-level specification of processing, including algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms.
  • Such packages include, e.g., MATLAB from Mathworks (Natick, MA), MATHEMATICA from Wolfram Research (Champaign, 111.), and MATHCAD from Mathsoft (Cambridge, MA).
  • the analytical methods of the invention can be programmed in a procedural language or symbolic package.
  • the mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof.
  • the mass memory stores an operating system for controlling the operation of the premises server. It will appreciated that this component can be comprised of a general-pu ⁇ ose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT.
  • the memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's INTERNET EXPLORER browsers, for accessing the WWW.
  • the mass memory also stores program code and data for interfacing with various premises monitoring devices, for processing the monitoring device data and for transmitting the data to a central server. More specifically, the mass memory stores a device interface application in accordance with the present invention for obtaining monitoring device data from a variety of devices and for manipulating the data for processing by the central server.
  • the device interface application comprises computer-executable instructions which, when executed by the premises server obtains and transmits device data as will be explained below in greater detail.
  • the mass memory also stores a data transmittal application program for transmitting the device data to a central server and to facilitate communication between the central server and the monitoring devices. It will be appreciated that these components can be stored on a computer-readable medium and loaded into the memory of the premises server using a drive mechanism associated with the computer-readable medium, such as a floppy,
  • CD-ROM CD-ROM, DVD-ROM drive, or network drive.
  • This example describes the estimating equation (EE) method, along with a computationally efficient algorithm.
  • genotype g ⁇ can be represented by a pair of haplotypes namely, (H' : H, 2 ) . Under the assumption of the Hardy- Weinberg
  • the function f(H ; ⁇ ) may be represented by a multinomial distribution function: m ⁇ rff) (4)
  • This likelihood function is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol. 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
  • V is the covariance matrix of multinomial distribution, which is a semi- positive definite and constant over all individuals. Consequently, solving score estimation equations in (6) for ⁇ are equivalent to solving the following equations, which are called estimating and denoted by U( ⁇ ) ,
  • we first find the partial haplotypes corresponding to the genotype data g lU ,i l,---,n in the first block.
  • R is the number of haplotypes whose estimated frequencies are above ⁇ in the first block.
  • ⁇ Jl2k ⁇ ⁇ k * ⁇ J ⁇ (k+l)2k
  • ⁇ ( ⁇ ) (dU( ⁇ ) / d ⁇ ) ⁇ l var [U( ⁇ )] (dU( ⁇ ) I (11) dU( ⁇ )
  • var [[( ?)] ⁇ E p (F t ⁇ g x ; ⁇ E p (F, '
  • the covariance matrix i of ⁇ can be estimated by ⁇ ( ⁇ ) and the standard error of ⁇ is estimated by the square root of diagonals of ⁇ ( ⁇ ) .
  • genotypes on some loci cannot be determined therefore are treated as missing data.
  • the missing genotype is associated with the choice of probe sequences, experimental conditions or random fluctuations. It is reasonable to assume that the missing mechanism is either Missing
  • haplotypes have been identified as disease-associated haplotypes and if the penetrance from the haplotypes to the disease phenotype is specified, an inference of haplotypes from an individual's genotypes at multiple SNP loci can be used to predict the individual's risk of having the disease.
  • phase f(P, ⁇ g, > ' ⁇ ) given in equation (8), we can predict the phase of individual's genotype and, therefore, provide a pair of haplotypes with probability statement.
  • EXAMPLE 2 Simulation Study This example describes a simulation study to assess the accuracy of estimates under the HWE model assumption and when the model assumption is violated, and when the number of SNPs is large.
  • Sample haplotype frequencies are estimated from the sampled individuals' genotypes with known phase.
  • Step E we estimate haplotype frequencies and standard errors (SE) from the sampled individuals' genotypes using the EE method discussed in EXAMPLE 1 and ignoring the phase information.
  • Step G and Step E we repeat Step G and Step E for 500 times after each Step G. Then, we repeat Step G for 20 times.
  • HWD(l) we assume that one out of five SNP loci carries a lethal mutant allele and that individuals with a heterozygous genotype (0/1) at that locus have a 50% chance for survival and individuals with a homozygous variant genotype (1/1) have no chance for survival.
  • HWD(2) we assume that one haplotype is associated with a disease and that individuals with one copy of the haplotype have a 75% chance for survival and individuals with two copies have a 50% chance for survival.
  • the simulation results are shown in FIGURE 3.
  • the top panels present the average discrepancy of the estimated haplotype frequencies from the true population frequencies (dashed line) and the average discrepancy of the estimated haplotype frequencies from the sample frequencies (solid line) under the assumption of HWE, HWD(l), and HWD(2), respectively.
  • the discrepancy of the estimated haplotype frequencies from the true population frequencies assesses the overall validity of the final estimates of haplotype frequencies with respect to the true population values.
  • the discrepancy of the estimated haplotype frequencies from the sample frequencies assesses the accuracy of the estimation method. The difference between the two discrepancies is due to sampling error. The difference approaches to zero when the sample sizes approach infinity.
  • the simulation results show that the estimated haplotype frequencies by the EE method are consistent even when sample sizes are small.
  • the discrepancies are not affected significantly by the departures of HWE. These observations are consistent to the observations done in Fallin and Slatkin (2000) for the EM method (Fallin & Schork (2000) Amer. J. Hum. Genet. 65:947-959).
  • the bottom panels present the average discrepancy of the estimated SE using the EE method from the sample SD of the estimates. The small discrepancy indicates that the EE method gives correctly estimates of standard errors. The discrepancy is not affected by the departures of HWE.
  • the average number of common haplotypes for sample sizes of 30, 50,100, 150, 200, 500, and 1,000 is 2.3, 3.6, 4.7, 5.0, 5.3, 11.8, and 20.9 under HWE, 2.1, 3.3, 4.3, 4.8, 5.1, 10.5, and 17.2 under HWD(l), and 2.4, 3.6, 4.6, 5.0, 5.2, 10.7, and 20.0 under HWD(2).
  • the number of common haplotypes increases with sample size because a haplotype is considered "common” if its estimated frequency multiplied by twice the sample size is at least 5.
  • Step S we generate genotypes for 100 individuals. Each individual's genotype is generated from the generated haplotype population under the assumption of HWE. The rest of the simulation procedure is as discussed at the beginning of this example. The simulation result is shown in FIGURE 4. Similarly, the top panels show the average discrepancy of the estimated haplotype frequencies from the population haplotype frequencies (dashed line) and the average discrepancy the estimated haplotype frequencies from the sample haplotype frequencies (solid line).
  • the bottom panels show the average discrepancy of the estimated standard errors of the estimates of haplotype frequencies from the sample standard deviation of the estimates of 500 replicates.
  • the average number of common haplotypes observed from 100 individuals' genotypes is between 6 and 12 regardless of the recombination rate.
  • the running time depends linearly on the number of SNPs, the number of subjects and complexity of the data set.
  • SNPs were with complete genotypes on all 44 individuals, the sample size used in that study. The rest of the SNPs were missing genotypes in one or more individuals. Because only our EE method and HAPLOTYPER can handle the missing genotype data, we first analyzed the whole data set using the two methods. The EE method found 20 haplotypes and HAPLOTYPER found 21 haplotypes.
  • haplotypes 16 haplotypes were found by both methods.
  • the haplotypes that were found by only one method had estimated frequencies less than 1%.
  • the SE of the estimate for common haplotypes was estimated in the EE method and are given in parentheses after the corresponding estimate of haplotype frequency. Comparing the estimates of haplotype frequencies using the EE method and HAPLOTYPER, one can see that HAPLOTYPER produced a higher estimate for the more frequent haplotype than the EE method.
  • the reason for this phenomenon is that the estimates of haplotype frequencies are estimated from the best reconstruction of individuals' haplotypes. Because each individual contributes paired haplotypes, other most frequent haplotypes' frequencies might be underestimated. If we use the reconstructed the individuals' haplotypes for a subsequent analysis (here the estimate of population haplotype frequencies), misclassification errors can result in biased estimations.
  • EXAMPLE 4 The Use of the EE Method in a Case-Control Study 1.
  • the phase (parental origin) of an individual allele, e.g., g l , within a subject may or may not be known.
  • p t (p n ,p i2 ,- - -,p iq ) denote a vector of phase indicators; p..
  • denotes parameters of interest to be estimated from the data, including haplotype frequencies and their differences between cases and within controls. From the likelihood (13), it is clear that the derivation of the likelihood function is equivalent to the derivation of an individual distributional function.
  • the distributional function f(g, ⁇ p,,d l ) specifies the association of genotypes and hence haplotypes with the disease phenotype.
  • ⁇ d ( ) may be represented by a multinomial distribution function:
  • ⁇ t (d t ) denotes the frequency of the tth haplotype
  • ⁇ t (d i ) a t + ⁇ t d i , (18) where or, represents the frequency of the tth haplotype in controls, and (a t + ⁇ t ) represents the frequency in cases.
  • the difference ⁇ t if not equal to zero, indicates that the tth haplotype frequency is different in cases and controls.
  • odds ratio is more commonly used as a measure of association.
  • the above likelihood function encompasses the haplotype likelihood function derived by Excoffier and Slatkin (1995), when the disease status is ignored.
  • the score estimating equation (21) can be solved for the maximum likelihood estimate, but the computational burden of enumerating all possible phases
  • the estimate and its standard error can then be used to construct a 95% confidence interval for the difference between cases and controls of the tth haplotype as [ ⁇ t -1.96 ⁇ ⁇ t , ⁇ , +1.96 ⁇ ⁇ t ] .
  • Constructing a test statistic based upon the estimating function U( ⁇ , ⁇ ) is also possible.
  • haplotype-based associations When analyzing multiple SNPs, one has to rely on a systematic strategy for analyzing all potential haplotype-based associations. While an optimal strategy remains to be investigated, we consider an analytic strategy that searches for haplotype- associations progressively, in multiple steps. The first step is to examine allelic association with the case/control disease phenotype. The second step is to correlate haplotypes with two adjacent SNPs with the disease phenotype. Progressively, one can add more SNPs in assessing associations of phenotype with haplotypes of three, four, or more adjacent SNPs. The primary rationale is that the physical adjacency implies the linkage-disequilibrium and hence the haplotype-based associations with adjacent SNPs.
  • equation (25) is iterated until convergence in all parameters is reached, thus obtaining the estimate ⁇ .
  • equation (25) is iterated until convergence in all parameters is reached, thus obtaining the estimate ⁇ .
  • a similar procedure can be applied to cases to yield the estimate ( + ⁇ ) .
  • the choice of the initial value is important, given the high dimensionality of many haplotypes.
  • the second issue relates to the convergence criteria.
  • convergence in the estimating function U(a, ⁇ ) (23)
  • convergence in estimates we examine whether the estimated parameters stabilize and if the increments of updating parameters approach zero.
  • the third technical issue relates to the numerical derivative matrix of the estimating function U(a, ⁇ ) .
  • U(a, ⁇ ) we calculate the difference of da U( ⁇ + ⁇ , ⁇ ) and U( ⁇ , ⁇ ) divided by a small ⁇ , where ⁇ may be chosen as 10 "5 .
  • the primary reason for choosing ⁇ instead of just ⁇ , is to ensure precision when ⁇ is small.
  • EXAMPLE 5 This example compares the use of the EE method and the EM method using data from an actual case-control study.
  • DNA from all 779 patients were genotyped using a panel of biallelic sites that are candidate markers for atherosclerotic disease risk.
  • This panel represents an expansion of a previously described assay (Cheng et al. (1999) Genome Res. 9:936-949). Briefly, the assay uses multiplex PCR and immobilized sequence-specific probes to detect amplified alleles. All probes have been validated using DNAs of known genotype, as confirmed by sequencing. Most of the candidate genes in the panel were represented by one or two polymo ⁇ hisms, but multiple genes were typed in a few genes, including apolipoprotein CIII (APOC3), a gene encoding a component of plasma lipoproteins.
  • APOC3 apolipoprotein CIII
  • APOC3 was represented by six biallelic sites: three in the promoter region, one in exon 3, two in the 3'-untranslated region of exon 4 (Cheng et al. (1999) Genome Res. 9:936-949). APOC3, thus offered us an opportunity to illustrate the estimating equation approach.
  • Paired SNP loci are listed row-wise, (1,2), (2,3), (3,4), (4,5), (5,6), where the locus numbers are associated with C(-641)A, C(-482)T, T(-455)C, C1100T, C3175G, T3206G, respectively.
  • All possible haplotypes, except the reference haplotype 11, are listed column-wise.
  • haplotype 00 for locus 1,2 is equivalent to the haplotype AT at C(-641)A, C(- 482)T.
  • FIGURE 9(a) shows the distribution of maximized test statistics after 1,000 permutations. It appears that the estimated maximum test statistic centers around 2 with a heavy tail towards the right. Within the same 1,000 permutations, we computed maximum Z-statistics at every loci (FIGURE 9(b), solid line). Based upon this distribution, we estimated the threshold values for 90% and 95%) significance levels (dotted lines). At loci 4 and 5, the observed Z-statistics remains significant at the 90% level, but not at the 95% level. That is, the statistical significance associated with the haplotype TC at C1100T and C3175G, is between 90% and 95%.
  • Table 5 in FIGURE 10 shows Z-statistics, odds ratios and their 95% confidence intervals for haplotype associations with three adjacent SNPs.
  • Table 5 lists four possible adjacencies, i.e., 123, 234, 345 and 456, in the columns, while all possible haplotypes, excluding 111, which is the reference haplotype, are listed by row.
  • Rare haplotypes were merged with more common haplotypes, and are thus omitted from the table.
  • the rare haplotype 110 at locus 123 i.e., haplotype CCC at C(-641)A, C(-482)T and T(-455)C, was merged with other uncommon haplotypes.
  • haplotype 010 at locus 456 i.e., haplotype TCG at C1100T, C3175G and T3206G
  • FOGURE 9(c) the distribution of maximum Z statistics
  • FIG. 9(d) the threshold values at 90% and 95%) level
  • haplotype 0110 at locus 3456 i.e., haplotype CCCG at T(-455)C, C1100T, C3175G and T3206G, has a reduced risk associated with the phenotype.
  • Haplotype TCG (010) which was significant at locus 456, extends to haplotypes CTCG (0010) or TTCG (1010) at locus 3456. That neither haplotype was significantly associated with the phenotype is inconsistent with TCG at locus 456 identified earlier.
  • Table 7 in FIGURE 12 and Table 8 in FIGURE 13 show Z statistics, odds ratios and their confidence intervals for haplotype associations with five and six SNP loci, respectively. It appears that maximum Z-statistics become much smaller, even though the 95% CIs seem to suggest several genes with marginally significant.
  • This exercise shows the feasibility of performing haplotype analysis with many SNPs. Indeed, when the number of SNP loci increases, the total number of haplotypes does not increase exponentially with the number of loci. The number of common haplotypes appears to be around 10, even though the theoretical number of all possible haplotypes could be far greater. This result is consistent with the recent observations by (Drysdale et al. (2000) Proc. Natl Acad. Sci. U.S.A. 97:10483-10488).
  • Table 9 in FIGURE 14 lists estimated haplotype frequencies and their standard errors. Estimated frequencies by both methods are identical up to the 5 l decimal point, so only one column of haplotype frequencies is shown in the second column of Table 9.
  • the third and fourth columns in Table 7 show estimated standard errors by EM algorithm and EE approach. While the estimates obtained by the EM algorithm and the EE approach are largely consistent, the EE approach yields consistently smaller estimates for standard errors. The smaller standard errors reflect the efficiency of the EE approach in estimating haplotype frequencies. Nevertheless, it is quite comforting to learn that both EM algorithm and EE approach yield comparable results. However, the EM approach was computationally less efficient. In this application, the EE computations were at least 120 times faster than EM. Computation speed is important in the design of analytic strategies for dealing with complex situations: 1) dealing with many more SNPs, 2) testing many combinations of SNP loci, adjacent or non-adjacent, and 3) using a permutation method to estimate significance level.
  • EXAMPLE 6 Trait Associations with SNP Haplotypes and Environmental Variables This example describes the application of the EE method for assessing trait associations with SNP haplotypes and environmental variables.
  • the analytic objective is to correlate haplotypes of multiple SNPs with the disease phenotype, i.e., the penetrance from haplotypes and covariates to disease phenotype.
  • haplotypes of multiple SNPs i.e., the penetrance from haplotypes and covariates to disease phenotype.
  • penetrance function we assume a logistic regression that relates haplotypes and covariables with disease phenotype via:
  • I(h),h 2 ,x t , ⁇ ) is a function of haplotypes and covariables (A?,/., 2 ,*,) with ⁇ as regression coefficients.
  • this function I(h],h 2 x t , ⁇ ) may choose the following function:
  • the rationale for choosing the logistic regression function includes: 1) regression coefficients ( ⁇ x , ⁇ 2 ) are rather inte ⁇ retable as log odds ratios, and log odds ratios approximate log relative risk when the disease incidence rate is low (Prentice & Pyke (1979) Biometrika 66(3):403- 11); 2) The logistic regression technique has been well studied in biostatistical literature and its statistical properties are well known; and 3) The logistic regression is routinely applied to epidemiological studies to inte ⁇ ret results from case-control studies (Rothman & Greenland (1998) Modern Epidemiology, Lipincott-Raven Publishers, Philadelphia), and is thus readily accepted to study gene/environmental interactions.
  • the logistic regression defined by equation (26) is a penetrance function quantifying the disease probability given paired haplotypes and covariates, and is not directly estimable case-control studies.
  • the intercept a specifying the baseline prevalence rate, is unspecified in case-control studies (Prentice & Pyke (1979) Biometrika 66:403-11; Whittemore (1995) Biometrika 82:57-67).
  • the analytic objective is to estimate parameters ( , ?) via an estimating equation, the derivation for which is detailed in EXAMPLE 7.
  • an estimating function for ( ⁇ , ⁇ ) if phases were known. Since phases are largely unknown, one can treat them as latent variables. After obtaining a posterial distribution of phases given all observed data (phenotypes, genotypes and covariates), one can integrate out these latent haplotypes via conditional expectation of estimating function given observed data. Setting the integrated estimating function to zero results in an estimating equation for estimating ( ⁇ , ⁇ ) . 2.
  • X i is the partial derivative of I(h),h 2 ,x t , ⁇ ) with respect to ⁇
  • an f(Pi ⁇ gi.d j ,x i ) is the posterial probability of latent phases given observed data.
  • 0 is a matrix of appropriate dimension
  • conditional means, variances and covariances are computed in the usual way.
  • Raphson method to estimate all parameters that satisfy the estimating equation (32). Starting from an initial value ( ⁇ (0) , ⁇ (0) , ⁇ (0) ) , one can iterate to a new value
  • the covariance matrix is easily estimable, and may be written as
  • I( ⁇ , ⁇ , ⁇ ) ⁇ var[u( ⁇ , ⁇ , ⁇ )]l( ⁇ , ⁇ , ⁇ )- 1 , (34) in which all quantities are evaluated at their respective estimates with a variance matrix va ⁇ [u( ⁇ , ⁇ , ⁇ )] is estimated by ⁇ . i
  • test statistic based upon the estimating function U( ⁇ , ⁇ , ⁇ ) (32).
  • Haplotype-Specific Effects An immediate interest is to assess the disease associations with haplotypes.
  • Let h denote H common haplotypes of interest.
  • To test this null hypothesis one may use the above Z-statistics (35).
  • Diplotype-Specific Effects While haplotype-based associations are of interest, the disease association could also be genotype-specific, that is, the disease associations are with one or more genotypes, formed by paired haplotypes (diplotype). Diplotype- based associations may be categorized into four different penetrance modes: being dominant, or being recessive, or being additive, or being co-dominant. To capture the mode of diplotype associations, one needs to re-code corresponding genotypes under each mode of penetrance. Suppose that h is the target haplotype of interest. Under a dominant mode, we use the following indicator function,
  • K(h],ht) 1 One of/?, 1 and h equals /?
  • I(h),h 2 x l , ⁇ ) ⁇ n K (h),h 2 ) + ⁇ n K 2 (ti l ,h 2 ) + ⁇ 2 'x l and
  • h. and h 2 are two co-dominant haplotypes under considerations. Indeed, this model encompasses both dominance and additive modes. Specifically, if one of two coefficients i ⁇ , ⁇ 12 ) equals zero, the model implies the dominant. If two coefficients ( ⁇ u , ⁇ l2 ) are equal, the above model implies the additive mode.
  • ⁇ 3 ( ⁇ ,—, ⁇ 3H )' quantifies the interactions of all candidate haplotypes with the dosage. Indeed, one can postulate other models to depict interactions that may be dominant, recessive or co-dominant, besides the additive mode described above. Interactions among Candidate Genes: In addition to gene/environmental interactions, one candidate gene may also interact with another candidate gene at a functional level, which is known as epistasis. The interactions could occur as haplotype- haplotype interaction, or as genotype-genotype interaction.
  • K 2 (h] 2 ,h 2 2 ) are two indicator functions for candidate genotype 1 and 2, respectively, and the log odds ratio ⁇ i quantifies the interaction of interest.
  • TDT Transmission Disequilibrium Test
  • the idea is to permute case/control status under the null hypothesis that SNPs and their haplotypes do not associate with disease status.
  • This example describes the derivation of the estimating equations.
  • EXAMPLE 8 Approximation of Conditional Probability This example describes the approximation of the conditional probability.
  • the probability function may be written as ft ⁇
  • the marginal disease probability Pr(-/, l
  • the disease probability is approximated by exp[ ⁇ + /(/?,', h', Xl , ⁇ )]
  • Pr( ⁇ , 1 l 2 . , ): exp[ ⁇ + /(?;, h;, Xl , ⁇ )], l + exp[ ⁇ + I(h), h 2 ,x lake ⁇ )] because exp[ ⁇ + I(h), h 2 ,x t , ⁇ )] is much smaller than one. Substituting these approximations into the above probability function, one obtains, for cases, an approximated function,
  • f(p t ⁇ g,,d l ,x t ) may be represented by exp ⁇ /?, 1 ,/?, 2 ⁇ ,, ⁇ )]/ ⁇ , 1 ,/?, 2 ) f(P, ⁇ g handedd constrainx,) ⁇
  • EXAMPLE 9 Derivation of Derivative Matrix for the Joint Estimating Equation This example describes the derivation of the derivative matrix for the joint estimating equations.
  • ⁇ Ai d i- ⁇ dfiPAgi ⁇ A - ⁇ EiX ⁇ ' ⁇ XA ⁇ d ⁇ g ⁇ + ⁇ ⁇ X ⁇ - ⁇ -fip ⁇ g ⁇ x,)
  • This example describes a Monte Carlo simulation study.
  • the first statistic is the biases in estimating log odds ratios.
  • ⁇ j denote they ' th estimated regression coefficient in (26) from the rth replicate.
  • the second statistic is the biases in estimating standard errors.
  • SEj SE ( ⁇ j ) denote the estimated standard error for they ' th log odds ratio ⁇ ⁇ r from the rth replicate.
  • the third statistic is the coverage probability, measuring how frequently the confidence interval, ⁇ ⁇ r -Z g SE ⁇ j +Z e SE ⁇ r ] at the significance level of ⁇ , covers the true value ⁇ ⁇ .
  • the rationale for choosing the coverage probability is that it is reliable measurement under both null and alternative hypothesis. If the estimation and inference are appropriate, the coverage probability should be around 1- ⁇ . 3. Under the Null Hypothesis with No Covariates
  • haplotypes For consistence throughout the simulation studies, we always use the haplotype with the highest haplotype frequency as the reference. In the current simulation, we focus on three haplotypes, which are considered as common haplotypes in the entire population. In this case, we simulate phenotypes in the general population by the following penetrance function:

Abstract

The invention provides methods for estimating haplotype frequencies using an estimating equation method and a forward-block algorithm that provides a standard error measurement for each estimated haplotype frequency. The methods may be used, for example, for estimating a set of haplotype frequencies in a population, for assessing the association of a haplotype with a trait, and for assessing the association of haplotypes for one or more set of markers and one or more environmental factors with a trait. The invention also provides a computer-readable medium having computer-readable instructions for performing the methods of the invention, and a computer-system for performing the methods of the invention.

Description

METHODS FOR ESTIMATING HAPLOTYPE FREQUENCIES AND DISEASE ASSOCIATIONS WITH HAPLOTYPES AND ENVIRONMENTAL VARIABLES
This invention was made with government support under grant numbers HG02283 and CA106320-04 awarded by the National Institutes of Health. The government has certain rights in the invention.
CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of U.S. Provisional Application No. 60/415,028, filed October 1, 2002, under 35 U.S.C. § 119, which application is herein incorporated by reference.
FIELD OF THE INVENTION The present invention relates to methods of estimating haplotype frequencies and methods of associating haplotype frequencies and environmental factors with a trait.
BACKGROUND OF THE INVENTION The completion of the Human Genome Map (Ventner et al. (2001) Science
291:1304-1351) marked the end of the genomic era and the beginning of the post- genomic era, with the expectation that the rich genomic information would promote our understanding of basic biology and improve human health (International Human Genome Sequencing Consortium (2001) Nature 409:860-921). A preliminary examination of the human genome sequence that there may be 30,000 to 40,000 functional genes (International Human Genome Sequencing Consortium (2001) Nature 409:860-921).
Much of the variation between human individuals can be attributed to variations in genomic DNA sequences, and most variations are nucleotide substitutions which arose only once in human evolutionary history (Wang et al. (1998) Science 280:1077-1082; Cargill et al. (1999) Nature Genet. 22:231-238; Halushka et al. (1999) Nαtwre Genet. 22:239-247). A nucleotide substitution at a single nucleotide base pair is called a single nucleotide polymorphism (SΝP). Over a million SΝPs have already been discovered throughout the human genome, including in many coding regions (International Human Genome Sequencing Consortium (2001) Nature 409:860-921). The total number of SΝPs with moderate frequency (> 10%) in the human genome has been estimated at approximately 5.3 million (one SΝP per 600 base pairs on average) (Kruglyak & Νickerson (2001) Nature Genet. 27:234-236). Coupled with recently developed array technologies (Chee et al. (1996) Science 274:610-614; Wang et al. (1998) Science 280:1077-1082), biomedical researchers are rapidly able to generate thousands of SNP genotypes. One important application of these recent advances is to study variations between SNPs and to correlate genotype with trait phenotype as a way of identifying trait-associated genes (Chakravarti (1998) Nature Genetics 19:216-217; Chakravarti (1999) Nαtwre Genet. Suppl. 21:-60). A trait may be associated with a particular SΝP genotype or with a distinct combination of contiguous SΝP genotypes (SΝP haplotype). For example, there is a haplotype association of lipoprotein lipase with coronary heart diseases, but no association is evident with individual SΝPs (Hallman et al. (1999) Ann. Hum. Genet. 63:499-510). It has become clear that, instead of single SΝPs one at a time, testing for SΝP haplotypes is essential to the detection of susceptibility genes, especially in the presence of allelic heterogeneity (Daly et al. (2001) Nat. Genet. 29:229-232; Pritchard (2001) Am. J. Hum. Genet. 69:124- 137. Haplotype methods have contributed to the identification of genes for Mendelian disease and disorders that are both common and complex in inheritance (see, e.g., Kerem et al. (1989) Science 245:1073; Hastbacka et al. (1992) Nature Genet. 2:204; Puffenberger et al. (1994) Cell 79:1257; Rioux et al. (2001) Nature Genet. 29:223; Hugot et al. (2001) Nature 411:599; Ogura et al. (2001) Nature 411 :603).
The unambiguous determination of haplotypes from diploid individuals is a difficult task for individuals heterozygous at multiple SNPs. For example, if a sample has two SNP genotypes, A/C and A/G, then the two haplotypes could be either AA and CG or AG and CA. Several approaches have been attempted to read alleles from separated chromosomes, such as dissecting a single chromosome or inserting an entire chromosome into a yeast artificial chromosome (YAC) (Green et al. (1998) in The Genetic Basis of Human Cancer, McGraw-Hill Health Professions Division, NY) or using rodent-human hybrid technique to physically separate two chromosomes (Patil et al. (2001) Science 294:1719-1723). However, such direct biochemical and cell biological phase resolution methods remain extremely difficult to automate and costly, low-throughput, and prone to experimental errors (Judson & Stephens (2001) Pharmacogenomics 2:7-10). Therefore, they are not appropriate for any large scale population studies.
Another approach is to infer haplotypes by genotyping first-degree relatives. With parental genotype data, the phases (origin of parental copy of chromosome) of genotypes can be partially resolved (Wijsman (1987) Amer. J. Hum. Genet. : 41 :356-373). However, gathering parental biological samples, if available, and genotyping them are expensive and may be ethically sensitive.
Several statistical approaches to predict phase from unrelated individuals have been developed to avoid the difficulties posed by the absence of parental in pedigree- based approaches. Historically, the most popular approach has been Clark's parsimony method (Clark (1990) Mol Biol. Evol 7:111-122), although recently, two Bayesian methods have provided substantial improvements over prior methods in accuracy and scalability to handle a larger number of SNPs (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169). However, when the number of SNPs becomes large, the inferred haplotypes are reconstructed with less certainty regardless of which a statistical methodology is used. If the reconstructed haplotypes are used for subsequent analyses, misclassification errors are inevitable and may mislead in further analyses.
Alternatively, haplotype frequencies can be accurately estimated using expectation-maximization algorithms (EM). Several maximum likelihood approaches have been developed to estimate population haplotype frequencies directly from genotype data in unrelated individuals (Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927; Hawley & Kidd (1995) J Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810). Among these maximum likelihood approaches, Excoffier & Slatkin's EM algorithm is the most popular one and appears to require the fewest assumptions. However, the EM algorithms are computationally burdensome when the number of SNPs exceeds 20. Even when the number of SNPs is less than 20, EM algorithms may fail to yield estimates of standard errors (SEs) if the information matrix is singular. This frequently occurs when the number of possible haplotypes is large and estimated frequencies of some of the haplotypes approach zero. Consequently, bootstrap methods have been used to estimate SEs (Schneider et al. (2000) Arlequin version 2.000, Genetics and Biometry Laboratory, University of Geneva, Switzerland, http://lgb.unige.ch/arelquin).
Estimating haplotype frequencies and assessing the association of particular haplotypes with defined traits becomes increasingly practical as millions of SNPs are identified and are readily genotyped in population research. There exists a need for algorithms that accurately estimate haplotype frequencies from genotyped SNPs when the number of SNPs is large, and to provide SEs for the estimated haplotype frequencies. There also exists a need for methods for assessing disease associations with SNP haplotypes and environmental variables in case-control studies.
SUMMARY OF THE INVENTION In a first aspect, the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information is incomplete, and wherein the method directly provides a standard error measurement for each estimated haplotype frequency. Generally, the forward-block algorithm comprises distributing the markers into a plurality of blocks and (i) estimating block haplotype frequencies for a first block, wherein a block haplotype comprises a combination of genotypes for markers in a block; (ii) estimating block haplotype frequencies for a second block; (iii) estimating combination block haplotype frequencies for a first combination block using selected and pooled block haplotypes for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein block haplotypes with greater than a predetermined minimum frequency are selected and block haplotypes that are not selected are pooled; (iv) estimating block haplotype frequencies for a third block; (v) estimating combination block haplotype frequencies for a second combination block using selected and pooled block haplotypes for the first combination block and the third block; and (vi) sequentially repeating steps (iv) and (v), wherein each repetition provides combination block haplotype frequencies for a combination block comprising an additional block.
In a second aspect, the invention provides a method for assessing the association of a haplotype with a trait. In some embodiments, the method comprises two parts. The first part of the method estimates a set of haplotype frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown) and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency. Generally, the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait- positive group and individuals in the trait-negative group, as described above for the methods of the first aspect of the invention. The second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies.
In another embodiment, the second aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts. The first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency. Generally, the forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group, as described above. The second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies. The third part of the method derives one or more haplotypes that are significantly associated with the trait. The fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
In a third aspect, the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with a trait. In some embodiments, the method comprises two parts. The first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency. Generally, the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait-negative individuals. In some embodiments, the forward-block algorithm comprises the steps of: (i) estimating pattern frequencies for a first block, wherein pattern comprises a block haplotypes and one or more environmental factors and a block haplotype comprises a combination of genotypes for markers in a block; (iii) estimating pattern frequencies for a second block; (iii) estimating pattern frequencies for a first combination block using selected and pooled patterns for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein patterns with greater than a predetermined minimum frequency are selected and patterns that are not selected are pooled; (iv) estimating pattern frequencies for a third block; (v) estimating pattern frequencies for a second combination block using selected and pooled patterns for the first combination block and the third block; and (vi) sequentially repeating steps (iv) and (v), wherein each repetition provides pattern frequencies for a combination block comprising an additional block. The second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies. In another embodiment, the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts. The first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of one or more haplotypes or diplotypes at one or more loci, one or more environmental factors, or a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers may be incomplete (i.e., the phase information may be known, partially known, or completely unknown), and wherein the algorithm provides a standard error measurement for each estimated pattern frequency. In some embodiments, the forward-block algorithm comprises distributing the markers in each set of markers into a plurality of blocks and determining pattern frequencies in a step-wise fashion for each set of markers in both trait-positive individuals and trait- negative individuals, as described above. The second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies. The third part of the method derives one or more patterns that are significantly associated with the trait. The fourth part of the method diagnoses an increased risk of developing the trait in a trait- negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
In a fourth aspect, the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes or traits.
The methods of the invention may be used for any study design, for example case- control studies , observational studies, cohort studies, or clinical trials. The markers used in any of the methods of the invention may have two alleles (biallelic markers) or multiple alleles (multiallelic markers). Representative markers include, but are not limited to, microsatellite markers or single nucleotide polymorphisms (SNPs). In some embodiments, the markers are biallelic SNPs. The set of markers may comprise more than about 12 markers, such as more than about 20 markers, such as more than about 50 markers, such as more than about 100 markers, or such as several hundreds of markers. In any of the methods of the invention, each block may comprise at least about 3 markers, such as at least about 5 markers, such as at least about 10 markers, or such as at least about 20 markers. In some embodiments the genotype of at least some of the markers is unknown. The sample population in any of the methods of the invention may comprise more than about 10 individuals, more than about 50 individuals, more than about 100 individuals, more than about 500 individuals, or more than about 1000 individuals. The invention also provides a computer-readable medium having computer- readable instructions for performing the methods of the invention. The invention also provides a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein: FIGURE 1 shows a schematic diagram representing the forward-block algorithm.
FIGURE 2 shows a schematic flowchart for the simulations.
FIGURE 3 shows plots of the average discrepancy of the estimates of haplotype frequencies. The plots in the top panel are the average discrepancy \ θjj | summed j over all haplotypes) of the estimates of haplotype frequencies compared to the population frequencies (dashed line) and the sample frequency (solid line) under HWE, HWD(l), and HWD(2), respectively. The plots in the lower panel are the average discrepancy y\ τjj \ summed over all common haplotypes) of the estimated standard error j compared to the sample standard deviation of the estimate based on 500 simulations under HWE, HWD(l), and HWD(2), respectively. The simulations are carried out for a five-locus system.
FIGURE 4 shows the discrepancy of the estimated haplotype frequencies using the EE method. We simulated 100 haplotypes, which were randomly paired to form 100 genotypes, assuming that R=2 ( R = 4N0r where r is the recombination rate and N0 is the effective population size), the number of sites between which recombination can occur is 1,000, and the number of segregating sites is fixed to 20-100, using a coalescent-based program. We generated 1,000 independent data sets for each number of segregating sites. The solid line represents the average discrepancy over 1,000 simulated data sets and the two dashed lines represent the 5 and 95 percentile of 1 ,000 discrepancies.
FIGURE 5 shows Table 1 describing estimated haplotype frequencies (standard errors) using the EE and estimated haplotype frequencies using Bayesian method
(HAPLOTYPER), from 44 individuals genotyped for 18 SΝPs within the gene
ARHGDIB on chromosome 12: core SNP, G4923a6, G4923a7, G4923al0, G4923al l, G4923al2, G4923al8, G4923al5, G4923al6, G4923al7, G4923a22, G4923a26, G4923a28, G4923a44, G4923a46, G4923a35, G4923a37, and G4923a38. Complete genotype for all 44 individuals are available for the 8 SNPs indicated in italics, the remaining SNPs have missing genotypes for one or more individuals (0: reference allele, 1 : variant allele).
FIGURE 6 shows Table 2 describing estimated haplotype frequencies (standard errors) using both the EE and EM algorithms and estimated haplotype frequencies using Bayesian methods, PHASE and HAPLOTYPER, from 44 individuals with complete genotype data for 8 SNPs within the gene ARHGDIB on chromosome 12: core SNP, G4923a6, G4923a7, G4923al2, G4923a26, G4923a44, G4923a46, and G4923a35 (0: reference allele, 1 : variant allele). The standard errors of EM estimates of haplotype frequencies were computed from 1,000 bootstrap samples.
FIGURE 7 shows Table 3 describing estimated allelic frequencies among cases and controls, their allelic differences, standard errors and Z-statistics, and estimated odds ratios with bootstrap confidence intervals
FIGURE 8 shows Table 4 describing estimated Z-statistics, odds ratios, and their 95% confidence intervals for haplotype associations with two adjacent SNPs within APOC3.
FIGURES 9(a)-(d) show the distributions of maximum Z-statistics, 90% and 95% threshold values obtained through 1,000 permutations, (a) shows the frequency of maximum Z-scores using 2 adjacent SNPs; (b) shows the maximized Z-scores at every SNP locus when two adjacent SNPs are used ; (c) and (d) are similar to (a) and (b) using three adjacent SNPs. The two threshold values in (b) are 2.86 and 2.54), in (d) are 3.05 and 2.82. FIGURE 10 shows Table 5 describing estimated Z-statistics, OR, and their 95% confidence intervals for haplotype associations at three adjacent SNPs within APOC3.
FIGURE 11 shows Table 6 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at four adjacent SNPs within APOC3.
FIGURE 12 shows Table 7 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at five adjacent SNPs within APOC3.
FIGURE 13 shows Table 8 describing estimated Z statistics, OR, and their 95% confidence intervals for haplotype associations at six adjacent SNPs within APOC3. FIGURE 14 shows Table 9 comparing the estimating equation approach (EE) with the EM algorithm in estimating haplotype frequencies and their standard errors.
FIGURE 15 shows Table 10 describing a Monte Carlo simulation study under the null hypothesis with no covariates. The estimation model is c η = ξ + alGender + a2Age + βJ_lI(hj) , where hj,j ~ 2,-- -,c are the 2nd to the last
common haplotypes; /(•) is an indicator function; the most common haplotype is used as reference; True model is η = -5 . Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each set of simulation data.
FIGURE 16 shows Table 11 describing a Monte Carlo simulation study under the c null hypothesis with covariates. The estimation model is η = ξ + βj_J hj) , where
7=2 hj , j = 2, ■ ■ ■ , c are the 2nd to the last common haplotypes; /(•) is an indicator function; the most common haplotype is used as reference; True model is η = -5 + 1.0* Race where Race=l represent Africa Americans with 30% and Race=0 represents Caucasian with 70% in the population. Race and genotypes are generated independently. Avg. bias is the average bias of estimate of parameter over 1 ,000 duplicates. Std is the standard deviation of 1 ,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each set of simulation data.
FIGURE 17 shows Table 12 describing a Monte Carlo simulation study under the c null hypothesis with varying sample sizes. The estimation model is η = ξ + ∑β lI(hj) ,
7=2 where h},j = 2,- --,c are the 2nd to the last common haplotypes; 7(-) is an indicator function; the most common haplotype is used as reference; True model is η = -5 + 1.0* Race where Race=l represent Africa Americans with 30% and Race=0 represents Caucasian with 70% in the population. The haplotype h2 is observed only in Africa Americans. Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each simulation data. FIGURE 18 shows Table 13 describing a Monte Carlo simulation study under the c alternative hypothesis with no covariates. The estimation model is η = ξ + ∑βj_λI(hJ) ,
7=2 where hj,j = 2,- - -,c are the 2nd to the last common haplotypes; 7(-) is an indicator function; the most common haplotype is used as reference; True model is 77 = -5 + 1.0 * I(h2 ) - 1.0 * I(h3 ) . Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each simulation data.
FIGURE 19 shows Table 14 describing a Monte Carlo simulation study under the alternative hypothesis with covariates. The estimation model is c η = ξ + alGender + a2Age + βJ_lI(hj) , where hj,j = 2,-- -,c are the 2nd to the last
7=2 common haplotypes; /(•) is an indicator function; the most common haplotype is used as reference; True model is η = -8 -1.0* Gender + .1* Age + 1.0*1 (h2) -1.0* 1(h) . Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. There are 300 cases and 300 controls in each simulation data.
FIGURE 20 shows Table 15 describing a case-control study under the null hypothesis with covariates. True model is η = ζ + βλGender + β2Age + ^/(/^ ) + β ^ ) where ζ = -8, λ = -l,β2 = .l,β3 = l,β4 = -l ; I( ) is an indicator function; h2 and A- are the second and third common haplotypes. The risk factors Gender and Age are omitted from the estimation model and only all common haplotypes are modeled in a case-control study with 300 cases and 300 controls. Avg. bias is the average bias of estimate of parameter over 1,000 duplicates. Std is the standard deviation of 1,000 estimates of parameter. Avg. SE is the average of estimated standard error of the estimate of parameter over 1,000 duplicates. β5 is the coefficient of I(h4) in the estimation model.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In a first aspect, the invention provides a method for estimating haplotype frequencies in a population using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency. In a second aspect, the invention provides a method assessing the association of a haplotype with a trait, comprising estimating haplotype frequencies for a group of trait-positive individuals and a group of trait-negative individuals using an estimating equation method that provides a standard error measurement for each estimated haplotype frequency; and estimating the differences in haplotype frequencies for the trait-positive group and the trait-negative group. In a third aspect, the invention provides a method for assessing the association of a trait with a pattern comprising: (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and/or (3) a combination of haplotypes or diplotypes and environmental factors, using an estimating equation method that provides standard error measurements. In a fourth aspect, the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes. The invention also provides a computer-readable medium having computer-readable instructions for performing the methods of the invention, and a computer-system having a processor, a memory, and an operating environment, the computer system operable for performing the methods of the invention. Unless specifically defined herein, all terms used herein have the same meaning as they would to one skilled in the art of the present invention.
As used herein, the following terms have the meanings defined below: The term "genetic marker" or "marker" refers to genome-derived polynucleotides which are sufficiently polymorphic to allow a reasonable probability that a randomly selected individual will be heterozygous, and thus informative for genetic analysis by methods such as linkage analysis or association studies. For example, the human haploid genome contains a 3xl09 base- long double stranded DNA shared among the 24 chromosomes. Each human individual is diploid, i.e., possesses two haploid genomes, one of paternal origin, the other of maternal origin. The sequence of the human genome varies among individuals in a population. About 107 sites scattered along the 3xl09 base pairs of DNA are polymorphic, existing in at least two variant forms called alleles. Most of these polymorphic sites are generated by single base substitution mutations (SNPs) and are biallelic. Less than 105 polymorphic sites are due to more complex changes and are very often multi-allelic, t.e., exist in more than two allelic forms. The term "polymorphism" as used herein refers to the occurrence of two or more alternative genomic sequences or alleles between or among different genomes or individuals. "Polymorphic" refers to the condition in which two or more variants of a specific genomic sequence can be found in a population. A "polymorphic site" is the locus at which the variation occurs. A single nucleotide polymorphism is the replacement of one nucleotide by another nucleotide at the polymorphic site. Deletion of a single nucleotide or insertion of a single nucleotide also gives rise to single nucleotide polymorphisms. The term "single nucleotide polymorphism" or "SNP" refers to which ' are genome-derived polynucleotides markers that exhibit allelic polymorphism.
The term "allele" refers to a variant of a nucleotide sequence. All alleles on a single chromosome are from one of the two parents. At a given polymorphic site, any individual (diploid), can be either homozygous (twice the same allele) or heterozygous (two different alleles). The term "haplotype" refers to a particular combination of alleles at multiple markers, such as SNPs, present in an individual. Haplotype information is specified by the parental origin (phase) of individual marker alleles. Across short distances (usually 10's of kbp), haplotypes are conserved over the evolutionary time-scale (Drysdale et al. (2000) Proc. Natl Acad. Sci. U.S.A. 97:10483-10488; International Human Genome Sequencing Consortium (2001) Nature 409:860-921; Gabriel et al. (2002) Science 296:2225-2229). Consequently, in many regions the number of common haplotypes comprising multiple markers in the population is much smaller than the theoretical number of possible combinatorial haplotypes. For example, assuming 10 biallelic SNPs there are 210=1024 possible haplotypes, but the expected number of observable haplotypes is approximately 10~14, which is considerably smaller.
The term "genotype" refers the identity of the alleles present in an individual or a sample. The term "genotyping" a sample or an individual for a marker involves determining the specific allele or the specific nucleotide carried by an individual at a biallelic marker. The term "diplotype" refers to the identity of the alleles on both chromosomes in an individual, i.e., a pair of haplotypes. The term "Hardy- Weinberg Equilibrium" or "HWE" refers to the assumption that haplotypes are randomly paired to form individuals' diplotypes.
A "trait" or "genetic trait" or "phenotype" refers to a measurable characteristic present in some individuals of a population and absent in others. A given polymoφhism or rare mutation can be either neutral, i.e., it has no effect on the trait, or functional, i.e., it is responsible for a particular genetic trait. The preferred traits contemplated within the present invention relate to fields of therapeutic interest, for example susceptibility to a disease, or drug response reflecting drug efficacy, toxicity, or other side effects related to treatment. Traits may also relate to any other desirable or undesirable characteristic in a population of individuals, such as a bovine trait to produce tender, high-quality beef (see Adam (2002) Nature 417:778) or a disease-resistance in a plant. The term "trait-positive individuals" or "cases" refers to individuals in which a particular trait is present. The term "trait-negative individuals" or "controls" refers to individuals in which a particular trait is absent.
Traits can either be "binary", e.g. diabetic vs. non-diabetic, "quantitative", e.g. elevated blood pressure, or "censored", e.g., time-to-onset of a disease outcome. Individuals affected by a quantitative trait can be classified according to an appropriate scale of trait values, e.g., blood pressure ranges. Each trait value range can then be analyzed as a binary trait. Individuals showing a trait value within one such range are compared to individuals showing a trait value outside of this range. In such a case, genetic analysis methods are applied to subpopulations of individuals showing trait values within defined ranges. In addition, a trait can be categorical (e.g., blood types), ordinal (e.g., stages of breast cancer), censored (e.g., observations for study participants who do not have the event of interest during the period of follow-up), or provide frequency information.
The term "environmental variable" or "environmental factor" refers to non-genetic contributions to the presence or absence of a trait. Environmental factors include, for example, age, sex, weight, height, nutrition, life-styles, smoking, alcohol-consumption, work habits, history of medications, medical history, family history, leisure activities, etc.
The term "pattern" as used herein refers to either (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, or (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors. Thus, the presence of a particular pattern may associate a trait with the combination of a haplotype or diplotype at one locus with a haplotype or diplotype at another locus, or it may associate a trait with one or more environmental factors, or it may associate a trait with a combination of a haplotype or diplotype at one ore more loci with one or more environmental factors.
The term "locus" refers to the specific location of a marker in the genome. The terms "significantly associated," "significantly positively associated," and "significantly negatively associated" all refer to statistical significance. Statistical significance, is used herein as it is typically used by those with skill in the art. It is a measure of the probability that an observed difference would have been observed simply by chance and is not the result of a "real" difference between two groups, for example. Thus, the lower the probability that the observed difference would have happened by chance, the less likely that it happened by chance. Statistical significance is based on p- values. A p-value <0.05 is typically considered statistically significant, although in some instances a p-value of <0.01 or even <0.005 or <0.001 is preferred. In general, the lower the p-value, the less likely that an observed difference occurred by chance, and thus, the more statistically significant the difference. A statistically significant positive association between a haplotype and a trait means that the presence of a haplotype correlates with the presence of the trait, a statistically significant negative association between a haplotype and a trait means that the presence of a haplotype correlates with the absence of the trait.
In the first aspect, the invention provides a method for estimating haplotype frequencies for a set of markers in a sample population using an estimating equation (EE) method, wherein the genotypes of at least some of the markers are known for each individual in a sample population, wherein the phase information for the markers is incomplete, and wherein the algorithm used directly provides a standard error measurement for each estimated haplotype frequency without using bootstrap methods.
Most genotyping protocols result purely in genotype information; they produce information about the pair of alleles an individual possesses at each locus, but not necessarily haplotype information which would reveal the alleles that have been inherited together on the same paternal or maternal chromosome. Without explicit haplotype information, there is ambiguity with respect to the origin of alleles at neighboring loci. For example, it is difficult to determine if there are differences in the frequency of certain haplotypes between individuals with a disease ("cases") and individuals without the disease ("controls") in the absence of haplotype information.
The estimation of haplotype frequencies from genotype data gathered on a sample of individuals is based on the fact that the haplotypes of some individuals in the sample are unambiguous. This allows the ambiguous haplotypes to be estimated using statistical predictions. Individuals that are unambiguous with respect to phase or haplotype information have homozygous genotypes either at all relevant loci or at all but one relevant locus. Individuals with two or more heterozygous genotypes have more than one possible haplotype configuration compatible with their genotype data, and hence are ambiguous with respect to phase or haplotype information.
For example, consider two biallelic loci, one with alleles A and a and one with alleles B and b. An individual with a two-locus genotype (AA) and (BB) must have received two "A-B" haplotypes. An individual with an (aA) and a (BB) genotype must have received haplotypes "A-B" and "a-B" from his/her parents. The haplotypes of these two individuals are "unambiguous" for these two loci (i.e., the phase information for these loci is known). An individual with the genotypes (Aa) and (Bb) however, could have received haplotypes "A-B" and "a-b", or haplotypes "A-b" and "a-B", and hence is ambiguous with respect phase information for these loci.
Thus, in the absence of explicit phase information, if an individual is heterozygous at more than one locus, he or she is ambiguous with respect to haplotype and phase information. Individuals that are ambiguous thus have "missing" or "incomplete" phase information. Individuals that are unambiguous with respect to phase and haplotype data have "complete" information.
The EE method of the first aspect of the invention is motivated by the likelihood approach for missing data problems (Efron (1994) J. Am. Stat. Assoc. 89:463-475; Heitjan (1994) Biometrika 81 :701-708; Rubin (1996) J Am. Stat. Assoc. 91:473-489). Specifically, at an SNP locus, the phase of a heterozygous SNP is unresolved, and may be treated as missing data. In the terminology of missing data analysis, the missingness of phase is Missing At Random (Heitjan (1994) Biometrika 81 :701-708) in the sense that the missing mechanism depends on observed genotype data, rather than missed phase information. Note that if the parental origin (phase) of an allele at any single heterozygous locus is assumed to be fixed, it may serve as a reference phase for other SNPs within the same individual.
The EE method of the first aspect of the invention is described in EXAMPLE 1. The derivations of the likelihood and estimating equations are described in the first section of EXAMPLE 1. The likelihood function used in the method of the invention is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169). However, unlike these methods, the EE method uses a forward-block computational algorithm, which is computationally efficient algorithm that permits the resolution of phase information when the number of SNPs is larger than about 20. The forward-block algorithm used in the EE method of the invention is described in the second section of EXAMPLE 1.
The forward-block algorithm carries out computations in a stepwise fashion as shown in FIGURE 1. First, the set of markers (e.g., SNPs) is divided into several blocks. A block may contain from about 3 to about 100 markers. Second, haplotype frequencies for the first two blocks are estimated separately. The first two blocks are then joined and the haplotype frequencies for the enlarged block are estimated using the estimation results of the first two blocks as initial values. Third, haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block. The estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined. At each step, haplotypes with insignificant frequencies are filtered out.
The markers used in the methods of the invention may be biallelic or multiallelic markers. Exemplary markers are microsatellite markers and SNPs. In some embodiments, the markers are biallelic SNPs. The set of markers may comprise more than about 12 markers, more than about 20 markers, more than about 50 markers, more than about 100 markers, between about 200 and about 1000 markers, or several thousands of markers. In some embodiments the genotype of at least some of the markers is unknown. The sample population may comprise more than about 40 individuals, more than about 100 individuals, more than about 500 individuals, or more than 1000 individuals.
The EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values. The estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1. The adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
The EE method also permits the inference of haplotypes of individuals. Thus, the phase of an individual's genotype can be predicted to provide a pair of haplotypes with a probability statement, as shown in the fifth section of EXAMPLE 1.
The EE method accurately estimates haplotype frequencies and their standard errors (SEs) under the assumption of Hardy- Weinberg Equilibrium (HWE) and under conditions when the model assumption is violated, as shown in EXAMPLE 2. In a simulation study, the estimated haplotype frequencies and their SEs using the EE method are consistent even when the sample size is small (see FIGURES 2 and 3). Another simulation study shows that the haplotype frequencies estimated using the EE method are accurate even when the number of markers (e.g., SNPs) is large, as shown in EXAMPLE 2 and FIGURE 4. In some embodiments of the EE method of the invention, the number of individuals can be several thousands, and the number of markers can be several hundreds. In some embodiments, the estimation of haplotype frequencies from a dataset of 632 individuals and 296 markers was completed in approximately 2 minutes, as shown in EXAMPLE 2.
The estimated haplotype frequencies and their SEs using the EE method of the invention compare favorable with those calculated with three existing methods: Arlequin software (Schneider et al. (2000) http//lgb.unige.ch/arelquin), which is an implementation of the EM method proposed in Excoffier & Slatkin (1995), PHASE software, which is an implementation of the Bayesian method (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989), and HAPLOTYPER software, which is an implementation of the Bayesian method (Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169). The comparison of the EE method with the other three methods is described for an actual dataset in EXAMPLE 3 and FIGURES 5 and 6. Both EE and EM methods estimate haplotype frequencies from the same genotype data; therefore, both should give the comparable estimates of haplotype frequencies, as shown EXAMPLE 3. Because of computational difficulties, the EM method only applies to the data with a small number of SNPs. Moreover, because a typical information matrix is sparse under the likelihood, the EM method may not be able to yield any estimate of SEs via inverting the information matrix. Therefore, the EM method has to rely on bootstrap methods, which are inevitably computationally intensive, to estimate SEs. Utilizing the forward-block-computational algorithm and estimating equation technique, the EE method is able to handle the data with large numbers of SNPs and estimate SEs directly without using bootstraps. Unlike the EM method, The EE method can also handle the data with missing genotypes. Compared with the EM method for estimating haplotype frequencies, the EE method yields consistently smaller estimates for SEs, as documented in EXAMPLE 4 and FIGURE 14. In addition, the computations using the EE methods were at least 120 times faster than using the EM algorithm in a dataset including 6 markers and 437 individuals, as shown in EXAMPLE 4.
Alternatively, haplotype frequencies can be estimated using Bayesian approach. Recently, two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978- 989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169) were proposed to reconstruct individuals' haplotypes from their genotypes. In despite of the different focuses, both maximum likelihood and Bayesian methods can be used to reconstruct individuals' haplotypes and estimate haplotype frequencies. Compared to the maximum likelihood methods (including the EE method), an additional assumption is required for the conditional distribution of individual haplotype given all haplotypes in the first Bayesian approach (PHASE, Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989). In general, if this assumption were satisfied by the observed data, the Bayesian method would be expected to produce more accurate estimates than the maximum likelihood estimates. On the hand, when such an assumption is violated, the Bayesian method can produce biased estimates. These points were demonstrated in recent simulations (Niu et al. (2002) Amer.
J. Hum. Genet. 70:157-169), and in real data sets (Zhang et al. (2001) Am. J. Hum. Genet.
69:906-912).
The second Bayesian approach (HAPLOTYPER, Niu et al. (2002) Amer. J. Hum.
Genet. 70:157-169), instead of assuming the conditional distribution of haplotypes, assumes a prior distribution of haplotype frequencies to be a Dirichlet( β ) distribution, which is indexed by parameters β estimated from the data. The resulted posterior distribution is Dirichlet( ? + N(Z)), where N(Z) is the vector of haplotype counts observed in the "imputed" haplotypes Z. In general, estimates using this Bayesian method are very comparable to the direct maximum likelihood estimates unless prior knowledge on haplotypes in the population is known. Simulation results showed that this Bayesian method and the EM method produced the similar results in term of reconstructing for individuals' haplotypes (Νiu et al. (2002) Amer. J. Hum. Genet. 70:157-169). However, the prior distributional assumption, if violated, may have an impact on estimation of standard errors, which have not been dealt with in HAPLOTYPER at this time.
An extension of the EE method may be used if an individual's parental genotypes are also available. Thus, the accuracy of the estimates may be improved by incoφorating parental genotypes in estimating equation (9) in EXAMPLE 1. The improvement can be very valuable, especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped.
In addition, the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers. The EE method may also be modified by incoφorating available population genetic models to estimate haplotype frequencies for an isolated population. Additionally, recognizing that block-like structures of SNPs exist throughout the genome (Daly et al. (2001) Nature Genetics 29:229-232; Goldstein (2001) Nature Genetics 29:109-111; Patil et al. (2001) Science 294:1719-1723), the EE method may be used together with the forward block computational algorithm to systematically identify haplotype blocks such that there are only limited haplotypes within each block. With identified blocks and haplotypes, it is possible to determine which SNPs are essential for defining the haplotypes and which SNPs are redundant, obviating the need to genotype non-essential SNPs.
In a second aspect of the invention, the estimating equation approach may be used for assessing trait associations with haplotypes. One important application of SNP genotypes is the study of their association with complex traits, such as cancers or coronary heart diseases. When a trait is relatively rare, a widely accepted design strategy is the case-control study, which has been established through decades of epidemiological research (Breslow & Day (1980) Statistical Methods in Cancer Research, Int'l Agency for Research on Cancer, Lyon; Schlesselman (1982) Case-Control Studies: Design, Conduct, Analysis, Oxford University Press, NY). A typical case-control study identifies a sample of trait-positive individuals ("cases") and a sample of matched, trait-negative individuals ("controls") from a well-defined population. For each case and control, investigators obtain biological samples as well as demographic and clinical information via questionnaires. In the current context, genomic DNA from these biological samples are genotyped at multiple SNPs in the candidate genes of interest. The analytic objective is to correlate SNP genotypes with trait phenotype, that is, to estimate differences in allele frequencies between cases and controls (or odds ratios, OR). With a few exceptions (Whittemore & Halpern (1997) Stat. Med. 16:153-167; Zhao et al. (1997) Am. J. Med. Genet. 84:433-453), typical case-control studies do not involve family members, other than to obtain self-reported family history of the disease of interest, and, therefore, biological samples from family members are not available. One possible analysis uses logistic regression methodology (Breslow & Day (1980) Statistical Methods in Cancer Research, Int'l Agency for Research on Cancer, Lyon). Essentially, markers are treated as covariates, measured as either genotypes or alleles, and then correlated with disease phenotype. To enumerate all possible associations, logistic regression analysis has to consider all main effects and all pair-wise and high order interactions. Thus, the number of potential tests increases at an exponential rate with the number of markers being considered. While conceptually straightforward, such methods are not efficient, because they do not take advantage of the genomic structure of multiple markers, such as SNPs, within a functional gene. Furthermore, the inteφretation of regression coefficients associated with genotypes is not straightforward. Most importantly, regression analysis has limited power because it has to test hypotheses under all possible combinations of genotypes.
An alternative approach to examining individual markers is to focus on associations of marker haplotypes with disease phenotype, which is advocated by population geneticists (Haiti & Clark (1997) Principles of Population Genetics, Sinauer Assoc, Inc., Sunderland, MA; Drysdale et al. (2000) Proc. Natl. Acad. Sci. U.S.A. 97:10483-10488). Using a haplotype-based approach has two advantages. First, haplotypes are conserved across short distances over the evolutionary time-scale (Drysdale et al. (2000) Proc. Natl. Acad. Sci. U.S.A. 97:10483-10488; International Human Genome Sequencing Consortium (2001) Nature 409:860-921; Gabriel et al. (2002) Science 296:2225-2229). Consequently, in many regions the number of common haplotypes comprising multiple markers in the population is much smaller than the theoretical number of possible combinatorial haplotypes. Because of this greatly reduced dimensionality, the approach of using haplotypes is expected to be more efficient than logistic regression analysis which treats markers as individual covariates. Second, some phenotypes are associated with haplotypes, rather than with any individual marker within the haplotype. For example, Hallman and co-workers noted a haplotype association of lipoprotein lipase with coronary heart diseases (Hallman et al. (1999) Ann. Hum. Genet. 63:499-510), but no association was evident with individual SNPs. The traditional method to associate haplotypes with a trait is to collect genotype data from both parents, and to compare individual marker alleles with the father's and mother's marker alleles, to determine phase of marker alleles (Wijsman (1987) Am. J. Hum. Genet. 41 :356-373). However, gathering parental biological samples and genotyping them, even if possible, is difficult to implement in large case-control studies. Alternatively it is possible to haplotype multiple markers experimentally (Weston et al. (1992) Cancer Epidemiology, Biomarkers & Prevention 1:481-483; Green et al. (1998) The Human Genome Project and Its Impact on the Study of Human Disease, The Genetic Basis of Human Cancer, McGraw-Hll, Health Professions Division, NY). For example, one may dissect out a single chromosome, and then genotype multiple marker alleles from the single chromosome. Without a biotechnological breakthrough, such direct biochemical phase resolution methods remain extremely costly.
Without relying on experimental methods or on parental data, another class of methods infers haplotypes from multiple markers. Unambiguous haplotypes among cases and controls are identified, ignoring remainder haplotypes (Haviland et al. (1995) Ann. Hum. Genet. 59:211-231). As expected, ignoring partially informative haplotypes implies a loss of efficiency, and this difficulty becomes more significant as the number of sites assayed increases. Typically, such a method is applicable to at the most three or four markers. To incoφorate partially observed haplotypes, another method is to infer haplotypes based upon empirical distributions, tolerating a degree of misclassification error (Hallman et al. (1999) Ann. Hum. Genet. 63:499-510). However, misclassification errors may bias the results. Recently, Stephens and co-workers (2001) have proposed a Markov Chain-Monte Carlo (MCMC) method to construct haplotypes (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989). Of course, the computational burden associated with MCMC is significant, and may become prohibitive when dealing with a large number of cases and controls.
Rather than reconstructing haplotypes, several groups have proposed a statistical approach to estimate haplotype frequencies using the maximum likelihood approach. There are three different implementations (Excoffier & Slatkin (1995) Mol. Biol. Evol. 12:921-927; Hawley & Kidd (1995) J. Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810). While the implementations are largely comparable except for slightly different modeling assumptions, Excoffier and Slatkin's development seems to require the fewest assumptions. All used the EM algorithm to maximize the respective likelihood functions. Recently, a simulation study showed that Excoffier & Slatkin's EM implementation yields reliable estimates of haplotype frequencies under some typical population genetic scenarios (Fallin & Schork (2000) Amer. J. Hum. Genet. 67:947-959). A validation study on estimation of haplotype frequencies in the CD4 locus showed that estimated haplotype frequencies by all three implementations yield comparable estimates, and estimates are largely consistent with experimental data when haplotypes are observed experimentally (Tishkoff et al. (2000) Amer. J. Hum. Genet. 67:518-522). However, estimated haplotype frequencies are not reliable for rare haplotypes. This reduced reliability is expected, and is best quantified by their confidence intervals and standard errors, which are not documented.
The method of the second aspect of the invention uses an estimating equation approach for analyzing case-control marker data to associate haplotypes with a trait, as shown in EXAMPLES 4 and 5. The method infers haplotype frequencies statistically, and correlates these with case-control status, thereby establishing associations of particular haplotypes with the disease phenotype. The method of the invention has two parts. In the first part, the method estimates haplotype frequencies for a set of markers with incomplete phase information in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation algorithm, which directly provides a standard error measurement for each estimated haplotype frequency. A forward-block algorithm carries out computation of this part of the method in a stepwise fashion as shown in FIGURE 1. First, the set of markers (e.g., SNPs) is divided into several block. A block may contain between about 3 and about 100 markers. Second, haplotype frequencies for the first two blocks are estimated separately. The first two blocks are then joined and the haplotype frequencies for the enlarged block is estimated using the estimation results of the first two blocks as initial values. Third, haplotype frequencies for the next single block are estimated, and the single block is added to the enlarged block. The estimations are done sequentially for each enlarged block and next single block. This process is continued until all the blocks are joined. At each step, haplotypes with insignificant frequencies are filtered out.
The EE method of the invention directly provides a standard error measurement for each estimated haplotype frequency using estimation equation theory and can also handle data sets that include missing genotype values. The estimations of the standard error for any haplotype frequency is as shown in the third section of EXAMPLE 1. The adjustment of the covariance matrix for missing genotypes is described in the fourth section of EXAMPLE 1.
In the second part, the method estimates the differences in frequencies of individual haplotypes for the set of markers in the trait-positive group and in the trait- negative group using an estimating equation algorithm, which directly provides a standard error measurement for each estimated difference in haplotype frequency.
In another embodiment of the second aspect, the invention provides a haplotype- based method for diagnosing an increased risk of development a trait. In this embodiment, the method comprises four parts. The first part of the method estimates a set of haplotype frequencies in a group of trait-positive individuals and in a group of trait- negative individuals using an estimating equation method and a forward-block algorithm, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated haplotype frequency. The forward-block algorithm comprises distributing the markers into a plurality of blocks and determining haplotype frequencies in a step-wise fashion for both individuals in the trait-positive group and individuals in the trait-negative group. In the first step, haplotype frequencies are estimated for the subset of markers in the first block. The second step comprises estimating haplotype frequencies for the subset of markers in the second block. In a third step, the haplotype frequencies are estimated from a combination of selected and pooled haplotypes for the first block and selected and pooled haplotypes for the second block, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected. Step four comprises sequentially repeating steps two and three, wherein during each repetition the pooled and selected haplotypes for one additional block is added to the selected and pooled haplotypes for the previous combination of blocks. The second part of the method estimates the differences in haplotype frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in haplotype frequencies. The third part of the method derives one or more haplotypes that are significantly associated with the trait. The fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattern that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
The likelihood derivation used in the EE method in a case-control study is described in the first section of EXAMPLE 4. The motivation of the estimating equation is described in the second section of EXAMPLE 4. In some embodiments, the strategy used is to search for haplotype associations progressively, in multiple steps, as shown in the third section of EXAMPLE 4. The permutation of the case/control study under the null hypothesis that markers and their haplotypes do not associate with disease status to compute the maximum Z-statistic is described in the fourth section of EXAMPLE 4. An efficient computational implementation of the estimating equation method for use in the case-control study is described in the fifth section of EXAMPLE 4.
The use of the EE method in an actual case-control study is described in EXAMPLE 5. This study involved 6 biallelic markers in 779 individuals, of which 342 were cases and 437 were controls. The dataset was analyzed in several steps: one marker at a time, 2 adjacent markers at a time, 3 adjacent markers at a time, etc. up to 6 adjacent markers. FIGURES 7-13 show the estimated allele or haplotype frequencies, estimated odds ratios (ORs), and their 95% confidence intervals for cases and controls. This study shows the feasibility of using the method of the invention with many markers. This EE approach for case-control studies has several important features. First, it does not require family data to infer haplotype information. Of course, if family data are available, they can be used to improve haplotype information. Second, the method estimates haplotype frequencies, and establishes their associations (but not necessarily causal associations) with a phenotype, without the need to assay haplotypes directly. Third, this method is easily generalized to incoφorate many markers, and this feature is important in view of increasing knowledge of SNPs in the genome and early evidence for haplotype specific effects. Fourth, it relies on the empirical distribution of multiple marker alleles for haplotype inference, rather than requiring linkage-disequilibrium assumptions or other population genetics assumptions. Fifth, this method is more efficient than the analysis with individual markers, because the actual number of common haplotypes is much smaller than the theoretically possible number of haplotypes. Sixth, because of its computational efficiency, this method is easily scaled to deal with a large number of SNPs (e.g., >100) collected from a large number of subjects (e.g., >1000).
This method of the second aspect of the invention, if applied only to controls, has a close connection with Expectation-Maximization (EM) algorithms developed for estimating haplotype frequencies in a general population (Excoffier & Slatkin (1995) Mol Biol Evol. 12:921-927; Hawley & Kidd (1995) J. Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810). In this context, the EE method has the advantage of readily estimating standard errors for haplotype frequency estimates, while an EM algorithm may not. In a software implementation, called Arlequin, Excoffier and co-workers used a parametric bootstrap method to estimate standard errors. As expected, the computing burden from the bootstrap is substantial, and is as much as 120 times slower than the EE method.
A major limitation inherent in the case-control study design is that analytic results may be confounded by population stratification or admixture of populations (Elston (1999) Genet. Epidemiol. 17:79-101). In epidemiological terminology, the sources of subpopulations may confound associations of genetic markers and disease phenotype (Breslow & Day (1980) Statistical Methods in Cancer Research, International Agency for Research on Cancer, Lyon; Rothman & Greenland (1998) Modern Epidemiology, Lipincott-Raven Publishers, Philadelphia). The consequence is that associations of haplotypes with a phenotype may be confounded by the association of the phenotype with subpopulation-specific factors. To evaluate possibly spurious associations, one could independently validate the function of a particular haplotype. For example, one can study functionalities of its transcripts and proteins in vitro and in vivo. Of course, when the number of positive findings is large, it is likely that population stratification may be a serious problem, and a statistical analysis may be necessary to control for such a confounding effect. One method for such an analysis is to identify underlying subpopulations via cluster analysis of subjects and then to adjust for the confounding effects (Pritchard et al. (2000) Am. J. Hum. Genet. 67:170-181). An alternative method is to use a latent-class model to account for unmeasured population substructures (Satten et al. (2001) Am. J. Hum. Genet. 68:473-489).
Rather than analytically addressing this issue of population stratification, another approach is to design studies with cases and their parents, and to test disequilibrium in transmission (TDT) (Spielman et al. (1993) Am. J. Hum. Genet. 54:559-560; Spielman & Ewens (1996) Am J. Hum. Genet. 59:983-989; Schaid et al. (1999) J. Natl. Cancer Inst. Monogr. 26:1-16). A set of analytic tools have been developed to analyze such trio-data. However, the primary concern with this design is that the power of discovering genes is probably too low for mapping complex traits.
A modification of the EE method may be used if an individual's parental genotypes are also available. Thus, the accuracy of the estimates may be improved by incoφorating parental genotypes in estimating equation (23) in EXAMPLE 4. The application of such a modified likelihood function will improve the efficiency in deriving distributions of haplotypes and hence the efficiency of the estimation. The improvement can be particularly valuable especially for estimating frequencies of rare haplotypes, which is not reliable unless a very large number of individuals are genotyped. In some embodiments, the EE method may be used to estimate haplotype frequencies for multi-allelic markers, such as microsatellite markers. Also, in studies of isolated populations, where population genetic history is well characterized, the method may be modified by incoφorating population genetic models. Moreover, recognizing the richness of study designs for case-control studies, one may modify the above likelihood construction for various study designs, such as matched case-controls with sibling controls. In some embodiment the method of the invention may be extended to analyze quantitative phenotypes.
In a third aspect, the invention provides a method for assessing trait associations with haplotypes or diplotypes and environmental factors in case-control studies. Interactions between genes and environmental factors has been of interest to genetic epidemiology (Khoury et al. (1993) Fundamentals of Genetic Epidemiology, Oxford University Press, New York). In recent years, researchers in pharmacological research have been very interested in studying the interactions of drugs and genes, the field for which is known as pharmacogenomics. Additionally, researchers in clinical sciences have been interested in personalized medicine in the sense that physicians can prescribe medical treatment based upon patients' genotypes. Characterization of the association between traits and genes, independently and/or interactively with environmental factors, can also improve the prediction, diagnosis, and prognosis of disease or other traits in an individual. To model gene/environmental interactions, one would typically gather an array of covariables, referred here to as environmental factors, for example age, sex, clinical data, history of medications, nutritional factors, etc. For an example, consider x( measuring dosage of the compound consumed by the tth subject. An analytic objective is to identify the association of the compound with the phenotype, as well as the same associations among a subset of subjects who have a specific haplotype. In addition to the interaction of genes and environmental factors, a gene may also interact with another gene at a functional level (epistasis). Given the current understanding on genetic regulatory circuitry, it is believed that multiple candidate genes, rather than a single gene, likely play a role in most of chronic diseases. Consequently, multiple genes may jointly penetrate to disease phenotype in the form of gene/gene interactions. The interactions could occur as haplotype-haplotype interaction, or as diplotype-diplotype interaction.
The method of the invention treats haplotypes, if unknown, as latent variables and constructs an estimating equation by integrating out these latent haplotypes, resulting in an estimating equation for estimating association parameters of interest. The various notations and assumptions, and the estimation procedure is shown in EXAMPLE 6. The derivation of the estimating equation is described in EXAMPLE 7. Under the assumption that the traits are uncommon, this conditional probability may be approximated as described in EXAMPLE 8. The method of the invention provides an assessment of associations of a trait with haplotypes or with diplotypes, as shown in EXAMPLE 6. Moreover, the method estimates the association of a trait with a combination of haplotypes or diplotypes and environmental factors. The method also provides estimates of associations of interacting haplotypes or diplotypes with a trait. In the method, case/control status is permutated under the null hypothesis that markers and their haplotypes do not associate with trait status to compute the Z-statistic over all selected haplotypes, as shown in EXAMPLE 6. Using this approach, the estimated association parameters are unbiased and inferences retain the desired false error rate, provided that the logistic regression model in equation (26) holds. The method for associating one or more haplotypes or diplotypes for a set of markers at one or more loci and one or more environmental factors with a trait comprises two parts. The first part of the method estimates pattern frequencies in a group of trait- positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency. The forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group. First, each set of markers is divided into a plurality of blocks. A block may contain between about 3 and about 100 markers. For each set of markers in both trait-positive individuals and trait-negative individuals, the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1, wherein a pattern comprises a haplotype with or without the one or more environmental factors; (2) estimating pattern frequencies for block 2 haplotypes; (3) estimating the pattern frequencies for a combination of the first and the second blocks using of selected and pooled patterns for block 1 and selected and pooled patterns for block 2, wherein a selected pattern has greater than a predetermined minimum frequency and a pooled pattern is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the patterns for one additional block is added to the previous combination of blocks.
The second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies. In another embodiment, the third aspect of the invention provides a haplotype- based method of diagnosing an increased risk of developing a trait. In this embodiment, the method comprises four parts. The first part of the method estimates pattern frequencies in a group of trait-positive individuals and in a group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (1) one or more haplotypes or diplotypes at one or more loci, (2) one or more environmental factors, and (3) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors, wherein the genotypes of at least some of the markers are known for each individual in the population and wherein the phase information for the markers is incomplete, and wherein the algorithm provides a standard error measurement for each estimated pattern frequency. The forward-block algorithm comprises a series of steps for both individuals in the trait-positive group and individuals in the trait-negative group. First, each set of markers is divided into a plurality of blocks. A block may contain between about 3 and about 10 markers. For each set of markers in both trait-positive individuals and trait- negative individuals, the first part of the method comprises the following steps: (1) estimating pattern frequencies for block 1 haplotypes with or without the one or more environmental factors, wherein a block 1 haplotype comprises the subset of markers in a first block; (2) estimating pattern frequencies for block 2 haplotypes with or without the one or more environmental factors, wherein a block 2 haplotype comprises the subset of markers in a second block, (3) estimating the pattern frequencies for a combination of selected and pooled block 1 haplotypes with or without environmental factors and selected and pooled block 2 haplotypes with or without environmental factors, wherein a selected block haplotype has greater than a predetermined minimum frequency and a pooled block haplotypes is a haplotype that is not selected; (4) sequentially repeating steps (2) and (3), wherein during each repetition the pooled and selected haplotypes with or without environmental factors for one additional block is added to the selected and pooled haplotypes with or without environmental factors for the previous combination of blocks.
The second part of the method estimates the differences in pattern frequencies in the trait-positive group and the trait-negative group, wherein the algorithm provides a standard error measurement for each estimated difference in pattern frequencies.
The third part of the method derives one or more patterns that are significantly associated with the trait.
The fourth part of the method diagnoses an increased risk of developing the trait in a trait-negative individual by determining the presence of a pattem that is significantly positively associated with the trait or the absence of a pattern that is significantly negatively associated with the trait. The validity of the method of this aspect of the invention was established using
Monte Carlo simulation studies, as described in EXAMPLE 11. The simulations were performed under the null hypothesis with no covariates (FIGURE 15), under the null hypothesis with covariates (FIGURE 16), under the null hypothesis with varying sample sizes (FIGURE 17), under the alternative hypothesis with no covariates (FIGURE 18), and under the alternative hypothesis with covariates (FIGURE 19). In one embodiment, the method is used in an actual case-control study, as described in EXAMPLE 11 and FIGURE 20.
In a fourth aspect, the invention provides a method for associating haplotypes for one or more sets of markers and one or more environmental factors with multiple phenotypes.
Yet another aspect of the invention provides computer programs and systems for implementing the haplotype-based methods of the invention. In an illustrative embodiment, the computation steps of the previous methods are implemented on a computer system or on one or more networked computer systems in order to provide a powerful and convenient facility for forming and testing network models of biological systems. In some embodiments, the computer system can include but is not limited to a hand-held device, a server computer, a desktop personal computer, a portable computer or a mobile telephone. A representative computer system is a single hardware platform comprising internal components and being linked to external components. The internal components of this computer system include processor elements interconnected with main memory.
The computer system includes a processing unit, a display, an input/output (I/O) interface and a mass memory, all connected via a communication bus, or other communication device. The I/O interface includes hardware and software components that facilitate interaction with a variety of the monitoring devices via a variety of communication protocols including TCP/IP, XI 0, digital I/O, RS-232, RS-485 and the like. Additionally, the I/O interface facilitates communication via a variety of communication mediums including telephone land lines, wireless networks (including cellular, digital and radio networks), cable networks and the like. In an actual embodiment of the present invention, the I/O interface is implemented as a layer between the server hardware and software applications. It will be understood by one skilled in the relevant art that alternative interface configurations can be practiced with the present invention.
The external components include mass storage. The mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof. The mass memory stores an operating system for controlling the operation of the premises server. It will be appreciated that this component can comprise a general piupose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT. The memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's Internet Explorer browsers, for accessing the WWW. This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory). Other external components include user interface device, which can be a monitor and keyboards, together with pointing device, which can be a "mouse", or other graphic input devices. Typically, computer system is also linked to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows computer system to share data and processing tasks with other computer systems.
Loaded into memory during operation of this system are several software components, which are both standard in the art and special to the instant invention. These software components collectively cause the computer system to function according to the methods of this invention. These software components are typically stored on mass storage. Alternatively, the software components can be stored on removable media such as floppy disks, CD-ROM, or other networked devices. A software component represents the operating system, which is responsible for managing computer system and its network interconnections. This operating system can be, e.g., of the Microsoft Windows family, a UNIX operating system, or a LINUX-based operating system. Another software component represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention. Languages that can be used to program the analytic methods of this invention include C, C++, or, less preferably, JAVA. Most preferably, the methods of this invention are programmed in mathematical software packages which allow symbolic entry of equations and high-level specification of processing, including algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms. Such packages include, e.g., MATLAB from Mathworks (Natick, MA), MATHEMATICA from Wolfram Research (Champaign, 111.), and MATHCAD from Mathsoft (Cambridge, MA). The analytical methods of the invention can be programmed in a procedural language or symbolic package.
The mass memory generally comprises a RAM, ROM, and a permanent mass storage device, such as a hard disk drive, tape drive, optical drive, floppy disk drive, or combination thereof. The mass memory stores an operating system for controlling the operation of the premises server. It will appreciated that this component can be comprised of a general-puφose server operating system as is known to those skilled in the art, such as UNIX, LINUX, or Microsoft WINDOWS NT. The memory also includes a WWW browser, such as Netscape's NAVIGATOR or Microsoft's INTERNET EXPLORER browsers, for accessing the WWW.
The mass memory also stores program code and data for interfacing with various premises monitoring devices, for processing the monitoring device data and for transmitting the data to a central server. More specifically, the mass memory stores a device interface application in accordance with the present invention for obtaining monitoring device data from a variety of devices and for manipulating the data for processing by the central server. The device interface application comprises computer-executable instructions which, when executed by the premises server obtains and transmits device data as will be explained below in greater detail. The mass memory also stores a data transmittal application program for transmitting the device data to a central server and to facilitate communication between the central server and the monitoring devices. It will be appreciated that these components can be stored on a computer-readable medium and loaded into the memory of the premises server using a drive mechanism associated with the computer-readable medium, such as a floppy,
CD-ROM, DVD-ROM drive, or network drive.
Alternative systems and methods for implementing the analytic methods of this invention will be apparent to one of skill in the art and are intended to be comprehended within the accompanying claims. In particular, the accompanying claims are intended to include the alternative program structures for implementing the methods of this invention that will be readily apparent to one of skill in the art.
The following examples are provided for the puφose of illustrating, not limiting, the invention.
EXAMPLES EXAMPLE 1
The Estimating Equation Method
This example describes the estimating equation (EE) method, along with a computationally efficient algorithm.
1. Derivations of the Likelihood and Estimating Equations Suppose we have a sample of n unrelated individuals from a population. From each individual, we observe q SNP-genotypes on a specific region in genome, e.g., on a candidate gene. Let gt : = (gn,gi2. - • •>£,,,) denote the q SNP-genotypes for the ith individual, where gy = g : g2 denotes a pair of alleles at they'th SNP locus. There are two possible values, 0 and 1, for each allele; 0 represents a reference allele and 1 represents a variant allele. From those observed genotypes, we estimate haplotype frequencies and covariance of estimates using estimating equation technique.
To ensure the efficiency of EE method, we derive the estimating equations from the likelihood of haplotype frequencies. Excoffier & Slatkin derived a likelihood of haplotype frequencies by modeling each individual's two haplotypes directly (Excoffier &
Slatkin (1995) Mol. Biol. Evol. 12:921-927). Because phases are unknown, the derivation is not straightforward and the required assumptions are not implicit. To enhance the theoretical basis, we propose to treat phase indicators as latent variables, and then to use the conceptual framework built for the random effect model to motivate the derivation of the same likelihood, so that assumptions required for the derivation becomes transparent. Let py denote the phase status for the ith individual and the y'th locus; py = 0 implies that g1 comes from one chromosome (either paternal or maternal copy but fixed for all loci) and ptj = 1 implies that gl is from that chromosome. Let Pi = (Pt\ . Pa . ■ ■ • . P , ) • The likelihood function is the product of the distribution function of g., given haplotype frequencies Θ = (θx,...,θτ), where T = 2V is the total number of possible haplotypes, over n independent individuals, that is,
To incoφorate the phase information, we rewrite the distribution of gt as a marginal distribution of its joint distribution with phase p. and further decomposed the joint distribution as,
/(?,;*) = Σ f g,-Pi.Θ = Σ /(?/ 1 p,;0)f(p.) . (2) p, ft where f(p() is a prior distribution of the phase which describes a random process of phase at all SNP loci. Recall that the uncertainty of phase is due to the nature of genotyping. Hence, it is appropriate to assume that the phase indicators are independent, i.e., f(Pi) = T"T /(Py) - Furthermore, ifgl7 is homozygous for the first heterozygous
7=1 genotype, there is only one solution for haplotypes, therefore, f(pjj = 0) = 1 . If gtj is heterozygous genotype but not the first one, there are two solutions for haplotypes, therefore, f(pi} = 0) = f(py = 1) = 1 / 2 . As a result, the distribution function f(pl) = l/2c' , where e, is the number of heterozygous loci minus one for the ith individual. Given the phase information, genotype g{ can be represented by a pair of haplotypes namely, (H' : H,2) . Under the assumption of the Hardy- Weinberg
Equilibrium (HWE), haplotypes are randomly paired to form individuals' genotypes. Therefore, f(g, \ pfβ) can be written as f(g, I Pi;θ) = f(H),H ;θ) = f(H);θ)f(H ;θ) . (3) Let
Figure imgf000036_0001
all possible haplotypes given q SNP markers. Without requiring any further distributional assumptions, the function f(H ;θ) may be represented by a multinomial distribution function: mϊrff)
Figure imgf000036_0002
(4)
T where /(•) is an indicator function and T θt = 1 .
1=1
Combining equations (2)-(4) and the distribution of phase with the likelihood function (2.1), we have a likelihood function, which may be written as
Figure imgf000036_0003
This likelihood function is essentially the same as the likelihood function that was derived in Excoffier & Slatkin (1995) Mol. Biol. Evol. 12:921-927. This likelihood function was also used in the two Bayesian methods (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989; Niu et al. (2002) Amer. J. Hum. Genet. 70:157-169).
To obtain the maximum likelihood estimate of θ , theoretically one can maximize the likelihood function (5) by taking derivatives of log-likelihood function and solving the resulting score estimating equations, which can be written in the following form after some algebraic manipulations,
∑ ∑ V-'Fif(pJ \ gi;θ) = 0, (6) ι=l Pi where Fi = (Fn,Fi2,...,FiT)' in which R.. = I(HA hj) + I(H = λ,) -20y ; V is the covariance matrix of the multinomial distribution; and (/>, ! £,;#) is the posterior distribution of phase /?,., given genotype g{ at multiple loci and haplotype frequencies θ . and f(pt \ gf,θ) are both functions of θ, with
0,(1-0,) -θ.θ2 —θfj
2ΘX θ2(l -θ2) —θ2θj
V = (7)
jθ —θjθ2 θj (1 — θj τ. ))j and f(H ;θ)f(H2;θ)f(Pi) f(Hl;θ)f(H2;θ) f i n. g. ' t) ) = ■ ; = = '■ ■ (8)
~ ~" ∑AH];ΘMH*;θ)f(P,) ∑AH ;ff)f{H>;ff) In (6), V is the covariance matrix of multinomial distribution, which is a semi- positive definite and constant over all individuals. Consequently, solving score estimation equations in (6) for θ are equivalent to solving the following equations, which are called estimating and denoted by U(θ) ,
U(θ) = ∑ ∑ F (P I g,.Φ = ∑ EPl C5 I g ,θ) = 0 , (9) ι'=l p, ι'=l or
Figure imgf000037_0001
where /,. = (IΛ,...,IiT) , Ii} = 1(H) = hj) + I(H? = h;) , which can be 0, 1, and 2.
We can find the estimate of θ by iteratively evaluating θ using (10) where f(pt \ gt;θ) is computed for the estimate oft? in the previous iteration using (8) until the convergence is reached. The iteration starts with letting f(p, | g ,θ) = f(pt) = — — or specifying an initial value for θ .
Comparing to the EM method of Excoffier & Slatkin (1995), in the expectation step, the probability of resolving each genotype into the different possible pairs of haplotypes is computed using the formula that is equivalent to the posterior distribution given in (8). In the maximization step, haplotype frequencies are computed using the equations that are equivalent to the ones given in (9) and (10). 2. A Forward-Block-Computational Algorithm
The estimation described in the previous section is usually slow when the number of SNPs is greater than 10 and fails to resolve any result when the number of SNPs is larger than 20. This phenomenon has also been reported for the EM method (Fallin & Schork (2000) Amer. J. Hum. Genet. 67:947-959). The reason is that the number of possible values of individual's phase increases exponentially in the number of heterozygous loci. Each possible phase gives rise to a pair of haplotypes. The requirement to save those haplotypes can easily exceed the capacity of most modem computers.
To overcome this computational hurdle, we introduce a computational algorithm in which a large number of SNPs is divided into several blocks and each block contains a small number of SNPs. First we estimate haplotype frequencies for the first two blocks separately using the method discussed in the previous section. Because of the low diversity of haplotypes, there are not many haplotypes with non-zero estimates of frequency in each individual block. We then join the first two blocks and estimate the frequencies of haplotypes for the enlarged block using the estimation results done for the first two blocks as initial values. Successively, we estimate haplotype frequencies for the next single block and add the block to the previously enlarged block. The estimations are done sequentially for each enlarged block and next single block. The computations are carried out in a forward stepwise fashion, as shown in FIGURE 1. This computational algorithm is similar to the progressive ligation algorithm proposed by Niu et al. (2002)
Amer. J. Hum. Genet.70:157-169.
To be specific, the genotype datag(,t = l, •••,«, with q SNPs is divided into several blocks glΛk, g, k+nu, ■■•, g <i„-i)k+i)Mk- i = — ,n, with equal length k (we typically choose A: = 10) and if necessary, a residual segment g, kq,i = l,---,n with length q-Mk which is less than k. Using the method described in the previous section, we first find the partial haplotypes corresponding to the genotype data glU,i = l,---,n in the first block. We then select haplotypes if their estimated frequencies are above δ (e.g. δ = 10~5 ) and denote these haplotypes by partial haplotypes by hJlk,j = 1,---, R, and their corresponding frequencies byθJlk,j = l,---,Bi, where R, is the number of haplotypes whose estimated frequencies are above δ in the first block. Similarly, we find the partial haplotypes hj(k+1)2k,j = 1,---,B2 and corresponding frequencies θjXk+l)2k,j = l,---,B2 for the genotype data g,χM)2k,i = l,---,n in the second block. The partial haplotypes h)U,j =l,---,B in the first block are joined to the partial haplotypes hj(k+)2k,j = l,---,B2 in the second block to form the list of partial haplotypes for the genotype data (g, k,g,χk+ι)2k), i = A".n, in the merged block from the first two blocks as
Figure imgf000038_0001
= l,2, and corresponding frequencies θJl2k}λkJι(k+l)2k, j= /, *j2, jt =!,■■-, B„t = 1,2. We use the partial haplotypes hj]2k, j = l,---,B1xB2, and their corresponding frequencies θ l2k, j = 1, • ■ • , R, x B2 , as initial values in the estimation iteration steps in (9) and (10). Again, after the estimation, we select the haplotypes with estimated frequencies are above δ for the estimation in following steps. We repeat the process for the merged block and the next single block until that the last block has been merged to the bigger block. At each step, haplotypes with insignificant frequencies (less than^ ) are filtered out. Thus, the value of δ controls the number of haplotypes in the final estimation step.
The smaller is, the more the haplotypes are retained and therefore the more accurate the final estimates of haplotype frequencies, but at the cost of computing speed. One can choose the value of δ to balance between the accuracy and the computing speed. 3. Covariance matrix of θ
The EM method of Excoffier & Slatkin (1995) computes an inverse of information matrix for covariance oft? . However, the information matrix is very complex and is sometimes not invertible when the number of SNPs gets large and the number of haplotypes becomes large. Here we use the estimating equation theory, which was previously developed for a general application (Zeger & Liang (1986) Biometrics 42:121-130) and for a genetic application (Zhao et al. (1998) Amer. J. Hum. Genet. 63:255-240), to calculate the covariance ∑(θ) . Thus,
∑(θ) = (dU(θ) / dθ)~l var [U(θ)] (dU(θ) I
Figure imgf000039_0001
(11) dU(θ) where = -∑ [2/ - var(E(. \ g{;θ)V '] is the derivative matrix of estimating dθ j ~ function with respect to haplotype frequencies, and var [[/( ?)] = ∑ Ep (Ft \ gx;θ Ep (F, ' | gf,θ) is its variance matrix. The covariance matrix i of θ can be estimated by ∑(θ) and the standard error of θ is estimated by the square root of diagonals of ∑(θ) . Furthermore, provided that the conditional probability f(Pi I g>θ) *s correctly specified, the asymptotic distribution off? is normal with mean θ and covariance ∑(6?) .
Experimental evidence suggested that only a few common haplotypes exists within each candidate gene. Therefore, most haplotype frequencies are very small or near zero. Estimation of haplotype frequency and its standard error for any rare haplotype in a reasonably sized sample is not reliable because such uncommon haplotype frequencies can cause instability of the inverse function in equation (11). We therefore excluded the haplotypes with 2nθj < 5 from θ and calculated the covariance matrix ∑(θ) (therefore the standard error of θ) only for the common haplotypes with 2nθj ≥ 5 . Our experience has shown that the standard error of θ for the common haplotypes does not change significantly if the uncommon haplotypes are excluded from the calculation. 4. Missing Genotypes
Despite advances in genotyping technologies, genotypes on some loci cannot be determined therefore are treated as missing data. Typically, the missing genotype is associated with the choice of probe sequences, experimental conditions or random fluctuations. It is reasonable to assume that the missing mechanism is either Missing
Completely at Random or Missing at Random (Heitjam (1994) Biometrika 81:701-708).
Under this assumption, it has been shown that the likelihood, resulting from integrating out missing data, can yield unbiased and efficient estimates. The same property holds for the estimates using estimating equations (9) and (10) because equations (9) and (10) are derived from the score estimating equation (6).
There are two possibilities for missing genotype data: 1) missing both alleles and
2) missing one allele, although the second possibility is very rare. For a bi-allelic SNP, if both alleles are missing, there are three possibilities: the genotype is homozygous with wild type (0:0) or with variant type (1:1) or heterozygous either (1:0) or (0:1). If one allele is missing, there are two possibilities: either 0 or 1 for that missing allele. Let gt = (g,°,g,Λ/) where superscripts O and M represent observed and missing genotype, respectively, for the fth individual. Then, the estimating equations (10) is rewritten in the following equations U(θ) = FJ(Pl I g, ;φf(g I g?) = 0 (12)
Figure imgf000040_0001
where f g^ \ g°) is the conditional distribution of missing genotypes given observed genotypes which is assumed to be uniformly distributed among all possible values. The covariance matrix is adjusted accordingly. 5. Inference of Haplotypes of Individuals Although estimating haplotype frequencies for a population is the most interesting task in the population studies, inferring individuals' haplotypes from their genotypes can also be of use in some applications, e.g., risk assessment in genetic counseling. If one or more haplotypes have been identified as disease-associated haplotypes and if the penetrance from the haplotypes to the disease phenotype is specified, an inference of haplotypes from an individual's genotypes at multiple SNP loci can be used to predict the individual's risk of having the disease. Using the posterior distribution of phase f(P, \ g,>'θ) given in equation (8), we can predict the phase of individual's genotype and, therefore, provide a pair of haplotypes with probability statement.
EXAMPLE 2 Simulation Study This example describes a simulation study to assess the accuracy of estimates under the HWE model assumption and when the model assumption is violated, and when the number of SNPs is large.
Estimates of haplotype frequencies and their SE derived from estimating equations (10) and (11) have been theoretically proved to approach their "true" values as the sample size goes to infinity under the assumption of HWE (Lian & Zeger (1986) Biometrika 73:13-22). For finite sample sizes, using simulation, we assess accuracy of estimates under the model assumption and under the model assumption being violated. We also assess accuracy of estimates when the number of SNPs is large. The schematic flow chart of the simulation is shown in FIGURE 2. In the generation step (Step G), we generate population haplotype frequencies. In the sampling step (Step S), we sample individuals' genotypes by drawing pairs of haplotypes from the population haplotypes generated in Step G. Sample haplotype frequencies are estimated from the sampled individuals' genotypes with known phase. In the estimation step (Step E), we estimate haplotype frequencies and standard errors (SE) from the sampled individuals' genotypes using the EE method discussed in EXAMPLE 1 and ignoring the phase information. We repeat Step G and Step E for 500 times after each Step G. Then, we repeat Step G for 20 times.
We use function d(λ,λ) = - λj , where A and λ denote for the estimated
Figure imgf000041_0001
and the reference value, respectively, to measure an overall discrepancy of the estimated values from "true" values. ∑ is the summation over all haplotypes, when the j estimation of haplotype frequencies is assessed, and the summation over all common haplotypes, when the estimation of SE is assessed, d = 2 * D where D is a discrepancy measure (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989), and d = 2 * (1 - IF) ,where IF is a similarity measure used in Excoffier & Slatkin (1995).
We first assess statistical proprieties of the EE estimation for haplotype frequencies and SE for finite sample sizes and a small number of SNPs. We choose the sample sizes to 10, 50, 100, 150, 200, 500, and 1000, and the number of SNPs to be five. For five SNPs, there are 32 possible haplotypes. In our first sets of simulation data, we assume an equal probability for all 32 haplotypes in the population. The genotype data is generated under HWE, i.e., each individual' genotype is generated by forming from two haplotypes that are randomly drawn from the population. The obtained results (measured in mean square errors) are very similar to those obtained by Fallin & Schork (2000) using an EM method (Fallin & Schork (2000) Amer. J. Hum. Genet. 65:947-959).
Then, we assume that population haplotype frequencies are unequal, which is more realistic. To generate unequal probabilities for 32 haplotypes, we first generate 32 random numbers from a uniform distribution on (0,1) and then add 4, 8, 12, 16, and 20 to five randomly selected numbers from the 32 numbers. Finally we re-scale the number such that they add up to one. On average, we have 11 out of 32 probabilities whose value is above 1%. The genotype data is generated under the assumption of HWE and under the departures of the assumption. Here we consider two scenarios in which the model assumption is not satisfied. In the first scenario, denoted by HWD(l), we assume that one out of five SNP loci carries a lethal mutant allele and that individuals with a heterozygous genotype (0/1) at that locus have a 50% chance for survival and individuals with a homozygous variant genotype (1/1) have no chance for survival. In the second scenario, denoted by HWD(2), we assume that one haplotype is associated with a disease and that individuals with one copy of the haplotype have a 75% chance for survival and individuals with two copies have a 50% chance for survival. For general consequences of departures from HWE to estimation of haplotype frequencies, see Fallin & Schork (2000) Amer. J. Hum. Genet. 65:947-959.
The simulation results are shown in FIGURE 3. The top panels present the average discrepancy of the estimated haplotype frequencies from the true population frequencies (dashed line) and the average discrepancy of the estimated haplotype frequencies from the sample frequencies (solid line) under the assumption of HWE, HWD(l), and HWD(2), respectively. The discrepancy of the estimated haplotype frequencies from the true population frequencies assesses the overall validity of the final estimates of haplotype frequencies with respect to the true population values. And the discrepancy of the estimated haplotype frequencies from the sample frequencies assesses the accuracy of the estimation method. The difference between the two discrepancies is due to sampling error. The difference approaches to zero when the sample sizes approach infinity. The simulation results show that the estimated haplotype frequencies by the EE method are consistent even when sample sizes are small. The discrepancies are not affected significantly by the departures of HWE. These observations are consistent to the observations done in Fallin and Slatkin (2000) for the EM method (Fallin & Schork (2000) Amer. J. Hum. Genet. 65:947-959). The bottom panels present the average discrepancy of the estimated SE using the EE method from the sample SD of the estimates. The small discrepancy indicates that the EE method gives correctly estimates of standard errors. The discrepancy is not affected by the departures of HWE. As we discussed in EXAMPLE 1, estimation of frequencies for rare haplotypes is not reliable for small sample sizes; therefore the estimation is carried out only for common haplotypes. The average number of common haplotypes for sample sizes of 30, 50,100, 150, 200, 500, and 1,000 is 2.3, 3.6, 4.7, 5.0, 5.3, 11.8, and 20.9 under HWE, 2.1, 3.3, 4.3, 4.8, 5.1, 10.5, and 17.2 under HWD(l), and 2.4, 3.6, 4.6, 5.0, 5.2, 10.7, and 20.0 under HWD(2). The number of common haplotypes increases with sample size because a haplotype is considered "common" if its estimated frequency multiplied by twice the sample size is at least 5.
In the next set of simulations, we assess the statistical proprieties of estimates of haplotype frequencies and SE using the EE method for finite sample size and larger numbers of SNPs. We choose the sample size to be 100, and the number of SNPs to be 20, 40, 60, 80, and 100. Although, the number of possible haplotypes is extremely large, varying from 220 to 2100, the actual number remains relatively small in a well-defined population. Here we use the coalescence-based program from Dr. R. R. Hudson's website (http://home.uchicago.edu/~rhudsonl/source/mksamples.html) to generate population's haplotypes. The coalescence process (Hudson (2002) Bioinformatics 18 (2):337-8) was developed to simulate the population evolutionary process. One needs to specify either a mutation rate or a number of segregating sites (equivalent to SNP) using the coalescence- based program. One can also specify a recombination rate (R = 4N0r where N0 is the effective population size and r is the recombination rate per generation between a length of base pairs). The higher the specified recombination rate, the larger the number of expected possible haplotypes. In our simulations, we choose five different numbers of segregating sites, 20, 40, 60, 80, 100, and three different recombination rates, R = 0, 4, 40 between 1000 base pairs. For each combination of the number of segregating sites and recombination rate, we generate 2000 haplotypes to estimate the population's haplotype frequencies. In Step S, we generate genotypes for 100 individuals. Each individual's genotype is generated from the generated haplotype population under the assumption of HWE. The rest of the simulation procedure is as discussed at the beginning of this example. The simulation result is shown in FIGURE 4. Similarly, the top panels show the average discrepancy of the estimated haplotype frequencies from the population haplotype frequencies (dashed line) and the average discrepancy the estimated haplotype frequencies from the sample haplotype frequencies (solid line). Both discrepancy measures for the EE estimates of haplotype frequencies increase slowly when the number of SNPs increases, especially for small recombination rates and the second discrepancy measure. The overall discrepancy is similar to that observed in the first set of simulation, and supports the accuracy of EE estimates.
The bottom panels show the average discrepancy of the estimated standard errors of the estimates of haplotype frequencies from the sample standard deviation of the estimates of 500 replicates. The average numbers of observed population haplotypes corresponding to 20, 40, 60, 80, and 100 SNPs are 17, 28, 37, 54, and 59 when R = 0 ; 21, 35, 48, 60, and 65 when ; and 42, 84, 107, 119, and 135 when R = 40. The average number of common haplotypes observed from 100 individuals' genotypes is between 6 and 12 regardless of the recombination rate.
The same restrictive stopping criterion (epsilon = 10-5) in estimation iterations and filtering criterion ( = 10-5) in the forward-block-computational algorithm are used in all simulations. If the smaller values are applied both criteria, the discrepancies are expected to be smaller than ones given in FIGURES 3 and 4. The EE method not only produces accurate estimates of haplotype frequencies and standard errors but also is computationally efficient. The running time is on average 0.04-2 seconds for each data set in our simulation studies. All simulations were run on a 3 -machine MOSIX cluster, with each having a dual Pentium III 800 MHZ with 2GB RAM, running under Linux 2.2.18-mosix. The running time depends linearly on the number of SNPs, the number of subjects and complexity of the data set. We recently estimated haplotype frequencies for a dataset with 632 subjects and 296 SNPs resulting in 38 common haplotypes. This calculation was completed in approximately 2 minutes in the same computing environment. We expect that the EE method can handle a data set with the number of subjects up to several thousands and the number of SNPs up to several hundreds. EXAMPLE 3
The Use of the EE Method on an Actual Dataset This example describes the comparative use of the EE method and previous methods on an actual dataset. To illustrate the EE method for an actual data set, we used data on gene ARHGDIB, one of the 19 regions on chromosome 12 (Reich et al. (2001) Nature 411:199-204). To compare our method with other existing methods, we also analyzed the data using Arlequin software (Schneider et al. (2000) http//lgb.unige.ch arelquin), which is an implementation of the EM method proposed in Excoffier & Slatkin (1995), PHASE software, which is an implementation of the Bayesian method (Stephens et al. (2001) Amer. J. Hum. Genet. 68:978-989), and HAPLOTYPER software, which is an implementation of the Bayesian method (Niu et al. (2002) Amer. J. Hum. Genet. 70:157- 169). There were 18 SNPs found on gene ARHGDIB. Among the 18 SNPs, only 8
SNPs were with complete genotypes on all 44 individuals, the sample size used in that study. The rest of the SNPs were missing genotypes in one or more individuals. Because only our EE method and HAPLOTYPER can handle the missing genotype data, we first analyzed the whole data set using the two methods. The EE method found 20 haplotypes and HAPLOTYPER found 21 haplotypes.
16 haplotypes were found by both methods. The haplotypes that were found by only one method had estimated frequencies less than 1%. The ten most frequent haplotypes shown in Table 1 of FIGURE 5. The SE of the estimate for common haplotypes was estimated in the EE method and are given in parentheses after the corresponding estimate of haplotype frequency. Comparing the estimates of haplotype frequencies using the EE method and HAPLOTYPER, one can see that HAPLOTYPER produced a higher estimate for the more frequent haplotype than the EE method. The reason for this phenomenon is that the estimates of haplotype frequencies are estimated from the best reconstruction of individuals' haplotypes. Because each individual contributes paired haplotypes, other most frequent haplotypes' frequencies might be underestimated. If we use the reconstructed the individuals' haplotypes for a subsequent analysis (here the estimate of population haplotype frequencies), misclassification errors can result in biased estimations.
We then analyzed the subset of the data that contains the 8 SNPs with complete genotypes on 44 individuals using all four methods. The ten haplotypes were found by all four methods except that the last haplotype was missed by HAPLOTYPER, as shown in Table 2 of FIGURE 6. The estimates of haplotype frequencies obtained from the EE method match the estimates obtained from EM method up to 5 decimal points (the precision was set at 10-5 for both methods). The EM estimates of SEs using 1,000 bootstraps are slightly but consistently larger than the EE estimates of SEs for the haplotypes with high frequencies, but are smaller for the three rare haplotypes. However, the difference in the estimates from two methods is no more than 10% of the average of the two estimates. Nevertheless, for the rare haplotypes, t.e., those with estimated frequencies of less than 6.6%, the estimated SEs are as large as their corresponding frequencies, implying that the estimates for rare haplotypes are likely to be unreliable. This result is largely consistent with empirical observation (Tishkoff et al. (2000) Amer. J. Hum. Genet. 67:518-522). The last two columns in Table 2 in FIGURE 6 show the estimations of haplotype frequencies using PHASE and HAPLOTYPER, respectively. Both methods overestimated the frequencies of the most frequent haplotype.
EXAMPLE 4 The Use of the EE Method in a Case-Control Study 1. A Likelihood Derivation: Notation, Assumptions and Formalization Consider a case-control study with n subjects (i=l,2, ...,n), with cases denoted by dι=l and controls denoted as di=0. From the ith subject, a sample is genotyped for q SNPs. Let gt = (gn,gi2,- - -,giq) denote linearly ordered SNP genotypes within a single candidate gene. Let gff = g1 : g2 denote a pair of alleles at the y'th locus in the rth individual, where gk for the Mi allele has a value of either 0 or 1 for the two possible alleles of a biallelic SNP. The phase (parental origin) of an individual allele, e.g., gl , within a subject may or may not be known. Let pt = (pn,pi2,- - -,piq) denote a vector of phase indicators; p.. = 0 implies that the phase status of gl , the first allele at they'th locus for the rth subject, is known to be, for example, from the paternal side. Then in contrast, Py = 1 implies that gx is from the maternal side. Summarizing statistical information, we derive a likelihood function. Let f(g, I d) denote the distributional function of genotypes (gi) given the disease status (di), where i=l,2, ...,n. The likelihood function results from the product of the distributional functions in all samples, that is, (Θ) = Y[ fig, \ dt) , (13)
where θ denotes parameters of interest to be estimated from the data, including haplotype frequencies and their differences between cases and within controls. From the likelihood (13), it is clear that the derivation of the likelihood function is equivalent to the derivation of an individual distributional function.
In the following we derive the distributional function f(gt \ dt) , together with a discussion of the required assumptions. First of all, we decompose the distributional function by incoφorating the partial phase information. Summing over all possible phases, fig, ) = ∑ f(g„P, \ d,) = ∑ f(g, I P,Λ)fiP, ) > (14) ft p, in which the second equality holds by conditioning argument. The distribution function fiP, \ dt) describes the random process of phases at all SNP loci. The uncertainty of phase is due to nature of genotyping technology, rather than the disease phenotype. Hence, we assume that the randomness in phases is independent of the disease status, i.e., fiP, \ d,) = fip,) - hi the absence of any prior knowledge about phase, it is appropriate to assume that the phase indicators are independent, i.e., fip,) = J J fipυ) ■ Further, if
7 the phase at the jth locus for the tth individual is completely known, f(pυ = 0) = 1 , . When a phase is ambiguous, f(p = 0) = f(pυ = 1) = 1 / 2 . As a result, the distribution function f(pt \ dl) = H2c' , where c, is the number of heterozygous loci minus one for the z'th sample.
Now the distributional function f(g, \ p,,dl) specifies the association of genotypes and hence haplotypes with the disease phenotype. Conditioning upon the phase information, one can deduce two haplotypes (h) : h2), leading to fig, \ p ) = fih) : h2 \ p„dl) = f(h) : h2 \ dl) , (15) in which the second equality holds since haplotypes, once known, are independent of phases. To decompose this distribution further, it is essential to assume that two haplotypes within a subject are independent. Under this assumption, the distribution can now be written as f{gt \ p,A) = f< \ d,)fVt \ d,) t (16) in which /( ?* | dt) denotes the distribution of the kth haplotype given the disease status for the z'th subject. This assumption is crucial, and its validity is justified as follows.
Within the context of case-control studies, parents of cases and controls are not included in the study and hence all subjects may be considered as founders (although there are exceptions in some large population-based studies where relatives are included within single studies). Under a set of population genetics assumptions (Hartl & Clark (1997)
Principles of Population Genetics, Sinauer Associates, Inc., Sunderland, MA), one can show that the Hardy- Weinberg equilibrium holds and is assumed to apply to controls as well as cases, except that haplotype frequencies may differ in the two groups. Let S denote a set of all possible haplotypes, {t | t=lst, 2nd, ...,H"" haplotypes}.
Without requiring any further distributional assumptions, the function
Figure imgf000048_0001
\ d() may be represented by a multinomial distribution function:
Figure imgf000048_0002
where the product is over all H possible haplotypes, μt(dt) denotes the frequency of the tth haplotype, and the indicator function /{/?* = t} equals one if A*=t. As usual for the multinomial distribution, the summation of μ,(dt) equals one, i.e., V μt(dt) = 1 . t
To quantify the associations of the tth candidate haplotype with the disease status, one approach is to postulate the following linear model: μt(di) = at + βtdi, (18) where or, represents the frequency of the tth haplotype in controls, and (at + βt) represents the frequency in cases. The difference βt , if not equal to zero, indicates that the tth haplotype frequency is different in cases and controls. In the epidemiological literature, odds ratio is more commonly used as a measure of association. Treating the first haplotype as a reference haplotype, the odds ratio of the tth haplotype is defined as φl = (all)al / l(a1l) . (19)
Applying the logarithm results in the log odds ratio. The primary rationale for using odds ratio is that it approximates the relative risk in studies of uncommon diseases (Breslow & Day (1980) Statistical Methods in Cancer Research, International Agency for Research on Cancer, Lyon; Prentice (1985) Environ. Health Perspect. 63:225-234). Relative risk is inteφreted as the fold-change in the risk of disease associated with one haplotype versus a reference haplotype.
Integrating equations (14)-(18) with the likelihood function (13), we now have a likeliho
Figure imgf000048_0003
where θ = (a,β) , a = (aλ,a2,...,aH) and β = (β ,β2,...,βH) . Note that the above likelihood function encompasses the haplotype likelihood function derived by Excoffier and Slatkin (1995), when the disease status is ignored.
To obtain maximum likelihood estimate of θ = (a, β) , one maximizes the above likelihood function. Or equivalently, one may solve the score estimating equation for the estimator, defined as the first derivative of the logarithm of the above likelihood function (20). After some algebraic manipulation, one can show that the score estimating equation may be written as
Figure imgf000049_0001
where F, = (FΛ , Fl2 , ... , FlH ) ' , FIJ = I(h) = t) + I(h2 = t) - 2μ, (d, ) represents the difference of "observed haplotype counts" and their expectations, the covariance matrix V, is obtained under the multinomial distribution for f(hx \ dt) and f(h2 \ d,) . Now the distribution of phase given the genotype data is specified by all of the multinomial
Figure imgf000049_0002
where the summation in the denominator is over all possible phases, and the distribution function /(/*,*) is specified by (17) and is a function of (α,β) . In contrast with the earlier assumption that f(pι \ dI) = f(pt) without conditioning on genotypes (g,), the above distribution function f(pt \ g,,dt) may be thought of as a posterior probability for the phase indicators.
Conceptually, the score estimating equation (21) can be solved for the maximum likelihood estimate, but the computational burden of enumerating all possible phases
(∑ ) in the above likelihood function (20), or in the score estimating equation (21)is ft prohibitive. Overcoming such computational difficulty has motivated some to use the EM algorithm, (Excoffier & Slatkin (1995) Mol. Biol Evol. 12:921-927; Hawley & Kidd
(1995) J. Heredity 86:409-411; Long et al. (1995) Amer. J. Hum. Genet. 56:799-810).
More recently, Excoffier and co-workers began using the Monte Carlo Markovian Chain
(MCMC) method to address the computational difficulties.
To overcome this computational difficulty, we use an estimating equation (EE) approach. Examining the score estimating equation (21), we realize that it has a close connection with the efficient estimating equation derived earlier by us for combined linkage and linkage-disequilibrium analysis (Zhao et al. (1999) Amer. J. Med. Genet.
84:433-453), in which the computational burden was reduced substantially. This recognition motivated us to develop an estimation equation-based approach for correlating haplotypes with disease phenotype. The primary advantage is that one can simplify the score estimating equation (21) while retaining efficiency and without losing the consistency of estimated parameters. An important bonus is that the estimating equation technique directly yields estimates of the standard errors.
2. Estimating Equation: Estimation and Inferences To motivate the estimating equation, we examine the score estimating equation
(21), and note that , is a covariance matrix under the multinomial distribution and is a function of μt(d;) . Within the context of the case-control study, the covariance matrix is the same across all cases and similarly across all controls, even though it may be different between cases and controls. If we treat , as a "working weight matrix", it can be chosen to be the same between cases and controls. After some manipulation, one can show that this covariance matrix can be factored out from the summation and thus eliminated from the above score estimating equation, resulting in the following: U(a,β) = \ gi,di} = 0 , (23)
Figure imgf000050_0001
where the conditional expectation E (. | g,,J,) is over unknown phase indicators. By the estimating equation theory (Liang & Zeger (1986) Biometrika 73:13-22; Zhao & Prentice (1990) Biometrika 77:642-648), the choice of a "working weight matrix" does not affect the consistency of estimate, but an inappropriate choice may reduce the efficiency of the estimation. In the current context however, the misspecification is associated with the equality of covariance matrix between case and control, and hence, under the null hypothesis, there is no loss of efficiency.
In addition to simplifying the estimating equation form (23), one may select only the haplotypes of interest and estimate their frequencies, rather than estimating frequencies of all possible haplotypes within the likelihood function (20). An obvious choice is to focus on common haplotypes; estimating their haplotype frequencies and assessing their associations with the disease phenotype, and grouping the uncommon haplotypes together. So in the estimating equation (23), we choose the top Ho common haplotypes F, = (Fή,Fl2,...,FlH , (H0 < H ). Indeed, this flexibility affords us an efficient computation.
Solving the estimating equation (23) yields an estimate of ( ,β), denoted as (ά,β) . Provided that the conditional probability /fø, \gι,d,) is correctly specified, it can be shown that as the number of subjects increases, the estimated parameters (ά,β) approach their "true values". Further, the distribution of the estimated parameters approaches an asymptotic normal distribution with a covariance matrix, which may be written as
Figure imgf000051_0001
in which all quantities are evaluated at their respective estimates with a variance matrix vaτ[U(α,β)] as
∑ Zr g}E p F d,F ) \ g,A} -
Figure imgf000051_0002
The derivative matrix, dU(α,β)/d(α,β) , lacks an explicit expression, because both F, and f(p,\g,,dj in U(α,β) in equation (23) are functions of (α,β). Instead, we use a numerical derivative matrix. Some implementation details are given later. For the tth haplotype, one can obtain estimated parameters (άtt) along with their estimated standard errors (σαtβt) (t=l,2, ...,Ho), where (σωβt) are square-roots of diagonal elements in the covariance matrix of Σ . These values can be used to make inferences. In particular, one may construct a Z-test statistic for testing zero difference between cases and controls as Zt = β, lσβt , which follows a standard normal distribution under the null hypothesis of β, = 0. The estimate and its standard error can then be used to construct a 95% confidence interval for the difference between cases and controls of the tth haplotype as [βt -1.96σβt,β, +1.96σβt] . Constructing a test statistic based upon the estimating function U(α, β) is also possible.
As noted earlier, estimating the odds ratio is of interest. Using the odds ratio formula (19), one obtains a point estimate of the odds ratio for the tth haplotype, treating the 1st haplotype as a reference: φt = (άlt)ά άlλl) . To make inference from the odds ratio, one may take a logarithm of the odds ratio, so that it has desirable statistical properties: its asymptotic distribution is approximated well by a normal distribution with a standard error that can be estimated. Then, under the asymptotic normality assumption, one may compute a Z-statistic to test the null hypothesis, H0: φ, = l . Further, one can use the asymptotic normality of the log odds ratio to construct confidence interval. However, the assumption of asymptotic normality may not be valid in situations where haplotype frequencies are close to zero or to one.
An alternative is to use bootstrap methods for computing odds ratios and their confidence intervals following the scheme previously outlined (Efron & Tibshirani (1993) J. Am. Stat. Assoc. 89:463-475). We sample cases and controls from original study with replacement, and then apply the procedure described above to estimate haplotype frequencies and obtain bootstrap estimates of odds ratios. We repeat the bootstrap sampling, say 500 times, and obtain 500 odds ratios for each haplotype. Estimated odds ratios from all bootstrap samples form an empirical distribution of every odds ratio of interest. The empirical distribution is then used to construct a confidence interval. 3. An Analytic Strategy
When analyzing multiple SNPs, one has to rely on a systematic strategy for analyzing all potential haplotype-based associations. While an optimal strategy remains to be investigated, we consider an analytic strategy that searches for haplotype- associations progressively, in multiple steps. The first step is to examine allelic association with the case/control disease phenotype. The second step is to correlate haplotypes with two adjacent SNPs with the disease phenotype. Progressively, one can add more SNPs in assessing associations of phenotype with haplotypes of three, four, or more adjacent SNPs. The primary rationale is that the physical adjacency implies the linkage-disequilibrium and hence the haplotype-based associations with adjacent SNPs. Computationally, using the adjacency reduces the search domain, hence reducing effects of multiple comparisons and improving the statistical power. However, it should be noted that this strategy may miss some important haplotype-based associations among non-adjacent SNPs. 4. Multiple Comparisons and Permutation Method One of challenges facing the above strategy, and many other analytic strategies, is how to deal with the increase in false positive errors due to multiple comparisons. There are two distinct types of multiple comparison problems, one of which involves multiple loci and another, the focus here, involves multiple haplotypes. Here our primary concern is with multiple haplotypes within a single candidate gene. Further, it is unwise to rely on an asymptotic distribution to assess the rate of false positive errors, since such a rate may be quite small.
To address the multiple comparison problem, we have chosen to use a Monte Carlo method, as has been successfully applied to the Test of Disequilibrium Transmission (TDT) (Mclntyre et al. (2000) Genet. Epidemiol. 19:18-29). Briefly, the idea is to permute case/control status under the null hypothesis that SNPs and their haplotypes do not associate with disease status. Sampling from a randomly permuted case/control population, say 1000 times, we apply our analytic strategy, and compute the maximum Z-statistic of all computed Z-statistics, with varying numbers of SNPs. The distribution of maximum Z-statistics from all permuted samples is used to establish the threshold value, e.g., the critical statistic at 5% significance level. 5. A Computing Algorithm
As detailed above, attempts have been made to improve computational efficiency by appropriately constructing the estimating equation (23). Nevertheless, we expect the computational burden to be substantial, as a typical analysis involves many SNPs (e.g., more than 10 SNPs) and use of the bootstrap method for evaluating odds ratios-related statistics and permutation to address the multiple comparison problem will further increase the computational burden. Here we outline an efficient implementation of the estimating equation method by taking advantage of the case-control study design. To begin, note that haplotype frequencies a = (cti,a2,...,aH) can be estimated from controls only, while + β = (α, + ?, , a2 + β2 , ... , aH + βH ) can be estimated from cases only. For example, within just controls, the corresponding estimating equation may be written as
Figure imgf000053_0001
which can be solved for α on the left hand side of the above equation. The solution has the following representation,
Figure imgf000053_0002
which is intuitively inteφreted as weighted averages of haplotype frequencies by direct enumeration of all possible phases with weights in favor of common haplotypes. If all phase indicators were known, the above formula would yield an unbiased estimate of haplotype frequencies via direct counting. Otherwise, all possible phases have to be enumerated and each weighted according to the probability f(p,\gι,d . Despite this intuitive representation, the parameter (α) is not directly solvable by (25), because the probability f(pι\g„dj is also a function of α. Nevertheless, the above equation can be effectively used to solve for α numerically: starting from an initial value, equation (25) is iterated until convergence in all parameters is reached, thus obtaining the estimate ά . A similar procedure can be applied to cases to yield the estimate ( + β) .
Now let us comment on several technical details that are essential for an efficient implementation. First, the choice of the initial value is important, given the high dimensionality of many haplotypes. We choose the initial haplotype frequencies to equal the product of corresponding allele frequencies, as if all SNPs are independent The second issue relates to the convergence criteria. We choose two criteria: convergence in the estimating function U(a,β) (23) and convergence in estimating parameters. To check if the estimating function is convergent, we examine if it approaches zero. On the other hand, to check the convergence in estimates, we examine whether the estimated parameters stabilize and if the increments of updating parameters approach zero. We declare the convergence if both differences are below a pre-specified precision (e.g., 10' 5). The third technical issue relates to the numerical derivative matrix of the estimating function U(a, β) . Specifically, to evaluate ,. .Λ , we calculate the difference of da U(ά + άδ,β) and U(ά,β) divided by a small δά, where δ may be chosen as 10"5. The primary reason for choosing δά , instead of just δ, is to ensure precision when ά is small.
EXAMPLE 5 This example compares the use of the EE method and the EM method using data from an actual case-control study.
To demonstrate this analytical approach, we had access to genotype data from a restenosis study undertaken by the Hospital Universitario San Carlos, Ciudad
Universitaria, Madrid, Spain. Of 779 consecutive individuals who had undergone angioplasty, 342 developed restenosis within 6 months and were designated as cases. The remaining 437 patients served as controls. The study was approved by the Hospital San Carlos Institutional Review Board and appropriate informed consent was obtained from all patients in the study. Restenosis has been observed in as many as 50% of treated patients, with stenting providing some reduction Previous studies of risk factors have found little evidence for contributions from conventional risk factors such as cholesterol levels, and underlying genetic factors may influence the restenotic process, for example, by modulating the response to injury at the arterial wall (Kastrati et al. (2000) Herz 25:34-46).
DNA from all 779 patients were genotyped using a panel of biallelic sites that are candidate markers for atherosclerotic disease risk. This panel represents an expansion of a previously described assay (Cheng et al. (1999) Genome Res. 9:936-949). Briefly, the assay uses multiplex PCR and immobilized sequence-specific probes to detect amplified alleles. All probes have been validated using DNAs of known genotype, as confirmed by sequencing. Most of the candidate genes in the panel were represented by one or two polymoφhisms, but multiple genes were typed in a few genes, including apolipoprotein CIII (APOC3), a gene encoding a component of plasma lipoproteins. APOC3 was represented by six biallelic sites: three in the promoter region, one in exon 3, two in the 3'-untranslated region of exon 4 (Cheng et al. (1999) Genome Res. 9:936-949). APOC3, thus offered us an opportunity to illustrate the estimating equation approach.
Following the analytic strategy outlined in EXAMPLE 4, we analyzed this data set in several steps: one SNP a time, 2 adjacent SNPs a time, 3 adjacent SNPs a time and so on, up to 6 adjacent SNPs. Our primary objective is to illustrate the application of the estimating equation approach. The six SNPs are C(-641)A, C(-482)T, T(-455)C, C1100T, C3175G, T3206G. We coded the common allele at each site in the combined population as allele 1 and the less common allele as allele 0. For example, at SNP site (- 641), the nucleotide of C is coded as allele 1 (common allele) and A is coded as allele 0 (less common allele).
Table 3 in FIGURE 7 shows the estimated allele frequencies of allele 0 at the six SNP loci, with the actual number of copies in parentheses. Allele frequencies for cases are listed in the second column, and for controls, in the third column. The difference in allele frequencies between cases and controls, standard errors of the difference and Z- statistics are shown in the fourth to sixth columns. Estimated odds ratios and their 95% confidence intervals are shown in last two columns. In computing odds ratios, we used the most common allele as the reference allele, with an OR=l. Overall, estimated allele frequencies are comparable between cases and controls, with the exception of C HOOT, which has a marginally significant (Z=1.941), frequency difference of 0.045. The odds ratio is estimated at 0.79, suggesting that the allele T at this site (1100 of APOC3) is associated with a reduced risk of having the disease phenotype. Note that the 95% confidence intervals are constructed without adjusting for effect of multiple comparisons. Table 4 in FIGURE 8 shows estimated Z-scores, ORs and their 95% confidence intervals for the haplotype analysis with two adjacent SNPs, in which the haplotype (11) for all paired SNPs is chosen as a reference haplotype with an OR=l . Paired SNP loci are listed row-wise, (1,2), (2,3), (3,4), (4,5), (5,6), where the locus numbers are associated with C(-641)A, C(-482)T, T(-455)C, C1100T, C3175G, T3206G, respectively. All possible haplotypes, except the reference haplotype 11, are listed column-wise. For example, haplotype 00 for locus 1,2 is equivalent to the haplotype AT at C(-641)A, C(- 482)T. Upon inspecting all possible pairwise and adjacent haplotypes, we note that that a significant association with haplotype (01)=TC at locus (4,5)=(O 100T, C3175G), (Z=2.69). The associated OR is 0.69, implying that individuals carrying haplotype TC at these two loci are at a lower risk of restenosis than those carrying the CC (reference) haplotype. Recognizing that multiple comparisons had been made in both the univariate analysis and the pairwise haplotype analysis, we used the permutation method to estimate significance levels of test statistics Z at 5% and 10%. FIGURE 9(a) shows the distribution of maximized test statistics after 1,000 permutations. It appears that the estimated maximum test statistic centers around 2 with a heavy tail towards the right. Within the same 1,000 permutations, we computed maximum Z-statistics at every loci (FIGURE 9(b), solid line). Based upon this distribution, we estimated the threshold values for 90% and 95%) significance levels (dotted lines). At loci 4 and 5, the observed Z-statistics remains significant at the 90% level, but not at the 95% level. That is, the statistical significance associated with the haplotype TC at C1100T and C3175G, is between 90% and 95%.
Table 5 in FIGURE 10 shows Z-statistics, odds ratios and their 95% confidence intervals for haplotype associations with three adjacent SNPs. For an efficient presentation, Table 5 lists four possible adjacencies, i.e., 123, 234, 345 and 456, in the columns, while all possible haplotypes, excluding 111, which is the reference haplotype, are listed by row. Rare haplotypes were merged with more common haplotypes, and are thus omitted from the table. For example, the rare haplotype 110 at locus 123, i.e., haplotype CCC at C(-641)A, C(-482)T and T(-455)C, was merged with other uncommon haplotypes. Upon inspecting all Z statistics, we note that the haplotype 010 at locus 456, i.e., haplotype TCG at C1100T, C3175G and T3206G, has Z-statistic of 2.63, with a corresponding OR = 0.63. Using the permutation method, we estimated the distribution of maximum Z statistics (FIGURE 9(c)), and established the threshold values at 90% and 95%) level (FIGURE 9(d)). The results indicate that after correction for multiple comparisons, the association fails to achieve significance and the maximum Z statistics fall just below the 90% significance level.
Proceeding to haplotype analysis with 4 adjacent SNPs, we have three possible combinations of SNP loci, 1234, 2345 and 3456, and 16 possible haplotypes, with l l l l as a reference haplotype. Following the format of Table 5, Table 6 in FIGURE 11 lists estimated Z statistics, odds ratios and their 95% confidence intervals for all haplotypes with frequency exceeding 5 counts. Upon inspection all the Z statistics, it is noted that the haplotype 0110 at locus 3456, i.e., haplotype CCCG at T(-455)C, C1100T, C3175G and T3206G, has a reduced risk associated with the phenotype. Haplotype TCG (010), which was significant at locus 456, extends to haplotypes CTCG (0010) or TTCG (1010) at locus 3456. That neither haplotype was significantly associated with the phenotype is inconsistent with TCG at locus 456 identified earlier.
For illustrative puφoses, Table 7 in FIGURE 12 and Table 8 in FIGURE 13 show Z statistics, odds ratios and their confidence intervals for haplotype associations with five and six SNP loci, respectively. It appears that maximum Z-statistics become much smaller, even though the 95% CIs seem to suggest several genes with marginally significant. This exercise shows the feasibility of performing haplotype analysis with many SNPs. Indeed, when the number of SNP loci increases, the total number of haplotypes does not increase exponentially with the number of loci. The number of common haplotypes appears to be around 10, even though the theoretical number of all possible haplotypes could be far greater. This result is consistent with the recent observations by (Drysdale et al. (2000) Proc. Natl Acad. Sci. U.S.A. 97:10483-10488).
Lastly, we compared the EE approach with the EM method for estimating haplotype frequencies. Because the EM algorithm is designed to estimate haplotype frequencies in one population, we compared these two methods in the context of estimating haplotype frequencies among controls only. Because of its ease of use, we used Arlequin (http://lgb.unige.ch/arlequin), an implemented EM algorithm by Excoffier and co-workers. In the current version, Arlequin estimates standard errors by parametric bootstrap to overcome the limitation of not being able to yield standard errors directly from the log likelihood calculation (Excoffier & Slatkin (1995) Mol. Biol. Evol. 12:921- 927). The bootstrap algorithm is detailed in the documentation for Arlequin. Table 9 in FIGURE 14 lists estimated haplotype frequencies and their standard errors. Estimated frequencies by both methods are identical up to the 5l decimal point, so only one column of haplotype frequencies is shown in the second column of Table 9. The third and fourth columns in Table 7 show estimated standard errors by EM algorithm and EE approach. While the estimates obtained by the EM algorithm and the EE approach are largely consistent, the EE approach yields consistently smaller estimates for standard errors. The smaller standard errors reflect the efficiency of the EE approach in estimating haplotype frequencies. Nevertheless, it is quite comforting to learn that both EM algorithm and EE approach yield comparable results. However, the EM approach was computationally less efficient. In this application, the EE computations were at least 120 times faster than EM. Computation speed is important in the design of analytic strategies for dealing with complex situations: 1) dealing with many more SNPs, 2) testing many combinations of SNP loci, adjacent or non-adjacent, and 3) using a permutation method to estimate significance level.
EXAMPLE 6 Trait Associations with SNP Haplotypes and Environmental Variables This example describes the application of the EE method for assessing trait associations with SNP haplotypes and environmental variables.
1. Notation and Assumptions
Consider a case-control study with n subjects (i=l,2, ...,n), with cases denoted by dj=l and controls denoted as d(= . Let xt = (xn,...xic)' denote a vector of c covariables, such as clinical variables, or demographical data, or drug usage. Also obtained from the tth subject is a biological sample, which is used to genotype for multiple SNPs. Let gi = (g >gi2 >--->giq) denote linearly ordered SNP genotypes within a single candidate gene, and the extension to multiple candidate genes is straightforward by including an additional subscript. Let gy = gl : g2 denote a pair of alleles at they'th locus in the tth individual, where gk for the i allele has a value of either 0 or 1 for the two possible alleles. Due to the nature of genotyping technology, the parental origin of individual alleles is unknown, the knowledge for which is referred to as phase. Let Pi =
Figure imgf000059_0001
denote a vector of phase indicators; ptj = 0 implies that the first allele at the y'th locus for the tth subject, gl , is from father. Then in contrast, ptJ = 1 implies that g1 is from mother. When phases are known, (gh pi) defines two haplotypes, denoted as h) : h2 , where h- = (Λ*,...,Λ* ),k = 1,2 .
As noted earlier, the analytic objective is to correlate haplotypes of multiple SNPs with the disease phenotype, i.e., the penetrance from haplotypes and covariates to disease phenotype. To model this penetrance function, we assume a logistic regression that relates haplotypes and covariables with disease phenotype via:
Pr(-J = l \ h , h , x.) = ϊ— j— ; , (26)
' ' « ' " ' l + exp[-« -/(Λ, 2, ,., ?)] ' where I(h),h2 ,xt,β) is a function of haplotypes and covariables (A?,/.,2,*,) with β as regression coefficients. There are many choices for this function I(h],h2 xt,β) . For example, to assess the main effect of haplotypes with an adjustment for covariables, one may choose the following function:
I(h),h2t.,β) = βx \K(h)) + K(h2)] + β2 ' ,., (27) where K(.) represents a vector of indicator functions for various haplotypes. Besides, there can be many other choices, some of which are listed in the section on Analytic Strategies. Estimating regression coefficients (β ,β2) are of interest. The rationale for choosing the logistic regression function includes: 1) regression coefficients (βx2) are rather inteφretable as log odds ratios, and log odds ratios approximate log relative risk when the disease incidence rate is low (Prentice & Pyke (1979) Biometrika 66(3):403- 11); 2) The logistic regression technique has been well studied in biostatistical literature and its statistical properties are well known; and 3) The logistic regression is routinely applied to epidemiological studies to inteφret results from case-control studies (Rothman & Greenland (1998) Modern Epidemiology, Lipincott-Raven Publishers, Philadelphia), and is thus readily accepted to study gene/environmental interactions.
The logistic regression defined by equation (26) is a penetrance function quantifying the disease probability given paired haplotypes and covariates, and is not directly estimable case-control studies. In particular, the intercept a , specifying the baseline prevalence rate, is unspecified in case-control studies (Prentice & Pyke (1979) Biometrika 66:403-11; Whittemore (1995) Biometrika 82:57-67). It has been shown that if the logistic regression (26) is assumed as the penetrance function, the logistic regression model can still be used but the intercept is a function of the intercept a , the sampling fraction and disease probability, which may be written as = fid: = 1 , (28)
Figure imgf000060_0001
where = α + log[ — ] represents a shifted intercept on the logit scale, π is a π(l-η) fraction of cases in the study, and η is the disease probability in the general population.
Clearly, this intercept in this model (28) is not terribly meaningful and is best treated as a nuisance parameter, since the disease probability in the general probability is unknown. Nevertheless, the intercept ξ is estimable and is constraint by the case-control sampling scheme (Prentice & Pyke (1979) Biometrika 66:403-11).
Conceptually, the analytic objective is to estimate parameters ( , ?) via an estimating equation, the derivation for which is detailed in EXAMPLE 7. Briefly, one could construct an estimating function for (ξ,β) , if phases were known. Since phases are largely unknown, one can treat them as latent variables. After obtaining a posterial distribution of phases given all observed data (phenotypes, genotypes and covariates), one can integrate out these latent haplotypes via conditional expectation of estimating function given observed data. Setting the integrated estimating function to zero results in an estimating equation for estimating (ξ, β) . 2. An Estimation Procedure
As being alluded to above, our analytic objective is to estimate regression parameters (ξ,β) (26). The estimate, denoted by (ξ,β) , satisfies the following estimating equation:
U(ξ,β)
Figure imgf000060_0002
where Xi =
Figure imgf000060_0003
is the partial derivative of I(h),h2 ,xt,β) with respect to β an f(Pi \ gi.dj,xi) is the posterial probability of latent phases given observed data.
Under the assumption that the diseases are uncommon, this conditional probability may be approximated (see EXAMPLE 8), and may be expressed as
Figure imgf000061_0001
where f(h],h ) = f(h])f(h2) is the joint distribution of paired haplotypes and satisfies the Hardy- Weinberg equilibrium in the general population. Note that for controls (d,=0), the above posterial probability degenerates to a simple calculation of probability observing a certain phase in the general population given genotype. Among cases, this probability calculation is simply weighted by exponential function exp[/(Λ1 1 , h2 , xt , β)] . It is also worth noting that the above conditional probability (30) does not depend on the intercept ξ or , implying that the estimation would be robust against potential misspecification of this intercept.
It is important to realize that the marginal joint distribution f(h),h2) , which does not depend on the disease status, is entirely specified by haplotype frequencies in the general population and needs to be calculated. Under the assumption of rare diseases, one can treat control population representative of the general population, and can estimate haplotype frequencies using controls. To proceed with estimating haplotype frequencies in controls, we propose to use the same estimating equation that has been described in EXAMPLE 4. Briefly, one may represent the distribution of haplotypes, f(h]) , via a multinomial distribution with haplotype frequencies π = (πh ,...,πh ) , for all possible haplotypes (1 ,h2,...,hH) . To estimate π = (πh ,...,πh ) , one may use the following estimating equation
' I(hA h]) + I(h = hl) -2π, x
U(π) = ∑ ∑ (1-d,) f(h),h2 \ dt = 0)
I(h = hH) + I(h2 = hH) -2πH (31)
∑ ∑ (l -d,)F,f(h],h2 \ dl = 0), = ∑ U,(π) = 0,
where (1 - dt ) is introduced to ensure the selection of control samples only, and Ft is the difference between observed haplotype counts and their expectations (haplotype frequencies).
Now joining both equation (29) and (31) results in a joint estimating equation. At the estimate, denoted as (ξ,β,τt) , which may be written as fuχξ) Λ
U(ξ,β,π) = ∑ = 0, (32)
Figure imgf000061_0002
in which estimating functions U's are specified in (29) and (31). To proceed with the estimation, one needs to compute the derivative of U(ξ,β,π) with respect to all parameters (ξ,β,π) . As shown in EXAMPLE 9, the derivative matrix of the above joint estimating equation (32), denoted as I(ξ, β, π), may be written as
Figure imgf000062_0001
where 0 is a matrix of appropriate dimension, and conditional means, variances and covariances are computed in the usual way.
The available derivative matrix I(ξ,β,π) (33) facilitates use of the Newton-
Raphson method to estimate all parameters that satisfy the estimating equation (32). Starting from an initial value (ξ(0)(0)(0)) , one can iterate to a new value
mmm) via
?(D 9 (o) -I ξ^ β^ A0 U ξw(-° π(0)) , π (1) π (0)
where all quantities on the left hand side are evaluated at the initial value. The iteration continues until convergence in all parameters, resulting in estimates denoted as (ξ,β,π) .
3. Asymptotics and Inferences
The joint estimation equation (32) is written as the summation of individual estimating functions over K independent samples. Naturally, by the central limit theorem
(Godambe & Thompson (1989) J Statistical Planning and Inference 22:137-152; Liang
& Zeger (1986) Biometrika 73:13-22), one can prove that under the regularity conditions estimated parameters are consistent as K approaches infinity. Moreover, the estimated parameters have an asymptotic normal distribution. One of the key regularity conditions is that the estimating function in (32) approaches zero, which is shown in EXAMPLE 10.
The covariance matrix is easily estimable, and may be written as
Σ = I(ξ,β,π)~ var[u(ξ,β,π)]l(ξ,β,π)-1 , (34) in which all quantities are evaluated at their respective estimates with a variance matrix vaι[u(ξ,β,π)] is estimated by ∑
Figure imgf000063_0001
. i
This asymptotic distribution can now be used to make inferences. Obviously, the primary interest is to make inference on parameters in β . Suppose that one is interested in assessing haplotype-based association, after adjusting for covariables xt via the linear model (27). For an example, consider the tth haplotype, one obtains an estimated parameter βt , along with their estimated standard error σβt , where σβt is the squared root of diagonal elements in the covariance matrix of Σ . This value can be used to make inference. In particular, one may construct a Z-test statistic for testing zero log odds ratio, Zt = β, lσβ„ (35) which follows a standard normal distribution under the null hypothesis of βt = 0. The estimate and its standard error can be used to construct a 95 % confidence interval for the odds ratio of the tth haplotype as
[eAl.96άβ, ,+X.96*β, ^ (36)
Of course, it is also possible to construct a test statistic based upon the estimating function U(ξ,β,π) (32).
4. Analytic Strategies
Based on upon the flexible formulation of the logistic regression formulation (26), one can design various analytic strategies via choosing I(h],h2 χ.,β) . Below, we consider a set of such strategies.
Haplotype-Specific Effects: An immediate interest is to assess the disease associations with haplotypes. Let h =
Figure imgf000063_0002
denote H common haplotypes of interest. To assess their associations with the disease phenotype, one may choose the equation (27) for I(h),h2 ,xt,β) , with a control of environmental covariates „ where βl = (βu,..., β ) ' . Under the null hypothesis that the Λh haplotype is not associated with the disease phenotype, the corresponding log odds ratio equals zero, Ho : βu = 0. To test this null hypothesis, one may use the above Z-statistics (35).
Diplotype-Specific Effects: While haplotype-based associations are of interest, the disease association could also be genotype-specific, that is, the disease associations are with one or more genotypes, formed by paired haplotypes (diplotype). Diplotype- based associations may be categorized into four different penetrance modes: being dominant, or being recessive, or being additive, or being co-dominant. To capture the mode of diplotype associations, one needs to re-code corresponding genotypes under each mode of penetrance. Suppose that h is the target haplotype of interest. Under a dominant mode, we use the following indicator function,
I(ti„h2,xl,β) = βlK(til ,h2) + β2 <xι and K(h ,h2) =
Figure imgf000064_0001
where all other haplotypes are treated as a reference. Similarly, if the recessive mode is considered, one may use the same regression function with a slightly different indicator function as
Figure imgf000064_0002
Now to model the additive penetrance, one may choose the following indicator function:
2 Both /?, 1 and /?, 2 equal / ~?,
K(h],ht) = 1 One of/?,1 and h equals /?,
0 None of/?,1 and A2 equals /?.
Lastly, consider the co-dominant mode of penetrance. The regression model is now written as
I(h),h2 xl,β) = βnK (h),h2) + βnK2(til ,h2) + β2 'xl and
Figure imgf000064_0003
Kιi ) otherwise
Figure imgf000064_0004
Figure imgf000064_0005
where h. and h2 are two co-dominant haplotypes under considerations. Indeed, this model encompasses both dominance and additive modes. Specifically, if one of two coefficients iβ ,β12) equals zero, the model implies the dominant. If two coefficients (βul2) are equal, the above model implies the additive mode.
Interactions between Haplotypes and Covariables: To model gene/environmental interactions, one would typically gather an array of environmental factors. For an example, consider x, measuring dosage of the compound consumed by the ?'th subject. An analytic objective is to identify the association of the compound with the disease phenotype, as well as the same associations among a subset of subjects who have a specific type of haplotypes. To accomplish this objective, we model the haplotype- covariable interaction via I(h],h2,xt,β) = β. \K(h ) + K(h2)] + β2Xl3 \K(h ) + K(h2)]Xl , where the third term β3 = (β ,—,β3H)' quantifies the interactions of all candidate haplotypes with the dosage. Indeed, one can postulate other models to depict interactions that may be dominant, recessive or co-dominant, besides the additive mode described above. Interactions among Candidate Genes: In addition to gene/environmental interactions, one candidate gene may also interact with another candidate gene at a functional level, which is known as epistasis. The interactions could occur as haplotype- haplotype interaction, or as genotype-genotype interaction. For the illustrative puφose, let consider a genotype-genotype interaction between two candidate genes with corresponding haplotypes h : h2 and h)2 : h2 2 . Let K(h] h2) denote the corresponding genotype specified above according to the mode of inheritance (j=l and 2). One can thus model their interaction via
Figure imgf000065_0001
where
Figure imgf000065_0002
and K2(h]2,h2 2) are two indicator functions for candidate genotype 1 and 2, respectively, and the log odds ratio βi quantifies the interaction of interest.
5. Multiple Comparisons and Permutation Method
One of challenges facing all analytic strategies, such as those listed above, is how to deal with an increase in false positive errors due to multiple comparisons. There are two distinct types of multiple comparison problems, one of which involves multiple loci and another, the focus here, involves multiple haplotypes. Here our primary concern is with multiple haplotypes within a single candidate gene or at most two candidate genes. Additionally purely relying on asymptotic distribution to assess the rate of false positive errors is probably unreliable, since sample size is much smaller than number of comparisons. To address the multiple comparison problem, we propose to use a permutation method, as has been successfully applied to the Transmission Disequilibrium Test (TDT) (Mclntyre et al. (2000) Genet Epidemiol. 19:18-29). Briefly, the idea is to permute case/control status under the null hypothesis that SNPs and their haplotypes do not associate with disease status. Sampling from a randomly permuted case/control population, say 1000 times, we apply an analytic strategy, and compute the Z-statistics over all selected haplotypes. Then, we can use ordered Z-statistics to establish their distribution under the null hypothesis. Such a distribution can be used to estimate number of false discoveries (Loomis & Stemberg (1995) Science 269(5224):649), or the upper bound of false discovery rate.
EXAMPLE 7 Derivation of Estimating Equations
This example describes the derivation of the estimating equations.
There is a wealth of literature on logistic regression and its variations for case- control studies (Prentice & Pyke (1979) Biometrika 66(3):403-l l). Based on the retrospective log likelihood function, one can easily write down the estimating equation for (ξ,β) as
Figure imgf000066_0001
where i(h ,h2 ,xt,β) is the partial derivative of I(h],h2 ,x,,β) , and (dtt) are disease phenotype and the mean defined in (28). Note that the first equality of zero ensures the constraint associated with case-control studies (Whittemore (1995) Biometrika 82:57-67). When phases (pi) are unknown, one may construct an estimating equation by integrating latent phases, and resulted estimating equation U(ξ,β) may be written as
1 Ep^id. -μ g^d,^,} Σ EPi[ idll) \ g„dl,x,] = ∑ = 0. iih],h2A), EΛ [i(hJ lx d, -μ,) \ g,tdl lx,] where the conditional expectation is defined through the conditional probability fiP, \ g,>d„x,) . The conditional probability can be approximated, and the approximation is derived in EXAMPLE 8.
EXAMPLE 8 Approximation of Conditional Probability This example describes the approximation of the conditional probability. By the Bayes' rule, the probability function may be written as
Figure imgf000067_0001
ft Λ
ft ft under the assumption that the covariables are independent of haplotypes. When the disease of interest is rare, the marginal disease probability Pr(-/, =l|/?,', ?2, ,) is much smaller than the marginal non-disease probability Vx(dl=0\h),h2,xl) . Hence, the disease probability is approximated by exp[α + /(/?,', h',Xl,β)]
Pr(<,=1 l 2. ,): exp[α + /(?;, h;,Xl,β)], l + exp[α + I(h), h2,x„β)] because exp[α + I(h), h2,xt,β)] is much smaller than one. Substituting these approximations into the above probability function, one obtains, for cases, an approximated function,
Pr«=l I h), h2, x, )/(/?,', h2) f(Pl\gl.d,=l,xl) =
∑ Pr^^ll/?,1,/?2,^)/^,1, ?,2)
Figure imgf000067_0003
Similarly, for controls, one obtains an approximation as
Figure imgf000067_0004
In general, f(pt \g,,dl,xt) may be represented by exp ^/?,1,/?,2^,,^)]/^,1,/?,2) f(P, \g„d„x,)∞
∑ ^[d,I(h],h2,x„β f(h],h2)
EXAMPLE 9 Derivation of Derivative Matrix for the Joint Estimating Equation This example describes the derivation of the derivative matrix for the joint estimating equations.
As noted in the text, the joint estimating equation for (ξ,β,π) may be written as
Figure imgf000067_0005
where Xt =ϊ(h), h2,xt,β) represent a vector of covariate function, the indicator function I(h],h2) is used here genetically to denote
I(h),h2) = (l(h]=h.) + I(h2=h.),...,I(h =hH) + I(h2=hHj) Note by construction, X, should be free from any unknown parameters. The derivative matrix of the above joint estimating equation with respect to (ξ,β,π) may be written as
Figure imgf000068_0001
Now let us derive each individual derivative below:
(1,1) dU{ξ/)dξ = -∑ EiμXl-μ d.g.-X,]
Figure imgf000068_0002
= -∑ E[μχi-μι)XI\dl,g„xl] + ∑ ∑ (d, -μ,)f(p, \ dltg,.x,) — bifip, \d„g„x,)
1 ft
Figure imgf000068_0003
= —[(d,β'Xl) + ln /(/?,', h2)-ln∑ exp(^'N,)/(/?;,/?,2)] dP ft
Figure imgf000068_0004
= d,[X,-E(Xl\d„g„xl)]
(1,2) dU^y = -∑ E[μχi-μl)Xι\dl,gl,x,} + ∑ cov[(-i, -μ,), d,X, \d,tgl,x.]
? oπ1 = r oπ-∑ , Σ id,-μ,)fiP,\g,ΛA) = ∑ ∑ i .-μ^fip.lg.A.A)
1 , ft „. oπ d
= ∑ ∑ idl-μi)fip,\gl,d„xi)—-nf(p,\gl,dl,x,)
, A dπ ±h. f(n \tr d A d P- .β'Ii Λ'XMiltf) *JlP-\g,> "^-θπ∑ eSp[d,β'lflA2,x,)W;Λ2) p,
= ^-[lnexp[-i,/5'/(/?,1,/?2, ,)]/(/?,1,/?,2)]- in^ Q V[dβ^ I(h,h2,Xl)\f(h],h2) oπ oπ Pt
Figure imgf000069_0001
ft = F-1[/(/?,1,Λ,2)-2Λ-]-E[ ,[/(/?,1,?,2)-2Λ:] „g„x,]
(1,3) dU{ξ/) = ∑ cov[( ,.-/,.),[/(?,1,?2)-2^ -1|^,.,J,., ,.],
(2,1) at/(% = -∑ E[ J/,α-/ ,g,,*,]
∑ Aidi-μdfiPAgiΛA
Figure imgf000069_0002
= -∑ EiX^'μXAμ^d^g^ + ∑ ∑ X^-^^-fip^g^x,)
(2,2)
dU(β)/dβ = -∑ E[N,. ,.'/.,.(i-/,.)| ,.,g,.,x,.]+∑ .cov[ ,.( ,.- .),^: ,.,g,., ,.]
Figure imgf000069_0003
, ^ oπ
(2,3) dU{β/) = ∑ cov{ ,(ύf,- ),F-1[/(/?,1,/?2)-2^|g„ „ ,} dU(π)
(3,1) = 0 δξ
dU(π)
= 0 (3,2)
Figure imgf000070_0001
(3,3) ^) = _2N0l + ∑(l-Jι)F-1 var[/( ?,1,/,2)-2^ | J„g„Λ,] oπ ,
dU(ξ,β,π) d(ξ,β,π)
= ∑
Figure imgf000070_0002
θ d, ∞ {d, -μ,)tX, \ d„g„x,- ∞vM -μW tτV--ι1 \ g„d,A] 0 J, cov[X,(rf, -/ ,),N,
Figure imgf000070_0003
0 0 (l-rf ' var . ,,*,]
in which E, = /(A,1 , h 2 ) - 2π . EXAMPLE 10
Consistency of Estimating Equations This example shows the consistency of the estimating equations. One important aspect of this development is to prove the consistency of the estimating equation in the sense that estimated parameters are consistent as the number of samples approaches infinity. To prove this consistency, it is suffice to prove that the estimating function asymptotically approaches zero. Consider the situation where haplotype frequencies are consistently estimated and are held consistent. By the law of large number, the estimating function on the left hand side of (29) may be approximated by: U{a,β) = ∑ ∑ ( Did. -μJAP. l gΛA)
Figure imgf000071_0001
where Nj is the number of cases, and ' >' represents convergence, when the number of independent samples becomes sufficiently large. Hence, the estimating function above approaches zero asymptotically. This convergence indicates that estimated parameter is also consistent via the Taylor expansion.
EXAMPLE 11 Monte Carlo Simulation Study
This example describes a Monte Carlo simulation study.
Recognizing that results presented above base upon asymptotic theory, we are interested in how well asymptotic results approximate distributions of results in finite samples. To do so, we performed Monte Carlo simulation studies. Once the validity of this method is established, we used the same simulation method to investigate its performance under some realistic situations. 1. A General Setup We use a coalescent theory to simulate a population of 10 million individuals and their genotypes, using the simulation program of the coalescent-process, which is from Dr. Hudson's website (http://home.uchicago.edu/~rhudsonl/source/mksamples.html). Then, for every individual in this population, we simulate a vector of covariates, specifically, age, gender and race. Then, we use the penetrance model (26) to compute the expected disease probability and then to simulate the binary phenotype status using Bernoulli process. Treating simulated data on 10 million individuals as the entire population, one can now conduct a case-control study with a sample of representative cases and a sample of representative controls. Analyzing case-control data, we can estimate log odds ratios and their standard errors for every haplotypes and covariates of interest. One case-control study is considered as a replicate in the Monte Carlo simulation study. As needed, we can repeatedly perform such a case-control study, and use 1000 replicates in all of our Monte Carlo simulation study. Resulting statistics from 1000 replicates are then used to evaluate following measurement statistics. 2. Measurement Statistics
Among many possible choices of measurement statistics, we choose to use three measurement statistics. The first statistic is the biases in estimating log odds ratios. Let βj denote they'th estimated regression coefficient in (26) from the rth replicate. The bias measurement of this log odds ratio is defined as Bias(£) = ±∑ (ft -βj) , r=l where β represents the true value from the simulation setup, and R equals 1000 for our simulation study. The second statistic is the biases in estimating standard errors. Let SEj = SE (βj) denote the estimated standard error for they'th log odds ratio β} r from the rth replicate. The bias measurement of this standard error is defined as Bias(SE,) = T (SE, -SE,) and SE2 = ^∑ (β -β s2 , where SE^ is the Monte Carlo estimate of the true standard error for the jth. regression coefficient. The third statistic is the coverage probability, measuring how frequently the confidence interval, β} r -ZgSE^βj +ZeSE} r] at the significance level of θ, covers the true value β} . The rationale for choosing the coverage probability is that it is reliable measurement under both null and alternative hypothesis. If the estimation and inference are appropriate, the coverage probability should be around 1-θ . 3. Under the Null Hypothesis with No Covariates
For consistence throughout the simulation studies, we always use the haplotype with the highest haplotype frequency as the reference. In the current simulation, we focus on three haplotypes, which are considered as common haplotypes in the entire population. In this case, we simulate phenotypes in the general population by the following penetrance function:
?x(dl = l \ h),h2,xl) = l l + exV[-a -β,Kχh],h2) -β2K2(h ,h2)-β,Ki(til ,h2)]
(37) where (β. ,β23) = (0, 0, 0) and [Kλ (/?,' , h2 ), K2 (h , h2 ), K3 (h] , h2 )] are indicator functions for three common haplotype of interest. The above model is used to fit case-control data in every replicate (see Table 10 in FIGURE 15). 4. Under the Null Hypothesis with Covariates
Here we consider two scenarios for inclusion of a covariate that associates with phenotype, such a binary indicator for ethnicity. Two scenarios are that the covariate is independent of or dependent of genetic constitution. The penetrance function used to simulate phenotypes may be written as:
(38) where (β. ,β23) = (0, 0, 0) and [K. (h ,h2), K2 (h) ,h2), K3 (h , h2 )] are indicator functions for three common haplotype of interest. By this specification, we interest to investigate properties of test statistics for haplotypes under the null hypothesis. When the analysis is performed, we consider two strategies; the covariate is or is not included in the analysis (see Table 11 in FIGURE 16). The simulations were also performed under the null hypothesis with varying sample sizes (see Table 12 in FIGURE 17), under the alternative hypothesis with no covariates (see Table 13 in FIGURE 18), and under the alternative hypothesis with covariates (see Table 14 in FIGURE 19).
EXAMPLE 12 Case-Control Study The example describes a case-control study assembled from GAW12 (see Table 15 in FIGURE 20). Further details and results are described in Zhao et al. (2003) (Am. J. Hum. Genet. 72:1231-50), herein incoφorated by reference. While the prefeoed embodiment of the invention has been illustrated and described, it will be appreciated that various changes can be made therein without departing from the spirit and scope of the invention. '

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:
1. A method for estimating haplotype frequencies for a set of markers in a sample population, comprising:
(a) providing a set of markers, wherein each marker has a genotype, wherein the genotypes of at least some of the markers are known for each individual in a sample population; and
(b) estimating haplotype frequencies for the marker set in a sample population using an estimating equation method and a forward-block algorithm, wherein the algorithm provides a standard error measurement for each estimated haplotype frequency.
2. The method of Claim 1, wherein the forward-block algorithm comprises dividing the set of markers into a plurality of blocks; and
(i) estimating block haplotype frequencies for a first block, wherein a block haplotype comprises a combination of genotypes for markers in a block;
(ii) estimating block haplotype frequencies for a second block;
(iii) estimating combination block haplotype frequencies for a first combination block using selected and pooled block haplotypes for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein block haplotypes with greater than a predetermined minimum frequency are selected and block haplotypes that are not selected are pooled;
(iv) estimating block haplotype frequencies for a third block;
(v) estimating combination block haplotype frequencies for a second combination block using selected and pooled block haplotypes for the first combination block and the third block; and
(vi) sequentially repeating steps (iv) and (v), wherein each repetition provides combination block haplotype frequencies for a combination block comprising an additional block.
3. The method of Claim 1, wherein the markers comprise single nucleotide polymoφhisms (SNPs).
4. The method of Claim 1, wherein the markers comprise microsatellite markers.
5. The method of Claim 1, wherein the set of markers comprises more than about 12 markers.
6. The method of Claim 1 , wherein the set of markers comprises more than about 100 markers.
7. The method of Claim 1, wherein the set of markers comprises several hundred markers.
8. The method of Claim 1, wherein the sample population comprises more than about 40 individuals.
9. The method of Claim 1, wherein the sample population comprises more than about 1000 individuals.
10. The method of Claim 1, wherein the genotype of at least some of the markers is unknown.
11. The method of Claim 2, wherein each block comprises at least about 3 markers.
12. The method of Claim 2, wherein each block comprises at least about 20 markers.
13. A method for associating haplotype frequencies for a set of markers with a trait, comprising:
(a) providing a set of markers, wherein each marker has a genotype, wherein the genotypes of at least some of the markers is known for a group of trait-positive individuals and a group of trait-negative individuals; and
(b) estimating haplotype frequencies for the set of markers in the group of trait- positive individuals and in the group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the algorithm provides a standard eoor measurement for each estimated haplotype frequency; and (c) estimating the differences in frequencies of haplotypes for the set of markers in the trait-positive group and in the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard eoor measurement for each estimated difference in haplotype frequency.
14. A method for diagnosing an increased risk of developing a trait, comprising:
(a) providing a set of markers, wherein each marker has a genotype, wherein the genotypes of at least some of the markers is known for a group of trait-positive individuals and a group of trait-negative individuals; and
(b) estimating haplotype frequencies for the marker set in the group of trait- positive individuals and in the group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein the algorithm provides a standard eoor measurement for each estimated haplotype frequency; and
(c) estimating the differences in haplotype frequencies in the trait-positive group and in the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard eoor measurement for each estimated difference in haplotype frequencies;
(d) deriving one or more haplotypes that are significantly associated with the trait; and
(e) diagnosing an increased risk of developing the trait in a trait-negative individual by determining the presence of a haplotype that is significantly positively associated with the trait or the absence of a haplotype that is significantly negatively associated with the trait.
15. A method for associating haplotypes or diplotypes at one or more loci and one or more environmental factors with a trait, comprising:
(a) providing a set of markers for each of one or more loci, wherein each marker has a genotype, wherein the genotypes of at least some of the markers is known for a group of trait-positive individuals and a group of trait-negative individuals; (b) providing a set of environmental factors, wherein the presence of each environmental factors is known for at least some members of a group of trait-positive individuals and at least some members of a group of trait-negative individuals;
(c) estimating pattern frequencies in the group of trait-positive individuals and in the group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein a pattern comprises at least one of (i) one or more haplotypes or diplotypes at one or more loci; (ii) one or more environmental factors; and (iii) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors; and wherein the algorithm provides a standard eoor measurement for each estimated pattern frequency;
(d) estimating the differences in pattern frequencies in the trait-positive group and in the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard eoor measurement for each estimated difference in pattern frequency.
16. The method of Claim 15, wherein the forward-block algorithm comprises dividing each set of markers into a plurality of blocks; and for each set of markers:
(i) estimating pattern frequencies for a first block, wherein a pattern comprises haplotype and one or more environmental factors and a haplotype comprises a combination of genotypes for markers in a block;
(iii) estimating pattern frequencies for a second block;
(iii) estimating pattern frequencies for a first combination block using selected and pooled patterns for the first and second blocks, wherein a combination block comprises two or more blocks, and wherein patterns with greater than a predetermined minimum frequency are selected and patterns that are not selected are pooled;
(iv) estimating pattern frequencies for a third block;
(v) estimating pattern frequencies for a second combination block using selected and pooled patterns for the first combination block and the third block; and
(vi) sequentially repeating steps (iv) and (v), wherein each repetition provides pattern frequencies for a combination block comprising an additional block.
17. A method for diagnosing an increased risk of developing a trait, comprising: (a) providing a set of markers for each of one or more loci, wherein each marker has a genotype, wherein the genotypes of at least some of the markers is known for a group of trait-positive individuals and a group of trait-negative individuals;
(b) providing a set of environmental factors, wherein the presence of each environmental factors is known for at least some members of a group of trait-positive individuals and at least some members of a group of trait-negative individuals;
(c) estimating pattern frequencies in the group of trait-positive individuals and in the group of trait-negative individuals using an estimating equation method and a forward-block algorithm, wherein pattern comprises at least one of (i) one or more haplotypes or diplotypes at one or more loci; (ii) one or more environmental factors; and (iii) a combination of one or more haplotypes or diplotypes at one or more loci and one or more environmental factors; and wherein the algorithm provides a standard eoor measurement for each estimated pattern frequency;
(d) estimating the differences in pattern frequencies in the trait-positive group and in the trait-negative group using an estimating equation algorithm, wherein the algorithm provides a standard eoor measurement for each estimated difference in pattern frequency.
(e) deriving one or more patterns that are significantly associated with the trait; and
(f) diagnosing an increased risk of developing a trait in a trait-negative individual by determining the presence of a pattern that is significantly associated with the trait or the absence of a pattern that is significantly negatively associated with the trait.
18. A computer-readable medium having computer-executable instructions for performing the method recited in any of Claims 1, 13-15, and 17.
19. A computer-system having a processor, a memory and an operating environment, the computer system operable for performing the method recited in any of Claims 1, 13- 15, and 17.
PCT/US2003/031186 2002-10-01 2003-10-01 Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables WO2004031912A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003282907A AU2003282907A1 (en) 2002-10-01 2003-10-01 Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US41502802P 2002-10-01 2002-10-01
US60/415,028 2002-10-01

Publications (2)

Publication Number Publication Date
WO2004031912A2 true WO2004031912A2 (en) 2004-04-15
WO2004031912A3 WO2004031912A3 (en) 2004-08-05

Family

ID=32069800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/031186 WO2004031912A2 (en) 2002-10-01 2003-10-01 Methods for estimating haplotype frequencies and disease associations with haplotypes and environmental variables

Country Status (2)

Country Link
AU (1) AU2003282907A1 (en)
WO (1) WO2004031912A2 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
US7917438B2 (en) 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US20130332081A1 (en) * 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
WO2014089356A1 (en) * 2012-12-05 2014-06-12 Genepeeks, Inc. System and method for the computational prediction of expression of single-gene phenotypes
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
WO2014144032A2 (en) * 2013-03-15 2014-09-18 The Broad Institute, Inc. Systems and methods for identifying significantly mutated genes
US9031870B2 (en) 2008-12-30 2015-05-12 Expanse Bioinformatics, Inc. Pangenetic web user behavior prediction system
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DRYSDALE C.M. ET AL: 'Complex promoter and coding region beta2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness' PNAS vol. 97, no. 19, 12 September 2000, pages 10483 - 10488, XP002940094 *
EXCOFFIER L. ET AL: 'Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population' MOLECULAR BIOLOGY AND EVOLUTION vol. 12, no. 5, 1995, pages 921 - 927, XP002953528 *
FALLIN D. ET AL: 'Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data' J. OF HUMAN GENETICS vol. 67, October 2000, pages 947 - 959, XP002951850 *
NIU T. ET AL: 'Bayesian haplotype inference for multiple linked single-nucleotide polymophisms' AM.J. OF HUMAN GENETICS vol. 70, January 2002, pages 157 - 169, XP002978163 *
PATIL N. ET AL: 'Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21' SCIENCE vol. 294, 23 November 2001, pages 1719 - 1723, XP002965310 *
STEPHENS M. ET AL: 'A new statistical method for haplotype reconstruction from population data' AM. J. OF HUMAN GENETICS vol. 68, April 2001, pages 978 - 989, XP002955780 *

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10896233B2 (en) 2007-03-16 2021-01-19 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US11791054B2 (en) 2007-03-16 2023-10-17 23Andme, Inc. Comparison and identification of attribute similarity based on genetic markers
US7844609B2 (en) 2007-03-16 2010-11-30 Expanse Networks, Inc. Attribute combination discovery
US10957455B2 (en) 2007-03-16 2021-03-23 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US7933912B2 (en) 2007-03-16 2011-04-26 Expanse Networks, Inc. Compiling co-associating bioattributes using expanded bioattribute profiles
US7941434B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Efficiently compiling co-associating bioattributes
US7941329B2 (en) 2007-03-16 2011-05-10 Expanse Networks, Inc. Insurance optimization and longevity analysis
US8024348B2 (en) 2007-03-16 2011-09-20 Expanse Networks, Inc. Expanding attribute profiles
US8051033B2 (en) 2007-03-16 2011-11-01 Expanse Networks, Inc. Predisposition prediction using attribute combinations
US8099424B2 (en) 2007-03-16 2012-01-17 Expanse Networks, Inc. Treatment determination and impact analysis
US11735323B2 (en) 2007-03-16 2023-08-22 23Andme, Inc. Computer implemented identification of genetic similarity
US8209319B2 (en) 2007-03-16 2012-06-26 Expanse Networks, Inc. Compiling co-associating bioattributes
US8606761B2 (en) 2007-03-16 2013-12-10 Expanse Bioinformatics, Inc. Lifestyle optimization and behavior modification
US11621089B2 (en) 2007-03-16 2023-04-04 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US11600393B2 (en) 2007-03-16 2023-03-07 23Andme, Inc. Computer implemented modeling and prediction of phenotypes
US11581096B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Attribute identification based on seeded learning
US7797302B2 (en) 2007-03-16 2010-09-14 Expanse Networks, Inc. Compiling co-associating bioattributes
US11581098B2 (en) 2007-03-16 2023-02-14 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US11545269B2 (en) 2007-03-16 2023-01-03 23Andme, Inc. Computer implemented identification of genetic similarity
US11515047B2 (en) 2007-03-16 2022-11-29 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US10379812B2 (en) 2007-03-16 2019-08-13 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US10803134B2 (en) 2007-03-16 2020-10-13 Expanse Bioinformatics, Inc. Computer implemented identification of genetic similarity
US10991467B2 (en) 2007-03-16 2021-04-27 Expanse Bioinformatics, Inc. Treatment determination and impact analysis
US7818310B2 (en) 2007-03-16 2010-10-19 Expanse Networks, Inc. Predisposition modification
US11495360B2 (en) 2007-03-16 2022-11-08 23Andme, Inc. Computer implemented identification of treatments for predicted predispositions with clinician assistance
US11482340B1 (en) 2007-03-16 2022-10-25 23Andme, Inc. Attribute combination discovery for predisposition determination of health conditions
US11348691B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented predisposition prediction in a genetics platform
US11348692B1 (en) 2007-03-16 2022-05-31 23Andme, Inc. Computer implemented identification of modifiable attributes associated with phenotypic predispositions in a genetics platform
US8788286B2 (en) 2007-08-08 2014-07-22 Expanse Bioinformatics, Inc. Side effects prediction using co-associating bioattributes
US7917438B2 (en) 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US9031870B2 (en) 2008-12-30 2015-05-12 Expanse Bioinformatics, Inc. Pangenetic web user behavior prediction system
US11003694B2 (en) 2008-12-30 2021-05-11 Expanse Bioinformatics Learning systems for pangenetic-based recommendations
US11514085B2 (en) 2008-12-30 2022-11-29 23Andme, Inc. Learning system for pangenetic-based recommendations
US11935628B2 (en) 2008-12-31 2024-03-19 23Andme, Inc. Finding relatives in a database
US11468971B2 (en) 2008-12-31 2022-10-11 23Andme, Inc. Ancestry finder
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database
US11657902B2 (en) 2008-12-31 2023-05-23 23Andme, Inc. Finding relatives in a database
US11508461B2 (en) 2008-12-31 2022-11-22 23Andme, Inc. Finding relatives in a database
US11776662B2 (en) 2008-12-31 2023-10-03 23Andme, Inc. Finding relatives in a database
US20130332081A1 (en) * 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
US20150317432A1 (en) * 2012-12-05 2015-11-05 Genepeeks, Inc. System and method for the computational prediction of expression of single-gene phenotypes
WO2014089356A1 (en) * 2012-12-05 2014-06-12 Genepeeks, Inc. System and method for the computational prediction of expression of single-gene phenotypes
US11545235B2 (en) 2012-12-05 2023-01-03 Ancestry.Com Dna, Llc System and method for the computational prediction of expression of single-gene phenotypes
WO2014144032A3 (en) * 2013-03-15 2014-11-06 The Broad Institute, Inc. Systems and methods for identifying significantly mutated genes
WO2014144032A2 (en) * 2013-03-15 2014-09-18 The Broad Institute, Inc. Systems and methods for identifying significantly mutated genes

Also Published As

Publication number Publication date
AU2003282907A1 (en) 2004-04-23
AU2003282907A8 (en) 2004-04-23
WO2004031912A3 (en) 2004-08-05

Similar Documents

Publication Publication Date Title
Choin et al. Genomic insights into population history and biological adaptation in Oceania
Speidel et al. A method for genome-wide genealogy estimation for thousands of samples
Marees et al. A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis
Palamara et al. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability
Pe'er et al. Evaluating and improving power in whole-genome association studies using fixed marker sets
Gao et al. New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era
Ngo et al. A diagnostic ceiling for exome sequencing in cerebellar ataxia and related neurological disorders
Flutre et al. A statistical framework for joint eQTL analysis in multiple tissues
Glémin et al. Quantification of GC-biased gene conversion in the human genome
Prabhu et al. Ultrafast genome-wide scan for SNP–SNP interactions in common complex disease
Dumas et al. Direct quantitative trait locus mapping of mammalian metabolic phenotypes in diabetic and normoglycemic rat models
Ptak et al. Evidence for population growth in humans is confounded by fine-scale population structure
Munafo et al. Meta-analysis of genetic association studies
Sun et al. eQTL mapping using RNA-seq data
Furlotte et al. Efficient multiple-trait association and estimation of genetic correlation using the matrix-variate linear mixed model
Gompert et al. A hierarchical Bayesian model for next-generation population genomics
Hung et al. Analysis of microarray and RNA-seq expression profiling data
Hu et al. Proper use of allele-specific expression improves statistical power for cis-eQTL mapping with RNA-seq data
Chou et al. A combined reference panel from the 1000 Genomes and UK10K projects improved rare variant imputation in European and Chinese samples
Paşaniuc et al. Accurate estimation of expression levels of homologous genes in RNA-seq experiments
Paschou et al. Intra-and interpopulation genotype reconstruction from tagging SNPs
Leache et al. Comparative species divergence across eight triplets of spiny lizards (Sceloporus) using genomic sequence data
Rodriguez et al. Parente2: a fast and accurate method for detecting identity by descent
Yuan et al. Refining models of archaic admixture in Eurasia with ArchaicSeeker 2.0
Chanda et al. Comprehensive evaluation of imputation performance in African Americans

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP