Dimension Reduction Approaches for Genome-wide Association Testing
PI:
Andrew Clark
Co-PIs:
Carlos Bustamante, Rasmus Nielsen
Support:
current
Source:
NIH -- PI: Andrew Clark
Location:
Cornell University
Duration:
04/01/06- 03/31/09
Summary:
Whole-genome association testing is widely cited as having promise for identification of genetic variants that are causal to elevated risk of complex disorders like cardiovascular disease, diabetes, and cancers. The technology for genotyping at the requisite scale is becoming practical and affordable, but we lag behind in having the analytical tools needed to make the most reliable inferences from these data. This implies that we cannot yet design optimal studies, because we do not know what aspects of experimental designs erode the power of the studies. Specific Aim 1 will develop Bayesian classification models, a promising approach for inference when the number of predictors (SNPs) is large, but where the prior expectation is that most SNPs will have zero effect. The model will have a three-component mixture prior with a high point mass at zero (no effect) as well as positive and negative effects on risk. Fitting will be done by Monte Carlo Markov chain and by stochastic variable selection. We will apply the model to BeadArray data, providing transcript abundance for 700 genes in cell lines from the 270 subjects of the HapMap project (each having more than 4 M SNP genotypes). The Bayesian classification approach will be contrasted with linear model based approaches. Both case-control and random cohort data will be addressed. Performance of the methods in the face of missing and erroneous data will be quantified. Specific Aim 2 will explore the effects of ascertainment bias and of departures from neutrality of the marker variation on association testing. The HapMap SNPs were discovered in small samples, resulting in a bias toward SNPs that are more common than are found in the full population. There is a pressing need to explore the impact of such ascertainment bias on inference of association. Most methods of association testing assume that the markers follow neutral expectations, but we know that many regions of the genome show marked departures from this pattern. We will show through theory and simulation how these distortions impact standard approaches to association testing, and devise accommodations to the test. Specific Aim 3 will apply data reduction methods to both the SNP and the phenotype data. SNP data consist of discrete factors that arise through a well-understood process (the coalescent), and explicit modeling of this process is likely to identify better methods for SNP dimension reduction. Some beginnings of this have appeared in the literature as the tag SNP. The phenotype data can be reduced by combining methods like clustering and sparse principal components. These methods will be applied to the Sanger gene expression data, and will be tested by simulation. Specific Aim 4 will employ simulations to assess the power of association tests under violations of model assumptions. Of particular interest will be the tuning model parameters to optimize the balance of false positive and false negative inferences.