Whole-genome association (WGA) studies are becoming a common tool for the exploration of the genetic components of common disease. The analysis of such large scale data presents unique analytical challenges, including problems of multiple testing, correlated independent variables, and large multivariate model spaces. These issues have prompted the development of novel computational approaches. Thorough, extensive simulation studies are a necessity for methods development work to evaluate the power and validity of novel approaches. Many data simulation packages exist, however, the resulting data is often overly simplistic and does not compare to the complexity of real data; especially with respect to linkage disequilibrium (LD). To overcome this limitation, we have developed genomeSIMLA. genomeSIMLA is a forward-time population simulation method that can simulate realistic patterns of LD in both family-based and case-control datasets.
genomeSIMLA [Dudek et al. 2006] was developed for the simulation of large-scale genomic data in population based case-control samples. It is a forward-time population simulation algorithm that allows the user to specify many evolutionary parameters and control evolutionary processes. SIMLA [Bass et al. 2004, Schmidt et al. 2004] is a simulation program that allows the user to specify varying levels of both linkage and LD among and between markers and disease loci. SIMLA was specifically designed for the simultaneous study of linkage and association methods in extended pedigrees, but the penetrance specification algorithm can also be used to simulate samples of unrelated individuals (e.g., cases and controls). We have combined genomeSIM as a front-end to generate a population of founder chromosomes. This population will exhibit the desired patterns of LD that can be used as input for the SIMLA simulation of disease models. Particular SNPs may be chosen to represent disease loci according to desired location, correlation with nearby SNPs, and allele frequency. Up to six loci may be selected for main effects and all possible 2 and 3-way interactions as specified in  among these 6 loci are available to the user as elements of a disease model.
Once these loci are chosen the user specifies disease prevalence, a mode of inheritance for each locus, and relative risks of exposure to the genotypes at each locus. Pairs of chromosomes are then drawn at random from the genomeSIM population to produce nonredundant individuals. These individuals are either mated to yield pedigrees or are evaluated by a logistic function of their genotypic exposures to determine disease status for case-control data. An advantage of the SIMLA approach to the logistic function is it can simulate data on markers that are not independent, yet yield the correct relative risks and prevalence. Many simulation packages using a logistic function for penetrance specification do not have this capability. Thus, the marriage of genomeSIM and SIMLA has allowed for the simulation of large scale datasets with realistic patterns of LD and realistic disease models in both family-based and case-control data.
genomeSIMLA generates datasets using a forward-time population simulator which relies on random mating, genetic drift, recombination, and population growth to allow a population to naturally obtain LD features. An initial population (or pool of chromosomes) is generated using allele frequencies and positions of a set of desired SNPs. Recombinant gametes are created based on intermarker recombination fractions determined by Kosambi or Haldane functions as a function of map distance. Crossover events are drawn from a Poisson distribution where ? is the sum of all intermarker probabilities. The resulting gamete is then combined with a gamete from another individual. The approximation of 1 million bases per centimorgan is applied here; however, other values could be applied to simulate population-specific genetic maps or recombination hotspots.
This process continues on the pool of chromosomes for a set number of generations to generate realistic patterns of LD and produce sufficient numbers of chromosomes for drawing data. After the pool of chromosomes has developed suitable LD and grown to a useful size, datasets can be drawn by randomly sampling chromosomes with replacement to create nonredundant individuals. Disease-susceptibility effects of multiple genetic variables can be modeled using either the SIMLA logistic function [Bass et al. 2004, Schmidt et al. 2004] or a purely epistatic multi-locus penetrance function [Moore et al. 2004] found using a genetic algorithm to assign affected status.
- Dudek S, Motsinger AA, Velez D, Williams SM, Ritchie MD: Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput 2006, 499-510.
- Schmidt M, Hauser ER, Martin ER, Schmidt S: Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol 2005 2004, Article15.
- Bass MP, Martin ER, Hauser ER: Pedigree generation for analysis of genetic linkage and association. Pac Symp Biocomput 2004 1993, 93-103.
- Bush WS, Chen G, Torstenson ES, Ritchie MD. LD-Spline: Mapping SNPs on genotyping platforms to genomic regions using patterns of linkage disequilibrium. BMC BioData Mining, 2(1):7 (2009).
- Edwards TL, Bush WS, Turner SD, Dudek SM, Torstenson ES, Schmidt M, Martin E, Ritchie MD. Generating linkage disequilibrium patterns in data simulations using genomeSIMLA. Lecture Notes in Computer Science, 4973: 24-35 (2008).