Imputation is a method of inferring missing genotypes in a dataset. In our paper titled "Imputation and quality control steps for combining multiple genome-wide datasets
", we have provided a description on how to perform imputation on a large dataset and also ways of combining platform separated imputed datasets into one dataset. Imputation is performed using standard information provided on the Impute2 website
. Below we provide the two scripts that can be used to combine imputed datasets and also convert impute2 files to PLINK files.
- impute2-group-join.py: This script takes all imputed files as input and provide one merged dataset as output.
- impute2-to-plink.py: This script is used to convert impute2 files to PLINK format.
- Refer to each utility's --help option for more information.
For more information on imputation and our method, please refer to our paper.
What is PLATO ?
The PLatform for the Analysis, Translation, and Organization of large-scale data (PLATO) is a standalone program written in C++ that is designed to be a flexible and extensible analysis tool for a wide variety of genetic data. PLATO includes a configurable set of QC and analysis steps that can be used for the filtering and analysis of data in a single command step. Further, through the abstraction of genetic data, PLATO allows for the easy addition of customized analysis or filtering steps requiring only a basic level of computing expertise.
Why use PLATO ?
With the wide array of genotypic and phenotypic data available, there is no single analytical method that is appropriate for all data. In fact, no single method can be optimal for all datasets, especially when the genetic architecture for diseases can vary substantially. PLATO serves as an integrative platform that can accommodate multiple analytical methods for analysis as we learn more about genetic architecture. By allowing for user customization through the use of command line options, PLATO can adapt to many different kinds of data and analyses. Additionally, PLATO has the ability to be run in parallel for some steps, reducing the computing time of the analyses on the multi-core machines that have become standard.
Notes about PLATO 2.0
PLATO 2.0 is a complete rewrite of the previous version of PLATO, intended to simplify the most commonly used analysis steps. We have enhanced the steps that we have found most useful, and removed other steps that we weren't using PLATO for. Please see the user manual for a complete summary of all that PLATO can do. If there was a feature in PLATO 1.2 that is no longer in 2.0 that you would find useful, please let us know by using the Conctact Us link above.
What's Coming in PLATO 2.1:
- Ability to use categorical covariates without manual dummy-encoding
- Addition of Firth bias reduction in logistic regression
- Reporting and correcting for genomic inflation factor
- Permutation p-value calculations added
- Higher resolution of small p-values
- Fixed some deadlocking, especially when using MPI
New Features in PLATO 2.0:
- PLATO has a completely rewritten command structure, and all commands can be given on the command line.
- Able to read VCF, Beagle, TPED, and LGEN files
- Can work with arbitrary number of chromosomes
- Added more complex model-generation techniques for regression testing
- Ability to run PheWAS
- MPI Parallelization
- PLATO 1.2.6, released October 17, 2013
- PLATO 1.2.5, released October 10, 2013
- PLATO 1.2.4, released September 13, 2013
- PLATO 1.2.3, released July 22, 2013
- PLATO 1.2.2, released June 19, 2013
- PLATO 1.2.1, released June 5, 2013
Bug Fixes in Plato 1.2.x
- Version 1.2.4 contains a bug fix for likelihood ratio test p-values when using logistic regression in the interaction method (introduced in version 1.2). Please update your installation to the latest version.
- We have found a critical bug in PLATO versions 1.2.1 and earlier regarding reading binary PLINK files. This has been fixed in version 1.2.2; please update your installation at your earliest convenience. For more details on bug fix click here
What is Synthesis-View ?
Synthesis-View is software for data visualization. With Synthesis-View, through the use of stacked data-tracks, information on SNP genomic locations, presence of the SNP in a specific study or analysis, as well as related information such as genetic effect size and summary phenotype information, can be plotted according to user preference. Through these data visualizations, rapid comparisons of multiple forms of information are possible, not easily achievable through reviewing results in tabular form alone.
Why use Synthesis-View ?
Visualization of the complex data arising from current genetic studies is an integral part of interpreting as well as sharing these complex and multi-layered results. For example, significant GWAS findings are being further investigated for replication and characterization, both in the populations in which the initial GWAS findings were discovered (such as European-Americans) as well as in new cohorts and populations. To increase power, meta-analysis is often used to combine results from multiple research sites. Multiple independent and correlated phenotypic measurements may be included in these analyses, such as measurements of cardiovascular disease and related biomarkers (lipids, inflammation, etc). Many of these studies characterize less than 100 SNPs. The software "Synthesis-View" has been developed to visually synthesize multiple pieces of information of interest from these studies with the flexibility to perform multiple types of data comparisons.
Link to Web-based Synthesis-View
What is PhenoGram ?
With PhenoGram researchers can create chomosomal ideograms annotated with lines in color at specific base-pair locations, or colored base-pair to base-pair regions, with or without other annotation. PhenoGram allows for annotation of chromosomal locations and/or regions with shapes in different colors, gene identifiers, or other text. PhenoGram also allows for creation of plots showing expanded chromosomal locations, providing a way to show results for specific chromosomal regions in greater detail.
Why use Phenogram ?
PhenoGram is a versatile, user-friendly software tool fostering the exploration and sharing of genomic information. We have used PhenoGram to visualize the genomic coverage of SNPs from genotyping arrays, to highlight the chromosomal coverage of imputed SNPs, to show copy-number variation region coverage, as well as to generate plots similar to the NHGRI GWA Catalog of genome-wide association results. Through visualization of data, researchers can both explore and share complex results, facilitating a greater understanding of these data.
Link to Web-based PhenoGram
What is PheWAS-View ?
We have developed the software PheWAS-View for visually integrating PheWAS results, including information about the SNPs, relevant genes, phenotypes, and the interrelationships between phenotypes, that exist in PheWAS. As a result both the fine grain detail as well as the larger trends that exist within PheWAS results can be elucidated.
Why use PheWAS-View ?
PheWAS can be used to discover novel relationships between SNPs, phenotypes, and networks of interrelated phenotypes; identify pleiotropy; provide novel mechanistic insights; and foster hypothesis generation – and these results can be both explored and presented with PheWAS-View.
Link to Web-based PheWAS-View
Supporting and Sample files:
What is PARIS?
Pathway Analysis by Randomization Incorporating Structure (PARIS) is a software tool for conducting pathway analysis independent of study design. It is written in C++ and utilizes an SQLite database and variations file (both included in download) for analysis.
Why use PARIS ?
PARIS conducts pathway analysis to aggregate signals from GWAS analysis. It groups the SNPs into linkage disequilibrium (LD) features and single SNPs in linkage equilibrium (LE) features. Those features are then grouped by pathways (using the KEGG database or other sources) drawn from data in the accompanying database and variations file. The significance of a pathway is determined by permutation testing. In each permutation, the features in a pathway are replaced by a randomly selected set of features of similar size. The total number of features with a significant p-value are compared with the number of significant features in the permuted pathway. This approach has the following benefits:
- It only require the results of the GWAS analysis. No manipulation of the original data is needed.
- It permutes the genome rather than the data and is, therefore, faster to run than methods that require affection status permutation.
- The permutation testing accounts for LD, gene size, pathway size and SNP coverage.
What is pMDR ?
Parallel multifactor dimensionality reduction (pMDR) is a parallelized version of the multifactor dimensionality reduction (MDR) method for the analysis of gene-gene and gene-environment interactions. It uses the Message Passing Interface (MPI) for inter-process communication and can be run on computing clusters.
Why use pMDR ?
Multifactor dimensionality reduction (MDR) is a non-parametric, model-free method for detecting gene–gene and gene–environment interactions. pMDR is a parallelized implementation of this method and allows for an unlimited number of variable states (for haplotype encoding) and an unlimited number of individuals. The number of variables and the order interaction to analyze (2 locus interactions, 3 locus interactions, etc) are limited only by machine memory and computation time. These improvements allow the analysis of higher order interactions for small datasets and make two-locus interactions computationally feasible for very large datasets. If run on a cluster, pMDR will use MPI to split the analysis among the nodes to reduce the run time needed to complete an analysis.
- MDR-PDT 18.104.22.168, Released on June 29, 2010 (Source with makefiles and binaries for OSX(intel) and redhat linux)
- MDR-PDT 22.214.171.124, Released on June 10, 2010 (Source with makefiles and binaries for OSX(intel) and redhat linux)
- MDR-PDT 126.96.36.199, Released on June 10, 2010 (Source with makefiles and binaries for OSX(intel) and redhat linux)
- MDR-PDT 1.1.6 WIN32, Released on August 14, 2007 (Windows/DOS compatible version)
- MDR-PDT 1.1.6 MAC OSX, Released on August 14, 2007 (MAC OSX compatible version)
- MDR-PDT 1.1.6 Linux, Released on August 10, 2007 (Linux (gcc 3.2) binary for i386. This should be fine for most modern linux systems running on intel hardware)
Through out this site you will be able to find the source for the application, as well as links to any dependencies (and, in most cases, local downloads that can be extracted directly into your mdrpdt/src directory!
In addition to bug reports, we are interested in usability issues as well. Is there anything we can do to make it easier to use- more effectively fit into your work flow? Please contact us at this address.
Please note that version two is now available as a single download. Inside the tarball exists the source which should compile under any modern GCC enabled system as well as the manual, and prebuilt versions for OSX and certain versions of RedHat linux. Please contact us at the address above if you have trouble getting the software compiled on other platforms. New features include the following:
* Cross Validation
* Matched Odds Ratio Evaluation
Performance under version 2.0 should be a little better under larger order models than the previous version.
What is Biofilter ?
Biofilter provides a convenient single interface for accessing multiple publicly available human genetic data sources that have been compiled in the supporting database of the Library of Knowledge Integration (LOKI). Information within LOKI includes genomic locations of SNPs and genes, as well as known relationships among genes and proteins such as interaction pairs, pathways and ontological categories.Via Biofilter 2.0 researchers can:·
- Annotate genomic location or region based data, such as results from association studies, or CNV analyses, with relevant biological knowledge for deeper interpretation.
- Filter genomic location or region based data on biological criteria, such as filtering a series SNPs to retain only SNPs present in specific genes within specific pathways of interest.
- Generate Predictive Models for gene-gene, SNP-SNP, or CNV-CNV interactions based on biological information, with priority for models to be tested based on biological relevance, thus narrowing the search space and reducing multiple hypothesis-testing.
Why use Biofilter ?
Expanding resources of existing knowledge can be used to direct new analyses investigating the relationship between genetic architecture and outcome traits, provide more information for interpreting analysis results, and provide a way to explore complex interactions. Many different types of “-omic” analyses, such as genome-wide association studies (GWAS) or eQTL analyses, take a high-throughput association approach with the multiple hypothesis testing burden and potential for Type-1 error increasing with the number of SNPs and/or phenotypes/outcomes used in these studies. Existing expert-knowledge can be used to filter data in various ways before calculating associations, thus reducing the number of proposed tests and the multiple testing burden based on a biologically-driven rationale. Existing biological data can also be used to annotate results of high-throughput studies to provide further biological interpretation of genomic regions showing statistically significant associations. Further, methodologies that facilitate the exploration of models beyond those explored through the GWA approach are important for potentially explaining more of the heritability of complex traits. Biofilter is a software tool that provides a flexible way to use the ever-expanding expert biological knowledge that exists to direct filtering, annotation, and complex predictive model development for elucidating the etiology of complex phenotypic outcomes.
Archived Releases and Documentation:
What is BioBin?
BioBin is a standalone command line application and collapsing method that uses prior biological knowledge from a prebuilt database. While it was specifically developed to investigate rare variant burden in traditional genetic trait studies, BioBin can be used to apply multiple levels of burden testing and is useful for exploring the natural distribution of rare variants in ancestral populations.
Why use BioBin?
The era of successful genome-wide association studies (GWAS) has increased the fields’ understanding of heritable traits, highlighted novel disease associations that were critical for further biochemical and pharmaceutical development, and advanced the understanding of genetic association and complexity of common diseases. However, large proportions of variance in common complex diseases remain to be investigated. Many researchers are investigating the effects of rare variants. Collapsing and/or binning methods such as BioBin have been a popular approach because they are easily applied to case-control studies, can utilize whole-genome data, and allow for the investigation of collective polygenic inheritance. BioBin’s approach meets four criteria we have defined for improved binning algorithm development:
1. Complexity of interactions (both epistatic and in aggregate) of rare and common variants
2. Potential non-independence between rare variants and between bins
3. Importance and possible limitation of “user” feature definition
4. Necessity of tool flexibility
Instead of focusing on a novel statistical test, we have concentrated on biologically-driven automated bin generation. BioBin can create bins based on many features, including: regulatory regions, evolutionary conserved regions, genes, and/or pathways. Based on the user’s selected features, BioBin creates appropriate feature level bins using information from one or more of the databases in our integrated database, called the Library of Knowledge Integration (LOKI). In addition, users can utilize complex binning, i.e. collapse only exons in pathways or perform regulatory and gene feature analyses simultaneously. The innovation of BioBin’s flexible algorithm and incorporation of prior biological knowledge to automate bin generation allows the user the opportunity to test unique hypotheses.
New Features in 2.3:
- Genomic build is now detected from the VCF file when --genomic-build is not specified.
- Summary script output now includes case/control capacity and gene list, and is tab-separated by default.
- Changed the representation of p-values from SKAT methods in the case when they are very close to 0.
- Added option --bin-constant-loci to bin loci even when they do not vary among samples with non-missing phenotypes (default: N).
- Added option --drop-missing-phenotype-samples to drop samples which are neither case nor control for any phenotype (default: Y).
- Added option --force-all-control to consider all phenotypes as controls (default: N).
- Added option --ignore-build-difference to suppress the error when the specified genome build does not match the VCF (default: N).
- Added options --include-samples and --exclude-samples to specify files containing samples (one per line) to keep or filter out, respectively.
- Added option --set-star-referent to consider star alleles as referent rather than missing (default: Y).
- SKAT implementation is in beta development.
Bug Fix in 2.2.1:
- Fixed a multithreading crash when using roles and threads (multiple phenotypes) together.
Bug Fix in 2.2.0:
- Correctly parsing variants that include the "FT" genotype-level filter tag. Previous released versions used vcftools, which ignored this flag. Previous beta versions incorrectly set the variant to monomorphic, effectively ignoring the entire variant.
New Features in 2.2:
- Added ability to run PheWAS, reading the VCF only once
- Compressed VCFs are automatically detected
- Added an output listing unlifted loci
- Added the ability to suppress summary information in the bins file
- Removed dependence on VCFTools library and other various performance improvements
- Removed Genotype and Frequency reports
- Added the ability to run statistical tests within BioBin. Tests included:
- linear regression
- logistic regression
- Wilcoxon rank-sum test
- SKAT (continuous phenotype) **Beta
- SKAT (dichotomous phenotype) **Beta
Bug Fixes in 2.1.2:
- Fixed reactome, GWAS catalog and liftOver loaders in LOKI.
Bug Fix in 2.1.1:
- Fixed NaN in bins when using the built-in weighting method.
New Features in 2.1:
- Weighting of individual loci according to MAF or user input
- Sliding-window intergenic bins
- Minimum MAF threshold added
- Filtering of genes / regions of interest
What is ATHENA ?
The Analysis Tool for Heritable and Environmental Network Associations (ATHENA) is a software package that combines statistical and biological variable selection methods with machine learning modeling techniques to identify complex prediction models that can include non-linear interactions and different types of high-throughput data.
Why use ATHENA ?
ATHENA was developed to integrate biological data to perform feature/variable selection and modeling of complex genetic effects. In addition, ATHENA, a multi-functional software package, was designed to perform the three main functions essential to determine the meta-dimensional models of complex disease:(1) performing feature/variable selections from categorical or continuous independent variables;(2) modeling main and interaction effects that explain or predict categorical or continuous clinical outcomes;(3) interpreting the significant models for use in further translational bioinformatics.ATHENA contains filtering components, modeling components, and an evolutionary computing approach based on a machine technique to generate complex models. The current version of ATHENA has two different computational evolution modeling methods, Grammatical Evolution Symbolic Regression (GESR) and Grammatical Evolution Neural Networks (GENN).