Main

The Analysis Commons, which relies on a new team-science model for genetic epidemiology, integrates multi-omic data and rich phenotypic and clinical information from diverse population studies into a single shared analytic platform that leverages the resources of a cloud-computing environment and allows for distributed access. The number of WGS studies with large sample sizes is rapidly expanding. Projects such as the NHLBI TOPMed Program, the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium1,2 and the Centers for Common Disease Genomics (CCDG)3, among others, have already conducted WGS in more than 100,000 individuals, and the Personalized Medicine Initiative4 promises whole-genome sequencing in over a million samples. These programs span a diverse set of studies and institutions, many of which lack the computational infrastructure to store and compute on this scale of data. Genomic, epigenomic, metabolic and proteomic data derived from expensive assays often do not exist in large numbers in any single study but represent a powerful discovery resource when they are combined across studies and integrated with phenotypic data.

In aggregate, many population-based studies have collected data on tens of thousands of variables over a period of decades, and the addition of WGS data to cohorts with long-term prospective follow-up provides a powerful resource for immediate discovery. Analysis of WGS data for large samples presents formidable computational and administrative challenges. Evaluation of rare genetic variation in WGS data requires manipulation of data sets that are tens to hundreds of terabytes in size and are prohibitively large for exchange between analysis sites. In contrast, pooled data sets that include genotype and phenotype data from all participants in the contributing individual studies provide for practical and efficient WGS analysis. The creation of such large pooled data sets containing harmonized multi-omic, phenotype and clinical data with appropriate metadata (for example, parent-study information and use permissions) is difficult and time consuming.

Because it can provide extensive computational resources and can host many users, the cloud-computing environment serves as an excellent platform and infrastructure for the Analysis Commons. Instead of distributing copies of excessively large data sets to many analysts, the Analysis Commons uses a cloud-computing infrastructure providing both data and tools to many analysts. This setting, which incorporates collaborative resources and a team-science approach to discovery, permits nimble analyses and methodological developments.

Although existing studies provide valuable data, a major hurdle to the Analysis Commons is that these same studies have legacy data-sharing policies that were not developed with complex data sharing in mind. The Analysis Commons requires the ability not only to combine data across studies and institutions but also to share pooled data among participating investigators from multiple institutions. In addition, mechanisms must be in place to ensure that sensitive participant data are both accessible to authorized investigators and simultaneously protected by robust security protocols. To bring data and researchers from multiple studies together into the Analysis Commons, we implemented two methods for data security. The first involves individual studies' securing institutional approval to share data with a consortium through a single 'consortium agreement' rather than through the typical series of bilateral agreements. Under this model, the individual studies retain oversight over their shared data by way of a steering committee. The second model leverages the National Center for Biotechnology Information (NCBI) database of Genotypes and Phenotypes (dbGaP) system of controlled access to coordinate authorization and data sharing across the set of approved external collaborators. Both systems build upon well-used approval mechanisms but extend them to enable sharing among a broad group of investigators from multiple institutions.

In most cohort studies, some phenotypes have multiple repeated measures and may require several data types. For example, ascertainment of type 2 diabetes and its date of onset represent a combination of longitudinal glucose measures, medication use, self-reported measures and, in some cases, review of diagnostic codes from medical records. The Analysis Commons can accommodate multiple approaches to phenotype harmonization. The Working Group model, which has facilitated discovery in other settings1, convenes investigators from multiple institutions who have content knowledge of a related set of phenotypes along with analytic or biostatistical expertise to develop analysis plans and consensus definitions for key analytic variables. Analysis plans often require harmonization of primary outcomes as well as eligibility criteria and exclusions. This approach leverages the knowledge of investigators from the contributing studies.

In the Analysis Commons, harmonized genomic and phenotypic data are available to authorized researchers to conduct genotype–phenotype analyses that require 'bursts' of intense computing. Implementing this workflow in a cloud environment can efficiently use on-demand computing capacity and thus avoid a costly build-out of local computing clusters at multiple institutions. The Analysis Commons also provides analysts with access to mature pipelines that represent the methods that have been tested and debugged, and are likely to become a standard in the field. Access is possible either through a web interface or through command-line batch processing. The logging of parameters and data-file identifiers used in analyses provides the provenance of results files and facilitates the reproducibility of analyses.

The Analysis Commons is designed to support a variety of software applications that have particular strengths, such as familial adjustment, analysis of time-to-event outcomes and computational optimization. Available applications for genetic association analyses currently include Genetic Estimation and Inference in Structured Samples (GENESIS)5, Mixed Model Analysis for Pedigrees and Populations (MMAP), Efficient and Parallelizable Association Container Toolbox (EPACTS) and seqMeta. Applications support the analysis of both related and unrelated individuals. The multiple-variant tests are flexibly designed so that variants can be aggregated by genes, by regulatory regions, by sliding windows or by user-defined motifs. Variants can be filtered or weighted according to annotations (for example, WGS Annotator (WGSA)6 or Cassandra7), which build on a base of common information such as conservation and functional protein predictions as well as extensive tissue-specific assays from projects such as the Encyclopedia of DNA Elements (ENCODE)8. By focusing on those variants with higher likelihoods to be functional for a given phenotype, these tools allow researchers to leverage their specific expertise in trait biology to improve power.

The setting of the Analysis Commons has the flexibility to serve phenotypic-driven research as well as to aid investigators in developing and testing new statistical methods and computational algorithms. Although analysis is more complicated to execute than a model that provides users with the results of predefined point-and-click analysis tools, methods development is made possible by full direct access to the combined data sets. Importantly, these new methods, which will be essential to leverage a growing collection of WGS data sets, can be readily benchmarked against established methods in a controlled environment and then rapidly distributed. For example, fastSKAT9, a methodological advance that greatly decreases the computational burden of the sequence kernel association test (SKAT)10 with large numbers of variants, was developed and validated in the Analysis Commons and benefits from access to sample data sets and benchmarking against standard SKAT implementations. This collaborative 'sandbox' assures the availability of the latest methods to interested investigators and provides researchers with full access to the individual-level data needed to drive discovery.

The use of modular-analysis applications (apps) implements particular operations that are chained together into pipelines (Fig. 1). As an example, we implemented one such pipeline for a sequence-to-discovery workflow, including (i) conversion of variant call format to a binary random-access genetic-storage format by using the SeqArray R package, (ii) single-variant and aggregate tests implemented through the GENESIS R package and (iii) visualization for quality control and display of the results. Apps for each step in the workflow were contributed by users at different institutions and coordinated through the Apps Development Working Group, thus demonstrating that the Analysis Commons allows for greater collaboration in both development and analysis. This pipeline is publically available on DNAnexus (Supplementary Note). All analyses were performed in parallel in an independently developed MMAP pipeline, which allowed for not only validation of the methods and results but also benchmarking of computing parameters.

Figure 1: Analysis Commons design.
figure 1

The Analysis Commons is a cloud-computing environment that combines data from multiple sources and provides analysis access to a wide range of analysts and developers. Each study uploads phenotype, genotype or other -omics data. Both genetic data and phenotypic data are harmonized and pooled into joint data sets. Analysts can choose from multiple analytic pipelines for association analysis as well as quality control, annotation and results visualization. A large number of analysts from remote sites can access the analytic tools through a web interface or batch processing through a command-line interface. In addition, analysts can run ad hoc analyses, and developers can test and implement new methods by accessing the underlying data resources directly.

The Analysis Commons is currently implemented in DNAnexus, which is built on Amazon Web Services. Data from 12 studies from 2 large WGS efforts, CHARGE and TOPMed, are combined and made accessible to authorized study investigators. Data sets are held securely within the DNAnexus platform for genomic-data management and analysis, which is independently certified as compliant to relevant research and clinical regulations (including ISO 27001, HIPAA, CLIA, CAP and GCP). For the purpose of illustration, we integrated data from 2 of the 12 studies with measured plasma fibrinogen levels—the Old Order Amish Study and the Framingham Heart Study—to analyze genetic association with fibrinogen levels in 3,996 study participants (Supplementary Note). The participating studies and the analysts received institutional approval via a consortium agreement to share phenotype and genotype data and perform analyses within the Analysis Commons. The analyses used linear mixed models that were adjusted for family structure through an empirical kinship matrix. Single-variant regression analyses assessed associations with common variants (i.e., those with a minor allele count ≥5). After correction for the number of variants tested (n = 13,742,969), we identified a low-frequency variant with a two-tailed score test (rs148685782[G>C] (p.Ala108Gly); P = 2.51 × 10−9, MAF = 0.34%), a previously identified11 nonsynonymous variant in FGG (Fig. 2), the gene encoding the gamma chain of the fibrinogen glycoprotein. Rare variants (MAF <5%) were limited to those with a Combined Annotation-Dependent Depletion (CADD)12 phred score ≥10 and were tested in aggregate within sequential 50-kb windows (Fig. 2b). No windows were genome-wide significant after Bonferroni correction. These analyses benefited from the extensive computing resource. For example, the GENESIS SKAT analyses that used 380 CPU hours were run in approximately 1 hour of wall-clock time. The analyses were validated by running both GENESIS and MMAP applications by analysts from separate institutions.

Figure 2: Plasma fibrinogen association results.
figure 2

(a) Top single-variant association results fall within a region on chromosome 4 containing the fibrinogen subunits. A regional-association-plotting application computes the linkage disequilibrium with the top signal (diamond) and plots the −log10(P values) and genes within a specified window. (b) Rare variants (MAF <5%) were filtered for those with high CADD phred scores and aggregated into genomic windows covering 50 kb.

We present a model that builds a collaboration among researchers with the common goal of multicenter genomic epidemiology research. The oversight of the Analysis Commons requires the management of four activities: (i) data access, (ii) phenotype harmonization, (iii) app development and (iv) analysis. The management is shared among several committees and Working Groups. These components of the Analysis Commons are designed to flexibly accommodate teams that may work on subprojects with distinct permissions, data sets and analytic approaches. Team members participate in the Analysis Committee, wherein researchers present work in progress focusing on ongoing challenges in analytic methods and discuss data-set curation and availability, as well as annotation resources. Similarly, the membership of the Apps Development Working Group is drawn from the phenotype-driven Working Groups and focuses on the development and testing of software for use across the Analysis Commons and eventual release to the broader scientific community. Although project teams primarily work independently on their research aims, communication among investigators through joint teleconferences, real-time messaging and in-person training seminars is key to successful collaboration. Large multistudy collaborations and big-data efforts are the next stage in contemporary genetics. With the Analysis Commons, we present a blueprint for how to navigate the practical issues of both large-scale computing and collaboration that are facing many studies, and the analytic code and data-sharing mechanisms that can be adopted by other investigators. The Analysis Commons is a resource for many research groups, through direct collaboration, established committees or parallel adoption of the governance model and the developed apps.

The Analysis Commons is one model for the translation of WGS resources from a massive quantity of raw data into a better understanding of the determinants of health in diverse human populations. Strong infrastructure support is needed for analysis of these WGS data in a setting that allows for phenotype, analytic and computational experts to convene and address these questions. This environment should enable and accelerate the promise of precision medicine to provide the right treatment at the right time and to tailor treatments to patients' individual needs.

URLs. GENESIS, http://bioconductor.org/packages/release/bioc/html/GENESIS.html; MMAP, https://github.com/MMAP/; EPACTS, http://genome.sph.umich.edu/wiki/EPACTS; seqMeta, https://cran.r-project.org/web/packages/seqMeta/index.html; SeqArray, https://www.bioconductor.org/packages/release/bioc/html/SeqArray.html; DNAnexus, https://www.dnanexus.com/; Analysis Commons GitHub, https://github.com/AnalysisCommons/; Analysis Commons analysis tools, https://platform.dnanexus.com/projects/F2KK1b80zzK7vb0G0qb8fJvk/; Analysis Commons public site, http://analysiscommons.com/.

Data availability. All data generated or analyzed during this study are included in this published article and its supplementary information.

Author Contributions

S.R.H., C.C.L., B.D.M., J.R.O., R.S.V., S.S.R., J.I.R., J.G.W., J.A.B., A.C.M., J.C.B., E.B., B.M.P., L.A.C. and K.M.R. formed the management team of the Analysis Commons. J.A.B., A.C.M., J.C.B., E.B., B.M.P., L.A.C., S.R.H. and X.L. drafted the manuscript. W.S., R.A.G., A.C. and D.C.A. managed the computing infrastructure. A.K.M., J.R.O., M.R.B., D.C.A., A.C., M.P.C., S.M.G. and A.N.P. were responsible for the implementation and design of the applications. S.G. and N.G. oversaw the sequence generation. J.A.B., J.E.H., J.P.L., A.D.J., J.C.B., C.M.S., N.L.S., C.E.J. and G.J.P. conceived, designed and implemented the example data analyses. All coauthors reviewed and edited the manuscript before approving its submission.