AnGeLi: A Tool for the Analysis of Gene Lists from Fission Yeast

Bitton, Danny A.; Schubert, Falk; Dey, Shoumit; Okoniewski, Michal; Smith, Graeme C.; Khadayate, Sanjay; Pancaldi, Vera; Wood, Valerie; Bähler, Jürg

doi:10.3389/fgene.2015.00330

TECHNOLOGY REPORT article

Front. Genet., 16 November 2015

Sec. Genomic Assay Technology

Volume 6 - 2015 | https://doi.org/10.3389/fgene.2015.00330

AnGeLi: A Tool for the Analysis of Gene Lists from Fission Yeast

Danny A. Bitton^1‡

Falk Schubert^1‡

Shoumit Dey¹

Michal Okoniewski²

Graeme C. Smith¹

Sanjay Khadayate^1†

Vera Pancaldi^1†

Valerie Wood³

Jürg Bähler^1*

¹Research Department of Genetics, Evolution and Environment – UCL Genetics Institute, University College London, London, UK
²Scientific IT Services, ETH Zürich, Zürich, Switzerland
³Cambridge Systems Biology and Department of Biochemistry, University of Cambridge, Cambridge, UK

Genome-wide assays and screens typically result in large lists of genes or proteins. Enrichments of functional or other biological properties within such lists can provide valuable insights and testable hypotheses. To systematically detect these enrichments can be challenging and time-consuming, because relevant data to compare against query gene lists are spread over many different sources. We have developed AnGeLi (Analysis of Gene Lists), an intuitive, integrated web-tool for comprehensive and customized interrogation of gene lists from the fission yeast, Schizosaccharomyces pombe. AnGeLi searches for significant enrichments among multiple qualitative and quantitative information sources, including gene and phenotype ontologies, genetic and protein interactions, numerous features of genes, transcripts, translation, and proteins such as copy numbers, chromosomal positions, genetic diversity, RNA polymerase II and ribosome occupancy, localization, conservation, half-lives, domains, and molecular weight among others, as well as diverse sets of genes that are co-regulated or lead to the same phenotypes when mutated. AnGeLi uses robust statistics which can be tailored to specific needs. It also provides the option to upload user-defined gene sets to compare against the query list. Through an integrated data submission form, AnGeLi encourages the community to contribute additional curated gene lists to further increase the usefulness of this resource and to get the most from the ever increasing large-scale experiments. AnGeLi offers a rigorous yet flexible statistical analysis platform for rich insights into functional enrichments and biological context for query gene lists, thus providing a powerful exploratory tool through which S. pombe researchers can uncover fresh perspectives and unexpected connections from genomic data. AnGeLi is freely available at: www.bahlerlab.info/AnGeLi

Introduction

Large-scale and genome-wide studies such as the profiling of gene expression, DNA-binding sites, mutant phenotypes, or genetic interactions, typically lead to sizeable lists of candidate genes or proteins. Such gene lists often contain valuable, hidden biological information which can enlighten the processes studied, provide useful context, and generate testable hypotheses for targeted follow-up experiments. While the generation of gene lists entails established experimental and analytical procedures, the extraction of any biologically meaningful information from such lists can be a serious challenge. Evidently, prior knowledge is a major factor affecting interpretation of any gene list, regardless of the underlying biological or experimental context by which it was generated. Gene list interpretation therefore relies on the availability of comprehensive reference information on genes and proteins against which the list can be compared to uncover any statistically significant common features among its members.

To get the most from gene lists, such reference information may include validated or predicted gene/protein functions, detailed data on gene architecture and conservation, regulatory factors, expression levels and context, cellular localization, pathway information, physical/genetic interactions, and phenotypic data, to name just a few. Such databases of integrated gene and protein information are partially provided through InterMine for some organisms but not for fission yeast (Kalderimis et al., 2014). Hence, gene list interpretation relies on incomplete functional annotation databases, combined with statistical tools, which typically interrogate one or more properties in search for any significant enrichment. GO enrichment tools are popular (Ashburner et al., 2000; Boyle et al., 2004; Carbon et al., 2009), which look for over-representation of associated GO terms within gene lists. To meet the growing needs of biologists in the omics era, more specialized gene identifier-based search engines have been developed for various model organisms, including free or commercial resources such as functional annotation tools (Huang da et al., 2009), pathway mapping algorithms (Kanehisa and Goto, 2000; Nikolsky and Bryant, 2009; Kelder et al., 2012; Mi et al., 2013; Croft et al., 2014), or protein interaction search tools (Stark et al., 2006).

The emergence of central, regularly maintained and updated databases that store genomic variation, ontology, pathway, interaction or phenotypic data has attracted software developers to design ‘all-in-one’ search engines that enable systematic searches against published, pre-defined gene sets (Subramanian et al., 2005) and/or multiple functional annotation resources (Zhang et al., 2005; Araki et al., 2012). Such gene set enrichment tools have proven valuable for downstream analysis of large-scale experiments by providing functional insights for query gene lists of interest. Given the rapid growth of relevant information, integration of developer-curated and user-defined gene sets into a single resource offers a flexible solution. GSEA (Subramanian et al., 2005), for example, a standalone or web-based application for selected vertebrates, allows the user to search a query gene list against thousands of curated gene sets but also against additional user-defined lists.

The fission yeast Schizosaccharomyces pombe is an important model organism that shares many critical biological processes with multicellular eukaryotes (Wood et al., 2002). Over the years, the fission yeast community has produced many genomic data sets and resources, including a gene deletion collection (Kim et al., 2010; Chen et al., 2015) and protein localization data (Matsuyama et al., 2006). The curators at PomBase [the S. pombe model organism database (Wood et al., 2012; McDowall et al., 2015)], are assembling rich information on gene characteristics and functions and on mutant phenotypes by applying the FYPO (Harris et al., 2013). These efforts are supported by volunteer expert curators among the fission yeast community, using the Canto online tool (Rutherford et al., 2014).

We have exploited the rich published and annotated resources to build a generic gene list enrichment tool, AnGeLi, that can satisfy the growing need of the community for a comprehensive, one-stop analysis of gene lists. AnGeLi is an intuitive web-based tool which offers customized analyses of gene lists, by systematically screening a multitude of data sources, including published and user-defined gene sets to highlight statistically significant enrichments. Moreover, AnGeLi encourages a community-wide effort to further increase its usefulness by contributing additional published or otherwise annotated gene lists via its data submission feature. The more data are included in AnGeLi the more powerful it will become in uncovering functional insights, context and unexpected connections, and thus fully unleashing the information hidden in genomic data that currently remain only partially explored.

Overview of AnGeLi Tool

Database, Data Types, and Gene Set Resources

AnGeLi is a knowledge-driven, web-based application implemented in Perl. It takes as an input a list of systematic gene identifiers and searches for any enrichment of common features using a diverse collection of annotated resources, published gene lists, or curated gene sets or features (henceforth AnGeLi’s database), as well as user-defined gene sets (optional). AnGeLi’s database includes three discrete types of data: categorical, metric, and pairwise (Table 1). Categorical data refer to gene sets representing membership in specific biological categories, where gene membership of a category is stored in binary format. These gene sets are derived from different sources such as specific GO categories, phenotypes, or published gene lists, as examples. A query gene can either belong to a specific gene set (gene value is 1) or not (gene value is 0). Metric data describe a quantifiable, continuous characteristic of a gene or protein such as intron number, distance from centromere or transcript copy number, to name a few examples. Both categorical and metric data are organized in a tabular format prior to data compilation (Table 1). Pairwise data represent pair relationships such as genetic or protein–protein interactions.

TABLE 1

TABLE 1. Organization of data types (binary and metric) and grouping into themes.

AnGeLi’s database currently holds 9632 features (9579 binary, 49 metric, and 4 pairwise features; Supplementary Table S1). These features are sourced directly from PomBase, or calculated using sequence or annotation data (55 features). Other data sources include GO categories (5603 features: 3529 Biological Process; 1277 Molecular Function; 797 Cellular Component; Wood et al., 2012), phenotypes (FYPO; 2682 features; Harris et al., 2013), Pfam domains (1130 features; Finn et al., 2014), and BioGrid interactions (four features; Breitkreutz et al., 2008). To augment AnGeLi’s database beyond the annotated resources, we have initially selected 23 genomic papers which report fundamental expression or functional profiling data (158 features); many more such data can be added in the future using a straightforward submission form (see below). Among the categorical data, we included gene lists from defined ‘housekeeping’ genes (Pancaldi et al., 2010), stress-response genes (Chen et al., 2003, 2008; Tanay et al., 2005), meiotic differentiation genes (Mata et al., 2002, 2007; Tanay et al., 2005; Mata and Bähler, 2006), and cell cycle-regulated genes (Rustici et al., 2004; Marguerat et al., 2006), genes regulated in chromatin mutants (De Groot et al., 2003; Tanny et al., 2007) or in response to caffeine and rapamycin (Rallis et al., 2013), and gene sets that highlight differences between haploid and diploid transcriptomes (Bitton et al., 2011). We also incorporated key regulatory modules (Tanay et al., 2005), transcription factor targets (Rustici et al., 2004; Tanay et al., 2005), protein localization data (Matsuyama et al., 2006), genes identified in genome-wide splicing assays (Bitton et al., 2014, 2015), targets of RNA-binding proteins (Lemieux et al., 2011; Hasan et al., 2014), GPI-anchored cell-surface proteins (De Groot et al., 2003; Tanny et al., 2007), as well as genes involved in TORC1 function, lifespan and growth (Rallis et al., 2014; Sideri et al., 2014). Among the metric data, we incorporated genetic diversity among wild S. pombe strains (Jeffares et al., 2015), transcript half-life data (Amorim et al., 2010; Hasan et al., 2014), RNA polymerase II occupancy (Lackner et al., 2007), cellular transcript and protein copy numbers (Marguerat et al., 2012), protein molecular weight, amino acid composition, ribosome occupancy, and density (Lackner et al., 2007), AUG translation initiation index (Miyasaka, 2002; Lackner et al., 2007) poly-A tail lengths (Beilharz and Preiss, 2007; Lackner et al., 2007), protein half-lives (Christiano et al., 2014), as well as protein fold-index (Prilusky et al., 2005), which predicts intrinsically unfolded proteins (Gsponer et al., 2008). AnGeLi also stores interaction data from BioGrid (Breitkreutz et al., 2008), including protein–protein and genetic interactions identified in fission yeast, and inferred interactions based on orthologs in budding yeast (Wood, 2006). AnGeLi may thus facilitate the discovery of protein complexes, network ‘hubs’, or enrichment of specific pathways among the query genes.

AnGeLi’s output is grouped in themes capturing different biological aspects: GO categories, Gene Expression (differentially regulated genes under different conditions), Gene Features (e.g., intron number, chromosomal position, and genetic diversity), Genetic and Physical Interactions (based on BioGRID), Phenotypes (based on FYPO), Phenotypic Profiles (genes identified in mutant screens), Protein Domains (based on Pfam), Protein Features (e.g., amino-acid composition, conservation, and cellular copy numbers), Protein Localizations (based on ORFeome), and Transcript Features (e.g., RNA length and type, ribosome occupancy, and cellular copy numbers). This grouping into themes facilitates an overview of the results but is not used for any higher-level analysis.

Statistical Framework for Gene Enrichment Analyses

To determine whether a feature is significantly enriched or under-enriched in the query gene list, AnGeLi automatically selects from three statistical tests depending on the data type. Categorical data are countable (i.e., number of overlapping genes between the query list A and categorical set C), and AnGeLi applies a widely used test for gene set enrichment, the 2-tailed Fisher’s exact test (Rivals et al., 2007). AnGeLi thus determines whether the proportion of genes from set C found in the query list A is significantly higher or lower than the proportion of genes from set C in the entire background gene population. The statistics therefore is affected by the background gene population, which can be adjusted to best match the analysis (see below). Metric data are continuous (e.g., transcript length, copy numbers), and AnGeLi performs a 2-sided Wilcoxon rank-sum test to assess whether the values of metric feature M associated with the genes in query list A are significantly higher or lower than the values of feature M associated with the genes not present in list A. Pairwise data are assessed by a permutation test (Good, 2000) to reveal any enrichments of protein–protein or genetic interactions within the query list. Briefly, a random set of genes (of same gene number as list A) is iteratively drawn from a pool of genes not found in list A and evaluated for protein–protein or genetic interactions in pairwise gene set P; the number of permutations is determined by the user (default is 1000), while the p-value is derived from the number of times the random set achieved a greater sum of interactions in set P than the sum of interactions in query list A. Owing to the large increase in analysis time, AnGeLi does not analyze pairwise as a default.

Under default settings, the query list is tested against 7554 features simultaneously (7505 binary and 49 metric features, excluding user-defined gene sets); thus, the probability of false positive enrichments is quite high. To account for this multiple testing problem, AnGeLi provides three approaches for P-value correction. The Bonferroni method (Shaffer, 1995) is conservative and simply multiplies the original p-value by the total number of tests performed to derive the corrected p-value. The Holm (1979) method of correction reduces false negatives, but is still conservative; in brief, the p-values are ranked in an ascending order, and the first p-value is multiplied by the total number of tests, while each sequential p-value is multiplied by a decreasing number of the remaining tests (i.e., p-value₁ x ‘t’ [total number of tests], p-value₂ x [‘t’ – 1], etc.). The FDR (Benjamini and Hochberg, 1995) is used as the default option by AnGeLi. FDR is less conservative because it controls the number of false positives in the reported list of significant features. Again, the p-values are ranked in ascending order and the corrected p-value is equal to the rank divided by the total number of tests performed, multiplied by the accepted false positive threshold chosen by the user. At an FDR of 0.01, we expect 1% false positives among the reported significant features. Note that for the pairwise data type, the p-value is highly dependent on the number of permutations set by the user, which in turn dictates the analysis time (see Materials and Methods). When the number of permutations is relatively low (e.g., 1000), even the lowest p-value will not be sufficient to pass multiple testing corrections; AnGeLi therefore provides the option to increase the number of permutations at the expense of analysis time. Furthermore, AnGeLi permits deselecting categories that are not of interest, which in turn will increase the statistical power and enhance identification of subtle enrichments, and is therefore recommended if applicable.

AnGeLi provides the ability to choose a background gene population as a reference for the statistical analyses based on the query gene list. This option allows tailoring of the analysis to the context of the gene list of interest, which can greatly increase the accuracy and sensitivity of the analysis. For example, a query list from an experiment which has only considered protein-coding genes should be analyzed with the protein-coding gene background. As another example, query genes derived from phenotypic screens with the deletion mutant library will all be non-essential, which would skew the statistics if all genes were used as background. AnGeLi offers six pre-set background options, covering all common scenarios: protein-coding genes (default), all annotated genes, non-coding RNA genes, genes with associated GO terms, genes with associated phenotypes, and non-essential genes. In addition, users can provide their own bespoke background gene list to tailor the analysis to their particular requirements. An overview of AnGeLi’s steps for data entry, statistical tests and data processing is presented in Figure 1.

FIGURE 1

FIGURE 1. Workflows in AnGeLi. (Top – blue) Data entry: the user pastes a query gene list and has the option to add user-defined gene sets and/or select the background gene set (default = PC; protein-coding genes). If no additional gene sets are added, under default settings, 7554 features of the AnGeLi knowledgebase will be analyzed (7505 binary, and 49 metric features), because 1277 GO Molecular Function, 797 GO Cellular Component, and 4 Genetic and Physical Interactions (BioGRID) features are excluded by default (9632 features in total). If any user-defined gene sets are added, the database is augmented accordingly. (Middle – green) Statistical parameter settings: the user selects GO category (default = BP; Biological Process), a method for multiple testing correction (default = FDR) and the desired p-value threshold (default = 0.01). The users can also specify whether to perform the pairwise interaction enrichment analysis (default = No), set the desired number of permutations accordingly (default = 1000), and adjust the p-value to account for multiple testing. (Bottom – red) Data processing: AnGeLi performs gene list enrichment analysis based on user input and reports any significant functional enriched features, along with associated information.

Comparison to Other Tools and Applications

The breath of AnGeLi offers several advantages over existing tools that are based on only one or two data types such as GO categories or pathways. We compared AnGeLi’s performance to two other tools that support GO enrichment analysis for fission yeast, GeneCodis (Carmona-Saez et al., 2007) and GO Term Finder (Boyle et al., 2004). We assembled a list of 100 protein-coding genes containing 50 cell cycle-regulated genes (Rustici et al., 2004) and 50 random genes (Supplementary Table S2). This list was analyzed with all three tools using FDR as the multiple-testing correction method, with a cutoff of <0.01, using all genes with GO terms as background and Biological Process as category. Surprisingly, GeneCodis did not identify any enrichment in the list, even after disabling the multiple-testing correction option. The GeneCodis database for fission yeast was last updated in December 2011, which could partially explain the lack of any enrichment. On the other hand, the results obtained from AnGeLi and GO Term Finder corresponded very well, with only minor differences (Supplementary Tables S3 and S4): of the 17 enrichments found by AnGeLi, 15 were also found by GO Term Finder which reported numerous additional enrichments with lower significance. These differences between the two tools largely arise from differences in statistical tests and thresholds. AnGeLi actually did find all enrichments presented by GO Term Finder after relaxing the FDR to <0.08.

Importantly, AnGeLi offers a uniquely broad analysis tailored to fission yeast, far beyond GO term enrichments. Enrichments for several informative features are exclusive to AnGeLi, like gene expression signatures and phenotype annotations; the absolute number of phenotype annotations exceeds the number of GO annotations and is currently increasing at a rate of ∼1000 per year. When analyzing the test list of 100 genes with AnGeLi using default settings, rich additional biological insights were provided (Supplementary Table S5). For example, the analysis revealed enrichment in target genes for specific transcription factors that control gene expression during distinct phases of the cell cycle. As another example, the list was associated with abnormal cell-cycle phenotypes, like aberrant mitosis and cell division, and was also enriched for cell surface proteins. AnGeLi has served our group and collaborators very well in numerous studies to obtain biologically meaningful insights from large gene lists. As recent examples, the tool has uncovered helpful functional enrichments, besides GO categories, among lifespan and growth mutants (Sideri et al., 2014) and among the targets of RNA-binding proteins (Cotobal et al., 2015).

AnGeLi provides additional advantages compared to other enrichment tools. It is easily configurable for additional data sets, and users can incorporate their own gene sets. It also provides a broad choice for statistical analyses. Moreover, because of its link with PomBase, users can be assured that AnGeLi remains updated and uses current data. On the other hand, AnGeLi is organism specific and therefore its application is narrower than for other tools, but other organism communities may benefit from similar tools which are configured as a one-stop resource for datasets of specific interest.

User Interface

AnGeLi offers an intuitive online interface: www.bahlerlab.info/AnGeLi. The user supplies a query gene list (systematic names only), and sets the statistical parameters and background gene list. In addition, users can provide additional gene sets in tab-delimited format. AnGeLi’s output is organized into different themes and include hyperlinks to the corresponding resource or the publication from which the data derive. AnGeLi allows the user to de-select any pre-defined themes; in the extreme case, AnGeLi’s statistical framework could just be used to analyze a query list against a user-defined gene set.

Once analysis is completed, AnGeLi reports a summary of all tests performed, including color-coded tables where over- and under-represented sets and features are highlighted in red and green, respectively. For each theme, enrichments are ranked by their p-values, with expected vs. observed gene overlaps provided for categorical data, average values for metric data, and the number of interactions for pairwise data. Only gene sets or features that show any enrichments or under-enrichments are listed. An explain button next to each over- or under-represented gene set or feature provides a detailed summary for the corresponding enrichment. The user can export the results page in tab-delimited format, which also includes the corresponding external database identifiers as well as the actual intersection between the gene sets. A detailed Help Page is also provided (Link ‘Help’).

To further extend the data types available in AnGeLi, users are encouraged to submit published gene lists through a straightforward submission form. AnGeLi’s database will be updated monthly via synchronization with the annmap annotation package (Yates et al., 2008). The database could be downloaded via a link from the website (Link ‘Download Database’). Furthermore, user feedback will be monitored via the GitHub issue-tracking utility to allow continuous improvement (Link ’Report a bug’).

Materials and Methods

Database Construction and Resources

AnGeLi utilizes the S. pombe Ensembl annotation database (version 27) as the source for gene features (Kersey et al., 2010), which is based on PomBase (McDowall et al., 2015) and is implemented in the annmap core Bioconductor/R package (Gentleman et al., 2004; Yates et al., 2008). The database was used to derive the following: list of genes, exons, proteins, and their chromosomal positions as well as transcript biotypes (i.e., protein-coding, ncRNA, etc.). Applying customized R and Perl scripts, these data were used to compute relative and absolute distances from centromere and telomeres. Similarly, these data were used to compute intron locations, intron number per gene, average intron length and total transcript length. The GC content of the first intron was computed using the ‘geecee’ function within the EMBOSS (Rice et al., 2000). The protein sequence data was downloaded from PomBase (McDowall et al., 2015), and protein features such as molecular weight, isoelectric point, charge, and number of amino acids were also calculated using the EMBOSS suite (pepstats function). Amino acid compositions were calculated using a customized Perl script. The fold-index for each fission yeast protein was computed using a modified Perl script available from http://bip.weizmann.ac.il/fldbin/findex (Prilusky et al., 2005). S. pombe GO annotations and the generic GO OBO flat file were downloaded from ftp://ftp.geneontology.org. A recursive algorithm was used to map genes to all corresponding ancestor terms in the ontology. Pfam domains (Finn et al., 2014) were retrieved from the xmapcore database (Yates et al., 2008). For phenotype mappings (Harris et al., 2013), we used the phenotype annotation ‘phaf’ file available from ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Phenotype_annotations/phenotype_annotations.pombase.phaf.gz, fypo OBO file available from https://cdn.rawgit.com/pombase/fypo/master/release/fypo.obo. We only considered GO terms, Pfam domains and phenotypes that were associated with at least two genes. The manually curated human and budding yeast orthologs of fission yeast proteins (Wood, 2006) were retrieved from ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/orthologs/cerevisiae-orthologs.txt. Physical and genetic interaction data were downloaded from BioGRID (Breitkreutz et al., 2008) and processed using customized Perl scripts. All binary and metric data were combined into a single table using an R script (similar to Table 1) prior to conversion into Perl associative array data structures. Pairwise relationships were stored directly in Perl data structures.

Implementation of Statistical Tests

All statistical tests and multiple testing correction functions were implemented in Perl. For Fisher’s exact test, the Text::NSP::Measures::2D::Fisher::twotailed module was used (available from http://search.cpan.org), where the 2 × 2 contingency table was constructed using the following values: (row1) genes found in input list ‘A’ and in gene set ‘G’, genes found in gene set ‘G’ but not in input list ‘A’, (row2) genes found in input list ‘A’ but not in gene set ‘G’ and genes not found in input list ‘A’ and not in gene set ‘G’.

The core of the Wilcoxon rank sum test implemented in Perl was adopted from http://www.fon.hum.uva.nl/rob/SignedRank/. In this script, a normal approximation with a continuity correction or an exact test is used, depending on the number of permutations (‘k’ out of ‘n’) and estimation of the p-value. AnGeLi displays a warning for small gene lists (below 10 genes), for permutations ≥2500 or for p ≥ 0.25. Genes with no values are ignored throughout.

The pairwise permutation test repeatedly draws a random set of genes from a pool of genes not found in the query list, while the number of permutations is set by the user and the size of the random set is equal to the size of the query list. However, the pool of genes has to be at least twice as large as the query list, otherwise AnGeLi will display a warning that the query list is too large and p-values cannot be computed. The running time of the permutation test is quadratic, therefore pairwise analysis is excluded by default and, if selected, permutations are set to 1000. The p-value is equal to the number of times the random set has a greater sum of interactions compared to the real set divided by the total number of permutations and multiplied by 2 (i.e., pairwise). For example, in the best-case scenario, where the sum of random interactions equals 0 or 1 following 1000 permutations, the p-value will be equal to (1/1000)^∗2 = 0.002. This relatively high p-value is unlikely to be significant following correction for multiple testing (7554 tests: 7505 binary, 49 metric, and 4 pairwise features), and a higher number of permutations at the expense of analysis time should be set.

Conclusion

AnGeLi offers a unique and flexible statistical framework for the analysis of gene lists derived from S. pombe, using a rich catalog of annotated features, published information and gene sets that span multiple and diverse biological aspects. The analyses can be tailored to the query gene lists and enhanced by the addition of user-defined gene sets. With respect to published gene sets, the current content of AnGeLi’s database is somewhat arbitrary and far from complete. We encourage a community-wide effort to further increase the usefulness of AnGeLi by contributing additional published gene lists via its data submission feature. Such community submissions will enhance the visibility and citations of the papers reporting the submitted lists, and will help to unleash the full power of genomic data sets.

Author Contributions

DB, FS, and JB conceived the study. FS developed the prototype of AnGeLi and wrote the core Perl modules. DB extended its functionality, improved Perl-cgi scripts, and wrote all the R scripts needed for creation and update of AnGeLi’s database. DB also integrated multiple annotation resources, curated the majority of data features, and configured the web server. MO wrote the recursive R function needed for traversing the ontology graphs. SK and SD improved the user interface. GS helped fine-tuning AnGeLi’s performance. VW helped in designing the tool and improving its functionality. VP wrote the scripts for retrieval of pairwise data type and amino acid composition. DB and JB wrote the manuscript.

Funding

This work was supported by a Wellcome Trust Senior Investigator Award (grant # 095598/Z/11/Z).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We thank Tristan Clark and David Gregory for their help with setting up the web server, and Juan Mata, Midori Harris, Antonia Lock, Brian Wilhelm, St John Townsend, Martin Převorovský, Rob de Bruin, Dan Jeffares, and all members of the Bähler laboratory for their constructive comments and continuous help with improving AnGeLi. We also thank Phoebe Tristram Churchill for designing the AnGeLi logo, and Dan Jeffares for the creation of a Google form for metric dataset submission.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene.2015.00330

Abbreviations

AnGeLi, Analysis of Gene Lists; BioGRID, Biological General Repository for Interaction Datasets; EMBOSS, European Molecular Biology Open Software Suite; FDR, False Discovery Rate; FYPO, Fission Yeast Phenotype Ontology; GO, Gene Ontology; GSEA, Gene Set Enrichment Analysis; Pfam, Protein Families.

References

Amorim, M. J., Cotobal, C., Duncan, C., and Mata, J. (2010). Global coordination of transcriptional control and mRNA decay during cellular differentiation. Mol. Syst. Biol. 6, 380. doi: 10.1038/msb.2010.38

PubMed Abstract | CrossRef Full Text | Google Scholar

Araki, H., Knapp, C., Tsai, P., and Print, C. (2012). GeneSetDB: a comprehensive meta-database, statistical and visualisation framework for gene set analysis. FEBS Open Bio 2, 76–82. doi: 10.1016/j.fob.2012.04.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. doi: 10.1038/75556

PubMed Abstract | CrossRef Full Text | Google Scholar

Beilharz, T. H., and Preiss, T. (2007). Widespread use of poly(A) tail length control to accentuate expression of the yeast transcriptome. RNA 13, 982–997. doi: 10.1261/rna.569407

PubMed Abstract | CrossRef Full Text | Google Scholar

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300.

Google Scholar

Bitton, D. A., Atkinson, S. R., Rallis, C., Smith, G. C., Ellis, D. A., Chen, Y. Y., et al. (2015). Widespread exon skipping triggers degradation by nuclear RNA surveillance in fission yeast. Genome Res. 25, 884–896. doi: 10.1101/gr.185371.114

PubMed Abstract | CrossRef Full Text | Google Scholar

Bitton, D. A., Grallert, A., Scutt, P. J., Yates, T., Li, Y., Bradford, J. R., et al. (2011). Programmed fluctuations in sense/antisense transcript ratios drive sexual differentiation in S. pombe. Mol. Syst. Biol. 7, 559. doi: 10.1038/msb.2011.90

PubMed Abstract | CrossRef Full Text | Google Scholar

Bitton, D. A., Rallis, C., Jeffares, D. C., Smith, G. C., Chen, Y. Y., Codlin, S., et al. (2014). LaSSO, a strategy for genome-wide mapping of intronic lariats and branch points using RNA-seq. Genome Res. 24, 1169–1179. doi: 10.1101/gr.166819.113

PubMed Abstract | CrossRef Full Text | Google Scholar

Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M., et al. (2004). GO::TermFinder–open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20, 3710–3715. doi: 10.1093/bioinformatics/bth456

PubMed Abstract | CrossRef Full Text | Google Scholar

Breitkreutz, B. J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., et al. (2008). The BioGRID interaction database: 2008 update. Nucleic Acids Res. 36, D637–D640. doi: 10.1093/nar/gkm1001

PubMed Abstract | CrossRef Full Text | Google Scholar

Carbon, S., Ireland, A., Mungall, C. J., Shu, S., Marshall, B., and Lewis, S. (2009). AmiGO: online access to ontology and annotation data. Bioinformatics 25, 288–289. doi: 10.1093/bioinformatics/btn615

PubMed Abstract | CrossRef Full Text | Google Scholar

Carmona-Saez, P., Chagoyen, M., Tirado, F., Carazo, J. M., and Pascual-Montano, A. (2007). GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biol. 8, R3. doi: 10.1186/gb-2007-8-1-r3

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, D., Toone, W. M., Mata, J., Lyne, R., Burns, G., Kivinen, K., et al. (2003). Global transcriptional responses of fission yeast to environmental stress. Mol. Biol. Cell 14, 214–229. doi: 10.1091/mbc.E02-08-0499

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, D., Wilkinson, C. R., Watt, S., Penkett, C. J., Toone, W. M., Jones, N., et al. (2008). Multiple pathways differentially regulate global oxidative stress responses in fission yeast. Mol. Biol. Cell 19, 308–317. doi: 10.1091/mbc.E07-08-0735

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, J. S., Beckley, J. R., McDonald, N. A., Ren, L., Mangione, M., Jang, S. J., et al. (2015). Identification of new players in cell division, DNA damage response, and morphogenesis through construction of Schizosaccharomyces pombe deletion strains. G3 5, 361–370. doi: 10.1534/g3.114.015701

PubMed Abstract | CrossRef Full Text | Google Scholar

Christiano, R., Nagaraj, N., Frohlich, F., and Walther, T. C. (2014). Global proteome turnover analyses of the yeasts S. cerevisiae and S. pombe. Cell Rep. 9, 1959–1965. doi: 10.1016/j.celrep.2014.10.065

PubMed Abstract | CrossRef Full Text | Google Scholar

Cotobal, C., Rodriguez-Lopez, M., Duncan, C., Hasan, A., Yamashita, A., Yamamoto, M., et al. (2015). Role of Ccr4-Not complex in heterochromatin formation at meiotic genes and subtelomeres in fission yeast. Epigenetics Chromatin 8, 28. doi: 10.1186/s13072-015-0018-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Croft, D., Mundo, A. F., Haw, R., Milacic, M., Weiser, J., Wu, G., et al. (2014). The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477. doi: 10.1093/nar/gkt1102

PubMed Abstract | CrossRef Full Text | Google Scholar

De Groot, P. W., Hellingwerf, K. J., and Klis, F. M. (2003). Genome-wide identification of fungal GPI proteins. Yeast 20, 781–796. doi: 10.1002/yea.1007

PubMed Abstract | CrossRef Full Text | Google Scholar

Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., et al. (2014). Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230. doi: 10.1093/nar/gkt1223

PubMed Abstract | CrossRef Full Text | Google Scholar

Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80. doi: 10.1186/gb-2004-5-10-r80

PubMed Abstract | CrossRef Full Text | Google Scholar

Good, P. (2000). “Theory of permutation tests,” in Permutation Tests, ed. P. Good (New York, NY: Springer), 201–214.

Google Scholar

Gsponer, J., Futschik, M. E., Teichmann, S. A., and Babu, M. M. (2008). Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science 322, 1365–1368. doi: 10.1126/science.1163581

PubMed Abstract | CrossRef Full Text | Google Scholar

Harris, M. A., Lock, A., Bähler, J., Oliver, S. G., and Wood, V. (2013). FYPO: the fission yeast phenotype ontology. Bioinformatics 29, 1671–1678. doi: 10.1093/bioinformatics/btt266

PubMed Abstract | CrossRef Full Text | Google Scholar

Hasan, A., Cotobal, C., Duncan, C. D., and Mata, J. (2014). Systematic analysis of the role of RNA-binding proteins in the regulation of RNA stability. PLoS Genet. 10:e1004684. doi: 10.1371/journal.pgen.1004684

PubMed Abstract | CrossRef Full Text | Google Scholar

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70.

Google Scholar

Huang da, W., Sherman, B. T., and Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. doi: 10.1038/nprot.2008.211

PubMed Abstract | CrossRef Full Text | Google Scholar

Jeffares, D. C., Rallis, C., Rieux, A., Speed, D., Prevorovsky, M., Mourier, T., et al. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe. Nat. Genet. 47, 235–241. doi: 10.1038/ng.3215

PubMed Abstract | CrossRef Full Text | Google Scholar

Kalderimis, A., Lyne, R., Butano, D., Contrino, S., Lyne, M., Heimbach, J., et al. (2014). InterMine: extensive web services for modern biology. Nucleic Acids Res. 42, W468–W472. doi: 10.1093/nar/gku301

PubMed Abstract | CrossRef Full Text | Google Scholar

Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. doi: 10.1093/nar/28.1.27

CrossRef Full Text | Google Scholar

Kelder, T., van Iersel, M. P., Hanspers, K., Kutmon, M., Conklin, B. R., Evelo, C. T., et al. (2012). WikiPathways: building research communities on biological pathways. Nucleic Acids Res. 40, D1301–D1307. doi: 10.1093/nar/gkr1074

PubMed Abstract | CrossRef Full Text | Google Scholar

Kersey, P. J., Lawson, D., Birney, E., Derwent, P. S., Haimel, M., Herrero, J., et al. (2010). Ensembl Genomes: extending Ensembl across the taxonomic space. Nucleic Acids Res. 38, D563–D569. doi: 10.1093/nar/gkp871

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, D. U., Hayles, J., Kim, D., Wood, V., Park, H. O., Won, M., et al. (2010). Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 28, 617–623. doi: 10.1038/nbt.1628

PubMed Abstract | CrossRef Full Text | Google Scholar

Lackner, D. H., Beilharz, T. H., Marguerat, S., Mata, J., Watt, S., Schubert, F., et al. (2007). A network of multiple regulatory layers shapes gene expression in fission yeast. Mol. Cell 26, 145–155. doi: 10.1016/j.molcel.2007.03.002

PubMed Abstract | CrossRef Full Text | Google Scholar

Lemieux, C., Marguerat, S., Lafontaine, J., Barbezier, N., Bähler, J., and Bachand, F. (2011). A Pre-mRNA degradation pathway that selectively targets intron-containing genes requires the nuclear poly(A)-binding protein. Mol. Cell 44, 108–119. doi: 10.1016/j.molcel.2011.06.035

PubMed Abstract | CrossRef Full Text | Google Scholar

Marguerat, S., Jensen, T. S., de Lichtenberg, U., Wilhelm, B. T., Jensen, L. J., and Bähler, J. (2006). The more the merrier: comparative analysis of microarray studies on cell cycle-regulated genes in fission yeast. Yeast 23, 261–277. doi: 10.1002/yea.1351

PubMed Abstract | CrossRef Full Text | Google Scholar

Marguerat, S., Schmidt, A., Codlin, S., Chen, W., Aebersold, R., and Bähler, J. (2012). Quantitative analysis of fission yeast transcriptomes and proteomes in proliferating and quiescent cells. Cell 151, 671–683. doi: 10.1016/j.cell.2012.09.019

PubMed Abstract | CrossRef Full Text | Google Scholar

Mata, J., and Bähler, J. (2006). Global roles of Ste11p, cell type, and pheromone in the control of gene expression during early sexual differentiation in fission yeast. Proc. Natl. Acad. Sci. U.S.A. 103, 15517–15522. doi: 10.1073/pnas.0603403103

PubMed Abstract | CrossRef Full Text | Google Scholar

Mata, J., Lyne, R., Burns, G., and Bähler, J. (2002). The transcriptional program of meiosis and sporulation in fission yeast. Nat. Genet. 32, 143–147. doi: 10.1038/ng951

PubMed Abstract | CrossRef Full Text | Google Scholar

Mata, J., Wilbrey, A., and Bähler, J. (2007). Transcriptional regulatory network for sexual differentiation in fission yeast. Genome Biol. 8, R217. doi: 10.1186/gb-2007-8-10-r217

PubMed Abstract | CrossRef Full Text | Google Scholar

Matsuyama, A., Arai, R., Yashiroda, Y., Shirai, A., Kamata, A., Sekido, S., et al. (2006). ORFeome cloning and global analysis of protein localization in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 24, 841–847. doi: 10.1038/nbt1222

PubMed Abstract | CrossRef Full Text | Google Scholar

McDowall, M. D., Harris, M. A., Lock, A., Rutherford, K., Staines, D. M., Bähler, J., et al. (2015). PomBase 2015: updates to the fission yeast database. Nucleic Acids Res. 43, D656–D661. doi: 10.1093/nar/gku1040

PubMed Abstract | CrossRef Full Text | Google Scholar

Mi, H., Muruganujan, A., and Thomas, P. D. (2013). PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 41, D377–D386. doi: 10.1093/nar/gks1118

PubMed Abstract | CrossRef Full Text | Google Scholar

Miyasaka, H. (2002). Translation initiation AUG context varies with codon usage bias and gene length in Drosophila melanogaster. J. Mol. Evol. 55, 52–64. doi: 10.1007/s00239-001-0090-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Nikolsky, Y., and Bryant, J. (2009). “Protein networks and pathway analysis,” in Methods in Molecular Biology, Vol. 563, eds Y. Nikolsky and J. Bryant (New York City, NY: Humana Press).

Google Scholar

Pancaldi, V., Schubert, F., and Bähler, J. (2010). Meta-analysis of genome regulation and expression variability across hundreds of environmental and genetic perturbations in fission yeast. Mol. Biosyst. 6, 543–552. doi: 10.1039/b913876p

PubMed Abstract | CrossRef Full Text | Google Scholar

Prilusky, J., Felder, C. E., Zeev-Ben-Mordehai, T., Rydberg, E. H., Man, O., Beckmann, J. S., et al. (2005). FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 21, 3435–3438. doi: 10.1093/bioinformatics/bti537

PubMed Abstract | CrossRef Full Text | Google Scholar

Rallis, C., Codlin, S., and Bähler, J. (2013). TORC1 signaling inhibition by rapamycin and caffeine affect lifespan, global gene expression, and cell proliferation of fission yeast. Aging Cell 12, 563–573. doi: 10.1111/acel.12080

PubMed Abstract | CrossRef Full Text | Google Scholar

Rallis, C., Lopez-Maury, L., Georgescu, T., Pancaldi, V., and Bähler, J. (2014). Systematic screen for mutants resistant to TORC1 inhibition in fission yeast reveals genes involved in cellular ageing and growth. Biol. Open 3, 161–171. doi: 10.1242/bio.20147245

PubMed Abstract | CrossRef Full Text | Google Scholar

Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the european molecular biology open software suite. Trends Genet. 16, 276–277. doi: 10.1016/S0168-9525(00)02024-2

CrossRef Full Text | Google Scholar

Rivals, I., Personnaz, L., Taing, L., and Potier, M. C. (2007). Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 23, 401–407. doi: 10.1093/bioinformatics/btl633

PubMed Abstract | CrossRef Full Text | Google Scholar

Rustici, G., Mata, J., Kivinen, K., Lio, P., Penkett, C. J., Burns, G., et al. (2004). Periodic gene expression program of the fission yeast cell cycle. Nat. Genet. 36, 809–817. doi: 10.1038/ng1377

PubMed Abstract | CrossRef Full Text | Google Scholar

Rutherford, K. M., Harris, M. A., Lock, A., Oliver, S. G., and Wood, V. (2014). Canto: an online tool for community literature curation. Bioinformatics 30, 1791–1792. doi: 10.1093/bioinformatics/btu103

PubMed Abstract | CrossRef Full Text | Google Scholar

Shaffer, J. (1995). Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584. doi: 10.1146/annurev.psych.46.1.561

CrossRef Full Text | Google Scholar

Sideri, T., Rallis, C., Bitton, D. A., Lages, B. M., Suo, F., Rodriguez-Lopez, M., et al. (2014). Parallel profiling of fission yeast deletion mutants for proliferation and for lifespan during long-term quiescence. G3 5, 145–155. doi: 10.1534/g3.114.014415

PubMed Abstract | CrossRef Full Text | Google Scholar

Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539. doi: 10.1093/nar/gkj109

PubMed Abstract | CrossRef Full Text | Google Scholar

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550. doi: 10.1073/pnas.0506580102

PubMed Abstract | CrossRef Full Text | Google Scholar

Tanay, A., Regev, A., and Shamir, R. (2005). Conservation and evolvability in regulatory networks: the evolution of ribosomal regulation in yeast. Proc. Natl. Acad. Sci. U.S.A. 102, 7203–7208. doi: 10.1073/pnas.0502521102

PubMed Abstract | CrossRef Full Text | Google Scholar

Tanny, J. C., Erdjument-Bromage, H., Tempst, P., and Allis, C. D. (2007). Ubiquitylation of histone H2B controls RNA polymerase II transcription elongation independently of histone H3 methylation. Genes Dev. 21, 835–847. doi: 10.1101/gad.1516207

PubMed Abstract | CrossRef Full Text | Google Scholar

Wood, V. (2006). “Schizosaccharomyces pombe comparative genomics; from sequence to systems,” in Comparative Genomics, eds P. Sunnerhagen and J. Piskur (Heidelberg: Springer), 233–285.

PubMed Abstract | Google Scholar

Wood, V., Gwilliam, R., Rajandream, M. A., Lyne, M., Lyne, R., Stewart, A., et al. (2002). The genome sequence of Schizosaccharomyces pombe. Nature 415, 871–880. doi: 10.1038/nature724

PubMed Abstract | CrossRef Full Text | Google Scholar

Wood, V., Harris, M. A., McDowall, M. D., Rutherford, K., Vaughan, B. W., Staines, D. M., et al. (2012). PomBase: a comprehensive online resource for fission yeast. Nucleic Acids Res. 40, D695–D699. doi: 10.1093/nar/gkr853

PubMed Abstract | CrossRef Full Text | Google Scholar

Yates, T., Okoniewski, M. J., and Miller, C. J. (2008). X:Map: annotation and visualization of genome structure for Affymetrix exon array analysis. Nucleic Acids Res. 36, D780–D786. doi: 10.1093/nar/gkm779

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, B., Kirov, S., and Snoddy, J. (2005). WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 33, W741–W748. doi: 10.1093/nar/gki475

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: gene cluster, ontology, S. pombe, PomBase, data mining, database, large-scale assay, genetic screen

Citation: Bitton DA, Schubert F, Dey S, Okoniewski M, Smith GC, Khadayate S, Pancaldi V, Wood V and Bähler J (2015) AnGeLi: A Tool for the Analysis of Gene Lists from Fission Yeast. Front. Genet. 6:330. doi: 10.3389/fgene.2015.00330

Received: 09 September 2015; Accepted: 30 October 2015;
Published: 16 November 2015.

Edited by:

Zhen Su, China Agricultural University, China

Reviewed by:

Yijing Zhang, Chinese Academy of Sciences, China
Zhenyan Miao, Purdue University, USA

Copyright © 2015 Bitton, Schubert, Dey, Okoniewski, Smith, Khadayate, Pancaldi, Wood and Bähler. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jürg Bähler, ai5iYWhsZXJAdWNsLmFjLnVr

^†Present address: Sanjay Khadayate, Imperial College London, Medical Research Council Clinical Sciences Centre, London, UK; Vera Pancaldi, Spanish National Cancer Research Centre, Madrid, Spain

^‡These authors have contributed equally to this work.

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.