Top R Packages for GO Enrichment Analysis: topGO vs globaltest Explained

1. Introduction

Gene Ontology (GO) enrichment analysis is a cornerstone of gene expression studies. It helps researchers identify biological processes, molecular functions, and cellular components that are overrepresented in a set of genes, offering insights into the underlying biology. Several statistical methods can be used for GO enrichment analysis, including Fisher’s exact test, the Kolmogorov-Smirnov test, and the global test.

This article aims to simplify the process of choosing an appropriate R package for GO enrichment analysis by introducing two popular Bioconductor packages: topGO and globaltest. While both packages are widely used, their distinct statistical methods and outputs can make it challenging to choose the right tool for your study. This guide compares these packages, highlights their differences, and provides practical examples to help researchers, especially those with busy lab schedules, efficiently integrate GO analysis into their workflows.

2. Overview of Bioconductor Packages

topGO

topGO is a versatile R package for GO enrichment analysis, well-suited for identifying specific GO terms that are enriched among differentially expressed genes. It primarily employs statistical methods like Fisher’s exact test and the Kolmogorov-Smirnov test, making it a reliable choice for detecting overrepresentation in gene lists. With its robust functionality and detailed documentation, topGO is a go-to tool for exploring gene-level associations in various biological datasets.

globaltest

globaltest takes a different approach by assessing whether specific GO terms are associated with clinical outcomes or other continuous variables. It uses the global test methodology, which evaluates associations at a more holistic level compared to gene-specific tests. This makes it particularly valuable for studies where the research question involves linking GO terms to phenotypic data, such as disease progression or treatment response.

Key Differences in Statistical Approaches

Both packages are highly ranked in the Bioconductor repository due to their active maintenance and comprehensive documentation. However, their underlying statistical methods set them apart:

• topGO: Uses Fisher’s exact test or the Kolmogorov-Smirnov test to test the null hypothesis that no specific GO terms are enriched in a set of genes.

• globaltest: Employs the global test to evaluate the null hypothesis that no association exists between a set of genes and a clinical outcome or phenotype.

Use Case Comparison

• topGO is ideal for researchers seeking to uncover enriched biological processes in differentially expressed genes.

• globaltest is better suited for studies focused on linking GO terms to clinical or phenotypic outcomes, such as identifying functional pathways associated with disease progression.

By understanding these distinctions, researchers can choose the package that best aligns with their study objectives.

3. Setting Environment

Setting up includes ensuring the required packages are installed and loaded.

# Install globaltest Biocondcutor package
BiocManager::install(c("globaltest", "topGO", "golubEsets", "vsn", "hu6800.db", "GO.db", "AnnotationDbi", "annotate", "topGO", "ALL", "Biobase", "limma"))  

# Load the globaltest package
library(topGO)
library(globaltest)
library(golubEsets) 
library(vsn)
library(hu6800.db)
library(AnnotationDbi)
library(methods)
library(annotate)
library(limma) 
library(dplyr)

4. Data Prepration

Data preprocessing and normalization.

# Load the Golub training data set consisting of 7129 genes and 38 samples (27 ALL and 11 AML)
data(Golub_Train, package = "golubEsets")  

## Normalize the data using the VSN package
Golub_Train_VSN <- vsn::vsn2(exprs(Golub_Train)) 

Gene expression data before the normalization process:

A matrix: 5 x 10 of type int

	1	2	3	4	5	6	7	8	9	10
AFFX-BioB-5_at	-214	-139	-76	-135	-106	-138	-72	-413	5	-88
AFFX-BioB-M_at	-153	-73	-49	-114	-125	-85	-144	-260	-127	-105
AFFX-BioB-3_at	-58	-1	-307	265	-76	215	238	7	106	42
AFFX-BioC-5_at	88	283	309	12	168	71	55	-2	268	219
AFFX-BioC-3_at	-295	-264	-376	-419	-230	-272	-399	-541	-210	-178

Gene expression data after the normalization process:

A matrix: 5 x 10 of type dbl

	1	2	3	4	5	6	7	8	9	10
AFFX-BioB-5_at	5.053873	5.396673	5.972362	5.549766	5.337167	5.411235	5.968888	4.616873	6.396420	5.615805
AFFX-BioB-M_at	5.364311	5.838160	6.128136	5.678733	5.212093	5.797844	5.508008	5.178958	5.605328	5.480476
AFFX-BioB-3_at	5.948439	6.391596	4.906233	8.079299	5.550368	8.071159	7.967052	6.646939	7.033943	6.825375
AFFX-BioC-5_at	6.988237	8.224862	8.003817	6.558152	7.593938	7.118814	6.884934	6.591159	7.872088	8.162249
AFFX-BioC-3_at	4.707586	4.747971	4.672928	4.322134	4.638834	4.665432	4.390771	4.257310	5.197148	4.978863

5. GO Enrichment Analysis with topGO

GO enrichment with topGO

# Create a topGOdata object 
sampleGOdata <- new("topGOdata", description = "Simple session", ontology = "BP", allGenes = pvalues, geneSel = topDiffGenes, nodeSize = 10, annot = annFUN.db, affyLib = affyLib) 

# Run GO enrichment analysis with Fisher's exact test 
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher") 

Display summary of results:

# Display the results 
resultFisher 

Description: Simple session 
Ontology: BP 
'classic' algorithm with the 'fisher' test
5431 GO terms scored: 302 terms with p < 0.01
Annotation data:
    Annotated genes: 6234 
    Significant genes: 1512 
    Min. no. of genes annotated to a GO: 10 
    Nontrivial nodes: 5341 

Show the top 10 enriched GO terms:

	GO.ID <chr>	Term <chr>	Annotated <int>	Significant <int>	Expected <dbl>	Rank in classicFisher <int>	classicFisher <chr>	classicKS <chr>	elimKS <chr>
1	GO:0010042	response to manganese ion	18	14	4.37	10	2.6e-06	2.1e-06	2.1e-06
2	GO:0000002	mitochondrial genome maintenance	13	8	3.15	199	0.00458	0.00015	0.00015
3	GO:0044539	long-chain fatty acid import into cell	11	8	2.67	88	0.00095	0.00017	0.00017
4	GO:0070198	protein localization to chromosome, telo…	20	13	4.85	43	0.00013	0.00018	0.00018
5	GO:0071897	DNA biosynthetic process	109	41	26.44	101	0.00118	1.1e-05	0.00029
6	GO:0045429	positive regulation of nitric oxide bios…	32	18	7.76	42	0.00010	0.00033	0.00033
7	GO:0016570	histone modification	44	16	10.67	694	0.04846	0.00053	0.00053
8	GO:0098869	cellular oxidant detoxification	70	28	16.98	142	0.00243	0.00056	0.00056
9	GO:0045820	negative regulation of glycolytic proces…	11	9	2.67	39	9.6e-05	0.00056	0.00056
10	GO:1903241	U2-type prespliceosome assembly	15	9	3.64	168	0.00332	0.00057	0.00057

The GO topology graph for the top 5 enriched GO terms is shown below:

6. Analysis with globaltest

global_test_result <- globaltest::gt(response = ALL.AML, alternative = Golub_Train)

res <- globaltest::gtGO(ALL.AML, Golub_Train,ontology = "BP", annotation = "hu6800.db", multtest = "BH") 

	GO <chr>	alias <chr>	BH <dbl>
1	GO:0006979	response to oxidative stress	5.937219e-09
2	GO:0062197	cellular response to chemical stress	5.937219e-09
3	GO:0034599	cellular response to oxidative stress	5.937219e-09
4	GO:0034614	cellular response to reactive oxygen species	1.479059e-08
5	GO:0009628	response to abiotic stimulus	2.469754e-08
6	GO:0000302	response to reactive oxygen species	2.680335e-08
7	GO:0019932	second-messenger-mediated signaling	1.030937e-07
8	GO:0019722	calcium-mediated signaling	1.080573e-07
9	GO:0050921	positive regulation of chemotaxis	1.080573e-07
10	GO:0003013	circulatory system process	1.342398e-07

5. Comparing Results

The results of GO enrichment analysis using topGO and globaltest provide distinct yet informative insights into the biological processes underlying the data.

Key Results and Differences:

• Significant GO Terms:

• topGO identified 302 significant GO terms using the Fisher’s exact test, with a focus on Biological Process (BP) ontology, while globaltest highlighted 10 highly significant GO terms, such as “response to oxidative stress” (GO:0006979) and “cellular response to reactive oxygen species” (GO:0034614).

• Methodological Nuances:

• topGO excels in leveraging the GO hierarchy, particularly with algorithms like “elim” to minimize redundancy. This approach is effective for capturing nuanced biological pathways.

• globaltest uses a multivariate approach that evaluates the overall association between gene sets and outcomes, making it particularly sensitive to systemic biological patterns.

Type of Questions Addressed:

• topGO is better suited for researchers aiming to understand specific enriched processes while accounting for the GO graph topology.

• globaltest is ideal for broader hypotheses, such as assessing the overall contribution of a gene set to a phenotype.

6. Discussion

The results from topGO and globaltest are largely complementary rather than competitive. Each package offers a unique lens through which to interpret the data:

• Complementary Insights:

• topGO provides granular details about localized processes within the GO hierarchy.

• globaltest identifies overarching biological associations, highlighting system-wide trends.

• Scenarios for Combining Insights:

• Combining topGO’s hierarchical insights with globaltest’s systemic view can help uncover both specific mechanisms and broader biological themes, particularly for complex datasets.

• For example, a researcher could first use topGO to pinpoint key pathways and then employ globaltest to evaluate their aggregate impact.

• Limitations and Considerations:

• topGO assumes independence between GO terms, which may oversimplify relationships in some contexts.

• globaltest may lose specificity in its focus on global patterns, potentially overlooking individual pathway nuances.

• Computational efficiency may also differ: topGO’s hierarchical algorithms may demand more preprocessing, while globaltest benefits from a simpler multivariate setup.

7. Conclusion

This comparative analysis demonstrates that both topGO and globaltest offer valuable but distinct approaches to GO enrichment analysis:

Practical Takeaways:

topGO is preferred for detailed pathway analysis, especially for datasets with hierarchical biological information.
globaltest excels in scenarios requiring a holistic assessment of gene set relevance to phenotypes.

Encouragement for Beginners:

New researchers are encouraged to explore both packages to deepen their understanding of GO enrichment methods. Experimenting with these tools fosters a comprehensive grasp of the biological and statistical nuances critical for robust bioinformatics analysis.