Background¶
Power analysis is an essential tool for clinical and biomedical researchers, helping them determine the minimal sample size required for their studies. By guiding experimental design and statistical analysis, power analysis ensures that studies are both robust and efficient. While conventional power analysis methods have been well-established for classical hypothesis tests, they often struggle with the multi-dimensional nature of omics data.
This blog post introduces the Bioconductor R package SSPA, which offers an advanced approach to power analysis for omics studies. We will explore the conceptual methodologies behind the package and showcase its core functionalities through two case studies—one involving microarray data and another focusing on metabolomics data.
Minimal Statistical Concepts¶
Before diving into the complexities of omics studies, it’s helpful to refresh a few key statistical concepts that will aid in understanding the approach discussed here.
- Power of a test: In hypothesis testing, there are two types of errors: Type I and Type II. A Type I error occurs when the null hypothesis is rejected when it is actually true (false positive), while a Type II error occurs when the null hypothesis is not rejected when the alternative hypothesis is true (false negative). The power of a test is the probability of correctly rejecting the null hypothesis when it is false, i.e., the ability to detect a true effect.
- Effect size: Effect size quantifies the magnitude of the difference between two groups, often expressed as a function of the difference in means and the variability (standard deviation) within the groups. It is a crucial measure of the practical significance of the findings. There are various ways to calculate effect size depending on the type of outcome variable (e.g., Cohen’s d for continuous variables).
- Test statistics: Test statistics measure the difference between the observed data and the null hypothesis. They help determine whether the observed effect is statistically significant. The test statistic is compared to a critical value or used to calculate a p-value, which indicates the likelihood of observing the data under the null hypothesis.
- False discovery rate (FDR): The significance level (often set at 5%) indicates the probability of making a Type I error (false positive) in a single hypothesis test. However, when multiple tests are performed simultaneously, the likelihood of at least one false positive increases. The false discovery rate (FDR) is a method for controlling the expected proportion of false positives among all significant results. It is an important consideration when performing multiple hypothesis tests, such as in omics studies.
Special Consideration to Complex Omics Data¶
In statistical analysis, the power of a test is influenced by three key factors: the effect size, the significance level, and the sample size. Typically, the significance level is set at 5%, and the sample size is often determined based on the available data. However, the effect size can vary depending on the nature of the dataset and the underlying biological phenomena.
Omics datasets, which can include data from genomics, proteomics, or metabolomics, often present unique challenges. These datasets typically contain tens of thousands of features, yet the number of available samples remains relatively small. This imbalance between the number of features and samples results in large variability in both effect size and variance. As a result, traditional power analysis methods may not be directly applicable, and the average power of the test is a better metric to use rather than a single value of power.
Additionally, when performing hypothesis testing on omics data, significance levels need to account for multiple comparisons. Given the high number of features being tested simultaneously, traditional raw p-values may lead to inflated false-positive rates. Instead, methods like the False Discovery Rate (FDR) are commonly employed to adjust for multiple testing, ensuring that the findings are more reliable.
One of the most challenging aspects of power analysis in omics data is determining the appropriate effect size. While effect size can sometimes be estimated by mining literature, this approach is not always feasible, especially when dealing with multidimensional omics data. In such cases, an alternative approach is to use Monte Carlo simulations based on pilot data. This method can simulate a range of scenarios and generate more accurate estimates of effect size, providing a more robust basis for power analysis in complex omics datasets.
Introduction of SSPA¶
The SSPA package in R offers a powerful tool for conducting simulation-based power analysis and estimating sample sizes, particularly for omics datasets like gene expression data. SSPA, which stands for General Sample Size and Power Analysis for Microarray and Next-Generation Sequencing, was developed by Maarten van Iterson in 2009 and was initially published in the Bioconductor repository (version 3.12).
One of the key features of SSPA is its ability to simulate scenarios based on pilot data, allowing researchers to estimate both the power of a test and the sample sizes needed to detect specific effects in omics data. This is particularly useful when dealing with complex datasets, where determining an appropriate effect size can be challenging.
While the package offers valuable functionality for power analysis, it should be noted that SSPA has been removed from the latest version of Bioconductor for reasons that remain unclear. Nevertheless, for those able to access the package, it provides a reliable approach to power estimation in high-dimensional data, especially in the context of gene expression profiling.
For a deeper understanding of the mathematical foundations behind SSPA, the paper “Relative power and sample size analysis on gene expression profiling data” offers a detailed explanation of the methods and algorithms employed by the package. This paper is a valuable resource for anyone looking to apply SSPA to their omics research, providing clarity on how the simulations are constructed and how they can be interpreted for sample size and power optimization.
Installing R Packages¶
# Install the required packages
if (!require("SSPA")) remotes::install_github("mvaniterson/SSPA")
# Load necessary libraries
library(SSPA)
library(genefilter)
library(lattice)
Case Study¶
Case 1 - Nutrigeonmics and Intestinal Transcription Factor¶
The study aimed to investigate the activation effects of three agonists—Wy14,643, C18:3, and fenofibrate—on PPARα, a transcription factor, in the small intestine of mice, examining differences in potency and exposure time. Agonists were administered to both PPARα-wild and PPARα-null mice for durations of 6 hours and 5 days. Gene expression data were collected using Affymetrix GeneChip Mouse 430 2.0 arrays. The dataset includes five sets of test statistics, each corresponding to a specific compound and exposure time. These statistics were derived using an empirical Bayes linear model, as implemented in the limma R package, comparing the effects between PPARα-null mice (n = 4) and wild-type mice (n = 4).
1. Microarray Data Overview¶
The ‘Nutrigenomics’ example dataset provided by the SSPA package includes test statistics obtained from linear regression for five experimental conditions, which involve different combinations of agonists and exposure times across 16439 genes.
The first row of the dataset indicates the effective sample size for each experimental condition, while the column names specify the compound and exposure time associated with the respective data.
# Access documentation for the "Nutrigenomics" dataset
# help("Nutrigenomics", package = "SSPA")
# Load the example data set
data(Nutrigenomics)
dim(Nutrigenomics)
- 16540
- 5
2. Initializing PilotData Object¶
Create a PilotData object for power and sample size analysis using the user-friendly SSPA::pilotData() function.
There are three key arguments in SSPA::pilotData() function as follows,
statistics
takes each column of Nutrigenomics datasetsamplesize
square root of the effective sample size in the frist row of Nutrigenomics datasetdistribution
assume t test statistics follows the normal distribution
pilot_data <- apply(Nutrigenomics, 2, function(x) SSPA::pilotData(statistics = x[-1], samplesize = x[1], distribution = "norm"))
We can visualize the PilotData object to check if distribution assumptions are met and to assess the proportion of significantly different positive genes between wild-type and knockout mice across various agonist and exposure time combinations.
# Visualization of the test statistics and p-values distribution
par(mfrow = c(2, 2))
for (i in 1:4) {
plot(pilot_data[[i]])
}
The histogram and Q-Q plot of the test statistics (upper-left and lower-left, respectively) show the distribution’s normality. However, the test statistics for the five experimental conditions do not perfectly follow a normal distribution, potentially impacting the validity of the power analysis.
Additionally, the p-value distribution and sorted regression plots (upper-right and lower-right, respectively) reveal the proportion of positive genes in specific experimental conditions. The 5-day treatment with the agonist Wy14,643 appears to produce the most substantial activation of intestinal PPARα.
3. Estimating Effect Sizes¶
Create a SampleSize object to estimate the proportion of non-differentially expressed genes and the density of effect sizes using the SSPA::sampleSize() function. Refer to the help documentation for detailed explanations of the primary arguments.
According to the SSPA authors, the proportion of non-differentially expressed genes is estimated using the method by Langaas et al., while the density (λ) of effect size θ is estimated with a deconvolution estimator.
sample_size <- lapply(pilot_data, SSPA::sampleSize, method = "congrad", control = list(verbose = FALSE, resolution = 2^10, from = -10, to = 10))
compounds_vector <- c("Wy14,643", "fenofibrate", "trilinolenin (C18:3)", "Wy14,643", "fenofibrate")
compounds <- rep(compounds_vector, each = 1024) |> factor()
exposure_vector <- c("5_Days", "6_Hours")
exposure <- rep(exposure_vector, c(2, 3)) |> rep(each = 1024) |> factor()
lambda <- as.vector(sapply(sample_size, function(x) x@lambda))
theta <- as.vector(sapply(sample_size, function(x) x@theta))
effect_size <- data.frame(exposure = exposure, compounds = compounds, lambda = lambda, theta = theta)
lattice::xyplot(lambda ~ theta | exposure,
group = compounds,
data = effect_size,
type = c('g', 'l'),
layout = c(1, 2),
lwd = 2,
xlab = "effect size",
ylab = "",
auto.key = list(columns = 3, lines = TRUE, points = FALSE, cex = 0.7)) |> print()
The plot shows the density of estimated effect sizes for different agonists and exposure times. Overall, effect sizes are much stronger on average for the 5-day exposure compared to the 6-hour exposure.
4. Calculating the Average Power¶
Using simulated effect sizes and proportions of non-differentially expressed genes, we can calculate the average power while controlling for multiple testing with the adaptive Benjamini-Hochberg method. For detailed mathematical explanations, refer to the paper.
The SSPA::predictpower()
function calculates the average power based on these simulations. The samplesize
argument represents the square root of the biological replicates per group, as defined by the sample_size_per_group
variable.
# Define the expected sample size
sample_size_per_group <- seq(2, 20)
power = sapply(sample_size, function(x) as.numeric(SSPA::predictpower(x, samplesize = sqrt(sample_size_per_group)))) |> as.vector()
exposure = rep(exposure_vector, c(2, 3)) |> rep(each = length(sample_size_per_group)) |> factor()
compounds = rep(compounds_vector, each = length(sample_size_per_group)) |> factor()
samplesize = rep(sample_size_per_group, 5)
average_power <- data.frame(power = power, exposure = exposure, compounds = compounds, samplesize = samplesize)
lattice::xyplot(power ~ samplesize | exposure,
group = compounds,
data = average_power,
type = c('g', 'b'), # g for lines, b for points
layout = c(1, 2),
lwd = 2,
pch = 16,
xlab = "sample size (per group)",
ylab = "",
auto.key = list(columns = 3, lines = TRUE, points = FALSE, cex = 0.7)) |> print()
The curve plot shows the average power as a function of biological replicates per group. Exposure time significantly affects the effect size. With 6 hours of exposure (upper panel), 20 biological replicates per group nearly achieve 80% average power. In contrast, with 5 days of exposure (lower panel), only 8 biological replicates per group are needed to reach 80% average power.
Case Study 2 - Metabolomics in Cancer Study¶
In this case study, I will use a metabolomics dataset from a pilot urine study on cancer patients to demonstrate power analysis for high-throughput omics data using SSPA. The dataset includes metabolite concentrations for two groups: cachexic patients (n = 47) and a control group (n = 30). This pilot data is openly available from the web application MetaboAnalyst 6.0. Cachexia is a syndrome marked by severe weight and muscle loss, commonly seen in chronic diseases like cancer, and is resistant to standard nutritional support.
In this case study, I applied the analytical workflow from Case Study 1. Since the input data in Case Study 2 consists of metabolite concentrations rather than test statistics, we first need to calculate the test statistics using the limma
package.
# Load the required packages
require(limma)
# Read in the pilot metabolomics data
df_metabo_data <- read.csv("human_cachexia.csv", check.names = FALSE)
cat("The pilot data set contains ", dim(df_metabo_data)[1], " sample and ", dim(df_metabo_data)[2]-1, " metabolites.")
The pilot data set contains 77 sample and 64 metabolites.
The first column represents patient ID (or sample ID), the second column represents the group (control and disease state), and the remaining columns represent the 63 metabolite concentrations.
Prepare pilot data to fit the limma model
df_metabo_data2 <- df_metabo_data[ ,-2] |> t() |> as.data.frame()
colnames(df_metabo_data2) <- df_metabo_data2[1, ]
df_metabo_data3 <- df_metabo_data2[-1, ]
df_metabo_data4 <- apply(df_metabo_data3, 2, as.numeric) |> as.data.frame()
rownames(df_metabo_data4) <- rownames(df_metabo_data3)
# Define the experimental design
treatment_group <- df_metabo_data[ ,2] |> factor()
exp_design <- model.matrix(~treatment_group, data = df_metabo_data4)
fit_metabo_data <- limma::lmFit(df_metabo_data4, exp_design)
fit_metabo_data_ebayes <- limma::eBayes(fit_metabo_data)
# Extract the test statistics
test_statistics <- limma::topTable(fit_metabo_data_ebayes, coef = 2, number = Inf)
# Extract t test statistics
t_statistics <- test_statistics[ , "t"] |> as.numeric()
Define the object pilotData class for the metabolomics data
Before creating the pilotData object, we need to calculate the effective sample size, a key parameter for power analysis. For two groups, it is calculated as the square root of the inverse of 1/n1 + 1/n2.
# Calculate the effective sample size
sample_size_treatment <- as.numeric(table(treatment_group))[1]
sample_size_control <- as.numeric(table(treatment_group))[2]
effective_sample_size <- sqrt(1 / (1/sample_size_treatment + 1/sample_size_control))
pilot_data_metabo <- SSPA::pilotData(statistics = t_statistics, samplesize = effective_sample_size, distribution = "norm")
plot(pilot_data_metabo)
sample_size_distribution <- SSPA::sampleSize(pilot_data_metabo, method = "congrad", control = list(verbose = FALSE, resolution = 2^10, from = -10, to = 10))
plot(sample_size_distribution)
sample_size_expected_metabo <- seq(2, 20, by = 2)
effective_sample_size_expected <- sqrt(sample_size_expected_metabo / 2)
power_metabo <- predictpower(sample_size_distribution, samplesize = effective_sample_size_expected)
matplot(
sample_size_expected_metabo,
power_metabo,
type = "b",
pch = 16,
ylim = c(0, 1),
ylab = "predicted power",
xlab = "sample size (per group)"
)
grid()
In summary, the minimal sample size (biology replicate) is 20 per group to achieve 80% power to detect difference in mean metabolite concentration between the two groups.