Parametric Survival Fit & PSA for Age-at-Onset in Autoimmune Disease: Generalized Gamma Workflow
In rare disease epidemiology, researchers often face a familiar challenge: most published studies do not provide individual-level patient data. Instead, they report only high-level summary statistics such as quartiles, means, and total sample size. Yet for health economic modeling, burden of disease estimation, or natural history modeling, we still need to understand the full age-at-onset distribution—including its shape, skew, and uncertainty.
This article presents a practical workflow for reconstructing that distribution using parametric survival models, with a focus on the generalized gamma distribution. We will also show how to incorporate parameter uncertainty using probabilistic sensitivity analysis (PSA) and how to compute key outputs such as diagnosis probabilities within specific age bands.
This workflow is adapted from a full analysis notebook, and is intended to be immediately useful for epidemiologists, biostatisticians, and health economists working with sparse or aggregate-only datasets.
1. Why Parametric Survival Modeling Matters
When individual-level data are unavailable, modeling decisions become difficult:
- How do we estimate onset age distributions when we only know quartiles?
- How can downstream models quantify uncertainty?
- Can we still estimate age-specific probabilities, like onset before 12 or after 18?
The answer is yes. Well-chosen parametric models—fit using a quantile-matching method—allow us to approximate the underlying distribution closely enough for epidemiological and health economic purposes.
The approach also supports probabilistic uncertainty methods widely used in cost-effectiveness modeling.
2. Summary Data and Python Setup
2.1 Summary Statistics Used
The analysis uses the following literature-reported values.
| Statistic | Value |
|---|---|
| Min | 19 |
| Q1 | 61 |
| Median | 66 |
| Q3 | 72 |
| Max | 88 |
| Mean | 67 |
| Sample Size | 111 |
These values provide a robust representation of the distribution’s central tendency and spread.
2.2 Software Environment
The full workflow uses:
- NumPy for numerical operations
- pandas for data structuring
- SciPy for probability distributions
- statsmodels for numerical Hessian estimation
- matplotlib for visualization
All version details are recorded to support reproducibility.
3. Fitting Parametric Models from Summary Data
3.1 Why Use Parametric Distributions?
Parametric survival distributions help us:
- interpolate and extrapolate beyond observed quartiles
- obtain smooth CDFs and PDFs
- compute probabilities within arbitrary age bands
- implement Monte Carlo–based uncertainty quantification
This is particularly valuable in rare disease research, where datasets are often small.
3.2 Quantile Matching: Reconstructing the Distribution
Because raw data are missing, we instead match:
\[(Q1, Median, Q3)_\text{model} \approx (Q1, Median, Q3)_\text{observed}\]We define an objective function that penalizes deviations between model-predicted and observed quartiles, and we optimize the model’s parameters to minimize this squared error.
q1, median, q3 = 61, 66, 72
empirical_q = [q1, median, q3]
def gengamma_objective(params):
a, c, scale = params
if a <= 0 or scale <= 0:
return np.inf
try:
dist = gengamma(a=a, c=c, scale=scale)
theo_q = dist.ppf([0.25, 0.5, 0.75])
return np.sum((np.array(theo_q)- np.array(empirical_q))**2)
except:
return np.inf
This allows us to recover a plausible distribution consistent with the literature.
3.3 Why Generalized Gamma?
While lognormal and Weibull are widely used, the generalized gamma provides:
- high flexibility in skewness
- adjustable tail behavior
- ability to mimic many other survival distributions
After fitting the model via quantile matching, we obtain parameter estimates for shape (a), power (c), and scale (s). These values define the final onset-age distribution.
initial_guess_gengamma = [2.0, 1.0, 10.0]
bounds_gengamma = [(0.01, None), (0.01, None), (0.01, None)]
result_gengamma = minimize(gengamma_objective, x0=initial_guess_gengamma, bounds=bounds_gengamma)
a_fit_gengamma, c_fit_gengamma, scale_fit_gengamma = result_gengamma.x
| Parameter | Value |
|---|---|
| a | 26.6110 |
| c | 1.5835 |
| scale | 8.4092 |
3.4 Diagnostic Checks
To validate the fit, we:
- simulate onset ages using the fitted parameters
- plot a histogram compared to the reported median
- overlay the fitted CDF against empirical quartile points
This confirms that the generalized gamma model aligns well with both the median and quartile anchors.
4. Adding Uncertainty: Probabilistic Sensitivity Analysis (PSA)
4.1 Why PSA?
Health economic and epidemiological models require not only point estimates but also credible intervals, especially when clinical evidence is sparse.
Parameter uncertainty in the fitted model must therefore be propagated to all downstream quantities.
4.2 Building the Variance-Covariance Matrix
- We approximate the Hessian numerically at the parameter optimum.
- Inverting the Hessian yields the variance–covariance matrix of the parameter estimates.
- This matrix captures both individual parameter uncertainty and correlations among parameters.
theta_hat = np.array([a_fit, c_fit, scale_fit])
H = approx_hess(theta_hat, gengamma_objective)
vcov = np.linalg.inv(H)
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 595.1296 | -17.7608 | -315.0622 |
| 1 | -17.7608 | 0.5511 | 9.6338 |
| 2 | -315.0622 | 9.6338 | 169.3408 |
4.3 Multivariate Normal Sampling
Using Cholesky decomposition, we sample thousands of parameter sets from a multivariate normal distribution:
\[\theta^{(i)} \sim \text{MVN}(\hat{\theta},\, \Sigma)\]For each sampled parameter set, we compute:
- a full onset-age CDF
- age-band probabilities
This forms the backbone of the PSA.
rng = np.random.default_rng(123)
L = np.linalg.cholesky(vcov)
Z = rng.standard_normal((5000, 3))
theta_draws = theta_hat + Z @ L.T
5. Estimating Age-Band Probabilities
For many models, we need to know the probability that onset occurs within:
- 0–12 years
- 12–18 years
- 18+ years
Using the CDF from each Monte Carlo draw, we compute:
\[P(a \leq \text{onset} < b) = F(b) - F(a)\]This avoids slow individual-level simulation and supports rapid PSA.
A summary table (across ~5,000 draws) includes:
- mean probability
- standard deviation
- median
- 95% uncertainty interval (2.5–97.5th percentile)
Typical results show:
- ~95% of onsets occur after age 18
- 0–12 and 12–18 have wide uncertainty due to very low event frequency
| Age Band | Mean | SD | Median | CI Lower (2.5%) | CI Upper (97.5%) |
|---|---|---|---|---|---|
| 0–12 | 2.67% | 14.00% | 0.00% | 0.00% | 44.47% |
| 12–18 | 1.95% | 9.41% | 0.00% | 0.00% | 23.81% |
| 18+ | 95.38% | 18.45% | 100.00% | 9.54% | 100.00% |
This output can be plugged directly into disease models.
6. Visualizing Uncertainty: CDF Ensembles
One of the most insightful plots overlays:
- light gray curves—thousands of CDFs generated from sampled parameters
- a blue line—mean CDF across all Monte Carlo draws
- red points—empirical quartile anchors
This reveals:
- strong alignment near the median
- greatest uncertainty in the tails
- minimal probability mass before ~20 years
- a steep probability rise between 55–75 years
The ensemble plot illustrates not just the best-fit curve, but the range of plausible distributions given uncertainty in the literature.
7. Ensuring Reproducibility
At the end of the analysis, we record:
- Python version
- OS
- NumPy, pandas, SciPy, statsmodels, matplotlib versions
Reproducibility is essential for transparent modeling, peer review, and regulatory or client submissions.
8. Conclusion
Even when individual-level data are unavailable, we can still reconstruct a meaningful, validated age-at-onset distribution using parametric survival models. The generalized gamma distribution offers excellent flexibility, and the combination of:
- quantile matching,
- diagnostic validation, and
- probabilistic sensitivity analysis (PSA)
produces a robust and transparent modeling workflow.
This approach fits naturally into broader health economic and epidemiological frameworks, enabling:
- estimation of onset-age probabilities,
- sensitivity and uncertainty analyses,
- population-level burden and cost projections,
- and forecasting models informed by realistic evidence intervals.
As the next step, this method can be extended to multiple studies, enabling meta-analytic quantile fitting, and eventually integrated into automated data extraction and modeling pipelines.