How to Reconstruct Disease Onset Age Distributions Using Summary Statistics: A Rare Disease Use Case in Python

In rare disease research, one of the most frustrating barriers is the lack of individual-level data. While disease onset age is critical for modeling epidemiological burden and designing trials, most published studies only report summary statistics like medians or quartiles.

But what if you could reconstruct a full onset age distribution from just these summary stats?

In this post, I’ll walk you through how to simulate granular age-at-onset profiles using Python. We’ll use anti-GABABR autoimmune encephalitis (AIE) as a real-world example and fit three candidate distributions—log-normal, Weibull, and generalized gamma—using a technique called quantile matching.

You’ll learn how to:

  • Fit distributions to summary statistics using constrained optimization
  • Simulate individual-level onset ages
  • Compare model fit across distributions
  • Estimate age-band proportions with bootstrap confidence intervals

Whether you’re working on a burden-of-disease model, HTA submission, or just exploring rare disease analytics, this post will give you a reproducible template to start from.

The reusable Python notebook is available at here


Step 1: Background and Input Data

Our example comes from Lamblin et al. (2024), who reported onset age statistics for anti-GABABR AIE:

median = 66
q1 = 61 
q3 = 72
min = 19
max = 88
mean = 67
size = 111

For our modeling, we’ll focus on the three quantiles (Q1, median, Q3).


Step 2: Quantile Matching to Fit Distributions

We’ll use constrained optimization to find the best-fitting parameters such that theoretical quantiles match the reported ones.

Here’s the objective function for a log-normal distribution:

from scipy.stats import lognorm
from scipy.optimize import minimize
import numpy as np

# Empirical quantiles
empirical_q = [61, 66, 72]  

# Objective function: Minimize squared differences between model and empirical quantiles
def objective(params):
    mu, sigma = params
    if sigma <= 0:
        return np.inf
    dist = lognorm(s=sigma, scale=np.exp(mu))
    theo_q = dist.ppf([0.25, 0.5, 0.75])
    return np.sum((np.array(theo_q) - empirical_q)**2)

# Initial guess for mu and sigma
initial_guess = [np.log(66), 0.5]

# Optimize
result = minimize(objective, x0=initial_guess, bounds=[(0, None), (0.01, 5)])

mu_fit, sigma_fit = result.x

Fitted parameters:

  • μ (log-scale mean): 4.192
  • σ (log-scale SD): 0.123

This defines the objective function for optimization:

  • Takes parameters mu and sigma as input
  • Returns infinity if sigma is non-positive (constraint)
  • Calculates theoretical quantiles using the log-normal distribution with given parameters
  • Returns the sum of squared differences between theoretical and empirical quantiles (this is what we want to minimize)

We repeat this process for Weibull and generalized gamma as well.


Step 3: Visualizing the Simulated Distribution

Once we’ve fitted the distribution, we simulate 10,000 onset ages and plot:

  • Left: Histogram of simulated ages
  • Right: CDF overlay comparing empirical and model quantiles

Example output for log-normal:

This dual-panel visualization helps assess how well the simulated distribution captures the shape and tails of the reported data.


Step 4: Comparing Goodness-of-Fit

We assess fit using the sum of squared differences between simulated and empirical quantiles:

Distribution Sum of Squared Errors
Log-normal 0.0916
Weibull 0.5238
Generalized Gamma 0.0772 (best fit)

Unsurprisingly, the generalized gamma distribution—known for its flexibility—performed best.


Step 5: Estimating Age-Band Proportions with Bootstrap CI

Let’s say we want to estimate the proportion of patients in three age bands:

- <12 years
- 1217 years
- 18 years

We use bootstrapping (1,000 iterations) to estimate proportions and their 95% confidence intervals:

age_bands = [(0,12), (12,18), (18,100)]
n_iterations = 1000
n_samples = 10000

from scipy.stats import gengamma

# Using best-fit generalized gamma parameters
bootstrap_results = []
for _ in range(n_iterations):
    sim_ages = gengamma(a=26.611, c=1.583, scale=8.409).rvs(n_samples)
    proportions = [np.mean((sim_ages >= low) & (sim_ages < high)) for (low, high) in age_bands]
    bootstrap_results.append(proportions)

Resulting summary:

Age Band Mean Proportion 95% CI Lower 95% CI Upper
0–12 0.0000 0.0000 0.0000
12–18 0.0000 0.0000 0.0000
18+ 0.9999 0.9997 1.0000

Visualization:

This clearly shows the distribution is concentrated in adult-onset.


Conclusion and Reuse

This method allows you to simulate realistic onset age distributions using only summary statistics. It’s particularly valuable in rare disease modeling, where raw data are scarce but precise modeling inputs are needed.

You can adapt this notebook by:

  • Swapping in new quantiles from another disease or study
  • Testing additional distribution families (e.g., skew-normal)
  • Extending to model time to diagnosis or treatment delay

Whether you’re building HTA models or conducting evidence synthesis, this is a scalable, data-light, and transparent tool to add to your workflow.