Stan-Based Epidemiological Models I: The Basics
Table of Contents
Introduction
This post is primarily for my own understanding rather than to present new ideas. I'll walk through the latest epidemiological model developed by a leading group of researchers, one that incorporates behavioral factors like mobility, from the ground up. I couldn't find working Stan implementations. Partly due to version changes (e.g., renaming of the differential equation solver: from integrate_ode_rk45
to ode_rk45
) LLMs don't do that well on it either, so I've reconstructed the models here. Hopefully, this will be useful to others facing similar issues.
The simplest models in Stan
SIR
Files are: sir0.stan: the model, sir0.py: the runtime.
SIR models divide a population into three compartments: S - Susceptible → I - Infected → R - Recovered. S: People who can catch the disease, I: People currently infected and contagious and R: People who have recovered (or died!) and are immune. The parameters of these models are
- beta: Transmission rate or how easily the disease spreads
- gamma: Recovery rate or how quickly people recover
- R0 = beta/gamma: Basic reproduction number or average new infections per infected person.
The rate of transition from one state to the other can be modeled using a system of ordinary differential equations, with each equation describing the rate of change of one compartment.
dS/dt = -beta × S × I / N dI/dt = beta × S × I / N - gamma × I dR/dt = gamma × I
which in turn can be straightforwardly encoded in Stan as:
functions { vector sir(real t, vector y, real beta, real gamma) { real S = y[1]; real I = y[2]; real R = y[3]; real N = S + I + R; vector[3] dydt; dydt[1] = -beta * S * I / N; // dS/dt dydt[2] = beta * S * I / N - gamma * I; // dI/dt dydt[3] = gamma * I; // dR/dt return dydt; } }
The most important part now is simply generating the quantities.
generated quantities { array[n_days] vector[3] y = ode_rk45(sir, y0, t0, ts, beta, gamma); }
You will see in sir0.stan that there's an
empty model{...}
block. In Stan, the model block is where you
typically specify your likelihood and priors for Bayesian
inference. However, in this case, we're not doing any parameter
estimation. We'll come back to this below
The data come from the Python file sir0.py.
N = 1000 # Total population size. data = { 'n_days': 100, # How long to run the simulation (100 days) 't0': 0, # Starting time (day 0) 'y0': [N-1, 1, 0], # Initial conditions: [999, 1, 0] 'beta': 0.5, # Transmission rate 'gamma': 0.1 # Recovery rate }
The initial conditions y0: [N-1, 1, 0]
set up the epidemic starting
point. S0 = 999 means almost everyone in the population starts
susceptible to the disease. I0 = 1 indicates there is one initially
infected person, often called "patient zero," who will start the
outbreak. R0 = 0 shows that nobody has recovered yet at the
beginning. The model's parameters control how the disease spreads:
beta = 0.5 represents the transmission rate, meaning an infected
person has an average of 0.5 disease-transmitting contacts per day
with each susceptible person (when scaled by the total
population). gamma = 0.1 is the recovery rate, indicating that each
infected person has a 10% chance of recovering each day, which
translates to an average infectious period of 1/gamma = 10 days.
SEIR
Files are seir0.stan and seir0.py.
The Exposed (E) compartment represents individuals who have been infected but are not yet infectious themselves. When a susceptible person contracts the disease, they first enter the E state where the pathogen is incubating - multiplying within their body but not yet at levels sufficient to transmit to others. After an average incubation period of 1/σ days, they transition to the Infectious (I) state where they can spread the disease. This is more realistic than the SIR model's assumption that people become immediately infectious upon infection, and is particularly important for diseases like COVID-19, influenza, or measles where there's a significant latent period between infection and infectiousness. The E compartment is easy to add. First we modify the equations, in Stan it would translate to:
dydt[1] = -beta * S * I / N; // dS/dt dydt[2] = beta * S * I / N - sigma * E; // dE/dt dydt[3] = sigma * E - gamma * I; // dI/dt dydt[4] = gamma * I; // dR/dt
We just added the sigma parameter (ie., rate of progression from E to I).
Inference
The above is essentially using Stan as a differential equation solver rather than a probabilistic programming language. Remember the "forward simulation" we were talking about above? So that means that we're taking the SIR differential equations and solving them forward in time from a starting point, without trying to fit the model to any data or estimate unknown parameters. In the forward case, we're asking: "Given these parameters, what happens?". In the inverse case, we'd ask: "Given what happened (data), what were the parameters?" Or, a bit closer to our examples here: from "Given beta and gamma, what epidemic do we get?", to "Given observed cases, what beta and gamma most likely produced them?".
Now we'll demonstrate Bayesian inference using Markov Chain Monte
Carlo (MCMC) methods applied to the classic SIR epidemiological
model. To make this exercise concrete, we'll work with synthetic
epidemic data representing fictional disease outbreaks, each with its
own unique transmission characteristics and backstory. The complete
code for generating these synthetic datasets can be found here. If you
don't want to read the explanation of the file, you can find the
python script in sir_synthetic_data.py
.
The objective is to fit the cases brought about by these
diseases. That is, fitting beta and gamma without knowing them a
priori. We know them, but the model doesn't, and it should be able
to recover them. You can find the model specification in
sir_infer.stan
, and we'll try to recover
the data generated by our script, where the most important part is the
following:
def generate_sir_synthetic_data( N=1000, beta=0.8, # <-- this matters gamma=0.1, # <-- this matters I0=1, n_days=50, observation_noise='poisson', noise_scale=1.0, seed=42 ...
where beta
.8= and gamma
.1= is what we will try to recover. To do
this, the two most important part of the model specification is now in
the model {...}
part of the Stan code. It will contain the
likelihood function (our expectation of what the data will look like)
and the priors. In Bayesian statistics, the priors are what we expect
the data to be (given our previous knowledge) before actually seeing
the data. In a way, it will guide the model to find the correct
parameter values without having to explore an infinite value space.
In our model, the priors are:
beta ~ lognormal(-0.69, 0.5); gamma ~ lognormal(-2.30, 0.5); phi ~ exponential(0.1);
These prior choices are weakly informative, they don't constrain the
search too much, and grounded in epidemiological knowledge. The beta
prior lognormal(-0.69, 0.5)
centers around 0.5 per day with a 95%
credible interval of roughly [0.2, 1.2], covering realistic
transmission rates from slow-spreading diseases with interventions to
highly transmissible infections like early COVID-19, while the
lognormal distribution ensures positivity and allows for right-skewed
behavior typical of transmission parameters. The gamma prior
lognormal(-2.30, 0.5)
centers around 0.1 per day (corresponding to a
10-day infectious period) with a range covering 4-25 day infectious
periods, which encompasses most respiratory infections from
fast-recovering common colds to slower bacterial infections. The phi
prior has a mean of 10 and mode at 0, weakly favoring less
overdispersion while allowing the long tail to accommodate high
overdispersion when the data demands it, with small phi values
indicating noisy data and large phi values approaching Poisson-like
behavior. Together, these priors encode basic biological constraints
(positivity), rule out completely unrealistic scenarios (like beta =
100 or 1-day infectious periods), provide gentle regularization when
data is sparse, and allow the data to dominate inference while
implying reasonable R0 values (mean ≈ 5) that align with
epidemiological expectations for infectious disease modeling.
Finally, the likelihood is cases ~ neg_binomial_2(y[,2], phi);
,
because real epidemiological data almost always exhibits
overdispersion, the variance is much larger than the mean, due to
reporting inconsistencies, testing variability, behavioral
heterogeneity, spatial clustering, and measurement error that our
simple SIR model doesn't capture. A Poisson likelihood would assume
variance equals the mean, which is unrealistic for disease
surveillance data. The negative binomial distribution naturally
extends Poisson to handle overdispersion with mean E[cases] = y[,2]
(the infected count from our SIR model) and variance Var[cases] =
mu + mu²/phi
, where phi controls the overdispersion; ie, large phi
behaves like Poisson while small phi allows high overdispersion. This
likelihood essentially says "the observed cases are drawn from a
distribution centered at our model's predicted infected count, but
with additional variance controlled by phi to account for all the
real-world noise our ODE doesn't capture." This approach is
empirically supported since count data in epidemiology is almost
always overdispersed, provides a flexible variance structure that's
robust to model misspecification, remains computationally efficient,
and gives interpretable parameters where φ directly measures how much
extra variability exists beyond what a perfect model would predict,
making our inferences more robust and realistic by acknowledging that
our SIR model is a simplified approximation of reality.
So, does our model fit the generated data? Let's look at some plots
generated by running the python script in
sir_infer.py
:
or, in numbers:
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail R_hat lp__ -203.918 0.033031 1.318 1.008910 -206.610 -203.558 -202.536 1777.15 2417.88 1.000 beta 1.183 0.000359 0.018 0.018110 1.154 1.182 1.212 2742.12 2336.94 1.000 gamma 0.100 0.000022 0.001 0.001233 0.098 0.100 0.102 3098.30 2665.67 1.000 phi 91.223 0.433643 25.389 23.988400 54.080 88.724 137.356 3350.32 2747.04 1.000 ... [205 rows x 10 columns]
The parameter recovery is remarkably precise (in this toy example, reality is not like that, unfortunately). For beta (transmission rate), our posterior mean of 1.183 is extremely close to the true value of 1.2 we used to generate the data, representing only a 1.4% difference. The 95% credible interval [1.154, 1.213] captures the true value with high confidence, and the narrow standard deviation of 0.018 indicates excellent precision.
Similarly, gamma (recovery rate) shows outstanding recovery with a posterior mean of 0.101 versus the true value of 0.1 - a 1% difference. The extremely tight 95% credible interval [0.099, 0.103] and minuscule standard deviation of 0.001 demonstrate the model's ability to precisely estimate this parameter from the observed case counts alone. This translates to a basic reproduction number R0 = beta/gamma = 1.183/0.101 = 11.7, which closely matches the true R0 = 1.2/0.1 = 12 used in data generation (only 2.5% difference).
The model diagnostics are also excellent: all R-hat values are very close to 1.0 (indicating convergence), and the effective sample sizes (ESS) are well above 1000, confirming reliable posterior estimates. The fit between observed and predicted cases (bottom left panel) shows the model capturing the epidemic curve's shape and magnitude with high fidelity, while the inferred SIR trajectories (bottom right) recreate the full epidemic dynamics that were never directly observed, demonstrating the power of mechanistic modeling to recover underlying epidemiological processes from limited surveillance data.
A real disease!
There's github repo
with a few files that can serve as an example. You can either download
a specific file from the repo's
data
directory or clone all of it. I've written a small python script
convert2csv.py that reads the RData
files
and transform them into csv
files.
I have chosen a simple flu outbreak, appearing in influenza_england_1978_school.csv
in the files above.
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail R_hat lp__ -71.844 0.060 1.628 1.182 -74.967 -71.397 -70.221 1111.97 981.472 1.001 beta 1.716 0.001 0.062 0.051 1.623 1.717 1.812 1804.90 1017.500 1.000 gamma 0.511 0.001 0.049 0.041 0.434 0.514 0.581 1633.92 1150.480 1.000 phi 10.105 0.131 5.508 4.686 3.405 8.964 20.467 1437.19 1373.830 1.000
Just like in the previous case, the model performs quite well, even without any modifications. We could have, for example, modified the priors with a bit more specific information about influenza that could have brought down the errors, but there's no need to do that at this point. This is just an example. We will sophisticate this in future posts so that we can account for our potential dengue outbreaks.
This post also appears on Substack.
This post has been written with the support of Fondo de Investigación y Desarrollo en Salud – Fonis, Project SA24I0124
—
Now you can go home or read the second part of this series.