Stan-Based Epidemiological Models I: The Basics

Table of Contents

Introduction

This post is primarily for my own understanding rather than to present new ideas. I'll walk through the latest epidemiological model developed by a leading group of researchers, one that incorporates behavioral factors like mobility, from the ground up. Finding working Stan implementations proved challenging for several reasons. Version changes have broken many existing examples. For instance, the differential equation solver was renamed from integrate_ode_rk45 to ode_rk45. Even LLMs struggle with these updated syntax requirements. I've therefore reconstructed the models from scratch, hoping this will help others facing similar implementation issues.

This work is motivated by an upcoming analysis of dengue fever outbreaks in Chile, which present unique modeling challenges due to their vector-borne transmission patterns and seasonal dynamics. However, before tackling dengue-specific models, I need to establish a solid foundation in basic epidemiological modeling with Stan, hence this step-by-step walkthrough of increasingly complex compartmental models.

This post has been heavily inspired by Grinsztajn et al.

The simplest models in Stan

SIR

Files are: sir0.stan: the model, sir0.py: the runtime.

SIR models divide a population into three compartments: S - Susceptible → I - Infected → R - Recovered. S: People who can catch the disease, I: People currently infected and contagious and R: People who have recovered (or died!) and are immune. The parameters of these models are

  • beta: Transmission rate or how easily the disease spreads
  • gamma: Recovery rate or how quickly people recover
  • R0 = beta/gamma: Basic reproduction number or average new infections per infected person.

The rate of transition from one state to the other can be modeled using a system of ordinary differential equations, with each equation describing the rate of change of one compartment.

dS/dt = -beta × S × I / N
dI/dt = beta × S × I / N - gamma × I
dR/dt = gamma × I

which in turn can be straightforwardly encoded in Stan as:

functions { vector sir(real t, vector y, real beta, real gamma) {
    real S = y[1];
    real I = y[2];
    real R = y[3];
    real N = S + I + R;

    vector[3] dydt;
    dydt[1] = -beta * S * I / N;            // dS/dt
    dydt[2] = beta * S * I / N - gamma * I; // dI/dt
    dydt[3] = gamma * I;                    // dR/dt

    return dydt;
  }
}

The most important part now is simply generating the quantities.

generated quantities {
  array[n_days] vector[3] y = ode_rk45(sir, y0, t0, ts, beta, gamma);
}

You will see in sir0.stan that there's an empty model{...} block. In Stan, the model block is where you typically specify your likelihood and priors for Bayesian inference. However, in this case, we're not doing any parameter estimation. We'll come back to this below

sir0.png

The data come from the Python file sir0.py.

N = 1000 # Total population size.

data = {
    'n_days': 100,          # How long to run the simulation (100 days)
    't0': 0,                # Starting time (day 0)
    'y0': [N-1, 1, 0],      # Initial conditions: [999, 1, 0]
    'beta': 0.5,            # Transmission rate
    'gamma': 0.1            # Recovery rate
}

The initial conditions y0: [N-1, 1, 0] set up the epidemic starting point. S0 = 999 means almost everyone in the population starts susceptible to the disease. I0 = 1 indicates there is one initially infected person, often called "patient zero," who will start the outbreak. R0 = 0 shows that nobody has recovered yet at the beginning. The model's parameters control how the disease spreads: beta = 0.5 represents the transmission rate, meaning an infected person has an average of 0.5 disease-transmitting contacts per day with each susceptible person (when scaled by the total population). gamma = 0.1 is the recovery rate, indicating that each infected person has a 10% chance of recovering each day, which translates to an average infectious period of 1/gamma = 10 days.

SEIR

Files are seir0.stan and seir0.py.

The Exposed (E) compartment represents individuals who have been infected but are not yet infectious themselves. When a susceptible person contracts the disease, they first enter the E state where the pathogen is incubating - multiplying within their body but not yet at levels sufficient to transmit to others. After an average incubation period of 1/σ days, they transition to the Infectious (I) state where they can spread the disease. This is more realistic than the SIR model's assumption that people become immediately infectious upon infection, and is particularly important for diseases like COVID-19, influenza, or measles where there's a significant latent period between infection and infectiousness. The E compartment is easy to add. First we modify the equations, in Stan it would translate to:

dydt[1] = -beta * S * I / N;              // dS/dt
dydt[2] = beta * S * I / N - sigma * E;   // dE/dt
dydt[3] = sigma * E - gamma * I;          // dI/dt
dydt[4] = gamma * I;                      // dR/dt

We just added the sigma parameter (ie., rate of progression from E to I).

seir0.png

Inference

The above is essentially using Stan as a differential equation solver rather than a probabilistic programming language. Remember the "forward simulation" we were talking about above? So that means that we're taking the SIR differential equations and solving them forward in time from a starting point, without trying to fit the model to any data or estimate unknown parameters. In the forward case, we're asking: "Given these parameters, what happens?". In the inverse case, we'd ask: "Given what happened (data), what were the parameters?" Or, a bit closer to our examples here: from "Given beta and gamma, what epidemic do we get?", to "Given observed cases, what beta and gamma most likely produced them?".

Now we'll demonstrate Bayesian inference using Markov Chain Monte Carlo (MCMC) methods applied to the classic SIR epidemiological model. To make this exercise concrete, we'll work with synthetic epidemic data representing fictional disease outbreaks, each with its own unique transmission characteristics and backstory. The complete code for generating these synthetic datasets can be found here. If you don't want to read the explanation of the file, you can find the python script in sir_synthetic_data.py.

The objective is to fit the cases brought about by these diseases. That is, fitting beta and gamma without knowing them a priori. We know them, but the model doesn't, and it should be able to recover them. You can find the model specification in sir_infer.stan, and we'll try to recover the data generated by our script, where the most important part is the following:

def generate_sir_synthetic_data(
    N=1000,
    beta=0.8,  # <-- this matters
    gamma=0.1, # <-- this matters
    I0=1,
    n_days=50,
    observation_noise='poisson',
    noise_scale=1.0,
    seed=42
    ...

where beta.8= and gamma.1= is what we will try to recover. To do this, the two most important part of the model specification is now in the model {...} part of the Stan code. It will contain the likelihood function (our expectation of what the data will look like) and the priors. In Bayesian statistics, the priors are what we expect the data to be (given our previous knowledge) before actually seeing the data. In a way, it will guide the model to find the correct parameter values without having to explore an infinite value space.

Priors

beta ~ lognormal(-0.69, 0.5);
gamma ~ lognormal(-2.30, 0.5);
phi ~ exponential(0.1);

These prior choices are weakly informative, they don't constrain the search too much, and grounded in epidemiological knowledge.

Beta prior: The lognormal(-0.69, 0.5) distribution centers around 0.5 per day with a 95% credible interval of roughly [0.2, 1.2]. This covers realistic transmission rates from slow-spreading diseases with interventions to highly transmissible infections like early COVID-19. The lognormal distribution ensures positivity while allowing for the right-skewed behavior typical of transmission parameters.

Gamma prior: The lognormal(-2.30, 0.5) distribution centers around 0.1 per day, corresponding to a 10-day infectious period. This range covers 4-25 day infectious periods, encompassing most respiratory infections from fast-recovering common colds to slower bacterial infections.

Phi prior: The exponential prior has a mean of 10 and mode at 0, weakly favoring less overdispersion while allowing the long tail to accommodate high overdispersion when data demands it.

Together, these priors encode basic biological constraints (positivity), rule out completely unrealistic scenarios (like beta = 100 or 1-day infectious periods), provide gentle regularization when data is sparse, and allow the data to dominate inference while implying reasonable R0 values (mean ≈ 5) that align with epidemiological expectations for infectious disease modeling.

Why a negative binomial likelihood?

Finally, the likelihood is cases ~ neg_binomial_2(y[,2], phi);, because real epidemiological data almost always exhibits overdispersion, the variance is much larger than the mean, due to reporting inconsistencies, testing variability, behavioral heterogeneity, spatial clustering, and measurement error that our simple SIR model doesn't capture. A Poisson likelihood would assume variance equals the mean, which is unrealistic for disease surveillance data. The negative binomial distribution handles overdispersion better than Poisson. It uses our SIR model's infected count as the mean, but adds extra variance controlled by the parameter phi. Large phi values behave like Poisson, while small phi allows for much higher variance than the mean. This likelihood essentially says "the observed cases are drawn from a distribution centered at our model's predicted infected count, but with additional variance controlled by phi to account for all the real-world noise our ODE doesn't capture." This approach is empirically supported since count data in epidemiology is almost always overdispersed, provides a flexible variance structure that's robust to model misspecification, remains computationally efficient, and gives interpretable parameters where φ directly measures how much extra variability exists beyond what a perfect model would predict, making our inferences more robust and realistic by acknowledging that our SIR model is a simplified approximation of reality.

Parameter recovery

So, does our model fit the generated data? Let's look at some plots generated by running the python script in sir_infer.py:

sir_inference_results.png

or, in numbers:

                 Mean      MCSE  StdDev        MAD       5%      50%      95%  ESS_bulk  ESS_tail  R_hat
lp__         -203.918  0.033031   1.318   1.008910 -206.610 -203.558 -202.536   1777.15   2417.88  1.000
beta            1.183  0.000359   0.018   0.018110    1.154    1.182    1.212   2742.12   2336.94  1.000
gamma           0.100  0.000022   0.001   0.001233    0.098    0.100    0.102   3098.30   2665.67  1.000
phi            91.223  0.433643  25.389  23.988400   54.080   88.724  137.356   3350.32   2747.04  1.000
...
[205 rows x 10 columns]

The parameter recovery is remarkably precise (in this toy example, reality is not like that, unfortunately). For beta (transmission rate), our posterior mean of 1.183 is extremely close to the true value of 1.2 we used to generate the data, representing only a 1.4% difference. The 95% credible interval [1.154, 1.213] captures the true value with high confidence, and the narrow standard deviation of 0.018 indicates excellent precision.

Similarly, gamma (recovery rate) shows outstanding recovery with a posterior mean of 0.101 versus the true value of 0.1 - a 1% difference. The extremely tight 95% credible interval [0.099, 0.103] and minuscule standard deviation of 0.001 demonstrate the model's ability to precisely estimate this parameter from the observed case counts alone. This translates to a basic reproduction number R0 = beta/gamma = 1.183/0.101 = 11.7, which closely matches the true R0 = 1.2/0.1 = 12 used in data generation (only 2.5% difference).

Model diagnostics

The model diagnostics are also excellent: all R-hat values are very close to 1.0 (indicating convergence), and the effective sample sizes (ESS) are well above 1000, confirming reliable posterior estimates. The fit between observed and predicted cases (bottom left panel) shows the model capturing the epidemic curve's shape and magnitude with high fidelity, while the inferred SIR trajectories (bottom right) recreate the full epidemic dynamics that were never directly observed, demonstrating the power of mechanistic modeling to recover underlying epidemiological processes from limited surveillance data.

Testing on Real Epidemiological Data

While synthetic data validation is encouraging, real disease outbreaks present additional challenges: irregular reporting, behavioral changes, and environmental factors that our simple models don't capture. To test our approach's robustness, let's fit the model to actual influenza data.

I chose influenza data for this validation because it shares key characteristics with dengue: both are infectious diseases with clear epidemic curves, but influenza's simpler transmission dynamics (direct human-to-human) make it ideal for testing our modeling framework before moving to dengue's more complex vector-borne transmission in future posts.

There's github repo with a few files that can serve as an example. You can either download a specific file from the repo's data directory or clone all of it. I've written a small python script convert2csv.py that reads the RData files and transform them into csv files.

I have chosen a simple flu outbreak, appearing in influenza_england_1978_school.csv in the files above.

Priors

For the influenza data, I kept the same priors used in the synthetic example rather than tailoring them to flu-specific knowledge. This tests whether our 'generic' priors are robust enough for real applications. However, we could have used more informative priors based on influenza literature: flu typically has R0 values between 1.3-1.7, suggesting a tighter beta prior, and infectious periods of 2-8 days, implying a gamma prior centered around 0.2-0.5 per day.

Prior sensitivity deserves attention in any Bayesian analysis. Our lognormal priors are weakly informative by design, but even weak priors can influence results when data is limited. A proper sensitivity analysis would involve refitting the model with alternative prior specifications (e.g., wider or narrower variances, different central values) and comparing posterior distributions. This is particularly important for policy-relevant parameters like R0, where prior assumptions could meaningfully affect public health recommendations.

Fitting the model

                 Mean   MCSE   StdDev    MAD       5%      50%      95%  ESS_bulk  ESS_tail  R_hat
lp__          -71.844  0.060   1.628   1.182  -74.967  -71.397  -70.221   1111.97   981.472  1.001
beta            1.716  0.001   0.062   0.051    1.623    1.717    1.812   1804.90  1017.500  1.000
gamma           0.511  0.001   0.049   0.041    0.434    0.514    0.581   1633.92  1150.480  1.000
phi            10.105  0.131   5.508   4.686    3.405    8.964   20.467   1437.19  1373.830  1.000

While parameter recovery looks good, other validation metrics would strengthen our confidence: posterior predictive checks comparing simulated outbreaks to observed data, cross-validation on held-out time periods, and comparison with alternative model structures (e.g., SEIR, age-structured models) to assess whether added complexity improves predictive performance.

We could refine this influenza model with more specific priors or account for seasonal patterns, but for now, this demonstrates that our basic framework works on real data. The next step is adapting these methods for dengue's unique epidemiological features: vector dynamics, climate dependence, and serotype interactions that make dengue modeling considerably more complex.

flu_inference_results.png

Model limitations

Our SIR framework makes several strong assumptions that merit discussion. We assume homogeneous mixing (everyone contacts everyone else equally), constant transmission and recovery rates, and no births, deaths, or migration during the epidemic (which in places like Chile remain constant and therefore invariant). Real flu outbreaks violate these assumptions through age-structured transmission, changing behavior during epidemics, and seasonal effects. The negative binomial likelihood partially addresses these limitations by allowing extra variance, but it's a statistical bandage rather than a mechanistic solution.

The Stan implementation requires careful attention to numerical stability. The ODE solver can fail with extreme parameter values, so our priors implicitly constrain the search space to numerically stable regions. Additionally, the negative binomial parameterization using phi (dispersion) rather than traditional r and p parameters improves MCMC efficiency by reducing parameter correlation.

Conclusion

This foundational post establishes the core Stan implementation patterns we'll need for dengue modeling. In the next posts, we'll extend these methods to handle vector-borne transmission, incorporating mosquito population dynamics, seasonal climate effects, and the multiple dengue serotypes that complicate real-world dengue surveillance and prediction.

This post also appears on Substack.

This post has been written with the support of Fondo de Investigación y Desarrollo en Salud – Fonis, Project SA24I0124

Now you can go home or read the second part of this series.