Introduction to Computational Causal Inference Using Reproducible Stata, R, and Python Code: A Tutorial

Overview

Journal Stat Med

Publisher Wiley

Specialty Public Health

Date 2021 Oct 29

PMID 34713468

Citations 19

Authors

Matthew J Smith

Mohammad A Mansournia

Camille Maringe

Paul N Zivich

Stephen R Cole

Clemence Leyrat

Aurelien Belot

Bernard Rachet

Miguel A Luque-Fernandez

Affiliations

Soon will be listed here.

Abstract

The main purpose of many medical studies is to estimate the effects of a treatment or exposure on an outcome. However, it is not always possible to randomize the study participants to a particular treatment, therefore observational study designs may be used. There are major challenges with observational studies; one of which is confounding. Controlling for confounding is commonly performed by direct adjustment of measured confounders; although, sometimes this approach is suboptimal due to modeling assumptions and misspecification. Recent advances in the field of causal inference have dealt with confounding by building on classical standardization methods. However, these recent advances have progressed quickly with a relative paucity of computational-oriented applied tutorials contributing to some confusion in the use of these methods among applied researchers. In this tutorial, we show the computational implementation of different causal inference estimators from a historical perspective where new estimators were developed to overcome the limitations of the previous estimators (ie, nonparametric and parametric g-formula, inverse probability weighting, double-robust, and data-adaptive estimators). We illustrate the implementation of different methods using an empirical example from the Connors study based on intensive care medicine, and most importantly, we provide reproducible and commented code in Stata, R, and Python for researchers to adapt in their own observational study. The code can be accessed at https://github.com/migariane/Tutorial_Computational_Causal_Inference_Estimators.

Citing Articles

Guidelines and Best Practices for the Use of Targeted Maximum Likelihood and Machine Learning When Estimating Causal Effects of Exposures on Time-To-Event Outcomes.

Talbot D, Diop A, Mesidor M, Chiu Y, Sirois C, Spieker A Stat Med. 2025; 44(6):e70034.

PMID: 40079648 PMC: 11905698. DOI: 10.1002/sim.70034.

So Many Choices: A Guide to Selecting Among Methods to Adjust for Observed Confounders.

Keele L, Grieve R Stat Med. 2025; 44(5):e10336.

PMID: 39947224 PMC: 11825193. DOI: 10.1002/sim.10336.

Interaction between opium use and cigarette smoking on bladder cancer: An inverse probability weighting approach based on a multicenter case-control study in Iran.

Akrami R, Hadji M, Rashidian H, Nazemipour M, Naghibzadeh-Tahami A, Ansari-Moghaddam A Glob Epidemiol. 2025; 9():100182.

PMID: 39846054 PMC: 11751544. DOI: 10.1016/j.gloepi.2024.100182.

Comparative effectiveness of laparoscopic versus open colectomy in colon cancer patients: a study protocol for emulating a target trial using cancer registry data.

Abera S, Robers G, Kastner A, Stentzel U, Weitmann K, Hoffmann W J Cancer Res Clin Oncol. 2025; 151(1):34.

PMID: 39798018 PMC: 11724780. DOI: 10.1007/s00432-024-06057-x.

Machine learning in causal inference for epidemiology.

Moccia C, Moirano G, Popovic M, Pizzi C, Fariselli P, Richiardi L Eur J Epidemiol. 2024; 39(10):1097-1108.

PMID: 39535572 PMC: 11599438. DOI: 10.1007/s10654-024-01173-x.

References

Naimi A, Mishler A, Kennedy E . Challenges in Obtaining Valid Causal Effect Estimates with Machine Learning Algorithms. Am J Epidemiol. 2021; 192(9). DOI: 10.1093/aje/kwab201. View

Rubin D . The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med. 2006; 26(1):20-36. DOI: 10.1002/sim.2739. View

Bang H, Robins J . Doubly robust estimation in missing data and causal inference models. Biometrics. 2006; 61(4):962-73. DOI: 10.1111/j.1541-0420.2005.00377.x. View

Jung K, Lee J, Gupta V, Cho G . Comparison of Bootstrap Confidence Interval Methods for GSCA Using a Monte Carlo Simulation. Front Psychol. 2019; 10:2215. PMC: 6797821. DOI: 10.3389/fpsyg.2019.02215. View

Luque-Fernandez M, Belot A, Valeri L, Cerulli G, Maringe C, Rachet B . Data-Adaptive Estimation for Double-Robust Methods in Population-Based Cancer Epidemiology: Risk Differences for Lung Cancer Mortality by Emergency Presentation. Am J Epidemiol. 2017; 187(4):871-878. PMC: 5888939. DOI: 10.1093/aje/kwx317. View

Zivich P, Breskin A . Machine Learning for Causal Inference: On the Use of Cross-fit Estimators. Epidemiology. 2021; 32(3):393-401. PMC: 8012235. DOI: 10.1097/EDE.0000000000001332. View

van der Laan M, Polley E, Hubbard A . Super learner. Stat Appl Genet Mol Biol. 2007; 6:Article25. DOI: 10.2202/1544-6115.1309. View

Gutman R, Rubin D . Estimation of causal effects of binary treatments in unconfounded studies. Stat Med. 2015; 34(26):3381-98. PMC: 4782596. DOI: 10.1002/sim.6532. View

Connors Jr A, Speroff T, Dawson N, Thomas C, Harrell Jr F, Wagner D . The effectiveness of right heart catheterization in the initial care of critically ill patients. SUPPORT Investigators. JAMA. 1996; 276(11):889-97. DOI: 10.1001/jama.276.11.889. View

10.

Luque-Fernandez M, Redondo-Sanchez D, Schomaker M . Effect Modification and Collapsibility in Evaluations of Public Health Interventions. Am J Public Health. 2019; 109(3):e12-e13. PMC: 6366494. DOI: 10.2105/AJPH.2018.304916. View

11.

Webster-Clark M, Sturmer T, Wang T, Man K, Marinac-Dabic D, Rothman K . Using propensity scores to estimate effects of treatment initiation decisions: State of the science. Stat Med. 2020; 40(7):1718-1735. DOI: 10.1002/sim.8866. View

12.

Hernan M, Brumback B, Robins J . Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000; 11(5):561-70. DOI: 10.1097/00001648-200009000-00012. View

13.

Austin P . Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009; 28(25):3083-107. PMC: 3472075. DOI: 10.1002/sim.3697. View

14.

Luque-Fernandez M, Schomaker M, Rachet B, Schnitzer M . Targeted maximum likelihood estimation for a binary treatment: A tutorial. Stat Med. 2018; 37(16):2530-2546. PMC: 6032875. DOI: 10.1002/sim.7628. View

15.

Keil A, Edwards J, Richardson D, Naimi A, Cole S . The parametric g-formula for time-to-event data: intuition and a worked example. Epidemiology. 2014; 25(6):889-97. PMC: 4310506. DOI: 10.1097/EDE.0000000000000160. View

16.

Goetghebeur E, le Cessie S, De Stavola B, Moodie E, Waernbaum I . Formulating causal questions and principled statistical answers. Stat Med. 2020; 39(30):4922-4948. PMC: 7756489. DOI: 10.1002/sim.8741. View

17.

Schuler M, Rose S . Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. Am J Epidemiol. 2016; 185(1):65-73. DOI: 10.1093/aje/kww165. View

18.

Tsiatis A, Davidian M . Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Stat Sci. 2008; 22(4):569-573. PMC: 2397555. DOI: 10.1214/07-STS227. View

19.

Mansournia M, Nazemipour M, Naimi A, Collins G, Campbell M . Reflection on modern methods: demystifying robust standard errors for epidemiologists. Int J Epidemiol. 2020; 50(1):346-351. DOI: 10.1093/ije/dyaa260. View

20.

Cole S, Hernan M . Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008; 168(6):656-64. PMC: 2732954. DOI: 10.1093/aje/kwn164. View