Simulation of Complex Data Structures for Planning of Studies with Focus on Biomarker Comparison

Overview

Journal BMC Med Res Methodol

Publisher Biomed Central

Specialties General Medicine
Health Services

Date 2017 Jun 15

PMID 28610631

Citations 3

Authors

Andreas Schulz

Daniela Zoller

Stefan Nickels

Manfred E Beutel

Maria Blettner

Philipp S Wild

Harald Binder

Affiliations

Soon will be listed here.

Abstract

Background: There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure.

Methods: In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated.

Results: We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach.

Conclusions: We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.

Citing Articles

Machine-Learning vs. Expert-Opinion Driven Logistic Regression Modelling for Predicting 30-Day Unplanned Rehospitalisation in Preterm Babies: A Prospective, Population-Based Study (EPIPAGE 2).

Reed R, Morgan A, Zeitlin J, Jarreau P, Torchin H, Pierrat V Front Pediatr. 2021; 8:585868.

PMID: 33614539 PMC: 7886676. DOI: 10.3389/fped.2020.585868.

Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data.

Mazumdar M, Lin J, Zhang W, Li L, Liu M, Dharmarajan K BMC Health Serv Res. 2020; 20(1):350.

PMID: 32334595 PMC: 7183716. DOI: 10.1186/s12913-020-05148-y.

Integrated Chemometrics and Statistics to Drive Successful Proteomics Biomarker Discovery.

Suppers A, van Gool A, Wessels H Proteomes. 2018; 6(2).

PMID: 29701723 PMC: 6027525. DOI: 10.3390/proteomes6020020.

References

Obuchowski N . Receiver operating characteristic curves and their use in radiology. Radiology. 2003; 229(1):3-8. DOI: 10.1148/radiol.2291010898. View

Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios L . A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010; 10:1. PMC: 2824145. DOI: 10.1186/1471-2288-10-1. View

Vaeth M, Skovlund E . A simple approach to power and sample size calculations in logistic regression and Cox regression models. Stat Med. 2004; 23(11):1781-92. DOI: 10.1002/sim.1753. View

Obuchowski N . Computing sample size for receiver operating characteristic studies. Invest Radiol. 1994; 29(2):238-43. DOI: 10.1097/00004424-199402000-00020. View

Burton A, Altman D, Royston P, Holder R . The design of simulation studies in medical statistics. Stat Med. 2006; 25(24):4279-92. DOI: 10.1002/sim.2673. View

Schmoor C, Sauerbrei W, Schumacher M . Sample size considerations for the evaluation of prognostic factors in survival analysis. Stat Med. 2000; 19(4):441-52. DOI: 10.1002/(sici)1097-0258(20000229)19:4<441::aid-sim349>3.0.co;2-n. View

van der Ploeg T, Austin P, Steyerberg E . Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014; 14:137. PMC: 4289553. DOI: 10.1186/1471-2288-14-137. View

Binder H, Sauerbrei W, Royston P . Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat Med. 2012; 32(13):2262-77. DOI: 10.1002/sim.5639. View

Wild P, Sinning C, Roth A, Wilde S, Schnabel R, Lubos E . Distribution and categorization of left ventricular measurements in the general population: results from the population-based Gutenberg Heart Study. Circ Cardiovasc Imaging. 2010; 3(5):604-13. DOI: 10.1161/CIRCIMAGING.109.911933. View

10.

Hastie T, Tibshirani R . Generalized additive models for medical research. Stat Methods Med Res. 1995; 4(3):187-96. DOI: 10.1177/096228029500400302. View

11.

Chen W, Samuelson F, Gallas B, Kang L, Sahiner B, Petrick N . On the assessment of the added value of new predictive biomarkers. BMC Med Res Methodol. 2013; 13:98. PMC: 3733611. DOI: 10.1186/1471-2288-13-98. View

12.

Cook N . Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007; 115(7):928-35. DOI: 10.1161/CIRCULATIONAHA.106.672402. View

13.

Kruppa J, Liu Y, Diener H, Holste T, Weimar C, Konig I . Probability estimation with machine learning methods for dichotomous and multicategory outcome: applications. Biom J. 2014; 56(4):564-83. DOI: 10.1002/bimj.201300077. View

14.

Schoenfeld D . Sample-size formula for the proportional-hazards regression model. Biometrics. 1983; 39(2):499-503. View

15.

de Valpine P, Bitter H, Brown M, Heller J . A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics. 2009; 10(3):424-35. PMC: 2697341. DOI: 10.1093/biostatistics/kxp001. View

16.

Gotte H, Zwiener I . Sample size planning for survival prediction with focus on high-dimensional data. Stat Med. 2012; 32(5):787-807. DOI: 10.1002/sim.5550. View

17.

Jinks R, Royston P, Parmar M . Discrimination-based sample size calculations for multivariable prognostic models for time-to-event data. BMC Med Res Methodol. 2015; 15:82. PMC: 4603804. DOI: 10.1186/s12874-015-0078-y. View

18.

Gerds T, Cai T, Schumacher M . The performance of risk prediction models. Biom J. 2008; 50(4):457-79. DOI: 10.1002/bimj.200810443. View

19.

Hanley J, McNeil B . The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982; 143(1):29-36. DOI: 10.1148/radiology.143.1.7063747. View

20.

De Bin R, Herold T, Boulesteix A . Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol. 2014; 14:117. PMC: 4271356. DOI: 10.1186/1471-2288-14-117. View