» Articles » PMID: 28610631

Simulation of Complex Data Structures for Planning of Studies with Focus on Biomarker Comparison

Overview
Publisher Biomed Central
Date 2017 Jun 15
PMID 28610631
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Background: There are a growing number of observational studies that do not only focus on single biomarkers for predicting an outcome event, but address questions in a multivariable setting. For example, when quantifying the added value of new biomarkers in addition to established risk factors, the aim might be to rank several new markers with respect to their prediction performance. This makes it important to consider the marker correlation structure for planning such a study. Because of the complexity, a simulation approach may be required to adequately assess sample size or other aspects, such as the choice of a performance measure.

Methods: In a simulation study based on real data, we investigated how to generate covariates with realistic distributions and what generating model should be used for the outcome, aiming to determine the least amount of information and complexity needed to obtain realistic results. As a basis for the simulation a large epidemiological cohort study, the Gutenberg Health Study was used. The added value of markers was quantified and ranked in subsampling data sets of this population data, and simulation approaches were judged by the quality of the ranking. One of the evaluated approaches, the random forest, requires original data at the individual level. Therefore, also the effect of the size of a pilot study for random forest based simulation was investigated.

Results: We found that simple logistic regression models failed to adequately generate realistic data, even with extensions such as interaction terms or non-linear effects. The random forest approach was seen to be more appropriate for simulation of complex data structures. Pilot studies starting at about 250 observations were seen to provide a reasonable level of information for this approach.

Conclusions: We advise to avoid oversimplified regression models for simulation, in particular when focusing on multivariable research questions. More generally, a simulation should be based on real data for adequately reflecting complex observational data structures, such as found in epidemiological cohort studies.

Citing Articles

Machine-Learning vs. Expert-Opinion Driven Logistic Regression Modelling for Predicting 30-Day Unplanned Rehospitalisation in Preterm Babies: A Prospective, Population-Based Study (EPIPAGE 2).

Reed R, Morgan A, Zeitlin J, Jarreau P, Torchin H, Pierrat V Front Pediatr. 2021; 8:585868.

PMID: 33614539 PMC: 7886676. DOI: 10.3389/fped.2020.585868.


Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by Oncology Care Model (OCM) data.

Mazumdar M, Lin J, Zhang W, Li L, Liu M, Dharmarajan K BMC Health Serv Res. 2020; 20(1):350.

PMID: 32334595 PMC: 7183716. DOI: 10.1186/s12913-020-05148-y.


Integrated Chemometrics and Statistics to Drive Successful Proteomics Biomarker Discovery.

Suppers A, van Gool A, Wessels H Proteomes. 2018; 6(2).

PMID: 29701723 PMC: 6027525. DOI: 10.3390/proteomes6020020.

References
1.
Obuchowski N . Receiver operating characteristic curves and their use in radiology. Radiology. 2003; 229(1):3-8. DOI: 10.1148/radiol.2291010898. View

2.
Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios L . A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010; 10:1. PMC: 2824145. DOI: 10.1186/1471-2288-10-1. View

3.
Vaeth M, Skovlund E . A simple approach to power and sample size calculations in logistic regression and Cox regression models. Stat Med. 2004; 23(11):1781-92. DOI: 10.1002/sim.1753. View

4.
Obuchowski N . Computing sample size for receiver operating characteristic studies. Invest Radiol. 1994; 29(2):238-43. DOI: 10.1097/00004424-199402000-00020. View

5.
Burton A, Altman D, Royston P, Holder R . The design of simulation studies in medical statistics. Stat Med. 2006; 25(24):4279-92. DOI: 10.1002/sim.2673. View