» Articles » PMID: 36611079

Fast, Accurate, and Racially Unbiased Pan-cancer Tumor-only Variant Calling with Tabular Machine Learning

Overview
Publisher Springer Nature
Specialty Oncology
Date 2023 Jan 7
PMID 36611079
Authors
Affiliations
Soon will be listed here.
Abstract

Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient's normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R = 0.006 to 0.71-0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling.

Citing Articles

Refined variant calling pipeline on RNA-seq data of breast cancer cell lines without matched-normal samples.

Eberth S, Koblitz J, Steenpass L, Pommerenke C BMC Res Notes. 2025; 18(1):67.

PMID: 39955561 PMC: 11829467. DOI: 10.1186/s13104-025-07140-3.


Transformers meets neoantigen detection: a systematic literature review.

Machaca V, Goyzueta V, Cruz M, Sejje E, Pilco L, Lopez J J Integr Bioinform. 2024; 21(2).

PMID: 38960869 PMC: 11377031. DOI: 10.1515/jib-2023-0043.


Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus.

Mishra S, Nelson C, Zhu B, Pinheiro M, Lee H, Dean M Virus Evol. 2024; 10(1):veae013.

PMID: 38455683 PMC: 10919477. DOI: 10.1093/ve/veae013.


Detection of mutant antigen-specific T cell receptors against multiple myeloma for T cell engineering.

Okada M, Shimizu K, Nakazato H, Yamasaki S, Fujii S Mol Ther Methods Clin Dev. 2023; 29:541-555.

PMID: 37359417 PMC: 10285226. DOI: 10.1016/j.omtm.2023.05.014.

References
1.
Bentley A, Callier S, Rotimi C . Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom Med. 2020; 5:5. PMC: 7042246. DOI: 10.1038/s41525-019-0111-x. View

2.
Shi W, Ng C, Lim R, Jiang T, Kumar S, Li X . Reliability of Whole-Exome Sequencing for Assessing Intratumor Genetic Heterogeneity. Cell Rep. 2018; 25(6):1446-1457. PMC: 6261536. DOI: 10.1016/j.celrep.2018.10.046. View

3.
Sukhai M, Misyura M, Thomas M, Garg S, Zhang T, Stickle N . Somatic Tumor Variant Filtration Strategies to Optimize Tumor-Only Molecular Profiling Using Targeted Next-Generation Sequencing Panels. J Mol Diagn. 2018; 21(2):261-273. DOI: 10.1016/j.jmoldx.2018.09.008. View

4.
. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487(7407):330-7. PMC: 3401966. DOI: 10.1038/nature11252. View

5.
Huang W, Guo Y, Muthukumar K, Baruah P, Chang M, Skanderup A . SMuRF: portable and accurate ensemble prediction of somatic mutations. Bioinformatics. 2019; 35(17):3157-3159. PMC: 6735703. DOI: 10.1093/bioinformatics/btz018. View