» Articles » PMID: 38715933

Contemporary Symbolic Regression Methods and Their Relative Performance

Overview
Date 2024 May 8
PMID 38715933
Authors
Affiliations
Soon will be listed here.
Abstract

Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.

Citing Articles

Using interpretable machine learning to predict bloodstream infection and antimicrobial resistance in patients admitted to ICU: Early alert predictors based on EHR data to guide antimicrobial stewardship.

Ferrari D, Arina P, Edgeworth J, Curcin V, Guidetti V, Mandreoli F PLOS Digit Health. 2024; 3(10):e0000641.

PMID: 39413052 PMC: 11482717. DOI: 10.1371/journal.pdig.0000641.


Achieving Occam's razor: Deep learning for optimal model reduction.

Antal B, Chesebro A, Strey H, Mujica-Parodi L, Weistuch C PLoS Comput Biol. 2024; 20(7):e1012283.

PMID: 39024398 PMC: 11288447. DOI: 10.1371/journal.pcbi.1012283.


Distilling identifiable and interpretable dynamic models from biological data.

Massonis G, Villaverde A, Banga J PLoS Comput Biol. 2023; 19(10):e1011014.

PMID: 37851682 PMC: 10615316. DOI: 10.1371/journal.pcbi.1011014.


Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science.

Romano J, Mei L, Senn J, Moore J, Mortensen H Comput Toxicol. 2023; 25.

PMID: 37829618 PMC: 10569310. DOI: 10.1016/j.comtox.2023.100261.


Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives.

Angelis D, Sofos F, Karakasidis T Arch Comput Methods Eng. 2023; :1-21.

PMID: 37359747 PMC: 10113133. DOI: 10.1007/s11831-023-09922-z.


References
1.
Murdoch W, Singh C, Kumbier K, Abbasi-Asl R, Yu B . Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci U S A. 2019; 116(44):22071-22080. PMC: 6825274. DOI: 10.1073/pnas.1900654116. View

2.
Romano J, Le T, La Cava W, Gregg J, Goldberg D, Chakraborty P . PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods. Bioinformatics. 2021; 38(3):878-880. PMC: 8756190. DOI: 10.1093/bioinformatics/btab727. View

3.
Schmidt M, Lipson H . Distilling free-form natural laws from experimental data. Science. 2009; 324(5923):81-5. DOI: 10.1126/science.1165893. View

4.
Virgolin M, Alderliesten T, Witteveen C, Bosman P . Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions. Evol Comput. 2020; 29(2):211-237. DOI: 10.1162/evco_a_00278. View

5.
La Cava W, Helmuth T, Spector L, Moore J . A Probabilistic and Multi-Objective Analysis of Lexicase Selection and -Lexicase Selection. Evol Comput. 2018; 27(3):377-402. PMC: 9453780. DOI: 10.1162/evco_a_00224. View