The primary setting for this work is Symbolic Regression (SR), the task of deriving mathematical formula from observational data without any fore-knowledge of the domain or problem. In essence, this is the scientific process performed by a computer. Hypotheses are formulated, tested against the observations, and compared for explanatory value. Because the resulting models are mathematical formula, SR can assist domain experts in almost all fields.
The main contribution of this work is Prioritized Grammar Enumeration (PGE), a deterministic machine learning algorithm for solving Symbolic Regression. Working with a grammarâ€™s rules, PGE prioritizes the enumeration of mathematical expressions in order to find the best fit model. By recognizing large overlaps in the search space and introducing mechanisms for memoization, PGE can exploring the space of all equations efficiently. Most notably, PGE provides reproducibility of results, a key aspect to any system used by scientists at large.
By the author
Symbolic Regression for mathematical discovery
MotivationOverview, context, details, implementations
The Problem - Definition, classes, applicationsStanding on the shoulders of others
Genetic Programming - The original Symbolic Regression algorithmDeterministic, reproducible, and reliable Symbolic Regression
Theory - Rethinking the Symbolic Regression problem.Overcoming limitations in the original formulation
Decoupling - Separating into scalable servicesThis chapter focuses on testing the PGE algorithm
Overview - Synopsis for the benchmarks and purposesFinal thoughts and places to go
Open-source projects for PGE
Comparative results for the PyPGE, DEAP, and FFX Python packages
The system classes and equations used for benchmarking
Explicit Equations - Benchmarks from the SR literatureA quick reference