top
##### A First Example Let's begin with a simple example to motivate the need for Symbolic Regression tools. We'll then follow with a real and complex example that will follow us throughout. Below, you will see a plot which compares test scores to hours studied. As one would expect, this looks like a strong linear relationship, scores to go up with hours spent studying.
We know that you can only study so much before the law of diminishing returns and burnout sets in. So how does a 2nd-order polynomial fit with the data?
This model looks much better and the $$R^2$$ value has improved as well. How about a 3rd-order polynomial?
Even better, though only a marginal improvement. If we continue down this path, we will likely start overfitting. So which is it? Which model should we choose as our final model? The automated answer lies within model selection, a deep and complex problem. The breadth of model selection is largely beyond the scope of this work, however, PGE and SR implementations have an internal model selection scheme which is integral to their success. In the end, it is up to a domain expert to make the final determination. ##### A Motivating Example Now for our complex example from a real life system. The following chart shows the time-series for seven variables related to yeast processing sugar. Mmmmmm beer! (or more generally, [The Art of Fermentation](http://www.amazon.com/The-Art-Fermentation-Exploration-Essential-ebook/dp/B0083JQCF2))
It took humans years of study and analysis to formulate the differential equations for yeast metabolism [[Wolf:2000:BiochemJ](http://www.ncbi.nlm.nih.gov/pubmed/10702114)]. The set of equations help scientists to understand the interactions and complex cellular behaviors. They also allow for increased accuracy in simulation under a greater diversity of situations. These capabilities enables researchers to rapidly prototype before moving to laboratory tests. In 2011, machines recovered the equations for the first time [[Schmidt:2011:PhysBiol](http://www.ncbi.nlm.nih.gov/pubmed/21832805)]. With Symbolic Regression technology, we can help scientists find useful models that shed light on the problems they study. The Genetic Programming methods first employed on this problem required computational time on the order of hours. This is an incredible feat. Later we will see how Prioritized Grammar Enumeration can solve this same task on the order of minutes.
top
##### Contributions The main contribution of this work is Prioritized Grammar Enumeration (PGE), a method which can derive mathematical formula from data in an efficient and reproducible way. We step back from mainstream Genetic Programming (GP) thinking, to fundamentally change the way we approach the Symbolic Regression (SR) problem. The overall goal of SR is to produce equations which find a balance in the trade-off between accuracy and complexity, the method of getting there need not be fixed. By reconsidering the fundamental approach to SR problem, we believe that PGE is an evolution in thought, leading the way to a reliable SR technology. PGE brief PGE basic 1. Redefining the problem 1. Creating a deterministic and reproducible algorithm We then enhance the PGE algorithm in several ways. 1. More application domains through differential equations. 1. Deeper abstractions and richer relationships. 1. Scaling the algorithm by decoupling into services 1. Combine multiple experiments to return more symbolic expressions
top
##### Results Will fill this in later after the chapter(s) are written. In~\cite{meier:2014:symbolic}, PGE derived simple, compact functions for predicting precrash severity in automobile accidents in less than 2ms, exceeding the required real-time constraint by several orders of magnitude.
top
##### Reproducibility A running theme to this work is Reproducibility. There have been many controversies recently... Estimates of irreproducibility... [Oncology: 11%](http://www.nature.com/nature/journal/v483/n7391/full/483531a.html) [Computer Science: 25%](http://reproducibility.cs.arizona.edu/tr.pdf) [Psychology: 39%](http://www.sciencemag.org/content/349/6251/aac4716) We champion reproducibility by 1. Deterministic algorithm which will reproduce itself 1. Open source Python project (PyPGE) 1. Accurate portrayal of results Each chapter will end with a section on reproducibility and the ideas relevant to that chapter. ">next   index

Forward



This is the online form of my dissertation, the subject of which is Symbolic Regression.

It is a work in progress with completion planned for by the end of November 2015.

Cheers, Tony


Thank yous and appreciations

Talk about trials, tribulations, motivations



Tony Worm, 2015




top
##### A First Example Let's begin with a simple example to motivate the need for Symbolic Regression tools. We'll then follow with a real and complex example that will follow us throughout. Below, you will see a plot which compares test scores to hours studied. As one would expect, this looks like a strong linear relationship, scores to go up with hours spent studying.
We know that you can only study so much before the law of diminishing returns and burnout sets in. So how does a 2nd-order polynomial fit with the data?
This model looks much better and the $$R^2$$ value has improved as well. How about a 3rd-order polynomial?
Even better, though only a marginal improvement. If we continue down this path, we will likely start overfitting. So which is it? Which model should we choose as our final model? The automated answer lies within model selection, a deep and complex problem. The breadth of model selection is largely beyond the scope of this work, however, PGE and SR implementations have an internal model selection scheme which is integral to their success. In the end, it is up to a domain expert to make the final determination. ##### A Motivating Example Now for our complex example from a real life system. The following chart shows the time-series for seven variables related to yeast processing sugar. Mmmmmm beer! (or more generally, [The Art of Fermentation](http://www.amazon.com/The-Art-Fermentation-Exploration-Essential-ebook/dp/B0083JQCF2))
It took humans years of study and analysis to formulate the differential equations for yeast metabolism [[Wolf:2000:BiochemJ](http://www.ncbi.nlm.nih.gov/pubmed/10702114)]. The set of equations help scientists to understand the interactions and complex cellular behaviors. They also allow for increased accuracy in simulation under a greater diversity of situations. These capabilities enables researchers to rapidly prototype before moving to laboratory tests. In 2011, machines recovered the equations for the first time [[Schmidt:2011:PhysBiol](http://www.ncbi.nlm.nih.gov/pubmed/21832805)]. With Symbolic Regression technology, we can help scientists find useful models that shed light on the problems they study. The Genetic Programming methods first employed on this problem required computational time on the order of hours. This is an incredible feat. Later we will see how Prioritized Grammar Enumeration can solve this same task on the order of minutes.
top
##### Contributions The main contribution of this work is Prioritized Grammar Enumeration (PGE), a method which can derive mathematical formula from data in an efficient and reproducible way. We step back from mainstream Genetic Programming (GP) thinking, to fundamentally change the way we approach the Symbolic Regression (SR) problem. The overall goal of SR is to produce equations which find a balance in the trade-off between accuracy and complexity, the method of getting there need not be fixed. By reconsidering the fundamental approach to SR problem, we believe that PGE is an evolution in thought, leading the way to a reliable SR technology. PGE brief PGE basic 1. Redefining the problem 1. Creating a deterministic and reproducible algorithm We then enhance the PGE algorithm in several ways. 1. More application domains through differential equations. 1. Deeper abstractions and richer relationships. 1. Scaling the algorithm by decoupling into services 1. Combine multiple experiments to return more symbolic expressions
top
##### Results Will fill this in later after the chapter(s) are written. In~\cite{meier:2014:symbolic}, PGE derived simple, compact functions for predicting precrash severity in automobile accidents in less than 2ms, exceeding the required real-time constraint by several orders of magnitude.
top
##### Reproducibility A running theme to this work is Reproducibility. There have been many controversies recently... Estimates of irreproducibility... [Oncology: 11%](http://www.nature.com/nature/journal/v483/n7391/full/483531a.html) [Computer Science: 25%](http://reproducibility.cs.arizona.edu/tr.pdf) [Psychology: 39%](http://www.sciencemag.org/content/349/6251/aac4716) We champion reproducibility by 1. Deterministic algorithm which will reproduce itself 1. Open source Python project (PyPGE) 1. Accurate portrayal of results Each chapter will end with a section on reproducibility and the ideas relevant to that chapter. ">next (Introduction)