##### A First Example
Let's begin with a simple example
to motivate the need for
Symbolic Regression tools.
We'll then follow with a
real and complex example
that will follow us throughout.
Below, you will see a plot
which compares test scores
to hours studied.
As one would expect,
this looks like a strong linear relationship,
scores to go up with
hours spent studying.
We know that you can only study so much
before the law of diminishing returns
and burnout sets in.
So how does a 2nd-order polynomial
fit with the data?
This model looks much better and the $$R^2$$ value has improved as well.
How about a 3rd-order polynomial?
Even better, though only a marginal improvement.
If we continue down this path,
we will likely start overfitting.
So which is it?
Which model should we choose as our final model?
The automated answer lies within model selection,
a deep and complex problem.
The breadth of model selection
is largely beyond the scope of this work,
however,
PGE and SR implementations
have an internal model selection
scheme which is integral to their success.
In the end, it is up to a domain expert
to make the final determination.
##### A Motivating Example
Now for our complex example from a real life system.
The following chart shows the time-series for
seven variables related to yeast processing sugar.
Mmmmmm beer!
(or more generally,
[The Art of Fermentation](http://www.amazon.com/The-Art-Fermentation-Exploration-Essential-ebook/dp/B0083JQCF2))
It took humans years of study and analysis
to formulate the differential equations for
yeast metabolism
[[Wolf:2000:BiochemJ](http://www.ncbi.nlm.nih.gov/pubmed/10702114)].
The set of equations
help scientists to understand the
interactions and complex cellular behaviors.
They also
allow for increased accuracy in simulation
under a greater diversity of situations.
These capabilities
enables researchers to rapidly
prototype before moving to
laboratory tests.
In 2011, machines recovered the equations
for the first time
[[Schmidt:2011:PhysBiol](http://www.ncbi.nlm.nih.gov/pubmed/21832805)].
With Symbolic Regression technology,
we can help scientists find useful models
that shed light on the problems they study.
The Genetic Programming methods first employed
on this problem required computational time
on the order of hours.
This is an incredible feat.
Later we will see how
Prioritized Grammar Enumeration
can solve this same task
on the order of minutes.
##### Contributions
The main contribution of this work is
Prioritized Grammar Enumeration (PGE),
a method which can derive
mathematical formula from data
in an efficient and reproducible way.
We step back from mainstream
Genetic Programming (GP) thinking,
to fundamentally change the way we
approach the Symbolic Regression (SR) problem.
The overall goal of SR is to produce
equations which find a balance in the
trade-off between accuracy and complexity,
the method of getting there need
not be fixed.
By reconsidering the fundamental approach to SR problem,
we believe that PGE is an evolution in thought,
leading the way to a reliable SR technology.
PGE brief
PGE basic
1. Redefining the problem
1. Creating a deterministic and reproducible algorithm
We then enhance the PGE algorithm
in several ways.
1. More application domains through differential equations.
1. Deeper abstractions and richer relationships.
1. Scaling the algorithm by decoupling into services
1. Combine multiple experiments to return more symbolic expressions
##### Results
Will fill this in later after the chapter(s) are written.
In~\cite{meier:2014:symbolic},
PGE derived simple, compact functions for predicting
precrash severity in automobile accidents in less than 2ms,
exceeding the required real-time constraint
by several orders of magnitude.
##### Reproducibility
A running theme to this work is Reproducibility.
There have been many controversies recently...
Estimates of irreproducibility...
[Oncology: 11%](http://www.nature.com/nature/journal/v483/n7391/full/483531a.html)
[Computer Science: 25%](http://reproducibility.cs.arizona.edu/tr.pdf)
[Psychology: 39%](http://www.sciencemag.org/content/349/6251/aac4716)
We champion reproducibility by
1. Deterministic algorithm which will reproduce itself
1. Open source Python project (PyPGE)
1. Accurate portrayal of results
Each chapter will end with a section on reproducibility
and the ideas relevant to that chapter.
">next
index

## Forward

top
##### A First Example
Let's begin with a simple example
to motivate the need for
Symbolic Regression tools.
We'll then follow with a
real and complex example
that will follow us throughout.
Below, you will see a plot
which compares test scores
to hours studied.
As one would expect,
this looks like a strong linear relationship,
scores to go up with
hours spent studying.
We know that you can only study so much
before the law of diminishing returns
and burnout sets in.
So how does a 2nd-order polynomial
fit with the data?
This model looks much better and the $$R^2$$ value has improved as well.
How about a 3rd-order polynomial?
Even better, though only a marginal improvement.
If we continue down this path,
we will likely start overfitting.
So which is it?
Which model should we choose as our final model?
The automated answer lies within model selection,
a deep and complex problem.
The breadth of model selection
is largely beyond the scope of this work,
however,
PGE and SR implementations
have an internal model selection
scheme which is integral to their success.
In the end, it is up to a domain expert
to make the final determination.
##### A Motivating Example
Now for our complex example from a real life system.
The following chart shows the time-series for
seven variables related to yeast processing sugar.
Mmmmmm beer!
(or more generally,
[The Art of Fermentation](http://www.amazon.com/The-Art-Fermentation-Exploration-Essential-ebook/dp/B0083JQCF2))
It took humans years of study and analysis
to formulate the differential equations for
yeast metabolism
[[Wolf:2000:BiochemJ](http://www.ncbi.nlm.nih.gov/pubmed/10702114)].
The set of equations
help scientists to understand the
interactions and complex cellular behaviors.
They also
allow for increased accuracy in simulation
under a greater diversity of situations.
These capabilities
enables researchers to rapidly
prototype before moving to
laboratory tests.
In 2011, machines recovered the equations
for the first time
[[Schmidt:2011:PhysBiol](http://www.ncbi.nlm.nih.gov/pubmed/21832805)].
With Symbolic Regression technology,
we can help scientists find useful models
that shed light on the problems they study.
The Genetic Programming methods first employed
on this problem required computational time
on the order of hours.
This is an incredible feat.
Later we will see how
Prioritized Grammar Enumeration
can solve this same task
on the order of minutes.
##### Contributions
The main contribution of this work is
Prioritized Grammar Enumeration (PGE),
a method which can derive
mathematical formula from data
in an efficient and reproducible way.
We step back from mainstream
Genetic Programming (GP) thinking,
to fundamentally change the way we
approach the Symbolic Regression (SR) problem.
The overall goal of SR is to produce
equations which find a balance in the
trade-off between accuracy and complexity,
the method of getting there need
not be fixed.
By reconsidering the fundamental approach to SR problem,
we believe that PGE is an evolution in thought,
leading the way to a reliable SR technology.
PGE brief
PGE basic
1. Redefining the problem
1. Creating a deterministic and reproducible algorithm
We then enhance the PGE algorithm
in several ways.
1. More application domains through differential equations.
1. Deeper abstractions and richer relationships.
1. Scaling the algorithm by decoupling into services
1. Combine multiple experiments to return more symbolic expressions
##### Results
Will fill this in later after the chapter(s) are written.
In~\cite{meier:2014:symbolic},
PGE derived simple, compact functions for predicting
precrash severity in automobile accidents in less than 2ms,
exceeding the required real-time constraint
by several orders of magnitude.
##### Reproducibility
A running theme to this work is Reproducibility.
There have been many controversies recently...
Estimates of irreproducibility...
[Oncology: 11%](http://www.nature.com/nature/journal/v483/n7391/full/483531a.html)
[Computer Science: 25%](http://reproducibility.cs.arizona.edu/tr.pdf)
[Psychology: 39%](http://www.sciencemag.org/content/349/6251/aac4716)
We champion reproducibility by
1. Deterministic algorithm which will reproduce itself
1. Open source Python project (PyPGE)
1. Accurate portrayal of results
Each chapter will end with a section on reproducibility
and the ideas relevant to that chapter.
">next (Introduction)

This is the online form of my dissertation, the subject of which is Symbolic Regression.

It is a work in progress with completion planned for by the end of November 2015.

Cheers, Tony

Thank yous and appreciations

Talk about trials, tribulations, motivations

Tony Worm, 2015

top