### Theory
1. **Removing Non-determinism** - Creation of a completely reproducible SR algorithm.
1. **Search Space Organization** - Understanding and designing the structure of the search space.
1. **Evaluating Forms Once** - Designing the model for detection and avoidance of overlapping subproblems.
PGE was born from the process of removing
random number generators from GP.
This required methods for
selection, survival, and replication
to all be replaced.
The replacement process
brought to light several issues
which have not been addressed
by the GP community.
The theory underlying PGE
is based on intuitive principles.
First, take into account the rules from algebra,
to reduce model multiplicity into a canonical form,
thereby reducing search space complexity.
Second, design models so that they
are only handled once. This requires that
they are also designed with dynamic programming
techniques in mind.
Third, remove all sources of non-determinism
to create a completely reproducible algorithm.
This also simplifies development and debugging
of the software which implements the algorithm.
#### Removing Non-determinism
Removing sources of non-determinism
was a central theme to the development of PGE.
The aim was to produce an algorithm
which would give the same results with each invocation.
The reason for this goal was the
detrimental effects non-determinism has
on search algorithms. We believe
it to be an inherent and unavoidable
difficulty present in the GP algorithm.
You can find details on GP in [Chapter 7](/sr/07-gp)
To achieve deterministic behavior,
PGE makes no use of random number generation.
It performs the exact same algorithmic steps
given the same parameters and same training data.
This requires several aspects of the PGE algorithm
to be defined deterministically.
**Initialization** establishes the starting points
from which PGE searches. Determinism starts here
by constructing a set of basis functions
from the input variables.
**Prioritization** determines which points
in the space to search next. This process
uses a priority queue coupled with a
Pareto non-dominated sorting algorithm.
**Exploration** discovers new models by
making small changes to a given model.
To maintain determinism, PGE makes
all possible changes to a model,
given the particular exploration scheme in use.
**Communication** occurs when PGE uses
parallel execution. In order to maintain
deterministic behavior, well defined
synchronization points are used.
Details of each aspect will be expanded upon
in the upcoming sections and next chapter.
#### Search Space Organization
The size of the space defined by a grammar
is infinite, even when disregarding the
increase brought by considering real valued coefficients.
The main contributions are combinatorial components
at both the search space and model levels of the problem.
PGE's great efficiency gains are the result
of methods designed to combat the combinatorial components.
We begin at the model level, where the main tool
is the trie based representation.
It is dynamic programming which comes to the rescue
at the search space level.
**Tree Representation Combinatorics**
PGE primarily uses the trie based representation
for the models, as detailed in the last chapter.
The reason for this is the combinatorial
number of different binary tree representations
for the equivalent model.
These different representations arise
from permutations of the leaf nodes
and from variations in the tree structure.
Consider the equation `a•b•c` (Figure 4-1).
**Figure 4-1** - Combinatorics of the Tree
This equation has 12 different binary tree representations,
from six leaf permutations and two shape permutations.
If we account for both addition and multiplication,
the number of trees grows to 48.
As you can imagine, with
increasing numbers of operations and operands,
the tree complexity and this number
undergoes a combinatorial explosion.
The advantage of the trie based model
is the ability to take advantage of the
commutative and associative properties
of addition and multiplication.
Because of these properties,
we are free to order the operands,
sub-trees in the model trie, as we choose.
Commutativity is the ability to rearrange parentheses
and is associated with the multiple tree representations
for the same expression. By using the trie,
we flatten a binary tree and metaphorically
erase the parentheses from the expression.
Associativity means that we can sort
the operands to addition and multiplication.
By sorting them, we create a consistent ordering
which can be treated as the canonical form for a model.
In doing so, we remove the combinatorial element
associated with the permutation of leaf nodes.
Sorting is possible by
creating a ordering on node types.
This is easily achievable by assigning
an integer to each node type.
By turning binary trees into tries
and by sorting the tries,
each model can have a canonical form.
Figure 4-1, right column, shows
the canonical form of `a•b•c`.
After applying the reductions,
the original 48 models is now just 4.
This savings created by sorting
become even more important as equations
become more complex.
**Algebraic simplifications**
There are other sub-expressions which arise
that we would also like to remove and reduce.
Simplifications group like terms together ($$x+x$$),
replace sub-expressions which evaluate
to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$),
and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$.
The resulting equations are equivalent,
simpler forms.
There is debate as to how
simplification effects the SR process
[ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ].
Certainly, changing the tree effects
the Pareto trade-off which in turn
has consequences on selection and
therefore search progress.
Questions still remain as to
how much and which simplifications should be applied,
which forms of the equation should be kept,
and the overall effects simplification has on the search process.
Despite these questions,
PGE still employs
the use of the aforementioned simplifications.
This is done to eliminate
those sub-expressions which
unnecessarily complicate equation
and the overall search space.
If they were not replaced,
many equivalent variations of a good equation
could crowd out other potential solutions,
as they would be selected more often.
This would result in the algorithm
stagnating in a local optima of the search space.
Simplifications effectively trim
out parts of the search space
which are redundant, and also
return results which are more comprehensible.
PGE uses the simplification techniques
which are present in the SymPy Python library.
This and further implementation details
can be found in [Appendix-A1](/sr/A1-pypge).
#### Evaluating Forms Once
At its core, PGE is a dynamic programming algorithm
which aims to evaluate each sub-problem once.

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.
**Figure #** - Multiple Derivations
In this diagram, we start with models
which consist of a single variable.
Now we could take each of these models
and add a different variable to it.
Consider though that we could
start with $$x$$ and add $$y$$
or
start with $$y$$ and then add $$x$$,
both arriving at the canonical for $$ x+y $$.
Repeat this process, adding a third variable,
and we have six paths to the model $$ x+y+z $$.
It is important that we detect this situation,
by determining that each of the six $$ x+y+z $$
are really the same model.
If we do not detect the separate models as one,
would would end up repeating a lot of work.
This repetition would become a combinatorial
explosion of exponential proportions.
PGE uses dynamic programming techniques
to detect multiple derivations and
consider an equation form only once.
This feature of PGE is enabled by
the Integer Prefix Trie and
an integer linear representation
for the equation models.
The next section will fill in the details
of both representation and memoization.

### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.**Figure #**: PGE Exporation Functions
{% highlight Python linenos %}
def Grow(model):
add_expands = add_extend(model)
mul_expands = mul_extend(model)
var_expands = var_substitution(model)
return var_expands + add_expands + mul_expands
def add_extend(model):
// E -> E + c*L
// E -> E + c*N(L)
def mul_extend(model):
// T -> T * L
// T -> T * N(L)
// T -> T / L
// T -> T / N(L)
def var_substitution(model):
// N -> ( N + c*L )
// N -> ( N + c*N(L) )
// N -> ( (c*L) / N )
// N -> ( N / (c*L) )
{% endhighlight %}
PGE uses the grammar's production rules
expand previously encountered equations.
New equations are produced by applying
the function *Grow*, from Listing \ref{expand},
while recursing over the parse tree.
The *Grow* function determines the
current node's type and applies the appropriate
production function.
Each production function corresponds to
one or more of the grammar's production rules.
*AddTerm* increases
the number of terms in a summation, such as
$$aX + bY \Rightarrow aX + bY + cZ$$.
*WidenTerm* increases
the number of terms in a product, such as
$$aXY^2 \Rightarrow aX^2Y^2$$ or $$aXY^2 \Rightarrow aXY^2Z$$.
*DeepenTerm* increases
the complexity of a term, such as
$$aXY \Rightarrow a(X+bZ)Y$$ or $$aSin(X) \Rightarrow aSin(X+Y)$$.
The generating functions define
the set of reachable expression
in conjunction with the supplied building blocks
(terminals and non-terminals of the grammar).
However, generating functions do not determine
the order in which the space is searched.
This is the role of the PPQ
(section \ref{sec-pge-ppq}),
which prioritizes the areas to explore next.
**Prioritization Strategies**
The Pareto Priority Queue (PPQ)
is the deterministic mechanism
for controlling the search direction of PGE.
The PPQ replaces selection for mating
with prioritized equations for expansion
(selection for survival is unnecessary
since all equations are remembered).
The PPQ is what its name implies,
a priority queue built on the Pareto non-dominated sorting.
The PPQ prioritizes expansion towards
those branches of the search space which
balance size and accuracy best.
The PPQ removes the need for
generations, survivors, and
mating pools, storing
all explored equations which
have not been expanded.
Our experimentations have not shown this to
be prohibitive. Consumed memory
never exceeded 500Mb, even in the face of
hundreds of thousands of unique equations.
To construct the PPQ, successive
Pareto frontiers are appended onto a linear structure.
Thus, the smallest equation from the
first Pareto front is at the head of the queue.
The next elements are the remainder of
the first Pareto front in increasing size.
The remaining Pareto fronts
follow a similar pattern.
Priority is first given to the Pareto frontier
and then to size within the frontier.
This is the same as Pareto sorted array
that results from the GP Pareto sort
during selection.
The Pareto Priority Queue (PPQ)
determines the order
the space of equations is searched.
The PPQ gives equations which expand the
branches of the search space
towards small equations with low error.
To construct the PPQ,
successive Pareto fronts are
appended to the end of the queue.
Thus, the smallest equation from the
first Pareto front is at the head of the queue.
The next elements are the remainder of
the first Pareto front in increasing size.
The remaining Pareto fronts
follow a similar pattern.
Priority is first given to the Pareto frontier
and then to size within the frontier.
This is the same as Pareto sorted array
that results from the GP Pareto sort
during selection.
**Regularization** is the process of
introducing a penalty for complexity
in order to avoid overfitting.
In PGE and GP, this is an implicit process
which is part of the Pareto non-dominated sort.
PGE uses the PPQ and a pop count to
control the amount of regularization.
During processing, we remove the top
$$p$$ equations (3 in our experiments)
when selecting the next areas to search.
By doing so, we select the $$p$$ smallest
equations from the first Pareto frontier.
This gives variation across the trade-offs
for the equations to be expanded,
and thus variation in the search direction too.
If we only remove one equation,
the smallest equations would be replaced
from the next front and the search
would progress through equation space by size.
If we remove too many,
then PGE starts to produce
complex and over fit models.
#### Memoization
**Detect overlapping subproblems**
is the key to the dynamic programming approach taken by PGE.
In PGE, a sub-problem is equivalent
to a particular equation form.
Sub-problems are encountered
when an equation has
several derivations,
or paths over the search graph
which reach the same node.
This means there are multiple
orderings of the production rules which
result in the same equation
and that each equation may
appear in more than one form,
as several equivalent points
within a grammar's representable space.
The memoization of form allows PGE
to consider a form just once.
PGE matches equations by comparing
the serialized representation of the equations.
Serialization transforms an equation into a sequence of integers
by assigning a unique value to each node type.
The resulting integer sequence is equivalent
to the prefix notation of the equation.
Since each sequence represents a unique equation,
the parse tree is recoverable from the sequence.
Also, since PGE uses abstract coefficients,
they are all converted to the same value.
This means PGE only memoizes their existence and location.
**The Integer Prefix Tree** (IPT) is at the core
of the PGE dynamic programming framework.
The IPT detects overlapping subproblems
by comparing integer sequences,
the serialized representation of an equation.
Detecting overlapping subproblems
enables us to eliminate duplication of effort
in an exponentially growing search space.
The IPT was inspired by the
suffix tree algorithm for string matching
[[Weiner:1973:LPM](),
[Farach:1997:OST](),
[Skiena:1998:ADM]()].
The suffix tree algorithm
gives log-time lookup for
string existence in a piece of text.
Similarly, the IPT provides log-time determination,
w.r.t. the size of the equation alphabet,
as to whether a subproblem has been encountered before.
Since the ASCII alphabet is equivalent to the integers,
in a mathematical mapping sense,
we can use any sized alphabet,
or equivalently, range of integers.
PGE uses the IPT memoization tree defined in Listing \ref{pretree}.
The IPT is a recursive structure and insertion algorithm
which memoizes a sequence of integers.
The IPT is built up as new expressions are encountered.
As each node in the IPT is visited,
its associated counter is incremented.
If the insertion algorithm has to create
a new node, then the expression is also new,
and the expression type is stored in that new node.
New node creation is propagated back up the tree
as the insertion function unwinds.
The root IPT node unique counter is incremented if
a new insertion took place.
Each insertion or lookup takes O(mlog(n)) operations,
where m is the length of the serial
and n is the size of the alphabet.
The depth of the IPT is equal to
the length of the longest serial encountered.
The IPT also tracks the number of times
each node in the structure has been visited.
We keep counts for total visits
and unique visits.
This allows us to do some accounting
of the number of unique equations
as well as how many times each
equation is produced by the generation functions.
There has been no previous research on the topic
of unique equations in relation to the
total number of equations created or evaluated.
We believe this to be due to the lack of
a method for matching equations, which is
due to the lack of canonical forms in the literature.
We use the IPT to compare
the efficiency of GP and PGE in section \ref{sec-results-efficient}.
The IPT is also the enabling structure
which conceptually turns the search space from
a tree into a graph by matching and memoizing
subproblems, or equations.
This greatly reduces the amount of work,
by eliminating duplication,
that the PGE must perform
while searching for ideal equations.
As mentioned earlier,
the search space is also reduced in size by
sorting equations into a canonical form and
simplifying them by rewrite rules.
**Figure #**: Integer Prefix Tree
{% highlight Python linenos %}
class Node:
def __init__(self, key=0, value=None):
self.map = {} # int -> Node
self.key = key
self.value = value
def insert(self,iis, value):
if len(iis) > 1:
ii = iis[0]
if ii not in self.map:
self.map[ii] = Node(key=ii)
return self.map[ii].insert(iis[1:], value)
if len(iis) == 1:
ii = iis[0]
if ii not in self.map:
# print " new node for key: ", ii
self.map[ii] = Node(key=ii, value=value)
return True
else:
return False
def lookup(self,iis):
if len(iis) > 1:
ii = iis[0]
if ii in self.map:
return self.map[ii].lookup(iis[1:])
else:
return False, None
if len(iis) == 1:
ii = iis[0]
if ii in self.map:
return True, self.map[ii].get_value()
else:
return False, None
{% endhighlight %}
### The Search Loop
PGE iteratively refines equations
by applying the grammar's
production rules recursively.
PGE removes the top equations
from the queue, prioritizing
the first Pareto frontier.
The top equations are then expanded
according to the production rules.
The resultant expanded equations
are checked for uniqueness
by the Integer Prefix Tree (IPT).
The unique set of trees that remain are
fit using non-linear regression,
evaluated for fitness,
and pushed into the Pareto Priority Queue (PPQ).
The main PGE loop continues
in this way until the
stopping criteria is reached.
**Figure #** - PGE Flow Diagram
The PGE search proceeds as follows.
Initialization generates
a set of basis functions for each variable.
These basis functions are then
memoized by the IPT,
fit to the training data
and pushed into the PPQ.
The main PGE loop iteratively
pops $$p$$ equations from the top of
the PPQ for processing.
Each of these equations is expanded
through recursive application of the generation functions.
The resulting equations are memoized using the IPT.
If the equation has been encountered before,
it is discarded.
The remaining unique equations are
fit to the training data
with non-linear regression
and pushed into the PPQ.
PGE continues until
a model of desired accuracy is discovered
or a computational threshold is reached.
Theoretically, PGE could
explore the entire space of representable equations
given infinite space and unlimited time.
**Figure #**: PGE Search Loop
{% highlight python linenos %}
def loop(iterations):
for I in range(iterations):
popd = pop(p)
# expand popd models, they are now fully processed
expanded = expand(popd)
# filter and memoize expanded models
to_memo = filter(expanded)
to_eval = memoize(to_memo)
# fit and evaluate these models
eval(to_eval)
# push fully fit models into the queue
push(to_eval)
{% endhighlight %}
The main PGE loop
makes the following steps (line numbers):
(3) $$p$$ equations are from the top of the PPQ.
(6) The $$p$$ equations are expanded to create
the more complex equations 'expanded'
by applying the grammar's production rules
recursively to an equation's parse tree.
(9) Each equation $$e \in $$ 'expanded' is potentially filtered
for excessive size or illegal operations.
(10) The IPT uses an equation's serialized representation
to checks if $$e$$ has been seen before.
If it has, $$e$$ is discarded, otherwise $e$ is unique and
(13) fit to the training data
and (16) pushed into the PPQ.
Variations on the PGE algorithm
can be created by altering how initialize PGE
and step through the grammar via its production rules.
We can seed PGE with a richer set of basis function,
and even allow complete sub-trees to become building blocks.
Further, by enriching the production rules applied at each node,
we can enlarge the set of equations generated at each point in the search space.
Variations on the PGE expansion algorithm
can be created by parameterizing where and
which generation functions are applied to a candidate.
By modifying the productions applied at each node,
we can reduce or increase the set of equations
generated at each node of the tree,
and thus each point in the search space.
In this paper,
we used three different expansion methods.
The basic method (PGE-1) restricts the grammar
by removing productions which result in
non-distributed expressions like $$ax*(b+cx*(d+x^2))$$.
It only uses *AddTerm* and *WidenTerm*
during the recursive expansion.
The second method (PGE-2) adds back the previously mentioned restriction,
using all three generation functions from Listing \ref{expand}.
This results in isomorphs
which are too `far' apart
for simplification rules to consider the same.
Despite being equivalent,
these isomorphic forms can produce very different offspring,
and represent separate areas of the search space.
The third method (PGE-3) is FFX inspired,
but starts with only a set of univariate bases.
These bases are iteratively
grown into a summation of multiplications
by applying *AddTerm* and *WidenTerm*,
creating candidates with more and increasingly complex bases.
PGE-3 differs from the PGE-1 method by
explicitly defining the root node to be addition
and placing stricter limitations on the complexity
of each term in the summation.
### Limitations
During the development of PGE,
we encountered some of the limitations
with our current implementation and reasoning.
#### Exponential to |{Features}|
PGE does not handle dimensionality well.
As the number of variables increases,
so do the number of equations generated.
Growth is exponential to the number of variables.
This is because of the production rules
being applied to each node of the parse tree,
which expand with all possible variables.
Further, with more complex generator functions,
equations which are the same mathematically,
are not equivalent through our current rewrite rules.
The combination of these two effects
compounds the issue of the exponential rate of equation generation.
This fact may only be partially limiting,
as the various forms of one equation
represent different areas of the search space,
and different forms are producible from
these points in the space.
#### Algebra Policies
Expansion and simplification policies
are difficult to establish and generalize.
There is no free lunch.
The policies determine the reachable search space
and how local changes are made.
If too much expansion is unrestrained,
the complexity grows quickly,
often creating equations which
have a simpler version.
Additionally, the originial implementation
used a custom algebra system and lacked many
features and capabilities.
We advise, and will in the next chapter,
move to using a mature Computer Algebra System (CAS).
#### Bloat in PGE
Bloat in PGE is when
good partial solutions dominate
the selection pool and create
an expansion cycle.
Consider $$F(x)$$ as a good partial solution.
PGE makes all possible modifications to this expression,
PGE will produce $$F(x) + c*g(x)$$
where either $$c$$ is very small
or $$g(x)$$ produces near zero values.
In both cases, bloat occurs because the extra
terms of the expression have little effect
on the fitness of the equation.
Adding to this, each node in the bloated portion is
expanded upon, further exasperating the situation.
#### Similar Selection
Due to the bloat and all inclusive expansion,
the bloated partial solutions can quickly
dominate the selection pool.
These similar solutions have similar fitness,
causing the selection process to have
difficulty distinguishing between them.
They can create a situation where
similar models are selected and expanded,
with bloated terms and fitness incomparability,
and PGE getting stuck in a local optima.s
#### Parameter Optimization
Parameter optimization is the portion of any SR implementation
which takes the longest time.
In many cases, the parameters are linearly dependent
and we get an exact best solution for a model.
In the nonlinear cases we may not, as
nonlinear regression is not guaranteed to find an optimal solution.
There are also questions as to where parameters should be included.
If excessive parameters are added to a model
it can take a comparatively long time to fit
and in many cases may be overfitting to the
noise in the data.
### Relation to Genetic Programming
Prioritized Enumeration is
heavily influenced by GP.
It uses the same parse tree representation
for candidate equations.
PGE, however, differs in how it
evaluates, selects, and generates
new candidates from old ones.
This section is meant to help
those familiar with GP
to understand where and what
the analogous processes are.
**Parameters**
PGE is nearly a parameter free
algorithm, requiring only
the building blocks, peel count,
and termination criteria
to be established prior
beginning an equation search.
Since building blocks,
generally established by the problem or benchmark,
and stopping criteria are common to
any search method,
PGE only has one parameter
the number of candidates to remove from the PPQ.
This is a substantial
advantage over GP,
which requires many parameters to be set.
This is a notoriously difficult task
as was discussed in the previous chapter.
PGE is nearly a parameter free
algorithm, requiring only that
the building blocks, generating functions,
the value of $p$, and termination criteria
be established prior to
beginning an equation search.
Since building blocks are
generally established by the benchmarks,
and stopping criteria are relatively
common across SR implementations,
PGE has only two unique parameters:
(1) $p$, the number equations to remove from the PPQ
and
(2) which generation functions to use.
% PGE has no populations,
% no generations or ages,
% and no random number generation.
This is a substantial
advantage over GP,
which requires many parameters to be set,
a notoriously difficult task.
**Evaluations**
Evaluation is the most
time consuming phase
of the model discovery process.
PGE performs two steps of evaluation,
fitting coefficients and evaluating fitness.
PGE uses
non-linear regression
to optimize the coefficients,
which requires an undetermined
number of evaluations
on the derivatives of the equation.
Once fit, PGE calculates
fitness on the testing set.
In GP, one evaluation
per data point, per equation
is required for both training and testing.
Regressing the coefficients
has a profound effect on the
ability of PGE to quickly
traverse the search space.
It allows for the separation
of equation form and coefficient optimization
into two sub-tasks of the overall search.
In GP, the optimization of form
is difficult and the most time
consuming portion of the search process.
**Selection**
To prioritize the enumeration of expressions,
we maintain, what we will refer to as, a Pareto Priority Queue (PPQ).
The contents of the PPQ are the equations
which have been regressed and evaluated,
but have not yet been expanded.
The PPQ is similar to
the basic priority queue,
in that we want to quickly
prioritize the elements
which have the highest priority.
However it is different
because we are sorting on
two objectives, parsimony and accuracy,
rather than one.
When selecting the equations
to expand next,
we remove one or more elements
from the front of the heap.
Since the front of the heap
is the first Pareto front,
we prioritize the smallest
equations first
The more equations we *peel* (remove)
from the front of the heap,
the more of the search space
we consider concurrently.
This enables the search to
exploit equations that
express different levels
of the trade off between
parsimony and accuracy.
If we select only one equation
from the PPQ per iteration,
the search degenerates into
a search which progresses
through expression size.
This degenerate case will
exhaust all possible equations
of the current size
before considering
any larger equations.
**Population and Breeding**
In PGE, there is no need to
maintain a diverse population
in which survival and replication
determine the outcome of the search.
In fact, one of the goals of PGE
is to only process
an expression form once.
This is managed by the IPT
which tracks the equations that have
been evaluated and expanded.
Equations which have been
generated, regressed, and evaluated
sit in the PPQ.
Once an equation has been
removed from the PPQ
and expanded upon,
it requires no further consideration.
">next
index

## Related Work

**Figure 4-1** - Combinatorics of the Tree
This equation has 12 different binary tree representations,
from six leaf permutations and two shape permutations.
If we account for both addition and multiplication,
the number of trees grows to 48.
As you can imagine, with
increasing numbers of operations and operands,
the tree complexity and this number
undergoes a combinatorial explosion.
The advantage of the trie based model
is the ability to take advantage of the
commutative and associative properties
of addition and multiplication.
Because of these properties,
we are free to order the operands,
sub-trees in the model trie, as we choose.
Commutativity is the ability to rearrange parentheses
and is associated with the multiple tree representations
for the same expression. By using the trie,
we flatten a binary tree and metaphorically
erase the parentheses from the expression.
Associativity means that we can sort
the operands to addition and multiplication.
By sorting them, we create a consistent ordering
which can be treated as the canonical form for a model.
In doing so, we remove the combinatorial element
associated with the permutation of leaf nodes.
Sorting is possible by
creating a ordering on node types.
This is easily achievable by assigning
an integer to each node type.
By turning binary trees into tries
and by sorting the tries,
each model can have a canonical form.
Figure 4-1, right column, shows
the canonical form of `a•b•c`.
After applying the reductions,
the original 48 models is now just 4.
This savings created by sorting
become even more important as equations
become more complex.
**Algebraic simplifications**
There are other sub-expressions which arise
that we would also like to remove and reduce.
Simplifications group like terms together ($$x+x$$),
replace sub-expressions which evaluate
to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$),
and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$.
The resulting equations are equivalent,
simpler forms.
There is debate as to how
simplification effects the SR process
[ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ].
Certainly, changing the tree effects
the Pareto trade-off which in turn
has consequences on selection and
therefore search progress.
Questions still remain as to
how much and which simplifications should be applied,
which forms of the equation should be kept,
and the overall effects simplification has on the search process.
Despite these questions,
PGE still employs
the use of the aforementioned simplifications.
This is done to eliminate
those sub-expressions which
unnecessarily complicate equation
and the overall search space.
If they were not replaced,
many equivalent variations of a good equation
could crowd out other potential solutions,
as they would be selected more often.
This would result in the algorithm
stagnating in a local optima of the search space.
Simplifications effectively trim
out parts of the search space
which are redundant, and also
return results which are more comprehensible.
PGE uses the simplification techniques
which are present in the SymPy Python library.
This and further implementation details
can be found in [Appendix-A1](/sr/A1-pypge).
#### Evaluating Forms Once
At its core, PGE is a dynamic programming algorithm
which aims to evaluate each sub-problem once.

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.
**Figure #** - Multiple Derivations
In this diagram, we start with models
which consist of a single variable.
Now we could take each of these models
and add a different variable to it.
Consider though that we could
start with $$x$$ and add $$y$$
or
start with $$y$$ and then add $$x$$,
both arriving at the canonical for $$ x+y $$.
Repeat this process, adding a third variable,
and we have six paths to the model $$ x+y+z $$.
It is important that we detect this situation,
by determining that each of the six $$ x+y+z $$
are really the same model.
If we do not detect the separate models as one,
would would end up repeating a lot of work.
This repetition would become a combinatorial
explosion of exponential proportions.
PGE uses dynamic programming techniques
to detect multiple derivations and
consider an equation form only once.
This feature of PGE is enabled by
the Integer Prefix Trie and
an integer linear representation
for the equation models.
The next section will fill in the details
of both representation and memoization.

### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.**Figure #**: PGE Exporation Functions
{% highlight Python linenos %}
def Grow(model):
add_expands = add_extend(model)
mul_expands = mul_extend(model)
var_expands = var_substitution(model)
return var_expands + add_expands + mul_expands
def add_extend(model):
// E -> E + c*L
// E -> E + c*N(L)
def mul_extend(model):
// T -> T * L
// T -> T * N(L)
// T -> T / L
// T -> T / N(L)
def var_substitution(model):
// N -> ( N + c*L )
// N -> ( N + c*N(L) )
// N -> ( (c*L) / N )
// N -> ( N / (c*L) )
{% endhighlight %}
PGE uses the grammar's production rules
expand previously encountered equations.
New equations are produced by applying
the function *Grow*, from Listing \ref{expand},
while recursing over the parse tree.
The *Grow* function determines the
current node's type and applies the appropriate
production function.
Each production function corresponds to
one or more of the grammar's production rules.
*AddTerm* increases
the number of terms in a summation, such as
$$aX + bY \Rightarrow aX + bY + cZ$$.
*WidenTerm* increases
the number of terms in a product, such as
$$aXY^2 \Rightarrow aX^2Y^2$$ or $$aXY^2 \Rightarrow aXY^2Z$$.
*DeepenTerm* increases
the complexity of a term, such as
$$aXY \Rightarrow a(X+bZ)Y$$ or $$aSin(X) \Rightarrow aSin(X+Y)$$.
The generating functions define
the set of reachable expression
in conjunction with the supplied building blocks
(terminals and non-terminals of the grammar).
However, generating functions do not determine
the order in which the space is searched.
This is the role of the PPQ
(section \ref{sec-pge-ppq}),
which prioritizes the areas to explore next.
**Prioritization Strategies**
The Pareto Priority Queue (PPQ)
is the deterministic mechanism
for controlling the search direction of PGE.
The PPQ replaces selection for mating
with prioritized equations for expansion
(selection for survival is unnecessary
since all equations are remembered).
The PPQ is what its name implies,
a priority queue built on the Pareto non-dominated sorting.
The PPQ prioritizes expansion towards
those branches of the search space which
balance size and accuracy best.
The PPQ removes the need for
generations, survivors, and
mating pools, storing
all explored equations which
have not been expanded.
Our experimentations have not shown this to
be prohibitive. Consumed memory
never exceeded 500Mb, even in the face of
hundreds of thousands of unique equations.
To construct the PPQ, successive
Pareto frontiers are appended onto a linear structure.
Thus, the smallest equation from the
first Pareto front is at the head of the queue.
The next elements are the remainder of
the first Pareto front in increasing size.
The remaining Pareto fronts
follow a similar pattern.
Priority is first given to the Pareto frontier
and then to size within the frontier.
This is the same as Pareto sorted array
that results from the GP Pareto sort
during selection.
The Pareto Priority Queue (PPQ)
determines the order
the space of equations is searched.
The PPQ gives equations which expand the
branches of the search space
towards small equations with low error.
To construct the PPQ,
successive Pareto fronts are
appended to the end of the queue.
Thus, the smallest equation from the
first Pareto front is at the head of the queue.
The next elements are the remainder of
the first Pareto front in increasing size.
The remaining Pareto fronts
follow a similar pattern.
Priority is first given to the Pareto frontier
and then to size within the frontier.
This is the same as Pareto sorted array
that results from the GP Pareto sort
during selection.
**Regularization** is the process of
introducing a penalty for complexity
in order to avoid overfitting.
In PGE and GP, this is an implicit process
which is part of the Pareto non-dominated sort.
PGE uses the PPQ and a pop count to
control the amount of regularization.
During processing, we remove the top
$$p$$ equations (3 in our experiments)
when selecting the next areas to search.
By doing so, we select the $$p$$ smallest
equations from the first Pareto frontier.
This gives variation across the trade-offs
for the equations to be expanded,
and thus variation in the search direction too.
If we only remove one equation,
the smallest equations would be replaced
from the next front and the search
would progress through equation space by size.
If we remove too many,
then PGE starts to produce
complex and over fit models.
#### Memoization
**Detect overlapping subproblems**
is the key to the dynamic programming approach taken by PGE.
In PGE, a sub-problem is equivalent
to a particular equation form.
Sub-problems are encountered
when an equation has
several derivations,
or paths over the search graph
which reach the same node.
This means there are multiple
orderings of the production rules which
result in the same equation
and that each equation may
appear in more than one form,
as several equivalent points
within a grammar's representable space.
The memoization of form allows PGE
to consider a form just once.
PGE matches equations by comparing
the serialized representation of the equations.
Serialization transforms an equation into a sequence of integers
by assigning a unique value to each node type.
The resulting integer sequence is equivalent
to the prefix notation of the equation.
Since each sequence represents a unique equation,
the parse tree is recoverable from the sequence.
Also, since PGE uses abstract coefficients,
they are all converted to the same value.
This means PGE only memoizes their existence and location.
**The Integer Prefix Tree** (IPT) is at the core
of the PGE dynamic programming framework.
The IPT detects overlapping subproblems
by comparing integer sequences,
the serialized representation of an equation.
Detecting overlapping subproblems
enables us to eliminate duplication of effort
in an exponentially growing search space.
The IPT was inspired by the
suffix tree algorithm for string matching
[[Weiner:1973:LPM](),
[Farach:1997:OST](),
[Skiena:1998:ADM]()].
The suffix tree algorithm
gives log-time lookup for
string existence in a piece of text.
Similarly, the IPT provides log-time determination,
w.r.t. the size of the equation alphabet,
as to whether a subproblem has been encountered before.
Since the ASCII alphabet is equivalent to the integers,
in a mathematical mapping sense,
we can use any sized alphabet,
or equivalently, range of integers.
PGE uses the IPT memoization tree defined in Listing \ref{pretree}.
The IPT is a recursive structure and insertion algorithm
which memoizes a sequence of integers.
The IPT is built up as new expressions are encountered.
As each node in the IPT is visited,
its associated counter is incremented.
If the insertion algorithm has to create
a new node, then the expression is also new,
and the expression type is stored in that new node.
New node creation is propagated back up the tree
as the insertion function unwinds.
The root IPT node unique counter is incremented if
a new insertion took place.
Each insertion or lookup takes O(mlog(n)) operations,
where m is the length of the serial
and n is the size of the alphabet.
The depth of the IPT is equal to
the length of the longest serial encountered.
The IPT also tracks the number of times
each node in the structure has been visited.
We keep counts for total visits
and unique visits.
This allows us to do some accounting
of the number of unique equations
as well as how many times each
equation is produced by the generation functions.
There has been no previous research on the topic
of unique equations in relation to the
total number of equations created or evaluated.
We believe this to be due to the lack of
a method for matching equations, which is
due to the lack of canonical forms in the literature.
We use the IPT to compare
the efficiency of GP and PGE in section \ref{sec-results-efficient}.
The IPT is also the enabling structure
which conceptually turns the search space from
a tree into a graph by matching and memoizing
subproblems, or equations.
This greatly reduces the amount of work,
by eliminating duplication,
that the PGE must perform
while searching for ideal equations.
As mentioned earlier,
the search space is also reduced in size by
sorting equations into a canonical form and
simplifying them by rewrite rules.
**Figure #**: Integer Prefix Tree
{% highlight Python linenos %}
class Node:
def __init__(self, key=0, value=None):
self.map = {} # int -> Node
self.key = key
self.value = value
def insert(self,iis, value):
if len(iis) > 1:
ii = iis[0]
if ii not in self.map:
self.map[ii] = Node(key=ii)
return self.map[ii].insert(iis[1:], value)
if len(iis) == 1:
ii = iis[0]
if ii not in self.map:
# print " new node for key: ", ii
self.map[ii] = Node(key=ii, value=value)
return True
else:
return False
def lookup(self,iis):
if len(iis) > 1:
ii = iis[0]
if ii in self.map:
return self.map[ii].lookup(iis[1:])
else:
return False, None
if len(iis) == 1:
ii = iis[0]
if ii in self.map:
return True, self.map[ii].get_value()
else:
return False, None
{% endhighlight %}
### The Search Loop
PGE iteratively refines equations
by applying the grammar's
production rules recursively.
PGE removes the top equations
from the queue, prioritizing
the first Pareto frontier.
The top equations are then expanded
according to the production rules.
The resultant expanded equations
are checked for uniqueness
by the Integer Prefix Tree (IPT).
The unique set of trees that remain are
fit using non-linear regression,
evaluated for fitness,
and pushed into the Pareto Priority Queue (PPQ).
The main PGE loop continues
in this way until the
stopping criteria is reached.
**Figure #** - PGE Flow Diagram
The PGE search proceeds as follows.
Initialization generates
a set of basis functions for each variable.
These basis functions are then
memoized by the IPT,
fit to the training data
and pushed into the PPQ.
The main PGE loop iteratively
pops $$p$$ equations from the top of
the PPQ for processing.
Each of these equations is expanded
through recursive application of the generation functions.
The resulting equations are memoized using the IPT.
If the equation has been encountered before,
it is discarded.
The remaining unique equations are
fit to the training data
with non-linear regression
and pushed into the PPQ.
PGE continues until
a model of desired accuracy is discovered
or a computational threshold is reached.
Theoretically, PGE could
explore the entire space of representable equations
given infinite space and unlimited time.
**Figure #**: PGE Search Loop
{% highlight python linenos %}
def loop(iterations):
for I in range(iterations):
popd = pop(p)
# expand popd models, they are now fully processed
expanded = expand(popd)
# filter and memoize expanded models
to_memo = filter(expanded)
to_eval = memoize(to_memo)
# fit and evaluate these models
eval(to_eval)
# push fully fit models into the queue
push(to_eval)
{% endhighlight %}
The main PGE loop
makes the following steps (line numbers):
(3) $$p$$ equations are from the top of the PPQ.
(6) The $$p$$ equations are expanded to create
the more complex equations 'expanded'
by applying the grammar's production rules
recursively to an equation's parse tree.
(9) Each equation $$e \in $$ 'expanded' is potentially filtered
for excessive size or illegal operations.
(10) The IPT uses an equation's serialized representation
to checks if $$e$$ has been seen before.
If it has, $$e$$ is discarded, otherwise $e$ is unique and
(13) fit to the training data
and (16) pushed into the PPQ.
Variations on the PGE algorithm
can be created by altering how initialize PGE
and step through the grammar via its production rules.
We can seed PGE with a richer set of basis function,
and even allow complete sub-trees to become building blocks.
Further, by enriching the production rules applied at each node,
we can enlarge the set of equations generated at each point in the search space.
Variations on the PGE expansion algorithm
can be created by parameterizing where and
which generation functions are applied to a candidate.
By modifying the productions applied at each node,
we can reduce or increase the set of equations
generated at each node of the tree,
and thus each point in the search space.
In this paper,
we used three different expansion methods.
The basic method (PGE-1) restricts the grammar
by removing productions which result in
non-distributed expressions like $$ax*(b+cx*(d+x^2))$$.
It only uses *AddTerm* and *WidenTerm*
during the recursive expansion.
The second method (PGE-2) adds back the previously mentioned restriction,
using all three generation functions from Listing \ref{expand}.
This results in isomorphs
which are too `far' apart
for simplification rules to consider the same.
Despite being equivalent,
these isomorphic forms can produce very different offspring,
and represent separate areas of the search space.
The third method (PGE-3) is FFX inspired,
but starts with only a set of univariate bases.
These bases are iteratively
grown into a summation of multiplications
by applying *AddTerm* and *WidenTerm*,
creating candidates with more and increasingly complex bases.
PGE-3 differs from the PGE-1 method by
explicitly defining the root node to be addition
and placing stricter limitations on the complexity
of each term in the summation.
### Limitations
During the development of PGE,
we encountered some of the limitations
with our current implementation and reasoning.
#### Exponential to |{Features}|
PGE does not handle dimensionality well.
As the number of variables increases,
so do the number of equations generated.
Growth is exponential to the number of variables.
This is because of the production rules
being applied to each node of the parse tree,
which expand with all possible variables.
Further, with more complex generator functions,
equations which are the same mathematically,
are not equivalent through our current rewrite rules.
The combination of these two effects
compounds the issue of the exponential rate of equation generation.
This fact may only be partially limiting,
as the various forms of one equation
represent different areas of the search space,
and different forms are producible from
these points in the space.
#### Algebra Policies
Expansion and simplification policies
are difficult to establish and generalize.
There is no free lunch.
The policies determine the reachable search space
and how local changes are made.
If too much expansion is unrestrained,
the complexity grows quickly,
often creating equations which
have a simpler version.
Additionally, the originial implementation
used a custom algebra system and lacked many
features and capabilities.
We advise, and will in the next chapter,
move to using a mature Computer Algebra System (CAS).
#### Bloat in PGE
Bloat in PGE is when
good partial solutions dominate
the selection pool and create
an expansion cycle.
Consider $$F(x)$$ as a good partial solution.
PGE makes all possible modifications to this expression,
PGE will produce $$F(x) + c*g(x)$$
where either $$c$$ is very small
or $$g(x)$$ produces near zero values.
In both cases, bloat occurs because the extra
terms of the expression have little effect
on the fitness of the equation.
Adding to this, each node in the bloated portion is
expanded upon, further exasperating the situation.
#### Similar Selection
Due to the bloat and all inclusive expansion,
the bloated partial solutions can quickly
dominate the selection pool.
These similar solutions have similar fitness,
causing the selection process to have
difficulty distinguishing between them.
They can create a situation where
similar models are selected and expanded,
with bloated terms and fitness incomparability,
and PGE getting stuck in a local optima.s
#### Parameter Optimization
Parameter optimization is the portion of any SR implementation
which takes the longest time.
In many cases, the parameters are linearly dependent
and we get an exact best solution for a model.
In the nonlinear cases we may not, as
nonlinear regression is not guaranteed to find an optimal solution.
There are also questions as to where parameters should be included.
If excessive parameters are added to a model
it can take a comparatively long time to fit
and in many cases may be overfitting to the
noise in the data.
### Relation to Genetic Programming
Prioritized Enumeration is
heavily influenced by GP.
It uses the same parse tree representation
for candidate equations.
PGE, however, differs in how it
evaluates, selects, and generates
new candidates from old ones.
This section is meant to help
those familiar with GP
to understand where and what
the analogous processes are.
**Parameters**
PGE is nearly a parameter free
algorithm, requiring only
the building blocks, peel count,
and termination criteria
to be established prior
beginning an equation search.
Since building blocks,
generally established by the problem or benchmark,
and stopping criteria are common to
any search method,
PGE only has one parameter
the number of candidates to remove from the PPQ.
This is a substantial
advantage over GP,
which requires many parameters to be set.
This is a notoriously difficult task
as was discussed in the previous chapter.
PGE is nearly a parameter free
algorithm, requiring only that
the building blocks, generating functions,
the value of $p$, and termination criteria
be established prior to
beginning an equation search.
Since building blocks are
generally established by the benchmarks,
and stopping criteria are relatively
common across SR implementations,
PGE has only two unique parameters:
(1) $p$, the number equations to remove from the PPQ
and
(2) which generation functions to use.
% PGE has no populations,
% no generations or ages,
% and no random number generation.
This is a substantial
advantage over GP,
which requires many parameters to be set,
a notoriously difficult task.
**Evaluations**
Evaluation is the most
time consuming phase
of the model discovery process.
PGE performs two steps of evaluation,
fitting coefficients and evaluating fitness.
PGE uses
non-linear regression
to optimize the coefficients,
which requires an undetermined
number of evaluations
on the derivatives of the equation.
Once fit, PGE calculates
fitness on the testing set.
In GP, one evaluation
per data point, per equation
is required for both training and testing.
Regressing the coefficients
has a profound effect on the
ability of PGE to quickly
traverse the search space.
It allows for the separation
of equation form and coefficient optimization
into two sub-tasks of the overall search.
In GP, the optimization of form
is difficult and the most time
consuming portion of the search process.
**Selection**
To prioritize the enumeration of expressions,
we maintain, what we will refer to as, a Pareto Priority Queue (PPQ).
The contents of the PPQ are the equations
which have been regressed and evaluated,
but have not yet been expanded.
The PPQ is similar to
the basic priority queue,
in that we want to quickly
prioritize the elements
which have the highest priority.
However it is different
because we are sorting on
two objectives, parsimony and accuracy,
rather than one.
When selecting the equations
to expand next,
we remove one or more elements
from the front of the heap.
Since the front of the heap
is the first Pareto front,
we prioritize the smallest
equations first
The more equations we *peel* (remove)
from the front of the heap,
the more of the search space
we consider concurrently.
This enables the search to
exploit equations that
express different levels
of the trade off between
parsimony and accuracy.
If we select only one equation
from the PPQ per iteration,
the search degenerates into
a search which progresses
through expression size.
This degenerate case will
exhaust all possible equations
of the current size
before considering
any larger equations.
**Population and Breeding**
In PGE, there is no need to
maintain a diverse population
in which survival and replication
determine the outcome of the search.
In fact, one of the goals of PGE
is to only process
an expression form once.
This is managed by the IPT
which tracks the equations that have
been evaluated and expanded.
Equations which have been
generated, regressed, and evaluated
sit in the PPQ.
Once an equation has been
removed from the PPQ
and expanded upon,
it requires no further consideration.
">next (The PGE algorithm)

A dynamic-programming algorithm solves each subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time it solves each subsubproblem.

~ Cormen:2009:algorithms

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.

### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.

Genetic Programming

Enhancements in GP

Fast Function eXtraction

Regression Methods

Graph Algorithms

### Genetic Programming

#### The Algorithm

**Overview of GP**

**Representation in Genetic Programming**

**Evaluation in Genetic Programming**

**Optimization in Genetic Programming**

**Initialization & Termination**

#### Limitations in GP

**Disruptiveness of Crossover**

**Figure 4-1** - Probabilities on the Tree
**Equation Bloat**

**Equation Redundancy**

**Loss of Population Diversity**

**Premature Convergence**

**Fundemental Issues**

### Genetic Programming Enhancements

#### Selection

#### Variations on crossover

#### The Island Model

#### Co-evolution

#### Hybrid Algorithms

### Fast Function eXtraction

#### Algorithm

**Figure #**: FFX Algorithm
#### Limitations

### Regression Techniques

#### Linear Regression

#### Nonlinear Regression

#### Support Vector Regression

#### Ridge, Lasso, Elastic Net

### Graph Algorithms

#### Minimum Spanning Tree

#### Single-source Shortest Path

prev top

### Theory
1. **Removing Non-determinism** - Creation of a completely reproducible SR algorithm.
1. **Search Space Organization** - Understanding and designing the structure of the search space.
1. **Evaluating Forms Once** - Designing the model for detection and avoidance of overlapping subproblems.
PGE was born from the process of removing
random number generators from GP.
This required methods for
selection, survival, and replication
to all be replaced.
The replacement process
brought to light several issues
which have not been addressed
by the GP community.
The theory underlying PGE
is based on intuitive principles.
First, take into account the rules from algebra,
to reduce model multiplicity into a canonical form,
thereby reducing search space complexity.
Second, design models so that they
are only handled once. This requires that
they are also designed with dynamic programming
techniques in mind.
Third, remove all sources of non-determinism
to create a completely reproducible algorithm.
This also simplifies development and debugging
of the software which implements the algorithm.
#### Removing Non-determinism
Removing sources of non-determinism
was a central theme to the development of PGE.
The aim was to produce an algorithm
which would give the same results with each invocation.
The reason for this goal was the
detrimental effects non-determinism has
on search algorithms. We believe
it to be an inherent and unavoidable
difficulty present in the GP algorithm.
You can find details on GP in [Chapter 7](/sr/07-gp)
To achieve deterministic behavior,
PGE makes no use of random number generation.
It performs the exact same algorithmic steps
given the same parameters and same training data.
This requires several aspects of the PGE algorithm
to be defined deterministically.
**Initialization** establishes the starting points
from which PGE searches. Determinism starts here
by constructing a set of basis functions
from the input variables.
**Prioritization** determines which points
in the space to search next. This process
uses a priority queue coupled with a
Pareto non-dominated sorting algorithm.
**Exploration** discovers new models by
making small changes to a given model.
To maintain determinism, PGE makes
all possible changes to a model,
given the particular exploration scheme in use.
**Communication** occurs when PGE uses
parallel execution. In order to maintain
deterministic behavior, well defined
synchronization points are used.
Details of each aspect will be expanded upon
in the upcoming sections and next chapter.
#### Search Space Organization
The size of the space defined by a grammar
is infinite, even when disregarding the
increase brought by considering real valued coefficients.
The main contributions are combinatorial components
at both the search space and model levels of the problem.
PGE's great efficiency gains are the result
of methods designed to combat the combinatorial components.
We begin at the model level, where the main tool
is the trie based representation.
It is dynamic programming which comes to the rescue
at the search space level.
**Tree Representation Combinatorics**
PGE primarily uses the trie based representation
for the models, as detailed in the last chapter.
The reason for this is the combinatorial
number of different binary tree representations
for the equivalent model.
These different representations arise
from permutations of the leaf nodes
and from variations in the tree structure.
Consider the equation `a•b•c` (Figure 4-1).
Enhancements in GP

Fast Function eXtraction

Regression Methods

Graph Algorithms

Genetic Programming (GP) is the original implementation of Symbolic Regression (SR). Traditionally much of the literature has used the term Genetic Programming as both the problem and the implementation. Historically, term Symbolic Regression has been used as a sub-problem to Genetic Programming, as the task of evolving equations, and often as an evolutionary method for finding equations. We want to make a distinction between the task of symbolically regressing a parse tree in a grammar, from the method of realizing that tree. That is, separating the objective from the optimization method. From here on out, when we refer to Genetic Programming, we mean an evolutionary implementation of Symbolic Regression. And when we use Symbolic Regression, we mean the generalized regression problem, limiting our examples to equation regression, which we consider a specific sub-problem of Symbolic Regression.

**Overview**- Basics of Genetic Programming**Representation**- Tree-based and beyond**Evaluation**- Fitness measurement**Optimization**- Genetic operators and selection**Initialization & Termination**- It’s important

GP requires several components to be specified in its implementation. Representation and evaluation define the solution structure and how it is simulated to infer modeling accuracy. Initialization determines the search starting point and what portions of equation space are reachable. Selection and breeding policies determine which candidates are better and how the space of equations is searched. This section overviews each of these components in the GP algorithm.

The most common implementation of SR has been the evolutionarily inspired method called Genetic Programming (GP) koza:1992:gen_prog. GP is part of the larger family of Genetic Algorithms (GA) a class of algorithms inspired by `survival of the fittest’ [ holland:1962:outline, goldberg:1988:genetic ]. GP differs from GAs by solution representation. In GAs, solution representation is a fixed size structure. In GP, solutions vary in size, usually represented as a parse tree within a grammar. While GP can be applied to generic languages, and was originally intentioned to evolve computer programs, we shall restrict our discussion to mathematical equations. This is in line with the general focus of the GP literature.

GP is a stochastic search heuristic over the syntax-tree representation of an equation. GP requires several components to be defined: representation, evaluation, selection, genetic operators, initialization & termination, and population management. Representation and evaluation, the first two components of a ML algorithm were detailed previously. We while expand upon their usage in GP here. The third component, optimization, is the combination of selection, the genetic operators, and population management. Initialization and termination are required because the success of GP is sensitive to their determination.

The Basic GP Process begins with an initial, random population of models. Then each individual is evaluated to determine fitness. This is followed by a selection process for candidate survival and replication. Survivors are then recombined using methods which resemble sexual reproduction and mutation. GP continues this process for a number of generations, until a model of desired accuracy is discovered or a computational threshold is reached.

For detailed overviews of tree-based Genetic Programming, see [ koza:1992:gen_prog, kouch:08:thesis ] For a nearly complete list of all GP publications, visit GP-bib, maintained by Bill Langdon.

Section SR-components described the various schemes for representing equations.

GP uses the operators and operands of this framework to construct parse trees for equations. The parse trees are chromosomes of an individual. This analogy is important to the methodology which uses sexual reproduction, mutation, and natural selection to search. One note about representation, equation coefficients are real valued numbers. As we will see, having continuous parameters, and optimizing by evolution, makes the task much harder.

**Representation Closure**

In the basic GP algorithm, expressions are generated at random. This can result in invalid operations being evolved which violate our basic mathematical constraints. Closure is the property that an expression be valid mathematically. Invalid expressions include dividing by zero, taking the square root of a negative number, and violating other similar rules.

A simple remedy exists in which
invalid operations are removed
and functions such as log
are protected.
This method has been termed
*interval arithmetic*
[
keijzer:03:improving,
kotanchek:2008:trustable
].
A second method is to place restrictions
on the grammar, as in
[
hoai:2001:framework,
hoai:2002:solving,
hoai:2003:tree
].
Hoai prevents invalid expressions from being generated
where the previous method removes them later.

**Populations** are the pools
of candidates equations GP has to work with.
It is essential just a set of equations
with nomenclature to reflect the
biological evolution inspirations of GP.
Basic GP uses a single population.
Parents are selected from this population
for breeding, children and parents
compete for positions in the population
between generations.

**Schema Theory** developed as a
framework for talking about partial
solutions in GP
[ holland:1975:adaptation ].
Originally associated with Genetic Algorithms,
it has been useful in the discussions
around solution propegation and convergence
as well as the disruptiveness
of crossover in GP.

A schemata is a template for an individual where some of the parts are filled in and some are ``blanks’’ and not yet determined. The Schema Theorem states that schemata which have fewer determined components and above-average fitness will increase exponentially in successive generations. This means that good schemata will propagate through the population.

The basic GP algorithm uses the evaluation methods outlined in SR-components. The fitness of an individual is then determined by a combination of accuracy and size. We will revisit evaluation in GP later in the next section to discuss some variations on evaluation.

Optimization is performed over a series of iterations, know as generations in the GP scheme. Following the evolutionary motif, optimizing a population requires selection and replication. As the great MPUA says

The sole reason we exist is to survive and replicate. That's it. ~ Mystery

GP simulates the evolutionary process on equations.
Replication uses *crossover* and *mutation*
to create new individuals from existing ones.
Selection uses the non-dominated sorting
Pareto fronts described in
Section SR-???.
Here we will describe the particulars
as the relate to GP.

**Genetic operators** are
the means by which GP
searches the space of equations.
The search uses
crossover, an exploratory technique, to find unexplored regions
and mutation, an exploitative technique, to search locally.
A balance between both is needed to make progress
and converge towards good solutions.

**Crossover** is analogous to sexual reproduction and
uses two parent equations to produce child offspring equations.
To perform crossover, subtrees of the parents
are selected, snipped, and swapped.
This method is similar to crossing DNA,
except that GP is a branching representation,
rather than linear.
The original crossover method
selects subtrees of the parents at random.
Enhancements on selecting crossover points
will be described below.

It is considered a destructive operation due to the significant probability that good schemata of the tree are disrupted. Disrupted in this context means that a partial solution, stored in a subtree, has an internal node selected for the crossover point. To combat this, i.e. reduce the probability of disruption, candidate solutions grow extra ‘genetic material’ that has little value in the evaluation. This ‘natural’ growth, called bloat, is described below in the sections on limitations and enhancements.

**Mutation** alters a single candidate equation
to search locally around that candidate.
Mutation simply selects a node of a tree
and changes it to a different type.
Examples of mutations are
a change of variable from X to Y,
an operation from addition to subtraction.
Mutation is also the way the original GP algorithm
optimizes real-valued coefficients.
If a coefficient is selected for mutation,
it scaled by a random value in some predefined range.

**Injection** is a less used method
in which a single candidate is altered
by choosing a single point of change.
It is similar to mutation,
but instead of changing a node type,
the subtree is trimmed and replaced
by a freshly grown branch.
In terms of exploratory and exploitative
search qualities,
injection lies between
crossover and mutation.

**Selection** is the most important part
of evolutionary algorithms,
and thus GP.
Therefore it is very important
to choose the selection mechanisms wisely
Dumitrescu:2000:EC.
Selections role is to differentiate between candidates,
based on their quality, for survival and replication.
This artificial environmental pressure causes natural selection,
raising the fitness of the population.
[
Dumitrescu:2000:EC,
goldberg:1991:comparative
]
extensively cover the variations in selection schemes.
Here, we will give an overview to show what is possible.

There are two points where selection is used, for survival and replication. Replication is the chance of having genetics passed to an offspring. This is how good partial solutions are passed through the population and altered to find more of the solution.

Parental selection can happen with or without replacement. Proportional selection favors better candidates by increasing the probability they will be selected. This can cause premature convergence to a local optima solution permeating the population, wiping out diversity. Rank based selection sought to overcome this by associating a position rather than a probability for chance of selection. Rank is often calculated by the dominate, dominated ration of a candidate. Rank selection benefits GP by increasing selection pressure when the population variance is low.

Binary Tournament [goldberg:1991:comparative] selects two individuals and competes them for one of the parental equations. This is down twice to generate both parents for a recombination. Binary tournament can be extended to compete N candidates against each other. Binary tournament is similar to selecting a pivot point in library implementations of quick-sort.

Survival is the chance of an individual solution remaining in the population. In the elitist method, a parent is only removed if one of its offspring is better than it. In the pure method, all parents and children contend for a spot in the population.

**Initialization** determines the search starting point
and what portions of equation space are reachable.
GP uses random generation to
create the initial population of candidate solutions.
using a uniform random probability distribution

There are two ways to grow a single parse tree. The first is to grow a full tree up to a specified depth. The second is to allow node selection to include variables. This causes generation to halt at unspecified times while traversing the tree, with the resultant tree having variable shape. To create the initial population, these two tree generation schemes can be used for the entire population or can be used for different percentages of the population. A method known as ramped half-and-half creates half the population with the full method and the second half with the variable grow method. A size proportional set can be created by growing an increasing number of trees as the maximum depth increases. For example, if we limit candidates to binary trees, we grow twice as many trees as the previous depth.

Ensuring diversity of equations at the start is difficult. One method is two continually generate equations, eliminating duplicates and invalid expressions, until the population is filled. Another method is to run several short run GP searches, leveraging GP’s randomness (non-determinism), to obtain a pool of partially optimized candidates.

**Termination**

Since Genetic Programming is a heuristic search, it has no intrinsic stopping point. Therefore, we must tell it when to stop, The most common method used is generational, i.e. upon reaching a predefined number of generations or iterations. This gives us a reasonable idea of how long a search will take as well as providing a measurement of what can be solved in a given number of steps.

Another class is performance based stopping criteria.
To use these, an accuracy metric needs to be defined.
One such metric is simply the
best performing candidates error.
If the error is less than some defined threshold,
then the search is terminated.
This method, however, is sensitive to the
scale of the data and error measurement.
Another performance based method uses *hits*,
the number of data points which a candidate equation
is within some threshold of the real value
citation.
In this case, the search terminates if any
candidate scores a hit on all data points
or some percentage of all the data points.

Other schemes attempt to measure the progress of the search. One such example is the number of generations since a new best solution has been found. If the search continues for a time without finding improvement, then termination occurs.

No matter what means is used to determine the stopping point, the results of the search need to be returned. Generally a set of the best solutions are returned, providing some amount of diversity. This set usually comes from the Pareto frontier, but can also include solutions which come from successive frontiers.

**Disruptiveness of Crossover**- the basic search operator has uniform selection probability. It is the reason for bloat and inefficiencies.**Equation Bloat**- the growth in tree size to reduce probability of node selection. Bloat emerges to protect good partial solutions.**Equation Redundancy**- results from multiple copies of the same equation. Compounded by commutative and associative properties.**Loss of Population Diversity**- results when a partial solution dominates the population. Caused by bloat and good partial solutions.**Premature Convergence**- results when a search converges upon a local optimum. Caused by loss of population diversity then next point.**Fundamental Issus in GP**- topics which are inherient to the process or have been unaddressed by the GP community

The basic GP algorithm has many limitations stemming from its evaluation, selection, and breeding sub-algorithms. Many research papers in the field have sought to address these limitations and offer solution. In this section we discuss these limitations, detailing where they come from. We relate the solutions which offer improvement, to each limitation, and discuss their effect on the GP algorithm. The limitations, as well as the counters, are ofter intertwined between several aspects of the GP algorithm. In fact, if the initial starting points do not give sufficient coverage of the search space, GP will not be able to reach areas of that space.

GP has issues which it inherits from evolution, which is good at dealing with uncertainty and variability, but not the ideal method for honing in on the most fit individual. These issues stem from the difficulties in maintaining a diverse population of equations while converging towards the best equation. Representation, the genetic operators, selection mechanisms, and population structure are interwoven in complex ways. These difficulties, and advancements made in GP, have effects which permeate through the entire GP infrastructure and operation. Measuring and understanding these effects requires rigorous testing and analysis, something laking but being addressed [ McDermott:2012:benchmarks ].

The basic version of crossover has uniform selection probability driven by random number generation. This is the main cause of bloat and inefficiencies. The diagram below shows the probability of selecting a node given the tree size.

As the partial solutions converge towards the real solution, an increasingly larger percentage of the tree structure is correct as is. However, as the percentage of correct structure increases, so does its chance of being selected for crossover. Thus the probability of disrupting the correct portion a good partial solution has increased [ Whitley:1994:tutorial, Poli:1998:ec ].

To negate this issue, the internal GP process grows large subtrees with little effective output. That is, they return values close to zero and do not change the output of the model significantly. By having large subtrees, the probabilities of selection change, and reduce the likelihood of disrupting the good partial solution in a different branch of the tree. This extra no-op material is referred to as bloat.

Equation bloat is the growth of average candidate size over course of evolution. Bloat is inherent in any variable length representation [ langdon:1998:bloat, langdon:2000:quadratic]. The general consensus for the cause of bloat is the disruptiveness of crossover, though [ banzhaf:2000:bloat] speculated that fitness causes bloat.

Parsimony pressure is meant to control bloat, but the disruptiveness of crossover, more often than not, overpowers the Pareto front’s benefits. The next section discusses several schemes which attempt to mitigate bloat and the disruption of partial solutions.

Bloat causes partial solutions to dominate the population as GP attempts to optimize the best partial solution’s coefficients. These equations are either disrupted partial solutions from crossover or the same solution differing only in the value of one coefficient. The net result is that, as the GP search progresses, many copies of the same equation are generated. This in turn, results in a lot of time being wasted testing equations that are minimally different from others.

Equation redundency is further compounded by commutative and associative properties that allow for the tree combinatorics discussed in the last chapter. To our knowledge, this issue still remains unaddressed in the GP community.

A major issue through the course of evolution is maintaining the proper diversity within the population of candidate solutions. One can often find a good partial solution spreading quickly through the population and destroying diversity Generally speaking, diversity should be greater at the early stages and reduce as time progresses. If we allow too much diversity, the algorithm cannot refine the candidates and fails to converge to any optima. If we force too little diversity, the algorithm converges too quickly and doesn’t thoroughly search the space.

Population diversity is difficult to diagnose and address. There are many factors which effect the population diversity through the course of evolution. The population size, the mating pool size, elitism, survival scheme. The difficult in diagnosis is likely the lack of a good measure. The next chapter will introduce a structure for comparing unique to total equation forms which could be useful to GP for measuring diversity.

Premature convergence is the result of a loss of diversity within the population. As diversity is reduced, it becomes difficult for the search to move away from a solution which has dominated the population. This result is convergence upon a local optimum [ Srinivas:1994:probabilities, Baker:1985:adaptive]

Premature convergence is the final result of the aforementioned limitations. It is effected by all components of the GP process and becomes the term to describe the event, in GP, when the algorithm is stuck on the wrong answer. Generally, premature convergence is remedied by improving individual components.

Several issues are inherient to the GP algorithm. They stem from the heavy usage of random number generation and using these random numbers for making choices. This effects the control-flow through almost all aspects of a GP implementation. It is our belief that where GP uses a random number to make a decision, that a well defined and informed choice can instead be made. Another effect of random choice making is the loss of structural information about the search space. Randomized exploration operators lack information about equation relationships in the larger search space. The next chapter will show how PGE maintains and takes advantage of the search space structure. A final result of using random number generators is that the final results of a GP search are different with each invokation. GP usually mitigates this by running the search several times to gain a concensus. A consistent random seed could be used, however this would only fixate the random choices made.

There are other issues in GP remain completely unaddressed by the community. Redundent candidates due to associativity and commutativity in algebra, as well as the amount of diversity and redundent effort. Both seem to be due to the lack of a detection or measurement technique.

More concerning is the lack of good research practice within a significant portion of the GP comunity. An effort was made to shed light on this issue and start a conversation to improve the situation. In brief, there are inconsistencies in benchmarking, metrics, and with cross-comparisons between implementations. Further, the details reported in the literature are insufficient to reproduce the work or compare results in a meaningful way.

**Selection Improvements**- Parent recombination and archives.**Variations on crossover**- Grammar guided and probabilistic.**The Island Model**- Run several GP instances with message passing.**Co-evolutionary GP**- Evolve fitness predictors in parallel with equations.**Hybrid Algorithms**- GP algorithms with non-GP parts

Since Koza’s initial formulation of GP and SR, there has been a plethora of research from generalized implementation enhancements to improvements addressing the issues of disruptiveness of crossover, bloat, population diversity, and premature convergence among many others. We cover material which has relevance to our discussion, though it will necessarily be abridged due to space limitations.

**Brood selection**

Brood selection is a technique to create better offspring from the crossover operation. Rather than each set of parents producing only two children, each set of parents are crossed multiple times, producing many children. The rational is that by choosing multiple crossover points, we gain a better change of selecting a good one. [ Tackett:1994:brood, Fogel:1997:handbook ].

Brood selection occurs in two phases, offspring creating and survival selection. The first phase happens during crossover. Brood selection applies crossover on two parents N times, producing 2N offspring. In the second phase during selection, the brood of candidates, from two parents, are preselected to filter out the best offspring. This best child then becomes part of the usual survival and replication selection phases.

**Pareto Front**

The Pareto front is a central component to the SR framework. The Pareto non-dominated sort, or Pareto front, balances the trade-off between opposing goals in multi-objective optimization [ fonseca:1993:genetic, horn:1994:niched, van:1998:evolutionary, luke:2002:lexicographic, smits:2005:pareto ]. Various methods improve the diversity along the Pareto front, and/or maintain archives of good solutions: NGSA [ srinivas:1994:muiltiobjective ], NSGA-II [ deb:2000:fast ], SPEA [ zitzler:1998:evolutionary ], SPEA2 [ zitzler:2001:spea2 ], SPEA2+ [ kim:2004:spea2 ]}, PAES [ knowles:1999:pareto ], PESA [ corne:2000:pareto ], PESA-II [ corne:2001:pesa ], are all algorithms which aim to improve the quality of solutions. SPEA, PESA, MOGA — aim to improve diversity by accounting for density along the front. SPEA2, NSGA, NSGA-II — use an archive to maintain the population of good solutions. NSGA-II, SPEA2+ — improve the time complexity of the Pareto sort itself. The overall goal is to include individuals near the front and reward uniqueness and diversity among the candidate solutions.

**Stepwise Adaptive Weights** (SAW)
alters the fitness function
with information gained during the search process [eggermont:2000:stepwise].
SAW weights each sample point,
periodically updating the weights
using the error of the best candidate solution.
If the informing candidate has non-zero error, then
either a constant is added to the weight (CSAW),
or a precision value is added (PSAW).
By variating the sample weights,
SAW accentuates points which
are not modeled well.
This exerts pressure on the
evolutionary process to
produce better models
geared towards explaining
the difficult points.

**Tree-Adjunct Grammar Guided Genetic Programming** (TAG3P)
uses a grammar to enforce syntactical constraints on a solution
[ hoai:2001:framework, hoai:2002:solving, hoai:2003:tree ].
TAG3P is basically the same as GP,
having the same representation and components.
However, TAG3P differs is a couple of ways.
TAG3P only allows crossover points
where the non-terminals are the same.
A similar constraint is enforced
during mutation. Only subtrees
of the same root node can be
substituted into the parse tree.
TAG3P can also incorporate the use of
tree rewriting systems
[ Joshi:1975:TAG ].
By produces only syntactically valid solutions,
TAG3P biases the search towards those solutions
[hoai:2001:framework, hoai:2002:solving].

At the individual genome level, changes to representation, crossover, and coefficient optimization have been explored. Tree-based representation is standard, others include linear, grammar-based [ mckay:2010:grammar ], and graph or Cartesian GP [miller:2000:cartesian]. Invalid mathematical operations are removed in [ keijzer:03:improving,kotanchek:2008:trustable]. Context-aware crossover [majeed:2006:using] selects a snip in one tree and substitutes it at all valid locations in the other parent. [korns:2008:large] extends this to include all possible valid snip replacements from both parents. Semantically Aware Crossover (SAC) [nguyen:2009:semantic] biases crossover to exchange semantically different material and Semantic Similarity-based Crossover (SSC) [uy:2011:semantically] extends this, by limiting the size of material to small, manageable snips.

The island model cohoon:1987:punctuated splits a unified population onto several concurrent GP searches called islands. The interconnections between these islands form lines of communication. At one extreme there is no connection and each island as a completely separate GP search with a smaller population. If we follow the usual scheme of communication then we have what is called migration. At regular intervals, and island informs its neighbors of its search progress. This usually takes the form of sending several good candidates to be included in the neighbors equation pool. Connection settings vary from a ring, fully connected, random connections, and several other variants alba:2000:influence In lassig:2010:benefit, a proof is given that islands are needed to achieve polynomial time convergence. Without communication, the islands become basic GP searches and don’t receive the added benefit of information sharing.

Using the island model with migration, several topologies of interaction arise. These topologies are undirected graphs. At the extremes, the ring topology connects each island to one neighbor on each side and full topology connects each island to every other island. Other topologies connect to N random neighbors, bringing the ring topology closer to a fully connected graph. Using the island model for GP permits for easy GP parallelization alba:2002:parallelism. alba:2001:analyzing Analyzing synchronous and asynchronous parallel distributed genetic algorithms and found that there was no meaningful difference in ability. In alba:2002:super-linear, they claim super-linear speed up by running several GP searches in parallel. However, they incorporated information sharing between the processes which alters the algorithm’s behavior so much that should really be considered different implementations.

Co-evolution is the simultaneous evolution of separate species. Using co-evolution has several consequences on the Symbolic Regression algorithm. First, and most directly, evaluating equations on only a subset of the data alleviates a large portion of the computational burden. Second, co-evolution naturally adapts and maintains pressure on both populations. Pressure is maintained by exchanging some of the best members of each population, the third consequence of co-evolution. At regular intervals, a best and diversified set of equations is shared with the subset evolution process, and similarly current best subsets are shared with the equation islands. A fourth consequence of co-evolution is that the subsets are free to emphasize different areas of the data set to different extents. This allows the search to focus on different areas of the observation space at various times in the search. It also permits for important points to show up more often. [hod:09:implicit_long] showed that points where the rate of change in the data was increased had a greater ability to differentiate between solutions.

Co-evolution adds a multitude of new parameters, more than doubling the number in vanilla Symbolic Regression. This is due to the nearly equal number of parameters for the subset evolution process, plus the extra parameters for describing the interaction. Despite the large increase in parameter space, the benefits from reduced computational effort and increased search ability justify using co-evolution [packard:1988:adaptation, kaufman:1991:coevo, paige:2002:compare].

Due to issues with benchmarking SR algorithms McDermott:2012:benchmarks, as well as generally poor accuracy from state-of-the-art GP algorithms korns:2011:accuracy, several recent works have proposed hybrid or alternative techniques to SR. topchy:2001:fasterraidl:1998:hybrid,eberhart:2001:swarm} use hybrid approaches to speed equation parameter fitting. Abstract Expression Grammar (AEG) korns:2011:abstract, uses several concurrent GP searches with abstract place-holders. This enables the optimization methods to focus on smaller areas of the search space in parallel. Fast Function eXtraction (FFX) McConaghy:2011:FFX is a non-GP, deterministic algorithm for SR. In FFX, the model is a Generalized Linear Model and pathwise regularized learning zou:2005:regularizationfriedman:2010:regularization} is used to optimize model parameters. In bongard:2013:improving, a hybrid GP-FFX algorithm in proposed, where FFX is used for feature selection prior to a GP search Their approach was shown to be more effective than either one alone. Real-valued coefficient optimization, a known difficulty for GP, has been improved by local gradient search, non-linear regression, and swarm intelligence [ topchy:2001:faster, raidl:1998:hybrid, eberhart:2001:swarm ].

The classical GP algorithm naturally fits both functional and data parallelism, and as such, there exists a significant amount of literature on the subject. The first investigation was done by the originator of GP, Koza [koza:1995:parallel]. The effects of (a)synchronous execution in distributed GP were investigated in [alba:2001:analyzing], concluding that asynchronous execution achieved equivalent or better results in less time. [alba:2000:influence] investigates the influence of migration policy on distributed GA’s and [lassig:2010:benefit] proves that migration is required for convergence. In [hodjat:2014:maintenance], the effects on, and maintenance of, long-running GP systems is investigated. Work towards implement GP systems on cloud providers is explored in [veeramachaneni:2013:learning,derby:2013:cloud]. Other research [langdon:2010:large, harding:2011:implementing, augusto:2013:accelerated] has focused on utilizing the GPU to speed up evaluation. Several advanced GP systems have also emerged as the field has matured, including: AEG [korns:2011:abstract], Ec-Star [oreilly:2013:ecstar], Eurequa [hod:09:science], and FlexGP [derby:2013:flexgp].

In addition to research on scaling GP for BigData, many works have investigated the application of GP-based SR to real-world problems. Here we give an overview of GP application on differential equations and dynamical systems, but the body of literature is filled with applications to many fields. General application to dynamical systems are investigated in [ lay:1994:application, cao:2000:evolutionary, tsoulos:2006:solving, hod:08:mining, iba:2008:inference, seaton:2010:analytic ]. Specific applications to metabolic networks can be found in cho:2006:identificationschmidt:2011:automated} and genetics in [sakamoto:2001:inferring, qian:2008:inference ]. In [hod:07:pnas], it was shown that the equations to a dynamical system could be searched in parallel by using a partitioned evaluation scheme. Dynamical systems with multiple-time-scale, with signals composed of simpler signals operating at different time-scales, is investigated in [cornforth:2012:symbolic]. [cornforth:2013:inference] showed that equations for unobserved variables in dynamical systems can be recovered in some instances. Invariants, or conserved quantities, were shown recoverable in [ hod:09:implicit_long, hod:09:science ]. Hybrid dynamical systems, which feature both discrete and continuous components, were shown tractable in [ly:2012:learning]. Much of these works use simulated data from known systems. There are, however, many references to use of GP on real-world problems and data as well.

Fast Function eXtraction (FFX) is the first deterministic SR implementation [ McConaghy:http:FFX, McConaghy:2011:GPTP, McConaghy:2011:CICC ]. FFX does not use genetic operators, tree based representations, or random number generation. Instead, FFX uses a Generalized Linear Model (GLM). and a set of basis functions derived from the input features. FFX then applies a series of regressions to fit the parameters of the basis functions to a desired complexity This makes FFX very fast, but also limits its flexibility and the complexity of solutions.

FFX uses a Generalized Linear Model (GLM) of the form:

The GLM has linear coefficients to terms of the summation, and is a flexible version of ordinary linear regression. [ McCullagh:1972:Paper, McCullagh:1989:Book ] The are not required to be linear functions themselves, but rather linear in coefficients, to the terms of the summation. In other words, there are no coefficients inside any of the .

To learn the coefficients of the GLM, FFX uses Pathwise Regularized Learning (PRL). PRL augments least squares by adding regularization terms and then sweeping across and returning multiple parameter vectors [ ZouHastie:2005:Paper, Friedman:2010:Paper ]. PRL also has some interesting properties:

- Learning speed is comparable or better than least squares
- Can learn when there are fewer samples than coefficients
- Can learn thousands or more coefficients
- Sets of coefficient vectors are returned trading off the number of coefficients and training accuracy

1
2
3
4
5
6
7
8
9
10
11
12

def FFX(Data):
bases = createBasisFunctions(features)
for b in range(1,B):
alpha = 1.0
while complexityOf(eqn) != b:
eqn := PathwiseLearning(Data,alpha)
adjustLambda(eqn,alpha)
best.Push(eqn)
return best

Following the psuedo code, FFX first creates a set of basis functions for the GLM. To do this, univariate bases from each variable with the operations (). In their example this produced 176 bases. Next, the univariate bases were combined to produce 3374 bivariate bases, resulting in 3550 total bases. By allowing bases to be in both the numerator and the denominator, the overall number of bases doubles to 7100.

Then, for all from , FFX derives a linear combination of basis functions. To learn a model, FFX applies pathwise regularized learning to fit the GLM coefficients. FFX will then adjust the value to change the emphasis in the Pathwise Learning. This has the effect of alternating between increasing and decreasing the number of bases, continuing until the number of function bases equals the desired model complexity. FFX repeats this process for a number of desired complexities so that a variation of models is returned. This enables similar behavior to GP returning the first Pareto Frontier.

FFX works well for many problems, requiring far fewer computations GP, FFX, however, has its own limitations:

(1) there are no coefficients or parameters within the bases, meaning more difficult, non-linear relationships are beyond its abilities. This issue could be addressed by using non-linear regression and abstract coefficients.

(2) Equations such as $v = x*y*z$ are beyond the
abilities of FFX. Individual terms of the summation
are limited in complexity
to a pair-wise combination of uni-variate and bi-variate
bases determined at initialization.
Seeding with increased basis functions
could become prohibitive as the number of
terms grows through pair-wise basis combinations.
In the 13 variable example provided,
the initial number of GLM basis functions was 7100.

As FFX is incapable of finding many of the benchmarks, we did compare against it.

There are many methods for regression, which generally attempt to explain the relationship between variables. Here, we highlight several which are common or relavent to the subject matter.

Linear regression is a technique for modelling a dependent variable from a set of independent variables and unknown parameters. The most common method for determining the parameters is least squares, which minimizes the squared residuals [ Legrand:1805:orbites_des_comietes]. The key is that the dependent variable is linear to the parameters of the basis terms constructed from the independent variables. Nonlinear functions map be used, so long as parameters remain linear.

Nonlinear regression is an extension to the linear variation where the independent variable can now have more complex, nonlinear relationships to the parameters and depenent variables. Unlike linear regression, no guarentee can be made to the optimality of the final parameter estimate. Most methods for estimating the parameters incorporate successive iterations which improve upon the accuracy. Signifcant algorithms include Gauss-Newton, Levenberg–Marquardt, and gradient descent [ Fletcher:2000:Book, Madsen:2004:Methods].

PGE uses the Levenberg-Marquardt (LM) algorithm is an iterative technique that finds a local minimum of a function that is expressed as the sum of squares of nonlinear functions. It has become a standard technique for nonlinear least-squares problems and can be thought of as a combination of steepest descent and the Gauss-Newton method. When the current solution is far from the correct one, the algorithm behaves like a steepest descent method: slow, but guaranteed to converge. When the current solution is close to the correct solution, it becomes a Gauss-Newton method \cite{lourakis04LM}.

Support Vector Regression Machines (SVRM) ere proposed as a variation on the original Support Vector Machine (SVM) [ Vapnik:1996:SVRM ]. In SVM, a classification boundary is sought which maximizes the margin between the classes. In SVRM, the basic idea is to flip around the SVM problem and find a line with where the deviation is minimized. SVM and SVRM work for both linear and nonlinear models and is a convex optimization problem [ Smola:1996:Tutorial ].

**Ridge** regression is a regularization method
for ordinary least squares.
It uses the L2-norm of the parameter vector,
imposing a penalization is added for excessively large parameters.
Often a scaling term is used to control
the effect of the penalty.
[Tikhonov:1943:stability]

**Lasso** regression is similar to Ridge
as a regularization method on ordinary least squares.
Lasso uses the L1-norm, which causes
terms to have zero valued parameters,
thus effectively removing them from the model
[ Tibshirani:1996:lasso,
stanford:http:lasso ]

**Elastic Net** is the linear combination of
Ridge and Lasso techniques
[Zou:2005:elastic].
This method uses the offsetting features
of both methods to find a sparse model
with few non-zero terms.

We include several graph problems and algorithms as they served as seeds of thought that grew into the PGE algorithm.

The Minimum Spanning Tree (MST) of a graph is a tree which minimizes the sum of edges which connect all vertices in a graph. MST is solvable in polynomial time using a greedy algorithm.

**Prim’s** algorithm is a greedy algorithm
for finding the MST of an undirected graph.
The algorithm starts with an arbitrary vertex
and constructs the MST by adding the
lowest weight edge of a node
which is already in the MST.
A priority queue is used
to efficiently determine
the next lowest cost edge to add.
[ Prim:1957:Shortest ]

**Kruskal’s** algorithm is also a greedy algorithm
for finding the MST of an undirected graph.
The algorithm starts by treating each
vertex as a separate MST.
It then proceeds to select the minimum
edge from all remaining edges.
If the edge connects two distinct MSTs,
then it is included, otherwise, the edge
introduces a cycle and is subsequently discarded.
A disjoint-set data structure can be used
to effeciently determine this condition.
[ Kruskal:1956:Shortest ]

The single-source shortest path (SSSP) problem is to find the shortest path to all nodes, given a starting point.

**Dijkstra’s** algorithm for SSSP is
a greedy algorithm which resembles Prim’s.
[ Dijkstra:1959:algorithm ].
At each step, the next closest vertex
is added to the already visited set,
thus expanding outwards in a greedy manner.
A priority queue is used as well
to improve upon runtime complexity.
The difference is that at each step,
the distances to vertices on the frontier
are updated to reflect the latest addition,
and thus possibly changing the order
of the priority queue.
Knuth generalized Dijkstra’s algorithm to hyper-graphs
[ Knuth:1977:generalization].

**A* search** is a graph traversal algorithm
proposed as an extension to Dijksta’s algorithm
with better runtime performance.
[ Hart:1968:astar]
The A* algorithm cuts down on the size of the subgraph that must be explored,
by using a heuristic to estimate a lower bound on the “distance” to the target.
Similarly, it uses a priority queue to determine
the next vertice to include.
The difference is that the current distance
is added to the estimated distance to the target.

prev top

A dynamic-programming algorithm solves each subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time it solves each subsubproblem.

~ Cormen:2009:algorithms

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.

### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.