### Decoupling into Services
Evaluation is the most time consuming process
in any SR implementation,
generally exceeding 90% of the total time.
The main reasons for this are:
1. Floating point calculations take longer
1. Implementations generate many candidate models
1. Volume of data for training is increasing
1. The other phases tend not to require much time
One of the nice properties of SR, and evaluation,
is that it has both data and functional parallelism.
As a result, research into
parallel and distributed GP has been
extensive and fruitful (see Chapter 3).
We follow suit with PGE by splitting the algorithm
into a set of services and deploying them to
cloud infrastucture.
We decoupled the PGE algorithm
into three services:
1. Main Search Loop
1. Evaluation (for parallel benefits)
1. Algebra (for a mature CAS)
We elected for the additional algebra service
so that a mature Computer Algebra System (CAS)
could be used. We chose SymPy because
it had the desired functionality
and was simple to integrate.
We believe it also possible
to decouple the expansion and
memoization functionality into
their own services.
This is however, beyound the scope of this work.
This enhancement was also made in conjunction
with the last enhancement, waterfall evaluation.
**Figure #** - Decoupled PGE
#### The Three Services
##### **Algebra Service**
The original PGE algorithm
internally implemented
basic algebra operations to
collect terms and perform
simple rewrites.
While decoupling this service,
we opted to replace the internal algorithms
with a third-party library.
We chose Sympy for algebraic manipulations
as it offers more mature
and well tested set of functionalities.
Initially we used the string representation
of an equation when sending messages between services.
To alleviate some of the overhead
we converted to using the same
integer serialization as is used in the
memoization process.
##### **Evaluation Service**
As evaluation is the most resource
intensive part of the search,
this was the first natural choice
to decouple. Each PGE search
will require several evaluator instances
and so we chose to have them only
concerned with a single data set
at a time.
PGE uses a third-party library, Levmar,
to perform non-linear fitting
of a model's parameters.
PGE uses the analytical jacobian
fitting function and thus requires
the jacobian, w.r.t. the parameters,
to be calculated.
The evaluator service calls upon
the algebra service to perform this
operation on its behalf.
Since both the subset and fullfit
evaluation phases make use of
the non-linear regression,
they thus both require the jacobian
to be available.
Each evaluation service instance
is first sent the necessary information
for setup, including the parameters
and the dataset to be used for evaluation.
They then sit in an event loop waiting for
a request to evaluate a model.
##### **Search Service**
The search loop exists
as its own service and is the main
control-flow structure.
It remains basically the same
as the original search loop.
The main difference is
that function calls to the
algebra and evalutation service
now require network activity.
During each iteration, PGE delegates
to the appropriate services
at the neccessary times.
##### **Service Interactions**
The algebra service is independent and does not
depend on the other two services.
The evaluation service depends upon
the algebra service for calculating
the analytical jacobian of a model
w.r.t. the parameters.
The search service depends upon both
the algebra and evaluation service.
Each search service has its own set
of evaluation service instances.
The algebra services are more flexible,
as they are independent.
Several variations we explore are:
1) each service instance having its own dedicated
algebra instance (additionally the search process
can have multiple instances)
2) a search instance and its evaluator instances
can share a pool of algebra services
3) multiple PGE search processes can
share a larger pool of algebra service instances.
We explore the trade-offs and appropriate situations
for each case in the evaluation phase.
The best setups are tied to both the problem at
hand and the current implementation.
Our preliminary experiences in decoupling allowed us
to reduce the original count of algebra service instances after converting messages to use
the serialization of a model, as opposed to
the human readable versions of the model.
This was most likely the result of removing
the string processing necessary in tokenization and parsing into the syntax tree.

### Progressive Evaluation One of the issues in PGE is that a significant amount of effort is spent evaluating models which end up with poor fit. A useful idea is to evaluate models on a subset of the data. This method has been shown to provide good estimates with as few as just 8 data points [ [hod:08:coevo_fp]() ].
**Figure #** - PGE Progressive Evaluation
We use this idea of subsampling and
introduce an extra stage of processing,
which we call *peeked evaluation*.
Peeked evaluation is used to provide
a quick estimate of fitness from
a small number of training points.
The training data is uniformly sampled
to the reduced data set.
With peeked evaluation,
each equation must pass
through an additional priority queue.
The idea is to get a quick estimate
of accuracy, and then prioritize
full evaluation from these estimates.
The main reason for incorporating
peeked evaluation is to avoid
fully evaluating models which
will be poor candidates.
Only models which are fully fit
can be entered into the priority
queue for expansion candidacy.
In this way, there is an extra
barrier to poor models producing
more poor models.
Figure "peeked" below shows the
psuedo code with the additional
phases for evaluation and prioritization.
**Figure #**: PGE Peeked Flow
{% highlight Python linenos %}
for each iteration:
to_eval = FitQueue.Pop()
fullfitEvaluate(to_eval)
ExpandQueue.Push(to_eval)
to_expand = ExpandQueue.Pop()
expanded = expand(to_expand)
unique = memoize(expanded)
peekEvaluate(unique)
FitQueue.Push(unique)
{% endhighlight %}
The loop is now composed of
two pop-process-push sequences.
At the begining of the iteration
all equations have already
been estimate under peeked evaluation.
The `FitQueue` then has several
models popped for full evaluation
and subsequent pushing into
the `ExpandQueue`.
Next the `ExpandQueue` is popped
to select models to serve
as the next expansion points.
The new models are checked for
uniqueness, estimated, and
pushed into the `FitQueue`.
Under this scheme, a single model can pass
through all phases
in a single iteration.
One interesting consequence is that
the majority of coefficient tuning
can happen in the peek evaluation phase.
Initially, we make a guess of `1` for the
coefficient values. Peek evaluation
makes the best estimates on limited data.
However, when a model makes it to full evaluation,
it has much better starting points for
the coefficient values.
Often, the full fitting procedures
need fewer passes because of this situation.

### Selection Improvements The original PGE algorithm used tree size and an error metric to calculate model fitness. It then used a simple Pareto sorting mechanism to sort the population. This section describes enhancements we make to the metrics, fitness, and selection processes. #### More Model Metrics Our experience showed that modelling error was often insufficient for effectively separating good models from the rest. We add several additional metrics for models which account for quality while penalizing complexity. This is separate from the complexity and accuracy tradeoff which happens at the global level. They do however have a benefical influence on the global selection and prioritization process. **Penalized complexity** is a tree size calculation which adds a penalty to specific types of nodes. In our experiments, we applied it to function nodes from the triganomic and logarithmic families. The penalty accounts for increased relative computational complexity which we set at +2. The value, while effective, deserves further investigation than we provide. **Jacobian complexity** is the summed tree size of the jacobians for the model. We also included the penalized version as well. This metric has multiple effect for minimizing complexity. It minimizes absolute, summed size. It minimizes the number of model parameters, because fewer terms means smaller sum. It also minimizes the nesting complexity of the original model. When coefficients are dependent upon one another, the jacobians become quite complex and drastically increase the value of this metric. **R-squared** measures the goodness of fit for a model and attempts to account for the amount of variance explained. Values range from zero to one and can be interpreted as the percentage of variance the model explains. One issue with this metric is that extra variables can be included to increase its value. The adjusted R-squared addresses this by adding a penalty which increases with each extra variable [[Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John)]. **reduced Chi-squared** also measures the goodness of fit for a model. It includes the variance in the data and degrees of freedom. This effectively normalizes the value to the number of data points and model complexity [ [Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John), [Pearson:1900:chi](http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf) ]. **AIC/BIC** measure the relative quality of fit for models. Akaike information criterion (AIC) and Bayesian information criterion (BIC) do not say anything about the absolute quality of the model. Rather they measure the relative quality of models against eachother. Both add a penalty term based on the number of model parameters. BIC penalizes complexity more and is thus the more conservative metric [ [Akaike:1974:aic](http://www.unt.edu/rss/class/Jon/MiscDocs/Akaike_1974.pdf), [Schwarz:1978:bic](http://projecteuclid.org/euclid.aos/1176344136) ]. #### Improvement over parent Improvement over the parent refers to how much an accuracy or goodness of fit metric changes in an offspring, relative to the parent. We use a simple difference between the values, making sure to account for whether the metric is a minimizing or maximising one. One interesting situation happens when we find a new path to an existing model. In this case, we can look to Dijkstra's algorithm and *relax* to find a new relative improvement value. #### Model Fitness Model fitness is used to make relative comparisons during the selection process. The fitness for an individual model is usually a combination of complexity and accuracy in an attempt to regularize. The original PGE algorithm used tree size and model accuracy, as determined by an error metric, as the components to the fitness metric. It then used a the basic Pareto non-dominated sort to prioritize the next models to be expanded. In our enhancements, we made several changes in how we used the metrics in fitness. In addition to being able to use any of the new metrics, we added the ability to use multiple metrics at once, allowing the number of components to grow beyond just one for size and one for accuracy. We then added the ability to weight each component separately. This proved beneficial as we discovered the best policy was to weight the complexity measure so that it was equal to multiple accuracy metrics. We also found that competing certain metrics was beneficial, such as minimizing BIC while maximizing the improvement of BIC. One of the more significant changes we made was to normalize each of the values across the set of models. For each metric being used, we normalized across the models, and then used the new value as the component to the fitness for the model. This removed the effect of scale of values on the relative comparisons made during selection. With out normalization, one of the components can easily dominate the solution. Additionally, the dominating component often changed as the search progressed. In example, as error tends towards zero the size becomes the dominate factor, and distinguishing between models on error gets more difficult for the algorithm. #### Density based Pareto methods The orginial PGE algorithm used a simlistic Pareto sorting algorithm. Research in GP has shown that accounting for denisity along the frontier, better choices can be made. Section ? in Chapter 3 discussed the various improved algorithms for Pareto selection. We settled on the NSGA-II algorithm for its good balance between effectiveness and computational complexity. In our experiments, the addition of NSGA-II did not, by itself, improve the prioritization process. It was its combined usage when the fitness values contained more than two components. NSGA-II was able to distribute selection by accounting for density across all dimension, making the use of more dimensions beneficial.

### Expansion Improvements This section describes improvements to the expansion phase and functionality. The first changes enable multiple levels of complexity and context awareness when appling the expanions operators. We then describe two new expansion operators which were designed to work on summations. Finally, we enable multiple tiers of model expansion, thereby applying progressively complex expansion operations to progressively fewer models at a time. #### Configurable complexity levels The original PGE algorithm had a three different sets of expansion functions. They encompased three different complexities or scenarios. We alter this idea for greater grainularity. Each of the four operators can have its own complexity level. The *low*, *medium*, and *high* options provide flexibility while simplifying the choices for an end-user. The following tables describe the the basis functions used for each of the operators. Their application remains the same as the original PGE algorithm.

**Initialization** | level | result set | | ------- | ------------| | low | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | med | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_3 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | high | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_4 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:3]} \Big) + C $$ |

**Variable Substitution** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ |

**Multiplication Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Mul Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \}$$ |

**Addition Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ a * \vec{x}C_1 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | med | $$ \big \{ a * \vec{x}C_2 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup a* \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ #### Context-aware The original expansion polices were applied recursively at each node in the model tree. This was done without regard for the context in which these operations took place. This led to situations where trigonimic functions were embedded within eachother, such as $$ sin(cos(x)) $$. It also led to the production of ovely complex expressions involving coefficients and addition. In example, $$C * ( C + C*X(C*X + C*(C*X + C)^2)) $$ is more simply expressed as a single level summation. Additionally, it is much more difficult to fit the coefficients as they involve many nonlinear relationships. We incorporate the idea of context-aware expansion policies to deal with undesirable situations and consequences. Here we include two types of context-awareness, functional and depth, though there are generally more options. Context aware operators are easily created with state traking during the recursion process. Our experience has been that their use implies restriction based on context. However, there may be situations where certain operations are only permitted within a particular context. #### New Extension Operator The additon and multiplication expansion policies extend the number of terms to the respective operators. They do this by drawing from a predefined pool of basis functions. We make use of a new extension operator which only works at root level addition. This operator extends an addition with small modification to its current terms. First, each term is expanded with a set of basis terms. This is similar to the multiplication expansion. Then each of these terms is included in the summation, as opposed to replacing an existing term. This operator makes larger movements around the search space, effectively being the combination of smaller modifications with several intermediate steps. We chose to only use this at root level addition because deeper recursion becomes overly complex very quickly. #### New Shrinkage Operator Under the original policies, there was no explicit method to go backwards, or decrease the size of a model. This can occur when a cancelling term is applied through an expansion operator, such as $$ x^2 * x^{-1} \rightarrow x $$. We make use of a new shrinkage operator which removes terms from additions. The operator recurses over the entire tree. When an addition is found, it produces all expressions possible, with one term removed. The reason for adding this is to clean up after intermediary terms in models. During the search process, it is often the case that additonal terms are incorporated which either bridge a gap or are meant to deal with noise. This is particularly noticable with trigonimic functions whos output is in the range [-1:1]. ADD PICTURES HERE FROM ACTUAL BENCHMARKS / PROBLEMS The shrinkage operator removes excessive terms from overfitting noise or from the intemediary terms needed during search progression. We exclude a similar operator for multiplication for two reasons. The first is that there would be a large combinatorial number of models generated from this process. Second, in most cases the models were or can be reached through the usual multiplication expansion process. That is, this process would be to similar to the multiplying by a cancelling term to make a significant difference. #### Progressive expansion Progressive expansion is an idea that a series of increasingly complex expansion schemes can be applied to a decreasing number of models. This follows along with the multiple tiers of refinement motif. Once a model is fully fit during a search, it becomes a candidate for expansion. If selected through the heap prioritization, it then has the expansion methods applied. Under the progressive expansion design, each step in the series has its own selection heap and its own set of expansion policies. When a model has been selected and expanded it is pushed into the next tier's heap.
**Figure #** - PGE Progressive Expansion
Example schemes for progressive expansion are:
1. Increasing complexity of modifications
1. Limiting variables available at early tiers
1. Reducing contextual restrictions
1. Delaying incorporation of functions

### Enhanced PGE This chapter has described many enhancements to the original PGE algorithm. The diagram below depicts PGE with these enhancements. The loop remains the same however several phases have been internally extended with progressive application and intemediate heaps. The next chapter will show their effectiveness over a range of experiments.
**Figure #** - PGE Progressive Expansion
">next
index

## Prioritized Grammar Enumeration

**Figure #** - Decoupled PGE
#### The Three Services
##### **Algebra Service**
The original PGE algorithm
internally implemented
basic algebra operations to
collect terms and perform
simple rewrites.
While decoupling this service,
we opted to replace the internal algorithms
with a third-party library.
We chose Sympy for algebraic manipulations
as it offers more mature
and well tested set of functionalities.
Initially we used the string representation
of an equation when sending messages between services.
To alleviate some of the overhead
we converted to using the same
integer serialization as is used in the
memoization process.
##### **Evaluation Service**
As evaluation is the most resource
intensive part of the search,
this was the first natural choice
to decouple. Each PGE search
will require several evaluator instances
and so we chose to have them only
concerned with a single data set
at a time.
PGE uses a third-party library, Levmar,
to perform non-linear fitting
of a model's parameters.
PGE uses the analytical jacobian
fitting function and thus requires
the jacobian, w.r.t. the parameters,
to be calculated.
The evaluator service calls upon
the algebra service to perform this
operation on its behalf.
Since both the subset and fullfit
evaluation phases make use of
the non-linear regression,
they thus both require the jacobian
to be available.
Each evaluation service instance
is first sent the necessary information
for setup, including the parameters
and the dataset to be used for evaluation.
They then sit in an event loop waiting for
a request to evaluate a model.
##### **Search Service**
The search loop exists
as its own service and is the main
control-flow structure.
It remains basically the same
as the original search loop.
The main difference is
that function calls to the
algebra and evalutation service
now require network activity.
During each iteration, PGE delegates
to the appropriate services
at the neccessary times.
##### **Service Interactions**
The algebra service is independent and does not
depend on the other two services.
The evaluation service depends upon
the algebra service for calculating
the analytical jacobian of a model
w.r.t. the parameters.
The search service depends upon both
the algebra and evaluation service.
Each search service has its own set
of evaluation service instances.
The algebra services are more flexible,
as they are independent.
Several variations we explore are:
1) each service instance having its own dedicated
algebra instance (additionally the search process
can have multiple instances)
2) a search instance and its evaluator instances
can share a pool of algebra services
3) multiple PGE search processes can
share a larger pool of algebra service instances.
We explore the trade-offs and appropriate situations
for each case in the evaluation phase.
The best setups are tied to both the problem at
hand and the current implementation.
Our preliminary experiences in decoupling allowed us
to reduce the original count of algebra service instances after converting messages to use
the serialization of a model, as opposed to
the human readable versions of the model.
This was most likely the result of removing
the string processing necessary in tokenization and parsing into the syntax tree.

### Progressive Evaluation One of the issues in PGE is that a significant amount of effort is spent evaluating models which end up with poor fit. A useful idea is to evaluate models on a subset of the data. This method has been shown to provide good estimates with as few as just 8 data points [ [hod:08:coevo_fp]() ].
**Figure #** - PGE Progressive Evaluation
We use this idea of subsampling and
introduce an extra stage of processing,
which we call *peeked evaluation*.
Peeked evaluation is used to provide
a quick estimate of fitness from
a small number of training points.
The training data is uniformly sampled
to the reduced data set.
With peeked evaluation,
each equation must pass
through an additional priority queue.
The idea is to get a quick estimate
of accuracy, and then prioritize
full evaluation from these estimates.
The main reason for incorporating
peeked evaluation is to avoid
fully evaluating models which
will be poor candidates.
Only models which are fully fit
can be entered into the priority
queue for expansion candidacy.
In this way, there is an extra
barrier to poor models producing
more poor models.
Figure "peeked" below shows the
psuedo code with the additional
phases for evaluation and prioritization.
**Figure #**: PGE Peeked Flow
{% highlight Python linenos %}
for each iteration:
to_eval = FitQueue.Pop()
fullfitEvaluate(to_eval)
ExpandQueue.Push(to_eval)
to_expand = ExpandQueue.Pop()
expanded = expand(to_expand)
unique = memoize(expanded)
peekEvaluate(unique)
FitQueue.Push(unique)
{% endhighlight %}
The loop is now composed of
two pop-process-push sequences.
At the begining of the iteration
all equations have already
been estimate under peeked evaluation.
The `FitQueue` then has several
models popped for full evaluation
and subsequent pushing into
the `ExpandQueue`.
Next the `ExpandQueue` is popped
to select models to serve
as the next expansion points.
The new models are checked for
uniqueness, estimated, and
pushed into the `FitQueue`.
Under this scheme, a single model can pass
through all phases
in a single iteration.
One interesting consequence is that
the majority of coefficient tuning
can happen in the peek evaluation phase.
Initially, we make a guess of `1` for the
coefficient values. Peek evaluation
makes the best estimates on limited data.
However, when a model makes it to full evaluation,
it has much better starting points for
the coefficient values.
Often, the full fitting procedures
need fewer passes because of this situation.

### Selection Improvements The original PGE algorithm used tree size and an error metric to calculate model fitness. It then used a simple Pareto sorting mechanism to sort the population. This section describes enhancements we make to the metrics, fitness, and selection processes. #### More Model Metrics Our experience showed that modelling error was often insufficient for effectively separating good models from the rest. We add several additional metrics for models which account for quality while penalizing complexity. This is separate from the complexity and accuracy tradeoff which happens at the global level. They do however have a benefical influence on the global selection and prioritization process. **Penalized complexity** is a tree size calculation which adds a penalty to specific types of nodes. In our experiments, we applied it to function nodes from the triganomic and logarithmic families. The penalty accounts for increased relative computational complexity which we set at +2. The value, while effective, deserves further investigation than we provide. **Jacobian complexity** is the summed tree size of the jacobians for the model. We also included the penalized version as well. This metric has multiple effect for minimizing complexity. It minimizes absolute, summed size. It minimizes the number of model parameters, because fewer terms means smaller sum. It also minimizes the nesting complexity of the original model. When coefficients are dependent upon one another, the jacobians become quite complex and drastically increase the value of this metric. **R-squared** measures the goodness of fit for a model and attempts to account for the amount of variance explained. Values range from zero to one and can be interpreted as the percentage of variance the model explains. One issue with this metric is that extra variables can be included to increase its value. The adjusted R-squared addresses this by adding a penalty which increases with each extra variable [[Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John)]. **reduced Chi-squared** also measures the goodness of fit for a model. It includes the variance in the data and degrees of freedom. This effectively normalizes the value to the number of data points and model complexity [ [Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John), [Pearson:1900:chi](http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf) ]. **AIC/BIC** measure the relative quality of fit for models. Akaike information criterion (AIC) and Bayesian information criterion (BIC) do not say anything about the absolute quality of the model. Rather they measure the relative quality of models against eachother. Both add a penalty term based on the number of model parameters. BIC penalizes complexity more and is thus the more conservative metric [ [Akaike:1974:aic](http://www.unt.edu/rss/class/Jon/MiscDocs/Akaike_1974.pdf), [Schwarz:1978:bic](http://projecteuclid.org/euclid.aos/1176344136) ]. #### Improvement over parent Improvement over the parent refers to how much an accuracy or goodness of fit metric changes in an offspring, relative to the parent. We use a simple difference between the values, making sure to account for whether the metric is a minimizing or maximising one. One interesting situation happens when we find a new path to an existing model. In this case, we can look to Dijkstra's algorithm and *relax* to find a new relative improvement value. #### Model Fitness Model fitness is used to make relative comparisons during the selection process. The fitness for an individual model is usually a combination of complexity and accuracy in an attempt to regularize. The original PGE algorithm used tree size and model accuracy, as determined by an error metric, as the components to the fitness metric. It then used a the basic Pareto non-dominated sort to prioritize the next models to be expanded. In our enhancements, we made several changes in how we used the metrics in fitness. In addition to being able to use any of the new metrics, we added the ability to use multiple metrics at once, allowing the number of components to grow beyond just one for size and one for accuracy. We then added the ability to weight each component separately. This proved beneficial as we discovered the best policy was to weight the complexity measure so that it was equal to multiple accuracy metrics. We also found that competing certain metrics was beneficial, such as minimizing BIC while maximizing the improvement of BIC. One of the more significant changes we made was to normalize each of the values across the set of models. For each metric being used, we normalized across the models, and then used the new value as the component to the fitness for the model. This removed the effect of scale of values on the relative comparisons made during selection. With out normalization, one of the components can easily dominate the solution. Additionally, the dominating component often changed as the search progressed. In example, as error tends towards zero the size becomes the dominate factor, and distinguishing between models on error gets more difficult for the algorithm. #### Density based Pareto methods The orginial PGE algorithm used a simlistic Pareto sorting algorithm. Research in GP has shown that accounting for denisity along the frontier, better choices can be made. Section ? in Chapter 3 discussed the various improved algorithms for Pareto selection. We settled on the NSGA-II algorithm for its good balance between effectiveness and computational complexity. In our experiments, the addition of NSGA-II did not, by itself, improve the prioritization process. It was its combined usage when the fitness values contained more than two components. NSGA-II was able to distribute selection by accounting for density across all dimension, making the use of more dimensions beneficial.

### Expansion Improvements This section describes improvements to the expansion phase and functionality. The first changes enable multiple levels of complexity and context awareness when appling the expanions operators. We then describe two new expansion operators which were designed to work on summations. Finally, we enable multiple tiers of model expansion, thereby applying progressively complex expansion operations to progressively fewer models at a time. #### Configurable complexity levels The original PGE algorithm had a three different sets of expansion functions. They encompased three different complexities or scenarios. We alter this idea for greater grainularity. Each of the four operators can have its own complexity level. The *low*, *medium*, and *high* options provide flexibility while simplifying the choices for an end-user. The following tables describe the the basis functions used for each of the operators. Their application remains the same as the original PGE algorithm.

**Initialization** | level | result set | | ------- | ------------| | low | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | med | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_3 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | high | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_4 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:3]} \Big) + C $$ |

**Variable Substitution** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ |

**Multiplication Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Mul Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \}$$ |

**Addition Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ a * \vec{x}C_1 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | med | $$ \big \{ a * \vec{x}C_2 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup a* \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ #### Context-aware The original expansion polices were applied recursively at each node in the model tree. This was done without regard for the context in which these operations took place. This led to situations where trigonimic functions were embedded within eachother, such as $$ sin(cos(x)) $$. It also led to the production of ovely complex expressions involving coefficients and addition. In example, $$C * ( C + C*X(C*X + C*(C*X + C)^2)) $$ is more simply expressed as a single level summation. Additionally, it is much more difficult to fit the coefficients as they involve many nonlinear relationships. We incorporate the idea of context-aware expansion policies to deal with undesirable situations and consequences. Here we include two types of context-awareness, functional and depth, though there are generally more options. Context aware operators are easily created with state traking during the recursion process. Our experience has been that their use implies restriction based on context. However, there may be situations where certain operations are only permitted within a particular context. #### New Extension Operator The additon and multiplication expansion policies extend the number of terms to the respective operators. They do this by drawing from a predefined pool of basis functions. We make use of a new extension operator which only works at root level addition. This operator extends an addition with small modification to its current terms. First, each term is expanded with a set of basis terms. This is similar to the multiplication expansion. Then each of these terms is included in the summation, as opposed to replacing an existing term. This operator makes larger movements around the search space, effectively being the combination of smaller modifications with several intermediate steps. We chose to only use this at root level addition because deeper recursion becomes overly complex very quickly. #### New Shrinkage Operator Under the original policies, there was no explicit method to go backwards, or decrease the size of a model. This can occur when a cancelling term is applied through an expansion operator, such as $$ x^2 * x^{-1} \rightarrow x $$. We make use of a new shrinkage operator which removes terms from additions. The operator recurses over the entire tree. When an addition is found, it produces all expressions possible, with one term removed. The reason for adding this is to clean up after intermediary terms in models. During the search process, it is often the case that additonal terms are incorporated which either bridge a gap or are meant to deal with noise. This is particularly noticable with trigonimic functions whos output is in the range [-1:1]. ADD PICTURES HERE FROM ACTUAL BENCHMARKS / PROBLEMS The shrinkage operator removes excessive terms from overfitting noise or from the intemediary terms needed during search progression. We exclude a similar operator for multiplication for two reasons. The first is that there would be a large combinatorial number of models generated from this process. Second, in most cases the models were or can be reached through the usual multiplication expansion process. That is, this process would be to similar to the multiplying by a cancelling term to make a significant difference. #### Progressive expansion Progressive expansion is an idea that a series of increasingly complex expansion schemes can be applied to a decreasing number of models. This follows along with the multiple tiers of refinement motif. Once a model is fully fit during a search, it becomes a candidate for expansion. If selected through the heap prioritization, it then has the expansion methods applied. Under the progressive expansion design, each step in the series has its own selection heap and its own set of expansion policies. When a model has been selected and expanded it is pushed into the next tier's heap.
**Figure #** - PGE Progressive Expansion
Example schemes for progressive expansion are:
1. Increasing complexity of modifications
1. Limiting variables available at early tiers
1. Reducing contextual restrictions
1. Delaying incorporation of functions

### Enhanced PGE This chapter has described many enhancements to the original PGE algorithm. The diagram below depicts PGE with these enhancements. The loop remains the same however several phases have been internally extended with progressive application and intemediate heaps. The next chapter will show their effectiveness over a range of experiments.
**Figure #** - PGE Progressive Expansion
">next (Enhancing PGE)

### Progressive Evaluation One of the issues in PGE is that a significant amount of effort is spent evaluating models which end up with poor fit. A useful idea is to evaluate models on a subset of the data. This method has been shown to provide good estimates with as few as just 8 data points [ [hod:08:coevo_fp]() ].

### Selection Improvements The original PGE algorithm used tree size and an error metric to calculate model fitness. It then used a simple Pareto sorting mechanism to sort the population. This section describes enhancements we make to the metrics, fitness, and selection processes. #### More Model Metrics Our experience showed that modelling error was often insufficient for effectively separating good models from the rest. We add several additional metrics for models which account for quality while penalizing complexity. This is separate from the complexity and accuracy tradeoff which happens at the global level. They do however have a benefical influence on the global selection and prioritization process. **Penalized complexity** is a tree size calculation which adds a penalty to specific types of nodes. In our experiments, we applied it to function nodes from the triganomic and logarithmic families. The penalty accounts for increased relative computational complexity which we set at +2. The value, while effective, deserves further investigation than we provide. **Jacobian complexity** is the summed tree size of the jacobians for the model. We also included the penalized version as well. This metric has multiple effect for minimizing complexity. It minimizes absolute, summed size. It minimizes the number of model parameters, because fewer terms means smaller sum. It also minimizes the nesting complexity of the original model. When coefficients are dependent upon one another, the jacobians become quite complex and drastically increase the value of this metric. **R-squared** measures the goodness of fit for a model and attempts to account for the amount of variance explained. Values range from zero to one and can be interpreted as the percentage of variance the model explains. One issue with this metric is that extra variables can be included to increase its value. The adjusted R-squared addresses this by adding a penalty which increases with each extra variable [[Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John)]. **reduced Chi-squared** also measures the goodness of fit for a model. It includes the variance in the data and degrees of freedom. This effectively normalizes the value to the number of data points and model complexity [ [Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John), [Pearson:1900:chi](http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf) ]. **AIC/BIC** measure the relative quality of fit for models. Akaike information criterion (AIC) and Bayesian information criterion (BIC) do not say anything about the absolute quality of the model. Rather they measure the relative quality of models against eachother. Both add a penalty term based on the number of model parameters. BIC penalizes complexity more and is thus the more conservative metric [ [Akaike:1974:aic](http://www.unt.edu/rss/class/Jon/MiscDocs/Akaike_1974.pdf), [Schwarz:1978:bic](http://projecteuclid.org/euclid.aos/1176344136) ]. #### Improvement over parent Improvement over the parent refers to how much an accuracy or goodness of fit metric changes in an offspring, relative to the parent. We use a simple difference between the values, making sure to account for whether the metric is a minimizing or maximising one. One interesting situation happens when we find a new path to an existing model. In this case, we can look to Dijkstra's algorithm and *relax* to find a new relative improvement value. #### Model Fitness Model fitness is used to make relative comparisons during the selection process. The fitness for an individual model is usually a combination of complexity and accuracy in an attempt to regularize. The original PGE algorithm used tree size and model accuracy, as determined by an error metric, as the components to the fitness metric. It then used a the basic Pareto non-dominated sort to prioritize the next models to be expanded. In our enhancements, we made several changes in how we used the metrics in fitness. In addition to being able to use any of the new metrics, we added the ability to use multiple metrics at once, allowing the number of components to grow beyond just one for size and one for accuracy. We then added the ability to weight each component separately. This proved beneficial as we discovered the best policy was to weight the complexity measure so that it was equal to multiple accuracy metrics. We also found that competing certain metrics was beneficial, such as minimizing BIC while maximizing the improvement of BIC. One of the more significant changes we made was to normalize each of the values across the set of models. For each metric being used, we normalized across the models, and then used the new value as the component to the fitness for the model. This removed the effect of scale of values on the relative comparisons made during selection. With out normalization, one of the components can easily dominate the solution. Additionally, the dominating component often changed as the search progressed. In example, as error tends towards zero the size becomes the dominate factor, and distinguishing between models on error gets more difficult for the algorithm. #### Density based Pareto methods The orginial PGE algorithm used a simlistic Pareto sorting algorithm. Research in GP has shown that accounting for denisity along the frontier, better choices can be made. Section ? in Chapter 3 discussed the various improved algorithms for Pareto selection. We settled on the NSGA-II algorithm for its good balance between effectiveness and computational complexity. In our experiments, the addition of NSGA-II did not, by itself, improve the prioritization process. It was its combined usage when the fitness values contained more than two components. NSGA-II was able to distribute selection by accounting for density across all dimension, making the use of more dimensions beneficial.

### Expansion Improvements This section describes improvements to the expansion phase and functionality. The first changes enable multiple levels of complexity and context awareness when appling the expanions operators. We then describe two new expansion operators which were designed to work on summations. Finally, we enable multiple tiers of model expansion, thereby applying progressively complex expansion operations to progressively fewer models at a time. #### Configurable complexity levels The original PGE algorithm had a three different sets of expansion functions. They encompased three different complexities or scenarios. We alter this idea for greater grainularity. Each of the four operators can have its own complexity level. The *low*, *medium*, and *high* options provide flexibility while simplifying the choices for an end-user. The following tables describe the the basis functions used for each of the operators. Their application remains the same as the original PGE algorithm.

**Initialization** | level | result set | | ------- | ------------| | low | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | med | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_3 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | high | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_4 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:3]} \Big) + C $$ |

**Variable Substitution** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ |

**Multiplication Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Mul Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \}$$ |

**Addition Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ a * \vec{x}C_1 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | med | $$ \big \{ a * \vec{x}C_2 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup a* \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ #### Context-aware The original expansion polices were applied recursively at each node in the model tree. This was done without regard for the context in which these operations took place. This led to situations where trigonimic functions were embedded within eachother, such as $$ sin(cos(x)) $$. It also led to the production of ovely complex expressions involving coefficients and addition. In example, $$C * ( C + C*X(C*X + C*(C*X + C)^2)) $$ is more simply expressed as a single level summation. Additionally, it is much more difficult to fit the coefficients as they involve many nonlinear relationships. We incorporate the idea of context-aware expansion policies to deal with undesirable situations and consequences. Here we include two types of context-awareness, functional and depth, though there are generally more options. Context aware operators are easily created with state traking during the recursion process. Our experience has been that their use implies restriction based on context. However, there may be situations where certain operations are only permitted within a particular context. #### New Extension Operator The additon and multiplication expansion policies extend the number of terms to the respective operators. They do this by drawing from a predefined pool of basis functions. We make use of a new extension operator which only works at root level addition. This operator extends an addition with small modification to its current terms. First, each term is expanded with a set of basis terms. This is similar to the multiplication expansion. Then each of these terms is included in the summation, as opposed to replacing an existing term. This operator makes larger movements around the search space, effectively being the combination of smaller modifications with several intermediate steps. We chose to only use this at root level addition because deeper recursion becomes overly complex very quickly. #### New Shrinkage Operator Under the original policies, there was no explicit method to go backwards, or decrease the size of a model. This can occur when a cancelling term is applied through an expansion operator, such as $$ x^2 * x^{-1} \rightarrow x $$. We make use of a new shrinkage operator which removes terms from additions. The operator recurses over the entire tree. When an addition is found, it produces all expressions possible, with one term removed. The reason for adding this is to clean up after intermediary terms in models. During the search process, it is often the case that additonal terms are incorporated which either bridge a gap or are meant to deal with noise. This is particularly noticable with trigonimic functions whos output is in the range [-1:1]. ADD PICTURES HERE FROM ACTUAL BENCHMARKS / PROBLEMS The shrinkage operator removes excessive terms from overfitting noise or from the intemediary terms needed during search progression. We exclude a similar operator for multiplication for two reasons. The first is that there would be a large combinatorial number of models generated from this process. Second, in most cases the models were or can be reached through the usual multiplication expansion process. That is, this process would be to similar to the multiplying by a cancelling term to make a significant difference. #### Progressive expansion Progressive expansion is an idea that a series of increasingly complex expansion schemes can be applied to a decreasing number of models. This follows along with the multiple tiers of refinement motif. Once a model is fully fit during a search, it becomes a candidate for expansion. If selected through the heap prioritization, it then has the expansion methods applied. Under the progressive expansion design, each step in the series has its own selection heap and its own set of expansion policies. When a model has been selected and expanded it is pushed into the next tier's heap.

### Enhanced PGE This chapter has described many enhancements to the original PGE algorithm. The diagram below depicts PGE with these enhancements. The loop remains the same however several phases have been internally extended with progressive application and intemediate heaps. The next chapter will show their effectiveness over a range of experiments.

Theory

Components

Searching

Limitations

Reproducibility

Relation to GP

#### An Overview

### Theory

#### Removing Non-determinism

#### Search Space Organization

**Figure 4-1** - Combinatorics of the Tree
#### Evaluating Forms Once

**Figure #** - Multiple Derivations
### Components

#### Representation

#### Evaluation

#### Optimization

**Figure #**: PGE Exporation Functions
#### Memoization

**Figure #**: Integer Prefix Tree
### The Search Loop

**Figure #** - PGE Flow Diagram
**Figure #**: PGE Search Loop
### Limitations

#### Exponential to |{Features}|

#### Algebra Policies

#### Bloat in PGE

#### Similar Selection

#### Parameter Optimization

### Relation to Genetic Programming

prev top

### Decoupling into Services
Evaluation is the most time consuming process
in any SR implementation,
generally exceeding 90% of the total time.
The main reasons for this are:
1. Floating point calculations take longer
1. Implementations generate many candidate models
1. Volume of data for training is increasing
1. The other phases tend not to require much time
One of the nice properties of SR, and evaluation,
is that it has both data and functional parallelism.
As a result, research into
parallel and distributed GP has been
extensive and fruitful (see Chapter 3).
We follow suit with PGE by splitting the algorithm
into a set of services and deploying them to
cloud infrastucture.
We decoupled the PGE algorithm
into three services:
1. Main Search Loop
1. Evaluation (for parallel benefits)
1. Algebra (for a mature CAS)
We elected for the additional algebra service
so that a mature Computer Algebra System (CAS)
could be used. We chose SymPy because
it had the desired functionality
and was simple to integrate.
We believe it also possible
to decouple the expansion and
memoization functionality into
their own services.
This is however, beyound the scope of this work.
This enhancement was also made in conjunction
with the last enhancement, waterfall evaluation.
Components

Searching

Limitations

Reproducibility

Relation to GP

Prioritized Grammar Enumeration (PGE) is a deterministic and highly efficient algorithm for Symbolic Regression (SR). It answers the question, how would you solve SR without using random number generators? This requirement forced us to rethink the fundamental approach to SR. PGE is the result of this process. To our knowledge, PGE is the first SR implementation that is both deterministic and tree-based. PGE’s unique perspective on SR brings the problem structure to the forefront. At the same time, it enables vast reductions in effort and opens the door to advanced analyses not possible with other methods. PGE also offers consistency, capability, and reliability not available with other methods. We believe PGE is the forefront of SR technology.

Prioritized Grammar Enumeration (PGE) is a deterministic algorithm for SR. PGE solves the SR problem from a language point of view, to prioritize the enumeration of expressions in that grammar. By recognizing the structure of the search space, PGE turns the SR problem into a graph traversal problem. PGE also recognizes multiplicity of points in in the search tree and overlaps them, resulting in the search graph. PGE uses dynamic programming techniques that memoizes previous results and prevent the compounding duplication of effort.

PGE further shrinks the search space by reducing models to a canonical form with the rules of algebra and simplification. Recursion and generating functions are applied to produce new forms from the current model. By placing valueless parameters, which are later determined with non-linear regression, PGE additionally separates the search for model form from the optimization of any given form.

PGE takes reductionism one step further, fully optimizing an equation in the search space the first time it is encountered and remembering which equations it has discovered thus far.

Partial ordering, coupled with the trie representation and simplifications, yields many-fold reductions of the search space. Invalid and ineffectual expressions are removed, variations of associative forms are limited, and isomorphs are combined, shrinking the number of representable equations that need to be explored by a SR implementation.

To consolidate this space, PGE imposes operator restrictions, simplifies equations, and partially orders sub-expressions. When combined, these methods only allow syntactically valid equations to be considered, merge isomorphic equations into a canonical form, reduce the overall size of the search space, and create a structural ordering to the search space.

The third observation PGE makes is that the search for an equations form, its structural configuration, can be separated for the optimization of that form’s parameters.

**Removing Non-determinism**- Creation of a completely reproducible SR algorithm.**Search Space Organization**- Understanding and designing the structure of the search space.**Evaluating Forms Once**- Designing the model for detection and avoidance of overlapping subproblems.

PGE was born from the process of removing random number generators from GP. This required methods for selection, survival, and replication to all be replaced. The replacement process brought to light several issues which have not been addressed by the GP community.

The theory underlying PGE is based on intuitive principles. First, take into account the rules from algebra, to reduce model multiplicity into a canonical form, thereby reducing search space complexity. Second, design models so that they are only handled once. This requires that they are also designed with dynamic programming techniques in mind. Third, remove all sources of non-determinism to create a completely reproducible algorithm. This also simplifies development and debugging of the software which implements the algorithm.

Removing sources of non-determinism was a central theme to the development of PGE. The aim was to produce an algorithm which would give the same results with each invocation. The reason for this goal was the detrimental effects non-determinism has on search algorithms. We believe it to be an inherent and unavoidable difficulty present in the GP algorithm. You can find details on GP in Chapter 7

To achieve deterministic behavior, PGE makes no use of random number generation. It performs the exact same algorithmic steps given the same parameters and same training data. This requires several aspects of the PGE algorithm to be defined deterministically.

**Initialization** establishes the starting points
from which PGE searches. Determinism starts here
by constructing a set of basis functions
from the input variables.

**Prioritization** determines which points
in the space to search next. This process
uses a priority queue coupled with a
Pareto non-dominated sorting algorithm.

**Exploration** discovers new models by
making small changes to a given model.
To maintain determinism, PGE makes
all possible changes to a model,
given the particular exploration scheme in use.

**Communication** occurs when PGE uses
parallel execution. In order to maintain
deterministic behavior, well defined
synchronization points are used.

Details of each aspect will be expanded upon in the upcoming sections and next chapter.

The size of the space defined by a grammar is infinite, even when disregarding the increase brought by considering real valued coefficients. The main contributions are combinatorial components at both the search space and model levels of the problem. PGE’s great efficiency gains are the result of methods designed to combat the combinatorial components. We begin at the model level, where the main tool is the trie based representation. It is dynamic programming which comes to the rescue at the search space level.

**Tree Representation Combinatorics**

PGE primarily uses the trie based representation
for the models, as detailed in the last chapter.
The reason for this is the combinatorial
number of different binary tree representations
for the equivalent model.
These different representations arise
from permutations of the leaf nodes
and from variations in the tree structure.
Consider the equation `a•b•c`

(Figure 4-1).

This equation has 12 different binary tree representations, from six leaf permutations and two shape permutations. If we account for both addition and multiplication, the number of trees grows to 48. As you can imagine, with increasing numbers of operations and operands, the tree complexity and this number undergoes a combinatorial explosion.

The advantage of the trie based model is the ability to take advantage of the commutative and associative properties of addition and multiplication. Because of these properties, we are free to order the operands, sub-trees in the model trie, as we choose.

Commutativity is the ability to rearrange parentheses and is associated with the multiple tree representations for the same expression. By using the trie, we flatten a binary tree and metaphorically erase the parentheses from the expression.

Associativity means that we can sort the operands to addition and multiplication. By sorting them, we create a consistent ordering which can be treated as the canonical form for a model. In doing so, we remove the combinatorial element associated with the permutation of leaf nodes. Sorting is possible by creating a ordering on node types. This is easily achievable by assigning an integer to each node type.

By turning binary trees into tries
and by sorting the tries,
each model can have a canonical form.
Figure 4-1, right column, shows
the canonical form of `a•b•c`

.
After applying the reductions,
the original 48 models is now just 4.
This savings created by sorting
become even more important as equations
become more complex.

**Algebraic simplifications**

There are other sub-expressions which arise that we would also like to remove and reduce.

Simplifications group like terms together (), replace sub-expressions which evaluate to a constant value (, , or ), and reduce unnecessarily complex expressions such as . The resulting equations are equivalent, simpler forms.

There is debate as to how simplification effects the SR process [ kinzett:2008:using, kinzett:2009:online, mcree:2010:symbolic ]. Certainly, changing the tree effects the Pareto trade-off which in turn has consequences on selection and therefore search progress. Questions still remain as to how much and which simplifications should be applied, which forms of the equation should be kept, and the overall effects simplification has on the search process.

Despite these questions, PGE still employs the use of the aforementioned simplifications. This is done to eliminate those sub-expressions which unnecessarily complicate equation and the overall search space. If they were not replaced, many equivalent variations of a good equation could crowd out other potential solutions, as they would be selected more often. This would result in the algorithm stagnating in a local optima of the search space. Simplifications effectively trim out parts of the search space which are redundant, and also return results which are more comprehensible.

PGE uses the simplification techniques which are present in the SymPy Python library. This and further implementation details can be found in Appendix-A1.

At its core, PGE is a dynamic programming algorithm which aims to evaluate each sub-problem once.

A dynamic-programming algorithm solves each subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time it solves each subsubproblem.

~ Cormen:2009:algorithms

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value.

The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique.

**Non-linear Regression** is not unique to PGE, or GP for that matter,
but it is a central component of PGE’s
ability to evaluate an equation form once.
As PGE encounters new equation forms,
it fits the abstract parameters using non-linear regression,
resulting in the ‘best’ version of that form on the training data.
PGE uses the
Levenberg-Marquardt (LM) optimization algorithm CITE
to fully fit an equation to a set of data.
In cases were an equation is linear w.r.t.
the parameters, the LM algorithm library
returns in one iteration by using
singular value decomposition (SVD).

PGE’s treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form’s parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks.

This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation.

**Memoization**

Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.

In this diagram, we start with models which consist of a single variable. Now we could take each of these models and add a different variable to it. Consider though that we could start with and add or start with and then add , both arriving at the canonical for . Repeat this process, adding a third variable, and we have six paths to the model .

It is important that we detect this situation, by determining that each of the six are really the same model. If we do not detect the separate models as one, would would end up repeating a lot of work. This repetition would become a combinatorial explosion of exponential proportions.

PGE uses dynamic programming techniques to detect multiple derivations and consider an equation form only once. This feature of PGE is enabled by the Integer Prefix Trie and an integer linear representation for the equation models. The next section will fill in the details of both representation and memoization.

**Representation**- Trie and linear representations with valueless parameters**Evaluation**- Nonlinear regression for parameters, fitness metric for models**Optimization**- Exploration, prioritization, and regularization**Memoization**- Detecting overlapping subproblems to avoid duplicate work

PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another.

PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization.

**Syntax Trie**

PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree’s size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here.

In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space.

**Linear integer sequence** is
the prefix notation representation
of an equation, mapped to the integers.
It is equivalent to the trie representation
with one-to-one correspondence.
The linear integer sequence is obtained by
mapping each node type to a different integer
and the performing a pre-order traversal
of the trie, collecting the integers
by appending them to a list.

The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly.

Additionally, the linear representation can be used for interprocess and network communication. It’s main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation.

**Abstract parameters** are the valueless
placeholders used as coefficients
in equations. They signify that a parameter
is present and where in an equation it occurs.
The benefit of abstract parameters is two-fold.
First, equation form matching and memoization
can happen without regard to their value.
Second, they enable the optimization
of the parameters to be performed
in a separate process, via non-linear regression.
This is the key to separating
the optimization of a given form,
from the optimization which occurs
at the global scale, in the space of all equations.

**Non-linear parameter fitting**
though not unique to PGE, or GP for that matter,
is a central component to the PGE algorithm.
It enables the separation of search for form
and optimization of that form’s parameters.
Great gains in efficacy are realized by
separating the search for equation form,
from the optimization of parameters.

PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation.

**Fitness metrics** are composite metrics
which combine a notion of accuracy and complexity.
They are used to compare the relative
fitness between two models or equations.
They are the multi-objective value
which SR implementations seek to optimize.
The most common fitness metric is to
use the tree size and the mean squared error.
Many values for accuracy and complexity
may be used, and more than one may be used.
Additionally, the values in the fitness metric
may be weighted to give preference.
Another useful practice is to normalize
the values used in the fitness metric,
across the set, before sorting
into successive non-dominated sets.
The Pareto non-dominated sorting methods,
described in the previous chapter,
use fitness metrics to determine
the successive non-dominated sets of equations.

SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move.

PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order.

**Exploration Operators** are
the means by which PGE searches
the space of all equations.
Each exploration operator is a policy
for how to expand and modify
an equation’s parse tree to obtain
functions ‘close’ to the current one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

def Grow(model):
add_expands = add_extend(model)
mul_expands = mul_extend(model)
var_expands = var_substitution(model)
return var_expands + add_expands + mul_expands
def add_extend(model):
// E -> E + c*L
// E -> E + c*N(L)
def mul_extend(model):
// T -> T * L
// T -> T * N(L)
// T -> T / L
// T -> T / N(L)
def var_substitution(model):
// N -> ( N + c*L )
// N -> ( N + c*N(L) )
// N -> ( (c*L) / N )
// N -> ( N / (c*L) )

PGE uses the grammar’s production rules
expand previously encountered equations.
New equations are produced by applying
the function *Grow*, from Listing \ref{expand},
while recursing over the parse tree.

The *Grow* function determines the
current node’s type and applies the appropriate
production function.
Each production function corresponds to
one or more of the grammar’s production rules.
*AddTerm* increases
the number of terms in a summation, such as
.
*WidenTerm* increases
the number of terms in a product, such as
or .
*DeepenTerm* increases
the complexity of a term, such as
or .

The generating functions define the set of reachable expression in conjunction with the supplied building blocks (terminals and non-terminals of the grammar). However, generating functions do not determine the order in which the space is searched. This is the role of the PPQ (section \ref{sec-pge-ppq}), which prioritizes the areas to explore next.

**Prioritization Strategies**

The Pareto Priority Queue (PPQ) is the deterministic mechanism for controlling the search direction of PGE. The PPQ replaces selection for mating with prioritized equations for expansion (selection for survival is unnecessary since all equations are remembered). The PPQ is what its name implies, a priority queue built on the Pareto non-dominated sorting. The PPQ prioritizes expansion towards those branches of the search space which balance size and accuracy best. The PPQ removes the need for generations, survivors, and mating pools, storing all explored equations which have not been expanded. Our experimentations have not shown this to be prohibitive. Consumed memory never exceeded 500Mb, even in the face of hundreds of thousands of unique equations.

To construct the PPQ, successive Pareto frontiers are appended onto a linear structure. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection.

The Pareto Priority Queue (PPQ) determines the order the space of equations is searched. The PPQ gives equations which expand the branches of the search space towards small equations with low error.

To construct the PPQ, successive Pareto fronts are appended to the end of the queue. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection.

**Regularization** is the process of
introducing a penalty for complexity
in order to avoid overfitting.
In PGE and GP, this is an implicit process
which is part of the Pareto non-dominated sort.
PGE uses the PPQ and a pop count to
control the amount of regularization.

During processing, we remove the top equations (3 in our experiments) when selecting the next areas to search. By doing so, we select the smallest equations from the first Pareto frontier. This gives variation across the trade-offs for the equations to be expanded, and thus variation in the search direction too. If we only remove one equation, the smallest equations would be replaced from the next front and the search would progress through equation space by size. If we remove too many, then PGE starts to produce complex and over fit models.

**Detect overlapping subproblems**
is the key to the dynamic programming approach taken by PGE.
In PGE, a sub-problem is equivalent
to a particular equation form.
Sub-problems are encountered
when an equation has
several derivations,
or paths over the search graph
which reach the same node.
This means there are multiple
orderings of the production rules which
result in the same equation
and that each equation may
appear in more than one form,
as several equivalent points
within a grammar’s representable space.

The memoization of form allows PGE to consider a form just once. PGE matches equations by comparing the serialized representation of the equations. Serialization transforms an equation into a sequence of integers by assigning a unique value to each node type. The resulting integer sequence is equivalent to the prefix notation of the equation. Since each sequence represents a unique equation, the parse tree is recoverable from the sequence. Also, since PGE uses abstract coefficients, they are all converted to the same value. This means PGE only memoizes their existence and location.

**The Integer Prefix Tree** (IPT) is at the core
of the PGE dynamic programming framework.
The IPT detects overlapping subproblems
by comparing integer sequences,
the serialized representation of an equation.
Detecting overlapping subproblems
enables us to eliminate duplication of effort
in an exponentially growing search space.

The IPT was inspired by the suffix tree algorithm for string matching [Weiner:1973:LPM, Farach:1997:OST, Skiena:1998:ADM]. The suffix tree algorithm gives log-time lookup for string existence in a piece of text. Similarly, the IPT provides log-time determination, w.r.t. the size of the equation alphabet, as to whether a subproblem has been encountered before. Since the ASCII alphabet is equivalent to the integers, in a mathematical mapping sense, we can use any sized alphabet, or equivalently, range of integers.

PGE uses the IPT memoization tree defined in Listing \ref{pretree}. The IPT is a recursive structure and insertion algorithm which memoizes a sequence of integers. The IPT is built up as new expressions are encountered. As each node in the IPT is visited, its associated counter is incremented. If the insertion algorithm has to create a new node, then the expression is also new, and the expression type is stored in that new node. New node creation is propagated back up the tree as the insertion function unwinds. The root IPT node unique counter is incremented if a new insertion took place. Each insertion or lookup takes O(mlog(n)) operations, where m is the length of the serial and n is the size of the alphabet. The depth of the IPT is equal to the length of the longest serial encountered.

The IPT also tracks the number of times each node in the structure has been visited. We keep counts for total visits and unique visits. This allows us to do some accounting of the number of unique equations as well as how many times each equation is produced by the generation functions. There has been no previous research on the topic of unique equations in relation to the total number of equations created or evaluated. We believe this to be due to the lack of a method for matching equations, which is due to the lack of canonical forms in the literature. We use the IPT to compare the efficiency of GP and PGE in section \ref{sec-results-efficient}.

The IPT is also the enabling structure which conceptually turns the search space from a tree into a graph by matching and memoizing subproblems, or equations. This greatly reduces the amount of work, by eliminating duplication, that the PGE must perform while searching for ideal equations. As mentioned earlier, the search space is also reduced in size by sorting equations into a canonical form and simplifying them by rewrite rules.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

class Node:
def __init__(self, key=0, value=None):
self.map = {} # int -> Node
self.key = key
self.value = value
def insert(self,iis, value):
if len(iis) > 1:
ii = iis[0]
if ii not in self.map:
self.map[ii] = Node(key=ii)
return self.map[ii].insert(iis[1:], value)
if len(iis) == 1:
ii = iis[0]
if ii not in self.map:
# print " new node for key: ", ii
self.map[ii] = Node(key=ii, value=value)
return True
else:
return False
def lookup(self,iis):
if len(iis) > 1:
ii = iis[0]
if ii in self.map:
return self.map[ii].lookup(iis[1:])
else:
return False, None
if len(iis) == 1:
ii = iis[0]
if ii in self.map:
return True, self.map[ii].get_value()
else:
return False, None

PGE iteratively refines equations by applying the grammar’s production rules recursively. PGE removes the top equations from the queue, prioritizing the first Pareto frontier. The top equations are then expanded according to the production rules. The resultant expanded equations are checked for uniqueness by the Integer Prefix Tree (IPT). The unique set of trees that remain are fit using non-linear regression, evaluated for fitness, and pushed into the Pareto Priority Queue (PPQ). The main PGE loop continues in this way until the stopping criteria is reached.

The PGE search proceeds as follows. Initialization generates a set of basis functions for each variable. These basis functions are then memoized by the IPT, fit to the training data and pushed into the PPQ. The main PGE loop iteratively pops equations from the top of the PPQ for processing. Each of these equations is expanded through recursive application of the generation functions. The resulting equations are memoized using the IPT. If the equation has been encountered before, it is discarded. The remaining unique equations are fit to the training data with non-linear regression and pushed into the PPQ. PGE continues until a model of desired accuracy is discovered or a computational threshold is reached. Theoretically, PGE could explore the entire space of representable equations given infinite space and unlimited time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

def loop(iterations):
for I in range(iterations):
popd = pop(p)
# expand popd models, they are now fully processed
expanded = expand(popd)
# filter and memoize expanded models
to_memo = filter(expanded)
to_eval = memoize(to_memo)
# fit and evaluate these models
eval(to_eval)
# push fully fit models into the queue
push(to_eval)

The main PGE loop makes the following steps (line numbers): (3) equations are from the top of the PPQ. (6) The equations are expanded to create the more complex equations ‘expanded’ by applying the grammar’s production rules recursively to an equation’s parse tree. (9) Each equation ‘expanded’ is potentially filtered for excessive size or illegal operations. (10) The IPT uses an equation’s serialized representation to checks if has been seen before. If it has, is discarded, otherwise $e$ is unique and (13) fit to the training data and (16) pushed into the PPQ.

Variations on the PGE algorithm can be created by altering how initialize PGE and step through the grammar via its production rules. We can seed PGE with a richer set of basis function, and even allow complete sub-trees to become building blocks. Further, by enriching the production rules applied at each node, we can enlarge the set of equations generated at each point in the search space.

Variations on the PGE expansion algorithm
can be created by parameterizing where and
which generation functions are applied to a candidate.
By modifying the productions applied at each node,
we can reduce or increase the set of equations
generated at each node of the tree,
and thus each point in the search space.
In this paper,
we used three different expansion methods.
The basic method (PGE-1) restricts the grammar
by removing productions which result in
non-distributed expressions like .
It only uses *AddTerm* and *WidenTerm*
during the recursive expansion.
The second method (PGE-2) adds back the previously mentioned restriction,
using all three generation functions from Listing \ref{expand}.
This results in isomorphs
which are too `far’ apart
for simplification rules to consider the same.
Despite being equivalent,
these isomorphic forms can produce very different offspring,
and represent separate areas of the search space.
The third method (PGE-3) is FFX inspired,
but starts with only a set of univariate bases.
These bases are iteratively
grown into a summation of multiplications
by applying *AddTerm* and *WidenTerm*,
creating candidates with more and increasingly complex bases.
PGE-3 differs from the PGE-1 method by
explicitly defining the root node to be addition
and placing stricter limitations on the complexity
of each term in the summation.

During the development of PGE, we encountered some of the limitations with our current implementation and reasoning.

PGE does not handle dimensionality well. As the number of variables increases, so do the number of equations generated. Growth is exponential to the number of variables. This is because of the production rules being applied to each node of the parse tree, which expand with all possible variables. Further, with more complex generator functions, equations which are the same mathematically, are not equivalent through our current rewrite rules. The combination of these two effects compounds the issue of the exponential rate of equation generation. This fact may only be partially limiting, as the various forms of one equation represent different areas of the search space, and different forms are producible from these points in the space.

Expansion and simplification policies are difficult to establish and generalize. There is no free lunch. The policies determine the reachable search space and how local changes are made. If too much expansion is unrestrained, the complexity grows quickly, often creating equations which have a simpler version.

Additionally, the originial implementation used a custom algebra system and lacked many features and capabilities. We advise, and will in the next chapter, move to using a mature Computer Algebra System (CAS).

Bloat in PGE is when good partial solutions dominate the selection pool and create an expansion cycle. Consider as a good partial solution. PGE makes all possible modifications to this expression, PGE will produce where either is very small or produces near zero values. In both cases, bloat occurs because the extra terms of the expression have little effect on the fitness of the equation. Adding to this, each node in the bloated portion is expanded upon, further exasperating the situation.

Due to the bloat and all inclusive expansion, the bloated partial solutions can quickly dominate the selection pool. These similar solutions have similar fitness, causing the selection process to have difficulty distinguishing between them. They can create a situation where similar models are selected and expanded, with bloated terms and fitness incomparability, and PGE getting stuck in a local optima.s

Parameter optimization is the portion of any SR implementation which takes the longest time. In many cases, the parameters are linearly dependent and we get an exact best solution for a model. In the nonlinear cases we may not, as nonlinear regression is not guaranteed to find an optimal solution. There are also questions as to where parameters should be included. If excessive parameters are added to a model it can take a comparatively long time to fit and in many cases may be overfitting to the noise in the data.

Prioritized Enumeration is heavily influenced by GP. It uses the same parse tree representation for candidate equations. PGE, however, differs in how it evaluates, selects, and generates new candidates from old ones. This section is meant to help those familiar with GP to understand where and what the analogous processes are.

**Parameters**

PGE is nearly a parameter free algorithm, requiring only the building blocks, peel count, and termination criteria to be established prior beginning an equation search. Since building blocks, generally established by the problem or benchmark, and stopping criteria are common to any search method, PGE only has one parameter the number of candidates to remove from the PPQ. This is a substantial advantage over GP, which requires many parameters to be set. This is a notoriously difficult task as was discussed in the previous chapter.

PGE is nearly a parameter free algorithm, requiring only that the building blocks, generating functions, the value of $p$, and termination criteria be established prior to beginning an equation search. Since building blocks are generally established by the benchmarks, and stopping criteria are relatively common across SR implementations, PGE has only two unique parameters: (1) $p$, the number equations to remove from the PPQ and (2) which generation functions to use. % PGE has no populations, % no generations or ages, % and no random number generation. This is a substantial advantage over GP, which requires many parameters to be set, a notoriously difficult task.

**Evaluations**

Evaluation is the most time consuming phase of the model discovery process. PGE performs two steps of evaluation, fitting coefficients and evaluating fitness. PGE uses non-linear regression to optimize the coefficients, which requires an undetermined number of evaluations on the derivatives of the equation. Once fit, PGE calculates fitness on the testing set. In GP, one evaluation per data point, per equation is required for both training and testing.

Regressing the coefficients has a profound effect on the ability of PGE to quickly traverse the search space. It allows for the separation of equation form and coefficient optimization into two sub-tasks of the overall search. In GP, the optimization of form is difficult and the most time consuming portion of the search process.

**Selection**

To prioritize the enumeration of expressions, we maintain, what we will refer to as, a Pareto Priority Queue (PPQ). The contents of the PPQ are the equations which have been regressed and evaluated, but have not yet been expanded. The PPQ is similar to the basic priority queue, in that we want to quickly prioritize the elements which have the highest priority. However it is different because we are sorting on two objectives, parsimony and accuracy, rather than one. When selecting the equations to expand next, we remove one or more elements from the front of the heap. Since the front of the heap is the first Pareto front, we prioritize the smallest equations first

The more equations we *peel* (remove)
from the front of the heap,
the more of the search space
we consider concurrently.
This enables the search to
exploit equations that
express different levels
of the trade off between
parsimony and accuracy.
If we select only one equation
from the PPQ per iteration,
the search degenerates into
a search which progresses
through expression size.
This degenerate case will
exhaust all possible equations
of the current size
before considering
any larger equations.

**Population and Breeding**

In PGE, there is no need to maintain a diverse population in which survival and replication determine the outcome of the search. In fact, one of the goals of PGE is to only process an expression form once. This is managed by the IPT which tracks the equations that have been evaluated and expanded. Equations which have been generated, regressed, and evaluated sit in the PPQ. Once an equation has been removed from the PPQ and expanded upon, it requires no further consideration.

prev top

### Progressive Evaluation One of the issues in PGE is that a significant amount of effort is spent evaluating models which end up with poor fit. A useful idea is to evaluate models on a subset of the data. This method has been shown to provide good estimates with as few as just 8 data points [ [hod:08:coevo_fp]() ].

### Selection Improvements The original PGE algorithm used tree size and an error metric to calculate model fitness. It then used a simple Pareto sorting mechanism to sort the population. This section describes enhancements we make to the metrics, fitness, and selection processes. #### More Model Metrics Our experience showed that modelling error was often insufficient for effectively separating good models from the rest. We add several additional metrics for models which account for quality while penalizing complexity. This is separate from the complexity and accuracy tradeoff which happens at the global level. They do however have a benefical influence on the global selection and prioritization process. **Penalized complexity** is a tree size calculation which adds a penalty to specific types of nodes. In our experiments, we applied it to function nodes from the triganomic and logarithmic families. The penalty accounts for increased relative computational complexity which we set at +2. The value, while effective, deserves further investigation than we provide. **Jacobian complexity** is the summed tree size of the jacobians for the model. We also included the penalized version as well. This metric has multiple effect for minimizing complexity. It minimizes absolute, summed size. It minimizes the number of model parameters, because fewer terms means smaller sum. It also minimizes the nesting complexity of the original model. When coefficients are dependent upon one another, the jacobians become quite complex and drastically increase the value of this metric. **R-squared** measures the goodness of fit for a model and attempts to account for the amount of variance explained. Values range from zero to one and can be interpreted as the percentage of variance the model explains. One issue with this metric is that extra variables can be included to increase its value. The adjusted R-squared addresses this by adding a penalty which increases with each extra variable [[Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John)]. **reduced Chi-squared** also measures the goodness of fit for a model. It includes the variance in the data and degrees of freedom. This effectively normalizes the value to the number of data points and model complexity [ [Taylor:1997:error](An_Introduction_To_Error_Analysis_The_Study_Of_Uncertainties_In_Physical_Measurements_Taylor_John), [Pearson:1900:chi](http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf) ]. **AIC/BIC** measure the relative quality of fit for models. Akaike information criterion (AIC) and Bayesian information criterion (BIC) do not say anything about the absolute quality of the model. Rather they measure the relative quality of models against eachother. Both add a penalty term based on the number of model parameters. BIC penalizes complexity more and is thus the more conservative metric [ [Akaike:1974:aic](http://www.unt.edu/rss/class/Jon/MiscDocs/Akaike_1974.pdf), [Schwarz:1978:bic](http://projecteuclid.org/euclid.aos/1176344136) ]. #### Improvement over parent Improvement over the parent refers to how much an accuracy or goodness of fit metric changes in an offspring, relative to the parent. We use a simple difference between the values, making sure to account for whether the metric is a minimizing or maximising one. One interesting situation happens when we find a new path to an existing model. In this case, we can look to Dijkstra's algorithm and *relax* to find a new relative improvement value. #### Model Fitness Model fitness is used to make relative comparisons during the selection process. The fitness for an individual model is usually a combination of complexity and accuracy in an attempt to regularize. The original PGE algorithm used tree size and model accuracy, as determined by an error metric, as the components to the fitness metric. It then used a the basic Pareto non-dominated sort to prioritize the next models to be expanded. In our enhancements, we made several changes in how we used the metrics in fitness. In addition to being able to use any of the new metrics, we added the ability to use multiple metrics at once, allowing the number of components to grow beyond just one for size and one for accuracy. We then added the ability to weight each component separately. This proved beneficial as we discovered the best policy was to weight the complexity measure so that it was equal to multiple accuracy metrics. We also found that competing certain metrics was beneficial, such as minimizing BIC while maximizing the improvement of BIC. One of the more significant changes we made was to normalize each of the values across the set of models. For each metric being used, we normalized across the models, and then used the new value as the component to the fitness for the model. This removed the effect of scale of values on the relative comparisons made during selection. With out normalization, one of the components can easily dominate the solution. Additionally, the dominating component often changed as the search progressed. In example, as error tends towards zero the size becomes the dominate factor, and distinguishing between models on error gets more difficult for the algorithm. #### Density based Pareto methods The orginial PGE algorithm used a simlistic Pareto sorting algorithm. Research in GP has shown that accounting for denisity along the frontier, better choices can be made. Section ? in Chapter 3 discussed the various improved algorithms for Pareto selection. We settled on the NSGA-II algorithm for its good balance between effectiveness and computational complexity. In our experiments, the addition of NSGA-II did not, by itself, improve the prioritization process. It was its combined usage when the fitness values contained more than two components. NSGA-II was able to distribute selection by accounting for density across all dimension, making the use of more dimensions beneficial.

### Expansion Improvements This section describes improvements to the expansion phase and functionality. The first changes enable multiple levels of complexity and context awareness when appling the expanions operators. We then describe two new expansion operators which were designed to work on summations. Finally, we enable multiple tiers of model expansion, thereby applying progressively complex expansion operations to progressively fewer models at a time. #### Configurable complexity levels The original PGE algorithm had a three different sets of expansion functions. They encompased three different complexities or scenarios. We alter this idea for greater grainularity. Each of the four operators can have its own complexity level. The *low*, *medium*, and *high* options provide flexibility while simplifying the choices for an end-user. The following tables describe the the basis functions used for each of the operators. Their application remains the same as the original PGE algorithm.

**Initialization** | level | result set | | ------- | ------------| | low | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | med | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_3 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:2]} \Big) + C $$ | | high | $$ \sum \Big( \Big\{ \big\{ \vec{x}C_4 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \} \Big \} C_{[1:3]} \Big) + C $$ |

**Variable Substitution** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ |

**Multiplication Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ \vec{x}C_1 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | med | $$ \big \{ \vec{x}C_2 \big \} \bigcup \big \{ F(\vec{x}C_1) \big \}$$ | | high | $$ \big \{Mul Med \big \} \bigcup \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \}$$ |

**Addition Extension** | level | result set | | ------- | ------------| | low | $$ \big \{ a * \vec{x}C_1 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | med | $$ \big \{ a * \vec{x}C_2 +b \big \} \bigcup \big \{ a * F(\vec{x}C_1) +b \big \}$$ | | high | $$ \big \{Add Med \big \} \bigcup a* \Big \{ \big \{ \vec{x}C_1 \big \} \times \big \{ F(\vec{x}C_1) \big \} \Big \} +b $$ #### Context-aware The original expansion polices were applied recursively at each node in the model tree. This was done without regard for the context in which these operations took place. This led to situations where trigonimic functions were embedded within eachother, such as $$ sin(cos(x)) $$. It also led to the production of ovely complex expressions involving coefficients and addition. In example, $$C * ( C + C*X(C*X + C*(C*X + C)^2)) $$ is more simply expressed as a single level summation. Additionally, it is much more difficult to fit the coefficients as they involve many nonlinear relationships. We incorporate the idea of context-aware expansion policies to deal with undesirable situations and consequences. Here we include two types of context-awareness, functional and depth, though there are generally more options. Context aware operators are easily created with state traking during the recursion process. Our experience has been that their use implies restriction based on context. However, there may be situations where certain operations are only permitted within a particular context. #### New Extension Operator The additon and multiplication expansion policies extend the number of terms to the respective operators. They do this by drawing from a predefined pool of basis functions. We make use of a new extension operator which only works at root level addition. This operator extends an addition with small modification to its current terms. First, each term is expanded with a set of basis terms. This is similar to the multiplication expansion. Then each of these terms is included in the summation, as opposed to replacing an existing term. This operator makes larger movements around the search space, effectively being the combination of smaller modifications with several intermediate steps. We chose to only use this at root level addition because deeper recursion becomes overly complex very quickly. #### New Shrinkage Operator Under the original policies, there was no explicit method to go backwards, or decrease the size of a model. This can occur when a cancelling term is applied through an expansion operator, such as $$ x^2 * x^{-1} \rightarrow x $$. We make use of a new shrinkage operator which removes terms from additions. The operator recurses over the entire tree. When an addition is found, it produces all expressions possible, with one term removed. The reason for adding this is to clean up after intermediary terms in models. During the search process, it is often the case that additonal terms are incorporated which either bridge a gap or are meant to deal with noise. This is particularly noticable with trigonimic functions whos output is in the range [-1:1]. ADD PICTURES HERE FROM ACTUAL BENCHMARKS / PROBLEMS The shrinkage operator removes excessive terms from overfitting noise or from the intemediary terms needed during search progression. We exclude a similar operator for multiplication for two reasons. The first is that there would be a large combinatorial number of models generated from this process. Second, in most cases the models were or can be reached through the usual multiplication expansion process. That is, this process would be to similar to the multiplying by a cancelling term to make a significant difference. #### Progressive expansion Progressive expansion is an idea that a series of increasingly complex expansion schemes can be applied to a decreasing number of models. This follows along with the multiple tiers of refinement motif. Once a model is fully fit during a search, it becomes a candidate for expansion. If selected through the heap prioritization, it then has the expansion methods applied. Under the progressive expansion design, each step in the series has its own selection heap and its own set of expansion policies. When a model has been selected and expanded it is pushed into the next tier's heap.

### Enhanced PGE This chapter has described many enhancements to the original PGE algorithm. The diagram below depicts PGE with these enhancements. The loop remains the same however several phases have been internally extended with progressive application and intemediate heaps. The next chapter will show their effectiveness over a range of experiments.