### Theory 1. **Removing Non-determinism** - Creation of a completely reproducible SR algorithm. 1. **Search Space Organization** - Understanding and designing the structure of the search space. 1. **Evaluating Forms Once** - Designing the model for detection and avoidance of overlapping subproblems. PGE was born from the process of removing random number generators from GP. This required methods for selection, survival, and replication to all be replaced. The replacement process brought to light several issues which have not been addressed by the GP community. The theory underlying PGE is based on intuitive principles. First, take into account the rules from algebra, to reduce model multiplicity into a canonical form, thereby reducing search space complexity. Second, design models so that they are only handled once. This requires that they are also designed with dynamic programming techniques in mind. Third, remove all sources of non-determinism to create a completely reproducible algorithm. This also simplifies development and debugging of the software which implements the algorithm. #### Removing Non-determinism Removing sources of non-determinism was a central theme to the development of PGE. The aim was to produce an algorithm which would give the same results with each invocation. The reason for this goal was the detrimental effects non-determinism has on search algorithms. We believe it to be an inherent and unavoidable difficulty present in the GP algorithm. You can find details on GP in [Chapter 7](/sr/07-gp) To achieve deterministic behavior, PGE makes no use of random number generation. It performs the exact same algorithmic steps given the same parameters and same training data. This requires several aspects of the PGE algorithm to be defined deterministically. **Initialization** establishes the starting points from which PGE searches. Determinism starts here by constructing a set of basis functions from the input variables. **Prioritization** determines which points in the space to search next. This process uses a priority queue coupled with a Pareto non-dominated sorting algorithm. **Exploration** discovers new models by making small changes to a given model. To maintain determinism, PGE makes all possible changes to a model, given the particular exploration scheme in use. **Communication** occurs when PGE uses parallel execution. In order to maintain deterministic behavior, well defined synchronization points are used. Details of each aspect will be expanded upon in the upcoming sections and next chapter. #### Search Space Organization The size of the space defined by a grammar is infinite, even when disregarding the increase brought by considering real valued coefficients. The main contributions are combinatorial components at both the search space and model levels of the problem. PGE's great efficiency gains are the result of methods designed to combat the combinatorial components. We begin at the model level, where the main tool is the trie based representation. It is dynamic programming which comes to the rescue at the search space level. **Tree Representation Combinatorics** PGE primarily uses the trie based representation for the models, as detailed in the last chapter. The reason for this is the combinatorial number of different binary tree representations for the equivalent model. These different representations arise from permutations of the leaf nodes and from variations in the tree structure. Consider the equation a•b•c (Figure 4-1).
Figure 4-1 - Combinatorics of the Tree
This equation has 12 different binary tree representations, from six leaf permutations and two shape permutations. If we account for both addition and multiplication, the number of trees grows to 48. As you can imagine, with increasing numbers of operations and operands, the tree complexity and this number undergoes a combinatorial explosion. The advantage of the trie based model is the ability to take advantage of the commutative and associative properties of addition and multiplication. Because of these properties, we are free to order the operands, sub-trees in the model trie, as we choose. Commutativity is the ability to rearrange parentheses and is associated with the multiple tree representations for the same expression. By using the trie, we flatten a binary tree and metaphorically erase the parentheses from the expression. Associativity means that we can sort the operands to addition and multiplication. By sorting them, we create a consistent ordering which can be treated as the canonical form for a model. In doing so, we remove the combinatorial element associated with the permutation of leaf nodes. Sorting is possible by creating a ordering on node types. This is easily achievable by assigning an integer to each node type. By turning binary trees into tries and by sorting the tries, each model can have a canonical form. Figure 4-1, right column, shows the canonical form of a•b•c. After applying the reductions, the original 48 models is now just 4. This savings created by sorting become even more important as equations become more complex. **Algebraic simplifications** There are other sub-expressions which arise that we would also like to remove and reduce. Simplifications group like terms together ($$x+x$$), replace sub-expressions which evaluate to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$), and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$. The resulting equations are equivalent, simpler forms. There is debate as to how simplification effects the SR process [ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ]. Certainly, changing the tree effects the Pareto trade-off which in turn has consequences on selection and therefore search progress. Questions still remain as to how much and which simplifications should be applied, which forms of the equation should be kept, and the overall effects simplification has on the search process. Despite these questions, PGE still employs the use of the aforementioned simplifications. This is done to eliminate those sub-expressions which unnecessarily complicate equation and the overall search space. If they were not replaced, many equivalent variations of a good equation could crowd out other potential solutions, as they would be selected more often. This would result in the algorithm stagnating in a local optima of the search space. Simplifications effectively trim out parts of the search space which are redundant, and also return results which are more comprehensible. PGE uses the simplification techniques which are present in the SymPy Python library. This and further implementation details can be found in [Appendix-A1](/sr/A1-pypge). #### Evaluating Forms Once At its core, PGE is a dynamic programming algorithm which aims to evaluate each sub-problem once.
A dynamic-programming algorithm solves each subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time it solves each subsubproblem.
~ Cormen:2009:algorithms

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.
Figure # - Multiple Derivations
In this diagram, we start with models which consist of a single variable. Now we could take each of these models and add a different variable to it. Consider though that we could start with $$x$$ and add $$y$$ or start with $$y$$ and then add $$x$$, both arriving at the canonical for $$x+y$$. Repeat this process, adding a third variable, and we have six paths to the model $$x+y+z$$. It is important that we detect this situation, by determining that each of the six $$x+y+z$$ are really the same model. If we do not detect the separate models as one, would would end up repeating a lot of work. This repetition would become a combinatorial explosion of exponential proportions. PGE uses dynamic programming techniques to detect multiple derivations and consider an equation form only once. This feature of PGE is enabled by the Integer Prefix Trie and an integer linear representation for the equation models. The next section will fill in the details of both representation and memoization.
### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.
Figure #: PGE Exporation Functions
{% highlight Python linenos %} def Grow(model): add_expands = add_extend(model) mul_expands = mul_extend(model) var_expands = var_substitution(model) return var_expands + add_expands + mul_expands def add_extend(model): // E -> E + c*L // E -> E + c*N(L) def mul_extend(model): // T -> T * L // T -> T * N(L) // T -> T / L // T -> T / N(L) def var_substitution(model): // N -> ( N + c*L ) // N -> ( N + c*N(L) ) // N -> ( (c*L) / N ) // N -> ( N / (c*L) ) {% endhighlight %} PGE uses the grammar's production rules expand previously encountered equations. New equations are produced by applying the function *Grow*, from Listing \ref{expand}, while recursing over the parse tree. The *Grow* function determines the current node's type and applies the appropriate production function. Each production function corresponds to one or more of the grammar's production rules. *AddTerm* increases the number of terms in a summation, such as $$aX + bY \Rightarrow aX + bY + cZ$$. *WidenTerm* increases the number of terms in a product, such as $$aXY^2 \Rightarrow aX^2Y^2$$ or $$aXY^2 \Rightarrow aXY^2Z$$. *DeepenTerm* increases the complexity of a term, such as $$aXY \Rightarrow a(X+bZ)Y$$ or $$aSin(X) \Rightarrow aSin(X+Y)$$. The generating functions define the set of reachable expression in conjunction with the supplied building blocks (terminals and non-terminals of the grammar). However, generating functions do not determine the order in which the space is searched. This is the role of the PPQ (section \ref{sec-pge-ppq}), which prioritizes the areas to explore next. **Prioritization Strategies** The Pareto Priority Queue (PPQ) is the deterministic mechanism for controlling the search direction of PGE. The PPQ replaces selection for mating with prioritized equations for expansion (selection for survival is unnecessary since all equations are remembered). The PPQ is what its name implies, a priority queue built on the Pareto non-dominated sorting. The PPQ prioritizes expansion towards those branches of the search space which balance size and accuracy best. The PPQ removes the need for generations, survivors, and mating pools, storing all explored equations which have not been expanded. Our experimentations have not shown this to be prohibitive. Consumed memory never exceeded 500Mb, even in the face of hundreds of thousands of unique equations. To construct the PPQ, successive Pareto frontiers are appended onto a linear structure. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection. The Pareto Priority Queue (PPQ) determines the order the space of equations is searched. The PPQ gives equations which expand the branches of the search space towards small equations with low error. To construct the PPQ, successive Pareto fronts are appended to the end of the queue. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection. **Regularization** is the process of introducing a penalty for complexity in order to avoid overfitting. In PGE and GP, this is an implicit process which is part of the Pareto non-dominated sort. PGE uses the PPQ and a pop count to control the amount of regularization. During processing, we remove the top $$p$$ equations (3 in our experiments) when selecting the next areas to search. By doing so, we select the $$p$$ smallest equations from the first Pareto frontier. This gives variation across the trade-offs for the equations to be expanded, and thus variation in the search direction too. If we only remove one equation, the smallest equations would be replaced from the next front and the search would progress through equation space by size. If we remove too many, then PGE starts to produce complex and over fit models. #### Memoization **Detect overlapping subproblems** is the key to the dynamic programming approach taken by PGE. In PGE, a sub-problem is equivalent to a particular equation form. Sub-problems are encountered when an equation has several derivations, or paths over the search graph which reach the same node. This means there are multiple orderings of the production rules which result in the same equation and that each equation may appear in more than one form, as several equivalent points within a grammar's representable space. The memoization of form allows PGE to consider a form just once. PGE matches equations by comparing the serialized representation of the equations. Serialization transforms an equation into a sequence of integers by assigning a unique value to each node type. The resulting integer sequence is equivalent to the prefix notation of the equation. Since each sequence represents a unique equation, the parse tree is recoverable from the sequence. Also, since PGE uses abstract coefficients, they are all converted to the same value. This means PGE only memoizes their existence and location. **The Integer Prefix Tree** (IPT) is at the core of the PGE dynamic programming framework. The IPT detects overlapping subproblems by comparing integer sequences, the serialized representation of an equation. Detecting overlapping subproblems enables us to eliminate duplication of effort in an exponentially growing search space. The IPT was inspired by the suffix tree algorithm for string matching [[Weiner:1973:LPM](), [Farach:1997:OST](), [Skiena:1998:ADM]()]. The suffix tree algorithm gives log-time lookup for string existence in a piece of text. Similarly, the IPT provides log-time determination, w.r.t. the size of the equation alphabet, as to whether a subproblem has been encountered before. Since the ASCII alphabet is equivalent to the integers, in a mathematical mapping sense, we can use any sized alphabet, or equivalently, range of integers. PGE uses the IPT memoization tree defined in Listing \ref{pretree}. The IPT is a recursive structure and insertion algorithm which memoizes a sequence of integers. The IPT is built up as new expressions are encountered. As each node in the IPT is visited, its associated counter is incremented. If the insertion algorithm has to create a new node, then the expression is also new, and the expression type is stored in that new node. New node creation is propagated back up the tree as the insertion function unwinds. The root IPT node unique counter is incremented if a new insertion took place. Each insertion or lookup takes O(mlog(n)) operations, where m is the length of the serial and n is the size of the alphabet. The depth of the IPT is equal to the length of the longest serial encountered. The IPT also tracks the number of times each node in the structure has been visited. We keep counts for total visits and unique visits. This allows us to do some accounting of the number of unique equations as well as how many times each equation is produced by the generation functions. There has been no previous research on the topic of unique equations in relation to the total number of equations created or evaluated. We believe this to be due to the lack of a method for matching equations, which is due to the lack of canonical forms in the literature. We use the IPT to compare the efficiency of GP and PGE in section \ref{sec-results-efficient}. The IPT is also the enabling structure which conceptually turns the search space from a tree into a graph by matching and memoizing subproblems, or equations. This greatly reduces the amount of work, by eliminating duplication, that the PGE must perform while searching for ideal equations. As mentioned earlier, the search space is also reduced in size by sorting equations into a canonical form and simplifying them by rewrite rules.
Figure #: Integer Prefix Tree
{% highlight Python linenos %} class Node: def __init__(self, key=0, value=None): self.map = {} # int -> Node self.key = key self.value = value def insert(self,iis, value): if len(iis) > 1: ii = iis[0] if ii not in self.map: self.map[ii] = Node(key=ii) return self.map[ii].insert(iis[1:], value) if len(iis) == 1: ii = iis[0] if ii not in self.map: # print " new node for key: ", ii self.map[ii] = Node(key=ii, value=value) return True else: return False def lookup(self,iis): if len(iis) > 1: ii = iis[0] if ii in self.map: return self.map[ii].lookup(iis[1:]) else: return False, None if len(iis) == 1: ii = iis[0] if ii in self.map: return True, self.map[ii].get_value() else: return False, None {% endhighlight %} ### The Search Loop PGE iteratively refines equations by applying the grammar's production rules recursively. PGE removes the top equations from the queue, prioritizing the first Pareto frontier. The top equations are then expanded according to the production rules. The resultant expanded equations are checked for uniqueness by the Integer Prefix Tree (IPT). The unique set of trees that remain are fit using non-linear regression, evaluated for fitness, and pushed into the Pareto Priority Queue (PPQ). The main PGE loop continues in this way until the stopping criteria is reached.
Figure # - PGE Flow Diagram
The PGE search proceeds as follows. Initialization generates a set of basis functions for each variable. These basis functions are then memoized by the IPT, fit to the training data and pushed into the PPQ. The main PGE loop iteratively pops $$p$$ equations from the top of the PPQ for processing. Each of these equations is expanded through recursive application of the generation functions. The resulting equations are memoized using the IPT. If the equation has been encountered before, it is discarded. The remaining unique equations are fit to the training data with non-linear regression and pushed into the PPQ. PGE continues until a model of desired accuracy is discovered or a computational threshold is reached. Theoretically, PGE could explore the entire space of representable equations given infinite space and unlimited time.
Figure #: PGE Search Loop
{% highlight python linenos %} def loop(iterations): for I in range(iterations): popd = pop(p) # expand popd models, they are now fully processed expanded = expand(popd) # filter and memoize expanded models to_memo = filter(expanded) to_eval = memoize(to_memo) # fit and evaluate these models eval(to_eval) # push fully fit models into the queue push(to_eval) {% endhighlight %} The main PGE loop makes the following steps (line numbers): (3) $$p$$ equations are from the top of the PPQ. (6) The $$p$$ equations are expanded to create the more complex equations 'expanded' by applying the grammar's production rules recursively to an equation's parse tree. (9) Each equation $$e \in$$ 'expanded' is potentially filtered for excessive size or illegal operations. (10) The IPT uses an equation's serialized representation to checks if $$e$$ has been seen before. If it has, $$e$$ is discarded, otherwise $e$ is unique and (13) fit to the training data and (16) pushed into the PPQ. Variations on the PGE algorithm can be created by altering how initialize PGE and step through the grammar via its production rules. We can seed PGE with a richer set of basis function, and even allow complete sub-trees to become building blocks. Further, by enriching the production rules applied at each node, we can enlarge the set of equations generated at each point in the search space. Variations on the PGE expansion algorithm can be created by parameterizing where and which generation functions are applied to a candidate. By modifying the productions applied at each node, we can reduce or increase the set of equations generated at each node of the tree, and thus each point in the search space. In this paper, we used three different expansion methods. The basic method (PGE-1) restricts the grammar by removing productions which result in non-distributed expressions like $$ax*(b+cx*(d+x^2))$$. It only uses *AddTerm* and *WidenTerm* during the recursive expansion. The second method (PGE-2) adds back the previously mentioned restriction, using all three generation functions from Listing \ref{expand}. This results in isomorphs which are too far' apart for simplification rules to consider the same. Despite being equivalent, these isomorphic forms can produce very different offspring, and represent separate areas of the search space. The third method (PGE-3) is FFX inspired, but starts with only a set of univariate bases. These bases are iteratively grown into a summation of multiplications by applying *AddTerm* and *WidenTerm*, creating candidates with more and increasingly complex bases. PGE-3 differs from the PGE-1 method by explicitly defining the root node to be addition and placing stricter limitations on the complexity of each term in the summation. ### Limitations During the development of PGE, we encountered some of the limitations with our current implementation and reasoning. #### Exponential to |{Features}| PGE does not handle dimensionality well. As the number of variables increases, so do the number of equations generated. Growth is exponential to the number of variables. This is because of the production rules being applied to each node of the parse tree, which expand with all possible variables. Further, with more complex generator functions, equations which are the same mathematically, are not equivalent through our current rewrite rules. The combination of these two effects compounds the issue of the exponential rate of equation generation. This fact may only be partially limiting, as the various forms of one equation represent different areas of the search space, and different forms are producible from these points in the space. #### Algebra Policies Expansion and simplification policies are difficult to establish and generalize. There is no free lunch. The policies determine the reachable search space and how local changes are made. If too much expansion is unrestrained, the complexity grows quickly, often creating equations which have a simpler version. Additionally, the originial implementation used a custom algebra system and lacked many features and capabilities. We advise, and will in the next chapter, move to using a mature Computer Algebra System (CAS). #### Bloat in PGE Bloat in PGE is when good partial solutions dominate the selection pool and create an expansion cycle. Consider $$F(x)$$ as a good partial solution. PGE makes all possible modifications to this expression, PGE will produce $$F(x) + c*g(x)$$ where either $$c$$ is very small or $$g(x)$$ produces near zero values. In both cases, bloat occurs because the extra terms of the expression have little effect on the fitness of the equation. Adding to this, each node in the bloated portion is expanded upon, further exasperating the situation. #### Similar Selection Due to the bloat and all inclusive expansion, the bloated partial solutions can quickly dominate the selection pool. These similar solutions have similar fitness, causing the selection process to have difficulty distinguishing between them. They can create a situation where similar models are selected and expanded, with bloated terms and fitness incomparability, and PGE getting stuck in a local optima.s #### Parameter Optimization Parameter optimization is the portion of any SR implementation which takes the longest time. In many cases, the parameters are linearly dependent and we get an exact best solution for a model. In the nonlinear cases we may not, as nonlinear regression is not guaranteed to find an optimal solution. There are also questions as to where parameters should be included. If excessive parameters are added to a model it can take a comparatively long time to fit and in many cases may be overfitting to the noise in the data. ### Relation to Genetic Programming Prioritized Enumeration is heavily influenced by GP. It uses the same parse tree representation for candidate equations. PGE, however, differs in how it evaluates, selects, and generates new candidates from old ones. This section is meant to help those familiar with GP to understand where and what the analogous processes are. **Parameters** PGE is nearly a parameter free algorithm, requiring only the building blocks, peel count, and termination criteria to be established prior beginning an equation search. Since building blocks, generally established by the problem or benchmark, and stopping criteria are common to any search method, PGE only has one parameter the number of candidates to remove from the PPQ. This is a substantial advantage over GP, which requires many parameters to be set. This is a notoriously difficult task as was discussed in the previous chapter. PGE is nearly a parameter free algorithm, requiring only that the building blocks, generating functions, the value of $p$, and termination criteria be established prior to beginning an equation search. Since building blocks are generally established by the benchmarks, and stopping criteria are relatively common across SR implementations, PGE has only two unique parameters: (1) $p$, the number equations to remove from the PPQ and (2) which generation functions to use. % PGE has no populations, % no generations or ages, % and no random number generation. This is a substantial advantage over GP, which requires many parameters to be set, a notoriously difficult task. **Evaluations** Evaluation is the most time consuming phase of the model discovery process. PGE performs two steps of evaluation, fitting coefficients and evaluating fitness. PGE uses non-linear regression to optimize the coefficients, which requires an undetermined number of evaluations on the derivatives of the equation. Once fit, PGE calculates fitness on the testing set. In GP, one evaluation per data point, per equation is required for both training and testing. Regressing the coefficients has a profound effect on the ability of PGE to quickly traverse the search space. It allows for the separation of equation form and coefficient optimization into two sub-tasks of the overall search. In GP, the optimization of form is difficult and the most time consuming portion of the search process. **Selection** To prioritize the enumeration of expressions, we maintain, what we will refer to as, a Pareto Priority Queue (PPQ). The contents of the PPQ are the equations which have been regressed and evaluated, but have not yet been expanded. The PPQ is similar to the basic priority queue, in that we want to quickly prioritize the elements which have the highest priority. However it is different because we are sorting on two objectives, parsimony and accuracy, rather than one. When selecting the equations to expand next, we remove one or more elements from the front of the heap. Since the front of the heap is the first Pareto front, we prioritize the smallest equations first The more equations we *peel* (remove) from the front of the heap, the more of the search space we consider concurrently. This enables the search to exploit equations that express different levels of the trade off between parsimony and accuracy. If we select only one equation from the PPQ per iteration, the search degenerates into a search which progresses through expression size. This degenerate case will exhaust all possible equations of the current size before considering any larger equations. **Population and Breeding** In PGE, there is no need to maintain a diverse population in which survival and replication determine the outcome of the search. In fact, one of the goals of PGE is to only process an expression form once. This is managed by the IPT which tracks the equations that have been evaluated and expanded. Equations which have been generated, regressed, and evaluated sit in the PPQ. Once an equation has been removed from the PPQ and expanded upon, it requires no further consideration. ">next   index

## Related Work

### Genetic Programming

Genetic Programming (GP) is the original implementation of Symbolic Regression (SR). Traditionally much of the literature has used the term Genetic Programming as both the problem and the implementation. Historically, term Symbolic Regression has been used as a sub-problem to Genetic Programming, as the task of evolving equations, and often as an evolutionary method for finding equations. We want to make a distinction between the task of symbolically regressing a parse tree in a grammar, from the method of realizing that tree. That is, separating the objective from the optimization method. From here on out, when we refer to Genetic Programming, we mean an evolutionary implementation of Symbolic Regression. And when we use Symbolic Regression, we mean the generalized regression problem, limiting our examples to equation regression, which we consider a specific sub-problem of Symbolic Regression.

#### The Algorithm

1. Overview - Basics of Genetic Programming
2. Representation - Tree-based and beyond
3. Evaluation - Fitness measurement
4. Optimization - Genetic operators and selection
5. Initialization & Termination - It’s important

GP requires several components to be specified in its implementation. Representation and evaluation define the solution structure and how it is simulated to infer modeling accuracy. Initialization determines the search starting point and what portions of equation space are reachable. Selection and breeding policies determine which candidates are better and how the space of equations is searched. This section overviews each of these components in the GP algorithm.

##### Overview of GP

The most common implementation of SR has been the evolutionarily inspired method called Genetic Programming (GP) koza:1992:gen_prog. GP is part of the larger family of Genetic Algorithms (GA) a class of algorithms inspired by survival of the fittest’ [ holland:1962:outline, goldberg:1988:genetic ]. GP differs from GAs by solution representation. In GAs, solution representation is a fixed size structure. In GP, solutions vary in size, usually represented as a parse tree within a grammar. While GP can be applied to generic languages, and was originally intentioned to evolve computer programs, we shall restrict our discussion to mathematical equations. This is in line with the general focus of the GP literature.

GP is a stochastic search heuristic over the syntax-tree representation of an equation. GP requires several components to be defined: representation, evaluation, selection, genetic operators, initialization & termination, and population management. Representation and evaluation, the first two components of a ML algorithm were detailed previously. We while expand upon their usage in GP here. The third component, optimization, is the combination of selection, the genetic operators, and population management. Initialization and termination are required because the success of GP is sensitive to their determination.

The Basic GP Process begins with an initial, random population of models. Then each individual is evaluated to determine fitness. This is followed by a selection process for candidate survival and replication. Survivors are then recombined using methods which resemble sexual reproduction and mutation. GP continues this process for a number of generations, until a model of desired accuracy is discovered or a computational threshold is reached.

For detailed overviews of tree-based Genetic Programming, see [ koza:1992:gen_prog, kouch:08:thesis ] For a nearly complete list of all GP publications, visit GP-bib, maintained by Bill Langdon.

##### Representation in Genetic Programming

Section SR-components described the various schemes for representing equations.

GP uses the operators and operands of this framework to construct parse trees for equations. The parse trees are chromosomes of an individual. This analogy is important to the methodology which uses sexual reproduction, mutation, and natural selection to search. One note about representation, equation coefficients are real valued numbers. As we will see, having continuous parameters, and optimizing by evolution, makes the task much harder.

Representation Closure

In the basic GP algorithm, expressions are generated at random. This can result in invalid operations being evolved which violate our basic mathematical constraints. Closure is the property that an expression be valid mathematically. Invalid expressions include dividing by zero, taking the square root of a negative number, and violating other similar rules.

A simple remedy exists in which invalid operations are removed and functions such as log are protected. This method has been termed interval arithmetic [ keijzer:03:improving, kotanchek:2008:trustable ]. A second method is to place restrictions on the grammar, as in [ hoai:2001:framework, hoai:2002:solving, hoai:2003:tree ]. Hoai prevents invalid expressions from being generated where the previous method removes them later.

Populations are the pools of candidates equations GP has to work with. It is essential just a set of equations with nomenclature to reflect the biological evolution inspirations of GP. Basic GP uses a single population. Parents are selected from this population for breeding, children and parents compete for positions in the population between generations.

Schema Theory developed as a framework for talking about partial solutions in GP [ holland:1975:adaptation ]. Originally associated with Genetic Algorithms, it has been useful in the discussions around solution propegation and convergence as well as the disruptiveness of crossover in GP.

A schemata is a template for an individual where some of the parts are filled in and some are blanks’’ and not yet determined. The Schema Theorem states that schemata which have fewer determined components and above-average fitness will increase exponentially in successive generations. This means that good schemata will propagate through the population.

##### Evaluation in Genetic Programming

The basic GP algorithm uses the evaluation methods outlined in SR-components. The fitness of an individual is then determined by a combination of accuracy and size. We will revisit evaluation in GP later in the next section to discuss some variations on evaluation.

##### Optimization in Genetic Programming

Optimization is performed over a series of iterations, know as generations in the GP scheme. Following the evolutionary motif, optimizing a population requires selection and replication. As the great MPUA says

The sole reason we exist is to survive and replicate. That's it. ~ Mystery

GP simulates the evolutionary process on equations. Replication uses crossover and mutation to create new individuals from existing ones. Selection uses the non-dominated sorting Pareto fronts described in Section SR-???. Here we will describe the particulars as the relate to GP.

Genetic operators are the means by which GP searches the space of equations. The search uses crossover, an exploratory technique, to find unexplored regions and mutation, an exploitative technique, to search locally. A balance between both is needed to make progress and converge towards good solutions.

Crossover is analogous to sexual reproduction and uses two parent equations to produce child offspring equations. To perform crossover, subtrees of the parents are selected, snipped, and swapped. This method is similar to crossing DNA, except that GP is a branching representation, rather than linear. The original crossover method selects subtrees of the parents at random. Enhancements on selecting crossover points will be described below.

It is considered a destructive operation due to the significant probability that good schemata of the tree are disrupted. Disrupted in this context means that a partial solution, stored in a subtree, has an internal node selected for the crossover point. To combat this, i.e. reduce the probability of disruption, candidate solutions grow extra ‘genetic material’ that has little value in the evaluation. This ‘natural’ growth, called bloat, is described below in the sections on limitations and enhancements.

Mutation alters a single candidate equation to search locally around that candidate. Mutation simply selects a node of a tree and changes it to a different type. Examples of mutations are a change of variable from X to Y, an operation from addition to subtraction. Mutation is also the way the original GP algorithm optimizes real-valued coefficients. If a coefficient is selected for mutation, it scaled by a random value in some predefined range.

Injection is a less used method in which a single candidate is altered by choosing a single point of change. It is similar to mutation, but instead of changing a node type, the subtree is trimmed and replaced by a freshly grown branch. In terms of exploratory and exploitative search qualities, injection lies between crossover and mutation.

Selection is the most important part of evolutionary algorithms, and thus GP. Therefore it is very important to choose the selection mechanisms wisely Dumitrescu:2000:EC. Selections role is to differentiate between candidates, based on their quality, for survival and replication. This artificial environmental pressure causes natural selection, raising the fitness of the population. [ Dumitrescu:2000:EC, goldberg:1991:comparative ] extensively cover the variations in selection schemes. Here, we will give an overview to show what is possible.

There are two points where selection is used, for survival and replication. Replication is the chance of having genetics passed to an offspring. This is how good partial solutions are passed through the population and altered to find more of the solution.

Parental selection can happen with or without replacement. Proportional selection favors better candidates by increasing the probability they will be selected. This can cause premature convergence to a local optima solution permeating the population, wiping out diversity. Rank based selection sought to overcome this by associating a position rather than a probability for chance of selection. Rank is often calculated by the dominate, dominated ration of a candidate. Rank selection benefits GP by increasing selection pressure when the population variance is low.

Binary Tournament [goldberg:1991:comparative] selects two individuals and competes them for one of the parental equations. This is down twice to generate both parents for a recombination. Binary tournament can be extended to compete N candidates against each other. Binary tournament is similar to selecting a pivot point in library implementations of quick-sort.

Survival is the chance of an individual solution remaining in the population. In the elitist method, a parent is only removed if one of its offspring is better than it. In the pure method, all parents and children contend for a spot in the population.

##### Initialization & Termination

Initialization determines the search starting point and what portions of equation space are reachable. GP uses random generation to create the initial population of candidate solutions. using a uniform random probability distribution

There are two ways to grow a single parse tree. The first is to grow a full tree up to a specified depth. The second is to allow node selection to include variables. This causes generation to halt at unspecified times while traversing the tree, with the resultant tree having variable shape. To create the initial population, these two tree generation schemes can be used for the entire population or can be used for different percentages of the population. A method known as ramped half-and-half creates half the population with the full method and the second half with the variable grow method. A size proportional set can be created by growing an increasing number of trees as the maximum depth increases. For example, if we limit candidates to binary trees, we grow twice as many trees as the previous depth.

Ensuring diversity of equations at the start is difficult. One method is two continually generate equations, eliminating duplicates and invalid expressions, until the population is filled. Another method is to run several short run GP searches, leveraging GP’s randomness (non-determinism), to obtain a pool of partially optimized candidates.

Termination

Since Genetic Programming is a heuristic search, it has no intrinsic stopping point. Therefore, we must tell it when to stop, The most common method used is generational, i.e. upon reaching a predefined number of generations or iterations. This gives us a reasonable idea of how long a search will take as well as providing a measurement of what can be solved in a given number of steps.

Another class is performance based stopping criteria. To use these, an accuracy metric needs to be defined. One such metric is simply the best performing candidates error. If the error is less than some defined threshold, then the search is terminated. This method, however, is sensitive to the scale of the data and error measurement. Another performance based method uses hits, the number of data points which a candidate equation is within some threshold of the real value citation. In this case, the search terminates if any candidate scores a hit on all data points or some percentage of all the data points.

Other schemes attempt to measure the progress of the search. One such example is the number of generations since a new best solution has been found. If the search continues for a time without finding improvement, then termination occurs.

No matter what means is used to determine the stopping point, the results of the search need to be returned. Generally a set of the best solutions are returned, providing some amount of diversity. This set usually comes from the Pareto frontier, but can also include solutions which come from successive frontiers.

#### Limitations in GP

1. Disruptiveness of Crossover - the basic search operator has uniform selection probability. It is the reason for bloat and inefficiencies.
2. Equation Bloat - the growth in tree size to reduce probability of node selection. Bloat emerges to protect good partial solutions.
3. Equation Redundancy - results from multiple copies of the same equation. Compounded by commutative and associative properties.
4. Loss of Population Diversity - results when a partial solution dominates the population. Caused by bloat and good partial solutions.
5. Premature Convergence - results when a search converges upon a local optimum. Caused by loss of population diversity then next point.
6. Fundamental Issus in GP - topics which are inherient to the process or have been unaddressed by the GP community

The basic GP algorithm has many limitations stemming from its evaluation, selection, and breeding sub-algorithms. Many research papers in the field have sought to address these limitations and offer solution. In this section we discuss these limitations, detailing where they come from. We relate the solutions which offer improvement, to each limitation, and discuss their effect on the GP algorithm. The limitations, as well as the counters, are ofter intertwined between several aspects of the GP algorithm. In fact, if the initial starting points do not give sufficient coverage of the search space, GP will not be able to reach areas of that space.

GP has issues which it inherits from evolution, which is good at dealing with uncertainty and variability, but not the ideal method for honing in on the most fit individual. These issues stem from the difficulties in maintaining a diverse population of equations while converging towards the best equation. Representation, the genetic operators, selection mechanisms, and population structure are interwoven in complex ways. These difficulties, and advancements made in GP, have effects which permeate through the entire GP infrastructure and operation. Measuring and understanding these effects requires rigorous testing and analysis, something laking but being addressed [ McDermott:2012:benchmarks ].

##### Disruptiveness of Crossover

The basic version of crossover has uniform selection probability driven by random number generation. This is the main cause of bloat and inefficiencies. The diagram below shows the probability of selecting a node given the tree size.

Figure 4-1 - Probabilities on the Tree

As the partial solutions converge towards the real solution, an increasingly larger percentage of the tree structure is correct as is. However, as the percentage of correct structure increases, so does its chance of being selected for crossover. Thus the probability of disrupting the correct portion a good partial solution has increased [ Whitley:1994:tutorial, Poli:1998:ec ].

To negate this issue, the internal GP process grows large subtrees with little effective output. That is, they return values close to zero and do not change the output of the model significantly. By having large subtrees, the probabilities of selection change, and reduce the likelihood of disrupting the good partial solution in a different branch of the tree. This extra no-op material is referred to as bloat.

##### Equation Bloat

Equation bloat is the growth of average candidate size over course of evolution. Bloat is inherent in any variable length representation [ langdon:1998:bloat, langdon:2000:quadratic]. The general consensus for the cause of bloat is the disruptiveness of crossover, though [ banzhaf:2000:bloat] speculated that fitness causes bloat.

Parsimony pressure is meant to control bloat, but the disruptiveness of crossover, more often than not, overpowers the Pareto front’s benefits. The next section discusses several schemes which attempt to mitigate bloat and the disruption of partial solutions.

##### Equation Redundancy

Bloat causes partial solutions to dominate the population as GP attempts to optimize the best partial solution’s coefficients. These equations are either disrupted partial solutions from crossover or the same solution differing only in the value of one coefficient. The net result is that, as the GP search progresses, many copies of the same equation are generated. This in turn, results in a lot of time being wasted testing equations that are minimally different from others.

Equation redundency is further compounded by commutative and associative properties that allow for the tree combinatorics discussed in the last chapter. To our knowledge, this issue still remains unaddressed in the GP community.

##### Loss of Population Diversity

A major issue through the course of evolution is maintaining the proper diversity within the population of candidate solutions. One can often find a good partial solution spreading quickly through the population and destroying diversity Generally speaking, diversity should be greater at the early stages and reduce as time progresses. If we allow too much diversity, the algorithm cannot refine the candidates and fails to converge to any optima. If we force too little diversity, the algorithm converges too quickly and doesn’t thoroughly search the space.

Population diversity is difficult to diagnose and address. There are many factors which effect the population diversity through the course of evolution. The population size, the mating pool size, elitism, survival scheme. The difficult in diagnosis is likely the lack of a good measure. The next chapter will introduce a structure for comparing unique to total equation forms which could be useful to GP for measuring diversity.

##### Premature Convergence

Premature convergence is the result of a loss of diversity within the population. As diversity is reduced, it becomes difficult for the search to move away from a solution which has dominated the population. This result is convergence upon a local optimum [ Srinivas:1994:probabilities, Baker:1985:adaptive]

Premature convergence is the final result of the aforementioned limitations. It is effected by all components of the GP process and becomes the term to describe the event, in GP, when the algorithm is stuck on the wrong answer. Generally, premature convergence is remedied by improving individual components.

##### Fundemental Issues

Several issues are inherient to the GP algorithm. They stem from the heavy usage of random number generation and using these random numbers for making choices. This effects the control-flow through almost all aspects of a GP implementation. It is our belief that where GP uses a random number to make a decision, that a well defined and informed choice can instead be made. Another effect of random choice making is the loss of structural information about the search space. Randomized exploration operators lack information about equation relationships in the larger search space. The next chapter will show how PGE maintains and takes advantage of the search space structure. A final result of using random number generators is that the final results of a GP search are different with each invokation. GP usually mitigates this by running the search several times to gain a concensus. A consistent random seed could be used, however this would only fixate the random choices made.

There are other issues in GP remain completely unaddressed by the community. Redundent candidates due to associativity and commutativity in algebra, as well as the amount of diversity and redundent effort. Both seem to be due to the lack of a detection or measurement technique.

More concerning is the lack of good research practice within a significant portion of the GP comunity. An effort was made to shed light on this issue and start a conversation to improve the situation. In brief, there are inconsistencies in benchmarking, metrics, and with cross-comparisons between implementations. Further, the details reported in the literature are insufficient to reproduce the work or compare results in a meaningful way.

top

### Genetic Programming Enhancements

1. Selection Improvements - Parent recombination and archives.
2. Variations on crossover - Grammar guided and probabilistic.
3. The Island Model - Run several GP instances with message passing.
4. Co-evolutionary GP - Evolve fitness predictors in parallel with equations.
5. Hybrid Algorithms - GP algorithms with non-GP parts

Since Koza’s initial formulation of GP and SR, there has been a plethora of research from generalized implementation enhancements to improvements addressing the issues of disruptiveness of crossover, bloat, population diversity, and premature convergence among many others. We cover material which has relevance to our discussion, though it will necessarily be abridged due to space limitations.

#### Selection

Brood selection

Brood selection is a technique to create better offspring from the crossover operation. Rather than each set of parents producing only two children, each set of parents are crossed multiple times, producing many children. The rational is that by choosing multiple crossover points, we gain a better change of selecting a good one. [ Tackett:1994:brood, Fogel:1997:handbook ].

Brood selection occurs in two phases, offspring creating and survival selection. The first phase happens during crossover. Brood selection applies crossover on two parents N times, producing 2N offspring. In the second phase during selection, the brood of candidates, from two parents, are preselected to filter out the best offspring. This best child then becomes part of the usual survival and replication selection phases.

Pareto Front

The Pareto front is a central component to the SR framework. The Pareto non-dominated sort, or Pareto front, balances the trade-off between opposing goals in multi-objective optimization [ fonseca:1993:genetic, horn:1994:niched, van:1998:evolutionary, luke:2002:lexicographic, smits:2005:pareto ]. Various methods improve the diversity along the Pareto front, and/or maintain archives of good solutions: NGSA [ srinivas:1994:muiltiobjective ], NSGA-II [ deb:2000:fast ], SPEA [ zitzler:1998:evolutionary ], SPEA2 [ zitzler:2001:spea2 ], SPEA2+ [ kim:2004:spea2 ]}, PAES [ knowles:1999:pareto ], PESA [ corne:2000:pareto ], PESA-II [ corne:2001:pesa ], are all algorithms which aim to improve the quality of solutions. SPEA, PESA, MOGA — aim to improve diversity by accounting for density along the front. SPEA2, NSGA, NSGA-II — use an archive to maintain the population of good solutions. NSGA-II, SPEA2+ — improve the time complexity of the Pareto sort itself. The overall goal is to include individuals near the front and reward uniqueness and diversity among the candidate solutions.

#### Variations on crossover

Stepwise Adaptive Weights (SAW) alters the fitness function with information gained during the search process [eggermont:2000:stepwise]. SAW weights each sample point, periodically updating the weights using the error of the best candidate solution. If the informing candidate has non-zero error, then either a constant is added to the weight (CSAW), or a precision value is added (PSAW). By variating the sample weights, SAW accentuates points which are not modeled well. This exerts pressure on the evolutionary process to produce better models geared towards explaining the difficult points.

Tree-Adjunct Grammar Guided Genetic Programming (TAG3P) uses a grammar to enforce syntactical constraints on a solution [ hoai:2001:framework, hoai:2002:solving, hoai:2003:tree ]. TAG3P is basically the same as GP, having the same representation and components. However, TAG3P differs is a couple of ways. TAG3P only allows crossover points where the non-terminals are the same. A similar constraint is enforced during mutation. Only subtrees of the same root node can be substituted into the parse tree. TAG3P can also incorporate the use of tree rewriting systems [ Joshi:1975:TAG ]. By produces only syntactically valid solutions, TAG3P biases the search towards those solutions [hoai:2001:framework, hoai:2002:solving].

At the individual genome level, changes to representation, crossover, and coefficient optimization have been explored. Tree-based representation is standard, others include linear, grammar-based [ mckay:2010:grammar ], and graph or Cartesian GP [miller:2000:cartesian]. Invalid mathematical operations are removed in [ keijzer:03:improving,kotanchek:2008:trustable]. Context-aware crossover [majeed:2006:using] selects a snip in one tree and substitutes it at all valid locations in the other parent. [korns:2008:large] extends this to include all possible valid snip replacements from both parents. Semantically Aware Crossover (SAC) [nguyen:2009:semantic] biases crossover to exchange semantically different material and Semantic Similarity-based Crossover (SSC) [uy:2011:semantically] extends this, by limiting the size of material to small, manageable snips.

#### The Island Model

The island model cohoon:1987:punctuated splits a unified population onto several concurrent GP searches called islands. The interconnections between these islands form lines of communication. At one extreme there is no connection and each island as a completely separate GP search with a smaller population. If we follow the usual scheme of communication then we have what is called migration. At regular intervals, and island informs its neighbors of its search progress. This usually takes the form of sending several good candidates to be included in the neighbors equation pool. Connection settings vary from a ring, fully connected, random connections, and several other variants alba:2000:influence In lassig:2010:benefit, a proof is given that islands are needed to achieve polynomial time convergence. Without communication, the islands become basic GP searches and don’t receive the added benefit of information sharing.

Using the island model with migration, several topologies of interaction arise. These topologies are undirected graphs. At the extremes, the ring topology connects each island to one neighbor on each side and full topology connects each island to every other island. Other topologies connect to N random neighbors, bringing the ring topology closer to a fully connected graph. Using the island model for GP permits for easy GP parallelization alba:2002:parallelism. alba:2001:analyzing Analyzing synchronous and asynchronous parallel distributed genetic algorithms and found that there was no meaningful difference in ability. In alba:2002:super-linear, they claim super-linear speed up by running several GP searches in parallel. However, they incorporated information sharing between the processes which alters the algorithm’s behavior so much that should really be considered different implementations.

#### Co-evolution

Co-evolution is the simultaneous evolution of separate species. Using co-evolution has several consequences on the Symbolic Regression algorithm. First, and most directly, evaluating equations on only a subset of the data alleviates a large portion of the computational burden. Second, co-evolution naturally adapts and maintains pressure on both populations. Pressure is maintained by exchanging some of the best members of each population, the third consequence of co-evolution. At regular intervals, a best and diversified set of equations is shared with the subset evolution process, and similarly current best subsets are shared with the equation islands. A fourth consequence of co-evolution is that the subsets are free to emphasize different areas of the data set to different extents. This allows the search to focus on different areas of the observation space at various times in the search. It also permits for important points to show up more often. [hod:09:implicit_long] showed that points where the rate of change in the data was increased had a greater ability to differentiate between solutions.

Co-evolution adds a multitude of new parameters, more than doubling the number in vanilla Symbolic Regression. This is due to the nearly equal number of parameters for the subset evolution process, plus the extra parameters for describing the interaction. Despite the large increase in parameter space, the benefits from reduced computational effort and increased search ability justify using co-evolution [packard:1988:adaptation, kaufman:1991:coevo, paige:2002:compare].

#### Hybrid Algorithms

Due to issues with benchmarking SR algorithms McDermott:2012:benchmarks, as well as generally poor accuracy from state-of-the-art GP algorithms korns:2011:accuracy, several recent works have proposed hybrid or alternative techniques to SR. topchy:2001:fasterraidl:1998:hybrid,eberhart:2001:swarm} use hybrid approaches to speed equation parameter fitting. Abstract Expression Grammar (AEG) korns:2011:abstract, uses several concurrent GP searches with abstract place-holders. This enables the optimization methods to focus on smaller areas of the search space in parallel. Fast Function eXtraction (FFX) McConaghy:2011:FFX is a non-GP, deterministic algorithm for SR. In FFX, the model is a Generalized Linear Model and pathwise regularized learning zou:2005:regularizationfriedman:2010:regularization} is used to optimize model parameters. In bongard:2013:improving, a hybrid GP-FFX algorithm in proposed, where FFX is used for feature selection prior to a GP search Their approach was shown to be more effective than either one alone. Real-valued coefficient optimization, a known difficulty for GP, has been improved by local gradient search, non-linear regression, and swarm intelligence [ topchy:2001:faster, raidl:1998:hybrid, eberhart:2001:swarm ].

The classical GP algorithm naturally fits both functional and data parallelism, and as such, there exists a significant amount of literature on the subject. The first investigation was done by the originator of GP, Koza [koza:1995:parallel]. The effects of (a)synchronous execution in distributed GP were investigated in [alba:2001:analyzing], concluding that asynchronous execution achieved equivalent or better results in less time. [alba:2000:influence] investigates the influence of migration policy on distributed GA’s and [lassig:2010:benefit] proves that migration is required for convergence. In [hodjat:2014:maintenance], the effects on, and maintenance of, long-running GP systems is investigated. Work towards implement GP systems on cloud providers is explored in [veeramachaneni:2013:learning,derby:2013:cloud]. Other research [langdon:2010:large, harding:2011:implementing, augusto:2013:accelerated] has focused on utilizing the GPU to speed up evaluation. Several advanced GP systems have also emerged as the field has matured, including: AEG [korns:2011:abstract], Ec-Star [oreilly:2013:ecstar], Eurequa [hod:09:science], and FlexGP [derby:2013:flexgp].

In addition to research on scaling GP for BigData, many works have investigated the application of GP-based SR to real-world problems. Here we give an overview of GP application on differential equations and dynamical systems, but the body of literature is filled with applications to many fields. General application to dynamical systems are investigated in [ lay:1994:application, cao:2000:evolutionary, tsoulos:2006:solving, hod:08:mining, iba:2008:inference, seaton:2010:analytic ]. Specific applications to metabolic networks can be found in cho:2006:identificationschmidt:2011:automated} and genetics in [sakamoto:2001:inferring, qian:2008:inference ]. In [hod:07:pnas], it was shown that the equations to a dynamical system could be searched in parallel by using a partitioned evaluation scheme. Dynamical systems with multiple-time-scale, with signals composed of simpler signals operating at different time-scales, is investigated in [cornforth:2012:symbolic]. [cornforth:2013:inference] showed that equations for unobserved variables in dynamical systems can be recovered in some instances. Invariants, or conserved quantities, were shown recoverable in [ hod:09:implicit_long, hod:09:science ]. Hybrid dynamical systems, which feature both discrete and continuous components, were shown tractable in [ly:2012:learning]. Much of these works use simulated data from known systems. There are, however, many references to use of GP on real-world problems and data as well.

### Fast Function eXtraction

Fast Function eXtraction (FFX) is the first deterministic SR implementation [ McConaghy:http:FFX, McConaghy:2011:GPTP, McConaghy:2011:CICC ]. FFX does not use genetic operators, tree based representations, or random number generation. Instead, FFX uses a Generalized Linear Model (GLM). and a set of basis functions derived from the input features. FFX then applies a series of regressions to fit the parameters of the basis functions to a desired complexity This makes FFX very fast, but also limits its flexibility and the complexity of solutions.

#### Algorithm

FFX uses a Generalized Linear Model (GLM) of the form:

The GLM has linear coefficients to terms of the summation, and is a flexible version of ordinary linear regression. [ McCullagh:1972:Paper, McCullagh:1989:Book ] The $f_b(\vec{x})$ are not required to be linear functions themselves, but rather linear in coefficients, to the terms of the summation. In other words, there are no coefficients inside any of the $f_b(\vec{x})$.

To learn the coefficients of the GLM, FFX uses Pathwise Regularized Learning (PRL). PRL augments least squares by adding regularization terms and then sweeping across and returning multiple parameter vectors [ ZouHastie:2005:Paper, Friedman:2010:Paper ]. PRL also has some interesting properties:

1. Learning speed is comparable or better than least squares
2. Can learn when there are fewer samples than coefficients
3. Can learn thousands or more coefficients
4. Sets of coefficient vectors are returned trading off the number of coefficients and training accuracy
Figure #: FFX Algorithm

Following the psuedo code, FFX first creates a set of basis functions for the GLM. To do this, univariate bases from each variable with the operations ($x^{\pm 0.5}, x^{\pm 1}, x^{\pm2}, abs(x),log(x)$). In their example this produced 176 bases. Next, the univariate bases were combined to produce 3374 bivariate bases, resulting in 3550 total bases. By allowing bases to be in both the numerator and the denominator, the overall number of bases doubles to 7100.

Then, for all $b$ from $1 \rightarrow B$, FFX derives a linear combination of $b$ basis functions. To learn a model, FFX applies pathwise regularized learning to fit the GLM coefficients. FFX will then adjust the $alpha$ value to change the emphasis in the Pathwise Learning. This has the effect of alternating between increasing and decreasing the number of bases, continuing until the number of function bases equals the desired model complexity. FFX repeats this process for a number of desired complexities so that a variation of models is returned. This enables similar behavior to GP returning the first Pareto Frontier.

#### Limitations

FFX works well for many problems, requiring far fewer computations GP, FFX, however, has its own limitations:

(1) there are no coefficients or parameters within the bases, meaning more difficult, non-linear relationships are beyond its abilities. This issue could be addressed by using non-linear regression and abstract coefficients.

(2) Equations such as $v = xyz$ are beyond the abilities of FFX. Individual terms of the summation are limited in complexity to a pair-wise combination of uni-variate and bi-variate bases determined at initialization. Seeding with increased basis functions could become prohibitive as the number of terms grows through pair-wise basis combinations. In the 13 variable example provided, the initial number of GLM basis functions was 7100.

As FFX is incapable of finding many of the benchmarks, we did compare against it.

### Regression Techniques

There are many methods for regression, which generally attempt to explain the relationship between variables. Here, we highlight several which are common or relavent to the subject matter.

#### Linear Regression

Linear regression is a technique for modelling a dependent variable from a set of independent variables and unknown parameters. The most common method for determining the parameters is least squares, which minimizes the squared residuals [ Legrand:1805:orbites_des_comietes]. The key is that the dependent variable is linear to the parameters of the basis terms constructed from the independent variables. Nonlinear functions map be used, so long as parameters remain linear.

#### Nonlinear Regression

Nonlinear regression is an extension to the linear variation where the independent variable can now have more complex, nonlinear relationships to the parameters and depenent variables. Unlike linear regression, no guarentee can be made to the optimality of the final parameter estimate. Most methods for estimating the parameters incorporate successive iterations which improve upon the accuracy. Signifcant algorithms include Gauss-Newton, Levenberg–Marquardt, and gradient descent [ Fletcher:2000:Book, Madsen:2004:Methods].

PGE uses the Levenberg-Marquardt (LM) algorithm is an iterative technique that finds a local minimum of a function that is expressed as the sum of squares of nonlinear functions. It has become a standard technique for nonlinear least-squares problems and can be thought of as a combination of steepest descent and the Gauss-Newton method. When the current solution is far from the correct one, the algorithm behaves like a steepest descent method: slow, but guaranteed to converge. When the current solution is close to the correct solution, it becomes a Gauss-Newton method \cite{lourakis04LM}.

#### Support Vector Regression

Support Vector Regression Machines (SVRM) ere proposed as a variation on the original Support Vector Machine (SVM) [ Vapnik:1996:SVRM ]. In SVM, a classification boundary is sought which maximizes the margin between the classes. In SVRM, the basic idea is to flip around the SVM problem and find a line with where the deviation is minimized. SVM and SVRM work for both linear and nonlinear models and is a convex optimization problem [ Smola:1996:Tutorial ].

#### Ridge, Lasso, Elastic Net

Ridge regression is a regularization method for ordinary least squares. It uses the L2-norm of the parameter vector, imposing a penalization is added for excessively large parameters. Often a scaling term is used to control the effect of the penalty. [Tikhonov:1943:stability]

Lasso regression is similar to Ridge as a regularization method on ordinary least squares. Lasso uses the L1-norm, which causes terms to have zero valued parameters, thus effectively removing them from the model [ Tibshirani:1996:lasso, stanford:http:lasso ]

Elastic Net is the linear combination of Ridge and Lasso techniques [Zou:2005:elastic]. This method uses the offsetting features of both methods to find a sparse model with few non-zero terms.

### Graph Algorithms

We include several graph problems and algorithms as they served as seeds of thought that grew into the PGE algorithm.

#### Minimum Spanning Tree

The Minimum Spanning Tree (MST) of a graph is a tree which minimizes the sum of edges which connect all vertices in a graph. MST is solvable in polynomial time using a greedy algorithm.

Prim’s algorithm is a greedy algorithm for finding the MST of an undirected graph. The algorithm starts with an arbitrary vertex and constructs the MST by adding the lowest weight edge of a node which is already in the MST. A priority queue is used to efficiently determine the next lowest cost edge to add. [ Prim:1957:Shortest ]

Kruskal’s algorithm is also a greedy algorithm for finding the MST of an undirected graph. The algorithm starts by treating each vertex as a separate MST. It then proceeds to select the minimum edge from all remaining edges. If the edge connects two distinct MSTs, then it is included, otherwise, the edge introduces a cycle and is subsequently discarded. A disjoint-set data structure can be used to effeciently determine this condition. [ Kruskal:1956:Shortest ]

#### Single-source Shortest Path

The single-source shortest path (SSSP) problem is to find the shortest path to all nodes, given a starting point.

Dijkstra’s algorithm for SSSP is a greedy algorithm which resembles Prim’s. [ Dijkstra:1959:algorithm ]. At each step, the next closest vertex is added to the already visited set, thus expanding outwards in a greedy manner. A priority queue is used as well to improve upon runtime complexity. The difference is that at each step, the distances to vertices on the frontier are updated to reflect the latest addition, and thus possibly changing the order of the priority queue. Knuth generalized Dijkstra’s algorithm to hyper-graphs [ Knuth:1977:generalization].

A* search is a graph traversal algorithm proposed as an extension to Dijksta’s algorithm with better runtime performance. [ Hart:1968:astar] The A* algorithm cuts down on the size of the subgraph that must be explored, by using a heuristic to estimate a lower bound on the “distance” to the target. Similarly, it uses a priority queue to determine the next vertice to include. The difference is that the current distance is added to the estimated distance to the target.

### Theory 1. **Removing Non-determinism** - Creation of a completely reproducible SR algorithm. 1. **Search Space Organization** - Understanding and designing the structure of the search space. 1. **Evaluating Forms Once** - Designing the model for detection and avoidance of overlapping subproblems. PGE was born from the process of removing random number generators from GP. This required methods for selection, survival, and replication to all be replaced. The replacement process brought to light several issues which have not been addressed by the GP community. The theory underlying PGE is based on intuitive principles. First, take into account the rules from algebra, to reduce model multiplicity into a canonical form, thereby reducing search space complexity. Second, design models so that they are only handled once. This requires that they are also designed with dynamic programming techniques in mind. Third, remove all sources of non-determinism to create a completely reproducible algorithm. This also simplifies development and debugging of the software which implements the algorithm. #### Removing Non-determinism Removing sources of non-determinism was a central theme to the development of PGE. The aim was to produce an algorithm which would give the same results with each invocation. The reason for this goal was the detrimental effects non-determinism has on search algorithms. We believe it to be an inherent and unavoidable difficulty present in the GP algorithm. You can find details on GP in [Chapter 7](/sr/07-gp) To achieve deterministic behavior, PGE makes no use of random number generation. It performs the exact same algorithmic steps given the same parameters and same training data. This requires several aspects of the PGE algorithm to be defined deterministically. **Initialization** establishes the starting points from which PGE searches. Determinism starts here by constructing a set of basis functions from the input variables. **Prioritization** determines which points in the space to search next. This process uses a priority queue coupled with a Pareto non-dominated sorting algorithm. **Exploration** discovers new models by making small changes to a given model. To maintain determinism, PGE makes all possible changes to a model, given the particular exploration scheme in use. **Communication** occurs when PGE uses parallel execution. In order to maintain deterministic behavior, well defined synchronization points are used. Details of each aspect will be expanded upon in the upcoming sections and next chapter. #### Search Space Organization The size of the space defined by a grammar is infinite, even when disregarding the increase brought by considering real valued coefficients. The main contributions are combinatorial components at both the search space and model levels of the problem. PGE's great efficiency gains are the result of methods designed to combat the combinatorial components. We begin at the model level, where the main tool is the trie based representation. It is dynamic programming which comes to the rescue at the search space level. **Tree Representation Combinatorics** PGE primarily uses the trie based representation for the models, as detailed in the last chapter. The reason for this is the combinatorial number of different binary tree representations for the equivalent model. These different representations arise from permutations of the leaf nodes and from variations in the tree structure. Consider the equation a•b•c (Figure 4-1).
Figure 4-1 - Combinatorics of the Tree
This equation has 12 different binary tree representations, from six leaf permutations and two shape permutations. If we account for both addition and multiplication, the number of trees grows to 48. As you can imagine, with increasing numbers of operations and operands, the tree complexity and this number undergoes a combinatorial explosion. The advantage of the trie based model is the ability to take advantage of the commutative and associative properties of addition and multiplication. Because of these properties, we are free to order the operands, sub-trees in the model trie, as we choose. Commutativity is the ability to rearrange parentheses and is associated with the multiple tree representations for the same expression. By using the trie, we flatten a binary tree and metaphorically erase the parentheses from the expression. Associativity means that we can sort the operands to addition and multiplication. By sorting them, we create a consistent ordering which can be treated as the canonical form for a model. In doing so, we remove the combinatorial element associated with the permutation of leaf nodes. Sorting is possible by creating a ordering on node types. This is easily achievable by assigning an integer to each node type. By turning binary trees into tries and by sorting the tries, each model can have a canonical form. Figure 4-1, right column, shows the canonical form of a•b•c. After applying the reductions, the original 48 models is now just 4. This savings created by sorting become even more important as equations become more complex. **Algebraic simplifications** There are other sub-expressions which arise that we would also like to remove and reduce. Simplifications group like terms together ($$x+x$$), replace sub-expressions which evaluate to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$), and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$. The resulting equations are equivalent, simpler forms. There is debate as to how simplification effects the SR process [ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ]. Certainly, changing the tree effects the Pareto trade-off which in turn has consequences on selection and therefore search progress. Questions still remain as to how much and which simplifications should be applied, which forms of the equation should be kept, and the overall effects simplification has on the search process. Despite these questions, PGE still employs the use of the aforementioned simplifications. This is done to eliminate those sub-expressions which unnecessarily complicate equation and the overall search space. If they were not replaced, many equivalent variations of a good equation could crowd out other potential solutions, as they would be selected more often. This would result in the algorithm stagnating in a local optima of the search space. Simplifications effectively trim out parts of the search space which are redundant, and also return results which are more comprehensible. PGE uses the simplification techniques which are present in the SymPy Python library. This and further implementation details can be found in [Appendix-A1](/sr/A1-pypge). #### Evaluating Forms Once At its core, PGE is a dynamic programming algorithm which aims to evaluate each sub-problem once.
A dynamic-programming algorithm solves each subsubproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time it solves each subsubproblem.
~ Cormen:2009:algorithms

In PGE and SR, a sub-problem is a specific form, namely the parse tree for an equation and its terminals. Recall that PGE uses placeholders for the parameters, so an equation form accounts for the existence of a parameter without knowing its value. The key to evaluating forms once is to fully optimize a form the first time it is seen and to remember the forms which have already been explored. PGE fully optimizes forms by fitting the valueless parameters with non-linear regression, thus assigning it a value. In order to memoize the equation forms, they are converted to a linear representation and then recorded in a lookup structure called the Integer Prefix Trie. This is possible because each equation has a canonical form which is unique. **Non-linear Regression** is not unique to PGE, or GP for that matter, but it is a central component of PGE's ability to evaluate an equation form once. As PGE encounters new equation forms, it fits the abstract parameters using non-linear regression, resulting in the 'best' version of that form on the training data. PGE uses the Levenberg-Marquardt (LM) optimization algorithm [CITE]() to fully fit an equation to a set of data. In cases were an equation is linear w.r.t. the parameters, the LM algorithm library returns in one iteration by using singular value decomposition (SVD). PGE's treatment of parameters and use of non-linear regression enables the separation of search for equation form, from optimization of a form's parameters. In this way, the search for form, and the optimization of coefficients are split into separate tasks. This separation has a profound implications on the efficiency and effectiveness of PGE. It enables an equation form to be fully evaluated once, removing duplication of effort that GP experiences when using genetic operators to optimize the coefficients of an equation form. Some GP implementations have incorporated the use of regression for fitting coefficients. However, they make no use of memoization, and therefore still suffer from duplication of effort which results from multiple copies of the same equation. **Memoization** Canonical forms for equations reduced one source of combinatorics in the SR problem. A second source is from multiple derivations. That is multiple paths over the graph which reach the same node. Below is a diagram which demonstrates this.
Figure # - Multiple Derivations
In this diagram, we start with models which consist of a single variable. Now we could take each of these models and add a different variable to it. Consider though that we could start with $$x$$ and add $$y$$ or start with $$y$$ and then add $$x$$, both arriving at the canonical for $$x+y$$. Repeat this process, adding a third variable, and we have six paths to the model $$x+y+z$$. It is important that we detect this situation, by determining that each of the six $$x+y+z$$ are really the same model. If we do not detect the separate models as one, would would end up repeating a lot of work. This repetition would become a combinatorial explosion of exponential proportions. PGE uses dynamic programming techniques to detect multiple derivations and consider an equation form only once. This feature of PGE is enabled by the Integer Prefix Trie and an integer linear representation for the equation models. The next section will fill in the details of both representation and memoization.
### Components 1. **Representation** - Trie and linear representations with valueless parameters 1. **Evaluation** - Nonlinear regression for parameters, fitness metric for models 1. **Optimization** - Exploration, prioritization, and regularization 1. **Memoization** - Detecting overlapping subproblems to avoid duplicate work PGE has four components which enable its search capabilities. The first of these are the generating functions which produce new equations from previous equations. The second is an Integer Prefix Tree which enables memoization of equations from their serialized representation. The third component is non-linear regression which separates form from optimization. The fourth component is the Pareto Priority Queue which gives order to the search space by prioritizing equations over one another. #### Representation PGE uses multiple representations for an equation. The syntax trie is used for optimization. The integer sequence is used for memoization. **Syntax Trie** PGE uses trees, where operands can have a variable number of children. In the n-ary tree representation, the associative operators can have n sub-trees, flattening and reducing the tree's size. This is a slight modification from the usual binary tree; only affecting the associative operators addition and multiplication. The n-ary tree does not change the modeling ability of the equation, but will effect the trade-off between parsimony and accuracy. This in turn effects the selection operation of any SR implementation, though we do not investigate this issue here. In addition to the reductions in parse tree size, the n-ary representation eases the design of sorting and simplification algorithms. These algorithms, detailed next, work within the parse tree to reshape equivalent trees into a canonical form. This effectively merges isomorphic equations, reducing the size of the search space, while simultaneously adding structure to the search space. **Linear integer sequence** is the prefix notation representation of an equation, mapped to the integers. It is equivalent to the trie representation with one-to-one correspondence. The linear integer sequence is obtained by mapping each node type to a different integer and the performing a pre-order traversal of the trie, collecting the integers by appending them to a list. The usefulness of this representation is for efficiency in implementation. Integers can be used for constant time lookups, where otherwise we may have been limited to a higher complexity operation. The linear representation is mainly used in conjunction with the Integer Prefix Trie (IPT) during the memoization process. The IPT is an efficient data structure for recording and looking up equations and will be detailed shortly. Additionally, the linear representation can be used for interprocess and network communication. It's main benefit there is on the receiving side, where we can avoid the complexity of parsing the human readable version of an equation. **Abstract parameters** are the valueless placeholders used as coefficients in equations. They signify that a parameter is present and where in an equation it occurs. The benefit of abstract parameters is two-fold. First, equation form matching and memoization can happen without regard to their value. Second, they enable the optimization of the parameters to be performed in a separate process, via non-linear regression. This is the key to separating the optimization of a given form, from the optimization which occurs at the global scale, in the space of all equations. #### Evaluation **Non-linear parameter fitting** though not unique to PGE, or GP for that matter, is a central component to the PGE algorithm. It enables the separation of search for form and optimization of that form's parameters. Great gains in efficacy are realized by separating the search for equation form, from the optimization of parameters. PGE achieves this separation by using abstract parameters. Parameters are indexed into an array, which means they do not store their value internally. The indexed parameters contribute only to form, taking on value only after non-linear fitting. Handling parameters in this manner is analogous to the way variables are treated, which only contribute to fitness during evaluation. **Fitness metrics** are composite metrics which combine a notion of accuracy and complexity. They are used to compare the relative fitness between two models or equations. They are the multi-objective value which SR implementations seek to optimize. The most common fitness metric is to use the tree size and the mean squared error. Many values for accuracy and complexity may be used, and more than one may be used. Additionally, the values in the fitness metric may be weighted to give preference. Another useful practice is to normalize the values used in the fitness metric, across the set, before sorting into successive non-dominated sets. The Pareto non-dominated sorting methods, described in the previous chapter, use fitness metrics to determine the successive non-dominated sets of equations. #### Optimization SR is a multi-objective optimization task seeks to maximize accuracy while minimizing complexity. To search over equation space PGE must move within the space and decide where and to to move. PGE uses exploration operators derived from the rules of the grammar. The exploration operators input one equation and output a set of equations which result from small edits. PGE uses a priority queue to express which points in the search space to expand next. By incorporating the Pareto non-dominated sort into a priority queue, the PGE search can exploit and optimize the trade-off between competing objectives in a deterministic order. **Exploration Operators** are the means by which PGE searches the space of all equations. Each exploration operator is a policy for how to expand and modify an equation's parse tree to obtain functions 'close' to the current one.
Figure #: PGE Exporation Functions
{% highlight Python linenos %} def Grow(model): add_expands = add_extend(model) mul_expands = mul_extend(model) var_expands = var_substitution(model) return var_expands + add_expands + mul_expands def add_extend(model): // E -> E + c*L // E -> E + c*N(L) def mul_extend(model): // T -> T * L // T -> T * N(L) // T -> T / L // T -> T / N(L) def var_substitution(model): // N -> ( N + c*L ) // N -> ( N + c*N(L) ) // N -> ( (c*L) / N ) // N -> ( N / (c*L) ) {% endhighlight %} PGE uses the grammar's production rules expand previously encountered equations. New equations are produced by applying the function *Grow*, from Listing \ref{expand}, while recursing over the parse tree. The *Grow* function determines the current node's type and applies the appropriate production function. Each production function corresponds to one or more of the grammar's production rules. *AddTerm* increases the number of terms in a summation, such as $$aX + bY \Rightarrow aX + bY + cZ$$. *WidenTerm* increases the number of terms in a product, such as $$aXY^2 \Rightarrow aX^2Y^2$$ or $$aXY^2 \Rightarrow aXY^2Z$$. *DeepenTerm* increases the complexity of a term, such as $$aXY \Rightarrow a(X+bZ)Y$$ or $$aSin(X) \Rightarrow aSin(X+Y)$$. The generating functions define the set of reachable expression in conjunction with the supplied building blocks (terminals and non-terminals of the grammar). However, generating functions do not determine the order in which the space is searched. This is the role of the PPQ (section \ref{sec-pge-ppq}), which prioritizes the areas to explore next. **Prioritization Strategies** The Pareto Priority Queue (PPQ) is the deterministic mechanism for controlling the search direction of PGE. The PPQ replaces selection for mating with prioritized equations for expansion (selection for survival is unnecessary since all equations are remembered). The PPQ is what its name implies, a priority queue built on the Pareto non-dominated sorting. The PPQ prioritizes expansion towards those branches of the search space which balance size and accuracy best. The PPQ removes the need for generations, survivors, and mating pools, storing all explored equations which have not been expanded. Our experimentations have not shown this to be prohibitive. Consumed memory never exceeded 500Mb, even in the face of hundreds of thousands of unique equations. To construct the PPQ, successive Pareto frontiers are appended onto a linear structure. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection. The Pareto Priority Queue (PPQ) determines the order the space of equations is searched. The PPQ gives equations which expand the branches of the search space towards small equations with low error. To construct the PPQ, successive Pareto fronts are appended to the end of the queue. Thus, the smallest equation from the first Pareto front is at the head of the queue. The next elements are the remainder of the first Pareto front in increasing size. The remaining Pareto fronts follow a similar pattern. Priority is first given to the Pareto frontier and then to size within the frontier. This is the same as Pareto sorted array that results from the GP Pareto sort during selection. **Regularization** is the process of introducing a penalty for complexity in order to avoid overfitting. In PGE and GP, this is an implicit process which is part of the Pareto non-dominated sort. PGE uses the PPQ and a pop count to control the amount of regularization. During processing, we remove the top $$p$$ equations (3 in our experiments) when selecting the next areas to search. By doing so, we select the $$p$$ smallest equations from the first Pareto frontier. This gives variation across the trade-offs for the equations to be expanded, and thus variation in the search direction too. If we only remove one equation, the smallest equations would be replaced from the next front and the search would progress through equation space by size. If we remove too many, then PGE starts to produce complex and over fit models. #### Memoization **Detect overlapping subproblems** is the key to the dynamic programming approach taken by PGE. In PGE, a sub-problem is equivalent to a particular equation form. Sub-problems are encountered when an equation has several derivations, or paths over the search graph which reach the same node. This means there are multiple orderings of the production rules which result in the same equation and that each equation may appear in more than one form, as several equivalent points within a grammar's representable space. The memoization of form allows PGE to consider a form just once. PGE matches equations by comparing the serialized representation of the equations. Serialization transforms an equation into a sequence of integers by assigning a unique value to each node type. The resulting integer sequence is equivalent to the prefix notation of the equation. Since each sequence represents a unique equation, the parse tree is recoverable from the sequence. Also, since PGE uses abstract coefficients, they are all converted to the same value. This means PGE only memoizes their existence and location. **The Integer Prefix Tree** (IPT) is at the core of the PGE dynamic programming framework. The IPT detects overlapping subproblems by comparing integer sequences, the serialized representation of an equation. Detecting overlapping subproblems enables us to eliminate duplication of effort in an exponentially growing search space. The IPT was inspired by the suffix tree algorithm for string matching [[Weiner:1973:LPM](), [Farach:1997:OST](), [Skiena:1998:ADM]()]. The suffix tree algorithm gives log-time lookup for string existence in a piece of text. Similarly, the IPT provides log-time determination, w.r.t. the size of the equation alphabet, as to whether a subproblem has been encountered before. Since the ASCII alphabet is equivalent to the integers, in a mathematical mapping sense, we can use any sized alphabet, or equivalently, range of integers. PGE uses the IPT memoization tree defined in Listing \ref{pretree}. The IPT is a recursive structure and insertion algorithm which memoizes a sequence of integers. The IPT is built up as new expressions are encountered. As each node in the IPT is visited, its associated counter is incremented. If the insertion algorithm has to create a new node, then the expression is also new, and the expression type is stored in that new node. New node creation is propagated back up the tree as the insertion function unwinds. The root IPT node unique counter is incremented if a new insertion took place. Each insertion or lookup takes O(mlog(n)) operations, where m is the length of the serial and n is the size of the alphabet. The depth of the IPT is equal to the length of the longest serial encountered. The IPT also tracks the number of times each node in the structure has been visited. We keep counts for total visits and unique visits. This allows us to do some accounting of the number of unique equations as well as how many times each equation is produced by the generation functions. There has been no previous research on the topic of unique equations in relation to the total number of equations created or evaluated. We believe this to be due to the lack of a method for matching equations, which is due to the lack of canonical forms in the literature. We use the IPT to compare the efficiency of GP and PGE in section \ref{sec-results-efficient}. The IPT is also the enabling structure which conceptually turns the search space from a tree into a graph by matching and memoizing subproblems, or equations. This greatly reduces the amount of work, by eliminating duplication, that the PGE must perform while searching for ideal equations. As mentioned earlier, the search space is also reduced in size by sorting equations into a canonical form and simplifying them by rewrite rules.
Figure #: Integer Prefix Tree
{% highlight Python linenos %} class Node: def __init__(self, key=0, value=None): self.map = {} # int -> Node self.key = key self.value = value def insert(self,iis, value): if len(iis) > 1: ii = iis[0] if ii not in self.map: self.map[ii] = Node(key=ii) return self.map[ii].insert(iis[1:], value) if len(iis) == 1: ii = iis[0] if ii not in self.map: # print " new node for key: ", ii self.map[ii] = Node(key=ii, value=value) return True else: return False def lookup(self,iis): if len(iis) > 1: ii = iis[0] if ii in self.map: return self.map[ii].lookup(iis[1:]) else: return False, None if len(iis) == 1: ii = iis[0] if ii in self.map: return True, self.map[ii].get_value() else: return False, None {% endhighlight %} ### The Search Loop PGE iteratively refines equations by applying the grammar's production rules recursively. PGE removes the top equations from the queue, prioritizing the first Pareto frontier. The top equations are then expanded according to the production rules. The resultant expanded equations are checked for uniqueness by the Integer Prefix Tree (IPT). The unique set of trees that remain are fit using non-linear regression, evaluated for fitness, and pushed into the Pareto Priority Queue (PPQ). The main PGE loop continues in this way until the stopping criteria is reached.
Figure # - PGE Flow Diagram
The PGE search proceeds as follows. Initialization generates a set of basis functions for each variable. These basis functions are then memoized by the IPT, fit to the training data and pushed into the PPQ. The main PGE loop iteratively pops $$p$$ equations from the top of the PPQ for processing. Each of these equations is expanded through recursive application of the generation functions. The resulting equations are memoized using the IPT. If the equation has been encountered before, it is discarded. The remaining unique equations are fit to the training data with non-linear regression and pushed into the PPQ. PGE continues until a model of desired accuracy is discovered or a computational threshold is reached. Theoretically, PGE could explore the entire space of representable equations given infinite space and unlimited time.
Figure #: PGE Search Loop
{% highlight python linenos %} def loop(iterations): for I in range(iterations): popd = pop(p) # expand popd models, they are now fully processed expanded = expand(popd) # filter and memoize expanded models to_memo = filter(expanded) to_eval = memoize(to_memo) # fit and evaluate these models eval(to_eval) # push fully fit models into the queue push(to_eval) {% endhighlight %} The main PGE loop makes the following steps (line numbers): (3) $$p$$ equations are from the top of the PPQ. (6) The $$p$$ equations are expanded to create the more complex equations 'expanded' by applying the grammar's production rules recursively to an equation's parse tree. (9) Each equation $$e \in$$ 'expanded' is potentially filtered for excessive size or illegal operations. (10) The IPT uses an equation's serialized representation to checks if $$e$$ has been seen before. If it has, $$e$$ is discarded, otherwise $e$ is unique and (13) fit to the training data and (16) pushed into the PPQ. Variations on the PGE algorithm can be created by altering how initialize PGE and step through the grammar via its production rules. We can seed PGE with a richer set of basis function, and even allow complete sub-trees to become building blocks. Further, by enriching the production rules applied at each node, we can enlarge the set of equations generated at each point in the search space. Variations on the PGE expansion algorithm can be created by parameterizing where and which generation functions are applied to a candidate. By modifying the productions applied at each node, we can reduce or increase the set of equations generated at each node of the tree, and thus each point in the search space. In this paper, we used three different expansion methods. The basic method (PGE-1) restricts the grammar by removing productions which result in non-distributed expressions like $$ax*(b+cx*(d+x^2))$$. It only uses *AddTerm* and *WidenTerm* during the recursive expansion. The second method (PGE-2) adds back the previously mentioned restriction, using all three generation functions from Listing \ref{expand}. This results in isomorphs which are too `far' apart for simplification rules to consider the same. Despite being equivalent, these isomorphic forms can produce very different offspring, and represent separate areas of the search space. The third method (PGE-3) is FFX inspired, but starts with only a set of univariate bases. These bases are iteratively grown into a summation of multiplications by applying *AddTerm* and *WidenTerm*, creating candidates with more and increasingly complex bases. PGE-3 differs from the PGE-1 method by explicitly defining the root node to be addition and placing stricter limitations on the complexity of each term in the summation. ### Limitations During the development of PGE, we encountered some of the limitations with our current implementation and reasoning. #### Exponential to |{Features}| PGE does not handle dimensionality well. As the number of variables increases, so do the number of equations generated. Growth is exponential to the number of variables. This is because of the production rules being applied to each node of the parse tree, which expand with all possible variables. Further, with more complex generator functions, equations which are the same mathematically, are not equivalent through our current rewrite rules. The combination of these two effects compounds the issue of the exponential rate of equation generation. This fact may only be partially limiting, as the various forms of one equation represent different areas of the search space, and different forms are producible from these points in the space. #### Algebra Policies Expansion and simplification policies are difficult to establish and generalize. There is no free lunch. The policies determine the reachable search space and how local changes are made. If too much expansion is unrestrained, the complexity grows quickly, often creating equations which have a simpler version. Additionally, the originial implementation used a custom algebra system and lacked many features and capabilities. We advise, and will in the next chapter, move to using a mature Computer Algebra System (CAS). #### Bloat in PGE Bloat in PGE is when good partial solutions dominate the selection pool and create an expansion cycle. Consider $$F(x)$$ as a good partial solution. PGE makes all possible modifications to this expression, PGE will produce $$F(x) + c*g(x)$$ where either $$c$$ is very small or $$g(x)$$ produces near zero values. In both cases, bloat occurs because the extra terms of the expression have little effect on the fitness of the equation. Adding to this, each node in the bloated portion is expanded upon, further exasperating the situation. #### Similar Selection Due to the bloat and all inclusive expansion, the bloated partial solutions can quickly dominate the selection pool. These similar solutions have similar fitness, causing the selection process to have difficulty distinguishing between them. They can create a situation where similar models are selected and expanded, with bloated terms and fitness incomparability, and PGE getting stuck in a local optima.s #### Parameter Optimization Parameter optimization is the portion of any SR implementation which takes the longest time. In many cases, the parameters are linearly dependent and we get an exact best solution for a model. In the nonlinear cases we may not, as nonlinear regression is not guaranteed to find an optimal solution. There are also questions as to where parameters should be included. If excessive parameters are added to a model it can take a comparatively long time to fit and in many cases may be overfitting to the noise in the data. ### Relation to Genetic Programming Prioritized Enumeration is heavily influenced by GP. It uses the same parse tree representation for candidate equations. PGE, however, differs in how it evaluates, selects, and generates new candidates from old ones. This section is meant to help those familiar with GP to understand where and what the analogous processes are. **Parameters** PGE is nearly a parameter free algorithm, requiring only the building blocks, peel count, and termination criteria to be established prior beginning an equation search. Since building blocks, generally established by the problem or benchmark, and stopping criteria are common to any search method, PGE only has one parameter the number of candidates to remove from the PPQ. This is a substantial advantage over GP, which requires many parameters to be set. This is a notoriously difficult task as was discussed in the previous chapter. PGE is nearly a parameter free algorithm, requiring only that the building blocks, generating functions, the value of $p$, and termination criteria be established prior to beginning an equation search. Since building blocks are generally established by the benchmarks, and stopping criteria are relatively common across SR implementations, PGE has only two unique parameters: (1) $p$, the number equations to remove from the PPQ and (2) which generation functions to use. % PGE has no populations, % no generations or ages, % and no random number generation. This is a substantial advantage over GP, which requires many parameters to be set, a notoriously difficult task. **Evaluations** Evaluation is the most time consuming phase of the model discovery process. PGE performs two steps of evaluation, fitting coefficients and evaluating fitness. PGE uses non-linear regression to optimize the coefficients, which requires an undetermined number of evaluations on the derivatives of the equation. Once fit, PGE calculates fitness on the testing set. In GP, one evaluation per data point, per equation is required for both training and testing. Regressing the coefficients has a profound effect on the ability of PGE to quickly traverse the search space. It allows for the separation of equation form and coefficient optimization into two sub-tasks of the overall search. In GP, the optimization of form is difficult and the most time consuming portion of the search process. **Selection** To prioritize the enumeration of expressions, we maintain, what we will refer to as, a Pareto Priority Queue (PPQ). The contents of the PPQ are the equations which have been regressed and evaluated, but have not yet been expanded. The PPQ is similar to the basic priority queue, in that we want to quickly prioritize the elements which have the highest priority. However it is different because we are sorting on two objectives, parsimony and accuracy, rather than one. When selecting the equations to expand next, we remove one or more elements from the front of the heap. Since the front of the heap is the first Pareto front, we prioritize the smallest equations first The more equations we *peel* (remove) from the front of the heap, the more of the search space we consider concurrently. This enables the search to exploit equations that express different levels of the trade off between parsimony and accuracy. If we select only one equation from the PPQ per iteration, the search degenerates into a search which progresses through expression size. This degenerate case will exhaust all possible equations of the current size before considering any larger equations. **Population and Breeding** In PGE, there is no need to maintain a diverse population in which survival and replication determine the outcome of the search. In fact, one of the goals of PGE is to only process an expression form once. This is managed by the IPT which tracks the equations that have been evaluated and expanded. Equations which have been generated, regressed, and evaluated sit in the PPQ. Once an equation has been removed from the PPQ and expanded upon, it requires no further consideration. ">next (The PGE algorithm)