### The Problem The overall goal of SR is to produce models which find a balance in the trade-off between accuracy and complexity. The best models will therefore be both simple and explanatory, elucidating the dynamics of the system under study. In this way, an expert is freed to think about the larger and more complex aspects of that system, gaining insights from the results of SR. The domain of SR is to model a dataset with a mathematical formula or equation. SR makes no a priori assumptions about the form of a model. In SR, the search encompases the space of all equations. Compare this with more traditional methods for regression. Linear and non-linear regression assume a model and fit the model's parameters to the data on hand. More sophisticated methods like the Fourier transform and neural networks can model almost any signal or function. However, they don't elucidate an understanding of the interactions and behavior of that signal or function. SR is often considered a generalization of traditional methods for regression. A SR implementation will generate, tests, and validated a massive number of equations. Thus it is flexible, capable of modelling almost any signal, and also explanitory, producing models which are easily interpreted by humans. Because SR models data with equations, it also finds a vast array of application domains. Simple relationships are modeled with explicit functions. Time variant relationships are modeled with differential equations. Several differential equations can be combined into a set, forming a dynamical system. Relationships with vary in both time and space are modeled with partial differential equations. Invariants, or conserved quantities can also be modeled with an equation and its partial derivatives. These are all incantations of ideas from calculus and are prominent throughout the sciences and engineering. From cell biology to ecology, physics to astronomy, chemistry to pharmaceuticals, engineering and finance; all fields rely on mathematical notions and theory to produce solid foundations upon which decisions can be made. Looking forward, a SR technology will. be a tool of great assistance to domain experts. The ability to automatically discover analytical models from our increasing volumes and complexity of data is paramount to continued scientific progress. The equation and formula models are human interpretable, precisely because they are a human conception and language for talking about functional relationships found in the world. It is the relation between humans, math, and science; that SR holds a special place. As a tool for helping humans make sense in a data driven world. SR holds the potential to accelerate the pace of scientific breakthroughs. #### The Search Space The SR search space is the space of all equations. This search space is defined by a grammar through its terminals, non-terminals, and production rules. It is an exponential space that can be represented as a tree or a graph. As a tree, the space is the ordered expansion of applied productions, with the start symbol at the root. The graph is similar, but connects nodes in the tree that are equivalent. In the graph, there are multiple paths to an equation, coinciding with the multiple derivations for an equations parse tree. Defining the building blocks from which equations can be built, also defines the space of representable expressions that be searched. The size of this space is infinite even when disregarding the massive increase brought on by real valued coefficients. This is because a grammar is infinite through its production rules. Consider the following spaces: 1. The space of all polynomials in one variable 1. The space of all polynomials in more than one variable 1. The space of all polynomials with trigonometric functions 1. Increasing the depth of expressions (allowing parentheses ()') 1. Allowing division (put any of the previous spaces on top, bottom, or both) The complexity of the search space is also linked to how the search space is represented. Tree representations create a large fanout ratio in the search space, with repetition. Using dynamic programming, that repetition can be detected and the search space simplified. The next section will describe various representations used in SR. Subsequent chapters, especially the PGE Chapter, will discuss the search space and these issues further.
### The Components Like all machine learning tasks, SR can be logically broken into three components: Representation, Evaluation, and Optimization [[Pedro:20??:ml](), [Mustafa - Learning from Data]()]. Representations are the conceptual and implementation models we choose, and which define the hypothesis space. Evaluation determines a model's fitness, and is defined by the objective function. Optimization is then the technique which explores the representational space and maximizes model fitness. Often these three components have a high degree of coupling or co-dependence. That is, optimization techniques usually have a representation and objective in mind. In our work, we additionally include a memoization component for tracking the sub-problems previously solved. Memoization is necessary in SR as the representational space contains an extremely high degree of overlapping subproblems. If we did not detect overlap, we would experience a high degree of repetition and process the same equations many times. To our knowledge, this is a first for the SR research community. #### Representation in Symbolic Regression In SR, equations need to be represented *in cilico*. The choice of representation determines the methods we can later use to work with the equations. As we will see, there are tradeoffs when choosing between them. There is also opportunity to use multiple representations simultaneously and will prove to be beneficial. Of the many equation representations which have been explored in SR research, binary trees are the most common. They fit naturally with the theoretical concepts and simplify the underlying algorithms. Tree representationss in SR can be subdivided into binary and n-ary. Binary trees are natural to computers, n-ary tries are natural to algebra. Graph and linear representations have been used as well. All of the representations are interchangeable and we are able to transform between them. This is important as each representation has its benefits and can be useful to different subtasks in the SR problem. **Linguistic Foundations** Before we explore each representation, some context for the discussion is in order. In this work, we take a strong point of view from a language perspective. This perspective mainly comes from a computational background, in terms of regular languages, but also from and algebraic perspective. It was this perspective which lead to the PGE algorithm and underpins much of our foundation. From a programmers point of view, a grammar defines a language in a recursive manner [ [Aho:1972:TPT](http://dl.acm.org/citation.cfm?id=578789), [Dragon Book](http://dragonbook.stanford.edu/) ]. A grammar is comprised of a start symbol, nonterminals, terminals, and production rules. The language generated by a grammar G, denoted L(G), is the set of sentences generated by G. A sentence *S* is in L(G) if there is a sequence productions which results *S*. This sequence of productions constructs the derivation for *S*. If a sentence *S* has a derivation then it is in the language G. Thus, grammar allows us to determine if a sentence is in a language. Below is a simplified example for a grammar for mathematics. START -> E E -> E + T | E - T | T T -> T * N | T / N | N N -> Cos(E) | Sin(E) | Tan(E) | Log(E) | Exp(E) | L L -> (E) | -(E) | (E)^(E) | TERM TERM -> Constant | Variable When working with languages in computers, expressions are represented as trees. The trees represent the application of the production rules for a grammar. They are often refered to as syntax or parse trees. The grammar can then be used to determine if a tree is valid by finding a matching set of production rule applications. In PGE, the production rules will also provide a means to recursively enumerate sentences in a language. **Binary Trees** In Symbolic Regression, equations are constructed as binary trees made of basic components called building blocks. The tree of building blocks makes up the *DNA* of an equation. It matches the parse or derivation tree for a given equation, and can also be called its Abstract Syntax Tree (AST). Building blocks come in two types, operators and operands. Operators are the internal nodes of the AST and include familiar mathematical functions $$(+,-,*,/,sin,cos,...)$$. Operands are the terminals or leafs of the AST and represent state-variables, constants, and real-valued numbers. There are two ways to represent constant, indexed and floating point. The real value floating point representation has been the most common in GP literature. The value is randomly assigned, mutated by multiplying it by a small number or swapping it out with another during crossover. Indexing associates a constant in the tree with an element into an array. This permits the coefficients to be optimized outside of the equation form. This can be done by a second evolutionary algorithm or by standard regression techniques. We use real-valued coefficients in GP and indexed coefficients in PGE. **Algebra Tries** The algebra *trie*, or just plain *trie*, is a tree structure where each operand can have a variable number of children. The algebra trie representation only affects two operators in SR, addition and multiplication. In fact, in any tree based representation, some operators (internal nodes) such as $$cos$$ and $$log$$ always have only one outgoing edge to their operand. Figure \ref{fig:eqn-trees} shows the comparable representations of several equations as both binary and algebra tries.
Figure # - Binary Trees and Algebra Tries
The algebra trie representation has no effect on the constituents or modeling ability of the equation. It does however, alter the equation tree's size and structure. Changing size of equations effects the trade-off between parsimony and accuracy. This multi-objective trade-off, often called Pareto sorting, is the multi-objective optimization search of SR. Any change to the fitness function undoubtedly effects the operation of any SR implementation. Additionally, the trie representation eases the design of simplification algorithms. These algorithms work within the tree to rewrite it to a canonical form. The simplification algorithms are analogous to rewrite rules from parser literature [ [Aho:1972:TPT](http://dl.acm.org/citation.cfm?id=578789) ], and are reminiscent of high school algebra. The implications of tries and associated algorithms have largely been unexplored in the SR community. Despite the lack of detailed analyses, our theories and experience have shown their use to greately increase the tractability of SR. As such, we will use them exclusively and omit the use of binary trees. **Linear Forms** Linear representations come in a few flavors and are usually the result of a transformation. Linear GP considers that a program tree is compiled into a linear set of machine instructions. In Linear GP, the evolution occurs on the machine code by modifying the sequence of instructions [ [Refs Here](http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/71LinearGeneticProgramming.html#13_1) ]. In Fast Function eXtraction [ [McConaghy:2011:FFX]() ], equations are represented as a linear combination of simple basis functions. FFX uses non-GP methods to determine which bases to use and what the coefficient should be. **Graph Representations** Graph representations encode the same information as tree representations. The difference is that a subtree which appears more than once in the tree, can have a single instance connected to by different parts of a graph representation [ [Hod:2007:graphrep]()]. A second variation is to maintain a library of small trees. Then, indexing nodes reference which basis to use and where to use it [[miller:2000:cartesian]()]. With graph representations it is possible to simultaneously update several parts of a model, to its benefit or detriment. **Restrictions and Simplifications** Simple restrictions disallow invalid mathematics, operations such as dividing by zero or taking the logarithm of a negative number. This is sometimes called interval arithmatic in the literature \cite{keijzer:03:improving}. In general, we wish to only consider valid equations. Simplification groups like terms together, remove sub-expressions which evaluate to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$), and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$. These algorithms, known as tree rewriting systems [ [Joshi:1975:TAG](), [dershowitz:1982:orderings](), [huet:1980:equations]() ], convert one parse tree to an equivalent and simpler tree. There is debate as to how simplification effects the SR process [ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ]. Simplifications effect the structure of the AST through its manipulation, which in turn effects the areas of the overall search space. In addition, depth and size bounds are often placed on the equation's derivation tree. Placing bounds on the tree explicitly places bounds on the search space. Generally, a small bound on the size or depth is started with and increased if the models returned are unsatisfactory. #### Evaluation in Symbolic Regression In SR, equations need to be evaluated to assess their accuracy and fitness. This involves processing on the training data, measuring the error, and then constructing a fitness value to be used for relative comparison. **Point evaluation** is evaluating a model representation on a data point. This is generally a straight forward process and is repeated for every point in the training data to produce the predicted values. **Accuracy metrics** are methods for estimating the residual error of a model. Error is estimated by comparing the predicted values with the true values using a standard measure. Typical measures include: 1. Mean Absolute Error (MAE) 1. Mean Squared Error (MSE) 1. Root Mean Absolute Error (RMAE) 1. Root Mean Squared Error (RMSE) Less direct accuracy calculations include: total hits, which tallies the number of data points that an equation is 'close' to, determined by a threshold. Ranking calculates a score based on relative order by considering the Pareto dominates, is dominated ratio. The most common accuracy metrics are mean squared error and average absolute error. **Fitness values** are used to determine the relative fitness of individuals as opposed to the absolute fitness. In SR, they are generally constructed from the values of competing objective functions. The most typical is to minimize complexity while minimizing error. Complexity is usually measured as the number of nodes that are contained in the tree. The values used for fitness can further be manipulated prior to model comparison. The values can be weighted to express relative emphasis during optimization and skew the balance. Normalization of the values, on a per value basis, limits the effect of scale between fitness components. The next section describes how fitness value are used to make the determination of relative fitness. #### Optimization in Symbolic Regression Optimization is then the technique which explores the representational space and maximizes model fitness. This phase is highly related to the method for solving SR. Arguably, one could say this is the most significant point of difference between the various methods. The next section describes the three main methods for SR.
Figure # - SR Loop
**Exploration operators** Exploration operators determine how the search moves around the search space. The method of exploration is tied to the method of representation, since the former must act upon the later. Generally, an exploration operator will input one or two models and output one or more models. Exploration operators can be further subdivided into exploration and exploitation operators. Exploration operators make larger moves within the search space, while exploitation operators search locally around good partial solutions. **Selection Mechanisms** Selection in SR chooses a subset of models from a larger group or population. Many methods for selection exist, including tournament, fitness-proportional, and stochastic sampling [ [Goldberg:1989:book](http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/23Selection.html#7_3) ]. SR can also be a multi-objective optimization (MOO). In this case the Pareto objective can be used for selection [ [fonseca:1993:genetic](), [horn:1994:niched]() ]. Selection is used to determine which models will undergo a subsequent process or which are privileged enough to pass on. **Pareto Selection** In multi-objective optimizations, such as SR, no single model metric is better than others. There is a trade-off between the various metrics. In SR, there are two metrics, a real-valued error and a discrete size. We prefer SR to find the simpler equations which model the data accurately. The Pareto non-dominated sort, usually just called the Pareto front, addresses this trade-off between opposing objectives [ [luke:2002:lexicographic](), [smits:2005:pareto](), [van:1998:evolutionary](), [fonseca:1993:genetic](), [horn:1994:niched]() ]. At the Pareto frontier, no one solution is better than the solutions in both complexity and performance. In mathematical terms, vector $$\vec{u} \prec \vec{v}$$ if $$\forall i, f_i(u_i) \leq f_i(v_i)$$. In the case of SR, we have first consider the discrete objective, parsimony, and then the continuous objective, accuracy. A smaller function will always dominate any function which is larger and less accurate. A large function $$g$$ dominates $$f$$ when its error is smaller. That is to say that we accept larger functions when they are more accurate and smaller functions are allowed to have more error.
Figure # - Pareto Fronts
#### Memoization Memoization is a technique which records the results of expensive computations, and later returns the cached value when requested. A primary example is computing an iterated function such as the Fibonacci sequence. Memoization is useful in these recursive cases because the repetition of computation tends to grow as the recursion deepens. [ [Wiki](https://en.wikipedia.org/wiki/Memoization) ] Dynamic Programming is a technique where a problem is broken down into smaller sub-problems which are solved independently and memoized. DP is available when there is an overlapping and optimal substructural patterns within the problem. SR is such a problem and can make use of memoization. To date, we are unaware of any work beyond our own which has done so. In Chapter 4, we will see how the SR search space can be constructed to fit the DP paradigm. This will prove to be exceedingly beneficial.
### Implementations Similarly, as in any machine learning task, there are several algorithms which try to solve the SR problem. Originally, SR was born from Genetic Programming (GP), an evolutionary algorithm for attempting this problem. The use of GP as a method for SR has been the de-facto until only recently. It is for this reason that SR as a problem is just starting to find itself researched outside of evolutionary algorithms. With the advent of Fast Function eXtraction (FFX) and Prioritized Grammar Enumeration (PGE), there are now competing methods to GP which are also deterministic. SR implementations are the assemblages of algorithms and data structures for representing, generating, and testing hypotheses. The most common implementation of SR has been the evolutionarily inspired method called Genetic Programming (GP). More recently, Fast Function eXtraction (FFX) was proposed as the first deterministic SR implementation, based on ridge and lasso regressions. The main contribution of this work is Prioritized Grammar Enumeration (PGE), a second deterministic implementation, with foundations in dynamic programming. We will briefly describe each here with further details in subsequent chapters. #### Genetic Programming GP is a stochastic search heuristic over the syntax-tree representation of an equation. GP requires several components to be defined: representation, evaluation, selection, genetic operators, initialization & termination, and population management. he Basic GP Process begins with an initial, random population of models. Then each individual is evaluated to determine fitness. This is followed by a selection process for candidate survival and replication. Survivors are then recombined using methods which resemble sexual reproduction and mutation. GP continues this process for a number of generations, until a model of desired accuracy is discovered or a computational threshold is reached.
Figure # - GP Loop
#### Fast Function eXtraction Fast Function eXtraction (FFX) is the first deterministic SR implementation [ [McConaghy:http:FFX](http://trent.st/ffx/), [McConaghy:2011:GPTP](http://trent.st/content/2011-GPTP-FFX-paper.pdf), [McConaghy:2011:CICC](http://trent.st/content/2011-CICC-FFX-paper.pdf) ]. FFX does not use genetic operators, tree based representations, or random number generation. Instead, FFX uses a Generalized Linear Model (GLM). and a set of basis functions derived from the input features. FFX then applies Pathwise Regularized Learning to fit the parameters of the basis functions to a desired complexity This makes FFX very fast, but also limits its flexibility and the complexity of solutions.
Figure # - FFX Loop
#### Prioritized Grammar Enumeration Prioritized Grammar Enumeration (PGE) is a deterministic algorithm for SR. PGE solves the SR problem from a language point of view, to prioritize the enumeration of expressions in that grammar. By recognizing the structure of the search space, PGE turns the SR problem into a graph traversal problem. PGE also recognizes multiplicity of points in in the search tree and overlaps them, resulting in the search graph. PGE uses dynamic programming techniques that memoizes previous results and prevent the compounding duplication of effort. Initially, PGE starts with a set of basis functions. Each basis function is fit to the training data, evaluated on the testing data, and placed into the Pareto Priority Queue (PPQ). The PGE algorithm iteratively pops equations from the top of the queue for processing. Each of these equations is expanded by applying the grammar's production rules recursively. The new equations are memoized using a specialized structure, the Integer Prefix Tree (IPT). The IPT uses an equation's serialized representation to track which solutions have been processed already. The IPT ensures that we only consider an equation form once. This allows PGE to eliminate duplication of effort in evaluation and expansion. Unique equations are fit to the training data, evaluated on the test data, and pushed into the priority queue. PGE continues until a model of desired accuracy is discovered or a computational threshold is reached.
Figure # - PGE Loop

### Applications Symbolic Regression can, and has been, applied to many problems from a multitude of domains. This is possible because the underlying tree or trie can represent multiple types of equations. The different types of equation can, in turn, model different problems. Here we describe the different types of equations and provide some sample applications. #### Equation Types | Problem Type | Target Equation Form | | ----------------------- | -------------------- | | Explicit Equations | $$y = f(\vec{x})$$ | | Differential Equations | $$\frac{dx_i}{dt} = f(\vec{x})$$ | | Invariants | $$F(\vec{x}) = C$$ | **Explicit Equations** Explicit functions are the most basic equation type. These are a direct mapping, or rule, from the independent input variables to the dependent output variable. Data is plugged in and the output is the result of depth-first evaluation. The most familiar of these is the line: $$y = ax + b$$. The line has one input variable and one output variable. Other familiar examples are the polynomial: $$y = ax^2 + bx + c$$ and the plane: $$z = ax + by + c$$. A real-world example is calculating housing price based on square footage and other features of a house. **Differential Equations and Dynamical Systems** Differential equations are functions which relate a quantity to its rate of change. They are prominent through the sciences. Dynamical systems are a set of differential equations, either heterogeneous or homogeneous. Each differential equations, in turn, is a rule for the rate of change, in one variable over time, to the current state of all variables. That is to say that the future depends upon the past. To evaluate differential equations, techniques of numerical integration are required. There are many methods of numerical integration [[kress:98:num_analysis]()]. Runge-Kutta 4 (RK4) is one such technique. It makes four smaller steps in time, updating the input variables along the way. Since RK4 involves four evaluations of a function, it requires four times the computational effort. RK4 also requires all of the differential equations, for a dynamical system, in order to simulate the system. RK4 integrates from current state to produce the next state of the system. It thus requires a temporal ordering of the data. We use the RK4 algorithm in this work to produce the data sets for the dynamical systems. When performing an SR search, it is possible to decompose the differential equations for separate evaluation. We can approximate the integrated values of the other variables by substituting interpolated data. This partitions the SR search making the equation recovery simpler task [ [hod:07:pnas]() ]. We call this Partitioned RK4 (PRK4). PRK4 requires the current and next states of the system, and four evaluations, in order to measure the fitness of an equation. It also maintains the temporal restriction of data in the evaluation process. An alternative evaluation method for differential equations exists, if the numerical derivatives of the data can be calculated. The equations fitness is then measured by its ability to predict the numerical derivative. There are two reasons for doing this. The first is that small variations in the equation result in a larger magnitude of difference from the fitness function. The integrals of $$x^2$$ and $$x^2+1$$ diverge linearly, which means the integrated values of RK4 are very similar. The fitness landscape of the numerical derivative method is less smooth. The second reason for evaluating against the numerical derivative data is that we can perform a single point evaluation. We perform four times fewer evaluations and we can perform them at arbitrary points [ [hod:08:mining]() ]. **Invariants and Partial Differential Equations** The natural extension to differential equations is Partial Differential Equations (PDE). They are analogous in that they relate a function to rate of change. PDEs allow more than one independent variable, such as x, y, and z in three dimensional space. Invariant represent the solution to such a system system and are also called a conserved quantity. Well known examples from physics are the conservation of energy, mass, and angular momentum. Previous work has shown that it is actually possible to uncover the Hamiltonians for conservation of angular momentum, directly from measured data [ [hod:09:science](), [hod:09:implicit_long]() ]. This begs the question:
Are there conserved quantities in biological or ecological systems.
Despite the potential for, PDEs and invariants are beyond the scope of this work.
top ### Further Considerations #### Separating Task from Implementation In this work, we view SR as only a problem and view GP as an algorithm for performing SR. We attempt to make a distinction which the SR literature has not by using GP as a term for both the problem *and* solution. We view SR, not as a sub-class, but as a super-class to GP, as the regression within a grammar. A particular programming language is a specific grammar and an instance of the SR problem. So can be circuit or atomic regression. Mathematical equations are another instance of the SR problem and the one we focus on. We want to reiterate the difference between problem and implementation. Symbolic Regression is the task which uncovers relationships in data by searching the expression space defined by grammar. An implementation is the assemblage of methods which realize the Symbolic Regression task. We believe the lack of separation has led to some missed opportunities is SR research. In this work, we step back from SR to fundamentally change the way we think about regressing symbolic expressions. We also make this distinction because recent work [[McConaghy:2011:FFX]()] introduces, to our knowledge, the first deterministic algorithm for SR called Fast Function eXtraction (FFX), offering a new means of performing an equation search. #### Benchmark Problems Benchmarking has been a problem for SR research. Candidate fitness metrics, methods for comparing implementations, and benchmark problems vary widely across the GP field. This situation has made effective comparison and progress difficult to assess. In order to compare SR implementations, and before trusting them on unsolved problems, methods for quantifying the effectiveness of their search capabilities are needed. In 2012, [McDermott:2012:benchmarks]() surveyed three years of literature from EuroGP and GECCO GP track, bringing this issue to the forefront of the community. Their aim was to start a discussion on unifying and standardizing the evaluation process in GP. We agree with these ideals and use 22 of their SR target functions for the evaluation of PGE. We do, however, disagree with the assumption in ~\cite{McDermott:2012:benchmarks}, that results should not be expected to be repeatable, and thus unverifiable by a third party. A non-GP practitioner will not likely use a tool which gives different answers each time it is used. This has been partially addressed by *rate of convergence* (how often an implementation finds an answer) and *cumulative probability of success* (the probability that an ideal solution would be found on or prior to generation $$i$$). Both of these methods require many trials. Nevertheless, we agree that on optimum is less obtainable and that a consensus needs to be reached on unbiased methods for comparison between different implementations. #### Metrics for Success **Accuracy of Candidate Solutions** Accuracy predicts a candidates ability to model a given data set. Depending on the goals in using SR accuracy may or may not be the right measure. If we seek the most predictive model then accuracy can tell us which model is best. However, it is often the goal in SR to provide insight through meaningful relations. This is the trade-off between accuracy and parsimony. We want the simplest equation that isn't to simple. In SR, we usually return multiple models instead of one single best model. This allows the domain expert can evaluate the varying degrees of complexity and accuracy. It is usually the case that the most accurate model has also over fit the data set. Over fit models high frequency signals within the data by adding subexpressions to fit these signals. Validation sets overcome this issue by testing the model on unseen data after training. This results is an unbiased measurement, unless the learner is also being optimized. In that case, cross validation or a third data set is used as the never-before-seen testing data. **Rate of Convergence** The rate of convergence is defined as an implementation's probability of finding the correct solution. This metric is because GP is a randomized algorithm which produces inconsistent results. A randomized algorithm's results depend on the initial starting points. GP has non-deterministic behavior when creating the initial equation population and when genetic operations are applied during breeding. To compensate for the lack of consistency, GP researchers run many trials. The final results of all runs are tallied to come up with a probability for success. Rate of convergence can be a good measure when comparing the relative efficacy of GP implementations. It can also be used to tune the parameters of the GP implementation. PGE doesn't have a rate of convergence. It is a deterministic algorithm which executes the same way, in the same order, every time. Thus, given a set of parameters, PGE will either find the solution or it will not. The ability of PGE to find a solution depends upon the expansion functions and the amount of time alloted for searching. Rate of convergence will not allow us to make direct comparison between GP and PGE. It will permit us to loosely infer problem complexity with the both implementations difficulty in finding the 'correct' solution. Further complicating this issue is that determining convergence is not well defined. The literature has used hit ratio, error thresholds, and $$R^2$$ correlation. Real convergence is the matching of form, which can be done by manual inspection or by tree comparisons. To our knowledge, no one has reported this type of convergence by directly stating how close the results are to the true form. **Is the solution present?** Determining if the exact solution is present only works for benchmark problems. Optimizing for this can mean the researcher has overfit their algorithm to the benchmark problems. Even so, an algorithm should return the answer to most benchmark problems. The difficulties with this metric have been that GP is generally run multiple times to form the rate of convergence metric. It can become cumbersome to inspect the results for exact solution. Adding more difficulty to this process is the SR research has not considered the algebraic variations on an equation, that is, there is no canonical form. This makes the use of automated methods ineffective for detecting if the solution is present. #### Metrics for Effort **Number of Iterations** is often equated with how many steps it takes to find a solution. GP has a natural step between parents and children and is referred to as generations. Each generational step uses the current solutions and genetic operators to produce new candidate solutions. The number of generations is loosely tied to how long it takes a GP implementation to converge to an answer. In PGE, the generational step is applying the grammar productions to the best candidate solutions. Measuring performance by iterations is easily a biased measure. It requires careful attention to detail. Different implementations use different logic and parameters, searching different amounts of equation space per iteration. Varying the number of equations or the layout of the population effects the search rate. Fewer iterations is not always better, but can be a useful measure to minimize when optimizing the parameters within a particular SR implementation. When comparing between implementations, iterations becomes misleading if other factors are not considered. **Clock Time** Clock time tells us how long, from when we hit enter, until we get our answer. All other considerations equal, it is an unbiased measure. Generally every thing is not equal, and why comparing several measures is useful and no measure should be used in isolation. At some point, algorithms and computers become fast enough at a problem that time becomes less of an issue. To a domain expert, the difference between an hour and two, or hours and a day, may not be important. When they are modeling a large data set for a particular problem, a 24 hour turnaround is often acceptable, especially when one considers the time it takes to publish in a journal. Time, despite being something not to fool with, is a powerful measure. Small incremental improvements compound in time to create meaningful progress. When time is coupled with other metrics, we can gain measures for the amount of work. **Number of Evaluations** Amount of work done is a good metric when we want the most return for what we put in. Work, however, can be difficult no quantify. In SR, work is usually measured by counting the number of evaluations made, and relating this measurement over time. Counting evaluations is easy in GP. Calculation is population size multiplied by generations and the number of data points. In PGE, the scheme is similar, however the number of models per iteration varies. A further unknown arises with the use of iterative fitting schemes like those used in FFX and with nonlinear regression. There is an extra iteration loop over the data points per model which needs to be accounted for. **Number of Models** Counting the number of candidates tells us how many equations we actually considered. If we count the number of evaluations we made on each equation, we get a measure of work. Most work occurs during the evaluation phase of the search process. The computation complexity of simulation and evaluation, far outweighs the time taken for the remainder of the search process. We can reduce this work by reducing either of the terms, number of equations or number of evaluations. We can further refine this by considering the number of unique equations evaluated compared to the total number. We want to be smart about how many and which, equations and data points we choose. Coupling the unique to total equations with the number of evaluations will provide us a good comparison between the SR implementations. GP evaluates many equations fewer times, but experiences great amount of redundancy in form. PGE spends a larger amount of effort per equation, considers a form only once. #### Reproducibility and the Importance of Determinism This deterministic nature of FFX and PGE is an important feature for a SR algorithm. It is because of the link between human, math, and science. In science we expect reproducible results. We believe this is why GP has not become a predominant technology as predicted by its proponents, and likewise, SR for being so tightly coupled to GP for much of its history. Here, we take a strong stance that reproducibility is one of the key features to a SR implementation. Reproducibility provides the necessary reliability for scientific discovery and adoption of a new tool, as well as the foundation upon which research into SR can stand. Deterministic execution has benefits. First, the algorithm is consistent and repeatable. This means that researchers are able to reproduce results easily. This is an important feature with large portions of research unable to be replicated [[moraila:2013:measuring](), [MORE..]() ]. Second, deterministic execution means that modifications to the algorithm can be studied in isolation with their effects becoming much easier to measure. Third, the PGE algorithm only needs to be run once on a given data set. This is a significant advantage over GP which requires 30 or more trials in order to obtain statistically significant results. In all of our modifications to the algorithm, we maintained PGE's deterministic execution nature. The following two quotes from prior publications are included in the hope that it will frame this work in the larger context of producing usable SR tools and technologies. Much of the research in SR aims at improving the GP implementation, with only limited results and inconsistent comparisons. By reconsidering the definition of the SR problem, we believe that PGE is an evolution in perspective and approach. Through determinism we may further our understanding of the problem and find ways to create such a true Symbolic Regression technology.
Genetic programming (GP) is not a field noted for the rigor of its benchmarking. Some of its benchmark problems are popular purely through historical contingency, and they can be criticized as too easy or as providing misleading information concerning real-world performance, but they persist largely because of inertia and the lack of good alternatives. Even where the problems themselves are impeccable, comparisons between studies are made more difficult by the lack of standardization. We argue that the definition of standard benchmarks is an essential step in the maturation of the field. We make several contributions towards this goal. We motivate the development of a benchmark suite and define its goals; we survey existing practice; we enumerate many candidate benchmarks; we report progress on reference implementations; and we set out a concrete plan for gathering feedback from the GP community that would, if adopted, lead to a standard set of benchmarks.
~ McDermott:2012:benchmarks

Outside the GP literature, SR is rare; there are only scattered references such as (Langley et al., 1987). In contrast, the GP literature has dozens of papers on SR every year; even the previous GPTP had seven papers involving SR (Riolo et al., 2010). In a sense, the home field of SR is GP. This means, of course, that when authors aim at SR, they start with GP, and look to modify GP to improve speed, scalability, reliability, interpretability, etc. The improvements are typically 2x to 10x, but fall short of performance that would makes SR a “technology” the way LS or linear programming is. We are aiming for SR as a technology. What if we did not constrain ourselves to using GP? To GP researchers, this may seem heretical at first glance. But if the aim is truly to improve SR, then this should pose no issue. And in fact, we argue that the GP literature is still an appropriate home for such work, because (a) GP authors doing SR deeply care about SR problems, and (b) as already mentioned, GP is where all the SR publications are. Of course, we can draw inspiration from GP literature, but also many other potentially-useful fields.
~ McConaghy:2011:FFX
">next   index

## Introduction

The primary setting for this work is Symbolic Regression (SR), the task of deriving mathematical formula from observational data without any fore-knowledge of the domain or problem. In essence, this is the scientific process performed by a computer. Hypotheses are formulated, tested against the observations, and compared for explanatory value. Because the resulting models are mathematical formula, SR can assist domain experts in almost all fields.

The main contribution of this work is Prioritized Grammar Enumeration (PGE), a deterministic machine learning algorithm for solving Symbolic Regression. Working with a grammar’s rules, PGE prioritizes the enumeration of mathematical expressions in order to find the best fit model. By recognizing large overlaps in the search space and introducing mechanisms for memoization, PGE can exploring the space of all equations efficiently. Most notably, PGE provides reproducibility of results, a key aspect to any system used by scientists at large.

##### A Brief Computational History

For many thousands of years, humans have been using mathematics to understand and study the world. Mathematics is our bridge between the world, the science, and each other. It is through communications in this language, the writing of formulaic descriptions, that we share our discoveries and understandings with each other. Without a doubt, mathematics has been an integral part of human growth and progress.

With the invention of calculus, and the start of the Industrial Revolution, equations became synonymous with progress. Machines and math brought order to chaos. During World War II it was becoming apparent that humans were reaching the limit of their calculation abilities. The US military was employing hundreds of women, known as “computers,” to perform calculations for artillery aiming tables. Completing the table for a single artillery gun would take several months and was prone to errors. It was during these dark times that the first electronic computers were used to replace the human calculators. The EINAC was programmed to simulate the differential equations for an artillery shell fired under different conditions. EINAC could simulate thousands of artillery shells and produce the same firing tables in less than a day. With the end of the war EINAC and other early computers were put to use calculating weather models, making financial predictions, and simulating complex systems.

As computers have permeated our society, we have experienced waves of automation and analytics in every domain. The thousands of engineering and scientific breakthroughs that define the world we live in today were only possible with the progress of computers. While data science and machine learning have made great contributions and are achieving incredible feats, it has remained humans’ responsibility for discovering the models and mathematics which we put into our computers.

Enter Symbolic Regression (SR), a machine learning problem for discovering the mathematics, the equations and formula, that define the behavior described in the data. The Symbolic Regression method is much like Newton writing down his Law of Universal Gravitation by analyzing the movement of the planets. Scientists observe the world around them and write down mathematical formula which accurately and concisely explain behavior to other humans. Symbolic Regression automates this process, from observation to mathematical formula.

##### A First Example

Let’s begin with a simple example to motivate the need for Symbolic Regression tools. We’ll then follow with a real and complex example that will follow us throughout.

Below, you will see a plot which compares test scores to hours studied. As one would expect, this looks like a strong linear relationship, scores to go up with hours spent studying.

We know that you can only study so much before the law of diminishing returns and burnout sets in. So how does a 2nd-order polynomial fit with the data?

This model looks much better and the $R^2$ value has improved as well. How about a 3rd-order polynomial?

Even better, though only a marginal improvement. If we continue down this path, we will likely start overfitting.

So which is it? Which model should we choose as our final model? The automated answer lies within model selection, a deep and complex problem. The breadth of model selection is largely beyond the scope of this work, however, PGE and SR implementations have an internal model selection scheme which is integral to their success. In the end, it is up to a domain expert to make the final determination.

##### A Motivating Example

Now for our complex example from a real life system. The following chart shows the time-series for seven variables related to yeast processing sugar. Mmmmmm beer!

(or more generally, The Art of Fermentation)

It took humans years of study and analysis to formulate the differential equations for yeast metabolism [Wolf:2000:BiochemJ]. The set of equations help scientists to understand the interactions and complex cellular behaviors. They also allow for increased accuracy in simulation under a greater diversity of situations. These capabilities enables researchers to rapidly prototype before moving to laboratory tests.

In 2011, machines recovered the equations for the first time [Schmidt:2011:PhysBiol]. With Symbolic Regression technology, we can help scientists find useful models that shed light on the problems they study. The Genetic Programming methods first employed on this problem required computational time on the order of hours. This is an incredible feat. Later we will see how Prioritized Grammar Enumeration can solve this same task on the order of minutes.

##### Contributions

The main contribution of this work is Prioritized Grammar Enumeration (PGE), a method which can derive mathematical formula from data in an efficient and reproducible way. We step back from mainstream Genetic Programming (GP) thinking, to fundamentally change the way we approach the Symbolic Regression (SR) problem. The overall goal of SR is to produce equations which find a balance in the trade-off between accuracy and complexity, the method of getting there need not be fixed. By reconsidering the fundamental approach to SR problem, we believe that PGE is an evolution in thought, leading the way to a reliable SR technology.

PGE brief

PGE basic

1. Redefining the problem
2. Creating a deterministic and reproducible algorithm

We then enhance the PGE algorithm in several ways.

1. More application domains through differential equations.
2. Deeper abstractions and richer relationships.
3. Scaling the algorithm by decoupling into services
4. Combine multiple experiments to return more symbolic expressions
##### Results

Will fill this in later after the chapter(s) are written.

In~\cite{meier:2014:symbolic}, PGE derived simple, compact functions for predicting precrash severity in automobile accidents in less than 2ms, exceeding the required real-time constraint by several orders of magnitude.

##### Reproducibility

A running theme to this work is Reproducibility.

There have been many controversies recently…

Estimates of irreproducibility…

Oncology: 11%

Computer Science: 25%

Psychology: 39%

We champion reproducibility by

1. Deterministic algorithm which will reproduce itself
2. Open source Python project (PyPGE)
3. Accurate portrayal of results

Each chapter will end with a section on reproducibility and the ideas relevant to that chapter.

### The Problem The overall goal of SR is to produce models which find a balance in the trade-off between accuracy and complexity. The best models will therefore be both simple and explanatory, elucidating the dynamics of the system under study. In this way, an expert is freed to think about the larger and more complex aspects of that system, gaining insights from the results of SR. The domain of SR is to model a dataset with a mathematical formula or equation. SR makes no a priori assumptions about the form of a model. In SR, the search encompases the space of all equations. Compare this with more traditional methods for regression. Linear and non-linear regression assume a model and fit the model's parameters to the data on hand. More sophisticated methods like the Fourier transform and neural networks can model almost any signal or function. However, they don't elucidate an understanding of the interactions and behavior of that signal or function. SR is often considered a generalization of traditional methods for regression. A SR implementation will generate, tests, and validated a massive number of equations. Thus it is flexible, capable of modelling almost any signal, and also explanitory, producing models which are easily interpreted by humans. Because SR models data with equations, it also finds a vast array of application domains. Simple relationships are modeled with explicit functions. Time variant relationships are modeled with differential equations. Several differential equations can be combined into a set, forming a dynamical system. Relationships with vary in both time and space are modeled with partial differential equations. Invariants, or conserved quantities can also be modeled with an equation and its partial derivatives. These are all incantations of ideas from calculus and are prominent throughout the sciences and engineering. From cell biology to ecology, physics to astronomy, chemistry to pharmaceuticals, engineering and finance; all fields rely on mathematical notions and theory to produce solid foundations upon which decisions can be made. Looking forward, a SR technology will. be a tool of great assistance to domain experts. The ability to automatically discover analytical models from our increasing volumes and complexity of data is paramount to continued scientific progress. The equation and formula models are human interpretable, precisely because they are a human conception and language for talking about functional relationships found in the world. It is the relation between humans, math, and science; that SR holds a special place. As a tool for helping humans make sense in a data driven world. SR holds the potential to accelerate the pace of scientific breakthroughs. #### The Search Space The SR search space is the space of all equations. This search space is defined by a grammar through its terminals, non-terminals, and production rules. It is an exponential space that can be represented as a tree or a graph. As a tree, the space is the ordered expansion of applied productions, with the start symbol at the root. The graph is similar, but connects nodes in the tree that are equivalent. In the graph, there are multiple paths to an equation, coinciding with the multiple derivations for an equations parse tree. Defining the building blocks from which equations can be built, also defines the space of representable expressions that be searched. The size of this space is infinite even when disregarding the massive increase brought on by real valued coefficients. This is because a grammar is infinite through its production rules. Consider the following spaces: 1. The space of all polynomials in one variable 1. The space of all polynomials in more than one variable 1. The space of all polynomials with trigonometric functions 1. Increasing the depth of expressions (allowing parentheses ()') 1. Allowing division (put any of the previous spaces on top, bottom, or both) The complexity of the search space is also linked to how the search space is represented. Tree representations create a large fanout ratio in the search space, with repetition. Using dynamic programming, that repetition can be detected and the search space simplified. The next section will describe various representations used in SR. Subsequent chapters, especially the PGE Chapter, will discuss the search space and these issues further.
### The Components Like all machine learning tasks, SR can be logically broken into three components: Representation, Evaluation, and Optimization [[Pedro:20??:ml](), [Mustafa - Learning from Data]()]. Representations are the conceptual and implementation models we choose, and which define the hypothesis space. Evaluation determines a model's fitness, and is defined by the objective function. Optimization is then the technique which explores the representational space and maximizes model fitness. Often these three components have a high degree of coupling or co-dependence. That is, optimization techniques usually have a representation and objective in mind. In our work, we additionally include a memoization component for tracking the sub-problems previously solved. Memoization is necessary in SR as the representational space contains an extremely high degree of overlapping subproblems. If we did not detect overlap, we would experience a high degree of repetition and process the same equations many times. To our knowledge, this is a first for the SR research community. #### Representation in Symbolic Regression In SR, equations need to be represented *in cilico*. The choice of representation determines the methods we can later use to work with the equations. As we will see, there are tradeoffs when choosing between them. There is also opportunity to use multiple representations simultaneously and will prove to be beneficial. Of the many equation representations which have been explored in SR research, binary trees are the most common. They fit naturally with the theoretical concepts and simplify the underlying algorithms. Tree representationss in SR can be subdivided into binary and n-ary. Binary trees are natural to computers, n-ary tries are natural to algebra. Graph and linear representations have been used as well. All of the representations are interchangeable and we are able to transform between them. This is important as each representation has its benefits and can be useful to different subtasks in the SR problem. **Linguistic Foundations** Before we explore each representation, some context for the discussion is in order. In this work, we take a strong point of view from a language perspective. This perspective mainly comes from a computational background, in terms of regular languages, but also from and algebraic perspective. It was this perspective which lead to the PGE algorithm and underpins much of our foundation. From a programmers point of view, a grammar defines a language in a recursive manner [ [Aho:1972:TPT](http://dl.acm.org/citation.cfm?id=578789), [Dragon Book](http://dragonbook.stanford.edu/) ]. A grammar is comprised of a start symbol, nonterminals, terminals, and production rules. The language generated by a grammar G, denoted L(G), is the set of sentences generated by G. A sentence *S* is in L(G) if there is a sequence productions which results *S*. This sequence of productions constructs the derivation for *S*. If a sentence *S* has a derivation then it is in the language G. Thus, grammar allows us to determine if a sentence is in a language. Below is a simplified example for a grammar for mathematics. START -> E E -> E + T | E - T | T T -> T * N | T / N | N N -> Cos(E) | Sin(E) | Tan(E) | Log(E) | Exp(E) | L L -> (E) | -(E) | (E)^(E) | TERM TERM -> Constant | Variable When working with languages in computers, expressions are represented as trees. The trees represent the application of the production rules for a grammar. They are often refered to as syntax or parse trees. The grammar can then be used to determine if a tree is valid by finding a matching set of production rule applications. In PGE, the production rules will also provide a means to recursively enumerate sentences in a language. **Binary Trees** In Symbolic Regression, equations are constructed as binary trees made of basic components called building blocks. The tree of building blocks makes up the *DNA* of an equation. It matches the parse or derivation tree for a given equation, and can also be called its Abstract Syntax Tree (AST). Building blocks come in two types, operators and operands. Operators are the internal nodes of the AST and include familiar mathematical functions $$(+,-,*,/,sin,cos,...)$$. Operands are the terminals or leafs of the AST and represent state-variables, constants, and real-valued numbers. There are two ways to represent constant, indexed and floating point. The real value floating point representation has been the most common in GP literature. The value is randomly assigned, mutated by multiplying it by a small number or swapping it out with another during crossover. Indexing associates a constant in the tree with an element into an array. This permits the coefficients to be optimized outside of the equation form. This can be done by a second evolutionary algorithm or by standard regression techniques. We use real-valued coefficients in GP and indexed coefficients in PGE. **Algebra Tries** The algebra *trie*, or just plain *trie*, is a tree structure where each operand can have a variable number of children. The algebra trie representation only affects two operators in SR, addition and multiplication. In fact, in any tree based representation, some operators (internal nodes) such as $$cos$$ and $$log$$ always have only one outgoing edge to their operand. Figure \ref{fig:eqn-trees} shows the comparable representations of several equations as both binary and algebra tries.
Figure # - Binary Trees and Algebra Tries
The algebra trie representation has no effect on the constituents or modeling ability of the equation. It does however, alter the equation tree's size and structure. Changing size of equations effects the trade-off between parsimony and accuracy. This multi-objective trade-off, often called Pareto sorting, is the multi-objective optimization search of SR. Any change to the fitness function undoubtedly effects the operation of any SR implementation. Additionally, the trie representation eases the design of simplification algorithms. These algorithms work within the tree to rewrite it to a canonical form. The simplification algorithms are analogous to rewrite rules from parser literature [ [Aho:1972:TPT](http://dl.acm.org/citation.cfm?id=578789) ], and are reminiscent of high school algebra. The implications of tries and associated algorithms have largely been unexplored in the SR community. Despite the lack of detailed analyses, our theories and experience have shown their use to greately increase the tractability of SR. As such, we will use them exclusively and omit the use of binary trees. **Linear Forms** Linear representations come in a few flavors and are usually the result of a transformation. Linear GP considers that a program tree is compiled into a linear set of machine instructions. In Linear GP, the evolution occurs on the machine code by modifying the sequence of instructions [ [Refs Here](http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/71LinearGeneticProgramming.html#13_1) ]. In Fast Function eXtraction [ [McConaghy:2011:FFX]() ], equations are represented as a linear combination of simple basis functions. FFX uses non-GP methods to determine which bases to use and what the coefficient should be. **Graph Representations** Graph representations encode the same information as tree representations. The difference is that a subtree which appears more than once in the tree, can have a single instance connected to by different parts of a graph representation [ [Hod:2007:graphrep]()]. A second variation is to maintain a library of small trees. Then, indexing nodes reference which basis to use and where to use it [[miller:2000:cartesian]()]. With graph representations it is possible to simultaneously update several parts of a model, to its benefit or detriment. **Restrictions and Simplifications** Simple restrictions disallow invalid mathematics, operations such as dividing by zero or taking the logarithm of a negative number. This is sometimes called interval arithmatic in the literature \cite{keijzer:03:improving}. In general, we wish to only consider valid equations. Simplification groups like terms together, remove sub-expressions which evaluate to a constant value ($$\sin\pi$$, $$\frac{x}{x}$$, or $$x-x$$), and reduce unnecessarily complex expressions such as $$\frac{1}{\frac{1}{x}}$$. These algorithms, known as tree rewriting systems [ [Joshi:1975:TAG](), [dershowitz:1982:orderings](), [huet:1980:equations]() ], convert one parse tree to an equivalent and simpler tree. There is debate as to how simplification effects the SR process [ [kinzett:2008:using](), [kinzett:2009:online](), [mcree:2010:symbolic]() ]. Simplifications effect the structure of the AST through its manipulation, which in turn effects the areas of the overall search space. In addition, depth and size bounds are often placed on the equation's derivation tree. Placing bounds on the tree explicitly places bounds on the search space. Generally, a small bound on the size or depth is started with and increased if the models returned are unsatisfactory. #### Evaluation in Symbolic Regression In SR, equations need to be evaluated to assess their accuracy and fitness. This involves processing on the training data, measuring the error, and then constructing a fitness value to be used for relative comparison. **Point evaluation** is evaluating a model representation on a data point. This is generally a straight forward process and is repeated for every point in the training data to produce the predicted values. **Accuracy metrics** are methods for estimating the residual error of a model. Error is estimated by comparing the predicted values with the true values using a standard measure. Typical measures include: 1. Mean Absolute Error (MAE) 1. Mean Squared Error (MSE) 1. Root Mean Absolute Error (RMAE) 1. Root Mean Squared Error (RMSE) Less direct accuracy calculations include: total hits, which tallies the number of data points that an equation is 'close' to, determined by a threshold. Ranking calculates a score based on relative order by considering the Pareto dominates, is dominated ratio. The most common accuracy metrics are mean squared error and average absolute error. **Fitness values** are used to determine the relative fitness of individuals as opposed to the absolute fitness. In SR, they are generally constructed from the values of competing objective functions. The most typical is to minimize complexity while minimizing error. Complexity is usually measured as the number of nodes that are contained in the tree. The values used for fitness can further be manipulated prior to model comparison. The values can be weighted to express relative emphasis during optimization and skew the balance. Normalization of the values, on a per value basis, limits the effect of scale between fitness components. The next section describes how fitness value are used to make the determination of relative fitness. #### Optimization in Symbolic Regression Optimization is then the technique which explores the representational space and maximizes model fitness. This phase is highly related to the method for solving SR. Arguably, one could say this is the most significant point of difference between the various methods. The next section describes the three main methods for SR.
Figure # - SR Loop
**Exploration operators** Exploration operators determine how the search moves around the search space. The method of exploration is tied to the method of representation, since the former must act upon the later. Generally, an exploration operator will input one or two models and output one or more models. Exploration operators can be further subdivided into exploration and exploitation operators. Exploration operators make larger moves within the search space, while exploitation operators search locally around good partial solutions. **Selection Mechanisms** Selection in SR chooses a subset of models from a larger group or population. Many methods for selection exist, including tournament, fitness-proportional, and stochastic sampling [ [Goldberg:1989:book](http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/23Selection.html#7_3) ]. SR can also be a multi-objective optimization (MOO). In this case the Pareto objective can be used for selection [ [fonseca:1993:genetic](), [horn:1994:niched]() ]. Selection is used to determine which models will undergo a subsequent process or which are privileged enough to pass on. **Pareto Selection** In multi-objective optimizations, such as SR, no single model metric is better than others. There is a trade-off between the various metrics. In SR, there are two metrics, a real-valued error and a discrete size. We prefer SR to find the simpler equations which model the data accurately. The Pareto non-dominated sort, usually just called the Pareto front, addresses this trade-off between opposing objectives [ [luke:2002:lexicographic](), [smits:2005:pareto](), [van:1998:evolutionary](), [fonseca:1993:genetic](), [horn:1994:niched]() ]. At the Pareto frontier, no one solution is better than the solutions in both complexity and performance. In mathematical terms, vector $$\vec{u} \prec \vec{v}$$ if $$\forall i, f_i(u_i) \leq f_i(v_i)$$. In the case of SR, we have first consider the discrete objective, parsimony, and then the continuous objective, accuracy. A smaller function will always dominate any function which is larger and less accurate. A large function $$g$$ dominates $$f$$ when its error is smaller. That is to say that we accept larger functions when they are more accurate and smaller functions are allowed to have more error.
Figure # - Pareto Fronts
#### Memoization Memoization is a technique which records the results of expensive computations, and later returns the cached value when requested. A primary example is computing an iterated function such as the Fibonacci sequence. Memoization is useful in these recursive cases because the repetition of computation tends to grow as the recursion deepens. [ [Wiki](https://en.wikipedia.org/wiki/Memoization) ] Dynamic Programming is a technique where a problem is broken down into smaller sub-problems which are solved independently and memoized. DP is available when there is an overlapping and optimal substructural patterns within the problem. SR is such a problem and can make use of memoization. To date, we are unaware of any work beyond our own which has done so. In Chapter 4, we will see how the SR search space can be constructed to fit the DP paradigm. This will prove to be exceedingly beneficial.
### Implementations Similarly, as in any machine learning task, there are several algorithms which try to solve the SR problem. Originally, SR was born from Genetic Programming (GP), an evolutionary algorithm for attempting this problem. The use of GP as a method for SR has been the de-facto until only recently. It is for this reason that SR as a problem is just starting to find itself researched outside of evolutionary algorithms. With the advent of Fast Function eXtraction (FFX) and Prioritized Grammar Enumeration (PGE), there are now competing methods to GP which are also deterministic. SR implementations are the assemblages of algorithms and data structures for representing, generating, and testing hypotheses. The most common implementation of SR has been the evolutionarily inspired method called Genetic Programming (GP). More recently, Fast Function eXtraction (FFX) was proposed as the first deterministic SR implementation, based on ridge and lasso regressions. The main contribution of this work is Prioritized Grammar Enumeration (PGE), a second deterministic implementation, with foundations in dynamic programming. We will briefly describe each here with further details in subsequent chapters. #### Genetic Programming GP is a stochastic search heuristic over the syntax-tree representation of an equation. GP requires several components to be defined: representation, evaluation, selection, genetic operators, initialization & termination, and population management. he Basic GP Process begins with an initial, random population of models. Then each individual is evaluated to determine fitness. This is followed by a selection process for candidate survival and replication. Survivors are then recombined using methods which resemble sexual reproduction and mutation. GP continues this process for a number of generations, until a model of desired accuracy is discovered or a computational threshold is reached.
Figure # - GP Loop
#### Fast Function eXtraction Fast Function eXtraction (FFX) is the first deterministic SR implementation [ [McConaghy:http:FFX](http://trent.st/ffx/), [McConaghy:2011:GPTP](http://trent.st/content/2011-GPTP-FFX-paper.pdf), [McConaghy:2011:CICC](http://trent.st/content/2011-CICC-FFX-paper.pdf) ]. FFX does not use genetic operators, tree based representations, or random number generation. Instead, FFX uses a Generalized Linear Model (GLM). and a set of basis functions derived from the input features. FFX then applies Pathwise Regularized Learning to fit the parameters of the basis functions to a desired complexity This makes FFX very fast, but also limits its flexibility and the complexity of solutions.
Figure # - FFX Loop
#### Prioritized Grammar Enumeration Prioritized Grammar Enumeration (PGE) is a deterministic algorithm for SR. PGE solves the SR problem from a language point of view, to prioritize the enumeration of expressions in that grammar. By recognizing the structure of the search space, PGE turns the SR problem into a graph traversal problem. PGE also recognizes multiplicity of points in in the search tree and overlaps them, resulting in the search graph. PGE uses dynamic programming techniques that memoizes previous results and prevent the compounding duplication of effort. Initially, PGE starts with a set of basis functions. Each basis function is fit to the training data, evaluated on the testing data, and placed into the Pareto Priority Queue (PPQ). The PGE algorithm iteratively pops equations from the top of the queue for processing. Each of these equations is expanded by applying the grammar's production rules recursively. The new equations are memoized using a specialized structure, the Integer Prefix Tree (IPT). The IPT uses an equation's serialized representation to track which solutions have been processed already. The IPT ensures that we only consider an equation form once. This allows PGE to eliminate duplication of effort in evaluation and expansion. Unique equations are fit to the training data, evaluated on the test data, and pushed into the priority queue. PGE continues until a model of desired accuracy is discovered or a computational threshold is reached.
Figure # - PGE Loop

### Applications Symbolic Regression can, and has been, applied to many problems from a multitude of domains. This is possible because the underlying tree or trie can represent multiple types of equations. The different types of equation can, in turn, model different problems. Here we describe the different types of equations and provide some sample applications. #### Equation Types | Problem Type | Target Equation Form | | ----------------------- | -------------------- | | Explicit Equations | $$y = f(\vec{x})$$ | | Differential Equations | $$\frac{dx_i}{dt} = f(\vec{x})$$ | | Invariants | $$F(\vec{x}) = C$$ | **Explicit Equations** Explicit functions are the most basic equation type. These are a direct mapping, or rule, from the independent input variables to the dependent output variable. Data is plugged in and the output is the result of depth-first evaluation. The most familiar of these is the line: $$y = ax + b$$. The line has one input variable and one output variable. Other familiar examples are the polynomial: $$y = ax^2 + bx + c$$ and the plane: $$z = ax + by + c$$. A real-world example is calculating housing price based on square footage and other features of a house. **Differential Equations and Dynamical Systems** Differential equations are functions which relate a quantity to its rate of change. They are prominent through the sciences. Dynamical systems are a set of differential equations, either heterogeneous or homogeneous. Each differential equations, in turn, is a rule for the rate of change, in one variable over time, to the current state of all variables. That is to say that the future depends upon the past. To evaluate differential equations, techniques of numerical integration are required. There are many methods of numerical integration [[kress:98:num_analysis]()]. Runge-Kutta 4 (RK4) is one such technique. It makes four smaller steps in time, updating the input variables along the way. Since RK4 involves four evaluations of a function, it requires four times the computational effort. RK4 also requires all of the differential equations, for a dynamical system, in order to simulate the system. RK4 integrates from current state to produce the next state of the system. It thus requires a temporal ordering of the data. We use the RK4 algorithm in this work to produce the data sets for the dynamical systems. When performing an SR search, it is possible to decompose the differential equations for separate evaluation. We can approximate the integrated values of the other variables by substituting interpolated data. This partitions the SR search making the equation recovery simpler task [ [hod:07:pnas]() ]. We call this Partitioned RK4 (PRK4). PRK4 requires the current and next states of the system, and four evaluations, in order to measure the fitness of an equation. It also maintains the temporal restriction of data in the evaluation process. An alternative evaluation method for differential equations exists, if the numerical derivatives of the data can be calculated. The equations fitness is then measured by its ability to predict the numerical derivative. There are two reasons for doing this. The first is that small variations in the equation result in a larger magnitude of difference from the fitness function. The integrals of $$x^2$$ and $$x^2+1$$ diverge linearly, which means the integrated values of RK4 are very similar. The fitness landscape of the numerical derivative method is less smooth. The second reason for evaluating against the numerical derivative data is that we can perform a single point evaluation. We perform four times fewer evaluations and we can perform them at arbitrary points [ [hod:08:mining]() ]. **Invariants and Partial Differential Equations** The natural extension to differential equations is Partial Differential Equations (PDE). They are analogous in that they relate a function to rate of change. PDEs allow more than one independent variable, such as x, y, and z in three dimensional space. Invariant represent the solution to such a system system and are also called a conserved quantity. Well known examples from physics are the conservation of energy, mass, and angular momentum. Previous work has shown that it is actually possible to uncover the Hamiltonians for conservation of angular momentum, directly from measured data [ [hod:09:science](), [hod:09:implicit_long]() ]. This begs the question:
Are there conserved quantities in biological or ecological systems.
Despite the potential for, PDEs and invariants are beyond the scope of this work.
top ### Further Considerations #### Separating Task from Implementation In this work, we view SR as only a problem and view GP as an algorithm for performing SR. We attempt to make a distinction which the SR literature has not by using GP as a term for both the problem *and* solution. We view SR, not as a sub-class, but as a super-class to GP, as the regression within a grammar. A particular programming language is a specific grammar and an instance of the SR problem. So can be circuit or atomic regression. Mathematical equations are another instance of the SR problem and the one we focus on. We want to reiterate the difference between problem and implementation. Symbolic Regression is the task which uncovers relationships in data by searching the expression space defined by grammar. An implementation is the assemblage of methods which realize the Symbolic Regression task. We believe the lack of separation has led to some missed opportunities is SR research. In this work, we step back from SR to fundamentally change the way we think about regressing symbolic expressions. We also make this distinction because recent work [[McConaghy:2011:FFX]()] introduces, to our knowledge, the first deterministic algorithm for SR called Fast Function eXtraction (FFX), offering a new means of performing an equation search. #### Benchmark Problems Benchmarking has been a problem for SR research. Candidate fitness metrics, methods for comparing implementations, and benchmark problems vary widely across the GP field. This situation has made effective comparison and progress difficult to assess. In order to compare SR implementations, and before trusting them on unsolved problems, methods for quantifying the effectiveness of their search capabilities are needed. In 2012, [McDermott:2012:benchmarks]() surveyed three years of literature from EuroGP and GECCO GP track, bringing this issue to the forefront of the community. Their aim was to start a discussion on unifying and standardizing the evaluation process in GP. We agree with these ideals and use 22 of their SR target functions for the evaluation of PGE. We do, however, disagree with the assumption in ~\cite{McDermott:2012:benchmarks}, that results should not be expected to be repeatable, and thus unverifiable by a third party. A non-GP practitioner will not likely use a tool which gives different answers each time it is used. This has been partially addressed by *rate of convergence* (how often an implementation finds an answer) and *cumulative probability of success* (the probability that an ideal solution would be found on or prior to generation $$i$$). Both of these methods require many trials. Nevertheless, we agree that on optimum is less obtainable and that a consensus needs to be reached on unbiased methods for comparison between different implementations. #### Metrics for Success **Accuracy of Candidate Solutions** Accuracy predicts a candidates ability to model a given data set. Depending on the goals in using SR accuracy may or may not be the right measure. If we seek the most predictive model then accuracy can tell us which model is best. However, it is often the goal in SR to provide insight through meaningful relations. This is the trade-off between accuracy and parsimony. We want the simplest equation that isn't to simple. In SR, we usually return multiple models instead of one single best model. This allows the domain expert can evaluate the varying degrees of complexity and accuracy. It is usually the case that the most accurate model has also over fit the data set. Over fit models high frequency signals within the data by adding subexpressions to fit these signals. Validation sets overcome this issue by testing the model on unseen data after training. This results is an unbiased measurement, unless the learner is also being optimized. In that case, cross validation or a third data set is used as the never-before-seen testing data. **Rate of Convergence** The rate of convergence is defined as an implementation's probability of finding the correct solution. This metric is because GP is a randomized algorithm which produces inconsistent results. A randomized algorithm's results depend on the initial starting points. GP has non-deterministic behavior when creating the initial equation population and when genetic operations are applied during breeding. To compensate for the lack of consistency, GP researchers run many trials. The final results of all runs are tallied to come up with a probability for success. Rate of convergence can be a good measure when comparing the relative efficacy of GP implementations. It can also be used to tune the parameters of the GP implementation. PGE doesn't have a rate of convergence. It is a deterministic algorithm which executes the same way, in the same order, every time. Thus, given a set of parameters, PGE will either find the solution or it will not. The ability of PGE to find a solution depends upon the expansion functions and the amount of time alloted for searching. Rate of convergence will not allow us to make direct comparison between GP and PGE. It will permit us to loosely infer problem complexity with the both implementations difficulty in finding the 'correct' solution. Further complicating this issue is that determining convergence is not well defined. The literature has used hit ratio, error thresholds, and $$R^2$$ correlation. Real convergence is the matching of form, which can be done by manual inspection or by tree comparisons. To our knowledge, no one has reported this type of convergence by directly stating how close the results are to the true form. **Is the solution present?** Determining if the exact solution is present only works for benchmark problems. Optimizing for this can mean the researcher has overfit their algorithm to the benchmark problems. Even so, an algorithm should return the answer to most benchmark problems. The difficulties with this metric have been that GP is generally run multiple times to form the rate of convergence metric. It can become cumbersome to inspect the results for exact solution. Adding more difficulty to this process is the SR research has not considered the algebraic variations on an equation, that is, there is no canonical form. This makes the use of automated methods ineffective for detecting if the solution is present. #### Metrics for Effort **Number of Iterations** is often equated with how many steps it takes to find a solution. GP has a natural step between parents and children and is referred to as generations. Each generational step uses the current solutions and genetic operators to produce new candidate solutions. The number of generations is loosely tied to how long it takes a GP implementation to converge to an answer. In PGE, the generational step is applying the grammar productions to the best candidate solutions. Measuring performance by iterations is easily a biased measure. It requires careful attention to detail. Different implementations use different logic and parameters, searching different amounts of equation space per iteration. Varying the number of equations or the layout of the population effects the search rate. Fewer iterations is not always better, but can be a useful measure to minimize when optimizing the parameters within a particular SR implementation. When comparing between implementations, iterations becomes misleading if other factors are not considered. **Clock Time** Clock time tells us how long, from when we hit enter, until we get our answer. All other considerations equal, it is an unbiased measure. Generally every thing is not equal, and why comparing several measures is useful and no measure should be used in isolation. At some point, algorithms and computers become fast enough at a problem that time becomes less of an issue. To a domain expert, the difference between an hour and two, or hours and a day, may not be important. When they are modeling a large data set for a particular problem, a 24 hour turnaround is often acceptable, especially when one considers the time it takes to publish in a journal. Time, despite being something not to fool with, is a powerful measure. Small incremental improvements compound in time to create meaningful progress. When time is coupled with other metrics, we can gain measures for the amount of work. **Number of Evaluations** Amount of work done is a good metric when we want the most return for what we put in. Work, however, can be difficult no quantify. In SR, work is usually measured by counting the number of evaluations made, and relating this measurement over time. Counting evaluations is easy in GP. Calculation is population size multiplied by generations and the number of data points. In PGE, the scheme is similar, however the number of models per iteration varies. A further unknown arises with the use of iterative fitting schemes like those used in FFX and with nonlinear regression. There is an extra iteration loop over the data points per model which needs to be accounted for. **Number of Models** Counting the number of candidates tells us how many equations we actually considered. If we count the number of evaluations we made on each equation, we get a measure of work. Most work occurs during the evaluation phase of the search process. The computation complexity of simulation and evaluation, far outweighs the time taken for the remainder of the search process. We can reduce this work by reducing either of the terms, number of equations or number of evaluations. We can further refine this by considering the number of unique equations evaluated compared to the total number. We want to be smart about how many and which, equations and data points we choose. Coupling the unique to total equations with the number of evaluations will provide us a good comparison between the SR implementations. GP evaluates many equations fewer times, but experiences great amount of redundancy in form. PGE spends a larger amount of effort per equation, considers a form only once. #### Reproducibility and the Importance of Determinism This deterministic nature of FFX and PGE is an important feature for a SR algorithm. It is because of the link between human, math, and science. In science we expect reproducible results. We believe this is why GP has not become a predominant technology as predicted by its proponents, and likewise, SR for being so tightly coupled to GP for much of its history. Here, we take a strong stance that reproducibility is one of the key features to a SR implementation. Reproducibility provides the necessary reliability for scientific discovery and adoption of a new tool, as well as the foundation upon which research into SR can stand. Deterministic execution has benefits. First, the algorithm is consistent and repeatable. This means that researchers are able to reproduce results easily. This is an important feature with large portions of research unable to be replicated [[moraila:2013:measuring](), [MORE..]() ]. Second, deterministic execution means that modifications to the algorithm can be studied in isolation with their effects becoming much easier to measure. Third, the PGE algorithm only needs to be run once on a given data set. This is a significant advantage over GP which requires 30 or more trials in order to obtain statistically significant results. In all of our modifications to the algorithm, we maintained PGE's deterministic execution nature. The following two quotes from prior publications are included in the hope that it will frame this work in the larger context of producing usable SR tools and technologies. Much of the research in SR aims at improving the GP implementation, with only limited results and inconsistent comparisons. By reconsidering the definition of the SR problem, we believe that PGE is an evolution in perspective and approach. Through determinism we may further our understanding of the problem and find ways to create such a true Symbolic Regression technology.
Genetic programming (GP) is not a field noted for the rigor of its benchmarking. Some of its benchmark problems are popular purely through historical contingency, and they can be criticized as too easy or as providing misleading information concerning real-world performance, but they persist largely because of inertia and the lack of good alternatives. Even where the problems themselves are impeccable, comparisons between studies are made more difficult by the lack of standardization. We argue that the definition of standard benchmarks is an essential step in the maturation of the field. We make several contributions towards this goal. We motivate the development of a benchmark suite and define its goals; we survey existing practice; we enumerate many candidate benchmarks; we report progress on reference implementations; and we set out a concrete plan for gathering feedback from the GP community that would, if adopted, lead to a standard set of benchmarks.
~ McDermott:2012:benchmarks

Outside the GP literature, SR is rare; there are only scattered references such as (Langley et al., 1987). In contrast, the GP literature has dozens of papers on SR every year; even the previous GPTP had seven papers involving SR (Riolo et al., 2010). In a sense, the home field of SR is GP. This means, of course, that when authors aim at SR, they start with GP, and look to modify GP to improve speed, scalability, reliability, interpretability, etc. The improvements are typically 2x to 10x, but fall short of performance that would makes SR a “technology” the way LS or linear programming is. We are aiming for SR as a technology. What if we did not constrain ourselves to using GP? To GP researchers, this may seem heretical at first glance. But if the aim is truly to improve SR, then this should pose no issue. And in fact, we argue that the GP literature is still an appropriate home for such work, because (a) GP authors doing SR deeply care about SR problems, and (b) as already mentioned, GP is where all the SR publications are. Of course, we can draw inspiration from GP literature, but also many other potentially-useful fields.
~ McConaghy:2011:FFX
">next (Symbolic Regression)