3.4 Evolving Neural Networks

- Representations
- Credit Assignment
- Adapting Weights
- Evolving AND Learning Weights
- Evolving Architectures
- Evolving Learning Rules
- 3.4.1 Ensembles of NNs

- 3.4.2 Yao's Framework for Evolving NNs
- 3.4.3 Conclusions

3.4 Evolving Neural Networks

The study of Neural Networks (NNs) is a large and interdisciplinary area. The term Artificial Neural Network (ANN) is often used to distinguish simulations from biological NNs, but having noted this we shall refer simply to NNs. When evolution is involved such systems may be called EANNs (Evolving Artificial Neural Networks) [316] or ECoSs (Evolving Connectionist Systems) [147].

A neural network consists of a set of nodes, a set of directed
connections between a subset of nodes and a set of weights on the
connections. The connections specify inputs and outputs to and from
nodes and there are three forms of nodes: input nodes (for input to
the network from the outside world), output nodes, and hidden nodes
which only connect to other nodes. Nodes are typically arranged in
layers: the input layer, hidden layer(s) and output layer.
Nodes compute by integrating their inputs using an activation function
and passing on their activation as output. Connection weights modulate
the activation they pass on and in the simplest form of learning
weights are modified while all else remains fixed. The most common
approach to learning weights is to use a gradient descent-based
learning rule such as backpropagation.
The *architecture* of a NN refers to the set of nodes,
connections, activation functions and the plasticity of nodes; whether
they can be updated or not. Most often all nodes use the same
activation function and in virtually all cases all nodes can be
updated.
Evolution has been applied at three levels: weights, architecture and
learning rules. In terms of architecture, evolution has been used to
determine connectivity, select activation functions, and determine
plasticity.

An alternative is to evolve the weights which has the advantages that EAs don't rely on gradients and can work on discrete fitness functions. Another advantage of evolving weights is that the same evolutionary method can be used for different types of network (feedforward, recurrent, higher order), which is a great convenience for the engineer [316]. Consequently, much research has been done on evolution of weights. Unsurprisingly fitness functions penalise NN error but they also typically penalise network complexity (number of hidden nodes) in order to control overfitting. The expressive power of a NN depends on the number of hidden nodes: fewer nodes = less expressive = fits training data less while more nodes = more expressive = fits data better. As a result, if a NN has too few nodes it underfits while with too many nodes it overfits. In terms of training rate there is no clear winner between evolution and gradient descent; which is better depends on the problem [316]. However, Yao [316] states that evolving weights AND architecture is better than evolving weights alone and that evolution seems better for reinforcement learning and recurrent networks. Floreano [93] suggests evolution is better for dynamic networks. Happily we don't have to choose between the two approaches.

- The surface is infinitely large since the number of possible nodes and connections is unbounded;
- the surface is nondifferentiable since changes in the number of nodes or connections are discrete and can have a discontinuous effect on EANN's performance;
- the surface is complex and noisy since the mapping from an architecture to its performance is indirect, strongly epistatic, and dependent on the evaluation method used;
- the surface is deceptive since similar architectures may have quite different performance;
- the surface is multimodal since different architectures may have similar performance.

There are good reasons to evolve architectures and weights simultaneously. If we learn with gradient descent there is a many-to-one mapping from NN genotypes to phenotypes [318]. Random initial weights and stochastic learning lead to different outcomes, which makes fitness evaluation noisy, and necessitates averaging over multiple runs, which means the process is slow. On the other hand, if we evolve architectures and weights simultaneously we have a one-to-one genotype to phenotype mapping which avoids the problem above and results in faster learning. Furthermore, we can co-optimise other parameters of the network [93] at the same time. For example, [21] found the best networks had a very high learning rate which may have been optimal due to many factors such as initial weights, training order, and amount of training. Without co-optimising architecture and weights evolution would not have been able to take all factors into account at the same time.

One approach is to evolve only learning rule parameters [316] such as the learning rate and momentum in backpropagation. This has the effect of adapting a standard learning rule to the architecture or problem at hand. Non-evolutionary methods of adapting training rules also exist. Castillo [66], working with multi-layer perceptrons, found evolving the architecture, initial weights and rule parameters together as good or better than evolving only first two or the third.

We can also evolve new learning rules [316,233]. Open-ended evolution of rules was initially considered impractical and instead Chalmers [67] specified a generic, abstract form of update and evolved its parameters to produce different concrete rules. The generic update was a linear function of ten terms, each of which had an associated evolved real-valued weight. Four of the terms represented local information for the node being updated while the other six terms were the pairwise products of the first four. Using this method Chalmers was able to rediscover the delta rule and some of its variants. This approach has been used by a number of others and has been reported to outperform human-designed rules [78]. More recently, GP was used to evolve novel types of rules from a set of mathematical functions and the best new rules consistently outperformed standard backpropagation [233]. Whereas architectures are fixed, rules could potentially change over their lifetime (e.g. their learning rate could change) but evolving dynamic rules would naturally be much more complex than evolving static ones.