Tim Kovacs
School of Computer Science, University of Birmingham
Birmingham U.K. B15 2TT
T.Kovacs@cs.bham.ac.uk
Phone: +44 121 414 3736
Fax: +44 121 414 4281
Keywords: XCS, Classifier Systems, Generalization Problem, Structural Credit Assignment Problem, Reinforcement Learning.
The ability of XCS to evolve optimal populations for boolean multiplexer problems is demonstrated using condensation, a technique in which evolutionary search is suspended by setting the crossover and mutation rates to zero. Condensation is automatically triggered by self-monitoring of performance statistics, and the entire learning process is terminated by autotermination. Combined, these techniques allow a classifier system to evolve optimal representations of boolean functions without any form of supervision.
A more complex but more robust and efficient technique for obtaining optimal populations called subset extraction is also presented and compared to condensation.
This work is concerned with finding solutions to the generalization or structural credit assignment problem, which may be seen as the problem of generalizing accurately from experience. This is an important issue for machine learning systems as effective generalization is vital for systems to scale well to larger problems. Additionally, our folk concept of ``understanding'' learned material seems to imply an ability to generalize effectively. Certainly it is important for machine learning systems to be able to generalize from past experience in order to cope with novel situations.
Reinforcement learning systems attempt to learn complete maps of their environment from a limited input/action/payoff interaction with it. A complete map is one which has an estimated payoff for each input/action pair. Many approaches to learning such mappings, and to producing accurate generalizations within them have been used by reinforcement learning systems. For example, the popular tabular Q Learning technique (Watkins, 1990) exhaustively enumerates condition/action pairs and maintains a payoff estimate for each. As a result it suffers from poor scalability, although modifications to allow generalization have been introduced (see for example (Lin, 1992; Munos 1994).
Classifier systems (CS) are rule based learning systems which are able to generalize over their inputs, and thus have the potential to scale well, but they have not traditionally constructed complete payoff maps. The current work is based on Wilson's XCS system (Wilson, 1995, 1996) a new type of classifier system which, unlike traditional CS, does construct complete payoff maps thanks to its shift to accuracy based fitness. According to Wilson's XCS Generalization Hypothesis, XCS has a natural tendency towards accurate generalization.
Potential applications of XCS in design and manufacturing include those in which accurate generalization (rather than traditional point optimization) is desirable. XCS may also be suitable when a complete representation of the problem space is needed. One case where this is useful is when a variety of fitness functions need to be applied to a common structure. The current work enhances the ability of XCS to find useful solutions.
The focus of the current work is this: given that we have in XCS a reinforcement learning system with a tendency towards accurate generalization, how can we achieve optimal generalizations? In this paper I will present two techniques which allow XCS to evolve optimally general representations for boolean functions.
I will be referring to optimal populations or optimal solutions for given problems. For the purposes of this work, an optimal solution is defined as one which has three characteristics:
The advantages of optimal solutions over those which are merely correct are:
Classifier systems are a form of domain independent rule-based machine learning system introduced by John Holland (see Holland, 1975, 1986). Classifier systems use a Genetic Algorithm (GA) to generate condition/action rules or classifiers which are evaluated during interaction with the problem environment. Classifiers typically use strings of characters composed from the ternary alphabet {0, 1, #} to represent conditions and actions. This is the binary alphabet of the standard GA augmented with a ``don't care'' symbol # which acts as a wildcard when matching environmental inputs to rule conditions. It is the use of the # which allows rules to generalize over inputs. Classifiers also have various parameters (statistics associated with them), such as estimates of the payoff the system will receive if it uses that rule, or the fitness of the rule (an estimate of its overall value to the system). Search proceeds in the GA by selecting parent classifiers (with fitter classifiers being more likely to be selected) and generating children from them using search operators. The two most common search operators used by the GA are crossover (in which segments of two parent's strings are recombined to make new strings for their children) and mutation (in which a random change is made to a child's string).
XCS is a type of classifier system introduced in (Wilson, 1995) and extended in (Wilson, 1996) (additional detail and analysis of the system are available in (Kovacs, 1996). The primary distinguishing feature of XCS is that classifier fitness is based on the accuracy of classifier payoff prediction rather than on payoff prediction (strength) itself. In addition, there are many more subtle differences between the traditional classifier system and XCS including the use of a niche GA, a Q-Learning like update mechanism and a deletion mechanism which seeks to balance classifier allocation between niches. XCS has many features in common with Wilson's earlier ZCS work (Wilson, 1994).
For an extended review and analysis of XCS the reader is referred to (Kovacs, 1996), which addresses many of the subjects of this paper in more detail, and includes overviews of classifier systems, reinforcement learning, payoff environments, multiplexer problems, representational difficulties with classifier systems and other subjects.
In traditional CS, classifier strength plays a double role: it is used as a predictor of payoff in action-selection, and as a measure of fitness in the GA. In XCS, strength is replaced by three parameters: (payoff) prediction, prediction error and fitness. Prediction fills the role of strength in the action-selection component, and is also used to calculate prediction error, a measure of classifier accuracy. Fitness is an inverse function of the prediction error (put another way, it is a function of the accuracy of the prediction).
By relieving payoff prediction of its double role Wilson has significantly improved the classifier system architecture. In XCS, payoff prediction is only relevant in action selection which means XCS classifiers are free to map any region of the input/action/payoff space -- as long as they do so accurately. As a result XCS tends to form ``complete and accurate mappings X x A => P from inputs and actions to payoff predictions ...which can make payoff-maximising action-selection straightforward'' (Wilson, 1995). The attempt to construct a complete mapping of the payoff environment is in the spirit of much reinforcement learning work, and indeed the function of the learning subsystem in XCS is related to the reinforcement learning technique Q-learning.
A second major difference of XCS is the use of macroclassifiers, the name given to a classifier with an additional numerosity parameter which indicates how many copies of that classifier are considered to be in the population. When a new classifier is created, the population is scanned to see if a classifier with the same condition and action exists. If so, the new classifier is discarded and the existing one has its numerosity incremented by one. An analogous process is used in deleting classifiers from the system. XCS treats a macroclassifier in all ways as the equivalent number of (micro)classifiers. Macroclassifiers increase the run-time speed of the system, make for a more compact and better scaling representation, offer interesting information on the distribution of resources within the system, and most importantly for the current work, make possible the use of techniques for obtaining optimal populations of classifiers which I will outline shortly. I have briefly evaluated macroclassifiers in (Kovacs, 1996).
(Wilson, 1995) describes two techniques for calculating the probability of a classifier being selected for deletion in the GA. I have compared these techniques and found that although the second technique results in smaller population sizes, it is highly detrimental to the development of members of the optimal population. I believe this is a result of excessive bias against newly evolved classifiers, and have found that it can be rectified by modifying this technique to only penalise low fitness classifiers if their experience parameter is greater than 20 (i.e. if they have had their parameters updated more than 20 times). I will refer herein to this modified form of the second technique as deletion technique 3.
The following terminology and notation was drawn from (Wilson, 1995) whenever possible.
condition : action => payoff prediction
E.g. 111###:0 => 100 would be interpreted as: if the input string begins with 111, then action 0 should be taken, and a payoff of 100 units will be expected.
Classifiers express generalizations using the don't care symbol # in their conditions. For example, a classifier with condition 00# will match both 001 and 000 and treats these two inputs as equivalent. A classifier is either overgeneral, maximally (optimally) general, or suboptimally general in respect to the inputs it matches. Consider the following payoff landscape:
Suppose an XCS system trained on this payoff landscape has a population consisting of the following classifiers:
(The accuracy of A is 0.0 because its prediction error exceeds a threshold called the accuracy criterion (Wilson, 1995; Kovacs, 1996)). We can describe each classifier as one of the following:
If fitness is based on strength, overgeneral classifiers may survive as ``guessers'' which are sometimes correct and sometimes not. The prediction of A, 100, is an average of the payoffs it receives. To a system which bases fitness on strength (predicted payoff), A will look as valuable as C or D for use as a parent classifier. However, to a system which bases fitness on accuracy of payoff prediction, it is clear that A is not as valuable as the other classifiers.
It appears that evolutionary pressure to generate maximally general classifiers (i.e. classifiers which are as general as possible while remaining within some accuracy criterion) exists within XCS. Wilson's Generalization Hypothesis explains this process as follows:
``Consider two classifiers C1 and C2 having the same action, where C2's condition is a generalization of C1's. That is, C2's condition can be generated from C1's by changing one or more of C1's specified (1 or 0) alleles to don't cares (#). Suppose that C1 and C2 are equally accurate in that their values of [prediction error] are the same. Whenever C1 and C2 occur in the same action set, their fitness values will be updated by the same amounts. However, since C2 is a generalization of C1, it will tend to occur in more match sets than C1. Since the GA occurs in match sets, C2 would have more reproductive opportunities and thus its number of exemplars would tend to grow with respect to C1's (or, in macroclassifier terms, the ratio of C2's numerosity to C1's would increase). Consequently, when C1 and C2 next meet in the same action set, a larger fraction of the constant fitness update amount would be ``steered'' toward exemplars of C2, resulting through the GA in yet more exemplars of C2 relative to C1. Eventually, it was hypothesised, C2 would displace C1 from the population.'' (Wilson 1995)
The logical conclusion of this preference for the more general of two equally accurate classifiers is that the maximally general classifier for a payoff level will tend to be evolved and tend to displace its competitors from the population. In the following I will make use of this tendency to obtain optimal populations.
I propose a simple extension to the Generalization Hypothesis called the Optimality Hypothesis. It states that XCS will eventually evolve a maximally general classifier for each payoff level and that that classifier will have a greater numerosity than any other in its payoff level. The significance of the Optimality Hypothesis is that it holds that XCS will evolve a population of which a subset is an optimal solution and that we will be able to distinguish this subset from the general population based on the distribution of numerosity.
The predictions of the Optimality Hypothesis are naturally contingent upon such things as enough cycles and a large enough population size being used for the generalization process to succeed completely. Further, the generalization process is limited by the nature of the samples drawn from the environment (e.g. unrepresentative sampling of the problem environment may throw the generalization process off). However, such complications are inherent in all reinforcement learning systems.
The multiplexer function is defined for strings of length L = k + 2^k where k is an integer > 0. In a 6 multiplexer problem (k = 2), the input to the system consists of a string of six binary digits, of which the first two (the address) represent an index into the remaining bits (the data), and the value of the function is the value of the indexed data bit. E.g., the value of of 101101 is 0 as the first two bits 10 represent the index 2 (in base 10) which is the third bit following the address. Similarly, the value of 001000 is 1 as the 0th bit after the address is indexed. The optimal population for this problem consists of the 8 classifier conditions of the form: 11###1 (i.e. where all digits in the data part are # apart from the indexed one), each advocating the 2 possible actions for a total of 16 classifiers. The multiplexer is actually a rather difficult problem to find an optimal population for, as there are a number of maximally general classifiers of the form 0#00## which are as general (both types have three #s) and as accurate as the members of [O]. However, maximally general classifiers of this second form tend to overlap and a population based on them requires more than 16 classifiers to completely map the input/action space. Because these two types of maximally general classifier are equally general and accurate, the generalization mechanism must distinguish the optimal population by considering the solution as a combination of interacting classifiers, not just a group of independent classifiers.
Wilson (Wilson, 1995) discusses the use of condensation to reduce the size of the population once the system appears to have learned a given problem. Condensation consists of running the system with the mutation and crossover rates set to zero. This suspends the genetic search as no new classifier conditions can be generated, but allows the classifier selection/deletion dynamics in the GA to continue to operate. The result is a gradual shift in numerosity from less fit/less general classifiers to more fit/more general classifiers. Once a classifier reaches zero numerosity it is removed from the system. Over several thousand cycles the result is a condensation of the population to its fittest members.
Although condensation was applied to an XCS system in (Wilson, 1995), optimal populations were not obtained as condensation was applied too soon. Because no new classifier conditions are generated once condensation begins, the optimal population must already exist within the general population at this point. Choosing when to begin condensation is thus of some importance, and, unfortunately, difficult to determine because of the variability in the time required for the completion of [O] and the indirectness of problem-independent techniques for detecting it. I investigated the use of several statistics for estimating the completion of [O] in (Kovacs, 1996) but none was very accurate, and all involved the use of a user-selected delay between the trigger condition being met and the commencement of condensation. I have triggered condensation for the 6 multiplexer problem when a moving average of the last 50 points of system error has remained below 0.01 for 2,500 consecutive GA cycles (using deletion technique 1). With these settings the system was able to find optimal populations on 100 out of 100 runs. Unfortunately, this approach does not scale well, as a delay of 10,000 GA cycles is required when triggering condensation for the 11 multiplexer to again achieve 100% success on 100 runs.
The system can be left to run for a preset number of cycles, or alternatively can make use of autotermination to cease execution when an optimal population has been evolved. Autotermination works as follows: once condensation has begun, each time a classifier is removed from the system, each pair of classifiers which can be formed from those in [P] is compared to see if they overlap (i.e. to see if there is any input which both will match). Initially, there will be many overlaps in the population, but once condensation has reduced [P] to [O], no overlaps will remain (recall from section 1.1 that an optimal population has no overlaps). It should be noted that this only works if an optimal population is found, otherwise overlaps will likely remain and autotermination will not stop the system.
I have found that the rate of condensation can be greatly increased by the simple modification of truncating each action set to its single most numerous member once condensation has begun. In other words, each time the system forms an action set, all but the most numerous classifier in that set are removed completely from the entire system. When an optimal population is present in the general population, this has the same effect as the original condensation method because in an optimal population there are no overlaps and thus only a single classifier for each action set.
Although optimal populations can be found by condensation, the approach is somewhat unsatisfactory in that the delay before commencing condensation must be set by the user and appears to be rather sensitive to problem complexity. Additionally, because of the considerable variability from run to run in the number of cycles required for the complete optimal population to evolve (before which condensation must not be started), a rather liberal delay must be used. A second approach to obtaining optimal populations called subset extraction improves on condensation by being less sensitive to problem complexity and yet finds optimal populations in fewer cycles. However, it has the disadvantage of being more complex than the straightforward condensation method.
Subset extraction attempts to find an optimal population by analysing the system at a particular instant -- it does not rely on incremental calculations or a delay as condensation does. This means it can be applied at any interval, e.g. every 50th or 100th cycle in order to reduce processing resource use. Subset extraction is conceptually simple: if some well-evaluated subset of [P] with good accuracy ratings can be found which completely covers the input/action space and is non-overlapping, this subset is taken as a candidate optimal population. The third property of an optimal population, that it have as few members as possible, is then delt with by attempting to combine compatible classifiers. The following sections describe this process in detail.
The first problem which presents itself is that some classifiers in [P] may be inaccurate and should not be used in forming [O]. For example, ###### : 1 and ###### : 0 cover the input/action space completely and in a non-overlapping way, but for the 6 multiplexer problem clearly do not form an optimal population as they will often incorrectly classify the inputs they match. To avoid generating [O]s based on inaccurate classifiers we will simply ignore any classifiers which are not accurate (i.e. which do not have an accuracy of 1.0) or which are not experienced (i.e. which have been members of [A] and therefore had their parameters updated less than 20 times).
The second difficulty is dealing with the large number of possible subsets of [P]. In a typical 6 multiplexer experiment there might be a maximum of 400 microclassifiers allowed, although this might amount to only 100 macroclassifiers when we try to extract [O]. Given that some classifiers may be rejected out of hand due to a lack of accuracy or experience, we might be left with only 75 to consider. However, this leaves an impractical or 3.777893e+22 possible subsets to consider.
A simple technique was adopted to reduce the computational load, based on XCS's tendency to eventually allocate more numerosity to members of [O] than to any other classifiers in [P]. This technique requires the assumption that if a classifier x is an element of [O], then any more numerous classifiers will also be in [O]. This assumption is not always true, but as XCS redistributes numerosity it will tend to become true.
Given this assumption, we can sort the accurate, experienced members of [P] according to numerosity and use the following algorithm to identify elements of the optimal population:
With this approach we have at most 75 subsets of [P] to consider -- making the assumption above changes the computational complexity of the algorithm from exponential to linear time. Although this algorithm can incorrectly conclude that a complete [O] does not exist within [P], this becomes less likely as numerosity is redistributed in the expected way. Because the algorithm is computationally inexpensive it can be applied at relatively frequent intervals, minimising the delay in extracting [O].
This assumption appears reasonable and justifiable given the Generalization Hypothesis. By this I mean that, although it is possible for some element in [P] not in [O] to be more numerous than some element in [O], this situation is counter to the trend of numerosity allocation in XCS. Consequently, although this algorithm may not extract [O] from [P] as soon as possible, it will not be long before the least numerous classifier in [O] becomes more numerous than the most numerous outside [O], at which point this algorithm will extract [O] from [P]. In other words, by making this assumption and attempting subset extraction relatively frequently we merely delay the extraction of [O] by some reasonably small number of cycles.
Once [O] has been extracted from [P] there is still the matter of minimality to contend with before [O] is truly optimal. The problem is that [O] may cover the input/action space completely, but may use more classifiers than necessary. Fortunately, it is always possible to detect and rectify this condition by combining classifiers. It may be possible for two or more classifiers to have their conditions combined such that a smaller number of conditions expresses the union of all the other conditions. E.g. 000000 : 0 => 100 and 000001 : 0 => 100 can be combined into (and replaced by) 00000# : 0 => 100 . This is not always possible due to limitations in the representational capacity of the ternary alphabet currently used in XCS classifier conditions (see (Schuurmans & Schaeffer, 1989) and (Kovacs, 1996)) section 5.3. This is a difficulty with the ternary alphabet and thus only indirectly with XCS as it could employ some other representational scheme for conditions.) It is necessary that classifiers which are to be combined advocate the same action and have the same prediction, in order not to corrupt the representation of the input/action space. In principle, we might have a population in which many classifiers could be combined, e.g. 8 classifiers might be combined into a single one. In practice, the only combinations of classifiers which have been encountered in trying to minimise the candidate [O] in 6 and 11 multiplexer experiments are combining pairs of classifiers into a single classifier, and combining triplets into pairs. Thus it appears unnecessary to attempt all possible combinations of classifiers. This is thanks to XCS's tendency towards generalization which produces classifiers which are already maximally general, or close to it. Once all classifiers combinations have been made, [O] has all three properties of an optimal population.
Condensation and subset extraction were tested on 6 and 11 multiplexer problems in order to compare their performance, although only 11 multiplexer problems are shown due to lack of space. Figure 1 shows condensation on an 11 multiplexer problem. In this test, the population size limit was set to 1,000, condensation was triggered by 12,000 consecutive GA cycles of system error below 0.01, and deletion technique 3 from section 2 was used. [O] was first completed around 13,000 cycles and condensation began around 28,000 cycles. This excessive delay is the only cause for complaint with the condensation method as the accelerated form of condensation performed very well, reducing [P] to [O] within a few hundred cycles once it had actually begun. Unfortunately, the large delay before beginning condensation is necessary due to the considerable variance between runs between the trigger condition being met and the completion of [O]. If a more accurate estimate of the completion of [O] could be found it would greatly improve the efficiency and scalability of the condensation method.
Figure 2 shows the performance of subset extraction on the 11 multiplexer problem. Again, the population size limit was 1,000 microclassifiers and deletion technique 3 was used. [O] is first completed just after 16,000 cycles, but its final member is deleted, evolved again, and deleted again before reappearing around 18,000 cycles. [O] was finally extracted from [P] around 19,500 cycles. [O] was not extracted directly at 18,000 cycles because the numerosity allocation dynamics had not yet met the conditions of the numerosity distribution assumption made in section 3.1. However, the lag of 1,500 cycles in obtaining [O] from [P] is much better than that of 15,000 cycles for the condensation experiment.
Figure 1: A single run of an 11 multiplexer experiment using the condensation
method. System error is a measure of the difference between the system's prediction
of payoff and the actual payoff received on each cycle. Members of [O] is the percentage
of [O] present (generated using problem-dependent
information). Population size is the number of macroclassifiers / 1000.
Figure 2: A single run of an 11 multiplexer experiment
using the subset extraction method.
Note the difference between figures 1 and 2 in terms of when [O] was first completed (13,000 and 16,000 cycles) and when system error dropped below 0.01 (13,000 and 15,000 cycles). It is this sort of variability that makes triggering condensation difficult and which necessitates the use of long delays for reliable performance.
I have presented two methods of evolving optimal solutions for boolean functions using XCS. Condensation is the simpler of the two to implement, but requires the user to choose an appropriate delay for the trigger and because of this delay works more slowly than subset extraction. The delay is needed because our estimate based on the system error statistic of the time at which [O] is completed is imprecise. As a result, this approach scales poorly. Furthermore, if condensation begins too soon, [O] cannot be found, at least not without a costly restart of evolutionary search.
In contrast, the subset extraction method works without modification on both the 6 and 11 multiplexer problems despite the latter problem being considerably more complex. It is computationally tractable thanks to XCS's tendency to allocate more numerosity to the members of [O] than to any other classifiers, which allows us to check only a small number of the possible subsets of [P] for completeness and a lack of overlaps. XCS also tends to evolve minimal solutions, and when this does not occur it is possible to combine classifiers to form a minimal population. Finally, subset extraction, unlike condensation, can be applied at any time without ill effects, as failure to extract [O] leaves the population unchanged.
Both methods rely on XCS's tendency towards accurate generalization and its ability to form complete mappings of the input/action space. These two characteristics are not found in other classifier systems, which consequently are incapable of evolving optimal populations.
The success of these two methods demonstrates the effectiveness of XCS's tendency towards generalization and its ability to form complete mappings. Combined, these characteristics make XCS a reinforcement learning system which scales well, and suggest that XCS will be a useful means of investigating reinforcement learning based systems, for instance in the context of artificial life research.
I would like to thank Stewart Wilson, Manfred Kerber, Ian Wright and Aaron Sloman for their help in all its forms. This work was funded by Tim Kovacs and the School of Computer Science at the University of Birmingham.
XCS Classifier System Reliably Evolves Accurate, Complete, and Minimal Representations for Boolean Functions
This document was generated with help from the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 online97.tex.
The translation was initiated by Timothy Kovacs on Thu Jun 5 23:01:07 BST 1997