A Response to the ML-95 Paper Entitled
'Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of
Koza's ' "
Version
1 distributed at the Twelfth International Machine Learning Conference (ML-95)
held in Tahoe City, California on July 9­p;12, 1995. Version 2
distributed at the Sixth International Conference on Genetic Algorithms
(ICGA-95) to be held in Pittsburgh on July 15­p;19, 1995. Version 3
deposited at WWW site on May 26,1996.
Section 10 of
this response is entitled "A Peer Review of the Peer Reviewing Process of
the International Machine Learning Conference"
This paper comments on the paper by Kevin Lang entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that appears in the proceedings of the 1995 Twelfth International Machine Learning Conference.
In a paper entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" at the 1995 International Machine Learning Conference, Kevin Lang raises the question of whether the existence of the population of individual entities (a key characteristic of genetic algorithms) actually performs any useful role in searching complicated search spaces. In particular, the paper focuses on searches in the space of computer programs (i.e., compositions of functions and terminals) searched by the technique of genetic programming (GP). The paper says,
"Genetic programming has attracted a large following ... probably because it comes with an appealing story about how the power of evolution is being harnessed by the technique. A crucial part of this story is the idea that the pool of highly fit individuals selected from earlier generations constitutes a valuable genetic resource that facilitates the creation of even more fit individuals in the future. This paper describes an experiment that casts doubt not only on the relative efficiency of genetic programming, but on the validity of the story about the value of the high fitness gene pool."
As an alternative to the population-based search employed in
the genetic algorithm, the paper proposes a population-free hill climbing
algorithm that goes from a single point in the search space of the problem to a
single new point. The proposed hill climbing algorithm does not use (and has no
use for) a population.
Specifically, the proposed population-free search algorithm starts with a
single random individual and, at each time step, remembers the single best
individual ever seen. Then, on each time step, one candidate new individual is
created and compared with the current single best individual. If the new
candidate is at least as good as the current best individual, the candidate
becomes the new best individual.
The candidate is created differently on alternate time steps. On the
even-numbered steps (starting with time step 0), the new candidate is a
randomly created entirely new individual. On the odd-numbered steps, the new
candidate is the offspring produced by "crossing" the current best
individual with a randomly created new individual. In this "crossing"
operation, a randomly chosen subtree from an especially created new random
individual is inserted at a randomly chosen point of the current best
individual (thereby replacing the subtree currently located at that point).
The paper explains,
"The main difference is that our algorithm generates new hypotheses by crossing the current champion with totally random individuals, while GP generates them by crossing pairs of individuals drawn from a high-fitness pool."
Using a testbed of 100 runs on each of the 80 distinct 3-argument Boolean
functions, the paper concludes,
"We found hill climbing to be more efficient than GP, with the advantage increasing to a factor of 40:1 on the hardest problem instances."
The paper's interpretation is that
"Evidently, the high-fitness pool is worse than [a] random source of components for building improved individuals."
The 1995 International Conference on Machine Learning accepted and published
this paper (Lang 1995).
When a scientific claim must be established by experimental
evidence (as opposed to proof), the persuasive quality of the experimental
testbed is necessarily a threshold issue. One would think that most researchers
in automated learning would be very surprised to see the family of 3-argument
Boolean functions proffered as a suitable testbed for reaching generalizable
conclusions about machine learning.
The paper attempts to justify its choice of 3-argument Boolean functions as
follows:
"The testbed for the experiment is the Boolean circuit synthesis task described in chapter 9 of Genetic Programming. This task was selected by the book's author on which to battle an earlier batch of critics who suspected that genetic programming was nothing more than a disguised version of random-generation-and-test algorithm ... " (Emphasis added)
However, chapter 9 of Genetic Programming: On the Programming of Computers by Means of Natural Selection (Koza 1992) says precisely the opposite. The 3-argument Boolean functions were not "selected by the book's author" as a suitable testbed for disproving the claim that genetic search is no better than blind random search - much less for basing generalizations about machine learning algorithms. In fact, in the trivial world of 3-argument Boolean functions, blind random search actually outperforms genetic programming in a good many instances. Far from endorsing 3-argument Boolean functions as a suitable testbed, chapter 9 specifically dismisses them (page 206) as
"... the obvious exception of certain designated trivial problems (some of which [namely, the 3-argument Boolean functions] appear later in this chapter ... ) ... " (Emphasis added).
A glance at chapter 9 shows that the category of
experimental evidence that Genetic Programming does embrace for the purpose of
comparison with random-generation-and-test is the "generation 0"
performance of genetic programming (i.e., blind random search). The book
provides a mass of experimental evidence about "generation 0"
performance in the form of 20 fitness curves, 43 performance curves, and 44
probability of success curves. This collection of curves (each based on up to
several hundred runs) cover a wide variety of different problems, including
many benchmark problems from robotics, control, system identification,
optimization, and game playing. Only three of the 81 or so different problems
in Genetic Programming are Boolean problems (e.g., the 6-argument multiplexer,
the 11-argument multiplexer, and the even-5-parity function). (The contents of
chapter 9 are intentionally not even considered as "problems" by the
index of book).
The experimental evidence that chapter 9 embraces to show that genetic
programming is not just blind random search is not the trivial family of 80
3-argument Boolean functions, but, instead,
"[t]he fact that we do not solve a problem on generation 0, but do solve the problem before generation 50 on a high percentage of the runs ... "
Boolean functions of any arity are, by themselves, of
questionable value for basing generalizations about the performance of learning
algorithms. Boolean functions are unusual in that both the range and domain are
discrete and their performance can be efficiently represented by finite truth
tables. Boolean functions of arity 3 are particularly inadequate as a suitable
testbed since they exhibit a mere 9 discrete levels of fitness (i.e. from 0 to
8 correct answers for the 8 possible combinations of the 3 Boolean arguments).
Indeed, 3-argument Boolean functions are about as useful as looking at the
first three integers to conclude that all integers are prime numbers.
The inadequacies of Boolean functions in general and 3-argument Boolean
functions in particular are so obvious and so well known that it is difficult
to discern the reason why an experimental paper that did no more than
mechanically tabulate the results of 100 runs of 80 3-argument Boolean
functions could possibly have been accepted for publication by the
International Conference on Machine Learning.
In any event, the burden is on the paper to justify its surprising choice of
the manifestly trivial 3-argument Boolean functions as a suitable testbed. The
attempt to invoke support for this surprising choice from chapter 9 of Genetic
Programming is just plain wrong. Other than this mischaracterization of chapter
9, the paper provides no other justification to support its surprising choice
on this threshold issue. The inadequacy of this testbed and its effect on the
validity of the paper's conclusions will become apparent momentarily.
When a published paper bombastically proclaims in its title
that it is going to "beat" a particular published result and
over-personalizes the issue by including the name of the author of the
previously published result in its title, one can reasonably expect that the
paper will present minimal familiarity with the subject matter it promises to
"beat."
Lang refers (section 1) to "our version of the hill climbing algorithm"
in which "[t]he main difference is that our algorithm generates new
hypotheses by crossing the current champion with totally random individuals,
while GP generates them by crossing pairs of individuals drawn from a
high-fitness pool" (Emphasis added).
However, anyone familiar with even the basic elements of genetic programming
knows that what Lang calls "our" approach of "crossing the
current champion with totally random individuals" is nothing more than the
familiar GP operation called mutation. This ordinary GP operation is
described in detail in chapter 6 on page 106 of Genetic Programming (and is
properly credited in chapter 4 on page 65 to several previous researchers from
the mid 1980's - the most extensive and detailed of which is Hicklin's 1986 work).
As clearly stated on page 106 of Genetic Programming,
"The mutation operation ... removes whatever is currently at the selected point and whatever is below the selected point and inserts a randomly generated subtree at that point."
Lang's "crossing" is not sexual recombination or crossover at all.
Nor is it new. It is ordinary GP mutation. Once this unacknowledged equivalence
is recognized, the way that paper's proposed hill climbing algorithm goes about
creating each successive new point in its search can be restated in the
following simpler way: The new candidate is either a randomly created entirely
new individual (i.e., identical to the way individuals are created in
"generation 0" of a GP run) or the offspring produced by a random GP
mutation of the current best individual. In other words, we are about to imbibe
of the very old wine of "mutate and save the best."
We will use the even-3-parity function (one of the 80
members of the family of 3-argument Boolean functions) to illustrate the
fallacy behind the paper's bombastic proclamation that it "beats"
genetic programming.
For convenience, we will refer to the already-published histogram showing how
well 1,000,000 randomly created programs perform the even-3-parity task found
in table 26.1 of Genetic Programming II (Koza 1994a) and shown here as figure
1. Out of 1,000,000 randomly generated programs, 67.5% have fitness 4 (i.e.,
produce the correct answer for half of the 8 combinations of the 3 Boolean
inputs), 15.6% have fitness 5 and 3 each, 0.5% have fitness 2 and 6, only 0.33%
have fitness 7 or 1, and none (out of 1,000,000) have fitness 8 (the best) or 0
(the worst). Note that the vertical axis of this figure has a logarithmic scale
running between 100 and 106 (thereby highlighting the otherwise imperceptibly
small number of programs that have scores that stray very far from the mean
value of 4).
Figure 1 Histogram for the even-3-parity problem.
Because two thirds of the random population have fitness 4, a run of the
proposed population-free hill climbing algorithm will typically start at time
step 0 with a randomly created individual with a fitness of 4. On step 1 (and
every succeeding odd-numbered step), a random GP mutation of the current best
individual will be executed. On step 2 (and every succeeding even-numbered
step), a randomly created entirely new individual will be generated. This
entirely new random individual (and every succeeding entirely new random
individual generated on these even-numbered steps) will have a 2/3 probability
of having fitness 4 and a 1/6 probability of having fitness 5 (as shown in the
histogram).
After alternating these odd and even time steps a few times, the current best
individual will probably work its way up to a fitness value of 5 and later to
6.
Once the current best individual reaches a fitness level of 6, the
even-numbered steps starts to become irrelevant since there is only a 0.33% (1
in 300) chance of creating a randomly created entirely new individual with
fitness 7. The point is that the even-numbered steps always draw from the same
well (i.e., the known distribution as shown in figure 1).
And, when the fitness of the current best individual reaches the level of 7
(most likely due to an odd-numbered step), the chance that an even-numbered
step will create an entirely new individual with fitness 8 is considerably less
than one in a million. These odds are essentially "never" in relation
to the total of 1,250 steps involved in the runs in the paper.
In other words, as soon as the fitness of the current best individual moves
just a few standard deviations away from the mean value for randomly created
new individuals, the even-numbered steps in the proposed hill climbing
algorithm become totally ineffective and irrelevant in advancing the progress
of the search. In fact, the even-numbered steps matter only when the best
current individual in this hillclimbing algorithm is very close to the mean
(e.g., at the very beginning of the run).
The irrelevance of the even-numbered steps would have been obvious to the
paper's author had not made the threshold error (endorsed by the 1995
International Machine Learning Conference) of employing the very trivial
3-argument Boolean functions as its chosen testbed and had, instead, studied
the ever-so-slightly less trivial 4-argument or 5-argument Boolean functions.
For example, figure 2 shows the histogram of fitness values for 1,000,000
randomly generated programs for the even-4-parity problem. 100% of the
1,000,000 randomly created individuals for the even-4-parity problem have
fitness between 4 and 12. None of these 1,000,000 randomly created individuals
even come close to the best level of fitness of 16 because a 16 is many, many
standard deviations away from the mean of 8. Note again that the vertical axis
of this figure has a logarithmic scale.
Figure 2 Histogram for the even-4-parity problem.
Similarly, figure 3 shows the histogram of fitness values for 1,000,000
randomly generated programs for the even-5-parity problem. 100% of the
1,000,000 randomly created individuals for the even-5-parity task have fitness
between 12 and 20. Again, a perfect score of 32 is many, many standard
deviations away from the mean value of 16. Note again that the vertical axis of
this figure has a logarithmic scale.
Figure 3 Histograms for the even-5-parity problem.
Thus, once the proposed population-free hillclimbing algorithm moves even
slightly above the mean, there is essentially no chance that the even-numbered
steps will produce a new individual that will dethrone the current best
individual. The entire burden of advancing the progress of the search falls,
almost immediately, upon the odd-numbered steps (i.e., the ordinary GP
operation of mutation).
Having eliminated the even-numbered steps from the picture, the question then
becomes whether the odd-numbered steps can successfully search a complicated
search space. That is, the question is whether the ordinary GP mutation
operation is capable of successfully searching the tiny world of 3-argument
Boolean functions.
We need not speculate on the answer to this question because the answer is
already in the paper. Lang apparently first tried an approach consisting only
of the odd-numbered steps (i.e., the ordinary GP mutation operation) and
apparently discovered that this approach didn't work even in the trivial world
of 3-argument Boolean functions. The paper says,
"Fully half of our candidates circuits were random, drawn from the same distribution that yielded poor performance for [random-generation-and-test]."
Tellingly, the paper continues,
"While this seems like a waste of resources, we needed to have some mechanism for escaping from the local optima that would have resulted from keeping only one good circuit around at a time. (Emphasis added)."
In other words, the paper concedes the inability of the odd-numbered steps
alone (i.e., the ordinary GP mutation operation) for searching a non-linear
space because of the well-known problem of lingering for long stretches of time
at various suboptimal levels of fitness.
Given that the paper concedes the inability of the odd-numbered steps to work
and given the provable inability of the even-numbered steps to play any role in
the search (once the search breaks out of the immediate neighborhood of the
mean value of the initial random distribution), it is clear that neither part
of the proposed algorithm works. The proposed population-free hill climbing
algorithm is a pasting together of an approach that the paper concedes doesn't
work with an ineffective and irrelevant approach that provably can't work.
Presumably, Lang came to the realization that he "needed to have some
mechanism" from his own preliminary experimentation on the family of
3-argument Boolean functions. Then, because of the threshold error of
conducting these experiments in the trivial testbed of 3-argument Boolean
functions (an glaringly obvious error accepted uncritically by the
International Machine Learning Conference), all the search activity of
consequence was concentrated in the highly confined interval between a fitness
level of 6 and 8. In this highly misleading and trivial world, the author of
the paper managed to convince himself that his even-numbered steps (i.e., the
resort to totally random "generation 0" individuals) actually
contributed something to the progress of the search.
It is noteworthy that the proposed population-free algorithm dethrones the current
best individual whenever the fitness of a newly created individual is better or
equal to the current best individual. In the highly confined interval between a
fitness level of 6, 7 and 8 for the 3-argument Boolean functions, this
"fine print" of the hill climbing algorithm played a role in further
misleading the author of the paper. The author convinced himself (and the 1995
International Machine Learning Conference uncritically accepted his conclusion)
that he "beat" genetic programming because of the peculiarities of
his own, unjustified choice of the trivial, highly confined, unsuitable testbed
of 3-argument Boolean functions.
When an accepted and published paper proclaims in its title
that it is going to "beat" previously published results from a
specific book, the reader is reasonably entitled to assume that the author of
the paper (and the scholarly institution accepting and publishing the paper)
has accorded at least minimal attention to relevant evidence contained in the
specific book that is being "beaten."
However, the paper inexplicably pays no heed at all to the experiment (on page
607 in chapter 25 of Genetic Programming) that contains unwelcome evidence
contrary to its position. That experiment considers the ability of the GP
mutation operation alone (i.e., the odd-numbered steps of the proposed hill
climbing algorithm) to successfully search spaces of computer programs for the
6-argument Boolean multiplexer problem.
Figure 4 (identical to figure 25.15 from Genetic Programming) shows, by
generation, the probability of success for 200 runs of the Boolean
6-multiplexer problem with a population of 1,000 for runs employing the normal
GP operations (i.e., with ordinary two-parent GP crossover) and for 200 other
runs employing only the mutation operation. The probability of success after 50
generations reaches 100% for this problem for runs employing the normal GP
operations. However, for runs employing the mutation operation, the probability
of success after 50 generations is only 8%.
Figure 4. Contrary evidence from Genetic Programming (1992) that was
inexplicably ignored in the paper proclaiming that it "beats genetic
programming."
This experiment constitutes evidence, sitting right in the very same book that
the paper bombastically proclaims it is going to "beat" and that the
paper (and the scholarly institution accepting and publishing the paper) had no
right to ignore. While Boolean 6-argument functions are, in my opinion, also
very simple-minded (in relation to the mass of non-Boolean problems in Genetic
Programming and Genetic Programming II), when one travels in the world of the
trivial, the evidential value of the six argument Boolean functions surely far
exceeds that of the three argument domain.
In addition, the extended run of the 6-multiplexer (on page 614 in section
25.13 of Genetic Programming) provides additional contrary evidence indicating
that the improving nature of the subtrees available for crossover in later
generations of a run of genetic programming.
A misstatement contained in section 2 of Lang's paper as to
how the genetic algorithm works directly relates to the author's
misunderstanding of what he was doing. The misstatement relates to the
important issue of why the author perceived that he "needed to have some
mechanism" for escaping from the local optima.
Specifically, the misstatement is that, "...successive generations ... are
created by selecting the best trees from preceding generations." (Emphasis
added). This is precisely how the genetic algorithm does not work.
In the genetic algorithm, the selection is probabilistic, so that any tree may
be selected. Known-bad trees are often selected. Even the worst tree can potentially
be selected. The probability of selection increases with increased observed
fitness, so that a better tree definitely has a greater probability of being
chosen than a less fit tree. However, while the genetic algorithm is somewhat
greedy, it is not entirely greedy. The genetic algorithm intentionally
allocates some new trials in its search to known-inferior individuals.
A consideration of simulated annealing further clarifies the importance of not
exclusively doing hill climbing in searching a non-trivial space. Simulated
annealing is a population-free search algorithm that moves, on each generation
(time step), from a single point in the search space of the problem to a new
single point. Simulated annealing generates a candidate new point from the
search space of the problem. However, unlike simple hill climbing (but like the
genetic algorithm), simulated annealing occasionally (probabilistically)
discards a current good point in exchange for a known-inferior new point.
Indeed, without this ability to occasionally go to known-inferior points,
simulated annealing degenerates to simple hill climbing. Then, like simple hill
climbing searches, the search would necessarily often become trapped on
locally-optimal (but globally-non-optimal) points of a non-trivial search
space.
The intentional allocation of some trials to known-inferior individuals (in
favor of known-better individuals) is one of the features that distinguishes
the genetic algorithm (and simulated annealing) from simple hill climbing.
Moreover, for both simulated annealing and the genetic algorithm, this
occasional preference toward known-inferior points is not done haphazardly,
but, instead, is done in a principled (exponentially increasing way) way that
is closely related to the solution of the multi-armed bandit problem that turns
out to be near-optimal under certain reasonable definitions and assumptions
(Holland 1975; Kirkpatrick, Gelatt, and Vecchi 1983; and Metropolis,
Rosenbluth, Rosenbluth, Teller, and Teller 1953).
None of this is new. The deficiencies of hill climbing are well known because
in any non-linear search space (which is to say, any non-trivial search space),
hill climbing will become trapped on local optima.
It is interesting to note the vastly different weight that the author of the paper proclaiming to "beat" genetic programming accords to his own 100 runs of 80 3-argument Boolean functions versus the amount of weight he gives to mass of evidence contained in Genetic Programming. The paper tells us,
"The book Genetic Programming demonstrates the method's generality by exhibiting solutions to a variety of problems. However, it provides little evidence for believing that GP is better than the many existing algorithms for general learning and optimization, much less than the specialized algorithms that exists for many of the various problems."
Skipping past the puzzling question as to the identity of these unnamed
"existing algorithms for general learning" that may be capable of
creating computer programs to solve problems, let's look at the composition of
this "little evidence" that the paper dismisses in preference to its
own 100 runs of the family of 80 3-argument Boolean functions.
First, the book Genetic Programming provides evidence that genetic
programming can evolve satisfactory solutions for 81 or so problems from a
variety of fields, including many benchmark problems from robotics, control,
system identification, optimization, and game playing.
Is that "little evidence" in relation to a mechanical tabulation of
100 runs of 80 3-argument Boolean functions?
Second, Genetic Programming II (Koza 1994a) describes how to evolve
multi-part programs consisting of a main program and one or more reusable,
parameterized, hierarchically-called subprograms (called "automatically
defined functions"). Genetic Programming II contains evidence of the
effectiveness of automatically defined functions from a couple of dozen
distinct areas, including one where the genetically evolved program was
superior to that written by several different human investigators.
Third, Genetic programming II also provides evidence that genetic
programming requires less computational effort (i.e., fewer fitness evaluations
to yield a solution with a satisfactorily high probability) with automatically
defined functions than without them (provided the difficulty of the problem is
above a certain relatively low break-even point). Genetic programming II also
provides evidence that genetic programming usually yields solutions with
smaller average overall size with automatically defined functions than without
them (provided, again, that the problem is not too simple). That is, both
learning efficiency and parsimony appear to be properties of genetic programming
with automatically defined functions.
Fourth, Genetic programming II also provides evidence that genetic
programming with automatically defined functions is scalable. For several
problems for which a progression of scaled-up versions was studied, the
computational effort increases as a function of problem size at a slower rate
with automatically defined functions than without them. In addition, the
average size of solutions similarly increases as a function of problem size at
a slower rate with automatically defined functions than without them. This
observed scalability results from the profitable reuse of
hierarchically-callable, parameterized subprograms within the overall program.
Fifth, six new architecture-altering operations have recently been developed
(Koza 1994b, 1995) for genetic programming that are patterned after the
naturally-occurring chromosomal operations of gene duplication and gene
deletion. When these new operations are included in a run of genetic
programming, genetic programming can dynamically change, during the run, the
architecture of a multi-part program consisting of a main program and a set of
hierarchically-called subprograms. These on-the-fly architectural changes occur
while genetic programming is concurrently evolving the work-performing steps of
the main program and the hierarchically-called subprograms. The new operations
can be interpreted as an automated way to change the representation of a
problem while solving the problem. Equivalently, these operations can be viewed
as an automated way to decompose a problem into an non-pre-specified number of
subproblems of non-pre-specified dimensionality; solve the subproblems; and
assemble the solutions of the subproblems into a solution of the overall
problem. These operations can also be interpreted as providing an automated way
to specialize and generalize.
Sixth, over 250 articles using genetic programming have now been published by
other authors, including one collection of two dozen papers on advances in
genetic programming (Kinnear 1994, Koza 1994a).
All of the above is dismissed as "little evidence" while the paper
(and, more importantly, the 1995 International Machine Learning Conference) has
promoted a tabulation of 100 runs of 80 3-argument Boolean functions into the
only accepted and published paper relating to genetic programming to be
presented to the 1995 Machine Learning Conference.
The question arises as to whether the proposed population-free hill climbing algorithm continues to "beat" genetic programming when the number of Boolean arguments is expanded from 3 to the domain of 4- and 5-argument Boolean functions.
The paper claims that its algorithm "beat" genetic
programming by a largest margin on the most challenging (including the
even-3-parity function) 3-argument Boolean functions (given the representation,
fitness measure, and operations involved in these discussions).
Thus, we made two series of 12 runs of the Boolean even-4-parity problem. One
series used ordinary GP and the other used the proposed population-free hill
climbing algorithm.
A maximum time limit of processing 4,800,000 individuals was used for both
series of runs here (and for both series of runs for the even-5-parity in the
following section of this paper).
All 12 runs using ordinary GP produced solutions before generation 20 using a
population size of 32,000. The average number of individuals that were
processed to yield a solution using ordinary GP was 528,500 (i.e., 16.5
generations).
The proposed hill climbing algorithm solved the problem in 9 of the 12 runs
within the 4,800,000 limit (32,000 times 150). The 9 solutions came on very
early time steps (the latest being at step 58,879). Similarly, the best value
of fitness in the 3 failed runs came on very early time steps (70, 2,329, and
3,803). No subsequent improvement occurred during the last 99.93% of the
duration of these 3 failed runs. There is an early burst of activity followed
by a long freeze.
Table 1 shows the outcome of the 12 runs of the proposed population-free hill
climbing algorithm, the time step on which the best value of fitness was
attained, and the length of the run. As can be seen, all 12 runs became
stabilized very early - either because a solution was found very early (for 9
of the 12 runs) or because the run droned on futilely until step 4,800,000 (for
3 of the 9 runs).
If we make the overly generous assumption that lightening will strike on all 3
failed runs of the proposed hill climbing algorithm on time step 4,800,001
(after having been moribund for 99.93% or more of the first 4,800,000 steps),
the average number of individuals that would need to be processed to yield a
solution would be 1,218,640 with the proposed hill climbing algorithm. That is
2.3 times the 528,500 required for ordinary GP, so the proposed hill climbing
algorithm does not "beat" genetic programming for this problem at the
4 level. That is, the ranking between the proposed hill climbing algorithm and
genetic programming has already reversed.
Table 1 Summary of hill climbing runs for even-4-parity problem.
Run Outcome of run Best fitness attained on step Length of run
1 Solved 14,285 14,285
2 Solved 16,347 16,347
3 Solved 16,897 16,897
4 Solved 17,437 17,437
5 Solved 20,261 20,261
6 Solved 21,859 21,859
7 Solved 22,991 22,991
8 Solved 34,723 34,723
9 Solved 58,879 58,879
10 Failed 70 4,800,000
11 Failed 2,329 4,800,000
12 Failed 3,803 4,800,000
We then made two series of 12 runs of the Boolean
even-5-parity problem.
Again all 12 runs using ordinary GP produced solutions prior to generation 71
using a population size of 64,000. The average number of individuals that were
processed to yield a solution using ordinary GP was 2,913,583 (i.e., 45.5
generations).
The gap in performance between the proposed hill climbing algorithm and genetic
programming widens for this problem at the 5 level. Only 3 of the 12 runs using
the proposed hill climbing algorithm solved the problem after processing
4,800,000 (75 times 64,000) individuals. Again, these 100%-correct solutions
emerged on very early steps (37,028, 59,643, and 60,226). Similarly, the best
value of fitness ever attained during the 9 failed runs emerged on very early
steps (between time steps 0 and 16,747). No subsequent improvement occurred
during the last 99.7% of the duration of these 9 failed runs.
We believe that there is little chance of soon dislodging these seemingly-unending
runs from the plateau of suboptimal values of fitness on which they are
lingering. We also believe that there is little chance of soon making any
additional progress had these runs been continued for many millions of
additional steps. Of course, given enough time, Lang's random mutation of
program trees will, eventually, solve any problem whatsoever. All runs of
Lang's algorithm will eventually become successful at some time in the distant
future.
Table 2 shows the outcome of the 12 runs of the proposed population-free hill
climbing algorithm, the time step on which the best value of fitness was
attained, and the length of the run. As can be seen, all 12 runs stabilized
very early - either a solution was found very early (in 3 of the 12 runs) or
because the run droned on futilely until step 4,800,000 (for 9 of the 12 runs).
Four of the 9 failed runs never broke out above 20 (out of 32) in fitness.
Table 2 Summary of hill climbing runs for even-5-parity problem.
Run Outcome of run Best fitness attained on step Length of run
1 Solved 37,028 37,028
2 Solved 59,643 59,643
3 Solved 60,226 60,226
4 Failed 0 4,800,000
5 Failed 39 4,800,000
6 Failed 187 4,800,000
7 Failed 284 4,800,000
8 Failed 4,009 4,800,000
9 Failed 5,985 4,800,000
10 Failed 8,631 4,800,000
11 Failed 14,279 4,800,000
12 Failed 16,747 4,800,000
Practical considerations of computer time motivated the decision to terminate
all runs of both the 4- and 5-parity problems for both genetic programming and
the hill climbing algorithm after processing 4,800,000 (75 times 64,000)
individuals. Each run of the hill climbing algorithm consumed about 20 hours on
our 90 MHz Pentium PC-type computer.
If we again make the overly generous assumption that lightening will strike on
all 9 failed runs of the proposed hill climbing algorithm on time step
4,800,001, the average number of individuals that would need to be processed to
yield a solution would be 3,763,076 with the proposed hill climbing algorithm.
That is 1.3 times the 2,913,583 for ordinary GP.
Of course, the overly generous assumption that all 9 failed runs will suddenly
succeed at step 4,800,001 substantially reduces the lead of genetic programming
over the hill climbing algorithm for the even-5-parity problem (as compared to
the even-4-parity problem). At the time of termination of the 9 failed runs,
the reigning champions had been sitting on their thrones for at least 99.7% of
the 4,800,000 steps of the runs.
As explained in detail in chapter 8 of Genetic Programming, the use of
arithmetic averages is an oversimplification when there are failed runs in a
probabilistic algorithm. However, because of the slowness of the proposed hill
climbing algorithm, construction of a performance curve was not an option for
us. In any case, the collapse of the hill climbing algorithm is obvious as the
arity is scaled up from 3 to 4 to 5.
Based on these experiments, we totally concur with the author's own concession
that he
" ... needed to have some mechanism for escaping from the local optima ... "
Unfortunately, his proposed hill climbing algorithm contains
no such mechanism.
The reversal of the ranking of genetic programming and the proposed hill
climbing algorithm between the even-3-parity problem and the even-4-parity
problem and the widening of the lead of genetic programming over the hill
climbing algorithm between the even-4-parity problem (3 failed runs out of 12
for the hill climbing algorithm) and the even-5-parity problem (9 failed runs
out of 12 for the hill climbing algorithm) strongly suggests that the proposed
hill climbing algorithm simply doesn't work on anything but the most trivial,
unrepresentative, and misleading of test beds.
The proposed algorithm doesn't work because it is based on hill climbing
(instead of the less greedy, but mathematically near-optimal allocation of
trials inherent in the genetic algorithm and simulated annealing) and because
the pool of individuals in the population does, in fact, provide useful
subtrees that are beneficially exploited by the true two-parent sexual
recombination operation (crossover).
The paper by Kevin Lang entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that appears in the proceedings of the 1995 International Machine Learning Conference
The acceptance and publication of an experimental paper that
mechanically tabulated the results of 100 runs of the family of 80 3-argument
Boolean functions by the 1995 International Conference on Machine Learning
raises questions about the ML-95 reviewing process.
Such a mechanical tabulation (even if its conclusions were correct) would
hardly qualify for a passing grade in reputable university course. The paper
becomes further removed from the realm of the appropriate by its bombastic
title proclaiming that it "beats" genetic programming, its
over-personalized inclusion of another author's name in its title, and its sociological
introduction characterizing GP's "large following" as gullibly
embracing some kind of "appealing" Pied Piper "story" of
evolution.
One does not need to be a rocket scientist to know that 3-argument Boolean
functions are about as useful for purposes of generalization as looking at the
first three integers to conclude that all integers are prime numbers. Nor does
one have to be a rocket scientist to know about the deficiencies of hill
climbing as a search technique for non-linear search spaces.
Acceptance of the Lang paper by ML-95 therefore cannot therefore be viewed as a
matter of bad judgment, but rather a suspension of judgment.
Why were good judgment and normal scientific standards suspended in favor of
this paper that proclaims that it "beats" genetic programming?
This question does not arise in a vacuum, as if this particular GP-bashing
paper is an isolated event in the history of the International Machine Learning
Conference. Concurrent with the acceptance of the Lang paper
"beating" genetic programming, 100% of the genetic programming papers
that were submitted to the 1995 Machine Learning Conference were rejected. And,
all but one of the papers on genetic programming submitted to the Machine
Learning Conference over the past five years have been rejected.
(The exact number of submitted 1995 GP papers is not fully known to us at this
time; however, it is at least 7. Based on the co-chair's statement that there
were 15-20 submitted GA-GP papers and other information that we will be
verifying in the next few weeks as we attend other conferences, the number is
almost certainly higher than 7).
Focusing on the 7 currently known rejected 1995 GP papers, is there anyone who
is prepared to step forward and straight-facedly assert that 0% of the papers in
this entire category of papers had any merit (while the same conference was
accepting and publishing 68 other papers on machine learning)?
And, since Lang's paper provides a standard as to what is acceptable at the
ML-95 conference, is there anyone with any scientific integrity who is prepared
to step forward and straight-facedly assert that 100% of the papers in the GP
category were inferior to Lang's accepted GP-bashing paper?
And, looking back over previous years, is it reasonable to say that almost 100%
of the many GP papers submitted to ML in previous years had no merit?
Surely, the problem with the GP papers is not relevance, since it is hard to
think of anything more clearly embraced by Art Samuel's term "machine
learning" than a method for creating computer programs to solve a broad
range of problems. Thus, ML's long history of rejections must be based on the
proposition that there is a persistent lack of scientific merit among the GP
papers.
Surely all processes that aspire to any degree of integrity and credibility are
susceptible to some accountability. If the peer review process is appropriate
for assessing the quality of submitted papers, then it seems altogether fitting
and proper that a peer review process should be used to assess the quality of
ML's peer review process in connection with the question of why essentially
100% of GP papers are consistently rejected by the Machine Learning Conference.
Of course, in undertaking a peer review of ML's peer review process, mere
differences of opinion can be resolved in favor of ML and quibbling over minor
matters is to be avoided. However, after generously granting an a priori
presumption in favor of ML, it is nonetheless the case that this presumption
remains susceptible to reversal upon presentation of evidence of a course of
conduct that is intellectually dishonest and preposterous.
We, therefore, now present a review of the currently available ML-95 reviews
(as supplied by the authors of the 7 currently known rejected 1995 GP papers).
Paper 302 uses genetic programming to evolve a motif for
families of proteins. The use of automatically defined functions in conjunction
with genetic programming overcomes the deficiency that existing motif languages
used by molecular biologists do not provide for reusability of common
subexpressions. Paper 302 compares a genetically evolved motif to the motif
determined by the committee of human molecular biologists that meets in Geneva to
identify and archive regularities in protein sequence data. In paper 302, GP
evolved a feature that the committee of human experts missed. The genetically
evolved motif (which takes the missed feature into account) does slightly
better than the committee of human experts on the human committee's own data.
Paper 302 suggests a reasonable biological interpretation for the feature
missed by the human committee.
The 1995 International Machine Learning Conference complains,
"The problem attacked in the paper has already been
solved (GP does give a very slightly better solution, but the original
solution was already almost perfect)"
(Emphasis added)
Yes, indeed, the problem has already been solved. But, the
reviewer fails to mention that this existing solution was accomplished by
humans! The "original solution" that GP only "slightly"
exceeded was expert human performance on an international committee of
human experts on molecular biology.
One would think that the purpose of the field of machine learning was to
develop automated means to rival human performance on non-trivial problems.
Just how many published papers at the ML-95 conference deliver anything near
human-level performance, much less the merely "slightly better" than
human-level performance?
Since Lang's paper was accepted and paper 302 was rejected, one can conclude
that a mechanical tabulation of 100 runs of 80 3-argument Boolean functions has
more scientific merit, in the eyes of the ML-95 conference, than exceeding
human-level performance on a problem ordinarily handled by an international
committee of human experts.
ML-95 complains about paper 228 because
" ... there [were no] ... comparisons with other search methods."
The above is just one of several examples of precisely the same criticism, in
almost the same words (coming, no doubt, from precisely the same reviewer)
daisy-chaining their way through ML-95's reviews of the 7 rejected GP papers.
Just how many published papers at the ML conference make cross-paradigm
comparisons?
After leafing through the proceedings of the most recent (1994) ML conference,
I could not find a single cross-paradigm comparison in any papers employing
decision trees, inductive logic programming, reinforcement learning, or any
other machine learning paradigm.
I also briefly leafed through six randomly chosen issues of the Machine
Learning journal, including the first issue in 1986. I did not find any
cross-paradigm comparisons in those particular issues (although there are
several that I recall seeing in other issues in the past).
In the first issue of the Machine Learning journal, the founding editor
held up the four papers in that issue as models that authors should consider
following for machine learning papers (Langley 1986). None of these four
excellent papers (which included, among others, Quinlan's often-cited and
well-regarded paper on decision trees[1986]) had any cross-paradigm
comparisons.
Of course, the idea of doing cross-paradigm comparisons is itself a highly
debatable proposition.
For one thing, doing a cross-paradigm comparison is a major undertaking for a
researcher. Such comparisons require writing, debugging, and running computer
programs for each of these other methodologies involved (or, alternately,
coming up with an all-encompassing analytic theory definitively modeling the
comparative performance of the different technologies). About two years ago, I
reviewed a government grant proposal that carefully laid out the costs and
difficulties of doing such comparisons on a rather limited set of such
paradigms and problems. A mid-six-figure budget spanning a couple of years was
proposed for the project.
Second, doing a cross-paradigm comparison even-handedly is itself extremely difficult.
The possiblity of unintentional bias presents the real risk that the very
considerable effort involved will not prove to be at all persuasive to anyone.
Third, and more importantly, preventing publication of a new result merely
because it has not (and perhaps cannot) also be produced by all other
techniques of machine learning would mean that only the most inconsequential
and incremental results would ever be published in the field. Cross-paradigm
comparisons can, by definition, only be done on problems that can be solved by
several different paradigms. Given the inability of many paradigms to solve any
breadth of problems, such cross-paradigm comparisons are necessarily limited to
trivial problem domains. The challenege for machine learning at the present
time is how to get non-trivial results on non-trivial problems - not the
relative speed by which trivial and useless results can be simultaneously
obtained by various different paradigms.
AUTHOR'S NOTE: Click here for additional discussion about cross-paradigm comparisons
But, in any event, the issue here is not whether more authors should, or should
not, do cross-paradigm comparisons. The point is that there is a long-standing,
well-known, prevailing practice at ML conferences (and throughout the ML
community) generally to not do cross-paradigm comparisons. (In fact, given the
difficulty, the very few papers that do cross-paradigm comparisons generally do
only that).
Why are papers on genetic programming being subjected to a unique and onerous
higher standard than the long-standing, well-known, prevailing standard applied
to papers employing decision trees, inductive logic programming, reinforcement
learning, and all other paradigms?
Certainly, researchers in decision trees can get their papers published without
also solving the very same problem using inductive logic programming,
reinforcement learning, or genetic programming. So why should ML-95 demand that
researchers in genetic programming also work with decision trees, inductive
logic programming, and reinforcement learning?
Do all of this reviewer's own papers contain such cross-paradigm comparisons?
Indeed, do any of his papers contain such a comparison?
If doing such cross-paradigm comparisons are not the prevailing practice of ML
and if doing them is not the reviewer's own practice, then why is the reviewer
criticizing this paper (and other GP papers) for not doing such comparisons?
The (no doubt same) ML-95 reviewer further complains about paper 302,
" ... no comparisons are made with other search
methods (e.g., why use GP instead of some other method?)."
(Emphasis added)
I searched in vain for a paper in the recent ML proceedings
employing decision trees, inductive logic programming, reinforcement learning,
or any other paradigm where the author gives any kind of organized and detailed
recital as to why he used his chosen paradigm "instead of some other
method."
On the whole, the fact is that researchers use the paradigms that they use in
their papers because that is what they do.
Apparently this reviewer thinks that a researcher wishing to use genetic
programming owes some special explanation or apology (not demanded from users
of other paradigms) for using GP. Perhaps the reviewer views GP as some
forbidden technique that may be used only as a last resort after one has first
obligingly tried other (presumably more legitimate) techniques.
Again, the question is why the papers on genetic programming are being singled
out, criticized, and penalized for an omission common to almost every other
paper ever published by the ML conference?
Curiously, the author of one of the rejected GP papers ventured one sentence to say "why use GP instead of some other method?". ML-95 then bristled,
"There are unsubstantiated claims, such as
'Genetic programming...is particularly suited to problems in which some
underlying regularity or structure must be discovered'"
(Emphasis added)
It is hard to imagine a more innocuous or inoffensive
one-sentence statement as to "why use GP" (specifically, GP with automatically
defined functions).
In any event, paper 221's author was just giving the reviewer what he
complained that paper's 302's author did not.
Moreover, the reviewer's accusation that this offending sentence is
"unsubstantiated" is itself wrong. There is a considerable literature
on GP's usefulness and applicability to regularity discovery and they are cited
right in the paper.
If it were the prevailing practice of ML to require each paper to contain an
organized and detailed recital of the specific reasons for choosing a
particular paradigm (and it is not) and if each author's reasons also had to be
"substantiated" in the paper (as the reviewer is demanding of author
221), there would be no room for the paper!
It is totally unreasonable to demand inclusion of this kind of introductory
material and then to take offense when such necessarily terse material is
"unsubstantiated."
The point here is that the same reviewer criticizes and penalizes GP authors
when they fail to fess up as to why they use GP and then criticizes and
penalizes them when they try to explain why they use GP.
The stated purpose of paper 055 was to compare the author's
proposed enhancement to GP (adding recursion) with existing forms of GP
(without recursion). Thus, the author made intra-paradigm comparisons among the
different ways of doing GP in his paper.
However, this apparently was not enough, and paper 055 was dismissed because
"... comparisons were done with other forms of GP but
not with other search methods."
(Emphasis added)
Skipping past the obvious and embarrassing question
of whether any other such other search method exists that could even handle the
problem presented in this paper, did ML-95 even-handedly criticize and penalize
authors of papers on decision trees, inductive logic programming, reinforcement
learning, or any other paradigm for not making cross-paradigm comparisons?
The essence of institutionalized bigotry is the imposition of an unreasonable
standard against the group at which improper discrimination is aimed.
Evidently, none of the other categories of accepted papers at ML conferences
have been subjected to any demand, criticism, or penalty concerning
cross-paradigm comparisons. And, there is certainly no conceivable reason why
cross-paradigm comparisons are differentially more desirable or more
appropriate for GP papers than for any other category of papers.
So why are papers 228, 302, 055 and others on genetic programming singled out
for criticism and penalty because they failed to do that which almost every
other paper ever published by the 1994 ML conference (and the Machine
Learning journal) also fails to do?
Of course, the reviewer is factually correct when he says that papers 228, 302,
and 055 (and others) make no cross-paradigm comparisons. But does this
nominally true statement really have anything to do with the reviewer's
decision-making process in rejecting these papers?
If not, why are these idle words in his review?
Given his privileged and insulated position as a peer reviewer, why would the
reviewer not feel entirely comfortable in stating the actual reasons for his
negative decisions in his written reviews?
Can it be because he knows that the actual reasons for his decision were
illegitimate and unethical and that the words in his reviews are merely
hypocritical window-dressing concealing the reviewer's hidden agenda?
Tellingly, paper 222 did do a cross-paradigm comparison
involving three paradigms: genetic programming, the ID3 decision tree
algorithm, and a nearest neighbor algorithm.
If the absence of cross-paradigm comparisons is so important to the reviewer,
why didn't the ML-95 review of paper 222 make a positive comment about the
presence of this cross-paradigm comparison in paper 222?
Certainly, cross-paradigm comparisons are infrequent enough in published machine
learning papers.
Could it be that ML-95 doesn't really care that much about cross-paradigm
comparisons after all?
ML-95 criticizes paper 055 (which demonstrated how to do recursion in GP),
"The strengths and limitations of GP are not discussed, except for the single fact that GP succeeded in evolving a general solution to this problem." (Emphasis added).
What "strengths" beyond solving one "general
problem" must a short paper at the ML conference possess?
How many other papers at the ML conference had more "strengths" than
that of finding "a general solution" to one problem?
Indeed, how many papers at the ML conference, when examined, yielded a
"general solution" to even one problem?
The important point in the above quotation is that reviewer does acknowledge,
on the face of his review, that paper 055 did, in fact, yield a "general
solution" to what any open-minded person in machine learning would
acknowledge to be an important problem (adding recursion into a machine
learning paradigm).
Why isn't that enough?
Far less seems to be more than enough when ML is looking at papers outside the
GP category.
The same reviewer who complained that the genetically evolved solution to the molecular biology problem did only "slightly better" than human performance now tells us,
"GP is by now not a new method, and there have been a large number of papers (and two large books) detailing GP's success on various problems, so it is not clear what this one additional example adds to our knowledge about GP."
What does the number of previously published papers have to do with the process
of judging any newly submitted paper?
Can anyone conceive of any possible circumstances under which the ML-95
conference would tell the author of a submitted paper on, say, inductive logic
programming, that
"Inductive logic programming is by now not a new method and there have been a large number of papers (and XXX large books) ... "
Or, can anyone conceive of any possible circumstances under which the ML-95
conference would say something like that to the author of a submitted paper on,
say, decisions trees?
Is this a Freudian slip indicating that the reviewer has mentally established a
global cap on GP papers and that GP has already been too pushy?
In any event, if there has already been an intolerable excess of GP papers
published, it's safe to say that it hasn't been at the ML conference or in the
ML community.
ML-95 complains,
" ... not enough information about the particular
GP algorithm is given ... . Missing is information about the probabilities of
the various functions and terminals in the initial population ... Also,
some particulars of the method (e.g., percent of the population copied to
the next generation, probability of crossing over at various points in the
trees) are not given but one of the first author's books is cited. It would be
helpful to give these parameters in the paper.
(Emphasis added).
Why would this be helpful in a short paper?
To whom?
Few, if any, readers of a paper are actually interested in replicating
experiments. And, if they are, the reviewer acknowledges that the complete
information required for replication is readily available (in the almost 10,000
copies "of the first author's books" that are currently in
circulation).
It takes several pages to meticulously catalog all of the marginally different
alternative ways of translating the metaphor of the genetic algorithm into an
actual computer program. That amount of space is clearly not available in short
conference paper.
The only issue here is the reviewer's strong opinion that this voluminous
information should be inserted into the confines of the small number of pages
allotted to ML papers.
A glance at the recent ML proceedings indicates that many accepted ML papers do
not meet the minimal requirement of stating all the important parameters of
their work. Why then is this reviewer singling out this GP paper (and other GP
papers) when all the important parameters are in this paper (and all the
unimportant parameters are readily available to anyone who is interested)?
ML-95 complains about paper 281 saying
"It would also help to have some discussion of the strengths and weaknesses of its technique."
Similarly, ML-95 complains about paper 302 saying
"The strengths and limitations of GP are not discussed ... "
Similarly, ML-95 complains about paper 055 saying
"The strengths and limitations of GP are not discussed ... "
Note that these reviewer comments are specifically aimed at
the "limitations of GP" and "weaknesses of the technique" -
not merely to the weaknesses of the results of the specific work described in
the paper.
Again, a search through the most recent ML proceedings does not uncover a
single article on inductive logic programming (or decision trees or
reinforcement learning or any other paradigm) that contains any kind of
detailed recital of the strengths and weaknesses of the technique used in the
paper. Where do we find Quinlan, Muggleton, or any of the other many
practitioners of decision trees and inductive logic programming dutifully reciting
the weaknesses of their techniques?
It might be nice if every paper at ML began with a disciplined and detailed
recitation of the strengths and weaknesses of the technique being used in the
paper, followed by a careful linkage between these itemized strengths and
weaknesses with the decision as to "why use" technique X.
One argument against this is that most readers are not impressed by such
recitals from avid practitioners of a particular paradigm and that such
recitals are best done by others. (Recall how the reviewer himself bristled
when paper 221 contained just one sentence about an asserted strength of GP).
But regardless of whether or not this practice would be "nice," the
important point is that such space-consuming recitals are not part of the prevailing
practice of ML.
And, there certainly is no special reason why such recitals should be
differentially required for GP while not being required for any other paradigm.
The problem with the ML-95 reviews is that there is a blizzard of words in
which the reviewer selectively complains about such matters concerning the GP
papers, while there is total silence as to the real reasons behind ML-95's
decision to exclude all GP papers. The ML-95 reviews are full of words that the
reviewer and the authors both know have nothing to do with the decisions. The
obvious lack of connection between the real decision-making criteria and the
reviewer's written words heightens interest in the question of the real reason
why 100% of the GP papers were rejected by ML and Lang's GP-bashing paper was
accepted?
Almost all the papers at ML conferences cast a blind eye to
the critical issue of scalability. In fact, there is apparently a gentleman's
agreement not to mention the "S-word" in the ML community's nanoworld
of 3-argument Boolean functions and sick soybeans.
Paper 055 did utter the forbidden S-word and demonstrated scaling over a series
of comparative experiments. ML-95 admits,
"The problem is of enormous significance, since proper handling of recursion in GAs would potentially allow significant advance in all types of first-order learning with GAs." (Emphasis added)
The reviewer then calls paper 055 "original." (Indeed, it is).
Then, in dismissing paper 055, ML-95 said,
" ... this ... kind of recursion did improve the performance of GP dramatically, and it would be very interesting to see it scaled up to problems whose interest and difficulty is clearer." (Emphasis added)
Thus, after doing what the reviewer acknowledges is indeed
"original" work, after grappling with a problem of "enormous
significance", after producing some evidence on the important question
of scalability, and after actualy doing comparative experiments,
ML-95 dismissed paper 055 because it didn't also solve unspecified additional
problems of "interest."
"Come back when you've cured both cancer and heart disease!"
One wonders how ML-95 came to view Lang's mechanical tabulation of 100 runs of
80 3-argument Boolean functions as "original" or "of enormous
significance" when it manifestly would not qualify for a passing grade in
reputable university course.
How could ML-95 possibly have regarded the bogus conclusions derived from the
trivial tabulation in Lang's paper as superior to paper 055?
ML-95 again complains,
"Too many details are missing ... and the reader is referred to a [technical report] for details." (Emphasis added)
Note, again, the issue is not whether the details were readily available,
but whether the author should have put them into his highly space-limited
paper.
One thing is noteworthy here: the apparently common ML-95 reviewer is to be
congratulated for his ecologically-minded recycling of words as he goes from
paper to paper.
For paper 221, ML-95 complains,
"Some details of the GA are missing, e.g., at each generation, how many individuals are created through copying, and how many through crossover? Or, what was the probability of using each of the functions and terminals in creating the initial population?" (Emphasis added)
For paper 228, ML-95 complains,
"... [T]he author leaves out some details of the GP method used -- e.g., how was the initial population formed? How many individuals were copied directly each generation and how many were formed via crossover?"
Although the authors of papers 221 and 228 did cover all the important
parameters in their short papers, they are indeed guilty of omitting some of
their minor parameters.
The key point here is whether ML-95 even-handedly demanded, criticized, and
penalized the non-GP papers for similar (and far worse) omissions? If the
published papers in previous ML proceedings are a guide, the answer is clearly
"no." Is the answer "yes' for 100% of the accepted and published
ML-95 papers?
The ML-95 reviewer asserts,
"The basic capability was demonstrated in the IEEE paper and the author does not note any significant differences in the results presented in this paper compared to the results in the IEEE paper."
The ML-95 reviewer clearly didn't bother to read paper 281 (much less the 1994
paper by the author of paper 281).
He's just plain wrong when he says paper 281 failed to "note ...
significant differences" between the 1995 work and the previous work. The
paper did recite these differences and they are substantial.
The accusation that the ML-95 reviewer didn't bother to read paper 281 can
be easily experimentally verified. As it happens, a paper by the same
author (substantially similar to paper 281) was submitted and subsequently
accepted by IJCAI-95. That is, paper 281, which was summarily rejected by
ML-95, was accepted by IJCAI-95 (a particularly competitive conference) in its
machine learning area. (Advance copies are available for readers not wishing to
wait until August 22, 1995 to conduct this experiment for themselves). The
reader can easily compare the two papers and form his own opinion as to whether
the ML-95 reviewer really read ML paper 281 or whether the ML-95 reviewer is
merely taking a cheap shot based on an incorrect guess about the contents of
paper(s) that he didn't read.
As foretaste of the results of this experiment, let me merely mention that the
author has informed me that one of the independent IJCAI reviewers (whose
identity is unknown to us) specifically commented in his IJCAI review about the
significant differences between the 1994 and 1995 papers.
We are confident that if the reader conducts his own experiment, he will end up
agreeing with the author of paper 281, with me, and with the independent IJCAI
reviewer that the ML-95 reviewer didn't read paper(s).
The fact is that the (apparently common) ML-95 reviewer probably didn't read
most of the other GP papers on which he passed judgment.
Could it be that the reason the ML-95 reviewer didn't read the papers is
that he really didn't need to read the papers because his negativity had
nothing to do with the contents of the papers, but derived from a
manuscript-independent dislike of GP?
AUTHOR'S NOTE: The IJCAI-95 proceednigs hvee, of course, been publically
available for some time now. The complete reference is as follows:
Andre, David. 1995. The automatic programming of agents that learn mental
models and create simple plans of action. Proceedings of the 14th
International Joint Conference on Artificial Intelligence. San Francisco,
CA: Morgan Kaufmann. Pages 741­p;747.
The complete reference to the 1994 paper by the author of paper 281 is as
follows:
Andre, David. 1994. Evolution of map making: Learning, planning, and memory
using genetic programming. Proceedings of the First IEEE Conference on
Evolutionary Computation. IEEE Press. Volume I. Page 250­p;255.
The reader is invited to compare the two papers and form his own opinion as
to whether the ML-95 reviewer really read ML paper 281.
The title of paper 281 is
"The artificial creation of agents that learn and use mental models to evolve simple plans" (Emphasis added).
There is, of course, an enormous literature (measured in tens thousands of
published papers) on agents, planning, and mental models in the machine
learning literature.
Puzzlingly, the reviewer of paper 281 further complains,
"The only related work I'm aware of is a dissertation that showed how to evolvng learning classifier systems". TGCA report No. 91003. Department of Engineering Mechanics, The University of Alabama, Tuscaloosa, Alabama."
Surely the reviewer is aware of more than one paper in
existence on the subject of agents, planning, and mental models.
In any event, this puzzling comment is an additional indicator that the
reviewer did not, in fact, read paper 281 because paper 281 has nothing
remotely to do with classifier systems.
Classifier systems are a miniscule sub-area of the field of agents, planning,
and mental models.
And, how could paper 281 possibly be considered deficient in neglecting to cite
an obscure technical report outside the subject matter of paper 281?
Indeed, standard practice is not to cite a technical report if a more widely
published source is available (since technical reports are typically published
in very small numbers, not to found to find in most libraries, and usually
familiar only to specialists who monitor the latest and most detailed new
developments in a particular subfield).
To say
"The only related work I'm aware of"
in connection with paper 281 with one particular
difficult-to-find University of Alabama technical report on classifier systems
shows again that the ML-95 reviewer has no idea what is in paper 228.
This is another throw-away comment that attempts to manufacture flaws in paper
228.
More important, what this comment does show, however, is that this apparent
common ML-95 reviewer is very conversant with the extremely specialized
subfield of classifier systems - a subfield where considerably less than
100 papers have ever been published (those few having been written by a
relatively small group of researchers).
Paper 228 applied GP to the automatic construction of mental models and illustrated it with a tree search task. ML-95 complained that
"No case is made that the problem being solved contains features that allow for generalization to more interesting problems." (Emphasis added).
No one complained (or should have complained) when Quinlan's often-cited and
well-regarded 1986 article on decision trees presented the infinitely more
trivial ("Saturday morning weather") problem to illustrate his
then-new technique. No one in computer science had no difficulty visualizing
numerous applications for Quinlan's "Saturday morning weather"
problem. Similarly, no one complains about the numerous papers in previous ML
conference proceedings that use a modest illustrative problem (usually far more
modest than the tree search task in paper 228) to illustrate whatever new
approach they are presenting.
Assuming the hostile common reviewer even bothered to read paper 228 (and one
cannot be sure, given his indiscriminate recycling of the more or less
identical complaints over all the GP papers), he would have seen what anyone in
computer science would readily recognize to be a very general procedure for the
automatic construction of mental models.
Anyone familiar with machine learning or artificial intelligence should have
had no difficulty visualizing a cornucopia of applications for something as
fundamental as mental models (just as any reader of Quinlan's 1986 paper would
have had no difficulty appreciating the wide applicability of his
"Saturday morning weather" illustration).
It is nominally true that
"No case is made ..."
in paper 228. However, no case was made because no case
was needed. Any open-minded reader not feigning inability to visualize
"interesting" applications would have immediately grasped the
importance the automatic construction of mental models.
However, if ML-95 felt that paper 228 fell short because it didn't make a
specific detailed "case" for the wider applicability of its
illustrative example, how is it that ML-95 came to accept Lang's mechanical
tabulation of 100 runs of 80 3-argument Boolean functions? Does Lang's accepted
paper
"contain ... features that allow for generalization to more interesting problems"? (Emphasis added)
ML-95 could not possibly have applied this test (or any test of scientific competency) to Lang's GP-bashing paper.
In section 10.3, I mentioned that the (no doubt common) ML-95 reviewer complained about paper 302,
" ... no comparisons are made with other search methods (e.g., why use GP instead of some other method?)." (Emphasis added)
In preparing version 2 of this paper, I happened to notice how utterly untrue the statement by the ML-95 reviewer was. In fact, paper 302 included several statements that directly spoke to the question of why genetic programming should be chosen for the particular problem addressed in the paper. One of four such linkages is
"When this problem is rephrased as a search for an unknown-sized task-performing computer program (i.e., a composition of primitive functions and terminals), genetic programming becomes a candidate for approaching this problem. Moreover, if it is also desired to capture regularities in the problem environment (e.g., the repeated use of a subexpression such as [LIVM]), then genetic programming with automatically defined functions becomes relevant." (Emphasis added).
Paper 255 attempted to tackle a problem of classifying complicated visual and acoustic signals. ML-95 criticized it because
"[t]he experiments used manufactured, reasonably small datasets with relatively small number of signal classes."
The reviewer - so anxious to fire off shoot-from-hip criticisms of GP papers -
is apparently unaware that both of the datasets that he is criticizing come
from published sources in the signal processing field (including Sebastian
Thrun's set) and the ML community.
Be that as it may, did ML-95 similarly criticize Lang's GP-bashing paper with
its 100 runs of 80 3-argument Boolean functions because it used
"... manufactured, reasonably small datasets with relatively small number of ... classes"?
What dataset could possibly have been smaller and less significant than the
family o 80f 3-argument Boolean functions?
Even if the authors of paper 255 were guilty of the alleged sin of using an
oversimplified dataset (and they are not), there are degrees of sin. How
serious was paper 255's oversimplification in comparison to the
oversimplification in Lang's accepted and published GP-bashing paper?
Even the ML-95 reviewer acknowledges
"the chosen problem domain is certainly challenging"
for paper 255 (something that surely cannot be said of
Lang's problem domain).
And, paper 255 faced pressing practical problems of computer time arising from
the massive number of pixels and time slices (something that surely cannot be said
of Lang's nanoscale problem domain).
In any event, author 255 did not oversimplify to anywhere near the degree that
Lang did in accepted and his published GP-bashing ML-95 paper.
And, most important, there is no reason to believe that the author of paper 255
produced any bogus conclusions.
Why was paper 255 criticized and penalized for a far less consequential
instance of the alleged sin of oversimplification that was breezily ignored by
ML-95 in accepting and publishing Lang's GP-bashing paper?
By the way, the reviewer also complained,
"There are no references to the related work
in the area of signal processing. The authors should also reference the work
by Usama Fayyad et al on SKYCAT..."
(Emphasis added)
This is yet additional evidence of the ML-95 reviewer not
reading the paper.
Clearly it is the reviewer's bias that caused him to reject the paper - not the
actual contents of the paper - and certainly not the empty words contained in
his review.
Most of the reviews that I have received over the years of
conference papers are extremely short. Most of the reviews of the rejected GP
papers at ML are also short.
The review of paper 302 (in which "GP does give a very slightly better
solution" than human performance) from the common ML reviewer begins with
the following extraordinary preamble:
"My opinion is that for a paper to be significant
for the ML community, it has to do at least one of the following things:
"(1) attack a problem that has not previously been solved, and show why
the problem being attacked is interesting and/or important
"(2) give some new insight as to why a particular ML method works or does
not work, or as to why some particular class of problems is difficult or easy
for ML
"(3) demonstrate that some new, previously untested method has the
potential to solve interesting and important problems
"(4) model some phenomenon in real-world learning in an interesting and
enlightening way
"This list could no doubt be added to, ... "
Why is this lengthy philosophizing about review criteria in the review of paper
302?
Did this ML-95 reviewer apply these criteria to the Lang article?
Paper 281, which was summarily rejected by ML-95, was accepted by IJCAI-95 (a
particularly competitive conference) in its machine learning area.
Subsequent to its rejection by ML-95, paper 302 was submitted and accepted as a
chapter in a refereed and edited book on machine learning being published
overseas. Papers 228 and 055 were accepted as regular papers by other
conferences in the machine learning area. Paper 222 was accepted as a poster
paper at another conference. The work from paper 255 will appear as a chapter
in a refereed and edited book.
When favorable decisions come in from the independent peer reviewers of IJCAI,
other conferences, and other publications for 86% of the papers involved
here, it is certainly suggestive of the fact that these GP papers are not as
worthless as ML-95 would have the world believe.
This independent reviewing is, in effect, a peer review of the ML peer review
process.
While decisions made by other conferences and publications are, of course, not
conclusive, these independent conclusions certainly carry some additional
evidential value that can be added to the scales of judgment. This is
especially so in light of the extreme disparity between the ML's 0%
acceptance rate and the 86% acceptance rate from these other independent
reviewers.
In the past five years, none of this writer's
submitted papers were accepted by the ML conference. In fact, during this same
five-year period, all but one paper on genetic programming was rejected
by the ML conference.
Over 250 GP papers (by authors other than myself) have been accepted and
published elsewhere and over 50 of my papers have been accepted and published
elsewhere (in almost every case by publications considered to be of at least
equal quality to ML).
The fact is that the collective scientific judgment of a very large sampling of
other independent reviewers - over an extended period of time - is very much at
variance with the extreme nature of the outcome at ML.
We are not talking about some minor percentage difference here.
What is the reason for this extreme disparity?
This writer's most recent experience with a previous ML conference may shed
additional light in the mystery of this extreme disparity, over a period of
years, between the outcome at ML and the outcome from the collective scientific
judgment of a mass of other researchers. I received two "independent"
reviews from ML that each contained several very odd common words and phrases.
These unusual words and phrases were, in turn, embedded in almost identical
sentences. The sentences conveyed identical thoughts. However, the order of the
sentences was symmetrically and mechanically toggled between the two
"independent" peer reviews. Anyone with even a little experience at
spotting plagiarism would instantly recognize that the two
"independent" reviewers were not independent at all, but, instead had
colluded in the writing of their reviews (most likely with the second reviewer using
the first review as his template and mechanically switching slightly modified
sentences around as to order).
Thus, ML's essentially manuscript-independent rejection of all the 1995 GP
papers is nothing new, but, instead, part of a persistent pattern of abuse of
the peer review process at ML.
The inescapable conclusion is that the only way paper 302 (or any of the other
1995 massacred GP papers) would have been accepted by ML-95 would if it had
carried a title such as "Yet Another GP Failure - GP Performs Only
Slightly Better than Human Experts."
The goal now should be to ascertain the truth of the matter
raised herein.
In ascertaining the truth of what happened to the rejected 1995 GP papers,
there is an obvious need for the acquisition of additional information
currently contained in ML-95's records.
What does the written record show about how the ML
conference came to accept the Lang paper?
Since the Lang paper was accepted at ML-95, its two reviews must be quite
favorable.
How, for example, did the two ML-95 reviewers of Lang's paper answer the
following six questions (taken from the official ML reviewer form)? How did the
two reviewers answer the question:
"To what extent does it evaluate the strengths and limitations of its result(s)? Dimensions for evaluation include generality, ... "
Did the reviewers actually say "yes - this paper has strength and,
yes, this paper has generality"?
Since the Lang paper was totally silent insofar as introspectively identifying
or itemizing limitations of its results, did the reviewers criticize the Lang
paper for this omission?
What did the ML-95 reviewers say in answer to the question,
"What lesson(s) does the paper offer the scientific community?"
Did the reviewers actually say that Lang showed that hill
climbing "beat" GP?
Did the reviewers really say that the Lang paper "had strengths,"
"was general," and "offered lessons"?
Presumably the reviewers must have said just that!!
If they didn't, did ML-95 accept and publish a paper that the reviewers said
did not have "strengths," did not have "generality," and
did not offer any "lessons"?
In answer to the question,
"How important is the work reported?"
did the reviewers actually say that a tabulation of the behavior of the family
of 3-argument Boolean functions was "important"?
In answer to the question,
"How difficult is the problem it attacks?"
did the reviewers really say there was some level of difficulty
involved in Lang's 100 runs of the family of 80 3-argument Boolean functions?
How did the two reviewers answer the question:
"Is the paper technically sound?"
Did the reviewers really say that Lang's paper was
technically sound?
In answer to the question,
"To what extent does the paper discuss relevant research?",
did the reviewers not notice the equivalence with mutation
and not cite the failure of attribution in the Lang paper?
Most important of all: Were any of many criticisms that the ML-95 reviewers
repeatedly leveled at the rejected GP papers (cited in earlier sections)
even-handedly raised with respect to Lang's paper on these two review forms?
Was a "single standard" applied or was a "double standard"
applied by ML-95?
It is the job of the ML program committee to maintain overall consistency in
the standards of the reviewing process as it adopts or rejects reviewer
recommendations.
The necessarily generally favorable comments on the two review forms for Lang's
accepted and now-published paper (and the necessary general absence of any
significant negative comments) are unlikely to demonstrate a fair or
even-handed application of a consistent single standard at ML-95.
Did the apparent common reviewer of the rejected GP papers
also review the Lang paper?
The answer is probably "yes" because the daisy-chaining of repetitive
statements across the rejected GP papers suggests that all the GP papers were
submitted to one common reviewer (with apparently different people providing
the second review) and the Lang paper had "genetic programming" in
its title.
If the answer is "yes," this particular review form takes on unusual
significance (without any reference his name).
The question is then whether the common reviewer even-handedly complained on
his review form for the Lang paper about all of the same matters that he
repeatedly raised about the rejected GP papers or whether he developed sudden
amnesia and extolled the Lang paper's "beating" of genetic
programming as a towering intellectual achievement providing great insights for
the machine learning community.
Since the Lang paper was favorably reviewed and accepted, it is unlikely that
the common reviewer made any negative comments (or at least not many). Thus,
this reviewing form is very important evidence that will almost certainly
further support the already very strong case of unethical discrimination and
scientific misconduct by this reviewer.
There is a very good possibility that the ML conference and
the ML-95 co-chairs were themselves also a victim in these events.
The fate of all the GP papers at ML was probably determined, as a practical
matter, at the time of appointment of the reviewer, not by the quality or
content of the submitted papers. There is a very good possibility that highly
relevant information that would have almost certainly deflected ML from
appointing the reviewer in the first place was intentionally withheld from ML
co-chairs by the would-be reviewer himself (perhaps in conjunction with someone
who may have promoted the appointment of this person as the reviewer of the GP
papers). This information is of such a clear nature that both the reviewer
himself (and the second person, if any, who may have pushed for his
appointment) knew, with certainty at the time they withheld (or colored) the
information, that there would be no way that the ML co-chairs would appoint the
person involved to review GP papers if the information had been fully and
properly disclosed to ML. Indeed, if the co-chairs had even an inkling of this
information, they would certainly have sought to acquire additional information
from sources independent of the reviewer himself (and the second person, if
any, who may have pushed for his appointment).
There is a good possibility that the above paragraph will come as a surprise to
the ML-95 co-chairs. If, upon reading the above paragraph, the co-chairs say
"ah ha," then the co-chairs will know that material information was
indeed withheld from them and that they, too, have been victimized by the same
people who victimized the GP authors.
This paper raises legitimate questions about scientific misconduct and calls upon ML for a candid explanation of
In responding to the above, there will, no doubt, be those who will advocate
further victimizing the victims (and the messenger) by attacking them for
making waves by raising these legitimate questions.
There will also, no doubt, be the see-no-evil whitewashers who will advocate
denying that there is any problem at all and will suggest manufacturing some
strained and implausible interpretation of events to explain away the clear
evidence in this matter.
I hope a clear focus is maintained on who exactly are the victims and who is
the perpetrator. David Goodstein (1995), vice provost at Caltech, covered this
point in his recent editorial in Biotechnology when he said,
"Editors of scientific journals and program
officers ... have the most to gain from peer review, and they steadfastly
refuse to believe that anything might be wrong with the system. Their jobs are
made easier because they never have to take responsibility for decisions. ...
" ... referees are able, with relative impunity, to delay or deny funding
or publication to their rivals. When misconduct of this kind occurs, it is the
referee who is guilty ..." (Emphasis added).
If there is any temptation for ML to resort to denial of the
existence of the problem, avoidance of dealing with the problem, or
blame-shifting to the victims (or messenger) as the way to deal with the
legitimate issues raised by the massacre of the GP papers, it should be noted
that there is one issue of scientific integrity for ML that cannot be easily
sidestepped if ML aspires to any degree of scientific credibility, namely the
issue of the presence in the scientific record of the demonstrably erroneous
Lang paper.
We submit that the best course of action for ML is to ascertain the truth of
this entire matter and then forthrightly take action against the perpetrator
that will send an unmistakable deterrent message in support of scientific
integrity - rather than to make itself part of the problem by denying the
existence of the problem, avoiding the problem, trying to shift the blame to
the victims, or trying to defend the indefensible. The evidence supports taking
the following actions:
David Andre did the programming of the runs of the even-4-parity and even-5-parity problems for genetic programming and the hill climbing algorithm.
Goldstein, David. Ethics and
peer review. Biotechnology. 13:618. June 13, 1995.
Holland, John H. 1975. Adaptation in Natural and Artificial Systems. Ann
Arbor, MI: University of Michigan Press.
Kinnear, Kenneth E. Jr. (editor). 1994. Advances in Genetic Programming.
Cambridge, MA: The MIT Press.
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Optimization by simulated
annealing. Science 220:671-680. 1983.
Koza, John R. 1992. Genetic Programming: On the Programming of Computers by
Means of Natural Selection. Cambridge, MA: The MIT Press.
Koza, John R. 1994a. Genetic Programming II: Automatic Discovery of Reusable
Programs. Cambridge, MA: The MIT Press.
Koza, John R. 1994b. Architecture-Altering Operations for Evolving the
Architecture of a Multi-part Program in Genetic Programming. Stanford
University Computer Science Department technical report stan-cs-tr-94-1528.
October 21, 1994.
Koza, John R. 1995. Gene duplication to enable genetic programming to
concurrently evolve both the architecture and work-performing steps of a
computer program. Proceedings of 14th International Joint Conference on
Artificial Intelligence. San Francisco, CA: Morgan Kaufmann.
Lang, Kevin. 1995. Hill climbing beats genetic Search on a Boolean circuit
synthesis problem of Koza's. Proceedings of the Twelfth International
Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann. Pahes
340-343.
Langley, Pat. 1986. On machine Learning. Machine Learning. 1(1):5-10.
Metropolis, N. A., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and
Teller, E. 1953. Equation of state calculations by fast computing machines. Journal
of Chemical Physics. 21: 1087-1092.
Quinlan, J. R. 1986. Induction of decision trees. Machine Learning. 1
(1): 81-106.
· The home page of Genetic Programming Inc. at www.genetic-programming.com.
· For information about the field of genetic programming in general, visit www.genetic-programming.org
· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University
· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.
· For information on 3,198
papers (many on-line) on genetic programming (as of June 27, 2003) by over 900
authors, see William Langdon’s bibliography on
genetic programming.
· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers
· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals
· For information about the
annual Genetic and Evolutionary
Computation (GECCO) conference (which includes the annual GP
conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle
and its sponsoring organization, the International Society for Genetic and
Evolutionary Computation (ISGEC). For
information about the annual NASA/DoD Conference on Evolvable Hardware Conference (EH)
to be held on June 24-26 (Thursday-Saturday), 2004 in Seattle. For
information about the annual Euro-Genetic-Programming Conference to be held on
April 5-7, 2004 (Monday – Wednesday) at the University of Coimbra in Coimbra
Portugal.
Last updated on August 27, 2003