A VERY SCHOLARLY DISCUSSION OF THE EFFECT OF SYNONYMS ON PERFORMANCE
                       by Eugene Volokh, VESOFT
     Presented at 1982 HPIUG Conference, Copenhagen, DANMARK
   Published in "Thoughts & Discourses on HP3000 Software", 1st ed.


ABSTRACT

This  paper   will  discuss  the  effect  of  synonyms  on  IMAGE/3000
performance,   in  particular  the average number of I/Os necessary to
perform  a DBGET mode 7 (also known as a hashed, calculated, or  keyed
read).


THE PROBLEM

As  is well known, IMAGE/3000 master datasets  use  an  access  method
called   HASHING  to implement keyed I/O against them.  The essence of
this  method is that when  a  DBPUT  is  performed  against  a  master
dataset,   the  key  is  translated through a special HASHING FUNCTION
(which    is   explained   in    Robelle    Consulting's    SMUG    II
micro-proceedings)   to  a record number within the dataset.  Then the
entry  to be added is inserted  into  the  dataset  at  the  resulting
record   number.   Now, when a DBGET mode 7 is performed with the same
key,  the key is translated through the same hashing function  to  the
same   record  number;  then, the entry we are looking for is just the
record  with that number.  Thus, we should ideally be able to  add  or
retrieve   any  entry in just one I/O (as opposed to considerably more
I/Os needed for other access methods, such as KSAM).

However,  a problem arises: what if two keys map to  the  same  record
number   (this condition is known as a COLLISION)?  We can't very well
put  both entries into the same record number, so we will have to  put
one  of these entries (known as the SECONDARY) somewhere else, and put
a  pointer to it into the other entry (known as the PRIMARY).  Now, if
we   want to retrieve the secondary, we will have to perform two I/Os:
one  to retrieve the primary, and then when we see that the primary is
not  the entry we want, a second to retrieve the secondary.  A similar
situation  arises when a key maps to a record number that already  has
a   primary and a secondary attached to it; then the entry is added to
the  so-called SYNONYM CHAIN for that record number (the  primary  and
all   its  secondaries),  and  we  now  need  three I/Os to get to it.
Similar  things occur when a key maps to a record number that has  two
synonyms, three synonyms, etc.

In  short,  access to primaries requires only one I/O; however, access
to  secondaries requires two or more  I/Os (unless the blocking factor
is greater than one).

It  is also well known that the more  full  a  dataset  is,  the  more
secondaries   there  are  in  that  dataset; and, the more secondaries
there  are, the more I/Os are needed (on the average) to  retrieve  an
entry.  However, how many more I/Os are needed, nobody knows.

This  brings up an interesting question: what is the  relationship  of
the   average  number  of I/Os needed to retrieve an entry to how full
the dataset is (THE FILL FACTOR), measured from 0=empty, to 1=full.


THE SOLUTION

In  this   paper,  I  will  attempt  to  solve  the   above   question
theoretically,   i.e.  derive  a  general formula that, given the fill
factor,  will return the average number of I/Os.  In order to do this,
however, some simplifying assumptions must be made:

First  of all, I will assume that our hashing function is "good", i.e.
regardless  of the data fed to it, it will return each  record  number
with  roughly equal frequency.  Thus, for instance, a hashing function
that  returns record number 17 half of the time, or returns odd record
numbers    more   often  than  even  record  numbers  is  not  "good".
Fortunately,  IMAGE's hashing  function  is  a  relatively  good  one.
(Note:   If  the  dataset's  key  is  numeric,  e.g. I2, a not-so-good
hashing  function is used by IMAGE; thus, these  results  may  not  be
correct for datasets with numeric keys).

Secondly,  I will assume that  the  master  dataset  in  question  has
blocking   factor 1.  This will permit me to ignore the possibility of
a  synonym being in the same block as the primary, thus  requiring  no
additional   I/Os  when it is retrieved.  I must confess that this may
be  a quite fallacious assumption in that synonyms very often  are  in
the   same  blocks as their primaries; however, although this fact may
be  decrease the average number of I/Os for a given  fill  factor,  it
should   not  influence  the  overall  nature  of  the function we are
seeking.

Now, we can get down to business.   Let  us  define
  N = number of entries  in  the  dataset,
  C = capacity of the dataset,
  F = fill factor = N/C,
  An X-ary = an entry that is the Xth  in  a synonym chain, and thus
             takes X I/Os to retrieve.  Thus, a 1-ary is a  primary,
             and 2-arys, 3-arys, 4-arys, etc. are all secondaries,
  E (X, I) = the expected number of X-aries in a dataset after I
             entries have been added.

Let us thus ask the following question:

  How  many primaries can we expect in the  dataset  after  I  entries
  have been added to it, i.e. what is E(1,I)?

Well,  we can expect as many  primaries  as  are  expected  after  I-1
entries   have  been  added  (i.e.   E(1,I-1)) plus either 0 or 1 more
primaries  (depending on whether the Ith entry becomes  a  primary  or
not), i.e.

  E(1,I) = E(1,I-1)+                                              [1]
           1*(probability that the Ith entry becomes a primary).

What  is the probability that the Ith entry  becomes  a  primary?   An
entry   becomes a primary if it hashes to any slot in the dataset that
is  NOT OCCUPIED BY ANOTHER PRIMARY, i.e. that is empty or is occupied
by  a secondary!  The probability that the Ith entry becomes a primary
is thus equal to

  (the number of slots that are not occupied by another primary)
  --------------------------------------------------------------
    (the total number of slots that it could possibly hash to)

The  total number of slots that  it  could  possibly  hash to  is,  of
course,   C.   The  number  of  slots that are not occupied by another
primary  is C - (the number of slots that are occupied by a  primary).
When   the Ith entry is being added, the expected number of slots that
are occupied by a primary is by definition E(1,I-1).

Thus, the probability that the Ith entry will become a primary is

  (C-E(1,I-1))/C = 1-E(1,I-1)/C                                   [2]

Therefore, substituting [2] into [1],

  E(1,I) = E(1,I-1)+1-E(1,I-1)/C =                                [3]
         = E(1,I-1)*(1-1/C)+1

Moreover,  we know that E(1,0) = expected number of primaries when the
dataset  has 0 entries, i.e. when the dataset is empty = 0.  Thus,  we
have   a  recursive formula for E(1,I).  From it we can, for instance,
determine  that E(1,1) = E(1,0)*(1-1/C)+1=0*(1-1/C)+1=1, that E(1,2) =
E(1,1)*(1-1/C)+1=1*(1-1/C)+1=2-1/C, etc.

But,  we want a non-recursive, straightforward formula.  We can obtain
it by further substituting [3] into itself:

  E(1,I) = E(1,I-1)*(1-1/C)+1 =                                   [4]
         = (E(1,I-2)*(1-1/C)+1)*(1-1/C)+1 =
         = E(1,I-2)*(1-1/C)^2+(1-1/C)+1 =
           ...
         = (1-1/C)^(I-1)+(1-1/C)^(I-2)+ ... +(1-1/C)^2+(1-1/C)+1

The  above is what is known in mathematical  circles  as  a  geometric
progression,   i.e.  a sum of terms each one of which is (1-1/C) times
less  than the previous one.  It is also known that  the  sum  of  the
above progression is

  (1-1/C)^I-1   (1-1/C)^I-1
  ----------- = ----------- = -C*((1-1/C)^I-1) = C*(1-(1-1/C)^I)  [5]
   (1-1/C)-1       -1/C

Thus,  E(1,I) = C*(1-(1-1/C)^I).  If what we are looking  for  is  the
percentage   of  the  dataset's  entries that are primaries, we should
consider E(1,I)/I = (C/I)*(1-(1-1/C)^I).

Now, let us take a closer look at (1-1/C)^I.  Clearly,

  (1-1/C)^I = (1-1/C)^(C*I/C) = ((1-1/C)^C)^(I/C)                 [6]

Also,  we know that when C is sufficiently large (say,  in  excess  of
100),   (1-1/C)^C is close to (and in fact gets closer as C increases)
the constant 1/e = 1/2.71828... = 0.367879....

Thus, for large C, we have (substituting [6] into [5])

  E(1,I)/I = (C/I)*(1-(1-1/C)^I) = (C/I)*(1-(1/e)^(I/C)) =        [7]
           = (C/I)*(1-e^(-I/C)) = (1-e^(-I/C))/(I/C).

So,  the expected fraction of primaries in our dataset, which contains
N   entries,  is  E(1,N)/N  =  (1-e^(-N/C))/(N/C).   But,  N/C  =   F!
Therefore,

               -F
  E (1,N)   1-e
  ------- = -----                                                 [8]
     N        F

Ugh.

At  last, we have a solid result -- we have  determined  the  expected
number   of  primaries  in  a  dataset  given  its  fill  factor.   An
interesting  corollary to the above is that when the dataset is  full,
i.e.   N=C  or  F=1, E(1,N)/N = (1-e^(-1))/1 = .632121....  Thus, even
when a dataset is full, 63% of all entries in it should be primaries.

However,  the battle is far from over yet.  What we are trying  to  do
is  to determine the average number of I/Os needed to retrieve a given
entry.    Before  doing  the  actual  derivations,  let me explain the
method that we are going to use:

The  average number of I/Os needed to retrieve a given entry is  equal
to   the  total  number  of I/Os needed to retrieve all entries in the
dataset divided by the total number of entries.

The  total number of I/Os  needed  to  retrieve  all  entries  in  the
dataset   is the total number of I/Os needed to retrieve all primaries
+  the total number of I/Os needed to  retrieve  all  2-aries  +  ....
But,   the  total  number of I/Os needed to retrieve all X-aries = the
expected  number of X-aries * the number of I/Os needed to retrieve an
X-ary.   Clearly, to retrieve an X-ary we would need X I/Os; thus, the
total number of I/Os needed to retrive all X-aries = X*E(X,N).

Therefore,  the total number of I/Os needed to retrieve all entries in
the  dataset  is  1*E(1,N)+2*E(2,N)+3*E(3,N)+....   And,  the  average
number   of I/Os needed to retrieve a given entry = the number of I/Os
needed to retrieve all entries / number of entries =

  1*E(1,N)+2*E(2,N)+3*E(3,N)+...
  ------------------------------                                  [9]
                N

We  have already determined E(1,N).  Now, if  we  can  only  determine
E(2,N),   E(3,N),  ...,  we  will be almost done.  So, this is what we
will attempt to do below.

How are we going to determine E(X,I) for X>1?

Well, reasoning similar to the one we used for E(1,I) indicates that

  E(X,I) = E(X,I-1)+                                             [10]
           (probability that the Ith entry becomes an X-ary)

Now,  what is the probability that the Ith entry becomes an X-ary?  To
answer  this, we must ask: under what conditions would an entry become
an  X-ary?  The answer to this is: if and only if it hashes to a  slot
that   is currently occupied by a primary which has a synonym chain of
length  exactly X-1.  How many entries like this are there?  Well, the
distinguishing  feature of this kind of entry is that it  has  in  its
synonym   chain an (X-1)-ary but no X-aries.  Thus, the number of such
primary  entries =  number  of  (X-1)-aries  -  number  of  X-aries  =
E(X-1,I-1)   -  E(X,I-1).   Thus,  the  probability that the Ith entry
becomes an X-ary is

  E(X-1,I-1)-E(X,I-1)
  -------------------                                            [11]
           C

and thus

  E(X,I) = E(X,I-1)+(E(X-1),I-1)-E(X,I-1))/C =                   [12]
         = E(X,I-1)*(1-1/C)+E(X-1,I-1)/C

To obtain a non-recursive formula, we expand the above into

              I
             ---
             \   E(X-1,J-1)
  E(X,I) =    -  ---------- * (1-1/C)^(I-J)                      [13]
             /       C
             ---
             J=1

In  particular, let us try to determine E(2,I) = the  expected  number
of 2-aries after I entries have been added.

  E(2,I) = sum(J=1:I | E(1,J-1)/C * (1-1/C)^(I-J)) ~             [14]
         ~ sum(J=1:I | E(1,J)/C * (1-1/C)^(I-J)) =
         = sum(J=1:I | C*(1-e^(-J/C))/C * (1-1/C)^(I-J)) ~
         = sum(J=1:I | (1-1/C)^(I-J)) -
           sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J)) =
         = sum(J=0:I-1 | (1-1/C)^J) -
           sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J))

Now,  we can observe that sum(J=0:I-1 |  (1-1/C)^J)  is  just  E(1,I);
moreover,    (1-1/C)^(I-J)  =  ((1-1/C)^C)^((I-J)/C)  ~  e^(-(I-J)/C).
Thus,

  E(2,I) = E(1,I) - sum(J=1:I | e^(-J/C) * e^(-(I-J)/C)) =       [15]
         = E(1,I) - sum(J=1:I | e^(-I/C)) =
         = E(1,I) - I*e^(-I/C).

So, E(2,N) = E(1,N)-N*e^(-N/C); furthermore,

  E (2,N)   E (1,N)    -N/C   E (1,N)    -F
  ------- = ------- - e     = ------- - e                        [16]
     N         N                 N

Now,  we can proceed to the general question: what is E(X,I) (or, what
is E(X,I)/I)?

I claim that for X>=2,

  E (X,I)   E (X-1,I)   (I/C)^(X-2)    -I/C
  ------- = --------- - ----------- * e                          [17]
     I          I          (X-1)!

where X! = X factorial = 1*2*3*...*X.

I  will prove this claim via a device known as mathematical induction;
i.e.  I will demonstrate that the claim is valid for X=2, and if it is
valid  for X=A, then it is also valid for X=A+1.  Clearly,  this  will
demonstrate   that  the  claim  is  valid  for  X=2, and therefore for
X=2+1=3, and therefore also for X=3+1=4, etc.

First  of  all,  I must prove that the claim  is valid for X=2. In the
case of X=2, the claim says that

  E (2,I)   E (1,I)   (I/C)^0    -I/C   E (1,I)    -I/C
  ------- = ------- - ------- * e     = ------- - e              [18]
     I         I         0!                I

But,  this was proven above in the derivation  of  E(2,I).   Thus,  we
know that our claim is valid for X=2.

Now let us assume that the claim is valid for X=A, i.e.

  E (X,A)   E (X-1,A)   (A/C)^(X-2)    -A/C
  ------- = --------- - ----------- * e                          [19]
     A          A            X!

We must show that it is also valid for X=A+1.

  E(A+1,I) = SUM(J=1:I | E(A,J-1)/C * (1-1/C)^(I-J)) =           [20]
           = SUM(J=1:I | (E(A-1,J-1)-J*(J/C)^(A-2)/(A-1)!*e^(-J/C))*
                         (1-1/C)^(I-J)/C) =
           = SUM(J=1:I | E(A-1,J-1)*(1-1/C)^(I-J)/C) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*
                         (1-1/C)^(I-J)/C) =
           = E(A,I) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*e^(-(I-J)/C)) =
           = E(A,I) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-I/C)) =
           = E(A,I) - e^(-I/C)/(A-1)!*SUM(J=1:I | J^(A-1))/C^(A-1) =

We  know that SUM(J=1:I | J^(A-1)) = I^A/A + smaller powers of I,  and
is thus approximately equal to I^A/A; so,

  E(A+1,I) = E(A,I) - e^(-I/C)/(A-1)!*                           [21]
             SUM(J=1:I | J^(A-1))/C^(A-1) =
           = E(A,I) - e^(-I/C)/(A-1)!*(I^A/A)/C^(A-1) =
           = E(A,I) - e^(-I/C)/A!*I*(I/C)^(A-1)

Thus,

  E (A+1,I)   E (A,I)   (I/C)^(A-1)    -I/C
  --------- = ------- - ----------- * e                          [22]
      I          I          X!

This  is indeed is the result predicted by our claim for X=A+1.  Thus,
we know that if the claim is valid for X=A, it is valid for X=A+1.

Thus,  we are justified in saying that our claim is true.  Finally, we
have obtained a general  (albeit  recursive)  formula  for  E  (X,I).

Now  comes the final phase of our  proof  --  the  derivation  of  the
average   number  of  I/Os  needed  to  retrieve  a  record.   For the
remainder  of this discussion, let us for  convenience  use  the  fill
factor F = N/C.

The  average number of I/Os needed to retrieve a record  is  equal  to
the   total  number  of I/Os needed to retrieve all records divided by
the number of records.

The  total number of I/Os needed to retrieve all records is  equal  to
(the  expected  number  of 1-arys)*(I/Os needed  to retrieve a 1-ary)+
(the expected number of 2-arys)*(I/Os needed to retrieve a 2-ary)+...

But,  we know that the expected number of X-arys is simply E(X,N); the
number  of I/Os needed to retrieve a  X-ary  is  simply  X;  and,  the
number of records is N.

Thus, the average number of I/Os needed to retrieve a record is

  1*E(1,N)+2*E(2,N)+...     E (1,N)     E (2,N)
  --------------------- = 1*------- + 2*------- + ...            [23]
            N                  N           N

We have found that

  E (X,N)   E (X-1,N)   F^(X-2)    -F
  ------- = --------- - ------- * e                              [24]
     N          N        (X-1)!

Thus,

  E (X,N)   E (1,N)   F^0  -F   F^1  -F         F^(X-2)  -F
  ------- = ------- - ---*e   - ---*e   - ... - -------*e        [25]
     N         N       1!        2!              (X-1)!

Now,

  E(1,N)/N = (1-e^(-F))/F =                                      [26]
           = e^(-F)*(e^F-1)/F =
           = e^(-F)*(1+(F^1)/1!+(F^2)/2!+(F^3)/3!+...-1)/F =
           = e^(-F)*((F^1)/1!+(F^2)/2!+(F^3)/3!+...)/F =
           = e^(-F)*((F^0)/1!+(F^1)/2!+(F^2)/3!+...)

So,

  E(X,N)/N = E(1,N)/N -                                          [27]
             ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
           = ((F^0)/1!+(F^1)/2!+...)*e^(-F) -
             ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
           = (F^(X-1)/X!+F^X/(X+1)!+...)*e^(-F) =
           = sum(Y=X-1:infinity | F^Y/(Y+1)!)*e^(-F)

We are trying to calculate

    E (1,N)     E (2,N)     E (3,N)
  1*------- + 2*------- + 3*------- + ... =                      [28]
       N           N           N

           = 1*(F^0/1!+F^1/2!+F^2/3!+...)*e^(-F) +
             2*       (F^1/2!+F^2/3!+...)*e^(-F) +
             3*              (F^2/3!+...)*e^(-F) + ...

How  many times will F^(Y-1)/Y! be included in the above sum  for  any
given   Y?   Well,  it  will be included once in the first row, with a
factor  of 1; once in the second row, with a factor of 2;  and  so  on
until  the Yth row, in which it will occur with a factor of Y.  In the
(Y+1)st   row,  it  will  no  longer  occur,  because  the (Y+1)st row
contains  only terms  starting  with  (F^Y)/(Y+1)!   Thus,  in  total,
F^(Y-1)/Y!   will  occur  in  the  resulting  sum  with  a  factor  of
(1+2+3+...+Y) = (Y+1)*Y/2.  Thus, the average number of I/Os will be

  = sum(Y=1:infinity | F^(Y-1)/Y!*(Y+1)*Y/2) * e^(-F) =          [29]
  = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y+1)/2) * e^(-F) =
  = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y-1)/2+
                       F^(Y-1)/(Y-1)!*2/2) * e^(-F) =
  = sum(Y=0:infinity | F^Y/Y!*Y/2) * e^(-F) +
    sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
  = sum(Y=1:infinity | F^Y/(Y-1)!)/2 * e^(-F) +
    sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
  = (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F)

Here  we can again observe that  sum(Y=0:infinity  |  F^Y/Y!)  =  e^F.
Thus, the average number of I/Os is

  =  (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F) =              [30]
  = (F/2+1) * e^F * e^(-F) = = 1+F/2

Eureka!  So,

  THE  AVERAGE NUMBER OF I/OS REQUIRED FOR A DBGET MODE 7  AGAINST  AN
  IMAGE/3000   MASTER DATASET WITH BLOCKING FACTOR 1 AND FILL FACTOR F
  IS 1+F/2.

At last.


THE APPLICATIONS

Now  that all of you  have  been  dazzled  by  the  above  display  of
Incredibly    Incomprehensible  Mathematical  Formulae,  an  important
question comes up:

  WHY BOTHER?

As  an amateur mathematician (i.e. someone  who  is  crazy  enough  to
actually   LIKE  the  above  mumbo-jumbo), I can answer the above with
"Why   not?"  or  "Because  it's  there"  or  "But  aren't  you   just
overwhelmed by the incredible beauty of the result?".

However,  as  a computer programmer who works on what is essentially a
business  computer, I feel that I ought to show some  applications  of
the above mess.  Well, here they come.

For  one, note that the formula generated depends only on how full the
dataset is as a percentage, not on the actual number of entries in the
datasets.   Thus, if we compare the efficiency of IMAGE mode 7  DBGETs
and  KSAM FREADBYKEYs, we find that IMAGE should theoretically require
only 1 to 1.5 I/Os, whereas KSAM requires on the order of log N I/Os.

Another  point  that this proves is that  increasing the capacity of a
dataset  will not necessarily greatly improve the efficiency of master
dataset  access  (as  is  commonly  believed).   For instance,  if the
capacity  of a dataset that is 80% full is doubled, the average number
of I/Os required for a DBGET mode 7 against this dataset will decrease
from  1.4  to  1.2, which may not be  significant enough to offset the
two-fold increase in disc space usage.

On  the other hand, if you see that the  fraction  of  secondaries  is
much   larger than 1-(1-e^(-F))/F and/or the average number of I/Os is
much  greater than 1+F/2, you may have come across  a  case  in  which
IMAGE's   hashing algorithm behaves poorly. In this case, changing the
capacity  very slightly may cut the  number  of  secondaries  and  the
average  number of I/Os dramatically.  In one case which I ran across,
changing   the  capacity  by 1 cut the average number of I/Os in half!
The  actual number of secondaries and the  actual  average  number  of
I/Os   may  be  obtained  from  various  database  utilities  such  as
DBLOADNG, DBSTAT2, or Robelle Consulting's HowMessy.


CONCLUSION

The two major results that were derived in this paper are:

If a dataset has fill factor (number of entries/capacity) F,

the expected fraction of primaries in that dataset is:

          -F
       1-e
  P(F)=-----                                                      [8]
         F

and  the  expected average number of I/Os  needed to retrieve an entry
(if the dataset has blocking factor 1) is:

  A(F)=1+F/2                                                     [30]


EXPERIMENTAL EVIDENCE

The  following is a table of the results that were  expected  and  the
results that were obtained for several test datasets:

                   --------Actual--------  -------Expected-------
--C-- --N-- --F--  %Primaries Avg num IOs  %Primaries Avg num IOs
  650   525 80.8%     67.2%      1.411        68.6%      1.404
  331   193 58.3%     75.1%      1.337        75.8%      1.291
  331   193 58.3%     75.6%      1.275        75.8%      1.291
15000 10582 70.5%     71.0%      1.387        71.8%      1.352
  331   118 35.6%     80.5%      1.203        84.1%      1.178
  250   206 82.4%     66.0%      1.456        68.1%      1.412

Of  the following, some datasets  (the  one  marked  +)  have  primary
percentages   that  are  significantly  less than expected and average
numbers  of I/Os that  are  significantly  more  than  expected.   The
datasets   marked  -  are exact copies of those datasets with slightly
changed  capacities -- their primary percentages and  average  numbers
of I/Os are quite close to those expected:

                   --------Actual--------  -------Expected-------
--C-- --N-- --F--  %Primaries Avg num IOs  %Primaries Avg num IOs
+ 128   123 96.1%      7.3%     13.008        64.3%      1.480
+ 127   123 96.8%     52.0%      1.667        64.0%      1.484
- 126   123 97.6%     70.7%      1.407        63.9%      1.488
+ 320   284 88.9%     37.3%      2.810        66.3%      1.444
- 319   284 89.0%     65.5%      1.444        66.2%      1.445
+ 400   158 39.5%     69.6%      1.456        82.6%      1.198
- 399   158 39.6%     86.1%      1.158        82.6%      1.198

These  datasets are good  examples  of  instances  in  which  slightly
altering  the dataset capacity (even decreasing it!) can significantly
improve performance.

Go to Adager's index of technical papers