A VERY SCHOLARLY DISCUSSION OF THE EFFECT OF SYNONYMS ON PERFORMANCE by Eugene Volokh, VESOFT Presented at 1982 HPIUG Conference, Copenhagen, DANMARK Published in "Thoughts & Discourses on HP3000 Software", 1st ed. ABSTRACT This paper will discuss the effect of synonyms on IMAGE/3000 performance, in particular the average number of I/Os necessary to perform a DBGET mode 7 (also known as a hashed, calculated, or keyed read). THE PROBLEM As is well known, IMAGE/3000 master datasets use an access method called HASHING to implement keyed I/O against them. The essence of this method is that when a DBPUT is performed against a master dataset, the key is translated through a special HASHING FUNCTION (which is explained in Robelle Consulting's SMUG II micro-proceedings) to a record number within the dataset. Then the entry to be added is inserted into the dataset at the resulting record number. Now, when a DBGET mode 7 is performed with the same key, the key is translated through the same hashing function to the same record number; then, the entry we are looking for is just the record with that number. Thus, we should ideally be able to add or retrieve any entry in just one I/O (as opposed to considerably more I/Os needed for other access methods, such as KSAM). However, a problem arises: what if two keys map to the same record number (this condition is known as a COLLISION)? We can't very well put both entries into the same record number, so we will have to put one of these entries (known as the SECONDARY) somewhere else, and put a pointer to it into the other entry (known as the PRIMARY). Now, if we want to retrieve the secondary, we will have to perform two I/Os: one to retrieve the primary, and then when we see that the primary is not the entry we want, a second to retrieve the secondary. A similar situation arises when a key maps to a record number that already has a primary and a secondary attached to it; then the entry is added to the so-called SYNONYM CHAIN for that record number (the primary and all its secondaries), and we now need three I/Os to get to it. Similar things occur when a key maps to a record number that has two synonyms, three synonyms, etc. In short, access to primaries requires only one I/O; however, access to secondaries requires two or more I/Os (unless the blocking factor is greater than one). It is also well known that the more full a dataset is, the more secondaries there are in that dataset; and, the more secondaries there are, the more I/Os are needed (on the average) to retrieve an entry. However, how many more I/Os are needed, nobody knows. This brings up an interesting question: what is the relationship of the average number of I/Os needed to retrieve an entry to how full the dataset is (THE FILL FACTOR), measured from 0=empty, to 1=full. THE SOLUTION In this paper, I will attempt to solve the above question theoretically, i.e. derive a general formula that, given the fill factor, will return the average number of I/Os. In order to do this, however, some simplifying assumptions must be made: First of all, I will assume that our hashing function is "good", i.e. regardless of the data fed to it, it will return each record number with roughly equal frequency. Thus, for instance, a hashing function that returns record number 17 half of the time, or returns odd record numbers more often than even record numbers is not "good". Fortunately, IMAGE's hashing function is a relatively good one. (Note: If the dataset's key is numeric, e.g. I2, a not-so-good hashing function is used by IMAGE; thus, these results may not be correct for datasets with numeric keys). Secondly, I will assume that the master dataset in question has blocking factor 1. This will permit me to ignore the possibility of a synonym being in the same block as the primary, thus requiring no additional I/Os when it is retrieved. I must confess that this may be a quite fallacious assumption in that synonyms very often are in the same blocks as their primaries; however, although this fact may be decrease the average number of I/Os for a given fill factor, it should not influence the overall nature of the function we are seeking. Now, we can get down to business. Let us define N = number of entries in the dataset, C = capacity of the dataset, F = fill factor = N/C, An X-ary = an entry that is the Xth in a synonym chain, and thus takes X I/Os to retrieve. Thus, a 1-ary is a primary, and 2-arys, 3-arys, 4-arys, etc. are all secondaries, E (X, I) = the expected number of X-aries in a dataset after I entries have been added. Let us thus ask the following question: How many primaries can we expect in the dataset after I entries have been added to it, i.e. what is E(1,I)? Well, we can expect as many primaries as are expected after I-1 entries have been added (i.e. E(1,I-1)) plus either 0 or 1 more primaries (depending on whether the Ith entry becomes a primary or not), i.e. E(1,I) = E(1,I-1)+ [1] 1*(probability that the Ith entry becomes a primary). What is the probability that the Ith entry becomes a primary? An entry becomes a primary if it hashes to any slot in the dataset that is NOT OCCUPIED BY ANOTHER PRIMARY, i.e. that is empty or is occupied by a secondary! The probability that the Ith entry becomes a primary is thus equal to (the number of slots that are not occupied by another primary) -------------------------------------------------------------- (the total number of slots that it could possibly hash to) The total number of slots that it could possibly hash to is, of course, C. The number of slots that are not occupied by another primary is C - (the number of slots that are occupied by a primary). When the Ith entry is being added, the expected number of slots that are occupied by a primary is by definition E(1,I-1). Thus, the probability that the Ith entry will become a primary is (C-E(1,I-1))/C = 1-E(1,I-1)/C [2] Therefore, substituting [2] into [1], E(1,I) = E(1,I-1)+1-E(1,I-1)/C = [3] = E(1,I-1)*(1-1/C)+1 Moreover, we know that E(1,0) = expected number of primaries when the dataset has 0 entries, i.e. when the dataset is empty = 0. Thus, we have a recursive formula for E(1,I). From it we can, for instance, determine that E(1,1) = E(1,0)*(1-1/C)+1=0*(1-1/C)+1=1, that E(1,2) = E(1,1)*(1-1/C)+1=1*(1-1/C)+1=2-1/C, etc. But, we want a non-recursive, straightforward formula. We can obtain it by further substituting [3] into itself: E(1,I) = E(1,I-1)*(1-1/C)+1 = [4] = (E(1,I-2)*(1-1/C)+1)*(1-1/C)+1 = = E(1,I-2)*(1-1/C)^2+(1-1/C)+1 = ... = (1-1/C)^(I-1)+(1-1/C)^(I-2)+ ... +(1-1/C)^2+(1-1/C)+1 The above is what is known in mathematical circles as a geometric progression, i.e. a sum of terms each one of which is (1-1/C) times less than the previous one. It is also known that the sum of the above progression is (1-1/C)^I-1 (1-1/C)^I-1 ----------- = ----------- = -C*((1-1/C)^I-1) = C*(1-(1-1/C)^I) [5] (1-1/C)-1 -1/C Thus, E(1,I) = C*(1-(1-1/C)^I). If what we are looking for is the percentage of the dataset's entries that are primaries, we should consider E(1,I)/I = (C/I)*(1-(1-1/C)^I). Now, let us take a closer look at (1-1/C)^I. Clearly, (1-1/C)^I = (1-1/C)^(C*I/C) = ((1-1/C)^C)^(I/C) [6] Also, we know that when C is sufficiently large (say, in excess of 100), (1-1/C)^C is close to (and in fact gets closer as C increases) the constant 1/e = 1/2.71828... = 0.367879.... Thus, for large C, we have (substituting [6] into [5]) E(1,I)/I = (C/I)*(1-(1-1/C)^I) = (C/I)*(1-(1/e)^(I/C)) = [7] = (C/I)*(1-e^(-I/C)) = (1-e^(-I/C))/(I/C). So, the expected fraction of primaries in our dataset, which contains N entries, is E(1,N)/N = (1-e^(-N/C))/(N/C). But, N/C = F! Therefore, -F E (1,N) 1-e ------- = ----- [8] N F Ugh. At last, we have a solid result -- we have determined the expected number of primaries in a dataset given its fill factor. An interesting corollary to the above is that when the dataset is full, i.e. N=C or F=1, E(1,N)/N = (1-e^(-1))/1 = .632121.... Thus, even when a dataset is full, 63% of all entries in it should be primaries. However, the battle is far from over yet. What we are trying to do is to determine the average number of I/Os needed to retrieve a given entry. Before doing the actual derivations, let me explain the method that we are going to use: The average number of I/Os needed to retrieve a given entry is equal to the total number of I/Os needed to retrieve all entries in the dataset divided by the total number of entries. The total number of I/Os needed to retrieve all entries in the dataset is the total number of I/Os needed to retrieve all primaries + the total number of I/Os needed to retrieve all 2-aries + .... But, the total number of I/Os needed to retrieve all X-aries = the expected number of X-aries * the number of I/Os needed to retrieve an X-ary. Clearly, to retrieve an X-ary we would need X I/Os; thus, the total number of I/Os needed to retrive all X-aries = X*E(X,N). Therefore, the total number of I/Os needed to retrieve all entries in the dataset is 1*E(1,N)+2*E(2,N)+3*E(3,N)+.... And, the average number of I/Os needed to retrieve a given entry = the number of I/Os needed to retrieve all entries / number of entries = 1*E(1,N)+2*E(2,N)+3*E(3,N)+... ------------------------------ [9] N We have already determined E(1,N). Now, if we can only determine E(2,N), E(3,N), ..., we will be almost done. So, this is what we will attempt to do below. How are we going to determine E(X,I) for X>1? Well, reasoning similar to the one we used for E(1,I) indicates that E(X,I) = E(X,I-1)+ [10] (probability that the Ith entry becomes an X-ary) Now, what is the probability that the Ith entry becomes an X-ary? To answer this, we must ask: under what conditions would an entry become an X-ary? The answer to this is: if and only if it hashes to a slot that is currently occupied by a primary which has a synonym chain of length exactly X-1. How many entries like this are there? Well, the distinguishing feature of this kind of entry is that it has in its synonym chain an (X-1)-ary but no X-aries. Thus, the number of such primary entries = number of (X-1)-aries - number of X-aries = E(X-1,I-1) - E(X,I-1). Thus, the probability that the Ith entry becomes an X-ary is E(X-1,I-1)-E(X,I-1) ------------------- [11] C and thus E(X,I) = E(X,I-1)+(E(X-1),I-1)-E(X,I-1))/C = [12] = E(X,I-1)*(1-1/C)+E(X-1,I-1)/C To obtain a non-recursive formula, we expand the above into I --- \ E(X-1,J-1) E(X,I) = - ---------- * (1-1/C)^(I-J) [13] / C --- J=1 In particular, let us try to determine E(2,I) = the expected number of 2-aries after I entries have been added. E(2,I) = sum(J=1:I | E(1,J-1)/C * (1-1/C)^(I-J)) ~ [14] ~ sum(J=1:I | E(1,J)/C * (1-1/C)^(I-J)) = = sum(J=1:I | C*(1-e^(-J/C))/C * (1-1/C)^(I-J)) ~ = sum(J=1:I | (1-1/C)^(I-J)) - sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J)) = = sum(J=0:I-1 | (1-1/C)^J) - sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J)) Now, we can observe that sum(J=0:I-1 | (1-1/C)^J) is just E(1,I); moreover, (1-1/C)^(I-J) = ((1-1/C)^C)^((I-J)/C) ~ e^(-(I-J)/C). Thus, E(2,I) = E(1,I) - sum(J=1:I | e^(-J/C) * e^(-(I-J)/C)) = [15] = E(1,I) - sum(J=1:I | e^(-I/C)) = = E(1,I) - I*e^(-I/C). So, E(2,N) = E(1,N)-N*e^(-N/C); furthermore, E (2,N) E (1,N) -N/C E (1,N) -F ------- = ------- - e = ------- - e [16] N N N Now, we can proceed to the general question: what is E(X,I) (or, what is E(X,I)/I)? I claim that for X>=2, E (X,I) E (X-1,I) (I/C)^(X-2) -I/C ------- = --------- - ----------- * e [17] I I (X-1)! where X! = X factorial = 1*2*3*...*X. I will prove this claim via a device known as mathematical induction; i.e. I will demonstrate that the claim is valid for X=2, and if it is valid for X=A, then it is also valid for X=A+1. Clearly, this will demonstrate that the claim is valid for X=2, and therefore for X=2+1=3, and therefore also for X=3+1=4, etc. First of all, I must prove that the claim is valid for X=2. In the case of X=2, the claim says that E (2,I) E (1,I) (I/C)^0 -I/C E (1,I) -I/C ------- = ------- - ------- * e = ------- - e [18] I I 0! I But, this was proven above in the derivation of E(2,I). Thus, we know that our claim is valid for X=2. Now let us assume that the claim is valid for X=A, i.e. E (X,A) E (X-1,A) (A/C)^(X-2) -A/C ------- = --------- - ----------- * e [19] A A X! We must show that it is also valid for X=A+1. E(A+1,I) = SUM(J=1:I | E(A,J-1)/C * (1-1/C)^(I-J)) = [20] = SUM(J=1:I | (E(A-1,J-1)-J*(J/C)^(A-2)/(A-1)!*e^(-J/C))* (1-1/C)^(I-J)/C) = = SUM(J=1:I | E(A-1,J-1)*(1-1/C)^(I-J)/C) - SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)* (1-1/C)^(I-J)/C) = = E(A,I) - SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*e^(-(I-J)/C)) = = E(A,I) - SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-I/C)) = = E(A,I) - e^(-I/C)/(A-1)!*SUM(J=1:I | J^(A-1))/C^(A-1) = We know that SUM(J=1:I | J^(A-1)) = I^A/A + smaller powers of I, and is thus approximately equal to I^A/A; so, E(A+1,I) = E(A,I) - e^(-I/C)/(A-1)!* [21] SUM(J=1:I | J^(A-1))/C^(A-1) = = E(A,I) - e^(-I/C)/(A-1)!*(I^A/A)/C^(A-1) = = E(A,I) - e^(-I/C)/A!*I*(I/C)^(A-1) Thus, E (A+1,I) E (A,I) (I/C)^(A-1) -I/C --------- = ------- - ----------- * e [22] I I X! This is indeed is the result predicted by our claim for X=A+1. Thus, we know that if the claim is valid for X=A, it is valid for X=A+1. Thus, we are justified in saying that our claim is true. Finally, we have obtained a general (albeit recursive) formula for E (X,I). Now comes the final phase of our proof -- the derivation of the average number of I/Os needed to retrieve a record. For the remainder of this discussion, let us for convenience use the fill factor F = N/C. The average number of I/Os needed to retrieve a record is equal to the total number of I/Os needed to retrieve all records divided by the number of records. The total number of I/Os needed to retrieve all records is equal to (the expected number of 1-arys)*(I/Os needed to retrieve a 1-ary)+ (the expected number of 2-arys)*(I/Os needed to retrieve a 2-ary)+... But, we know that the expected number of X-arys is simply E(X,N); the number of I/Os needed to retrieve a X-ary is simply X; and, the number of records is N. Thus, the average number of I/Os needed to retrieve a record is 1*E(1,N)+2*E(2,N)+... E (1,N) E (2,N) --------------------- = 1*------- + 2*------- + ... [23] N N N We have found that E (X,N) E (X-1,N) F^(X-2) -F ------- = --------- - ------- * e [24] N N (X-1)! Thus, E (X,N) E (1,N) F^0 -F F^1 -F F^(X-2) -F ------- = ------- - ---*e - ---*e - ... - -------*e [25] N N 1! 2! (X-1)! Now, E(1,N)/N = (1-e^(-F))/F = [26] = e^(-F)*(e^F-1)/F = = e^(-F)*(1+(F^1)/1!+(F^2)/2!+(F^3)/3!+...-1)/F = = e^(-F)*((F^1)/1!+(F^2)/2!+(F^3)/3!+...)/F = = e^(-F)*((F^0)/1!+(F^1)/2!+(F^2)/3!+...) So, E(X,N)/N = E(1,N)/N - [27] ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) = = ((F^0)/1!+(F^1)/2!+...)*e^(-F) - ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) = = (F^(X-1)/X!+F^X/(X+1)!+...)*e^(-F) = = sum(Y=X-1:infinity | F^Y/(Y+1)!)*e^(-F) We are trying to calculate E (1,N) E (2,N) E (3,N) 1*------- + 2*------- + 3*------- + ... = [28] N N N = 1*(F^0/1!+F^1/2!+F^2/3!+...)*e^(-F) + 2* (F^1/2!+F^2/3!+...)*e^(-F) + 3* (F^2/3!+...)*e^(-F) + ... How many times will F^(Y-1)/Y! be included in the above sum for any given Y? Well, it will be included once in the first row, with a factor of 1; once in the second row, with a factor of 2; and so on until the Yth row, in which it will occur with a factor of Y. In the (Y+1)st row, it will no longer occur, because the (Y+1)st row contains only terms starting with (F^Y)/(Y+1)! Thus, in total, F^(Y-1)/Y! will occur in the resulting sum with a factor of (1+2+3+...+Y) = (Y+1)*Y/2. Thus, the average number of I/Os will be = sum(Y=1:infinity | F^(Y-1)/Y!*(Y+1)*Y/2) * e^(-F) = [29] = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y+1)/2) * e^(-F) = = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y-1)/2+ F^(Y-1)/(Y-1)!*2/2) * e^(-F) = = sum(Y=0:infinity | F^Y/Y!*Y/2) * e^(-F) + sum(Y=0:infinity | F^Y/Y!) * e^(-F) = = sum(Y=1:infinity | F^Y/(Y-1)!)/2 * e^(-F) + sum(Y=0:infinity | F^Y/Y!) * e^(-F) = = (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F) Here we can again observe that sum(Y=0:infinity | F^Y/Y!) = e^F. Thus, the average number of I/Os is = (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F) = [30] = (F/2+1) * e^F * e^(-F) = = 1+F/2 Eureka! So, THE AVERAGE NUMBER OF I/OS REQUIRED FOR A DBGET MODE 7 AGAINST AN IMAGE/3000 MASTER DATASET WITH BLOCKING FACTOR 1 AND FILL FACTOR F IS 1+F/2. At last. THE APPLICATIONS Now that all of you have been dazzled by the above display of Incredibly Incomprehensible Mathematical Formulae, an important question comes up: WHY BOTHER? As an amateur mathematician (i.e. someone who is crazy enough to actually LIKE the above mumbo-jumbo), I can answer the above with "Why not?" or "Because it's there" or "But aren't you just overwhelmed by the incredible beauty of the result?". However, as a computer programmer who works on what is essentially a business computer, I feel that I ought to show some applications of the above mess. Well, here they come. For one, note that the formula generated depends only on how full the dataset is as a percentage, not on the actual number of entries in the datasets. Thus, if we compare the efficiency of IMAGE mode 7 DBGETs and KSAM FREADBYKEYs, we find that IMAGE should theoretically require only 1 to 1.5 I/Os, whereas KSAM requires on the order of log N I/Os. Another point that this proves is that increasing the capacity of a dataset will not necessarily greatly improve the efficiency of master dataset access (as is commonly believed). For instance, if the capacity of a dataset that is 80% full is doubled, the average number of I/Os required for a DBGET mode 7 against this dataset will decrease from 1.4 to 1.2, which may not be significant enough to offset the two-fold increase in disc space usage. On the other hand, if you see that the fraction of secondaries is much larger than 1-(1-e^(-F))/F and/or the average number of I/Os is much greater than 1+F/2, you may have come across a case in which IMAGE's hashing algorithm behaves poorly. In this case, changing the capacity very slightly may cut the number of secondaries and the average number of I/Os dramatically. In one case which I ran across, changing the capacity by 1 cut the average number of I/Os in half! The actual number of secondaries and the actual average number of I/Os may be obtained from various database utilities such as DBLOADNG, DBSTAT2, or Robelle Consulting's HowMessy. CONCLUSION The two major results that were derived in this paper are: If a dataset has fill factor (number of entries/capacity) F, the expected fraction of primaries in that dataset is: -F 1-e P(F)=----- [8] F and the expected average number of I/Os needed to retrieve an entry (if the dataset has blocking factor 1) is: A(F)=1+F/2 [30] EXPERIMENTAL EVIDENCE The following is a table of the results that were expected and the results that were obtained for several test datasets: --------Actual-------- -------Expected------- --C-- --N-- --F-- %Primaries Avg num IOs %Primaries Avg num IOs 650 525 80.8% 67.2% 1.411 68.6% 1.404 331 193 58.3% 75.1% 1.337 75.8% 1.291 331 193 58.3% 75.6% 1.275 75.8% 1.291 15000 10582 70.5% 71.0% 1.387 71.8% 1.352 331 118 35.6% 80.5% 1.203 84.1% 1.178 250 206 82.4% 66.0% 1.456 68.1% 1.412 Of the following, some datasets (the one marked +) have primary percentages that are significantly less than expected and average numbers of I/Os that are significantly more than expected. The datasets marked - are exact copies of those datasets with slightly changed capacities -- their primary percentages and average numbers of I/Os are quite close to those expected: --------Actual-------- -------Expected------- --C-- --N-- --F-- %Primaries Avg num IOs %Primaries Avg num IOs + 128 123 96.1% 7.3% 13.008 64.3% 1.480 + 127 123 96.8% 52.0% 1.667 64.0% 1.484 - 126 123 97.6% 70.7% 1.407 63.9% 1.488 + 320 284 88.9% 37.3% 2.810 66.3% 1.444 - 319 284 89.0% 65.5% 1.444 66.2% 1.445 + 400 158 39.5% 69.6% 1.456 82.6% 1.198 - 399 158 39.6% 86.1% 1.158 82.6% 1.198 These datasets are good examples of instances in which slightly altering the dataset capacity (even decreasing it!) can significantly improve performance.