Do migrating secondaries give you migraines?
F. Alfredo Rego
Adager Corporation
Sun Valley, Idaho 83353-3000 · USA

www.adager.com


What are secondaries, anyway?

Secondaries are necessary evils associated with IMAGE's method of handling synonyms.



What are synonyms?

Synonyms are necessary evils associated with hashing.



What is hashing?

Hashing is a technique which allows us to quickly access a specific entry (among millions) according to the value of its search field, regardless of the entry's location in the dataset.

IMAGE's master datasets are optimized for hashed access. For online applications, which usually serve people waiting over the counter or over the telephone, this kind of quick access provides better performance than serial scans and is simpler than indexing.

Like everything else in the universe, the advantages of hashing come tightly coupled with disadvantages. For instance, we have the very real possibility of synonyms: entries with different search-field values which, nevertheless, hash to the same location. Even though mathematicians have written countless dissertations on the desirability of "perfect" hashing algorithms which would not produce any synonyms at all, every hashing algorithm known today is imperfect and produces synonyms.



What is a synonym chain?

A synonym chain is IMAGE's method of managing the class of entries whose search-field values hash to the same location. An entry which hashes to an empty location (or to a location temporarily occupied by a secondary) becomes a primary. An entry which hashes to a primary-occupied location becomes, ipso facto, a secondary.

The primary has an important job: it serves as the synonym chain's ChainHead, which contains information regarding the total number of synonyms in the chain (if any) and keeps track of the first and last members in the chain. A secondary only has to worry about its predecessor (if any) and its successor (if any).



Why do some secondaries migrate?

Some secondaries migrate because primaries (new or old) have top priority.

Fortunately, not all secondaries migrate. The rules for migrating secondaries are quite clear:



Are all secondaries "bad"?

No. There are "good" secondaries and "bad" secondaries, according to their demands for a very valuable computer resource with direct impact on performance: disc access.

Good secondaries are those which we may access directly in memory buffers (IMAGE's or caching's), without having to go to disc. Bad secondaries are those which force excessive disc activity.

Messy synonym chains, with entries scattered all over, will probably contribute numerous bad synonyms. Cleanly-packed synonym chains, on the other hand, may contribute good synonyms which will be, for all practical purposes, equivalent to primary entries. Intra-memory operations are, after all, significantly faster than disc operations. Disc caching (under Classic machines running MPE V) helps a lot, particularly if the dataset's block length is small (or if you have specified a random caching quantum larger than the default of 16 sectors). There is no disc caching on Spectrum RISC machines running MPE XL or MPE/iX.

An increase in IMAGE's number of buffers (by means of DBUTIL's BufSpec command) may also help, at the expense of more memory and more CPU time to search a larger buffer pool.

Under any circumstances, short synonym chains are much better than long synonym chains!



Why do secondaries (migrating or not) give you headaches?

There are three fundamental operations on entries:

We want to do these operations as quickly as possible. Therefore, we want to avoid excessive disc accesses. Unfortunately, secondaries (migrating or not) tend to increase disc activity. As a consequence, the response time for online applications deteriorates and the throughput of batch jobs becomes unacceptable.



Quick review: Adding and deleting entries.

If a new entry's appointed location is vacant, we are home free! We just add the new entry on the spot and mark it as "occupied". The new primary entry will have tenure for life.

If a new entry's appointed location is already occupied, we must do a lot of work. There are two possibilities:

  1. The current occupant is at its appointed place and, since it arrived before the new entry, it has "seniority" and tenure for life. The new entry, sadly, becomes a synonym and we must place it elsewhere as a secondary, linked to the primary by means of a synonym chain. Before we add the entry, though, we must make sure that it will not duplicate an existing entry's search-field value. We must, therefore, scan the whole synonym chain before we can add the new entry. If the synonym chain is long, this may take a while.

  2. The current occupant did not hash to this location but was placed here as a secondary (subject to migration). Sorry: its time to migrate has come and we must place it elsewhere. After we evict the secondary, we can place the new entry in its appointed spot, as a primary.

Notice that the innocent-sounding expression "we must place it elsewhere as a secondary" is easier said than done. Finding the "elsewhere" may take a long time if the dataset has a long, uninterrupted cluster of occupied entries. This usually happens if the dataset is quite full or if the distribution of search-field values taxes the limits of IMAGE's hashing algorithm.

Having found the "elsewhere" does not guarantee tenure there, since any secondary is subject to migration in the future. If we add a new entry which hashes to a spot occupied by a secondary, we must migrate the secondary elsewhere. If we delete a primary with secondaries, we must move the first secondary into the spot previously occupied by the deleted primary (since we must have a primary). IMAGE, under extreme circumstances, may spend a significant amount of time chasing its own tail while migrating secondaries all over the place!



Quick review: Finding existing entries.

Even in a static dataset (where we never add or delete entries), we may have performance problems when we simply want to find an existing entry.

If the entry happens to be at its appointed location, we are in good shape. If the entry is NOT at its appointed location, there are two possibilities:

  1. The appointed spot is empty, in which case we know, immediately, that our entry does not exist (this is a valid result in a "find" task).

  2. The appointed location contains some other synonym entry (which happens to hash to the same spot as the entry which interests us). Since this entry is the primary in a synonym chain, it keeps track of the number of synonyms. If there are no synonyms, we know that our entry does not exist. If there are synonyms, we must scan the synonym chain until we either find our entry or exhaust the chain. In either case, we may have to go to disc more than once (depending on the length of the chain, the messiness of the chain, the dataset's blocking factor, the database's BufSpec, disc caching for Classic CISC machines, and so on).



What can you do about nasty secondaries?

If you are a theoretical type, you can spend endless pleasant hours dedicated to the fine art of figuring out the ideal mix of capacity, percent-full, search-field value distribution, and so on. This is a worthy endeavor, of course. But the sad thing is that your dataset, most likely, is dynamic. The minute you add or delete a few thousand entries, most of your glorious conclusions become invalid.

If you are a practical type, you might just as well accept reality, bite the bullet, and periodically repack your master datasets to balance the synonym load (just as you periodically repack your detail datasets, your disc drives, or the drawers in your desk). If your dataset is static, you are in luck: you just repack once!

If the widths of master ventilation shafts (free areas reserved for future secondaries) are too suffocating, you may consider changes in the dataset capacity which, together with dataset repacking, will improve the ventilation. Dataset reblocking, database BufSpec redefinition, and increases in the random quantum for disc caching under Classic RISC machines may also help (provided you have memory to spare, of course).

An ideally-packed master dataset allows sufficient ventilation space for DBPUT's sake (so that it does not take forever to find an empty spot for a new secondary or for a migrating old secondary) without having so much empty space that a serial scan takes forever (not to mention a DBSTORE).

Furthermore, the ventilation space has to be intelligently distributed throughout the entire dataset.



The good news and the not-so-bad news.

Fortunately, it turns out that repacking a master dataset is a very efficient operation (when we compare it to the complexities of repacking a detail dataset). Smart repacking of a master dataset takes only slightly longer than an optimized serial dataset scan!

Unfortunately, repacking does not seem to be an a priori activity. We can only repack master datasets a posteriori. We must repack periodically, and (unless we quit adding or deleting entries) we must keep repacking ad infinitum!




What do your worldwide HP e3000 colleagues think of Adager? See a sample of comments from real people who use Adager in the real world, where performance and reliability really count.
Back to Adager