Transaction Logging Concepts

Engine Crew Monograph No. 4
Last updated July 20, 1997

Gus Bjorklund, Progress Software Corporation
Thanks to Greg Higgins for HTML assistance

1. Introduction

This monograph describes how the Progress database manager uses logging to enhance database reliability and performance. By being familiar with these subjects you will be better equipped to take full advantage of the Progress RDBMS.

The major topics covered include a short review of the transaction concept, general logging concepts, normal and recovery processing, checkpoints, and log space allocation.

2. Transactions

The transaction concept is central to the operation of modern database management systems. A short review is provided here as background.

Transactions are an error handling mechanism and a program structuring mechanism. As am error handling mechanism, they allow you to do an arbitrary amount of work in a system and then change your mind. When you do work within a transaction, you can tell the system

"Oops, I don't want to do this after all. Please put everything back the way it was before I started.".

Or you can say

"OK, I'm finished now. Please make all my changes official.".

When a transaction is executing, we say it is active or in-flight. When a transaction is completed normally, we say that it is committed. If a failure occurs before your transaction is complete, the system will automatically undo any work that cannot be finished because of the failure. The process of undoing the effects of an incomplete transaction is called rollback.

As a program structuring mechanism, transactions have well-defined boundaries that encapsulate the database operations performed by some part of an application. Encapsulation in a transaction should be matched by encapsulation in the application.

Transactions have four basic properties (the well-known ACID properties): atomicity, consistency, isolation, and durability. These properties are closely related to each other and are described below.

Atomicity

The atomicity property says transactions are an all-or-nothing affair.

Transactions often make several related changes to a database. These related changes are a logical unit of work and must be performed together. For example, to transfer money from one bank account to another, funds must be deducted from the first account and added to the second. These operations must be performed as a unit. They can be made in any order that is convenient as long as all the changes are made, or none of them are made.

Consistency

The consistency property says that transactions transform the database from one consistent state to another.

The notion of consistency refers both to physical database consistency (i.e. that a database's internal structures, such as indexes, records, etc. are valid and consistent) and to logical database consistency (i.e. the data contained within the database is valid from an application's point of view).

Physical consistency is managed entirely by the database manager. It is a prerequisite for logical consistency.

Logical consistency, on the other hand, is managed both by the database manager, through mechanisms such as validation rules, triggers, integrity constraints, etc., and by the actions of the application.

For a transformation from one state to another to be valid, the database manager and the application must perform correct actions to change the data from one state to another.

An important point: although the database state before and after execution of a transaction must be consistent, it can be inconsistent during its execution. A consistent state must only exist after a transaction ends.

Isolation

The isolation property gives each transaction the illusion that it is the only one executing.

The effects of several concurrently executing transactions should not be visible to each other. Any changes made by a transaction are considered provisional while the transaction is still executing. They do not become "official" or permanent until the transaction ends successfully (commits or rolls back).

Until a transaction commits, an error or failure will cause the changes made by that transaction to be rolled back. Therefore, it is important that one transaction does not make decisions based on the uncommitted changes made by another. If the uncommitted changes are undone, the decision can become invalid.

The Progress database manager provides isolation through the use of two-phase locking, but that is a subject beyond the scope of this discussion.

Durability

The durability property says once committed, always committed.

When a transaction is committed, its effects on the database are permanent. They must not be undone even if a failure occurs. The effects of a committed transaction can only be reversed by executing a second transaction that undoes the effects of the first.

3. Logging Concepts

The Progress database manager achieves the transaction properties of atomicity and durability by using a technique called undo-redo logging, combined with write-ahead logging.

As database changes are made by one or more transactions, notations or descriptions of these changes are durably recorded on disk in a type of transaction log file known as an undo-redo log. When an error occurs during a transaction, these log records are used to roll back the transaction, removing all of its effects on the database (restoring all data items to their previous values). The same log records are also used during crash recovery processing to restore the database to a consistent state after a failure.

There is an optional second log, called the redo log (discussed later), that can be used to enhance system reliability by recording the transaction log in two places.

The undo-redo log

The undo-redo log is called the "before-image file" or "bi file". The before-image file contains a log of all recent database changes. As transactions change the database during normal ("forward") processing, the database manager writes one or more log records describing each change to the before-image file before making the change. Although it is called the before-image file for sentimental reasons, the undo-redo log does not actually contain before-images. It contains data that are used to:

* Undo or reverse the effects of a transaction that has not yet been committed.

* Restore the database to a consistent state after a failure by repeating (redoing) previously made changes. The same set of changes may have to be repeated more than once because a failure might occur during crash recovery.

* Preserve the effects of committed transactions even if a failure occurs immediately after a transaction is committed.

* Allow incremental reuse of space in the undo-redo log when records are no longer needed, minimizing the size of the log.

* Limit the amount of data that must be processed, and the time required to do so, when recovering from a failure.

Depending on the particular database operation being performed, what is needed to meet these requirements might be a copy of the data before they were changed, a copy of the new data, some combination of the two, or perhaps other information. There are over 50 different types of log records and each type contains different data.

The before-image log can be either a single file, called the bi file, or multiple files (before-image extents) that are part of a multi-file database structure.

Redoing a change

The information that is recorded in the log records is such that one can combine the data in the log record with the data in the old version of a database block to produce a new, updated version of the block. For example, when a record is updated, the log record would contain old and new values of only those parts that changed. The new values would be merged into the old record to redo a change.

Undoing a change

The same log record that was used to make the initial change can also be used to undo an operation in combination with the new version of a database block to produce the old version of the block. Continuing the previous example, the old values would be merged with the new record to undo a change.

Note that when a change is undone, a new log record that describes the undo is generated. This is necessary because a failure can occur before the disk-resident copy of the database has been updated to reflect the fact that the undo occurred.

The redo log

The optional redo log is called the "after-image log" or "ai file". The after-image log is a log of all database changes that have occurred after a full backup of the entire database.

The purpose of the after-image log is to provide a mechanism for recovering from a failure that destroys all or part of the databases data files or all or part of the before-image log. To recover from such a failure, one restores the database from the backup and then reapplies all changes that were made since the time of the backup.

The process of reapplying the changes is called roll-forward. During this process, the database manager reads records in the after-image log and repeats each and every database change, in the same order that they were originally made. At the end of the roll-forward process, the database will be in the same state it was in just before the failure occurred. From this state, the normal crash-recovery process (described below) restores the database to a consistent state.

The after-image log can be either single file, called the ai file, or multiple files (after-image extents) that are part of a multi-file database structure.

Deferred writes

The existence of the undo-redo log allows the Progress database manager to keep database changes in memory indefinitely,using a buffer management policy known as "steal, no-force". Database blocks can be written to disk whenever it is convenient. Changes are always made in-place to the actual data, in an area of memory called the database buffer pool.

Change database blocks do not have to be written to disk when a change is made nor when the transaction that changed them is committed. It is sufficient that the transaction log records exist on disk. This is known in the jargon as "deferred writes". It allows for high performance and extensive optimization of disk write operations. Database blocks in the buffer pool can be updated many times before they are written to the disk. The database manager can choose the time and the order in which to write the data. In fact, updated blocks can be written before the updating transaction has committed, or after, whichever is most suitable. In effect, the log allows slower random writes to be transformed into faster sequential writes.

Write-ahead logging

When using write-ahead logging, records of all database changes are created and always written to the log first, before the changes are applied to the memory-resident and disk-resident part of the database. This way, the data in the undo-redo log (the before-image log) can always be used to repeat or redo any changes that may be lost before they can be written to the database.

Should a change be made to the database and written to disk before the log record, and the log record were lost, there would then be a database change for which no log record exists. Without the log record, the change cannot be undone if the transaction should have to be rolled back. The database cannot be restored to a consistent state and will be unusable.

For write-ahead logging to work correctly, all writes to the log must be unbuffered or "synchronous". This means that when the database manager writes a log buffer to disk (using a write() system call or something similar), the data must actually be written to the disk and not just saved in memory by the operating system or a disk controller.

Because of the requirement for reliable write-ahead logging, database files and before-image and after-image logs must not be stored on remote file systems (such as NFS) mounted across networks. This is because writes to such file systems are not sufficiently reliable.

There are storage subsystems that provide enhanced performance through the use of a technique known as write caching. With write caching, a disk subsystem contains a memory buffer which is used to stoe data written to the disk by the operating system. Should a power failure occur, the conents of this memory buffer will be lost. If that happens to data that the Progress database manager believes has been recorded on the disk, crash recovery may not be possible.

4. Normal Transaction Processing

Forward processing

The normal activities performed by an application is while it is accessing and changing the database are called forward processing. Typically these activities include starting and committing transactions, finding and retrieving data, changing existing data, adding new and deleting old data. The changes made by an application are called logical changes. Some examples of them are

create a new record in the customer table

delete this record from the customer table

replace a record with a new one that has a changed primary key column value of "Kathryn"

Each logical change is performed by making changes to one or more database blocks. A change to an individual block is called a physical change and might be an operation like

in block 409824 starting 354 bytes from the beginning, replace 8 bytes with "!@#$%^&*"

During forward processing log records are generated to record each physical database change. The typical sequence of events is as follows:

  1. The database block that contains the data to be changed is located in the database buffer pool or read from disk if it is not already there.
  2. The buffer is locked in update (exclusive) mode to prevent another process from accessing or changing that block until the change is complete.
  3. An undo-redo log record describing the change to be made is created in memory.
  4. The log record is stored at the next available space in the current log buffer. If no space is available, the buffer is queued for writing by the bi writer and a free log buffer is made the current buffer. If there are no free log buffers, the oldest full buffer is written to disk.
  5. The change is made to the database block, using only the data in the log record and the data in the block.
  6. The buffer lock is released.

When a logical operation affects more than one block, the steps described above are performed on each affected block in turn. An undo-redo log record is generated for each block.

Transaction rollback

When an application decides to undo a transaction rather than commit it, transaction rollback is initiated. This occurs when the application executes an UNDO statement in the 4gl, an interactive user presses the "stop" key on the keyboard, a lock wait is canceled by an interactive user, or when a lock wait times out. The processing required to perform rollback is handled automatically without the need for any intervention by the application.

To roll back an active transaction, the database manager reads, in reverse order, all the log records generated by the transaction, back to the point at which it began. The effects of each change are reversed and the original data values that existed just before the change was made are restored. As these changes are being undone, the database is being changed again. These new changes are also logged. This generates additional log records during the rollback operation just as they would be during forward processing.

The sequence of events for undoing a single database change are the same as those described previously for forward processing. The log records identify which blocks have to be changed and also contain all data needed to perform the undo operation.

Transactions can be rolled back explicitly at the request of the application or the user, or automatically by the database manager when an error occurs and the application is no longer able to communicate with the database.

5. Crash Recovery

Crash recovery is the process whereby the database is restored to a consistent state after a failure (crash) occurs. When a failure occurs, one or more transactions may be active and not yet committed. Since transactions must be atomic, the effects of these partially completed transactions must be removed from the database. Changes made by completed committed transactions may not have been written to disk yet. Since transactions must be durable, any committted data that were lost must be recreated.

Database state

The current state of the database is a combination of data from three places:

On disk, in the files that make up the database - The disk-resident part contains all data that have been written to the disk(s) so far. They may be incomplete and inconsistent because some data changes not have been written to disk yet.

In shared-memory - The memory-resident part contains all the most recent changes. Some or all of these changes may not have been written to disk. The memory-resident part is almost always incomplete because the entire database does not normally fit in memory.

In the transaction log - The undo-redo log contains records of all recent changes to the database. The transaction log is always incomplete because old log records are discarded and the space is reused.

Because database changes are buffered in memory and are written back to disk only when it becomes necessary to do so, as more and more database changes are made, the disk-resident copy of the database becomes more and more out of date.

When the database system is shut down in an orderly manner, all the memory-resident data are written to disk, making the disk-resident copy up-to-date.

The recovery process

When a failure occurs, the memory-resident part of the database state is always lost. The disk-resident part is in an unknown, but most likely out of date state. Some database changes may have been written to disk. Some may have been buffered in memory and lost. One or more transactions may have been active and not completed. During the recovery process the data that were lost are reconstructed by combining the data on disk in the before-image log with the data previously stored on disk in the database.

The recovery operation needs the data recorded in the before-image log and the data in the disk-resident copy of the database. If either one is not available or has been damaged, successful crash recovery is not possible, but roll-forward recovery will be. More on that later.

The steps performed during the recovery process are organized into two major phases: the redo phase and the undo phase. In these phases, the data that were lost are reconstructed and the results of any partially completed operations are discarded.

The redo-phase

During the redo-phase, the memory-resident and disk-resident data are reconstructed by performing a forward scan of the records in the before-image log and repeating or "redoing" any changes that were never written to the disk-resident copy of the database. This is possible because write-ahead logging ensures that a record of each change exists on disk in the undo-redo log.

At the end of the redo-phase, the database state (in the three parts as mentioned previously) has been restored to that which existed just before the failure occurred.

The undo-phase

When the redo-phase has been completed, the undo-phase is begun. The undo-phase performs a backward scan of the before-image log starting with the most recent change, undoing only those changes that were made by uncommitted transactions. Changes made by committed transactions are not disturbed. The backward scan continues to the point at which the oldest uncommitted transaction began. This is possible because write-ahead logging ensures that the data required to undo every change is present in the before-image log.

The undo-phase is quite similar to the processing that occurs during a normal transaction rollback.

When the undo-phase has been completed, the database state is once again consistent and ready to be used.

Danger - the -F startup option

Starting a database session with the -F option causes crash recovery to be skipped. This option is provided as an emergency measure and should never be needed. It exists for the purpose of allowing you to extract and salvage as much data as possible from a database that is damaged and otherwise unusable. It should be used only when the alternative is that the entire database will be lost. If you have a backup, you will probably lose less data by restoring the backed up copy of the database.

The memory-resident state that has been lost is not recreated, so any database updates that were not written to disk will be lost. Incomplete transactions are not rolled back. Only the (most probably) out-of-date disk resident part of the database state is retained.

Using the -F option when starting a database will almost always produce a database that is in an inconsistent (corrupt) state. There will usually be physical inconsistencies, such as incompletely updated records, partially deleted records, records without index entries, index entries that refer to nonexistent records, etc., as well as logical ones such as partially completed transactions.

The fact that the database has been damaged by skipping crash recovery is usually not immediately apparent. The existence of a partially created record will be revealed when that record is accessed and an error occurs. Until then, everything may appear normal.

Physical inconsistencies in index structures, except for the critical schema indexes needed by the index rebuild utility, can usually be repaired by rebuilding the indexes. This is possible because the index rebuild utility first deletes all the index blocks and then creates new ones using the key values present in records.

Physical inconsistencies in data records (including the records that describe the schema) cannot be repaired. The only thing that can be done is to remove damaged records manually, if they can be located.

Logical inconsistencies can only be repaired by examining all data values in all records and verifying that they are correct, that no records are missing, and that there are no extra records. This is almost always impossible.

Beginning with Version 8.2A, you can only forcing access to the database by using the -F option with the PROUTIL utility in conjunction with the TRUNCATE BI qualifier, as shown below.

proutil demo -C truncate bi -F

You can no longer start the database using the -F option. When you do this, the following dire warnings (danger, danger, will robinson) and prompt will be displayed:

The -F option has been specified to proutil. (6260)
Forcing into the database skips database recovery. (6261)
This leaves the database in unknown state, considered damaged. (6262)
Are you sure you want to skip crash recovery? (6263)

If you enter "y" for "yes", the force access will occur and the following additional dire warnings will be displayed and written to the log file:

** The FORCE option was given, database recovery will be skipped. (33)
** Your database was damaged. Dump its data and reload it. (37)

6. Roll-forward Processing

Roll-forward processing provides a method for recovering from a failure that has destroyed all or part of the disk-resident copy of the database or the before-image log. A second optional log called the redo log or after-image log can be generated. During normal processing, records of all database changes are written to the optional redo log, as well as to the undo-redo or before-image log. The contents of the two types of logs are almost exactly identical. The difference is that space in the redo log is not reused until the data in it have been archived.

To recover the database, the roll-forward process combines a backup copy of the database with the records in the after-image log to redo all the changes that were made since the backup was created. This redo process is very similar to the redo-phase of crash recovery, except that the log records are obtained from the after-image log instead of the before-image log. It reconstructs the memory and disk resident parts of the database state that existed just before the failure occurred, as well as a new before-image log.

Once the roll-forward operation has been completed, the normal crash recovery function described previously is used to undo any incomplete transactions to restore the database to a usable state.

If desired, it is possible to partially execute the roll-forward operation, stopping at a specified time and date and discarding any changes made after that point. This is known as point-in-time-recovery.

7. Checkpoints

If the database manager were to write changed database blocks back to disk only when the database was shut down, then when a failure occurred, the crash recovery operation would have to repeat every change made since the start of the session. This could possibly take a very long time.

In addition the database manager would never be able to reuse any before-image log space because it would need all the data for the redo-phase of crash recovery. To avoid this sad state of affairs, the memory-resident state of the database is periodically reconciled with the disk-resident state. This reconciling operation is called a "checkpoint". A checkpoint is begun whenever a before-image cluster (see the section about before-image log space allocation, below) is filled.

During the checkpoint operation, all memory-resident state information and all buffered changes are written to disk. This includes the transaction table, all changed database blocks in the buffer pool, and some other data structures normally kept in memory.

Fuzzy checkpoints

Checkpoints are asynchronous and occur at the same time as new database changes are being made. As long as the database state is changing while the reconciling operation is occurring, the memory-resident and disk-resident states are never actually consistent with each other. Because of this, Progress checkpoints arefuzzy checkpoints.

A checkpoint is begun when a before-image cluster is filled, but it does not have to be completed until just before the next one begins. One or more asynchronous page writer processes (also called page cleaners) can do the needed disk writes in the background. At the same time, servers can continue to process client requests and make additional database changes on their behalf.

Provided the page writers can complete a checkpoint by the time the next one must begin, only a very few disk writes will be required at cluster close time. The only database buffers that must be written then are any that were scheduled for checkpointing but were not already written by the page writers.

The fuzzy checkpoint technique minimizes the impact of checkpoints on normal processing.

When a checkpoint begins

The start of a checkpoint is triggered by a process that wants to store a transaction log record in the current before-image log buffer and discovers that there is insufficient free space to hold the record. Normally the next available block in the cluster will be used, but when the last block of the cluster has been used, the cluster becomes full and must be closed. Before the next cluster can be opened, a checkpoint is begun. The following actions are performed before the pending operation or any further database changes can proceed:

  1. All unwritten log buffers are forced to disk from memory. This includes both before-image (undo-redo) buffers and after-image (redo) buffers.
  2. A check is made to see if there are any buffers still on the checkpoint list from the previous checkpoint. Under normal circumstances, the list will be empty because the page writers will have already written all the buffers and removed them from the list. If they have not, then all buffers on the checkpoint list will be written at this time. Promon reports these writes as "buffers flushed".
  3. All the buffers in the buffer pool are examined. Any buffer that has been modified since the current before-image cluster was opened is added to the checkpoint list.
  4. The current before-image cluster is marked with the current time and date and the cluster header is written to disk. At this point, the current cluster is officially "closed".
  5. A new before-image cluster is opened. The checkpoint begins at this point.
  6. The contents of the transaction table and other memory-resident data is written to the cluster.

Once these actions have completed, the log record that is ready can be stored in the before-image log buffer and the pending database change can proceed. The remainder of the checkpoint can also proceed in parallel with future database changes.

Performing the checkpoint

After the new cluster has been opened, the page writers examine the checkpoint list periodically, write some number of the buffers on it to disk and remove them from the list.

The page writers attempt to pace themselves in such a manner that they will empty the checkpoint list just before the current cluster becomes full, ending the checkpoint just before the next one will begin. This allows the cluster to be closed without having to write any database buffers to disk at the time the cluster is closed.

When a checkpoint ends

A checkpoint normally ends just before the current before-image cluster becomes full, at the time the checkpoint list becomes empty. No special processing is required at this time.

8. Before-Image Log Space Allocation

Progress uses sophisticated algorithms for managing and reusing space in the before-image log to minimize the amount of disk space required while still providing reliable operation. In most situations, the before-image log is significantly smaller than the database. The exact amount of disk space required is highly dependent on the size and duration of an application's transactions.

Clusters

Space in the before-image file is allocated in fixed-size units called clusters. The cluster size can be changed whenever the before-image file is truncated, using the "proutil" utility.

When the before-image file is initialized, space for four clusters are allocated and linked together to form a ring. Once these four clusters are filled with log records, all the initially allocated space will have been used. Then, either the first cluster is reused, or a new cluster is added, either by allocating existing free space from a multi-extent bi file, or by expanding the bi file if no free space is available. When a new cluster is added, it is linked into the ring, time stamped, and assigned the next sequential cluster number.

Each time a cluster is filled, either the oldest cluster is reused or a new one is added.

Reusing an existing cluster

A cluster can be reused when the log records it contains are no longer needed. They are not needed when:

All transactions that were active when the cluster was opened have either been committed or rolled back. The database manager will not have to undo them during the transaction rollback or undo phase of crash recovery.

All database blocks that were changed between the time the cluster was opened and the time it was closed have been safely written to disk. Then the database manager knows it will never have to repeat any of the changes to them during the redo phase of crash recovery.

A long-running transaction can prevent all the space from the cluster in which the transaction began to the current cluster from being reused. If that transaction has to be rolled back, the log records it generated will be needed to reverse the effects of its changes.

When a previously used cluster is reused, the cluster header is updated with a new cluster number and the current time. Since previously used clusters are already linked into the ring of allocated clusters, they do not need to be formatted or linked again.

Using a new cluster

A new cluster must be formatted by writing to each block in the entire cluster if it is allocated from free space in fixed-length extents or if it is allocated by expanding a bi file. In the first case, the space may contain old log records that are obsolete and that must not be processed during any subsequent crash recovery operations. In the second case, it must be formatted because on many systems, space is not allocated from the underlying filesystem until data are actually written to the file.

After a cluster has been formatted, it is linked into the ring of allocated clusters, time stamped, and assigned the next cluster number.

Before version 7.2E, when the oldest cluster cannot be reused, four new clusters are formatted. In version 7.2E and later, once the initial four clusters have been formatted, only one cluster at a time is formatted.

Expanding the Before-Image Log

Beginning with Version 8.0B, you can allocate and format space for the before-image log in advance of when it is needed. Using this mechanism, you can specify that a number of bi clusters should be created, but without starting the database.

proutil dbname -C bigrow clustercount

where dbname is the name of your database and clustercount is the number of clusters to create and format, and link into the ring of clusters described above.

9. Truncating the Before-Image Log

Truncating the before-image log consists of going through recovery to ensure the database files are consistent and that there are no active transactions, and then discarding the contents of the before-image log. There are several reasons why one might want to do this. Among them are:

To avoid having to include the before-image log when doing a full off-line backup copy of a database. In version 7.3A and later the full backup utility does not copy the before-image log to the backup medium. If needed, the backup utility performs recovery on the database before making the backup copy. The backup copy is automatically made with a truncated before-image log even though the original is not truncated.

To enable after-imaging after a full backup. In this case, Progress automatically truncates the before-image log.

To change the before-image cluster size or block size.

To reduce the size of a before-image file that has become unusually large. This rarely occurs during normal operation because the log space is reused. Most of the time, the log reaches some stable size and does not grow further. The size is dependent on the frequency and types of transactions executed by the application.

Single file before-image logs

When a single file bi log is truncated, the database master block is updated to reflect the fact that it has been truncated, the existing bi file is deleted, freeing the space that was allocated for it. Then a new, empty file (zero bytes long) is created.

The next time the database is updated, four clusters will be allocated and formatted, causing the file to be expanded.

Multi-extent before-image logs

When a multi-extent bi log is truncated, the database master block is updated to reflect this fact and the headers in all fixed length before-image extents are marked. Any previously written data store din the extents is not removed or zeroed.

Any variable length extent is deleted and a new empty one is created.

The next time the database is updated, four clusters are allocated and formatted from the available before-image extents, using space from fixed-length extents and expanding any variable-length extent if needed.

Go to monograph index
Copyright 1997, Progress Software Corp., All Rights Reserved