This is the special part for recovery and normal.
| For this part, please review [Experi | Recovery](https://liuying-1.github.io/blog/2023/recover/). |
Let’s talk about the algorithm, we need to support all REDO and UNDOs. What do we log and how do we log it?
What are the things that we need to store in the log to do our accounting, accounting is what happening in the system.
First of all, we need IDs for the records that we store in the log and we will call log sequence number or just LSN. The log is a sequential thing, and LSNs always increasing. We will use this fact to compare to sequence numbers and to figure out what has happened before what.
Then, when we look at the data pages that are stored everywhere in the system both in memory and on disk, we will extend the data page beyond the actual data, also containing Log Sequence Number which is the most recent update to this page.
Moreover, the system will keep track of other special log sequence numbers. What does this special log sequence number (flushedLSN)? The log itself should live in stable storage, maybe the largest part of it. But still, when we extend it, we first need to go through RAM. The log itself will be split into a part that lives on the disk, and the part that lives on the RAM. And the flushedLSN is a kind of pointer to where the split happens between the two.
When a crash happens, we will lose the “Log tail” since it lives on the RAM. So this is not so perfect, this part will be lost and his will be our only information about actions that were happening in this period of time. => 1. We want to flush the log frequently. We want to make this time where entries log entries leave in the RAM as short as possible. 2. We are not aiming for perfection here, we will lose some information here. However, we don’t want to lose those entries.
What does the write-ahead logging means? This means, whatever we write a page to the disk, we need to make sure that pageLSN which is the most recent record modifying the page is below what has been flushed. And what has been flushed means that the log record corresponding to this modification is already in the “blue” part.
LogRecord fields
| prevLSN | XID | type | pageID | length | offset | old data | new data | | :—–: | :-: | :–: | :—-: | :—-: | :—-: | :——: | :——: |
prevLSN is a pointer to the previous log entry by the same transaction. It’s kind of in addition to just having a sequential view on all the actions. We also storing this linked lists that correspond to the actions performed by a single transaction. So, when we maybe undoing things of a single transaction, we can focus more effectively on just this transaction don’t need to traverse the entire log. Just have easy access to the actions of a single transaction.
type -> Possible log record types
| Update | Commit | Abort | End | Compensation Log Records (CLRs) | | :—-: | :—-: | :—: | :-: | :—————————–: |
Update -> old data & new data
Commit -> signifies the start of commit
End -> As commit or abort also take some certain time, this signifies end of commit or abort.
Compensation Log Records (CLRs) -> they will be recording the undo actions
Transaction Table and Dirty Page Table are the data structures maintained in memory. This was kind of the local state of the algorithm that there’s responsible for recovery, but it will be completely lost on the crash and we will be using the log to reconstruct the state of these data structures at the moment of the crash.
Transaction Table (XT)
One entry per active transaction.
Contains XID, status (running/commited/aborted), and lastLSN.
| XID | Status | lastLSN | | :-: | :—-: | :—–: |
So, what is the lastLSN? It is the last operation performed by a transaction. Kind of the entry point to the backwards pointing linked list associated with transaction. => Not fully understood
Dirty Page Table (DPT)
One entry per dirty page in buffer pool.
Contains recLSN – the LSN of the log record which first caused the page to be dirty.
| PID | recLSN | | :-: | :—-: |
What do we do during normal execution of a transaction?
Firstly, what is a transaction? A transaction is a series of READs and WRITEs at least from our point of view followed by a Commit or Abort at the end. We have assumed that writes to disk have no duration as opposed to commits or aborts. But in practice, we need to handle the fact that writes also take time.
Series of READs & WRITEs, followed by commit or abort.
WRITEs.STEAL, NO-FORCE buffer management, with Write-Ahead Logging.
We have the log that somehow sits in the hypothetical stable storage. It contains all these things, but maybe not all entries, for example, the pageID, length, offset, old data, and new data are only for the Update entries. However, these things are always there.
Then, we have the database, and the databases have been extended by the pointer of pageLSN => last modification to this page. We also have the master record.
flushedLSN indicates how the log is split between the parts that still in the RAM and the part that lives in the stable storage. Additionally, the log tail should also in the RAM. The two data structures, transaction table and dirty page table sit at memory.
Right now, we have to deal with the case where the RAM part is completely gone.
Before I tell you how we deal with the normal operation handle crashes. Here are the two extra log records. They can be seen as an optimization. What we want to avoid is that the recovery after crash takes a long time.
This happens when the transaction table and the dirty page table that we are losing. If you need to start from the beginning of time when reconstructing those, this will be a problematic. => So we want to make this point in time as late as possible from where we need to start working construction the transaction table and the dirty page table. And that’s we will take chekpoints of those and store them also in the log.
We will initiate a checkpoint and it consists of the transaction table and the dirty page table. Initiating a checkpoint means starting writing these things to the log and once we are done, we create the end checkpoint record which contains the current state of the two tables (XT & DPT).
Then, when we recover, we can start from this state as opposed the empty table. So, this is called fuzzy checkpoint as it takes the time to write these tables. And the point is fuzziness.
The other transcations continue to run. What is stored in the end_checkpoint record is the snapshot at the moment of the corresponding begin_checkpoint record.
This is the reliable information. From that point (begin_checkpoint), we can rely on what is stored at this slightly later point in time and we will be recreating the state of our table starting from the earlier point.
Since this is an optimization trying to bring the time as close to the present from which we need to start, it doesn’t actually matter that much that there is a certain operation that happened between the two.
Official Version
Periodically, the DBMS creates a checkpoint, to minimize the time taken to recover in the event of a system crash. Write to log:
begin_checkpoint record: Indicates when checkpoint began.
end_checkpoint record: Contains current Xact table and dirty page table. This is a fuzzy checkpoint:
Other Xacts continue to run; so these tables accurate only as of time of the begin_checkpoint record.
No attempt to force dirty pages to disk; effectiveness of checkpoint limited by oldest unwritten change to a dirty page, minDirtyPagesLSN.
We are not trting to force dirty pages to disk at any point. For the checkpoint, it is just about recording these tables, nothing about the actual data and the actual data is handled by the background process.
Use background process to flush dirty pages to disk!
Store LSN of begin_checkpoint record in a safe place (master record). => This is the point from which we need to start our recovery.
What should happen at when the transaction commits?
commit record to the log and we want to write this as quickly as possible.commit record to the disk. And it guarantees us that flushedLSN will be greater than the lastLSN transaction that has committed. It will make sure that the user can rely on the information about the commit made it to the stable storage.After we flush the log, then Commit() can return. Then the user gets the notification that “Your transaction has been committed”. You can rely that it’s durable.
Here I did not say anything about the actual data written to disk, this happens in background process.
Then, we need to update the local state (XT and DPT) of algorithm. After that, you will write the end record to the log.
Official Version
commit record to log.lastLSN are flushed. flushedLSN >= lastLSN.Commit() returns.end record to log. => We don’t need to flush it immediately.The transaction is considered as committed as soon as the commit record makes it to the stable storage which is the last entry that’s flushed.
What we want to do is UNDOing the effects of the transaction that happens so far. And we essentially will play back the log backwards to UNDO the latest action by the transaction, then the second, and so on, until the very start.
So, where do we get the lastLSN of the transaction? We looked up in the transaction table.
lastLSN of Xact from XT.We look it up to get what the latest update was. If it was a standard update, we UNDO it by taking the old data entry from the log record, and write it to memory. Flush to disk at some time.
Once we’ve done that, we will follow our linked list backwards => looking at the prevLSN of the last log entry that we’ve UNDOne.
prevLSN field.Before we starting doing backwards traverse, we write an Abort log record. And this is important when we have a crash during this UNDO phase. Remeber that we wanted to abort this transaction.
First of all, we will be writing pages, and this is a running transaction. To perform UNDO, we must have a lock on data. But magically, we have the lock on the data because we are using S2PL. We only releasing a lock once the transaction commits. => It will not work if using C2PL as it releases locks gradually.
To perform UNDO, must have a lock on data!
Before restoring old value of a page, write a CLR log record during the UNDO:
You continue logging while you UNDO!!
CLR has on extra filed: undonextLSN => another sequential number
Points to the next LSN to undo (i.e., the prevLSN of the record we’re currently undoing).
CLRs never Undone (but they might be Redone when repeating history: guarantees Atomicity!)
They are never undone like. During Undo phase when we are performing UNDO records, we simply skip over them. But they might need to be redone when we redo these things again.
At end of UNDO, write an end log record. => We reached the first action of the transaction and we will be using like a NULL pointer instead of the concrete pointer in the prevLSN for any first action performed by the transaction.
Solution: This is my RAM below to record what we have in the data structures as we go.
Today is all about the normal execution of a transaction, including transaction commit, and transaction aborts. The next lecture, we are going to talk about Crash Recovery.