This is the special part for recovery and normal.
For this part, please review [Experi | Recovery](https://liuying-1.github.io/blog/2023/recover/). |
Let’s talk about the algorithm, we need to support all REDO and UNDOs. What do we log and how do we log it?
What are the things that we need to store in the log to do our accounting, accounting is what happening in the system.
First of all, we need IDs for the records that we store in the log and we will call log sequence number or just LSN. The log is a sequential thing, and LSNs always increasing. We will use this fact to compare to sequence numbers and to figure out what has happened before what.
Then, when we look at the data pages that are stored everywhere in the system both in memory and on disk, we will extend the data page beyond the actual data, also containing Log Sequence Number which is the most recent update to this page.
Moreover, the system will keep track of other special log sequence numbers. What does this special log sequence number (flushedLSN)? The log itself should live in stable storage, maybe the largest part of it. But still, when we extend it, we first need to go through RAM. The log itself will be split into a part that lives on the disk, and the part that lives on the RAM. And the flushedLSN is a kind of pointer to where the split happens between the two.
When a crash happens, we will lose the “Log tail” since it lives on the RAM. So this is not so perfect, this part will be lost and his will be our only information about actions that were happening in this period of time. => 1. We want to flush the log frequently. We want to make this time where entries log entries leave in the RAM as short as possible. 2. We are not aiming for perfection here, we will lose some information here. However, we don’t want to lose those entries.
What does the write-ahead logging means? This means, whatever we write a page to the disk, we need to make sure that pageLSN which is the most recent record modifying the page is below what has been flushed. And what has been flushed means that the log record corresponding to this modification is already in the “blue” part.
LogRecord fields
| prevLSN | XID | type | pageID | length | offset | old data | new data | | :—–: | :-: | :–: | :—-: | :—-: | :—-: | :——: | :——: |
prevLSN
is a pointer to the previous log entry by the same transaction. It’s kind of in addition to just having a sequential view on all the actions. We also storing this linked lists that correspond to the actions performed by a single transaction. So, when we maybe undoing things of a single transaction, we can focus more effectively on just this transaction don’t need to traverse the entire log. Just have easy access to the actions of a single transaction.
type
-> Possible log record types
| Update | Commit | Abort | End | Compensation Log Records (CLRs) | | :—-: | :—-: | :—: | :-: | :—————————–: |
Update
-> old data
& new data
Commit
-> signifies the start of commit
End
-> As commit
or abort
also take some certain time, this signifies end of commit
or abort
.
Compensation Log Records (CLRs)
-> they will be recording the undo actions
Transaction Table and Dirty Page Table are the data structures maintained in memory. This was kind of the local state of the algorithm that there’s responsible for recovery, but it will be completely lost on the crash and we will be using the log to reconstruct the state of these data structures at the moment of the crash.
Transaction Table (XT)
One entry per active transaction.
Contains XID, status (running/commited/aborted), and lastLSN.
| XID | Status | lastLSN | | :-: | :—-: | :—–: |
So, what is the lastLSN
? It is the last operation performed by a transaction. Kind of the entry point to the backwards pointing linked list associated with transaction. => Not fully understood
Dirty Page Table (DPT)
One entry per dirty page in buffer pool.
Contains recLSN
– the LSN of the log record which first caused the page to be dirty.
| PID | recLSN | | :-: | :—-: |
What do we do during normal execution of a transaction?
Firstly, what is a transaction? A transaction is a series of READ
s and WRITE
s at least from our point of view followed by a Commit
or Abort
at the end. We have assumed that writes to disk have no duration as opposed to commits or aborts. But in practice, we need to handle the fact that writes also take time.
Series of READ
s & WRITE
s, followed by commit
or abort
.
WRITE
s.STEAL
, NO-FORCE
buffer management, with Write-Ahead Logging
.We have the log that somehow sits in the hypothetical stable storage. It contains all these things, but maybe not all entries, for example, the pageID
, length
, offset
, old data
, and new data
are only for the Update
entries. However, these things are always there.
Then, we have the database, and the databases have been extended by the pointer of pageLSN
=> last modification to this page. We also have the master record.
flushedLSN
indicates how the log is split between the parts that still in the RAM and the part that lives in the stable storage. Additionally, the log tail should also in the RAM. The two data structures, transaction table and dirty page table sit at memory.
Right now, we have to deal with the case where the RAM part is completely gone.
Before I tell you how we deal with the normal operation handle crashes. Here are the two extra log records. They can be seen as an optimization. What we want to avoid is that the recovery after crash takes a long time.
This happens when the transaction table and the dirty page table that we are losing. If you need to start from the beginning of time when reconstructing those, this will be a problematic. => So we want to make this point in time as late as possible from where we need to start working construction the transaction table and the dirty page table. And that’s we will take chekpoints of those and store them also in the log.
We will initiate a checkpoint and it consists of the transaction table and the dirty page table. Initiating a checkpoint means starting writing these things to the log and once we are done, we create the end checkpoint record which contains the current state of the two tables (XT & DPT).
Then, when we recover, we can start from this state as opposed the empty table. So, this is called fuzzy checkpoint
as it takes the time to write these tables. And the point is fuzziness.
The other transcations continue to run. What is stored in the end_checkpoint
record is the snapshot at the moment of the corresponding begin_checkpoint
record.
This is the reliable information. From that point (begin_checkpoint
), we can rely on what is stored at this slightly later point in time and we will be recreating the state of our table starting from the earlier point.
Since this is an optimization trying to bring the time as close to the present from which we need to start, it doesn’t actually matter that much that there is a certain operation that happened between the two.
Official Version
Periodically, the DBMS creates a checkpoint, to minimize the time taken to recover in the event of a system crash. Write to log:
begin_checkpoint
record: Indicates when checkpoint began.
end_checkpoint
record: Contains current Xact table and dirty page table. This is a fuzzy checkpoint
:
Other Xacts continue to run; so these tables accurate only as of time of the begin_checkpoint
record.
No attempt to force dirty pages to disk; effectiveness of checkpoint limited by oldest unwritten change to a dirty page, minDirtyPagesLSN
.
We are not trting to force dirty pages to disk at any point. For the checkpoint, it is just about recording these tables, nothing about the actual data and the actual data is handled by the background process.
Use background process to flush dirty pages to disk!
Store LSN of begin_checkpoint
record in a safe place (master record). => This is the point from which we need to start our recovery.
What should happen at when the transaction commits?
commit
record to the log and we want to write this as quickly as possible.commit
record to the disk. And it guarantees us that flushedLSN
will be greater than the lastLSN
transaction that has committed. It will make sure that the user can rely on the information about the commit made it to the stable storage.After we flush the log, then Commit()
can return. Then the user gets the notification that “Your transaction has been committed”. You can rely that it’s durable.
Here I did not say anything about the actual data written to disk, this happens in background process.
Then, we need to update the local state (XT and DPT) of algorithm. After that, you will write the end
record to the log.
Official Version
commit
record to log.lastLSN
are flushed. flushedLSN >= lastLSN
.Commit()
returns.end
record to log. => We don’t need to flush it immediately.The transaction is considered as committed as soon as the commit record makes it to the stable storage which is the last entry that’s flushed.
What we want to do is UNDOing the effects of the transaction that happens so far. And we essentially will play back the log backwards to UNDO the latest action by the transaction, then the second, and so on, until the very start.
So, where do we get the lastLSN of the transaction? We looked up in the transaction table.
lastLSN
of Xact from XT.We look it up to get what the latest update was. If it was a standard update
, we UNDO it by taking the old data entry from the log record, and write it to memory. Flush to disk at some time.
Once we’ve done that, we will follow our linked list backwards => looking at the prevLSN
of the last log entry that we’ve UNDOne.
prevLSN
field.Before we starting doing backwards traverse, we write an Abort
log record. And this is important when we have a crash during this UNDO phase. Remeber that we wanted to abort this transaction.
First of all, we will be writing pages, and this is a running transaction. To perform UNDO, we must have a lock on data. But magically, we have the lock on the data because we are using S2PL. We only releasing a lock once the transaction commits. => It will not work if using C2PL as it releases locks gradually.
To perform UNDO, must have a lock on data!
Before restoring old value of a page, write a CLR log record during the UNDO:
You continue logging while you UNDO!!
CLR has on extra filed: undonextLSN
=> another sequential number
Points to the next LSN to undo (i.e., the prevLSN
of the record we’re currently undoing).
CLRs never Undone (but they might be Redone when repeating history: guarantees Atomicity!)
They are never undone like. During Undo phase when we are performing UNDO records, we simply skip over them. But they might need to be redone when we redo these things again.
At end of UNDO, write an end
log record. => We reached the first action of the transaction and we will be using like a NULL
pointer instead of the concrete pointer in the prevLSN
for any first action performed by the transaction.
Solution: This is my RAM below to record what we have in the data structures as we go.
Today is all about the normal execution of a transaction, including transaction commit, and transaction aborts. The next lecture, we are going to talk about Crash Recovery.