Experiment | Recovery

Here are the chapters of Experiments and Recovery.

Experimental Design

Do-it-yourself recap

The below (Lecture 3 and Lecture 4) is all about achieving isolation, I in ACID.

Two-phase locking
- Conservative vs. Strict schemes
- My => C2PL has cascading aborts while S2PL has a problem with deadlocks.
- Teacher => Conservative means we acquire all locks at once, while strict means we release all the locks at once.
Schedules and Serializability
- Conflict serializability vs. view serializability
- My => Not so tight as conflict => Not fully understood
Deadlocks
- Deadlock prevention vss. deadlock detection
- My => Deadlock prevention ensures there is no cycle happening in the schedule. While detection is when detecting the cycle that exists, and addresses the problem by aborting.
- Teacher => Why we need to deal with the case of deadlocks in the first place is that we are focusing more on S2PL and it will cause deadlocks.
Optimistic Concurrency Control => Lockless approach to achieve isolation.
- Tests for WR conflicts vs. WW conflicts
- My => Test 2 for dirty reads and test 3 for overwrite.

Objectives Today

Explain the three main methodologies for performance measurement and modeling: analytical modeling, simulation, and experimentation
Design and execute experiments to measure the performance of a system

—> Focus on the atomicity and durability

Explain the concepts of volatile, nonvolatile, and stable storage as well as the main assumptions underlying database recovery
Predict how “force/no-force” and “steal/no-steal” strategies for writes and buffer management influence the need for redo and undo => How to deal with memory that gets lost on the crash and memory that does not get lost in a crash
Explain the notion of logging and the concept of write-ahead logging => The main way to deal with recovery is to introduce a log, to log all the actions that we do and use that log later to recover.
Predict what portions of the log and database are necessary for recovery based on the recovery equations

Techniques to Evaluate Performance

Three main techniques for performance measurement and modeling.

Analytical Modeling
Simulation
Experimentation

Do-it-yourself Recap - Virtual Memory With Paging

How could we build a two-level memory abstraction out of RAM and disk, with hopefully the latency of RAM and the size of the disk?
How did we handle READ and WRITE? What about page replacement?
- We have pages sitting on the RAM and pages sitting on the disk and modify and read most of the time hopefully from the RAM and then we need to do these transferred pages and talked about different strategies like LRU to exchange pages. => Virtual Memory with Paging
What were the guarantees we got in terms of atomicity and fault-tolerance?
- The atomicity here is referred to the strong modularity.
  - We did provide an abstraction that is only READs and WRITEs. So the user does not need to think about the existence of the RAM or the disk. Users simply get flat address space. Let’s say that can address as much memory as there is on disk and that’s it from READs and WRITEs towards the abstraction. So this gives us a certain kind of atomicity in the sense that we can consider this entire system as READs and WRITEs.
  - But of course, internally there are the READs and WRITEs small operations and probably in the design that we did they’ve not isolated, and this certainly not atomic and not durable in the sense of before-after atomicity and the sense of All-or-Nothing atomicity.
- What about fault-tolerance?
  - I am not fault-tolerant for respective disk errors. So no guarantees in terms of tolerance and partially the answer to our guarantees of ours will be given in today’s next lecture when we talk about recovery. How can we recover from these problems?

Analytical Modeling

So, according to the previous example, we can introduce a new kind of question. Given such a system, and discuss its performance. The previous example is about latency.

Get intuition about system performance
- Without actually implementing it.
Remember our virtual memory system with paging? Simple model:
- $AverageLatency = HitRation * Latency_{Hit} + (1- HitRatio) * Latency_{Miss}$
- The explanation above is about the probability that the needed data is on the RAM times the latency that accesses the RAM, plus the probability that the needed data is on the disk times the latency communicating to the disk (additionally, the page transfer from there)
- However, in reality, the model cannot be such kind of easy. It still can have some other factors, there is some overhead coming from translating the flat address space to concrete and this can be considered as constant time (+c) in the model.
However, let’s start with the simple model without any refinement.
So we can say now if the hit ratio is pretty high, the average time will be close to the latency of the main memory. => With a high hit ratio (say, >95%), average time can be pretty close to the main memory
- Some requests still require going to disk, of course, and take full disk latency blow
The question is how can we know the hit ratio? And that’s the relatively tricky part always was analytical modeling. You just get formulas and how you think your systems behave but getting things precise? So one way to get things a bit more precise is to perform simulations.

Simulation

We want to obtain the HitRatio by measuring it. So we will simulate a particular component of the system, maybe the virtual memory with paging component and you will estimate certain known parameters like latency of the main memory and the latency of the disk. Those are easier to get out from other specifications than the HitRatio which is certainly workloads specific. Then, we try different workloads to measure the HitRatio on our partial implementation that focuses on this particular aspect of the system. That’s a simulation.

Study properties of hard-to-model process, e.g., locality of workloads vs. Hit ratio in cache
Configure model with known parameters
- In our example, $Latency_{Hit}$ and $Latency_{Miss}$
Simulate behavior of system to get HitRatio

Pros & Cons of Simulation

Why do we consider such things? It’s a way to estimate difficult parameters, and it may be easier to do this rather than performing a full-blown implementation of the entire system and measuring the actual HitRatio. => If you want to implement the whole system and measure the actual value, that is the so-called Experiment.

Pros
- Effort may be smaller than full-blown implementation
- Allows you to simulate “impossible” or hard-to-experiment-with scenarios -> 10Ks of machines, next generation flash disk not on the market yet.
So, these are some advantages of simulations against experimental.
Cons
- Estimating paramters => We need to estimate certain parameters and we do not know whether the system will put them in the first place.
- Validating models and approximations
- Choosing workloads => How can we choose a workload that is representative of the thing that we want to measure. Different workloads will have different HitRatios. => What is a representative workload?

Numbers Everyone Should Know

And remember, do not switch them even though they do stay roughly the same. The main thing is they are relative to how they compare the world.

Choosing Workloads & Datasets

This is how we can estimate parameters. How can we choose workloads?

Synthetic workloads & datasets => Always tricky and probably you don’t want to generate data or actions or transactions that’s what workload is made of uniformly and randomly. They are pretty bad and there are certain statistical distributions (Zipf) that simulate the real world.
- Example: Use Zip distribution to generate workload of page accesses. => Zipf is a very common distribution that describes nature or natural things.
Real workloads & datasets => Lucky!
- Example: Take trace of page requests from real application
- Replay trace on your simulator
Combinations are also possible => If it contains private data from individuals, so we don’t want to access real data actually, but rather learn some statistics from the real data in the distribution of the data and then generate synthetic data that supported some realistic solutions.
- Use real datasets but generate accesses using a distribution

And this method is maybe the usual technique in the practice.

Issue: How can you tell if the workload is representative?
- With all these combinations, you have lost the workloads representative.

Experimentation

Implement a real system (or prototype)
Measure how it behaves with experiments
- Most respected method
- But also requires the most effort
Profile system to determine where time goes

Simple Factor Experimentation

How to set up certain experiments?

Understanding multiple influences
Vary one factor at a time, keep others fixed
Example: Skew of workload and size of cache
- Skew=0.5, vary cache size from 1MB to 1GB
- Cache size = 500MB, vary skew from 0 to 1
Care required:
- Parameters may influence each other!

In Practice: Many Factors

How to pick up the parameters to run the experiments is very difficult, and there should already be with some hypothesis to help us.

Benchmarking

Micro-benchmarks
- Measure a specific varaible or piece of code, e.g., memory and disk latencies in small experiment to calibrate simulation model.
Application-level benchmark
- Whole application designed to stress certain types of systems
- SPEC benchmarks for compute-intensive apps, web serves, file systems, and many others
- TPC benchmarks for databases

Designing a Micro-Benchmark => Not Understand

How would you measure file scan performance of your filesystem?
What performance metric would you measure?
What are the factors that may affect performance?
Which simple factor experiments would you use?

Hint: Think about your measurement program!

for (i=0; i<file.len; i++)
  read(f, i)

We already came up with very simple experiment with three different programs, reading forwards, reading backwards from the end, and random points. We can vary the file size. We can measure different things below. We have some multi-dimensional thing so maybe that reading left right works best. And that’s what we try to measure those experiments.

What would we measure?

Time
Consumption of CPU, consumption of memory => File system memory overhead

Necessary Care with Executing Experiments

When executing the experiments, select your data carefully, workloads carefully. For example, when you are comes counting things, be careful what you’re counting. Be careful to not introduce overhead by running the experiments. That’s can also lead you to wrong improvements obviously.

Think how you measure things. For example, sampling you run a system and periodically at different time points you take measurements or you let the system log certain things with a monitor. And what it logs. Be careful always some interpreting statistics when you look at other people and they tell you the conclusions they drawn from statistics. Don't be tricked by one arguments.

Select event counts
- Number of pages/chunks read
- Number of clock cycles elapsed -> wall-clock time
But control for overhead of event counting itself!
Sampling / Monitoring
- e.g., I/O vai iostat/vmstat
“Statistics can prove anything?!”
- Number of measurements
- Mean and variance
- Confidence intervals
- Dealing with outliers
- Setup matters

Comparing Alternatives

Two systems, with throughput-oriented measurements $R_2$ and $R_1$

Both systems travel same distance $D$, i.e., do same work but take different time
$R_2 = D / T_2; R_1 = D/T_1$

Speedup

$S_{2, 1} = R_2/R_1 = T_1/T_2$

Relative change

$\Delta_{2, 1} = (R_2-R_1)/R_1 = S_{2, 1} - 1$

Example statements

System 2 is 1.4 times faster than System 1
System 2 is 40% faster than System 1

Recovery: Basic Concepts

The next topic is about the ACID properties and here are different notions of atomicity that we have discussed before. In lecture 3 and lecture 4 about Concurrency Control, we are talking about Before-or-after atomicity (== Isolation). But right now, about recovery, we are focused on All-or-nothing atomicity (== A: Atomicity + D: Durability).

Cannot have partially executed transactions
Once executed and confirmed (once it is committed), transaction effects are visible and not forgotten

This is what we want to achieve. And what would be the problems?

Implementing All-or-Nothing Atomicity

Atomicity => Transactions may abort (“Rollback”)
Durability => What if the system stops running? (Causes?) The system can stop running for whatever the reason.

This is something that could happen. $T1$, $T2$, and $T3$ can successfully commit, so these transactions definitely have been executed. $T4$ and $T5$ still running at this point when a crash happened. And that even though they might have already performed certain WRITEs on the database. They modified the state of the database, these transactions should be undone and we don’t want to see any effects or any modifications performed by these transactions in our database. Hence, the transactions that were running at the moment of the crash should be aborted.

Official Statement:

$T1$, $T2$, & $T3$ should be durable.
$T4$ & $T5$ should be aborted (effects not seen).

Assumptions

What are our assumptions?

We assume that we have isolation so we have a concurrency control that is working and for simplicity, S2PL is in effect instead of using the optimistic algorithm. => Ensure isolation property.
Moreover, we assume that the updates to memory are happening “in place” so we don’t have many copies of things in memory.
- i.e. data is overwritten on (deleted from) memory using READ / WRITE interface. => We assume memory abstractions, so we use READ / WRITE interfaces to access the memory.
- We will use a two-level memory with buffer and disk => We assume that we have some short-lived memory and some longer-lived memory. In the short-lived memory, we have a buffer with pages, and in the disk, we have a collection of pages. => This is very similar to the two-level virtual memory with paging example.
We have two types of failures
- Crash => focus
- Media failure

The failures are always considered to be fail-stop and this means that some part of the system completely breaks. It’s not like one bit was flipped, we will not consider such kinds of variables rather the entire system stopped or maybe needs to restart. So all the information that was stored in the system maybe is gone. And then, we are gonna talk about which parts are gone.

Always fail-stop

Volatile vs. Nonvolatile vs. Stable Storage

To talk about which parts disappear after a crash and which parts survive, we are going to talk about the different kinds of storage that we have in such a system.

Volatile Storage
- Lost in the event of a crash
- Example: main memory
Nonvolatile Storage
- Not lost in a crash, but lost in the case of media failure
- Example: disk
Stable Storage
- Never lost (Otherwise, that’s it)
- How do you implement this in the system?
  It is impossible to be implemented, or maybe with a lot of redundancy. So stable storage survives all the crashes.

Surviving Crashes: How to handle the Buffer Pool?

Now the central question is when we have such a crash, where the volatile memory is lost?

The buffer, the pages stored in our main memory is volatile, so this part is gone.

And the question is, how can we restore it or restore the information that we need to be stored there?

It depends on what we have to do or how we treat what we store on this platform.

And the two basic strategies are called force and steal.

Force

Force means that everything that is written to this buffer also gets written to the disk, and the downside is of course maybe we need to wait for the WRITEs as we already remarked when we are discussing the two-level virtual memory abstraction. But on the other side, we can survive with respect to crashes due to the durability because everything is written on the disk.

Poor response time
But provides durability

However, because we don’t want to wait for every WRITE been propagated to the disk (i.e., As 2 level memory abstraction, we want the performance => latency to be close to the RAM, so we should use No Force), we want to lean to the No Force side.

Steal

And now, given that, there will be a quite lot of contention on the memory because everybody will want to write to the fast medium and things then will live in memory.

The different transactions compete for space in the Buffer-Pool. And the question is if there is too little space, what should happen? A transaction that wants to write something, can it steal a frame or page from another transaction that may be not even committed yet?

So if we don’t steal pages, this means our memories are the critical region. Then the transactions will block if they want to access a page but they cannot get a page from the memory. This will impact our throughput and this means that we want to lean on the stealing side where our transaction is allowed to steal a page that belongs to a potential uncommitted transaction.

If in the corner of “No Steal” and “Force”, it can ensure atomicity and durability. However, if in the “Desired” corner, both questions become open.

If not, poor throughput.
If so, how can we ensure atomicity?

Fill in the matrix

When do you need to UNDO changes? => STEAL

So for this problem, we need to consider what happens to the disk.

If we steal, for example, frames or pages, we might end up with pages of uncommitted transactions written to the disk. And there is a certain danger always that the transaction gets aborted. For example, if a crash occurs, so we might need UNDO. So, on the Steal side of things, we need UNDO.

When do you need to REDO changes?

When we don’t write things into the disk, you might have committed transactions that don’t write immediately to the disk. => We might have committed transactions whose modifications only live in memory. And if at that point, a crash occurs, we need to first see how we can redo the actions but we need to REDO them because the things that changed in this memory will be lost. So whenever we don’t force things, we need REDO things.

More on Steal and Force

STEAL is essentially the justification why enforcing atomicity is hard.

Suppose the frame or the page that we are about to steal is already written to the disk and particular some transaction holds the lock on this page. Now we need to consider what happens if this transaction aborts.

We need to remember the old value that was stored on this page at the moment of the steal to UNDO the write that we just performed to this disk.

To steal frame F: Current page in F (say P) is written to disk; some transaction holds lock on P.
- What if the transaction with the lock on P aborts?
- Must remember the old value of P at steal time (to support UNDOing the write to page P).

I think this is just the same case as what we have mentioned when talking about filling in the matrix. And this is all to enforce that we cannot have partially executed transactions and then see the effects on the memory.

NO FORCE (Why enforcing Durability is hard) => We are not necessarily writing modifications by committed transactions immediately to disk.

What if system crashes before a modified page is written to disk?
Write as little as possible, in a convenient place, at commit time, to support REDOing modifications.

UNDO/REDO vs. FORCE/STEAL

In the case of NO FORCE, we need to REDO things. In the case of STEAL, we need to UNDO things. And how do we support the hardest corner that we desired?

Basic Idea: Logging

To approach this corner, we can use logging and log all the operations that we do on the database. So there will be record information that this relevant for both REDOing and UNDOing actions for every update.

We store these in the so-called log. We write sequentially to the logs, that's the order of things determined there. We have S2PL so the actions are ordered somehow.

We also put it maybe on a separate disk, hopefully on stable storage when we don’t want to lose the log.

We try to write some minimal information that is required to perform our changes. We do not want to duplicate the database in the log. But in particular, we compress things so that multiple updates fit in a single log page. So whenever we write to the log, we are performing effectively many more writes than when we actually write data to disk.

Record REDO and UNDO information, for every update, in a log.

Sequential writes to log (put it on a separate disk).
Minimal info (diff) written to log, so multiple updates fit in a single log page.

Thus, what is a log?

Log can be thought of abstraction of the sequence of REDO and UNDO actions. There are different ways of logging standard approaches, they are Logical and Physical Logging. Example physical log record contains:

\[<XID, pageID, offset, length, old \ data, new \ data>\]

And there are also compromises where for certain parts, you do physical data use or both the old and the new value. For other parts of other types of database, we store some logical orientation of difference that you can then both REDO and UNDO. => Good compromise is physiological logging.

But for simplicity, we only consider physical logging. => And UNDO is just write the old data, REDO is just write new data.

Write-Ahead Logging (WAL)

The fundamental principle that makes it work is called Write-Ahead Logging.

Golden Rule: The idea is that you never modify the only copy of your data. Kind of always have a log and you modify the log before you actually modify the data.

WAL consists of two principles,

Must force the log record for an update **_before_** the corresponding data page gets to disk. => Guarantees Atomicity => Transaction => If one transaction get fully executed, then get a log record.
Must write all log records for a transaction before commit. => Guarantees Durability => Commit (Evidence for Recovery)

Exactly how is logging (and recovery!) done? We will study the ARIES algorithms.

Recovery Equations

What is stored and how far do we need to go back when we recover? Below is lost one data structure which is volatile (buffer pool)

When we have a crash recovery, we lose the volatile memory that does not show here. What we want to do is to recover the state from the current state of the database plus the log that we have here.

Crash Recovery: volatile memory lost
- Current DB = DB files + DB log => Since when (how far should we go?)
Media Recovery: nonvolatile storage lost
- Current DB = DB backup + DB log => Since when
We will focus on crash recovery next

Final Summary

Explain the three main methodologies for performance measurement and modeling: analytical modeling, simulation, and experimentation
Design and execute experiments to measure the performance of a system
Explain the concepts of volatile, nonvolatile, and stable storage as well as the main assumptions underlying database recovery
Predict how force/no-force and steal/no-steal strategies for writes and buffer management influence the need for redo and undo
Explain the notion of logging and the concept of write-ahead logging
Predict what portions of the log and database are necessary for the recovery based on the recovery equations.