mirror of
https://github.com/adulau/aha.git
synced 2024-12-27 03:06:10 +00:00
ring-buffer: add design document
This adds the design document for the ring buffer and also explains how it is designed to have lockless writes. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This commit is contained in:
parent
77ae365eca
commit
8b2c70d1e4
1 changed files with 955 additions and 0 deletions
955
Documentation/trace/ring-buffer-design.txt
Normal file
955
Documentation/trace/ring-buffer-design.txt
Normal file
|
@ -0,0 +1,955 @@
|
|||
Lockless Ring Buffer Design
|
||||
===========================
|
||||
|
||||
Copyright 2009 Red Hat Inc.
|
||||
Author: Steven Rostedt <srostedt@redhat.com>
|
||||
License: The GNU Free Documentation License, Version 1.2
|
||||
(dual licensed under the GPL v2)
|
||||
Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
|
||||
and Frederic Weisbecker.
|
||||
|
||||
|
||||
Written for: 2.6.31
|
||||
|
||||
Terminology used in this Document
|
||||
---------------------------------
|
||||
|
||||
tail - where new writes happen in the ring buffer.
|
||||
|
||||
head - where new reads happen in the ring buffer.
|
||||
|
||||
producer - the task that writes into the ring buffer (same as writer)
|
||||
|
||||
writer - same as producer
|
||||
|
||||
consumer - the task that reads from the buffer (same as reader)
|
||||
|
||||
reader - same as consumer.
|
||||
|
||||
reader_page - A page outside the ring buffer used solely (for the most part)
|
||||
by the reader.
|
||||
|
||||
head_page - a pointer to the page that the reader will use next
|
||||
|
||||
tail_page - a pointer to the page that will be written to next
|
||||
|
||||
commit_page - a pointer to the page with the last finished non nested write.
|
||||
|
||||
cmpxchg - hardware assisted atomic transaction that performs the following:
|
||||
|
||||
A = B iff previous A == C
|
||||
|
||||
R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
|
||||
current A is equal to C, and we put the old (current) A into R
|
||||
|
||||
R gets the previous A regardless if A is updated with B or not.
|
||||
|
||||
To see if the update was successful a compare of R == C may be used.
|
||||
|
||||
The Generic Ring Buffer
|
||||
-----------------------
|
||||
|
||||
The ring buffer can be used in either an overwrite mode or in
|
||||
producer/consumer mode.
|
||||
|
||||
Producer/consumer mode is where the producer were to fill up the
|
||||
buffer before the consumer could free up anything, the producer
|
||||
will stop writing to the buffer. This will lose most recent events.
|
||||
|
||||
Overwrite mode is where the produce were to fill up the buffer
|
||||
before the consumer could free up anything, the producer will
|
||||
overwrite the older data. This will lose the oldest events.
|
||||
|
||||
No two writers can write at the same time (on the same per cpu buffer),
|
||||
but a writer may interrupt another writer, but it must finish writing
|
||||
before the previous writer may continue. This is very important to the
|
||||
algorithm. The writers act like a "stack". The way interrupts works
|
||||
enforces this behavior.
|
||||
|
||||
|
||||
writer1 start
|
||||
<preempted> writer2 start
|
||||
<preempted> writer3 start
|
||||
writer3 finishes
|
||||
writer2 finishes
|
||||
writer1 finishes
|
||||
|
||||
This is very much like a writer being preempted by an interrupt and
|
||||
the interrupt doing a write as well.
|
||||
|
||||
Readers can happen at any time. But no two readers may run at the
|
||||
same time, nor can a reader preempt/interrupt another reader. A reader
|
||||
can not preempt/interrupt a writer, but it may read/consume from the
|
||||
buffer at the same time as a writer is writing, but the reader must be
|
||||
on another processor to do so. A reader may read on its own processor
|
||||
and can be preempted by a writer.
|
||||
|
||||
A writer can preempt a reader, but a reader can not preempt a writer.
|
||||
But a reader can read the buffer at the same time (on another processor)
|
||||
as a writer.
|
||||
|
||||
The ring buffer is made up of a list of pages held together by a link list.
|
||||
|
||||
At initialization a reader page is allocated for the reader that is not
|
||||
part of the ring buffer.
|
||||
|
||||
The head_page, tail_page and commit_page are all initialized to point
|
||||
to the same page.
|
||||
|
||||
The reader page is initialized to have its next pointer pointing to
|
||||
the head page, and its previous pointer pointing to a page before
|
||||
the head page.
|
||||
|
||||
The reader has its own page to use. At start up time, this page is
|
||||
allocated but is not attached to the list. When the reader wants
|
||||
to read from the buffer, if its page is empty (like it is on start up)
|
||||
it will swap its page with the head_page. The old reader page will
|
||||
become part of the ring buffer and the head_page will be removed.
|
||||
The page after the inserted page (old reader_page) will become the
|
||||
new head page.
|
||||
|
||||
Once the new page is given to the reader, the reader could do what
|
||||
it wants with it, as long as a writer has left that page.
|
||||
|
||||
A sample of how the reader page is swapped: Note this does not
|
||||
show the head page in the buffer, it is for demonstrating a swap
|
||||
only.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |
|
||||
+------+
|
||||
+---+ +---+ +---+
|
||||
| |-->| |-->| |
|
||||
| |<--| |<--| |
|
||||
+---+ +---+ +---+
|
||||
^ | ^ |
|
||||
| +-------------+ |
|
||||
+-----------------+
|
||||
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ v
|
||||
| +---+ +---+ +---+
|
||||
| | |-->| |-->| |
|
||||
| | |<--| |<--| |<-+
|
||||
| +---+ +---+ +---+ |
|
||||
| ^ | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | | | |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|buffer| RING BUFFER
|
||||
|page |-------------------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | | | |-->| |
|
||||
| | New | | | |<--| |<-+
|
||||
| | Reader +---+ +---+ +---+ |
|
||||
| | page ----^ | |
|
||||
| | | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
|
||||
|
||||
It is possible that the page swapped is the commit page and the tail page,
|
||||
if what is in the ring buffer is less than what is held in a buffer page.
|
||||
|
||||
|
||||
reader page commit page tail page
|
||||
| | |
|
||||
v | |
|
||||
+---+ | |
|
||||
| |<----------+ |
|
||||
| |<------------------------+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
This case is still valid for this algorithm.
|
||||
When the writer leaves the page, it simply goes into the ring buffer
|
||||
since the reader page still points to the next location in the ring
|
||||
buffer.
|
||||
|
||||
|
||||
The main pointers:
|
||||
|
||||
reader page - The page used solely by the reader and is not part
|
||||
of the ring buffer (may be swapped in)
|
||||
|
||||
head page - the next page in the ring buffer that will be swapped
|
||||
with the reader page.
|
||||
|
||||
tail page - the page where the next write will take place.
|
||||
|
||||
commit page - the page that last finished a write.
|
||||
|
||||
The commit page only is updated by the outer most writer in the
|
||||
writer stack. A writer that preempts another writer will not move the
|
||||
commit page.
|
||||
|
||||
When data is written into the ring buffer, a position is reserved
|
||||
in the ring buffer and passed back to the writer. When the writer
|
||||
is finished writing data into that position, it commits the write.
|
||||
|
||||
Another write (or a read) may take place at anytime during this
|
||||
transaction. If another write happens it must finish before continuing
|
||||
with the previous write.
|
||||
|
||||
|
||||
Write reserve:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--- given back to writer (current commit)
|
||||
|reserved |
|
||||
+---------+ <--- tail pointer
|
||||
| empty |
|
||||
+---------+
|
||||
|
||||
Write commit:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--- next positon for write (current commit)
|
||||
| empty |
|
||||
+---------+
|
||||
|
||||
|
||||
If a write happens after the first reserve:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <-- current commit
|
||||
|reserved |
|
||||
+---------+ <--- given back to second writer
|
||||
|reserved |
|
||||
+---------+ <--- tail pointer
|
||||
|
||||
After second writer commits:
|
||||
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--(last full commit)
|
||||
|reserved |
|
||||
+---------+
|
||||
|pending |
|
||||
|commit |
|
||||
+---------+ <--- tail pointer
|
||||
|
||||
When the first writer commits:
|
||||
|
||||
Buffer page
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+
|
||||
|written |
|
||||
+---------+ <--(last full commit and tail pointer)
|
||||
|
||||
|
||||
The commit pointer points to the last write location that was
|
||||
committed without preempting another write. When a write that
|
||||
preempted another write is committed, it only becomes a pending commit
|
||||
and will not be a full commit till all writes have been committed.
|
||||
|
||||
The commit page points to the page that has the last full commit.
|
||||
The tail page points to the page with the last write (before
|
||||
committing).
|
||||
|
||||
The tail page is always equal to or after the commit page. It may
|
||||
be several pages ahead. If the tail page catches up to the commit
|
||||
page then no more writes may take place (regardless of the mode
|
||||
of the ring buffer: overwrite and produce/consumer).
|
||||
|
||||
The order of pages are:
|
||||
|
||||
head page
|
||||
commit page
|
||||
tail page
|
||||
|
||||
Possible scenario:
|
||||
tail page
|
||||
head page commit page |
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
There is a special case that the head page is after either the commit page
|
||||
and possibly the tail page. That is when the commit (and tail) page has been
|
||||
swapped with the reader page. This is because the head page is always
|
||||
part of the ring buffer, but the reader page is not. When ever there
|
||||
has been less than a full page that has been committed inside the ring buffer,
|
||||
and a reader swaps out a page, it will be swapping out the commit page.
|
||||
|
||||
|
||||
reader page commit page tail page
|
||||
| | |
|
||||
v | |
|
||||
+---+ | |
|
||||
| |<----------+ |
|
||||
| |<------------------------+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
In this case, the head page will not move when the tail and commit
|
||||
move back into the ring buffer.
|
||||
|
||||
The reader can not swap a page into the ring buffer if the commit page
|
||||
is still on that page. If the read meets the last commit (real commit
|
||||
not pending or reserved), then there is nothing more to read.
|
||||
The buffer is considered empty until another full commit finishes.
|
||||
|
||||
When the tail meets the head page, if the buffer is in overwrite mode,
|
||||
the head page will be pushed ahead one. If the buffer is in producer/consumer
|
||||
mode, the write will fail.
|
||||
|
||||
Overwrite mode:
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
head page
|
||||
|
||||
Note, the reader page will still point to the previous head page.
|
||||
But when a swap takes place, it will use the most recent head page.
|
||||
|
||||
|
||||
Making the Ring Buffer Lockless:
|
||||
--------------------------------
|
||||
|
||||
The main idea behind the lockless algorithm is to combine the moving
|
||||
of the head_page pointer with the swapping of pages with the reader.
|
||||
State flags are placed inside the pointer to the page. To do this,
|
||||
each page must be aligned in memory by 4 bytes. This will allow the 2
|
||||
least significant bits of the address to be used as flags. Since
|
||||
they will always be zero for the address. To get the address,
|
||||
simply mask out the flags.
|
||||
|
||||
MASK = ~3
|
||||
|
||||
address & MASK
|
||||
|
||||
Two flags will be kept by these two bits:
|
||||
|
||||
HEADER - the page being pointed to is a head page
|
||||
|
||||
UPDATE - the page being pointed to is being updated by a writer
|
||||
and was or is about to be a head page.
|
||||
|
||||
|
||||
reader page
|
||||
|
|
||||
v
|
||||
+---+
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The above pointer "-H->" would have the HEADER flag set. That is
|
||||
the next page is the next page to be swapped out by the reader.
|
||||
This pointer means the next page is the head page.
|
||||
|
||||
When the tail page meets the head pointer, it will use cmpxchg to
|
||||
change the pointer to the UPDATE state:
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
"-U->" represents a pointer in the UPDATE state.
|
||||
|
||||
Any access to the reader will need to take some sort of lock to serialize
|
||||
the readers. But the writers will never take a lock to write to the
|
||||
ring buffer. This means we only need to worry about a single reader,
|
||||
and writes only preempt in "stack" formation.
|
||||
|
||||
When the reader tries to swap the page with the ring buffer, it
|
||||
will also use cmpxchg. If the flag bit in the pointer to the
|
||||
head page does not have the HEADER flag set, the compare will fail
|
||||
and the reader will need to look for the new head page and try again.
|
||||
Note, the flag UPDATE and HEADER are never set at the same time.
|
||||
|
||||
The reader swaps the reader page as follows:
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |
|
||||
+------+
|
||||
+---+ +---+ +---+
|
||||
| |--->| |--->| |
|
||||
| |<---| |<---| |
|
||||
+---+ +---+ +---+
|
||||
^ | ^ |
|
||||
| +---------------+ |
|
||||
+-----H-------------+
|
||||
|
||||
The reader sets the reader page next pointer as HEADER to the page after
|
||||
the head page.
|
||||
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ v
|
||||
| +---+ +---+ +---+
|
||||
| | |--->| |--->| |
|
||||
| | |<---| |<---| |<-+
|
||||
| +---+ +---+ +---+ |
|
||||
| ^ | ^ | |
|
||||
| | +---------------+ | |
|
||||
| +-----H-------------+ |
|
||||
+--------------------------------------+
|
||||
|
||||
It does a cmpxchg with the pointer to the previous head page to make it
|
||||
point to the reader page. Note that the new pointer does not have the HEADER
|
||||
flag set. This action atomically moves the head page forward.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | |<--| |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
After the new head page is set, the previous pointer of the head page is
|
||||
updated to the reader page.
|
||||
|
||||
+------+
|
||||
|reader| RING BUFFER
|
||||
|page |-------H-----------+
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | |-->| |-->| |
|
||||
| | | | | |<--| |<-+
|
||||
| | +---+ +---+ +---+ |
|
||||
| | | ^ | |
|
||||
| | +-------------+ | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
+------+
|
||||
|buffer| RING BUFFER
|
||||
|page |-------H-----------+ <--- New head page
|
||||
+------+ <---------------+ v
|
||||
| ^ +---+ +---+ +---+
|
||||
| | | | | |-->| |
|
||||
| | New | | | |<--| |<-+
|
||||
| | Reader +---+ +---+ +---+ |
|
||||
| | page ----^ | |
|
||||
| | | |
|
||||
| +-----------------------------+ |
|
||||
+------------------------------------+
|
||||
|
||||
Another important point. The page that the reader page points back to
|
||||
by its previous pointer (the one that now points to the new head page)
|
||||
never points back to the reader page. That is because the reader page is
|
||||
not part of the ring buffer. Traversing the ring buffer via the next pointers
|
||||
will always stay in the ring buffer. Traversing the ring buffer via the
|
||||
prev pointers may not.
|
||||
|
||||
Note, the way to determine a reader page is simply by examining the previous
|
||||
pointer of the page. If the next pointer of the previous page does not
|
||||
point back to the original page, then the original page is a reader page:
|
||||
|
||||
|
||||
+--------+
|
||||
| reader | next +----+
|
||||
| page |-------->| |<====== (buffer page)
|
||||
+--------+ +----+
|
||||
| | ^
|
||||
| v | next
|
||||
prev | +----+
|
||||
+------------->| |
|
||||
+----+
|
||||
|
||||
The way the head page moves forward:
|
||||
|
||||
When the tail page meets the head page and the buffer is in overwrite mode
|
||||
and more writes take place, the head page must be moved forward before the
|
||||
writer may move the tail page. The way this is done is that the writer
|
||||
performs a cmpxchg to convert the pointer to the head page from the HEADER
|
||||
flag to have the UPDATE flag set. Once this is done, the reader will
|
||||
not be able to swap the head page from the buffer, nor will it be able to
|
||||
move the head page, until the writer is finished with the move.
|
||||
|
||||
This eliminates any races that the reader can have on the writer. The reader
|
||||
must spin, and this is why the reader can not preempt the writer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The following page will be made into the new head page.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the new head page has been set, we can set the old head page
|
||||
pointer back to NORMAL.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the head page has been moved, the tail page may now move forward.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The above are the trivial updates. Now for the more complex scenarios.
|
||||
|
||||
|
||||
As stated before, if enough writes preempt the first write, the
|
||||
tail page may make it all the way around the buffer and meet the commit
|
||||
page. At this time, we must start dropping writes (usually with some kind
|
||||
of warning to the user). But what happens if the commit was still on the
|
||||
reader page? The commit page is not part of the ring buffer. The tail page
|
||||
must account for this.
|
||||
|
||||
|
||||
reader page commit page
|
||||
| |
|
||||
v |
|
||||
+---+ |
|
||||
| |<----------+
|
||||
| |
|
||||
| |------+
|
||||
+---+ |
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
^
|
||||
|
|
||||
tail page
|
||||
|
||||
If the tail page were to simply push the head page forward, the commit when
|
||||
leaving the reader page would not be pointing to the correct page.
|
||||
|
||||
The solution to this is to test if the commit page is on the reader page
|
||||
before pushing the head page. If it is, then it can be assumed that the
|
||||
tail page wrapped the buffer, and we must drop new writes.
|
||||
|
||||
This is not a race condition, because the commit page can only be moved
|
||||
by the outter most writer (the writer that was preempted).
|
||||
This means that the commit will not move while a writer is moving the
|
||||
tail page. The reader can not swap the reader page if it is also being
|
||||
used as the commit page. The reader can simply check that the commit
|
||||
is off the reader page. Once the commit page leaves the reader page
|
||||
it will never go back on it unless a reader does another swap with the
|
||||
buffer page that is also the commit page.
|
||||
|
||||
|
||||
Nested writes
|
||||
-------------
|
||||
|
||||
In the pushing forward of the tail page we must first push forward
|
||||
the head page if the head page is the next page. If the head page
|
||||
is not the next page, the tail page is simply updated with a cmpxchg.
|
||||
|
||||
Only writers move the tail page. This must be done atomically to protect
|
||||
against nested writers.
|
||||
|
||||
temp_page = tail_page
|
||||
next_page = temp_page->next
|
||||
cmpxchg(tail_page, temp_page, next_page)
|
||||
|
||||
The above will update the tail page if it is still pointing to the expected
|
||||
page. If this fails, a nested write pushed it forward, the the current write
|
||||
does not need to push it.
|
||||
|
||||
|
||||
temp page
|
||||
|
|
||||
v
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Nested write comes in and moves the tail page forward:
|
||||
|
||||
tail page (moved by nested writer)
|
||||
temp page |
|
||||
| |
|
||||
v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The above would fail the cmpxchg, but since the tail page has already
|
||||
been moved forward, the writer will just try again to reserve storage
|
||||
on the new tail page.
|
||||
|
||||
But the moving of the head page is a bit more complex.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The write converts the head page pointer to UPDATE.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But if a nested writer preempts here. It will see that the next
|
||||
page is a head page, but it is also nested. It will detect that
|
||||
it is nested and will save that information. The detection is the
|
||||
fact that it sees the UPDATE flag instead of a HEADER or NORMAL
|
||||
pointer.
|
||||
|
||||
The nested writer will set the new head page pointer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But it will not reset the update back to normal. Only the writer
|
||||
that converted a pointer from HEAD to UPDATE will convert it back
|
||||
to NORMAL.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
After the nested writer finishes, the outer most writer will convert
|
||||
the UPDATE pointer to NORMAL.
|
||||
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
It can be even more complex if several nested writes came in and moved
|
||||
the tail page ahead several pages:
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-H->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The write converts the head page pointer to UPDATE.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Next writer comes in, and sees the update and sets up the new
|
||||
head page.
|
||||
|
||||
(second writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The nested writer moves the tail page forward. But does not set the old
|
||||
update page to NORMAL because it is not the outer most writer.
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Another writer preempts and sees the page after the tail page is a head page.
|
||||
It changes it from HEAD to UPDATE.
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-U->| |--->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The writer will move the head page forward:
|
||||
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-U->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
But now that the third writer did change the HEAD flag to UPDATE it
|
||||
will convert it to normal:
|
||||
|
||||
|
||||
(third writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
Then it will move the tail page, and return back to the second writer.
|
||||
|
||||
|
||||
(second writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
|
||||
The second writer will fail to move the tail page because it was already
|
||||
moved, so it will try again and add its data to the new tail page.
|
||||
It will return to the first writer.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
The first writer can not know atomically test if the tail page moved
|
||||
while it updates the HEAD page. It will then update the head page to
|
||||
what it thinks is the new head page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
tail page
|
||||
|
|
||||
v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Since the cmpxchg returns the old value of the pointer the first writer
|
||||
will see it succeeded in updating the pointer from NORMAL to HEAD.
|
||||
But as we can see, this is not good enough. It must also check to see
|
||||
if the tail page is either where it use to be or on the next page:
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |-H->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
If tail page != A and tail page does not equal B, then it must reset the
|
||||
pointer back to NORMAL. The fact that it only needs to worry about
|
||||
nested writers, it only needs to check this after setting the HEAD page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |-U->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
||||
Now the writer can update the head page. This is also why the head page must
|
||||
remain in UPDATE and only reset by the outer most writer. This prevents
|
||||
the reader from seeing the incorrect head page.
|
||||
|
||||
|
||||
(first writer)
|
||||
|
||||
A B tail page
|
||||
| | |
|
||||
v v v
|
||||
+---+ +---+ +---+ +---+
|
||||
<---| |--->| |--->| |--->| |-H->
|
||||
--->| |<---| |<---| |<---| |<---
|
||||
+---+ +---+ +---+ +---+
|
||||
|
Loading…
Reference in a new issue