aha/fs/reiserfs
Frederic Weisbecker 8ebc423238 reiserfs: kill-the-BKL
This patch is an attempt to remove the Bkl based locking scheme from
reiserfs and is intended.

It is a bit inspired from an old attempt by Peter Zijlstra:

   http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html

The bkl is heavily used in this filesystem to prevent from
concurrent write accesses on the filesystem.

Reiserfs makes a deep use of the specific properties of the Bkl:

- It can be acqquired recursively by a same task
- It is released on the schedule() calls and reacquired when schedule() returns

The two properties above are a roadmap for the reiserfs write locking so it's
very hard to simply replace it with a common mutex.

- We need a recursive-able locking unless we want to restructure several blocks
  of the code.
- We need to identify the sites where the bkl was implictly relaxed
  (schedule, wait, sync, etc...) so that we can in turn release and
  reacquire our new lock explicitly.
  Such implicit releases of the lock are often required to let other
  resources producer/consumer do their job or we can suffer unexpected
  starvations or deadlocks.

So the new lock that replaces the bkl here is a per superblock mutex with a
specific property: it can be acquired recursively by a same task, like the
bkl.

For such purpose, we integrate a lock owner and a lock depth field on the
superblock information structure.

The first axis on this patch is to turn reiserfs_write_(un)lock() function
into a wrapper to manage this mutex. Also some explicit calls to
lock_kernel() have been converted to reiserfs_write_lock() helpers.

The second axis is to find the important blocking sites (schedule...(),
wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
release of the write lock on these locations before blocking. Then we can
safely wait for those who can give us resources or those who need some.
Typically this is a fight between the current writer, the reiserfs workqueue
(aka the async commiter) and the pdflush threads.

The third axis is a consequence of the second. The write lock is usually
on top of a lock dependency chain which can include the journal lock, the
flush lock or the commit lock. So it's dangerous to release and trying to
reacquire the write lock while we still hold other locks.

This is fine with the bkl:

      T1                       T2

lock_kernel()
    mutex_lock(A)
    unlock_kernel()
    // do something
                            lock_kernel()
                                mutex_lock(A) -> already locked by T1
                                schedule() (and then unlock_kernel())
    lock_kernel()
    mutex_unlock(A)
    ....

This is not fine with a mutex:

      T1                       T2

mutex_lock(write)
    mutex_lock(A)
    mutex_unlock(write)
    // do something
                           mutex_lock(write)
                              mutex_lock(A) -> already locked by T1
                              schedule()

    mutex_lock(write) -> already locked by T2
    deadlock

The solution in this patch is to provide a helper which releases the write
lock and sleep a bit if we can't lock a mutex that depend on it. It's another
simulation of the bkl behaviour.

The last axis is to locate the fs callbacks that are called with the bkl held,
according to Documentation/filesystem/Locking.

Those are:

- reiserfs_remount
- reiserfs_fill_super
- reiserfs_put_super

Reiserfs didn't need to explicitly lock because of the context of these callbacks.
But now we must take care of that with the new locking.

After this patch, reiserfs suffers from a slight performance regression (for now).
On UP, a high volume write with dd reports an average of 27 MB/s instead
of 30 MB/s without the patch applied.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-14 07:17:59 +02:00
..
bitmap.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
dir.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
do_balan.c reiserfs: fix warnings with gcc 4.4 2009-06-18 13:03:46 -07:00
file.c reiserfs: rename [cn]_* variables 2009-03-30 12:16:40 -07:00
fix_node.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
hashes.c reiserfs: strip trailing whitespace 2009-03-30 12:16:39 -07:00
ibalance.c reiserfs: strip trailing whitespace 2009-03-30 12:16:39 -07:00
inode.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
ioctl.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
item_ops.c reiserfs: rework reiserfs_panic 2009-03-30 12:16:36 -07:00
journal.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
Kconfig fs/reiserfs: return f_fsid for statfs(2) 2009-04-02 19:05:10 -07:00
lbalance.c reiserfs: fix warnings with gcc 4.4 2009-06-18 13:03:46 -07:00
lock.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
Makefile reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
namei.c reiserfs: remove privroot hiding in lookup 2009-05-09 10:49:39 -04:00
objectid.c reiserfs: strip trailing whitespace 2009-03-30 12:16:39 -07:00
prints.c reiserfs: strip trailing whitespace 2009-03-30 12:16:39 -07:00
procfs.c proc 2/2: remove struct proc_dir_entry::owner 2009-03-31 01:14:44 +04:00
README reiserfs: strip trailing whitespace 2009-03-30 12:16:39 -07:00
resize.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
stree.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
super.c reiserfs: kill-the-BKL 2009-09-14 07:17:59 +02:00
tail_conversion.c reiserfs: rename [cn]_* variables 2009-03-30 12:16:40 -07:00
xattr.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
xattr_acl.c helpers for acl caching + switch to those 2009-06-24 08:17:07 -04:00
xattr_security.c reiserfs: dont associate security.* with xattr files 2009-05-09 10:49:39 -04:00
xattr_trusted.c reiserfs: use generic xattr handlers 2009-03-30 12:16:38 -07:00
xattr_user.c reiserfs: use generic xattr handlers 2009-03-30 12:16:38 -07:00

[LICENSING]

ReiserFS is hereby licensed under the GNU General
Public License version 2.

Source code files that contain the phrase "licensing governed by
reiserfs/README" are "governed files" throughout this file.  Governed
files are licensed under the GPL.  The portions of them owned by Hans
Reiser, or authorized to be licensed by him, have been in the past,
and likely will be in the future, licensed to other parties under
other licenses.  If you add your code to governed files, and don't
want it to be owned by Hans Reiser, put your copyright label on that
code so the poor blight and his customers can keep things straight.
All portions of governed files not labeled otherwise are owned by Hans
Reiser, and by adding your code to it, widely distributing it to
others or sending us a patch, and leaving the sentence in stating that
licensing is governed by the statement in this file, you accept this.
It will be a kindness if you identify whether Hans Reiser is allowed
to license code labeled as owned by you on your behalf other than
under the GPL, because he wants to know if it is okay to do so and put
a check in the mail to you (for non-trivial improvements) when he
makes his next sale.  He makes no guarantees as to the amount if any,
though he feels motivated to motivate contributors, and you can surely
discuss this with him before or after contributing.  You have the
right to decline to allow him to license your code contribution other
than under the GPL.

Further licensing options are available for commercial and/or other
interests directly from Hans Reiser: hans@reiser.to.  If you interpret
the GPL as not allowing those additional licensing options, you read
it wrongly, and Richard Stallman agrees with me, when carefully read
you can see that those restrictions on additional terms do not apply
to the owner of the copyright, and my interpretation of this shall
govern for this license.

Finally, nothing in this license shall be interpreted to allow you to
fail to fairly credit me, or to remove my credits, without my
permission, unless you are an end user not redistributing to others.
If you have doubts about how to properly do that, or about what is
fair, ask.  (Last I spoke with him Richard was contemplating how best
to address the fair crediting issue in the next GPL version.)

[END LICENSING]

Reiserfs is a file system based on balanced tree algorithms, which is
described at http://devlinux.com/namesys.

Stop reading here.  Go there, then return.

Send bug reports to yura@namesys.botik.ru.

mkreiserfs and other utilities are in reiserfs/utils, or wherever your
Linux provider put them.  There is some disagreement about how useful
it is for users to get their fsck and mkreiserfs out of sync with the
version of reiserfs that is in their kernel, with many important
distributors wanting them out of sync.:-) Please try to remember to
recompile and reinstall fsck and mkreiserfs with every update of
reiserfs, this is a common source of confusion.  Note that some of the
utilities cannot be compiled without accessing the balancing code
which is in the kernel code, and relocating the utilities may require
you to specify where that code can be found.

Yes, if you update your reiserfs kernel module you do have to
recompile your kernel, most of the time.  The errors you get will be
quite cryptic if your forget to do so.

Real users, as opposed to folks who want to hack and then understand
what went wrong, will want REISERFS_CHECK off.

Hideous Commercial Pitch: Spread your development costs across other OS
vendors.  Select from the best in the world, not the best in your
building, by buying from third party OS component suppliers.  Leverage
the software component development power of the internet.  Be the most
aggressive in taking advantage of the commercial possibilities of
decentralized internet development, and add value through your branded
integration that you sell as an operating system.  Let your competitors
be the ones to compete against the entire internet by themselves.  Be
hip, get with the new economic trend, before your competitors do.  Send
email to hans@reiser.to.

To understand the code, after reading the website, start reading the
code by reading reiserfs_fs.h first.

Hans Reiser was the project initiator, primary architect, source of all
funding for the first 5.5 years, and one of the programmers.  He owns
the copyright.

Vladimir Saveljev was one of the programmers, and he worked long hours
writing the cleanest code.  He always made the effort to be the best he
could be, and to make his code the best that it could be.  What resulted
was quite remarkable. I don't think that money can ever motivate someone
to work the way he did, he is one of the most selfless men I know.

Yura helps with benchmarking, coding hashes, and block pre-allocation
code.

Anatoly Pinchuk is a former member of our team who worked closely with
Vladimir throughout the project's development.  He wrote a quite
substantial portion of the total code.  He realized that there was a
space problem with packing tails of files for files larger than a node
that start on a node aligned boundary (there are reasons to want to node
align files), and he invented and implemented indirect items and
unformatted nodes as the solution.

Konstantin Shvachko, with the help of the Russian version of a VC,
tried to put me in a position where I was forced into giving control
of the project to him.  (Fortunately, as the person paying the money
for all salaries from my dayjob I owned all copyrights, and you can't
really force takeovers of sole proprietorships.)  This was something
curious, because he never really understood the value of our project,
why we should do what we do, or why innovation was possible in
general, but he was sure that he ought to be controlling it.  Every
innovation had to be forced past him while he was with us.  He added
two years to the time required to complete reiserfs, and was a net
loss for me.  Mikhail Gilula was a brilliant innovator who also left
in a destructive way that erased the value of his contributions, and
that he was shown much generosity just makes it more painful.

Grigory Zaigralin was an extremely effective system administrator for
our group.

Igor Krasheninnikov was wonderful at hardware procurement, repair, and
network installation.

Jeremy Fitzhardinge wrote the teahash.c code, and he gives credit to a
textbook he got the algorithm from in the code.  Note that his analysis
of how we could use the hashing code in making 32 bit NFS cookies work
was probably more important than the actual algorithm.  Colin Plumb also
contributed to it.

Chris Mason dived right into our code, and in just a few months produced
the journaling code that dramatically increased the value of ReiserFS.
He is just an amazing programmer.

Igor Zagorovsky is writing much of the new item handler and extent code
for our next major release.

Alexander Zarochentcev (sometimes known as zam, or sasha), wrote the
resizer, and is hard at work on implementing allocate on flush.  SGI
implemented allocate on flush before us for XFS, and generously took
the time to convince me we should do it also.  They are great people,
and a great company.

Yuri Shevchuk and Nikita Danilov are doing squid cache optimization.

Vitaly Fertman is doing fsck.

Jeff Mahoney, of SuSE, contributed a few cleanup fixes, most notably
the endian safe patches which allow ReiserFS to run on any platform
supported by the Linux kernel.

SuSE, IntegratedLinux.com, Ecila, MP3.com, bigstorage.com, and the
Alpha PC Company made it possible for me to not have a day job
anymore, and to dramatically increase our staffing.  Ecila funded
hypertext feature development, MP3.com funded journaling, SuSE funded
core development, IntegratedLinux.com funded squid web cache
appliances, bigstorage.com funded HSM, and the alpha PC company funded
the alpha port.  Many of these tasks were helped by sponsors other
than the ones just named.  SuSE has helped in much more than just
funding....