Original post

When writing concurrent code, a lock is one of the most critical tools
programmers have in their toolbox. For the majority of applications adding
a lock to protect access to some data is all you need. However, for some
performance-critical pieces of code, a lock may introduce significant
overhead and its use has to be carefully optimized; for example, lock
granularity is frequently tweaked to make sure threads hold locks for the
minimal amount of time necessary.

A common optimization for locks is using reader-writer locks, and this post
discusses the concept and provides multiple implementations. I’m using as
the language because of its great support for and the
large number of building blocks available in the standard library, but the
concepts are easily translatable to any programming language [1].

All the code for this post is available on GitHub. See the
appendix at the end of the post for some notes on testing and benchmarking
methodology.

The purpose of this post is purely educational; Go already has an excellent
reader-writer lock implementation with sync.RWMutex. In fact, we’ll be
looking at how it works in this post.

Image of a lock with several read keys and one large write key

Motivation for having reader-writer locks

Reader-writer locks (RW locks from here on) were created from the observation
that multiple threads can read shared data concurrently, as long as no
one is modifying that data while it’s being read. Regular locks don’t
distinguish between “lock for reading” and “lock for writing”, so when multiple
threads are reading data, each will still have to lock it, producing needless
serialization. In other words, nesessary exclusion between readers and writers
leads to unnecessary exlusion between multiple readers.

RW locks fix this issue. Instead of having a single lock method, they have
two – one for readers and one for writers. When readers enter the critical
section they invoke the reader lock (and then reader unlock on exit); when
writers enter the critical section they invoke the writer lock (and then
writer unlock on exit). With this scheme, as long as there are no writers
in the critical section, multiple readers can interact with the
data simultaneously.

A simple implementation

Let’s start with a simple implementation that uses a counter. All RW locks in
this post will implement this interface:

type RWLocker interface {
    RLock()
    RUnlock()
    WLock()
    WUnlock()
}

Having a single interface is very useful for testing – see the appendix for more
details on that. The names of the methods in the interface should be
self-explanatory: reader lock/unlock, writer lock/unlock.

The simple counter-based implementation uses a mutex and a counter:

type ReaderCountRWLock struct {
    m           sync.Mutex
    readerCount int
}

The counter keeps track of the number of readers that are holding the lock. The
reader locking/unlocking is straightforward [2]:

func (l *ReaderCountRWLock) RLock() {
    l.m.Lock()
    l.readerCount++
    l.m.Unlock()
}

func (l *ReaderCountRWLock) RUnlock() {
    l.m.Lock()
    l.readerCount--
    l.m.Unlock()
}

The writer is a bit trickier. A writer attempting to take the lock will have
to wait until there are no readers inside. Here’s a simple, though inefficient,
way to do this:

func (l *ReaderCountRWLock) WLock() {
    for {
        l.m.Lock()
        if l.readerCount > 0 {
            l.m.Unlock()
        } else {
            break
        }
    }
}

The writer locks the mutex and checks if there are readers inside the lock. If
there are, it releases the mutex and tries again – this is called spinning.
If there are no readers, WLock returns with the mutex still locked, so
readers won’t be able to take the lock. A writer unlock is trivial:

func (l *ReaderCountRWLock) WUnlock() {
    l.m.Unlock()
}

This implementation is the simplest I could come up with. Its performance could
be better, though. If readers are inside the lock, a writer will spin taking and
releasing the mutex and “burn” CPU cycles. We could optimize this lock
significantly if only we had some way for a writer to wait more efficiently.

Efficient waiting with a condition variable

Condition variables are exactly what we need here, since the pattern of “take
mutex, check something, release it back if not ready” is precisely the pattern
they are designed to optimize. A more advanced version of the RW lock will use
one:

type ReaderCountCondRWLock struct {
    readerCount int
    c           *sync.Cond
}

If you’re wondering where is the mutex here, don’t worry. In Go, a sync.Cond
has a mutex embedded in it, so we get both. This struct is going to need a
constructor:

func NewReaderCountCondRWLock() *ReaderCountCondRWLock {
    return &ReaderCountCondRWLock{0, sync.NewCond(new(sync.Mutex))}
}

Here’s the reader for this lock:

func (l *ReaderCountCondRWLock) RLock() {
    l.c.L.Lock()
    l.readerCount++
    l.c.L.Unlock()
}

func (l *ReaderCountCondRWLock) RUnlock() {
    l.c.L.Lock()
    l.readerCount--
    if l.readerCount == 0 {
        l.c.Signal()
    }
    l.c.L.Unlock()
}

The only difference from the simple RW lock is that we signal the condition
variable when the last reader is releasing the lock. Now the writer lock can be
implemented as follows:

func (l *ReaderCountCondRWLock) WLock() {
    l.c.L.Lock()
    for l.readerCount > 0 {
        l.c.Wait()
    }
}

The Wait is still in a loop because it’s entirely possible that between the
time a reader signaled the condition variable and the time we got the lock,
some other reader got to the lock first.

The unlock is:

func (l *ReaderCountCondRWLock) WUnlock() {
    l.c.Signal()
    l.c.L.Unlock()
}

Can you figure out why the Signal is needed here? Without it, we’ll get
a deadlock under moderately heavy load. The answer is in footnote [3].

This implementation is much more efficient than the first one, because the
spin-loop is avoided. While there’s still a loop around Wait, it’s of
a very different nature – Wait still blocks, and the loop re-runs only in
rare cases when there’s a true race to the lock.

Here a note on benchmarking methodology is apt. I’m saying “more efficient”
without providing evidence, because benchmarking such code is extremely hard
and use-case dependent. Results will vary dramatically based on the load and
the specific workload you’re using. The appendix describes one benchmarking
methodology I used for this code and you should be able to reproduce my results.

Counted semaphore

Another variation of the RW lock which I find particularly elegant and simple is
using a counting semaphore. In Go, package .org/x/sync/semaphore
provides the Weighted type which implements this – a semaphore where you can
acquire and release a numeric “weight”. Here’s our type, a constructor and
reader methods:

const maxWeight int64 = 1 << 30

type SemaRWLock struct {
    s *semaphore.Weighted
}

// NewSemaRWLock creates a new SemaRWLock.
func NewSemaRWLock() *SemaRWLock {
    return &SemaRWLock{semaphore.NewWeighted(maxWeight)}
}

func (l *SemaRWLock) RLock() {
    l.s.Acquire(context.Background(), 1)
}

func (l *SemaRWLock) RUnlock() {
    l.s.Release(1)
}

A writer simply grabs the whole maxWeight, guaranteeing that only a single
writer can take a semaphore and it’s mutually exclusive with any other writer or
readers:

func (l *SemaRWLock) WLock() {
    l.s.Acquire(context.Background(), maxWeight)
}

func (l *SemaRWLock) WUnlock() {
    l.s.Release(maxWeight)
}

While this is a really simple and clean solution, I found its performance to be
wanting. It could be that the implementation of semaphore.Weighted is just
not a good fit for my specific benchmark, but this aproach performed vastly
worse than even the simple spin-loop implementation.

Reader preference vs. Writer preference

You may have noticed a common theme for all the implementations presented so
far. They all make it rather difficult for writers to get in when many readers
are active. For example, the very first implementation requires readerCount
to be at 0 for a writer to get in. Imagine there are 2 active readers and a
waiting writer; the writer waits for both readers to release the lock, but while
it’s waiting other readers can come in and grab it. It’s like waiting at a stop
sign to cross a very busy street where every car has the right of way over you –
it may take you quite a while.

This is called reader preference, or – more dramatically – writer
starvation
. It’s often not what we want since it can add significant delay to
updates in the system. It’s great to have many readers working concurrently
without blocking each other, but it’s not as appealing if they are mostly
reading stale data.

Let’s now look at a couple of writer-preferring implementations.

A simple writer-preferring RW lock

This implementation is adapted from Wikipedia. We’ll start with
the struct and the constructor:

type WritePreferRWLock struct {
    readerCount int
    hasWriter   bool
    c           *sync.Cond
}

func NewWritePreferRWLock() *WritePreferRWLock {
    return &WritePreferRWLock{0, false, sync.NewCond(new(sync.Mutex))}
}

Here readerCount is still the number of readers holding the lock, but we’re
adding a new field – hasWriter; this one is true whenever there’s a
writer waiting to take the lock. c is a condition variable and an associated
mutex that implement the actual exclusion as well as efficient waiting. Let’s
start with the reader:

func (l *WritePreferRWLock) RLock() {
    l.c.L.Lock()
    for l.hasWriter {
        l.c.Wait()
    }
    l.readerCount++
    l.c.L.Unlock()
}

func (l *WritePreferRWLock) RUnlock() {
    l.c.L.Lock()
    l.readerCount--
    if l.readerCount == 0 {
        l.c.Broadcast()
    }
    l.c.L.Unlock()
}

As before, it’s critical for correctness to only examine the shared data when a
mutex is held. When a reader comes in, it first checks if writers are waiting to
grab the lock; if yes, it will wait on the condition variable. This is where the
writer preference is embodied – unlike previous implementations, in this one
readers give writers the right of way. When no writers are waiting, the reader
increments the reader count.

When releasing the lock, the last reader out will broadcast l.c
to wake up any writers that could be waiting [4].

Here’s the writer side:

func (l *WritePreferRWLock) WLock() {
    l.c.L.Lock()
    for l.hasWriter {
        l.c.Wait()
    }
    l.hasWriter = true
    for l.readerCount > 0 {
        l.c.Wait()
    }
    l.c.L.Unlock()
}

func (l *WritePreferRWLock) WUnlock() {
    l.c.L.Lock()
    l.hasWriter = false
    l.c.Broadcast()
    l.c.L.Unlock()
}

A writer starts by checking that no other writer is active. Note that unlike in
previous implementations, here writers don’t hold the mutex between WLock
and WUnlock; instead, the mutex is only used to control access to the shared
struct, and the hasWriter field plays the double role of “writer waiting to
get lock” and “writer is using lock”. Once there are no more writers, it flips
hasWriter and waits for existing readers to clear out. Recall that setting
hasWriter to true guarantees that no new readers will grab the lock.

In WUnlock, the writer unmarks hasWriter and broadcasts on
the condition variable. We have to use Broadcast rather than Signal here
because multiple readers can be waiting and we want to release all of them.

A more efficient writer-preferring RW lock

While relatively easy to understand, the implementation we’ve just seen is not
performing very well – at least on my benchmarks. Therefore, I went looking for
a more sophisticated writer-preferring implementation, and ended up exploring
what Go itself uses for its sync.RWMutex [5]. This is the most complicated
implementation in this post, so don’t be discouraged if you don’t get
it on the first try. Feel free to experiment with the code and add
some logging if you want to get a better grasp of what’s going on.

The goal is to make lock taking – especially in readers – as fast as possible,
while at the same time giving writers the right of way.

As usual, we’ll start with the struct and constructor:

type WritePreferFastRWLock struct {
    w sync.Mutex
    writerWait chan struct{}
    readerWait chan struct{}
    numPending int32
    readersDeparting int32
}

const maxReaders int32 = 1 << 30

func NewWritePreferFastRWLock() *WritePreferFastRWLock {
    var l WritePreferFastRWLock
    l.writerWait = make(chan struct{})
    l.readerWait = make(chan struct{})
    return &l
}

Now I’ll present the reader methods and explain the relevant fields along the
way:

func (l *WritePreferFastRWLock) RLock() {
    if atomic.AddInt32(&l.numPending, 1) < 0 {
        <-l.readerWait
    }
}

func (l *WritePreferFastRWLock) RUnlock() {
    if r := atomic.AddInt32(&l.numPending, -1); r < 0 {
        if atomic.AddInt32(&l.readersDeparting, -1) == 0 {
            l.writerWait <- struct{}{}
        }
    }
}

The mutex w is not used by readers at all. Its sole purpose is to provide
mutual exclusion between writers, so we’ll get to it later. The most critical
field in this implementation is numPending. It’s used to mark the number
of readers that are using the lock (like readerCount), but is sneakily used
by writers as well. Writers subtract maxReaders from this field, so a
negative value means a writer is using the lock. Access to the field is done
with atomics – so no locking required.

The reader adds 1 to numPending (atomically). If the value of numPending
is not negative, there are no waiting/active writers and the reader can proceed.
This path is expected to be the most common; hence, it’s extremely quick – only
a single atomic add followed by a branch.

If numPending is negative it means a writer is waiting on the lock or using
it, so the reader will give the writer the right of way. This is done by waiting
on an unbuffered channel.

When a reader is done, it decrements numPending. If there are no writers,
it’s done – again, very quick for the common path. If writers are waiting,
the readersDeparting field is used to ensure that only a single reader
releases one waiting writer. readersDeparting is set by a waiting writer
as we’ll soon see.

Writer lock:

func (l *WritePreferFastRWLock) WLock() {
    l.w.Lock()
    r := atomic.AddInt32(&l.numPending, -maxReaders) + maxReaders
    if r != 0 && atomic.AddInt32(&l.readersDeparting, r) != 0 {
        <-l.writerWait
    }
}

The w mutex is used here to ensure that only one writer is
inside the lock at any given time. The second line of the function is a bit
tricky, since it does two things:

  1. It announces to readers that a writer is pending by decrementing
    maxReaders from numPending.
  2. It calculates the number of active readers by adding maxReaders back
    into the local variable r.

Then, if there are any active readers (r != 0), it sets their number into
readersDeparting – this lets readers know how many of them are there before
releasing writerWait. This function will return (holding the lock) when the
last departing reader is done. At this point the writer holds the lock
exclusively (other writers are excluded by w, other readers wait until
numPending turns positive again).

Writer unlock:

func (l *WritePreferFastRWLock) WUnlock() {
    r := atomic.AddInt32(&l.numPending, maxReaders)
    for i := 0; i < int(r); i++ {
        l.readerWait <- struct{}{}
    }
    l.w.Unlock()
}

Once again, the clever usage of numPending takes a bit of time to decipher
here. By adding maxReaders, the writer tells future readers that there are
no more writers using the lock. The remaining r is the number of readers
that accumulated waiting for the lock while this writer was active (review
RLock again). The writer now releases all of them by sending r dummy
objects into readerWait. Finally it unlocks the writer exclusion lock.

Phew, that was quite a journey! While it’s not necessary to fully understand
this version of RW locking to get a good sense of how RW locks work, it can be
quite a rewarding experience – so I do encourage you to play with the code a bit
and study it more.

Appendix: Testing and benchmarking methodology

The testing and benchmarking code can be found here.
Its emphasis is on testing correctness, so it runs a stress test with many
readers and many writers running concurrently. They all have access to shared
data – a long slice of integers. The slice begins in increasing sequence (0, 1,
2, 3, …) and writers increment each element. Readers check for consistency –
verifying that the elements in the slice are in the right order. If the lock
implementation is wrong, readers will read interim state and will fail with an
error. Since all the locks here implement RWLocker, testing them in a
uniform way is very easy – interfaces are great for testing!

Each reader and writer does many iterations, so this is a fairly good stress
test. I verified that minimal changes in the lock implementations cause the
tests to fail immediately, and ran all tests with -count 100 and separately
with -race.

Benchmarking is more tricky. Each reader and each writer record the time it
takes them to acquire the lock, and averages are printed out at the end of the
test. The tricky part here is representing a realistic workload; while my
measurements seem reasonable for my specific testing scenario, there are many
variables that represent workloads – number of writers, number of readers,
how often they access locks, how long their work takes in the critical section
and so on. In short, YMMV.


[1]

Quick disclaimer: in Go, channels are the first thing programmers
reach for when coordinating between multiple goroutines, so locks are
used much less frequently than in other languages. Nevertheless, they
are still useful in many scenarios.

In addition, I’ll be using the term threads to discuss multiple
concurrent execution paths to keep in more familiar for non-Go
programmers.

[2] Here and elsewhere I’m skipping some extra error checking / assertions
to keep the code shorter. Take a look at accompanying
code sample
for the full version that also includes comments.
[3] If there are two writers waiting on the condition variable when readers
are present, one of them will take a the lock and the other may still be
blocked in Wait. Without additional readers signaling the condition
variable, this other writer will be stuck in Wait forever. Therefore
it’s important for writers to signal the condition variable too, when
they’re done.
[4]

The Wikipedia sample I took this from says we should signal the
condition variable, but this could be subtly wrong. While one writer
could be waiting on l.c.Wait for readerCount > 0, another writer
could be waiting for l.hasWriter; in addition, multiple readers can
be waiting on l.c.Wait for l.hasWriter. With a single signal it’s
not defined which goroutine will be woken up.

Interestingly, if we replace the Broadcast by Signal in
RUnlock, it happens to work out fine in Go. That’s because Go’s
sync.Cond maintains a FIFO queue of waiting goroutines, and the
writer that got there first (i.e. the one who toggled hasWriter and
is now waiting for readerCount > 0 will be woken up. However, the
documentation doesn’t guarantee this, so it’s safer (albeit somewhat less
efficient) to use Broadcast.

[5] The following code sample is modified from the standard library
implementation (Go version 1.12), which uses Go runtime-internal
functions and types like runtime_SemacquireMutex; instead, I’ll be
using channels for a similar purpose. While the resulting implementation
is slightly slower than sync.RWMutex, it’s easier to understand.