Original post

Or: How adding goroutines can keep your CPU busy shuffling things around.

So when a cached value changes, the whole cache line gets synchronized back to main memory. Likewise, the caches of all other CPU cores that contain this cache line must now also sync this cache line to avoid operating on outdated data.


How does this affect our code? Remember that the concurrent loop uses a global slice to store the intermediate results. The elements of a slice are stored in a contiguous space. With high probability, two adjacent slice elements will share the same cache line.

And now the drama begins.

n CPU cores with n caches repeatedly read from and write to slice elements that are all in the same cache line. So whenever one CPU core updates “its” slice element with a new sum, the cache lines of all other CPU’s get invalidated. The changed cache line must be written back to main memory, and all other caches must update their respective cache line with new data. Even though each core accesses a different part of the slice!

This consumes precious time—more than the time that the serial loop needs for updating its single sum variable.

This is why our concurrent loop needs more time than the serial loop. All the concurrent updates to the slice cause a hectic cache line sync dance.

Please enable JavaScript to view the animation.

Spread the word data!

Now that we know the reason for the surprsing slowdown, the cure is obvious. We have to turn the slice into n individual variables that hopefully will be stored far enough apart from each other so that they do not share the same cache line.

So let’s change our concurrent loop so that each goroutine stores its intermediate sum in a goroutine-local variable. In order to pass the result back to the main goroutine, we also have to add a channel. This in turn allows us to remove the wait group, because channels are not only a means for communication but also an elegant synchronization mechanism.

Concurrent loop with local variables

ChannelSum() spawns n goroutines that store their intermediate sums locally, then pass the result back through a channel.

func ChannelSum() int {
	n := runtime.GOMAXPROCS(0)

A channel of ints will collect all intermediate sums.

	res := make(chan int)

	for i := 0; i < n; i++ {

The goroutine now receives a second parameter, the result channel. The arrow pointing “into” the chan keyword turns this channel into a send-only channel inside this function.

		 func(i int, r chan<- int) {

This local variable replaces the global slice.

			sum := 0

As before, we divide the input into n chunks of equal size.

			start := (limit / n) * i
			end := start + (limit / n)

Calculate the intermediate sum.

			for j := start; j < end; j += 1 {
				sum += j

Pass the final sum into the channel.

			r <- sum

Call the goroutine and pass the CPU index and the channel.

		}(i, res)

	sum := 0

This loop reads n values from the channel. We know exactly how many elements we will receive through the channel, hence we need no

	for i := 0; i < n; i++ {

Read a value from the channel and add it to sum.

The channel blocks when there are no elements to read. This provides a “natural” synchronization mechanism. The loop must wait until there is an element to read, and does not finish before all n elements have been passed through the channel.

		sum += <-res
	return sum

After adding a third benchmark function, BenchmarkChannelSum, to our test file, we can now run the benchmark on all three variants of our loop.

$ go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/appliedgo/concurrencyslower
BenchmarkSerialSum-4          1       6022493632 ns/op
BenchmarkConcurrentSum-4      1       15828807312 ns/op
BenchmarkChannelSum-4         1       1948465461 ns/op
ok      github.com/appliedgo/concurrencyslower  23.807s

Spreading the intermediate sums across individual local variables, rather than having them in a single slice, definitely helped us escaping the cache sync problem.

However, how can we be sure that the individual variables never share the same cache line? Well, starting a new goroutine allocates between 2KB and 8KB of data on the stack, which is way more than the typical cache line size of 64 bytes. And since the intermediate sum variable is not referenced from anywhere outside the goroutine that creates it, it does not escape to the heap (where it could end up near to one of the other intermediate sum variables). So we can be pretty sure that no two intermediate sum variables will end up in the same cache line.

How to get and run the code

Step 1: go get the code. Note the -d flag that prevents auto-installing
the binary into $GOPATH/bin.

go get -d github.com/appliedgo/concurrencyslower

Step 2: cd to the source code directory.

cd $GOPATH/src/github.com/appliedgo/concurrencyslower

Step 3. Run the benchmark.

go test -bench .

NOTE: As of this writing, Go 1.12 is the current version of Go. If you have the experimental Go modules enabled, go get will fetch the sources into $GOPATH/pkg/mod/github.com/appliedgo/concurrencyslower@v<n,n.n>-<timestamp>-<hash> instead of the aforementioned path.

If $GOPATH is missing, go get uses ~/go or %USERPROFILE%go by default.

Odds and ends

Future CPU architectures and/or future Go runtimes might be able to alleviate this problem. so if you run this code, the benchmarks might not show the same effect as in this article. In this case, consider yourself lucky.

In general, it is not a good idea to have goroutines update a global variable. Remember the Go proverb: Don’t communicate by sharing memory, share memory by communicating.

This blog post was inspired by a discussion thread on Reddit.


Cache coherence on Wikipedia

Happy coding!