Original post

Suggested title: “Beating” C with 400 lines of unoptimized


Earlier this week I ran into a fun quick blog post named
Beating C with 70 lines of Go,
which reimplements the basic functionality of wc in using various
approaches and compares their performance. Apparently it’s inspired by an
earlier Haskell-based post and several
other offshoots.

This reminded me of my earlier post about reimplementing wc in pure x64
assembly
, where I
also measured the performance of my program against wc.

The optimized approach taken in the Go implementation is very similar to what I
did in assembly, so it seemed like an interesting comparison. I started by
generating a ~580 MiB file using xmlgen
and ran the various versions against each other:

  • LC_TYPE=POSIX wc: 2.13 sec
  • wc-naive.go: 3.53 sec
  • wc-chunks.go: 1.37 sec
  • wcx64: 1.2 sec

Note the LC_TYPE setting for the system’s wc. This is important for a
fair comparison, because without this wc will attempt to do utf-8
decoding on all bytes in the file, which results in significant slowdowns. Since
the Go versions use byte-counts and so does my wcx64, I force the comparison
to be fair. In fact, this isn’t a bad result for Go – the straightforward
solution is almost as fast as the same approach direct-coded in assembly!

The Go blog post follows with parallelized versions which are much faster than
the serial one, but I’m excluding it here because all the other competitors are
single-threaded. This is not a serious benchmark anyway. If you prefer to
be serious, this response using SIMD-optimized C blows everything out of the water:

  • fastlwc: 0.11 sec

The conclusion? Well, there’s no real conclusion here, beyond that coding
exercises like this are fun in any language 🙂