Original post

Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about breaking up those tokens.

// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
//  { Object start
//  [ Array start
//  } Object end
//  ] Array End
//  , Literal comma
//  : Literal colon
//  t JSON true
//  f JSON false
//  n JSON null
//  " A string, possibly containing backslash escaped entites.
//  -, 0-9 A number
func (s *Scanner) Next() []byte {
        // release the previous token
        s.br.release(s.pos)
        s.pos = 0

        c := s.token()
        length := 0
        switch c {
        case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
                length = 1
                s.pos = 1
        case True:
                length = validateToken(&s.br, "true")
                s.pos = length
        case False:
                length = validateToken(&s.br, "false")
                s.pos = length
        case Null:
                length = validateToken(&s.br, "null")
                s.pos = length
        case String:
                // string
                length = parseString(&s.br)
                if length < 2 {
                        return nil
                }
                s.pos = length
        case 0:
                // eof
                return nil
        default:
                // ensure the number is correct.
                length = s.parseNumber()
                if length < 0 {
                        return nil
                }
        }
        return s.br.window()[:length]
}

This is the core loop of Scanner.Next. Scanner.Next skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.

Let’s look at how token works, then we’ll talk about some optimisations

func (s *Scanner) token() byte {
        w := s.br.window()
        for {
                for _, c := range w {
                        if whitespace[c] {
                                s.pos++
                                continue
                        }

                        // release whitespace
                        s.br.release(s.pos)
                        s.pos = 0
                        return c
                }
                if s.br.extend() == 0 {
                        // eof
                        return 0
                }
                w = s.br.window()[s.pos:]
        }
}

var whitespace = [256]bool{
        ' ':  true,
        'r': true,
        'n': true,
        't': true,
}
  • We start by getting the current window from the byteReader. This is a []byte slice of all the data that is yet to be read.

  • We’re looking for the first non whitespace character. If the character is a whitespace we increment s.pos to ignore the character and loop around.

  • If we do find a non whitespace character, we release s.pos characters from the front of the window. Now the start of the window is properly aligned with the first character of the token.

  • It turns out that we also get the first character of the token for free, it’s in c, so we can return that as a hint to Scanner.Next.

  • If we run out of characters without hitting a token then we called extend() to grow the window.

  • If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.

  • Otherwise update w with a new window.

This is the basic operation of byteReader, we’ll see that pattern repeated across the scanner. Some things to note:

  • Note the lack of error handling, its not part of the inner loop, it only happens when we have to read more data from the underlying reader.

  • extend hides the process of reading into, growing, refilling the buffer, it makes the loop above it–Scanner.token simpler; if there is data in the window, process it, extend if you need too, give up if you can’t extend.

  • release is similar, it shrinks the start of the window to exclude data that we don’t care about. No need to copy or move data around, no need to

  • extend is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given. In practice a buffer of 8k is sufficient.

Let’s talk about the performance of this code.

name                     time/op
Scanner/canada-16         4.41ms ± 2%
Scanner/citm_catalog-16   2.55ms ± 3%
Scanner/twitter-16        1.03ms ± 1%
Scanner/code-16           4.21ms ± 1%
Scanner/example-16        21.4µs ± 1%
Scanner/sample-16          822µs ± 1%

name                     speed
Scanner/canada-16        510MB/s ± 2%
Scanner/citm_catalog-16  677MB/s ± 3%
Scanner/twitter-16       615MB/s ± 1%
Scanner/code-16          461MB/s ± 1%
Scanner/example-16       608MB/s ± 1%
Scanner/sample-16        837MB/s ± 1%

This is a benchmark you saw earlier, minus a few optimisations we’ll talk about next.

Comparing the performance of Scanner.Next to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.

Let’s talk about the first improvement we can make to the code. Note the amount of work being spent to keep s.pos up to date. We know that s.pos is set to 0 just before Scanner.Next calls this function, and we set s.pos to zero on the way out of the function, so the changes we make to s.pos within the function are invisible—​its zero on entry, and zero on exit.

We can rewrite the function to keep a local pos value, which has an impressive effect on token.

func (s *Scanner) token() byte {
        w := s.br.window()
        pos := 0
        for {
                for _, c := range w {
                        if whitespace[c] {
                                pos++
                                continue
                        }

                        // release whitespace
                        s.br.release(pos)
                        return c
                }
                if s.br.extend() == 0 {
                        // eof
                        return 0
                }
                w = s.br.window()[pos:]
        }
}

var whitespace = [256]bool{
        ' ':  true,
        'r': true,
        'n': true,
        't': true,
}
name                     old time/op    new time/op    delta
Scanner/canada-16          4.39ms ± 1%    4.43ms ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16    2.52ms ± 1%    1.80ms ± 4%  -28.46%  (p=0.008 n=5+5)
Scanner/twitter-16         1.03ms ± 2%    0.95ms ± 3%   -7.41%  (p=0.008 n=5+5)
Scanner/code-16            4.24ms ± 2%    4.18ms ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16         21.4µs ± 1%    18.9µs ± 2%  -11.68%  (p=0.008 n=5+5)
Scanner/sample-16           828µs ± 2%     528µs ± 2%  -36.24%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         512MB/s ± 1%   509MB/s ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16   685MB/s ± 1%   958MB/s ± 4%  +39.84%  (p=0.008 n=5+5)
Scanner/twitter-16        616MB/s ± 2%   665MB/s ± 3%   +8.01%  (p=0.008 n=5+5)
Scanner/code-16           458MB/s ± 2%   465MB/s ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16        608MB/s ± 1%   688MB/s ± 2%  +13.23%  (p=0.008 n=5+5)
Scanner/sample-16         831MB/s ± 2%  1303MB/s ± 2%  +56.84%  (p=0.008 n=5+5)

By keeping pos local the compiler avoided those temporary writes back to memory.

The question I have for you is why did this improve some inputs and not others?

The answer, I think, is different inputs have different amounts of whitespace. For example canada only has 33 whitespace characters whereas citm has 1,227,563.

There is a larger improvement we can make for the runtime of this code, and it relates to inlining. Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller. This avoids the overhead of the function call.

Usually inlining is performed automatically by the compiler according to a set of rules it controls. The compiler has reasonable support for inlining, but has a number of limitations.

 % go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80

The first is the size of the function, byteReader.extend cannot be inlined because it is too complex. The second is statements within the fuction, Scanner.token cannot be inlined because it contains a for statement. Also note that Scanner.Next, the caller of Scanner.token cannot be inlined because it also too complex.

Let’s go back to the constraints. Scanner.Next is called for each token in the input. This means that Scanner.token is called for each token in the input. Scanner.token cannot be automatically inlined into its called because it is too complex. Therefore we’re paying for an extra function call for each token.

We can remove this overhead by manually inlining Scanner.token into its caller.

func (s *Scanner) Next() []byte {
        // release the previous token
        s.br.release(s.pos)
        w := s.br.window()
        pos := 0
        for {
                for _, c := range w {
                        if whitespace[c] {
                                pos++
                                continue
                        }

                        // release whitespace
                        s.br.release(pos)

                        length := 0
                        switch c {
                        case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
                                length = 1
                                s.pos = 1
                        case True:
                                length = validateToken(&s.br, "true")
                                s.pos = length
                        case False:
                                length = validateToken(&s.br, "false")
                                s.pos = length
                        case Null:
                                length = validateToken(&s.br, "null")
                                s.pos = length
                        case String:
                                // string
                                length = parseString(&s.br)
                                if length < 2 {
                                        return nil
                                }
                                s.pos = length
                        default:
                                // ensure the number is correct.
                                length = s.parseNumber()
                                if length < 0 {
                                        return nil
                                }
                        }
                        return s.br.window()[:length]
                }
                if s.br.extend() == 0 {
                        // eof
                        return nil
                }
                w = s.br.window()[pos:]
        }
}

The results support our thesis:

name                     old time/op    new time/op    delta
Scanner/canada-16          4.36ms ± 1%    3.50ms ± 0%  -19.68%  (p=0.008 n=5+5)
Scanner/citm_catalog-16    1.80ms ± 1%    1.56ms ± 2%  -13.16%  (p=0.008 n=5+5)
Scanner/twitter-16          965µs ± 2%     833µs ± 2%  -13.75%  (p=0.008 n=5+5)
Scanner/code-16            4.15ms ± 1%    3.61ms ± 1%  -12.82%  (p=0.008 n=5+5)
Scanner/example-16         18.9µs ± 2%    16.6µs ± 1%  -12.42%  (p=0.008 n=5+5)
Scanner/sample-16           515µs ± 1%     472µs ± 2%   -8.34%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         516MB/s ± 1%   642MB/s ± 0%  +24.50%  (p=0.008 n=5+5)
Scanner/citm_catalog-16   960MB/s ± 1%  1105MB/s ± 2%  +15.16%  (p=0.008 n=5+5)
Scanner/twitter-16        654MB/s ± 2%   759MB/s ± 1%  +15.94%  (p=0.008 n=5+5)
Scanner/code-16           468MB/s ± 1%   537MB/s ± 1%  +14.69%  (p=0.008 n=5+5)
Scanner/example-16        689MB/s ± 2%   787MB/s ± 1%  +14.17%  (p=0.008 n=5+5)
Scanner/sample-16        1.33GB/s ± 1%  1.46GB/s ± 1%   +9.11%  (p=0.008 n=5+5)

By saving the function call we’ve improved throughput by 9-24%. The largest improvement comes from canada, which basically contained no whitespace, so the call to Scanner.token almost always returned immediately having done no work, while also paying for all the s.pos and release overhead.

This is where the inner loop of the Scanner stands today. Note that citm is over 50% of the baseline, sample is nearly 75%. To recap the major optimisations were:

  • whitespace[c]

  • Avoiding s.pos updates. They cannot be registerised, CPU has to do a write on every iteration. s.pos updates reduced from one per byte to one per token.

  • Scanner.Next and Scanner.token were effectively one function spread over two. Each are too large to be inlined, so we’re paying for an extra function call per token. Manually inlining them increased the indentation depth of the function, but delivered substantial speedups.

  • Most JSON contains some whitespace, it’s moderately optimised for human readability. It turns out, the more whitespace, the faster pkg/json decodes.