Original post

Jan 6, 2020 at 12:11AM

Caleb Doxsey

Recently a blog post entitled “Hello World” by Drew DeVault has been making the rounds. He shows a “Hello World” program in several languages, how long those programs take to execute, how large of a binary they produce and how many syscalls they execute.

Some of his conclusions:

Most languages do a whole lot of other crap other than printing out “hello world”, even if that’s all you asked for.

This is more complexity that someone has to debug, more time your users are sitting there waiting for your program, less disk space available for files which actually matter to the user.

This was not an objective test, this is just an approximation that I hope will encourage readers to be more aware of the consequences of their abstractions, and their exponential growth as more layers are added.

I suppose there’s a lot that could be said in response to this, but in this blog post I will make the following arguments:

  1. Performance improvements of interactive programs below a certain threshold are imperceptible and therefore of negligible value.
  2. A minimalistic program does not always lead to better performance.
  3. Compilation involves trade-offs and binary size and the number of syscalls executed are not the only things that matter.
  4. The syscalls emitted by these programming languages are there for a reason and provide significant value to developers and users.

Perceptible Performance Improvements

“Hello World” is a rather strange program to optimize, but since that was what was tested in the original blog post, we’ll run with it. Presumably the purpose of such a program is to print “Hello World” to the screen. It is therefore an interactive program used by people. (As opposed to a program intended to be used by other programs, which we’ll cover in a bit)

Drew suggests that there are several downsides to users for programs written in high-level languages. As we’ll see, those downsides don’t hold up, so let’s take each in turn.

1. Programs are more complex and harder to debug

Certainly the “Hello World” programs produced by higher level programming languages are more complex. But does that mean they’re harder to debug?

No.

In fact a great deal of the “other crap” produced by higher level programming languages is precisely to help with debugging. For example DWARF for ELF binaries:

It is possible to debug programs by inserting code that prints values of selected interesting variables. Indeed, in some situations, such as debugging kernel drivers, this may be the preferred method. There are low­level debuggers that allow you to step through the executable program, instruction by instruction, displaying registers and memory contents in binary.

But it is much easier to use a source­level debugger which allows you to step through a program’s source, set breakpoints, print variable values, and perhaps a few other functions such as allowing you to call a function in your program while in the debugger. The problem is how to coordinate two completely different programs, the compiler and the debugger, so that the program can be debugged.

Introduction to the DWARF Debugging Format

This kind of information is also needed for stack traces. When your program crashes in production a stack trace is exceedingly useful. Without it debugging is much more difficult. (Incidentally all the examples were stripped of debugging information. Good luck debugging!)

Furthermore, utilizing a common programming language means there’s a very high probability that the kinds of problems or bugs you run into have been experienced by other developers. Stack Overflow is an incredibly powerful debugging tool – one you won’t have access to if you write everything in assembly or roll your own “simple” programming language.

2. Users spend more time waiting for your program to complete

Computers are fast – a lot faster than we can possibly perceive. By some estimates human beings perceive two events as instantaneous at around 100ms. To Drew’s point, this is an eternity for a computer. In that time a modern CPU could execute 10 billion instructions.

But regardless of what our computer is doing, if it takes less than 100ms to execute our program, we simply won’t be able to tell the difference between executing 1 instruction or all 10 billion. For an interactive program, this sets a minimum perception threshold under which any further improvements are imperceptible and therefore of negligible value. If there are good reasons to execute the extra instructions, then we can bank on this perception threshold as a way of improving some other aspect of our program, and no one will be the worse off.

All of the example programs except for “Python 3 (PyPy)” are basically within this threshold. No user will notice the difference.

3. Programs use too much disk space

In this day and age a 5MB binary hardly seems like a big deal. Disk space is ubiquitious, even on small hardware.

But suppose we do care about these sizes. Most of the examples given can leverage dynamic libraries (like glibc) or interpreters (like python). These can be shared across multiple programs. Say we write 2 dozen programs in python. Each one of those programs won’t be 15MB. We’ll pay the bulk of that 15MB once, then each program is just the python source code. The way it’s currently presented is disengenous.

Improving Performance by Introducing Complexity

The few milliseconds we save in the assembly version of the program are meaningless in the context of an interactive program. No user will ever notice the difference. But programs are sometimes run by other programs, in which case the differences may be significant because they can add up.

For example we might repeat the execution of our “Hello World” program several thousand times:

seq 10000 | xargs -P1 -n1 /asm/v1 > /dev/null

Since I’m most familiar with we’ll compare the assembly version with the version. The assembly version is significantly faster in this test:

Time for Number of Iterations

The assembly version on my computer takes about 5 seconds to run our “Hello World” program 25,000 times. The Go version takes nearly 15 seconds to do the same thing. So it would appear that our high level programming language is pretty inefficient. A “perfect compiler” should be producing a program 3 times as fast.

But this way of looking at performance is simplistic. Printing “Hello World” 25,000 times is really not very much work. It’s roughly the equivalent of transferring a few hundred KBs of data. Why is it taking 5 seconds to do this?

The reason it’s slow is because there’s a great deal of overhead in creating a process. That is to say, there’s a hidden cost we’ve forgotten to consider, and this cost greatly outweighs anything our program does. Our running time is 25,000 * (Tcreate process + Trun hello world). And because Tcreate process is so high, focusing on improving Trun hello world is fruitless. We need a different strategy.

First we can try hoisting:

Loop-invariant expressions can be hoisted out of loops, thus improving run-time performance by executing the expression only once rather than at each iteration.

Let’s change our programs so we only create one process regardless of how many times we want to print “hello world”.

func main() {
    n, _ := strconv.Atoi(os.Args[1])
    for i := 0; i < n; i++ {
        os.Stdout.Write([]byte("hello world"))
    }
}

And a similar program in assembly:

_start:
    mov rdi, [rsp + 16]
    call atoi

    mov r12, rax

.helloloop:
    mov rdx, len
    mov rsi, msg
    mov rdi, 1
    mov rax, 1
    syscall
    dec r12
    cmp r12, 0
    jg .helloloop

    mov rdi, 0
    mov rax, 60
    syscall

I’m a terrible assembly programmer, so buyer beware.

Now running this code we see dramatically better performance.

Time for Number of Iterations

The assembly version is still outperforming the Go version, but the number of syscalls is roughly the same. Assembly:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.021705           2     10000           write
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0.021705                 10001           total

Go:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.012307           1     10000           write
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.012307                 10170           total

It turns out all those extra syscalls only happen when the program first starts up. After that there’s pretty much no difference. I suspect the same will hold with most other programming languages.

Now as it happens strace has turned up another avenue for optimization. Syscalls are relatively expensive operations. I bet if we reduce the number of syscalls we make we can further improve the performance of our program.

So how do we do that?

Our second optimization is to buffer small writes. Memory operations are faster than syscalls, so we can trade a small amount of memory for less syscalls overall. Each write syscall will include multiple copies of “hello world” and the output will be identical.

I’m not going to write this code in assembly – it would involve a lot of tricky memory manipulation that’s easy to get wrong. In Go however it’s two lines of code:

func main() {
    n, _ := strconv.Atoi(os.Args[1])
    w := bufio.NewWriter(os.Stdout)
    defer w.Flush()
    for i := 0; i < n; i++ {
        w.Write([]byte("hello world"))
    }
}

This program is much faster:

Time for Number of Iterations

And the reason why is obvious with strace:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 18.86    0.000083           3        27           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000440                   196           total

Introducing the complexity and overhead of a high level programming language not only produces more maintainable software, improves security and increases productivity, it can actually improve performance.

If I had taken the time to try and find out what the distinction between the original assembly code and the Go version I wrote was, I would’ve missed much more significant avenues for optimization.

Now this is a toy example, but there are plenty of examples where adding complexity not only isn’t wasteful, but actually improves program performance.

For example many of the Go syscalls are related to handling multiple threads. Nearly all modern CPUs have multiple cores. Are you using them?

One kind of waste is unnecessary software bloat. But another kind of waste is underutilized hardware. 1 core pegged at 100% usage while 15 sit idle is inefficient.

And multithreaded programming is anything but easy. If a high level framework with a runtime makes it easier to write this kind of software than the minute expense of a few milliseconds of startup time seems like an entirely reasonable trade-off.

Compilation Trade-Offs

Implied in the original blog post was the idea that an ideal compiler would produce a minimalistic “Hello World” program. We’ve already seen why some of the “other crap” included in the binary is useful for debugging, and in the next section we’ll take a look at what the rest of the syscalls are doing, but before we move on, there’s another aspect of this which is also important.

Optimization is not free.

Go was born out of the frustration of waiting for large C++ projects to compile.

The consequence of these uncontrolled dependencies and massive scale is that it is impractical to build Google server binaries on a single computer, so a large distributed compilation system was created. With this system, involving many machines, much caching, and much complexity (the build system is a large program in its own right), builds at Google are practical, if still cumbersome.

Even with the distributed build system, a large Google build can still take many minutes. That 2007 binary took 45 minutes using a precursor distributed build system; today’s version of the same program takes 27 minutes, but of the program and its dependencies have grown in the interim. The engineering effort required to scale up the build system has barely been able to stay ahead of the growth of the software it is constructing.

Go at Google: Language Design in the Service of Software Engineering

When Go programs import packages, those packages are compiled into object files and it is not necessary to consult the source code for those packages. Everything needed to compile the final binary can be found by consulting the main program’s source code and the already-compiled dependencies.

A full-program, aggressive optimization step might lead to smaller binaries, but it would do so at a great cost to projects with many dependencies. Incremental compilation is a boon to developer productivity which is also important.

But if the compiler can’t help us here, surely Go’s linker should eliminate code that’s not used in the end. Right?

It does. But it’s hampered by several language features:

  1. Go packages can have init functions which are run the first time a package is imported. Those init functions can in turn use all kinds of functionality within the imported package. As a consequence many imported, unused functions remain because it’s difficult to determine whether or not the init function can be safely removed.

  2. Go has runtime reflection capabilities in the reflect package. An overly aggressive dead-code elimination might ruin our ability to use reflection. (Which in turn could break things like JSON encoding)

Compiler performance and these language features are examples of trade-offs. Perhaps they’ve led to bloated binaries, but when that bloat has so little real-world impact, the trade-offs absolutely make sense.

An Analysis of Syscalls

I’m reminded of Chesterton’s fence:

In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it. The Thing, G. K. Chesterton

Perhaps all that “other crap” in these binaries is there for a reason.

Let’s take a look at some of the system calls produced by a Go “Hello World” program to see what they’re for.

strace: Process 11112 attached
strace: Process 11111 attached
strace: Process 11113 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.34    0.000313          78         4           futex
 12.33    0.000071           1       114           rt_sigaction
 11.81    0.000068          34         2           nanosleep
  9.55    0.000055           7         8           sigaltstack
  4.34    0.000025          25         1           readlinkat
  1.91    0.000011           1        11           rt_sigprocmask
  1.39    0.000008           1         7           gettid
  1.04    0.000006           6         1           write
  1.04    0.000006           1         8           mmap
  1.04    0.000006           2         3           fcntl
  0.69    0.000004           1         3           clone
  0.52    0.000003           1         4           arch_prctl
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000576                   171           total

Scheduling

The first thing we notice is that Go programs use multiple threads. Go includes a runtime with a sophisticated userspace M:N work-stealing scheduler.

Although there are certainly trade-offs with the approach Go has taken, it does bring several benefits. Chief among them is a unified programming model where the scheduler can know the intricacies of channel operations, timers/tickers, syscalls, asynchronous network frameworks, etc… and use the appropriate mechanism in the appropriate circumstance.

Having the runtime scheduler available means we don’t have two different programming APIs (blocking world and async world), or to put it another way, two different colors.

Now theoretically the Go compiler could ditch this functionality if it determined it wasn’t going to be used, but Go is a concurrent programming language designed to use multiple threads. A common, uniform implementation is more important than optimizing “Hello World”. (though if you really want to go down this route, check out TinyGo)

Many of the above syscalls are related to scheduling:

  • sched_getaffinity is used to determine how many CPUs are available. The scheduler needs this information to determine how many threads to start.

  • clone, arch_prctl, rt_sig_procmask are used as part of thread creation:

    sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
    ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
    sigprocmask(_SIG_SETMASK, &oset, nil)
    
  • gettid gets a thread’s id. The comments state:

    // Cgo-created threads and the bootstrap m are missing a
    // procid. We need this for asynchronous preemption and its
    // useful in debuggers.
    
  • nanosleep is used in several places to put a thread to sleep momentarily

Garbage Collection

The Go runtime includes an automatic garbage collector. Many data structures in Go are allocated dynamically on the heap and when no longer used need to be cleaned up to free memory.

Automatic garbage collection has many benefits, but I’ll just mention one: It prevents many types of common bugs found in software that uses manual memory management. These bugs are a frequent source of security issues:

Since 2004, the Microsoft Security Response Centre (MSRC) has triaged every reported Microsoft security vulnerability. From all that triage one astonishing fact sticks out: as Matt Miller discussed in his 2019 presentation at BlueHat IL, the majority of vulnerabilities fixed and with a CVE assigned are caused by developers inadvertently inserting memory corruption bugs into their C and C++ code. As Microsoft increases its code base and uses more Open Source Software in its code, this problem isn’t getting better, it’s getting worse. And Microsoft isn’t the only one exposed to memory corruption bugs—those are just the ones that come to MSRC.

~70% of the vulnerabilities Microsoft assigns a CVE each year continue to be memory safety issues

A proactive approach to more secure code

Here are some of the syscalls related to garbage collection:

  • openat, close and read are used to determine the transparent huge page size in linux. Apparently openat is used because open doesn’t work properly in Android. Also there’s this comment:

    // By default, Linux's "transparent huge page" support will
    // merge pages into a huge page if there's even a single
    // present regular page, undoing the effects of madvise(adviseUnused)
    // below. On amd64, that means khugepaged can turn a single
    // 4KB page to 2MB, bloating the process's RSS by as much as
    // 512X. (See issue #8832 and Linux kernel bug
    // https://bugzilla.kernel.org/show_bug.cgi?id=93111)
    

    I have no idea what that means, but it’s one less thing I have to worry about.

Stdin, Stdout, Stderr

The 3 fnctl come from:

var (
    Stdin  = NewFile(uintptr(syscall.Stdin), "/dev/stdin")
    Stdout = NewFile(uintptr(syscall.Stdout), "/dev/stdout")
    Stderr = NewFile(uintptr(syscall.Stderr), "/dev/stderr")
)

In Unix NewFile checks to see if the file descriptor is in non-blocking mode, as per the comment:

// On Unix systems, if the file descriptor is in
// non-blocking mode, NewFile will attempt to return a pollable File
// (one for which the SetDeadline methods work).

That seems useful.

Signals

The rt_sigaction and sigaltstack syscalls are related to signal handling. The docs state:

By default, a synchronous signal is converted into a run-time panic. A SIGHUP, SIGINT, or SIGTERM signal causes the program to exit. A SIGQUIT, SIGILL, SIGTRAP, SIGABRT, SIGSTKFLT, SIGEMT, or SIGSYS signal causes the program to exit with a stack dump. A SIGTSTP, SIGTTIN, or SIGTTOU signal gets the system default behavior (these signals are used by the shell for job control). The SIGPROF signal is handled directly by the Go runtime to implement runtime.CPUProfile. Other signals will be caught but no action will be taken.

If the Go program is started with either SIGHUP or SIGINT ignored (signal handler set to SIG_IGN), they will remain ignored.

If the Go program is started with a non-empty signal mask, that will generally be honored. However, some signals are explicitly unblocked: the synchronous signals, SIGILL, SIGTRAP, SIGSTKFLT, SIGCHLD, SIGPROF, and, on GNU/Linux, signals 32 (SIGCANCEL) and 33 (SIGSETXID) (SIGCANCEL and SIGSETXID are used internally by glibc). Subprocesses started by os.Exec, or by the os/exec package, will inherit the modified signal mask.

os/signal

I completely forgot to mention profiling earlier. Profiling is also useful.

Executable Path

The readlinkat syscall appears to be happening from an initializer in the os package:

// We query the executable path at init time to avoid the problem of
// readlink returns a path appended with " (deleted)" when the original
// binary gets deleted.
var executablePath, executablePathErr = func() (string, error) {
    var procfn string
    switch runtime.GOOS {
    default:
        return "", errors.New("Executable not implemented for " + runtime.GOOS)
    case "linux", "android":
        procfn = "/proc/self/exe"
    case "netbsd":
        procfn = "/proc/curproc/exe"
    case "dragonfly":
        procfn = "/proc/curproc/file"
    }
    return Readlink(procfn)
}()

Kinda neat that it handles the binary getting deleted. This apparently came as a result of this issue.

And that should just about do it for the syscall investigation. As expect all these syscalls are here for good reasons.

Conclusion

So to recap:

  1. Decreasing binary size and the number of syscalls executed in a “Hello World” program results in imperceptible performance improvements.
  2. The tunnel vision produced by a relentless pursuit of software minimalism can blind us to better ways of improving performance.
  3. Compilation involves trade-offs in many dimensions.
  4. The extra syscalls in a Go “Hello World” program have good reasons for being there and provide meaningful value for developers and users.