Original post

A short talk, with code, on creating and using network namespaces from .

This post begins with a question Max Weber asked his students in 1917:

“Does it mean that we, today, for instance, everyone sitting in this hall, have a greater knowledge of the conditions of life under which we exist than has an American Indian or a [Khoisan]? Hardly. Unless he is a physicist, one who rides on the streetcar has no idea how the car happened to get into motion. And he does not need to know. He is satisfied that he may ‘count’ on the behavior of the streetcar, and he orients his conduct according to this expectation; but he knows nothing about what it takes to produce such a car so that it can move.

Modernity, but software in particular, requires us to ‘count’ on a good many under-understood streetcars. And as software engineers, we’re usually just one twist away from having to fix, to misuse, or to rebuild any one of them.

So in this post, the streetcar is Docker or any other container runtime. And the motor we’ll be misusing, somewhere inside it all, is the network namespace.

Code/TLDR: github.com/hblanks/sketches/2020-02-21-netns

The author/ex-CDN engineer’s work on the hobbyist control plane has continued, reaching the predictable point where one Go program (the “agent”) must create interfaces with arbitrary addresses, and it must keep these addresses separated by tenant. Because the control plane deploys on Linux, we’ll need to use Linux namespaces to do this – the same technology we use every time we start a Docker container or deploy a Kubernetes pod.

Namespaces are the fundamental primitive for ensuring a container has its own filesystem root, its own process ID space, and (barring things like --net host) its own network addresses and interfaces. But, how do you actually work with them? If, instead of riding on the Docker streetcar, you had to create one from your own program, and to make it persist, how would you do it?

I began again with a sketch. The goal: a simple tool that would look like this:

$ ./setns -h
Usage: ./build/setns NAMESPACE COMMAND ARG...
Runs COMMAND in the given named network NAMESPACE,
creating the namespace if it doesn't already exist.

Which is to say, something entirely analogous to the ip netns tool offered by iproute2 (ip-netns(8)). But written in Go, since that’s what the control plane’s written in.

The first step for the tool, assuming the namespace doesn’t already exist, is to create it. Linux provides two system calls for this, both called (as they must be) from an existing process:

  1. clone(2) copies your process into a new namespace and new process, and
  2. unshare(2) moves your existing process into a new namespace.

clone(2) turns out to be a fascinating streetcar of its own: for Linux, it’s the fundamental primitive for creating light-weight processes, generally known as threads. But, because the control plane agent is written in Go, it’s not a good fit: in Go, the Go runtime alone manages threads, scheduling goroutines on them as it sees fit. It thus won’t work for us to start a new thread every time we want to operate on a different namespace.

In contrast, unshare(2), works within an existing process. A working Go example follows, including the right flags for creating a new network namespace, plus saving off a file descriptor to the original namespace (more on that in a minute):

// Log any new error. For use when closing a file during
// error handling.
func closeFile(f *os.File) {
  if err := f.Close(); err != nil {
    log.Printf("close file error: %v", err)
  }
}

// Unshare into a new namespace, returning the original
// namespace.
func unshare() (*os.File, error) {
  f, err := os.Open("/proc/self/ns/net")
  if err != nil {
    return nil, err
  }
  _, _, e1 := syscall.Syscall(
    syscall.SYS_UNSHARE, syscall.CLONE_NEWNET, 0, 0)
  if e1 != 0 {
    closeFile(f)
    return nil, e1
  }
  return f, nil
}

If instead the network namespace already exists, the tool needs to specify and to join that namespace instead of creating one. For this, Linux provides a slightly different system call, setns(2), which takes two arguments:

  1. “A file descriptor referring to a namespace,” and
  2. A flag specifying which namespaces to change (network, pid, user, cgroup, mount, etc.).

A working Go example for changing the network namespace, given an open file, is:

// Sets namespace to the given open file.
func setns(f *os.File) error {
  _, _, e1 := syscall.Syscall(
    SYS_SETNS, f.Fd(), syscall.CLONE_NEWNET, 0)
  if e1 != 0 {
    return e1
  }
  return nil
}

So far, we’ve seen how to create namespaces and how to join existing namespaces, assuming we have a file descriptor to that namespace. But how do we get the file descriptor?

For any namespace attached to a process, it’s very simple: you open the corresponding file under /proc/${PID}/ns, or for your own process, the corresponding file under /proc/self/ns/. (In fact, that’s what we did above with /proc/self/ns/net.)

But if we don’t have a process to refer to, we need to do something more complicated: after we create and enter a namespace, we need to save off that namespace to a different path, so we can refer to it later.

File descriptors aren’t something we generally just “save off”. But sudo strace ip netns add ns0 shows us the way. With comments to explicate the text:

# Create a new, empty file, /var/run/netns/ns0
openat(AT_FDCWD, "/var/run/netns/ns0",
  O_RDONLY|O_CREAT|O_EXCL, 000) = 5
close(5) = 0

# Create a new namespace
unshare(CLONE_NEWNET) = 0

# Bind mount the new namespace to /var/run/netns/ns0
mount("/proc/self/ns/net", "/var/run/netns/ns0",
  0x562c7c23d9a5, MS_BIND, NULL) = 0

Normally, a namespace is only supposed to last as long as either (1) there’s at least one process in it, or (2) there’s at least one open file descriptor pointing to that namespace. Bind-mounting the namespace’s file, however, lets us persist it even when there are no processes present and no open file descriptors pointing to it.

The working example, then, for not just creating a namespace, but for “naming” it by bind mounting it into /var/run/netns (an arbitrary path, but the same path used by iproute2), is:

// Enters a new namespace and bind mounts it to
// /var/run/netns/${name}, returning an open file to the
// original namespace.
func createNamespace(name string) (*os.File, error) {
  f, err := unshare()
  if err != nil {
    return nil, err
  }

  // mounts /var/run/netns as tmpfs if needed
  if err := mountNamespaceDir(); err != nil {
    closeFile(f)
    return nil, err
  }

  nsPath := filepath.Join(nsDir, name)
  f, err = os.Create(nsPath)
  if err != nil {
    return nil, err
  }
  if f.Close(); err != nil {
    return nil, err
  }

  err = syscall.Mount("/proc/self/ns/net", nsPath,
    "", syscall.MS_BIND, "")
  if err != nil {
    closeFile(f)
    return nil, err
  }
  return f, nil
}

And our top level function, which joins a named network namespace (and creates it if necessary) is:

// Opens the file for a given namespace
func openNamespace(name string) (*os.File, error) {
  return os.Open(filepath.Join("/run/netns", name))
}

// Sets namespace to /var/run/netns/${name}, creating
// that namespace if necessary.
//
// Returns an open file pointing to the original namespace.
func setNamespace(name string) (*os.File, error) {
  newFile, err := openNamespace(name)
  if os.IsNotExist(err) {
    origFile, err := createNamespace(name)
    if err != nil {
      return nil, err
    }
    return origFile, nil
  } else if err != nil {
    return nil, err
  }

  origFile, err := os.Open("/proc/self/ns/net")
  if err != nil {
    closeFile(newFile)
    return nil, err
  }

  if err := setns(newFile); err != nil {
    closeFile(newFile)
    closeFile(origFile)
    return nil, err
  }

  return origFile, nil
}

One last and Go-specific concern: the system calls above all expect to be made from the same process. But in Go, the scheduler can and does move goroutines transparently from one light weight process to another. So by default, we have no guarantee that our syscalls will happen safely and from the same process.

Thankfully, since Go 1.10, it’s been possible to temporarily lock a goroutine to a single thread. Thus, the top-level function in our example obtains this lock, joins the namespace, and calls exec(2) to replace our process with the user-supplied command:

// Executes a command in a given namespace.
func execNamespace(name string, args []string) error {
  runtime.LockOSThread()
  defer runtime.UnlockOSThread()

  f, err := setNamespace(name)
  if err != nil {
    return err
  }

  arg0, err := exec.LookPath(args[0])
  if err != nil {
    closeFile(f)
    return err
  }
  return syscall.Exec(arg0, args, os.Environ())
}

So ends this short sketch and discussion of network namespaces, finished during one more winter visit to London, a home for some three years. Working code remains in:

hblanks/sketches/2020-02-21-netns,

and all comments or corrections are welcome as issues there, by email, or wherever this ends up on the lobste.rs.