Original post

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!

AWK is a text-processing language with a history spanning more than 40 years. It has a POSIX standard, several conforming implementations, and is still surprisingly relevant in 2020 — both for simple text processing tasks and for wrangling “big data”. The recent release of GNU Awk 5.1 seems like a good reason to survey the AWK landscape, see what GNU Awk has been up to, and look at where AWK is being used these days.

The language was created at Bell Labs in 1977. Its name comes from the initials of the original authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. A Unix tool to the core, AWK is designed to do one thing well: to filter and transform lines of text. It’s commonly used to parse fields from log files, transform output from other tools, and count occurrences of words and fields. Aho summarized AWK’s functionality succinctly:

AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

AWK programs are often one-liners executed directly from the command line. For example, to calculate the average response time of GET requests from some hypothetical web server log, you might type:

    $ awk '/GET/ { total += $6; n++ } END { print total/n }' server.log 
    0.0186667

This means: for all lines matching the regular expression /GET/, add up the response time (the sixth field or $6) and count the line; at the end, print out the arithmetic mean of the response times.

The various AWK versions

There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language. It’s sometimes called “new AWK” (nawk) or “one true AWK”, and it’s now hosted on GitHub. This is the version pre-installed on many BSD-based systems, including macOS (though the version that comes with macOS is out of date, and worth upgrading).

The second is GNU Awk (gawk), which is by far the most featureful and actively maintained version. Gawk is usually pre-installed on Linux systems and is often the default awk. It is easy to install on macOS using Homebrew and Windows binaries are available as well. Arnold Robbins has been the primary maintainer of gawk since 1994, and continues to shepherd the language (he has also contributed many fixes to the classic awk version). Gawk has many features not present in awk or the POSIX standard, including new functions, networking facilities, a C extension API, a profiler and debugger, and most recently, namespaces.

The third common version is mawk, written by Michael Brennan. It is the default awk on Ubuntu and Debian Linux, and is still the fastest version of AWK, with a bytecode compiler and a more memory-efficient value representation. (Gawk has also had a bytecode compiler since 4.0, so it’s now much closer to mawk’s speed.)

If you want to use AWK for one-liners and basic text processing, any of the above are fine variants. If you’re thinking of using it for a larger script or program, Gawk’s features make it the sensible choice.

There are also several other implementations of AWK with varying levels of maturity and maintenance, notably the size-optimized BusyBox version used in embedded Linux environments, a Java rewrite with runtime access to Java language features, and my own GoAWK, a POSIX-compliant version written in Go. The three main AWKs and the BusyBox version are all written in C.

Gawk changes since 4.0

It’s been almost 10 years since LWN covered the release of gawk 4.0. It would be tempting to say “much has changed since 2011”, but the truth is that things move relatively slowly in the AWK world. I’ll describe the notable features since 4.0 here, but for more details you can read the full 4.x and 5.x changelogs. Gawk 5.1.0 came out just over a month ago on April 14.

The biggest user-facing feature is the introduction of namespaces in 5.0. Most modern languages have some concept of namespaces to make it easier to ship large projects and libraries without name clashes. Gawk 5.0 adds namespaces in a backward-compatible way, allowing developers to create libraries, such as this toy math library:

    # area.awk
    @namespace "area"

    BEGIN {
        pi = 3.14159  # namespaced "constant"
    }

    function circle(radius) {
        return pi*radius*radius
    }

To refer to variables or functions in the library, use the namespace::name syntax, similar to C++:

    $ gawk -f area.awk -e 'BEGIN { print area::pi, area::circle(10) }'
    3.14159 314.159

Robbins believes that AWK’s lack of namespaces is one of the key reasons it hasn’t caught on as a larger-scale programming language and that this feature in gawk 5.0 may help resolve that. The other major issue Robbins believes is holding AWK back is the lack of a good C extension interface. Gawk’s dynamic extension interface was completely revamped in 4.1; it now has a defined API and allows wrapping existing C and C++ libraries so they can be easily called from AWK.

The following code snippet from the example C-code wrapper in the user manual populates an AWK array (a string-keyed hash table) with a filename and values from a stat() system call:

    /* empty out the array */
    clear_array(array);

    /* fill in the array */
    array_set(array, "name", make_const_string(name, strlen(name), &tmp));
    array_set_numeric(array, "dev", sbuf->st_dev);
    array_set_numeric(array, "ino", sbuf->st_ino);
    array_set_numeric(array, "mode", sbuf->st_mode);

Another change in the 4.2 release (and continued in 5.0) was an overhauled source code pretty-printer. Gawk’s pretty-printer enables its use as a standardized AWK code formatter, similar to Go’s go fmt tool and Python’s Black formatter. For example, to pretty-print the area.awk file from above:

    $ gawk --pretty-print -f area.awk

which results in the following output:

    @namespace "area"

    BEGIN {
        pi = 3.14159    # namespaced "constant"
    }


    function circle(radius)
    {
        return (pi * radius * radius)
    }

You may question the tool’s choices: why does “BEGIN {” not have a line break before the “{” when the function does? (It turns out AWK syntax doesn’t allow that.) Why two blank lines before the function and parentheses around the return expression? But at least it’s consistent and may help avoid code-style debates.

Gawk allows a limited amount of runtime type inspection, and extended that with the addition of the typeof() function in 4.2. typeof() returns a string constant like “string“, “number“, or “array” depending on the input type. These functions are important for code that recursively walks every item of a nested array, for example (which is something that POSIX AWK can’t do).

With 4.2, gawk also supports regular expression constants as a first-class data type using the syntax @/foo/. Previously you could not store a regular expression constant in a variable; typeof(@/foo/) returns the string “regexp“. In terms of performance, gawk 4.2 brings a significant improvement on Linux systems by using fwrite_unlocked() when it’s available. As gawk is single-threaded, it can use the non-locking stdio functions, giving a 7-18% increase in raw output speed — for example gawk '{ print }' on a large file.

The GNU Awk User’s Guide has always been a thorough reference, but it was substantially updated in 4.1 and again in the 5.x releases, including new examples, summary sections, and exercises, along with some major copy editing.

Last (and also least), a subtle change in 4.0 that I found amusing was the reverted handling of backslash in sub() and gsub(). Robbins writes:

The default handling of backslash in sub() and gsub() has been reverted to the behavior of 3.1. It was silly to think I could break compatibility that way, even for standards compliance.

The sub and gsub functions are core regular expression substitution functions, and even a small “fix” to the complicated handling of backslash broke people’s code:

When version 4.0.0 was released, the gawk maintainer made the POSIX rules the default, breaking well over a decade’s worth of backward compatibility. Needless to say, this was a bad idea, and as of version 4.0.1, gawk resumed its historical behavior, and only follows the POSIX rules when --posix is given.

Robbins may have had a small slip in judgment with the original change, but it’s obvious he takes backward compatibility seriously. Especially for a popular tool like gawk, sometimes it is better to continue breaking the specification than change how something has always worked.

Is AWK still relevant?

Asking if AWK is still relevant is a bit like asking if air is still relevant: you may not see it, but it’s all around you. Many Linux administrators and DevOps engineers use it to transform data or diagnose issues via log files. A version of AWK is installed on almost all Unix-based machines. In addition to ad-hoc usage, many large open-source projects use AWK somewhere in their build or documentation tooling. To name just a few examples: the Linux kernel uses it in the x86 tooling to check and reformat objdump files, Neovim uses it to generate documentation, and FFmpeg uses it for building and testing.

AWK build scripts are surprisingly hard to kill, even when people want to: in 2018 LWN wrote about GCC contributors wanting to replace AWK with Python in the scripts that generate its option-parsing code. There was some support for this proposal at the time, but apparently no one volunteered to do the actual porting, and the AWK scripts live on.

Robbins argues in his 2018 paper for the use of AWK (specifically gawk) as a “systems programming language”, in this context meaning a language for writing larger tools and programs. He outlines the reasons he thinks it has not caught on, but Kernighan is “not 100% convinced” that the lack of an extension mechanism is the main reason AWK isn’t widely used for larger programs. He suggested that it might be due to the lack of built-in support for access to system calls and the like. But none of that has stopped several people from building larger tools: Robbins’ own TexiWeb Jr. literate programming tool (1300 lines of AWK), Werner Stoop’s d.awk tool that generates documentation from Markdown comments in source code (800 lines), and Translate Shell, a 6000-line AWK tool that provides a fairly powerful command-line interface to cloud-based translation APIs.

Several developers in the last few years have written about using AWK in their “big data” toolkit as a much simpler (and sometimes faster) tool than heavy distributed computing systems such as Spark and Hadoop. Nick Strayer wrote about using AWK and R to parse 25 terabytes of data across multiple cores. Other big data examples are the tantalizingly-titled article by Adam Drake, “Command-line Tools can be 235x Faster than your Hadoop Cluster”, and Brendan O’Connor’s “Don’t MAWK AWK – the fastest and most elegant big data munging language!

Between ad-hoc text munging, build tooling, “systems programming”, and big data processing — not to mention text-mode first person shooters — it seems that AWK is alive and well in 2020.

[Thanks to Arnold Robbins for reviewing a draft of this article.]

Index entries for this article
GuestArticles Hoyt, Ben