Awk: Difference between revisions

From Alpine Linux
(→‎Statements: linebreaks ok after colon)
m (add some line breaks)
 
(19 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page compares BusyBox's implementation of awk (using [http://git.alpinelinux.org/cgit/aports/tree/main/busybox/busyboxconfig?id=v2.3.6 this] config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on [http://www.cs.bell-labs.com/cm/cs/awkbook/index.html Bell Labs/Brian Kernighan's 2007 version of awk]. <!-- his newer site is at http://www.cs.princeton.edu/~bwk/btl.mirror/ -->
{{DISPLAYTITLE:awk}}
 
This page compares BusyBox's implementation of awk (using [https://git.alpinelinux.org/cgit/aports/tree/main/busybox/busyboxconfig?id=v2.3.6 this] config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on [https://ia903404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf Bell Labs/Brian Kernighan's 2007 version of awk].  


It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).
It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).
{{note|This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's version of BusyBox awk reported here may have changed with the switch to musl.}}


__TOC__
__TOC__
Line 40: Line 44:
:or like this:
:or like this:
:{{Cat|bar.awk|
:{{Cat|bar.awk|
#!/usr/bin/runawk -f /usr/share/awkenough/library.awk -f
#!/usr/bin/runawk -f /usr/share/awkenough/library.awk
...
...
}}
}}
Line 51: Line 55:
** <code>--traditional</code> or <code>--compat</code>: makes gawk behave like nawk
** <code>--traditional</code> or <code>--compat</code>: makes gawk behave like nawk
** <code>--posix</code>: like <code>--traditional --re-interval</code>, with some further restrictions
** <code>--posix</code>: like <code>--traditional --re-interval</code>, with some further restrictions
** <code>--exec scriptfile</code>: stops option processing and disables any further <code>-v<code>s; meant to be used in shebang line for scripts in untrusted environments (cgi)
** <code>--exec scriptfile</code>: stops option processing and disables any further <code>-v</code>s; meant to be used in shebang line for scripts in untrusted environments (cgi)
** <code>--sandbox</code>: disable <code>system("foo")</code>, <code>"foo"|getline</code>, <code>getline <"file"</code>, <code>print|"foo"</code>, <code>print >"file"</code>, and gawk's <code>extension(obj, func)</code>; so only local resources explicitly specified in ARGV will be accessible. This feature is only available on recent versions of gawk.
** <code>--sandbox</code>: disable <code>system("foo")</code>, <code>"foo"|getline</code>, <code>getline <"file"</code>, <code>print|"foo"</code>, <code>print >"file"</code>, and gawk's <code>extension(obj, func)</code>; so only local resources explicitly specified in ARGV will be accessible. This feature is only available on recent versions of gawk.


Line 62: Line 66:
; function definitions
; function definitions


* Functions have global scope, and their invocations can syntactically precede their definitions.
* Functions have global scope, and they can be called from positions that syntactically precede their definitions.


* Scalars (that is, non-arrays) are passed by value, arrays are passed by reference.
* Scalars (that is, non-arrays) are passed by value, arrays are passed by reference.


* Function calls can be nested or recursive; upon return, values of the scalar parameters the caller supplied shall be unchanged (array parameters may have been mutated). So each invocation of the function has its own local environment.
* Function calls can be nested or recursive. Each invocation of the function has its own local environment: so upon return, values of the scalar parameters the caller supplied shall be unchanged (array parameters may have been mutated).


<ul><li>
<ul><li>
Line 84: Line 88:
Here the variables <code>x</code> and <code>y</code> are local to foo, and the variable <code>z</code> is local to bar. All other variable references are ''global'': contrast the shell, where non-local variable references are instead ''dynamic''. That is, in the shell, when foo calls bar, bar will be incrementing  foo's local variable <code>y</code>. In awk, on the other hand, bar always only increments a global variable <code>y</code>.
Here the variables <code>x</code> and <code>y</code> are local to foo, and the variable <code>z</code> is local to bar. All other variable references are ''global'': contrast the shell, where non-local variable references are instead ''dynamic''. That is, in the shell, when foo calls bar, bar will be incrementing  foo's local variable <code>y</code>. In awk, on the other hand, bar always only increments a global variable <code>y</code>.


Note also that the number of arguments in a function call needn't match the number of parameters in the function's definition; if more arguments are supplied, they will be ignored; if fewer are supplied (as in foo's invocation of bar), the missing variables are assigned either "" or an empty array, depending on how they're used in the function.
Note also that the number of arguments in a function call needn't match the number of parameters in the function's definition; if more arguments are supplied, they will be ignored; if fewer are supplied (as in foo's call of bar), the missing variables are assigned either "" or an empty array, depending on how they're used in the function.


Note also the the <code>return</code> statement needn't be given an explicit return value (it will default to ""). The <code>return</code> statement can also be omitted altogether.
Note also the the <code>return</code> statement needn't be given an explicit return value (it will default to ""). The <code>return</code> statement can also be omitted altogether.
Line 98: Line 102:
* If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append <code>< /dev/null</code> to an invocation of awk you know shouldn't consume stdin.
* If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append <code>< /dev/null</code> to an invocation of awk you know shouldn't consume stdin.


* If there are END block(s), and awk's processing of stdin hasn't been suppressed by encountering files in its ARGV list, then stdin will be processed before executing the END block(s).
* If there is a main loop or END blocks, and awk's processing of stdin hasn't been suppressed by encountering files in its ARGV list, then stdin will be processed before executing any END block(s).


* If there are END block(s), but <code>exit <var>status</var></code> is invoked outside them, then the main loop stops and control passes immediately to the first END block. If <code>exit <var>status</var></code> is invoked within an END block, or if there are no end blocks, the script terminates with result <var>status</var>.
* If there are END block(s), but <code>exit <var>status</var></code> is invoked outside them, then the main loop stops and control passes immediately to the first END block. If <code>exit <var>status</var></code> is invoked within an END block, or if there are no end blocks, the script terminates with result <var>status</var>.
Line 149: Line 153:
  @include "filename"
  @include "filename"


This is only available in some versions of gawk. See also Aleksey Cheusov's [http://runawk.sourceforge.net/ Runawk], which uses a different approach.
This is only available in some versions of gawk. See also Aleksey Cheusov's [https://runawk.sourceforge.net/ Runawk], which uses a different approach.


=== Statements ===
=== Statements ===
Line 168: Line 172:


  <var>lvalue</var> = <var>expr</var>
  <var>lvalue</var> = <var>expr</var>
  <var>lvalue</var> += 1
  <var>lvalue</var> += 1 (similarly for -= *= /= %= ^=)
  <var>lvalue</var> ++
  <var>lvalue</var> ++ (similarly for --)
  ++ <var>lvalue</var>
  ++ <var>lvalue</var> (similarly for --)


<var>lvalue</var> can take any of the forms:
<var>lvalue</var> can take any of the forms:
Line 178: Line 182:
* <code>$<var>expr</var></code>
* <code>$<var>expr</var></code>


Like function-invocations, assignments always evaluate to a value, so can be used as expressions, but also can occur in statement contexts, even in nawk.
Like function calls, assignments always evaluate to a value, so can be used as expressions, but also can occur in statement contexts, even in nawk.


  delete <var>arrayname</var>[<var>expr</var>]
  delete <var>arrayname</var>[<var>expr</var>]
Line 229: Line 233:
This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this:
This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this:


  NR=1 {...<var>handle first line</var>...; next }
  NR==1 {...<var>handle first line</var>...; next }
      {...<var>handle other lines</var>...}
      {...<var>handle other lines</var>...}


POSIX doesn't specify the behavior of <code>next</code> in BEGIN or END blocks.
POSIX doesn't specify the behavior of <code>next</code> in BEGIN or END blocks.
Line 283: Line 287:
  print (<var>expr</var>, ...) >> "path"
  print (<var>expr</var>, ...) >> "path"


print to the designated path. (See also TODO [[#built-in_filenames]], below.) The output file is not closed until <code>close "path"</code> is executed, or awk exits.
print to the designated path. (See also [[#Built-in filenames]], below.) The output file is not closed until <code>close "path"</code> is executed, or awk exits.




Line 302: Line 306:
* <code>\r</code> (0xd) which on Unix backspaces to the start of the current line
* <code>\r</code> (0xd) which on Unix backspaces to the start of the current line
* <code>\v</code> and <code>\f</code> (0xb and 0xc) which on Unix stay in the current column but go to the next screen line
* <code>\v</code> and <code>\f</code> (0xb and 0xc) which on Unix stay in the current column but go to the next screen line
* <code>\n</code> and <code>\t</code> and <code>\'</code> and <code>\\</code>
* <code>\n</code> and <code>\t</code> and <code>\"</code> and <code>\\</code>
* <code>\a</code> (0x7)
* <code>\a</code> (0x7)
* <code>\c</code> which ignores the rest of the string (none of my awk implementations honored this)
* <code>\c</code> which ignores the rest of the string (none of my awk implementations honored this)
Line 373: Line 377:




The expression `k in array` doesn't create an array entry, but the reference `array[k]` will create an entry with an uninitialized value. (`k in array` will then be true.)
The expression `k in array` doesn't create an array entry, but the reference `array[k]` will create an entry with an uninitialized value. (`k in array`
will then be true.)


References to fields > NF, on the other hand, don't create new fields. Assignments to these fields do create them (increasing NF and setting any intermediate fields to uninitialized values). Assignments to any field (even <= NF) cause $0 to be recomputed using OFS, but do not cause $0 to be reparsed. (So the modified/new field may contain embedded FS characters.)
References to fields > NF, on the other hand, don't create new fields. Assignments to these fields do create them (increasing NF and setting any  
intermediate fields to uninitialized values). Assignments to any field (even <= NF) cause $0 to be recomputed using OFS, but do not cause $0 to be  
reparsed. (So the modified/new field may contain embedded FS characters.)


Assignments to $0 do cause NF, $1, ... to be recomputed.
Assignments to $0 do cause NF, $1, ... to be recomputed.
Line 387: Line 394:
<div style="white-space:pre; font-family:monospace;"><nowiki>
<div style="white-space:pre; font-family:monospace;"><nowiki>


Uninitialized values include unset variables and array elements, invalid fields (> NF) or valid fields of length 0, and unassigned function parms (which can be used as scalars or arrays). Unititalized scalars have value "", which math operators treat as 0.
Uninitialized values include unset variables and array elements, invalid fields (> NF) or valid fields of length 0, and unassigned function parms  
(which can be used as scalars or arrays). Unititalized scalars have value "", which math operators treat as 0.
Can force string interpretaton by `var ""`, or force numeric interpretation by `var + 0`. "1text" is coerced as 1.
Can force string interpretaton by `var ""`, or force numeric interpretation by `var + 0`. "1text" is coerced as 1.
In boolean contexts: false when expr evaluates to 0 or "", else true.
In boolean contexts: false when expr evaluates to 0 or "", else true.
Line 501: Line 509:
close("file" or "pipe commnd") ~~> 0 if successful
close("file" or "pipe commnd") ~~> 0 if successful
     # awk doesn't automatically close files, pipes, sockets, or co-processes when they return EOF.
     # awk doesn't automatically close files, pipes, sockets, or co-processes when they return EOF.
     # the return value of close() is unspecified; gawk uses value from fclose(3) or pclose(3), or -1 if the named file, piple, or co-process was not opened with a redirection.
     # the return value of close() is unspecified; gawk uses value from fclose(3) or pclose(3), or -1 if the named file, piple, or co-process was not  
      opened with a redirection.
     { print ... | "sort>tmpfile" } END { close("sort>tmpfile"); while ((getline < "tmpfile")>0) ... close("tmpfile") }
     { print ... | "sort>tmpfile" } END { close("sort>tmpfile"); while ((getline < "tmpfile")>0) ... close("tmpfile") }


Line 512: Line 521:
</nowiki></div>
</nowiki></div>


=== Built-in filenames and variables ===
=== Built-in filenames ===
 
Only gawk handles these filenames internally:
 
* <code>"/dev/tty"</code>
* <code>"/dev/stdin"</code>
* <code>"/dev/stdout"</code>
* <code>"/dev/stderr"</code>
* <code>"/dev/fd/<var>n</var>"</code>
 
but in many Unix systems they're available externally anyway. As an alternative to <code>getline &lt; "/dev/stdin"</code>, you can also use <code>getline &lt; "-"</code>. As an alternative to <code>print ... &gt; "/dev/stderr"</code>, you can also use: <code>print ... | "cat 1&gt;&2"</code>.
 
=== Built-in variables ===
{{Draft|}}
{{Draft|}}


<div style="white-space:pre; font-family:monospace;"><nowiki>
; ENVIRON
: This is an associative array holding the current environment variables. The array is mutable, but POSIX does not specify whether changes to it must be visible to child processes spawned from awk. (In none of the implementations I checked were such changes visible.)
 
; ARGC and ARGV
: The number of command-line arguments to awk, and an array holding them. ARGV[0] will usually be "awk", and ARGV[1]...ARGV[ARGC-1] will be the arguments. Awk handles empty arguments specially, and also arguments of the form <code><var>var</var>=<var>value</var></code>.
 
: If your script only has a BEGIN block, and no main-loop rules or END block, then awk won't attempt to automatically read or process the ARGV arguments, or stdin, in any way.
 
: You can decrement ARGC; awk then won't process any arguments that have been lost. You can also set <code>ARGV[<var>n</var>] = ""</code>, or alternatively, <code>delete ARGV[<var>n</var>]</code>. Similarly, you can add additional ARGV entries, though if you do so, you need to increment ARGC manually.
 
: Arguments that awk does process should not refer to directories or nonexisting files (unless you are going to to scan the ARGV list manually and remove such). Some versions of some awk implementations will warn if asked to process a directory; others will raise an error.
 
; ARGIND
: Indicates which ARGV entry was last processed by awk. This is present only in gawk and BusyBox awk. The variable is mutable; but changes to it have no side-effect in gawk. In BusyBox, changes to it affect which ARGV entry will be processed next, after the current one is finished.
 
; FILENAME
: Indicates the name of the ARGV entry currently being processed. When stdin is being processed, this will be <code>"-"</code>.
 
 
 
 


# only in gawk, though all but /dev/fd/n (n>2) will generally still work
"/dev/tty"
"/dev/stdin", can also use: "-"
"/dev/stdout"
"/dev/stderr", can also use: | "cat 1>&2"
"/dev/fd/n"




ENVIRON associative array of environment variables
<div style="white-space:pre; font-family:monospace;"><nowiki>
    array is mutable, but changes don't affect environment of child processes (POSIX leaves implementation-dependent)


ARGC Number of arguments on command line
ARGIND  0 when processing stdin (and ARGC=1), 1 when processing ARGV[1], ...
        only in gawk and BusyBox; in BusyBox, mutating this affects which file is read next       
ARGV An array from 0...ARGC-1 containing the command-line arguments
        Entries of form "var=value" handled specially, entries of form "" skipped
        Entries that are directories: implementations and versions differ in whether they'll warn/error (gawk --traditional will error)
        You can: decrement ARGC or set ARGV[...]="" or delete ARGV[...] to suppress argument processing
        You can: append to ARGV and increment ARGC, then argument will be processed later


ERRNO  In gawk and BusyBox, contains string error message after getline or close fails
ERRNO  In gawk and BusyBox, contains string error message after getline or close fails
FILENAME Current filename


NR         Number of current record
NR         Number of current record
Line 562: Line 587:
     POSIX requires that assigning a new value to FS has no effect on the current input line; it only affects the next input line.
     POSIX requires that assigning a new value to FS has no effect on the current input line; it only affects the next input line.
     Only gawk conforms; BusyBox and FreeBSD awk will also use the new FS for current line if no fields have yet been referenced.
     Only gawk conforms; BusyBox and FreeBSD awk will also use the new FS for current line if no fields have yet been referenced.
     FS=space (default): strips leading and trailing space/tabs, fields are separated by spans of space/tabs/newlines (in gawk --posix, only spans of spaces/tabs)
     FS=space (default): strips leading and trailing space/tabs, fields are separated by spans of space/tabs/newlines (in gawk --posix, only spans of  
      spaces/tabs)
     FS=":" or "\t": each occurrence of the char separates another field
     FS=":" or "\t": each occurrence of the char separates another field
     FS="pat": separator is leftmost longest non-null and non-overlapping match of pattern
     FS="pat": separator is leftmost longest non-null and non-overlapping match of pattern
Line 651: Line 677:
These links may also be of interest:
These links may also be of interest:


* http://www.pement.org/awk/awk1line.txt
* https://www.pement.org/awk/awk1line.txt
* http://www.catonmat.net/series/awk-one-liners-explained
* https://www.catonmat.net/series/awk-one-liners-explained
* http://awk.freeshell.org/HomePage
* http://awk.freeshell.org/HomePage
* http://awk.freeshell.org/AwkFeatureComparison
* http://awk.freeshell.org/AwkFeatureComparison

Latest revision as of 07:42, 9 November 2024


This page compares BusyBox's implementation of awk (using this config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on Bell Labs/Brian Kernighan's 2007 version of awk.

It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).

Note: This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's version of BusyBox awk reported here may have changed with the switch to musl.

Invoking awk

awk [-F char_or_regex] [-v var=value ...] [--] 'awk script...' [ ARGV[1] ... ARGV[ARGC-1] ]
awk [-F char_or_regex] [-v var=value ...] -f scriptfile ... [--] [ ARGV[1] ... ARGV[ARGC-1] ]

Gawk makes a third invocation pattern possible, which mixes -f scriptfile options with an 'awk script...' on the command-line:

gawk [-F char_or_regex] [-v var=value ...] -f scriptfile ... -e 'awk script...' [--] [ ARGV[1] ... ARGV[ARGC-1] ]

Awkenough is a small set of Awk utility routines and a C stub that makes it easier to write shell scripts with awk shebang lines. (I'll make an Alpine package for this soon.) It also permits mixing -f scriptfile and command-line scripts, using the same -e option gawk provides. In both cases, the long-form option --source works the same as -e.

Notes
  • all implementations honor \t and so on in -v FS=expr. BusyBox doesn't honor \t in -F expr; perhaps this is a bug. You can use the $'\t' escapes of BusyBox ash to work around this.
  • nawk and gawk --traditional (but not gawk --posix) interpret -F t as -F '\t'. This is a weird special-case preserved for historical compatibility.
  • scriptfile can be - for stdin
  • ARGVs can be of any form, but awk only knows how to automatically process arguments whose form is:
    • var=unquoted_value: assignments will be made when the main loop reaches that ARGV
    • "": will be skipped
    • filename
    • -: will use stdin
If no files are processed (nor -), stdin will be processed after all command-line assignments.
  • standalone scripts should look like this:

Contents of foo.awk

#!/usr/bin/awk -f # will receive the script's ARGC, ARGV with ARGV[0]=awk BEGIN { ... } ...
or like this:

Contents of bar.awk

#!/usr/bin/runawk -f /usr/share/awkenough/library.awk ...
  • gawk places any unrecognized short options that precede -- into ARGV; other implementations (and gawk --traditional) fail/warn
  • some gawk-only options:
    • --lint: warns about nonportable constructs
    • --re-interval: enables /pat{1,3}/ regexes; some versions of gawk enable by default, others don't
    • --traditional or --compat: makes gawk behave like nawk
    • --posix: like --traditional --re-interval, with some further restrictions
    • --exec scriptfile: stops option processing and disables any further -vs; meant to be used in shebang line for scripts in untrusted environments (cgi)
    • --sandbox: disable system("foo"), "foo"|getline, getline <"file", print|"foo", print >"file", and gawk's extension(obj, func); so only local resources explicitly specified in ARGV will be accessible. This feature is only available on recent versions of gawk.
  • when handling -f path options where path contains no /s, gawk will first search in directories specified in an AWKPATH environment variable

Awk grammar

Blocks

function definitions
  • Functions have global scope, and they can be called from positions that syntactically precede their definitions.
  • Scalars (that is, non-arrays) are passed by value, arrays are passed by reference.
  • Function calls can be nested or recursive. Each invocation of the function has its own local environment: so upon return, values of the scalar parameters the caller supplied shall be unchanged (array parameters may have been mutated).
  • Consider functions defined like this:
    function foo(x,y) {
        ...
        bar()
        return "result"
    }
    
    function bar(z) {
        y += 1
        return
    }
    

    Here the variables x and y are local to foo, and the variable z is local to bar. All other variable references are global: contrast the shell, where non-local variable references are instead dynamic. That is, in the shell, when foo calls bar, bar will be incrementing foo's local variable y. In awk, on the other hand, bar always only increments a global variable y.

    Note also that the number of arguments in a function call needn't match the number of parameters in the function's definition; if more arguments are supplied, they will be ignored; if fewer are supplied (as in foo's call of bar), the missing variables are assigned either "" or an empty array, depending on how they're used in the function.

    Note also the the return statement needn't be given an explicit return value (it will default to ""). The return statement can also be omitted altogether.

BEGIN and END
BEGIN { ... } ...
END { ... } ...
  • Multiple BEGIN blocks are merged; so too multiple END blocks.
  • If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append < /dev/null to an invocation of awk you know shouldn't consume stdin.
  • If there is a main loop or END blocks, and awk's processing of stdin hasn't been suppressed by encountering files in its ARGV list, then stdin will be processed before executing any END block(s).
  • If there are END block(s), but exit status is invoked outside them, then the main loop stops and control passes immediately to the first END block. If exit status is invoked within an END block, or if there are no end blocks, the script terminates with result status.
  • POSIX requires that inside END, all of FILENAME NR FNR NF retain their most-recent values, so long as getline is not invoked. My implementations comply, and also retain the most-recent fields $0 ..., but earlier versions of nawk did not; and Darwin's awk may not.
main loop blocks

These are of the form:

pattern         { action ... }

or:

pattern,pattern { action ... }

where action defaults to print $0; and pattern defaults to 1 (that is, to true).

  • Patterns can take any of the forms:
    • /regex/
    • relational
    • pattern && pattern
    • pattern || pattern
    • ! pattern
    • ( pattern )
    • pattern ? pattern : pattern
    The first form is interpreted the same as the relational $0 ~ /regex/.
  • Relationals can take any of the forms:
    • expr ~ RE
    • expr !~ RE
    • expr eqop expr
    • expr in arrayname
    • (expr, ...) in arrayname
    • expr
    The schemas RE can be regex literals like /a[bc]+/ or any expression that evaluates to a string, like "a[bc]+". Note that strings require an extra level of \s: you can write /\w+/ or "\\w+". The scemas eqop can be any of: == != <= < > >=. A bare expression, as in the last form, is interpreted as false when the expression evaluates to 0 or "", else to true.
other block forms
BEGINFILE { ... }
ENDFILE { ... }

These are only available in (recent versions of) gawk.

@include "filename"

This is only available in some versions of gawk. See also Aleksey Cheusov's Runawk, which uses a different approach.

Statements

Lines can be broken after any of: \ , { && || ? : (The last two only in BusyBox and gawk, and not in gawk --posix.)


function calls
func(expr, ...)

always evaluates to a value, so can be used as an expression, but also can occur in statement contexts. (Some awks like nawk don't allow arbitrary expressions to occur in statement contexts; but BusyBox awk does.)

When calling user-defined functions, no space can separate the function name and the (.


assignments
lvalue = expr
lvalue += 1 (similarly for -= *= /= %= ^=)
lvalue ++ (similarly for --)
++ lvalue (similarly for --)

lvalue can take any of the forms:

  • var
  • arrayname[expr]
  • arrayname[expr, ...]
  • $expr

Like function calls, assignments always evaluate to a value, so can be used as expressions, but also can occur in statement contexts, even in nawk.

delete arrayname[expr]
delete arrayname[expr, ...]
delete arrayname

The last form is not available in gawk --traditional. More portably, you can get the same effect with:

split("", arrayname)

After an array has been deleted, it no longer has any elements; however, the array name remains unavailable for use as a scalar variable.


control operators
if (test) action
if (test) action; else action
if (test) action; else if (test) action
if (test) action; else if (test) action; else action
if (test) { ... }
if (test) { ... } else action

and so on.


while (test) action
...
do action while (test)
...

The do...while form will execute action at least once.


for (i=1; i<=NF; i++) action
...
for (k in arrayname) action
...

The last form has an unspecified iteration order. Also, the effect of adding or removing elements from an array while iterating over it is undefined.


break
continue

Historic awk implementations interpreted break and continue outside of while/do/for-loops as next. BusyBox supports this usage; as do some versions of gawk --traditional.


next

This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this:

NR==1 {...handle first line...; next }
      {...handle other lines...}

POSIX doesn't specify the behavior of next in BEGIN or END blocks.


nextfile

This aborts processing the current file and starts processing the next file (if any) from the first rule. It is not available in gawk --traditional.


return [value]

value defaults to "". POSIX doesn't specify the behavior of a return statement outside of a function.


exit status

This will begin executing END rules, or exit immediately if invoked inside an END block. status defaults to 0. In gawk and nawk, sequences like this:

    { ... exit n }
END { exit }

will preserve the n status code. In BusyBox awk, the second exit will instead revert to the default of 0.


printing

Parentheses are optional for print and printf, but are mandatory for sprintf. Also, parentheses should be used if any of the arguments to print or printf contains a >.

Though print and printf accepts parenthesized argument lists, their invocations are statements not expressions. They return no value and cannot appear in non-statement contexts. (sprintf on the other hand is a function; and its invocations can appear in both statement and expression contexts.)

The first of these:

print "a" "b"
print "a","b"

invokes print with a single argument, which is the simple concatenation "ab" of the expressions "a" and "b". On the other hand, the second form invokes print with two arguments. They will be separated in the output by the current value of OFS---which may, but need not, be "". The default value of OFS is a single space.


print

is interpreted the same as print $0. (Note also that a block of the form pattern with no {...} is interpreted as having the body { print $0 }.)


print ""
print("")

both print a newline.


print (expr, ...) > "path"
print (expr, ...) >> "path"

print to the designated path. (See also #Built-in filenames, below.) The output file is not closed until close "path" is executed, or awk exits.


print (expr, ...) | "shell pipeline"

prints to the designated pipeline. The pipeline is not closed until close "shell pipeline" is executed, or awk exits.


printf ( format, expr, ... )

can be followed by > "path" or >> "path" or | "shell pipeline", just like print.

This usually implements a subset of the functonality of printf(3). The format string can contain:

  • normal Awk string escape codes, such as:
    • \b (0x8) which backspaces one space
    • \r (0xd) which on Unix backspaces to the start of the current line
    • \v and \f (0xb and 0xc) which on Unix stay in the current column but go to the next screen line
    • \n and \t and \" and \\
    • \a (0x7)
    • \c which ignores the rest of the string (none of my awk implementations honored this)
    • \000 for up to three octal digits
    Of the implementations I tried, only gawk would printf or sprintf "\0"; others would take it to terminate the string. (A different way to print "\0" is to use printf "%c", 0. This works with gawk's printf and sprintf and nawk's printf.)
  • formatting codes, of the form:
    %[n$][#][-|0][+| ]['][minwidth][.precision][wordsize specifier][format specifier]
    

    where:

    • n$ means use argv[n] instead of the next argv in sequence. The count is 1-based, and only gawk honors this.
    • # forces octals to have an initial 0, hex to have initial 0x, floats to always have a decimal point.
    • - means align to the left
    • 0 means right-align as usual but pad with zeros instead of spaces
    • + forces the inclusion of a sign
    • leads with a sign if negative, or space if positive
    • ' means write a million as 1,000,000. None of my awks honor this.
    • minwidth or precision can be *, in which case the values are supplied by the next argv. Only gawk and nawk honor this.
    • precision gives the number of digits after the decimal for a number, or the maxwidth for a string.
    • wordsize specifier only nawk honors these, and then only: hh h l ll
    • format specifier can be any of the signed formats fegdi (the last two are equivalent), or any of the unsigned formats xuo, or any of the text formats sc. On BusyBox, the format %c only processes values up to 0x7fff, and the results are always masked to the range "\x00"..."\xff".

    None of my awk implementations honored %b (which is like echo -e) or any other format.

Expressions

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

/pat/ # when this is a complete expr, interpreted as `$0 ~ /pat/` Differences from other EREs: {m,n} treated as literal by nawk and gawk --traditional In some versions of gawk --traditional, [[:classes:]] not available. dot and [^x] do match newline ^ $ always anchors and only match against start-of-buffer and end-of-buffer (like \` and \') \1 is \x01; no backrefs available in match patterns In gawk and nawk, [a\]1] matches a,],1. In BusyBox, it matches a,\ followed by literal 1 then ]. In BusyBox and gawk, /ab\52c/ is interpreted as /ab*c/ not /ab\*c/. In nawk and gawk --traditional, it's interpreted as latter. Literal newlines not allowed in /pat/, but ok in strings interpreted as patterns. /pat\nmore/ and "pat\\nmore" are both ok. x=$1 $2 # string concatenation $one $(one+two) # any numeric-valued expression can follow $ x=1,2 # assigns 1 to x x=(1,2) # assigns 1 SUBSEP 2 to x == != <= < > >= ~ !~ && has higher precedence than || !(k in A) test ? expr : expr in BusyBox and gawk, 0[0-7]+ is interpreted as octal and 0x[[:xdigit:]]+ is interpreted as hex + - * / % ^ (often, as in BusyBox, ** also does exponentiation) ++ -- lvalue | lvalue ++ -- lvalue += ... %= ^= value C escapes in "string" and /pat/: "\OOO" up to 3 octal digits "\\ \" \n \t \r \f \v \b \a" "\xff" # not in gawk --posix; BusyBox restricts to two digits unspecified what is behavior of \m for other m; my awks use literal m (gawk gives warning) only gawk suppresses \<newlines> inside string literals The expression `k in array` doesn't create an array entry, but the reference `array[k]` will create an entry with an uninitialized value. (`k in array` will then be true.) References to fields > NF, on the other hand, don't create new fields. Assignments to these fields do create them (increasing NF and setting any intermediate fields to uninitialized values). Assignments to any field (even <= NF) cause $0 to be recomputed using OFS, but do not cause $0 to be reparsed. (So the modified/new field may contain embedded FS characters.) Assignments to $0 do cause NF, $1, ... to be recomputed.

String vs numeric

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

Uninitialized values include unset variables and array elements, invalid fields (> NF) or valid fields of length 0, and unassigned function parms (which can be used as scalars or arrays). Unititalized scalars have value "", which math operators treat as 0. Can force string interpretaton by `var ""`, or force numeric interpretation by `var + 0`. "1text" is coerced as 1. In boolean contexts: false when expr evaluates to 0 or "", else true. $(expr) is not reqired to convert uninitialized or string expr, but my implementations do so. "Numeric string" := value is of the form / *[+-]?NUMBER */, and it was supplied in a way that doesn't distinguish strings from numbers: n=$1 or split("string",n[]) or getline n n=FILENAME or ARGV[1] or ENVIRON["foo"] awk -v 'n=...' ... Comparisons are numeric iff: both operands are numeric, or one is numeric and other is a "numeric string" or is uninitialized my implementations don't require that at least one operand be numeric Otherwise (one of the operands is an explicit "string"), comparison is stringwise. awk 'BEGIN {print "00" == 0}' # false, first operand is not a "numeric string" awk -v'x=00' -v'n=000' 'BEGIN { y="0000" print(x=="0", x==0, x==u, x==n, y=="0", y==0, y==u, y==n) # only second, third and fourth will be true, all other comparisons are stringwise x+y print(x=="0", x==0, x==u, x==n, y=="0", y==0, y==u, y==n) # in nawk, the previous arithmetic reference to y converts its value to a "numeric string" too # so now the last three comparisons are also true # even numeric constants are susceptible to this side-effect # (print the constant and it's converted to a numeric string, and ignores further changes to OFMT) # however, stringwise evaluation of y still gives "0000" }' In POSIX and my other implementations, referencing doesn't have that side-effect.

Arrays

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

flavor[1]="cherry" ... print flavor[1] for (k in flavor) action nfields = split(string, larray, [sep=FS]) # sep can be single char or regex delete array[key] # BusyBox also honors "delete array" multiarray[k1, k2]="foo" (k1,k2) in multiarray for (kk in multiarray) { split(kk,k,SUBSEP) ... } # now k[1],k[2]

Built-in functions

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

cos(x), sin(x), atan2(x,y), exp(x), log(x), sqrt(x) int(x) # just truncates towards 0, use `printf "%.0f" ...` to round 0 <= rand() < 1 # 1 + int(rand() * TOPNUM) # returns number in range [1..TOPNUM] srand(seed=time_of_day) ~~> oldseed # bit arith are only gawk and BusyBox and(x,y) or(x,y) xor(x,y) lshift(x,count) rshift(x,count) compl(x) # 32-bit base 0xffffffff on BusyBox, may be 53-bit on x86_64 gawk # time functions are only gawk and BusyBox systime() # seconds since epoch strftime("format",systime()=current) mktime("YYYY MM DD HH MM SS [dst?=-1]") ~~> seconds since epoch # HH etc values can be non-standard, e.g. -1 # uses local TZ; dst? 1 yes, 0 no, <0 auto tolower("string"), toupper("string") sprintf("format", expr...) # does require parens # gawk is multibyte-aware, so length, substr, index, and match all count by chars, not bytes length(string=$0) # nawk and gawk support length without () # nawk and gawk (but not gawk --posix) support length(array) gawk has length(A), returns count of keys rather than maxkey substr("string", 1-based-start, maxlength=until_end) ~~> "substring" index("haystack","needle") ~~> 1-based, or 0 if fails match(string, /pat/) ~~> 1-based starting index, or 0 if fails # sets RSTART and RLENGTH (when match fails, they are 0 and -1) # ... match($0, /pat/) { print substr($0, RSTART, RLENGTH) } ... # prints first occurrence of /pat/ gawk has `match(string,/pat/,groups)`: fills groups[0,0start,0length,1,1start,1length,...] with text and positions of \\0, \\1, ... split("string", larray, sep=FS) ~~> nitems # splits into larray[1]...larray[nitems] # some versions of gawk permit fourth argument: lsep, to receive actual FS values (lsep[0] [1] lsep[1] ... [nitems] lsep[nitems]) # BusyBox and gawk accept "" and // as sep; with // BusyBox makes an extra empty field at end sub(/pat/, "subst", lstring=$0) ~~> 1 if successful, 0 if failed gsub(/pat/, "subst", lstring=$0) ~~> nsubs gensub(/pat/, "subst", "g" or which match, string=$0) ~~> newstring # gawk and BusyBox only # string is not mutated # "subst" honors not only & (same as \\0), but also \\1, \\2 ... system("command") ~~> exit code, output not captured # getline is a function though it doesn't use parens # all forms read one line each time they're evaluated; like "next" without restarting from first rule # returns 1 on success, 0 on EOF, -1 on error; file is not closed automatically # updates NF when assigning to $0 getline [var] # updates NR and NFR when file not supplied getline [var] < "-" # read from stdin instead of ARGV[ARGIND] # loop until we get a name while (!name) { printf("Enter a name? "); getline name < "-" } getline [var] < "file" "pipe command" | getline [var] close("file" or "pipe commnd") ~~> 0 if successful # awk doesn't automatically close files, pipes, sockets, or co-processes when they return EOF. # the return value of close() is unspecified; gawk uses value from fclose(3) or pclose(3), or -1 if the named file, piple, or co-process was not opened with a redirection. { print ... | "sort>tmpfile" } END { close("sort>tmpfile"); while ((getline < "tmpfile")>0) ... close("tmpfile") } fflush(output "file" or "pipe command") ~~> 0 if all requested buffers successfully flushed, else -1 # not in gawk --posix # if no argument is supplied, flushes stdout # if "" is supplied, flushes all open output files and pipes

Built-in filenames

Only gawk handles these filenames internally:

  • "/dev/tty"
  • "/dev/stdin"
  • "/dev/stdout"
  • "/dev/stderr"
  • "/dev/fd/n"

but in many Unix systems they're available externally anyway. As an alternative to getline < "/dev/stdin", you can also use getline < "-". As an alternative to print ... > "/dev/stderr", you can also use: print ... | "cat 1>&2".

Built-in variables

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

ENVIRON
This is an associative array holding the current environment variables. The array is mutable, but POSIX does not specify whether changes to it must be visible to child processes spawned from awk. (In none of the implementations I checked were such changes visible.)
ARGC and ARGV
The number of command-line arguments to awk, and an array holding them. ARGV[0] will usually be "awk", and ARGV[1]...ARGV[ARGC-1] will be the arguments. Awk handles empty arguments specially, and also arguments of the form var=value.
If your script only has a BEGIN block, and no main-loop rules or END block, then awk won't attempt to automatically read or process the ARGV arguments, or stdin, in any way.
You can decrement ARGC; awk then won't process any arguments that have been lost. You can also set ARGV[n] = "", or alternatively, delete ARGV[n]. Similarly, you can add additional ARGV entries, though if you do so, you need to increment ARGC manually.
Arguments that awk does process should not refer to directories or nonexisting files (unless you are going to to scan the ARGV list manually and remove such). Some versions of some awk implementations will warn if asked to process a directory; others will raise an error.
ARGIND
Indicates which ARGV entry was last processed by awk. This is present only in gawk and BusyBox awk. The variable is mutable; but changes to it have no side-effect in gawk. In BusyBox, changes to it affect which ARGV entry will be processed next, after the current one is finished.
FILENAME
Indicates the name of the ARGV entry currently being processed. When stdin is being processed, this will be "-".




ERRNO In gawk and BusyBox, contains string error message after getline or close fails NR Number of current record NR is mutable but changing it has no side-effects. FNR Like NR, but relative to the current file RS Record separator (defaults to newline) If RS is "", it's interpreted as a span of empty lines, and FS will then always implicitly include "\n". nawk only honors first character of RS BusyBox and gawk permit RS to be a regex, and set RT to the actual terminating text (will be "" at EOF): awk -v 'RS=[\t\n]' 'BEGIN {ORS=""} {print $0 RT} END {print "\n"}' awk -v 'RS=oldpat' -v 'ORS=newtext' '{ if (RT=="") printf "%s", $0 # EOF else print }' # print text between matches gawk --traditional still honors regex RS, but doesn't set RT BusyBox permits FS to have 0-length matches: echo aaba|awk -F 'b*' '{print NF}' # yields 5 ORS Output record separator (defaults to newline) NF Number of fields in current record gawk and BusyBox recalculate $0 when NF is either increased or decreased (in latter case, fields are discarded); nawk does neither. FS Field separator (defaults to a space) POSIX requires that assigning a new value to FS has no effect on the current input line; it only affects the next input line. Only gawk conforms; BusyBox and FreeBSD awk will also use the new FS for current line if no fields have yet been referenced. FS=space (default): strips leading and trailing space/tabs, fields are separated by spans of space/tabs/newlines (in gawk --posix, only spans of spaces/tabs) FS=":" or "\t": each occurrence of the char separates another field FS="pat": separator is leftmost longest non-null and non-overlapping match of pattern FS="" (gawk and BusyBox only): do character-wise splitting They also accept "" and // as args to split (latter can't be assigned as value to FS) When splitting on //, BusyBox generates an additional empty field All my gawks permit sub(//,...) and gsub(//,...): matches each inter-character position, starting before first char OFS Output field separator (defaults to a space) SUBSEP Separator character for array subscripts (\034) OFMT Output format for numbers (%.6g) CONVFMT String conversion format for numbers (%.6g). used for comparisons and array indexing, defaults to whatever value OFMT has integer strings are always converted using "%d" results are unspecified if CONVFMT isn't a floating-point format spec RSTART First position in the string matched by match() function RLENGTH Length of the string matched by match() function IGNORECASE=1 in gawk and BusyBox only Affects all pattern-matching and string comparison/indexing; doesn't affect array indexing

Gawk-only extensions

This material is work-in-progress ...


(Last edited by Dubiousjim on 9 Nov 2024.)

Some versions of gawk have: switch (expression) { case value|regex : statement ... [ default: statement ] } Only gawk supports FIELDWIDTHS="2 4 2" # use FS=FS to revert to regular field-splitting Only some versions of gawk support FPAT="a pat that every field satisfies" # use FS=FS to revert to regular field-splitting also patsplit("string",larray,[pat=FPAT,lsep]) Only gawk has PROCINFO array, with fields: version:3.1.8 pid:17171 ppid:17139 pgrpid:17171 uid:0 euid:0 gid:0 egid:0 group1:0 # group1..groupn are supplementary groups FS:FS or FIELDWIDTHS or FPAT [strftime] [sorted_in] Only some versions of gawk permit calling functions indirectly: ...funcname="foo"; @funcname(1,2)... Only gawk has: asort(a, [a.backup]) ~~> nelems, string "10" sorts lexicographically but numeric 10 sorts arithmetically asorti(a, [a.backup]) ~~> nelems, this discards original values and sorts the original keys lexicographically strtonum("string") # more portably, use "string"+0 isarray(a) in gawk4 command-or-socket |& getline # co-process print ... |& command-or-socket The following special filenames may be used with the |& co-process operator for creating TCP/IP network connections: /inet{,4,6}/tcp/lport/rhost/rport Files for a TCP/IP connection on local port lport to remote host rhost on remote port rport. Use a port of 0 to have the system pick a port. Use /inet4 to force an IPv4 connection, and /inet6 to force an IPv6 connection. Plain /inet uses the system default (most likely IPv4). /inet{,4,6}/udp/lport/rhost/rport Similar, but use UDP/IP instead of TCP/IP. optional second argument to the close() function. ability to use positional specifiers with printf and sprintf(). Localizable strings. The bindtextdomain(), dcgettext(), dcngettext() functions. BINMODE, LINT, TEXTDOMAIN adding new built-in functions dynamically with the extension() function.

Useful links

These links may also be of interest:

Awk and Lua

See summary of Lua regex.

Commands in brown are from awkenough.

in lua in awk
string.find(str, pattern, [startpos]) --> (start,stop) or nil match(str, pat) --> returns and sets RSTART, also sets RLENGTH

index(str, needle) --> start or 0
nthindex(str, needle, [nth=1, permits -1]) --> start or 0

string.match(str, pattern, [start=1]) --> (%0) or nil or (%1,%2,...) matchstr(str, pat, [nth=1]) --> \\0, sets RSTART and RLENGTH

gawk's match only matches nth=1; it returns RSTART and fills a third array argument with 0,0start,0length,1,1start...

string.gmatch(str, pattern) --> an iterator over all matches or sets of matching groups

The iteration sequence will look like: (%0), (%0), (%0) ...; or like: (%1,%2...), (%1,%2...),...
With this function, one can't specify the position to start matching.

gmatch(str, pat, MATCHES, STARTS) --> nmatches
string.gsub(str, pattern, replacement, [max#repls]) --> (newstr,nrepls) gensub(pat, repl, nth/"g", str): is closest to Lua's gsub

sub(pat, repl, str): mutate first match in str
gsub(pat, repl, str): mutate all matches in str

string:len length(str)
string:lower

string:upper

tolower

toupper

string.rep(str, count,[5.2 adds sep]) rep(str,count,[sep])
string.sub(str, start,[stop]) substr(str,start,[len])

has_prefix
has_suffix

split(str,ITEMS,[seppat],[gawk's SEPS]) --> nitems

gsplit(str,ITEMS,[seppat],[SEPS]) --> nitems
asplit(str, PAIRS, ["="], [" "]) --> nitems

table.concat(tbl,[sep],[start],[stop]) --> string concat([start=1], [len=to_end], [fs=OFS], [A]) --> string

if you want to preserve existing FS, need to use gsplit
to concat an array without specifying len: concat(start, uninitialized, OFS, A)

string:reverse reverse([A])
table.remove(tbl, [pos=from end]) --> value formerly at tbl[pos] pop([start=from_end], [len=to_end], [A]) --> values sep by SUBSEP
table.insert(tbl,[valpos=insert at end],value) --> nil insert(value, [start=after_end], [A]) --> new length of array

extend(VALS, [start=after_end], [A]) --> new length of array

table.sort(tbl, [lessthan]) sort(A)

hsort(A)
qsort(A, 1, length(A))

isempty(A)

has_value(A, value)

includes(A, B, [onlykeys?]): is B <= A?

union(A, B, [conflicts]) --> mutates A
intersect(A, B, [conflicts]) --> mutates A
subtract(A, B, [conflicts]) --> mutates A