Awk: Difference between revisions

Revision as of 18:26, 4 May 2012

This page compares BusyBox's implementation of awk (using this config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on Bell Labs/Brian Kernighan's 2007 version of awk.

It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).

Invoking awk

Awkenough is a small set of Awk utility routines and a C stub that makes it easier to write shell scripts with #!/usr/bin/awk -f lines. (I'll make an Alpine package for this soon.)

awk [-F char_or_regex] [-v var=value ...] [--] 'script...' [ ARGV[1] ... ARGV[ARGC-1] ]
awk [-F char_or_regex] [-v var=value ...] -f scriptfile ... [--] [ ARGV[1] ... ARGV[ARGC-1] ]

* all implementations honor \t and so on in `-v FS=expr`. BusyBox doesn't honor \t in `-F expr`. nawk and gawk --traditional (but not --posix) interpret `-F t` as `-F '\t'`.

* scriptfile can be "-" for stdin

* ARGVs can be of any form, but awk only knows how to automatically process arguments whose form is:
    * var=unquoted_value   # assignments will be made when the main loop reaches that ARGV
    * ""                   # will be skipped
    * filename
    * -                    # will use stdin
  If no files are processed (nor "-"),  stdin will be processed after all command-line assignments.

* standalone scripts:
    -----------------
    #!/usr/bin/awk -f
    # will receive the script's ARGC, ARGV with ARGV[0]=awk
    BEGIN { ... }
    ...
    -----------------

* gawk includes unrecognized -options that precede -- in ARGV; other implementations (and gawk --traditional) fail/warn

* gawk-only switches:

    --lint: warns about nonportable constructs [--lint=fatal, --lint=invalid]
    --re-interval: enables /pat{1,3}/ ; some versions of gawk enable by default, others don't
    --traditional/compat: makes gawk behave like nawk
    --posix: like --traditional --re-interval, with some further restrictions

    AWKPATH=.:/usr/local/share/awk # used by -f slashless
    -f scriptfile1 -f scriptfile2 --source 'script' ... # for mixing -f with commandline script
    --exec scriptfile: is last option processed and disables all -v; use for shebang line for scripts in untrusted environments (cgi)
    some versions of gawk have --sandbox to disable system("foo"), "foo"|getline, getline <"file", print|"foo", print >"file", gawk's extension(obj, func); then only local resources explicitly specified in ARGV will be accessible

Awk grammar

Blocks

Lines can be broken after any of: \ , { && || ? : (last two only in BusyBox and gawk, not in gawk --posix)

# functions have global scope, and their invocations can syntactically precede their definitions
# scalars are passed by value, arrays are passed by reference
# function calls can be nested or recursive; upon return, values of callers parms shall be unchanged (except for contents of array parms)

function foo(x,y) {
    ... # x,y are local; other vars are global (not dynamic, as in shell)
        # number of arguments in function call needn't match the number of parms in function definition
    return [value=""]
}
        # insertion sort
        function sort(array, len,   temp, i, j) {
          for (i = 2; i <= len; ++i) {
            for (j = i; array[j-1] > array[j]; --j) {
              temp = array[j]
              array[j] = array[j-1]
              array[j-1] = temp
            }
          }
          return
        }



BEGIN { ... } ...
END { ... } ...

    # Multiple BEGIN and END blocks are merged.

    # If there are only BEGIN blocks and getline isn't used, then POSIX requires awk to terminate
    #   without reading stdin or any file operands. My implementations honor this, but some historical
    #   implementations discard input in this case. For portability, you can explicitly append `< /dev/null` to the invocation of awk.
    # If there is an END block, stdin will be consumed before the END actions are executed.

    # POSIX requires that inside END, FILENAME NR FNR NF retain their most-recent values, so long as `getline` is not invoked. My implementations comply, and also for $0 ..., but earlier versions of nawk did not; and Darwin's awk may not.



BEGINFILE { ... }   # only in some versions of gawk
ENDFILE { ... }     # only in some versions of gawk

@include "filename" # only in some versions of gawk


pattern         { action ... }      # action defaults to `print $0`; pattern defaults to true
pattern,pattern { action ... }

where pattern := /regex/            # interpreted as `$0 ~ /regex/`
                 relational
                 pattern && pattern
                 pattern || pattern
                 ! pattern
                 (pattern)
                 pattern ? pattern : pattern

and relational := 
                 expr  ~ RE         # RE may be regex literal or any string-valued expression
                 expr !~ RE
                 expr EQ expr       # EQ may be: == != <= < > >=
                 expr in array_name
                 (expr,...) in array_name
                 expr               # interpreted as false when expr evaluates to 0 or "", else to true

Statements

func(expr,...)  # evaluates to a value but can also occur in statement contexts
                # when calling user-defined functions, no space can separate `func` and `(`

lvalue = expr   # evaluates to the value of expr but can also occur in statement contexts 
                # lvalue is `var`, `array[k]`, or `$numeric`
also: lvalue += 1
also: lvalue ++

expr            # nawk doesn't permit in statement contexts

break
continue
    # historic implementations interpret break/continue outside of a while, do, or for-loop as `next`
    # BusyBox supports this usage; as do some versions of gawk --traditional

return [expr=""]
    # POSIX doesn't specify behavior outside of a function definition


next            # read next line of input, restart from first rule
                    BEGIN {...}
                    NR==1 { ...; next } # get next record and start over
                    {...}
                # POSIX doesn't specify behavior in BEGIN or END blocks

nextfile        # not in gawk --traditional

exit [status=0] # will execute END actions, or exit immediately if invoked inside END
    in gawk and nawk, "... exit n ... END { exit; }" will preserve status=n
    in BusyBox awk, will revert to 0


delete array[k]
delete array    # not in gawk --traditional; more portably: split("",array)
                # even though it no longer has any elements, you cannot reuse the array name as a scalar variable

if (test) action [; else if (test2) action] [; else action]
if (test) { ... } [else action]

while (test) action
do action while (test)  # will execute action at least once

for (i=1; i<=NF; i++) action
for (k in array) action
    # iteration order is unspecified; effect of adding new elements to array while iterating is undefined

# parens optional for print, printf; should be used if any of the arguments contains `>`
# arg arg arg : concatenated
# arg,arg,arg : separated by OFS
print                  # prints $0
print ""               # prints newline
print [(] expr,... [)] # statement, though it accepts (args,...)
print [(] expr,... [)] > "file"
print [(] expr,... [)] >> "file"
print [(] expr,... [)] | "pipe command"
    Some awks (not BusyBox permit `for(;;print) ...`
printf [(] "format", expr... [)] [... as for print ...]  # statement, though it accepts (args,...)
    %[n$][#][-|0][+| ]['][minwidth][.precision][wordsize]f
        `n$` use argv[n] instead of next, 1-based: only gawk
        `#` forces octals to have initial 0, hex to have initial 0x, floats must have .
        `-|0` means align to left, or right-align padding with zeros instead of spaces
        `+| ` means force sign or leading space
        `'` means 1,000,000: no awks
        width and precision can be *, supplied by argv: only gawk and nawk
        precision defaults to 0, gives post-. digits or maxwidth for string
        wordsize = hh/char h/short l ll: only nawk honors these
        format specifier:
            signed: feg d/i
            unsigned: xuo s c(on BusyBox, only 0-0x7fff ~> "\xff")
            not in awks: b(like s with \t in argv)
                         p(pointer)
                         n(save number chars so far in &argv)

        \b and \r: backspace (to margin), 0x8 and 0xd
        \v \f: next line same column, 0xb and 0xc
        \n \t \' \\
        \a: alarm 0x7
        \c: ignore rest of string: no awks
        \nnn: up to three octal digits
            only gawk will printf or sprintf "\0", others terminate
            gawk's printf and sprintf and nawk's printf also work with "%c", 0

Expressions

/pat/            # when this is a complete expr, interpreted as `$0 ~ /pat/`
Differences from other EREs:
    {m,n} treated as literal by nawk and gawk --traditional
    In some versions of gawk --traditional, [[:classes:]] not available.
    dot and [^x] do match newline
    ^ $ always anchors and only match against start-of-buffer and end-of-buffer (like \` and \')
    \1 is \x01; no backrefs available in match patterns
    In gawk and nawk, [a\]1] matches a,],1. In BusyBox, it matches a,\ followed by literal 1 then ].
    In BusyBox and gawk, /ab\52c/ is interpreted as /ab*c/ not /ab\*c/. In nawk and gawk --traditional, it's interpreted as latter.
    Literal newlines not allowed in /pat/, but ok in strings interpreted as patterns. /pat\nmore/ and "pat\\nmore" are both ok.


x=$1 $2          # string concatenation
$one
$(one+two)       # any numeric-valued expression can follow $
x=1,2            # assigns 1 to x
x=(1,2)          # assigns 1 SUBSEP 2 to x
== != <= < > >= ~ !~
&& has higher precedence than ||
!(k in A)
test ? expr : expr


in BusyBox and gawk, 0[0-7]+ is interpreted as octal and 0x[[:xdigit:]]+ is interpreted as hex
+ - * / % ^ (often, as in BusyBox, ** also does exponentiation)
++ -- lvalue | lvalue ++ --
lvalue += ... %= ^= value


C escapes in "string" and /pat/:
    "\OOO" up to 3 octal digits
    "\\ \" \n \t \r \f \v \b \a"
    "\xff" # not in gawk --posix; BusyBox restricts to two digits
    unspecified what is behavior of \m for other m; my awks use literal m (gawk gives warning)
only gawk suppresses \<newlines> inside string literals


The expression `k in array` doesn't create an array entry, but the reference `array[k]` will create an entry with an uninitialized value. (`k in array` will then be true.)

References to fields > NF, on the other hand, don't create new fields. Assignments to these fields do create them (increasing NF and setting any intermediate fields to uninitialized values). Assignments to any field (even <= NF) cause $0 to be recomputed using OFS, but do not cause $0 to be reparsed. (So the modified/new field may contain embedded FS characters.)

Assignments to $0 do cause NF, $1, ... to be recomputed.

String vs numeric

Uninitialized values include unset variables and array elements, invalid fields (> NF) or valid fields of length 0, and unassigned function parms (which can be used as scalars or arrays). Unititalized scalars have value "", which math operators treat as 0.
Can force string interpretaton by `var ""`, or force numeric interpretation by `var + 0`. "1text" is coerced as 1.
In boolean contexts: false when expr evaluates to 0 or "", else true.
$(expr) is not reqired to convert uninitialized or string expr, but my implementations do so.


"Numeric string" := value is of the form / *[+-]?NUMBER */, and it was supplied in a way that doesn't distinguish strings from numbers:
    n=$1 or split("string",n[]) or getline n
    n=FILENAME or ARGV[1] or ENVIRON["foo"]
    awk -v 'n=...' ...

Comparisons are numeric iff:
    both operands are numeric, or one is numeric and other is a "numeric string" or is uninitialized
    my implementations don't require that at least one operand be numeric
Otherwise (one of the operands is an explicit "string"), comparison is stringwise.

awk 'BEGIN {print "00" == 0}' # false, first operand is not a "numeric string"

awk -v'x=00' -v'n=000' 'BEGIN {
    y="0000"
    print(x=="0", x==0, x==u, x==n, y=="0", y==0, y==u, y==n) # only second, third and fourth will be true, all other comparisons are stringwise
    x+y
    print(x=="0", x==0, x==u, x==n, y=="0", y==0, y==u, y==n) # in nawk, the previous arithmetic reference to y converts its value to a "numeric string" too
                                                  # so now the last three comparisons are also true
                                                  # even numeric constants are susceptible to this side-effect
                                                  # (print the constant and it's converted to a numeric string, and ignores further changes to OFMT)
                                                  # however, stringwise evaluation of y still gives "0000"
}'
In POSIX and my other implementations, referencing doesn't have that side-effect.

Arrays

flavor[1]="cherry" ... print flavor[1]
for (k in flavor) action
nfields = split(string, larray, [sep=FS]) # sep can be single char or regex
delete array[key] # BusyBox also honors "delete array"

multiarray[k1, k2]="foo"
(k1,k2) in multiarray
for (kk in multiarray) { split(kk,k,SUBSEP) ... } # now k[1],k[2]

Functions

cos(x), sin(x), atan2(x,y), exp(x), log(x), sqrt(x)
int(x)  # just truncates towards 0, use `printf "%.0f" ...` to round

0 <= rand() < 1
    # 1 + int(rand() * TOPNUM) # returns number in range [1..TOPNUM]
srand(seed=time_of_day) ~~> oldseed

# bit arith are only gawk and BusyBox
and(x,y) or(x,y) xor(x,y) lshift(x,count) rshift(x,count) compl(x)  # 32-bit base 0xffffffff on BusyBox, may be 53-bit on x86_64 gawk

# time functions are only gawk and BusyBox
systime() # seconds since epoch
strftime("format",systime()=current)
mktime("YYYY MM DD HH MM SS [dst?=-1]") ~~> seconds since epoch
    # HH etc values can be non-standard, e.g. -1
    # uses local TZ; dst? 1 yes, 0 no, <0 auto

tolower("string"), toupper("string")
sprintf("format", expr...)  # does require parens

# gawk is multibyte-aware, so length, substr, index, and match all count by chars, not bytes
length(string=$0)
    # nawk and gawk support length without ()
    # nawk and gawk (but not gawk --posix) support length(array)
substr("string", 1-based-start, maxlength=until_end) ~~> "substring"
index("haystack","needle") ~~> 1-based, or 0 if fails
match(string, /pat/) ~~> 1-based starting index, or 0 if fails
    # sets RSTART and RLENGTH (when match fails, they are 0 and -1)
    # ... match($0, /pat/) { print substr($0, RSTART, RLENGTH) } ... # prints first occurrence of /pat/
    gawk has `match(string,/pat/,groups)`: fills groups[0,0start,0length,1,1start,1length,...] with text and positions of \\0, \\1, ...

split("string", larray, sep=FS) ~~> nitems
    # splits into larray[1]...larray[nitems]
    # some versions of gawk permit fourth argument: lsep, to receive actual FS values (lsep[0] [1] lsep[1] ... [nitems] lsep[nitems])
    # BusyBox and gawk accept "" and // as sep; with // BusyBox makes an extra empty field at end
sub(/pat/, "subst", lstring=$0) ~~> 1 if successful, 0 if failed
gsub(/pat/, "subst", lstring=$0) ~~> nsubs
gensub(/pat/, "subst", "g" or which match, string=$0) ~~> newstring
    # gawk and BusyBox only
    # string is not mutated
    # "subst" honors not only & (same as \\0), but also \\1, \\2 ...

system("command") ~~> exit code, output not captured

# getline is a function though it doesn't use parens
# all forms read one line each time they're evaluated; like "next" without restarting from first rule
# returns 1 on success, 0 on EOF, -1 on error; file is not closed automatically
# updates NF when assigning to $0
getline [var]       # updates NR and NFR when file not supplied
getline [var] < "-" # read from stdin instead of ARGV[ARGIND]
        # loop until we get a name
        while (!name) { printf("Enter a name? "); getline name < "-" }
getline [var] < "file"
"pipe command" | getline [var]

close("file" or "pipe commnd") ~~> 0 if successful
    # awk doesn't automatically close files, pipes, sockets, or co-processes when they return EOF.
    # the return value of close() is unspecified; gawk uses value from fclose(3) or pclose(3), or -1 if the named file, piple, or co-process was not opened with a redirection.
    { print ... | "sort>tmpfile" } END { close("sort>tmpfile"); while ((getline < "tmpfile")>0) ... close("tmpfile") }

fflush(output "file" or "pipe command") ~~> 0 if all requested buffers successfully flushed, else -1
    # not in gawk --posix
    # if no argument is supplied, flushes stdout
    # if "" is supplied, flushes all open output files and pipes

Special filenames

# only in gawk, though all but /dev/fd/n (n>2) will generally still work
"/dev/tty"
"/dev/stdin", can also use: "-"
"/dev/stdout"
"/dev/stderr", can also use: | "cat 1>&2" 
"/dev/fd/n"

System variables

ENVIRON	associative array of environment variables
    array is mutable, but changes don't affect environment of child processes (POSIX leaves implementation-dependent)

ARGC	Number of arguments on command line
ARGIND  0 when processing stdin (and ARGC=1), 1 when processing ARGV[1], ...
        only in gawk and BusyBox; in BusyBox, mutating this affects which file is read next        
ARGV	An array from 0...ARGC-1 containing the command-line arguments
        Entries of form "var=value" handled specially, entries of form "" skipped
        Entries that are directories: implementations and versions differ in whether they'll warn/error (gawk --traditional will error)
        You can: decrement ARGC or set ARGV[...]="" or delete ARGV[...] to suppress argument processing
        You can: append to ARGV and increment ARGC, then argument will be processed later

ERRNO   In gawk and BusyBox, contains string error message after getline or close fails
FILENAME	Current filename

NR	        Number of current record
    NR is mutable but changing it has no side-effects.
FNR	        Like NR, but relative to the current file
RS	        Record separator (defaults to newline)
    If RS is "", it's interpreted as a span of empty lines, and FS will then always implicitly include "\n".
    nawk only honors first character of RS
    BusyBox and gawk permit RS to be a regex, and set RT to the actual terminating text (will be "" at EOF):
        awk -v 'RS=[\t\n]' 'BEGIN {ORS=""} {print $0 RT} END {print "\n"}'
        awk -v 'RS=oldpat' -v 'ORS=newtext' '{ if (RT=="") printf "%s", $0 # EOF
                                               else print }' # print text between matches 
    gawk --traditional still honors regex RS, but doesn't set RT
    BusyBox permits FS to have 0-length matches:
        echo aaba|awk -F 'b*' '{print NF}' # yields 5

ORS	Output record separator (defaults to newline)


NF	Number of fields in current record
    gawk and BusyBox recalculate $0 when NF is either increased or decreased (in latter case, fields are discarded); nawk does neither.
FS	Field separator (defaults to a space)
    POSIX requires that assigning a new value to FS has no effect on the current input line; it only affects the next input line.
    Only gawk conforms; BusyBox and FreeBSD awk will also use the new FS for current line if no fields have yet been referenced.
    FS=space (default): strips leading and trailing space/tabs, fields are separated by spans of space/tabs/newlines (in gawk --posix, only spans of spaces/tabs)
    FS=":" or "\t": each occurrence of the char separates another field
    FS="pat": separator is leftmost longest non-null and non-overlapping match of pattern
    FS="" (gawk and BusyBox only): do character-wise splitting
          They also accept "" and // as args to split (latter can't be assigned as value to FS)
          When splitting on //, BusyBox generates an additional empty field
          All my gawks permit sub(//,...) and gsub(//,...): matches each inter-character position, starting before first char
OFS	Output field separator (defaults to a space)

SUBSEP	Separator character for array subscripts (\034)

OFMT	Output format for numbers (%.6g)
CONVFMT	String conversion format for numbers (%.6g).
    used for comparisons and array indexing, defaults to whatever value OFMT has
    integer strings are always converted using "%d"
    results are unspecified if CONVFMT isn't a floating-point format spec

RSTART	First position in the string matched by match() function
RLENGTH	Length of the string matched by match() function

IGNORECASE=1 in gawk and BusyBox only
             Affects all pattern-matching and string comparison/indexing; doesn't affect array indexing

Gawk-only extensions

Some versions of gawk have:
    switch (expression) {
        case value|regex : statement
        ...
        [ default: statement ]
    }

Only gawk supports FIELDWIDTHS="2 4 2" # use FS=FS to revert to regular field-splitting
Only some versions of gawk support FPAT="a pat that every field satisfies"  # use FS=FS to revert to regular field-splitting
        also patsplit("string",larray,[pat=FPAT,lsep])
Only gawk has PROCINFO array, with fields:
        version:3.1.8
        pid:17171
        ppid:17139
        pgrpid:17171
        uid:0
        euid:0
        gid:0
        egid:0
        group1:0 # group1..groupn are supplementary groups
        FS:FS or FIELDWIDTHS or FPAT
        [strftime]
        [sorted_in]
Only some versions of gawk permit calling functions indirectly: ...funcname="foo"; @funcname(1,2)...

Only gawk has:
asort(a, [a.backup]) ~~> nelems, string "10" sorts lexicographically but numeric 10 sorts arithmetically
asorti(a, [a.backup]) ~~> nelems, this discards original values and sorts the original keys lexicographically
strtonum("string") # more portably, use "string"+0
isarray(a) in gawk4
 

command-or-socket |& getline # co-process
print ... |& command-or-socket

The  following  special	filenames  may	be used with the |& co-process operator for creating TCP/IP network connections:
/inet{,4,6}/tcp/lport/rhost/rport
      Files for a TCP/IP connection on local port lport to remote host
      rhost  on remote port rport.  Use a port of 0 to have the system
      pick a port.  Use /inet4 to force an IPv4 connection, and /inet6
      to  force  an  IPv6  connection.	 Plain	/inet  uses the system
      default (most likely IPv4).
/inet{,4,6}/udp/lport/rhost/rport
      Similar, but use UDP/IP instead of TCP/IP.


optional second argument to the close() function.
ability to use positional specifiers with printf and sprintf().
Localizable strings. The bindtextdomain(), dcgettext(), dcngettext() functions.
BINMODE, LINT, TEXTDOMAIN
adding new built-in functions dynamically with the extension()  function.

Useful links

These links may also be of interest:

Awk and Lua

See summary of Lua regex.

Commands in brown are from awkenough.

in lua	in awk
`string.find(str, pattern, [startpos])` --> (start,stop) or nil	`match(str, pat)` --> returns and sets RSTART, also sets RLENGTH `index(str, needle)` --> start or 0 `nthindex(str, needle, [nth=1, permits -1])` --> start or 0
`string.match(str, pattern, [start=1])` --> (%0) or nil or (%1,%2,...)	`matchstr(str, pat, [nth=1])` --> \\0, sets RSTART and RLENGTH gawk's `match` only matches nth=1; it returns RSTART and fills a third array argument with 0,0start,0length,1,1start...
`string.gmatch(str, pattern)` --> an iterator over all matches or sets of matching groups The iteration sequence will look like: (%0), (%0), (%0) ...; or like: (%1,%2...), (%1,%2...),... With this function, one can't specify the position to start matching.	`gmatch(str, pat, MATCHES, STARTS)` --> nmatches
`string.gsub(str, pattern, replacement, [max#repls])` --> (newstr,nrepls)	`gensub(pat, repl, nth/"g", str)`: is closest to Lua's gsub `sub(pat, repl, str)`: mutate first match in str `gsub(pat, repl, str)`: mutate all matches in str
`string:len`	`length(str)`
`string:lower` `string:upper`	`tolower` `toupper`
`string.rep(str, count,[5.2 adds sep])`	`rep(str,count,[sep])`
`string.sub(str, start,[stop])`	`substr(str,start,[len])` `has_prefix` `has_suffix`
	`split(str,ITEMS,[seppat],[gawk's SEPS])` --> nitems `gsplit(str,ITEMS,[seppat],[SEPS])` --> nitems `asplit(str, PAIRS, ["="], [" "])` --> nitems
`table.concat(tbl,[sep],[start],[stop])` --> string	`concat([start=1], [len=to_end], [fs=OFS], [A])` --> string if you want to preserve existing FS, need to use `gsplit` to concat an array without specifying len: `concat(start, uninitialized, OFS, A)`
`string:reverse`	`reverse([A])`
`table.remove(tbl, [pos=from end])` --> value formerly at tbl[pos]	`pop([start=from_end], [len=to_end], [A])` --> values sep by SUBSEP
`table.insert(tbl,[valpos=insert at end],value)` --> nil	`insert(value, [start=after_end], [A])` --> new length of array `extend(VALS, [start=after_end], [A])` --> new length of array
`table.sort(tbl, [lessthan])`	`sort(A)` `hsort(A)` `qsort(A, 1, length(A))`
	`isempty(A)` `has_value(A, value)`
	`includes(A, B, [onlykeys?])`: is B <= A? `union(A, B, [conflicts])` --> mutates A `intersect(A, B, [conflicts])` --> mutates A `subtract(A, B, [conflicts])` --> mutates A

@@ Line 579: / Line 579: @@
 gsplit returns as iterseq, split as multivalue -->
 |<code>split(str,ITEMS,[seppat],[gawk's SEPS])</code> --> nitems <br />
-<code style="color:brown">gsplit(str,ITEMS,[seppat],[gawk's SEPS])</code> --> nitems <br />
+<code style="color:brown">gsplit(str,ITEMS,[seppat],[SEPS])</code> --> nitems <br />
 <code style="color:brown">asplit(str, PAIRS, ["="], [" "])</code> --> nitems
 |- valign=top