Awk: Difference between revisions
Dubiousjim (talk | contribs) (→Statements: internal link) |
(use https link) |
||
(10 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
This page compares BusyBox's implementation of awk (using [ | {{DISPLAYTITLE:awk}} | ||
This page compares BusyBox's implementation of awk (using [https://git.alpinelinux.org/cgit/aports/tree/main/busybox/busyboxconfig?id=v2.3.6 this] config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on [https://ia903404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf Bell Labs/Brian Kernighan's 2007 version of awk]. | |||
It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways). | It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways). | ||
{{note|This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's version of BusyBox awk reported here may have changed with the switch to musl.}} | |||
__TOC__ | __TOC__ | ||
Line 98: | Line 102: | ||
* If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append <code>< /dev/null</code> to an invocation of awk you know shouldn't consume stdin. | * If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append <code>< /dev/null</code> to an invocation of awk you know shouldn't consume stdin. | ||
* If there | * If there is a main loop or END blocks, and awk's processing of stdin hasn't been suppressed by encountering files in its ARGV list, then stdin will be processed before executing any END block(s). | ||
* If there are END block(s), but <code>exit <var>status</var></code> is invoked outside them, then the main loop stops and control passes immediately to the first END block. If <code>exit <var>status</var></code> is invoked within an END block, or if there are no end blocks, the script terminates with result <var>status</var>. | * If there are END block(s), but <code>exit <var>status</var></code> is invoked outside them, then the main loop stops and control passes immediately to the first END block. If <code>exit <var>status</var></code> is invoked within an END block, or if there are no end blocks, the script terminates with result <var>status</var>. | ||
Line 149: | Line 153: | ||
@include "filename" | @include "filename" | ||
This is only available in some versions of gawk. See also Aleksey Cheusov's [ | This is only available in some versions of gawk. See also Aleksey Cheusov's [https://runawk.sourceforge.net/ Runawk], which uses a different approach. | ||
=== Statements === | === Statements === | ||
Line 229: | Line 233: | ||
This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this: | This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this: | ||
NR=1 {...<var>handle first line</var>...; next } | NR==1 {...<var>handle first line</var>...; next } | ||
{...<var>handle other lines</var>...} | |||
POSIX doesn't specify the behavior of <code>next</code> in BEGIN or END blocks. | POSIX doesn't specify the behavior of <code>next</code> in BEGIN or END blocks. | ||
Line 512: | Line 516: | ||
</nowiki></div> | </nowiki></div> | ||
=== Built-in filenames | === Built-in filenames === | ||
Only gawk handles these filenames internally: | |||
* <code>"/dev/tty"</code> | |||
* <code>"/dev/stdin"</code> | |||
* <code>"/dev/stdout"</code> | |||
* <code>"/dev/stderr"</code> | |||
* <code>"/dev/fd/<var>n</var>"</code> | |||
but in many Unix systems they're available externally anyway. As an alternative to <code>getline < "/dev/stdin"</code>, you can also use <code>getline < "-"</code>. As an alternative to <code>print ... > "/dev/stderr"</code>, you can also use: <code>print ... | "cat 1>&2"</code>. | |||
=== Built-in variables === | |||
{{Draft|}} | {{Draft|}} | ||
< | ; ENVIRON | ||
: This is an associative array holding the current environment variables. The array is mutable, but POSIX does not specify whether changes to it must be visible to child processes spawned from awk. (In none of the implementations I checked were such changes visible.) | |||
; ARGC and ARGV | |||
: The number of command-line arguments to awk, and an array holding them. ARGV[0] will usually be "awk", and ARGV[1]...ARGV[ARGC-1] will be the arguments. Awk handles empty arguments specially, and also arguments of the form <code><var>var</var>=<var>value</var></code>. | |||
: If your script only has a BEGIN block, and no main-loop rules or END block, then awk won't attempt to automatically read or process the ARGV arguments, or stdin, in any way. | |||
: You can decrement ARGC; awk then won't process any arguments that have been lost. You can also set <code>ARGV[<var>n</var>] = ""</code>, or alternatively, <code>delete ARGV[<var>n</var>]</code>. Similarly, you can add additional ARGV entries, though if you do so, you need to increment ARGC manually. | |||
: Arguments that awk does process should not refer to directories or nonexisting files (unless you are going to to scan the ARGV list manually and remove such). Some versions of some awk implementations will warn if asked to process a directory; others will raise an error. | |||
; ARGIND | |||
: Indicates which ARGV entry was last processed by awk. This is present only in gawk and BusyBox awk. The variable is mutable; but changes to it have no side-effect in gawk. In BusyBox, changes to it affect which ARGV entry will be processed next, after the current one is finished. | |||
; FILENAME | |||
: Indicates the name of the ARGV entry currently being processed. When stdin is being processed, this will be <code>"-"</code>. | |||
<div style="white-space:pre; font-family:monospace;"><nowiki> | |||
ERRNO In gawk and BusyBox, contains string error message after getline or close fails | ERRNO In gawk and BusyBox, contains string error message after getline or close fails | ||
NR Number of current record | NR Number of current record | ||
Line 651: | Line 671: | ||
These links may also be of interest: | These links may also be of interest: | ||
* | * https://www.pement.org/awk/awk1line.txt | ||
* | * https://www.catonmat.net/series/awk-one-liners-explained | ||
* http://awk.freeshell.org/HomePage | * http://awk.freeshell.org/HomePage | ||
* http://awk.freeshell.org/AwkFeatureComparison | * http://awk.freeshell.org/AwkFeatureComparison |
Latest revision as of 02:44, 25 August 2023
This page compares BusyBox's implementation of awk (using this config file) with gawk (versions >= 3.1.8) and FreeBSD 9's nawk, which is based on Bell Labs/Brian Kernighan's 2007 version of awk.
It's not intended as a tutorial, but rather as a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).
Invoking awk
awk [-F char_or_regex] [-v var=value ...] [--] 'awk script...' [ ARGV[1] ... ARGV[ARGC-1] ] awk [-F char_or_regex] [-v var=value ...] -f scriptfile ... [--] [ ARGV[1] ... ARGV[ARGC-1] ]
Gawk makes a third invocation pattern possible, which mixes -f scriptfile
options with an 'awk script...'
on the command-line:
gawk [-F char_or_regex] [-v var=value ...] -f scriptfile ... -e 'awk script...' [--] [ ARGV[1] ... ARGV[ARGC-1] ]
Awkenough is a small set of Awk utility routines and a C stub that makes it easier to write shell scripts with awk shebang lines. (I'll make an Alpine package for this soon.) It also permits mixing -f scriptfile
and command-line scripts, using the same -e
option gawk provides. In both cases, the long-form option --source
works the same as -e
.
- Notes
- all implementations honor
\t
and so on in-v FS=expr
. BusyBox doesn't honor\t
in-F expr
; perhaps this is a bug. You can use the$'\t'
escapes of BusyBox ash to work around this.
nawk
andgawk --traditional
(but notgawk --posix
) interpret-F t
as-F '\t'
. This is a weird special-case preserved for historical compatibility.
- scriptfile can be
-
for stdin
- ARGVs can be of any form, but awk only knows how to automatically process arguments whose form is:
var=unquoted_value
: assignments will be made when the main loop reaches that ARGV""
: will be skippedfilename
-
: will use stdin
- If no files are processed (nor
-
), stdin will be processed after all command-line assignments.
- standalone scripts should look like this:
Contents of foo.awk
#!/usr/bin/awk -f # will receive the script's ARGC, ARGV with ARGV[0]=awk BEGIN { ... } ...- or like this:
Contents of bar.awk
#!/usr/bin/runawk -f /usr/share/awkenough/library.awk ...
- gawk places any unrecognized short options that precede
--
into ARGV; other implementations (andgawk --traditional
) fail/warn
- some gawk-only options:
--lint
: warns about nonportable constructs--re-interval
: enables/pat{1,3}/
regexes; some versions of gawk enable by default, others don't--traditional
or--compat
: makes gawk behave like nawk--posix
: like--traditional --re-interval
, with some further restrictions--exec scriptfile
: stops option processing and disables any further-v
s; meant to be used in shebang line for scripts in untrusted environments (cgi)--sandbox
: disablesystem("foo")
,"foo"|getline
,getline <"file"
,print|"foo"
,print >"file"
, and gawk'sextension(obj, func)
; so only local resources explicitly specified in ARGV will be accessible. This feature is only available on recent versions of gawk.
- when handling
-f path
options where path contains no/
s, gawk will first search in directories specified in an AWKPATH environment variable
Awk grammar
Blocks
- function definitions
- Functions have global scope, and they can be called from positions that syntactically precede their definitions.
- Scalars (that is, non-arrays) are passed by value, arrays are passed by reference.
- Function calls can be nested or recursive. Each invocation of the function has its own local environment: so upon return, values of the scalar parameters the caller supplied shall be unchanged (array parameters may have been mutated).
-
Consider functions defined like this:
function foo(x,y) { ... bar() return "result" } function bar(z) { y += 1 return }
Here the variables
x
andy
are local to foo, and the variablez
is local to bar. All other variable references are global: contrast the shell, where non-local variable references are instead dynamic. That is, in the shell, when foo calls bar, bar will be incrementing foo's local variabley
. In awk, on the other hand, bar always only increments a global variabley
.Note also that the number of arguments in a function call needn't match the number of parameters in the function's definition; if more arguments are supplied, they will be ignored; if fewer are supplied (as in foo's call of bar), the missing variables are assigned either "" or an empty array, depending on how they're used in the function.
Note also the the
return
statement needn't be given an explicit return value (it will default to ""). Thereturn
statement can also be omitted altogether.
- BEGIN and END
BEGIN { ... } ... END { ... } ...
- Multiple BEGIN blocks are merged; so too multiple END blocks.
- If there are only BEGIN blocks (no main loop or END blocks), and getline isn't used, then POSIX requires awk to terminate without reading stdin or any file operands. The implementations I checked honor this, but some historical implementations discarded input in this case. For portability, you can explicitly append
< /dev/null
to an invocation of awk you know shouldn't consume stdin.
- If there is a main loop or END blocks, and awk's processing of stdin hasn't been suppressed by encountering files in its ARGV list, then stdin will be processed before executing any END block(s).
- If there are END block(s), but
exit status
is invoked outside them, then the main loop stops and control passes immediately to the first END block. Ifexit status
is invoked within an END block, or if there are no end blocks, the script terminates with result status.
- POSIX requires that inside END, all of
FILENAME NR FNR NF
retain their most-recent values, so long asgetline
is not invoked. My implementations comply, and also retain the most-recent fields$0 ...
, but earlier versions of nawk did not; and Darwin's awk may not.
- main loop blocks
These are of the form:
pattern { action ... }
or:
pattern,pattern { action ... }
where action defaults to print $0
; and pattern defaults to 1
(that is, to true).
-
Patterns can take any of the forms:
/regex/
relational
pattern && pattern
pattern || pattern
! pattern
( pattern )
pattern ? pattern : pattern
$0 ~ /regex/
. -
Relationals can take any of the forms:
expr ~ RE
expr !~ RE
expr eqop expr
expr in arrayname
(expr, ...) in arrayname
expr
/a[bc]+/
or any expression that evaluates to a string, like"a[bc]+"
. Note that strings require an extra level of\
s: you can write/\w+/
or"\\w+"
. The scemas eqop can be any of:== != <= < > >=
. A bare expression, as in the last form, is interpreted as false when the expression evaluates to0
or""
, else to true.
- other block forms
BEGINFILE { ... } ENDFILE { ... }
These are only available in (recent versions of) gawk.
@include "filename"
This is only available in some versions of gawk. See also Aleksey Cheusov's Runawk, which uses a different approach.
Statements
Lines can be broken after any of: \ , { && || ? :
(The last two only in BusyBox and gawk, and not in gawk --posix
.)
- function calls
func(expr, ...)
always evaluates to a value, so can be used as an expression, but also can occur in statement contexts. (Some awks like nawk don't allow arbitrary expressions to occur in statement contexts; but BusyBox awk does.)
When calling user-defined functions, no space can separate the function name and the (
.
- assignments
lvalue = expr lvalue += 1 (similarly for -= *= /= %= ^=) lvalue ++ (similarly for --) ++ lvalue (similarly for --)
lvalue can take any of the forms:
var
arrayname[expr]
arrayname[expr, ...]
$expr
Like function calls, assignments always evaluate to a value, so can be used as expressions, but also can occur in statement contexts, even in nawk.
delete arrayname[expr] delete arrayname[expr, ...] delete arrayname
The last form is not available in gawk --traditional
. More portably, you can get the same effect with:
split("", arrayname)
After an array has been deleted, it no longer has any elements; however, the array name remains unavailable for use as a scalar variable.
- control operators
if (test) action if (test) action; else action if (test) action; else if (test) action if (test) action; else if (test) action; else action if (test) { ... } if (test) { ... } else action
and so on.
while (test) action ... do action while (test) ...
The do...while form will execute action at least once.
for (i=1; i<=NF; i++) action ... for (k in arrayname) action ...
The last form has an unspecified iteration order. Also, the effect of adding or removing elements from an array while iterating over it is undefined.
break continue
Historic awk implementations interpreted break
and continue
outside of while/do/for-loops as next
. BusyBox supports this usage; as do some versions of gawk --traditional
.
next
This reads the next line of input and restarts the main loop from the first rule. For example, to treat the first line of awk's input specially, you can do this:
NR==1 {...handle first line...; next } {...handle other lines...}
POSIX doesn't specify the behavior of next
in BEGIN or END blocks.
nextfile
This aborts processing the current file and starts processing the next file (if any) from the first rule. It is not available in gawk --traditional
.
return [value]
value defaults to ""
. POSIX doesn't specify the behavior of a return
statement outside of a function.
exit status
This will begin executing END rules, or exit immediately if invoked inside an END block. status defaults to 0
.
In gawk and nawk, sequences like this:
{ ... exit n } END { exit }
will preserve the n
status code. In BusyBox awk, the second exit will instead revert to the default of 0
.
- printing
Parentheses are optional for print
and printf
, but are mandatory for sprintf
. Also, parentheses should be used if any of the arguments to print
or printf
contains a >
.
Though print
and printf
accepts parenthesized argument lists, their invocations are statements not expressions. They return no value and cannot appear in non-statement contexts. (sprintf
on the other hand is a function; and its invocations can appear in both statement and expression contexts.)
The first of these:
print "a" "b" print "a","b"
invokes print
with a single argument, which is the simple concatenation "ab"
of the expressions "a"
and "b"
. On the other hand, the second form invokes print
with two arguments. They will be separated in the output by the current value of OFS---which may, but need not, be "". The default value of OFS is a single space.
is interpreted the same as print $0
. (Note also that a block of the form pattern
with no {...}
is interpreted as having the body { print $0 }
.)
print "" print("")
both print a newline.
print (expr, ...) > "path" print (expr, ...) >> "path"
print to the designated path. (See also #Built-in filenames, below.) The output file is not closed until close "path"
is executed, or awk exits.
print (expr, ...) | "shell pipeline"
prints to the designated pipeline. The pipeline is not closed until close "shell pipeline"
is executed, or awk exits.
printf ( format, expr, ... )
can be followed by > "path"
or >> "path"
or | "shell pipeline"
, just like print
.
This usually implements a subset of the functonality of printf(3). The format string can contain:
-
normal Awk string escape codes, such as:
\b
(0x8) which backspaces one space\r
(0xd) which on Unix backspaces to the start of the current line\v
and\f
(0xb and 0xc) which on Unix stay in the current column but go to the next screen line\n
and\t
and\"
and\\
\a
(0x7)\c
which ignores the rest of the string (none of my awk implementations honored this)\000
for up to three octal digits
"\0"
; others would take it to terminate the string. (A different way to print"\0"
is to useprintf "%c", 0
. This works with gawk'sprintf
andsprintf
and nawk'sprintf
.) -
formatting codes, of the form:
%[n$][#][-|0][+| ]['][minwidth][.precision][wordsize specifier][format specifier]
where:
n$
means use argv[n] instead of the next argv in sequence. The count is 1-based, and only gawk honors this.#
forces octals to have an initial0
, hex to have initial0x
, floats to always have a decimal point.-
means align to the left0
means right-align as usual but pad with zeros instead of spaces+
forces the inclusion of a sign'
means write a million as1,000,000
. None of my awks honor this.minwidth
orprecision
can be*
, in which case the values are supplied by the next argv. Only gawk and nawk honor this.precision
gives the number of digits after the decimal for a number, or the maxwidth for a string.wordsize specifier
only nawk honors these, and then only:hh h l ll
format specifier
can be any of the signed formatsfegdi
(the last two are equivalent), or any of the unsigned formatsxuo
, or any of the text formatssc
. On BusyBox, the format%c
only processes values up to 0x7fff, and the results are always masked to the range "\x00"..."\xff".
None of my awk implementations honored
%b
(which is likeecho -e
) or any other format.
Expressions
This material is work-in-progress ...
|
String vs numeric
This material is work-in-progress ...
|
Arrays
This material is work-in-progress ...
|
Built-in functions
This material is work-in-progress ...
|
Built-in filenames
Only gawk handles these filenames internally:
"/dev/tty"
"/dev/stdin"
"/dev/stdout"
"/dev/stderr"
"/dev/fd/n"
but in many Unix systems they're available externally anyway. As an alternative to getline < "/dev/stdin"
, you can also use getline < "-"
. As an alternative to print ... > "/dev/stderr"
, you can also use: print ... | "cat 1>&2"
.
Built-in variables
This material is work-in-progress ...
|
- ENVIRON
- This is an associative array holding the current environment variables. The array is mutable, but POSIX does not specify whether changes to it must be visible to child processes spawned from awk. (In none of the implementations I checked were such changes visible.)
- ARGC and ARGV
- The number of command-line arguments to awk, and an array holding them. ARGV[0] will usually be "awk", and ARGV[1]...ARGV[ARGC-1] will be the arguments. Awk handles empty arguments specially, and also arguments of the form
var=value
.
- If your script only has a BEGIN block, and no main-loop rules or END block, then awk won't attempt to automatically read or process the ARGV arguments, or stdin, in any way.
- You can decrement ARGC; awk then won't process any arguments that have been lost. You can also set
ARGV[n] = ""
, or alternatively,delete ARGV[n]
. Similarly, you can add additional ARGV entries, though if you do so, you need to increment ARGC manually.
- Arguments that awk does process should not refer to directories or nonexisting files (unless you are going to to scan the ARGV list manually and remove such). Some versions of some awk implementations will warn if asked to process a directory; others will raise an error.
- ARGIND
- Indicates which ARGV entry was last processed by awk. This is present only in gawk and BusyBox awk. The variable is mutable; but changes to it have no side-effect in gawk. In BusyBox, changes to it affect which ARGV entry will be processed next, after the current one is finished.
- FILENAME
- Indicates the name of the ARGV entry currently being processed. When stdin is being processed, this will be
"-"
.
Gawk-only extensions
This material is work-in-progress ...
|
Useful links
These links may also be of interest:
- https://www.pement.org/awk/awk1line.txt
- https://www.catonmat.net/series/awk-one-liners-explained
- http://awk.freeshell.org/HomePage
- http://awk.freeshell.org/AwkFeatureComparison
Awk and Lua
See summary of Lua regex.
Commands in brown are from awkenough.
in lua | in awk |
---|---|
string.find(str, pattern, [startpos]) --> (start,stop) or nil
|
match(str, pat) --> returns and sets RSTART, also sets RLENGTH
|
string.match(str, pattern, [start=1]) --> (%0) or nil or (%1,%2,...)
|
matchstr(str, pat, [nth=1]) --> \\0, sets RSTART and RLENGTH gawk's |
string.gmatch(str, pattern) --> an iterator over all matches or sets of matching groups The iteration sequence will look like: (%0), (%0), (%0) ...; or like: (%1,%2...), (%1,%2...),... |
gmatch(str, pat, MATCHES, STARTS) --> nmatches
|
string.gsub(str, pattern, replacement, [max#repls]) --> (newstr,nrepls)
|
gensub(pat, repl, nth/"g", str) : is closest to Lua's gsub
|
string:len
|
length(str)
|
string:lower
|
tolower
|
string.rep(str, count,[5.2 adds sep])
|
rep(str,count,[sep])
|
string.sub(str, start,[stop])
|
substr(str,start,[len])
|
split(str,ITEMS,[seppat],[gawk's SEPS]) --> nitems
| |
table.concat(tbl,[sep],[start],[stop]) --> string
|
concat([start=1], [len=to_end], [fs=OFS], [A]) --> string if you want to preserve existing FS, need to use |
string:reverse
|
reverse([A])
|
table.remove(tbl, [pos=from end]) --> value formerly at tbl[pos]
|
pop([start=from_end], [len=to_end], [A]) --> values sep by SUBSEP
|
table.insert(tbl,[valpos=insert at end],value) --> nil
|
insert(value, [start=after_end], [A]) --> new length of array
|
table.sort(tbl, [lessthan])
|
sort(A)
|
isempty(A)
| |
includes(A, B, [onlykeys?]) : is B <= A?
|