Regex: Difference between revisions
Dubiousjim (talk | contribs) m (→POSIX regex: markup) |
(make note a warning) |
||
(19 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different | This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different implementations of a tool may behave in different or unexpected ways). | ||
{{Warning|This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's utilities reported here may have changed with the switch to musl.}} | |||
__TOC__ | __TOC__ | ||
Line 73: | Line 75: | ||
== POSIX regex == | == POSIX regex == | ||
POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>awk</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to <code>sed</code>; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.) | POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>[[awk]]</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to <code>sed</code>; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.) | ||
In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like <code>\w</code>, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor <code>\+</code> and <code>\|</code>, and many implementations of EREs will also honor backreferences like <code>\1</code>. Additionally, <code>awk</code> EREs obeys somewhat different rules than other EREs. | In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like <code>\w</code>, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor <code>\+</code> and <code>\|</code>, and many implementations of EREs will also honor backreferences like <code>\1</code>. Additionally, <code>awk</code> EREs obeys somewhat different rules than other EREs. | ||
Line 96: | Line 98: | ||
=== BREs === | === BREs === | ||
{{Draft|}} | |||
<pre> | <div style="white-space:pre; font-family:monospace;"><nowiki> | ||
Special characters: . * [] ^ $ \(\) \{\} | Special characters: . * [] ^ $ \(\) \{\} | ||
Line 156: | Line 159: | ||
[m-ax] ''BusyBox [e]grep and nawk treat as just x, others reject'' | [m-ax] ''BusyBox [e]grep and nawk treat as just x, others reject'' | ||
[a-m-xy] ''BusyBox [e]grep and nawk treat as [a-my], others reject'' | [a-m-xy] ''BusyBox [e]grep and nawk treat as [a-my], others reject'' | ||
</ | </nowiki></div> | ||
=== EREs === | |||
{{Draft|}} | |||
= | <div style="white-space:pre; font-family:monospace;"><nowiki> | ||
< | |||
. * ? + [] ^ $ | () {} | . * ? + [] ^ $ | () {} | ||
POSIX requires "long|longest" should match all of "longest", so implementations must at least simulate DFAs in that way | POSIX requires "long|longest" should match all of "longest", so implementations must at least simulate DFAs in that way | ||
Line 192: | Line 196: | ||
In gawk and nawk, [a\]1] matches a,],1. In BusyBox awk and egrep/sed, it matches a,\ followed by literal 1 then ]. | In gawk and nawk, [a\]1] matches a,],1. In BusyBox awk and egrep/sed, it matches a,\ followed by literal 1 then ]. | ||
</ | </nowiki></div> | ||
=== Gnu extensions === | === Gnu extensions === | ||
These | These were present in all the [e]greps I tested with, and in all the seds and awks except <code>nawk</code> (and <code>gawk --traditional</code>). They are only treated specially outside of bracket expressions, even in awks, which do still treat <code>\t</code> and so on specially there. | ||
< | # '''<code>\w</code>''' and '''<code>\W</code>''' for <code>[[:alnum:]_]</code> and <code>[^[:alnum:]_]</code> | ||
\w \W for [[:alnum:]_] and [^[:alnum:]_] | # '''<code>\s</code>''' and '''<code>\S</code>''' for <code>[[:space:]]</code> and <code>[^[:space:]]</code>, matches any of: space tab \n \r \v \f | ||
#: (BusyBox tools and some versions of <code>gawk</code> lack.) | |||
# '''<code>\b \B \<</code>''' and '''<code>\></code>''', zero-width matches at word boundaries (''non''-word-boundaries for <code>\B</code>) | |||
#: (In awks, <code>\b</code> instead means "\x08"; <code>\y</code> substitutes for <code>\b</code> in <code>gawk</code>, nothing substitutes for <code>\b</code> in BusyBox awk.) | |||
#: <span style="color:gray">FreeBSD's <code>[e]grep -o \b...</code> and <code>[e]grep -o \<...</code> are currently buggy; and BusyBox's sed and awk are buggy with <code>\< \b \B</code> at start of words.</span> | |||
# '''<code>\` \'</code>''' start-of-buffer and end-of-buffer anchors (some regex engines not surveyed here use <code>\A \Z \z</code> for these instead) | |||
#: (In awks, <code>^</code> and <code>$</code> already have this behavior, even against source texts containing newlines.) | |||
#: <span style="color:gray">FreeBSD's <code>[e]grep</code> currently wrongly match these against start-of-line and end-of-line, rather than start-of-buffer and end-of-buffer. Also, FreeBSD's <code>grep -o '\`...'</code> is buggy in ways <code>grep -o '...'\'</code> isn't. BusyBox sed and awk are also buggy with <code>\`</code> in ways they aren't with <code>\'</code>. All these bugs have been reported.</span> | |||
=== C escapes === | |||
\ | ; <code>\n \t \r \x09 \f \v \a \c</code> | ||
: These are handled specially by awks (though <code>nawk</code> only honors up to <code>\f</code>), even inside brackets. | |||
: They are also handled specially by Gnu's sed. BusyBox's sed honors <code>\n \t \r</code>; and FreeBSD's sed honors <code>\n</code>. The other escapes aren't honored by those seds, and none are honored by any grep I tested. | |||
These escapes may also be handled specially by your shell in <code>$'...'</code> constructs; as too may be <code>\OOO</code> (up to 3 octal digits) <code>\uXXXX</code> (4 hex digits) <code>\e \E \b \'</code>. Awk engines handle some of these latter forms, too. As noted [[#Gnu_extensions|above]], the last two are handled differently by some regex engines. | |||
</ | |||
=== | === Notes === | ||
* All of these regex engines treat <code>\d</code> as literal "d", not as <code>[[:digit:]]</code>. | |||
</ | |||
* BusyBox sed will match only one occurrence of "" (empty string); others will match several of them if the <code>/g</code> modifier is on. | |||
* Grep engines will treat newlines ''in patterns'' as equivalent to <code>\|</code>; sed and awk engines will reject as error. | |||
* FreeBSD's sed will force the presence of terminal <code>\n</code>, even if it wasn't present in the input. So too will some other FreeBSD tools like <code>cut</code>; others like <code>tr</code> won't. In Gnu's sed, the command <code>q</code> also forces a terminal <code>\n</code>. | |||
FreeBSD's sed will force | |||
* BusyBox's <code>grep -oz</code> suffixes each result with "\0" (nul); other greps suffix with <code>\n</code>. | |||
* Nongreedy quantifiers <code><var>pat</var>?? <var>pat</var>*? <var>pat</var>+? <var>pat</var>{m,n}?</code> aren't provided in the POSIX specification, nor by any of the POSIX-conforming tools discussed here. | |||
</ | * FreeBSD grep and egrep match the empty string at position 0 in: <code>printf 'cba' | egrep -o '[ba]*'</code>. None of the other [e]grep implementations I checked (such as BusyBox's or Gnu's) do this. | ||
== Regex in Lua == | == Regex in Lua == | ||
Line 273: | Line 268: | ||
=== Regex engines === | === Regex engines === | ||
The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the [ | The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the [https://www.inf.puc-rio.br/~roberto/lpeg/ LPEG] or [https://github.com/rrthomas/lrexlib Lrexlib] libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages. | ||
=== Regex specials === | === Regex specials === | ||
Line 300: | Line 295: | ||
'''%s''' and '''%S''' Like POSIX [[:space:]] and [^[:space:]]. | '''%s''' and '''%S''' Like POSIX [[:space:]] and [^[:space:]]. | ||
(%s and POSIX [[:space:]] do also match vertical space (\n \r \f \v, whereas POSIX [[:blank:]] matches only \x20 and tab.) | (%s and POSIX [[:space:]] do also match vertical space (\n \r \f \v, whereas POSIX [[:blank:]] matches only \x20 and tab.) | ||
'''%p''' and '''%P''' Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum | '''%p''' and '''%P''' Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum and cntrl | ||
'''%c''' and '''%C''' Like POSIX [[:cntrl:]] and [^[:cntrl:]] | '''%c''' and '''%C''' Like POSIX [[:cntrl:]] and [^[:cntrl:]] | ||
'''%g''' and '''%G''' Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 | '''%g''' and '''%G''' Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 | ||
POSIX <code>[[:print:]]</code>, all visible characters plus space, isn't directly available. Use <code>[%p% | POSIX <code>[[:print:]]</code>, all visible characters plus space, isn't directly available. Use <code>[%p%w ]</code> or <code>[%g ]</code>. | ||
<dt>Bracket expressions | <dt>Bracket expressions | ||
Line 346: | Line 341: | ||
<li>'''<code>%f[<var>class</var>]</code>''' The zero-width "frontier" between (the start-of-source or) text not matching <var>class</var> and (the end-of-source or) text which does match <var>class</var>. This is a generalization of the Gnu regex specials <code>\<</code> and <code>\></code>. Examples: | <li>'''<code>%f[<var>class</var>]</code>''' The zero-width "frontier" between (the start-of-source or) text not matching <var>class</var> and (the end-of-source or) text which does match <var>class</var>. This is a generalization of the Gnu regex specials <code>\<</code> and <code>\></code>. Examples: | ||
<pre> | <pre> | ||
%f[%x] matches the source text "123-567 9ab" | %f[%x] matches the source text "123-567 9ab" before positions 1, 5, and 9 | ||
%f[%X] matches the same source text | %f[%X] matches the same source text before positions 4, 8, and 12. | ||
</pre> | </pre> | ||
</ul> | </ul> |
Latest revision as of 18:19, 20 September 2023
This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different implementations of a tool may behave in different or unexpected ways).
Glob patterns
These are used in shell expansion and pattern-matching in case
expressions.
glob*
gl?bbing
[a-z]
and[!-aeiou]
[!-aeiou]
can also be expressed in some shells (including BusyBox ash
) as [^-aeiou]
but the [!...]
format is more portable.
- If the pattern contains an invalid bracket expression or does not match any existing filenames or pathnames, the pattern string will be interpreted literally.
- A leading period can only be matched literally, not by
[!a]
or?
or*
or[%-0]
or[[:punct:]]
. In BusyBox ash, it's not matched by[.a]
; other shell implementations may differ. /
can only be matched literally, and has higher parsing precedence than[...]
, soa[b/c]d
only matches file c]d in directory a[b.- Given the pattern
/foo/bar/x*/bam
, search permission is needed for directories / and foo, search and read permissions are needed for directory bar, and search permission is needed for each x* directory. - If
set -f
/set -o noglob
, glob-expansion is disabled.Todo: What contexts automatically suppress glob-expansion?
Q. How do I construct a shell glob-pattern that matches all files except "." and ".." ? (from Unix FAQ 2.11)
Pattern | matches . ? | matches .. ? | matches .a ? | matches .ab ? | matches ..pdq ? | matches xyz ? |
---|---|---|---|---|---|---|
*
|
no | no | no | no | no | yes |
.*
|
yes | yes | yes | yes | yes | no |
.[!.]*
|
no | no | yes | yes | no | no |
.??*
|
no | no | no | yes | yes | no |
Hence, to match all of the four rightmost columns, but neither of the two leftmost ones, you need to combine three glob patterns: * .[!.]* .??*
If you don't have any length-2 filenames like .a, you can go with the simpler: * .??*
POSIX regex
POSIX defines two classes of regex languages: basic regular expressions (BREs) and extended regular expressions (EREs). The first is historically implemented by utilities like grep
, sed
, and ed
. The second by tools like egrep
(grep -E
), awk
, lex
, and emacs
. EREs are usually also available as an option to sed
; sometimes this is expressed using sed -r
, other times sed -E
. (BusyBox sed
uses the first.)
In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like \w
, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor \+
and \|
, and many implementations of EREs will also honor backreferences like \1
. Additionally, awk
EREs obeys somewhat different rules than other EREs.
One difference between naive implementations of these is that NFAs are more "eager." Both types of engines will return the leftmost of several possible matches in the source text; but the greater eagerness of a NFA would show up in which of several alternative patterns it used to do the matching. For any regex engine which can handle alternative patterns, printf "NFA, no I mean DFA" | regex_match "/NFA|NFA, no I mean DFA/"
would match the whole source text if the engine were a DFA; but would match only "NFA", using the left pattern, if the engine were a (naive) NFA.
However, this is nowadays complicated by the fact that POSIX specifies the longest possible leftmost stretch of source text be matched: so a POSIX-conformant grep
(which accepted pattern alternation as an extension) would have to match "NFA, no I mean DFA", too, just as egrep
would.
POSIX specifies for both BREs and EREs:
- matches should be longest match leftmost in text
- subpatterns match greedily (with matching the empty string "" better than not matching)
- nulls (\0) are not permitted in text nor in patterns
For the following, I've compared the BusyBox tools (grep
, egrep
, sed
with and without -r
, and awk
) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system. I refer to the Gnu implementation of awk
as "gawk" (as Gnu itself does); and I refer to the FreeBSD implementation of awk
as "nawk" (though properly speaking, this is just one among several implementations of nawk).
BREs
This material is work-in-progress ...
|
EREs
This material is work-in-progress ...
|
Gnu extensions
These were present in all the [e]greps I tested with, and in all the seds and awks except nawk
(and gawk --traditional
). They are only treated specially outside of bracket expressions, even in awks, which do still treat \t
and so on specially there.
\w
and\W
for[[:alnum:]_]
and[^[:alnum:]_]
\s
and\S
for[[:space:]]
and[^[:space:]]
, matches any of: space tab \n \r \v \f- (BusyBox tools and some versions of
gawk
lack.)
- (BusyBox tools and some versions of
\b \B \<
and\>
, zero-width matches at word boundaries (non-word-boundaries for\B
)- (In awks,
\b
instead means "\x08";\y
substitutes for\b
ingawk
, nothing substitutes for\b
in BusyBox awk.) - FreeBSD's
[e]grep -o \b...
and[e]grep -o \<...
are currently buggy; and BusyBox's sed and awk are buggy with\< \b \B
at start of words.
- (In awks,
\` \'
start-of-buffer and end-of-buffer anchors (some regex engines not surveyed here use\A \Z \z
for these instead)- (In awks,
^
and$
already have this behavior, even against source texts containing newlines.) - FreeBSD's
[e]grep
currently wrongly match these against start-of-line and end-of-line, rather than start-of-buffer and end-of-buffer. Also, FreeBSD'sgrep -o '\`...'
is buggy in waysgrep -o '...'\'
isn't. BusyBox sed and awk are also buggy with\`
in ways they aren't with\'
. All these bugs have been reported.
- (In awks,
C escapes
\n \t \r \x09 \f \v \a \c
- These are handled specially by awks (though
nawk
only honors up to\f
), even inside brackets. - They are also handled specially by Gnu's sed. BusyBox's sed honors
\n \t \r
; and FreeBSD's sed honors\n
. The other escapes aren't honored by those seds, and none are honored by any grep I tested.
These escapes may also be handled specially by your shell in $'...'
constructs; as too may be \OOO
(up to 3 octal digits) \uXXXX
(4 hex digits) \e \E \b \'
. Awk engines handle some of these latter forms, too. As noted above, the last two are handled differently by some regex engines.
Notes
- All of these regex engines treat
\d
as literal "d", not as[[:digit:]]
.
- BusyBox sed will match only one occurrence of "" (empty string); others will match several of them if the
/g
modifier is on.
- Grep engines will treat newlines in patterns as equivalent to
\|
; sed and awk engines will reject as error.
- FreeBSD's sed will force the presence of terminal
\n
, even if it wasn't present in the input. So too will some other FreeBSD tools likecut
; others liketr
won't. In Gnu's sed, the commandq
also forces a terminal\n
.
- BusyBox's
grep -oz
suffixes each result with "\0" (nul); other greps suffix with\n
.
- Nongreedy quantifiers
pat?? pat*? pat+? pat{m,n}?
aren't provided in the POSIX specification, nor by any of the POSIX-conforming tools discussed here.
- FreeBSD grep and egrep match the empty string at position 0 in:
printf 'cba' | egrep -o '[ba]*'
. None of the other [e]grep implementations I checked (such as BusyBox's or Gnu's) do this.
Regex in Lua
String escapes
In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:
\\ \" \' \a for bell, \x07 \b for backspace, \x08 \t for \x09 \n for \x0a \v for \x0b \f for \x0c \r for \x0d
Lua strings also accept \ddd
for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh
for hex digits is also accepted.
Strings can be written inside matching single or double quotes. They can also be written inside:
[[constructs like this]] [=[ or [[like]] \this]=]
In these constructs, escape sequences like \t
aren't expanded. It and the embedded [[like]]
are both treated literally. Also, the first character of the string is ignored when it's a newline.
Regex engines
The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.
Regex specials
The following sequences have special meaning to the basic Lua regex engine.
- Character classes
-
. Matches any character
%z In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this special sequence was provided instead to match \0s in the source text (and permit the pattern string to continue). In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored, but it's deprecated.
%a and %A Like POSIX [[:alpha:]] and [^[:alpha:]] %l and %L Like POSIX [[:lower:]] and [^[:lower:]] %u and %U Like POSIX [[:upper:]] and [^[:upper:]]
%w and %W Like POSIX [[:alnum:]] and [^[:alnum:]]. (Note that unlike the Gnu regex extension\w
, the patterns%w
in Lua and[[:alnum:]]
in POSIX do not match underscores.) %d and %D Like POSIX [[:digit:]] and [^[:digit:]] %x and %X Like POSIX [[:xdigit:]] and [^[:xdigit:]]
%s and %S Like POSIX [[:space:]] and [^[:space:]]. (%s and POSIX [[:space:]] do also match vertical space (\n \r \f \v, whereas POSIX [[:blank:]] matches only \x20 and tab.) %p and %P Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum and cntrl %c and %C Like POSIX [[:cntrl:]] and [^[:cntrl:]] %g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 POSIX[[:print:]]
, all visible characters plus space, isn't directly available. Use[%p%w ]
or[%g ]
. - Bracket expressions
- [class] and [^class] can include sequences of:
- single characters like
a
or\t
- ranges like
a-m
- character class specials like
%a
- single characters like
- Groups and backreferences
-
(pat)
Capture the source text matching pat into a group. Unlike other regex engines, these constructions can not be followed by quantifiers like*
or+
%1
Backreference to a captured group (contrast\1
which is and matches "\x01", and\\1
which matches a literal character "\" then "1")()
Instead of text, capture the current position in the matched source into a group
- Quantifiers
-
?
and*
and+
are the familiar greedy quantifiers-
is a nongreedy variant of*
- single characters
- regex specials like
.
and%a
[class]
expressions
(pat)
- Anchors ^ $
- As in POSIX BREs, these are treated as literal characters when not in anchoring positions (as in the pattern
ab^cd
) - Literal characters
- %c This is a literal c, for arbitrary character c. It cancels the special meaning of characters
( ) . % + - * ? [ ] ^ $
Alternation (expressed in other regex languages using |
)
is not available in Lua; it can only be approximated using [class]
constructions.
Two of the nifty primitives that Lua has and other engines lack are:
%b()
Text inside (and including) balanced(...)
; other characters can be used in place of(
and)
.%f[class]
The zero-width "frontier" between (the start-of-source or) text not matching class and (the end-of-source or) text which does match class. This is a generalization of the Gnu regex specials\<
and\>
. Examples:%f[%x] matches the source text "123-567 9ab" before positions 1, 5, and 9 %f[%X] matches the same source text before positions 4, 8, and 12.