Regex: Difference between revisions

Latest revision as of 18:19, 20 September 2023

This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different implementations of a tool may behave in different or unexpected ways).

Warning: This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's utilities reported here may have changed with the switch to musl.

Glob patterns

These are used in shell expansion and pattern-matching in case expressions.

glob*
gl?bbing
[a-z] and [!-aeiou]

[!-aeiou] can also be expressed in some shells (including BusyBox ash) as [^-aeiou] but the [!...] format is more portable.

If the pattern contains an invalid bracket expression or does not match any existing filenames or pathnames, the pattern string will be interpreted literally.
A leading period can only be matched literally, not by [!a] or ? or * or [%-0] or [[:punct:]]. In BusyBox ash, it's not matched by [.a]; other shell implementations may differ.
/ can only be matched literally, and has higher parsing precedence than [...], so a[b/c]d only matches file c]d in directory a[b.
Given the pattern /foo/bar/x*/bam, search permission is needed for directories / and foo, search and read permissions are needed for directory bar, and search permission is needed for each x* directory.
If set -f / set -o noglob, glob-expansion is disabled.
Todo: What contexts automatically suppress glob-expansion?

Q. How do I construct a shell glob-pattern that matches all files except "." and ".." ? (from Unix FAQ 2.11)

Pattern	matches . ?	matches .. ?	matches .a ?	matches .ab ?	matches ..pdq ?	matches xyz ?
`*`	no	no	no	no	no	yes
`.*`	yes	yes	yes	yes	yes	no
`.[!.]*`	no	no	yes	yes	no	no
`.??*`	no	no	no	yes	yes	no

Hence, to match all of the four rightmost columns, but neither of the two leftmost ones, you need to combine three glob patterns: * .[!.]* .??*

If you don't have any length-2 filenames like .a, you can go with the simpler: * .??*

POSIX regex

POSIX defines two classes of regex languages: basic regular expressions (BREs) and extended regular expressions (EREs). The first is historically implemented by utilities like grep, sed, and ed. The second by tools like egrep (grep -E), awk, lex, and emacs. EREs are usually also available as an option to sed; sometimes this is expressed using sed -r, other times sed -E. (BusyBox sed uses the first.)

In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like \w, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor \+ and \|, and many implementations of EREs will also honor backreferences like \1. Additionally, awk EREs obeys somewhat different rules than other EREs.

Note: Historically, BRE-using regex engines are implemented using NFAs or "pattern-directed" regex engines. EREs were often implemented using more efficient DFAs or "text-directed" regex engines. There are theoretical translations between these, but in practice devices like backreferences and non-greedy quantification can only be efficiently implemented with NFAs---so although DFAs are faster, they are also more limited.

One difference between naive implementations of these is that NFAs are more "eager." Both types of engines will return the leftmost of several possible matches in the source text; but the greater eagerness of a NFA would show up in which of several alternative patterns it used to do the matching. For any regex engine which can handle alternative patterns, printf "NFA, no I mean DFA" | regex_match "/NFA|NFA, no I mean DFA/" would match the whole source text if the engine were a DFA; but would match only "NFA", using the left pattern, if the engine were a (naive) NFA.

However, this is nowadays complicated by the fact that POSIX specifies the longest possible leftmost stretch of source text be matched: so a POSIX-conformant grep (which accepted pattern alternation as an extension) would have to match "NFA, no I mean DFA", too, just as egrep would.

It's also complicated by the fact that many ERE engines nowadays will honor backreferences, which can't efficiently be matched with DFA/"text-directed" techniques.

POSIX specifies for both BREs and EREs:

matches should be longest match leftmost in text
subpatterns match greedily (with matching the empty string "" better than not matching)
nulls (\0) are not permitted in text nor in patterns

For the following, I've compared the BusyBox tools (grep, egrep, sed with and without -r, and awk) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system. I refer to the Gnu implementation of awk as "gawk" (as Gnu itself does); and I refer to the FreeBSD implementation of awk as "nawk" (though properly speaking, this is just one among several implementations of nawk).

BREs

This material is work-in-progress ...

(Last edited by Sertonix on 20 Sep 2023.)

Special characters: . * [] ^ $ \(\) \{\}
    
Dot and [^x] should match newline; but on no grep -z implementation do they do so.

Anchors ^ $
    POSIX says interpreted literally when not at start of pattern/group/alternate (FreeBSD grep is buggy with ^)
    All [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy)
    Sed engines, on the other hand, only ever match these anchors against start-of-buffer and end-of-buffer.

Alternation with \|
    Not in POSIX for BREs, but is honored by many <code>grep</code>s and <code>sed</code>s
    FreeBSD tools don't handle \| in these contexts: "_ ... | _ ... ( _ ... _ ) ... _"

Intervals \{m,n\}
    \? for \{0,1\} and \+ for \{1,\} aren't in POSIX, but are often honored.
    All greps and and all seds except FreeBSD's do support \| \? \+ and other Gnu extensions
    Unmatched \{ rejected as error
    Unmatched \} usually treated as literal; only FreeBSD sed without -r rejects
    Gnu sed accepts \{,1\}; others reject and all reject \{\}, \{2,1\}, and \{1,2,3\}

Perversely-placed quantifiers
    Treated as literals at start of a pattern/group/alternate (after ^, if any)
    This is true for *, \?, and \+ where they're available
    But \{1\} is instead rejected as an error by FreeBSD and Gnu sed

    FreeBSD and Gnu grep, BusyBox grep and sed: accept adjacent quantifiers without error
        Gnu and FreeBSD sed: reject \{...\} and * when they follow another quantifier
        Gnu sed: permits \? and \+ to do so; FreeBSD doesn't recognize them as quantifiers

Groups \( \) and backreferences \1-\9
    ^\(ab*\)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern)
        unmatched \( or \) rejected as error

\<var>c</var>
     POSIX doesn't define the behavior of this for orinary characters <var>c</var>. All of the implementations I 
     checked interpret it as matching a literal <var>c</var>.

Bracket expressions [...]
    inside brackets, ^ is literal except at start, and .*[\ are all literal
    The following are all required by POSIX (''except in italics''):
    ] in []...] and [^]...] treated as literal
    - in [-...] and [^-...] and [...-] treated as literal
        so [b\-a] is the set of b,\,],^,_,`,a; not the set of b,-,a
    \ in [...\...] treated as literal
        so [a\]b] is [a\] then literal b]; not the set of a,],b
    [:<var>classes</var>:]
        [:alpha:] [:alnum:] [:upper:] [:lower:]
        [:xdigit:] [:digit:] but not \d
        [:punct:]
        [:graph:] visible chars except space
        [:blank:] space or tab
        [:space:] space or tab or \n \r \v \f
        [:print:] visible chars plus space
        [:cntrl:] ASCII < 32 or 127
    unmatched [ ''rejected as error''
    unmatched ] treated as literal
    [m-ax] ''BusyBox [e]grep and nawk treat as just x, others reject''
    [a-m-xy] ''BusyBox [e]grep and nawk treat as [a-my], others reject''

EREs

This material is work-in-progress ...

(Last edited by Sertonix on 20 Sep 2023.)

    . * ? + [] ^ $ | () {}
    POSIX requires "long|longest" should match all of "longest", so implementations must at least simulate DFAs in that way
    dot and [^x] should match newline; but on no egrep -z implementation do they do so
    ^ $ are always anchors
        all [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy)
        sed and awk engines only match these anchors against start-of-buffer and end-of-buffer
        so ".^b" should match last two chars of $'a\nb' in grep (if only dot matched newline there), but not in sed; this is what we observe, except that while Gnu sed and gawk will only match '^pat' against start-of-buffer, they match '.^pat' against start-of-line (same with 'pat$.'); also, nawk rejects '.^b' but not 'a$.'
    |
        FreeBSD sed and nawk reject | in these contexts: "_ ... | _ ... ( _ ... _ ) ... _"; FreeBSD egrep silently fails to match
    {m,n}
        nawk and gawk --traditional treat as literal
        BusyBox and gawk --re-interval handle (this is default for some versions of gawk)
        unmatched {: BusyBox tools, Gnu sed and gawk reject as error; FreeBSD tools and Gnu egrep treat as literal
        unmatched }: treated as literal (only FreeBSD sed without -r rejects)
        Gnu sed and gawk accept {,1}; others reject and all reject {}, {2,1}, and {1,2,3}; except FreeBSD egrep handles the third strangely, and FreeBSD sed treats the first as literal
    quantifiers at start of pattern/group/alternate (after ^, if any): egreps silently drop (BusyBox whole construct, others only first char); seds with -r and awks usually reject (though gawk treats such quantifiers and any preceding ^ as literal; nawk accepts ^* ^+ ^? but hard to tell what they match)
    BusyBox and Gnu egrep and sed and awk, FreeBSD egrep and nawk: accept adjacent quantifiers without error; FreeBSD sed rejects all

    ( )
        all my implementations of egrep provide backrefs (so with those patterns, not running a DFA)
        ^(ab*)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern)
        FreeBSD sed with -r: \1 is literal "1" rather than backref; other seds provided backrefs
        awks: \1 is \x01 rather than backref
        unmatched (: rejected as error
        unmatched ): rejected as error, except FreeBSD and BusyBox sed with -r, and BusyBox awk and gawk, treat as literal
    \ordinary undefined (all my egrep and sed implementations treat as literal; awks honor C escapes but will otherwise treat as literal, perhaps with warning)

    inside brackets, ^ is literal except at start, and .*[\?+|$ are all literal
    otherwise, as for BREs

    In gawk and nawk, [a\]1] matches a,],1. In BusyBox awk and egrep/sed, it matches a,\ followed by literal 1 then ].

Gnu extensions

These were present in all the [e]greps I tested with, and in all the seds and awks except nawk (and gawk --traditional). They are only treated specially outside of bracket expressions, even in awks, which do still treat \t and so on specially there.

\w and \W for [[:alnum:]_] and [^[:alnum:]_]
\s and \S for [[:space:]] and [^[:space:]], matches any of: space tab \n \r \v \f
(BusyBox tools and some versions of gawk lack.)
\b \B \< and \>, zero-width matches at word boundaries (non-word-boundaries for \B)
(In awks, \b instead means "\x08"; \y substitutes for \b in gawk, nothing substitutes for \b in BusyBox awk.)

FreeBSD's [e]grep -o \b... and [e]grep -o \<... are currently buggy; and BusyBox's sed and awk are buggy with \< \b \B at start of words.
\` \' start-of-buffer and end-of-buffer anchors (some regex engines not surveyed here use \A \Z \z for these instead)
(In awks, ^ and $ already have this behavior, even against source texts containing newlines.)

FreeBSD's [e]grep currently wrongly match these against start-of-line and end-of-line, rather than start-of-buffer and end-of-buffer. Also, FreeBSD's grep -o '\`...' is buggy in ways grep -o '...'\' isn't. BusyBox sed and awk are also buggy with \` in ways they aren't with \'. All these bugs have been reported.

C escapes

\n \t \r \x09 \f \v \a \c: These are handled specially by awks (though nawk only honors up to \f), even inside brackets.; They are also handled specially by Gnu's sed. BusyBox's sed honors \n \t \r; and FreeBSD's sed honors \n. The other escapes aren't honored by those seds, and none are honored by any grep I tested.

These escapes may also be handled specially by your shell in $'...' constructs; as too may be \OOO (up to 3 octal digits) \uXXXX (4 hex digits) \e \E \b \'. Awk engines handle some of these latter forms, too. As noted above, the last two are handled differently by some regex engines.

Notes

All of these regex engines treat \d as literal "d", not as [[:digit:]].

BusyBox sed will match only one occurrence of "" (empty string); others will match several of them if the /g modifier is on.

Grep engines will treat newlines in patterns as equivalent to \|; sed and awk engines will reject as error.

FreeBSD's sed will force the presence of terminal \n, even if it wasn't present in the input. So too will some other FreeBSD tools like cut; others like tr won't. In Gnu's sed, the command q also forces a terminal \n.

BusyBox's grep -oz suffixes each result with "\0" (nul); other greps suffix with \n.

Nongreedy quantifiers pat?? pat*? pat+? pat{m,n}? aren't provided in the POSIX specification, nor by any of the POSIX-conforming tools discussed here.

FreeBSD grep and egrep match the empty string at position 0 in: printf 'cba' | egrep -o '[ba]*'. None of the other [e]grep implementations I checked (such as BusyBox's or Gnu's) do this.

Regex in Lua

String escapes

In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:

\\
\"
\'
\a for bell, \x07
\b for backspace, \x08
\t for \x09
\n for \x0a
\v for \x0b
\f for \x0c
\r for \x0d

Lua strings also accept \ddd for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh for hex digits is also accepted.

Strings can be written inside matching single or double quotes. They can also be written inside:

[[constructs
like this]]

[=[
or [[like]]
\this]=]

In these constructs, escape sequences like \t aren't expanded. It and the embedded [[like]] are both treated literally. Also, the first character of the string is ignored when it's a newline.

Regex engines

The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.

Regex specials

The following sequences have special meaning to the basic Lua regex engine.

Character classes

. Matches any character
%z In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this special sequence was provided instead to match \0s in the source text (and permit the pattern string to continue). In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored, but it's deprecated.
%a and %A Like POSIX [[:alpha:]] and [^[:alpha:]] %l and %L Like POSIX [[:lower:]] and [^[:lower:]] %u and %U Like POSIX [[:upper:]] and [^[:upper:]]
%w and %W Like POSIX [[:alnum:]] and [^[:alnum:]]. (Note that unlike the Gnu regex extension \w, the patterns %w in Lua and [[:alnum:]] in POSIX do not match underscores.) %d and %D Like POSIX [[:digit:]] and [^[:digit:]] %x and %X Like POSIX [[:xdigit:]] and [^[:xdigit:]]
%s and %S Like POSIX [[:space:]] and [^[:space:]]. (%s and POSIX [[:space:]] do also match vertical space (\n \r \f \v, whereas POSIX [[:blank:]] matches only \x20 and tab.) %p and %P Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum and cntrl %c and %C Like POSIX [[:cntrl:]] and [^[:cntrl:]] %g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 POSIX [[:print:]], all visible characters plus space, isn't directly available. Use [%p%w ] or [%g ].

Bracket expressions

[class] and [^class] can include sequences of:

single characters like a or \t
ranges like a-m
character class specials like %a

Groups and backreferences

(pat) Capture the source text matching pat into a group. Unlike other regex engines, these constructions can not be followed by quantifiers like * or +
%1 Backreference to a captured group (contrast \1 which is and matches "\x01", and \\1 which matches a literal character "\" then "1")
() Instead of text, capture the current position in the matched source into a group

Quantifiers

? and * and + are the familiar greedy quantifiers
- is a nongreedy variant of *

Note that in Lua quantifiers can only follow:

single characters
regex specials like . and %a
[class] expressions

They cannot follow arbitrary (pat)

Anchors ^ $

As in POSIX BREs, these are treated as literal characters when not in anchoring positions (as in the pattern ab^cd)

Literal characters

%c This is a literal c, for arbitrary character c. It cancels the special meaning of characters ( ) . % + - * ? [ ] ^ $

Alternation (expressed in other regex languages using |) is not available in Lua; it can only be approximated using [class] constructions.

Two of the nifty primitives that Lua has and other engines lack are:

%b() Text inside (and including) balanced (...); other characters can be used in place of ( and ).
%f[class] The zero-width "frontier" between (the start-of-source or) text not matching class and (the end-of-source or) text which does match class. This is a generalization of the Gnu regex specials \< and \>. Examples:
```
%f[%x] matches the source text "123-567 9ab" before positions 1, 5, and 9
%f[%X] matches the same source text before positions 4, 8, and 12.
```

@@ Line 1: / Line 1: @@
 This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different implementations of a tool may behave in different or unexpected ways).
+{{Warning|This summary was drawn up back when we were using uClibc. Some of the behavior of Alpine's utilities reported here may have changed with the switch to musl.}}
 __TOC__
@@ Line 73: / Line 75: @@
 == POSIX regex ==
-POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>awk</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to <code>sed</code>; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.)
+POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>[[awk]]</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to <code>sed</code>; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.)
 In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like <code>\w</code>, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor <code>\+</code> and <code>\|</code>, and many implementations of EREs will also honor backreferences like <code>\1</code>. Additionally, <code>awk</code> EREs obeys somewhat different rules than other EREs.
@@ Line 96: / Line 98: @@
 === BREs ===
+{{Draft|}}
 <div style="white-space:pre; font-family:monospace;"><nowiki>
@@ Line 159: / Line 162: @@
 === EREs ===
+{{Draft|}}
 <div style="white-space:pre; font-family:monospace;"><nowiki>
      . * ? + [] ^ $ | () {}
@@ Line 263: / Line 268: @@
 === Regex engines ===
-The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the [http://www.inf.puc-rio.br/~roberto/lpeg/ LPEG] or [http://rrthomas.github.com/lrexlib/ Lrexlib] libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.
+The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the [https://www.inf.puc-rio.br/~roberto/lpeg/ LPEG] or [https://github.com/rrthomas/lrexlib Lrexlib] libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.
 === Regex specials ===
@@ Line 290: / Line 295: @@
   '''%s''' and '''%S'''    Like POSIX &#91;&#91;:space:]] and [^[:space:]].
                (%s and POSIX &#91;&#91;:space:]] do also match vertical space (\n \r \f \v, whereas POSIX &#91;&#91;:blank:]] matches only \x20 and tab.)
-  '''%p''' and '''%P'''    Like POSIX &#91;&#91;:punct:]] and [^[:punct:]], excludes space and alnum
+  '''%p''' and '''%P'''    Like POSIX &#91;&#91;:punct:]] and [^[:punct:]], excludes space and alnum and cntrl
   '''%c''' and '''%C'''    Like POSIX &#91;&#91;:cntrl:]] and [^[:cntrl:]]
   '''%g''' and '''%G'''    Like POSIX &#91;&#91;:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2
-POSIX <code>&#91;&#91;:print:]]</code>, all visible characters plus space, isn't directly available. Use <code>[%p%a ]</code> or <code>[%g ]</code>.
+POSIX <code>&#91;&#91;:print:]]</code>, all visible characters plus space, isn't directly available. Use <code>[%p%w ]</code> or <code>[%g ]</code>.
 <dt>Bracket expressions
@@ Line 336: / Line 341: @@
 <li>'''<code>%f[<var>class</var>]</code>'''    The zero-width "frontier" between (the start-of-source or) text not matching <var>class</var> and (the end-of-source or) text which does match <var>class</var>. This is a generalization of the Gnu regex specials <code>\&lt;</code> and <code>\&gt;</code>. Examples:
 <pre>
-%f[%x] matches the source text "123-567 9ab" at positions 1, 5, and 9
+%f[%x] matches the source text "123-567 9ab" before positions 1, 5, and 9
-%f[%X] matches the same source text at positions 4, 7, and 12.
+%f[%X] matches the same source text before positions 4, 8, and 12.
 </pre>
 </ul>