Regex: Difference between revisions

Revision as of 16:22, 3 May 2012

This page summarizes some gritty regex details.

Glob patterns

These are used in shell expansion and "case" pattern-matching.

glob*
gl?bbing
[a-z] and [!-aeiou]  Note: [^abc] is also supported by some shells (including BusyBox ash) but not others

/ can only be matched literally, and has higher parsing precedence than [...], so a[b/c]d only matches file c]d in directory a[b.
a leading period can only be matched literally, not by [!a] or ? or * or [%-0] or [[:punct:]]. In BusyBox ash, it's not matched by [.a]; other shell implementations may differ.
given the pattern /foo/bar/x*/bam, search permission is needed for directories / and foo, search and read permissions are needed for directory bar, and search permission is needed for each x* directory.
if the pattern contains an invalid bracket expression or does not match any existing filenames or pathnames, the pattern string will be interpreted literally.
If set -f / set -o noglob, glob-expansion is disabled.
Todo: What contexts automatically suppress glob-expansion?

Q. How do I construct a shell glob-pattern that matches all files except "." and ".." ? (from Unix FAQ 2.11)

*        matches all files that don't begin with a ".";
.*       matches all files that do begin with a ".", but this includes the special
         entries "." and "..", which often you don't want; so instead we use:
.[!.]*   matches all files that begin with a "." and are followed by a non-".";
         unfortunately this will miss "..foo";
.??*     matches files that begin with a "." and which are at least 3 characters long.
         This neatly avoids "." and "..", but also misses ".a" .

So to match all files except "." and ".." safely you have to use 3 patterns (if you don't have
length-2 filenames like ".a" you can leave out the first):

.[!.]* .??* *

POSIX Regex

POSIX defines two classes of regex languages: basic regular expressions (BREs) and extended regular expressions (EREs). The first is historically implemented by utilities like grep, sed, and ed. The second by tools like egrep (grep -E), awk, lex, and emacs. EREs are usually also available as an option to sed; sometimes this is expressed using sed -r, other times sed -E. (BusyBox sed uses the first.)

In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like \w, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor \+ and \|, and many implementations of EREs will also honor backreferences like \1.

Additionally, awk EREs obeys somewhat different rules than other EREs.

Note: Historically, EREs were often implemented using DFAs or "text-directed" regex engines. BREs were often implemented using NFAs or "pattern-directed" regex engines. There are theoretical translations between these, but in practice devices like backreferences and non-greedy quantification are only efficiently implementable with NFAs. Naive implementations of these differ in that NFAs are more "eager" so would always return, of several alternate patterns, the leftmost matching one, even if another pattern might match more of the source text. So for a hypothetical regex engine, printf "NFA, no DFA" | regex_match "/NFA|NFA, no DFA/" would match "NFA" if the engine was a NFA, and "NFA, no DFA" if the engine was a DFA. (Both types of engines will return the leftmost of several possible matches in the source text; this contrast is only between which of several alternate patterns is used for the match.)

However, this is nowadays complicated by the fact that POSIX specifies the longest possible stretch of source text be matched: so a POSIX-conformant grep (which accepted pattern alternation as an extension) would have to match "NFA, no DFA", too, just as egrep would.

It's also complicated by the fact that many ERE engines will honor backreferences, which can't efficiently be handled with a DFA/"text-directed" engine.

POSIX specifies for both BREs and EREs:

matches should be longest match leftmost in text
subpatterns match greedily (with matching "" better than not matching)
\0 is not permitted in text nor in patterns

For the following, I've compared the BusyBox tools (grep, egrep, sed with and without -r, and awk) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system.

BREs

Special characters: . * [] ^ $ \(\) \{\}
    
Dot and [^x] should match newline; but on no grep -z implementation do they do so.

Anchors ^ $
    POSIX says interpreted literally when not at start of pattern/group/alternate (FreeBSD grep is buggy with ^)
    All [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy)
    Sed engines, on the other hand, only ever match these anchors against start-of-buffer and end-of-buffer.

Alternation with \|
    Not in POSIX for BREs, but is honored by many <code>grep</code>s and <code>sed</code>s
    FreeBSD tools don't handle \| in these contexts: "_ ... | _ ... ( _ ... _ ) ... _"

Intervals \{m,n\}
    \? for \{0,1\} and \+ for \{1,\} aren't in POSIX, but are often honored.
    All greps and and all seds except FreeBSD's do support \| \? \+ and other Gnu extensions
    Unmatched \{ rejected as error
    Unmatched \} usually treated as literal; only FreeBSD sed without -r rejects
    Gnu sed accepts \{,1\}; others reject and all reject \{\}, \{2,1\}, and \{1,2,3\}

Perversely-placed quantifiers
    Treated as literals at start of a pattern/group/alternate (after ^, if any)
    This is true for *, \?, and \+ where they're available
    But \{1\} is instead rejected as an error by FreeBSD and Gnu sed

    FreeBSD and Gnu grep, BusyBox grep and sed: accept adjacent quantifiers without error
        Gnu and FreeBSD sed: reject \{...\} and * when they follow another quantifier
        Gnu sed: permits \? and \+ to do so; FreeBSD doesn't recognize them as quantifiers

Groups \( \) and backreferences \1-\9
    ^\(ab*\)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern)
        unmatched \( or \) rejected as error

\<var>c</var>
     POSIX doesn't define the behavior of this for orinary characters <var>c</var>. All of the implementations I 
     checked interpret it as matching a literal <var>c</var>.

Bracket expressions [...]
    inside brackets, ^ is literal except at start, and .*[\ are all literal
    The following are all required by POSIX (''except in italics''):
    ] in []...] and [^]...] treated as literal
    - in [-...] and [^-...] and [...-] treated as literal
        so [b\-a] is the set of b,\,],^,_,`,a; not the set of b,-,a
    \ in [...\...] treated as literal
        so [a\]b] is [a\] then literal b]; not the set of a,],b
    [:<var>classes</var>:]
        [:alpha:] [:alnum:] [:upper:] [:lower:]
        [:xdigit:] [:digit:] but not \d
        [:punct:]
        [:graph:] visible chars except space
        [:blank:] space or tab
        [:space:] space or tab or \n \r \v \f
        [:print:] visible chars plus space
        [:cntrl:] ASCII < 32 or 127
    unmatched [ ''rejected as error''
    unmatched ] treated as literal
    [m-ax] ''BusyBox [e]grep and nawk treat as just x, others reject''
    [a-m-xy] ''BusyBox [e]grep and nawk treat as [a-my], others reject''

Gnu extensions

These are present in all [e]greps and in all seds and awks except nawk (and gawk --traditional); even in awks, they're only treated special outside of brackets (though awk does honor \t \n and so on there)

\w \W for [[:alnum:]_] and [^[:alnum:]_]

\s \S or [[:space:]] and [^[:space:]], matches any of: space tab \n \r \v \f (BusyBox tools and some versions of gawk lack)

\b \B \< \>, zero-width matches (though FreeBSD's [e]grep -o \b... and -o \<... are buggy; and BusyBox's sed and awk are buggy with \< \b \B at start of words)
    in awks, \b instead means \x08; \y substitutes for \b in gawk, nothing substitutes for \b in BusyBox awk 

\` \' start-of-buffer and end-of-buffer anchors (some regex engines use \A \Z \z for these instead)
    in awks, ^ and $ already have this behavior
    FreeBSD's [e]grep matches against start-of-line and end-of-line instead (and FreeBSD's grep -o '\`...' is buggy); (and BusyBox sed and awk are also buggy with \`)

C escapes

\n \t \r \x09 \f \v \a \c:
    handled specially by awk engines (though nawk only honors up to \f) and by Gnu's sed engine. Awk engines honor these even inside brackets.
    BusyBox's sed engine only honors \n \t \r
    FreeBSD's sed engine only honors \n
Otherwise, none of these are handled specially by the regex engines. They may be handled specially by your shell in $'...' constructs; as too may be \OOO (up to 3 octal digits) \uXXXX (4 hex digits) \e \E \b \'. (Awk engines handle some of these latter forms, too. As noted above, the last two are handled differently by regex engines.)

EREs

    . * ? + [] ^ $ | () {}
    POSIX requires "long|longest" should match all of "longest", so implementations must at least simulate DFAs in that way
    dot and [^x] should match newline; but on no egrep -z implementation do they do so
    ^ $ are always anchors
        all [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy)
        sed and awk engines only match these anchors against start-of-buffer and end-of-buffer
        so ".^b" should match last two chars of $'a\nb' in grep (if only dot matched newline there), but not in sed; this is what we observe, except that while Gnu sed and gawk will only match '^pat' against start-of-buffer, they match '.^pat' against start-of-line (same with 'pat$.'); also, nawk rejects '.^b' but not 'a$.'
    |
        FreeBSD sed and nawk reject | in these contexts: "_ ... | _ ... ( _ ... _ ) ... _"; FreeBSD egrep silently fails to match
    {m,n}
        nawk and gawk --traditional treat as literal
        BusyBox and gawk --re-interval handle (this is default for some versions of gawk)
        unmatched {: BusyBox tools, Gnu sed and gawk reject as error; FreeBSD tools and Gnu egrep treat as literal
        unmatched }: treated as literal (only FreeBSD sed without -r rejects)
        Gnu sed and gawk accept {,1}; others reject and all reject {}, {2,1}, and {1,2,3}; except FreeBSD egrep handles the third strangely, and FreeBSD sed treats the first as literal
    quantifiers at start of pattern/group/alternate (after ^, if any): egreps silently drop (BusyBox whole construct, others only first char); seds with -r and awks usually reject (though gawk treats such quantifiers and any preceding ^ as literal; nawk accepts ^* ^+ ^? but hard to tell what they match)
    BusyBox and Gnu egrep and sed and awk, FreeBSD egrep and nawk: accept adjacent quantifiers without error; FreeBSD sed rejects all

    ( )
        all my implementations of egrep provide backrefs (so with those patterns, not running a DFA)
        ^(ab*)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern)
        FreeBSD sed with -r: \1 is literal "1" rather than backref; other seds provided backrefs
        awks: \1 is \x01 rather than backref
        unmatched (: rejected as error
        unmatched ): rejected as error, except FreeBSD and BusyBox sed with -r, and BusyBox awk and gawk, treat as literal
    \ordinary undefined (all my egrep and sed implementations treat as literal; awks honor C escapes but will otherwise treat as literal, perhaps with warning)

    inside brackets, ^ is literal except at start, and .*[\?+|$ are all literal
    otherwise, as for BREs

    In gawk and nawk, [a\]1] matches a,],1. In BusyBox awk and egrep/sed, it matches a,\ followed by literal 1 then ].

Gnu extensions and C escapes as for BREs.

Notes

All the regex engines treat \d as literal d, not as [[:digit:]]
BusyBox sed will match only a single "", others will match multiple if /g
Grep engines will treat newlines in patterns as \|; sed and awk engines will reject as error
FreeBSD's sed will force a \n to end of text, as too will some other FreeBSD tools like cut; others like tr won't; in Gnu sed, 'q' forces a \n to end of text.
BusyBox's grep -oz suffixes each result with \0; other greps suffix with \n

Nongreedy quantifiers '??' '*?' '+?' '{m,n}?' unavailable in POSIX, and in all the POSIX-regex-implementing tools discussed here.

FreeBSD grep and egrep match the empty string at position 0 in:
    printf 'cba' | egrep -o '[ba]*'
None of the other [e]grep implementations I checked (such as BusyBox's or Gnu's) do this.

Regex in Lua

String escapes

In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:

\\
\"
\'
\a for bell, \x07
\b for backspace, \x08
\t for \x09
\n for \x0a
\v for \x0b
\f for \x0c
\r for \x0d

Lua strings also accept \ddd for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh for hex digits is also accepted.

Strings can be written inside matching single or double quotes. They can also be written inside:

[[constructs
like this]]

[=[
or [[like]]
\this]=]

In these constructs, escape sequences like \t aren't expanded. It and the embedded [[like]] are both treated literally. Also, the first character of the string is ignored when it's a newline.

Regex engines

The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.

Regex Specials

The following sequences have special meaning to the basic Lua regex engine.

%z           In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this escape sequence
             was provided instead to match \0s in the source text (and permit the pattern string to continue).
             In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored,
             but it's deprecated.
%a and %A    Like POSIX [[:alpha:]] and [^[:alpha:]]
%l and %L    Like POSIX [[:lower:]] and [^[:lower:]]
%u and %U    Like POSIX [[:upper:]] and [^[:upper:]]
%w and %W    Like POSIX [[:alnum:]] and [^[:alnum:]].
             Note that unlike the Gnu regex extension \w, the patterns %w in Lua and
             [[:alnum:]] in POSIX do not include underscores.
%d and %D    Like POSIX [[:digit:]] and [^[:digit:]]
%x and %X    Like POSIX [[:xdigit:]] and [^[:xdigit:]]
%s and %S    Like POSIX [[:space:]] and [^[:space:]].
             %s and POSIX [[:space:]] match vertical space (\n \r \f \v) as well, whereas POSIX [[:blank:]] matches only \x20 and tab.
%p and %P    Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum
%c and %C    Like POSIX [[:cntrl:]] and [^[:cntrl:]]
%g and %G    Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2

POSIX [[:print:]], all visible characters plus space, isn't directly available. Use [%p%a ] or [%g ].

.                Any character
^ and $          Anchors; treated as literal characters when not in anchoring position in the pattern string
[class] and [^class] character classes, can include sequences of:
                   single characters like a or \t
                   ranges like a-m
                   regex specials like %a 
(pat)            Capture the source text matching pat into a group
                 Unlike other regex engines, these constructions can not be followed by quantifiers like * or +
()               Instead of text, capture the current position in the source into a group
%1               Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1)
? and * and +    The familiar greedy quantifiers
                 Note that in Lua these can only follow:
                   single characters, regex specials like . and %a, and [class] constructions
                   not arbitrary (pat)
-                This is a nongreedy form of *
%c               This is a literal c, for arbitrary character c
                 It cancels the special meaning of characters ( ) . % + - * ? [ ] ^ $

Alternation (expressed in other regex languages using |) is not available; it can only be approximated using [class] constructions.

Two of the nifty primitives that Lua has and other engines lack are:

%b()         Text inside (and including) balanced (...); other characters can be used in place of ( and ).
%f[class]    The transition from ^ or [^class] to [class] or $.
             This is a generalization of the Gnu regex specials \< and \>.

Examples:

%f[%x] matches the source text "123-567 9ab" at positions 1, 5, and 9
%f[%X] matches the same source text at positions 4, 7, and 12.