Regex: Difference between revisions
Dubiousjim (talk | contribs) (→Glob patterns: reorganize) |
Dubiousjim (talk | contribs) (→POSIX Regex: tweaks) |
||
Line 66: | Line 66: | ||
</pre> | </pre> | ||
== POSIX | == POSIX regex == | ||
POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>awk</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to sed; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.) | POSIX defines two classes of regex languages: '''basic regular expressions (BREs)''' and '''extended regular expressions (EREs)'''. The first is historically implemented by utilities like <code>grep</code>, <code>sed</code>, and <code>ed</code>. The second by tools like <code>egrep</code> (<code>grep -E</code>), <code>awk</code>, <code>lex</code>, and <code>emacs</code>. EREs are usually also available as an option to <code>sed</code>; sometimes this is expressed using <code>sed -r</code>, other times <code>sed -E</code>. (BusyBox <code>sed</code> uses the first.) | ||
In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like <code>\w</code>, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor <code>\+</code> and <code>\|</code>, and many implementations of EREs will also honor backreferences like <code>\1</code>. | In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like <code>\w</code>, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor <code>\+</code> and <code>\|</code>, and many implementations of EREs will also honor backreferences like <code>\1</code>. Additionally, <code>awk</code> EREs obeys somewhat different rules than other EREs. | ||
{{Note|Historically, EREs were often implemented using [https://en.wikipedia.org/wiki/Deterministic_finite_automaton DFAs] or "text-directed" regex engines. BREs were often implemented using [https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton NFAs] or "pattern-directed" regex engines. There are theoretical translations between these, but in practice devices like backreferences and non-greedy quantification are only efficiently implementable with NFAs. Naive implementations of these differ in that NFAs are more "eager" so would always return, of several alternate patterns, the leftmost matching one, even if another pattern might match more of the source text. So for a hypothetical regex engine with alternation, <code>printf "NFA, no DFA" {{!}} regex_match "/NFA{{!}}NFA, no DFA/"</code> would match "NFA" if the engine was a NFA, and "NFA, no DFA" if the engine was a DFA. (Both types of engines will return the leftmost of several possible matches ''in the source text''; this contrast is only between which of several alternate ''patterns'' is used for the match.) | |||
However, this is nowadays complicated by the fact that POSIX specifies the ''longest'' possible leftmost stretch of source text be matched: so a POSIX-conformant <code>grep</code> (which accepted pattern alternation as an extension) would have to match "NFA, no DFA", too, just as <code>egrep</code> would. | |||
However, this is nowadays complicated by the fact that POSIX specifies the ''longest'' possible stretch of source text be matched: so a POSIX-conformant <code>grep</code> (which accepted pattern alternation as an extension) would have to match "NFA, no DFA", too, just as <code>egrep</code> would. | |||
It's also complicated by the fact that many ERE engines will honor backreferences, which can't efficiently be handled with a DFA/"text-directed" engine.}} | It's also complicated by the fact that many ERE engines will honor backreferences, which can't efficiently be handled with a DFA/"text-directed" engine.}} | ||
Line 85: | Line 83: | ||
* matches should be longest match leftmost in text | * matches should be longest match leftmost in text | ||
* subpatterns match greedily (with matching "" better than not matching) | * subpatterns match greedily (with matching "" better than not matching) | ||
* \0 | * nulls (\0) are not permitted in text nor in patterns | ||
For the following, I've compared the BusyBox tools (<code>grep</code>, <code>egrep</code>, <code>sed</code> with and without <code>-r</code>, and <code>awk</code>) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system. | For the following, I've compared the BusyBox tools (<code>grep</code>, <code>egrep</code>, <code>sed</code> with and without <code>-r</code>, and <code>awk</code>) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system. | ||
Line 234: | Line 232: | ||
</pre> | </pre> | ||
== Regex in Lua == | == Regex in Lua == |
Revision as of 16:34, 3 May 2012
This page summarizes some gritty regex details. It's not intended as a tutorial on any of these regex languages, but rather as a technical summary and list of "gotchas" (places where different tools may behave in different or unexpected ways).
Glob patterns
These are used in shell expansion and "case" pattern-matching.
glob* gl?bbing [a-z] and [!-aeiou] The latter can also be expressed in some shells (including BusyBox ash) as [^aeiou] but the [!...] format is more portable.
- if the pattern contains an invalid bracket expression or does not match any existing filenames or pathnames, the pattern string will be interpreted literally.
- a leading period can only be matched literally, not by
[!a]
or?
or*
or[%-0]
or[[:punct:]]
. In BusyBox ash, it's not matched by[.a]
; other shell implementations may differ. /
can only be matched literally, and has higher parsing precedence than[...]
, soa[b/c]d
only matches file c]d in directory a[b.- given the pattern
/foo/bar/x*/bam
, search permission is needed for directories / and foo, search and read permissions are needed for directory bar, and search permission is needed for each x* directory. - If
set -f
/set -o noglob
, glob-expansion is disabled.Todo: What contexts automatically suppress glob-expansion?
Q. How do I construct a shell glob-pattern that matches all files except "." and ".." ? (from Unix FAQ 2.11)
* matches all files that don't begin with a "."; .* matches all files that do begin with a ".", but this includes the special entries "." and "..", which often you don't want; so instead we use: .[!.]* matches all files that begin with a "." and are followed by a non-"."; unfortunately this will miss "..foo"; .??* matches files that begin with a "." and which are at least 3 characters long. This neatly avoids "." and "..", but also misses ".a" . So to match all files except "." and ".." safely you have to use 3 patterns (if you don't have length-2 filenames like ".a" you can leave out the first): .[!.]* .??* *
POSIX regex
POSIX defines two classes of regex languages: basic regular expressions (BREs) and extended regular expressions (EREs). The first is historically implemented by utilities like grep
, sed
, and ed
. The second by tools like egrep
(grep -E
), awk
, lex
, and emacs
. EREs are usually also available as an option to sed
; sometimes this is expressed using sed -r
, other times sed -E
. (BusyBox sed
uses the first.)
In practice most implementations of these regex languages go beyond the specification, for example including Gnu extensions like \w
, or including features that were only historically available (and are only specified for) the other regex language: thus most implementations of BREs will also honor \+
and \|
, and many implementations of EREs will also honor backreferences like \1
. Additionally, awk
EREs obeys somewhat different rules than other EREs.
printf "NFA, no DFA" | regex_match "/NFA|NFA, no DFA/"
would match "NFA" if the engine was a NFA, and "NFA, no DFA" if the engine was a DFA. (Both types of engines will return the leftmost of several possible matches in the source text; this contrast is only between which of several alternate patterns is used for the match.)
However, this is nowadays complicated by the fact that POSIX specifies the longest possible leftmost stretch of source text be matched: so a POSIX-conformant grep
(which accepted pattern alternation as an extension) would have to match "NFA, no DFA", too, just as egrep
would.
POSIX specifies for both BREs and EREs:
- matches should be longest match leftmost in text
- subpatterns match greedily (with matching "" better than not matching)
- nulls (\0) are not permitted in text nor in patterns
For the following, I've compared the BusyBox tools (grep
, egrep
, sed
with and without -r
, and awk
) to the Gnu core tools, and to the versions of these tools in a base FreeBSD 9 system.
BREs
Special characters: . * [] ^ $ \(\) \{\} Dot and [^x] should match newline; but on no grep -z implementation do they do so. Anchors ^ $ POSIX says interpreted literally when not at start of pattern/group/alternate (FreeBSD grep is buggy with ^) All [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy) Sed engines, on the other hand, only ever match these anchors against start-of-buffer and end-of-buffer. Alternation with \| Not in POSIX for BREs, but is honored by many <code>grep</code>s and <code>sed</code>s FreeBSD tools don't handle \| in these contexts: "_ ... | _ ... ( _ ... _ ) ... _" Intervals \{m,n\} \? for \{0,1\} and \+ for \{1,\} aren't in POSIX, but are often honored. All greps and and all seds except FreeBSD's do support \| \? \+ and other Gnu extensions Unmatched \{ rejected as error Unmatched \} usually treated as literal; only FreeBSD sed without -r rejects Gnu sed accepts \{,1\}; others reject and all reject \{\}, \{2,1\}, and \{1,2,3\} Perversely-placed quantifiers Treated as literals at start of a pattern/group/alternate (after ^, if any) This is true for *, \?, and \+ where they're available But \{1\} is instead rejected as an error by FreeBSD and Gnu sed FreeBSD and Gnu grep, BusyBox grep and sed: accept adjacent quantifiers without error Gnu and FreeBSD sed: reject \{...\} and * when they follow another quantifier Gnu sed: permits \? and \+ to do so; FreeBSD doesn't recognize them as quantifiers Groups \( \) and backreferences \1-\9 ^\(ab*\)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern) unmatched \( or \) rejected as error \<var>c</var> POSIX doesn't define the behavior of this for orinary characters <var>c</var>. All of the implementations I checked interpret it as matching a literal <var>c</var>. Bracket expressions [...] inside brackets, ^ is literal except at start, and .*[\ are all literal The following are all required by POSIX (''except in italics''): ] in []...] and [^]...] treated as literal - in [-...] and [^-...] and [...-] treated as literal so [b\-a] is the set of b,\,],^,_,`,a; not the set of b,-,a \ in [...\...] treated as literal so [a\]b] is [a\] then literal b]; not the set of a,],b [:<var>classes</var>:] [:alpha:] [:alnum:] [:upper:] [:lower:] [:xdigit:] [:digit:] but not \d [:punct:] [:graph:] visible chars except space [:blank:] space or tab [:space:] space or tab or \n \r \v \f [:print:] visible chars plus space [:cntrl:] ASCII < 32 or 127 unmatched [ ''rejected as error'' unmatched ] treated as literal [m-ax] ''BusyBox [e]grep and nawk treat as just x, others reject'' [a-m-xy] ''BusyBox [e]grep and nawk treat as [a-my], others reject''
Gnu extensions
These are present in all [e]greps and in all seds and awks except nawk (and gawk --traditional); even in awks, they're only treated special outside of brackets (though awk does honor \t \n and so on there)
\w \W for [[:alnum:]_] and [^[:alnum:]_] \s \S or [[:space:]] and [^[:space:]], matches any of: space tab \n \r \v \f (BusyBox tools and some versions of gawk lack) \b \B \< \>, zero-width matches (though FreeBSD's [e]grep -o \b... and -o \<... are buggy; and BusyBox's sed and awk are buggy with \< \b \B at start of words) in awks, \b instead means \x08; \y substitutes for \b in gawk, nothing substitutes for \b in BusyBox awk \` \' start-of-buffer and end-of-buffer anchors (some regex engines use \A \Z \z for these instead) in awks, ^ and $ already have this behavior FreeBSD's [e]grep matches against start-of-line and end-of-line instead (and FreeBSD's grep -o '\`...' is buggy); (and BusyBox sed and awk are also buggy with \`)
C escapes
\n \t \r \x09 \f \v \a \c: handled specially by awk engines (though nawk only honors up to \f) and by Gnu's sed engine. Awk engines honor these even inside brackets. BusyBox's sed engine only honors \n \t \r FreeBSD's sed engine only honors \n Otherwise, none of these are handled specially by the regex engines. They may be handled specially by your shell in $'...' constructs; as too may be \OOO (up to 3 octal digits) \uXXXX (4 hex digits) \e \E \b \'. (Awk engines handle some of these latter forms, too. As noted above, the last two are handled differently by regex engines.)
EREs
. * ? + [] ^ $ | () {} POSIX requires "long|longest" should match all of "longest", so implementations must at least simulate DFAs in that way dot and [^x] should match newline; but on no egrep -z implementation do they do so ^ $ are always anchors all [e]greps match start-of-lines and end-of-lines even with multiline buffers (though FreeBSD's [e]grep -o ^... is buggy) sed and awk engines only match these anchors against start-of-buffer and end-of-buffer so ".^b" should match last two chars of $'a\nb' in grep (if only dot matched newline there), but not in sed; this is what we observe, except that while Gnu sed and gawk will only match '^pat' against start-of-buffer, they match '.^pat' against start-of-line (same with 'pat$.'); also, nawk rejects '.^b' but not 'a$.' | FreeBSD sed and nawk reject | in these contexts: "_ ... | _ ... ( _ ... _ ) ... _"; FreeBSD egrep silently fails to match {m,n} nawk and gawk --traditional treat as literal BusyBox and gawk --re-interval handle (this is default for some versions of gawk) unmatched {: BusyBox tools, Gnu sed and gawk reject as error; FreeBSD tools and Gnu egrep treat as literal unmatched }: treated as literal (only FreeBSD sed without -r rejects) Gnu sed and gawk accept {,1}; others reject and all reject {}, {2,1}, and {1,2,3}; except FreeBSD egrep handles the third strangely, and FreeBSD sed treats the first as literal quantifiers at start of pattern/group/alternate (after ^, if any): egreps silently drop (BusyBox whole construct, others only first char); seds with -r and awks usually reject (though gawk treats such quantifiers and any preceding ^ as literal; nawk accepts ^* ^+ ^? but hard to tell what they match) BusyBox and Gnu egrep and sed and awk, FreeBSD egrep and nawk: accept adjacent quantifiers without error; FreeBSD sed rejects all ( ) all my implementations of egrep provide backrefs (so with those patterns, not running a DFA) ^(ab*)*\1$ matches ababbabb but not ababbab (backref to last match of the antecedent pattern) FreeBSD sed with -r: \1 is literal "1" rather than backref; other seds provided backrefs awks: \1 is \x01 rather than backref unmatched (: rejected as error unmatched ): rejected as error, except FreeBSD and BusyBox sed with -r, and BusyBox awk and gawk, treat as literal \ordinary undefined (all my egrep and sed implementations treat as literal; awks honor C escapes but will otherwise treat as literal, perhaps with warning) inside brackets, ^ is literal except at start, and .*[\?+|$ are all literal otherwise, as for BREs In gawk and nawk, [a\]1] matches a,],1. In BusyBox awk and egrep/sed, it matches a,\ followed by literal 1 then ].
Gnu extensions and C escapes as for BREs.
Notes
All the regex engines treat \d as literal d, not as [[:digit:]] BusyBox sed will match only a single "", others will match multiple if /g Grep engines will treat newlines in patterns as \|; sed and awk engines will reject as error FreeBSD's sed will force a \n to end of text, as too will some other FreeBSD tools like cut; others like tr won't; in Gnu sed, 'q' forces a \n to end of text. BusyBox's grep -oz suffixes each result with \0; other greps suffix with \n Nongreedy quantifiers '??' '*?' '+?' '{m,n}?' unavailable in POSIX, and in all the POSIX-regex-implementing tools discussed here. FreeBSD grep and egrep match the empty string at position 0 in: printf 'cba' | egrep -o '[ba]*' None of the other [e]grep implementations I checked (such as BusyBox's or Gnu's) do this.
Regex in Lua
String escapes
In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:
\\ \" \' \a for bell, \x07 \b for backspace, \x08 \t for \x09 \n for \x0a \v for \x0b \f for \x0c \r for \x0d
Lua strings also accept \ddd
for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh
for hex digits is also accepted.
Strings can be written inside matching single or double quotes. They can also be written inside:
[[constructs like this]] [=[ or [[like]] \this]=]
In these constructs, escape sequences like \t
aren't expanded. It and the embedded [[like]]
are both treated literally. Also, the first character of the string is ignored when it's a newline.
Regex engines
The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.
Regex Specials
The following sequences have special meaning to the basic Lua regex engine.
%z In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this escape sequence was provided instead to match \0s in the source text (and permit the pattern string to continue). In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored, but it's deprecated. %a and %A Like POSIX [[:alpha:]] and [^[:alpha:]] %l and %L Like POSIX [[:lower:]] and [^[:lower:]] %u and %U Like POSIX [[:upper:]] and [^[:upper:]] %w and %W Like POSIX [[:alnum:]] and [^[:alnum:]]. Note that unlike the Gnu regex extension\w
, the patterns%w
in Lua and[[:alnum:]]
in POSIX do not include underscores. %d and %D Like POSIX [[:digit:]] and [^[:digit:]] %x and %X Like POSIX [[:xdigit:]] and [^[:xdigit:]] %s and %S Like POSIX [[:space:]] and [^[:space:]]. %s and POSIX [[:space:]] match vertical space (\n \r \f \v) as well, whereas POSIX [[:blank:]] matches only \x20 and tab. %p and %P Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum %c and %C Like POSIX [[:cntrl:]] and [^[:cntrl:]] %g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2
POSIX [[:print:]]
, all visible characters plus space, isn't directly available. Use [%p%a ]
or [%g ]
.
. Any character ^ and $ Anchors; treated as literal characters when not in anchoring position in the pattern string [class] and [^class] character classes, can include sequences of: single characters likea
or\t
ranges likea-m
regex specials like%a
(pat) Capture the source text matching pat into a group Unlike other regex engines, these constructions can not be followed by quantifiers like * or + () Instead of text, capture the current position in the source into a group %1 Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1) ? and * and + The familiar greedy quantifiers Note that in Lua these can only follow: single characters, regex specials like.
and%a
, and [class] constructions not arbitrary (pat) - This is a nongreedy form of * %c This is a literal c, for arbitrary character c It cancels the special meaning of characters( ) . % + - * ? [ ] ^ $
Alternation (expressed in other regex languages using |
) is not available; it can only be approximated using [class]
constructions.
Two of the nifty primitives that Lua has and other engines lack are:
%b() Text inside (and including) balanced (...); other characters can be used in place of ( and ). %f[class] The transition from ^ or [^class] to [class] or $. This is a generalization of the Gnu regex specials\<
and\>
.
Examples:
%f[%x] matches the source text "123-567 9ab" at positions 1, 5, and 9 %f[%X] matches the same source text at positions 4, 7, and 12.