Regex: Difference between revisions
Dubiousjim (talk | contribs) (Created page with Lua info) |
Dubiousjim (talk | contribs) (Tweaks to Lua content) |
||
Line 59: | Line 59: | ||
%g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 | %g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2 | ||
POSIX [[:print:]], all visible characters plus space, isn't directly available. Use <code>[%p%a ]</code> or <code>[%g ]</code>. | POSIX <code>[[:print:]]</code>, all visible characters plus space, isn't directly available. Use <code>[%p%a ]</code> or <code>[%g ]</code>. | ||
. Any character | . Any character | ||
Line 72: | Line 72: | ||
%1 Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1) | %1 Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1) | ||
? and * and + The familiar greedy quantifiers | ? and * and + The familiar greedy quantifiers | ||
Note that in Lua these can only follow: single characters, regex specials like | Note that in Lua these can only follow: | ||
single characters, regex specials like <code>.</code> and <code>%a</code>, and [<var>class</var>] constructions | |||
''not'' arbitrary (<var>pat</var>) | |||
- This is a nongreedy form of * | - This is a nongreedy form of * | ||
%<var>c</var> This is a literal <var>c</var>, for arbitrary character <var>c</var> | %<var>c</var> This is a literal <var>c</var>, for arbitrary character <var>c</var> | ||
Line 83: | Line 84: | ||
%b() Text inside (and including) balanced (...); other characters can be used in place of ( and ). | %b() Text inside (and including) balanced (...); other characters can be used in place of ( and ). | ||
%f[ | %f[<var>class</var>] The transition from ^ or [^<var>class</var>] to [<var>class</var>] or $. | ||
This is a generalization of the Gnu regex specials | This is a generalization of the Gnu regex specials <code>\<</code> and <code>\></code>. | ||
Examples: | Examples: |
Revision as of 13:55, 3 May 2012
This page summarizes some gritty regex details.
Regex in Lua
String escapes
In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:
\\ \" \' \a for bell, \x07 \b for backspace, \x08 \t for \x09 \n for \x0a \v for \x0b \f for \x0c \r for \x0d
Lua strings also accept \ddd
for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh
for hex digits is also accepted.
Strings can be written inside matching single or double quotes. They can also be written inside:
[[constructs like this]] [=[ or [[like]] \this]=]
In these constructs, escape sequences like \t
aren't expanded. It and the embedded [[like]]
are both treated literally. Also, the first character of the string is ignored when it's a newline.
Regex engines
The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.
Regex Specials
The following sequences have special meaning to the basic Lua regex engine.
%z In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this escape sequence was provided instead to match \0s in the source text (and permit the pattern string to continue). In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored, but it's deprecated. %a and %A Like POSIX [[:alpha:]] and [^[:alpha:]] %l and %L Like POSIX [[:lower:]] and [^[:lower:]] %u and %U Like POSIX [[:upper:]] and [^[:upper:]] %w and %W Like POSIX [[:alnum:]] and [^[:alnum:]]. Note that unlike the Gnu regex extension\w
, the patterns%w
in Lua and[[:alnum:]]
in POSIX do not include underscores. %d and %D Like POSIX [[:digit:]] and [^[:digit:]] %x and %X Like POSIX [[:xdigit:]] and [^[:xdigit:]] %s and %S Like POSIX [[:space:]] and [^[:space:]]. %s and POSIX [[:space:]] match vertical space (\n \r \f \v) as well, whereas POSIX [[:blank:]] matches only \x20 and tab. %p and %P Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum %c and %C Like POSIX [[:cntrl:]] and [^[:cntrl:]] %g and %G Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2
POSIX [[:print:]]
, all visible characters plus space, isn't directly available. Use [%p%a ]
or [%g ]
.
. Any character ^ and $ Anchors; treated as literal characters when not in anchoring position in the pattern string [class] and [^class] character classes, can include sequences of: single characters likea
or\t
ranges likea-m
regex specials like%a
(pat) Capture the source text matching pat into a group Unlike other regex engines, these constructions can not be followed by quantifiers like * or + () Instead of text, capture the current position in the source into a group %1 Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1) ? and * and + The familiar greedy quantifiers Note that in Lua these can only follow: single characters, regex specials like.
and%a
, and [class] constructions not arbitrary (pat) - This is a nongreedy form of * %c This is a literal c, for arbitrary character c It cancels the special meaning of characters( ) . % + - * ? [ ] ^ $
Alternation (expressed in other regex languages using |
) is not available; it can only be approximated using [class]
constructions.
Two of the nifty primitives that Lua has and other engines lack are:
%b() Text inside (and including) balanced (...); other characters can be used in place of ( and ). %f[class] The transition from ^ or [^class] to [class] or $. This is a generalization of the Gnu regex specials\<
and\>
.
Examples:
%f[%x] matches the source text "123-567 9ab" at positions 1, 5, and 9 %f[%X] matches the same source text at positions 4, 7, and 12.