Regex

From Alpine Linux
Revision as of 13:55, 3 May 2012 by Dubiousjim (talk | contribs) (Tweaks to Lua content)

This page summarizes some gritty regex details.

Regex in Lua

String escapes

In Lua, regex patterns are always supplied as strings, so will honor all the normal escapes on strings:

\\
\"
\'
\a for bell, \x07
\b for backspace, \x08
\t for \x09
\n for \x0a
\v for \x0b
\f for \x0c
\r for \x0d

Lua strings also accept \ddd for decimal digits d. (Note: not octal digits.) Starting with Lua 5.2, \xhh for hex digits is also accepted.

Strings can be written inside matching single or double quotes. They can also be written inside:

[[constructs
like this]]

[=[
or [[like]]
\this]=]

In these constructs, escape sequences like \t aren't expanded. It and the embedded [[like]] are both treated literally. Also, the first character of the string is ignored when it's a newline.

Regex engines

The basic Lua regex engine is more limited than the Posix- or PCRE-style languages, though still quite capable. In fact some things are more easily done with the Lua engine than with the more familiar ones. If the basic Lua engine is nonetheless too limited for your purposes, you should look into the LPEG or Lrexlib libraries. The former is more powerful and widely-used in the Lua community; the latter interfaces to more familiar regex engine libraries and languages.

Regex Specials

The following sequences have special meaning to the basic Lua regex engine.

%z           In Lua 5.1, the regex engine wouldn't read past an embedded \0 in the pattern string, so this escape sequence
             was provided instead to match \0s in the source text (and permit the pattern string to continue).
             In Lua 5.2, embedded \0s can now be used directly. With the default compilation settings, %z is still honored,
             but it's deprecated.
%a and %A    Like POSIX [[:alpha:]] and [^[:alpha:]]
%l and %L    Like POSIX [[:lower:]] and [^[:lower:]]
%u and %U    Like POSIX [[:upper:]] and [^[:upper:]]
%w and %W    Like POSIX [[:alnum:]] and [^[:alnum:]].
             Note that unlike the Gnu regex extension \w, the patterns %w in Lua and
             [[:alnum:]] in POSIX do not include underscores.
%d and %D    Like POSIX [[:digit:]] and [^[:digit:]]
%x and %X    Like POSIX [[:xdigit:]] and [^[:xdigit:]]
%s and %S    Like POSIX [[:space:]] and [^[:space:]].
             %s and POSIX [[:space:]] match vertical space (\n \r \f \v) as well, whereas POSIX [[:blank:]] matches only \x20 and tab.
%p and %P    Like POSIX [[:punct:]] and [^[:punct:]], excludes space and alnum
%c and %C    Like POSIX [[:cntrl:]] and [^[:cntrl:]]
%g and %G    Like POSIX [[:graph:]] and [^[:graph:]], all visible characters except space; only so interpreted in Lua 5.2

POSIX [[:print:]], all visible characters plus space, isn't directly available. Use [%p%a ] or [%g ].

.                Any character
^ and $          Anchors; treated as literal characters when not in anchoring position in the pattern string
[class] and [^class] character classes, can include sequences of:
                   single characters like a or \t
                   ranges like a-m
                   regex specials like %a 
(pat)            Capture the source text matching pat into a group
                 Unlike other regex engines, these constructions can not be followed by quantifiers like * or +
()               Instead of text, capture the current position in the source into a group
%1               Backreference (contrast \1 which is \x01, and \\1 which is a literal character \ then 1)
? and * and +    The familiar greedy quantifiers
                 Note that in Lua these can only follow:
                   single characters, regex specials like . and %a, and [class] constructions
                   not arbitrary (pat)
-                This is a nongreedy form of *
%c               This is a literal c, for arbitrary character c
                 It cancels the special meaning of characters ( ) . % + - * ? [ ] ^ $

Alternation (expressed in other regex languages using |) is not available; it can only be approximated using [class] constructions.

Two of the nifty primitives that Lua has and other engines lack are:

%b()         Text inside (and including) balanced (...); other characters can be used in place of ( and ).
%f[class]    The transition from ^ or [^class] to [class] or $.
             This is a generalization of the Gnu regex specials \< and \>.

Examples:

%f[%x] matches the source text "123-567 9ab" at positions 1, 5, and 9
%f[%X] matches the same source text at positions 4, 7, and 12.