Because regular expressions that may be used in mkAlign are the
same as regular expressions that are defined in Perl, the
following text is an extraction from Perl documentation.
The following metacharacters have their standard
*egrep*-ish meanings:
- \
- Quote the next metacharacter
- ^
- Match the beginning of the string
- .
- Match any character
- $
- Match the end of the string
- |
- Alternation
- ()
- Grouping
- []
- Character class
The following standard quantifiers are recognized:
- *
- Match 0 or more times
- +
- Match 1 or more times
- ?
- Match 1 or 0 times
- {n}
- Match exactly n times
- {n,}
- Match at least n times
- {n,m}
- Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is treated as a
regular character.) The "*" modifier is equivalent to `{0,}', the "+"
modifier to `{1,}', and the "?" modifier to `{0,1}'. n and m are
limited to integral values less than 65536.
By default, a quantified subpattern is "greedy", that is, it will match
as many times as possible (given a particular starting location) while
still allowing the rest of the pattern to match. If you want it to
match the minimum number of times possible, follow the quantifier with a
"?". Note that the meanings don't change, just the "greediness":
- *?
- Match 0 or more times
- +?
- Match 1 or more times
- ??
- Match 0 or 1 time
- {n}?
- Match exactly n times
- {n,}?
- Match at least n times
- {n,m}?
- Match at least n but not more than m times
Because patterns are processed as double quoted strings, the following
also work:
- \t
- tab (HT, TAB)
- \n
- newline (LF, NL)
- \r
- return (CR)
- \f
- form feed (FF)
- \a
- alarm (bell) (BEL)
- \e
- escape (think troff) (ESC)
- \033
- octal char (think of a PDP-11)
- \x1B
- hex char
- \c[
- control char
- \l
- lowercase next char
- \u
- uppercase next char
- \L
- lowercase till \E
- \U
- uppercase till \E
- \E
- end case modification
- \Q
- quote regexp metacharacters till \E
If UseCzechLocales is set to non-zero value in TrEd's
configuration, the list of alphabetic characters generated by
`\l',`\u',`\L',`\U' is taken
from the Czech locale.
In addition, the following characters are defined:
- \w
- Match a "word" character (alphanumeric plus "_")
- \W
- Match a non-word character
- \s
- Match a whitespace character
- \S
- Match a non-whitespace character
- \d
- Match a digit character
- \D
- Match a non-digit character
Note that `\w' matches a single alphanumeric character, not a whole
word. To match a word you'd need to say `\w+'.
If UseCzechLocales is set to non-zero value in TrEd's
configuration, the list of alphabetic characters generated by `\w' is taken
from the Czech locale. You may use
`\w', `\W', `\s', `\S', `\d', and `\D' within character classes (though
not as either end of a range).
The following zero-width assertions are defined:
- \b
- Match a word boundary
- \B
- Match a non-(word boundary)
- \A
- Match at only beginning of string
- \Z
- Match at only end of string (or before newline at the end)
A word boundary (`\b') is defined as a spot between two characters that
has a `\w' on one side of it and a `\W' on the other side of it (in
either order), counting the imaginary characters off the beginning and
end of the string as matching a `\W'. (Within character classes `\b'
represents backspace rather than a word boundary.) The `\A' and `\Z'
are just like "^" and "$".
When the bracketing construct `( ... )' is used,
\<digit>
matches the
digit'th substring. If you want
to use parentheses to delimit a subpattern (e.g., a set of alternatives)
without saving it as a subpattern, follow the ( with a ?:.
You may have as many parentheses as you wish.
Within the pattern, \10, \11, etc. refer back to substrings
if there have been at least that many left parentheses before the
backreference. Otherwise (for backward compatibility) \10 is the same
as \010, a backspace, and \11 the same as
\011,
a tab. And so on. (\1 through \9 are always backreferences.)
$+ returns whatever the last bracket match matched.
$& returns the
entire matched string. ($0 used to return the same thing, but not any
more.) `$`' returns everything before the matched
string.
`$'' returns
everything after the matched string.
TrEd defines a consistent extension syntax for regular expressions. The
syntax is a pair of parentheses with a question mark as the first thing
within the parentheses (this was a syntax error in older versions of
Perl). The character after the question mark gives the function of the
extension. Several extensions are already supported:
- (?#text)
-
A comment. The text is ignored. If the `/x' switch is used to
enable whitespace formatting, a simple `#' will suffice.
- (?:regexp)
-
This groups things like "()" but doesn't make backreferences like
"()" does.
- (?!regexp)
-
A zero-width negative lookahead assertion. For example
`foo(?!bar)' matches any occurrence of "foo" that isn't followed
by "bar". Note however that lookahead and lookbehind are NOT the
same thing. You cannot use this for lookbehind: `/(?!foo)bar/'
will not find an occurrence of "bar" that is preceded by something
which is not "foo". That's because the `(?!foo)' is just saying
that the next thing cannot be "foo"--and it's not, it's a "bar", so
"foobar" will match. You would have to do something like
`(?!foo)...bar' for that. We say "like" because there's the case
of your "bar" not having three characters before it. You could
cover that this way: `(?:(?!foo)...|^..?)bar'.
- (?ix)
-
One or more embedded pattern-match modifiers. If you need a
case insensitive pattern you only need to include
`(?i)' at the front of the pattern.
- x
- Use extended regular expressions.
- i
- Do case-insensitive pattern matching.