This section describes the syntax of regular expressions that can be used in Datex. Enter the regular expression you want to use into the "Search pattern", and make sure that you use parentheses to define matches. Enter your output pattern into the "Output pattern", and use $1, $2, $3… to reference the matches.
Pattern |
Description |
|
. |
Matches any character except newline. |
|
[a-z0-9] |
Matches any single character of set. |
|
[^a-z0-9] |
Matches any single character not in set. |
|
\d |
Matches a digit. Same as [0-9]. |
|
\D |
Matches a non-digit. Same as [^0-9]. |
|
\w |
Matches an alphanumeric (word) character -- [a-zA-Z0-9_]. |
|
\W |
Matches a non-word character [^a-zA-Z0-9_]. |
|
\s |
Matches a whitespace character (space, tab, newline, etc.). |
|
\S |
Matches a non-whitespace character. |
|
\n |
Matches a newline (line feed). |
|
\r |
Matches a return. |
|
\t |
Matches a tab. |
|
\f |
Matches a formfeed. |
|
\b |
Matches a backspace. |
|
\0 |
Matches a null character. |
|
\nnn |
Matches an ASCII character of that octal value. |
|
\xnn |
Matches an ASCII character of that hexadecimal value. |
|
\cX |
Matches an ASCII control character. |
|
\metachar |
Matches the meta-character (e.g., \, ., |). |
|
(abc) |
Used to create sub-expressions. Remembers the match for later back references. Referenced by output patterns that use $1, $2, etc. |
|
$1, $2,… |
Matches whatever first (second, and so on) of parens matched. |
|
x? |
Matches 0 or 1 x's, where x is any of above. |
|
x* |
Matches 0 or more x's. |
|
x+ |
Matches 1 or more x's. |
|
x{m,n} |
Matches at least m x's, but no more than n. |
|
abc |
Matches all of a, b, and c in order. |
|
a|b|c |
Matches one of a, b, or c. |
|
\b |
Matches a word boundary (outside [] only). |
|
\B |
Matches a non-word boundary. |
|
^ |
Anchors match to the beginning of a line or string. |
|
$ |
Anchors match to the end of a line or string. |
Some special characters are used to match a class of characters:
Wildcard |
Matches |
|
. |
Any single character except a line break, including a space. |
If you use the "." as the search pattern, you will select the first character in the target string and, if you repeat the search, you will find each successive character, except for Return characters
The following wildcards match by position in a line:
Wildcard |
Matches |
Example |
|
^ |
Beginning of a line (unless used in a character class; see below) |
^Door: Finds lines that begin with "Door": |
|
$ |
End of a line (unless used in a character class) |
$: Finds the last character in the current line. |
A character class allows you to specify a set or range of characters. You can choose to either match or ignore the character class. The set of characters is enclosed in brackets. If you want to ignore the character class instead of match it, precede it by a caret (^). Here are some examples:
|
Character Class |
Matches |
|
[aeiou] |
Any one of the characters a, e, i, o, u. |
|
[^aeiou] |
Any character except a, e, i, o, u. |
|
[a-e] |
Any character in the range a-e, inclusive |
|
[a-zA-Z0-9] |
Any alphanumeric character. Note: Case-sensitivity is controlled by the Case-Sensitive option. |
|
[[] |
Finds a [. |
|
[]] |
Finds a ]. To find a closing bracket, place it immediately after the opening bracket. |
|
[a-e^] |
Finds a character in the range a-e or the caret character. To find the caret character, place it anywhere except as the first character after the opening bracket. |
|
[a-c-] |
Finds a character in the range a-c or the - sign. To match a -, place it at the beginning or end of the set. |
You can use the following notation to find non-printing characters:
|
Special Character |
Matches |
|
\r |
Line break (return) |
|
\n |
Newline (line feed) |
|
\t |
Tab |
|
\f |
Formfeed (page break) |
|
\xNN |
Hex code NN. |
The following patterns are wildcards for the following special characters:
Special Character |
Matches |
|
\s |
Any whitespace character (space, tab, return, linefeed, form feed) |
|
\S |
Any non-whitespace character. |
|
\w |
Any "word" character (a-z, A-Z, 0-9, and _) |
|
\W |
Any "non-word" character (All characters not included by \w). |
|
\d |
Any digit [0-9]. |
|
\D |
Any non-digit character. |
Repetition characters are modifiers that allow you to repeat a specified pattern.
Repetition Character |
Matches |
Examples |
|
* |
Zero or more characters. |
d* finds no characters, or one or more consecutive "d"s. .* finds an entire line of text, up to but not including the return character. |
|
+ |
One or more characters. |
d+ finds one or more consecutive "d"s. [0-9]+ finds a string of one or more consecutive numbers, such as "90404", "1938", the "32" in "Win32", etc. |
|
? |
Zero or one characters. |
d? finds no characters or one "d". |
Please note that, since * and ? match zero instances of the pattern, they always succeed but may not select any text. You can use them to specify an optional character, as in the examples in the following section.
Datex supports the "?" as a greediness modifier for a sub-pattern in a regular expression. Greediness can be overridden in the ‘Search pattern’ using the "?". You can place a "?" directly after a * or + to reverse the "greediness" setting. The global greediness setting is controlled by the ‘Greedy’ option on the main window.
Datex supports the regular expression extension mechanism used in Perl. For instance:
(?#text) |
Comment |
|
(?:pattern) |
For grouping without creating back references |
|
(?=pattern) |
A zero-width positive look-ahead assertion. For example, \w+(?=\t) matches a word followed by a tab, without including the tab in $&. |
|
(?!pattern) |
A zero-width negative look-ahead assertion. For example foo(?!bar)/matches any occurrence of "foo" that isn't followed by "bar". |
|
(?<=pattern) |
A zero-width positive look-behind assertion. For example, (?<=\t)\w+ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind. |
|
(?<!pattern) |
A zero-width negative look-behind assertion. For example (?<!bar)foo matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind. |
You can use parentheses within your search patterns to isolate portions of the matched string. You do this when you need to refer to subsections of the matched in your replacement string. For example you would do this if you need to replace only a portion of the matched string or insert other text into the matched string.
Here is an example. If you want to match any date followed by the letters "B.C." you can use the pattern "(\d+\sB\.C\.)" (Any number of digits followed by a space character, followed by the letters "B.C.") This will match dates such as 33 B.C., 1742 B.C., etc. However, if you wanted your output pattern to get the year and letters separately you would use two sets of parentheses. The search pattern "(\d+)\s(B\.C\.)" demonstrates this.
When you write this output pattern, you can refer to the year with the variable $1 and the letters with $2.
If you write "(\d+)\s(B.C.|A.D.|BC|AD)" in the search patter, then $2 would contain the matched letters.
Much of the power of regular expressions comes from combining these elementary patterns to make up complex searches. Here are some examples:
Pattern |
Matches |
|
\$?[0-9,]+\.?\d* |
Matches dollar amounts with an optional dollar sign. |
|
\d+\sB\.C\. |
One or more digits followed by a space, followed by "B.C." |
The alternation operator (|) allows you to match any of a number of patterns using the logical "or" operator. Place it between two existing patterns to match either pattern. You can use more than one alternation operator in a pattern:
Pattern |
Matches |
|
\she\s | \sshe\s |
" he " or " she " |
|
cat|dog|possum |
"cat", "dog", or "possum" |
|
([0-9,]+\sB\.C\.)|([0-9,]+\sA\.D\.) or [0-9,]+\s((B\.C\.)|(A\.D\.)) |
Years of the form "yearNum B.C. or A.D." e.g., "2,175 B.C." or "215 A.D." |
Datex uses a modified version of the PCRE library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.
The source to this library is available at: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/