Regular Expressions

This section describes the syntax of regular expressions that can be used in Datex. Enter the regular expression you want to use into the "Search pattern", and make sure that you use parentheses to define matches. Enter your output pattern into the "Output pattern", and use $1, $2, $3… to reference the matches.

 

Pattern

Description

.

Matches any character except newline.

[a-z0-9]

Matches any single character of set.

[^a-z0-9]

Matches any single character not in set.

\d

Matches a digit. Same as [0-9].

\D

Matches a non-digit. Same as [^0-9].

\w

Matches an alphanumeric (word) character -- [a-zA-Z0-9_].

\W

Matches a non-word character [^a-zA-Z0-9_].

\s

Matches a whitespace character (space, tab, newline, etc.).

\S

Matches a non-whitespace character.

\n

Matches a newline (line feed).

\r

Matches a return.

\t

Matches a tab.

\f

Matches a formfeed.

\b

Matches a backspace.

\0

Matches a null character.

\nnn

Matches an ASCII character of that octal value.

\xnn

Matches an ASCII character of that hexadecimal value.

\cX

Matches an ASCII control character.

\metachar

Matches the meta-character (e.g., \, ., |).

(abc)

Used to create sub-expressions. Remembers the match for later back references. Referenced by output patterns that use $1, $2, etc.

$1, $2,…

Matches whatever first (second, and so on) of parens matched.

x?

Matches 0 or 1 x's, where x is any of above.

x*

Matches 0 or more x's.

x+

Matches 1 or more x's.

x{m,n}

Matches at least m x's, but no more than n.

abc

Matches all of a, b, and c in order.

a|b|c

Matches one of a, b, or c.

\b

Matches a word boundary (outside [] only).

\B

Matches a non-word boundary.

^

Anchors match to the beginning of a line or string.

$

Anchors match to the end of a line or string.

Wildcards

Some special characters are used to match a class of characters:

Wildcard

Matches

.

Any single character except a line break, including a space.

If you use the "." as the search pattern, you will select the first character in the target string and, if you repeat the search, you will find each successive character, except for Return characters

The following wildcards match by position in a line:

Wildcard

Matches

Example

^

Beginning of a line (unless used in a character class; see below)

^Door: Finds lines that begin with "Door":

$

End of a line (unless used in a character class)

$: Finds the last character in the current line.

Character Classes

A character class allows you to specify a set or range of characters. You can choose to either match or ignore the character class. The set of characters is enclosed in brackets. If you want to ignore the character class instead of match it, precede it by a caret (^). Here are some examples:

Character Class

Matches

[aeiou]

Any one of the characters a, e, i, o, u.

[^aeiou]

Any character except a, e, i, o, u.

[a-e]

Any character in the range a-e, inclusive

[a-zA-Z0-9]

Any alphanumeric character. Note: Case-sensitivity is controlled by the Case-Sensitive option.

[[]

Finds a [.

[]]

Finds a ]. To find a closing bracket, place it immediately after the opening bracket.

[a-e^]

Finds a character in the range a-e or the caret character. To find the caret character, place it anywhere except as the first character after the opening bracket.

[a-c-]

Finds a character in the range a-c or the - sign. To match a -, place it at the beginning or end of the set.

Non-printing Characters

You can use the following notation to find non-printing characters:

Special Character

Matches

\r

Line break (return)

\n

Newline (line feed)

\t

Tab

\f

Formfeed (page break)

\xNN

Hex code NN.

Other Special Characters

The following patterns are wildcards for the following special characters:

Special Character

Matches

\s

Any whitespace character (space, tab, return, linefeed, form feed)

\S

Any non-whitespace character.

\w

Any "word" character (a-z, A-Z, 0-9, and _)

\W

Any "non-word" character (All characters not included by \w).

\d

Any digit [0-9].

\D

Any non-digit character.

Repetition Characters

Repetition characters are modifiers that allow you to repeat a specified pattern.

Repetition Character

Matches

Examples

*

Zero or more characters.

d* finds no characters, or one or more consecutive "d"s. .* finds an entire line of text, up to but not including the return character.

+

One or more characters.

d+ finds one or more consecutive "d"s. [0-9]+ finds a string of one or more consecutive numbers, such as "90404", "1938", the "32" in "Win32", etc.

?

Zero or one characters.

d? finds no characters or one "d".

Please note that, since * and ? match zero instances of the pattern, they always succeed but may not select any text. You can use them to specify an optional character, as in the examples in the following section.

Greediness

Datex supports the "?" as a greediness modifier for a sub-pattern in a regular expression. Greediness can be overridden in the ‘Search pattern’ using the "?". You can place a "?" directly after a * or + to reverse the "greediness" setting. The global greediness setting is controlled by the ‘Greedy’ option on the main window.

Extensions

Datex supports the regular expression extension mechanism used in Perl. For instance:

(?#text)

Comment

(?:pattern)

For grouping without creating back references

(?=pattern)

A zero-width positive look-ahead assertion. For example, \w+(?=\t) matches a word followed by a tab, without including the tab in $&.

(?!pattern)

A zero-width negative look-ahead assertion. For example foo(?!bar)/matches any occurrence of "foo" that isn't followed by "bar".

(?<=pattern)

A zero-width positive look-behind assertion. For example, (?<=\t)\w+ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.

(?<!pattern)

A zero-width negative look-behind assertion. For example (?<!bar)foo matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.

Sub-expressions

You can use parentheses within your search patterns to isolate portions of the matched string. You do this when you need to refer to subsections of the matched in your replacement string. For example you would do this if you need to replace only a portion of the matched string or insert other text into the matched string.

Here is an example. If you want to match any date followed by the letters "B.C." you can use the pattern "(\d+\sB\.C\.)" (Any number of digits followed by a space character, followed by the letters "B.C.") This will match dates such as 33 B.C., 1742 B.C., etc. However, if you wanted your output pattern to get the year and letters separately you would use two sets of parentheses. The search pattern "(\d+)\s(B\.C\.)" demonstrates this.

When you write this output pattern, you can refer to the year with the variable $1 and the letters with $2.

If you write "(\d+)\s(B.C.|A.D.|BC|AD)" in the search patter, then $2 would contain the matched letters.

Combining Patterns

Much of the power of regular expressions comes from combining these elementary patterns to make up complex searches. Here are some examples:

Pattern

Matches

\$?[0-9,]+\.?\d*

Matches dollar amounts with an optional dollar sign.

\d+\sB\.C\.

One or more digits followed by a space, followed by "B.C."

The Alternation Operator

The alternation operator (|) allows you to match any of a number of patterns using the logical "or" operator. Place it between two existing patterns to match either pattern. You can use more than one alternation operator in a pattern:

Pattern

Matches

\she\s | \sshe\s

" he " or " she "

cat|dog|possum

"cat", "dog", or "possum"

([0-9,]+\sB\.C\.)|([0-9,]+\sA\.D\.) or [0-9,]+\s((B\.C\.)|(A\.D\.))

Years of the form "yearNum B.C. or A.D." e.g., "2,175 B.C." or "215 A.D."

Credits

Datex uses a modified version of the PCRE library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.

The source to this library is available at: ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/