Regular Expressions - Tutorial Part 5: Character groups and classes

0 Comments | Write a Comment | Rate this Article | Report Article

Type

Tutorial for Beginners

Language

English

Author

Stefan Trost Media

Date

13.07.2011

Ratings

Views

1646

About the author

Stefan Trost is a developer of software and web solutions and gladly also cares about your needs and desires. Contact

Profile of Stefan Trost Media
Articles by Stefan Trost Media

Important Note

To try out the example and to test your own regular expression, you can use the software Text Converter in its free Basic version. In the first part of this tutorial, there is an explanation how to use the application.

Character groups

Until now, we have written each character explicitly. If we wanted to search for an "a", "b" or "c", we have used a character selection [abc]. But what can we do, if we want to search for a whole group of characters? Perhabs, with the group of numbers, it goes with [1234567890], but what's about all letters? Also the developers of regular expressions have thought, that it is a little bit complicated to write all letters (particularly if you have to use them more than one time). They have invented character groups.

Example 1

Search for: Replace with: Original: After replacement:

[a-z] X ABCTZU abcolm 2317 ABCTZU XXXXXX 2317

[A-Z] X ABCTZU abcolm 2317 XXXXXX abcolm 2317

[0-9] X ABCTZU abcolm 2317 ABCTZU abcolm XXXX

[a-d] X ABCTZU abcolm 2317 ABCTZU XXXolm 2317

[U-Z] X ABCTZU abcolm 2317 ABCTXX abcolm 2317

[3-9] X ABCTZU abcolm 2317 ABCTZU abcolm 2X1X

[A-Ya-z0-9] X ABCTZU abcolm 2317 XXXXZX XXXXXX XXXX

A group of characters can be simply defined with a -. So, [a-z] stands for all lowercase letters, [A-Z] stands for all uppercase letters and [0-9] for all numbers. Example 1 shows some applications.

Also smaller areas of such a selecion are possible. For instance, [a-d] stands for the letters a, b, c or d. [abcd] would be an expression with the same meaning. In the next lines of example 1 we can see some illustration of this.

You can also use and combine several groups of characters within one character group. In the last line of example 1, we have combined the character groups 0 to 9, a to z and A to Y. With this, all letters except Z and all numbers will be replaced. Concreteley, this regular expression means: Find one character that can be found in the selection. In the original, the first character is an "A". This is included in the selection, so this "A" will be replaced. After that, we look at the next character, a "B". Also this character is in the selection and will be replaced. And so on.

Example 2

Search for: Replace with: Original: After replacement:

[-a-z] X abcdefgh-ijz XXXXXXXXXXXX

[a-z-] X abcdefgh-ijz XXXXXXXXXXXX

[-az] X abcdefgh-ijz XbcdefghXijX

[a\-z] X abcdefgh-ijz XbcdefghXijX

[a-z] X abcdefgh-ijz XXXXXXXX-XXX

In example 2, we can see some particularities. If a meta character stands at a position at which it has no meaning, this characters is interpreted as normal character. In the first three lines, the hyphen stands before or after the character selection. So it is interpreted as "-" next to the character group "a" to "z" or the letters "a" and "z". Therefore, the expression [-az] finds only the characters "-", "a" and "z" and nothing else. In the example [a\-z, the meta character "-" is escaped. Therefore, again only "-", "a" and "z" will be found and this expression is not interpreted as a group or something else. In the last line, we have used a normal character group with [a-z] to show the difference. With this expression, only the letters a to z will be replaced.

Negation of characters

Example 3

Search for: Replace with: Original: After replacement:

[^a] X abcdefga-ijz aXXXXXXaXXXX

^[a] X abcdefga-ijz Xbcdefgh-ijz

Sometimes, it is easier to define only the characters that should not be searched for instead of all characters that can appear. Also this is possible with a regular expression. In example 3, we are searching for all characters instead of an "a". For this, we are using the meta character ^ again. The ^ also stands for the begin of a string, but if we do not use it at the begin of a regular expression, it makes no sense with the begin and then the meta character means "not". Because of that, in the secound line, we have written the ^ to the begin of the expression. Now, the regular expression means "search for an a which is standing at the begin of a string" instead of "search for all characters instead an a". So, in the first line, all characters except the two "a" will be replaced and in the secound line, only the "a" at the begin will be found and replaced.

Example 4

Search for: Replace with: Original: After replacement:

[^a]bc X abc zbc -bc abc X X

[a]bc X abc zbc -bc X zbc -bc

We want to look at another example. Example 4 means the following: We are searching for a group of characters ending with "bc". Before "bc", we have defined a character selection that tells us how the character before "bc" should be. In the selection stands: "no a". So this regular expression will match "zbc" and "-bc" but not "abc". In the secound line, we have deleted the negation of the character. Now the expression means: Before "bc" there have to be a character from the character selection "a". The character selection consists of exactly one character, so there must be an "a" before the "bc". As you can see, only the "abc" will be replaced.

Character classes

Some character groups are used very often. For example, this applies letters and numbers. Thus, for these groups of characters, there are shortcuts, that are displayed in the box. You can simply use such shortcut directly in a regular expression. In the column "Meaning" you can see for which other regular expression or character group the shortcut stands for.

Shortcut Meaning

\d [0-9]

\D [^0-9] or [^\d]

\w [A-Za-z0-9_]

\W [^A-Za-z0-9_] or [^\w]

\s Whitespace like spaces or line breaks

\S [^\s] No whitespace

Example 5

Search for: Replace with: Original: After replacement:

\d X abc 123 456 abc X X

\s X abc zbc -bc abcXzbcX-bc

The shortcut \d (digits) stands for numbers, the shortcut \D stands for all characters that are not numbers. \w (word) stands for letters, numbers and underscores, while \W finds all characters that are no letters, numbers or underscores.

With the help of \s (space), you can search for whitespace, so \s finds spaces, line breaks, tabs and so on. \S is exactly the opposite of \s and matches all characters that are no whitespace.

Of course, the defintion of some of these groups is not unique. For example if we have a look at some Unicode characters. Belong these characters to one of the mentioned classes or not? Because of that, programs like the Text Converter makes it possible that the normal characters are in the classes by default but you can also change and modify the classes customly. So, you can define your own character classes for \s or \w,

In a last step, we want to look at another small example. Example 5 shows, that \d makes the same as [0-9]. All numbers will be replaced. Likewise, we can use \s to replace all spaces in a text.

Begin and end of a word

We have seen that we can use the meta characters ^ and $ to search for the begin or the end of a string. In practice, it is much more interesting to search for the begin, the middle or the end of a word. You can use \b and \B for this. The example shows, how. I must say, that \b and \B are not supported by all programs. But the Text Converter understands them.

Example 6

Search for: Replace with: Original: After replacement:

\bab X ab abc babc X Xc babc

bc\b X ab abc babc ab aX baX

\bab\b X ab abc babc X abc babc

The shortcut \b stands for the position at the begin or the end of a word. If we use the regular expression "\bab", replacements will be carried out in words which are beginning with "ab". If we are writing "bc\b" instead, we are searching for words ending with "bc". You can see this in the secound line. We can also combine this and write "\bab\b" to search for a word that is beginning with "ab" end ending with "ab". So, with this expression, we are searching exactly for a word called "ab". In the third line, we can see what happens. Only the first "ab" will be replaced, because this is the only single word "ab" in the string.

Example 7

Search for: Replace with: Original: After replacement:

\Bbc X abc babcd bcd aX baXd bcd

bc\B X abc babcd bcd abc baXd Xd

\Bbc\B X abc babcd bcd abc baXd bcd

The shortcut \B stands for a position, that is not a the beginning or the end of a word. We have collected some examples in the box next to this text. In the first line, we are searching for a word containing "bc", but the word should not begin with "bc". Two positions will be found. The other way around, we go in the secound line with "bc\B". Now, we are searching a word containing "bc" but the "bc" should not stand at the end. With this, "abc" will not be found any more, but "bcd" matches. In the last line, we look at a combination. We are searching for a word containing "bc", but "bc" should not appear at the begin or the end of the word. The expression matches a position in the middle of the string.

Summary

With the hyphen, you can define groups of characters like [a-z] for all lowercase letters
If the meta character ^ do not stand at the begin of a regular expression, it stands for a negation of the next character
Common character groups like numbers (\d), letters, numbers and underscores (\w) or whitespace (\s) can be written by using the shortcuts in the brackets
You can define a position at the begin, the middle or the end of a word with \b and \B. So, you can search for words beginning, containing or ending with a special string

Comments

Have you got the same opinion like the author or do you want to add something? Here you can leave a comment.