Regular Expressions - Tutorial Part 5: Character groups and classes

 

0 Comments | Write a Comment | Rate this Article | Report Article

 

Type

Tutorial for Beginners

Category

Tools & Utilities

Language

English

Author

Stefan Trost Media

Date

13.07.2011

Ratings

40

Views

1348
 
 


About the author

Stefan Trost is a developer of software and web solutions and gladly also cares about your needs and desires. Contact

Profile of Stefan Trost Media
Articles by Stefan Trost Media

In this part of the tutorial we are caring about character groups and charater classes. These are not mandatory, because you can also represent them in other ways, but they make it much more easier to deal with regular expressions. Especially, you can save a lot of writing with this. Up to now, the following parts are published:

Part 1: Basics | Part 2: Normal strings, grouping and repetitions | Part 3: Meta characters and combinations | Part 4: Selections of characters and alternatives | Part 5: Character groups and classes | Part 6: Reusing and backward references | Part 7: Modifiers | Part 8: Usage and Examples

Important Note

To try out the example and to test your own regular expression, you can use the software Text Converter in its free Basic version. In the first part of this tutorial, there is an explanation how to use the application.

Character groups

Until now, we have written each character explicitly. If we wanted to search for an "a", "b" or "c", we have used a character selection [abc]. But what can we do, if we want to search for a whole group of characters? Perhabs, with the group of numbers, it goes with [1234567890], but what's about all letters? Also the developers of regular expressions have thought, that it is a little bit complicated to write all letters (particularly if you have to use them more than one time). They have invented character groups.

Example 1

Search for:   Replace with:  Original:           After replacement:

[a-z]         X              ABCTZU abcolm 2317  ABCTZU XXXXXX 2317

[A-Z]         X              ABCTZU abcolm 2317  XXXXXX abcolm 2317

[0-9]         X              ABCTZU abcolm 2317  ABCTZU abcolm XXXX

 

[a-d]         X              ABCTZU abcolm 2317  ABCTZU XXXolm 2317

[U-Z]         X              ABCTZU abcolm 2317  ABCTXX abcolm 2317

[3-9]         X              ABCTZU abcolm 2317  ABCTZU abcolm 2X1X

[A-Ya-z0-9]   X              ABCTZU abcolm 2317  XXXXZX XXXXXX XXXX

A group of characters can be simply defined with a -. So, [a-z] stands for all lowercase letters, [A-Z] stands for all uppercase letters and [0-9] for all numbers. Example 1 shows some applications.

Also smaller areas of such a selecion are possible. For instance, [a-d] stands for the letters a, b, c or d. [abcd] would be an expression with the same meaning. In the next lines of example 1 we can see some illustration of this.

You can also use and combine several groups of characters within one character group. In the last line of example 1, we have combined the character groups 0 to 9, a to z and A to Y. With this, all letters except Z and all numbers will be replaced. Concreteley, this regular expression means: Find one character that can be found in the selection. In the original, the first character is an "A". This is included in the selection, so this "A" will be replaced. After that, we look at the next character, a "B". Also this character is in the selection and will be replaced. And so on.

Example 2

Search for:   Replace with:  Original:         After replacement:

[-a-z]        X              abcdefgh-ijz      XXXXXXXXXXXX

[a-z-]        X              abcdefgh-ijz      XXXXXXXXXXXX

[-az]         X              abcdefgh-ijz      XbcdefghXijX

[a\-z]        X              abcdefgh-ijz      XbcdefghXijX

[a-z]         X              abcdefgh-ijz      XXXXXXXX-XXX

In example 2, we can see some particularities. If a meta character stands at a position at which it has no meaning, this characters is interpreted as normal character. In the first three lines, the hyphen stands before or after the character selection. So it is interpreted as "-" next to the character group "a" to "z" or the letters "a" and "z". Therefore, the expression [-az] finds only the characters "-", "a" and "z" and nothing else. In the example [a\-z, the meta character "-" is escaped. Therefore, again only "-", "a" and "z" will be found and this expression is not interpreted as a group or something else. In the last line, we have used a normal character group with [a-z] to show the difference. With this expression, only the letters a to z will be replaced.

Negation of characters

Example 3

Search for:   Replace with:  Original:          After replacement:

[^a]          X              abcdefga-ijz       aXXXXXXaXXXX

^[a]          X              abcdefga-ijz       Xbcdefgh-ijz  

Sometimes, it is easier to define only the characters that should not be searched for instead of all characters that can appear. Also this is possible with a regular expression. In example 3, we are searching for all characters instead of an "a". For this, we are using the meta character ^ again. The ^ also stands for the begin of a string, but if we do not use it at the begin of a regular expression, it makes no sense with the begin and then the meta character means "not". Because of that, in the secound line, we have written the ^ to the begin of the expression. Now, the regular expression means "search for an a which is standing at the begin of a string" instead of "search for all characters instead an a". So, in the first line, all characters except the two "a" will be replaced and in the secound line, only the "a" at the begin will be found and replaced.

Example 4

Search for:   Replace with:  Original:          After replacement:

[^a]bc        X              abc zbc -bc        abc X X

[a]bc         X              abc zbc -bc        X zbc -bc

We want to look at another example. Example 4 means the following: We are searching for a group of characters ending with "bc". Before "bc", we have defined a character selection that tells us how the character before "bc" should be. In the selection stands: "no a". So this regular expression will match "zbc" and "-bc" but not "abc". In the secound line, we have deleted the negation of the character. Now the expression means: Before "bc" there have to be a character from the character selection "a". The character selection consists of exactly one character, so there must be an "a" before the "bc". As you can see, only the "abc" will be replaced.

Character classes

Some character groups are used very often. For example, this applies letters and numbers. Thus, for these groups of characters, there are shortcuts, that are displayed in the box. You can simply use such shortcut directly in a regular expression. In the column "Meaning" you can see for which other regular expression or character group the shortcut stands for.

Shortcut           Meaning 

\d                 [0-9]

\D                 [^0-9] or [^\d]

\w                 [A-Za-z0-9_]

\W                 [^A-Za-z0-9_] or [^\w]

\s                 Whitespace like spaces or line breaks

\S                 [^\s] No whitespace

Example 5

Search for:   Replace with:  Original:          After replacement:

\d            X              abc 123 456        abc X X

\s            X              abc zbc -bc        abcXzbcX-bc

The shortcut \d (digits) stands for numbers, the shortcut \D stands for all characters that are not numbers. \w (word) stands for letters, numbers and underscores, while \W finds all characters that are no letters, numbers or underscores.

With the help of \s (space), you can search for whitespace, so \s finds spaces, line breaks, tabs and so on. \S is exactly the opposite of \s and matches all characters that are no whitespace.

Of course, the defintion of some of these groups is not unique. For example if we have a look at some Unicode characters. Belong these characters to one of the mentioned classes or not? Because of that, programs like the Text Converter makes it possible that the normal characters are in the classes by default but you can also change and modify the classes customly. So, you can define your own character classes for \s or \w,

In a last step, we want to look at another small example. Example 5 shows, that \d makes the same as [0-9]. All numbers will be replaced. Likewise, we can use \s to replace all spaces in a text.

Begin and end of a word

We have seen that we can use the meta characters ^ and $ to search for the begin or the end of a string. In practice, it is much more interesting to search for the begin, the middle or the end of a word. You can use \b and \B for this. The example shows, how. I must say, that \b and \B are not supported by all programs. But the Text Converter understands them.

Example 6

Search for:   Replace with:  Original:          After replacement:

\bab          X              ab abc babc        X Xc babc

bc\b          X              ab abc babc        ab aX baX

\bab\b        X              ab abc babc        X abc babc

The shortcut \b stands for the position at the begin or the end of a word. If we use the regular expression "\bab", replacements will be carried out in words which are beginning with "ab". If we are writing "bc\b" instead, we are searching for words ending with "bc". You can see this in the secound line. We can also combine this and write "\bab\b" to search for a word that is beginning with "ab" end ending with "ab". So, with this expression, we are searching exactly for a word called "ab". In the third line, we can see what happens. Only the first "ab" will be replaced, because this is the only single word "ab" in the string.

Example 7

Search for:   Replace with:  Original:           After replacement:

\Bbc          X              abc babcd bcd       aX baXd bcd

bc\B          X              abc babcd bcd       abc baXd Xd

\Bbc\B        X              abc babcd bcd       abc baXd bcd

The shortcut \B stands for a position, that is not a the beginning or the end of a word. We have collected some examples in the box next to this text. In the first line, we are searching for a word containing "bc", but the word should not begin with "bc". Two positions will be found. The other way around, we go in the secound line with "bc\B". Now, we are searching a word containing "bc" but the "bc" should not stand at the end. With this, "abc" will not be found any more, but "bcd" matches. In the last line, we look at a combination. We are searching for a word containing "bc", but "bc" should not appear at the begin or the end of the word. The expression matches a position in the middle of the string.

Summary

  • With the hyphen, you can define groups of characters like [a-z] for all lowercase letters
  • If the meta character ^ do not stand at the begin of a regular expression, it stands for a negation of the next character
  • Common character groups like numbers (\d), letters, numbers and underscores (\w) or whitespace (\s) can be written by using the shortcuts in the brackets
  • You can define a position at the begin, the middle or the end of a word with \b and \B. So, you can search for words beginning, containing or ending with a special string

Read more

 

© Stefan Trost - The usage of this tutorial, even in parts, is prohibited without prior written consent of Stefan Trost. But of course, you are welcome to link to this tutorial.

 
  
 

Comments

Have you got the same opinion like the author or do you want to add something? Here you can leave a comment.

Write a comment

You can leave an anonymous comment. If you want to write something under your name, please log in or register.



Past Comments

Nobody has written a comment on this article. You can be the first one.