Regular Expressions - Tutorial Part 2: Grouping and repetitions

 

0 Comments | Write a Comment | Rate this Article | Report Article

 

Type

Tutorial for Beginners

Category

Tools & Utilities

Language

English

Author

Stefan Trost Media

Date

13.07.2011

Ratings

60

Views

1388
 
 


About the author

Stefan Trost is a developer of software and web solutions and gladly also cares about your needs and desires. Contact

Profile of Stefan Trost Media
Articles by Stefan Trost Media

In this secound part of this tutorial, we care about normal strings and repetitions of characters. In the first part, we have discussed the usage of regular expressions and the meta characters asterisk *, point . and plus +. Up to now, the following parts are published:

Part 1: Basics | Part 2: Normal strings, grouping and repetitions | Part 3: Meta characters and combinations | Part 4: Selections of characters and alternatives | Part 5: Character groups and classes | Part 6: Reusing and backward references | Part 7: Modifiers | Part 8: Usage and Examples

Important Note

To try out the example and to test your own regular expression, you can use the software Text Converter in its free Basic version. In the first part of this tutorial, there is an explanation how to use the application.

Normal strings

Also the normal string "abc" from the first example in the box can be used as a regular expression. This expression do not contain any meta character, only normal letters, so all occurences of "abc" will be simply replaced with and "X".

Example 1

Search for:       abc         Original:      aba abcdef

Replace with:     X           Replaced:      aba Xdef

 

Example 2

Search for:       abc+        Original:      aba abcdef abccc

Replace with:     X           Replaced:      aba Xdef X

 

Example 3

Search for:       abc*        Original:      aba abcdef abccc

Replace with:     X           Replaced:      Xa Xdef X

 

Example 4

Search for:       abc.        Original:      aba abcdef abccc

Replace with:     X           Replaced:      aba Xef Xc

In a way, you can see an overlap between regular expressions and normal replacements of strings here. It will only be interesting, if we combine "abc" with some known meta characters. We have done this in the next examples.

In example 2, we combine "abc" with the plus. Here you have to recognize, that the plus only refers to the last character before it. In this case, it is only the "c" and not the complete group "abc". So, the "c" must appear at least one time and can also appear more often, so that a string matches our regular expression. Exactly that can be seen in the example. In the orginal, "abc" at the beginning as well as "abccc" at the end fit. Both parts will be replaced with an "X" and we can see the string "aba Xdef X" as the result.

In example 3, we can see what is happening if we are using a asterisk * instead of the plus + from example 2. The asterisk means that the character in front of it, has to appear any number of times, also never. This search pattern matches on "abc" and "abccc" from the secound example again, but this time, it also matches on "ab" in "aba" at the beginning. The "c" can appear, but it has not. "Xa Xdef X" is our result.

The third meta character we have heared about up to now is the point. The point stands for an arbitrary character. So, we are searching for a string beginning with "abc" and the fourth character is arbitrary. In the original "abcd" and "abcc" are matching this pattern. Two replacements are the result.

Grouping

In the first section, we have seen, that meta characters only care about the last character in front of them. But often, we have several characters or groups of characters, that should be repated. Here, the grouping of characters helps.

Example 5

Search for:       (ab)+      Original:      ababab abcde ccabc

Replace with:     X          Replaced:      X Xcde ccXc

 

Example 6

Search for:       (ab)*      Original:      aba abcdef abccc

Replace with:     X          Replaced:      XX XXcXdXeX XcXcXXcXX

If you want to group a group of characters, you can do that with round brackets ( and ). In the examples 5 and 6, we have build a group around "ab". So, the meta characters + and * are not applied to "b" only but to the whole group "ab".

In the example 5, "ab" has to appear at leat for one time or for multiple times. "ababab" and "ab" will be found. Both are replaced with an "X". In the example 6, "ab" can appear for any number of times, but it can also appear never. Perhabs, the result is surprising. First of all, both positions from example 5 will be found, but also all position where no "ab" occurs will be found and replaced with an "X". That shows, that you have to be carefull when using regular expressions, because you can also have thought that there is no big difference between examples 5 and 6.

Repetitions

Up to now, we have only seen simple kinds of repetions with the meta characters plus and asterisk. Can you also say that a string should be repeated a defined number of times? Yes, you can. You can use the curly brackets { and } for that. Again, the brackets correspond only to the last character in front of it and into the brackets, you can write how often this character has to be repeated.

Example 7

Search for:      ab{2}        Original:    abbcd fababcde abababg

Replace with:    X            Replaced:    Xcd fababcde abababg

 

Example 8

Search for:      (ab){2}      Original:    abbcd fababcde abababg

Replace with:    X            Replaced:    abbcd fXcde Xabg

 

In example 7, we are using the regular expression ab{2}. This means, first of all, we want to have an "a" and after that, the "b" should be repeated for two times. And exactly that happens in the example. The part "abb" will be replaced, the part "abab" will not be replaced.

How can we replace "abab"? This shows example 8. Here we have grouped "ab" with round brackets, so that the curly brackets apply to the whole group. Now, the first "abb" does not match any more, but instead the "abab" in the middle fits. At the end "ab" repeats thrice, but only the first two occurences will be replaced, because we are searching for an "ab" which is repeating twice.

Example 9

Search for:      a{2,4}       Original:    abc aabcaaad aaaajaaaaa

Replace with:    X            Replaced:    abc XbcXd XjXa

 

Example 10

Search for:      a{2,}        Original:    abc aabcaaad aaaajaaaaa

Replace with:    X            Replaced:    abc XbcXd XjX

 

Example 11

Search for:      a{,4}        Original:    abc aabcaaad aaaajaaaaa

Replace with:    X            Replaced:    Xbc XbcXd XjXa

But we can also handle a little bit more general. In example 9, we use a{2,4}. This means an "a" has to appear at least twice and a maximum of four times. A single "a" will not be found as well as more than four "a". "abc" do not fit on this search pattern, but "aa" and "aaa" in the middle of the string match it. At the end, only the first four "a" of the last five a in "jaaaaa" will be replaced, so that "jaaaaaa" becomes "jXa".

You can also leave the first or last number empty. In the example 10 the regular expression "a{2,}" means: The "a" must appear at least 2 times and it can appear any number of times, because there is no upper limit define in this example. Correspondingly, the replacement looks similar to example 9, but in this example, also the last five "a" will be found and replaced with an "X".

What happens, if we leave the first number free, can be seen in example 11. The regular expression a{,4} means: The "a" can appear not more than four times. But it can appear less. So, all occurences of one "a", two "aa", three "aaa" and four "aaaa" will be replaced. Some programs including the Text Converter do not support this expression, because it is better and clearer to express this with {0,4} respectively {1,4}.

Counterparts of previously known meta characters

Perhabs, you have noticed, that our meta characters point and asterisk, we had a look at before, can also be expressed with the curly brackets { and }. The plus + corresponds to the expression {1,} and the asterisk * is equivalent to the expression {0,}. So, the characters before can appear any number of times, but with the plus at least one time and with the asterisk also never.

Summary

  • Characters that are no meta characters like the letters a, b or c can simply be used in regular expressions
  • With round brackets, we can group a part of a regular expression. With this, you can for example apply meta characters like plus or asterisk to a complete group. Otherwise, this meta characters will only be applied to the last character before
  • Curly brackets specify, how often the character before should be repeated. {2} means exactly two times, {2,5} means at least two times and a maximum of five times, thus 2, 3, 4 or 5 times, {2,} means at least 2 times an {,5} means a maximum of 5 times.

Read more

 

© Stefan Trost - The usage of this tutorial, even in parts, is prohibited without prior written consent of Stefan Trost. But of course, you are welcome to link to this tutorial.

 
  
 

Comments

Have you got the same opinion like the author or do you want to add something? Here you can leave a comment.

Write a comment

You can leave an anonymous comment. If you want to write something under your name, please log in or register.



Past Comments

Nobody has written a comment on this article. You can be the first one.