Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Getting Started with Regular Expressions : Page 2

Regular expressions, also referred to as "regex" in the developer community, are extremely powerful pattern matching and substitution tools. This article introduces you to regular expressions, what they are, why you would want to use them, and finally, how you can begin putting them to work in Visual Studio .NET.


advertisement

Character Matching
Character matching is the heart of working with regular expressions. It allows you to search for and match specific characters in a string.

Let's say you want to search a string for every occurrence of the string "takenote." The regular expression would be:

   takenote
It doesn't get any easier than that so I'm sure I haven't lost you so far. Let's take a look at some of the character matching metacharacters available to us. We'll start with the period (.).

The Period (.) Character
The period matches any single character, except the newline character. So the pattern:

   tak.note 
will match strings like takenote, takznote, tak1note, and so on. Seems simple enough so far.

Square Brackets [ ]
You use brackets to match any one of the enclosed characters. For example:



   takenote[123]
This expression will only match takenote1, takenote2, and takenote3 and will not match a similar string like takenote0 because it does not have a 1, 2, or 3 after the "e" in note. Ranges are also supported within brackets.

   takenote[1-3]
This expression will only match takenote1, takenote2, and takenote3.

How could something that looks like this... ^[\w-]+(?:\.[\w-]+)*@(?:[\w-]+\.)+[a-zA-Z]{2,7}$ ...be called regular???
Commonly used ranges include [0-9], [a-z], and [A-Z]. You can also combine ranges like this [0-9a-zA-Z]. This range means that any single digit or letter is acceptable.

The [a-z] and [A-Z] ranges probably peaked your curiosity about case sensitivity. Yes, regular expressions are case sensitive. I'll discuss settings that instruct the regular expression engine to ignore case later in the article.

   [0-9]takeNote
This expression will match strings like 4takeNote, 7takeNote and 9takeNote but will not match similar strings 420takeNote or 01takeNote because the pattern only calls for a single digit before the word "take."

You can also specify a negative condition inside the brackets with a ^. This will cause anything to be matched EXCEPT what is specified.

   [^0-5]takeNote
This expression will match 6takeNote and ztakeNote but will not match a similar string like 4takeNote because it starts with a 4, or 5takenote because the "n" in note is not capitalized.

The Vertical Bar (|)
You use | to implement an either or type matching pattern.

   [T|t]ake[N|n]ote
This expression will match the strings TakeNote, Takenote, takeNote, or takenote but will not match similar strings like wakenote because it starts with a "w," or takeote because it is missing the "n."

Position Matching
Regular expressions support two metacharacters that you can use to force the string you want to search for to be at the beginning or the end of the searched string. The metacharacters involved with implementing this capability are the ^ and the $.

The caret (^)
The ^ matches the beginning of a string.

   ^take
This expression will match takenote and take a note but will not match I think I've been taken by the Nigerian email scam! Why not? Because the word "take" must be at the beginning of the string.

The Dollar Sign ($)
The $ matches the end of a string.

   [T|t]ake[N|n]ote$
This expression will match I work for TakeNote but will not match TakeNote is who I work for because the pattern calls for the word "takenote" to be the final characters in the string.

Developers often use the ^ and the $ together to define a pattern. For example, let's take a look at a fictitious inventory part number. Assume that in order for the part number to be valid it must be two letters followed by two digits.

   ^[A-Za-z][A-Za-z][0-9][0-9]$
This pattern, where each set of brackets represents a single character, matches AB12, BR54, and ZZ22, but does not match ABC1, S194, or ABCD. Even though the three non-matching examples are all four characters in length, they are not the correct characters.

If you're thinking that the following would have worked as well:

   [A-Za-z][A-Za-z][0-9][0-9]
You are correct, sort of. Yes, AB12 and BR54 would still match, but so would ABCD1234 because the pattern is represented in the string. That's why you needed the ^ in the beginning and the $ on the end to define the beginning and ending structure of the pattern.

   ^[A-Za-z][A-Za-z][0-9][0-9]$
You now know that this pattern defines a 4-character string with two initial letters followed by two ending digits. What if the pattern called for four characters followed by six digits? All those brackets might get kind of confusing. Luckily there is regular expression notation for that type of character repetition.

Repetition Matching
Regular expressions provide a way to quantify how many occurrences of a previous expression will match a pattern. The metacharacters involved with implementing this capability are: ?, +, \, *, ( ), and { }.

The Question Mark (?)
Now things start to get interesting. The ? matches 0 or 1 instances of the preceding character.

   ^takenotes?
This expression will match strings takenote and takenotes because the pattern calls for 0 or 1 instances of the letter "s." That pattern will not match a similar string like takenotess because it has too many occurrences of the letter "s."

The Plus Sign (+)
The + metacharacter, similar to the ? metacharacter, matches 1 or more instances of the preceding character.

   ^takenote+
This expression will match strings like takenote, takenotee, and takenoteeeeee but it will not match similar strings like takenot because there has to be at least one "e" after the "t" in note.

One thing to watch out for is making sure you don't subconsciously read the + as a concatenation operator. It's an easy mistake to make.

The Backslash (\)
What if you need to match a string that contains one of the metacharacters, "?" for example? Since the ? is a metacharacter you need to precede it with a \ to indicate that its metacharacter function should not be implemented.

   ^takenote\?
One thing to watch out for is making sure you don't subconsciously read the + as a concatenation operator. It's an easy mistake to make.
This expression will match takenote? but will not match the similar string takenote because the "?" is missing at the end of the string.

Let's try combining what you've learned so far. Take a look at this regular expression

   [0-9][T|t]ak.[N|n]otes?
This expression will match strings like 0Takenote, 25takzNote, and 9TakkNotes but will not match this similar string, 4TakeNotes9 because it ends with a "9". You still with me? Did the 25takzNote throw you off with it's leading two digits? Because the pattern does not use a ^ to specify that the string must begin with a single digit, 25takzNote matches the pattern.

The Asterisk (*)
The * matches 0 or more instances of the preceding character. It works just like the ? except that it matches 0 or more instances of the preceding character whereas the ? only matches 0 or 1 instances of the preceding character.

   ^takenotes*$
This expression will match takenote because it has 0 or more "s" characters, takenotes because it has one "s," and takenotessss because it has more than one "s."

   ^take*note$
This expression will match taknote because it has 0 or more middle "e" characters, takenote because it has 1 middle "e", takeeeenote, and takeeeeeeenote because they have more than 1 middle "e" as well.

   ^takenote[0-9]*$
This expression will match takenote, takenote6, takenote49, takenote100, and so one because the pattern describes a character string with 0 or more digits on the end.

Grouping and Parentheses ( )
You can also group characters with ( ).

   ^take(note)*$
This expression will match take, takenote, takenotenote, and takenotenotenote because the pattern calls for 0 or more of "note" on the end of the string.

Braces { }
The braces allow you to match a specified number of instances based on the values in the braces. So {n} will match exactly n instances of the preceding character.

   ^takenote{3}
This expression will match takenoteee but does not match takenote3. {n,} will match at least n instances of the preceding character

   ^takenote{3,}
This expression will match takenoteee, takenoteeee, and takenoteeeeee but will not match takenotee because it only has two trailing "e's". {n,m} will match at least n and at most m instances of the preceding character.

   ^takenote{2,4}
This expression will match takenotee, takenoteee, and takenoteeee because they all have at least two but no more than four trailing "e's". It will not match takenote or takenoteeeee because they don't fit the pattern of at least two "e's" but no more than four.

Special Characters
A number of special characters correspond to the non-text characters embedded in a typical character string. I will introduce you to the three most common.

The \s metacharacter matches a single white space character, including space, tab, form feed, and line feed. It is the same thing as specifying [\f\n\r\t\v].

   ^take\snote$
This expression will match take note but not takenote because the pattern calls for a space between "take" and "note."

The \w metacharacter matches any alphanumeric character, including the underscore. You can use this in place of [A-Za-z0-9_].

   ^take\wote$
This expression will match takenote, takevote, take8ote, and so on because the pattern states that the fifth position can be any alphanumeric character.

The \d metacharacter matches a digit character. You can use this in place of [0-9].

   ^\dtakenote$
This expression will match 3takenote, 7takenote, 9takenote and so on.

That wraps up the section on regular expression pattern development. Let's take this new knowledge and apply it to a number of common patterns.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date