Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Getting Started with Regular Expressions : Page 3

Regular expressions, also referred to as "regex" in the developer community, are extremely powerful pattern matching and substitution tools. This article introduces you to regular expressions, what they are, why you would want to use them, and finally, how you can begin putting them to work in Visual Studio .NET.


advertisement

An Analysis of Some Common Patterns
With the fundamentals of pattern building under your belt, think about these more popular general expressions in use today.

US Zip Code (5-digit)

   \d{5}
This will match exactly five digits.

US Zip Code (5- or 9-digit)

   \d{5}(-\d{4})?
As with the above pattern, the \d{5} will match exactly five digits. The key to this pattern is the (-\d{4})?. Working from the inside out you can see there needs to be four digits preceded by a hyphen. That pattern is then grouped and a ? qualifier is applied to it which says that 0 or 1 matching patterns of four digits will work. With this pattern 27624 and 27624-1234 are both valid. Slick huh?

U.S. Phone Number (999) 999-9999



   ((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}
This is one of the patterns I used in the opening paragraphs. It looked daunting at the time. It may still look a bit confusing, but at least you should recognize all of the metacharacters I used to construct it. Let's break it down into more manageable parts.

If you start at the end of the pattern you will find \d{3}-\d{4}. This pattern will match exactly three digits followed by a hyphen and then exactly four digits. That part will be responsible for matching the phone number portion 999-9999 of the string. Now let's focus on the ((\(\d{3}\) ?)|(\d{3}-))? subpattern. My eye is drawn to the ? on the end of the subpattern. It will match 0 or 1 instances of the pattern grouped by the parenthesis. This means that a phone number will be a match valid with or without an area code.

On the right side of the | (or) metacharacter is the (\d{3}-) pattern. It will match exactly three digits followed by a hyphen. The right side of the | is the (\(\d{3}\) ?) pattern. It uses the \ metacharacter to specify that a left parenthesis \( and a right parenthesis \) are part of the pattern and should not be considered grouping metacharacters. Between the \( and the \) is \d{3} which will match exactly three digits.

So, (555) 123-4567, 555-123-4567, and 123-4567 all match and will be considered valid U.S. phone numbers. See, that wasn't so bad after all.

U.S. Social Security Number

   \d{3}-\d{2}-\d{4}
After that U.S. phone number this one should be easy. It specifies a pattern of exactly three digits followed by a hyphen then exactly two digits followed by a hyphen and exactly four digits. Strings that match would include 123-45-6789, 000-00-0000, and 555-55-5555. Notice, these may not be valid U.S. Social Security numbers but they do match the pattern.

Date

   ^\d{1,2}\/\d{1,2}\/\d{4}$
This pattern will match a date in the 99/99/9999 format. Starting from left to right, the ^\d{1,2} subpattern specifies that a number at least one digit in length but not longer than two digits must be at the beginning of the string. Next comes a \/ which makes the / act as a literal character. Next comes \d{1,2} again, followed by another \/. Lastly this pattern specifies that exactly four digits must be at the end of the pattern.

Offensive Words

   (\bBadWord1\b)| (\bBadWord2\b)|...|(\bbadWordn\b)
This pattern will match any words you specify as offensive and this pattern makes it easy to keep unwanted words from making their way into your data. The \b metacharacter matches any word boundary, such as a space.

   (\bdratsl\b)|(\bshoot\b)|(\bdarn\b)
This pattern will match any text stream that contains the word drats or shoot or darn.

Table 2 contains a few more commonly used regular expression patterns.

Table 2: Common patterns

Description

Regular Expression Pattern

E-mail address

\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

 

Internet URL

http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?

 

Real number (including +/-)

^[-+]?\d+(\.\d+)?$

Password (first character is a letter, 4 ? 15 characters, nothing but numbers, letters and underscore)

^[a-zA-Z]\w{3,14}$


Basic Credit Card

   ^(\d{4}[- ]){3}\d{4}|\d{16}$ 
This pattern will match a credit card number in the format of 9999-9999-9999-9999, 9999 9999 9999 9999, or 9999999999999999. Let's break this pattern down from right to left. One the right side of the | (or) we see \d{16} which specifies sixteen digits. That's pretty straightforward. The left side of the | (or) looks a bit more complicated. I'll start with the ^(\d{4}[- ]). This specifies a grouped string of four digits followed by a hyphen or a space. Next is {3} which specifies that there must be exactly three 4-digit grouped strings. Next is the \d{4} which specifies the final four digits.

You may have noticed that this pattern doesn't validate the number at all or categorize it by type of card. 4999-9999-9999-9999 is just as valid as 1999-9999-9999-9999 even though no credit card starts with the number 1.

Advanced Credit Card

   ^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$ 
This is another credit card pattern but this time we're specifying that the card number must start with a 4, 5, 6, or 7. This pattern matches all the major credit cards including Visa which has a length of 16 and a prefix of 4, MasterCard which has a length of 16 and a prefix of 51-55, Discover which has a length of 16, and a prefix of 6011, and finally American Express which has a length of 15 and a prefix of 34 or 37. All of the 16 digit formats (Visa, MasterCard, and Discover) accept an optional hyphen between each group of four digits.

Let's start with the ^((4\d{3})|(5[1-5]\d{2})|(6011)). It's not as bad as it looks. The first thing to notice is that it is one big group with two OR conditions inside. This group is going to be the definition for the first four digits of the card. The string must start with a group comprised of a "4" followed by exactly three digits (4\d{3}) OR a group comprised of a "5" followed by a 1, 2, 3, 4, or 5, followed by exactly four digits (5[1-5]\d{2}), OR a group comprised of a 6011 (6011).

You can find out more about the RegEx Class in the Visual Studio .NET help.
Next is -?, which means that there can be 0 or 1 hyphens following the initial set of four digits.

Next is \d{4}-?, which refers to the second group of four digits. It means that exactly four digits followed by 0 or 1 hyphens are acceptable.

Next is another \d{4}-?, which refers to the third group of four digits. It, too, means that exactly four digits followed by 0 or 1 hyphens are acceptable.

Next is \d{4}, which refers to the fourth group of four digits. It means that exactly four digits are acceptable.

Next is the | (or) which signals the end of the 16 digit pattern and the beginning of the American Express pattern. This pattern, 3[4,7]\d{13}$ means that the string must start with "34" or "37" followed by 13 digits. In this pattern, spaces and hyphens are not acceptable in American Express card numbers.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date