Saturday, August 11, 2007

Regular Expressions - Part3

In previous post Regular Expressions - Part2, we discussed about regular expression engines and concept of backtracking. In this post we will discuss about the basis constructs of regular expressions , how to use them t o match simple words and lines.

Meta Characters

Meta characters are a sub-set of ASCII character set which take part in building a regular expression. e.g. +,$,^ etc.. Thus these characters instruct regex engines to perform specific operations. If we want to instruct regex engine to deal with thm as normal characters instead of meta characters we need to escape them with backward slash '\'.e.g regex "firstname\.lastname" instructs the engine to ignore special meaning of '.' and to consider as a '.' character.

Following list gives overview of mostly used meta characters.

. Dot - any character in a line

In normal search we use ? to specify a single wildcard character and * to specify sequence characters till next character. e.g. to search a file we use IAdb*.dll But in regex * is used for repetition. Also note that in a line phrase. This mean that the behavior of ‘.’ can be altered using mode settings like SingleLine or MultiLine mode to notify regex engine whether to match a newline (\n or \r\n) with a ‘.’ or to stop at new line. e.g:

Search String:
using System;
using System.IO;
using System.Text;
RegEx: “System.*”
Explanation: In MultiLine mode this matches all references with System and its decedents till end of line.
Matches in non-Single Line mode:
a)System;
b)System.IO;
c)System.Text;
Matches in Single Line mode:
a)System;System.IO;System.Text;

Note that semicolon is also matched in each line

\ - back slash
It is already mentioned that these are used to instruct regex engine to consider them as
normal characters. And when used with a number like \1 or \2, this specifies a back reference number. Back references will be covered seperately.

[ ] - opening and closing square bracket

Any group of characters to be matched are specified within these brackets. Examples are mentioned below.

( ) - round brackets
These are used to ho sub-expressions or back references. Back references will be covered later. Sub-expressions are similar to programming language sub expressions.

{ } curly brackets
These are used with iterators. We have seen this in Part 2 for five digit length. Its format is like {x,y}. where specifier is anything that need to repeated. Here x is
mandatory and can contain values from 0 to value of y. And y is optional to specify. and has to be any integer.

* Iterator to iterate for zero or more times. (0 or more times)

? Iterator to iterate for zero or one time only. (0 or 1 times)

+ Iterator to iterate for at least once or more times. (1 ore more times)

\w alphanumeric character including underscore

\W non-alphanumeric

\d numeric character

\D non-numeric

\s any white space; include , , and


\S non white space

| - pipe character
This is used to match alternatives. We used this in Part to test regex engine. Its syntax goes like [matcha | matchb]

\b - word boundary
We used this in part2 .

\B - any non- word boundary
example:
Search String:
ITSHUFFormAbc.htm
ITSCorporateFormBcd.htm
ITSSelfEmployedFormCde.htm

regex: ^[A-Z]{3}\B[a-zA-Z0-9]+\B

Match:
ITSHUFFormAb
ITSCorporateFormBc
ITSSelfEmployedFormCd


^ - beginning of a line

It matches from start of the line. In MultiLine mode it matches at the beginning of each new line. e.g.

Search String:
thisofLine1 of thisstring100
thisofLine2 of thisstring200
thisofLine3 of thisstring300
RegEx: ^this[a-zA-Z0-9]+
This matches this from the start of each line.
Matches in MultiLine Mode:
a) thisofLine1
b) thisofLine2
c) thisofLine3
Matches in SingleLine Mode:
thisofLine1

f you take out initial caret(^) in above regex, it will match both instances of this in each line

^ got a meaning of inversion when used within square brackets []. We will cover this in character classes.

$ - end of line
It matches at the end line. In MultiLine mode it matches at the end of each new line. e.g.
Search String: Let us the same string as above
thisofline1 of thisstring100
thisofline2 of thisstring200
thisofline3 of thisstring300
RegEx: this[a-za-z0-9]+.$
note '.' before $.
Matches in MultiLineMode:
thisstring100
thisstring200
thisstring300
Matches in SingleLineMode:
thisstring300

\A - start of text
Start of text matches only at the start of string irrespective mode setting

\Z - End of text
End of text matches only at the end of string irrespective mode setting

No comments: