Thursday, August 09, 2007

Regular Expressions - Part2

In my last post Regular Expressions - Part1, I gave some introduction about regular expressions and presented code to build a tool that can be used to test basic regex constructs.

In this part I will discuss about how regular expression engines work. Before going there I just want to make a point about non-printable characters. Just have a look at ASCII chart once. Look for control characters, printable characters, non- printable characters etc..

Brief on Regex Engines

Regex engine is a library that can process regular expressions for search and replace operations. These engines operate on each character and thus composed of complex algorithms.

Editors supporting regex provide front end for this library. You can use this library with programming languages. As such there is a native support for regular expressions in some languages like Pearl, Ruby etc.. Microsoft got regex support from vbscript days. vbscript supports regular expressions with RegExp object. And look for class library reference of System.Text.RegularExpressions namespace for regex support in .Net. We used this in last example.

As these engines operate on each character and try to match with patterns; performance plays a major role. There are two kinds of engines to chose from. One is called Text Directed Engine while other one is Regex Directed. I will describe them in brief.
Of course no programming language offers a choice to select a particular regex engine. Its matter of what it supports.

In case of editors there is a simple way to test what kind of engine it is. Just apply regular expression "regex | regex not" on string regex not. If search result is regex then it is regex engine otherwise if the search result is regex not then it is text directed engine. There are couple of points to note here.

  • Expression you applied tries to match string "regex" or "regex not". '|' is a meta character in regular expressions character set. You can think this as an instruction for 'or' operation.
  • When you apply this regex on string regex not it got option to chose just the first word or both words. Regex engines characterizes eagerness, mean they are eager to return results to be quick and faster. Where as text based engines look for longest possible match. Thus former returns regexregex not.
Thus one of the differences between regex engine and text engine is eagerness.

In simple terms, one can categorize them based on search time. Say that with text directed engines, search time depends on length of the search string. And with regex directed engines search time depends on length of the regex.

Back Tracking

Another important concept is back tracking. As the name suggests back tracking is about tracking how we traversed while coming back. Did I confuse you? That is not intentional; let us see this with an example.

Suppose you are looking at financial report. And you are interested in values between 10K to 99K.

Now let us take figure "INR75063.37" . You know that it starts with INR and it is in the format of INR(x).(y) where x >= 0 and y = (00 to 99).

So you are looking for five digits before paise separator '.' This can be expressed as

"\bINR[0-9]{5}\.[0-9]{2}\b"

Let us apply this regular expression on following text

The amounts are Auto INR30.00 Onward Flight INR450454.34
Taxi INR435.65 and Return Flight INR34543.43 etc.

Observe result with five digits INR34543.43 the figure in desired range.

Now let us walk through what happened here.

First with regular expression - \bINR[0-9]{5}\.[0-9]{2}\b. Outer \b at the beginning and at the end specify word boundary.

Obviously all amounts are within a word boundary. (To check about word boundary click at the start of search string, then press Ctrl+[right arrow ->]. Of course this includes space, forget space for this discussion.). So here goes the part wise description of this regex

  1. INR will be matched literally.
  2. Then [0-9]{5} specifies to look for 5 digits(0-9).
  3. Then we have a dot '\.' dot is a meta character in Regular expression character set, thus we need to escape it. It might work without escaping too, but just follow it as a good practice.
  4. Then we have [0-9]{2} which is obvious for two digit of paise. And closing of boundary with '\b'

Ok we understood the regular expression, now let us understand how regex engine might have done this search.

It starts with text string 'T' in The of search string, and looks at part1 of regular expression which is expecting I of INR, it is invalid and thus moves to next word.Then it finds 'a' in amount, which is invalid and moves ahead.

When it comes to INR30.00, it matches INR then moves regex to [0-9]{5}.

Here regex can match it in two ways.

(a) Collect characters till paise separator '.' then verify that each one is a digit(0-9) and number of digits is 5. If paise separator is not found till end of word (\b) skip this word

(b) Collect each character and then check it to be a digit and add to count, if the count is 5 then check for paise separator as sixth digit. If sixth character is a paise separator search passes else search fails am moves to next word.

Here first option (a) assumes that if paise separator is not there, then no need to check each character to be a digit and to add to counter. If it finds a paise separator then it will track back each character and then does necessary operations.

In case of (b) engine anticipates paise separator and does character validation and counter operation for each character.

Assume a case where we are searching for 10 digits. In that case performance penalty with option (b) would be very high.

Option (a) describes case of backtracking. Backtracking is like leaving a pile of bread crumbs at every fork in the road.

If the path we choose turns out to be a dead end, then we can retrace our steps giving up ground until we come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, we can continue to backtrack, retracing our steps to the next pile of crumbs, and so on, until we eventually find a path that leads to our goal or until we run out of untried paths [1].

With text directed engine each character is looked at only once, while with regex directed engines each characters might be looked at many times. Regex directed engines support back tracking and Text Directed engines doesn't.

While at back tracking I should mention about meta character dot '.' in Regular expressions. Dot means any character (except new line). Back tracking plays major role when working with dot. We will look into dot and all other meta characters in next post.

Next -> Regex Meta Characters

No comments: