Sriharsha Vardhan: August 2007

Saturday, August 11, 2007

Regular Expressions - Part3

In previous post Regular Expressions - Part2, we discussed about regular expression engines and concept of backtracking. In this post we will discuss about the basis constructs of regular expressions , how to use them t o match simple words and lines.

Meta Characters

Meta characters are a sub-set of ASCII character set which take part in building a regular expression. e.g. +,$,^ etc.. Thus these characters instruct regex engines to perform specific operations. If we want to instruct regex engine to deal with thm as normal characters instead of meta characters we need to escape them with backward slash '\'.e.g regex "firstname\.lastname" instructs the engine to ignore special meaning of '.' and to consider as a '.' character.

Following list gives overview of mostly used meta characters.

. Dot - any character in a line

In normal search we use ? to specify a single wildcard character and * to specify sequence characters till next character. e.g. to search a file we use IAdb*.dll But in regex * is used for repetition. Also note that in a line phrase. This mean that the behavior of ‘.’ can be altered using mode settings like SingleLine or MultiLine mode to notify regex engine whether to match a newline (\n or \r\n) with a ‘.’ or to stop at new line. e.g:

Search String:
using System;
using System.IO;
using System.Text;
RegEx: “System.*”
Explanation: In MultiLine mode this matches all references with System and its decedents till end of line.
Matches in non-Single Line mode:
a)System;
b)System.IO;
c)System.Text;
Matches in Single Line mode:
a)System;System.IO;System.Text;

Note that semicolon is also matched in each line

\ - back slash
It is already mentioned that these are used to instruct regex engine to consider them as
normal characters. And when used with a number like \1 or \2, this specifies a back reference number. Back references will be covered seperately.

[ ] - opening and closing square bracket

Any group of characters to be matched are specified within these brackets. Examples are mentioned below.

( ) - round brackets
These are used to ho sub-expressions or back references. Back references will be covered later. Sub-expressions are similar to programming language sub expressions.

{ } curly brackets
These are used with iterators. We have seen this in Part 2 for five digit length. Its format is like {x,y}. where specifier is anything that need to repeated. Here x is
mandatory and can contain values from 0 to value of y. And y is optional to specify. and has to be any integer.

* Iterator to iterate for zero or more times. (0 or more times)

? Iterator to iterate for zero or one time only. (0 or 1 times)

+ Iterator to iterate for at least once or more times. (1 ore more times)

\w alphanumeric character including underscore

\W non-alphanumeric

\d numeric character

\D non-numeric

\s any white space; include , , and

\S non white space

| - pipe character
This is used to match alternatives. We used this in Part to test regex engine. Its syntax goes like [matcha | matchb]

\b - word boundary
We used this in part2 .

\B - any non- word boundary
example:
Search String:
ITSHUFFormAbc.htm
ITSCorporateFormBcd.htm
ITSSelfEmployedFormCde.htm

regex: ^[A-Z]{3}\B[a-zA-Z0-9]+\B

Match:
ITSHUFFormAb
ITSCorporateFormBc
ITSSelfEmployedFormCd

^ - beginning of a line

It matches from start of the line. In MultiLine mode it matches at the beginning of each new line. e.g.

Search String:
thisofLine1 of thisstring100
thisofLine2 of thisstring200
thisofLine3 of thisstring300
RegEx: ^this[a-zA-Z0-9]+
This matches this from the start of each line.
Matches in MultiLine Mode:
a) thisofLine1
b) thisofLine2
c) thisofLine3
Matches in SingleLine Mode:
thisofLine1

f you take out initial caret(^) in above regex, it will match both instances of this in each line

^ got a meaning of inversion when used within square brackets []. We will cover this in character classes.

$ - end of line
It matches at the end line. In MultiLine mode it matches at the end of each new line. e.g.
Search String: Let us the same string as above
thisofline1 of thisstring100
thisofline2 of thisstring200
thisofline3 of thisstring300
RegEx: this[a-za-z0-9]+.$
note '.' before $.
Matches in MultiLineMode:
thisstring100
thisstring200
thisstring300
Matches in SingleLineMode:
thisstring300

\A - start of text
Start of text matches only at the start of string irrespective mode setting

\Z - End of text
End of text matches only at the end of string irrespective mode setting

Thursday, August 09, 2007

Regular Expressions - Part2

In my last post Regular Expressions - Part1, I gave some introduction about regular expressions and presented code to build a tool that can be used to test basic regex constructs.

In this part I will discuss about how regular expression engines work. Before going there I just want to make a point about non-printable characters. Just have a look at ASCII chart once. Look for control characters, printable characters, non- printable characters etc..

Brief on Regex Engines

Regex engine is a library that can process regular expressions for search and replace operations. These engines operate on each character and thus composed of complex algorithms.

Editors supporting regex provide front end for this library. You can use this library with programming languages. As such there is a native support for regular expressions in some languages like Pearl, Ruby etc.. Microsoft got regex support from vbscript days. vbscript supports regular expressions with RegExp object. And look for class library reference of System.Text.RegularExpressions namespace for regex support in .Net. We used this in last example.

As these engines operate on each character and try to match with patterns; performance plays a major role. There are two kinds of engines to chose from. One is called Text Directed Engine while other one is Regex Directed. I will describe them in brief.
Of course no programming language offers a choice to select a particular regex engine. Its matter of what it supports.

In case of editors there is a simple way to test what kind of engine it is. Just apply regular expression "regex | regex not" on string regex not. If search result is regex then it is regex engine otherwise if the search result is regex not then it is text directed engine. There are couple of points to note here.

Expression you applied tries to match string "regex" or "regex not". '|' is a meta character in regular expressions character set. You can think this as an instruction for 'or' operation.
When you apply this regex on string regex not it got option to chose just the first word or both words. Regex engines characterizes eagerness, mean they are eager to return results to be quick and faster. Where as text based engines look for longest possible match. Thus former returns regexregex not.

Thus one of the differences between regex engine and text engine is eagerness.

In simple terms, one can categorize them based on search time. Say that with text directed engines, search time depends on length of the search string. And with regex directed engines search time depends on length of the regex.

Back Tracking

Another important concept is back tracking. As the name suggests back tracking is about tracking how we traversed while coming back. Did I confuse you? That is not intentional; let us see this with an example.

Suppose you are looking at financial report. And you are interested in values between 10K to 99K.

Now let us take figure "INR75063.37" . You know that it starts with INR and it is in the format of INR(x).(y) where x >= 0 and y = (00 to 99).

So you are looking for five digits before paise separator '.' This can be expressed as

"\bINR[0-9]{5}\.[0-9]{2}\b"

Let us apply this regular expression on following text

The amounts are Auto INR30.00 Onward Flight INR450454.34
Taxi INR435.65 and Return Flight INR34543.43 etc.

Observe result with five digits INR34543.43 the figure in desired range.

Now let us walk through what happened here.

First with regular expression - \bINR[0-9]{5}\.[0-9]{2}\b. Outer \b at the beginning and at the end specify word boundary.

Obviously all amounts are within a word boundary. (To check about word boundary click at the start of search string, then press Ctrl+[right arrow ->]. Of course this includes space, forget space for this discussion.). So here goes the part wise description of this regex

INR will be matched literally.
Then [0-9]{5} specifies to look for 5 digits(0-9).
Then we have a dot '\.' dot is a meta character in Regular expression character set, thus we need to escape it. It might work without escaping too, but just follow it as a good practice.
Then we have [0-9]{2} which is obvious for two digit of paise. And closing of boundary with '\b'

Ok we understood the regular expression, now let us understand how regex engine might have done this search.

It starts with text string 'T' in The of search string, and looks at part1 of regular expression which is expecting I of INR, it is invalid and thus moves to next word.Then it finds 'a' in amount, which is invalid and moves ahead.

When it comes to INR30.00, it matches INR then moves regex to [0-9]{5}.

Here regex can match it in two ways.

(a) Collect characters till paise separator '.' then verify that each one is a digit(0-9) and number of digits is 5. If paise separator is not found till end of word (\b) skip this word

(b) Collect each character and then check it to be a digit and add to count, if the count is 5 then check for paise separator as sixth digit. If sixth character is a paise separator search passes else search fails am moves to next word.

Here first option (a) assumes that if paise separator is not there, then no need to check each character to be a digit and to add to counter. If it finds a paise separator then it will track back each character and then does necessary operations.

In case of (b) engine anticipates paise separator and does character validation and counter operation for each character.

Assume a case where we are searching for 10 digits. In that case performance penalty with option (b) would be very high.

Option (a) describes case of backtracking. Backtracking is like leaving a pile of bread crumbs at every fork in the road.

If the path we choose turns out to be a dead end, then we can retrace our steps giving up ground until we come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, we can continue to backtrack, retracing our steps to the next pile of crumbs, and so on, until we eventually find a path that leads to our goal or until we run out of untried paths [1].

With text directed engine each character is looked at only once, while with regex directed engines each characters might be looked at many times. Regex directed engines support back tracking and Text Directed engines doesn't.

While at back tracking I should mention about meta character dot '.' in Regular expressions. Dot means any character (except new line). Back tracking plays major role when working with dot. We will look into dot and all other meta characters in next post.

Next -> Regex Meta Characters

Wednesday, August 08, 2007

Regular Expressions - Part1

Regular expressions. Most (u|li)n(i|u)x programmers might have used them with grep. What are they and why they are so different? By this time you might have got a doubt about (u|li)n(i|u)x. If you interpretedit as unix or linux, then you know about Regular expressions.

If you don't know yet you could interpret it; then regular expressions are not difficult for you. It is just a matter of time you pick a book and understand the rules of this new game. If you got puzzled about those words then just think that you are learning a new technique.

What is this all about
Regular expressions can be used to search and replace text patterns in more structure manner. I am not defining the word regex here. But just bringing you to the context of text search / replace and structure. Just remember structured text search and replace.

Some taste and smell
\b[A-Za-z0-9._%\+-]+\@[A-Za-z0-9-]+\.[A-Za-z0-9-]+(\.[a-z0-9-]+)?\b

Above regular expression matches an email id with regular tld like .com or country specific tld like .co.in. DONT try too hard to match it to an email address. I assure you, by end of this session you will find lot many problems with this regex, and you will optimize it with a much better one.

Before we start:

1) It takes time and requires dedicated time of at least 10 hours before you gain momentum with regular expressions. But here I want to make this process simple and easy. Thus I want to stretch these 10 hours over 10 days, one hour on each day.

2) need a regex editor to test. You can pick some from google. But I feel it is better to write your own with minimal effort.

Build MyRegex test tool (Option 2 as mentioned above):

At the core, this tool is going to have three text boxes. (a) to enter text to be searched (b) to enter regex pattern (c) to show results.

Optionally we can have some check boxes to select few options and labels to address text boxes. I used .Net 2.0 and C# windows application.

I attached screen shot here. I used a context menu on regex textbox to avoid a button click.

In designer code just add this.tbRegex.ContextMenuStrip = this.contextMenuStrip1;

I named my regex text box as tbRegex. You can also find two check boxes to select

Singleline mode and Multiline mode. Don't bother about these things now.

Just add them in a frame for better looks. Then just add references for RegEx and event handler on

Find Context menu click. See code below for form.cs.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace RegExApp
{
public partial class Form1 : Form
{
RegexOptions regexoptions;

public Form1()
{
InitializeComponent();
}

private void findToolStripMenuItem_Click(object sender, EventArgs e)
{
tbreults.Text = "";
string results = "";


string txtstr = tbintext.Text;
string regex = @tbRegex.Text;

if (cbSingleLine.Checked)
 regexoptions = RegexOptions.Singleline;

if (cbMultiLine.Checked)
 regexoptions = RegexOptions.Multiline;

MatchCollection matches = Regex.Matches(txtstr, regex, regexoptions);
foreach (Match m in matches)
 results += m.Value + Environment.NewLine;

tbreults.Text = results;
}

}
}

Now we can test this tool. Just copy and paste following text in top text box i.e. serach string box,

my email id is firstname.last@gmail.com online
send me mails to FIRSTNAME.LAST@GMAIL.COM collections
yahoo id is firstname_last@yahoo.com on check
also aliased to Firstname.Last@yahoo.com checked often
and hotmail ID is firstname-last@hotmail.com least
or name@net.co.in is also fine

then copy and paste following regex in regex textbox (use Ctrl+v as right click opens context menu)

\b[A-Za-z0-9._%\+-]+\@[A-Za-z0-9-]+\.[A-Za-z0-9-]+(\.[a-z0-9-]+)?\b

right click on regex text box to trigger context menu and click on Find.

Check following results in results text box

firstname.last@gmail.com
FIRSTNAME.LAST@GMAIL.COM
firstname_last@yahoo.com
Firstname.Last@yahoo.com
firstname-last@hotmail.com
name@net.co.in

Experiment by adding new email ids, extra tld's etc..

Next -> Non printable characters, Regex engines and how it works?

Wednesday, August 01, 2007

Programming assemblies and modules in MSIL

Introduction

This is about building assemblies and modules using MSIL language. To have a reasonable insight about internals of .Net Framework understanding of IL is must. There are many articles that describe how CLR loader loads an assembly how JIT compiler compiles and points the the method to newly created memory block with compiled code etc. But I couldn't find a simple guide to build basic units of execution modules using IL like

a net module

a multi file library

an assembly with library reference

This article describes this process in step by step manner. We are going to create two net modules. Then we link them together to form a library. Then we build an application assembly which refers to methods defined in net modules of library. Before going there I would like to describe some theory in one paragraph describing theoretical aspects of this subject. If you cant bare it, just skip following section. You can always come back to have peep if needed.

Theoretical background

CLR CTS CLS - quick focus. CLS defines minimum set of rules needed to offer interoperatability among .NET application written and compiled in different languages and targetting CLR. In .NET's context IL,MSIL and CIL are synonyms. Intermediate Language (IL) ~= Microsoft Intermediate Language (MSIL) ~= Common Intermediate Language (CIL). CLS compliance ensures interoperatable in terms of naming conventions, the data types, the function types etc. But CLS compliance is not mandatory but a recommendation. CLR doesn't doesn't impose any restrictions for an application even it is non-compliant with respect to CLS. But in such cases interoperatability can not be achieved with modules/assemblies developed in different languages. On the other hand Common Type System enforces its rules on each every construct of any language that is targeted for CLR. For examples assemblies and modules are standard types. Managed .NET application are called assemblied and managed executables are referred as modules. An assembly ca contain many modules. Each module contains MetaData and IL. And an assembly contains a Manifest too. CLR offers a safe execution environment above operating system. Safety is achieved by enforcing type control, structured exception handling, garbage collection etc.

With that let us jump to files and code. Recollect that we are going to create two net modules. Then we link them together to form a library. Then we build an application assembly which refers to methods defined in net modules of library. Copy code for these three source files nm01.il,nm02.il and hello.il. And we will build mylib.dll by linking net modules nm01 and nm02.

Files and Source

net module 01 - [nm01.il]

I could have taken more IL files to build this net modules.
This got just a constructor and two static methods.
There is no need for them to be static, but they are there without any strict reason.

.assembly extern mscorlib{}
.class public mymath01
{
 .method public void .ctor()
 {
  .maxstack 1
   ldarg.0  //push "this" instance onto the stack
   call instance void [mscorlib]System.Object::.ctor()
   ret
 }
 .method static public int32  mysum(int32 i1,int32 i2)
 {
  .maxstack  2
   ldarg.0
   ldarg.1
   add
   ret
 }
 .method static public int32  mysub(int32 i1,int32 i2)
 {
  .maxstack  2
   ldarg.1
   ldarg.2
   sub
   ret
 }
}

net module 02 - [nm02.il]

.assembly extern mscorlib{}
.class public mymath02
{
.method public void .ctor()
{
.maxstack 1
ldarg.0 //push "this" instance onto the stack
call instance void [mscorlib]System.Object::.ctor()
ret
}

.method static public int32 mymul(int32 i1,int32 i2)
{
.maxstack 2
ldarg.1
ldarg.2
mul
ret
}

.method static public int32 mydiv(int32 i1,int32 i2)
{
.maxstack 2
ldarg.1
ldarg.2
div
ret
}
}

Application - [hello.il]

.assembly extern mscorlib {}
.assembly extern mylib {}
.assembly hello {}
.method static public void main() cil managed
{
 .entrypoint
  .maxstack 4
  .locals init (int32 first,
    int32 second,
    int32 result)


  ldstr "First number:"

call void [mscorlib]System.Console::WriteLine(string)
call string [mscorlib]System.Console::ReadLine()
call int32 [mscorlib]System.Int32::Parse(string)
stloc first

ldstr "Second number:"
call void [mscorlib]System.Console::WriteLine(string)
call string [mscorlib]System.Console::ReadLine()
call int32 [mscorlib]System.Int32::Parse(string)
stloc second

ldloc first
ldloc second
call int32 [mylib]mymath01::mysum(int32,int32)
stloc result

ldstr "{0} + {1} = {2}"
ldloc first
box int32
ldloc second
box int32
ldloc result
box int32
call void [mscorlib]System.Console::WriteLine(string,object,object,object)

ldstr "Hello, World!"
call void [mscorlib]System.Console::WriteLine(string)
ret
}

I guess most of this stuff is pretty obvious except for maxstack.

maxstack

At the IL level, the CLR needs to know the maximum stack depth of each method. This is not a plain arithmetic. It depends on code flow and depends on max number of stack slots required for any operation that is performed within this method. Though it is not simple it is not dynamic too. Compilers targeting for CLR can calculate this before producing executables. Compilers of high level languages like csc (CSharp ) and vbc (VB) does calculate this number embed it before producing executable code. But for some reason ILAsm doesn't do this. If you don't specify .maxstack, stack defaults to 8. But if methods more stack slots program will crash at runtime.

Partial Classes

Some might have thought that why I created a new class in each net module; rather I would have used a partial class. This makes sense because I am actually building an arithmetic class. But this is not possible here. Because IL doesn't know anything about partial classes. This feature is offered by high level language compilers with syntactic sugar. That makes it clear about why partial classes can not be spanned across assemblies. Compilers merge code in partial classes to make a single class while building. Thus all references for that class should be resolved before building the assembly.

Build commands

I guess you have .Net Framework and SDK(or Visual Studio 2003/2005) installed on your machine. The command I am describing here are for local host. If you are building for other targets than build machine, specify flags for respective targets. Now start Visual Studio Command prompt or SDK command prompt. Navigate to location of source files.

Building net modules
ilasm /dll /output=nm01.netmodule nm01.il
ilasm /dll /output=nm02.netmodule nm02.il

Building library - we use assembly linker for this purpose
al /t:lib /out:mylib.dll nm01.netmodule nm02.netmodule

Building application
ilasm hello.il

IL dis-assembler - ildasm
For the purpose of this article use command line options of ildasm instead of GUI.
ildasm /All hello.exe /out=hello.dil
This produces helo.dil file which contains complete information

dumpbin
You may also want to use dumpbin if you are interested to see headers and sections of PE files.

Please note that there is mention of linking mylib.dll while building hello.il. Actually this linking is specified in source; ".assembly extern mylib {}".

Now run hello and check for results like this
$hello
First number:
3
Second number:
7
3 + 7 = 10
Hello, World!

How to Debug

Debugging a managed code itself becomes a separate topic. This is because CLR won't execute IL code , it actually executes native(machine) code compiled by jitter. And any loss of fidelity from IL to Native will confuse the debugger. This entry ran quite long. I will keep Debugging for another entry.

Sriharsha Vardhan