Wednesday, August 08, 2007

Regular Expressions - Part1

Regular expressions. Most (u|li)n(i|u)x programmers might have used them with grep. What are they and why they are so different? By this time you might have got a doubt about (u|li)n(i|u)x. If you interpretedit as unix or linux, then you know about Regular expressions.

If you don't know yet you could interpret it; then regular expressions are not difficult for you. It is just a matter of time you pick a book and understand the rules of this new game. If you got puzzled about those words then just think that you are learning a new technique.

What is this all about
Regular expressions can be used to search and replace text patterns in more structure manner. I am not defining the word regex here. But just bringing you to the context of text search / replace and structure. Just remember structured text search and replace.

Some taste and smell
\b[A-Za-z0-9._%\+-]+\@[A-Za-z0-9-]+\.[A-Za-z0-9-]+(\.[a-z0-9-]+)?\b

Above regular expression matches an email id with regular tld like .com or country specific tld like .co.in. DONT try too hard to match it to an email address. I assure you, by end of this session you will find lot many problems with this regex, and you will optimize it with a much better one.

Before we start:

1) It takes time and requires dedicated time of at least 10 hours before you gain momentum with regular expressions. But here I want to make this process simple and easy. Thus I want to stretch these 10 hours over 10 days, one hour on each day.

2) need a regex editor to test. You can pick some from google. But I feel it is better to write your own with minimal effort.

Build MyRegex test tool (Option 2 as mentioned above):

At the core, this tool is going to have three text boxes. (a) to enter text to be searched (b) to enter regex pattern (c) to show results.

Optionally we can have some check boxes to select few options and labels to address text boxes. I used .Net 2.0 and C# windows application.

I attached screen shot here. I used a context menu on regex textbox to avoid a button click.

In designer code just add this.tbRegex.ContextMenuStrip = this.contextMenuStrip1;

I named my regex text box as tbRegex. You can also find two check boxes to select

Singleline mode and Multiline mode. Don't bother about these things now.

Just add them in a frame for better looks. Then just add references for RegEx and event handler on

Find Context menu click. See code below for form.cs.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace RegExApp
{
public partial class Form1 : Form
{
RegexOptions regexoptions;

public Form1()
{
InitializeComponent();
}

private void findToolStripMenuItem_Click(object sender, EventArgs e)
{
tbreults.Text = "";
string results = "";


string txtstr = tbintext.Text;
string regex = @tbRegex.Text;

if (cbSingleLine.Checked)
regexoptions = RegexOptions.Singleline;

if (cbMultiLine.Checked)
regexoptions = RegexOptions.Multiline;

MatchCollection matches = Regex.Matches(txtstr, regex, regexoptions);
foreach (Match m in matches)
results += m.Value + Environment.NewLine;

tbreults.Text = results;
}

}
}

Now we can test this tool. Just copy and paste following text in top text box i.e. serach string box,

my email id is firstname.last@gmail.com online
send me mails to FIRSTNAME.LAST@GMAIL.COM collections
yahoo id is firstname_last@yahoo.com on check
also aliased to Firstname.Last@yahoo.com checked often
and hotmail ID is firstname-last@hotmail.com least
or name@net.co.in is also fine

then copy and paste following regex in regex textbox (use Ctrl+v as right click opens context menu)

\b[A-Za-z0-9._%\+-]+\@[A-Za-z0-9-]+\.[A-Za-z0-9-]+(\.[a-z0-9-]+)?\b

right click on regex text box to trigger context menu and click on Find.

Check following results in results text box

firstname.last@gmail.com
FIRSTNAME.LAST@GMAIL.COM
firstname_last@yahoo.com
Firstname.Last@yahoo.com
firstname-last@hotmail.com
name@net.co.in

Experiment by adding new email ids, extra tld's etc..

Next -> Non printable characters, Regex engines and how it works?

No comments: