Thursday, June 11, 2009

Regular expressions interleave

Today we had requirement to write regex for domain users and machines. While doing this I expected some scope for succinctness in expression. First I will start with the expression I wrote then I will to succinctness part.

=====================

Regular expression to match hostnames in intranet scenario.

Here requirements are

1) Machine names can contain letter, numbers, underscore or hyphen

2) Names should not start or end with underscore or hyphen

3) labels are tld part follow machine name with an ‘.’ in between.

4) Further on labels may follow same format as machine name.

5) Every label follows with a dot

6) tld is just com

^[A-Za-z0-9][A-Za-z0-9-_]+[A-Za-z0-9].([A-Za-z0-9][A-Za-z0-9-_]+[A-Za-z0-9].){1,3}(comCOM){1}$

Here the special requirement is have a configurable number of labels.

With a minor change similar expression can be used to match Fully Qualified Domain Name for usernames and email addresses also.

^[A-Za-z0-9][A-Za-z0-9-_]+.[A-Za-z0-9-_]+[A-Za-z0-9]@([A-Za-z0-9][A-Za-z0-9-_]+[A-Za-z0-9].){1,3}(comCOM){1}$

Now coming to little bit of explanation.

^ : start of the string; [A-Za-z0-9] : must start with a character or numeric avoid hyphen or underscore; and rest of the expression is self explanatory.

=====================

With that I am curious to know about possible ways of writing expressions with interleave characters with specifiable multiplicity.

I know that is very confusing statement. Sorry I am very poor in expressing in short and crisp sentence.

What I mean here is these expressions can be looked at as

<character set>.<character set>@<character set>.<com>. Here dot and @ are mere interleaves for character sets. They hold restrictions like they can not placed at beginning and ending of character sets. That mean they are always interleaved.

Now I am want know possible means of specifying such interleaves with a language syntax. Also I should be able to specify multiply within a definable part of expression. For example I can allow only 1 dot before @ and that must be interleaved within valid character sets.

So I wish to express this expression as ^[A-Za-z]<interleave([0-9]+[.])>@([A-Za-z]<interleave([0-9]+)>.) {1,3}(comCOM){1}$

Requesting gurus of regular expressions to help me in case such a feature is available ?

No comments: