Thursday, February 24, 2011

Regular Expressions

In programming languages one generally comes across a situation where one needs to match input text against a pattern. I will quickly summarize the basic terminology involving the use of regular expressions.
1. Input text
2. Regular expression - this is also called the search pattern, there is a whole lot of syntax on how to construct this pattern which can be found here ( Regular Expression Reference)
3. Regular expression engine - this is the program that basically matches the regular expression occurring in the input text, and then it can do several operations on the matches that are found such as replacing it with some other text. So this can be thought of as consisting of two main components : a matcher and a replacer
4. Match - the text in the input text that complies exactly to the specification of the regular expression.

I will take a regex I recently modified (Original regex) to explain the concepts involved. This regex basically matches the url inside input text. I was working on showing tweets on a page and I needed to replace the plain URL text inside the tweet with html anchor tag pointing to the the plain URL text.

Here I am using javascript regex syntax to show how one can use this regex to convert text 'This is the URL of current webpage http://wearmyhat.blogspot.com' to 'This is the URL of current webpage < a href="http://wearmyhat.blogspot.com">http://wearmyhat.blogspot.com< / a>'
I marked the modification I did in bold..this is a slight modification to avoid matching the ellipsis that occurs after the actual URL text. For example original regex was matching the ellipsis occuring after the following URL http://wearmyhat.blogspot.com... This was not really acceptable to my application.
Modified Regex:
text = text.replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:(?:[^\s()<>.]+[.]?)+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi, "<a target=_blank href=$1>$1</a>");

What I am going to do is take the main constructs that are used in this regex in order and explain them one by one.
1. \b Matches character between word and non-word character..in other words matches starting or ending character of a word.
2. () Capturing group [whatever is inside is captured by the matcher in entirety]. The above regex has only one outer capturing group with multiple non-capturing groups placed inside. These groups are the only groups that can be backreferenced.
3. (?:) Non-capturing group [whatever is inside is not captured by the matcher in entirety]. This also means that these groups are not available for backreferencing purpose. Any number of capturing groups inside a non-capturing group are treated as capturing groups by the Regex engine (which is actually non-intuitive).
4. [\w] matches a word character.
5. [^\s] matches a non-space character, [a-z] matches any character between a to z.
6. /{1,3} forward slash 1-3 times, \d{0,3} a digit 0-3 times, [a-z]{2-4} any character between a to z can occur 2-4 times.
7. My change : ([^\s()<>.]+[.]?)+ and the corresponding group in original regex is [^\s()<>]+
Here I force '.' to occur only once at a time in the input text.
8. $1 This is needed to backreference a captured group which is needed to construct the anchor tag properly.
9. + means preceding expression can once or more, ? means preceding expression can occur 0 or 1 times, * means preceding expression can occur 0 or more times. These are called the quantifiers.
10. g and i in the end indicate the javascript syntax to match globally and in a case insensitive manner.
11. The second parameter of the replace function is the string that will replace the match found by the regex engine. We replace every url with an anchor tag pointing to that url.

No comments: