Jeff Atwood (Coding Horror) wrote an interesting blog post about detecting hyperlinks in blocks of "regular" text. I was suprised by the wave of negative comments he received, mainly because Jeff tried to solve a complex problem with a "simple" regular expression. I don't agree with these comments at all. Software developers are too obsessed with borderline cases. What's wrong with "good enough"? What's wrong with solving 99.99% of all possible situations? After all, what's the worst that can happen if the algorithm is not 100% correct? You'll see a bad hyperlink. Big deal...
Anyway, the thing is, as Jeff points out, URL's can contain some pretty weird characters you wouldn't expect, like '(', ')' and ','. This can become quite frustrating when trying to parse something like this:
My website (at http://www.mysite.com/coolpage) is becoming very popular, especially because of traffic coming from http://www.othersite.com/coolestpage, which is a cool site I recently found.
For humans, it's pretty obvious where the hyperlinks are, but what if the links contain some of the non-standard characters I mentioned above? For example:
http://www.mysite.com/coolpage(nice).aspx or http://www.othersite.com/coolest,wildestpage. Both of these are valid URLs.
If you would implement a trivial URL detector, the following URLs would be extracted from the text snippet above:
"http://www.mysite.com/coolpage)" and "http://www.othersite.com/coolestpage,".
Note the trailing parenthesis and comma. Obviously that is not what we want. The fact is that there is no way we can extract valid URLs by following a set of fixed rules, not even for humans. We humans will "parse" the URLs by looking at the context, but that's a pretty hard, if not impossible task for software.
What we can do is use some heuristics to satisfy at least 99.99% of all cases, so I started thinking about this a little bit and came up with a solution that uses regular expressions and requires no code at all.
Let's start with the basics. A URL contains 3 distinct parts, of which the last part is optional:
protocol://host/path
"host" is a hostname (or IP address), optionally followed by a port number, so this is the regex we can use for this:
([a-zA-Z-:@.0-9]+)
Of course, this will not validate the host name part of the URL, but that's not really our goal now.
Next we should specify what the path looks like. According to the specs, the path can contain any of these characters: letters, digits and any of -;:@&=$_.+!*',()
In this example, we will only try to match http and https, so our trivial regex would look like:
\b(https|http)://([a-zA-Z-:@.0-9]+)(/[-;:@&=?a-zA-Z0-9$_.+!*',()])?
That will work in most cases, but it fails miserably in our text snippet example.
Let's make an assumption: we assume that a URL containing parentheses always has matching parentheses, as in "http://www.mysite.com/Some(Nice)Page". This would exclude URLs with a single parenthesis, so "http://www.mysite.com/Some(NicePage" would not be recognized as a URL. That's a sacrifice I'm willing to make.
There are several ways of writing a regex like this, but a simple one is:
\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?
Note that I also added some regex code for escaped characters (%nn). Nested parentheses will not be matched.
This would solve the first match in our text snippet, but not the second one with the trailing ','. Again, we will make an assumption: let's assume that a URL will never end with a period or a comma. I think that's a pretty safe assumption to make (although technically it is allowed).
Our final regex will then look like:
\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?(?<![,.;])
This regex will match any valid URL pattern, except URLs without matching parentheses. It will also exclude URLs with a period, comma or semicolon at the end.
Again, it's not a perfect solution, but I bet it will be very hard to find a real-world example where this regex would fail to extract the URL from a piece of text. The regex can still use some tweaking, but you get the picture.

•