# Thursday, October 30, 2008

Jeff Atwood (Coding Horror) wrote an interesting blog post about detecting hyperlinks in blocks of "regular" text. I was suprised by the wave of negative comments he received, mainly because Jeff tried to solve a complex problem with a "simple" regular expression. I don't agree with these comments at all. Software developers are too obsessed with borderline cases. What's wrong with "good enough"? What's wrong with solving 99.99% of all possible situations? After all, what's the worst that can happen if the algorithm is not 100% correct? You'll see a bad hyperlink. Big deal...

Anyway, the thing is, as Jeff points out, URL's can contain some pretty weird characters you wouldn't expect, like '(', ')' and ','. This can become quite frustrating when trying to parse something like this:

My website (at http://www.mysite.com/coolpage) is becoming very popular, especially because of traffic coming from http://www.othersite.com/coolestpage, which is a cool site I recently found.

For humans, it's pretty obvious where the hyperlinks are, but what if the links contain some of the non-standard characters I mentioned above? For example:

http://www.mysite.com/coolpage(nice).aspx or http://www.othersite.com/coolest,wildestpage. Both of these are valid URLs.

If you would implement a trivial URL detector, the following URLs would be extracted from the text snippet above:

"http://www.mysite.com/coolpage)" and "http://www.othersite.com/coolestpage,".

Note the trailing parenthesis and comma. Obviously that is not what we want. The fact is that there is no way we can extract valid URLs by following a set of fixed rules, not even for humans. We humans will "parse" the URLs by looking at the context, but that's a pretty hard, if not impossible task for software.
What we can do is use some heuristics to satisfy at least 99.99% of all cases, so I started thinking about this a little bit and came up with a solution that uses regular expressions and requires no code at all.

Let's start with the basics. A URL contains 3 distinct parts, of which the last part is optional:

protocol://host/path

"host" is a hostname (or IP address), optionally followed by a port number, so this is the regex we can use for this:

([a-zA-Z-:@.0-9]+)

Of course, this will not validate the host name part of the URL, but that's not really our goal now.

Next we should specify what the path looks like. According to the specs, the path can contain any of these characters: letters, digits and any of -;:@&=$_.+!*',()

In this example, we will only try to match http and https, so our trivial regex would look like:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/[-;:@&=?a-zA-Z0-9$_.+!*',()])?

That will work in most cases, but it fails miserably in our text snippet example.

Let's make an assumption: we assume that a URL containing parentheses always has matching parentheses, as in "http://www.mysite.com/Some(Nice)Page". This would exclude URLs with a single parenthesis, so "http://www.mysite.com/Some(NicePage" would not be recognized as a URL. That's a sacrifice I'm willing to make.

There are several ways of writing a regex like this, but a simple one is:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?

Note that I also added some regex code for escaped characters (%nn). Nested parentheses will not be matched.

This would solve the first match in our text snippet, but not the second one with the trailing ','. Again, we will make an assumption: let's assume that a URL will never end with a period or a comma. I think that's a pretty safe assumption to make (although technically it is allowed).

Our final regex will then look like:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?(?<![,.;])

This regex will match any valid URL pattern, except URLs without matching parentheses. It will also exclude URLs with a period, comma or semicolon at the end.

Again, it's not a perfect solution, but I bet it will be very hard to find a real-world example where this regex would fail to extract the URL from a piece of text. The regex can still use some tweaking, but you get the picture.

kick it on DotNetKicks.com
Thursday, October 30, 2008 6:36:30 PM (W. Europe Standard Time, UTC+01:00)  #    Comments [3] -

# Wednesday, October 29, 2008

Most people believe dogfooding is the perfect way of ensuring the quality of the software you're creating. I couldn't agree more.

The only problem is that when you use your own software, you always have the feeling that it's not quite "done". There's always something that you can do better, always something that can be improved, always some features that you feel would be nice.

Especially when building a framework this can become a real problem because you never feel done. Another problem is making sure your framework maintains backwards compatibility with previous releases.

Well, I just mentioned this because I finally decided to freeze version 2.0 of ProMesh.NET and get it ready for release. Release candidate 3 has just been published on CodePlex

Thoughts on open source frameworks

A few weeks ago, a fellow developer asked me why I chose to make my projects open-source. That was a pretty good question because when you look at it: compared to just building a framework for your own use, it makes life more complicated: you have to write documentation, support users, and more headaches.

I didn't have to think long before I could answer that question: releasing a custom framework as open source forces you to be disciplined about your code and documentation. After all, you can't afford to write code you are ashamed of, can you? If you write software for your own use, you tend to write code that... well... sucks.

And of course, let's not forget the whole point of open source software: letting other developers contribute. It creates a dynamic that you would never get when you're just building and using your own little framework

If all of the above sounds like a bunch of crap, it's probably because I'm not a native English speaker, or ... it is just crap

kick it on DotNetKicks.com
Wednesday, October 29, 2008 10:32:35 PM (W. Europe Standard Time, UTC+01:00)  #    Comments [1] -
 |  |