# Thursday, October 30, 2008

Jeff Atwood (Coding Horror) wrote an interesting blog post about detecting hyperlinks in blocks of "regular" text. I was suprised by the wave of negative comments he received, mainly because Jeff tried to solve a complex problem with a "simple" regular expression. I don't agree with these comments at all. Software developers are too obsessed with borderline cases. What's wrong with "good enough"? What's wrong with solving 99.99% of all possible situations? After all, what's the worst that can happen if the algorithm is not 100% correct? You'll see a bad hyperlink. Big deal...

Anyway, the thing is, as Jeff points out, URL's can contain some pretty weird characters you wouldn't expect, like '(', ')' and ','. This can become quite frustrating when trying to parse something like this:

My website (at http://www.mysite.com/coolpage) is becoming very popular, especially because of traffic coming from http://www.othersite.com/coolestpage, which is a cool site I recently found.

For humans, it's pretty obvious where the hyperlinks are, but what if the links contain some of the non-standard characters I mentioned above? For example:

http://www.mysite.com/coolpage(nice).aspx or http://www.othersite.com/coolest,wildestpage. Both of these are valid URLs.

If you would implement a trivial URL detector, the following URLs would be extracted from the text snippet above:

"http://www.mysite.com/coolpage)" and "http://www.othersite.com/coolestpage,".

Note the trailing parenthesis and comma. Obviously that is not what we want. The fact is that there is no way we can extract valid URLs by following a set of fixed rules, not even for humans. We humans will "parse" the URLs by looking at the context, but that's a pretty hard, if not impossible task for software.
What we can do is use some heuristics to satisfy at least 99.99% of all cases, so I started thinking about this a little bit and came up with a solution that uses regular expressions and requires no code at all.

Let's start with the basics. A URL contains 3 distinct parts, of which the last part is optional:

protocol://host/path

"host" is a hostname (or IP address), optionally followed by a port number, so this is the regex we can use for this:

([a-zA-Z-:@.0-9]+)

Of course, this will not validate the host name part of the URL, but that's not really our goal now.

Next we should specify what the path looks like. According to the specs, the path can contain any of these characters: letters, digits and any of -;:@&=$_.+!*',()

In this example, we will only try to match http and https, so our trivial regex would look like:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/[-;:@&=?a-zA-Z0-9$_.+!*',()])?

That will work in most cases, but it fails miserably in our text snippet example.

Let's make an assumption: we assume that a URL containing parentheses always has matching parentheses, as in "http://www.mysite.com/Some(Nice)Page". This would exclude URLs with a single parenthesis, so "http://www.mysite.com/Some(NicePage" would not be recognized as a URL. That's a sacrifice I'm willing to make.

There are several ways of writing a regex like this, but a simple one is:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?

Note that I also added some regex code for escaped characters (%nn). Nested parentheses will not be matched.

This would solve the first match in our text snippet, but not the second one with the trailing ','. Again, we will make an assumption: let's assume that a URL will never end with a period or a comma. I think that's a pretty safe assumption to make (although technically it is allowed).

Our final regex will then look like:

\b(https|http)://([a-zA-Z-:@.0-9]+)(/((\([-;:@&=a-zA-Z0-9$_.+!*',]*?\))|[-;:@&=?a-zA-Z0-9$_.+!*',]|%\d\d)+)?(?<![,.;])

This regex will match any valid URL pattern, except URLs without matching parentheses. It will also exclude URLs with a period, comma or semicolon at the end.

Again, it's not a perfect solution, but I bet it will be very hard to find a real-world example where this regex would fail to extract the URL from a piece of text. The regex can still use some tweaking, but you get the picture.

kick it on DotNetKicks.com
Thursday, October 30, 2008 6:36:30 PM (W. Europe Standard Time, UTC+01:00)  #    Comments [3] -

# Wednesday, October 29, 2008

Most people believe dogfooding is the perfect way of ensuring the quality of the software you're creating. I couldn't agree more.

The only problem is that when you use your own software, you always have the feeling that it's not quite "done". There's always something that you can do better, always something that can be improved, always some features that you feel would be nice.

Especially when building a framework this can become a real problem because you never feel done. Another problem is making sure your framework maintains backwards compatibility with previous releases.

Well, I just mentioned this because I finally decided to freeze version 2.0 of ProMesh.NET and get it ready for release. Release candidate 3 has just been published on CodePlex

Thoughts on open source frameworks

A few weeks ago, a fellow developer asked me why I chose to make my projects open-source. That was a pretty good question because when you look at it: compared to just building a framework for your own use, it makes life more complicated: you have to write documentation, support users, and more headaches.

I didn't have to think long before I could answer that question: releasing a custom framework as open source forces you to be disciplined about your code and documentation. After all, you can't afford to write code you are ashamed of, can you? If you write software for your own use, you tend to write code that... well... sucks.

And of course, let's not forget the whole point of open source software: letting other developers contribute. It creates a dynamic that you would never get when you're just building and using your own little framework

If all of the above sounds like a bunch of crap, it's probably because I'm not a native English speaker, or ... it is just crap

kick it on DotNetKicks.com
Wednesday, October 29, 2008 10:32:35 PM (W. Europe Standard Time, UTC+01:00)  #    Comments [1] -
 |  | 
# Thursday, September 25, 2008

A few days ago, I stumbled upon a problem that I never really noticed before Google Chrome was released. It seems that Google Chrome chokes on embedded javascript if some weird bytes are present in the HTML.

It took a while to figure out what was going on. It seems the weird bytes were the byte order mark for UTF-8 documents (hex EF BB BF). When Google Chrome finds these bytes within a <script> block, it will simply stop parsing javascript. No errors, no warnings.

You can argue that this is a problem with Google Chrome, but that still doesn't explain why these bytes were in my javascript blocks to start with.

The problem is this: I inject javascript in the HTML output from embedded resources. The embedded resources are just .js files created in Visual Studio, marked as "embedded resource". The resource is then read from the assembly and converted to a UTF8 string and inserted in the HTML script block. The problem is that Visual Studio ALWAYS adds the UTF-8 "byte order mark" when you save a text file. These bytes are also embedded in the embedded resource, which is... annoying.

Of course, you could tell Visual Studio to save your file without signature ("byte order mark"), but you have to do that EVERY time you've created a new javascript file. There's no way to make that the default.

Bummer...

On a positive note, this gives me the perfect excuse to announce the release of Release Candidate 2 of ProMesh.NET 2.0, which has built-in detection for BOM bytes on embedded resources :-)

kick it on DotNetKicks.com
Thursday, September 25, 2008 9:27:45 PM (W. Europe Daylight Time, UTC+02:00)  #    Comments [1] -

# Friday, August 29, 2008

pmcapWhat's the hardest part of creating any open-source project?

No, not writing the code.

No, not helping users in the support forum...

I'll tell you: writing documentation, especially when the documentation is not in your native language. So that's what I have been doing for the last few weeks (or should I say "months"?).

In fact, the documentation project was the only thing that held back the release of ProMesh.NET version 2.0, and today I am proud to announce the release of the first release candidate of ProMesh.NET 2.0, with full documentation online. The documentation still needs a lot work, but at least (almost) all the features of the framework have been documented.

So, what's new in ProMesh.NET 2.0 (RC1) ?

  • Built-in routing engine. Routing can be added at runtime (at application startup), or can be specified using attributes on the controller classes and action methods.
  • Support for calling/rendering view components from templates. View components are special controllers that render a view, which will be inserted in the calling view.
  • Support for any page extension, including none at all (recommended for IIS 7 deployment)
  • New template engine (SharpTemplate.NET) which supports full C# expressions in view logic
  • Ajax validation of forms
  • Performance improvements

Want to check it out? Why don't you start where all the new stuff is: The Official ProMesh.NET Website

Or if you want to go straight to the download: Go to CodePlex

kick it on DotNetKicks.com
Friday, August 29, 2008 10:55:08 PM (W. Europe Daylight Time, UTC+02:00)  #    Comments [3] -

# Thursday, August 21, 2008

I'm a SmartPhone addict, I admit. At the same time, I am very picky about the requirements of a smartphone:

  1. 3G (HSDPA)
  2. Support for push e-mail (MS Exchange Server)
  3. Web browsing (even if it's flaky)
  4. Small form factor

Looking at the list, I may not be that picky after all. Feature number 2 however forces me to use Windows Mobile, and to be honest, I never think twice about that when deciding on a smartphone purchase. So these were my latest phones (in order of purchase)

What do these have in common? Of course, they're all HTC, but most importantly they're all Windows Mobile devices.

You would think progress means that every generation of a technology improves on features, power and responsiveness, but once you mix in products made by some huge company from Redmond, this universal law no longer applies.

Of the smartphones listed above, the fastest one was the MteoR, followed by the S730 and ... well, I almost threw the damn thing into a concrete wall several times since I bought the Diamond 2 months ago.

What the hell is wrong with Windows Mobile? Why, after so many years of development, after so many releases can't they get their act together and create a user interface that does something within, like, 2 seconds of acting upon it. Tapping something on the screen has become an act of blind faith because there's no immediate feedback. You just hope something is going to happen, and when you're in a hurry (you always are, otherwise you wouldn't need a smartphone), you tap again, only to find out that the "system" was already processing your first tap and sending your second tap to the button or link that will be in that same spot on the next screen within 5 seconds or so. If you're lucky, that button will delete an important file or hang up on an important client. Thank you so much, mighty Microsoft.

How does Microsoft expect to win the battle against Apple or Google (Android) in the smartphone world? If they even can't get the basic stuff right. Don't tell me it can't be done on small devices. Apple did it, and from what I've seen so far, Google did it.

One small piece of advice for the Windows Mobile developers: focus on the immediate feedback of the user interface, NOT on features. Rewrite all the user interface handling from scratch, put the user interface code in a dedicated area of memory that can't be "virtualized" (is there such a thing in Windows Mobile?) and get your act together.

My iPhone 3G will arrive tomorrow. Amen. (thank you Apple for licensing MS Exchange for the iPhone 3G)

kick it on DotNetKicks.com
Thursday, August 21, 2008 10:37:56 PM (W. Europe Daylight Time, UTC+02:00)  #    Comments [2] -