“A nearly impenetrable thicket of geekitude…”

Cleaner URLs

Posted on January 19, 2018 at 09:47

One thing I’ve wanted to do for a long time is move this site further towards the use of clean URLs. I am currently migrating to a static-site generator and that seemed like the ideal time. Here are a couple of tricks I’ve used to get clean URLs for my older content without breaking bookmarks.

As I’ve described in the colophon, this site has a long history and some of that is mirrored in the URLs used for content. For example, a URL like this might date from a time when I was using Adobe GoLive on the Mac:

/foo/bar/baz.html

I’ve omitted the scheme and host name for clarity here.

If the content is even older — dating back before 2006, when I still worked mainly on Windows — it might have looked like this instead:

/foo/bar/baz.htm

In these modern times, it’s regarded as best practice to use a clean URL instead:

/foo/bar/baz

That’s the kind of URL that modern content management systems like Drupal use by default, and it’s also one of the standard ways of using static-site generators like Nanoc. In the latter case, the file system actually contains this file:

/foo/bar/baz/index.html

The web server will respond with the contents of that file if either /foo/bar/baz or /foo/bar/baz/ are requested; in the former case by providing a redirect to the latter.

It’s easy, then, to generate a static site whose content has clean URLs. The interesting problem is to allow people to get to that content when they use the older URL with the extension. One possibility is simply to add a large number of statements like this to the site’s .htaccess file:

Redirect 301 /foo/bar/baz.htm /foo/bar/baz/

This assumes the site is served up by the Apache web server. Although I do use this approach in some places, I think the following approach to the clean URL problem is much more elegant:

RewriteCond "%{REQUEST_FILENAME}" !-f
RewriteCond "%{DOCUMENT_ROOT}/$1/index.html" -f
RewriteRule "^(.*)\.html?$" "/$1/" [Redirect=permanent,Last]

Those three rules don’t operate in quite the way you might expect, so here’s a quick breakdown:

  • The first operation is, perhaps surprisingly, related to the first part of the RewriteRule on the third line. Here we check that the incoming path (stripped of its leading slash, because we’re in a directory context) matches the regular expression ^(.*)\.html?$. The anchors ^ and $ ensure that the whole request is being matched.

    As a result, we will only attempt to rewrite paths ending in either .htm or .html. We collect the rest of the path as $1. In the case of our example /foo/bar/baz.htm, this means that $1 will become: foo/bar/baz

  • The next check is from the first line: check that the request for /foo/bar/baz.htm does not resolve to a real file. If it does, we don’t perform the rewrite.

  • The final check is whether the clean URL corresponds to a directory in the file system in which an index.html file exists. If it doesn’t, there’s no point in rewriting to the clean URL.

  • If all of these checks succeed, then the right-hand side of the RewriteRule comes into play and we rewrite the request as /$1/; for our example, this will be /foo/bar/baz/, the clean URL we’re looking for.

    Note that we rewrite to the version of the clean URL with the trailing slash; this is just an optimisation to prevent the browser coming back with /foo/bar/baz and being immediately redirected to /foo/bar/baz/ anyway. It’s only one round trip, but there’s no reason not to be nice.

    After performing the rewrite, the options [Redirect=permanent,Last] cause processing for this request to end, and for the requester to be redirected to the rewritten URL. Apache takes care of filling in the appropriate scheme and host name, so we don’t need to.

A second, related, rewrite takes care of anyone who has somehow acquired a reference to a clean URL’s underlying index file:

RewriteCond "%{DOCUMENT_ROOT}/$1/index.html" -f
RewriteRule "^(.*)/index\.html?$" "/$1/" [Redirect=permanent,Last]

Simply put: if the request ends with /index.htm or /index.html, and that does represent a real index.html file, redirect to the clean URL without the file name.

Tags: