Cleaner URLs
One thing I’ve wanted to do for a long time is move this site further towards the use of clean URLs. I am currently migrating to a static-site generator and that seemed like the ideal time. Here are a couple of tricks I’ve used to get clean URLs for my older content without breaking bookmarks.
As I’ve described in the colophon, this site has a long history and some of that is mirrored in the URLs used for content. For example, a URL like this might date from a time when I was using Adobe GoLive on the Mac:
/foo/bar/baz.html
I’ve omitted the scheme and host name for clarity here.
If the content is even older — dating back before 2006, when I still worked mainly on Windows — it might have looked like this instead:
/foo/bar/baz.htm
In these modern times, it’s regarded as best practice to use a clean URL instead:
/foo/bar/baz
That’s the kind of URL that modern content management systems like Drupal use by default, and it’s also one of the standard ways of using static-site generators like Nanoc. In the latter case, the file system actually contains this file:
/foo/bar/baz/index.html
The web server will respond with the contents of that file if either
/foo/bar/baz
or /foo/bar/baz/
are requested; in the former case by
providing a redirect to the latter.
It’s easy, then, to generate a static site whose content has clean URLs.
The interesting problem is to allow people to get to that content when
they use the older URL with the extension. One possibility is simply to
add a large number of statements like this to the site’s .htaccess
file:
Redirect 301 /foo/bar/baz.htm /foo/bar/baz/
This assumes the site is served up by the Apache web server. Although I do use this approach in some places, I think the following approach to the clean URL problem is much more elegant:
RewriteCond "%{REQUEST_FILENAME}" !-f
RewriteCond "%{DOCUMENT_ROOT}/$1/index.html" -f
RewriteRule "^(.*)\.html?$" "/$1/" [Redirect=permanent,Last]
Those three rules don’t operate in quite the way you might expect, so here’s a quick breakdown:
-
The first operation is, perhaps surprisingly, related to the first part of the
RewriteRule
on the third line. Here we check that the incoming path (stripped of its leading slash, because we’re in a directory context) matches the regular expression^(.*)\.html?$
. The anchors^
and$
ensure that the whole request is being matched.As a result, we will only attempt to rewrite paths ending in either
.htm
or.html
. We collect the rest of the path as$1
. In the case of our example/foo/bar/baz.htm
, this means that$1
will become:foo/bar/baz
-
The next check is from the first line: check that the request for
/foo/bar/baz.htm
does not resolve to a real file. If it does, we don’t perform the rewrite. -
The final check is whether the clean URL corresponds to a directory in the file system in which an
index.html
file exists. If it doesn’t, there’s no point in rewriting to the clean URL. -
If all of these checks succeed, then the right-hand side of the
RewriteRule
comes into play and we rewrite the request as/$1/
; for our example, this will be/foo/bar/baz/
, the clean URL we’re looking for.Note that we rewrite to the version of the clean URL with the trailing slash; this is just an optimisation to prevent the browser coming back with
/foo/bar/baz
and being immediately redirected to/foo/bar/baz/
anyway. It’s only one round trip, but there’s no reason not to be nice.After performing the rewrite, the options
[Redirect=permanent,Last]
cause processing for this request to end, and for the requester to be redirected to the rewritten URL. Apache takes care of filling in the appropriate scheme and host name, so we don’t need to.
A second, related, rewrite takes care of anyone who has somehow acquired a reference to a clean URL’s underlying index file:
RewriteCond "%{DOCUMENT_ROOT}/$1/index.html" -f
RewriteRule "^(.*)/index\.html?$" "/$1/" [Redirect=permanent,Last]
Simply put: if the request ends with /index.htm
or /index.html
, and
that does represent a real index.html
file, redirect to the clean URL
without the file name.