Tapirtype Blog: Index

« Ooooh, pretty new building! | Main | The son of cruftless links »

Cruftless URLs without breaking links

Cruftless URLs without breaking links

Update, the second: Well, so it’s not perfect after all. I’d figured that some of the problems encountered by other people had been fixed in the current version of Movable Type. However it looks like for some reason it still insists on sending you trackbacks to the “/index.php” and on sending you back to “/index.php” upon completing a comment. And for some reason it seems to break autodetection of trackback URLs as well. There is no reason for any of this to be the default behavior!

Update: Oh, crap, this seems to break trackback autodetection…

Ok, so at least 90% of the point of this blog so far has been as a tool for me to learn by doing how modern web apps and design works. Sure this is also a great creative outlet and a great way to share little tidbits with friends and family that are far away. As a consequence I find myself being interested in things that are really the purview of much larger sites. Sites that aren’t so obscure that they don’t even get spammed.1 But because the whole point is for me to learn, and because who knows, if I keep at this for years (and I don’t see why I won’t) this site might grow, I want to set things up as much “the right way” as I can while it is still relatively easy to make changes.

Out with the cruft!

One big issue when thinking about the long term viability of a site is the ability to maintain links with a stable structure that aren’t going to limit you in the future. The idea is that you want to have a good chance of surviving a redesign, reorganization, software, implementation, or server change without breaking any links (internal or external). Perhaps the biggest danger of breaking this are the filename endings: “.html,” “.php,” etc. I’ve already gone through one of these when I switched my filename endings from “.html” to “.php.” Sure php is a great, fast, flexible language, but will it still be my choice in five years? And with a server side language like that, why would I even consider exposing the implementation in my links?

There are a number of articles out there about solving this problem and about solving it with Movable Type in specific, but as anything on the internet is a moving target, many of them are somewhat out of date (as this will be very soon no doubt). Many of them are also needlessly complicated (for me anyway) as the easiest way to get rid of cruft in your URLs is to use the old fashioned index capability that has been around forever and link to the folder instead of the file, allowing the server to automatically serve the “index.whatever” file that is inside.

Movable Type 3.3 comes with almost clean URLs to begin with. The category and year archives default to generating clean URLs, but for some reason although the option is there, it isn’t set for individual entry archives. The path for individual entry archives in Movable Type now defaults to “/Year/Month/base_name.extension”. So that my first entry would be published to “/2006/08/first_entry.html” before I transitioned to php and “/2006/08/first_entry.php” after.

However all you have to do is select an option in the drop down menu of defaults on the publishing settings page of the Movable Type settings to select an option to publish individual entry archives to “/Year/Month/base_name/index.extension” and all links and trackbacks to my first entry will now go to: “/2006/08/first_entry/” allowing the server to select the correct index whether it is “index.php” as it is now or “index.html” as it used to be.

Yay! Except now, I’ve broken all my links again. For the last time, perhaps, but broken none the less.

In with the regular expressions: fixing the links

This is where the more complicated aspects of the articles that I linked to come in. Moving forward all of my links should be future proof,2 but there should be a way to prevent my old links from breaking, right? There is. Actually there are a few different ways, but I used mod_rewrite. The idea behind mod_rewrite is to cause apache to, instead of bounding the browser from a bad requested URL to a good one, simply and transparently serve the correct file (or query) using regular expression based rules to transform the old query into the new one. Best, using “.htaccess” files you can apply them on a per directory basis.

The next advantage that I have is that because I only need to apply the changes into the past and not into the future, I can know exactly what I have in the folders that I need to change. This comes in handy because I happen to know that I don’t have any images or anything else other than my individual entry files. Any other files sit in folders inside the month folder. So I can safely change any “something.anything” pattern into “something/index.php”. This is where I needed a crash course on regular expressions which I finally found here.

I made a .htaccess file for each of the past month folders (only three for me) which look something like this:

RewriteEngine On
RewriteBase /2006/08
RewriteRule ^([^/]+)\.(.*)$ $1/index.php [L]

The first line simply tells the server to turn on rewriting.

The second line tells where to start the rewriting. In this case I’m rewriting the URLs for the 08 directory so I don’t need to pay attention to the URL before that point.

The RewriteRule is where the action is and what looked like complete gibberish to me until yesterday. The “^” matches to the start of the line so that my patterns have to match from the beginning (as set by the RewriteBase).

The parenthases enclose a portion to limit scope (just like in normal math expressions) and in this case it will remember whatever gets matched by them so I can refer to it later as “$1” in the rewritten URL.

The brackets enclose a character class and the use of a carrot inside them means anything but this character class, so in this case, anything but a “/”. Left as it that would only match a single character, but I want to match any string of characters containing one or more characters none of which can be a “/” (which will limit me to the current directory). So I follow the character class with a “+” which means 1 or more of these.

The next thing I want is to look for a “.” (which I have to look for using a “.” because “.” alone means something special, any character) followed by anything (which I look for with “.*” which looks for zero or more of any character.

So this string will match anything inside the 08 directory which is ended with “.anything”.

Now I have to tell it what to replace it with, and that’s what the next part of the line does. It says “$1/index.php”. Remember that the $1 refers to the first thing I enclosed in parentheses above, which will be the filename without the “.extension” and the “/index.php” simply appends on to it.

So there you go, now I can convert requests for any entry name format. If they just request the entry name without the trailing slash, the default behavior will be to try it with a slash and find the index inside. If they append the old “.php” extension, it will match the expression and automatically pull up the index inside the folder of the same name, and if they apply any incorrect extension, it will match just as well.

Tidying up

Now I’m almost done. The only remaining files that I (my main photography page and sasha’s and my main pages) and the master archives. To fix these the idea is the same, I just renamed them “index.php” and saved them into folders with their original name. In order to avoid leaving broken links to those pages I could use a redirect, but I decided that I was on a roll with the rewrites, so I just made some specific rewrite rules to match the old requests to the new pages.

There’s still a couple of things that keep this from being a perfect solution. The first is that this does mean that I have to pay more attention to my directories than some might like. Since every file is named index, all the identifying information is in what directory encloses it.

Secondly (because I’m not going to make rewrite rules for folders for new months), going forward, I’m not going to be protected from someone choosing, for some reason, to try to link to a new page in the old way or to link directly to the index file with the wrong extension. I’m not too worried about that, though because someone would really have to go out of there way to do the wrong thing. I could just decide never to put any image files or other resources directly in my month folders (always making a folder like “images” or “resources” within the month folder)3, and just apply the rewrite rule globally. But I don’t see the need and this way the server doesn’t have to expend the energy on trying to second guess users mistakes.

The only real remaining nitpick that I have is that footnotes using textile link directly to the index file instead of applying the “#fn1” directly to the directory. That is if I click on the footnote it will take me from “entry_name/” to “entry_name/index.php#fn1” instead of “entry_name/#fn1”. As of right now this is the only place that seems to expose the underlying index file. I’m not entirely pleased with the footnote implementation anyway, so I might come up with a custom solution in the future.

1 Yes, it’s true. To date—apart from a brief spurt—I only receive about one spam trackback a week and I’m not sure if I’ve gotten more than one or two spam comments. All of them have been caught by the software so far. And believe me, I am not complaining about this!

2 Yes, there is the problem that it would still be technically possible for someone to link directly to the “index.php” file instead of the directory. But they’d have to go out of there way to do so since all of the links that I will serve go to the directory, so I’m not going to worry about it. If I really had to, I could fix this as well, but it’s messier, and I like the simplicity of the index approach.

3 This wouldn’t be likely to be a problem for me since I use Amazon S3 to store and serve up my image resources, but who knows what I’ll want to do in the future.

You are visiting Tapirtype Blog. Unless otherwise noted, all content is © 2006-2008 by Sasha Kopf and Michael Boyle, some rights reserved. Site design by Michael Boyle modified from the standard Movable Type templates. I've made an attempt to generate standards compliant content which should look best in Safari or, otherwise, Firefox. Use of Internet Explorer may be harmful to your sanity and I've made little attempt to support it.

If you like you can subscribe to Tapirtype Blog's feed. That way you can be the first to know when more things burble from our brains.

This page is published using Movable Type 4.1