Drupal and clean URL's

Topics: 

What do I mean by "clean" URL's? The URL is of course the string appearing in the browser location bar, and means 'Universal Resource Locator'. Theoretically the URL is used by computer software and maybe it doesn't matter to the user what the URL contents are. However people do share URL's with each other, they read URL's over the phone, write them down on note paper, post them on billboards, etc. For those human uses it is helpful to make URL's useful to humans.

It's common in website management systems depending on dynamic page generation for the generated URL's to be software-friendly. Drupal is no different. Though the out of the box Drupal URL format isn't terribly bad it can be improved.

Out of the box Drupal default URL format is: example.com/?q=node/12345.

Clean URL's

The "?q=" portion isn't terribly egregious, but it can be removed pretty simply. First you must run your website under the Apache server, and have the mod_rewrite package installed. Those conditions are often the default in typical webhosting plans, and are easily satisfiable. I happen to think it's a bug in Drupal that they insert "?q=" into the URL, but getting the Drupal community to change this detail isn't worth my time. Fortunately the Drupal community has made it trivial to workaround their bug simply by, as I said, running the site using Apache with mod_rewrite. Simply visit the Administer area, go to Clean Url's (yoursite.com/admin/settings/clean-urls) and you will find a button to initiate a test of whether your webserver supports Clean Url's. Once Drupal has tested your site for compatibility, it offers buttons to enable or disable clean URL's.

As I noted in First impressions of Drupal 6 this workaround has been made even simpler in the upcoming Drupal 6.

With clean URL's enabled the default URL format is: example.com/node/12345

This has the advantage of being short and sweet, it is very writable on a piece of paper, postable on a billboard, etc. But we can go even further.

URL Aliases

Staying for a moment with out-of-the-box behavior, in the node create/edit form there is a section labeled 'URL path settings'. If you enter a value into this box then Drupal will insert an alias for the standard URL of this node. The standard URL for any node is "node/12345", where this is simply the "node number". But if there is an alias entered in the URL alias table for some node, then Drupal will display that alias and translate behind the scenes to the standard URL.

Normally you want to leave the "URL path settings" as-is, but it can make sense to supply a URL alias. As explained earlier humans want to read and understand URL's and having words in the URL can explain the purpose or topic of the page located at that URL. Another benefit is if you are converting a site from one written using normal webpage editing software.. this makes it easier to transition users from an old site to a new one built using Drupal. For example you might have example.com/about.html as a page on an old site, and if you enter "about.html" in the URL path settings box then Drupal will present the page as if it is "about.html".

Pathauto

The Pathauto module makes for URL's which have more information in them. It works by generating a URL alias, which I'll discuss more later.

With pathauto the URL format is highly configurable and the pathauto module automatically generates a more readable URL. For example: example.com/blog/echuckj5/1872 or example.com/forum/motorcycles-and-large-scooters/1899. This sort of URL says a lot more to someone reading it, as it says a little about the likely topic of the given page.

It works by taking various parameters of the page, and transforming them to form the URL. The pathauto authors have given a lot of flexibility to make your URL' appear the way you want them.

To start, go into the Administer section and click on Pathauto. There are many sections to this page, but I want to start with the most critical, the one labeled Node path settings so click on that to expand the section.

You're shown a list of node types with some boxes containing strings like: [type]/[cat]/[title]/[nid] .. so what does this mean. They are using a 'token' paradigm where the '[type]' is a token, and it doesn't appear in the generated URL as '[type]' but instead 'blog' or 'forum' or whatever the node type is. These token strings help you specify how the URL alias is actually formulated.

First, it is helpful to make different URL's based on the type of node. This lets the reader know, it's a forum posting, a blog posting, a website-link, etc. In the above string '[type]' represents the node type, and when the URL is generated. But sometimes the type string is too long for my taste, such as 'blog-entry'. So what I do is specify a pattern in the box labeled Pattern for all Blog entry paths. The pattern starts with 'blog/' and says the same thing more succinctly than a URL pattern starting with 'blog-entry/'.

The succinctness of generated URL's are important, and some of the pathauto options can generate enormously long URL's. Succinctness helps users to use the URL, making them easier to write on paper or post on a billboard or read over a phone. For example the '[bookpath]' or '[catpath]' tokens appear tempting because they really get down to topical details of the page in question. But they make URL's that are too long for easy use.

The other sections are important but are probably best left to their default settings.

The Blog path settings section controls the URL for a given users blog, the default is 'blog/123' (123 = the user ID) and pathauto makes this 'blog/user-name' which is much friendlier. If you additionally make blog URL's be 'blog/[author-name]/[nid]-[title]' it makes the URL for each blog posting to appear to be beneath the URL for the users blog.

The Category path settings section controls the URL's for taxonomy (category) terms. By default the URL's are numerical in nature, and pathauto's default pattern makes this very human readable.

The User path settings section controls the URL for the user profile page. By default the URL's are numerical in nature, and pathauto's default pattern makes this very human readable.

Global Redirect

The last module to consider is Global Redirect. It does several useful but technically arcane things. I believe this module should be in Drupal core.

When there is a URL alias on a page, a user can visit that page using either its standard URL (node/12345) or it's alias (blog/joeblow/12345). Some (or all?) search engines penalize sites which have duplicate content, and this situation is sure to make the search engine think there is duplicate content.

Global redirect detects when a the standard URL (node/12345) is requested and issues an HTTP redirect to the URL alias (if it exists). If there is no URL alias then Global redirect does nothing.