The evolution of the World Wide Web (WWW) has seen URL's evolve from a very simple format to more and more complicated with the advent of database-driven sites and now back to simple. We look at how best to achieve search engine friendly URL's using mod_rewrite.
1. Background
In the early days of the WWW, websites were built by hand with each page saved as an individual HTML file. This allowed them to be easily indexed, copied and moved around as each element of the site was a unique file.
The next development was CGI scripts to server dynamic content. This led to URI such as:
http://www.example.net/cgi-bin/page.cgi
http://www.example.net/cgi-bin/script.pl?param1=val1¶m2=val2
The addresses could be cleaned up a bit, with the GET parameters being incorporated into the URI:
http://www.example.net/cgi-bin/script/param1/param2
Then along came various scripting languages that allowed pages to be dynamic without having to sit in a specific directory:
http://www.example.net/script.php?param1=val1¶m2=val2
http://www.example.net/script/param1/param2
The second example in this case would normally use Content Negotiation to recognise the address as a call to the script, with the remainder of the address (/param1/param2) accessible as an environment variable.
It was around this time that search engines such as Google emerged and suddenly everything had to be 'indexable' by search engine spiders with names such as Googlebot and Slurp (and now msnbot). The 'holy grail' being to make the site look as if it was made of of hand-crafted HTML files. In other words, we've come full circle.
The examples below describe the process of achieving this goal of simple URLs for comlex dynamic content.
2. Search Engine Friendly URL's
A typical component of a website is a 'latest news' database. Individual news items would be accessed as:
http://www.example.net/news.php?id=20050901
(a call to the news.php script passing a single GET parameter which identifies which item to display)
Now we introduce the RewriteRule:
RewriteRule ^news/([0-9]+) /news.php?id=$1
Translation:
- IF the request starts with news/ followed by one or more digits;
- THEN call the news.php script with the id parameter set to those digits.
The URI then becomes:
http://www.example.net/news/20050701
Or, because we left the RHS of the regular expression open, we can also use:
http://www.example.net/news/20050701.html
There is a slight problem here in that, if Content Negotiation is enabled, this URI could be taken as a call to the news.php script with the rest of the URL (/20050701.html) being unused. This is because there is no file called /news/20050701.html, and no file called /news, but /news.php does exist and Content Negotiation is all about finding that out. In that situation the correct news item won't be displayed.
The solution? We could re-name the script, change the format of the URI, or turn off Content Negotiation (which you might want to do in any case), but there's a simpler option:
RewriteRule ^news/([0-9]+) /scripts/news.php?id=$1
By moving the script into a sub-directory, which we can do now because it's no longer called 'in place', we avoid any chance of conflict. The /scripts/ directory can, and should, now be secured to avoid direct access.
One of the major benefits of using rewrite rules and 'hiding' the script is that ONLY requests matching the regular expression can access it. In this case it's not possible for someone to pass a non-numeric parameter, and any attempt would result, rightly, in a 404 Not Found response.
An even better rule in this case could be:
RewriteRule ^news/([0-9]{8}) /scripts/news.php?id=$1
or, if you want to be really strict and enforce the .html extension:
RewriteRule ^news/([0-9]{8})\.html$ /scripts/news.php?id=$1
Next we look at what to do if other sites or search engines are already linking to your dynamic pages.
3. Converting Dynamic to Search Engine Friendly URL's
If you follow the example in the previous section, you might see a lot of 404's in your logs because addresses that used to work have now been deprecated.
Wouldn't it be great if we could set up a PURL to handle this.
It's actually quite simple:
RewriteCond %{QUERY_STRING} ^id=([0-9]+) RewriteRule ^news\.php /news/%1.html? [R=301,L]
Note: regular expression matches from a RewriteCond are referenced using % wheras those in a RewriteRule are referenced using $.
Translation:
- IF the query string starts with id=(one or more digits)
- AND the request is for /news.php
- THEN redirect (301) to the search-engine-friendly URL with no query string
The reason for having a 301 Permanent redirect is that search engines such as Google will take that to mean that the previously indexed page now exists at the new location, and pass on any PageRank accumulated at the old address.
You also don't want there to be multiple ways to access the same content as that can trigger a duplicate content penalty with the search engines.
If you change your URL structure more than once over time, you might end up with a chain of 301 redirects leading from the oldest to the newest format so it's a good idea to map everything out on paper or on a development server before going live.
4. Common Mistakes
Trying to match a query string in the RewriteRule
The query string is never visible to the rewrite rule - RewriteRule only sees the address portion of the request. As shown above you need to use a RewriteCond on %{QUERY_STRING} before your RewriteRule.
Appending QUERY_STRING to the rewrite target
This probably seemed like a great solution at the time:
RewriteRule ^book/([0-9]+)\.html /scripts/book.html?id=$1&%{QUERY_STRING}
but you're much better using the built-in QSA flag:
RewriteRule ^book/([0-9]+)\.html /scripts/book.html?id=$1 [QSA]
Note: QSA stands for Query String Append.
Trying to match or redirect to page anchors
Page anchors - addresses ending in #anchor - are handled entirely on the client-side and never passed to or from the server. This makes sense if you think about it as you can go from one anchor to another in your browser without the page reloading.
Missing images and style-sheets
After a redirect such as:
RewriteRule ^book/([0-9]+)\.html /showbook.php?id=$1
relative paths to images will no longer work. That's because as far as the web browser knows, you're currently in a directory called books. The fix is to use absolute paths when referencing images and other resources. For example:

Comments (1)
The code to add www to your domain name:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^theitarticles.com$
RewriteRule (.*) http://www.theitarticles.com [R=301]