The Apache module mod_rewrite provides a powerful mechanism for hiding, redirecting, and reformatting request URLs. I just finished implementing a mod_rewrite scheme for timfanelli.com to accomplish 3 things:
- Redirect old URLs with a 301 redirect code
- Hide certain parts of the URL from my readers.
- Optimize my Google pagerank.
My first goal was to redirect old URLs using 301 Redirect codes. I migrated to pyBlosxom a long time ago, and it recently came to my attention that not only were there links to my old URLs on other people’s blogs, Google was turning up search results pointing to my old URLs also! All of these references resulted in 404’s, driving my pagerank down towards 0.
Using a two simple rewrite rule, I was able to redireect my previous URLs, http://wwww.timfanelli.com/index.cgi to a static page, old.html, which provides links to my new URL, http://www.timfanelli.com/cgi-bin/blog.cgi:
RewriteEngine on RewriteRule ^/index.cgi(.*) /old.html [R=301]
Now, any link followed to my page that starts with “/index.cgi” is redirected, and a 301 is issued to the requesting client indicating that the resource has been permanently relocated.
My second goal was to hide the /cgi-bin/blog.cgi
portion
of my URL. It’s ugly and it’s hard to remember. I wanted any request
sent to
http://www.timfanelli.com/blog/ to go directly to that CGI script. Using
a passthrough rule and a 301 redirect accomplished this nicely:
RewriteRule ^/blog/(.*)$ /cgi-bin/blog.cgi/ [PT] RewriteRule ^/$ /blog/ [R=301] RewriteRule ^/blog$ /blog/ [R=301]
The first rule redirects any request sent to /blog/
to /cgi-bin/blog.cgi/
.
Any extra characters in the URL string is copied into
the new rewritten URL using regular expression groupings. The second
rule causes a 301 redirect from my base-url to the blog, and the third
causes a 301 redirect
if the URL is missing the trailing / character. We use a 301 redirect
here instead of another passthrough rule to prevent having multiple
“valid” URLs with the
same content.
Having multiple “valid” URLs with the same content isn’t in and of itself a problem. Your website would work just fine, but I also wanted to optimize my site for Google pagerank. To this end, the astute reader would have noticed that there is now two ways to access my site: http://www.timfanelli.com/blog and http://www.timfanelli.com/cgi-bin/blog.cgi. We need to hide the /cgi-bin/blog.cgi URL from the outside world. This gets a little tricky, because we can’t just redirect /cgi-bin/blog.cgi to /blog/ — this would cause an infinitely recursive rewrite, because /blog/ rewrites to /cgi-bin/blog.cgi! We’ll still use this rewrite rule though, but we’ll protect it with a RewriteCond clause so its only evaluated when it comes in the original request URL:
RewriteCond ${IS_SUBREQ} false RewriteRule ^/cgi-bin/blog.cgi(.*)$ /blog/ [R=301]
IS_SUBREQ is “true” if the rule is being processed as a sub request of the original; false otherwise. So when it’s matching the user-entered URL, it is not a sub request, and the rewrite rule substitutes /cgi-bin/blog.cgi with /blog/. This is done with a 301 redirect, so Google won’t see it as a valid URL. Later, when the rewrite engine substitutes /blog/ for /cgi-bin/blog.cgi – IS_SUBREQ is going to be “true”, and this rule won’t be executed again.
So now the only valid way to access my site from the “outside” is via the URL http://www.timfanelli.com/blog/, even though all of the following URLs will appear to work as well:
- http://www.timfanelli.com/ (no “/blog/”)
- http://www.timfanelli.com/blog (no trailing slash)
- http://www.timfanelli.com/cgi-bin/blog.cgi
- http://www.timfanelli.com/cgi-bin/blog.cgi/
Many thanks to Pete for all his help!