Perishable Press 3G Blacklist and WP Super Cache

I’ve been fol­low­ing the Build­ing the 3G Black­list series on Per­ish­able Press for the last week or two and have been imple­ment­ing each of the rules as they were released. For the most part, there have been no prob­lems. I’ve seen a huge increase in 403 errors (For­bid­den Access) in my logs, which has been good. Judg­ing from my access.log, all of the requests have been bogus.

After the final list came out, I imple­mented any changes to the rules, tested it in my default browser (Safari) and called it good. Sev­eral days later how­ever, I tried to pull up this site on my home PC using Fire­fox and was greeted with a big fat 403. Uh oh. I switched over to IE and got the same results. After some cur­sory check­ing, I switched over to using my laptop and Safari and noticed that there was no prob­lem there. Weird. Even weirder because I’m using Ver­sion DSL with router, so as far as my server is con­cerned, both com­put­ers have the same IP.

Most weird: when I actu­ally checked my access.log, I could see my own requests that had been served 403 errors. But instead of the normal 403, the requests actu­ally showed a single request with a 200 status for each time I tried to load a page.

IP - - [30/May/2008:10:20:02 -0700] "GET /about/ HTTP/1.1" 200 363 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7" 
IP - - [30/May/2008:10:20:07 -0700] "GET / HTTP/1.1" 200 364 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7" 

That’s a request for my about page, and then my main page. Both recieved a “Forbiden” in my browser, but both show a status of OK in my log. Addi­tion­ally, if the request had actu­ally been suc­cess­ful, a bunch of other files would have been requested as well.

I quickly decided that it was more impor­tant to get the site up and run­ning again rather than spend a bunch of time trying to figure out what the prob­lem was and how to fix it. After some selec­tive com­ment­ing in my .htaccess file, I dis­cov­ered that the cul­prit rule was the fol­low­ing one from the 3G Blacklist:

RedirectMatch 403 \/\/

I com­mented out the rule for the time-​being so that I could test fur­ther at a later point in time.

This par­tic­u­lar rule redi­rects all requests that con­tain a double slash after the http:// sec­tion. I thought that this was very odd that this rule should break my site because I can’t see any reason why a legit­i­mate request would need to uti­lize a double slash. I also was con­cerned, because judg­ing from my access.log, this is the rule that does the bulk of the work con­cern­ing 403 errors.

I did some more scan­ning of my .htaccess folder and arrived at the con­clu­sion that the cul­prit must be within the rules for the WP Super Cache plugin I had recently installed. This plugin cre­ates a static html page to serve instead of the normal Word­Press PHP pages. Here’s an expla­na­tion from their site:

When a vis­i­tor who is not logged in, or who has not left a com­ment, visits they will be served a static HTML page out of the super­cache sub­di­rec­tory within the Word­Press cache direc­tory. If you nav­i­gate to that direc­tory you can view an exact replica of your perma­link struc­ture as well as the HTML files within the direc­to­ries. To deter­mine if a page has been served out of the Super Cache, view the source and the last line on the page should read or .

Hmm, I’d say we’re get­ting closer now. The sec­tion that WP Super Cache adds to my .htaccess file looks like this:

# WP SUPER CACHE
<IfModule mod_rewrite.c>
AddDefaultCharset UTF-8
RewriteCond %{REQUEST_METHOD} !=POST
RewriteCond %{QUERY_STRING} !.*s=.*
RewriteCond %{QUERY_STRING} !.*attachment_id=.*
RewriteCond %{HTTP_COOKIE} !^.*(comment_author_|wordpress|wp-postpass_).*$
RewriteCond %{HTTP:Accept-Encoding} gzip
RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1index.html.gz -f
RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html.gz [L]

RewriteCond %{REQUEST_METHOD} !=POST
RewriteCond %{QUERY_STRING} !.*s=.*
RewriteCond %{QUERY_STRING} !.*attachment_id=.*
RewriteCond %{HTTP_COOKIE} !^.*(comment_author_|wordpress|wp-postpass_).*$
RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1index.html -f
RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html [L]
</IfModule>
# END WPSuperCache

Now, per­haps a mod_rewrite ninja can see imme­di­ately what the prob­lem is, but I was having trou­ble actu­ally fig­ur­ing out what was going on. Since I’m using a shared host, I don’t have access to an httpd.conf and there­for cannot use the RewriteLog direc­tive to actu­ally see what’s going on in the rewrites.

After some research I dis­cov­ered that by adding an R flag to each of WP Super Caches RewriteRule direc­tives, it would force a tem­po­rary redi­rect and there­fore allow me to see in the browser what was actu­ally being requested. I changed each RewriteRule to the following:

RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html.gz [R,L]
RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html [R,L]

Now after run­ning the same tests again, I could see in my browser how things were get­ting screwed up. I typed the address http://blog.nerdstargamer.com into Fire­fox and sure enough, got the 403 error. This time though, when I looked at the URL it showed the fol­low­ing redirect:

http://blog.nerdstargamer.com/wp-content/cache/supercache/blog.nerdstargamer.com//index.html

There’s the cul­prit right there. The double-​slash right before index.html. So, basi­cally, every time WP Super Cache serves a cached page, it’s serv­ing a URL with a double slash before the file name. I did a quick check by delet­ing the cache fold­ers of WP Super Cache and con­firmed that pages not cached loaded fine while cached pages always got redi­rected to a 403 error. Bingo.

So, now to fix the prob­lem. Why on earth the .htaccess code for WP Super Cache does this in the first place, I’m not sure. It seems wrong to me, but I’ll defer to the experts on this one. Basi­cally what’s hap­ping is that the vari­able $1 is being replaced with the path name that was requested which includes a trail­ing slash. The next part of the rewrite starts with a slash, thus the double slash prob­lem.

I was able to fix the con­flict in the WP Super Cache code by remov­ing one of the slashes like so:

# WP SUPER CACHE
<IfModule mod_rewrite.c>
AddDefaultCharset UTF-8
# not post
RewriteCond %{REQUEST_METHOD} !=POST
# not a search
RewriteCond %{QUERY_STRING} !.*s=.*
# not an attachment page
RewriteCond %{QUERY_STRING} !.*attachment_id=.*
RewriteCond %{HTTP_COOKIE} !^.*(comment_author_|wordpress|wp-postpass_).*$
RewriteCond %{HTTP:Accept-Encoding} gzip
RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1index.html.gz -f
RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1index.html.gz [L]

RewriteCond %{REQUEST_METHOD} !=POST
RewriteCond %{QUERY_STRING} !.*s=.*
RewriteCond %{QUERY_STRING} !.*attachment_id=.*
RewriteCond %{HTTP_COOKIE} !^.*(comment_author_|wordpress|wp-postpass_).*$
RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1index.html -f
RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1index.html [L]
</IfModule>
# END WPSuperCache

This cer­tainly looks funny but at least it works. I’m sure there is a more ele­gant way to do this, like say, rewrit­ing the orig­i­nal request to remove the trail­ing slash and then apply­ing the cache rules. Per­haps this is really a prob­lem with the way Word­Press is doing its perma­links (I’m on 2.5.1 by the way). Who knows? Ninjas chime in.

Getting Geeky With YSlow

I spent a good amount of time over the last couple of days attempt­ing to make my site a little bit faster. I’ve been pretty neg­li­gent about it up until now, because I know that much of the slow­ness of my site can be directly attrib­uted to my web host­ing com­pany. 1 Even so, I decided to spend some time doing what I could to speed things up.

The first thing that I did was run a test in YSlow to see how my site was doing. Yikes! I got an F right off the bat. After some fur­ther review and research, I real­ized that this wasn’t nec­es­sar­ily some­thing that should have me freak­ing out. If you’re not entirely famil­iar with YSlow and what it does, Jeff Atwood’s arti­cle, “YSlow: Yahoo’s Prob­lems Are Not Your Prob­lems” on Coding Horror is a must-​read. Basi­cally, YSlow offers a lot of good advice that should be taken, but with a grain of salt.

With that said, here are the steps that I’ve taken to speed up my site.

Make Fewer HTTP Requests

The first time I ran YSlow, I dis­cov­ered that all of my pages were making a ridicu­lous number of HTTP Requests for JavaScript and CSS files. I was request­ing four CSS files: screen, print, IE hacks, and one for Light­box 2. Unfor­tu­nately, the IE hacks stylesheet is still nec­es­sary. Obvi­ously the screen an print ones are as well. After taking a look at the Light­box 2 CSS file, I decided that it was small enough to simply tack on to the bottom on my exist­ing screen stylesheet. That’s one down.

There were also quite a few JavaScript files being requested, includ­ing all of the files for Google Ana­lyt­ics, Mint, WP Stats and Light­box. What can I say? I like my track­ing software.

The first thing that I decided to do was to reduce the number of track­ing util­i­ties I was using to two. I love Mint and Google Ana­lyt­ics seems to be nec­es­sary, so I had to get rid of WP Stats. That wasn’t such a big deal for me. That’s another one down.

The next step was to take a long hard look at Light­box 2. I orig­i­nally installed this for my Gallery page, and then decided to include it on all my pages on the off chance that I might want to use it in a few posts. While it works and looks great, I’ve been decid­edly unhappy about how much bag­gage Light­box comes with. There are five JavaScript files that need to be included, just to have that neat little image trick. Even worse, the included Pro­to­type JavaScript library weighs in at a stag­ger­ing 124KB. What a waste.

I made a mental note to do some research to find a more light­weight solu­tion for my image gallery. Smash­ing Mag­a­zine has a good list of them, which I will inspect at a later point in time. For the time-​being, I com­pressed the Javascript files and was able to bring the total size of the Javascript files down to about 125KB from 196KB. I also decided to only include the scripts on my actual Gallery page. It seems like too much of a waste to require all those files when I rarely use them.

Put CSS At the Top and JS at the Bottom

When I first set up Light­box, I wanted to avoid using a Word­Press plugin for it, so I cooked up my own method of includ­ing it. Most of the work was simply trying to find a way around hard-​coding my tem­plate direc­tory in it and also using a func­tion to keep my header.php file clean and easy to read.

The first prob­lem with my orig­i­nal method was that all of those JavaScript files were at the top of the page, mean­ing that almost 200KB of JavaScript had to be loaded before any of the con­tent on my page started to load. That’s no good! The sim­plest thing to do was to move my func­tion down to the bottom of the page, right before the scripts for Google Ana­lyt­ics and Mint. The only other prob­lem was that the func­tion included the CSS file as well. Since I had already decided to merge the Light­box CSS with my main CSS, all I actu­ally had to do was remove the call to load the CSS.

Use Google’s APIs

Unless you’ve been living under a rock (or just don’t care), you’ve prob­a­bly heard that Google just released their AJAX Libraries API. This was pretty much per­fect timing for me since I was already look­ing at how Light­box used the Pro­to­type Frame­work and Scrip­tac­u­lous Effects Library. It makes a whole lot more sense to use a ver­sion hosted by Google than it does to require clients to down­load the same exact ver­sion of a stan­dard library from my slow web host. Ajax­ian has a good run­down of the fea­tures of this new API and why you would want to use it.

After doing a rel­a­tively quick setup, I was able to call the Pro­to­type frame­work from the Google API. It came in from Google at only 29KB; that’s the same file that I was just com­plain­ing was 124KB. That’s a no-​brainer. Scrip­tac­u­lous was a bit more of a prob­lem though, since it takes a mod­u­lar approach. Light­box 2 actu­ally only uses two of the eight pos­si­ble mod­ules. As far as I can tell, there is no way to use the stan­dard type of of script tag to only include the libraries you want like this:

<script type="text/javascript" src="http://blog.nerdstargamer.com/wp-content/themes/positiveGrey-v2.0/js/scriptaculous.js?load=effects,builder"></script>

One of the com­ments on Ajax­ian by jdal­ton, addresses this:

Another issue google will need to work out is that MooTools, Scrip­tac­u­lous, and Dojo are mod­u­lar (mean­ing you don’t have to load the kitchen sink and can just load the parts you want). This can effect the file size foot­print as well. This may be beyond the scope of a CDN though.

Because I couldn’t find a way to only include the mod­ules I needed, I decided to con­tinue serv­ing them locally instead. So, my func­tion to include Light­box now looks like this:

function AKM_include_lightbox() {
    $templateDir = get_bloginfo('template_directory');

    $output = <<<EOT
<script src='http://www.google.com/jsapi'></script>
<script type="text/javascript">
    var tplDir = "${templateDir}/images/lightbox/";
    google.load('prototype', '1.6.0.2');
</script>
<script type="text/javascript" src="${templateDir}/js/scriptaculous.js?load=effects,builder"></script>
<script type="text/javascript" src="${templateDir}/js/lightbox.js"></script>
EOT;

echo $output;
}

Reorganize Template Directory

Although this doesn’t actu­ally have any­thing to do with the speed of my web­site, it seemed appro­pri­ate to take this oppor­tu­nity to reor­ga­nize my tem­plate direc­tory a little bit. I was striv­ing to create a more tra­di­tional web setup within my Word­Press tem­plate that included all CSS in a CSS folder, JavaScript in a JS folder, and images in an image folder.

This first issue to address was the Word­Press default style.css file. This file is nec­es­sary for Word­Press tem­plate to func­tion prop­erly, as explained in the Word­Press Theme Devel­op­ment Codex page. What I decided to do was to remove all of the actual styles from this file and simply leave the Word­Press information:

/*
Theme Name:Positive Grey
Theme URI:http://nerdstargamer.com
Description:A simple theme using a fluid 2 column layout with green and grey
Version:2.0
Author:Alissa Miller
Author URI:http://nerdstargamer.com
*/

/* See css/screen-x.x.css for styles */

I then moved all of the styles to a new file called screen-x.x.css in the CSS folder. This allows me to have all stylesheets (with the excep­tion of style.css) in the CSS folder. It also allows me to use ver­sion­ing in the file­name, which as we will see, will be impor­tant after I’ve imple­mented better caching and expires headers.

I pre­vi­ously put all of the Light­box files in their own folder, to keep things neat. I’ve now decided to roll those files into the normal direc­tory struc­ture instead of keep­ing them sep­a­rate. The CSS file got merged with screen.css and all of the Light­box JavaScript files got moved into the js folder. Light­box also includes sev­eral images, which I decided to put in images/lightbox/ so as not to con­fuse them with my own tem­plate images.

Gzip Components, Improve Caching

One of the rules for YSlow includes Gzip­ing com­po­nents. Some of my scripts are for Mint and JavaScript, which I can’t really con­trol. The others how­ever, along with my CSS are fair game. I had a little bit of trou­ble fig­ur­ing out how to do this since I did not want to use any of the more common php meth­ods to com­press my pages on the fly and was look­ing at just using either mod_gzip or mod_deflate. The YSlow page gives the fol­low­ing information:

Gzip­ping gen­er­ally reduces the response size by about 70%. Approx­i­mately 90% of today’s Inter­net traf­fic trav­els through browsers that claim to sup­port gzip. If you use Apache, the module con­fig­ur­ing gzip depends on your ver­sion: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.

After some research, I fig­ured out that my web­site is hosted on Apache 2 (not ear­lier). I included this block in my root .htaccess file:

# GZIP CSS AND JS
<IfModule mod_deflate.c>
 <FilesMatch "\.(js|css)$">
  SetOutputFilter DEFLATE
 </FilesMatch>
</IfModule>

I also decided to make the move to using WP Super Cache instead of WP Cache. WP Super Cache is much like WP Cache but does offer some per­for­mance ben­e­fits. Once I got WP Super Cache con­fig­ured and run­ning, it seemed to have an imme­di­ate effect on the speed of my blog. Of course, that could have just been wish­ful think­ing on my part.

Add an Expires Header

One of the last things I did was add an expires header in my root .htaccess file. This tells the client browsers not to look for a new ver­sion at all if the one in their cache hasn’t expired yet.

Now, I obvi­ously don’t want to do this to the dynamic Word­Press files (com­ments and posts would never update!), but that’s okay because WP Super Cache is taking care of those files already. What I do want to do is add the expires header to all of my images, JavaScript and CSS files. None of these will really change except for the CSS files. For­tu­nately, when I reor­ga­nized my tem­plate files, I gained the abil­ity to append ver­sion num­bers to my CSS files. So now I can go ahead and add that expires header to my CSS files, and then simply change the file name when I need to make changes in my CSS. The new file will down­load like normal, and it’s good prac­tice to get some sort of ver­sion­ing underway.

Here is the code that I put in my .htaccess file:

### ADD FAR OUT EXPIRES HEADDER TO STATIC CONTENT ###
<ifmodule mod_expires.c>
  <filesmatch "\.(jpg|gif|png|css|js)$">
       ExpiresActive on
       ExpiresDefault "access plus 1 month"
   </filesmatch>
</ifmodule>

After some thought I decided that one month was an appro­pri­ate length for my pur­poses. This depends entirely on what type of con­tent it is, and how often you are going to change it.

After doing all of the pre­vi­ously men­tioned fixes, I had improved the page load time quite a bit for most of my site. The only remain­ing bot­tle­neck seemed to be my Gallery page. That wasn’t par­tic­u­larly sur­pris­ing con­sid­er­ing that the page includes 25 thumb­nail images. The total size of all of the images, at full size, weighs in at a hefty 2MB. This page was also still using the Light­box scripts.

One of the things I noticed while using YSlow was that some of my thumb­nail images seemed to be unnec­es­sar­ily large. Some of them were as big as 40KB for a 150×150 pixel image! Upon fur­ther inspec­tion I decided that all of the thumb­nails were too large. I had used WordPress’ fea­ture to auto­mat­i­cally create thumb­nails of images to set this up. I’m not sure exactly how Word­Press does this, but after taking a look at the file sizes I’m sure that it sucks. I recre­ated all of the thumb­nails in Pho­to­shop and the biggest one is now only 17.6KB.

I had also orig­i­nally set up the gallery page in WordPress’ admin screen (using the file browser and things like that). Once I was no longer using the dynam­i­cally gen­er­ated thumb­nails, it didn’t make sense to lay out the page in WordPress’ page sec­tion. Instead I cre­ated a page tem­plate called gallery.php, which includes all of the images and code for the page.

I also copied all of the full size and thumb­nail images into my tem­plate image folder. This way the links to the images are no longer being stored in my database.

Conclusion

After all of these changes my web­site does seem to be a little bit faster. These types of exer­cises are good prac­tice for any web designer/developer. Having a slow web host is no excuse for not doing what you can on your end to make the site faster.

As always, any tips or improve­ments from more expe­ri­enced devel­op­ers in this area are greatly appreciated.

  1. You get what you pay for, right?