Housekeeping

I’ve completed some general housekeeping here at Take the First Step. If I’ve done it correctly, then there will be fewer 404 - Not Found errors around here.

The first step for the budding webmaster is to register your site at Google Webmaster Tools. That will give you a window into how Google sees your site. I like to check once a week to make sure that the Googlebot can find everything that it is looking for.

The next step is to learn how to interpret your web server log files. Your mileage may vary, but here’s how I look for 404 errors:

1
2
3
4
5
6
7
8
$ zgrep -w 404 access_log.20080512.gz | cut -d ' ' -f7,11 | uniq -c
1 /blog/2004/09/27.html "-"
2 /blog/topic/software/2003/08/18.html "-"
1 /id/1319/jms-providers/ "-"
1 /id/1318/leopard-part-4/ "-"
1 /id/1317/brief-history/ "-"
1 /id/1316/march-drabness/ "-"
1 /id/1315/march-madness-08/ "-"

where:

1
2
3
"zgrep -w 404" retrieves lines with the word 404 from my compressed access logs
"cut -d ' ' -f7,11" defines a space as field delimiter and retrieves the 7th and 11th fields from those lines
"uniq -c" shows the unique lines preceded by the instance count

Here we see some failing radio userland links and a mis-behaving client that is adding a trailing ‘/‘ to my page links. A little htaccess magic and the 404’s are cured.