Half my Traffic is Bots

I was checking my Cloudfront cache performance when I discovered that half my traffic is bots (50.77% if you’re keeping score).

  • My first thought was “!@(#)!”
  • My second thought was “What am I doing wrong?”
  • My third thought was “What does the internet say about this?”

And it turns out that pretty much half of everyone’s traffic is bots. Which doesn’t really make me feel that much better. Especially when I know that I wasn’t setting my cache-headers until last month.

I’m hoping this will get better with max-age=2592000 (30 days) on my older content.

s3cmd Revisited

Still a work in progress, but here is what I have:

1
2
3
4
5
6
7
# 24 hours
s3cmd sync --no-preserve --cf-invalidate --add-header="Cache-Control: public, max-age=86400" [local directory path] s3://[bucket_name]/
# 1 hour
s3cmd modify --add-header="Cache-Control: public, max-age=3600" s3://[bucket_name]/categories/sports/index.html
s3cmd modify --add-header="Cache-Control: public, max-age=3600" s3://[bucket_name]/categories/technology/index.html
s3cmd modify --add-header="Cache-Control: public, max-age=3600" s3://[bucket_name]/index.html
s3cmd modify --add-header="Cache-Control: public, max-age=3600" s3://[bucket_name]/atom.xml

Let’s run down the options to the sync command:

  • --nopreserve: disable save of filesystem attributes in s3 metadata
  • --cf-invalidate: invalidate the uploaded file[s] in Cloudfront
  • --add-header=…: explicitly set the cache control headers on the uploaded files

Followed by 4 explicit modify statements to set a shorter max-age on my RSS feed and index pages.

I had previously planned to control my Cloudfront cache behavior via the underlying s3 Cache-Control header. After reflection, I realized this wasn’t my best choice. I want a short Cache-Control max-age to instruct browsers to check for new content. But I want Cloudfront to cache files as long as possible.

My current plan is to use the Minimum TTL in Cloudfront to control the Cloudfront cache behavior and use the Cache-Control header to control browser cache behavior.

17 Oct: s3cmd Revisited x2

Controlling Cloudfront

Cloudfront honors the source Cache-Control headers by default. That meant the easiest resolution to my Cloudfront caching issue was to set the Cache-Control headers in S3. After an hour or two of research, I decided that s3cmd was the right answer for me (I use s3cmd to sync files up to s3).

Let’s start with my front page.

1
s3cmd modify --add-header="Cache-Control: public, max-age=3600" s3://ideoplex.com/index.html

That seemed to do the trick (the first request is from s3 directly and the second request is via cloudfront).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ curl -I http://ideoplex.com.s3-website-us-east-1.amazonaws.com/index.html
HTTP/1.1 200 OK
x-amz-id-2: 4tfynzLtAyu1K0Pp3YqOEr7nlLwtuvNPX15PVmdXRa3f/7eDVF2yFRs4AJAnBCb2BjnawP5Aa3E=
x-amz-request-id: 71222B1E11C31C1E
Date: Sun, 13 Sep 2015 19:41:19 GMT
x-amz-meta-s3cmd-attrs: ...
Cache-Control: public, max-age=3600
Last-Modified: Sun, 13 Sep 2015 18:41:16 GMT
ETag: "358f032aa7f83bf14914430b80817c83"
Content-Type: text/html
Content-Length: 30702
Server: AmazonS3
$ curl -I http://ideoplex.com/
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 30702
Connection: keep-alive
Date: Sun, 13 Sep 2015 18:41:27 GMT
x-amz-meta-s3cmd-attrs: ...
Cache-Control: public, max-age=3600
Last-Modified: Sun, 13 Sep 2015 18:41:16 GMT
ETag: "358f032aa7f83bf14914430b80817c83"
Server: AmazonS3
X-Cache: RefreshHit from cloudfront
Via: 1.1 dbfc7fdca19a1a429546608a6a58a3d2.cloudfront.net (CloudFront)
X-Amz-Cf-Id: 2cgjVuUtuB0WKNa4A8NBjEF3Lwep5LZI_Y1_6of6yO75MYH2WjG2rw==

Unfortunately, the headers do not persist when the underlying file is updated. That complicates things - more to come.

Adding Cloudfront

You should now be reading Take the First Step via Amazon Cloudfront. There is no significant benefit at this point, but the move to Cloudfront is a necessary prelude to adding a SSL certificate.

The move was pretty easy:

  1. Create the distribution in cloudfront
  2. Update Route53 to point to cloudfront rather than s3
  3. Wait for the good news:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ dig ideoplex.com

; <<>> DiG 9.8.3-P1 <<>> ideoplex.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60651
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;ideoplex.com. IN A

;; ANSWER SECTION:
ideoplex.com. 46 IN A 54.230.52.82
ideoplex.com. 46 IN A 54.192.55.193
ideoplex.com. 46 IN A 54.230.53.75
ideoplex.com. 46 IN A 54.192.54.180
ideoplex.com. 46 IN A 54.192.55.198
ideoplex.com. 46 IN A 54.230.53.79
ideoplex.com. 46 IN A 54.230.52.175
ideoplex.com. 46 IN A 54.230.52.164

;; Query time: 30 msec
;; SERVER: 10.0.1.1#53(10.0.1.1)
;; WHEN: Sat Sep 12 15:06:15 2015
;; MSG SIZE rcvd: 158

Unfortunately, I didn’t think about Cloudfront caching. By default, Cloudfront will cache objects for 24 hours. That’s a long time to wait for new posts.

Not a Zero Sum Game

If you ain’t cheating, then you ain’t trying.

various

Cheating in sports is complicated. We look down on sports governed by subjective judgment (figure skating, rhythmic gymnastics, …). But most team sports have rules and officials to enforce those rules. And since the officials are human, there is always a level of subjectivity to their rulings.

In a sport with officials, some cheating isn’t really cheating. This quote reflects acceptance of the reality that there isn’t a bright shining line between legal and illegal, and you push the boundaries to find where the official is drawing the line today. If you’re never offsides, then you’re not being aggressive enough. Flopping is a cheat, but a little embellishment insures that the official is paying attention.

I’d like to think that sports is not a zero-sum game - there has to be a winner, but there doesn’t have to be a loser. What disturbs me about deflategate is that both sides want the other side to be a loser.

I’d like to think that sports teaches us about life. But deflategate is teaching us the wrong lesson.

In a zero sum world, you can convince yourself that favorable payment terms are just good business. But is it necessary to force suppliers to accept net 60 instead of net 30 and pay on the 59th day to boot?

In a zero sum world, you can convince yourself to never leave anything on the table. But is it necessary to squeeze so hard that your supplier’s long term survival is at risk?

JQuery Ajax and Selenium

Today, I’ll update the selenium automation test to wait on ajax calls. We haven’t seen any problems so far because our web application is too fast - not surprising as it is running with all data in memory on the local server.

Let’s start by slowing things down to see the problem. I’ll do this by adding a query parameter to the getUserMap method and using that parameter to request some sleep before returning the User set.

Read More