May 312011

First day down of our little adventure trip to Costa Rica.

Getting down here was already a bit of an adventure. A 2:30 wake up call, both flights delayed with mechanical problems, and then we get here and it’s pouring rain the rest of the evening.

We spent some time at a local mall killing off the evening so we can get a good rest before we start a two day rafting trip. The mall had a little churro stand we could go with out trying. Karen and I split a white chocolate covered, dulce de leche filled, and rolled in peanuts churro that could certainly have accounted for any entire meal.

By the time we had finished wandering around the mall and our completely lost taxi driver got us back to the hotel, it was time to head of for our first real meal of the day. Dinner at the house restaurant called the El Rodeo Steak House. We of course got steak. It was ok, different, but it certainly wasn’t the steak au poivre we love to indulge in back home.


The rain has finally let up, but it’s time for some shut eye, been long day and need rest for rafting!

Apr 262011

Great WOD last night.  Was really looking forward to this one, and wasn’t disappointed!:

5 Rounds for time:
10 Deadlift 225 lb./155 lb.
15 GHD situps

Time:  7:40
Workout Total Work Output:    41,929    ft-lbs
Power:    91.15    ft-lbs/sec
Power:    123.51    Watts
Power:    0.1657    Horsepower


Felt good.  I really wanted to go unbroken, but fell apart the last round…next time!

 Posted by at 10:51 am
Apr 252011

Once of the many reasons I love the brains behind SmugMug.  Check out our pictures site!

Amazon had a major outage last week, which took down some popular websites. Despite using a lot of Amazon services, SmugMug didn’t go down because we spread across availability zones and designed for failure to begin with, among other things.

We’ve known for quite some time that SkyNet was going to achieve sentience and attack us on April 21st, 2011. What we didn’t know is that Amazon’s Web Services platform (AWS) was going to be their first target, and that the attack would render many popular websites inoperable while Amazon battled the Terminators.

Sorry about that, that was probably our fault for deploying SkyNet there in the first place.

We’ve been getting a lot of questions about how we survived (SmugMug was minimally impacted, and all major services remained online during the AWS outage) and what we think of the whole situation. So here goes.

With last rays of the sun


We’re heavy AWS users with many petabytes of storage in their Simple Storage Service (S3) and lots of Elastic Comput Cloud (EC2) instances, load balancers, etc. If you’ve ever visited a SmugMug page or seen a photo or video embedded somewhere on the web (and you probably have), you’ve interacted with our AWS-powered services. Without AWS, we wouldn’t be where we are today – outages or not. We’re still very excited about AWS even after last week’s meltdown.

I wish I could say we had some sort of magic bullet that helped us stay alive. I’d certainly share if it I had one. In reality, our stability during this outage stemmed from four simple things:

First, all of our services in AWS are spread across multiple Availability Zones (AZs). We’d use 4 if we could, but one of our AZs is capacity constrained, so we’re mostly spread across three. (I say “one of our” because your “us-east-1b” is likely different from my “us-east-1b” – every customer is assigned to different AZs and the names don’t match up). When one AZ has a hiccup, we simple use the other AZs. Often this is a graceful, but there can be hiccups – there are certainly tradeoffs.

Second, we designed for failure from day one. Any of our instances, or any group of instances in an AZ, can be “shot in the head” and our system will recover (with some caveats – but they’re known, understood, and tested). I wish we could say this about some of our services in our own datacenter, but we’ve learned from our earlier mistakes and made sure that every piece we’ve deployed to AWS is designed to fail and recover.

Third, we don’t use Elastic Block Storage (EBS), which is the main component that failed last week. We’ve never felt comfortable with the unpredictable performance and sketchy durability that EBS provides, so we’ve never taken the plunge. Everyone (well, except for a few notable exceptions) knows that you need to use some level of RAID across EBS volumes if you want some reasonable level of durability (just like you would with any other storage device like a hard disk), but even so, EBS just hasn’t seemed like a good fit for us. Which also rules out their Relational Database Service (RDS) for us – since I believe RDS is, under the hood, EC2 instances runing MySQL on EBS. I’ll be the first to admit that EBS’ lack of predictable performance has been our main reason for staying away, rather than durability, but nonetheless, it’s not a perfect product for our use case.

Which brings us to fourth, we aren’t 100% cloud yet. We’re working as quickly as possible to get there, but the lack of a performant, predictable cloud database at our scale has kept us from going there 100%. As a result, the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance. This has its own downsides – we had two major outages ourselves this week (we lost a core router and its redundancy earlier, and a core master database server later). I wish I didn’t have to deal with routers or database hardware failures anymore, which is why we’re still marching towards the cloud.

Water On Fire© 2010 Colleen M. GriffithFriend me on FacebookThis is the main lava flow as it empties into the ocean.  You can't really see the scale of the eruption in this shot, but about 15 feet of the "shore" is captured here.  You can see the steam created when the molten rock hit the ocean - it would sometimes obscure the lava flows - but it was like a dance, when the steam cloud would part for a few minutes allowing us to glimpse the lava beneath it.Photo taken August 23, 2010 - it captures lava flowing from the Kilauea volcano on The Big Island of Hawaii.


So what did we see when AWS blew up? Honestly, not much. One of our Elastic Load Balancers (ELBs) on a non-critical service lost its mind and stopped behaving properly, especially with regards to communication with the affected AZs. We updated our own status board, and then I tried to work around the problem. We quickly discovered we could just launch another identical ELB, point it at the non-affected zones, and update our DNS. 5 minutes after we discovered this, DNS had propagated, and we were back in business. It’s interesting to note that the ELB itself was affected here – not the instances behind it. I don’t know much about how ELBs operate, but this leads me to believe that ELBs are constructed, like RDS, out of EC2 instances with EBS volumes. That seems like the most logical reason why an ELB would be affected by an EBS outage – but other things like network saturation, network component failures, split-brain, etc could easily cause it as well.

Probably the worst part about this whole thing is that the outage in question spread to more than one AZ. In theory, that’s not supposed to happen – I believe each AZ is totally isolated (physically in another building at the very least, if not on the other side of town), so there should be very few shared components. In practice, I’ve often wondered how AWS does capacity planning for total AZ failures. You could easily imagine peoples automated (and even non-automated) systems simply rapidly provisioning new capacity in another AZ if there’s a catastrophic even (like Terminators attacking your facility, say). And you could easily imagine that surge in capacity taking enough toll on one or more AZs to incapacitate them, even temporarily, which could cause a cascade effect. We’ll have to wait for the detailed post-mortem to see if something similar happened here, but I wouldn’t be surprised if a surge in EBS requests to a 2nd AZ had at least a deteriorating effect. Getting that capacity planning done just right is just another crazy difficult problem that I’m glad I don’t have to deal with for all of our AWS-powered services.

30 May 2009We have been having pretty colorless sunsets off late due to the cloud cover every evening - one of the better ones have a good weekend !!!


This stuff sounds super simple, but it’s really pretty important. If I were starting anew today, I’d absolutely build 100% cloud, and here’s the approach I’d take:

  • Spread across as many AZs as you can. Use all four. Don’t be like this guy and put all of the monitoring for your poor cardiac arrest patients in one AZ (!!).
  • If your stuff is truly mission critical (banking, government, health, serious money maker, etc), spread across as many Regions as you can. This is difficult, time consuming, and expensive – so it doesn’t make sense for most of us. But for some of us, it’s a requirement. This might not even be live – just for Disaster Recovery (DR)
  • Beyond mission critical? Spread across many providers. This is getting more and more difficult as AWS continues to put distance between themselves and their competitors, grow their platform and build services and interfaces that aren’t trivial to replicate, but if your stuff is that critical, you probably have the dough. Check out Eucalyptus and Rackspace Cloud for starters.
  • I should note that since spreading across multiple Regions and providers adds crazy amounts of extra complexity, and complex systems tend to be less stable, you could be shooting yourself in the foot unless you really know what you’re doing. Often redundancy has a serious cost – keep your eyes wide open.
  • Build for failure. Each component (EC2 instance, etc) should be able to die without affecting the whole system as much as possible. Your product or design may make that hard or impossible to do 100% – but I promise large portions of your system can be designed that way. Ideally, each portion of your system in a single AZ should be killable without long-term (data loss, prolonged outage, etc) side effects. One thing I mentally do sometimes is pretend that all my EC2 instances have to be Spot instances – someone else has their finger on the kill switch, not me. That’ll get you to build right. :)
  • Understand your components and how they fail. Use any component, such as EBS, only if you fully understand it. For mission-critical data using EBS, that means RAID1/5/6/10/etc locally, and some sort of replication or mirroring across AZs, with some sort of mechanism to get eventually consistent and/or re-instantiate after failure events. There’s a lot of work being done in modern scale-out databases, like Cassandra, for just this purpose. This is an area we’re still researching and experimenting in, but SimpleGeo didn’t seem affected and they use Cassandra on EC2, so I’d say that’s one big vote.
  • Try to componentize your system. Why take the entire thing offline if only a small portion is affected? During the EBS meltdown, a tiny portion of our site (custom on-the-fly rendered photo sizes) was affected. We didn’t have to take the whole site offline, just that one component for a short period to repair it. This is a big area of investment at SmugMug right now, and we now have a number of individual systems that are independent enough from each other to sustain partial outages but keep service online. (Incidentally, it’s AWS that makes this much easier to implement)
  • Test your components. I regularly kill off stuff on EC2 just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.
  • Relax. Your stuff is gonna break, and you’re gonna have outages. If you did all of the above, your outages will be shorter, less damaging, and less frequent – but they’ll still happen. Gmail has outages, Facebook has outages, your bank’s website has outages. They all have a lot more time, money, and experience than you do and they’re offline or degraded fairly frequently, considering. Your customers will understand that things happen, especially if you can honestly tell them these are things you understand and actively spend time testing and implementing. Accidents happen, whether they’re in your car, your datacenter, or your cloud.

Best part? Most of that stuff isn’t difficult or expensive, in large part thanks to the on-demand pricing of cloud computing.

Clouds and windmill in full sun. Just a few of the great sunset and cloud photos in Colorado.


Amazon has some explaining to do about how this outage affected multiple AZs, no question. Even so, high volume sites like Netflix and SmugMug remained online, so there are clearly cloud strategies that worked. Many of the affected companies are probably taking good hard looks at their cloud architecture, as well they should. I know we are, even though we were minimally affected.

Still, SmugMug wouldn’t be where we are today without AWS. We had a monster outage (~8.5 hours of total downtime) with AWS a few years ago, where S3 went totally dark, but that’s been the only significant setback. Our datacenter related outages have all been far worse, for a wide range of reasons, as many of our loyal customers can attest. :( That’s one of the reasons we’re working so hard to get our remaining services out of our control and into Amazon’s – they’re still better at this than almost anyone else on earth.

Will we suffer outages in the future because of Amazon? Yes. I can guarantee it. Will we have fewer outages? Will we have less catastrophic outages? That’s my bet.

Time of the full moon


There’s a lot of noise on the net about how cloud computing is dead, stupid, flawed, makes no sense, is coming crashing down, etc. Anyone selling that stuff is simply trying to get page views and doesn’t know what on earth they’re talking about. Cloud computing is just a tool, like any other. Some companies, like Netflix and SimpleGeo, likely understand the tool better. It’s a new tool, so cut the companies that are still learning some slack.

Then send them to my blog. :)

Oh, and while you’re here, would you mind doing me a huge favor? If you use StackOverflow, ServerFault, or any other StackExchange sites – I could really use your help. Thanks!

And, of course, we’re always hiring. Come see what it’s like to love your job (especially if you’re into cloud computing).

How SmugMug survived the Amazonpocalypse
Don MacAskill
Mon, 25 Apr 2011 07:54:15 GMT

Apr 212011

Jason Khalipa and Chris Spealler doing the 11.6 workout via the Crossfit Games site.

I was actually thought this may not be so bad, but I’m quickly getting convinced that it is going to be bad.  The tough part on this one is that it gets harder and harder, and with no finish line other than a time limit.

I think I’ll try and do 2 or 3 of these workouts for 3 minutes between today next week when we do the official workout.  I’ll also try to do it at a heavier weight at least as sort of a mind game for the 100 lbs. weight we’ll actually be doing.  Oh and of course lots of chest to gar pull ups, I think I’ll do at least a set or 2 each workout day until 2 days before the workout.

I haven’t tried to do much prep for one of these games workouts yet, but being the last one, I’d like to do what I can to go out with a bang as big as I’m capable of.

Apr 212011

Oh man…what a workout that was.  I was actually on the fence about doing the workout today or waiting until the weekend after a day or 2 of rest/recovery. I wasn’t really feeling all that hot mentally or physically, but I bucked up and gave it a go anyway.

Right away the Power Cleans felt heavier than it should have been, and even failed my first clean. I have very little history of doing Power Cleans and am still pretty much still trying to learn a habit of doing them correctly. I was out of rhythm, our of order with each progression of the movement, and not not throwing any power into it. After a few reps, I was starting to figure it out again.   The good part was that there were only 5 to have to deal with and they seemed to be over within no time.


TTBOn I went to the Toes-To-Bar.  I was able to get a rhythm right away, but couldn’t get more than 5 or so at a time.  As rounds went on, sweat built up on the hands and bar, and grip began to weaken, I was only stringing 2-3 together at a time.  This was definitely my weakest part of this workout.

wallball.257db6afThe following Wall Ball section of this work out wasn’t too bad, it was just a matter of my cardio keeping me going or not.  I was tossing up ate a 10’ point at the top of a steel pole.  The good part is that it was inches from my pull up bar, not 50’ away across the gym, which helped save transition time.

Overall I felt good that I made it through all that, but wish I had kept a better pace.  Any of the ‘breaks’ I took were just way to long and unnecessarily longer than they should have been.  However, I had no idea what score I could expect to get understand a pace I should set, I’m OK with the score I got.


5 Rounds plus 5 clean and 8 Toes-To-Bar for a score of 163

Apr 202011

increase-bench_pressIt’s been a while, to the tune of years, since I attempted a 1 rep max on Bench Press.  If I remember correctly, it was 245 lbs at that time.

Last night as part of our WOD, we did a 10-8-5-5-1-1-1-1-1 rep set on Bench, so I figured it would be a good time to at least get a ball park of where my max was.  After it was all said and done, I was able to get up 215 lbs.  It wasn’t a total struggle to do, may have been able to add 5 or so, but enough to call it a max.  This would probably be another goal I should set for my self.  Will mull it over and set one soon.

 Posted by at 12:07 pm  Tagged with:
Apr 202011


Another week of the Crossfit Games and another tough workout.

Workout 11.5 is a much more well rounded set of activities, the hard part is that it’s 20 minutes long:

      Complete as many rounds and reps as possible in 20 minutes of:
      5 Power cleans (145lbs / 65kg)
     10 Toes to bar
     15 Wall balls (20lbs to 10′ target)

This is going to be a fun one!…or something like that.

On the other hand, at least we have some time to get some good prep for Workout 11.6.  This is essentially ‘Fran’ but in a ladder scheme and slightly heavier (100 lbs. vs. 95 lbs.).

This’ll be a great wrap up to the last 7 weeks.  I’ve made a lot of progress since joining Crossfit just six months ago, and would never have thought that I’d be able to do these workouts at the prescribed weights.  Looking forward to what this will bring in the next 6 months!

Apr 202011

From the looks of these gameplay trailers, It’s going to be tough to wait until the end of this year for the actual release.

Hopefully I’ll decide to just pre-order and forget about it, maybe I’ll get an early surprise!

 Posted by at 7:55 am  Tagged with:
Apr 182011

Another summer is coming, and I can’t wait to start up the time lapses again.  We’ve got some great vacations lined up this year, so I plan to try to find some great subjects to use amongst these places.  I’m also hoping to have a wide angle lens to use for some of them too.

For now, I’ll use this stunning piece of work to keep me motivated!

The Mountain from Terje Sorgjerd on Vimeo.