Tuesday, July 20, 2010

Deploying at the speed of light

In SaaS environment, it is very important to make sure the service is "always" available. The downtimes in the service impacts wide range of users especially if they are coming from a different timezone. Downtime between 3 AM to 6 AM in US ET may not seem disruptive to anybody in US. However for somebody who is working in Asia or Australia this might eat into major chunk of one's productive hours.

Anyway, the point I am at is in order to keep the service up for the most of the time, it is important to minimize the downtimes. Downtimes are usually needed for emergency patches or software refreshes. That is usually the only way to roll out new features and services. In SaaS environment, it might be daunting task to rollout new upgrade to a number of services. The downtime might then range from few minutes to hours depending on complexity of the environment.

The presentation by Larry Gadea demonstrates some of the challenges that are involved in deploying new builds to servers.

For starters, in case of Twitter,the deployment time was around 40 minutes for a server. With a number of innovative ideas the team was able to reduce it few seconds.

Here is the first general issue. If the farm is to be updated by copying the updates from one central server, usually the central server is the main bottleneck.

One way to solve the issue would be to use tree distribution. However it runs into a problem if one or more nodes in the tree fail resulting in servers that are not updated.


Twitter was able to use an approach that involved using techniques below



The new approach integrated with existing deployment app, Capistrano, resulted in getting the deployment time by about 99%. You can see it here