Here is a report on Velocity 2010, the Web Performance and Operations conference. In its third year it grew to more than 1100 attendees – this year was sold out.
Tuesday was dedicated to workshops even though most of them turned out to be presentations with demos given the number of participants. So not a lot of hands-on sessions. Here is a small selection of talks I found interesting throughout the day:
Infrastructure automation with Chef
Overview of the chef project lead by the high energy and opinionated Adam Jacob from Opscode.
For me the most exciting part was the ability that chef provides a complete view in your infrastructure and the ability to query your infrastructure any way you want.
Adam gave a few high impact principles:
Being able to reconstruct a business from a source code repository, a data backup and bare metal resources.
Another interesting feature from the knife tool was the ability to start/spawn new instances in EC2 from the command line. For example the following command will give you an ec2 instance running your rails role within a few minutes:
knife ec2 server create ‘role[rails]’
Protecting “Cloud” Secrets With Grendel
A technical overview of the Grendel project: OpenPGP as a software service.
The project gives the ability to share encrypted documents between multiple people. From the security perspective each user private key is stored in the cloud encrypted by a pass phrase only known to the user transmitted via http basic auth.
Wednesday was the first day of the conference with keynotes in the morning and three tracks in the afternoon.
Datacenter Infrastructure Innovation
James Hamilton from Amazon Web Services gave an interesting overview of the different parts of building a data center.
An interesting point he made was that data center should target 100% usage of their servers while the industry standard is around 10 to 15% utilization on average. This objective lead to the introduction of spot instances in EC2 so that resource usage could be maximized and Amazon cloud infrastructure can be flat lined. That reminds me of some comments from Google engineers stating that they try to pile as much work as possible on each of their servers. At their scale having a server powered off is costing money.
He covered other topics:
- air conditioning: DC could be run way hotter they are now
- power: the cost of power has a small part of the total cost of running a data center – server hardware being more than half of the cost. This is an interesting point with regards to the whole green computing movement.
Urs Hölzle from Google covered the importance of having web page that load fast and a range of improvements Google had been working on for the last years: from the web browser (via chrome) down to the infrastructure (such as dns).
He also highlighted that Google page ranking process now takes into account the speed at which a page loads. As heard multiple times during the conference there is now empirical evidence that links directly the page load speed to revenue: the faster a page load the more people will stay on the web site.
Wednesdays lightning demos show cased a list of tools focusing on highlighting performance bottleneck and helping out tracking why page load are slow and how to improve them:
Getting Fast: Moving Towards a Toolchain for Automated Operations
Lee Thompson and Alex Honor reported on the work of the devtools-toolchain group. The group formed a few months ago to share experiences and build up a set of best practices. Of the use cases they’ve outlined KaChing’s Continuous Deployment was the most interesting one:
Release is a marketing concern.
Tom Cook of Facebook gave a sneak peak at the life of operations in Facebook.
Very interesting talk about the developement practices of one of the busiest website of the internet. Facebook is running of two data center (one on the east coast, one of the west coast) while they’re building their own data center in Oregon.
Their core OS is Centos 5 with a customized kernel. For system management cfengine is set to update every 15 minutes with a cfengine run taking around 30 seconds. All of the changes are peer reviewed.
On the deployment front bug fixes are pushed out once a day while new features are rolled out on a weekly basis. Code is pushed to 10000s of servers using bittorrent swarms. Coordination is done via IRC with the engineer available in case something goes wrong.
The developer is responsible for writing the code as well as testing and deploying it. New code is then exposed to a subset of real traffic. Ops are embedded in engineering teams and take part of design decisions. They’re actually an interface to other ops.
As a summary tom gave a few points:
- version control everything
- optimize early
- use configuratiom mgmgt
- plan to fail
- instrument everything
- don’t waste time on dumb stuff