Facebook Hires Chef to Juggle 150,000 Machines

It's hard to even picture the 150,000 machines that run Facebook's online empire. But now imagine trying to load all 150,000 with each tiny piece of software they needed to serve up all those photos, videos, messages, and newsfeeds to over a billion people across the globe. And then think about repeating this task every time that software needs changing -- which may happen as often as twice an hour.
Image may contain Electronics Server Hardware and Computer
Servers inside Facebook's Prineville, Oregon data center.Photo: Wired/Pete Erickson

Three years ago, Facebook's online empire sat atop a network of 30,000 computer servers. At the time, the company called this "massive scale," but today, it looks small. Judging from the enormous amounts of power pumped into the company's data centers, the Facebook empire now spans more than 150,000 machines -- and counting.

It's hard to even picture that many machines. But now imagine trying to load all 150,000 with each tiny piece of software needed to serve up that unending stream of photos, videos, messages, and news feeds to over a billion people across the globe. And then imagine repeating this task every time that software needs changing -- which may happen as often as twice an hour.

No, you can't do all that by yourself. Jeremy LaTrasse -- who once oversaw data center operations at Twitter and now runs an online email service called Message Bus -- compares the task to car washing. "If you have three cars in your garage, you can wash them by hand," he says. "But it you have a thousand cars, you gonna run them through some kind of automated car wash. Washing them by hand isn't an option." The same goes for 150,000 cars.

Traditionally, the giants of the web -- Google and Amazon and Facebook -- built their own tools for automatically deploying and configuring software across a vast network of machines. But in recent years, a trio of publicly available tools have pushed their way across the web, and this week, one of them -- a tool called Chef -- landed at one of those giants, proving it can tackle even the largest of online operations. It landed at Facebook.

Known as a "DevOps" tool -- because it lets you develop all sorts of tiny software programs that can automate data center operations -- Chef is hardly a new thing, but it was recently reconstructed in an effort to accommodate operations the size of Facebook. In fact, Facebook helped design the tool's new incarnation, according to Adam Jacob, the creator of Chef and one of the founders of Opscode, the company that sells the tool.

Facebook declined to discuss Chef -- it plans to reveal the particulars at a conference later this year -- but the company confirms that it worked with Opscode on the new version of the tool, that this tool is now overseeing its network, and that it's "happy with how things are going so far."

Dreamhost, a Los Angeles-based cloud computing outfit, is also using the new incarnation of Chef -- across three data centers -- but Carl Perry, who oversees the design of the Dreamhost network, still sees Facebook's involvement as a milestone in the evolution of DevOps software. "It's pretty huge. We, ourselves, were pretty blown away by the announcement," he says. "It can handle a lot of nodes."

Luke Kanies -- the founder of Puppet Labs, whose Puppet platform competes with Chef and Opscode -- doesn't quite see it that way, arguing that Puppet is already tackling massive server farms. "It's impressive that Opscode has gotten to this scale," he says, "but we've been at this scale for the last three or four years." He says the tool is overseeing about 50,000 servers at online game outfit Zynga -- and that it's slated to run a 300,000-server cluster at the CERN nuclear physics research lab in Switzerland. But he can at least agree that commercial DevOps tools continue to reinvent the data center.

With Chef, data center administrators can write tiny programs in the popular Ruby programming language, and then a central set of Chef servers can deploy these programs across a network of machines. The programs can install software -- such as an operating system -- but they can also update and configure existing software. They can, say, patch an application or change a security setting.

Puppet works in much the same way -- except that you build programs with its own specialized programming language. Both tools can also be used to install software on virtual servers running atop cloud services such as Amazon EC2 -- or on desktops and laptops running on your office network. Jeremy LaTrasse and Message Bus use Chef to deploy software on virtual servers running across multiple cloud services.

In the past, the Chef server was also built with Ruby, but in an effort to improve its ability to accommodate a large number of machines, Opscode rebuilt the server with Erlang, a very different programming language. And for much the same reason, the company is now using the PostgresSQL database to house all the information needed to deploy your software. For Carl Perry, this setup is indeed a significant improvement over the previous version of the platform.

"Ruby on the server is fine. But as you added more clients, you start running into, well, unpredictability on the server," he says. "But with Erlang, that went away. One of the great things about Erlang is that you can keep adding more servers and it will just distribute the load across those servers."

In other words, it can help run a site the size of Facebook.