Feb 8, 2013 6:30 AM

Meet the Detective Who Handcuffed Intel's Packet of Death

When Kristian Kielhofner first heard that his company’s new servers were crashing last summer, he worried that he might be facing a product recall. A few months prior, Kielhofner’s company, Star2Star, shipped Linux-based Voice Over Internet Protocol (VOIP) servers to about 200 customers. These machines helped run the back-end phone systems everywhere from doctor’s offices […]

Image may contain Furniture Desk Table Human Person Electronics Computer and Chair

When Kristian Kielhofner first heard that his company's new servers were crashing last summer, he worried that he might be facing a product recall.

A few months prior, Kielhofner's company, Star2Star, shipped Linux-based Voice Over Internet Protocol (VOIP) servers to about 200 customers. These machines helped run the back-end phone systems everywhere from doctor's offices to financial companies to daycares. But by September, something was clearly wrong. The new servers were unexpectedly crashing at 10 of the companies -- for a handful of them, it was happening every day, almost at random.

And so Kielhofner started looking for the culprit. It took him a few weeks -- about 150 hours of painstaking digital detective work -- but he uncovered the source. It was a single packet of data that just one model of phone -- the Yealink SIP-T22P -- was sending his VOIP servers.

This packet of death didn't simply knock the servers off the internet. It clobbered the system's Intel 82574L network controller -- server hardware that handles its Ethernet network connection. This part didn't power off with the rest of the system, so restarting the server wouldn't fix things. Neither would turning it off. After every crash, Kielhofner had to unplug the server, plug it back in, and then restart the machine to get back on the network.

On Thursday, Intel confirmed Kielhofner's findings but said the bug wasn't its fault. The blame lies with the company that made Kielhofner's motherboards, Taiwan's Lex CompuTech. Even though the networking hardware is built by Intel, the bug is in the firmware that Lex ships with its motherboards. To be specific, Lex seems to have used the wrong version of the Electrically Erasable Programmable Read-Only Memory (EEPROM) software for the controller setup that it's shipping.

"Intel root caused the issue to the specific vendor’s mother board design where an incorrect EEPROM image was programmed during manufacturing," Intel said Thursday in an emailed statement. "It is Intel’s belief that this is an implementation issue isolated to a specific manufacturer, not a design problem with the Intel 82574L Gigabit Ethernet controller."

That's good news, because Intel's networking controller has been shipping for five years and is pretty widely used, typically on server motherboards. Intel says that it only knows of one type of motherboard that has a problem -- the one that Kielhofner blogged about.

Lex, which does business under the name Synertron Technology in the U.S., says that it's testing out a new firmware image, supplied by Intel.

But are others affected? We don't know. That's why Kielhofner has gone public with his detective work. "It's completely unknown as to how widespread the overall issue is," he says. On his website, he's given people a way to test and see if their computer is vulnerable, and he's already fielding reports from users.

Even if this is an obscure bug, it's a pretty interesting one. It could, for example, be exploited by anyone who wanted to take down an entire rack of servers that were plagued with the buggy firmware.

"You could send this packet and shut these Ethernet chips down one by one," Kielhofner says. "And not only would that organization be knocked offline completely, more than likely, the people in charge of administering those servers wouldn't be able to get into them to fix them."

Another weird thing: Kielhofner zeroed in on the packet of death and found the exact byte value that kicks off the problem. If a certain byte is set to "1," it does nothing, set it to "2" or "3" and it crashes the controller, set it to "4" or any other number and it inoculates the controller against further packets of death -- an inoculation that lasts until the server is powered off.

The bug is reminiscent of the Ping of Death, a 15-year-old bug that caused Windows 95 systems to crash. But the Ping of Death crashed the operating system, not the networking gear.

"I’ve been working with networks for over 15 years and I’ve never seen anything like this," Kielhofner wrote in a blog post describing how he figured out what was going wrong. "I doubt I’ll ever see anything like it again. At least I hope I don’t...."

It's also a bug that most people on the planet would never bother to dig up. Kielhofner, however, is not like most people. He started playing with Linux when he was 11. By the time he was 20, he'd shipped his very own Linux distribution, called AstLinux, which is now the brains of the VOIP appliances that Star2Star ships. He's a founder and the company's chief technology officer.

Kielhofner says it took him about 150 hours of detective work to collar the packet of death.

And it's earned him some serious geek cred. "I'm impressed with the work here," said CloudFlare Network Engineer Terry Rodery, in an e-mail interview. "Figuring out something as convoluted as that is not only incredibly rewarding and satisfying, but time consuming and frustrating as well."

Klint Finley contributed to this story.