The strange case of the missing servers

An install for a large company with offices in multiple locations drags on and on when tracking down the source of a problem proves to be an ordeal

I was working for a large IT services provider, many of whose customers were large corporations, and we supported them remotely. As a result, we often did not know the physical location of customers' machines and often we knew only the people who worked directly with us. Plus, all of the work was supposed to be pre-approved by tickets, but all too often somebody would, without our knowledge, make changes.

Our team supported software that automated and kept customers' crucial operations going. It had components running on several machines at once, and a server component was needed to communicate with so-called "agent" components on other computers. Communication took place over the customer's internal network using the TCP protocol, and the network link between components was crucial for operation.

[ Get a highly rewarding IT career with these "12 effective habits of indispensable tech pros." | Follow Off the Record on Twitter for tech's war stories, career takes, and off-the-wall news. ]

One customer requested we install a new environment, with a new server and about 20 agents. The new environment was installed over the course of several days, then we went into testing -- a phase that was planned to last about two months and that was a near replica of the production environment already in place so we could correct any problems before production.

Off the Record submissions

At first everything went well, until one day three agents stopped responding. Whatever we did, the server component could not communicate with them. We started performing standard network connectivity tests to see if the problem was with the agents themselves, with the machines on which they resided, or with the network link. We couldn't get any answer from the problem machines.

The customer was a huge company, and we soon learned that although the entire testing environment was on the same private network, the physical servers resided in several different states in the U.S. We didn't know which state the problem machines were located in, so we called the customer's Unix team to check the servers; they told us the machines were up and running and there was nothing wrong with them. So it had to be a network problem.

We asked the customer to diagnose the network problem, and they told us that network administration was the responsibility of another company. We promptly got in touch with them, and they assigned an engineer to troubleshoot the problem.

The engineer told us to run some tests. We had already done these tests, but did them again and got the same results as before. He promised to get back to us.

Days went by. We contacted the engineer again and received no reply. Our company's liaison with the client called the network service provider again. They assigned another engineer to diagnose the problem. He requested we perform the same tests as before. We did, and sent him the results -- and got no reply. This same process was repeated with yet another network engineer.

Meanwhile, all testing on the environment had been halted, and the deadline for going into production was postponed. The customer's middle managers were furious and frantic -- postponing meant an incredible headache to them, and they started begging their superiors to mobilize resources to solve the problem. But apparently their superiors didn't care much, and the problem persisted.

Finally, after about a month, the network service provider company gave us their answer: They had no idea what the problem was. All network switch links were operational, and testing showed no signs of a problem. It had to be a problem with the machines, but we'd been told by the customer that the machines were fine. We were baffled.

The customer's top managers still didn't do anything to help figure out what was going on. Our customer liaison was going way beyond the call of duty to get the customer to find out what the problem was with their own machines. By opening unofficial lines of communication with employees working for the customer and by gradually gaining the trust of people in several departments, she was finally able to solve the mystery.

It was simple: The problem machines had been moved from one physical location to another, without our knowledge. They should have been up and operational after two days' delay, but due to the customer's internal bureaucracy, they had been forgotten in a warehouse and were still sitting in their crates waiting to be turned on and plugged into the network. Since the customer's Unix team didn't have problem tickets opened for the machines, they assumed they were in working order, and this they told us without bothering to check.

By the time this was discovered, the customer's testing phase had gone completely off schedule, and services that their company was scheduled to provide to their own customers had to be postponed, risking cancellation or renegotiation of contracts -- all because of a combination of bureaucratic inefficiency and neglect.

From that point on, we became very wary and demanded from our customers proof at every stage of a job. And we learned not to trust anybody ever again, except the people who we knew were also suffering from the same problem. In short, we learned that many times in such large environments only the people directly affected by a problem are genuinely interested in solving it.

Do you have a tech story to share? Send it to offtherecord@infoworld.com, and if we publish it, you'll receive a $50 American Express gift cheque.

This story, "The strange case of the missing servers," was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2012 IDG Communications, Inc.