BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

IBM's Anjul Bhambhri on What Is a Data Scientist?

This article is more than 10 years old.

Anjul Bhambhri, Vice President of Development for Big Data projects at IBM.

In order to solve contemporary business problems, a big data strategy is needed much more than any one product. As I explained in my prior article, “Curing the Big Data Storage Fetish,” there is a growing understanding among enterprises that solving the big data conundrum can’t just be about acquiring more data warehousing technology. To fully exploit the opportunity presented by big data, a value chain must be created that helps address the challenges of acquiring data, evaluating its value, distilling it, building models both manually and automatically, analyzing the data, creating applications, and changing business processes based on what is discovered. Organizations have to figure out a way to increase analytical capacity, not just raw storage capacity.

“The enterprises that will achieve a competitive edge and win will have a blend of a healthy data-science culture, enterprising data scientists who can bend the ear of C-level decision makers, and the right combination of technology that will surface the data that make sense in the context of the business,” says , vice president of development for big data projects at IBM . To continue my series of articles on “What is a Data Scientist?” I interviewed Bhambhri about her vision for creating business value from big data. (For more on expanding big data capabilities and a list of all the stories in the “What is a Data Scientist?” series, see “Growing your Own Data Scientists” on CITOResearch.com.)

As the leader of the Big Data development initiative at IBM, Anjul Bhambhri defines the overall product strategy, as well as leads the engineering team for delivering Big Data products. These products include both IBM InfoSphere BigInsights, and IBM InfoSphere Streams, which perform both historical and real-time data analysis. In addition, Bhambhri leads specialized customer-focused teams, who work with partners and customers in several vertical industries including finance, retail, energy, utilities, healthcare and telco. These teams provide critical aid to customers getting started in understanding and unleashing the power of the Big Data products, by defining proofs-of-concept, architecture and solutions.

No Silos or Ivory Towers

One of the biggest obstacles to analytical productivity at an organization is the fact that data are often locked in different lines of business as “silos,” and are not analyzed effectively—or analyzed at all. So, parking big data in a new repository to remove these silos is a good thing. However, in doing so, there is a risk of introducing a new “Big Data silo” if the new repository is not effectively connected to the rest of the business intelligence (BI) infrastructure of an organization.

Many organizations today solve the “data silos” problem by storing large volumes of decision support data in warehouses. To leverage Big Data analytics, organizations are being challenged to glean information from new data sources that are difficult to incorporate into an existing warehouse, says Bhambhri. “If you deal with this data, much of which is ‘noisy’ and unstructured, then you’re not really adding more value or more context to the information that you’re already storing,” she says. “The key to the Big Data approach is to be able to analyze all of this data, without moving it around, to gain better insights, and to be able to do it in near-real-time when necessary. The results of the analysis can enable a new class of applications for the enterprise, or can be used to enrich existing data warehouse or master-data implementations.”

Data Scientists should always have this in mind, in order to avoid creating new “silos” in the enterprise. “Ideally, every analyst would have all data in the company available to them, so they could analyze it and determine what would be of use in solving the problem at hand,” Bhambhri says. “For example, in the telco industry, we have seen the need to process and analyze billions of call data records a day for mediation, customer relationship management, etc. It would be cost--prohibitive to store the historical data in a traditional warehouse for trend analysis and deep data mining. In addition, many applications, such as fraud detection and billing reconciliation, will require real-time analysis at the point of arrival.”

IBM has been working with telco customers to expand their analytical capabilities in two ways. One the one hand, they’ve been building big-data systems that deploy connectors between traditional data warehouses and feeds for real-time transactional data. This can accommodate more real-time decision-making, while still correlating real-time data with historical data in traditional warehouses. On the other, they’ve been expanding the range of users to include not just data scientists in an “ivory tower,” but also enable data enthusiasts, business users who can get their hands dirty in the data using relatively familiar interfaces.

“We provide the data scientists the right set of tools so that they can explore the sources that they want to analyze, and ask questions they want to ask, so that they can focus on their core competency,” Bhambhri says. “They ask the questions, and the answers are given to them in user interfaces that they are familiar with, such as a spreadsheet . They can then interpret the results of those questions, and ask for more questions in an iterative way. So from our standpoint, we provide the platform and the tools to increase the analytic capabilities and the capacity for the business users to make use of all this data. We hope to increase the number of data scientists over time, through iterative cognition, and through rising up the ladder from data to information, without needing to understand every nuance of every analytical operation.”

Making Analytics Consumable

Part of the challenge of big data is making sure that data scientists and enthusiasts alike don’t have to spend hours creating new analytical models, or scouring huge datasets that are literally expanding by the second. The journey of making analytics consumable has begun, but is not complete, Bhambhri says. IBM is at work developing algorithms so that patterns in data can be detected, and analysts can be alerted if certain patterns occur. This serves a similar role to the time-tested scientific practice of sampling data, in that it makes it more digestible. But the key difference is, the algorithms poll all the data that has been collected, and make decisions about which patterns are meaningful, which can then be analyzed by humans.

At the University of Ontario, sensors monitoring the health of newborn babies return almost 1,000 readings per second. Each one-second span is typically compressed into a single composite reading every 30 to 60 minutes, comprising 1.8 million to 3.6 million individual readings per composite. If the reading appears normal, it is discarded after being stored for 72 hours. Under this approach, any telltale data that happens within each one-second interval might be lost. With the new technology IBM developed, however, those patterns can be discovered, first by applying machine learning techniques to historical data, then detected in real-time. When this happens, an alert is sent to the analyst. The end result is that the babies’ likelihood of developing infections was greatly reduced, without requiring analysts to individually scan millions of records looking for patterns, or risking missing patterns by viewing consolidated readings.

“Another example of machine learning on Big Data was demonstrated in Big Blue’s Watson computer, which has 4 TB of structured and unstructured data, including the entire content of Wikipedia, at its disposal. Big Blue was able to improve its Jeopardy! score from losing to a 12-year-old to beating the two reigning human champions, Bhambhri says. In case you’re concerned that there is no place for humans in data analysis any longer, fret not - at the University of Ontario, the machine learning was accomplished through several trial runs,in which humans pointed out the spots the algorithm missed. Once the gaps were identified, the tool could tirelessly focus on patterns that would be useful to the human scientists.

Data Scientist as Change Agent

Another way to keep data science out of the silo and at the forefront of the enterprise’s mind is to make sure that the organization is set up so that data scientists can truly bend the ear of C-level executives, Bhambhri says. We’ve explored the skill set of data scientists before in this series. Add “change agent” to the list of resume must-haves.

“You need some change agent in the company who can to really show the business decision-makers that if they are not transforming themselves to become data-driven decision makers, then they will lose out to their competition,” Bhambhri says, “If the business side is not convinced, then it’s difficult to just get the IT arm going. But if you get the business side convinced, and this data scientist or the change agent is really tied to a C-level executive in the company, I think that could really help them get started.”

Bhambhri has seen customer exoduses from companies that don’t take data seriously. The annual IBM Cyber Monday benchmark survey of 500 retailers revealed that the number of people who use a mobile device to visit a retail Web site jumped 11 percent between 2010 and 2011. The retailer who doesn’t provide a promotions campaign for mobile phones, or optimize its Web site so that it can be easily accessed from a mobile device, is missing out on that growth. The data scientist can (and should) play a key role in advocating for a dynamic, information-focused view on business growth, Bhambhri says.

Enterprises will need to cast a wide net for these individuals, and once they get them, they will need to be empowered.

“Organizations have to identify people from within who have a track record of breaking the status quo, and who are open to exploring new sources of information on a regular basis,” Bhambhri says. “And if they don’t have them within the organization, they need to bring them from outside the organization.”

Once found, these data scientists/change agents will need to be empowered to uncover value throughout the organization, and strongly encouraged to communicate their findings, or their missions will fail, Bhambhri warns.

The level of advocacy and business focus is one of the characteristics that separate the old idea of “statistician” or “analyst” from the emerging role of “data scientist,” Bhambhri says.

“It’s not so much that they are designing new systems, but are really championing these new sources of data,” he explains. “Of course, IT still has to build the system, but the new data scientists are the change agents who really help departments collaborate throughout the organization to create value.”

Balancing Spending Across the Big Data Value Chain

If buying ever-larger data warehouses only provides a partial solution, the question remains, what is the right way to invest in building big data capabilities? Bhambhri has several recommendations.

One approach is to work with an established player that offers a mix of integrated capabilities and business partner solutions, so that the enterprise doesn’t waste resources stringing together multiple solutions, effectively becoming a systems integrator in its own right. IBM has integrated partnerships across the value chain, such as InfoSphere BigInsights for the Hadoop file system organization,,Datameer for visualization and Karmasphere for application development.

“To get started, customers should not jump into an enterprise-wide big data deployment before they really know what data has useful information,” Bhambhri says. “It is part of a data scientist’s mission to understand the business needs and evaluate potential big-data solutions that can deliver return on investment to the business. So, through our customer engagements team, we work with customers to identify use cases and proofs of concept, where we identify the challenges to help them start on the journey. We take this journey with them so we can put capabilities in the product that will be useful to them. We’re not building the product in isolation, so that gives us ROI, and the customer is happy that they can see the value before they make the investment.” Even at the end of this proof period, the customer is under no obligation to make that investment with IBM, Bhambhri adds.

Through this approach, IBM helped a Danish windpower company build and use a Big Data solution to analyze weather data. The objective was to leverage data to deliver business value by improving optimal locations to deploy wind turbines. To achieve this, the company needed to run complex data mining models on a large volume of data, which took at least three weeks of processing time, even with a subset of the 2.8-petabyte data set.

Organizations ignore data at their peril, and there will only be more of it in the future. The technology to understand it is now within reach, and it should be exploited, Bhambhri says. But ultimately, the competitive difference will be made by the level of organizational influence exerted by the data scientist to make the information from the data actionable.

Follow Dan Woods on Twitter

Dan Woods is CTO and editor of CITO Research, a firm focused on advancing the craft of technology leadership. He consults for many of the companies he writes about. For more stories about how CIOs and CTOs can grow visit www.CITOResearch.com.