It's raining data —

Terabyte terror: It takes special databases to lasso the Internet of Things

Non-relational databases can help take the pain out of corralling swarms of sensor data.

There are enough magnets in the back of a Pixel C that it can stick to a refrigerator.
There are enough magnets in the back of a Pixel C that it can stick to a refrigerator.
Ron Amadeo

If you believe figures from the technology research firm Gartner, there will be 25 billion network-connected devices by 2020. The "Internet of Things" is embedding networked sensors in everyday objects all around us, from our refrigerators to our lights to our gas meters. These sensors collect "telemetry" and route out data to… whoever's collecting it. "Precision agriculture," for instance, uses sensors (on kites or drones) that collect data on plant health based on an analysis of near-infrared light reflected by crops. Sensors can do things like measure soil moisture and chemistry and track micro-climate conditions over time to help farmers decide what, where, and when to plant.

Regardless of what they're used for, IoT sensors produce a massive amount of data. This volume and variety of formats can often defy being corralled by standard relational databases. As such, a slew of nontraditional, NoSQL databases have popped up to help companies tackle that mountain of information.

This is by no means the first time relational databases have ever been used to handle sensor data. Quite the contrary—lots of companies start, and many never leave, the comfort of this familiar, structured world. Others, like Temetra, (which offers utility companies a way to collect and manage meter data) have found themselves pushed out of the world of relational database management systems (RDBMSes) because sensor data suddenly comes streaming at them like a school of piranha.

From a trickle to a torrent of IoT data

In 2002, Temetra was a small company operating out of Ireland. It employed just five people at the time, but the company was already storing data from hundreds of thousands of water meters, analyzing flow through customers' pipes. "Having more data allows you to do more analysis on the network," Temetra Managing Director Paul Barry said. "As you can imagine, water utilities don't have unlimited budgets. Say I've got a budget of $25 million to go fix leaks. I could spend a lot of time chasing them down. It's much better to address the least efficient parts of the network, where I get the most bang for my buck."

To that end, Temetra's not just working to give detailed information on each of its customers' meters. It is also aggregating that data to actionable results, as in "show me where all my leaking meters are." Over the course of 10 years, Temetra wound up collecting a flood of data from sensors to give customers that level of insight.

In 2002, the company was doing what everybody did back then—storing data in an RDBMS. "Everybody was just using SQL databases in 2002," Barry said. "Google had started to break the mold, but typically, everybody wrote apps in a monolithic way, with an RDBMS on the back end." Thus, Temetra's meter sensor data was pouring into PostgreSQL, the venerable relational database.

For a decade, things went swimmingly—then came 2012. The company, which was selling software as a service (SaaS), entered the UK market and started dealing with water utilities that were much bigger than those in Ireland. Leading up to the expansion, Temetra had started to see what Barry called "explosive growth." And since the volume of data was going up so significantly upon entering the UK, the company needed to look at different databases that could better store and help analyze it.

It's not that PostgreSQL wasn't up to the job of handling the expected spike in data volume, Barry said. The problem with the RDBMS was that the administrative burden would explode right along with the data volume. "Backups were getting very big," Barry said. "In that master/slave type of database [a database replication scheme wherein a master database is regarded as the authoritative source and the slave databases are synchronized to it], it takes longer to handle [replication] as the data grows and grows. The more data grew, the greater was the burden."

With only five employees, Temetra's driver was to find a data repository that would allow it to have low administrative costs and high database reliability. The team looked at a lot of options, including the non-traditional data stores MongoDB and CouchDB/Couchbase. It came down to a choice between Basho's Riak and Cassandra. And the main reason Temetra chose Riak was because the company got it running practically in the blink of an eye. "I had a test up and running very quickly. An hour with Riak, and I was up and storing data," Barry said. "I was very confident it would maintain that reliability, with a low administrative burden."

Cassandra has improved a lot since then, according to Barry. But back when he was looking for a data store that was easy on his tiny team, he found it "a little fiddly to setup and properly configure to get high availability."

That's not surprising, said Zach Altneu, CIO of the IoT car technology company VCARO. He told Ars when looking at some of the noSQL databases out there, his company picked DataStax Enterprise Cassandra over Riak. VCARO's decision happened in no small part because it already had staffers with Cassandra skills. As Altneu said, you've got to know how to implement Cassandra correctly to be successful with it, and that means knowing how to properly set up a data schema.

Barry said that when Temetra evaluated the data store, he "wasn't 100 percent sure I had Cassandra configured properly." Temetra started writing data to both Riak and Cassandra to run each solution through its paces. Barry tested the two using some of the standard tricks, including unplugging a node and making sure the cluster still worked as normal.

The data did not differ between Riak and Cassandra; it was more that the tools for Cassandra were a bit limited at the time. Barry couldn't find information on the running state of the cluster, and it wasn't easy to know if the cluster was healthy. By contrast, checking cluster health was easy in Riak. Right from the beginning, the service came with "nice tools," Barry said. "You could run one command and know that the cluster was in a healthy state."

For Altneu and VCARO, it was all about cost and keeping tight control over the setup, from setting up the database through all operational activities. There would be no database-as-a-service (DBaaS), where somebody like Google or Amazon lifts all that work off your shoulders (thank you very much). As far as costs, Cassandra is open source and saves a heap of dough. Using cloud hosting for lightweight Linux servers, VCARO runs a cluster for less than $1,000/month.

By contrast, Temetra didn't want to be up to its elbows in database guts, and performance wasn't all that critical. The company had plenty of headroom left in the PostgreSQL RDBMS. Rather, the choice of where to go with all that sensor data was about reliability above all else, Barry said.

"It sounds reasonable, but some NoSQL databases don't give you such a strong contract for reliability," Barry said. "They'll trade off for performance. They'll give you high-speed queries, but one in 1 million may fail. We can't afford that. [With fast-flowing sensor data], we have one shot to store it and respond that we've successfully stored it. Once we have, it's our responsibility to [tell our clients] that it's been stored."

Channel Ars Technica