Getting data processing right for the Internet of Things

By 2020 there will be six connected devices for each person on the planet. Alongside this tremendous growth in connectivity, the Internet of Things (IoT) has steadily become an everyday reality and not a trend of the distant future. The opportunities for both businesses and consumers to benefit from the data being produced is considerable, particularly as organisations become better at gleaning insights from the IoT.

We already have multiple use cases, particularly in industry, where data is constantly being produced on all manner of devices—ranging from factory machinery that outputs manufacturing quality markers, through to connected farm equipment that tracks a field’s microclimate. A study recently released by Beecham Research claimed that IoT could increase food production by 70%, providing a major boost to efforts to help feed the 9.6 billion people expected to inhabit the world by 2050.

However, despite the tremendous potential of IoT, how can it actually be delivered in data processing terms, and how fast can businesses or consumers gain insight from the data being produced? The answer lies in choosing the right platform to manage the entire data process.

Dealing with the data deluge
The massive volumes of data currently being generated by connected devices require a truly “big data” approach. With 50 billion connected devices set to be online by 2020, we are set to see a constant flow of information being created that will have significant value, if it can be effectively processed.

To make sense of this data, it’s key that businesses store the data properly and build historical references that will allow for a depth of data to be established and trends to be detected. In order to do this with huge numbers of small individual files, a highly scalable IoT processing platform is essential. Is that platform able to scale to the millions, billions, or event trillions of files? The answer has to be “yes” every time.

It’s also worth considering that simply storing the information in a data warehouse and generating reports every few days is not enough. With the vast majority of IoT use cases this isn’t effective as the data needs to be processed on a streaming basis, with the ability to identify and act on valuable information quickly and easily. In the IoT world, speed is essential.

With large volumes of both unstructured and structured data being produced due to IoT, a flexible platform is required that can properly store and process all types of data regardless of its form. This should also support stream processing from the outset with the functionality to deal with low-latency queries against semi-structure data items, at scale.

Developing a structure for things
One challenge facing the existing IoT landscape is that there are no widely accepted reference architectures. Those that have been proposed consistently include a theme around polyglot processing. By combining different processing modes with a platform, it becomes possible to deal with varying formats.

In addition privacy and security need to be covered by an IoT data processing platform. Protecting the privacy of users is key, from data masking to provenance support over encryption, protecting the user and their privacy is crucial.

Collecting the data alone is also not enough. It is critical that the platform can combine deep predictive analytics from historical data with real-time events. Therefore, ensuring that some kind of low-latency, high-throughput database facility is integrated into the overall workflow, is extremely important. In addition to massive scale, IoT workflows require fast lookup and sometimes updates of data, requiring an extremely fast database facility that doesn’t suffer breakdowns in performance.

Apache Hadoop represents the natural platform for such requirements. Designed for large-scale data intensive deployments, Hadoop is built to process huge amounts of data by connecting many commodity computers together to work in parallel.

Through MapReduce or the Apache Spark execution engine, Hadoop can take a query over a dataset, divide it, and run it in parallel over multiple nodes. This ability means it is ideal as a data processing platform for the volume of data produced by the different components in the IoT landscape. With the volumes of data only set to increase, it’s crucial that an effective data processing platform is in place at the very beginning so that results are delivered quickly and indefinitely.

Getting ready for IoT
Already there are areas where IoT is starting to have a real and important impact, with plans in place to dramatically increase the volume of data being produced from initiatives such as smart cities. As an example Bristol is set to become the UK’s first “programmable city” this year. Since March 2015 the city’s data on air quality, temperature, humidity, traffic movement and signal patterns have been and analysed to inform council decisions on areas like spending and legislation.

The IoT represents a tremendous opportunity and will continue to gain momentum through the course of the next few years. Integral to this development is that organisations, regardless of the sector they operate in, select the right data processing platform and support from the very beginning. If done effectively, the possibilities for IoT innovation are endless.

By Rob Anderson, vice president, systems engineering at MapR Technologies

Out Now! IoT Now Magazine

Get the latest IoT news to your inbox

Join the IoTNow online community for FREE, to receive: