Improving the reliability of the Internet of Things

Johan Kraft of Percepio

Over-the-air updates are enabling a dramatic change in the way systems in the Internet of Things (IoT) operate. Here, Johan Kraft, CEO and founder of Percepio explains the benefits.

The obvious advantage is, of course, easier updates, often downloaded and installed transparently. When this is coupled with software tracing, it becomes a powerful mechanism for improving the quality and reliability of a wide range of embedded IoT systems.

Systems still deployed with bugs

Despite the best efforts of developers, these systems are still deployed with bugs remaining in their code. A development team introduces on average about 120 bugs per 1,000 lines of code during development and about 5%, or 6 bugs per 1,000 lines of code, typically remain in the shipped software. When there are thousands of IoT devices deployed in the field, relying on users to report the problems caused by these bugs is neither reliable nor scalable. User reports also tend to be vague and unhelpful for solving the problem. When there are millions of devices, this matters even more.

These missed bugs probably won’t show up right away, but only cause problems under certain circumstances, otherwise they would have been found before the product shipped. While an over-the-air (OTA) update can solve the problem in the field, developers need some kind of feedback system to know about issues in the deployed devices, and they need to know quickly. This approach has long been standard in the development of mobile and cloud applications (DevOps), and it has now become viable for embedded development as well.

Identify new and important issues

The key to finding out about, and solving, problems in the field is the combination of software tracing, cloud management and OTA updates, but this is a complex challenge. The tracing code needs to be as efficient as possible in a system that is already constrained in resources. The link back to the cloud needs to be secure, transparent and transfer the right data to help developers identify any problems quickly and easily. The cloud service has to identify what issues are new and important, and then notify the developers that there is a problem that they need to fix. Once it’s fixed, the updated software must be distributed to all devices via an OTA update. And all of this needs to scale across millions of devices.

The information flow starts in the error handling code of the IoT device, such as already existing sanity checks and fault exception handlers. Using a software agent, firmware issues are uploaded as alerts to a customer’s cloud account. An alert may include an error message and any other information relevant to the specific issue, such as software state variables and hardware registers. Depending on the severity of the issue, the alert is either uploaded directly or after a device restart, once the cloud connection has been restored.

The alerts may also include a trace of the most recent software events in the device, which is recorded automatically by the agent. The trace provides both the details of the error and the context, making it easier for developers to identify the bug.

The encoding efficiency is key here, to ensure that a minimal amount of memory is needed to store a trace that provides developers with the context they need to identify the real problem. This is important for two reasons: In the collection of traces of sufficient length even from memory-constrained IoT systems, it reduces the upload time to a fraction of a second, and it minimises the cloud-side operational costs of alert messaging and storage. This encoding efficiency makes it possible to use the trace technology out in the field, also in small IoT devices, bringing dramatic advantages.

Alerts from the firmware agent are uploaded to the customer’s cloud service, which is configured to store the alerts and to also notify an engine that handles classification, statistics and sending of notifications to the developers. It also offers configuration options, for example identifying the conditions under which notifications should be sent and to whom.

Notifications received

When developers receive notification about a new issue, they can access alerts and traces to see what the problem is.

Privacy is also key here. The software trace never needs to leave the customer’s cloud account. Only an anonymised signature of the alert is required for the cloud processing, which can be provided in an external cloud service. This information can be made completely transparent, configurable, and meaningless on its own. The communication and storage is provided by the existing capabilities in the developer’s IoT platform using best practices for authentication and encryption.

Lab testing is not enough

Testing in the lab just isn’t enough to eradicate all software issues due to the complexity of today’s embedded IoT systems. Real-time tracing and alerts can identify bugs in the field as they happen, with automatic notifications to the developers to speed up resolution.

Such a system has to be scalable, secure and transparent to the developers. Once in place, it can provide immediate awareness on the very first occurrence of an issue, before many users have been affected, and let developers take full advantage of OTA updates to rapidly improve their product.

The author is Dr. Johan Kraft is CEO and founder of Percepio AB.

Dr. Johan Kraft, CEO and found of Percepio AB — *Johan Kraft*

About the author

Dr. Johan Kraft is CEO and founder of Percepio AB. Dr. Kraft is the original developer of Percepio Tracealyser, a tool for visual trace diagnostics that provides insight into runtime systems to accelerate embedded software development. His applied academic research, in collaboration with industry, focused on embedded software timing analysis. Prior to founding Percepio in 2009, he worked in embedded software development at ABB Robotics. Dr. Kraft holds a PhD in computer science.

For more information about the DevAlert cloud service which provides immediate feedback when something unexpected happens in the software of deployed IoT devices click here.

Comment on this article below or via Twitter: @IoTNow_OR @jcIoTnow