Concepts for updating embedded systems

If we build an embedded system we must have software updates in mind. There are a lot of reasons why we need to provide updates. It can be a bug that we didn't catch during testing, it can be because there is a security issue in a third-party application or it can be because we want to provide new features. To keep embedded devices up to date we can use different update concepts which this article focuses on without seeing an actual implementation. The goal is that some follow-up articles will (hopefully) show some implementations.

Naming Convention

In this article, we use different terms for how safe an update is. We differentiate between the following kind of updates:

Naming Description
Updatable The system can be updated but if there is a power cut during the update it might break the system
Failsafe The system can be updated failsafe. If there is a power cut during the update, the system will continue running in its previous state.
Recoverable The system can be updated failsafe. Additionally, it is possible to recover if a failsafe update was successful but somehow breaks a device in the field.

We will go into more details about updateable, failsafe, and recoverable in the next three sections. We also differentiate between updating only the Linux root file system or additionally the bootloader. This leads us to the following combinations:

U-Boot Linux Rootfs Naming Example
Updatable Updatable Updatable Package Manager
Updatable Failsafe Failsafe Partition Swap
Updatable Recoverable Recoverable Partition Swap + Bootcount
Failsafe Failsafe Double Failsafe eMMC Boot Partitions + Partition Swap
Failsafe Recoverable Failsafe Recoverable eMMC Boot Partitions + Partition Swap + Bootcount
Recoverable Recoverable Double Recoverable SPL with recovery option + Partition Swap + Bootcount

Updatable

A system is updatable as soon as we can replace some parts of the root filesystem. An example of an updatable system is a Linux distribution that provides a package manager. If a system will be updated by a service technician it might be enough to have an updatable system. This could e.g. be a medical device that is not connected to the Internet and is updated by a technician during maintenance work. It means that the update can fail and it might need some manual steps to make the system work again. If the update is meant to be installed by a nontechnical person or updates automatically, we should not use such an approach because this might break the devices.

Failsafe

A system is failsafe when it stays in a functional state even if the power fails while the update is being installed. An example can be the network router at home which is updated by the manufacturer to fix security issues. If the power fails the router will boot again without the update installed and will try again to install it. This should be the default for most embedded devices which automatically install updates or can be updated by end-users. An update file will normally contain a checksum and should also be signed so that the update tool can verify that the new firmware comes from the same provider as the previous firmware (often the manufacturer of the device). If the update has been properly tested this method will not break devices in the field. However, unfortunately, there are situations where exotic configurations exist that we never test. This could break the device in the field because the new firmware does not work with the old configuration. To avoid these use cases a recoverable system might be chosen. However, we can argue that instead of investing in a recoverable system we better invest in tests that allow us to fix issues before they happen.

Recoverable

For a recoverable system, we can detect if the system is not working anymore and then start a recovery procedure. This might happen for really complex devices. A home automation gateway could be an example that also allows scripting. Maybe the end customer was using a function of the device which was never meant to be used this way and the update will prohibit this use case. This will then lead to a broken system. However, in a recoverable system, the new firmware must mark the system update as successful. If it detects a configuration issue or even crashes while booting it will never set the update to successful. A watchdog or daemon will detect that the system is in a bad state and trigger a reboot. After several tries, the boot loader will switch back to the previous installation. It depends on the use case if a recoverable update is required or if a failsafe update is enough. The more use cases a device supports the more likely is that a configuration change might break the system.

Linux Rootfs Update

In this section, we will have a look at different methods on how to update a Linux application or system.

Package Manager

Most Linux distributions have a package manager installed. Often this also defines somehow what the distribution is based on. If we look at Debian-based systems they normally use apt-get, while Arch-based distributions use pcaman and Fedora-based systems use dnf.

All of these approaches have in common that they are not failsafe. If a power cut happens while the system is updating the system can end up in a state where manual intervention is required (e.g. run apt-get again). This is acceptable for Desktop systems where an operator is sitting in front of the PC but not for an embedded system which might be inaccessible.

Partition Swap

Partition swapping is used in a lot of embedded systems. The idea is shown in this image:

Partition Swap

Instead of using one partition, two equivalent partitions are used and are swapped after an update is installed successfully. The swapping is signalized to the bootloader by setting a flag or updating its environment. Important is that signalizing the partition swap is atomic or failsafe (e.g. by having a backup environment). This mechanism is failsafe but we can even make it recoverable. By enabling a boot counter in the boot loader we can count how many times the system tried to boot without success. If we tried to boot the updated system more than n times, we switch back to the previous boot partition. Because we didn't touch that partition at all the system can still boot. With this change, we now have a recoverable system.

Recovery Image

Using a recovery image is another widely used method. The principle is comparable to partition swap and is shown in the image below:

Recovery Image

Instead of having two copies of the same partitions we use a small recovery partition besides the productive partition. This has the advantage that we have more space for the main root filesystem. The disadvantage however is, that we need to boot into the recovery image, run the update and afterward switch back to the production system. Downloading the update can either be done in the main system or the recovery system. If the recovery image is responsible, it also needs access to the infrastructure (e.g. connect via LTE). Therefore, the recovery image is more complex. On the other side, it has the advantage that such a system can be made recoverable more easily because the recovery image is capable to download older and newer versions of the image. If e.g. there was an invalid rollout of an image it is still possible to fix it. If only the production system can receive updates, it's not that easily possible anymore.

As a recovery image, we often use a ramdisk image.

Besides the advantage of less space required there are also some disadvantages compared to partition swapping:

Ramdisk

A similar approach to the recovery image is to use a ramdisk which is part of the productive system. It's comparable to what is done on x86/amd64 distribution. There a ramdisk image is used to load drivers before the actual rootfs is mounted. If we want we can use this ramdisk to check for a new version of the main system and then deploy the update. The principle is shown in this image:

Recovery Image

This method is often used in combination. For example, OSTree can use a ramdisk image to chroot into the deployed rootfs.

OSTree

OSTree uses a file repository to do updates. The mechanism is shown in the following image:

TODO

OSTree has a mechanism to create a file system from a file repository. It is comparable to git in terms that there is a repo (.git) and then there is the currently checked out version (working directory). The version defines which file can be found under what path. To not have to copy files multiple times it uses hardlinks. An update will just contain the files which changed from the last version to the new version. This creates diff updates where only changes are sent to the device. Such mechanisms are powerful if the update is delivered over a mobile connection or a connection with limed bandwidth.

Filesystem

Some filesystems provide a mechanism to store snapshots. This feature can be used to afterward restore an old version. However, this normally can't guarantee fail-safety. It would have to be combined with a Ramdisk so that it is possible to recover if an update goes wrong.

BTRFS

BTRFS allows us to store a snapshot of the file system. By creating a sub-volume we can back up the current state to a directory. If we modify a file afterward BTRFS will keep the old copy of the file instead of overwriting it. To implement a file safe mechanism we create a snapshot of the current version and then use e.g. a package manager to update the system. If the update fails we have to detect this in a ramdisk image and then revert to the previous snapshot. With this mechanism, we could also implement a recoverable system. We create a file where we store that an update is pending and we have a boot counter which would trigger a revert if it exceeds a boot failure several times.

BTRFS

OverlayFS

Another method is to use overlay fs similar to the concept of what docker does. When an update succeeded we create a new overlay and mount it on top of the old overlays. All old overlays are mounted as read-only so changes will only appear in the last overlay. If we should end up in an unrecoverable state we can just delete the last layer and create a new and empty one.

OverlayFS

Container

By using containers we can move the responsibility for updates to the host system. This is shown in the following image: Container Inside the container we can use a non failsafe method to do updates like e.g. a package manager. If the update goes wrong the host system can initiate a recovery. The host system is the supervisor which decides when an update was successful. Of course, we also need to have a mechanism to keep the host system up to date. However, by using a solution like Torizon or Balena we don't have to care about that. The company providing the core system will take care of updates and also makes sure that they install in a failsafe way.

U-Boot Update

Updating U-Boot is relatively easy. For most systems, it just means we need to overwrite a certain area in the storage medium (e.g. eMMC or NAND-Flash). If U-Boot is stored in NAND-Flash it might be that we have to write U-Boot into several blocks and the BootROM will calculate a checksum and compares it with what is stored in the image header. If it matches it will load the boot loader if not it will try the next one. This is necessary because NAND-Flash might have bad blocks and therefore we have a corrupted image. This mechanism is shown in the following image. By writing several U-Boots into NAND-Flash we also have a failsafe update mechanism. The idea is shown in the following image: Container With eMMC we luckily don't have to care about bad block management anymore and we just have to store the image at a certain position in the eMMC or on an eMMC boot partition. To implement a failsafe U-Boot update we need to check the reference manual of the SoC, it depends on what the bootROM is capable of. Some BootROM search at different addresses on the storage others use eMMC boot partitions which can be swapped. Container Both methods implement a failsafe U-Boot update. To get a recoverable U-Boot we would have to implement a pre bootloader which can initiate the recovery. However, normally we don't need that. U-Boot is easily testable and a failsafe mechanism is sufficient.

Summary

This article gives an overview of different update methods. It is planned that follow-up articles will show some of these concepts in more detail.

Under this link you can find the slides of a talk at the QtDevCon 2022 about software updates: Keeping Qt based embedded devices up to date