OneVariable : Thoughts on Panicking

As part of some work in collaboration with Open Device Partnership and Tweede golf, a question came up about how to think about panics in Embedded Rust code, and how different projects/ecosystems handle panicking.

This is a somewhat common question to answer in embedded systems, particularly when deciding what kinds of failures you intend your system to be robust against, and what strategies to employ when something inevitably goes wrong.

Like many systemic decisions, there are a lot of potential right answers, as well as a lot of wrong answers, and fairly often folks don't always explicitly plan for or against any particular scenario. The following discussion is aimed at helping folks early in the design process to consider what kind of system they would or would not like to build.

Separate "Blast Domains"

The first question to answer is whether your system has a notion of separate "Blast Domains". I chose this name because it describes how far the "blast radius" is for a single catastrophic failure. A system with more than one blast domain gives the ability of one part of your system to fail catastrophically, without bringing down any other separate blast domains.

NOTE: "Catastrophic Failures" are defined here as any event that will cause the system to cease working as intended. We will generally scope these to software failures for the point of this discussion, as robustness against hardware failures, such as single event upsets, typically needs to be addressed outside the scope of a single processor, for example using an external hardware watchdog or electrical protection circuit. Sources of software failures include unexpected behavior leading to panics or hardware faults (e.g. divide by zero), or unsafe code leading to undefined behavior, including memory corruption. This does not generally include "logic errors" directly, though logic errors can cause catastrophic failures.

Single Blast Domains

The minimum amount of blast domains is one: if you have a statically compiled embedded application, for example a typical Embassy application, then this is a single blast domain. A panic anywhere in the system, or a busyloop lock-up, or an interrupt storm, will cause the entire system to fail, usually terminating in either a reboot (in the case of a watchdog, panic handler), or a complete lock-up failure.

Two Blast Domains

The next simplest approach is having two blast domains. Having more than one blast domains means that there are multiple parts of your system that exhibit two qualities:

"Spatial Partitioning": a separation of memory and resources (including hardware peripherals). This is generally achieved through the use of an MMU, MPU, or in some cases separate CPU operating modes (secure/non-secure, privileged/unprivileged modes). This means that one domain of the system does not have the ability (at the hardware level) of interfering with the resources of another domain of the system.
"Temporal Partitioning": a separation in time of domains of the system. This allows a "runaway" domain of the system, which never yields control authority, to be interrupted, either temporarily (e.g. in a time-slicing scheduler) or permanently (e.g. in a watchdog-like behavioral pattern).

These partitioning qualities may only be "one way": an OS kernel may be able to modify spatial assets of a process, or force the preemption of a running process, while the process is unable to do the same for the OS kernel. In this case, the OS kernel would be considered "independent" of the failures of the process, while the process would not be "independent" of the failures of the kernel: If the OS panics, the application process dies too!

NOTE: The concepts of spatial and temporal partitioning, as well as the concept of independence in this context, are based on definitions used in safety critical systems, such as those governed by DO-178 in aviation, or IEC 61508 in industrial or other applications. These areas require a more extensive level of rigor and analysis for compliance with standards, but the concepts can be applied on a broader level to less critical systems, including consumer devices.

This partitioning may also be "sequential" or "concurrent". A "sequential" partitioning may mean that there is a higher privilege component responsible for booting the system, but then yields control to the application, unless a watchdog timeout or panic event occurs. A "concurrent" partitioning means that the components regularly exchange control, for example an OS kernel handling hardware events or application process requests, then handing control back to the application.

Even at two blast domains, we begin to introduce the need for additional overhead and complexity over a single blast domain: We need to implement some form of hardware-level isolation between the domains, e.g. setting up processor modes, separate execution stacks, an MMU/MPU for isolation, and some kind of regular time slicing or watchdog preemption.

Even more invasively, we need to consider what resources are given or shared between the domains. If a higher order domain (like an OS Kernel) shares a resource (such as memory, or a UART) with the lower order domain, and the lower order domain fails, the higher order domain cannot necessarily assume that the shared resource remains in a reasonable state. This typically means that shared resources must be forcefully reclaimed. For memory, this could involve treating the memory as uninitialized, or re-zeroing it before re-use; or for peripherals like a UART, it may be necessary to re-initialize the peripheral, at least any aspects that the lower order domain had the ability to affect.

In some cases, there may be additional details to consider when it comes to answering "how much reclamation is necessary", for example, it may be possible to prevent the lower order domain from modifying hardware configuration to the lower order domain, without preventing regular use. In other cases, it might mean re-reading the current configuration after the lower domain has been terminated, rather than assuming the hardware remained in the state assumed by the higher-order domain. This requires some context-specific judgement and planning for specific systems.

In order to guarantee that our two domains do not explicitly share any assets, it is most common to compile the two components separately, either as two completely standalone applications, or with the lower-order domain as a static library with no visibility to the assets of the higher-order domain. Although this is not always done, failure to do so allows for the potential of implicit sharing, where unintended violations of independence may occur.

This may lead to redundancy and duplication: repeated formatting code, panic handlers, or async executors. However, this cost must be paid, as otherwise we would violate the requirements for independence.

N Blast Domains

The most common way of seeing an arbitrary number of blast domains would be operating systems such as Hubris or Tock-OS: there is a higher-order domain, the kernel, which is responsible for managing an arbitrary number of lower-order domains, "tasks" or "threads" or "processes". This requires that these "processes" are isolated from each other as well, e.g. there is inter-process independence.

In systems like these, individual processes, or in some cases even individual hardware drivers, operate with independence. This typically requires extra care to avoid implicit sharing, often requiring techniques like message passing in lieu of more typical direct access/sharing of resources.

Again, the more fully-independent domains that we have, the greater the potential cost due to redundancy.

Not all RTOSs are "N Blast Domains"

Although Hubris and Tock-OS were cited as N blast domain systems, many classic RTOSs, like FreeRTOS or Zephyr, are often still single- or two- blast domain systems, where one task may be able to violate spatial partitioning through shared/static resources. Although these systems do often provide some additional protection for common failures like stack overflows or runaway code, they do not always offer full independence from catastrophic failures, in the case of a failed assertion or memory corruption.

Why can't Embassy's tasks be blast domains?

One may wonder: why can't we make something like the Task in Embassy (or any other async/await executor system) a point of recoverable failure? We can, but only for "in-band" failures! For example, a task that terminates due to an async function that returns an Error. However, this only catches failures that we have planned for! If there is an unexpected panic, for example in application or library code, or a hardware-level fault, such as a divide-by-zero, there is no way in Rust to recover from this, short of unwinding the panic. However, this is not generally reasonable to do at the embedded systems level, and even on "desktop" Rust, if there is a "double panic", e.g. panicking a second time while unwinding, an abort is still required.

Why are so many embedded systems single-blast-domains?

In many cases, a single blast domain is an acceptable design choice. It eliminates the need for redundancy between independent domains, and avoids any setup or runtime overhead for implementing isolation.

In Rust, this is often a more reasonable case to make: if unsafe code is limited (and well audited for safety), and reasonable care is taken, then the expected occurrence of catastrophic failures may be low enough to justify a more lean application.

In addition, many systems are capable of "fast recovery": if the system can recover and resume operation in the blink of an eye, 50ms from panic to resuming normal behavior, the panic itself may not even be noticeable, unless someone was directly observing the system. A simple sensor, clock, or smart watch, could panic many times per day without ever being noticed.

However, this is context specific, if a factory machine experienced a catastrophic failure regularly, requiring a "hard stop" of an assembly line every time there was a panic, or would lead to the loss of material, this would be a much "louder" panic, meaning that it would become MUCH more important to limit the number of catastrophic failures of the system.

Panics are a tool, not (necessarily) a problem

It is important to consider as a system designer: "how do we respond when X happens". Panics are a reasonable tool in this toolbox: they typically detect when something goes wrong, and rather than continue onward in an unreasonable state, they halt execution to avoid doing something "wrong".

They provide a categorical answer to many of these individual and context-specific questions: If something really bad happens, we perform in a predictable way, such as halting or resetting the system as a last-ditch attempt to recover into a reasonable state.

However, like any other tool, we must consider the cost of their use. In embedded Rust, there are two primary cost concerns to consider:

The first of these costs are the additional code size they can inadvertently introduce. Panicking in Rust often comes with a formatting cost, as the panic machinery in Rust will attempt to prepare a panic message, which will often either involve formatting the failed assertion message, or the formatting/panicking machinery's use of dynamic dispatch in Rust will make it difficult for the optimizer to remove unused formatting code, due to imperfect static analysis. This cost leads to "formatting bloat", which is a longstanding limitation of embedded rust.

This "formatting bloat" can often be mitigated through the use of the still-unstable "immediate abort" technique, which replaces all panicking branches with an invalid CPU instruction, a UDF on cortex-m, which greatly simplifies the control flow analysis, leading to a more aggressive culling of unused formatting code. Users can still define the behavior in this case by defining an appropriate hardware fault handler, but some "in-band" panic context is potentially lost. This approach also still requires the use of a nightly toolchain.

The second of these costs is that panicking is a "big hammer": it typically requires a full system halt or a full system reboot in order to recover from. As discussed above, for some systems this approach is acceptable, as recovery can be quick and relatively stealthy. For other systems, this approach is less acceptable, as it may lead to a louder or more invasive recovery procedure.

Panics in multi-blast-domain systems

In the context of blast domains, these two costs weigh differently depending on the number of independent blast domains.

Increased redundancy may mean that the first cost, code size, may worsen if multiple domains contain redundant formatting code, for example duplicate routines for formatting common types including integers, floats, or strings. That being said, if these domains are smaller in size, the optimizer may have a better chance at successfully performing analysis that leads to dead-code elimination. Designers should be aware that there is potential for duplication, though specific analysis will be required rather than this being a predictable factor.

Increased independence may reduce the impact of the second cost, if the failure of one domain can be recovered from less invasively than a whole-system recovery procedure. This still requires some care to ensure that necessary reclamation can occur successfully, but independence allows for the possibility to be considered as an option.

Feasibility of eliminating panics

In embedded Rust today, it is extremely challenging to categorically eliminate panics. There are often implicit sources of panics, such as [] based indexing or even poll-after-completion of async tasks, that can lead to the presence of panicking branches. Categorically eliminating panics requires auditing both "first party" code, as well as any "third party" dependencies, such as crates from crates-io.

There do exist tools, such as Clippy lints, that will help to audit and eliminate sources of panics, instead using in-band methods such as get()/get_mut() for indexing, or checked_* methods for mathematical operations. However, there is not yet a way to systemically prevent the use of methods with panicking branches including dependencies, like a #![deny(panic)] directive.

There is also generally a pragmatic hack used by some developers, usually referred to as the "never panic" hack, which makes the panicking handler an undefined external symbol. In cases where the optimizer is able to prove that this method is never called, linking succeeds because the undefined symbol is never used, meaning that linking succeeds. If the optimizer is unable to prove this, either due to an actually live panicking branch in the code, or a "false positive" where the unreachable branch cannot be eliminated, then linking the application will fail. In general, this is still a useful tool as it will never have a "false negative", where it claims that no panics exist when in actuality they do.

This hack has a significant limitation: it is at the mercy of LLVM's heuristic-based optimizer to eliminate all dead panicking branches. The larger or more complex the application becomes, the greater the chance for a "false positive". This hack is not always feasible to use in full production systems.

There is often value to be had in minimizing panics, but similar to reaching 100% code coverage in testing: there are often diminishing returns in addressing the "long tail" of potential panic sources. Still, reduction of panics may be a worthy goal to pursue up to a point, to minimize potential the number "expensive" recoveries required at scale.

Thoughts for Systems Designers

Based on the points outlined above, the following are a set of opinions by the author on what to consider in the context of the system you choose to build.

Increasing the number of Blast Domains

Said succinctly, it is incredibly challenging to increase the number of blast domains from one to two or more domains within an existing system once it has been developed. This requires re-establishing some of the core assumptions made while building the system, and often requires the use of drastically different techniques to ensure independence. Moving from 2 to N domains is more variably expensive, depending on whether novel approaches are required to achieve the necessary levels of independence.

Care should be taken to decide on an approach as early as possible in the engineering process, as the cost of changing this decision may be practically infeasible the further into the development process you are. This is an area where "half measures" to work towards independence have significant cost and diminishing returns if not done completely when attempting to retrofit an existing system.

Adaptive Diligence

It is often pragmatic to assign differing levels of "criticality" to domains of your system, and perform differing levels of required diligence for these domains. For example, the "kernel" domain may have the highest criticality, and require that all dependencies are audited for panics, or even require that the "never panic" checks pass. These may be relaxed for "lower criticality" domains, allowing for increased productivity in application domains.

These considerations can be made in the context of the differing blast domains, based on their potential negative costs of failure at runtime.

Fail Fast vs Fail Never

There are generally two primary schools of thought when it comes to catastrophic failures: either design and implement your system in a way that attempts to make them as rare as possible, or embrace them as an expected possibility and design the system to be tolerant to them. In essence, you can choose either to minimize their possibility, or you can choose to minimize their impact.

Neither approach is "right", both can be a reasonable approach to take in context. Even within a single system, you might choose both: for example a kernel that minimizes their possibility, with an application that minimizes their impact. The cost of either approach can only be fully considered in the context of the reality of the system itself, and the requirements placed upon that system.

Choosing to "fail fast" often simplifies implementation: it becomes unnecessary to develop custom recovery steps for potential issues, instead just saying "if we get here, we panic". In contrast, "fail never" simplifies considerations of what to do in the face of failure, as we expect it to only occur at statistically insignificant rates: "just replace the machine" becomes a palatable expense. However, both of these approaches REQUIRE that their assumptions are correct: fail-fast systems must ALMOST NEVER incur greater "cost" to failure to make their occurrences unpalatable, and fail-never systems must ALMOST NEVER incur failure rates to make their exceptional costs statistically significant.

In the author's opinion, the biggest potential mistake is to be inconsistent in WHICH goal is prioritized. Doing so greatly complicates the cost estimates, as we must consider both frequent and costly failures. Although no system will be purely fail-fast or fail-never in practice, choosing the approach used will inform the development practices and diligence necessary, and empower engineers to make correct judgement calls at the "line" level.

Planning for failure

In designing a system, it is important to consider three questions about nearly every aspect of that system:

How will this component fail, and what impact will that failure have?
How often will this component fail in that way?
What should be done when that component fails in that way?

At a macro level, these questions help to inform HOW to appropriately design a system. Having consistent answers to these questions across many parts of a system also aids in making a system more predictable to develop and operate. Documenting the answers to these questions aids in whole-system analysis, as well as to feed-back real world information about the expected failure rates and impact of classes of failures.

NOTE: These questions are a distilled form of "FMEA", or "Failure Modes and Effects Analysis", performed in safety critical systems. Again, we can use the tools to guide decision making, without requiring the formal rigor required of actual safety critical systems.

In some cases, we may decide that potential failure modes are low enough impact, or rare enough in practice, to avoid additional considerations. Documenting that this consideration has been made prevents additional consideration, though may need to be re-evaluated if elevated failure rates or impact are experienced at scale.

"What are you going to do about it?"

As a part of considering failure, it is also important to consider what should be done in the face of that failure. For catastrophic failures, this often means determining how exceptional recovery should be performed, shutting down and potentially restarting an individual blast domain.

This aspect of planning also applies to NON-catastrophic errors as well. It is often easy to consider this on the micro scale: if some error case occurs, or some expected data is missing, in Rust we will often return a Result::Err, or an Option::None, and return early from that function. However, it is important to zoom out: if we pass that error state back up the call state to our caller, what does IT do with that error? Does it handle it directly and resolve the error, or does it pass it back up to ITS caller? This can only terminate in one of three potential outcomes:

The error is addressed, perhaps by retrying some bounded number of times, or choosing an acceptable fallback condition.
The error is ignored, dropping the error, or using some default value.
The error is escalated, often to a panic

This list doesn't include the error being "propagated", e.g. with ? or an early return, as that doesn't resolve the error, it only passes the error up to another context.

Although any of these three terminal cases can be correct, they are not always correct in context. For example if a piece of code currently unwraps a Result, as it has chosen the "escalation" approach, and that code is refactored to "ignore" the error to avoid a panicking branch, we MUST consider if that change was appropriate!

If it wasn't appropriate, we must now consider what could go wrong: if it was a "latching" error that requires recovery, it could lead to the malfunctioning or non-operation of a portion of our system, which will now become harder to diagnose because the symptoms are quieter and more subtle.

If it was appropriate, why were we checking the error in the first place? Could we simplify other parts of our system to also not be concerned with this error, as we have decided it was inconsequential in the first place?

This thought exercise is particularly important for designers choosing a "panic never" approach, often within a single blast domain. It can be extremely tempting to ignore errors rather than escalate them or properly address them, but this can lead to quiet bugs, instead of loud bugs, and quiet bugs are harder to catch and squash.

In contrast, it is also important to evaluate the cases where complexity or additional effort is being taken to attempt to address or escalate defects that we don't actually have the ability to resolve. Error handling often adds significant complexity and lines of code to a project, and if it isn't actually providing any value, it could be eliminated to reduce the scope of the system!

Thoughts on Panicking