Watchdog Timer for Robust Embedded Systems
In a complex embedded system, a small bug may crash the whole system, or worse, put it into a dangerous operating mode. Bugs are not the only problem. A perfectly-designed-and-tested device on which a perfect code executes can still fail. A watchdog timer (WDT) is a safety mechanism that brings the system back to life when it crashes. For this reason, it must be well-designed and implemented for robust embedded system development.
A WDT is a hardware that contains a timing device and clock source. A timing device is a free-running timer, which is set to a certain value that gets decremented continuously. When the value reaches zero, a short pulse is generated by WDT circuitry that resets and restarts the system.
Fig. 1: External watchdog timer
Fig. 2: Internal watchdog timer
It is the application’s responsibility to reload WDT value each time before it reaches zero, else WDT circuitry will reset the system. Once reloaded, it will again start decrementing. In short, WDT constantly watches the execution of the code and resets the system if software is hung or no longer executing the correct sequence of the code. Reloading of WDT value by the software is called kicking the watchdog.
Watchdog based design considerations
1. The clock source for WDT must be separate, which means that it should not share the system clock. If the crystal stops under normal operation, say, in sleep mode, the watchdog will not work.
2. Once WDT initialisation is complete and WDT starts, the software should not be able to disable the watchdog or modify its control registers to stop a buggy code from accidentally disabling it. Some processors do have this locking feature.
3. After the watchdog resets, the system must come back to a known state under any condition.
4. The watchdog reset sequence must ensure that all connected peripherals are also brought back to a known state.
Types of watchdog timers
WDTs can be divided into two general categories: external WDT and internal WDT. Most microcontrollers have an internal WDT. Various chip vendors also provide external WDT chips.
An external WDT has a physical reset pin for the processor. An I/O pin of the processor is used to kick the watchdog.
Non-watchdog based design problems
In 1994, a deep-space probe, the Clementine, was launched to make observations of the Moon and a large asteroid, 1620 Geo graphos. After months of operation, a software exception caused a control thruster to fire for 11 minutes, which depleted most of the remaining fuel and caused the probe to rotate at 80rpm. Control was eventually regained, but it was too late to successfully complete the mission.
There can always be a bug present in the embedded system design, even if the code is designed very carefully. If we test our device in a heavy-electrical, noisy environment, a high-voltage spike may corrupt the program counter or stack pointer. Cosmic rays are also evil for the digital system and can alter the processor’s register bits.
Software can cause the system to hang indefinitely, in case of an infinite loop, buffer overflow or deadlocks. In a small embedded device, it is easy to find the exact root cause of the bug, but not so in a complex embedded system. However, by using a watchdog, we can ensure that the system will not hang indefinitely.
Hence, the system software in any situation should not hang infinitely. A general solution, in case it does hang, is to reset the system, and this is where watchdogs in embedded systems come in handy.
Watchdog timer based system design
The software needs to kick the watchdog constantly. In some implementations, a sequence of bytes is needed to be written in the watchdog register to kick the watchdog. This reduces the chance of an errant code that might accidentally kick the watchdog.
After WDT overflows, it will assert the processor reset line. Some processors and controllers can generate an interrupt before resetting the device, which is like an early warning for an upcoming watchdog reset. We can save useful information like status register in a non-volatile memory by reading this information after recovery. From reset logs, we can debug the root cause of the reset.
A watchdog can also be used to wake up the device from sleep or idle mode. In sleep mode, watchdog timeout will not reset the system, but just cause it to wake up.
Simply enabling WDT and kicking it regularly is not enough to ensure system reliability. To get optimum benefit, implementation of the watchdog is a must for robust design.
Watchdog time-out period
For selecting watchdog time-out period, we must have a proper understanding of the software loop latency. An unusual number of interrupts may happen during a single scanning of a loop, and the extra time spent in the interrupt service routine (ISR) will increase the main loop latency. A software delay routine will also increase loop latency. The design with delays in various places in the code has control of the watchdog, which can prove to be problematic.
For some time, critical application and recovery time from the watchdog reset is very important. In such a system, time-out period needs to be very precise. After watchdog reset, the system must boot-up as fast as possible. For example, in case of a pacemaker machine, the system must boot-up almost within a heartbeat. The initialisation after a watchdog reset should be much shorter than power-on initialisation.
Very short time-out periods may lead to the system resetting unnecessarily. If the system is not time-critical, it is better to choose time-out in seconds.
Implementation of watchdog timer for single-thread software design
The traditional approach for a single-thread design is to kick WDT at the end of the main loop.
Fig. 3: Traditional watchdog kicking inside the main loop
In a single-thread design, we can use state-machine-like architecture as shown in the code snippet below. Increment the state variable value at three different sections of the code, which will definitely iterate once in a one-loop scan. At the end of the main loop, check the state value; if it is three, it means that the code execution is done in proper sequence. Then, kick the watchdog and clear the state flag. If the state value is not three, it means there is some fault in the execution of the code. In this case, do not kick the watchdog, else the system will reset after watchdog time-out.
On some microcontrollers, the built-in watchdog has a maximum time-out of the order of a few hundred milliseconds. But, if the main loop scan time is higher than the maximum allowed watchdog time-out, we need to multiply that in the software.
For example, main loop latency of 500ms and maximum allowed watchdog time-out period of 100ms (which means that the watchdog must kick before 100ms) is not possible from the main loop. In this case, we can configure the processor’s internal timer to 50ms free-running and define flag state at the end of the main loop set and state it as Alive.
In every 50ms ISR increment count, check state flag. Only kick the watchdog if state is not Unknown. When the count reaches above ten (500ms time is elapsed), ISR again and check state flag. If state is Alive, it means that the program is running correctly. Otherwise, set state as Unknown. This represents that there is some problem in the execution of the code and so ISR will not kick the watchdog anymore and the system will restart after watchdog time-out of 100ms.
Never kick the watchdog in an ISR unconditionally or devote an RTOS task to this activity, because, if the main code crashes, interrupts (and even the scheduler), it may continue to run so the watchdog never times-out. However, this approach is not recommended as we have no idea if the code is working, or not, except the timer ISR.
Implementation of watchdog timer for RTOS based application
In a multitasking environment, there are a couple of independent loops running in parallel, known as tasks. The scheduler schedules each task based on priority. To validate that each task is running properly, each task must contribute in the decision of kicking the watchdog.
To implement the watchdog mechanism in an RTOS environment, we can design a separate task that will monitor the status of all running tasks—we can call this the watchdog task. Only this task gets the privilege of kicking the watchdog.
Fig. 4: Watchdog design for RTOS, approach 1
Fig. 5: Watchdog design for RTOS, approach 2
Let us take an approach in which there is a status byte and each bit of this byte is associated with a task. For example, our system has three tasks running and each task will set corresponding bits in the status flag at the end of its body.
When the watchdog task wakes up, it will check whether all three bits are set (which means whether all tasks are running properly). It will kick the watchdog and clear the status flag. In this case, the priority of the watchdog task must be lower than other system tasks. Once the watchdog timer execution is completed, it goes in sleep mode for less than the watchdog time-out period.
The approach for the watchdog design for an RTOS (Fig. 4) will work well if all tasks are executed once in less time than the watchdog reset period, including watchdog task. But if any of the tasks go in sleep mode for a couple of seconds, or have to wait for an event, the above approach will not work in this design.
We can implement it in a better way by using the message queue, where each task blocks at the message queue. The watchdog task will post messages to all tasks and go in sleep mode for a specified time interval (less than the watchdog time-out period).
After the arrival of the message in the message queue, the task will wake up one by one based on priority. Each task reads the message and if the task has been woken up by the watchdog task, it will set the corresponding bit in the status flag.
When the watchdog task wakes up, check the status flag. If it has all corresponding bits set, kick the watchdog and clear the status flag. In this approach, the watchdog task must have higher priority than all other system tasks.
The selection of priority of the watchdog task is very important as it depends on the design architecture of the system.