ESP32 watchdog timer: build firmware that recovers from crashes

Part 5 of 6

1 ESP32 GPIO explained: inputs, outputs, pull-ups and interrupts
2 How to use ESP32 timers: stop using delay() in your firmware
3 ESP32 MQTT tutorial: publish and subscribe with PubSubClient
4 ESP32 FreeRTOS: tasks, queues and semaphores in plain English
5 ESP32 watchdog timer: build firmware that recovers from crashes
6 ESP32 OTA updates: push firmware wirelessly in production

PreviousESP32 FreeRTOS: tasks, queues and semaphores in plain English NextESP32 OTA updates: push firmware wirelessly in production

Your firmware has been running for six weeks. Then at 2 AM on a Tuesday, the network library hangs waiting for a response that never arrives. The task blocks. The device goes silent. Nobody notices until the morning shift starts and finds stale data in the dashboard. You SSH into the MQTT broker and see the Last Will Testament was published at 2:17 AM.

A watchdog timer would have fixed this while you slept.

What a watchdog timer is

A watchdog timer is a hardware counter on the chip itself — not a software timer, not a FreeRTOS task, actual silicon. Firmware must periodically reset this counter. If the firmware fails to reset it within the configured timeout period, the hardware fires a reset and reboots the chip.

It’s a deadman’s switch. The hardware assumes that if firmware stopped resetting the counter, something went wrong. And it’s right far more often than it’s wrong.

The critical word is hardware. A software watchdog implemented in a FreeRTOS task can be killed by the same bug that killed everything else. A hardware watchdog cannot. It keeps ticking regardless of what the firmware is doing.

Why every production device needs one

This isn’t theoretical. Here are failure modes a watchdog catches:

Network library deadlock. WiFiClient.connect() or MQTT’s socket read can block indefinitely waiting for a response. If the router drops the TCP RST packet, the library waits forever. The watchdog catches this.

FreeRTOS priority inversion. Task A holds a mutex and gets preempted by higher-priority Task B, which then tries to acquire the same mutex. If priority inheritance isn’t configured correctly, Task B spins forever. The watchdog catches this.

Stack corruption. A buffer overflow overwrites the stack frame. The function returns to garbage. The CPU executes random instructions, typically landing in an infinite loop or a fault handler that itself loops. The watchdog catches this.

Memory corruption. A dangling pointer write corrupts code or data structures. The firmware ends up spinning in a section of code it was never supposed to reach. The watchdog catches this.

Cosmic rays. Not a joke. In high-radiation environments — industrial settings, high altitude, outdoors in sunny regions — single event upsets can flip bits in SRAM or registers. A critical flag flips, control flow goes wrong, the chip hangs. The watchdog catches this too.

The watchdog is your unconditional safety net. No matter what goes wrong, the device reboots and tries again.

ESP32 has two watchdog types

Interrupt Watchdog (IWDT)

The IWDT monitors interrupt handlers. If any ISR runs longer than the configured timeout (300 ms by default in ESP-IDF), the IWDT fires a Non-Maskable Interrupt, prints a backtrace to serial, and resets the chip.

You generally don’t interact with the IWDT directly. It’s always enabled. The lesson it enforces is simple: ISRs must be fast. Don’t do I2C reads, don’t call printf, don’t allocate memory in an ISR. If you’re seeing IWDT resets, you have an ISR that’s doing too much.

Task Watchdog Timer (TWDT)

The TWDT monitors FreeRTOS tasks. Tasks explicitly subscribe to the TWDT, and each subscribed task must call esp_task_wdt_reset() within the timeout period. If any subscribed task fails to call reset in time, the TWDT prints which task failed and resets the chip.

This is the one you configure and use directly. The rest of this post is about the TWDT.

Configuring the TWDT

#include "esp_task_wdt.h"

void initWatchdog() {
    esp_task_wdt_config_t wdt_config = {
        .timeout_ms    = 5000,     // Reset if not fed within 5 seconds
        .idle_core_mask = 0,       // Don't monitor idle tasks (explained below)
        .trigger_panic  = true     // Trigger panic (prints backtrace) before reset
    };

    esp_err_t err = esp_task_wdt_reconfigure(&wdt_config);
    if (err != ESP_OK) {
        // TWDT may not be initialized yet — init it first
        err = esp_task_wdt_init(&wdt_config);
        if (err != ESP_OK) {
            ESP_LOGE("WDT", "Failed to init TWDT: %s", esp_err_to_name(err));
        }
    }
}

On idle_core_mask: The idle task runs when no other task is ready to run. If you set idle_core_mask = (1 << 0) | (1 << 1) (both cores), the idle task on each core is automatically subscribed to the TWDT. This means if any task blocks the CPU continuously for longer than the timeout — preventing the idle task from running — the watchdog fires.

This sounds good in theory. In practice, it causes false positives. If you deliberately run a CPU-intensive task (like computing a hash or driving PWM via software), the idle task won’t get CPU time and the watchdog fires even though everything is fine. Leave idle_core_mask = 0 unless you specifically need idle task monitoring and have accounted for it.

On trigger_panic: Set this to true. When the TWDT fires, it triggers the ESP panic handler, which prints a full backtrace, register dump, and — most importantly — the name of the task that failed to feed the watchdog. Without this you get a silent reset and no diagnostic information. Always enable trigger_panic.

Subscribing tasks

// Subscribe the currently running task to the TWDT
esp_task_wdt_add(NULL);  // NULL means "current task"

// Subscribe a specific task by handle
TaskHandle_t sensorTaskHandle = NULL;
xTaskCreate(sensorTask, "sensor", 4096, NULL, 5, &sensorTaskHandle);
esp_task_wdt_add(sensorTaskHandle);

// Unsubscribe when a task is done (before deleting it)
esp_task_wdt_delete(NULL);
vTaskDelete(NULL);

Only subscribe tasks that have a clear health signal — tasks that do meaningful work in a predictable loop. Don’t subscribe every task blindly. A task that legitimately waits on a blocking queue for extended periods shouldn’t be subscribed directly; instead, have a health monitor task (covered below) assess whether waiting is normal or stuck.

The right way to feed the watchdog

The watchdog is only useful if you feed it conditionally — only when the task actually completed its expected work. This is the part most tutorials get wrong.

The wrong way — watchdog feedback loop:

void sensorTask(void* param) {
    esp_task_wdt_add(NULL);

    while (true) {
        esp_task_wdt_reset();  // WRONG: fed BEFORE doing any work

        float temp = readSensorBlocking();  // If this hangs, we already fed the dog
        publishReading(temp);               // The watchdog will never fire

        vTaskDelay(pdMS_TO_TICKS(1000));
    }
}

If readSensorBlocking() hangs indefinitely, the watchdog was already fed at the top of the loop. The chip will never reset. You’ve made the watchdog completely useless.

The right way — feed only after successful work:

void sensorTask(void* param) {
    esp_task_wdt_add(NULL);

    while (true) {
        bool readOk = readSensorWithTimeout(&lastReading, 2000);

        if (readOk) {
            g_lastSensorReadMs = millis();  // Update health timestamp
            esp_task_wdt_reset();           // Feed ONLY after successful work
        } else {
            // Don't feed the watchdog — let it fire if this persists
            ESP_LOGW("SENSOR", "Read failed or timed out");
            // Optional: attempt recovery before the WDT fires
            reinitializeSensor();
        }

        vTaskDelay(pdMS_TO_TICKS(1000));
    }
}

Now the watchdog is actually watching something meaningful. If readSensorWithTimeout() hangs past the 2-second internal timeout, the function returns false, esp_task_wdt_reset() is never called, and when 5 seconds elapse (the TWDT timeout), the chip resets.

Warning: The interval between esp_task_wdt_reset() calls must be less than the configured timeout_ms. If your task loop takes 3 seconds and your timeout is 5 seconds, you have only 2 seconds of margin. Design your task loop timing with the watchdog timeout in mind — a comfortable rule is that the loop should complete in at most 50% of the timeout.

Detecting watchdog resets on next boot

A watchdog reset means something went wrong. You need to know about it.

#include "esp_system.h"

void checkResetReason() {
    esp_reset_reason_t reason = esp_reset_reason();

    switch (reason) {
        case ESP_RST_TASK_WDT:
            ESP_LOGW("BOOT", "Reset caused by Task Watchdog Timer");
            // Log to NVS, report via MQTT
            recordCrashEvent(CRASH_TASK_WDT);
            break;
        case ESP_RST_INT_WDT:
            ESP_LOGW("BOOT", "Reset caused by Interrupt Watchdog Timer");
            recordCrashEvent(CRASH_INT_WDT);
            break;
        case ESP_RST_PANIC:
            ESP_LOGW("BOOT", "Reset caused by panic (assertion, null deref, etc.)");
            recordCrashEvent(CRASH_PANIC);
            break;
        case ESP_RST_POWERON:
            ESP_LOGI("BOOT", "Power-on reset (normal startup)");
            break;
        case ESP_RST_SW:
            ESP_LOGI("BOOT", "Software reset (OTA or esp_restart())");
            break;
        default:
            ESP_LOGI("BOOT", "Reset reason: %d", reason);
            break;
    }
}

Call checkResetReason() early in app_main() or setup(), before any task is created. The reset reason is only valid once per boot — it gets cleared on the next reset.

Storing crash information across resets

esp_reset_reason() tells you the reset type but not the context — which task was running, what state the firmware was in, or how many times this has happened. For that, you need storage that survives a watchdog reset.

RTC slow memory is the answer. Variables marked RTC_DATA_ATTR are stored in RTC slow memory, which is not cleared on watchdog resets, soft resets, or deep sleep wakeups. It is cleared on power-on reset.

#include "esp_attr.h"

// These survive watchdog reset, panic reset, and deep sleep
RTC_DATA_ATTR uint32_t g_crashCount     = 0;
RTC_DATA_ATTR uint32_t g_lastCrashType  = 0;   // esp_reset_reason_t value
RTC_DATA_ATTR uint32_t g_lastState      = 0;   // application FSM state at crash
RTC_DATA_ATTR uint32_t g_lastErrorCode  = 0;   // last error encountered
RTC_DATA_ATTR char     g_lastTaskName[32] = {0}; // which task was active

void recordCrashEvent(uint32_t crashType) {
    g_crashCount++;
    g_lastCrashType = crashType;
    // g_lastState is already set by the application as it transitions states
    // Report these on next MQTT connection
}

void reportCrashViaMQTT(PubSubClient& mqtt) {
    if (g_crashCount == 0) return;  // No crashes to report

    char buf[256];
    snprintf(buf, sizeof(buf),
        "{\"crash_count\":%lu,\"last_crash_type\":%lu,"
        "\"last_state\":%lu,\"last_error\":%lu}",
        (unsigned long)g_crashCount,
        (unsigned long)g_lastCrashType,
        (unsigned long)g_lastState,
        (unsigned long)g_lastErrorCode
    );

    if (mqtt.publish("devices/sensor-01/crashes", buf, true)) {
        ESP_LOGI("CRASH", "Reported %lu crashes to MQTT", (unsigned long)g_crashCount);
        // Don't clear g_crashCount — keep cumulative count
    }
}

Tip: Update g_lastState every time your application FSM transitions between states. When a watchdog reset happens, you’ll know exactly which state the firmware was in — invaluable for field debugging.

Graceful recovery before resorting to reset

A full chip reset is disruptive. It means re-establishing WiFi, re-handshaking TLS, re-subscribing to MQTT topics. Sometimes the right response to a failure is partial recovery — fixing just what’s broken.

// Connection states for the MQTT manager
enum class ConnState {
    CONNECTED,
    WIFI_CONNECTING,
    MQTT_CONNECTING,
    RECONNECTING,
    FAILED
};

RTC_DATA_ATTR uint32_t g_reconnectAttempts = 0;
static ConnState        s_connState = ConnState::WIFI_CONNECTING;

void mqttManagerTask(void* param) {
    const uint32_t MAX_RECONNECT_ATTEMPTS = 10;

    while (true) {
        switch (s_connState) {
            case ConnState::CONNECTED:
                if (!mqttClient.connected()) {
                    ESP_LOGW("CONN", "MQTT disconnected, entering reconnect");
                    s_connState = ConnState::RECONNECTING;
                    g_lastState = (uint32_t)ConnState::RECONNECTING;
                }
                break;

            case ConnState::RECONNECTING:
                g_reconnectAttempts++;
                if (g_reconnectAttempts > MAX_RECONNECT_ATTEMPTS) {
                    // Tried 10 times, partial recovery isn't working — hard reset
                    ESP_LOGE("CONN", "Max reconnect attempts reached, rebooting");
                    g_lastErrorCode = 0xDEAD0001;
                    esp_restart();
                }
                // Attempt partial recovery: just the MQTT layer
                mqttClient.disconnect();
                vTaskDelay(pdMS_TO_TICKS(2000));
                if (ensureMQTT()) {
                    ESP_LOGI("CONN", "Reconnected after %lu attempts",
                             (unsigned long)g_reconnectAttempts);
                    g_reconnectAttempts = 0;
                    s_connState = ConnState::CONNECTED;
                }
                break;

            case ConnState::WIFI_CONNECTING:
                if (ensureWiFi()) {
                    s_connState = ConnState::MQTT_CONNECTING;
                }
                break;

            case ConnState::MQTT_CONNECTING:
                if (ensureMQTT()) {
                    s_connState = ConnState::CONNECTED;
                    g_reconnectAttempts = 0;
                }
                break;

            case ConnState::FAILED:
                // Permanent failure — reset
                esp_restart();
                break;
        }

        vTaskDelay(pdMS_TO_TICKS(500));
    }
}

The FSM approach (covered in detail in the state machine post) makes this clean. Instead of sprinkling esp_restart() calls everywhere, you transition to a RECONNECTING state and let the FSM attempt recovery. Only when recovery fails repeatedly do you escalate to a full reset.

The health monitor architecture

The most robust watchdog pattern uses a dedicated health monitor task that owns the TWDT subscription. Instead of every task subscribing and feeding the watchdog independently, one task checks health indicators from all other tasks and feeds the watchdog only if everything looks healthy.

┌──────────────┐     timestamp      ┌─────────────────────┐
│  Sensor Task │ ──────────────────▶│                     │
└──────────────┘                    │  Health Monitor     │──▶ esp_task_wdt_reset()
┌──────────────┐     timestamp      │  Task               │
│  MQTT Task   │ ──────────────────▶│                     │
└──────────────┘                    └─────────────────────┘
┌──────────────┐     queue depth           ▲
│  Event Queue │ ──────────────────────────┘
└──────────────┘

This is cleaner because:

Health logic is centralized in one place
Sensor and MQTT tasks don’t need to know about the watchdog at all
You can add health checks without modifying the tasks being monitored
The health monitor itself is simple enough to verify by inspection

Complete example: health monitor with crash reporting

#include <Arduino.h>
#include <WiFi.h>
#include <PubSubClient.h>
#include "esp_task_wdt.h"
#include "esp_attr.h"
#include "esp_system.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/queue.h"

// ── Configuration ───────────────────────────────────────────────────────────

const char* WIFI_SSID      = "your-ssid";
const char* WIFI_PASSWORD  = "your-password";
const char* MQTT_HOST      = "192.168.1.100";
const int   MQTT_PORT      = 1883;
const char* CLIENT_ID      = "esp32-health-01";
const char* TOPIC_TELEMETRY = "devices/health-01/telemetry";
const char* TOPIC_STATUS    = "devices/health-01/status";
const char* TOPIC_CRASHES   = "devices/health-01/crashes";

// Watchdog timeout — tasks must show health evidence within this window
const uint32_t WDT_TIMEOUT_MS       = 8000;
// Health thresholds
const uint32_t SENSOR_MAX_AGE_MS    = 10000;  // Sensor must update within 10s
const uint32_t MQTT_MAX_AGE_MS      = 60000;  // MQTT must publish within 60s
const uint32_t QUEUE_CAPACITY       = 20;
const float    QUEUE_HIGH_WATERMARK = 0.80f;  // Alert if queue > 80% full

// ── Crash info in RTC memory (survives watchdog reset) ───────────────────────

RTC_DATA_ATTR uint32_t g_crashCount    = 0;
RTC_DATA_ATTR uint32_t g_lastCrashType = 0;
RTC_DATA_ATTR uint32_t g_lastState     = 0;
RTC_DATA_ATTR uint32_t g_lastErrorCode = 0;

// ── Shared health state (written by worker tasks, read by health monitor) ───

struct HealthState {
    volatile uint32_t lastSensorReadMs  = 0;
    volatile uint32_t lastMqttPublishMs = 0;
    volatile uint32_t sensorErrorCount  = 0;
    volatile uint32_t mqttErrorCount    = 0;
};

static HealthState g_health;
static QueueHandle_t g_sensorQueue = NULL;

struct SensorReading {
    float temperature;
    float humidity;
    uint32_t timestampMs;
};

// ── Shared MQTT client ───────────────────────────────────────────────────────

static WiFiClient  s_wifiClient;
static PubSubClient s_mqtt(s_wifiClient);
static SemaphoreHandle_t s_mqttMutex = NULL;

// ── WiFi helpers ─────────────────────────────────────────────────────────────

bool ensureWiFi() {
    if (WiFi.status() == WL_CONNECTED) return true;

    WiFi.disconnect(true);
    WiFi.begin(WIFI_SSID, WIFI_PASSWORD);

    uint32_t start = millis();
    while (WiFi.status() != WL_CONNECTED) {
        if (millis() - start > 15000) return false;
        vTaskDelay(pdMS_TO_TICKS(250));
    }
    return true;
}

bool ensureMQTT() {
    if (s_mqtt.connected()) return true;

    bool ok = s_mqtt.connect(
        CLIENT_ID,
        nullptr, nullptr,
        TOPIC_STATUS, 1, true,
        "{\"online\":false}"
    );

    if (!ok) return false;

    s_mqtt.publish(TOPIC_STATUS, "{\"online\":true}", true);
    return true;
}

// ── Simulated sensor read (replace with real BME280 / DHT22 / etc.) ─────────

bool readSensorWithTimeout(SensorReading* out, uint32_t timeoutMs) {
    // Simulate a sensor read that takes ~100ms normally
    // In production: read I2C, check for timeout
    uint32_t start = millis();
    vTaskDelay(pdMS_TO_TICKS(100));  // simulate I2C read time

    if (millis() - start > timeoutMs) {
        return false;
    }

    // Fake values — replace with actual sensor library calls
    out->temperature  = 22.5f + (float)(random(-10, 10)) / 10.0f;
    out->humidity     = 55.0f + (float)(random(-50, 50)) / 10.0f;
    out->timestampMs  = millis();
    return true;
}

// ── Sensor task ──────────────────────────────────────────────────────────────
// Does NOT subscribe to TWDT — health monitor handles that

void sensorTask(void* param) {
    ESP_LOGI("SENSOR", "Task started");

    while (true) {
        SensorReading reading;
        bool ok = readSensorWithTimeout(&reading, 2000);

        if (ok) {
            // Update health timestamp — health monitor reads this
            g_health.lastSensorReadMs = millis();

            // Push to queue — non-blocking; drop reading if queue full
            if (xQueueSend(g_sensorQueue, &reading, 0) != pdTRUE) {
                ESP_LOGW("SENSOR", "Queue full, dropping reading");
                g_health.sensorErrorCount++;
            }
        } else {
            g_health.sensorErrorCount++;
            ESP_LOGW("SENSOR", "Read timeout (total errors: %lu)",
                     (unsigned long)g_health.sensorErrorCount);
        }

        vTaskDelay(pdMS_TO_TICKS(5000));
    }
}

// ── MQTT publish task ─────────────────────────────────────────────────────────
// Waits on the sensor queue, publishes readings, does NOT feed the WDT

void mqttPublishTask(void* param) {
    ESP_LOGI("MQTT", "Publish task started");

    while (true) {
        // Ensure connectivity (non-blocking attempt)
        if (!ensureWiFi() || !ensureMQTT()) {
            g_health.mqttErrorCount++;
            vTaskDelay(pdMS_TO_TICKS(5000));
            continue;
        }

        s_mqtt.loop();

        // Block waiting for a sensor reading (up to 10 seconds)
        SensorReading reading;
        if (xQueueReceive(g_sensorQueue, &reading, pdMS_TO_TICKS(10000)) == pdTRUE) {
            char buf[256];
            snprintf(buf, sizeof(buf),
                "{\"temperature\":%.1f,\"humidity\":%.1f,\"uptime_s\":%lu}",
                reading.temperature,
                reading.humidity,
                (unsigned long)(millis() / 1000)
            );

            if (s_mqtt.publish(TOPIC_TELEMETRY, buf)) {
                g_health.lastMqttPublishMs = millis();  // Update health timestamp
                ESP_LOGI("MQTT", "Published: %s", buf);
            } else {
                g_health.mqttErrorCount++;
                ESP_LOGW("MQTT", "Publish failed (total errors: %lu)",
                         (unsigned long)g_health.mqttErrorCount);
                // Force reconnect on next iteration
                s_mqtt.disconnect();
            }
        }
        // If xQueueReceive times out, loop continues and we check connectivity again
    }
}

// ── Crash reporting task ──────────────────────────────────────────────────────
// Runs once after boot to report any crash info stored in RTC memory

void crashReportTask(void* param) {
    // Wait for MQTT to be available (up to 30 seconds)
    uint32_t start = millis();
    while (!s_mqtt.connected() && (millis() - start < 30000)) {
        vTaskDelay(pdMS_TO_TICKS(1000));
    }

    if (g_crashCount > 0 && s_mqtt.connected()) {
        char buf[256];
        snprintf(buf, sizeof(buf),
            "{\"crash_count\":%lu,\"last_crash_type\":%lu,"
            "\"last_state\":%lu,\"last_error\":\"0x%08lX\"}",
            (unsigned long)g_crashCount,
            (unsigned long)g_lastCrashType,
            (unsigned long)g_lastState,
            (unsigned long)g_lastErrorCode
        );
        s_mqtt.publish(TOPIC_CRASHES, buf, true);
        ESP_LOGW("CRASH", "Reported crash history to MQTT: %s", buf);
    }

    vTaskDelete(NULL);  // One-shot task; delete self
}

// ── Health monitor task ───────────────────────────────────────────────────────
// Owns the TWDT subscription. Only feeds the watchdog when all health checks pass.

void healthMonitorTask(void* param) {
    ESP_LOGI("HEALTH", "Task started, subscribing to TWDT");

    // This task owns the watchdog subscription
    esp_task_wdt_add(NULL);

    // Wait a bit for other tasks to initialize before enforcing health checks
    // This prevents a false WDT fire on startup
    uint32_t bootGracePeriodMs = 30000;
    uint32_t bootTime = millis();

    while (true) {
        uint32_t now = millis();
        bool allHealthy = true;

        // ── Check 1: Sensor task updated within allowed window ──────────────
        uint32_t sensorAge = now - g_health.lastSensorReadMs;
        if (g_health.lastSensorReadMs == 0) {
            // Never read — only enforce after grace period
            if (now - bootTime > bootGracePeriodMs) {
                ESP_LOGE("HEALTH", "Sensor has never produced a reading");
                allHealthy = false;
            }
        } else if (sensorAge > SENSOR_MAX_AGE_MS) {
            ESP_LOGE("HEALTH", "Sensor stale: last read %lu ms ago (max %lu)",
                     (unsigned long)sensorAge, (unsigned long)SENSOR_MAX_AGE_MS);
            allHealthy = false;
        }

        // ── Check 2: MQTT task published within allowed window ───────────────
        uint32_t mqttAge = now - g_health.lastMqttPublishMs;
        if (g_health.lastMqttPublishMs == 0) {
            if (now - bootTime > bootGracePeriodMs + 30000) {
                // Extra grace for MQTT — WiFi + broker connection takes time
                ESP_LOGE("HEALTH", "MQTT has never successfully published");
                allHealthy = false;
            }
        } else if (mqttAge > MQTT_MAX_AGE_MS) {
            ESP_LOGE("HEALTH", "MQTT stale: last publish %lu ms ago (max %lu)",
                     (unsigned long)mqttAge, (unsigned long)MQTT_MAX_AGE_MS);
            allHealthy = false;
        }

        // ── Check 3: Queue depth below high-water mark ────────────────────────
        UBaseType_t queueDepth = uxQueueMessagesWaiting(g_sensorQueue);
        float queueLoad = (float)queueDepth / (float)QUEUE_CAPACITY;
        if (queueLoad >= QUEUE_HIGH_WATERMARK) {
            ESP_LOGW("HEALTH", "Queue at %.0f%% capacity (%u/%lu items)",
                     queueLoad * 100.0f, queueDepth, (unsigned long)QUEUE_CAPACITY);
            // High queue depth is a warning, not immediately fatal —
            // only fail health check if at 100%
            if (queueDepth >= QUEUE_CAPACITY) {
                allHealthy = false;
            }
        }

        // ── Feed watchdog only if all checks passed ───────────────────────────
        if (allHealthy) {
            esp_task_wdt_reset();
            ESP_LOGD("HEALTH", "All checks passed, watchdog fed");
        } else {
            // Don't feed — let the TWDT fire if this persists
            // Update RTC state so crash report has context
            g_lastState = (uint32_t)(now - bootTime);  // encode uptime at failure
            g_lastErrorCode = (sensorAge > SENSOR_MAX_AGE_MS) ? 0x0001 :
                              (mqttAge   > MQTT_MAX_AGE_MS)   ? 0x0002 :
                                                                 0x0003;
            ESP_LOGE("HEALTH", "Health check FAILED — watchdog NOT fed");
        }

        // Check every 2 seconds — well within the 8-second WDT timeout
        vTaskDelay(pdMS_TO_TICKS(2000));
    }
}

// ── Boot-time crash detection ─────────────────────────────────────────────────

void checkAndRecordResetReason() {
    esp_reset_reason_t reason = esp_reset_reason();

    if (reason == ESP_RST_TASK_WDT) {
        g_crashCount++;
        g_lastCrashType = (uint32_t)reason;
        ESP_LOGW("BOOT", "*** Task Watchdog reset detected (crash #%lu) ***",
                 (unsigned long)g_crashCount);
    } else if (reason == ESP_RST_INT_WDT) {
        g_crashCount++;
        g_lastCrashType = (uint32_t)reason;
        ESP_LOGW("BOOT", "*** Interrupt Watchdog reset detected (crash #%lu) ***",
                 (unsigned long)g_crashCount);
    } else if (reason == ESP_RST_PANIC) {
        g_crashCount++;
        g_lastCrashType = (uint32_t)reason;
        ESP_LOGW("BOOT", "*** Panic reset detected (crash #%lu) ***",
                 (unsigned long)g_crashCount);
    } else if (reason == ESP_RST_POWERON) {
        // Clean boot — clear accumulated crash state
        g_crashCount    = 0;
        g_lastCrashType = 0;
        g_lastState     = 0;
        g_lastErrorCode = 0;
        ESP_LOGI("BOOT", "Power-on reset: crash counters cleared");
    } else {
        ESP_LOGI("BOOT", "Reset reason: %d", (int)reason);
    }
}

// ── Watchdog initialization ───────────────────────────────────────────────────

void initWatchdog() {
    esp_task_wdt_config_t wdt_config = {
        .timeout_ms     = WDT_TIMEOUT_MS,
        .idle_core_mask = 0,      // Don't monitor idle tasks
        .trigger_panic  = true    // Print backtrace before resetting
    };

    esp_err_t err = esp_task_wdt_reconfigure(&wdt_config);
    if (err == ESP_ERR_INVALID_STATE) {
        // TWDT not yet initialized
        err = esp_task_wdt_init(&wdt_config);
    }

    if (err != ESP_OK) {
        ESP_LOGE("WDT", "TWDT init failed: %s", esp_err_to_name(err));
    } else {
        ESP_LOGI("WDT", "TWDT configured: %lu ms timeout",
                 (unsigned long)WDT_TIMEOUT_MS);
    }
}

// ── Entry point ───────────────────────────────────────────────────────────────

void setup() {
    Serial.begin(115200);
    ESP_LOGI("BOOT", "Firmware starting");

    // 1. Check why we rebooted before doing anything else
    checkAndRecordResetReason();

    // 2. Configure the watchdog timer
    initWatchdog();

    // 3. Create the shared queue
    g_sensorQueue = xQueueCreate(QUEUE_CAPACITY, sizeof(SensorReading));
    if (g_sensorQueue == NULL) {
        ESP_LOGE("BOOT", "Failed to create sensor queue — rebooting");
        esp_restart();
    }

    // 4. Configure MQTT client
    s_mqtt.setServer(MQTT_HOST, MQTT_PORT);
    s_mqtt.setBufferSize(1024);
    s_mqtt.setKeepAlive(30);

    // 5. Create tasks
    // Health monitor gets highest priority — it must always run
    xTaskCreatePinnedToCore(
        healthMonitorTask, "health_monitor",
        4096, NULL,
        configMAX_PRIORITIES - 1,  // Highest priority
        NULL, 1                     // Core 1
    );

    // Sensor task on core 0
    xTaskCreatePinnedToCore(
        sensorTask, "sensor",
        4096, NULL, 5,
        NULL, 0
    );

    // MQTT publish task on core 1 (alongside health monitor is fine)
    xTaskCreatePinnedToCore(
        mqttPublishTask, "mqtt_publish",
        8192, NULL, 4,
        NULL, 1
    );

    // Crash report task — runs once and deletes itself
    xTaskCreate(
        crashReportTask, "crash_report",
        4096, NULL, 3,
        NULL
    );

    ESP_LOGI("BOOT", "All tasks created");
}

void loop() {
    // All work is done in FreeRTOS tasks.
    // The Arduino loop task is not subscribed to the TWDT.
    vTaskDelay(pdMS_TO_TICKS(1000));
}

This is the complete architecture in one file. Let’s walk through what matters:

Health monitor priority. configMAX_PRIORITIES - 1 is the highest priority available to application tasks. The health monitor runs before everything else. If the system is so loaded that the health monitor can’t get CPU time, that’s a health failure and the watchdog should fire.

Boot grace period. Sensor tasks and MQTT tasks need time to initialize before health checks are meaningful. The 30-second grace period prevents false WDT fires on startup.

Separate crash report task. It’s a one-shot task that deletes itself after reporting. Clean, no ongoing overhead.

g_health is not protected by a mutex. Each field is written by exactly one task and read by the health monitor. The volatile keyword prevents compiler reordering. This is sufficient here — 32-bit aligned reads and writes on Xtensa are atomic.

Comparison: wrong vs right watchdog patterns

Pattern	Does it catch hangs?	Notes
Feed at top of loop unconditionally	No	WDT fed before work — hung work never prevents feed
Feed after each successful operation	Yes	Correct — WDT fires if operation hangs
Every task self-feeds independently	Partially	Tightly coupled, harder to enforce global health
Dedicated health monitor task	Yes, with context	Centralized, testable, can express cross-task health
No watchdog at all	Never	The “it works in the lab” approach

What the watchdog can’t catch

The watchdog catches hangs. It doesn’t catch:

Silent data corruption — firmware runs fine but produces wrong values
Slow memory leaks — heap shrinks over days, not within a WDT timeout
Logic bugs — firmware does the wrong thing efficiently

For these you need: heap monitoring (heap_caps_get_free_size()), application-level sanity checks on data, and automated testing. The watchdog is one layer of a multi-layer reliability strategy, not the whole strategy.

What’s next

You now have firmware that detects failures and recovers from them. The next challenge is structuring the code so failures and recoveries are expressed cleanly, without a tangle of if statements spread across tasks. The next post covers event-driven firmware on ESP32 — how to decouple your sensor tasks, MQTT tasks, and connection managers so that none of them know each other exists, and adding a new subscriber requires zero changes to existing code.