IoT Failure Modes — Battery, Signal, Scale, Security

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Things that work flawlessly on your desk fail at scale in ways that won't be visible until tens of thousands of units are in the field. Knowing the four canonical failure axes — power, radio, scaling, and security — gives you a checklist to design against from day one, instead of discovering them through angry customer reviews. Every veteran IoT engineer has seen each of these blow up a launch at least once.

Demo

The four failure axes and the symptoms each produces.

Try it yourself

For your own project, write down the worst-case answer to each of the four axes: 'how would this fail at 10,000 units, in a basement, in winter, after one of its components has been compromised?'
Read the Mirai botnet writeup (Krebs on Security, October 2016) and identify which of the four failure axes it exploited.
Estimate the energy budget of a device that wakes every 5 minutes to take a reading, transmits 10 bytes over Wi-Fi, and sleeps. Compare it to 2× AA batteries (≈ 5,000 mAh total). Predict the months of operation; verify against a real measurement if you have hardware.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, describe the four canonical failure modes of an IoT fleet and explain why each one is hard to catch in the lab.

2. Why it works (the mechanism)

Walk me through how a 'thundering herd' problem can emerge from a synchronous wake-up time across an IoT fleet — and why a simple ±60-second random jitter usually fixes it.

3. Advanced — application & what's next

Given a fleet of 50,000 LoRaWAN agricultural sensors that worked fine for a year and now fail to acknowledge uplinks across one region — outline how you would distinguish between an RF issue, a backend scaling issue, and a firmware issue using only data the devices have already uploaded.

References

1. POWER / BATTERY Symptoms: • Device dies in 6 weeks, datasheet promised 2 years. • Sleep current is 100 µA instead of 5 µA — a forgotten LED or pull-up. • Battery temperature compensation missed — winter kills the field. Mitigations: • Measure actual sleep current with a µCurrent / Joulescope. • Budget energy per task: x mAh / hour, knowing CR2032 = 235 mAh new. • Test in environmental chamber at -20°C to +60°C. • Build a battery-life dashboard from the fleet itself. 2. RADIO / SIGNAL Symptoms: • Works in office, doesn't work in customer's basement. • Wi-Fi roams between APs every 30 s, hammers reconnection logic. • BLE pairs fine, refuses to reconnect after the phone restarts. • LoRa link works at 1 km in spec — fails at 200 m in a city with concrete. Mitigations: • Don't trust marketing range numbers — measure with attenuators. • Implement exponential backoff and DON'T retry every 30 s forever. • Log RSSI per packet, ship to backend, build a coverage heatmap. 3. SCALE Symptoms: • 10 devices fine. 10,000 devices, your MQTT broker melts. • Every device wakes at 00:00:00 UTC, thundering-herd hammers backend. • One device's bad firmware crashes the broker, all devices reconnect, broker can't keep up, cascade. (Mirai-style but accidental.) Mitigations: • Randomise wake-up times with a per-device jitter. • Load test the backend with simulated devices (e.g., k6, custom). • Rate-limit per device. Set sane connection limits. • Build a kill switch that turns off non-essential telemetry remotely. 4. SECURITY Symptoms: • Default credentials in firmware. Mirai botnet. • Updates signed with a single key — key leak = fleet compromise. • Memory corruption in the parser leads to RCE (almost every CVE). • Cert hardcoded with 1-year expiry, device runs offline for 13 months, can't connect again. Mitigations: • Per-device certificates from day 1. • Secure boot + signed firmware updates with key rotation. • Threat-model the parser. Fuzz it. • Plan certificate refresh logic before shipping.