Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Things that work flawlessly on your desk fail at scale in ways that won't be visible until tens of thousands of units are in the field. Knowing the four canonical failure axes — power, radio, scaling, and security — gives you a checklist to design against from day one, instead of discovering them through angry customer reviews. Every veteran IoT engineer has seen each of these blow up a launch at least once.
The four failure axes and the symptoms each produces.
Use these three in order. Each builds on the one before.
In one paragraph, describe the four canonical failure modes of an IoT fleet and explain why each one is hard to catch in the lab.
Walk me through how a 'thundering herd' problem can emerge from a synchronous wake-up time across an IoT fleet — and why a simple ±60-second random jitter usually fixes it.
Given a fleet of 50,000 LoRaWAN agricultural sensors that worked fine for a year and now fail to acknowledge uplinks across one region — outline how you would distinguish between an RF issue, a backend scaling issue, and a firmware issue using only data the devices have already uploaded.
1. POWER / BATTERY
Symptoms:
• Device dies in 6 weeks, datasheet promised 2 years.
• Sleep current is 100 µA instead of 5 µA — a forgotten LED or pull-up.
• Battery temperature compensation missed — winter kills the field.
Mitigations:
• Measure actual sleep current with a µCurrent / Joulescope.
• Budget energy per task: x mAh / hour, knowing CR2032 = 235 mAh new.
• Test in environmental chamber at -20°C to +60°C.
• Build a battery-life dashboard from the fleet itself.
2. RADIO / SIGNAL
Symptoms:
• Works in office, doesn't work in customer's basement.
• Wi-Fi roams between APs every 30 s, hammers reconnection logic.
• BLE pairs fine, refuses to reconnect after the phone restarts.
• LoRa link works at 1 km in spec — fails at 200 m in a city with concrete.
Mitigations:
• Don't trust marketing range numbers — measure with attenuators.
• Implement exponential backoff and DON'T retry every 30 s forever.
• Log RSSI per packet, ship to backend, build a coverage heatmap.
3. SCALE
Symptoms:
• 10 devices fine. 10,000 devices, your MQTT broker melts.
• Every device wakes at 00:00:00 UTC, thundering-herd hammers backend.
• One device's bad firmware crashes the broker, all devices reconnect,
broker can't keep up, cascade. (Mirai-style but accidental.)
Mitigations:
• Randomise wake-up times with a per-device jitter.
• Load test the backend with simulated devices (e.g., k6, custom).
• Rate-limit per device. Set sane connection limits.
• Build a kill switch that turns off non-essential telemetry remotely.
4. SECURITY
Symptoms:
• Default credentials in firmware. Mirai botnet.
• Updates signed with a single key — key leak = fleet compromise.
• Memory corruption in the parser leads to RCE (almost every CVE).
• Cert hardcoded with 1-year expiry, device runs offline for 13 months,
can't connect again.
Mitigations:
• Per-device certificates from day 1.
• Secure boot + signed firmware updates with key rotation.
• Threat-model the parser. Fuzz it.
• Plan certificate refresh logic before shipping.