Reliability – like in real life, everybody has an intuitive idea of what reliable means, and in software we can understand reliability as meaning roughly “continuing to work correctly, even when things go wrong”.
Typical expectations also include:
- the application performs the action that the user requested
- it can tolerate the user using the software in unexpected ways
- its performance is good enough for the required use case
- the system prevents unauthorized access and abuse
It’s important to know that when systems stop working as expected, there can be faults or failures, and the difference is that:
- a fault – is when one component of a system is deviating from its spec
- a failure – is when the system as a whole stops providing the required service
It’s probably impossible to build systems with a probability of a fault to zero, but what can be done is to design fault-tolerance mechanisms that prevents faults to cause failures.
Many critical bugs are actually due to poor error handling and they can be avoided by deliberately introducing faults (e.g. randomly killing processes) and by doing this, you ensure that fault-tolerance machinery is continuously exercised and tested and obviously this will increase the confidence that the faults will be handled correctly.
Inspired from “Designing Data Intensive Applications’ – Martin Kleppmann