TDD at the System Scale
In Growing Object-Oriented Software, Guided by Tests, we advocate starting development by writing a system test (or tests). This caused more controversy than I expected. System tests have the reputation of being slow (not entirely avoidable, I admit), difficult to automate reliably and difficult to diagnose when they fail. However, I find that many teams follow a TDD process at the unit-level, but do post-hoc testing at the system level, and so don't use system tests as a source of design feedback.
For me, the key feature of the TDD process is that when something is hard to test, that's a driver to go back and adjust the design until it is easy to test.
At the system-test scale, tests become unreliable when they cannot observe the activity of the system well enough to reliably synchronise their execution with that of the system. Failures are hard to understand if the tests cannot gather and filter information about the activity of the system during execution of the test. Defects will break test isolation and cause cascading failures if the tests cannot wait until the system becomes passive and restore it to a known good state when the system fails a test.
To reliably test a system, we must be able to...
- determine what the system is doing.
- detect when it has stopped doing it.
- detect when the system has failed.
- explain why has it failed.
- restore the system to a known good state.
If our system tests are unreliable, that's a sign that we need to add interfaces to our system through which tests can better observe, synchronise with and control the activity of the system.
Those changes turned out to be exactly what we need to better manage the systems we built. We used the same interfaces that the system exposed to the tests to build automated and manual support tools.
It seems obvious in hindsight: to reliably test manage a
system, we must be able to...
- determine what the system is doing.
- detect when it has stopped doing it.
- detect when the system has failed.
- explain why has it failed.
- restore the system to a known good state.
Design Feedback from TDD
TDD at the unit scale guides the design of the code to make the system easier to modify, because it is easier to test code that is organised into loosely coupled, cohesive units that have clear responsibilities.
My hunch is that TDD at the system scale works in a similar way, guiding the design of the architecture to make the system easier to manage, because it requires the system have machine-readable interfaces through which tools can observe and control its activity.