Many tools for GUI testing use the capture-replay approach. The charm is easy to understand: quick startup, no need to learn how to work with the tool, non-technical testers can use it, and quick results. However, all of these points are generally only true at the beginning. Very quickly, tests become unmanageable, only editable by developers, and they may stop running altogether.
A short disclaimer in advance: our tools GUIdancer and Jubula do actually have an “observation mode”. We kind of had to put it in because so many people still look for that as a feature. It appears on a great deal of checklists. We always say, though, that there is a much better way of writing tests. What we don’t have (at the time of writing, but the climate shows we may have to add it for the same reasons as above) is a copy function. Instead, we say that if you need something twice you should make a module and reuse it to keep any changes you need to make to one place.
There are so many practical and philosophical problems with the capture-replay approach. I generally summarize them like this:
Recording my actions while I click through an application sounds so very tempting, but to have an automated acceptance test or regression test, then it needs planning and structure.
If I record a short test to create and delete a project in Eclipse (done with our recorder so as not to step on any toes, but even this version looks better than recorded code, I think), then it might look like this:
The first thing I’d notice is that creating and deleting are two separate logical modules and so they should be separated to be reused later.
Shortly afterwards, I notice that the text entry in the “create” part and the project selection in the “delete” part will be variable depending on what project I want to create or delete, so I should really parametrize that.
I then start to notice any actions that don’t actually need to be there, that should be cleaned up so I know exactly what my test is doing and why.
I start to notice redundancies, single actions or sequences that should be specified once and reused.
Then I start thinking about robustness. Before clicking or entering text, I’d really prefer to have a check that the component is enabled. After entering text, it would be nice to check that the text is actually there.
That’s not even all the things I could notice about this small example, but already I’m seeing that the work I have to do is pretty much to take what I have and redo it. To get an example with a good structure, I could have just written it from scratch.
The example above shows just the create part for now. It is readable, reuses an intelligent module to synchronize clicks, and it has a parameter for the new project name, so the same test case can be used for different data.
So it is certainly possible to make a recorded script into a decent test, but the effort involved with doing it is exactly the same (or even more if the script was generated as spaghetti code, I would argue) as sitting down and automating a test based on a plan, with design and structure. You can’t get around the fact that effort has to come into the equation somewhere.
If you leave a badly designed test (be it recorded or created in any other way) as it is, then it may run against the software and new versions of it for a while. But at some point, something will change. If you have no structure (e.g. you have copies instead of references, unknown dependencies and unreadable tests), then the maintenance work to get the test up and running again will be enormous. This danger is compounded by the fact that it is easy to record a great deal of tests very quickly. It wouldn’t take a big software change to affect all these tests to the state where none of them run. Without structure, there is no chance of getting them up and running again quickly. The possibility of being able to adapt the tests before the change takes place (based on the requirements) is also rather low.
As soon as tests start failing en masse and the effort to adapt them is too large, then we miss the goal that the tests were probably trying to achieve in the first place – to have continuous feedback about the chosen use cases over constantly changing versions of the software. Any time a regression test fails, it could be giving us valuable information about the application. We can analyze the information and react to it – saving time, effort and face. For this to really work (and so that one failed test doesn’t mask any (worse) errors in other tests), the status quo has to be that automated tests are running successfully. Errors in the software need to be fixed quickly and the tests need to always be kept up to date. If we get that, then we can at least say so much about the quality: the tests we run in the way we run them are not showing any problems. That may not be everything we want to be able to say about the quality, but if we’ve chosen tests that will fail if things go wrong, then the statement is nevertheless very helpful.
We should expect that our tests can provide us with information about every new build. Whether the information is something has changed and affected the tests – is it a problem? or whether it is anything which has changed has not affected the tests (-is that a problem?), it’s important to have that feedback. Since we expect so much from our tests, we should treat testing with the same respect as we do development, which means we need the discipline and dedication to do it right.