Testing Clichés Part V: Testing needs a test coverage model Rikard Edgren
I believe there is too much focus on test coverage , there is even an axiom about the need of it.
My reason is that no coverage model captures what is important.
Cem Kaner lists 101 possible coverage models (Software Negligence and Testing Coverage), and none of them are super-good to me (my favorite is an expansion of no. 89: Potential Usage, which is impossible to measure.)
A dangerous example is coverage by amount of planned tests performed, which easily gives too little exploration, and less ambitious testing efforts.
Test coverage is about planning, precision, measuring and control; which isn’t the best match for things that can be used in a variety of ways, with different data and environment, and different needs.
Sure you can make use of them, but if you rely too much on them, you will have problems in an industry of uncertainty like software development.
The over-emphasis can be shown in the following ISTQB quote:
“Experience-based tests utilize testers’ skill and intuition, along with their experience with similar applications or technologies. These tests are effective at finding defects, but not as appropriate as other techniques to achieve specific test coverage levels or producing reusable test procedures.”
(implying that you can’t really rely on these methods that merely) find defects (and important information.)
I understand that coverage models can give confidence to the decision makers, but how often are these used in reality?
Aren’t release decisions rather made based on how you feel about the facts you are presented with; and it is specific bugs that can stop a release, and external factors that push a release?
If so, isn’t focus on coverage model sort of wasted?
And if it brings a slower testing with less result, it is something to try to get rid of?
As an alternative, I present my 95% Table:
The measurement used is anything you want it to be, and of course practically unusable.
– SO HOW ARE WE GONNA REPORT STATUS? I hear shouted.
In a different, and better way.
I’m not sure how, but I want to be close to what’s important, and far away from John von Neumann’s quote:
“There’s no sense in being precise when you don’t even know what you’re talking about.”
Test coverage is one of those things that stakeholders get itchy about if there are “unknowns” – even if they can’t quantify themselves what a “known” is – the fact that there might be untouched areas makes them uneasy.
Even if someone reports “good coverage” (in their view) I still want to know about the silent evidence – otherwise it’s of no value.
This reminds me of a post I should complete about why stakeholders look for “comfort blankets”…
I’m not sure I understand your point.
I agree that a number stating test coverage of xx % really does not say much, and in practice I have not heard of anyone using such a number as a strict threshold for making decisions.
Personally I would like to see decision makers in general be more comfortable with qualitative analyses and not always demand quantitative numbers, from architectural quality attributes to verification.
I think the cause of always wanting numbers is in many cases due to a mistrust of the skills of a developer or tester since many times a qualitative analysis relies on tacit knowledge which is not easily shared.
Secondly I think it would be very hard to talk about coverage for a system in it’s context since the inputs in theory are infinite.
Simon; yes, finding untouched areas, that might be important, is a good way to use models. We can look both inside and outside models.
I think the ongoing test activities should handle relevant “silent evidence”, and stakeholders should review the reasoning behind untouched areas, rather than a percentage.
Ulrik; I think you understand my points, since I fully agree with your points.
My post should have stated that it is the quantitative models I have problems with.
Qualitative, implicit models are used all the time.
I discussed test coverage with an agile coach about a year ago. His idea of full coverage was that 100% of the code was executed. Great, you have 100% code coverage, now let me give some examples of what kind of other types of test coverage there are…
However you can measure coverage of a SPECIFIC MODEL! If you have a list of 100 test ideas and execute 50 of them you have 50% coverage of that specific model. If I create a state-model and cover all the transitions I have 100% transition coverage of that specific model. Since a model is by definition a simplification it is possible to add more detail to any model or to change it entirely so it does not in any way give a final quantitative coverage. But it tells you coverage in context. Since all models are wrong but some are useful (George Box) – I do believe that using models to simplify reality and to use these models as inspiration for testing – for design, finding problems in requirements and design. and yes, you can show coverage of that model!
One of my current ideas for exploratory testing is to use a test design model and cover that specific model during a session of testing. I did that yesterday with a state graph for an administrative system. Covering all states, all transitions finding some problems and a bunch of issues. Just because there are 100 more ways of test coverage according to Kaner does not mean that it is pointless to use this model, quite the opposite. That is one of my ideas on test design – create a model then cover it! (Did I just manage to refute your argument by finding one counter-example :-)?)
I think the low-level dashboard that James Bach speaks about is an excellent eaxmple of showing coverage in less detail but more exectness. But that really is a more qualitative than qualitative model.
Yes, it is good to use models, and try to cover them, but I think it is more interesting with the resuts than a number.
It is more important to see what you learned from the activities, what you found outside the models, what you want to investigate more, what information you have given to the project (and stakeholders.)
The “we have tested what we planned” is less important, to me.
And yes, you can specify coverage of a model, but what does it mean?
The 50% coverage of test ideas can have multiple answers hidden, and all are a lot more important than the number 50:
* we have found so many serious bugs that further testing is pointless
* we are running late because testers insist on investigating things they aren’t explicit told to look for
* we have run the 50 most difficult test ideas, and we believe we will finish on schedule
* we have run the 50 easy tests on input data, and look forward to the results from the radically different test ideas
* we have run the first half, in alphabetical order, and are not really sure what we are doing
* we have investigated the 50 most important test ideas, and believe the implicit coverage is enough to go Beta
* we are halfway through, but have found a lot of things that are more important to test than our original assumptions
(qualitative is better than quantitative)
And even for ambitious test projects that use many models, you can, and should, wonder, what is outside the models? Another model?
If I was a stakeholder I would like to have testers I trust, that have a gut feeling that they have tested “enough”. And I would like the high level test ideas to have been reviewed by people with different background.
The coverage question is “how much testing have we done?”. That’s not an unreasonable question. Coverage models give us a foreground answer to the question “compared to what?”. I don’t see a problem there. Where I do see a problem is in mistaking the model for for the reality, the map for the territory, or the test cases for the testing, as it were. The background, the stuff not covered by the model isn’t specified by the model; or, if you prefer, models tend not to tell you what they’re leaving out. A second model can help to identify gaps in the first; a third can help to identify gaps in the first two, and so on. (This relates to Weick’s gloss on Ashby’s Law of Requisite Varieity: “If you want to understand something complicated, you have to complicate yourself.” It seems to me that that advice is recursive.)
A model that is too narrow can be covered completely, but won’t yield very much in the way of information (and in testing, won’t yield many problems or bugs either, presumably). With a model that is too expansive (whether intentionally or inadvertently) you can’t get enough coverage of the model to make useful generalizations about a particular factor within the model.
I agree that we should take it easy on attempting to put things in terms of third- or even second-order models, when a first order model affords enough information for an appropriate response like “change something” or “don’t worry about it” or “get more data” , as well as the appeal to a higher-order measurement like “get more quantitative data”. I’m proud of a handful of columns that I wrote on that: Got You Covered, Cover or Discover, A Map By Any Other Name, and Three Kinds of Measurement (And Two Ways to Use Them). You can find those here: http://www.developsense.com/publications.html. (I’d like, but someone’s blog spam filter is too restrictive for that.)
Finding a model that’s too expansive can be easy. Sometimes all you have to do is look for the negative space. For example,
“Experience-based tests utilize testers’ skill and intuition, along with their experience with similar applications or technologies. These tests are effective at finding defects, but not as appropriate as other techniques to achieve specific test coverage levels or producing reusable test procedures.”
Can anyone tell me what kind of tests don’t utilize testers’ skill and intuition, along with their experience with similar applications and technologies? Insert your own joke here.
—Michael B.
“I’d like, but someone’s blog spam filter is too restrictive for that” really means “I’d like to put up links to each article separately, but someone’s blog spam filter is too restrictive for that”
—Michael B.
Henrik Emilsson: I’m not sure which spam-filter that is doing this, but I’ll have a look at it.
Michael, the article three orders of measurement made things clearer for me.
Rikard – I agree that the first and second order measurements are often neglected. But I do not agree that focus is on third order detailed measurement either. In my world of testing I have had great use of third order measurements and at the same time I admit that I have been on lots of projects where focus has been on covering details. This applies not only to testing but to the project as a whole.
The discussion we are having here is not only about testing – it is really about solving problems in general and more specifically on producing working software. The coverage you mention in the original post concerns not only testing but also business modeling, requirements, architecture, coding and whatever other parts you want to add.
That is why I use several models on several levels. High-level favorite is the Effect-Map with users and their needs (Effect Managing It : Ottersten). I promise to dig in to QSM part II after I am done with part I.
I’m not stating that coverage numbers and details always are bad.
I’m just saying it’s such a complex affair with software, so the detailed numbers don’t cover the whole and what is important (but they can still be useful.)
And even if clichés are over-used, they can still hold some truth…
I’m currently reading Gigerenzer’s excellent Gut Feelings, and it feels like intuition and skill beats objective facts in many situations.
Michael, your columns are great!
@Rikard, Michael and Tobbe:
Perhaps Coverage has been abused so many times that it therefore has become a cliché (in some parts of our society)?
For a long time I thought of Coverage as something bad, just because I recognized that some people abused the result of it. E.g. “We don’t have to test more because we have created test cases for each requirement and run them with success. We now have 100% Coverage, why aim for 110%?”
Nowadays I see that Coverage is a very helpful tool – if I know of (and communicate) the imperfection of the model it is based on.
And I am striving to get even better in modeling and use this as a tool; it becomes even more important for me since I nowadays seldom test a known application, but rather see new ones for each project I get involved in.
Gut Feelings as well as Blink are really cool books. Just read them both a couple of months ago. A very interesting though regarding decision making is the fact that in order for intuition to work well you need to be really skilled. So when people like ourselves – that are very well educated in test- make decisions based on intuition they may be very valid – novices (or PMs) may make totally different and erroneous conclusions based on the same material. The Dunning-Kruger effect at work. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
And No, I do not count test cases or bug reports. I use the low-level dashboard for reporting which is not only accepted but very appreciated by the rest of the project. The low-level models I use to communicate specific issues with developers or future users. They also work very well in their context.
Henrik – 100% coverage of test cases is a measure but as Rikard said before – out of context is tells you little of value. Idea for a presentation at SWET2?
I’m not sure that abuse is the problem. I think the problem is that very, very few people understand what coverage means. I’m not sure why that is. To me, the problem is like pointers in C: you can’t understand them; then you can, but you can’t understand why you didn’t. I’m going to try to recall how things fell together for me; maybe that will be helpful to someone.
I too had the hardest time grasping coverage, for quite a while. I confused coverage with oracles; I confused coverage with quality criteria. Then James really threw me for a loop with the statement (in our course material) that “coverage is the proportion of the product that has been tested”. I couldn’t accept that, and I couldn’t figure out why, except that I knew that there was a problem with “proportion”. To my mind, proportion could only make sense in terms of a ratio, and the test space is infinite. So coverage asymptotically approaches zero, no matter how much testing you’ve done.
Well, yeah. That’s true. So, for me, “proportion” had to go, because I figured it would drive people to quantifying coverage in a way that was at best meaningless or unhelpful, and at worst misleading.
In one chat session or another, James pointed out that any testing at all obtained coverage. Any use of the product at all obtained coverage. That didn’t make sense to me either. “Can an end-user find bugs?” James asked. Ummm… yes. “If the end user can find a bug, isn’t he getting test coverage?” Ah… coverage might not be intentional; it might be incidental, or accidental. Even if we’re not consciously testing some factor of the product, we might be getting coverage of that factor. Hmmm.
Through several long transpection sessions (he can be very patient), James did agree to rephrase the troublesome sentence as “how much of the product we’ve tested”, which got over my quantification issue to some degree. Fortunately, “how much” also triggered the reflex introduced in me by Jerry Weinberg: compared to what?. Well, test coverage can’t be expressed in terms of the whole product, since that would require knowing what complete coverage meant. To me, pretty much everyone (Myers, Kaner, Marick, Black) was getting stuck on code coverage, and getting me stuck in the process. (There is a passage early in Black’s Managing the Testing Process that hints at the idea of feature coverage, but it wasn’t quantified at all, and it didn’t click for me).
Eventually I found something in the way Beizer talked about coverage that helped to make it clear: “any metric of completeness with respect to a test selection criterion.” It helped that, at the time, I was fascinated with Kaner and Bond’s paper “Software Engineering Metrics: What Do They Measure and How Do We Know?”, in which the authors said that a metric was a measurement function—a function that “applies a number according to a model or theory (my emphasis) to attributes or an events with the intention of describing them.” Put those two things together, and expressing test coverage as a quantity in terms of a model of the product or the test space is possible.
When we’ve got a sufficiently understandable and quantifiable model, that model doesn’t even have to be very good, as long as we keep our comments about coverage relative to the model, rather than to the product overall or to the test space overall. The model stands a chance of being quantified; the product and the test space do not. For example, just as in Tobbe’s example above, some people might decide to write one test case for each of 500 requirements. In my point of view, that’s a terrible test strategy. Yet one could reasonably say that you’ve achieved 75% coverage when you’ve run 375 of them—75% with respect to that suite of test cases. If you say that you’ve performed 5000 tests that exercised all 3000 valid values for a given input field, you can claim 100% coverage with respect to those 3000 values in that field. If you’ve observed the logs afterwards, you can also say that you’ve obtained some performance coverage for free: you’ve observed this range of response times over 5000 requests. That’s not expressed that in terms of a completeness criterion—until someone says “If we were to do 5000 requests, that would be adequate coverage for the purpose of evaluating response times.” Not a great test strategy in my mind either, yet it’s a valid expression of how much was tested by some completeness criterion. In a subjective world, “completeness” is subject to the Relative Rule: complete to some person, at some time.
The trouble sneaks in when we quantify test coverage without qualifying what we mean by it. Exactly as you put it, Henrik: “Coverage is a very helpful tool – if I know of (and communicate) the imperfection of the model it is based on.”
Thank you for the compliments, Rikard.
Tobbe, just to be clear: I would argue that third-order measurement–the kind of measurement used to discover a natural law–is at best infeasible in software, and probably impossible. Second-order–used for the tuning of existing systems–is possible in certain domains, but possibly unreliable and probably very expensive when it comes to humans and their effects in systems. First-order measurement is pretty much always available. Let’s use it.
—Michael B.
Great thoughtful article and great thoughtful comments. It’s exchanges like these that help advance thinking in the software testing community.
I particularly appreciate Michael’s comment that “When we’ve got a sufficiently understandable and quantifiable model, that model doesn’t even have to be very good, as long as we keep our comments about coverage relative to the model, rather than to the product overall or to the test space overall.”
That really resonates with me. I’ve seen many people gloss over the distinctions (and have inadvertently been guilty of doing so myself). Our test design tool has a “coverage chart.” I’m going to improve the explanatory notes we have in our tool about what the coverage ratios mean (and what the numerical coverage records DO NOT MEAN) to better reflect the important point that Michael made here.
– Justin Hunter
Great explication Michael! And an interesting journey!
I wanna go back to something mentioned in the middle of the thread.
Regarding models on several levels that Tobbe mentioned, there is also useful with several models on the “same level” that interlace. Since the test space is infinite it is good if we can carefully select models that combined becomes a mesh.
When I select a test strategy I begin by investigating what the stakeholder’s information objectives are and then forming a test mission. When this becomes clearer, the next step is to come up with a strategy that fulfills this. And when doing this I try to diversify the approaches for fulfilling each information objective and thereby trying to get as good coverage as possible within the same time.
This means that the selected approaches/techniques will benefit from models that differ from the others, hence creating a condition where there is a more promising starting point for getting a better coverage.
To give an example, if the two most important information objectives are:
1. Find important bugs that needs to be fixed
2. Minimize the risk of safety-related lawsuits
Both could be addressed with Exploratory Risk-based Testing, but it would be better to try to diversify these and use other techniques and thereby other models. So in this example it might be good to select Exploratory Risk-based Testing for 1. and Specification-based Testing for 2. using existing safety-related legal frameworks.
Another approach: Scenario-based Testing for 1. and Exploratory Risk-based Testing for 2.
Or perhaps even better: These two combined – switching approach after a while.
By doing this, the models begin to form a mesh and thereby strengthen the overall coverage and in a way utilizing each model’s imperfect coverage.
On top of this, there could be other models on a higher or lower level that cover other important aspects.
Ok, I admit that referring to third order measurement was a mistake. It should be second order. And I need to read and think some more before I blog about it. Almost midnight, gotta leave before the enchantment wears off…
I guess we all use coverage in some way.
I want to cover what’s important about the product (might include looking at what is seemingly not important.)
These are qualitative judgments, and to put quantitative judgment on top of this doesn’t make sense.
I believe there are qualitative ways that have more than gut feelings, stories and enough, bad, excellent judgments,
but I haven’t found this, yet.