Measurement frameworks and standards

The software engineering literature abounds with software ‘metrics' falling into one or more of the categories described above. So somebody new to the area seeking a small set of 'best' software metrics is bound to be confused, especially as the literature presents conflicting views of what is best practice. In the section we establish a framework relating different activities and different metrics. The framework enables readers to distinguish the applicability (and value) of many metrics. It also provides a simple set of guidelines for approaching any software measurement task.

Until relatively recently a common criticism of much software metrics work was its lack of rigour. In particular, much work was criticised for its failure to adhere to the basic principles of measurement that are central to the physical and social sciences. Recent work, has shown how to apply the theory of measurement to software metrics [Fenton 1991, Zuse 1991]. Central to this work is the following definition of measurement:

The theory of measurement provides the rigorous framework for determining when a proposed measure really does characterise the attribute it is supposed to. The theory also provides rules for determining the scale types of measures, and hence to determine what statistical analyses are relevant and meaningful. We make a distinction between a measure (in the above definition) and a metric. A metric is a proposed measure. Only when it really does characterise the attribute in question can it truly be called a measure of that attribute. For example, the number of Lines of Code (LOC) (defined on the set of entities ‘programs’) is not a measure of ‘complexity’ or even ‘size’ of programs (although it has been proposed as such), but it is clearly a measure of the attribute of length of programs.

To understand better the definition of measurement in the software context we need to identify the relevant entities and the attributes of these that we are interested in characterising numerically. First we identify three classes of entities:

We make a distinction between attributes of these which are internal and external:

ENTITIES	ATTRIBUTES
Products	Internal	External
Specifications	size, reuse, modularity, redundancy, functionality, syntactic correctness, ...	comprehensibility, maintainability, ...
Designs	size, reuse, modularity, coupling, cohesiveness, functionality, ...	quality, complexity, maintainability, ...
Code	size, reuse, modularity, coupling, functionality, algorithmic complexity, control-flow structuredness, ...	reliability, usability, maintainability, ...
Test data	size, coverage level, ...	quality, ...
...	...	...
Processes
Constructing specification	time, effort, number of requirements changes, ...	quality, cost, stability, ...
Detailed design	time, effort, number of specification faults found, ...	cost, cost-effectiveness, ...
Testing	time, effort, number of coding faults found, ...	cost, cost-effectiveness, stability, ...
...	...	...
Resources
Personnel	age, price, ...	productivity, experience, intelligence, ...
Teams	size, communication level, structuredness, ...	productivity, quality, ...
Software	price, size, ...	usability, reliability, ...
Hardware	price, speed, memory size, ...	reliability, ...
Offices	size, temperature, light, ...	comfort, quality, ...
...	...	...

Table 4.1 provides some examples of how this framework applies to software measurement activities. Software managers and software users would most like to measure external attributes. Unfortunately, they are necessarily only measurable indirectly. For example, we already noted that productivity of personnel is most commonly measured as a ratio of: size of code delivered (an internal product attribute); and effort involved in that delivery (an internal process attribute). For the purposes of the current discussion the most important attribute that we wish to measure is 'quality' of a software system (a very high level external product attribute). It is instructive next to consider in detail the most common way of doing this since it puts into perspective much of the software metrics field.

The most commonly used means of measuring quality of a piece of software code C is the defect density metric, defined by:

where size of C is normally KLOC (thousands of lines of code). Note that the external product attribute here is being measured indirectly in terms of an internal process attribute (the number of defects discovered during some testing or operational period; and an internal product attribute (size). Although it can be a useful indicator of quality when used consistently, defect density it is not actually a measure of software quality in the formal sense of the above definition of measurement. There are a number of well documented problems with this metric. In particular:

Despite the serious problems listed above (and others that have been discussed extensively elsewhere) we accept that defect density has become the de-facto industry standard measure of software quality. Commercial organisations argue that they avoid many of the problems listed above by having formal definitions which are consistent in their own environment. In other words, it works for them, but you should not try to make comparisons outside of the source environment. This is sensible advice. Nevertheless, it is inevitable that organisations are hungry both for benchmarking data on defect densities and for predictive models of defect density. In both of these applications we do have to make cross project comparisons and inferences. It is important, therefore for broader QA issues, that we review what is known about defect density benchmarks.

Companies are (for obvious reasons) extremely reluctant to publish data about their own defect densities, even when these are relatively low. The few published references that we have found tend to be reported about anonymous third parties, and in a way that makes independent validation impossible. Nevertheless, company representatives seem happy to quote numbers at conferences and in the grey literature. Notwithstanding the difficulty of determining either the validity of the figures or exactly what was measured and how, there is some consensus on the following: in the USA and Europe the average defect density (based on number of known post-release defects) appears to be between 5 and 10 per KLOC. Japanese figures seem to be significantly lower (usually below 4 per KLOC), but this may be because only the top companies report. A well known article on 11th February 1991 in Business Week reported on results of an extensive study comparing similar 'state-of-the-art' US and Japanese software companies. The number of defects per KLOC post-delivery (first 12 months) were: 4.44 USA and 1.96 Japan. It is widely believed that a (delivered) defect density of below 2 per KLOC is good going.

In one of the more revealing of the published papers [Daskalantonakis 1992] reports that Motorola’s six sigma quality goal is to have ‘no more than 3.4 defects per million of output units from a project’. This translates to a an exceptionally low defect density of 0.0034 per KLOC. The paper seems to suggest that the actual defect density lay between 1 and 6 per KLOC on projects in 1990 (a figure which was decreasing sharply by 1992). Of course even the holy grail of zero-defect software may not actually mean that very high quality has been achieved. For example, [Cox 1991] reports that at Hewlett Packard, a number of systems that recorded zero post-release defects turned out to be those systems that were simply never used. A related phenomenon is the great variability of defect densities within the same system. In our own study of a major commercial system [Pfleeger et al 1994] the total 1.7 million LOC system was divided into 28 sub-systems whose median size was 70 KLOC. There was a total of 481 distinct user-reported faults for one year yielding a very low total defect density of around 0.3 per KLOC. However, 80 faults were concentrated in the subsystem which was by far the smallest (4 KLOC), and whose fault density was therefore a very high 20 per KLOC.

In all of the key examples of software measurement seen so far the notion of software ‘size’ has been a critical indirect factor. It is used as the normalising factor in the common measures of software quality (defect density) and programmer productivity. Product size as also the key parameter for models of software effort. It is not surprising therefore to note that the history of software metrics has been greatly influenced by the quest for good measures of size. The most common measure of size happens to be the simplest: Lines of Code (LOC). Other similar measures: are number of statements; number of executable statements; and delivered source instructions (DSI). In addition to the problems with these measures already discussed they all have the obvious drawback of only being defined on code. They offer no help in measuring the size of, say, a specification. Another critical problem (and the one which destroys the credibility of both the defect density metric and the productivity metric) is that they characterise only one specific view of size, namely length. Consequently there have been extensive efforts to characterise other internal product size attributes, notably complexity and functionality. In the next section we shall see how the history of software metrics has been massively influenced by this search.

A Rigorous Framework for Software Measurement