[papers/_private/small_logo.html]

A Rigorous Framework for Software Measurement

The software engineering literature abounds with software ‘metrics' falling into one or more of the categories described above. So somebody new to the area seeking a small set of 'best' software metrics is bound to be confused, especially as the literature presents conflicting views of what is best practice. In the section we establish a framework relating different activities and different metrics. The framework enables readers to distinguish the applicability (and value) of many metrics. It also provides a simple set of guidelines for approaching any software measurement task.

Measurement definitions

Until relatively recently a common criticism of much software metrics work was its lack of rigour. In particular, much work was criticised for its failure to adhere to the basic principles of measurement that are central to the physical and social sciences. Recent work, has shown how to apply the theory of measurement to software metrics [Fenton 1991, Zuse 1991]. Central to this work is the following definition of measurement:

Measurement is the process by which numbers or symbols are assigned to attributes of entities in the real world in such a way as to characterise them according to clearly defined rules. The numerical assignment is called the measure.

The theory of measurement provides the rigorous framework for determining when a proposed measure really does characterise the attribute it is supposed to. The theory also provides rules for determining the scale types of measures, and hence to determine what statistical analyses are relevant and meaningful. We make a distinction between a measure (in the above definition) and a metric. A metric is a proposed measure. Only when it really does characterise the attribute in question can it truly be called a measure of that attribute. For example, the number of Lines of Code (LOC) (defined on the set of entities ‘programs’) is not a measure of ‘complexity’ or even ‘size’ of programs (although it has been proposed as such), but it is clearly a measure of the attribute of length of programs.

To understand better the definition of measurement in the software context we need to identify the relevant entities and the attributes of these that we are interested in characterising numerically. First we identify three classes of entities:

· Processes: any specific activity, set of activities, or time period within the manufacturing or development project. Relevant examples include specific activities like requirements capture, designing, coding, and verification; also specific time periods like "the first three months of project X''.

· Products: any artefact, deliverable or document arising out of a process. Relevant examples include source code, a design specification, a documented proof, a test plan, and a user manual.

· Resources: any item forming, or providing input to, a process. Relevant examples include a person or team of people, a compiler, and a software test tool.

We make a distinction between attributes of these which are internal and external:

· Internal attributes of a product, process, or resource are those which can be measured purely in terms of the product, process, or resource itself. For example, length is an internal attribute of any software document, while elapsed time is an internal attribute of any software process.

· External attributes of a product, process, or resource are those which can only be measured with respect to how the product, process, or resource relates to other entities in its environment. For example, reliability of a program (a product attribute) is dependent not just on the program itself, but on the compiler, machine, and user. Productivity is an external attribute of a resource, namely people (either as individuals or groups); it is clearly dependent on many aspects of the process and the quality of products delivered.

ENTITIES ATTRIBUTES
Products Internal External
Specifications size, reuse, modularity, redundancy, functionality, syntactic correctness, ... comprehensibility, maintainability, ...
Designs size, reuse, modularity, coupling, cohesiveness, functionality, ... quality, complexity, maintainability, ...
Code size, reuse, modularity, coupling, functionality, algorithmic complexity, control-flow structuredness, ... reliability, usability, maintainability, ...
Test data size, coverage level, ... quality, ...
... ... ...
Processes    
Constructing specification time, effort, number of requirements changes, ... quality, cost, stability, ...
Detailed design time, effort, number of specification faults found, ... cost, cost-effectiveness, ...
Testing time, effort, number of coding faults found, ... cost, cost-effectiveness, stability, ...
... ... ...
Resources    
Personnel age, price, ... productivity, experience, intelligence, ...
Teams size, communication level, structuredness, ... productivity, quality, ...
Software price, size, ... usability, reliability, ...
Hardware price, speed, memory size, ... reliability, ...
Offices size, temperature, light, ... comfort, quality, ...
... ... ...

Table 4.1 provides some examples of how this framework applies to software measurement activities. Software managers and software users would most like to measure external attributes. Unfortunately, they are necessarily only measurable indirectly. For example, we already noted that productivity of personnel is most commonly measured as a ratio of: size of code delivered (an internal product attribute); and effort involved in that delivery (an internal process attribute). For the purposes of the current discussion the most important attribute that we wish to measure is 'quality' of a software system (a very high level external product attribute). It is instructive next to consider in detail the most common way of doing this since it puts into perspective much of the software metrics field.

The defect density metric

The most commonly used means of measuring quality of a piece of software code C is the defect density metric, defined by:

where size of C is normally KLOC (thousands of lines of code). Note that the external product attribute here is being measured indirectly in terms of an internal process attribute (the number of defects discovered during some testing or operational period; and an internal product attribute (size). Although it can be a useful indicator of quality when used consistently, defect density it is not actually a measure of software quality in the formal sense of the above definition of measurement. There are a number of well documented problems with this metric. In particular:

It fails to characterise much intuition about software quality and may even be more an indicator of testing severity than quality.

There is no consensus on what is a defect. Generally a defect can be either a fault discovered during review and testing (and which may potentially lead to an operational failure), or a failure that has been observed during software operation. In some studies defects means just post-release failures; in others it means all known faults; in others it is the set of faults discovered after some arbitrary fixed point in the software life-cycle (e.g. after unit testing). The terminology differs widely between organisations; fault rate, fault density and failure rate are used almost interchangeably.

It is no coincidence that the terminology defect rate is often used instead of defect density. Size is used only as a surrogate measure of time (on the basis that the latter is normally too difficult to record). For example, for operational failures defect rate should ideally be based on inter-failure times. In such a case the defect rate would be an accurate measure of reliability. It is reliability which we would most like to measure and predict, since this most accurately represents the user-view of quality.

There is no consensus about how to measure software size in a consistent and comparable way. Even when using the most most common size measure (LOC or KLOC) for the same programming language, deviations in counting rules can result in variations by factors of one to five.

Despite the serious problems listed above (and others that have been discussed extensively elsewhere) we accept that defect density has become the de-facto industry standard measure of software quality. Commercial organisations argue that they avoid many of the problems listed above by having formal definitions which are consistent in their own environment. In other words, it works for them, but you should not try to make comparisons outside of the source environment. This is sensible advice. Nevertheless, it is inevitable that organisations are hungry both for benchmarking data on defect densities and for predictive models of defect density. In both of these applications we do have to make cross project comparisons and inferences. It is important, therefore for broader QA issues, that we review what is known about defect density benchmarks.

Companies are (for obvious reasons) extremely reluctant to publish data about their own defect densities, even when these are relatively low. The few published references that we have found tend to be reported about anonymous third parties, and in a way that makes independent validation impossible. Nevertheless, company representatives seem happy to quote numbers at conferences and in the grey literature. Notwithstanding the difficulty of determining either the validity of the figures or exactly what was measured and how, there is some consensus on the following: in the USA and Europe the average defect density (based on number of known post-release defects) appears to be between 5 and 10 per KLOC. Japanese figures seem to be significantly lower (usually below 4 per KLOC), but this may be because only the top companies report. A well known article on 11th February 1991 in Business Week reported on results of an extensive study comparing similar 'state-of-the-art' US and Japanese software companies. The number of defects per KLOC post-delivery (first 12 months) were: 4.44 USA and 1.96 Japan. It is widely believed that a (delivered) defect density of below 2 per KLOC is good going.

In one of the more revealing of the published papers [Daskalantonakis 1992] reports that Motorola’s six sigma quality goal is to have ‘no more than 3.4 defects per million of output units from a project’. This translates to a an exceptionally low defect density of 0.0034 per KLOC. The paper seems to suggest that the actual defect density lay between 1 and 6 per KLOC on projects in 1990 (a figure which was decreasing sharply by 1992). Of course even the holy grail of zero-defect software may not actually mean that very high quality has been achieved. For example, [Cox 1991] reports that at Hewlett Packard, a number of systems that recorded zero post-release defects turned out to be those systems that were simply never used. A related phenomenon is the great variability of defect densities within the same system. In our own study of a major commercial system [Pfleeger et al 1994] the total 1.7 million LOC system was divided into 28 sub-systems whose median size was 70 KLOC. There was a total of 481 distinct user-reported faults for one year yielding a very low total defect density of around 0.3 per KLOC. However, 80 faults were concentrated in the subsystem which was by far the smallest (4 KLOC), and whose fault density was therefore a very high 20 per KLOC.

Measuring size and complexity

In all of the key examples of software measurement seen so far the notion of software ‘size’ has been a critical indirect factor. It is used as the normalising factor in the common measures of software quality (defect density) and programmer productivity. Product size as also the key parameter for models of software effort. It is not surprising therefore to note that the history of software metrics has been greatly influenced by the quest for good measures of size. The most common measure of size happens to be the simplest: Lines of Code (LOC). Other similar measures: are number of statements; number of executable statements; and delivered source instructions (DSI). In addition to the problems with these measures already discussed they all have the obvious drawback of only being defined on code. They offer no help in measuring the size of, say, a specification. Another critical problem (and the one which destroys the credibility of both the defect density metric and the productivity metric) is that they characterise only one specific view of size, namely length. Consequently there have been extensive efforts to characterise other internal product size attributes, notably complexity and functionality. In the next section we shall see how the history of software metrics has been massively influenced by this search.

Next section - Key Metrics

[papers/_private/horizontal_navbar.html][papers/_private/copyright_notice.html]

Last modified: July 28, 1999.