Successes and failures

[papers/_private/small_logo.html]

Successes and failures
First the good news about software metrics and its related activities:

It has been a phenomenal success if we judge it by the number of books, journal articles, academic research projects, and dedicated conferences. The number of published metrics papers grows at an exponential rate (CSR maintains a database of such papers which currently numbers over 1600). There are now at least 40 books on software metrics, which is as many as there on the general topic of software engineering. Most computing/software engineering courses world-wide now include some compulsory material on software metrics. In other words the subject area has now been accepted as part of the mainstream of software engineering.

There has been a significant increase in the level of software metrics activity actually taking place in industry. Most major IT companies, and even many minor ones, have in place software metrics ‘programs’, even if they are not widely recognised as such. This was not happening 10 years ago.

Now for the bad news:

There is no match in content between the increased level of metrics activity in academia and industry. Like many other core subjects within the software engineering mainstream (such as formal development methods, object-oriented design, and even structured analysis), the industrial take-up of most academic software metrics work has been woeful. Much of the increased metrics activity in industry is based almost entirely on metrics that were around in the early 1970s.

Much academic metrics research is inherently irrelevant to industrial needs. Irrelevance here is at two levels.

irrelevance in scope: much academic work has focused on metrics which can only ever be applied/computed for small programs, whereas all the reasonable objectives for applying metrics are relevant primarily for large systems. Irrelevance in scope also applies to the many academic models which rely on parameters which could never be measured in practice.

irrelevance in content: whereas the pressing industrial need is for metrics that are relevant for process improvement, much academic work has concentrated on detailed code metrics. In many cases these aim to measure properties that are of little practical interest. This kind of irrelevance prompted Glass to comment:

“What theory is doing with respect to measurement of software work and what practice is doing are on two different planes, planes that are shifting in different directions” [Glass 1994]

Much industrial metrics activity is poorly motivated. For all the warm feelings associated with the objectives for software metrics (improved assessment and prediction, quality control and assurance etc.) it is not at all obvious that metrics are inevitably a ‘good thing’. The decision by a company to put in place some kind of a metrics program is inevitably a ‘grudge purchase’; it is something done when things are bad or to satisfy some external assessment body. For example, in the US the single biggest trigger for industrial metrics activity has been the CMM [Humphreys 1989]; evidence of use of metrics is intrinsic for achieving higher levels of CMM. Just as there is little empirical evidence about the effectiveness of specific software development methods [Fenton et al 1994] so there is equally little known about the effectiveness of software metrics. Convincing success stories describing the long-term payback of metrics are almost non-existent. What we do know is that metrics will always be an overhead on your current software projects (we found typically it would be 4-8% [Hall and Fenton 1997]) . When deadlines are tight and slipping it is inevitable that the metrics activity will be one of the first things to suffer (or be stopped completely). Metrics are, above all, effective primarily as a management tool. Yet good metrics data is possible only with the commitment of technical staff involved in development and testing. There are no easy ways to motivate such staff in this respect.

Much industrial metrics activity is poorly executed: We have come across numerous examples of industrial practice which ignores well known guidelines of best practice data-collection and analysis and applies metrics in ways that were known to be invalid twenty years ago. For example, it is still common for companies to collect defect data which does not distinguish between defects discovered in operation and defects discovered during development [Fenton and Neil 1998].

The raison d’être of software metrics is to improve the way we monitor, control or predict various attributes of software and the commercial software production process. Therefore, a necessary criterion to judge the success of software metrics is the extent to which they are being routinely used across a wide cross section of the software production industry. Using this criterion, and based solely on our own consulting experience and the published literature, the only metrics to qualify as candidates for successes are:

The enduring LOC metric: despite the many convincing arguments about why this metric is a very poor ‘size’ measure [Fenton and Pfleeger 1996] it is still routinely used in its original applications: as a normalising measure for software quality (defects per KLOC); as a means of assessing productivity (LOC per programmer month) and as a means of providing crude cost/effort estimates. Like it or not LOC has been a ‘success’ because it is easy to compute and can be ‘visualised’ in the sense that we understand what a 100 LOC program looks like.

Metrics relating to defect counts: (see [Gilb and Graham 1994] and [Fenton and Pfleeger 1996] for extensive examples). Almost all industrial metrics programs incorporate some attempt to collect data on software defects discovered during development and testing. In many cases, such programs are not recognised explicitly as ‘metrics programs’ since they may be regarded simply as applying good engineering practice and configuration management.

McCabe’s cyclomatic number [McCabe 1976]. For all the many criticisms that it has attracted this remains exceptionally popular. It is easily computed by static analysis (unlike most of the more discriminating complexity metrics) and it is widely used for quality control purposes. For example, [Grady 1992] reports that, at Hewlett Packard, any modules with a cyclomatic complexity higher than 16 must be re-designed.

To a less extent we should also include:

Function Points [Albrecht 1979, Symons 1991]: Function point analysis is very hard to do properly [Jeffery and Stathis 1996] and is unnecessarily complex if used for resource estimation [Kitchenham 1995]. Yet, despite these drawbacks function points appear to have been widely accepted as a standard sizing measure, especially within the financial IT sector [Onvlee 1995].

This is not an impressive list, and in Section 4 we will explain why the credibility of metrics as a whole has been lowered because of the crude way in which these kinds of metrics have been used. However, as we will show in Section 5, it is the models and applications which have been fundamentally flawed rather than the metrics.

To go back to our resources section click here.

[papers/_private/horizontal_navbar.html]
[papers/_private/copyright_notice.html]

Last modified: July 28, 1999.