A "Grand Challenge" suggestion

Professor Les Hatton

University of Kent Computing Laboratory, L.Hatton@ukc.ac.uk

Proposed Grand Challenge

To understand the nature of injected defect in software systems and lay down an acceptable measurement basis for comparing different technologies and their effectiveness at reducing defect based around the notions of control process feedback.

Overview

Software Engineering is gripped by unconstrained creativity. In the physical sciences, the laws of physics intervene in some of our madder moments but in software engineering, there is no such barrier leading naturally to 50 years of erratic progress with an accelerating frequency of change. Ideas, languages and concepts come and go often within a couple of years without any kind of measurement support whatsoever so we are unable to tell if we are making progress or not, or indeed what to teach. Instead of the ruthless elimination of things which do not work, we have its diametrically opposed evil, backwards compatibility. We don't even have an agreed international definition for the concept of 'error'. The outcome is predictable. Software reliability is generally much less than any reasonable person would hope or expect, the most basic of testing is frequently absent, nobody really knows which technologies are effective and society wastes untold billions in projects which never even appear let alone fulfill their promise; money which could be very well spent elsewhere.

For the last 20 years I have been slowly gathering software reliability data and I am frankly tired of watching systems fall over for broadly the same reasons they fell over when I first started studying them. There are already some tantalisingly clear patterns of how to make progress using process control feedback as many failures could have been avoided using technologies we already know how to do, but the patterns are frequently surprising and often sit uncomfortably with conventional understanding. Much more research is necessary to understand why, but many of the questions are I feel both resolvable and essential to resolve. Perhaps I can flesh this out a little.

Almost independently of programming language, injected defect rates tend to rise very slowly for components in mature systems as size increases (close to logarithmic) until at a surprisingly large value, they start to rise much faster, (approximately quadratically). Why do they have this pattern ? Why is the quadratic increase delayed far longer than we would expect ? Why is the effect seen in systems written in anything from assembler to Ada as well as systems as disparate as operating systems and telecommunication systems ? There are a number of competing theories but as yet no experiments which can reject any of them. This result undermines belief in the structuring principle and needs to be resolved.
Scientific modelling software exhibits entropy in a most interesting way with a continual and alarming decay of significance as more software is used. The net effect of this is that our simulation results instead of having the 5 or 6 significant figures we hoped for may have only 1-2. The vast proportion of this decay appears to be defect related rather than algorithmic but the results of these simulations can affect people's lives dramatically. How can we detect it ? How can we avoid or control it ?
One of the most influential ideas of recent years, the concept of objects and object interactions, appears to have some interesting problems. For example, one of its cornerstones, inheritance, appears to be a defect attractor. Why ? The very ubiquity of this idea demands an answer.
Many defects in software related systems have astonishingly long lifetimes. Is this simply because they outlive testing lifetimes or is there a deeper underlying mechanism ? The importance of resolving this lies in the fact that most modern embedded control systems run up execution time very quickly in their user's hands owing to the large-scale of their distribution so the defects are actually seen relatively quickly in the lifetime of the software but strongly resist efforts to remove them during development.
Why is software so anisotropic ? Change in one direction can be formidably defect-prone whereas change in another can be trivial by comparison. How do we measure the axes of anisotropy ? How do we build isotropic systems ? Indeed can we build isotropic systems ?
Many techniques appear only to attack certain types of defect. For example, formalism appears to be very effective on design defects but in a system swamped by implementation defects, its effects can be almost invisible. How do we categorise and therefore target techniques and relevant defect types ? We still have no effective standard taxonomy.
Defect injection is much less sensitive to programming language than we would expect. Is this because engineer to engineer variation greatly exceeds technology to technology variation thus hiding the expected differences or is there some deeper principle responsible ? Decades of research into programming languages have failed so far to answer this principally because very little objective comparison ever takes place.

Resolving the above questions (and others similar) depends on the ability to measure, calibrate and therefore distinguish different techniques objectively by their ability to reduce defect. There are of course many grand theoretical challenges but in a society increasingly dependent on the fruits of our research, reliable technologies are of paramount importance as we pursue greater and greater complexity.

Measurement and engineering progress are mutually iterative. Improvements in one are reflected in improvements in the other. One of the main barriers to laying down an appropriate measurement basis and therefore resolving some of these questions is that control process feedback is only really effective when applied to processes with a significantly longer time-scale than those which we artificially induce in software engineering whilst pursuing our creative ends. Nevertheless, amidst all the technological swirl of modern computing science, a number of case studies have demonstrated that with relatively slowly varying technologies, simple process and product defect data and unsophisticated but careful root cause analysis, astonishingly satisfying and reliable software is possible relatively independently of the technology used. Alternatively, we need to find a way of applying such feedback to a rapidly moving process but that also is a question for research.

How does this meet the defined criteria for a 'Grand Challenge' ?

It arises from scientific curiosity about the foundation, nature or limits of a scientific discipline.

In so far as it is conventionally practised, measurement is at the heart of any scientific discipline.

It gives scope for engineering ambition to build something that has never been built before.

Understanding the nature of injected defect is the key to building reliable systems. These have been built before but the reason for their reliability compared with less successful efforts is unclear.

It has enthusiastic support from (almost) the entire scientific community.

On this criterion it fails. Compared with their role in conventional science, measurement and proneness to defect are not widely pursued.

It has international scope: participation would increase the research profile of a nation

Software failure is endemic. Any systematic attempt to understand it better with the goal of reducing it would automatically be of international scope.

It is generally comprehensible and captures the imagination of the public

It succeeds admirably here. The general public view of software is of something that fails a lot.

It was formulated long ago and still stands

There is nothing new about software failure. The need to understand it better is perhaps far more urgent now than in the past.

It goes beyond what is initially possible, and requires development of techniques and tools unknown at the start of the project

It probably fails this criterion. Although the underlying causes are unknown, may failures could have been avoided using techniques we already know how to do. We simply did not apply them for one reason or another.

It calls for planned cooperation between identified research teams

There are groups around the world who have studied these problems and cooperation though shared data and methodologies would be crucial as experiments tend to be expensive.

It encourages and benefits from competition among identified teams ...

Cooperation and competition are not obvious travelling companions. Cooperation is of far more importance given the nature of this problem.

It decomposes into intermediate research goals whose achievement brings scientific and economic benefit even if the project as a whole fails

Any of the basic questions if resolved would give benefit in how to produce more reliable systems. This is fortunate as most ambitious projects fail as a whole.

It will be obvious how far and when the challenge will be met

As a research project concerned with measurement it should by definition be more obvious as to what progress is being made and how far the challenge can be met.

It will lead to a radical paradigm shift

Control process feedback would be a radical paradigm shift to most software engineering practice. It is rarely practised at present.

It is not likely to be met by commercially motivated evolutionary advance

This is a very difficult question to answer given the scale of commercial research in some companies. Many of the most significant advances in computing have come from commercially motivated research which was not at all evolutionary.

Summary

This challenge is a quintessentially engineering issue but raises many issues for which there are simply no answers with the present state of knowledge and yet the answers are probably amenable to the right kind of coordinated attack. Unfortunately, this Grand Challenge involves a fundamental change in thinking. Just as in the best software, each idea must be accompanied inseparably by the means to test and calibrate it. At this, we have been less than satisfactory in the past 30 or 40 years.

We probably don't need any more ideas, we just need to spend a little time improving or rejecting the ones we have got to lay down a long overdue but ultimately more satisfying measurement-derived basis for our discipline. That is my Grand Challenge.

$Revision: 1.3 $, $Date: 2002/10/21 12:51:02 $.