Michael J. Harrison
I start by examining a fragment of code that is known to contain an error. I explain the error and show a simple change that will remove it. I then make a series of improvements to the code to make it easier to read and maintain. Finally I add a new feature. It's a trivially simple piece of code, much too small to really show the advantages of modularity and refactoring, but I believe it is big enough to illustrate the principles. Because the original fragment was written in C and uses native types, my changes and additions do also, and I have kept the original's K&R style because I feel it is, as Tony Hoare said about Algol-60, such an improvement on all its successors. For clarity, I have omitted the #include statements that would be required.
The error results from the coder's failure to see and appreciate the implication of C syntax: the break statement is in the scope of the switch statement, not that of the for statement, and consequently causes exit from the switch but not from the for. One is misled by the two continue statements, which are indeed in the scope of the for loop. So the for statement in fact has no termination condition, and thus after a potentially long series of character accesses an invalid address is generated.Listing One
/* Version 1: Original code from Gimpel ad. * Gimpel Bug #527, DDJ December 1999, p. 96 */ int count_vowels(char *s) { int sum = 0; for (;;) switch (*s++) { case 'a': case 'e': case 'i': case 'o': case 'u': sum++; continue; default: continue; case '\0': break; } return sum; }
Although Java's for and switch statements work in exactly the same way as C's and expose the programmer to the same error, a Java programmer almost certainly would not have made it. That is because Java strings use a length rather than a terminating character, and a Java programmer would automatically write the loop in the Java idiom [1]:
So, by using the termination condition directly in the loop, the bug can be fixed as shown in Listing Two.for (int i = 0; i < s.length(); i++)
Listing Two
/* Version 2: Use termination condition directly in loop definition. */ int count_vowels(char *s) { int sum = 0; for (; *s != '\0'; s++) switch (*s) { case 'a': case 'e': case 'i': case 'o': case 'u': sum++; continue; default: continue; } return sum; }
Listing Three
/* Version 3: With no real alternatives, switch statement is poor fit. Use if instead. */ int count_vowels(char *s) { int sum = 0; for (; *s != '\0'; s++) if (*s == 'a' | *s == 'e' | *s == 'i' | *s == 'o' | *s == 'u') sum++; return sum; }
Listing Four
/* Version 4: The if statement is ugly. Use a cleaner and faster test. */ int count_vowels(char *s) { static char vowels[] = "aeiuo"; int sum = 0; for (; *s != '\0'; s++) if (strchr(vowels, *s) != 0) sum++; return sum; }
Listing Five
/* Version 5: Deal with upper case as well. Note how easy this change is, * now that the logic is contained in a static declaration. */ int count_vowels(char *s) { static char vowels[] = "aeiouAEIOU"; int sum = 0; for (; *s != '\0'; s++) if (strchr(vowels, *s)) /* Implicit pointer comparison */ sum++; return sum; }
Although I will not demonstrate it here, the code in Listing Six is ready for further generalization, because any is_x function with the same signature as is_vowel could be passed as an argument to the string scanning function and so all sorts of different tests could be realized by writing only additional code (i.e., not changing existing code). In C++ or Java, instead of passing a function address (impossible in Java) one would wrap each function in its own class declaring is_x as a method and pass as the argument an object of the appropriate class (the functor idiom), and in C# one could use a delegate, but the principle is exactly the same.
Now that it is a separate function, is_vowel invites a more efficient implementation, for example, one based on a 256-byte array of character properties in the manner of the isalpha function and its companions [4]. This would be worth doing if profiling revealed that this function was an execution bottleneck.
The function is_vowel is a candidate for an independent unit test. Indeed, since there are only 256 possible values of c [5], this is an example of that rare case, a piece of code that can be exhaustively tested. We should take advantage of this and write and execute such a test in this instance; this would be particularly worthwhile if we planned to improve the implementation as suggested in the previous paragraph. I'll return to this point a little later.
Listing Six
/* Version 6: Refactor, to minimize the scope of the definition of "vowel". */ int is_vowel(char c) { static char vowels[] = "aeiouAEIOU"; return strchr(vowels, c) != 0; } int count_vowels(char *s) { int sum = 0; for (; *s != '\0'; s++) if (is_vowel(*s)) sum++; return sum; }
Listing Seven
/* Version 7: Implement a new feature - count vowel pairs. */ int is_vowelpair(const char *p) { return is_vowel(*p) && is_vowel(*(p+1)); } int count_vowelpairs(char *s) { int sum = 0; for (; *s != '\0'; s++) if (is_vowelpair(s)) sum++; return sum; }
That was easy — a little copy and paste, a minor edit, a one-line function, and we're done. Note how, because of the previous refactoring, we were able to use the is_vowel function to implement is_vowelpair. Note also that if in the original implementation of is_vowel we had passed a character pointer instead of a character, we could have implemented is_vowelpair as an is_x function and used the function pointer protocol mentioned previously to implement the new feature without having to implement count_vowelpairs as a separate new function. But we were not sufficiently farsighted, and the decision to pass a character was a wise one, based on the principle that it is safer not to pass a pointer if it can be avoided.
But are we really done? No, we have been a little cavalier in our approach, and the program contains an error, one that might well evade testing [7].
On hearing that, a programmer might first be drawn to the is_vowelpair function because it dereferences (p+1). Is that always a valid address? Yes, in fact it is, because the test in the for statement guarantees that p always points to a character in the string, so even when p is pointing to the last character in the string (p+1) is pointing at the null character terminating the string and so is still a valid address. We were lucky when we made the change, because we had not explicitly considered this. There is no way in C to protect is_vowelpair from being invoked with a pointer p such that (p+1) is an invalid address; we resolve to ensure that the specification for is_vowelpair contains a prominent notice of this precondition.
So what is the bug? It is this: if the string ends in a vowel x, the character pair "x\0" will be incorrectly identified as a vowel pair. The groundwork for this bug was laid in the improvement from Version 3 to Version 4; it turns out that the if test and the strchr test are not exactly equivalent. The function strchr considers the terminating null character to be part of the string for the purposes of the search, so searching for the null character will always succeed and return a non zero pointer. This is documented in the specification for strchr, but is not well known; it is not mentioned in the MSDN Library documentation for the Microsoft Foundation Classes method CString::Find (which itself calls strchr). Yet it is on such minutiæ that the correct execution of a program may hinge.
How did you answer the question about unit testing in the previous section? If you thought then it would not be worth the trouble, would you like to reconsider that decision now? If you would have chosen to perform unit testing, would you have tried the case when c is zero?
How should we fix this bug?
One way would be to replace the body of is_vowelpair with the statement:
return is_vowel(*p) && *(p+1) != '\0' && is_vowel(*(p+1));This will fix the bug and in one sense is quite correct, because it simply compensates for the slightly awkward behavior of strchr in our context. Another interpretation might be that the error lies in is_vowel, which classifies the null character as a vowel, and that this function should be changed to correct the error (I hear Bertrand Meyer in the background chiding us for an insufficiently precisely defined precondition for is_vowel) [8]. I maintain, however, that either of these changes fixes the symptom rather than the disease. Our original suspicion of (p+1) was justified; I contend that the proper precondition for is_vowelpair is that both p and (p+1) point to actual characters within the string, not to the terminator. Therefore a correct fix for the bug is to leave is_vowelpair as it is and replace the for statement in count_vowelpairs with this:
for (; *s != '\0' && *(s+1) != '\0'; s++)It is necessary to test both characters because testing only *(s+1) will fail if the string is null (for then *(s+1) would reference an invalid address). This is a bit clumsy, and we might want to consider testing for a null string before entering the loop or using a different termination condition, perhaps one based directly on the length of the string. We might also want to consider weakening the precondition for is_vowel by making it classify the null character as not a vowel. As the saying goes, these things are left as exercises for the reader.
There are four possibilities for the pointer passed in as argument:
Our contract imposes the precondition that the pointer must be valid as specified in the first case above. There is no reasonable way to detect the second case, and for the third case it is usual to delegate detection (and handling) to the exception mechanism of the operating system [10].
The fourth case is the interesting one. Null is a valid value for a pointer, but it is (by definition) an invalid address that will generally cause an addressing exception if dereferenced. It is, however, trivial for the function to detect that a null pointer argument has been passed. Therefore, we can propose a general rule that makes life easier for callers of a function: if a pointer to a null object is a valid argument, a null pointer should also be a valid argument with the same meaning. In the specific examples we are studying here, we should change the contract by weakening the precondition to allow the first and fourth cases, and implement the function to immediately return the value 0 if a null pointer is passed as the argument. The purpose of this rule is to relieve callers of the need to make the test for a null pointer; for a language like C the gain is not obvious, but for functional style languages like Lisp the gain is significant.
Programming by contract places responsibility squarely on the shoulders of the caller for ensuring that specified preconditions for a function are met, and relieves the function of any responsibility for verifying that preconditions are met. As intended, this has the desirable effect of eliminating redundant validity checking. In real life, however, it also has an undesirable side effect: if the caller makes a mistake and does not in fact meet the preconditions, the function is likely to fail but the cause of the failure may be difficult to track down. A pointer that is null or contains an invalid address is usually easy to diagnose, since an exception will be caused immediately an attempt is made to deference it, but other errors may cause failures far removed from the root cause. Therefore it is good practice for a function to perform reasonable validity checks for its preconditions, and in the C language the preferred technique is to use assertions. "Reasonable" is, of course, a slippery word; it would be reasonable, for example, to test whether an integer lies within a specified range, but not to test whether a binary search tree was indeed valid.
It is possible to get away with programming in poor style because, unlike with physical artifacts, the cost of destructive testing of a program is very small. Therefore it is economically feasible to build a program by an iterative process of construction and testing ("cut and try") in a way that would be out of the question for, say, an automobile or an aircraft. Just imagine what software development would be like if any processor exception caused the CPU to be destroyed.
Unfortunately, it is quite infeasible to perform an exhaustive test of even a small program, and so any program developed by cut and try is bound to contain errors, even serious ones. It's perfectly reasonable to argue that because destructive testing of software is very cheap the development process should take advantage of this; but the current sad state of software development is due in large part to driving this argument way past its logical conclusion to an irrational extreme. Testing is necessary and even reasonably effective, but we also have to go about developing code in such a way that for the most part we do not introduce errors in the first place [12].
When we are talking about component-based development and modularity, what we are really talking about is abstraction. What I see in the switch statement in the original piece of code is a total lack of abstraction. To me, the test of whether a given character is a vowel is a simple test of set membership, but I see no reflection of that insight in the original code, and I am forced to conclude that the programmer did not conceive of it in that way at all. Yet it's difficult to see how anyone can make effective use of component-based development without thinking in terms of programming abstractions.
Undoubtedly, part of the problem is the programming languages that are used. C is a fine language in its way, but its direct support for abstraction is limited to functions, macros and arrays. Pascal provides sets (though they are limited in cardinality), and it would be entirely natural for a Pascal programmer to write is_vowel as "c in ['a', 'e', 'i', 'o', 'u']"; thus a Pascal programmer is more used to thinking in terms of set operations, which are inherently more abstract than tests and assignments. My use of strchr is, of course, nothing but a C implementation of a set membership test, but how many C programmers would see it that way?
It seems to me that we must encourage and help developers to think in a more abstract way about the programming task. I suspect, however, that this would be an uphill battle. Languages such as APL, Lisp, or Prolog that encourage or require a more abstract view of programming have never been widely popular and are usually regarded as hard to learn. On the other hand, Python, which nicely combines aspects of procedural, object-oriented and functional languages, has proved quite popular, so there are some grounds for optimism.
In my series of transformations on the original program, I was guided, as much as anything, by an essentially aesthetic sense, most clearly in my feeling that there was something wrong about the nonterminating for loop, and my description of the if statement as ugly. There is a classic quote from the theoretical physicist Paul Dirac: "It seems that if one is working from the point of view of getting beauty in one's equations, and if one has really a sound insight, one is on a sure line of progress." Programs are very rarely beautiful in the Dirac sense [13], and we are usually limited to trying to get the ugliness out, rather than the beauty in, but the essence of Dirac's point still applies. We need to instill in developers an intuitive sense of what makes a program beautiful or ugly.
The final lesson I draw is how dangerous it is to make changes to code. It is so easy unwittingly to violate essential preconditions in the existing code, which are usually only implicit and often not completely known even by the original programmers. Furthermore, such violations are often not revealed by even quite thorough testing. The only other known countermeasures, which are not wholly effective, are extreme care on the part of the developers making changes, and thorough review, preferably by people intimately familiar with the existing code.
for (int i = 0, imax = s.length(); i < imax; i++)
© Copyright 2002-2024 Michael J. Harrison. All rights reserved.