Science is what we have learned about how to keep from fooling ourselves
Richard Feynman
The last post examined some of the psychological and statistical effects that can adversely affect the reliability of anecdotal evidence. It is important to keep these effects in mind when evaluating the importance of our colleague's stories of success with particular software development techniques. But this doesn't mean that claims based upon empirical evidence are necessarily trustworthy. The experimental process is also susceptible to these forces, where they manifest as methodological errors and biased analysis of results. This post will illustrate some of these effects in action, using a well-known empirical study of pair programming as an example.
In the Fall of 1999 at the University of Utah Laurie Williams conducted an experiment in pair programming upon students in the Senior Software Engineering course [1]. Here is a brief outline of her experiment:
The overall results were:
Ever since then, the phrase "a 15% increase in development cost for a 15% improvement in quality" has been cited many times, without qualification, as being the value proposition for pair programming.
In a subsequent presentation [3] Williams claimed that the only experimental variable at work here was pair programming. However a closer examination of the experimental method reveals that this was far from the case. There were numerous uncontrolled factors that could account in whole or part for the results achieved. Many of these factors are mentioned only in passing in Williams' paper, and some are entirely unacknowledged.
The 28 students in the experimental group were not chosen randomly - they were picked from amongst the 35 that initially indicated a preference for working collaboratively. This means that 6 students who indicated that preference were rejected (on an unspecified basis) and put in the control group. This creates a population bias in several ways.
The mechanisms by which the population was selected may well bias them towards favorable responses to survey questions, for a few reasons:
As Robert Cialdini notes [8]:
Once we have made a choice or taken a stand, we will encounter personal and interpersonal pressures to behave consistently with that commitment. Those pressures will cause us to respond in ways that justify our earlier decision.
All these sources of population bias are given short shrift at the tail end of Williams' ACM paper [1]:
It must also be noted that the majority of those involved in the study and of those that agreed to do complete [sic] the survey are self-selected pair-programmers. Further study is needed to examine the eventual satisfaction of programmers who are forced to pair-program despite their adamant resistance.
The tendency to better short-term performance when knowingly under scrutiny is called the Hawthorne Effect. This experiment provides the typical circumstances under which the effect manifests. Williams acknowledges the observational pressure the students are under:
The students were aware of the importance of the experiment, the need to keep accurate information, and that each person (whether in the control or experimental group) was a very important part of the outcome.
The students could not be unaware of the vested interest that Williams has in achieving a pro-collaboration outcome. It is well known that collaborative software development is at the heart of her academic and professional identity. She is also a known champion of collaborative development in an educational context.
Consider also that the students timed their own work efforts and submitted them using a web-based tool. One can only presume that these submissions were not anonymous, otherwise the results could not be used to academically grade the students. Consequently, the students must be aware that their performance is traceable back to themselves.
The experimental group is therefore particularly motivated (consciously or otherwise) to perform well so as to create an outcome that Williams considers favorable, knowing that they can be associated with this outcome. The control group is oppositely motivated (consciously or otherwise) knowing that high performance on their part could oppose the preferred hypothesis of an authority figure.
To appreciate the students' position, imagine your boss has instituted a new quality program in your workplace, of which he is an enthusiastic proponent. To evaluate its efficacy, he distributes questionnaires asking "Did this program work?" The questionnaires are not anonymous, and you know he can see the responses. Do you think he'll get an accurate indication of people's perceptions?
Williams claims to have mitigated against the Hawthorne Effect, but misses the point entirely:
Since both groups will receive the same information about the study and in all lecture materials, the Hawthorne effect should not pose a thread to this study.
A significant methodological flaw in the experiment is that it was not independently run. Williams notes:
All students attended the same classes, received the same instruction, and participated in class discussions on the pros and cons of pair programming.
The paper does not say who delivered this information and moderated these discussions, but if it was Williams herself then she will likely have communicated her pro-collaboration bias to her students well and truly by the time the experiment is over, thereby creating a sense of positive expectation in the experimental group and negative expectation in the control group. Even if it was not Williams, then her history of research in collaborative software development education constitutes a contextual bias towards pro-collaboration outcomes.
This group may well have been unintentionally primed to succumb to the placebo effect. They have been given an accounting of pair programming with a positive bias, and they are therefore predisposed to bring their own experience into alignment with that.
Note that technical problems prevented the accurate recording of completion times for Program 4, so it has been excluded from the result set.
In analyzing these results, Williams chooses to dismiss the data for Program 1 as being atypical. She claims that the pairs are "jelling" over this time, and adjusting to an unfamiliar work mode. She attempts to justify this by referring to the observance of a similar pattern by Nosek, but there is no mention of such an effect in the Nosek paper [4]. In fact, Nosek's pairs performed immediately and on a task of only 45 minutes duration. Furthermore, she describes an earlier trial run of her CSP process that was conducted in the Summer of 1999, where all students in the class programmed collaboratively. She boasts:
... each of the the ten collaborative groups turned in eight project and all 80 were on time. Additionally, all projects were of very high quality. The average grade on all 80 assignments was 98%.
So why did this group not experience a markedly lower performance during their "jelling" period as well?
She also makes references to the rapidity with which teams jell that, if relevant, would seem to contradict the decision to dismiss the entirety of Program 1 as an adjustment period. She claims:
It doesn't take many victorious, clean compiles or declarations of "We just got through our test with no defects!" for the teams to celebrate their union - and to feel as one jelled, collaborative team.<
And later:
In industry, this adjustment period has historically taken hours or days, depending upon the individuals.
It is inconsistent to both celebrate the rapidity with which pairs jell, and then dismiss the first quarter of one's data set to allow time for them to jell. This appears to be a clear case of observational bias - the post facto rejection of data that is not consistent with one's preferred hypothesis.
This is a clear case of observational bias - the dismissal of data that does not fit with the researcher's desired outcome.
The bombshell is that the completion times for individuals and pairs was not statistically significant after Program 1! Williams confesses:
... after the first program, the difference between the time for individuals and pairs was no longer statistically significant. (... For the difference in time values, p = 0.380, which indicates that there is almost a 40% chance the difference in time values would be observed by chance.)
In [2], the following conjecture is made based on the "Lines of Code" results:
... they [pairs] consistently implemented the same functionality as the individuals in fewer lines of code. â?¦ We believe this is an indication that the pairs had better designs.
I find the leap from "lines of code" to "better designs" to be a little hasty, to say the least.
The results of empirical investigations tend to carry a certain weight of authority - and rightly so. But part of the reason that empirically derived results carry such weight is the assumption that they have been derived through well-controlled experimentation by researchers who are fully cognizant of the need to account for and mitigate against the effects of methodological and observational bias. Without such mitigation, the results have only the appearance of authenticity and none of the credibility that we would otherwise attribute them with.
The Williams experiment is so open to the effects of bias and so poorly controlled that the results are next to meaningless - generalization based upon them, even more so.
In brief, the following features of the experiment are in question:
What I want to know is - why, as a community, do software developers eat this stuff up so uncritically? Why are there so many people, including Williams herself, quoting this experiment and its conclusions as if they were some sort of vindication of pair programming, when they are in fact meaningless? Where are the critical examinations of her experimental method? I've found only one other - that provided by Stephens & Rosenberg in "Extreme Programming Refactored". I wonder, have those that appeal to this experiment actually read the thesis describing the experimental method at all? In a scientific community, such pseudo-experimentation would be laughed at. In software development, we can't seem to be even bothered thinking about it.
In 1998 John Nosek published the results of an experiment in pair programming [4]. Williams' use of the word "strengthening" in the title of her paper is in part a reference to this earlier work, which she mentions in her paper's introduction. I mention it here because in several ways it is methodologically superior to the Williams experiment:
Nosek was also more conservative in his subsequent analysis. Unfortunately little can be generalized from his results because the experiment only involved the completion of a single 45 minute task.
References