Econometric Replication: Lessons from the Experimental Sciences

From the Quarterly Journal of Business and Economics, Volume 23, Number 1 Issue
Econometric Replication: Lessons
from the Experimental Sciences
Robert A. Mittelstaedt
Nathan Gold Professor of Marketing University of Nebraska-Lincoln
Thomas S. Zorn
Assistant Professor of Finance University of Nebraska-Lincoln
I. Introduction
The case for replication has been made by many. The editor of the Journal of
Irreproducible Results puts it this way:
The glorious endeavor that we know today as science has grown out of the
murk of sorcery, religious ritual, and cooking. But while witches, priests,
and chefs were developing taller and taller hats, scientists worked out a
method for determining the validity of their experimental results: they
learned to ask “are they reproducible?” [7].
The experimental disciplines have developed a tradition of replication that Kane [4]
calls “improvisational”. It is the purpose of this paper to examine more closely the
nature of replication in the experimental tradition; this study argues that a similar
view is important in econometric research and suggests that a restructuring of
incentives for researchers is in order.
1. Replication in the Experimental Tradition
By their nature, experiments are reproducible events. That fact in itself might
lead to the expectation of replication in the experimental disciplines. The
experimenter’s emphasis on replication, however, grows less from the ease of doing
than from those disciplines’ attitude toward the relationship between data and
knowledge. The French physiologist Claude Bernard noted over 100 years ago:
The fact is nothing in itself, it has value only through the proof it supplies.
We have said elsewhere that, when one calls anew fact a discovery, the fact
itself is not the discovery, but rather the new idea derived from it; in the
same way, when a fact proves anything, the fact does not itself give the
proof, but only the rational relation which it establishes between the
phenomenon and its cause [1].
Shifting the focus of attention from the phenomenon to the relationship among
variables raises questions about the robustness and generalizability of those
relationships. This suggests replication is vital, and that it may be done for several
purposes.
The term “replication” has yet to be defined. Beyond the general statement that
it is an attempt to test the consistency of a relationship among two or more variables
under si111ilar or predictably different conditions, the term nearly defies precise
definition. It is useful to describe several types of replication as shown in Exhibit I.
Exhibit 1—Types of Replication, Classified by Activity of Replicator Compared to Original
Study’s Methods and Data Sources
Observations
Of Same
Phenomena
Observations
Of Different
Phenomena
Same measures
and methods
of establishing relationships
Type I
Type III
Different measures
and methods
of establishing relationships
Type II
Type IV
A Type I replication is described by Kane [4] as an “econometric audit.” The
replicating researcher (RR) uses the same data sources, models, proxy variables and
statistical methods as the ori~nal researcher (OR). In Type II replication, RR uses the
same data sources, but employs different models, proxy variables and/or statistical
methods. In Type III replication, RR US~3 the same models, proxy variables and
statistical methods, but applies them to different data than those used by OR. In Type
IV replication, different models, proxy variables and statistical methods are applied
to different data.1
Each type contributes to the growth of a discipline’s knowledge base. It is the
attempt to replicate that makes the contribution; something is learned whether the
original findings are confirmed or not.
To illustrate, imagine OR has hypothesized Y = f(x1, x2, x3) and shown this
relationship holds when tested by a multiple regression model on U.S. data for the
years 1953-1978. RR attempts to replicate using OR’s exact methodological recipe
on the same data, a Type I replication. If RR’s results match OR’s, it would be
beneficial to the research community to know of the confirmation; OR’s finding has
been “audited.” Unfortunately, RR’s confirmatory results will likely go unreported.
On the other hand, if RR’s results do not confirm OR’s findings, it may be reported
in the literature. While this information is very important to the community of
researchers, it adds almost nothing to the base of knowledge.
1
Kane [4] refers to Type II and Type III replications as “improvisational replication.” Lykken
[6] calls Type I “literal replication,” Type II “operational replication,” and Type III and Type
IV “constructive replication.”
Type I confirmatory findings should be reported because they eliminate doubts
of gross errors in OR’s study. They provide important incentives to both OR and RR.
Economists are aware of the importance of incentives. If confirmatory findings are
typically not reported, few individuals are likely to undertake such efforts on the
small chance that an error of sufficient magnitude to warrant publication will be
uncovered. Moreover, if OR knows that his steps are likely to be retraced, his
incentives to avoid errors are affected. Type I confirmatory replication need not be
reported at any length (although the material should be available on request). Nor is
it suggested that RR’s work should be esteemed by the profession as highly as OR’s.
Without Type I replication, however, gross errors may persist unnoticed.
RR might attempt a Type II replication by following OR’s recipe on the same
data, but altering the specification of the model to improve its explanatory power. If
an improvement results, the research community would benefit. Alternatively, RR
might show OR’s findings were sensitive to the method used and lacked robustness.
In either case, it seems likely that RR will have considerable difficulty finding a
journal to publish these results; they lack the “glamour” of discovery of a major error
by OR. RR’s finding that OR’s work was insensitive to change in method would be
even less likely reported. This is unfortunate, as this form of confirmation would
probably advance knowledge further, for it shows that OR’s results possess
robustness.
There is a temptation to suppose that OR should do this type of sensitivity
analysis. Excessive “data mining” by OR, however, may be viewed as a fishing
expedition for “significant” results, a technique scorned by statisticians. Ideally OR
postulates a relationship or hypothesis, formulates an econometric model, and does
one computer run. OR should not be encouraged to try every conceivable
specification that current technology permits. It is difficult to specify the rules for
OR and RR with respect to Type II manipulation and replication. Only shrewd
judgments by qualified researchers can distinguish mindless variation and Type n
replication. Interesting examples of the latter were stimulated by Milton Friedman’s
research on the demand for money. Replication of his findings led to the discovery
of the sensitivity of money demand to the interest rate. A number of studies that
directly or indirectly addressed this issue were published because important policy
issues were at stake.2 Clearly, the importance of the original study affects the value
of Type n replications. This episode illustrates that Type II replications cannot be
dismissed out of hand; rather, they form an important basis for the tentative
acceptance or rejection of any hypothesized empirical relationship by the profession.
Valuable as confirmatory Type I and Type II replications may be, they
demonstrate little about the generalizability of the relationships. Major impediments
to generalizability can remain: (1) OR’s confirmed findings may be an artifact of
2
Laidler [5] cites and discusses a number of these studies.
data which contains systematic error, (2) the data may be error free, but the observed
levels of X1, X2, and X3 are peculiar to the time period and environment observed,
and (3) y is driven by an unobserved variable (X?) which, had it been measured,
would have explained a greater proportion of the variance than one or more of the
independent variables observed by OR and used by RR.
The extension of knowledge involves the search for generalizability. In a Type
III replication, RR applies OR’s methods to different data. Confirmatory results help
to rule out the possibility that OR’s findings resulted from artifacts of error ridden
data, observations made in unique circumstances, or simple data “massaging”. A
Type III replication, however, cannot rule out the possibility that some unobserved
variable, common to both sets, drives Y. Obviously, a failure to replicate in the Type
III sense suggests a limit to the generality of the relationship.
It should be noted that a Type III replication in econometrics comes the closest
to replication in the experimental sciences. Both Type I and II replications represent
RR peering over the shoulder of OR and critically evaluating the experiment. Type
III replication is a crucial step toward the generalizability of the results; it needs to
be institutionalized as part of the research process. The caution exhibited in the
experimental sciences, with respect to results that have not been confirmed by
independent researchers, should be noted.
It is arguable if Type IV replications should be considered replications at all.
This depends on the claims made by OR (i.e., how general his results should be
considered) and, in part, on the intentions of RR. If RR replicates the study with
different data (a Type III replication), and is forced to use a different proxy for one
of the variables, such a study should be considered a Type IV replication. If RR uses
what he considers an improved econometric model, it could still fall within the
category of replication. A Type IV replication should be a skillful blend of Types II
and III. At some point, the changes become so significant that the study would not
be considered a replication. The current emphasis on originality, particularly by
journal editors and referees, probably encourages needless and wasteful product
differentiation by researchers who are understandably anxious to be published. More
might be learned if fewer such studies were attempted.
Many Type IV replications are not consciously undertaken. These Type IV
replications are not the result of RR’s empirical work, but the result of a scholarly
review of the work of others. A good literature review examines the generalizability
of a particular relationship; the developing field of meta-analysis formalizes the
process [3]. Testing the limits of generalizability can reach far beyond the
boundaries of one discipline. For example, confidence in the assertion that people
seek to maximize the return from their own efforts increases when it is learned that
Heinrich [2] demonstrated that pollen-gathering bumblebees exhibit the same
optimizing behavior.
2. The Gains from Replication
The number of known relationships that exhibit sufficient invariance and
universality to merit the label “law” are few. This does not relieve researchers from
the responsibility to test the limits of those observed relationships. “Original
research” tends to be rewarded, and “mere replication” tends to be downgraded. In
the end, however, if all that is known is that “X is related to y by a particular set of
observations using a particular statistical technique,” little is known about the world
around us.
That which isn’t worth replicating isn’t worth knowing. Those who produce
“original research” thus have the most to gain from replication. Because both
confirmations and disconfirmations contribute to the base of knowledge, the research
community should have the opportunity to know of all attempts to replicate.
Confirmations extend the generalizability of knowledge; disconfirmations suggest
new approaches must be found. It is important that econometric replications be taken
as seriously as replications in the experimental sciences.
References
1.
Claude Bernard, An Introduction to the Study of Experimental Medicine, translated by Henry
C. Greene (New York: Collier Books, 1961).
2.
Bernd Heinrich, Bumblebee Economics (Cambridge: Harvard University Press, 1979).
3.
John E. Hunter, Frank L. Schmidt and Gregg B. Jackson, Metaanalysis: Cumulating Research
Findings Across Studies (Beverly Hills: Sage Publications, 1982).
4.
Edward J. Kane, “Why Journal Editors Should Encourage the Replication or Applied
Econometric Research”, Quarterly Journal of Business and Economics, 23, No.1 (Winter 1984), pp. 3-8.
5.
David E. Laidler, The Demand for Money: Theories and Evidence (Scranton: International
Textbook Company, 1969).
6.
David T. Lykken, “Statistical Significance in Psychological Research”, Psychological Bulletin,
70 (February 1968), pp. 151-159.
7.
George H. Scherr, “Irreproducible Science: Editor’s Introduction”, The Best of the Journal of
Irreproducible Results (New York: Workman Publishing,1983).