(Warning: serious, difficult content interspersed with ranting and exasperation)
A possible source of empiricism (i.e. overreliance on empirical results combined with devaluation of, or at least actively ignoring, theoretical analysis) in psychology is a phenomenon that often occurs in statistical analyses, but even more so, the more variables you put into the analysis: you won’t get a clear, unambiguous result. Got your attention? Read on!
That was kind of a presumptuous statement; there are (probably many) situations where statistical analysis yields results that are clear and don’t leave much room for interpretation, such as when you compare different medical treatments in a carefully controlled double-blind study and you have a precisely defined outcome. But this setting is like empirical science paradise. In reality, at least for psychologists and other behavioral scientists, problems often start with the impossibility of double-blinding: a therapist knows what kind of intervention he is performing, and only rarely will it be possible to have therapists executing some other kind of intervention than the one they are used to and which they prefer. The consequence is that you can not separate intervention effects from therapist effects.
You might argue that there is at least the broad domain of experimental psychology, where, by definition, experiments are carefully controlled. But what about matters of operationalization? For a well known and excessively used experimental task such as the Stroop-Color-Word task (color words printed in a different color, such as green, and you have to name the print color, not the written word) researchers don’t really agree whether the performance measures “inhibition” or “interference”. Still, I have to say that this is not really a statistical problem.
The real statistical problems turn up once you don’t only compare one group with another, but several groups. Now what to say when you find out two experimental groups differ from the control group, but not from each other? And even getting this kind of result isn’t that easy: there are many procedures for such multiple comparisons, and unfortunately they won’t all tell you the same. And it’s not only the different procedures (and there never is one that can be preferred against the others…), there is the power issue: Would you have gotten a different result if you had included more observations? If you make too few observations, you don’t have enough statistical power — you won’t be able to defend your results against the Chance Explanation: “the results might only be due to a few odd cases”. With enough power, you can make a claim such as: “it is highly improbable that the difference between experimental and control groups is only due to a few odd cases”; in stats-speak this is termed “significance”. Unfortunately, there is also something like over-powering, i.e. examining too many observations: in such a case you might get a statistically significant result that doesn’t mean much, i.e. the difference is tiny.
For two-group analyses, such as in comparing medication against placebo, and for some several group designs, there exists something like an optimal number of observations, one you can calculate beforehand so you won’t get into big trouble with power problems. However, even for this limited number of designs you must also know beforehand what kind of effects you expect, as in “the difference between medication and treatment should be at least so-and-so”. Even if you are able to make such a prediction — which is far from being easy —- you will be left alone whenever you not only compare groups on a single dimension, such as experimental vs. control, but include another factor you want to make comparisons on, e.g. evaluation of the treatment effects for different age groups, gender, etc.
For such complicated designs, which are nonetheless more of a rule than the exception in the behavioral sciences, there does not exist a unique power calculation, given that you could want to analyze several different kinds of comparisons like interaction effects (“the treatment generally is effective, with the exception of females above 50 years”): the optimal number of observations for an interaction effect will be different from that of simple group comparisons. But in most of the applications of such designs, people will want to analyze both kinds of comparisons. And oh, I forgot: most of the power calculations available have some serious requirements, such as equal group sizes, equal variances, and so on — requirements that will be routinely violated in “real-life” research. (for general information about power analysis, there are countless introductory textbooks and web-pages. The problems when doing research instead of just talking about it in a statistics textbook you will be left alone with…).
Leaving experimental designs, things get even worse. Scott Maxwell, himself author of a really good textbook on experimental designs, had two related articles (I based part of this rant on those…) in the prestigiuous journal Psychological Methods in 2000 and 2004, where he first showed the difficulties of sample size calculations for regression analysis (i.e. you don’t have different groups to compare, but instead collect metric measurements for all observed individuals on different variables and relate them to an outcome variable), which are even larger than for multi-factor group comparisons as described above. More important for the point I want to make is that he then argued that a really large number of psychological studies is actually underpowered, but people don’t do anything about it. Why is that so? Because in multi-comparison designs or regression analyses with several variables, chances are high that you will get at least one significant result for the many different comparisons possible. In the 2004 paper, Maxwell showed that for an experimental design with 2 factors, each with 2 levels (e.g. experimental vs. control and male vs. female), with all possible effects of a medium size chances are 71% that at least one of the effects will be statistically significant with only 10 observations per cell of the design (i.e. 10 males in the control group, 10 males in the experimental group,…). So the researcher might contentedly stop collecting data with such a small sample size because he has found a significant effect — thus failing to observe that all possible effects were really there which he would have found out with just a little more effort. For regression analyses things might even get worse; assume you wanted to predict academic success from several predictor variables such as academic competence, appearance, social competence etc. Even if “in reality”, that is, if you were able to examine all possible subjects, all predictor variables are related to academic success, whenever you observe only a small sample, chances are very high that you will find only one significant predictor and then you stop collecting data.
I want to call this effect “statistical self-immunization”, because it prevents people from doing really thorough analyses: even without proper design, chances are very high that there will be at least one significant and thus publishable result whenever you observe many variables on a limited number of subjects. People would only start thinking in really novel ways if they keep on failing. The erratic, one-significant-in-many-possible result however immunizes against thinking outside the box.
Now this was all about serious kinds of analysis strategies. In factor analysis, you are in statistics hell. Exploratory factor analysis can be interpreted any way you like, depending on how many factors you want to find — and even if you use some objectively-seeming criterion, chances are high you won’t find the true number of factors. And in “confirmatory” factor analysis, you have two paradoxically operating power problems: increasing the number of observations high enough for using sophisticated estimation techniques, the model-that-should-be routinely does not fit because of overpowering (and other problems, but that’s a different story). So then people turn to all kinds of cheating “minor modifications“, “item parceling“, “correlated disturbances“; they will ignore model equivalence, and they will turn to “multifaceted conceptions of model fit” in order to find the one conception that serves their purpose (publication).