Andrew Gelman (note that I am somewhat behind on reading, since I usually stay away from the computer on weekends) posted about convergence diagnostics in Gibbs sampling. He has an example showing that convergence diagnostics are not about the precision of estimates. Even if different chains converge in one run, another run (that also converges) might give rather different estimates:
Consider theta. R.hat is at 1.0, indicating that the 3 chains have mixed, but the two runs give different values for the posterior mean (10.9 or 12.2). (And it’s not just a problem with the mean; there’s a similar difference in the medians.)
Andrew then shows that this isn’t a real problem: 1. the standard deviations are rather large and 2. the effective sample size is just 300.
Here’s point that’s really important, not only for those working with such sophisticated methods like Gibbs sampling:
Sampling error is larger than you think.
I recently had a discussion with a student who collected data about students’ achievements with a standardized test. She feared that something had gone wrong during testing, because mean results differed between semesters and courses, sometimes (she had several samples) even significantly. But of course, that is not only the notoriously difficult problem of multiple comparisons; in my experience it is generally the case that people underestimate how much some statistic like the mean can differ from its population value in small samples. Again citing Andrew’s example:
Let’s do some quick simulations, again using the normal distribution:> mean (rnorm (300, 10.9, 8.3))  10.1 > mean (rnorm (300, 10.9, 8.3))  12.4 > mean (rnorm (300, 10.9, 8.3))  11.3
As you can see, even completely random draws don’t get you the posterior mean so precisely when you only have 300.
And now think about sample sizes in psychology – the student I was talking about had sub-samples of about 40…