Not to be found in any Methods section

On 27th April Ernö Teglas, Chris, Agnes Melinda Kovacs and I met up at the most OTT coffeehouse in Budapest, the New York. The décor puts you into Rococo mood, but the pianist on the upper level suggests a 1920s jazz feeling.We are having outrageous layered coffee Melange with Chili, extravagant New York ice cream cups, pistachio cake, but Chris asks a glass of Furmint and a plate of Mangalica ham.

The conversation quickly turned to how to do experiments with babies. Why do such experiments take forever to complete? What to do with non-replications?

– “But this is the same with any experiments we have ever done!” – Chris quips. “But why do you think your experiments take a long time to complete? Why are you skeptical about non-replications?”

ET: Here is my claim: The success of a baby experiment is decided by how you instruct the parents, how and where they hold their baby.

Instruction is particularly relevant because it is the parents who coordinate the data acquisition. Without them these experiments simply cannot happen and they are very willing to help. The problem is that they have no experience with such situations. So, in a short session we have to turn them into “experimenters”. Every detail matters. It may seem surprising, but we really have long debates about how to hold the baby during tests, and what is the optimal position. The way a baby is held makes all the difference to their freedom to move. For example when held just under the arms, the mother may exert more influence on the baby than necessary. Also, the baby can easily slip down just slightly. And then eye gaze slides too. It may no longer be on the target you display on the screen. The hold has to be on the hips of the baby.

CF: Experiments stand and fall by how the experimenter instructs the subject.

UF: But isn’t this against the idea that scientific experimentation has to be independent of the experimenter. The whole point is that they can be replicated by somebody else.

CF: Ah, there are critical aspects of instructions, which often don’t get spelled out in methods sections.

ET: The pity is that if a student doing a first experiment fails to replicate a previous result then this throws doubt on the previous finding, when in fact it throws doubt on the student who is still learning and doesn’t yet know how to do the experiment. Only once the student has succeeded in replicating a well known robust finding, can he or she be trusted to do a good job.

CF: When I first worked at the long disbanded MRC Clinical Research Centre, the biochemists (who, mysteriously, later turned into molecular biologists) often said things like: today the reaction just didn’t work. We have to try again.

UF: So it’s not necessarily the sign of an immature and still soft science that you have to be pernickety on exactly how an experiment is carried out. If an experiment doesn’t replicate, there are many possible reasons and does not necessarily mean the previous results are not true.

ET: With infants we basically rely on measuring what they are looking at, and for how long they look at it: they look longer at something that surprises them; they get bored and look away when something is highly predictable to them. However, they also look away when something else attracts their attention; when there is some noise, when they feel uncomfortable for some reason. All this makes us extremely careful to have completely soundproof labs, very relaxed mothers, and babies who have only recently woken up and have been fed before we even start the experiment.

– This was the moment when we found out that the coffee Melange with Chili is actually rather spicy, but the honey that served as a bottom layer softened this feeling. “Overall each ingredient plays a role if their interaction is orchestrated by a hand sensitive to details” says Ernö

UF: Is it true that you often don’t get results that you know we should get?

ET: True! Then we go over every step of the procedure and find possible reasons.

UF: Is it okay to eliminate data when you believe there was a slip in the procedure for one particular baby?

ET: Actually you have to do it. You have to eliminate data all the time. If you don’t you include complete nonsense, when for example an infant no longer looks at the target. You need him/her to look to take notice of the scenario you have devised. We have to eliminate usually 20% of the data, if not more. …of course it depends on the experimental protocol: In habituation studies rejection rate can be as high as 50%. The procedure only lasts a few minutes because babies soon tire of watching a simple scenario on video. Sometimes we can only use a fraction of these few minutes.

UF: It makes me marvel not only at the fact that you are so incredibly scrupulous in your procedure but also that you have so many successes with your ingenious experiments.

8 thoughts on “Not to be found in any Methods section”

Dorothy Bishop says:

8 May 2014 at 06:22

Thanks for raising this topic, Uta. This is on the one hand important and on the other hand worrying!
The important bit is to draw attention of how specific aspects of methods that can crucially affect findings are often not mentioned when papers are written up, and how this can lead to problems when others try to replicate.
The worrying bit is that if experimenters decide to omit data post hoc then it opens the floodgates to the kind of problems with ‘False positive psychology’ that Simmons et al talk about: Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology. Psychological Science, 22(11), 1359-1366. doi: 10.1177/0956797611417632. It’s very clear that a lot of this goes on and in many cases it is not a case of deliberate data manipulation or fraud, but rather of experimenters deceiving themselves. As psychologists we should be all too well aware of how easily we can be biased in what we observe, and if we are to have a replicable science we must defend against that with scrupulous methods.
Of course you don’t want to include trials where the baby was not looking at the stimulus, for instance, but that is something that should (a) be mentioned in the methods, together with an indication of the criteria for deciding ‘not looking at’ and (b) should be a decision made objectively – either by an automated procedure, or by someone unaware of how inclusion/exclusion of the trial affects the results.
Of course, it’s equally true that you should not dismiss a finding because an inexperienced person fails to replicate it – but I am uneasy with the argument that failures to replicate are typically due to incompetence. Surely it is up to those who have the expertise to describe the methods such that anyone should be able to replicate?
This account makes me even more convinced of the importance of Open Science. There are several steps that could be taken to help improve matters.
1. Researchers with expertise in the area need to share that expertise more, including the details that don’t get mentioned – publishing a video of the procedure is one way forward, e.g. through Journal of Visualised Experiments (JOVE).
2. The notion of pre-registration of studies has been controversial, but could make a huge difference in this area. Together with a graduate student I’ve been going over the pre-registration requirements that are used by Cortex. On the one hand they are scarily stringent – you are expected to specify in advance criteria you will use for excluding data and what methods you will use in analysis. On the other hand, I’ve found this a very useful exercise to make explicit things that are otherwise implicit, and I am sure it will improve our experiments. You are not precluded, of course, from doing something different when your data are collected and you realise that your methods don’t capture something important, but it will at least be clear that this decision was made after data collection, which gives it a different status from a priori rules.
3. More generally, it would help if the entire dataset was made available; there are nowadays several options for depositing data on the web, e.g. Figshare or Open Science Framework. Authors who are adopting best practice for data coding and analysis could then use this as an opportunity to explain their procedures to others using the raw data as an example.
Andrew Derrington says:

8 May 2014 at 10:24

I think you are absolutely right Dorothy. I think it’s important to be clear that the ‘entire dataset’ includes rejected trials.
Bill Skaggs says:

8 May 2014 at 17:58

The question that arises, surely, is how many of these experiments are actually measuring reactions of the mother to the stimulus rather than reactions of the baby.

Best regards, Bill
Rosalind Ridley says:

11 May 2014 at 10:15

I’m not a great fan of the ‘everything needs to be replicated’ approach. If an experiment is well designed, well carried out, uses an adequate sample size (even if that number is quite small) and is significant using appropriate statistics, then the result should stand.

Excluding data points is difficult. If a data point is impossible (the subject weighed 1000kg) it must be eliminated but really the data collected by that person should be investigated because they may have other non-obvious errors there too. Leaving out ‘iffy’ trials without a published criteria of ‘iffiness’ is likely to become problematic, particularly if your method is taken up by other labs with weaker standards, ultimately bringing the whole field (including your lab), into disrepute.

In behavioural research I have always favoured choice response paradigms rather than response/no response paradigms because the choosing the wrong stimulus excludes the possibility of trial being included which was invalidated by distraction/tiredness/disinhibition etc.

But I worry more about data where the statistical significance of a small sample is dependent on just one data point. Unless the result has been ‘pre-replicated’ because it is a well known robust effect, then that kind of result may need to be replicated.
Fred Cummins says:

13 May 2014 at 09:49

Heinz von Foerster tells a nice anecdote about an almost exact replication of Pavlov’s classical conditioning experiments, with one important alteration: no clapper in the bell. His conclusion: the sound of the bell was a stimulus, but for Pavlov, not for his dog!

https://www.youtube.com/watch?v=KM85u4AZpOU
Chris Frith says:

21 May 2014 at 21:19

The importance of non-replication

My recollection is that we used to talk about experimental control. Perhaps this was in the days of behaviourism. The idea was that the purpose of an experiment was to gain control over the behaviour of interest. A failure to replicate indicates that we don’t have control over the behaviour of interest, and is a sign that we should be doing more work in order to gain control.
A nice example of such work is found in a paper from Wolfgang Prinz’s group (Diefenbach et al. Front Psychol 4, 2013). This is a study of the priming of actions by sentences. For example, the sentence “Jakob hands you the book” should prime actions towards the body, while “You hand the book to Jakob” should prime responses away from the body. Their first experiment failed to replicate the results of a previous study by others. But they did not stop at this point. They carried out detailed analyses and ran several further experiments. By these means they regained ‘experimental control’ over the behaviour. They specified the circumstances under which positive and negative priming effects could be obtained. These effects depended on the precise timing of the processes underlying sentence comprehension and action preparation. Positive priming is observed if action planning starts at the same time as sentence presentation. Negative priming occurs if action planning is delayed for 500 msec after sentence presentation.
In this case, the failure to replicate led to important new knowledge. Unfortunately, in the current climate, failures to replicate are all too often taken as an excuse to berate the original researchers rather than as an opportunity for new developments. For example, much fuss was made about failures to replicate a much cited study reporting that people walked more slowly after being primed with words relating to old age (Bargh et al. 1996. J Pers Soc Psychol, 71, 230-244). In one of the studies (Doyen et al. 2012 PLoS ONE Jan 18) the authors showed that priming effects could be obtained when the person running the experiment expected certain results. Rather than berating Bargh and colleagues for not properly controlling for expectations, we should be exploring the mechanisms through which the expectations of the experimenter can alter the behaviour of the people participating in the experiment. Understanding such mechanisms is critical, not only for advancing the field of social cognition, but also for gaining experimental control in psychology.
Andy Bremner says:

1 June 2014 at 11:22

It’s very easy to bias infant looking time, and particularly so in the direction of a null effect. I personally think that the difficulty of understanding what infants’ looking behaviour actually tells us (and the related challenge of designing well controlled experiments) outweighs concerns about false positives. The good news is that whereas we used to be much too reliant on looking duration (~10 years ago) we’re now seeing a wider range of methods being more commonly adopted for use with infants (orienting responses, ERPs, EEG, fNIRS, pupil dilation). This helps us with concerns about false positives, as convergent evidence across measures strengthens confidence. It also helps in a range of ways with understanding what the findings mean.
Vincent Reid says:

2 June 2014 at 08:53

Very nicely put, Andy. Evidence from convergent methods is the most robust way forward.

8 thoughts on “Not to be found in any Methods section”

Leave a Reply